All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative
@ 2016-11-02 19:33 Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
                   ` (33 more replies)
  0 siblings, 34 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

Hello,

these userfaultfd features are finished and are ready for larger
exposure in -mm and upstream merging.

1) tmpfs non present userfault
2) hugetlbfs non present userfault
3) non cooperative userfault for fork/madvise/mremap

qemu development code is already exercising 2) and container postcopy
live migration needs 3).

1) is not currently used but there's a self test and we know some qemu
user for various reasons uses tmpfs as backing for KVM so it'll need
it too to use postcopy live migration with tmpfs memory.

In addition there's a few related pending fixes and cleanups.

The "mm: mprotect: use pmd_trans_unstable instead of taking the
pmd_lock" patch is actually required for the WP support that will come
later, but it looks a nice cleanup + optimization for upstream too so
I'm sending it already.

Andrea Arcangeli (10):
  userfaultfd: document _IOR/_IOW
  userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP
  userfaultfd: convert BUG() to WARN_ON_ONCE()
  userfaultfd: use vma_is_anonymous
  userfaultfd: non-cooperative: report all available features to
    userland
  userfaultfd: non-cooperative: Add fork() event, build warning fix
  userfaultfd: shmem: add tlbflush.h header for microblaze
  userfaultfd: shmem: lock the page before adding it to pagecache
  userfaultfd: shmem: avoid leaking blocks and used blocks in
    UFFDIO_COPY
  mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock

Mike Kravetz (7):
  userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb
    userfaultfd support
  userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd
    support
  userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page
    UFFDIO_COPY
  userfaultfd: hugetlbfs: add userfaultfd hugetlb hook
  userfaultfd: hugetlbfs: allow registration of ranges containing huge
    pages
  userfaultfd: hugetlbfs: add userfaultfd_hugetlb test
  userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges

Mike Rapoport (11):
  userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of
    mm_users
  userfaultfd: introduce vma_can_userfault
  userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support
  userfaultfd: shmem: introduce vma_is_shmem
  userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory
  userfaultfd: shmem: add userfaultfd hook for shared memory faults
  userfaultfd: shmem: allow registration of shared memory ranges
  userfaultfd: shmem: add userfaultfd_shmem test
  userfaultfd: non-cooperative: selftest: introduce userfaultfd_open
  userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page
  userfaultfd: non-cooperative: selftest: add test for FORK,
    MADVDONTNEED and REMAP events

Pavel Emelyanov (5):
  userfaultfd: non-cooperative: Split the find_userfault() routine
  userfaultfd: non-cooperative: Add ability to report non-PF events from
    uffd descriptor
  userfaultfd: non-cooperative: Add fork() event
  userfaultfd: non-cooperative: Add mremap() event
  userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED
    request

 fs/userfaultfd.c                         | 445 +++++++++++++++++++++++++++++--
 include/linux/hugetlb.h                  |   8 +-
 include/linux/mm.h                       |  13 +
 include/linux/shmem_fs.h                 |  11 +
 include/linux/userfaultfd_k.h            |  42 +++
 include/uapi/asm-generic/ioctl.h         |  10 +-
 include/uapi/linux/userfaultfd.h         |  39 ++-
 kernel/fork.c                            |  10 +-
 mm/hugetlb.c                             | 114 ++++++++
 mm/madvise.c                             |   2 +
 mm/memory.c                              |  25 ++
 mm/mprotect.c                            |  44 ++-
 mm/mremap.c                              |  17 +-
 mm/shmem.c                               | 159 ++++++++++-
 mm/userfaultfd.c                         | 211 ++++++++++++++-
 tools/testing/selftests/vm/Makefile      |   8 +
 tools/testing/selftests/vm/run_vmtests   |  24 ++
 tools/testing/selftests/vm/userfaultfd.c | 405 +++++++++++++++++++++++++---
 18 files changed, 1453 insertions(+), 134 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 01/33] userfaultfd: document _IOR/_IOW
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

This adds proper documentation (inline) to avoid the risk of further
misunderstandings about the semantics of _IOW/_IOR and it also reminds
whoever will bump the UFFDIO_API in the future, to change the two
ioctl to _IOW.

This was found while implementing strace support for those ioctl,
otherwise we could have never found it by just reviewing kernel code
and testing it.

_IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so
it's only worth fixing if the UFFDIO_API is bumped someday.

Reported-by: "Dmitry V. Levin" <ldv@altlinux.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/asm-generic/ioctl.h | 10 +++++++++-
 include/uapi/linux/userfaultfd.h |  6 ++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/ioctl.h b/include/uapi/asm-generic/ioctl.h
index 7e7c11b..749b32f 100644
--- a/include/uapi/asm-generic/ioctl.h
+++ b/include/uapi/asm-generic/ioctl.h
@@ -48,6 +48,9 @@
 /*
  * Direction bits, which any architecture can choose to override
  * before including this file.
+ *
+ * NOTE: _IOC_WRITE means userland is writing and kernel is
+ * reading. _IOC_READ means userland is reading and kernel is writing.
  */
 
 #ifndef _IOC_NONE
@@ -72,7 +75,12 @@
 #define _IOC_TYPECHECK(t) (sizeof(t))
 #endif
 
-/* used to create numbers */
+/*
+ * Used to create numbers.
+ *
+ * NOTE: _IOW means userland is writing and kernel is reading. _IOR
+ * means userland is reading and kernel is writing.
+ */
 #define _IO(type,nr)		_IOC(_IOC_NONE,(type),(nr),0)
 #define _IOR(type,nr,size)	_IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size)))
 #define _IOW(type,nr,size)	_IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size)))
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9057d7a..94046b8 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -11,6 +11,12 @@
 
 #include <linux/types.h>
 
+/*
+ * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
+ * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR.  In
+ * userfaultfd.h we assumed the kernel was reading (instead _IOC_READ
+ * means the userland is reading).
+ */
 #define UFFD_API ((__u64)0xAA)
 /*
  * After implementing the respective features it will become:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

Minor comment correction.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 85959d8..501784e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -162,7 +162,7 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
 	msg.arg.pagefault.address = address;
 	if (flags & FAULT_FLAG_WRITE)
 		/*
-		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WRITE was set in the
+		 * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
 		 * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
 		 * was not set in a UFFD_EVENT_PAGEFAULT, it means it
 		 * was a read fault, otherwise if set it means it's

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE()
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
                   ` (30 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

Avoid BUG_ON()s and only WARN instead. This is just a cleanup, it
can't make any runtime difference. This BUG_ON has never triggered and
cannot trigger.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 501784e..5a1c3cf 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -539,7 +539,8 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 			ret = POLLIN;
 		return ret;
 	default:
-		BUG();
+		WARN_ON_ONCE(1);
+		return POLLERR;
 	}
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 04/33] userfaultfd: use vma_is_anonymous
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
                   ` (29 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

Cleanup the vma->vm_ops usage.

Side note: it would be more robust if vma_is_anonymous() would also
check that vm_flags hasn't VM_PFNMAP set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 8 ++++----
 mm/userfaultfd.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5a1c3cf..4161f99 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -795,7 +795,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
-		if (cur->vm_ops)
+		if (!vma_is_anonymous(cur))
 			goto out_unlock;
 
 		/*
@@ -820,7 +820,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(vma->vm_ops);
+		BUG_ON(!vma_is_anonymous(vma));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
 		       vma->vm_userfaultfd_ctx.ctx != ctx);
 
@@ -946,7 +946,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * provides for more strict behavior to notice
 		 * unregistration errors.
 		 */
-		if (cur->vm_ops)
+		if (!vma_is_anonymous(cur))
 			goto out_unlock;
 
 		found = true;
@@ -960,7 +960,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(vma->vm_ops);
+		BUG_ON(!vma_is_anonymous(vma));
 
 		/*
 		 * Nothing to do: this vma is already registered into this
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af817e5..9c2ed70 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -197,7 +197,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 * FIXME: only allow copying on anonymous vmas, tmpfs should
 	 * be added.
 	 */
-	if (dst_vma->vm_ops)
+	if (!vma_is_anonymous(dst_vma))
 		goto out_unlock;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
                   ` (28 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Pavel Emelyanov <xemul@parallels.com>

I will need one to lookup for userfaultfd_wait_queue-s in different
wait queue

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4161f99..b4f790f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -487,25 +487,30 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 }
 
 /* fault_pending_wqh.lock must be hold by the caller */
-static inline struct userfaultfd_wait_queue *find_userfault(
-	struct userfaultfd_ctx *ctx)
+static inline struct userfaultfd_wait_queue *find_userfault_in(
+		wait_queue_head_t *wqh)
 {
 	wait_queue_t *wq;
 	struct userfaultfd_wait_queue *uwq;
 
-	VM_BUG_ON(!spin_is_locked(&ctx->fault_pending_wqh.lock));
+	VM_BUG_ON(!spin_is_locked(&wqh->lock));
 
 	uwq = NULL;
-	if (!waitqueue_active(&ctx->fault_pending_wqh))
+	if (!waitqueue_active(wqh))
 		goto out;
 	/* walk in reverse to provide FIFO behavior to read userfaults */
-	wq = list_last_entry(&ctx->fault_pending_wqh.task_list,
-			     typeof(*wq), task_list);
+	wq = list_last_entry(&wqh->task_list, typeof(*wq), task_list);
 	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 out:
 	return uwq;
 }
 
+static inline struct userfaultfd_wait_queue *find_userfault(
+		struct userfaultfd_ctx *ctx)
+{
+	return find_userfault_in(&ctx->fault_pending_wqh);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
                   ` (27 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Pavel Emelyanov <xemul@parallels.com>

The custom events are queued in ctx->event_wqh not to disturb the
fast-path-ed PF queue-wait-wakeup functions.

The events to be generated (other than PF-s) are requested in UFFD_API
ioctl with the uffd_api.features bits. Those, known by the kernel, are
then turned on and reported back to the user-space.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 96 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b4f790f..76205b3 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -12,6 +12,7 @@
  *  mm/ksm.c (mm hashing).
  */
 
+#include <linux/list.h>
 #include <linux/hashtable.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
@@ -45,12 +46,16 @@ struct userfaultfd_ctx {
 	wait_queue_head_t fault_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
+	/* waitqueue head for events */
+	wait_queue_head_t event_wqh;
 	/* a refile sequence protected by fault_pending_wqh lock */
 	struct seqcount refile_seq;
 	/* pseudo fd refcounting */
 	atomic_t refcount;
 	/* userfaultfd syscall flags */
 	unsigned int flags;
+	/* features requested from the userspace */
+	unsigned int features;
 	/* state machine */
 	enum userfaultfd_state state;
 	/* released */
@@ -135,6 +140,8 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 		VM_BUG_ON(waitqueue_active(&ctx->fault_pending_wqh));
 		VM_BUG_ON(spin_is_locked(&ctx->fault_wqh.lock));
 		VM_BUG_ON(waitqueue_active(&ctx->fault_wqh));
+		VM_BUG_ON(spin_is_locked(&ctx->event_wqh.lock));
+		VM_BUG_ON(waitqueue_active(&ctx->event_wqh));
 		VM_BUG_ON(spin_is_locked(&ctx->fd_wqh.lock));
 		VM_BUG_ON(waitqueue_active(&ctx->fd_wqh));
 		mmdrop(ctx->mm);
@@ -423,6 +430,59 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
 	return ret;
 }
 
+static int __maybe_unused userfaultfd_event_wait_completion(
+		struct userfaultfd_ctx *ctx,
+		struct userfaultfd_wait_queue *ewq)
+{
+	int ret = 0;
+
+	ewq->ctx = ctx;
+	init_waitqueue_entry(&ewq->wq, current);
+
+	spin_lock(&ctx->event_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->event_wqh, &ewq->wq);
+	for (;;) {
+		set_current_state(TASK_KILLABLE);
+		if (ewq->msg.event == 0)
+			break;
+		if (ACCESS_ONCE(ctx->released) ||
+		    fatal_signal_pending(current)) {
+			ret = -1;
+			__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
+			break;
+		}
+
+		spin_unlock(&ctx->event_wqh.lock);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		spin_lock(&ctx->event_wqh.lock);
+	}
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->event_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * already released.
+	 */
+
+	userfaultfd_ctx_put(ctx);
+	return ret;
+}
+
+static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
+				       struct userfaultfd_wait_queue *ewq)
+{
+	ewq->msg.event = 0;
+	wake_up_locked(&ctx->event_wqh);
+	__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -511,6 +571,12 @@ static inline struct userfaultfd_wait_queue *find_userfault(
 	return find_userfault_in(&ctx->fault_pending_wqh);
 }
 
+static inline struct userfaultfd_wait_queue *find_userfault_evt(
+		struct userfaultfd_ctx *ctx)
+{
+	return find_userfault_in(&ctx->event_wqh);
+}
+
 static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -542,6 +608,9 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 		smp_mb();
 		if (waitqueue_active(&ctx->fault_pending_wqh))
 			ret = POLLIN;
+		else if (waitqueue_active(&ctx->event_wqh))
+			ret = POLLIN;
+
 		return ret;
 	default:
 		WARN_ON_ONCE(1);
@@ -606,6 +675,19 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			break;
 		}
 		spin_unlock(&ctx->fault_pending_wqh.lock);
+
+		spin_lock(&ctx->event_wqh.lock);
+		uwq = find_userfault_evt(ctx);
+		if (uwq) {
+			*msg = uwq->msg;
+
+			userfaultfd_event_complete(ctx, uwq);
+			spin_unlock(&ctx->event_wqh.lock);
+			ret = 0;
+			break;
+		}
+		spin_unlock(&ctx->event_wqh.lock);
+
 		if (signal_pending(current)) {
 			ret = -ERESTARTSYS;
 			break;
@@ -1149,6 +1231,14 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+static inline unsigned int uffd_ctx_features(__u64 user_features)
+{
+	/*
+	 * For the current set of features the bits just coincide
+	 */
+	return (unsigned int)user_features;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -1167,19 +1257,21 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	ret = -EFAULT;
 	if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
 		goto out;
-	if (uffdio_api.api != UFFD_API || uffdio_api.features) {
+	if (uffdio_api.api != UFFD_API ||
+	    (uffdio_api.features & ~UFFD_API_FEATURES)) {
 		memset(&uffdio_api, 0, sizeof(uffdio_api));
 		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 			goto out;
 		ret = -EINVAL;
 		goto out;
 	}
-	uffdio_api.features = UFFD_API_FEATURES;
+	uffdio_api.features &= UFFD_API_FEATURES;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
 	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 		goto out;
 	ctx->state = UFFD_STATE_RUNNING;
+	ctx->features = uffd_ctx_features(uffdio_api.features);
 	ret = 0;
 out:
 	return ret;
@@ -1266,6 +1358,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 
 	init_waitqueue_head(&ctx->fault_pending_wqh);
 	init_waitqueue_head(&ctx->fault_wqh);
+	init_waitqueue_head(&ctx->event_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
 	seqcount_init(&ctx->refile_seq);
 }
@@ -1306,6 +1399,7 @@ static struct file *userfaultfd_file_create(int flags)
 
 	atomic_set(&ctx->refcount, 1);
 	ctx->flags = flags;
+	ctx->features = 0;
 	ctx->state = UFFD_STATE_WAIT_API;
 	ctx->released = false;
 	ctx->mm = current->mm;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
                   ` (26 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

This will allow userland to probe all features available in the
kernel. It will however only enable the requested features in the
open userfaultfd context.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 76205b3..b164c40 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1250,6 +1250,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	struct uffdio_api uffdio_api;
 	void __user *buf = (void __user *)arg;
 	int ret;
+	__u64 features;
 
 	ret = -EINVAL;
 	if (ctx->state != UFFD_STATE_WAIT_API)
@@ -1257,21 +1258,23 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 	ret = -EFAULT;
 	if (copy_from_user(&uffdio_api, buf, sizeof(uffdio_api)))
 		goto out;
-	if (uffdio_api.api != UFFD_API ||
-	    (uffdio_api.features & ~UFFD_API_FEATURES)) {
+	features = uffdio_api.features;
+	if (uffdio_api.api != UFFD_API || (features & ~UFFD_API_FEATURES)) {
 		memset(&uffdio_api, 0, sizeof(uffdio_api));
 		if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 			goto out;
 		ret = -EINVAL;
 		goto out;
 	}
-	uffdio_api.features &= UFFD_API_FEATURES;
+	/* report all available features and ioctls to userland */
+	uffdio_api.features = UFFD_API_FEATURES;
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
 	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
 		goto out;
 	ctx->state = UFFD_STATE_RUNNING;
-	ctx->features = uffd_ctx_features(uffdio_api.features);
+	/* only enable the requested features for this uffd context */
+	ctx->features = uffd_ctx_features(features);
 	ret = 0;
 out:
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
                   ` (25 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Pavel Emelyanov <xemul@parallels.com>

When the mm with uffd-ed vmas fork()-s the respective vmas
notify their uffds with the event which contains a descriptor
with new uffd. This new descriptor can then be used to get
events from the child and populate its mm with data. Note,
that there can be different uffd-s controlling different
vmas within one mm, so first we should collect all those
uffds (and ctx-s) in a list and then notify them all one by
one but only once per fork().

The context is created at fork() time but the descriptor, file
struct and anon inode object is created at event read time. So
some trickery is added to the userfaultfd_ctx_read() to handle
the ctx queues' locking vs file creation.

Another thing worth noticing is that the task that fork()-s
waits for the uffd event to get processed WITHOUT the mmap sem.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                 | 146 ++++++++++++++++++++++++++++++++++++++-
 include/linux/userfaultfd_k.h    |  13 ++++
 include/uapi/linux/userfaultfd.h |  15 ++--
 kernel/fork.c                    |  10 ++-
 4 files changed, 168 insertions(+), 16 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b164c40..1de16c9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -64,6 +64,12 @@ struct userfaultfd_ctx {
 	struct mm_struct *mm;
 };
 
+struct userfaultfd_fork_ctx {
+	struct userfaultfd_ctx *orig;
+	struct userfaultfd_ctx *new;
+	struct list_head list;
+};
+
 struct userfaultfd_wait_queue {
 	struct uffd_msg msg;
 	wait_queue_t wq;
@@ -430,9 +436,8 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
 	return ret;
 }
 
-static int __maybe_unused userfaultfd_event_wait_completion(
-		struct userfaultfd_ctx *ctx,
-		struct userfaultfd_wait_queue *ewq)
+static int userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
+					     struct userfaultfd_wait_queue *ewq)
 {
 	int ret = 0;
 
@@ -483,6 +488,79 @@ static void userfaultfd_event_complete(struct userfaultfd_ctx *ctx,
 	__remove_wait_queue(&ctx->event_wqh, &ewq->wq);
 }
 
+int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
+{
+	struct userfaultfd_ctx *ctx = NULL, *octx;
+	struct userfaultfd_fork_ctx *fctx;
+
+	octx = vma->vm_userfaultfd_ctx.ctx;
+	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+		vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+		return 0;
+	}
+
+	list_for_each_entry(fctx, fcs, list)
+		if (fctx->orig == octx) {
+			ctx = fctx->new;
+			break;
+		}
+
+	if (!ctx) {
+		fctx = kmalloc(sizeof(*fctx), GFP_KERNEL);
+		if (!fctx)
+			return -ENOMEM;
+
+		ctx = kmem_cache_alloc(userfaultfd_ctx_cachep, GFP_KERNEL);
+		if (!ctx) {
+			kfree(fctx);
+			return -ENOMEM;
+		}
+
+		atomic_set(&ctx->refcount, 1);
+		ctx->flags = octx->flags;
+		ctx->state = UFFD_STATE_RUNNING;
+		ctx->features = octx->features;
+		ctx->released = false;
+		ctx->mm = vma->vm_mm;
+		atomic_inc(&ctx->mm->mm_users);
+
+		userfaultfd_ctx_get(octx);
+		fctx->orig = octx;
+		fctx->new = ctx;
+		list_add_tail(&fctx->list, fcs);
+	}
+
+	vma->vm_userfaultfd_ctx.ctx = ctx;
+	return 0;
+}
+
+static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
+{
+	struct userfaultfd_ctx *ctx = fctx->orig;
+	struct userfaultfd_wait_queue ewq;
+
+	msg_init(&ewq.msg);
+
+	ewq.msg.event = UFFD_EVENT_FORK;
+	ewq.msg.arg.reserved.reserved1 = (__u64)fctx->new;
+
+	return userfaultfd_event_wait_completion(ctx, &ewq);
+}
+
+void dup_userfaultfd_complete(struct list_head *fcs)
+{
+	int ret = 0;
+	struct userfaultfd_fork_ctx *fctx, *n;
+
+	list_for_each_entry_safe(fctx, n, fcs, list) {
+		if (!ret)
+			ret = dup_fctx(fctx);
+		list_del(&fctx->list);
+		kfree(fctx);
+	}
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
@@ -618,12 +696,49 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 	}
 }
 
+static const struct file_operations userfaultfd_fops;
+
+static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
+				  struct userfaultfd_ctx *new,
+				  struct uffd_msg *msg)
+{
+	int fd;
+	struct file *file;
+	unsigned int flags = new->flags & UFFD_SHARED_FCNTL_FLAGS;
+
+	fd = get_unused_fd_flags(flags);
+	if (fd < 0)
+		return fd;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, new,
+				  O_RDWR | flags);
+	if (IS_ERR(file)) {
+		put_unused_fd(fd);
+		return PTR_ERR(file);
+	}
+
+	fd_install(fd, file);
+	msg->arg.reserved.reserved1 = 0;
+	msg->arg.fork.ufd = fd;
+
+	return 0;
+}
+
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 				    struct uffd_msg *msg)
 {
 	ssize_t ret;
 	DECLARE_WAITQUEUE(wait, current);
 	struct userfaultfd_wait_queue *uwq;
+	/*
+	 * Handling fork event requires sleeping operations, so
+	 * we drop the event_wqh lock, then do these ops, then
+	 * lock it back and wake up the waiter. While the lock is
+	 * dropped the ewq may go away so we keep track of it
+	 * carefully.
+	 */
+	LIST_HEAD(fork_event);
+	struct userfaultfd_ctx *fork_nctx = NULL;
 
 	/* always take the fd_wqh lock before the fault_pending_wqh lock */
 	spin_lock(&ctx->fd_wqh.lock);
@@ -681,6 +796,14 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 		if (uwq) {
 			*msg = uwq->msg;
 
+			if (uwq->msg.event == UFFD_EVENT_FORK) {
+				fork_nctx = (struct userfaultfd_ctx *)uwq->msg.arg.reserved.reserved1;
+				list_move(&uwq->wq.task_list, &fork_event);
+				spin_unlock(&ctx->event_wqh.lock);
+				ret = 0;
+				break;
+			}
+
 			userfaultfd_event_complete(ctx, uwq);
 			spin_unlock(&ctx->event_wqh.lock);
 			ret = 0;
@@ -704,6 +827,23 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 	__set_current_state(TASK_RUNNING);
 	spin_unlock(&ctx->fd_wqh.lock);
 
+	if (!ret && msg->event == UFFD_EVENT_FORK) {
+		ret = resolve_userfault_fork(ctx, fork_nctx, msg);
+
+		if (!ret) {
+			spin_lock(&ctx->event_wqh.lock);
+			if (!list_empty(&fork_event)) {
+				uwq = list_first_entry(&fork_event,
+						       typeof(*uwq),
+						       wq.task_list);
+				list_del(&uwq->wq.task_list);
+				__add_wait_queue(&ctx->event_wqh, &uwq->wq);
+				userfaultfd_event_complete(ctx, uwq);
+			}
+			spin_unlock(&ctx->event_wqh.lock);
+		}
+	}
+
 	return ret;
 }
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index dd66a95..bf42f20 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -52,6 +52,9 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
 }
 
+extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
+extern void dup_userfaultfd_complete(struct list_head *);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -76,6 +79,16 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline int dup_userfaultfd(struct vm_area_struct *vma,
+				  struct list_head *l)
+{
+	return 0;
+}
+
+static inline void dup_userfaultfd_complete(struct list_head *l)
+{
+}
+
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 94046b8..c8953c8 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -18,12 +18,7 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-/*
- * After implementing the respective features it will become:
- * #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
- *			      UFFD_FEATURE_EVENT_FORK)
- */
-#define UFFD_API_FEATURES (0)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -78,6 +73,10 @@ struct uffd_msg {
 		} pagefault;
 
 		struct {
+			__u32	ufd;
+		} fork;
+
+		struct {
 			/* unused reserved fields */
 			__u64	reserved1;
 			__u64	reserved2;
@@ -90,9 +89,7 @@ struct uffd_msg {
  * Start at 0x12 and not at 0 to be more strict against bugs.
  */
 #define UFFD_EVENT_PAGEFAULT	0x12
-#if 0 /* not available yet */
 #define UFFD_EVENT_FORK		0x13
-#endif
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
@@ -111,10 +108,8 @@ struct uffdio_api {
 	 * are to be considered implicitly always enabled in all kernels as
 	 * long as the uffdio_api.api requested matches UFFD_API.
 	 */
-#if 0 /* not available yet */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
-#endif
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/kernel/fork.c b/kernel/fork.c
index 623259f..3f01dd4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -55,6 +55,7 @@
 #include <linux/rmap.h>
 #include <linux/ksm.h>
 #include <linux/acct.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
@@ -554,6 +555,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
+	LIST_HEAD(uf);
 
 	uprobe_start_dup_mmap();
 	if (down_write_killable(&oldmm->mmap_sem)) {
@@ -610,12 +612,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (retval)
 			goto fail_nomem_policy;
 		tmp->vm_mm = mm;
+		retval = dup_userfaultfd(tmp, &uf);
+		if (retval)
+			goto fail_nomem_anon_vma_fork;
 		if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
-		tmp->vm_flags &=
-			~(VM_LOCKED|VM_LOCKONFAULT|VM_UFFD_MISSING|VM_UFFD_WP);
+		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		tmp->vm_next = tmp->vm_prev = NULL;
-		tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);
@@ -671,6 +674,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	dup_userfaultfd_complete(&uf);
 fail_uprobe_end:
 	uprobe_end_dup_mmap();
 	return retval;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

It was harmless, but 32bit kernel builds would emit warnings if not
passing through an (unsigned long) cast of the pointer, before storing
it in a __u64.

Warning found by the kbuild test robot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1de16c9..07b1c25 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -543,7 +543,7 @@ static int dup_fctx(struct userfaultfd_fork_ctx *fctx)
 	msg_init(&ewq.msg);
 
 	ewq.msg.event = UFFD_EVENT_FORK;
-	ewq.msg.arg.reserved.reserved1 = (__u64)fctx->new;
+	ewq.msg.arg.reserved.reserved1 = (unsigned long)fctx->new;
 
 	return userfaultfd_event_wait_completion(ctx, &ewq);
 }
@@ -797,7 +797,9 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			*msg = uwq->msg;
 
 			if (uwq->msg.event == UFFD_EVENT_FORK) {
-				fork_nctx = (struct userfaultfd_ctx *)uwq->msg.arg.reserved.reserved1;
+				fork_nctx = (struct userfaultfd_ctx *)
+					(unsigned long)
+					uwq->msg.arg.reserved.reserved1;
 				list_move(&uwq->wq.task_list, &fork_event);
 				spin_unlock(&ctx->event_wqh.lock);
 				ret = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
                   ` (23 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

Since commit d2005e3f41d4 (userfaultfd: don't pin the user memory in
userfaultfd_file_create()) userfaultfd uses mm_count rather than mm_users
to pin mm_struct. Make dup_userfaultfd consistent with this behaviour

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07b1c25..e0bb733 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -523,7 +523,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 		ctx->features = octx->features;
 		ctx->released = false;
 		ctx->mm = vma->vm_mm;
-		atomic_inc(&ctx->mm->mm_users);
+		atomic_inc(&ctx->mm->mm_count);
 
 		userfaultfd_ctx_get(octx);
 		fctx->orig = octx;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (9 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-03  7:41   ` Hillf Danton
  2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Pavel Emelyanov <xemul@parallels.com>

The event denotes that an area [start:end] moves to different
location. Length change isn't reported as "new" addresses, if
they appear on the uffd reader side they will not contain any
data and the latter can just zeromap them.

Waiting for the event ACK is also done outside of mmap sem, as
for fork event.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                 | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h    | 17 +++++++++++++++++
 include/uapi/linux/userfaultfd.h | 11 ++++++++++-
 mm/mremap.c                      | 17 ++++++++++++-----
 4 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index e0bb733..2fcbd6b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -561,6 +561,43 @@ void dup_userfaultfd_complete(struct list_head *fcs)
 	}
 }
 
+void mremap_userfaultfd_prep(struct vm_area_struct *vma,
+			     struct vm_userfaultfd_ctx *vm_ctx)
+{
+	struct userfaultfd_ctx *ctx;
+
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) {
+		vm_ctx->ctx = ctx;
+		userfaultfd_ctx_get(ctx);
+	}
+}
+
+void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
+				 unsigned long from, unsigned long to,
+				 unsigned long len)
+{
+	struct userfaultfd_ctx *ctx = vm_ctx.ctx;
+	struct userfaultfd_wait_queue ewq;
+
+	if (!ctx)
+		return;
+
+	if (to & ~PAGE_MASK) {
+		userfaultfd_ctx_put(ctx);
+		return;
+	}
+
+	msg_init(&ewq.msg);
+
+	ewq.msg.event = UFFD_EVENT_REMAP;
+	ewq.msg.arg.remap.from = from;
+	ewq.msg.arg.remap.to = to;
+	ewq.msg.arg.remap.len = len;
+
+	userfaultfd_event_wait_completion(ctx, &ewq);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index bf42f20..bfab4ef 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -55,6 +55,12 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
+extern void mremap_userfaultfd_prep(struct vm_area_struct *,
+				    struct vm_userfaultfd_ctx *);
+extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
+					unsigned long from, unsigned long to,
+					unsigned long len);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -89,6 +95,17 @@ static inline void dup_userfaultfd_complete(struct list_head *l)
 {
 }
 
+static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma,
+					   struct vm_userfaultfd_ctx *ctx)
+{
+}
+
+static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
+					       unsigned long from,
+					       unsigned long to,
+					       unsigned long len)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index c8953c8..79a85e5 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -18,7 +18,8 @@
  * means the userland is reading).
  */
 #define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK)
+#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |	    \
+			   UFFD_FEATURE_EVENT_REMAP)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -77,6 +78,12 @@ struct uffd_msg {
 		} fork;
 
 		struct {
+			__u64	from;
+			__u64	to;
+			__u64	len;
+		} remap;
+
+		struct {
 			/* unused reserved fields */
 			__u64	reserved1;
 			__u64	reserved2;
@@ -90,6 +97,7 @@ struct uffd_msg {
  */
 #define UFFD_EVENT_PAGEFAULT	0x12
 #define UFFD_EVENT_FORK		0x13
+#define UFFD_EVENT_REMAP	0x14
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
@@ -110,6 +118,7 @@ struct uffdio_api {
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
+#define UFFD_FEATURE_EVENT_REMAP		(1<<2)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/mremap.c b/mm/mremap.c
index da22ad2..450e811 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
+#include <linux/userfaultfd_k.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -234,7 +235,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 static unsigned long move_vma(struct vm_area_struct *vma,
 		unsigned long old_addr, unsigned long old_len,
-		unsigned long new_len, unsigned long new_addr, bool *locked)
+		unsigned long new_len, unsigned long new_addr,
+		bool *locked, struct vm_userfaultfd_ctx *uf)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
@@ -293,6 +295,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		old_addr = new_addr;
 		new_addr = err;
 	} else {
+		mremap_userfaultfd_prep(new_vma, uf);
 		arch_remap(mm, old_addr, old_addr + old_len,
 			   new_addr, new_addr + new_len);
 	}
@@ -397,7 +400,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 }
 
 static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
-		unsigned long new_addr, unsigned long new_len, bool *locked)
+		unsigned long new_addr, unsigned long new_len, bool *locked,
+		struct vm_userfaultfd_ctx *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
@@ -442,7 +446,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (offset_in_page(ret))
 		goto out1;
 
-	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
+	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf);
 	if (!(offset_in_page(ret)))
 		goto out;
 out1:
@@ -481,6 +485,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	unsigned long ret = -EINVAL;
 	unsigned long charged = 0;
 	bool locked = false;
+	struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
 
 	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
 		return ret;
@@ -507,7 +512,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 
 	if (flags & MREMAP_FIXED) {
 		ret = mremap_to(addr, old_len, new_addr, new_len,
-				&locked);
+				&locked, &uf);
 		goto out;
 	}
 
@@ -576,7 +581,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 			goto out;
 		}
 
-		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
+		ret = move_vma(vma, addr, old_len, new_len, new_addr,
+			       &locked, &uf);
 	}
 out:
 	if (offset_in_page(ret)) {
@@ -586,5 +592,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	up_write(&current->mm->mmap_sem);
 	if (locked && new_len > old_len)
 		mm_populate(new_addr + old_len, new_len - old_len);
+	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (10 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-03  8:01   ` Hillf Danton
  2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Pavel Emelyanov <xemul@parallels.com>

If the page is punched out of the address space the uffd reader
should know this and zeromap the respective area in case of
the #PF event.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                 | 26 ++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h    | 12 ++++++++++++
 include/uapi/linux/userfaultfd.h | 10 +++++++++-
 mm/madvise.c                     |  2 ++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2fcbd6b..9357dcf 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -598,6 +598,32 @@ void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
 	userfaultfd_event_wait_completion(ctx, &ewq);
 }
 
+void madvise_userfault_dontneed(struct vm_area_struct *vma,
+				struct vm_area_struct **prev,
+				unsigned long start, unsigned long end)
+{
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue ewq;
+
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
+		return;
+
+	userfaultfd_ctx_get(ctx);
+	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
+	up_read(&vma->vm_mm->mmap_sem);
+
+	msg_init(&ewq.msg);
+
+	ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
+	ewq.msg.arg.madv_dn.start = start;
+	ewq.msg.arg.madv_dn.end = end;
+
+	userfaultfd_event_wait_completion(ctx, &ewq);
+
+	down_read(&vma->vm_mm->mmap_sem);
+}
+
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index bfab4ef..efb06d7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -61,6 +61,11 @@ extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
 					unsigned long from, unsigned long to,
 					unsigned long len);
 
+extern void madvise_userfault_dontneed(struct vm_area_struct *vma,
+				       struct vm_area_struct **prev,
+				       unsigned long start,
+				       unsigned long end);
+
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
@@ -106,6 +111,13 @@ static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
 					       unsigned long len)
 {
 }
+
+static inline void madvise_userfault_dontneed(struct vm_area_struct *vma,
+					      struct vm_area_struct **prev,
+					      unsigned long start,
+					      unsigned long end)
+{
+}
 #endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 79a85e5..2bbf323 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
  */
 #define UFFD_API ((__u64)0xAA)
 #define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK |	    \
-			   UFFD_FEATURE_EVENT_REMAP)
+			   UFFD_FEATURE_EVENT_REMAP |	    \
+			   UFFD_FEATURE_EVENT_MADVDONTNEED)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -84,6 +85,11 @@ struct uffd_msg {
 		} remap;
 
 		struct {
+			__u64	start;
+			__u64	end;
+		} madv_dn;
+
+		struct {
 			/* unused reserved fields */
 			__u64	reserved1;
 			__u64	reserved2;
@@ -98,6 +104,7 @@ struct uffd_msg {
 #define UFFD_EVENT_PAGEFAULT	0x12
 #define UFFD_EVENT_FORK		0x13
 #define UFFD_EVENT_REMAP	0x14
+#define UFFD_EVENT_MADVDONTNEED	0x15
 
 /* flags for UFFD_EVENT_PAGEFAULT */
 #define UFFD_PAGEFAULT_FLAG_WRITE	(1<<0)	/* If this was a write fault */
@@ -119,6 +126,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
 #define UFFD_FEATURE_EVENT_REMAP		(1<<2)
+#define UFFD_FEATURE_EVENT_MADVDONTNEED		(1<<3)
 	__u64 features;
 
 	__u64 ioctls;
diff --git a/mm/madvise.c b/mm/madvise.c
index 93fb63e..7168bc6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -10,6 +10,7 @@
 #include <linux/syscalls.h>
 #include <linux/mempolicy.h>
 #include <linux/page-isolation.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/hugetlb.h>
 #include <linux/falloc.h>
 #include <linux/sched.h>
@@ -476,6 +477,7 @@ static long madvise_dontneed(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	zap_page_range(vma, start, end - start, NULL);
+	madvise_userfault_dontneed(vma, prev, start, end);
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
                   ` (20 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

userfaultfd UFFDIO_COPY allows user level code to copy data to a page
at fault time.  The data is copied from user space to a newly allocated
huge page.  The new routine copy_huge_page_from_user performs this copy.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  3 +++
 mm/memory.c        | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d7..ec18964 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2404,6 +2404,9 @@ extern void clear_huge_page(struct page *page,
 extern void copy_user_huge_page(struct page *dst, struct page *src,
 				unsigned long addr, struct vm_area_struct *vma,
 				unsigned int pages_per_huge_page);
+extern long copy_huge_page_from_user(struct page *dst_page,
+				const void __user *usr_src,
+				unsigned int pages_per_huge_page);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 extern struct page_ext_operations debug_guardpage_ops;
diff --git a/mm/memory.c b/mm/memory.c
index e18c57b..00dc2fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4093,6 +4093,31 @@ void copy_user_huge_page(struct page *dst, struct page *src,
 		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
 	}
 }
+
+long copy_huge_page_from_user(struct page *dst_page,
+				const void __user *usr_src,
+				unsigned int pages_per_huge_page)
+{
+	void *src = (void *)usr_src;
+	void *page_kaddr;
+	unsigned long i, rc = 0;
+	unsigned long ret_val = pages_per_huge_page * PAGE_SIZE;
+
+	for (i = 0; i < pages_per_huge_page; i++) {
+		page_kaddr = kmap_atomic(dst_page + i);
+		rc = copy_from_user(page_kaddr,
+				(const void __user *)(src + i * PAGE_SIZE),
+				PAGE_SIZE);
+		kunmap_atomic(page_kaddr);
+
+		ret_val -= (PAGE_SIZE - rc);
+		if (rc)
+			break;
+
+		cond_resched();
+	}
+	return ret_val;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
 #if USE_SPLIT_PTE_PTLOCKS && ALLOC_SPLIT_PTLOCKS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
                   ` (19 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

hugetlb_mcopy_atomic_pte is the low level routine that implements
the userfaultfd UFFDIO_COPY command.  It is based on the existing
mcopy_atomic_pte routine with modifications for huge pages.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/hugetlb.h |  8 ++++-
 mm/hugetlb.c            | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 48c76d6..fc27b66 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -81,6 +81,11 @@ void hugetlb_show_meminfo(void);
 unsigned long hugetlb_total_pages(void);
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
+				struct vm_area_struct *dst_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr,
+				struct page **pagep);
 int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						struct vm_area_struct *vma,
 						vm_flags_t vm_flags);
@@ -149,6 +154,8 @@ static inline void hugetlb_show_meminfo(void)
 #define is_hugepage_only_range(mm, addr, len)	0
 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
 #define hugetlb_fault(mm, vma, addr, flags)	({ BUG(); 0; })
+#define hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
+				src_addr, pagep)	({ BUG(); 0; })
 #define huge_pte_offset(mm, address)	0
 static inline int dequeue_hwpoisoned_huge_page(struct page *page)
 {
@@ -272,7 +279,6 @@ static inline bool is_file_hugepages(struct file *file)
 	return is_file_shm_hugepages(file);
 }
 
-
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)			false
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ec49d9e..baf7fd4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3857,6 +3857,87 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * Used by userfaultfd UFFDIO_COPY.  Based on mcopy_atomic_pte with
+ * modifications for huge pages.
+ */
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
+			    pte_t *dst_pte,
+			    struct vm_area_struct *dst_vma,
+			    unsigned long dst_addr,
+			    unsigned long src_addr,
+			    struct page **pagep)
+{
+	struct hstate *h = hstate_vma(dst_vma);
+	pte_t _dst_pte;
+	spinlock_t *ptl;
+	int ret;
+	struct page *page;
+
+	if (!*pagep) {
+		ret = -ENOMEM;
+		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		if (IS_ERR(page))
+			goto out;
+
+		ret = copy_huge_page_from_user(page,
+						(const void __user *) src_addr,
+						pages_per_huge_page(h));
+
+		/* fallback to copy_from_user outside mmap_sem */
+		if (unlikely(ret)) {
+			ret = -EFAULT;
+			*pagep = page;
+			/* don't free the page */
+			goto out;
+		}
+	} else {
+		page = *pagep;
+		*pagep = NULL;
+	}
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+	set_page_huge_active(page);
+
+	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
+	spin_lock(ptl);
+
+	ret = -EEXIST;
+	if (!huge_pte_none(huge_ptep_get(dst_pte)))
+		goto out_release_unlock;
+
+	ClearPagePrivate(page);
+	hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+
+	_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
+	if (dst_vma->vm_flags & VM_WRITE)
+		_dst_pte = huge_pte_mkdirty(_dst_pte);
+	_dst_pte = pte_mkyoung(_dst_pte);
+
+	set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
+					dst_vma->vm_flags & VM_WRITE);
+	hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+	spin_unlock(ptl);
+	ret = 0;
+out:
+	return ret;
+out_release_unlock:
+	spin_unlock(ptl);
+	put_page(page);
+	goto out;
+}
+
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-03 10:15   ` Hillf Danton
  2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge
pages.  It is based on the existing __mcopy_atomic routine for normal
pages.  Unlike normal pages, there is no huge page support for the
UFFDIO_ZEROPAGE operation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/userfaultfd.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9c2ed70..ba9adff 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,6 +14,8 @@
 #include <linux/swapops.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hugetlb.h>
+#include <linux/pagemap.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -139,6 +141,177 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
+ * called with mmap_sem held, it will release mmap_sem before returning.
+ */
+static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
+					      struct vm_area_struct *dst_vma,
+					      unsigned long dst_start,
+					      unsigned long src_start,
+					      unsigned long len,
+					      bool zeropage)
+{
+	ssize_t err;
+	pte_t *dst_pte;
+	unsigned long src_addr, dst_addr;
+	long copied;
+	struct page *page;
+	struct hstate *h;
+	unsigned long vma_hpagesize;
+	pgoff_t idx;
+	u32 hash;
+	struct address_space *mapping;
+
+	/*
+	 * There is no default zero huge page for all huge page sizes as
+	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
+	 * by THP.  Since we can not reliably insert a zero page, this
+	 * feature is not supported.
+	 */
+	if (zeropage)
+		return -EINVAL;
+
+	src_addr = src_start;
+	dst_addr = dst_start;
+	copied = 0;
+	page = NULL;
+	vma_hpagesize = vma_kernel_pagesize(dst_vma);
+
+retry:
+	/*
+	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
+	 * retry, dst_vma will be set to NULL and we must lookup again.
+	 */
+	err = -EINVAL;
+	if (!dst_vma) {
+		dst_vma = find_vma(dst_mm, dst_start);
+		vma_hpagesize = vma_kernel_pagesize(dst_vma);
+
+		/*
+		 * Make sure the vma is not shared, that the dst range is
+		 * both valid and fully within a single existing vma.
+		 */
+		if (dst_vma->vm_flags & VM_SHARED)
+			goto out_unlock;
+		if (dst_start < dst_vma->vm_start ||
+		    dst_start + len > dst_vma->vm_end)
+			goto out_unlock;
+	}
+
+	/*
+	 * Validate alignment based on huge page size
+	 */
+	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+		goto out_unlock;
+
+	/*
+	 * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out_unlock;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out_unlock;
+
+	h = hstate_vma(dst_vma);
+
+	while (src_addr < src_start + len) {
+		pte_t dst_pteval;
+
+		BUG_ON(dst_addr >= dst_start + len);
+		dst_addr &= huge_page_mask(h);
+
+		/*
+		 * Serialize via hugetlb_fault_mutex
+		 */
+		idx = linear_page_index(dst_vma, dst_addr);
+		mapping = dst_vma->vm_file->f_mapping;
+		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
+								idx, dst_addr);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+		err = -ENOMEM;
+		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
+		if (!dst_pte) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out_unlock;
+		}
+
+		err = -EEXIST;
+		dst_pteval = huge_ptep_get(dst_pte);
+		if (!huge_pte_none(dst_pteval)) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out_unlock;
+		}
+
+		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+						dst_addr, src_addr, &page);
+
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+		cond_resched();
+
+		if (unlikely(err == -EFAULT)) {
+			up_read(&dst_mm->mmap_sem);
+			BUG_ON(!page);
+
+			err = copy_huge_page_from_user(page,
+						(const void __user *)src_addr,
+						pages_per_huge_page(h));
+			if (unlikely(err)) {
+				err = -EFAULT;
+				goto out;
+			}
+			down_read(&dst_mm->mmap_sem);
+
+			dst_vma = NULL;
+			goto retry;
+		} else
+			BUG_ON(page);
+
+		if (!err) {
+			dst_addr += vma_hpagesize;
+			src_addr += vma_hpagesize;
+			copied += vma_hpagesize;
+
+			if (fatal_signal_pending(current))
+				err = -EINTR;
+		}
+		if (err)
+			break;
+	}
+
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+out:
+	if (page)
+		put_page(page);
+	BUG_ON(copied < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!copied && !err);
+	return copied ? copied : err;
+}
+#else /* !CONFIG_HUGETLB_PAGE */
+static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
+					      struct vm_area_struct *dst_vma,
+					      unsigned long dst_start,
+					      unsigned long src_start,
+					      unsigned long len,
+					      bool zeropage)
+{
+	up_read(&dst_mm->mmap_sem);	/* HUGETLB not configured */
+	BUG();
+	return -EINVAL;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
 static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
@@ -182,6 +355,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
+	 * If this is a HUGETLB vma, pass off to appropriate routine
+	 */
+	if (dst_vma->vm_flags & VM_HUGETLB)
+		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
+						src_start, len, false);
+
+	/*
 	 * Be strict and only allow __mcopy_atomic on userfaultfd
 	 * registered ranges to prevent userland errors going
 	 * unnoticed. As far as the VM consistency is concerned, it

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-04  7:02   ` Hillf Danton
  2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
                   ` (17 subsequent siblings)
  33 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

When processing a hugetlb fault for no page present, check the vma to
determine if faults are to be handled via userfaultfd.  If so, drop the
hugetlb_fault_mutex and call handle_userfault().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/hugetlb.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index baf7fd4..7247f8c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -32,6 +32,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/node.h>
+#include <linux/userfaultfd_k.h>
 #include "internal.h"
 
 int hugepages_treat_as_movable;
@@ -3589,6 +3590,38 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (idx >= size)
 			goto out;
+
+		/*
+		 * Check for page in userfault range
+		 */
+		if (userfaultfd_missing(vma)) {
+			u32 hash;
+			struct fault_env fe = {
+				.vma = vma,
+				.address = address,
+				.flags = flags,
+				/*
+				 * Hard to debug if it ends up being
+				 * used by a callee that assumes
+				 * something about the other
+				 * uninitialized fields... same as in
+				 * memory.c
+				 */
+			};
+
+			/*
+			 * hugetlb_fault_mutex must be dropped before
+			 * handling userfault.  Reacquire after handling
+			 * fault to make calling code simpler.
+			 */
+			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
+							idx, address);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			ret = handle_userfault(&fe, VM_UFFD_MISSING);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			goto out;
+		}
+
 		page = alloc_huge_page(vma, address, 0);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
                   ` (16 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB
vmas.  huge page alignment checking is performed after a VM_HUGETLB
vma is encountered.

Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not
return that as a valid ioctl method for huge page ranges.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                 | 55 ++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/userfaultfd.h |  3 +++
 2 files changed, 53 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 9357dcf..a73e999 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -27,6 +27,7 @@
 #include <linux/mempolicy.h>
 #include <linux/ioctl.h>
 #include <linux/security.h>
+#include <linux/hugetlb.h>
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -1021,6 +1022,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	struct uffdio_register __user *user_uffdio_register;
 	unsigned long vm_flags, new_flags;
 	bool found;
+	bool huge_pages;
 	unsigned long start, end, vma_end;
 
 	user_uffdio_register = (struct uffdio_register __user *) arg;
@@ -1072,6 +1074,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		goto out_unlock;
 
 	/*
+	 * If the first vma contains huge pages, make sure start address
+	 * is aligned to huge page size.
+	 */
+	if (is_vm_hugetlb_page(vma)) {
+		unsigned long vma_hpagesize = vma_kernel_pagesize(vma);
+
+		if (start & (vma_hpagesize - 1))
+			goto out_unlock;
+	}
+
+	/*
 	 * Search for not compatible vmas.
 	 *
 	 * FIXME: this shall be relaxed later so that it doesn't fail
@@ -1079,6 +1092,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	 * on anonymous vmas).
 	 */
 	found = false;
+	huge_pages = false;
 	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
 		cond_resched();
 
@@ -1087,8 +1101,21 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
-		if (!vma_is_anonymous(cur))
+		if (!vma_is_anonymous(cur) && !is_vm_hugetlb_page(cur))
 			goto out_unlock;
+		/*
+		 * If this vma contains ending address, and huge pages
+		 * check alignment.
+		 */
+		if (is_vm_hugetlb_page(cur) && end <= cur->vm_end &&
+		    end > cur->vm_start) {
+			unsigned long vma_hpagesize = vma_kernel_pagesize(cur);
+
+			ret = -EINVAL;
+
+			if (end & (vma_hpagesize - 1))
+				goto out_unlock;
+		}
 
 		/*
 		 * Check that this vma isn't already owned by a
@@ -1101,6 +1128,12 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		    cur->vm_userfaultfd_ctx.ctx != ctx)
 			goto out_unlock;
 
+		/*
+		 * Note vmas containing huge pages
+		 */
+		if (is_vm_hugetlb_page(cur))
+			huge_pages = true;
+
 		found = true;
 	}
 	BUG_ON(!found);
@@ -1112,7 +1145,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_is_anonymous(vma));
+		BUG_ON(!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
 		       vma->vm_userfaultfd_ctx.ctx != ctx);
 
@@ -1170,7 +1203,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		 * userland which ioctls methods are guaranteed to
 		 * succeed on this range.
 		 */
-		if (put_user(UFFD_API_RANGE_IOCTLS,
+		if (put_user(huge_pages ? UFFD_API_RANGE_IOCTLS_HPAGE :
+			     UFFD_API_RANGE_IOCTLS,
 			     &user_uffdio_register->ioctls))
 			ret = -EFAULT;
 	}
@@ -1217,6 +1251,17 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		goto out_unlock;
 
 	/*
+	 * If the first vma contains huge pages, make sure start address
+	 * is aligned to huge page size.
+	 */
+	if (is_vm_hugetlb_page(vma)) {
+		unsigned long vma_hpagesize = vma_kernel_pagesize(vma);
+
+		if (start & (vma_hpagesize - 1))
+			goto out_unlock;
+	}
+
+	/*
 	 * Search for not compatible vmas.
 	 *
 	 * FIXME: this shall be relaxed later so that it doesn't fail
@@ -1238,7 +1283,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * provides for more strict behavior to notice
 		 * unregistration errors.
 		 */
-		if (!vma_is_anonymous(cur))
+		if (!vma_is_anonymous(cur) && !is_vm_hugetlb_page(cur))
 			goto out_unlock;
 
 		found = true;
@@ -1252,7 +1297,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_is_anonymous(vma));
+		BUG_ON(!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma));
 
 		/*
 		 * Nothing to do: this vma is already registered into this
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 2bbf323..a3828a9 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -29,6 +29,9 @@
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
 	 (__u64)1 << _UFFDIO_ZEROPAGE)
+#define UFFD_API_RANGE_IOCTLS_HPAGE		\
+	((__u64)1 << _UFFDIO_WAKE |		\
+	 (__u64)1 << _UFFDIO_COPY)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (16 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
                   ` (15 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

Test userfaultfd hugetlb functionality by using the existing testing
method (in userfaultfd.c).  Instead of an anonymous memeory, a
hugetlbfs file is mmap'ed private.  In this way fallocate hole punch
can be used to release pages.  This is because madvise(MADV_DONTNEED)
is not supported for huge pages.

Use the same file, but create wrappers for allocating ranges and
releasing pages.  Compile userfaultfd.c with HUGETLB_TEST defined to
produce an executable to test userfaultfd hugetlb functionality.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/Makefile      |   4 +
 tools/testing/selftests/vm/run_vmtests   |  13 +++
 tools/testing/selftests/vm/userfaultfd.c | 161 +++++++++++++++++++++++++++----
 3 files changed, 161 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index bbab7f4..0114aac 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -10,6 +10,7 @@ BINARIES += on-fault-limit
 BINARIES += thuge-gen
 BINARIES += transhuge-stress
 BINARIES += userfaultfd
+BINARIES += userfaultfd_hugetlb
 BINARIES += mlock-random-test
 
 all: $(BINARIES)
@@ -18,6 +19,9 @@ all: $(BINARIES)
 userfaultfd: userfaultfd.c ../../../../usr/include/linux/kernel.h
 	$(CC) $(CFLAGS) -O2 -o $@ $< -lpthread
 
+userfaultfd_hugetlb: userfaultfd.c ../../../../usr/include/linux/kernel.h
+	$(CC) $(CFLAGS) -DHUGETLB_TEST -O2 -o $@ $< -lpthread
+
 mlock-random-test: mlock-random-test.c
 	$(CC) $(CFLAGS) -o $@ $< -lcap
 
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index e11968b..14d697e 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -103,6 +103,19 @@ else
 	echo "[PASS]"
 fi
 
+echo "----------------------------"
+echo "running userfaultfd_hugetlb"
+echo "----------------------------"
+# 258MB total huge pages == 128MB src and 128MB dst
+./userfaultfd_hugetlb 128 32 $mnt/ufd_test_file
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+rm -f $mnt/ufd_test_file
+
 #cleanup
 umount $mnt
 rm -rf $mnt
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index d77ed41..3011711 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -76,6 +76,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
 #define BOUNCE_POLL		(1<<3)
 static int bounces;
 
+#ifdef HUGETLB_TEST
+static int huge_fd;
+static char *huge_fd_off0;
+#endif
 static unsigned long long *count_verify;
 static int uffd, finished, *pipefd;
 static char *area_src, *area_dst;
@@ -97,6 +101,69 @@ pthread_attr_t attr;
 				 ~(unsigned long)(sizeof(unsigned long long) \
 						  -  1)))
 
+#ifndef HUGETLB_TEST
+
+#define EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
+				 (1 << _UFFDIO_COPY) | \
+				 (1 << _UFFDIO_ZEROPAGE))
+
+static int release_pages(char *rel_area)
+{
+	int ret = 0;
+
+	if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static void allocate_area(void **alloc_area)
+{
+	if (posix_memalign(alloc_area, page_size, nr_pages * page_size)) {
+		fprintf(stderr, "out of memory\n");
+		*alloc_area = NULL;
+	}
+}
+
+#else /* HUGETLB_TEST */
+
+#define EXPECTED_IOCTLS		UFFD_API_RANGE_IOCTLS_HPAGE
+
+static int release_pages(char *rel_area)
+{
+	int ret = 0;
+
+	if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+				rel_area == huge_fd_off0 ? 0 :
+				nr_pages * page_size,
+				nr_pages * page_size)) {
+		perror("fallocate");
+		ret = 1;
+	}
+
+	return ret;
+}
+
+
+static void allocate_area(void **alloc_area)
+{
+	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
+				MAP_PRIVATE | MAP_HUGETLB, huge_fd,
+				*alloc_area == area_src ? 0 :
+				nr_pages * page_size);
+	if (*alloc_area == MAP_FAILED) {
+		fprintf(stderr, "mmap of hugetlbfs file failed\n");
+		*alloc_area = NULL;
+	}
+
+	if (*alloc_area == area_src)
+		huge_fd_off0 = *alloc_area;
+}
+
+#endif /* HUGETLB_TEST */
+
 static int my_bcmp(char *str1, char *str2, size_t n)
 {
 	unsigned long i;
@@ -384,10 +451,8 @@ static int stress(unsigned long *userfaults)
 	 * UFFDIO_COPY without writing zero pages into area_dst
 	 * because the background threads already completed).
 	 */
-	if (madvise(area_src, nr_pages * page_size, MADV_DONTNEED)) {
-		perror("madvise");
+	if (release_pages(area_src))
 		return 1;
-	}
 
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
 		char c;
@@ -425,16 +490,12 @@ static int userfaultfd_stress(void)
 	int uffd_flags, err;
 	unsigned long userfaults[nr_cpus];
 
-	if (posix_memalign(&area, page_size, nr_pages * page_size)) {
-		fprintf(stderr, "out of memory\n");
+	allocate_area((void **)&area_src);
+	if (!area_src)
 		return 1;
-	}
-	area_src = area;
-	if (posix_memalign(&area, page_size, nr_pages * page_size)) {
-		fprintf(stderr, "out of memory\n");
+	allocate_area((void **)&area_dst);
+	if (!area_dst)
 		return 1;
-	}
-	area_dst = area;
 
 	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
 	if (uffd < 0) {
@@ -528,9 +589,7 @@ static int userfaultfd_stress(void)
 			fprintf(stderr, "register failure\n");
 			return 1;
 		}
-		expected_ioctls = (1 << _UFFDIO_WAKE) |
-				  (1 << _UFFDIO_COPY) |
-				  (1 << _UFFDIO_ZEROPAGE);
+		expected_ioctls = EXPECTED_IOCTLS;
 		if ((uffdio_register.ioctls & expected_ioctls) !=
 		    expected_ioctls) {
 			fprintf(stderr,
@@ -562,10 +621,8 @@ static int userfaultfd_stress(void)
 		 * MADV_DONTNEED only after the UFFDIO_REGISTER, so it's
 		 * required to MADV_DONTNEED here.
 		 */
-		if (madvise(area_dst, nr_pages * page_size, MADV_DONTNEED)) {
-			perror("madvise 2");
+		if (release_pages(area_dst))
 			return 1;
-		}
 
 		/* bounce pass */
 		if (stress(userfaults))
@@ -606,6 +663,8 @@ static int userfaultfd_stress(void)
 	return err;
 }
 
+#ifndef HUGETLB_TEST
+
 int main(int argc, char **argv)
 {
 	if (argc < 3)
@@ -632,6 +691,74 @@ int main(int argc, char **argv)
 	return userfaultfd_stress();
 }
 
+#else /* HUGETLB_TEST */
+
+/*
+ * Copied from mlock2-tests.c
+ */
+unsigned long default_huge_page_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+
+	if (!f)
+		return 0;
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:       %lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+
+	free(line);
+	fclose(f);
+	return hps;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc < 4)
+		fprintf(stderr, "Usage: <MiB> <bounces> <hugetlbfs_file>\n"),
+				exit(1);
+	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	page_size = default_huge_page_size();
+	if (!page_size)
+		fprintf(stderr, "Unable to determine huge page size\n"),
+				exit(2);
+	if ((unsigned long) area_count(NULL, 0) + sizeof(unsigned long long) * 2
+	    > page_size)
+		fprintf(stderr, "Impossible to run this test\n"), exit(2);
+	nr_pages_per_cpu = atol(argv[1]) * 1024*1024 / page_size /
+		nr_cpus;
+	if (!nr_pages_per_cpu) {
+		fprintf(stderr, "invalid MiB\n");
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	}
+	bounces = atoi(argv[2]);
+	if (bounces <= 0) {
+		fprintf(stderr, "invalid bounces\n");
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	}
+	nr_pages = nr_pages_per_cpu * nr_cpus;
+	huge_fd = open(argv[3], O_CREAT | O_RDWR, 0755);
+	if (huge_fd < 0) {
+		fprintf(stderr, "Open of %s failed", argv[3]);
+		perror("open");
+		exit(1);
+	}
+	if (ftruncate(huge_fd, 0)) {
+		fprintf(stderr, "ftruncate %s to size 0 failed", argv[3]);
+		perror("ftruncate");
+		exit(1);
+	}
+	printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
+	       nr_pages, nr_pages_per_cpu);
+	return userfaultfd_stress();
+}
+
+#endif
 #else /* __NR_userfaultfd */
 
 #warning "missing __NR_userfaultfd definition"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (17 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
                   ` (14 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Kravetz <mike.kravetz@oracle.com>

Add routine userfaultfd_huge_must_wait which has the same functionality as
the existing userfaultfd_must_wait routine.  Only difference is that new
routine must handle page table structure for hugepmd vmas.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a73e999..9552734 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -195,6 +195,49 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
 	return msg;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Same functionality as userfaultfd_must_wait below with modifications for
+ * hugepmd ranges.
+ */
+static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
+					 unsigned long address,
+					 unsigned long flags,
+					 unsigned long reason)
+{
+	struct mm_struct *mm = ctx->mm;
+	pte_t *pte;
+	bool ret = true;
+
+	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	pte = huge_pte_offset(mm, address);
+	if (!pte)
+		goto out;
+
+	ret = false;
+
+	/*
+	 * Lockless access: we're in a wait_event so it's ok if it
+	 * changes under us.
+	 */
+	if (huge_pte_none(*pte))
+		ret = true;
+	if (!huge_pte_write(*pte) && (reason & VM_UFFD_WP))
+		ret = true;
+out:
+	return ret;
+}
+#else
+static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
+					 unsigned long address,
+					 unsigned long flags,
+					 unsigned long reason)
+{
+	return false;	/* should never get here */
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
 /*
  * Verify the pagetables are still not ok after having reigstered into
  * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
@@ -367,7 +410,12 @@ int handle_userfault(struct fault_env *fe, unsigned long reason)
 			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
-	must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
+	if (!is_vm_hugetlb_page(fe->vma))
+		must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags,
+						  reason);
+	else
+		must_wait = userfaultfd_huge_must_wait(ctx, fe->address,
+						       fe->flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 20/33] userfaultfd: introduce vma_can_userfault
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (18 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-04  7:39   ` Hillf Danton
  2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
                   ` (13 subsequent siblings)
  33 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

Check whether a VMA can be used with userfault in more compact way

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 9552734..387fe77 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1060,6 +1060,11 @@ static __always_inline int validate_range(struct mm_struct *mm,
 	return 0;
 }
 
+static inline bool vma_can_userfault(struct vm_area_struct *vma)
+{
+	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma);
+}
+
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 				unsigned long arg)
 {
@@ -1149,7 +1154,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 		/* check not compatible vmas */
 		ret = -EINVAL;
-		if (!vma_is_anonymous(cur) && !is_vm_hugetlb_page(cur))
+		if (!vma_can_userfault(cur))
 			goto out_unlock;
 		/*
 		 * If this vma contains ending address, and huge pages
@@ -1193,7 +1198,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma));
+		BUG_ON(!vma_can_userfault(vma));
 		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
 		       vma->vm_userfaultfd_ctx.ctx != ctx);
 
@@ -1331,7 +1336,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * provides for more strict behavior to notice
 		 * unregistration errors.
 		 */
-		if (!vma_is_anonymous(cur) && !is_vm_hugetlb_page(cur))
+		if (!vma_can_userfault(cur))
 			goto out_unlock;
 
 		found = true;
@@ -1345,7 +1350,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	do {
 		cond_resched();
 
-		BUG_ON(!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma));
+		BUG_ON(!vma_can_userfault(vma));
 
 		/*
 		 * Nothing to do: this vma is already registered into this

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (19 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

shmem_mcopy_atomic_pte is the low level routine that implements
the userfaultfd UFFDIO_COPY command.  It is based on the existing
mcopy_atomic_pte routine with modifications for shared memory pages.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/shmem_fs.h |  11 +++++
 mm/shmem.c               | 110 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 121 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff078e7..fdaac9d4 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -124,4 +124,15 @@ static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
 }
 #endif
 
+#ifdef CONFIG_SHMEM
+extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+				  struct vm_area_struct *dst_vma,
+				  unsigned long dst_addr,
+				  unsigned long src_addr,
+				  struct page **pagep);
+#else
+#define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
+			       src_addr, pagep)        ({ BUG(); 0; })
+#endif
+
 #endif
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d..6e5fe90 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -70,6 +70,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/syscalls.h>
 #include <linux/fcntl.h>
 #include <uapi/linux/memfd.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -2135,6 +2136,115 @@ bool shmem_mapping(struct address_space *mapping)
 	return mapping->host->i_sb->s_op == &shmem_ops;
 }
 
+int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
+			   pmd_t *dst_pmd,
+			   struct vm_area_struct *dst_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   struct page **pagep)
+{
+	struct inode *inode = file_inode(dst_vma->vm_file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct address_space *mapping = inode->i_mapping;
+	gfp_t gfp = mapping_gfp_mask(mapping);
+	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	struct mem_cgroup *memcg;
+	spinlock_t *ptl;
+	void *page_kaddr;
+	struct page *page;
+	pte_t _dst_pte, *dst_pte;
+	int ret;
+
+	if (!*pagep) {
+		ret = -ENOMEM;
+		if (shmem_acct_block(info->flags, 1))
+			goto out;
+		if (sbinfo->max_blocks) {
+			if (percpu_counter_compare(&sbinfo->used_blocks,
+						   sbinfo->max_blocks) >= 0)
+				goto out_unacct_blocks;
+			percpu_counter_inc(&sbinfo->used_blocks);
+		}
+
+		page = shmem_alloc_page(gfp, info, pgoff);
+		if (!page)
+			goto out_dec_used_blocks;
+
+		page_kaddr = kmap_atomic(page);
+		ret = copy_from_user(page_kaddr, (const void __user *)src_addr,
+				     PAGE_SIZE);
+		kunmap_atomic(page_kaddr);
+
+		/* fallback to copy_from_user outside mmap_sem */
+		if (unlikely(ret)) {
+			*pagep = page;
+			/* don't free the page */
+			return -EFAULT;
+		}
+	} else {
+		page = *pagep;
+		*pagep = NULL;
+	}
+
+	ret = mem_cgroup_try_charge(page, dst_mm, gfp, &memcg, false);
+	if (ret)
+		goto out_release;
+
+	ret = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
+	if (!ret) {
+		ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL);
+		radix_tree_preload_end();
+	}
+	if (ret)
+		goto out_release_uncharge;
+
+	mem_cgroup_commit_charge(page, memcg, false, false);
+
+	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	if (dst_vma->vm_flags & VM_WRITE)
+		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+
+	ret = -EEXIST;
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!pte_none(*dst_pte))
+		goto out_release_uncharge_unlock;
+
+	__SetPageUptodate(page);
+
+	lru_cache_add_anon(page);
+
+	spin_lock(&info->lock);
+	info->alloced++;
+	inode->i_blocks += BLOCKS_PER_PAGE;
+	shmem_recalc_inode(inode);
+	spin_unlock(&info->lock);
+
+	inc_mm_counter(dst_mm, mm_counter_file(page));
+	page_add_file_rmap(page, false);
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	unlock_page(page);
+	pte_unmap_unlock(dst_pte, ptl);
+	ret = 0;
+out:
+	return ret;
+out_release_uncharge_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+out_release_uncharge:
+	mem_cgroup_cancel_charge(page, memcg, false);
+out_release:
+	put_page(page);
+out_dec_used_blocks:
+	if (sbinfo->max_blocks)
+		percpu_counter_add(&sbinfo->used_blocks, -1);
+out_unacct_blocks:
+	shmem_unacct_blocks(info->flags, 1);
+	goto out;
+}
+
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
 static const struct inode_operations shmem_short_symlink_operations;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (20 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to ensure
compatibility of a VMA with userfault. Introduction of vma_is_shmem allows
detection if tmpfs backed VMAs, so that they may be used with userfaultfd.
Current implementation presumes usage of vma_is_shmem only by slow path
routines in userfaultfd, therefore the vma_is_shmem is not made inline to
leave the few remaining free bits in vm_flags.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h | 10 ++++++++++
 mm/shmem.c         |  5 +++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec18964..fd98303 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1365,6 +1365,16 @@ static inline bool vma_is_anonymous(struct vm_area_struct *vma)
 	return !vma->vm_ops;
 }
 
+#ifdef CONFIG_SHMEM
+/*
+ * The vma_is_shmem is not inline because it is used only by slow
+ * paths in userfault.
+ */
+bool vma_is_shmem(struct vm_area_struct *vma);
+#else
+static inline bool vma_is_shmem(struct vm_area_struct *vma) { return false; }
+#endif
+
 static inline int stack_guard_page_start(struct vm_area_struct *vma,
 					     unsigned long addr)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 6e5fe90..66deb90 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -191,6 +191,11 @@ static const struct inode_operations shmem_special_inode_operations;
 static const struct vm_operations_struct shmem_vm_ops;
 static struct file_system_type shmem_fs_type;
 
+bool vma_is_shmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &shmem_vm_ops;
+}
+
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (21 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

It resolves this build error:

All errors (new ones prefixed by >>):

   mm/shmem.c: In function 'shmem_mcopy_atomic_pte':
   >> mm/shmem.c:2228:2: error: implicit declaration of function
   'update_mmu_cache' [-Werror=implicit-function-declaration]
        update_mmu_cache(dst_vma, dst_addr, dst_pte);

microblaze may have to be also updated to define it in asm/pgtable.h
like the other archs, then this header inclusion can be removed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/shmem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 66deb90..acf80c2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -34,6 +34,8 @@
 #include <linux/uio.h>
 #include <linux/khugepaged.h>
 
+#include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */
+
 static struct vfsmount *shm_mnt;
 
 #ifdef CONFIG_SHMEM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (22 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
                   ` (9 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

The shmem_mcopy_atomic_pte implements low lever part of UFFDIO_COPY
operation for shared memory VMAs. It's based on mcopy_atomic_pte with
adjustments necessary for shared memory pages.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/userfaultfd.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ba9adff..7b27d1e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -16,6 +16,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
 #include <linux/pagemap.h>
+#include <linux/shmem_fs.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
 
@@ -348,7 +349,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 */
 	err = -EINVAL;
 	dst_vma = find_vma(dst_mm, dst_start);
-	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+	if (!dst_vma)
+		goto out_unlock;
+	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
 		goto out_unlock;
 	if (dst_start < dst_vma->vm_start ||
 	    dst_start + len > dst_vma->vm_end)
@@ -373,11 +376,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	if (!dst_vma->vm_userfaultfd_ctx.ctx)
 		goto out_unlock;
 
-	/*
-	 * FIXME: only allow copying on anonymous vmas, tmpfs should
-	 * be added.
-	 */
-	if (!vma_is_anonymous(dst_vma))
+	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
 
 	/*
@@ -386,7 +385,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	 * dst_vma.
 	 */
 	err = -ENOMEM;
-	if (unlikely(anon_vma_prepare(dst_vma)))
+	if (vma_is_anonymous(dst_vma) && unlikely(anon_vma_prepare(dst_vma)))
 		goto out_unlock;
 
 	while (src_addr < src_start + len) {
@@ -423,12 +422,18 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_none(*dst_pmd));
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
-		if (!zeropage)
-			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					       dst_addr, src_addr, &page);
-		else
-			err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
-						 dst_addr);
+		if (vma_is_anonymous(dst_vma)) {
+			if (!zeropage)
+				err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+						       dst_addr, src_addr,
+						       &page);
+			else
+				err = mfill_zeropage_pte(dst_mm, dst_pmd,
+							 dst_vma, dst_addr);
+		} else {
+			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
+						     dst_addr, src_addr, &page);
+		}
 
 		cond_resched();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (23 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-04  8:59   ` Hillf Danton
  2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
                   ` (8 subsequent siblings)
  33 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

When processing a page fault in shared memory area for not present page,
check the VMA determine if faults are to be handled by userfaultfd. If so,
delegate the page fault to handle_userfault.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/shmem.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index acf80c2..fe469e5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -72,6 +72,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/syscalls.h>
 #include <linux/fcntl.h>
 #include <uapi/linux/memfd.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/rmap.h>
 
 #include <asm/uaccess.h>
@@ -118,13 +119,14 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		struct page **pagep, enum sgp_type sgp,
-		gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);
+		gfp_t gfp, struct vm_area_struct *vma,
+		struct vm_fault *vmf, int *fault_type);
 
 int shmem_getpage(struct inode *inode, pgoff_t index,
 		struct page **pagep, enum sgp_type sgp)
 {
 	return shmem_getpage_gfp(inode, index, pagep, sgp,
-		mapping_gfp_mask(inode->i_mapping), NULL, NULL);
+		mapping_gfp_mask(inode->i_mapping), NULL, NULL, NULL);
 }
 
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
@@ -1542,7 +1544,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
  */
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, gfp_t gfp,
-	struct mm_struct *fault_mm, int *fault_type)
+	struct vm_area_struct *vma, struct vm_fault *vmf, int *fault_type)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
@@ -1597,7 +1599,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	 */
 	info = SHMEM_I(inode);
 	sbinfo = SHMEM_SB(inode->i_sb);
-	charge_mm = fault_mm ? : current->mm;
+	charge_mm = vma ? vma->vm_mm : current->mm;
 
 	if (swap.val) {
 		/* Look it up and read it in.. */
@@ -1607,7 +1609,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			if (fault_type) {
 				*fault_type |= VM_FAULT_MAJOR;
 				count_vm_event(PGMAJFAULT);
-				mem_cgroup_count_vm_event(fault_mm, PGMAJFAULT);
+				mem_cgroup_count_vm_event(vma->vm_mm,
+							  PGMAJFAULT);
 			}
 			/* Here we actually start the io */
 			page = shmem_swapin(swap, gfp, info, index);
@@ -1676,6 +1679,23 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		swap_free(swap);
 
 	} else {
+		if (vma && userfaultfd_missing(vma)) {
+			struct fault_env fe = {
+				.vma = vma,
+				.address = (unsigned long)vmf->virtual_address,
+				.flags = vmf->flags,
+				/*
+				 * Hard to debug if it ends up being
+				 * used by a callee that assumes
+				 * something about the other
+				 * uninitialized fields... same as in
+				 * memory.c
+				 */
+			};
+			*fault_type = handle_userfault(&fe, VM_UFFD_MISSING);
+			return 0;
+		}
+
 		/* shmem_symlink() */
 		if (mapping->a_ops != &shmem_aops)
 			goto alloc_nohuge;
@@ -1927,7 +1947,7 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		sgp = SGP_NOHUGE;
 
 	error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
-				  gfp, vma->vm_mm, &ret);
+				  gfp, vma, vmf, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
 	return ret;
@@ -4212,7 +4232,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 
 	BUG_ON(mapping->a_ops != &shmem_aops);
 	error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
-				  gfp, NULL, NULL);
+				  gfp, NULL, NULL, NULL);
 	if (error)
 		page = ERR_PTR(error);
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (24 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

Expand the userfaultfd_register/unregister routines to allow shared memory
VMAs. Currently, there is no UFFDIO_ZEROPAGE and write-protection support
for shared memory VMAs, which is reflected in ioctl methods supported by
uffdio_register.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c                         | 21 +++++++--------------
 include/uapi/linux/userfaultfd.h         |  2 +-
 tools/testing/selftests/vm/userfaultfd.c |  2 +-
 3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 387fe77..53de9e7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1062,7 +1062,8 @@ static __always_inline int validate_range(struct mm_struct *mm,
 
 static inline bool vma_can_userfault(struct vm_area_struct *vma)
 {
-	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma);
+	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+		vma_is_shmem(vma);
 }
 
 static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1075,7 +1076,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	struct uffdio_register __user *user_uffdio_register;
 	unsigned long vm_flags, new_flags;
 	bool found;
-	bool huge_pages;
+	bool non_anon_pages;
 	unsigned long start, end, vma_end;
 
 	user_uffdio_register = (struct uffdio_register __user *) arg;
@@ -1139,13 +1140,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 
 	/*
 	 * Search for not compatible vmas.
-	 *
-	 * FIXME: this shall be relaxed later so that it doesn't fail
-	 * on tmpfs backed vmas (in addition to the current allowance
-	 * on anonymous vmas).
 	 */
 	found = false;
-	huge_pages = false;
+	non_anon_pages = false;
 	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
 		cond_resched();
 
@@ -1184,8 +1181,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		/*
 		 * Note vmas containing huge pages
 		 */
-		if (is_vm_hugetlb_page(cur))
-			huge_pages = true;
+		if (is_vm_hugetlb_page(cur) || vma_is_shmem(cur))
+			non_anon_pages = true;
 
 		found = true;
 	}
@@ -1256,7 +1253,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		 * userland which ioctls methods are guaranteed to
 		 * succeed on this range.
 		 */
-		if (put_user(huge_pages ? UFFD_API_RANGE_IOCTLS_HPAGE :
+		if (put_user(non_anon_pages ? UFFD_API_RANGE_IOCTLS_BASIC :
 			     UFFD_API_RANGE_IOCTLS,
 			     &user_uffdio_register->ioctls))
 			ret = -EFAULT;
@@ -1316,10 +1313,6 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 
 	/*
 	 * Search for not compatible vmas.
-	 *
-	 * FIXME: this shall be relaxed later so that it doesn't fail
-	 * on tmpfs backed vmas (in addition to the current allowance
-	 * on anonymous vmas).
 	 */
 	found = false;
 	ret = -EINVAL;
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index a3828a9..378dd27 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -29,7 +29,7 @@
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
 	 (__u64)1 << _UFFDIO_ZEROPAGE)
-#define UFFD_API_RANGE_IOCTLS_HPAGE		\
+#define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY)
 
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 3011711..d753a91 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -129,7 +129,7 @@ static void allocate_area(void **alloc_area)
 
 #else /* HUGETLB_TEST */
 
-#define EXPECTED_IOCTLS		UFFD_API_RANGE_IOCTLS_HPAGE
+#define EXPECTED_IOCTLS		UFFD_API_RANGE_IOCTLS_BASIC
 
 static int release_pages(char *rel_area)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (25 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
@ 2016-11-02 19:33 ` Andrea Arcangeli
  2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

The test verifies that anonymous shared mapping can be used with userfault
using the existing testing method.
The shared memory area is allocated using mmap(..., MAP_SHARED |
MAP_ANONYMOUS, ...) and released using madvise(MADV_REMOVE)

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/Makefile      |  4 ++++
 tools/testing/selftests/vm/run_vmtests   | 11 ++++++++++
 tools/testing/selftests/vm/userfaultfd.c | 37 ++++++++++++++++++++++++++++++--
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 0114aac..900dfaf 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -11,6 +11,7 @@ BINARIES += thuge-gen
 BINARIES += transhuge-stress
 BINARIES += userfaultfd
 BINARIES += userfaultfd_hugetlb
+BINARIES += userfaultfd_shmem
 BINARIES += mlock-random-test
 
 all: $(BINARIES)
@@ -22,6 +23,9 @@ userfaultfd: userfaultfd.c ../../../../usr/include/linux/kernel.h
 userfaultfd_hugetlb: userfaultfd.c ../../../../usr/include/linux/kernel.h
 	$(CC) $(CFLAGS) -DHUGETLB_TEST -O2 -o $@ $< -lpthread
 
+userfaultfd_shmem: userfaultfd.c ../../../../usr/include/linux/kernel.h
+	$(CC) $(CFLAGS) -DSHMEM_TEST -O2 -o $@ $< -lpthread
+
 mlock-random-test: mlock-random-test.c
 	$(CC) $(CFLAGS) -o $@ $< -lcap
 
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 14d697e..c92f6cf 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -116,6 +116,17 @@ else
 fi
 rm -f $mnt/ufd_test_file
 
+echo "----------------------------"
+echo "running userfaultfd_shmem"
+echo "----------------------------"
+./userfaultfd_shmem 128 32
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 #cleanup
 umount $mnt
 rm -rf $mnt
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index d753a91..a5e5808 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -101,8 +101,9 @@ pthread_attr_t attr;
 				 ~(unsigned long)(sizeof(unsigned long long) \
 						  -  1)))
 
-#ifndef HUGETLB_TEST
+#if !defined(HUGETLB_TEST) && !defined(SHMEM_TEST)
 
+/* Anonymous memory */
 #define EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
 				 (1 << _UFFDIO_COPY) | \
 				 (1 << _UFFDIO_ZEROPAGE))
@@ -127,10 +128,13 @@ static void allocate_area(void **alloc_area)
 	}
 }
 
-#else /* HUGETLB_TEST */
+#else /* HUGETLB_TEST or SHMEM_TEST */
 
 #define EXPECTED_IOCTLS		UFFD_API_RANGE_IOCTLS_BASIC
 
+#ifdef HUGETLB_TEST
+
+/* HugeTLB memory */
 static int release_pages(char *rel_area)
 {
 	int ret = 0;
@@ -162,8 +166,37 @@ static void allocate_area(void **alloc_area)
 		huge_fd_off0 = *alloc_area;
 }
 
+#elif defined(SHMEM_TEST)
+
+/* Shared memory */
+static int release_pages(char *rel_area)
+{
+	int ret = 0;
+
+	if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) {
+		perror("madvise");
+		ret = 1;
+	}
+
+	return ret;
+}
+
+static void allocate_area(void **alloc_area)
+{
+	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
+			   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+	if (*alloc_area == MAP_FAILED) {
+		fprintf(stderr, "shared memory mmap failed\n");
+		*alloc_area = NULL;
+	}
+}
+
+#else /* SHMEM_TEST */
+#error "Undefined test type"
 #endif /* HUGETLB_TEST */
 
+#endif /* !defined(HUGETLB_TEST) && !defined(SHMEM_TEST) */
+
 static int my_bcmp(char *str1, char *str2, size_t n)
 {
 	unsigned long i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (26 preceding siblings ...)
  2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
@ 2016-11-02 19:34 ` Andrea Arcangeli
  2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

A VM_BUG_ON triggered on the shmem selftest.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/shmem.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index fe469e5..5d39f88 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2214,6 +2214,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	VM_BUG_ON(PageLocked(page) || PageSwapBacked(page));
+	__SetPageLocked(page);
+	__SetPageSwapBacked(page);
+
 	ret = mem_cgroup_try_charge(page, dst_mm, gfp, &memcg, false);
 	if (ret)
 		goto out_release;
@@ -2263,6 +2267,7 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 out_release_uncharge:
 	mem_cgroup_cancel_charge(page, memcg, false);
 out_release:
+	unlock_page(page);
 	put_page(page);
 out_dec_used_blocks:
 	if (sbinfo->max_blocks)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (27 preceding siblings ...)
  2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
@ 2016-11-02 19:34 ` Andrea Arcangeli
  2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

If the atomic copy_user fails because of a real dangling userland
pointer, we won't go back into the shmem method, so when the method
returns it must not leave anything charged up, except the page itself.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/shmem.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5d39f88..578622e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2183,17 +2183,17 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	pte_t _dst_pte, *dst_pte;
 	int ret;
 
-	if (!*pagep) {
-		ret = -ENOMEM;
-		if (shmem_acct_block(info->flags, 1))
-			goto out;
-		if (sbinfo->max_blocks) {
-			if (percpu_counter_compare(&sbinfo->used_blocks,
-						   sbinfo->max_blocks) >= 0)
-				goto out_unacct_blocks;
-			percpu_counter_inc(&sbinfo->used_blocks);
-		}
+	ret = -ENOMEM;
+	if (shmem_acct_block(info->flags, 1))
+		goto out;
+	if (sbinfo->max_blocks) {
+		if (percpu_counter_compare(&sbinfo->used_blocks,
+					   sbinfo->max_blocks) >= 0)
+			goto out_unacct_blocks;
+		percpu_counter_inc(&sbinfo->used_blocks);
+	}
 
+	if (!*pagep) {
 		page = shmem_alloc_page(gfp, info, pgoff);
 		if (!page)
 			goto out_dec_used_blocks;
@@ -2206,6 +2206,9 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		/* fallback to copy_from_user outside mmap_sem */
 		if (unlikely(ret)) {
 			*pagep = page;
+			if (sbinfo->max_blocks)
+				percpu_counter_add(&sbinfo->used_blocks, -1);
+			shmem_unacct_blocks(info->flags, 1);
 			/* don't free the page */
 			return -EFAULT;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (28 preceding siblings ...)
  2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
@ 2016-11-02 19:34 ` Andrea Arcangeli
  2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

userfaultfd_open will be needed by the non cooperative selftest.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 41 +++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index a5e5808..75540e7 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -81,7 +81,7 @@ static int huge_fd;
 static char *huge_fd_off0;
 #endif
 static unsigned long long *count_verify;
-static int uffd, finished, *pipefd;
+static int uffd, uffd_flags, finished, *pipefd;
 static char *area_src, *area_dst;
 static char *zeropage;
 pthread_attr_t attr;
@@ -512,23 +512,9 @@ static int stress(unsigned long *userfaults)
 	return 0;
 }
 
-static int userfaultfd_stress(void)
+static int userfaultfd_open(void)
 {
-	void *area;
-	char *tmp_area;
-	unsigned long nr;
-	struct uffdio_register uffdio_register;
 	struct uffdio_api uffdio_api;
-	unsigned long cpu;
-	int uffd_flags, err;
-	unsigned long userfaults[nr_cpus];
-
-	allocate_area((void **)&area_src);
-	if (!area_src)
-		return 1;
-	allocate_area((void **)&area_dst);
-	if (!area_dst)
-		return 1;
 
 	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
 	if (uffd < 0) {
@@ -549,6 +535,29 @@ static int userfaultfd_stress(void)
 		return 1;
 	}
 
+	return 0;
+}
+
+static int userfaultfd_stress(void)
+{
+	void *area;
+	char *tmp_area;
+	unsigned long nr;
+	struct uffdio_register uffdio_register;
+	unsigned long cpu;
+	int err;
+	unsigned long userfaults[nr_cpus];
+
+	allocate_area((void **)&area_src);
+	if (!area_src)
+		return 1;
+	allocate_area((void **)&area_dst);
+	if (!area_dst)
+		return 1;
+
+	if (userfaultfd_open() < 0)
+		return 1;
+
 	count_verify = malloc(nr_pages * sizeof(unsigned long long));
 	if (!count_verify) {
 		perror("count_verify");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (29 preceding siblings ...)
  2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
@ 2016-11-02 19:34 ` Andrea Arcangeli
  2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

With future addition of event tests, copy_page will be called with
different userfault file descriptors

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 75540e7..c79c372 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -317,7 +317,7 @@ static void *locking_thread(void *arg)
 	return NULL;
 }
 
-static int copy_page(unsigned long offset)
+static int copy_page(int ufd, unsigned long offset)
 {
 	struct uffdio_copy uffdio_copy;
 
@@ -329,7 +329,7 @@ static int copy_page(unsigned long offset)
 	uffdio_copy.len = page_size;
 	uffdio_copy.mode = 0;
 	uffdio_copy.copy = 0;
-	if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy)) {
+	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
 		if (uffdio_copy.copy != -EEXIST)
 			fprintf(stderr, "UFFDIO_COPY error %Ld\n",
@@ -386,7 +386,7 @@ static void *uffd_poll_thread(void *arg)
 		offset = (char *)(unsigned long)msg.arg.pagefault.address -
 			 area_dst;
 		offset &= ~(page_size-1);
-		if (copy_page(offset))
+		if (copy_page(uffd, offset))
 			userfaults++;
 	}
 	return (void *)userfaults;
@@ -424,7 +424,7 @@ static void *uffd_read_thread(void *arg)
 		offset = (char *)(unsigned long)msg.arg.pagefault.address -
 			 area_dst;
 		offset &= ~(page_size-1);
-		if (copy_page(offset))
+		if (copy_page(uffd, offset))
 			(*this_cpu_userfaults)++;
 	}
 	return (void *)NULL;
@@ -438,7 +438,7 @@ static void *background_thread(void *arg)
 	for (page_nr = cpu * nr_pages_per_cpu;
 	     page_nr < (cpu+1) * nr_pages_per_cpu;
 	     page_nr++)
-		copy_page(page_nr * page_size);
+		copy_page(uffd, page_nr * page_size);
 
 	return NULL;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (30 preceding siblings ...)
  2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
@ 2016-11-02 19:34 ` Andrea Arcangeli
  2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
  2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

From: Mike Rapoport <rppt@linux.vnet.ibm.com>

Add test for userfaultfd events used in non-cooperative scenario when the
process that monitors the userfaultfd and handles user faults is not the
same process that causes the page faults.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 174 ++++++++++++++++++++++++++++---
 1 file changed, 162 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index c79c372..fed2119 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -63,6 +63,7 @@
 #include <sys/mman.h>
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
+#include <sys/wait.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 
@@ -347,6 +348,7 @@ static void *uffd_poll_thread(void *arg)
 	unsigned long cpu = (unsigned long) arg;
 	struct pollfd pollfd[2];
 	struct uffd_msg msg;
+	struct uffdio_register uffd_reg;
 	int ret;
 	unsigned long offset;
 	char tmp_chr;
@@ -378,16 +380,35 @@ static void *uffd_poll_thread(void *arg)
 				continue;
 			perror("nonblocking read error"), exit(1);
 		}
-		if (msg.event != UFFD_EVENT_PAGEFAULT)
+		switch (msg.event) {
+		default:
 			fprintf(stderr, "unexpected msg event %u\n",
 				msg.event), exit(1);
-		if (msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
-			fprintf(stderr, "unexpected write fault\n"), exit(1);
-		offset = (char *)(unsigned long)msg.arg.pagefault.address -
-			 area_dst;
-		offset &= ~(page_size-1);
-		if (copy_page(uffd, offset))
-			userfaults++;
+			break;
+		case UFFD_EVENT_PAGEFAULT:
+			if (msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+				fprintf(stderr, "unexpected write fault\n"), exit(1);
+			offset = (char *)(unsigned long)msg.arg.pagefault.address -
+				area_dst;
+			offset &= ~(page_size-1);
+			if (copy_page(uffd, offset))
+				userfaults++;
+			break;
+		case UFFD_EVENT_FORK:
+			uffd = msg.arg.fork.ufd;
+			pollfd[0].fd = uffd;
+			break;
+		case UFFD_EVENT_MADVDONTNEED:
+			uffd_reg.range.start = msg.arg.madv_dn.start;
+			uffd_reg.range.len = msg.arg.madv_dn.end -
+				msg.arg.madv_dn.start;
+			if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range))
+				fprintf(stderr, "madv_dn failure\n"), exit(1);
+			break;
+		case UFFD_EVENT_REMAP:
+			area_dst = (char *)(unsigned long)msg.arg.remap.to;
+			break;
+		}
 	}
 	return (void *)userfaults;
 }
@@ -512,7 +533,7 @@ static int stress(unsigned long *userfaults)
 	return 0;
 }
 
-static int userfaultfd_open(void)
+static int userfaultfd_open(int features)
 {
 	struct uffdio_api uffdio_api;
 
@@ -525,7 +546,7 @@ static int userfaultfd_open(void)
 	uffd_flags = fcntl(uffd, F_GETFD, NULL);
 
 	uffdio_api.api = UFFD_API;
-	uffdio_api.features = 0;
+	uffdio_api.features = features;
 	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
 		fprintf(stderr, "UFFDIO_API\n");
 		return 1;
@@ -538,6 +559,131 @@ static int userfaultfd_open(void)
 	return 0;
 }
 
+/*
+ * For non-cooperative userfaultfd test we fork() a process that will
+ * generate pagefaults, will mremap the area monitored by the
+ * userfaultfd and at last this process will release the monitored
+ * area.
+ * For the anonymous and shared memory the area is divided into two
+ * parts, the first part is accessed before mremap, and the second
+ * part is accessed after mremap. Since hugetlbfs does not support
+ * mremap, the entire monitored area is accessed in a single pass for
+ * HUGETLB_TEST.
+ * The release of the pages currently generates event only for
+ * anonymous memory (UFFD_EVENT_MADVDONTNEED), hence it is not checked
+ * for hugetlb and shmem.
+ */
+static int faulting_process(void)
+{
+	unsigned long nr;
+	unsigned long long count;
+
+#ifndef HUGETLB_TEST
+	unsigned long split_nr_pages = (nr_pages + 1) / 2;
+#else
+	unsigned long split_nr_pages = nr_pages;
+#endif
+
+	for (nr = 0; nr < split_nr_pages; nr++) {
+		count = *area_count(area_dst, nr);
+		if (count != count_verify[nr]) {
+			fprintf(stderr,
+				"nr %lu memory corruption %Lu %Lu\n",
+				nr, count,
+				count_verify[nr]), exit(1);
+		}
+	}
+
+#ifndef HUGETLB_TEST
+	area_dst = mremap(area_dst, nr_pages * page_size,  nr_pages * page_size,
+			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
+	if (area_dst == MAP_FAILED)
+		perror("mremap"), exit(1);
+
+	for (; nr < nr_pages; nr++) {
+		count = *area_count(area_dst, nr);
+		if (count != count_verify[nr]) {
+			fprintf(stderr,
+				"nr %lu memory corruption %Lu %Lu\n",
+				nr, count,
+				count_verify[nr]), exit(1);
+		}
+	}
+
+#ifndef SHMEM_TEST
+	if (release_pages(area_dst))
+		return 1;
+
+	for (nr = 0; nr < nr_pages; nr++) {
+		if (my_bcmp(area_dst + nr * page_size, zeropage, page_size))
+			fprintf(stderr, "nr %lu is not zero\n", nr), exit(1);
+	}
+#endif /* SHMEM_TEST */
+
+#endif /* HUGETLB_TEST */
+
+	return 0;
+}
+
+static int userfaultfd_events_test(void)
+{
+	struct uffdio_register uffdio_register;
+	unsigned long expected_ioctls;
+	unsigned long userfaults;
+	pthread_t uffd_mon;
+	int err, features;
+	pid_t pid;
+	char c;
+
+	printf("testing events (fork, remap, madv_dn): ");
+	fflush(stdout);
+
+	if (release_pages(area_dst))
+		return 1;
+
+	features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
+		UFFD_FEATURE_EVENT_MADVDONTNEED;
+	if (userfaultfd_open(features) < 0)
+		return 1;
+	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
+
+	uffdio_register.range.start = (unsigned long) area_dst;
+	uffdio_register.range.len = nr_pages * page_size;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		fprintf(stderr, "register failure\n"), exit(1);
+
+	expected_ioctls = EXPECTED_IOCTLS;
+	if ((uffdio_register.ioctls & expected_ioctls) !=
+	    expected_ioctls)
+		fprintf(stderr,
+			"unexpected missing ioctl for anon memory\n"),
+			exit(1);
+
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+		perror("uffd_poll_thread create"), exit(1);
+
+	pid = fork();
+	if (pid < 0)
+		perror("fork"), exit(1);
+
+	if (!pid)
+		return faulting_process();
+
+	waitpid(pid, &err, 0);
+	if (err)
+		fprintf(stderr, "faulting process failed\n"), exit(1);
+
+	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
+		perror("pipe write"), exit(1);
+	if (pthread_join(uffd_mon, (void **)&userfaults))
+		return 1;
+
+	printf("userfaults: %ld\n", userfaults);
+
+	return userfaults != nr_pages;
+}
+
 static int userfaultfd_stress(void)
 {
 	void *area;
@@ -555,7 +701,7 @@ static int userfaultfd_stress(void)
 	if (!area_dst)
 		return 1;
 
-	if (userfaultfd_open() < 0)
+	if (userfaultfd_open(0) < 0)
 		return 1;
 
 	count_verify = malloc(nr_pages * sizeof(unsigned long long));
@@ -702,7 +848,11 @@ static int userfaultfd_stress(void)
 		printf("\n");
 	}
 
-	return err;
+	if (err)
+		return err;
+
+	close(uffd);
+	return userfaultfd_events_test();
 }
 
 #ifndef HUGETLB_TEST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (31 preceding siblings ...)
  2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
@ 2016-11-02 19:34 ` Andrea Arcangeli
  2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 19:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michael Rapoport, Dr. David Alan Gilbert,
	 <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

pmd_trans_unstable does an atomic read on the pmd so it doesn't
require the pmd_lock for the same check.

This also removes the special assumption that the mmap_sem is hold for
writing if prot_numa is not set. userfaultfd will hold the mmap_sem
only for reading in change_pte_range like prot_numa, but it will not
set prot_numa.

This is always a valid micro-optimization regardless of userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mprotect.c | 44 +++++++++++++++-----------------------------
 1 file changed, 15 insertions(+), 29 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1193652..6d4c89a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,34 +33,6 @@
 
 #include "internal.h"
 
-/*
- * For a prot_numa update we only hold mmap_sem for read so there is a
- * potential race with faulting where a pmd was temporarily none. This
- * function checks for a transhuge pmd under the appropriate lock. It
- * returns a pte if it was successfully locked or NULL if it raced with
- * a transhuge insertion.
- */
-static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, int prot_numa, spinlock_t **ptl)
-{
-	pte_t *pte;
-	spinlock_t *pmdl;
-
-	/* !prot_numa is protected by mmap_sem held for write */
-	if (!prot_numa)
-		return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
-
-	pmdl = pmd_lock(vma->vm_mm, pmd);
-	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
-		spin_unlock(pmdl);
-		return NULL;
-	}
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl);
-	spin_unlock(pmdl);
-	return pte;
-}
-
 static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable, int prot_numa)
@@ -70,7 +42,21 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 
-	pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl);
+	/*
+	 * Can be called with only the mmap_sem for reading by
+	 * prot_numa so we must check the pmd isn't constantly
+	 * changing from under us from pmd_none to pmd_trans_huge
+	 * and/or the other way around.
+	 */
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	/*
+	 * The pmd points to a regular pte so the pmd can't change
+	 * from under us even if the mmap_sem is only hold for
+	 * reading.
+	 */
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (!pte)
 		return 0;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative
  2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
                   ` (32 preceding siblings ...)
  2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
@ 2016-11-02 20:07 ` Andrea Arcangeli
  33 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-02 20:07 UTC (permalink / raw)
  To: linux-mm

FYI: apparently I hit a git bug in this submit... reproducible with
the below command:

git send-email -1 --to '"what ever" <--your--@--email--.com>"'

after replacing --your--@email--.com with your own email.

/crypto/home/andrea/tmp/tmp/ftuVw5S7Vf/0001-userfaultfd-wp-use-uffd_wp-information-in-userfaultf.patch
Dry-OK. Log says:
Sendmail: /usr/sbin/sendmail *snip* -i --your--@--email--.com andrea@cpushare.com
From: Andrea Arcangeli <aarcange@redhat.com>
To: "what ever" " <--your--@--email--.com>
Subject: [PATCH 1/1] userfaultfd: wp: use uffd_wp information in userfaultfd_must_wait
Date: Wed,  2 Nov 2016 20:59:43 +0100
Message-Id: <1478116783-578-1-git-send-email-aarcange@redhat.com>
X-Mailer: git-send-email 2.7.3

Result: OK

It's not ok if the --dry-run outputs the above with a fine header, but
the actual header in the email data is different. Of course I tested
--dry-run twice and it was fine like the above is fine as well.

The submit is still valid for review so I'm not re-sending. I may
resend privately to Andrew post-review if needed.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event
  2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
@ 2016-11-03  7:41   ` Hillf Danton
  2016-11-03 17:52     ` Mike Rapoport
  2016-11-04 15:40     ` Mike Rapoport
  0 siblings, 2 replies; 69+ messages in thread
From: Hillf Danton @ 2016-11-03  7:41 UTC (permalink / raw)
  To: 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Michael Rapoport',
	Dr. David Alan Gilbert,  <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> @@ -576,7 +581,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  			goto out;
>  		}
> 
> -		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
> +		ret = move_vma(vma, addr, old_len, new_len, new_addr,
> +			       &locked, &uf);
>  	}
>  out:
>  	if (offset_in_page(ret)) {
> @@ -586,5 +592,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	up_write(&current->mm->mmap_sem);
>  	if (locked && new_len > old_len)
>  		mm_populate(new_addr + old_len, new_len - old_len);
> +	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);

nit: s/uf/&uf/

>  	return ret;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request
  2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
@ 2016-11-03  8:01   ` Hillf Danton
  2016-11-03 17:24     ` Mike Rapoport
  2016-11-04 15:42     ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
  0 siblings, 2 replies; 69+ messages in thread
From: Hillf Danton @ 2016-11-03  8:01 UTC (permalink / raw)
  To: 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Michael Rapoport',
	Dr. David Alan Gilbert,  <dgilbert@redhat.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	 Shaohua Li <shli@fb.com>,
	 Pavel Emelyanov <xemul@parallels.com>

On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> +void madvise_userfault_dontneed(struct vm_area_struct *vma,
> +				struct vm_area_struct **prev,
> +				unsigned long start, unsigned long end)
> +{
> +	struct userfaultfd_ctx *ctx;
> +	struct userfaultfd_wait_queue ewq;
> +
> +	ctx = vma->vm_userfaultfd_ctx.ctx;
> +	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
> +		return;
> +
> +	userfaultfd_ctx_get(ctx);
> +	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
> +	up_read(&vma->vm_mm->mmap_sem);
> +
> +	msg_init(&ewq.msg);
> +
> +	ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
> +	ewq.msg.arg.madv_dn.start = start;
> +	ewq.msg.arg.madv_dn.end = end;
> +
> +	userfaultfd_event_wait_completion(ctx, &ewq);
> +
> +	down_read(&vma->vm_mm->mmap_sem);

After napping with mmap_sem released, is vma still valid?

> +}
> +

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
@ 2016-11-03 10:15   ` Hillf Danton
  2016-11-03 17:33     ` Mike Kravetz
  0 siblings, 1 reply; 69+ messages in thread
From: Hillf Danton @ 2016-11-03 10:15 UTC (permalink / raw)
  To: 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Dr. David Alan Gilbert',
	'Mike Kravetz', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

[out of topic] Cc list is edited to quite mail agent warning:  
-"Dr. David Alan Gilbert"@v2.random; " <dgilbert@redhat.com> 
+"Dr. David Alan Gilbert" <dgilbert@redhat.com>
-Pavel Emelyanov <xemul@parallels.com>"@v2.random
+Pavel Emelyanov <xemul@parallels.com>
-Michael Rapoport <RAPOPORT@il.ibm.com>
+Mike Rapoport <rppt@linux.vnet.ibm.com>


On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote: 
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
> + * called with mmap_sem held, it will release mmap_sem before returning.
> + */
> +static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> +					      struct vm_area_struct *dst_vma,
> +					      unsigned long dst_start,
> +					      unsigned long src_start,
> +					      unsigned long len,
> +					      bool zeropage)
> +{
> +	ssize_t err;
> +	pte_t *dst_pte;
> +	unsigned long src_addr, dst_addr;
> +	long copied;
> +	struct page *page;
> +	struct hstate *h;
> +	unsigned long vma_hpagesize;
> +	pgoff_t idx;
> +	u32 hash;
> +	struct address_space *mapping;
> +
> +	/*
> +	 * There is no default zero huge page for all huge page sizes as
> +	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
> +	 * by THP.  Since we can not reliably insert a zero page, this
> +	 * feature is not supported.
> +	 */
> +	if (zeropage)
> +		return -EINVAL;

Release mmap_sem before return?

> +
> +	src_addr = src_start;
> +	dst_addr = dst_start;
> +	copied = 0;
> +	page = NULL;
> +	vma_hpagesize = vma_kernel_pagesize(dst_vma);
> +
> +retry:
> +	/*
> +	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
> +	 * retry, dst_vma will be set to NULL and we must lookup again.
> +	 */
> +	err = -EINVAL;
> +	if (!dst_vma) {
> +		dst_vma = find_vma(dst_mm, dst_start);

In case of retry, s/dst_start/dst_addr/?
And check if we find a valid vma?

> @@ -182,6 +355,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  		goto out_unlock;
> 
>  	/*
> +	 * If this is a HUGETLB vma, pass off to appropriate routine
> +	 */
> +	if (dst_vma->vm_flags & VM_HUGETLB)
> +		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> +						src_start, len, false);

Use is_vm_hugetlb_page()? 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request
  2016-11-03  8:01   ` Hillf Danton
@ 2016-11-03 17:24     ` Mike Rapoport
  2016-11-04 16:40       ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg Andrea Arcangeli
  2016-11-04 15:42     ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Rapoport @ 2016-11-03 17:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrea Arcangeli, Andrew Morton, linux-mm,
	Dr. David Alan Gilbert, Mike Kravetz, Shaohua Li,
	Pavel Emelyanov

(changed 'CC:
- Michael Rapoport <RAPOPORT@il.ibm.com>,
- Dr. David Alan Gilbert@v2.random,  <dgilbert@redhat.com>,
+ Dr. David Alan Gilbert  <dgilbert@redhat.com>,
- Pavel Emelyanov <xemul@parallels.com>@v2.random
+ Pavel Emelyanov <xemul@virtuozzo.com>
)

On Thu, Nov 03, 2016 at 04:01:12PM +0800, Hillf Danton wrote:
> On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> > +void madvise_userfault_dontneed(struct vm_area_struct *vma,
> > +				struct vm_area_struct **prev,
> > +				unsigned long start, unsigned long end)
> > +{
> > +	struct userfaultfd_ctx *ctx;
> > +	struct userfaultfd_wait_queue ewq;
> > +
> > +	ctx = vma->vm_userfaultfd_ctx.ctx;
> > +	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
> > +		return;
> > +
> > +	userfaultfd_ctx_get(ctx);
> > +	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
> > +	up_read(&vma->vm_mm->mmap_sem);
> > +
> > +	msg_init(&ewq.msg);
> > +
> > +	ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
> > +	ewq.msg.arg.madv_dn.start = start;
> > +	ewq.msg.arg.madv_dn.end = end;
> > +
> > +	userfaultfd_event_wait_completion(ctx, &ewq);
> > +
> > +	down_read(&vma->vm_mm->mmap_sem);
> 
> After napping with mmap_sem released, is vma still valid?

You are right, vma may be invalid at that point. Thanks for spotting.

Andrea, how do you prefer the fix, incremental or the entire patch updated?

> > +}
> > +

--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-03 10:15   ` Hillf Danton
@ 2016-11-03 17:33     ` Mike Kravetz
  2016-11-03 19:14       ` Mike Kravetz
  2016-11-04 16:35       ` Andrea Arcangeli
  0 siblings, 2 replies; 69+ messages in thread
From: Mike Kravetz @ 2016-11-03 17:33 UTC (permalink / raw)
  To: Hillf Danton, 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/03/2016 03:15 AM, Hillf Danton wrote:
> [out of topic] Cc list is edited to quite mail agent warning:  
> -"Dr. David Alan Gilbert"@v2.random; " <dgilbert@redhat.com> 
> +"Dr. David Alan Gilbert" <dgilbert@redhat.com>
> -Pavel Emelyanov <xemul@parallels.com>"@v2.random
> +Pavel Emelyanov <xemul@parallels.com>
> -Michael Rapoport <RAPOPORT@il.ibm.com>
> +Mike Rapoport <rppt@linux.vnet.ibm.com>
> 
> 
> On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote: 
>> +
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +/*
>> + * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
>> + * called with mmap_sem held, it will release mmap_sem before returning.
>> + */
>> +static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>> +					      struct vm_area_struct *dst_vma,
>> +					      unsigned long dst_start,
>> +					      unsigned long src_start,
>> +					      unsigned long len,
>> +					      bool zeropage)
>> +{
>> +	ssize_t err;
>> +	pte_t *dst_pte;
>> +	unsigned long src_addr, dst_addr;
>> +	long copied;
>> +	struct page *page;
>> +	struct hstate *h;
>> +	unsigned long vma_hpagesize;
>> +	pgoff_t idx;
>> +	u32 hash;
>> +	struct address_space *mapping;
>> +
>> +	/*
>> +	 * There is no default zero huge page for all huge page sizes as
>> +	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
>> +	 * by THP.  Since we can not reliably insert a zero page, this
>> +	 * feature is not supported.
>> +	 */
>> +	if (zeropage)
>> +		return -EINVAL;
> 
> Release mmap_sem before return?
> 
>> +
>> +	src_addr = src_start;
>> +	dst_addr = dst_start;
>> +	copied = 0;
>> +	page = NULL;
>> +	vma_hpagesize = vma_kernel_pagesize(dst_vma);
>> +
>> +retry:
>> +	/*
>> +	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
>> +	 * retry, dst_vma will be set to NULL and we must lookup again.
>> +	 */
>> +	err = -EINVAL;
>> +	if (!dst_vma) {
>> +		dst_vma = find_vma(dst_mm, dst_start);
> 
> In case of retry, s/dst_start/dst_addr/?
> And check if we find a valid vma?
> 
>> @@ -182,6 +355,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>>  		goto out_unlock;
>>
>>  	/*
>> +	 * If this is a HUGETLB vma, pass off to appropriate routine
>> +	 */
>> +	if (dst_vma->vm_flags & VM_HUGETLB)
>> +		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
>> +						src_start, len, false);
> 
> Use is_vm_hugetlb_page()? 
> 
> 

Thanks Hillf, all valid points.  I will create another version of
this patch.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event
  2016-11-03  7:41   ` Hillf Danton
@ 2016-11-03 17:52     ` Mike Rapoport
  2016-11-04 15:40     ` Mike Rapoport
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Rapoport @ 2016-11-03 17:52 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrea Arcangeli', 'Andrew Morton',
	linux-mm, Dr. David Alan Gilbert, Mike Kravetz, Shaohua Li,
	Pavel Emelyanov

(changed 'CC:
- Michael Rapoport <RAPOPORT@il.ibm.com>,
- Dr. David Alan Gilbert@v2.random,  <dgilbert@redhat.com>,
+ Dr. David Alan Gilbert  <dgilbert@redhat.com>,
- Pavel Emelyanov <xemul@parallels.com>@v2.random
+ Pavel Emelyanov <xemul@virtuozzo.com>
)

On Thu, Nov 03, 2016 at 03:41:15PM +0800, Hillf Danton wrote:
> On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> > @@ -576,7 +581,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >  			goto out;
> >  		}
> > 
> > -		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
> > +		ret = move_vma(vma, addr, old_len, new_len, new_addr,
> > +			       &locked, &uf);
> >  	}
> >  out:
> >  	if (offset_in_page(ret)) {
> > @@ -586,5 +592,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >  	up_write(&current->mm->mmap_sem);
> >  	if (locked && new_len > old_len)
> >  		mm_populate(new_addr + old_len, new_len - old_len);
> > +	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
> 
> nit: s/uf/&uf/

Thanks, will fix.

> 
> >  	return ret;
> >  }
> > 
> 

--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-03 17:33     ` Mike Kravetz
@ 2016-11-03 19:14       ` Mike Kravetz
  2016-11-04  6:43         ` Hillf Danton
  2016-11-04 19:36         ` Andrea Arcangeli
  2016-11-04 16:35       ` Andrea Arcangeli
  1 sibling, 2 replies; 69+ messages in thread
From: Mike Kravetz @ 2016-11-03 19:14 UTC (permalink / raw)
  To: Hillf Danton, 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/03/2016 10:33 AM, Mike Kravetz wrote:
> On 11/03/2016 03:15 AM, Hillf Danton wrote:
>> [out of topic] Cc list is edited to quite mail agent warning:  
>> -"Dr. David Alan Gilbert"@v2.random; " <dgilbert@redhat.com> 
>> +"Dr. David Alan Gilbert" <dgilbert@redhat.com>
>> -Pavel Emelyanov <xemul@parallels.com>"@v2.random
>> +Pavel Emelyanov <xemul@parallels.com>
>> -Michael Rapoport <RAPOPORT@il.ibm.com>
>> +Mike Rapoport <rppt@linux.vnet.ibm.com>
>>
>>
>> On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote: 
>>> +
>>> +#ifdef CONFIG_HUGETLB_PAGE
>>> +/*
>>> + * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
>>> + * called with mmap_sem held, it will release mmap_sem before returning.
>>> + */
>>> +static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>> +					      struct vm_area_struct *dst_vma,
>>> +					      unsigned long dst_start,
>>> +					      unsigned long src_start,
>>> +					      unsigned long len,
>>> +					      bool zeropage)
>>> +{
>>> +	ssize_t err;
>>> +	pte_t *dst_pte;
>>> +	unsigned long src_addr, dst_addr;
>>> +	long copied;
>>> +	struct page *page;
>>> +	struct hstate *h;
>>> +	unsigned long vma_hpagesize;
>>> +	pgoff_t idx;
>>> +	u32 hash;
>>> +	struct address_space *mapping;
>>> +
>>> +	/*
>>> +	 * There is no default zero huge page for all huge page sizes as
>>> +	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
>>> +	 * by THP.  Since we can not reliably insert a zero page, this
>>> +	 * feature is not supported.
>>> +	 */
>>> +	if (zeropage)
>>> +		return -EINVAL;
>>
>> Release mmap_sem before return?
>>
>>> +
>>> +	src_addr = src_start;
>>> +	dst_addr = dst_start;
>>> +	copied = 0;
>>> +	page = NULL;
>>> +	vma_hpagesize = vma_kernel_pagesize(dst_vma);
>>> +
>>> +retry:
>>> +	/*
>>> +	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
>>> +	 * retry, dst_vma will be set to NULL and we must lookup again.
>>> +	 */
>>> +	err = -EINVAL;
>>> +	if (!dst_vma) {
>>> +		dst_vma = find_vma(dst_mm, dst_start);
>>
>> In case of retry, s/dst_start/dst_addr/?
>> And check if we find a valid vma?
>>
>>> @@ -182,6 +355,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>>>  		goto out_unlock;
>>>
>>>  	/*
>>> +	 * If this is a HUGETLB vma, pass off to appropriate routine
>>> +	 */
>>> +	if (dst_vma->vm_flags & VM_HUGETLB)
>>> +		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
>>> +						src_start, len, false);
>>
>> Use is_vm_hugetlb_page()? 
>>
>>
> 
> Thanks Hillf, all valid points.  I will create another version of
> this patch.

Below is an updated patch addressing Hillf's comments.  Tested with error
injection code to hit the retry path.

Andrea, let me know if you prefer a delta from original patch.

From: Mike Kravetz <mike.kravetz@oracle.com>

userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY

__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge
pages.  It is based on the existing __mcopy_atomic routine for normal
pages.  Unlike normal pages, there is no huge page support for the
UFFDIO_ZEROPAGE operation.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/userfaultfd.c | 186
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 186 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9c2ed70..e01d013 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -14,6 +14,8 @@
 #include <linux/swapops.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hugetlb.h>
+#include <linux/pagemap.h>
 #include <asm/tlbflush.h>
 #include "internal.h"

@@ -139,6 +141,183 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm,
unsigned long address)
 	return pmd;
 }

+
+#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
+ * called with mmap_sem held, it will release mmap_sem before returning.
+ */
+static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct
*dst_mm,
+					      struct vm_area_struct *dst_vma,
+					      unsigned long dst_start,
+					      unsigned long src_start,
+					      unsigned long len,
+					      bool zeropage)
+{
+	ssize_t err;
+	pte_t *dst_pte;
+	unsigned long src_addr, dst_addr;
+	long copied;
+	struct page *page;
+	struct hstate *h;
+	unsigned long vma_hpagesize;
+	pgoff_t idx;
+	u32 hash;
+	struct address_space *mapping;
+
+	/*
+	 * There is no default zero huge page for all huge page sizes as
+	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
+	 * by THP.  Since we can not reliably insert a zero page, this
+	 * feature is not supported.
+	 */
+	if (zeropage) {
+		up_read(&dst_mm->mmap_sem);
+		return -EINVAL;
+	}
+
+	src_addr = src_start;
+	dst_addr = dst_start;
+	copied = 0;
+	page = NULL;
+	vma_hpagesize = vma_kernel_pagesize(dst_vma);
+
+retry:
+	/*
+	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
+	 * retry, dst_vma will be set to NULL and we must lookup again.
+	 */
+	err = -EINVAL;
+	if (!dst_vma) {
+		/* lookup dst_addr as we may have copied some pages */
+		dst_vma = find_vma(dst_mm, dst_addr);
+		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
+			goto out_unlock;
+
+		vma_hpagesize = vma_kernel_pagesize(dst_vma);
+
+		/*
+		 * Make sure the vma is not shared, that the remaining dst
+		 * range is both valid and fully within a single existing vma.
+		 */
+		if (dst_vma->vm_flags & VM_SHARED)
+			goto out_unlock;
+		if (dst_addr < dst_vma->vm_start ||
+		    dst_addr + len - (copied * vma_hpagesize) > dst_vma->vm_end)
+			goto out_unlock;
+	}
+
+	/*
+	 * Validate alignment based on huge page size
+	 */
+	if (dst_addr & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+		goto out_unlock;
+
+	/*
+	 * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges.
+	 */
+	if (!dst_vma->vm_userfaultfd_ctx.ctx)
+		goto out_unlock;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out_unlock;
+
+	h = hstate_vma(dst_vma);
+
+	while (src_addr < src_start + len) {
+		pte_t dst_pteval;
+
+		BUG_ON(dst_addr >= dst_start + len);
+		dst_addr &= huge_page_mask(h);
+
+		/*
+		 * Serialize via hugetlb_fault_mutex
+		 */
+		idx = linear_page_index(dst_vma, dst_addr);
+		mapping = dst_vma->vm_file->f_mapping;
+		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
+								idx, dst_addr);
+		mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+		err = -ENOMEM;
+		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
+		if (!dst_pte) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out_unlock;
+		}
+
+		err = -EEXIST;
+		dst_pteval = huge_ptep_get(dst_pte);
+		if (!huge_pte_none(dst_pteval)) {
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			goto out_unlock;
+		}
+
+		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+						dst_addr, src_addr, &page);
+
+		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+		cond_resched();
+
+		if (unlikely(err == -EFAULT)) {
+			up_read(&dst_mm->mmap_sem);
+			BUG_ON(!page);
+
+			err = copy_huge_page_from_user(page,
+						(const void __user *)src_addr,
+						pages_per_huge_page(h));
+			if (unlikely(err)) {
+				err = -EFAULT;
+				goto out;
+			}
+			down_read(&dst_mm->mmap_sem);
+
+			dst_vma = NULL;
+			goto retry;
+		} else
+			BUG_ON(page);
+
+		if (!err) {
+			dst_addr += vma_hpagesize;
+			src_addr += vma_hpagesize;
+			copied += vma_hpagesize;
+
+			if (fatal_signal_pending(current))
+				err = -EINTR;
+		}
+		if (err)
+			break;
+	}
+
+out_unlock:
+	up_read(&dst_mm->mmap_sem);
+out:
+	if (page)
+		put_page(page);
+	BUG_ON(copied < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!copied && !err);
+	return copied ? copied : err;
+}
+#else /* !CONFIG_HUGETLB_PAGE */
+static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct
*dst_mm,
+					      struct vm_area_struct *dst_vma,
+					      unsigned long dst_start,
+					      unsigned long src_start,
+					      unsigned long len,
+					      bool zeropage)
+{
+	up_read(&dst_mm->mmap_sem);	/* HUGETLB not configured */
+	BUG();
+	return -EINVAL;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
+
 static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 					      unsigned long dst_start,
 					      unsigned long src_start,
@@ -182,6 +361,13 @@ retry:
 		goto out_unlock;

 	/*
+	 * If this is a HUGETLB vma, pass off to appropriate routine
+	 */
+	if (is_vm_hugetlb_page(dst_vma))
+		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
+						src_start, len, false);
+
+	/*
 	 * Be strict and only allow __mcopy_atomic on userfaultfd
 	 * registered ranges to prevent userland errors going
 	 * unnoticed. As far as the VM consistency is concerned, it
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-03 19:14       ` Mike Kravetz
@ 2016-11-04  6:43         ` Hillf Danton
  2016-11-04 19:36         ` Andrea Arcangeli
  1 sibling, 0 replies; 69+ messages in thread
From: Hillf Danton @ 2016-11-04  6:43 UTC (permalink / raw)
  To: 'Mike Kravetz', 'Andrea Arcangeli',
	'Andrew Morton'
  Cc: linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Friday, November 04, 2016 3:14 AM, Mike Kravetz wrote: 
> 
> Andrea, let me know if you prefer a delta from original patch.
> 
> From: Mike Kravetz <mike.kravetz@oracle.com>
> 
> userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
> 
> __mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge
> pages.  It is based on the existing __mcopy_atomic routine for normal
> pages.  Unlike normal pages, there is no huge page support for the
> UFFDIO_ZEROPAGE operation.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/userfaultfd.c | 186
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 186 insertions(+)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 9c2ed70..e01d013 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -14,6 +14,8 @@
>  #include <linux/swapops.h>
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/hugetlb.h>
> +#include <linux/pagemap.h>
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> 
> @@ -139,6 +141,183 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm,
> unsigned long address)
>  	return pmd;
>  }
> 
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * __mcopy_atomic processing for HUGETLB vmas.  Note that this routine is
> + * called with mmap_sem held, it will release mmap_sem before returning.
> + */
> +static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct
> *dst_mm,
> +					      struct vm_area_struct *dst_vma,
> +					      unsigned long dst_start,
> +					      unsigned long src_start,
> +					      unsigned long len,
> +					      bool zeropage)
> +{
> +	ssize_t err;
> +	pte_t *dst_pte;
> +	unsigned long src_addr, dst_addr;
> +	long copied;
> +	struct page *page;
> +	struct hstate *h;
> +	unsigned long vma_hpagesize;
> +	pgoff_t idx;
> +	u32 hash;
> +	struct address_space *mapping;
> +
> +	/*
> +	 * There is no default zero huge page for all huge page sizes as
> +	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
> +	 * by THP.  Since we can not reliably insert a zero page, this
> +	 * feature is not supported.
> +	 */
> +	if (zeropage) {
> +		up_read(&dst_mm->mmap_sem);
> +		return -EINVAL;
> +	}
> +
> +	src_addr = src_start;
> +	dst_addr = dst_start;
> +	copied = 0;
> +	page = NULL;
> +	vma_hpagesize = vma_kernel_pagesize(dst_vma);
> +
> +retry:
> +	/*
> +	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
> +	 * retry, dst_vma will be set to NULL and we must lookup again.
> +	 */
> +	err = -EINVAL;
> +	if (!dst_vma) {
> +		/* lookup dst_addr as we may have copied some pages */
> +		dst_vma = find_vma(dst_mm, dst_addr);
> +		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
> +			goto out_unlock;
> +
> +		vma_hpagesize = vma_kernel_pagesize(dst_vma);
> +
> +		/*
> +		 * Make sure the vma is not shared, that the remaining dst
> +		 * range is both valid and fully within a single existing vma.
> +		 */
> +		if (dst_vma->vm_flags & VM_SHARED)
> +			goto out_unlock;
> +		if (dst_addr < dst_vma->vm_start ||
> +		    dst_addr + len - (copied * vma_hpagesize) > dst_vma->vm_end)
> +			goto out_unlock;
> +	}
> +
> +	/*
> +	 * Validate alignment based on huge page size
> +	 */
> +	if (dst_addr & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> +		goto out_unlock;
> +
> +	/*
> +	 * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges.
> +	 */
> +	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> +		goto out_unlock;
> +
> +	/*
> +	 * Ensure the dst_vma has a anon_vma.
> +	 */
> +	err = -ENOMEM;
> +	if (unlikely(anon_vma_prepare(dst_vma)))
> +		goto out_unlock;
> +
> +	h = hstate_vma(dst_vma);
> +
> +	while (src_addr < src_start + len) {
> +		pte_t dst_pteval;
> +
> +		BUG_ON(dst_addr >= dst_start + len);
> +		dst_addr &= huge_page_mask(h);
> +
> +		/*
> +		 * Serialize via hugetlb_fault_mutex
> +		 */
> +		idx = linear_page_index(dst_vma, dst_addr);
> +		mapping = dst_vma->vm_file->f_mapping;
> +		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
> +								idx, dst_addr);
> +		mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +
> +		err = -ENOMEM;
> +		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
> +		if (!dst_pte) {
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +			goto out_unlock;
> +		}
> +
> +		err = -EEXIST;
> +		dst_pteval = huge_ptep_get(dst_pte);
> +		if (!huge_pte_none(dst_pteval)) {
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +			goto out_unlock;
> +		}
> +
> +		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> +						dst_addr, src_addr, &page);
> +
> +		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +
> +		cond_resched();
> +
> +		if (unlikely(err == -EFAULT)) {
> +			up_read(&dst_mm->mmap_sem);
> +			BUG_ON(!page);
> +
> +			err = copy_huge_page_from_user(page,
> +						(const void __user *)src_addr,
> +						pages_per_huge_page(h));
> +			if (unlikely(err)) {
> +				err = -EFAULT;
> +				goto out;
> +			}
> +			down_read(&dst_mm->mmap_sem);
> +
> +			dst_vma = NULL;
> +			goto retry;
> +		} else
> +			BUG_ON(page);
> +
> +		if (!err) {
> +			dst_addr += vma_hpagesize;
> +			src_addr += vma_hpagesize;
> +			copied += vma_hpagesize;
> +
> +			if (fatal_signal_pending(current))
> +				err = -EINTR;
> +		}
> +		if (err)
> +			break;
> +	}
> +
> +out_unlock:
> +	up_read(&dst_mm->mmap_sem);
> +out:
> +	if (page)
> +		put_page(page);
> +	BUG_ON(copied < 0);
> +	BUG_ON(err > 0);
> +	BUG_ON(!copied && !err);
> +	return copied ? copied : err;
> +}
> +#else /* !CONFIG_HUGETLB_PAGE */
> +static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct
> *dst_mm,
> +					      struct vm_area_struct *dst_vma,
> +					      unsigned long dst_start,
> +					      unsigned long src_start,
> +					      unsigned long len,
> +					      bool zeropage)
> +{
> +	up_read(&dst_mm->mmap_sem);	/* HUGETLB not configured */
> +	BUG();
> +	return -EINVAL;
> +}
> +#endif /* CONFIG_HUGETLB_PAGE */
> +
>  static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  					      unsigned long dst_start,
>  					      unsigned long src_start,
> @@ -182,6 +361,13 @@ retry:
>  		goto out_unlock;
> 
>  	/*
> +	 * If this is a HUGETLB vma, pass off to appropriate routine
> +	 */
> +	if (is_vm_hugetlb_page(dst_vma))
> +		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> +						src_start, len, false);
> +
> +	/*
>  	 * Be strict and only allow __mcopy_atomic on userfaultfd
>  	 * registered ranges to prevent userland errors going
>  	 * unnoticed. As far as the VM consistency is concerned, it
> --
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook
  2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
@ 2016-11-04  7:02   ` Hillf Danton
  0 siblings, 0 replies; 69+ messages in thread
From: Hillf Danton @ 2016-11-04  7:02 UTC (permalink / raw)
  To: 'Andrea Arcangeli', 'Andrew Morton', Mike Kravetz
  Cc: linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

> 
> From: Mike Kravetz <mike.kravetz@oracle.com>
> 
> When processing a hugetlb fault for no page present, check the vma to
> determine if faults are to be handled via userfaultfd.  If so, drop the
> hugetlb_fault_mutex and call handle_userfault().
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> 

>  mm/hugetlb.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index baf7fd4..7247f8c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -32,6 +32,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/hugetlb_cgroup.h>
>  #include <linux/node.h>
> +#include <linux/userfaultfd_k.h>
>  #include "internal.h"
> 
>  int hugepages_treat_as_movable;
> @@ -3589,6 +3590,38 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		size = i_size_read(mapping->host) >> huge_page_shift(h);
>  		if (idx >= size)
>  			goto out;
> +
> +		/*
> +		 * Check for page in userfault range
> +		 */
> +		if (userfaultfd_missing(vma)) {
> +			u32 hash;
> +			struct fault_env fe = {
> +				.vma = vma,
> +				.address = address,
> +				.flags = flags,
> +				/*
> +				 * Hard to debug if it ends up being
> +				 * used by a callee that assumes
> +				 * something about the other
> +				 * uninitialized fields... same as in
> +				 * memory.c
> +				 */
> +			};
> +
> +			/*
> +			 * hugetlb_fault_mutex must be dropped before
> +			 * handling userfault.  Reacquire after handling
> +			 * fault to make calling code simpler.
> +			 */
> +			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
> +							idx, address);
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +			ret = handle_userfault(&fe, VM_UFFD_MISSING);
> +			mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +			goto out;
> +		}
> +
>  		page = alloc_huge_page(vma, address, 0);
>  		if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 20/33] userfaultfd: introduce vma_can_userfault
  2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
@ 2016-11-04  7:39   ` Hillf Danton
  0 siblings, 0 replies; 69+ messages in thread
From: Hillf Danton @ 2016-11-04  7:39 UTC (permalink / raw)
  To: 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

> 
> From: Mike Rapoport <rppt@linux.vnet.ibm.com>
> 
> Check whether a VMA can be used with userfault in more compact way
> 
> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> 

>  fs/userfaultfd.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 9552734..387fe77 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1060,6 +1060,11 @@ static __always_inline int validate_range(struct mm_struct *mm,
>  	return 0;
>  }
> 
> +static inline bool vma_can_userfault(struct vm_area_struct *vma)
> +{
> +	return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma);
> +}
> +
>  static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  				unsigned long arg)
>  {
> @@ -1149,7 +1154,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> 
>  		/* check not compatible vmas */
>  		ret = -EINVAL;
> -		if (!vma_is_anonymous(cur) && !is_vm_hugetlb_page(cur))
> +		if (!vma_can_userfault(cur))
>  			goto out_unlock;
>  		/*
>  		 * If this vma contains ending address, and huge pages
> @@ -1193,7 +1198,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	do {
>  		cond_resched();
> 
> -		BUG_ON(!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma));
> +		BUG_ON(!vma_can_userfault(vma));
>  		BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
>  		       vma->vm_userfaultfd_ctx.ctx != ctx);
> 
> @@ -1331,7 +1336,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  		 * provides for more strict behavior to notice
>  		 * unregistration errors.
>  		 */
> -		if (!vma_is_anonymous(cur) && !is_vm_hugetlb_page(cur))
> +		if (!vma_can_userfault(cur))
>  			goto out_unlock;
> 
>  		found = true;
> @@ -1345,7 +1350,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  	do {
>  		cond_resched();
> 
> -		BUG_ON(!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma));
> +		BUG_ON(!vma_can_userfault(vma));
> 
>  		/*
>  		 * Nothing to do: this vma is already registered into this
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
@ 2016-11-04  8:59   ` Hillf Danton
  2016-11-04 14:53     ` Mike Rapoport
  2016-11-04 15:44     ` Mike Rapoport
  0 siblings, 2 replies; 69+ messages in thread
From: Hillf Danton @ 2016-11-04  8:59 UTC (permalink / raw)
  To: 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

> @@ -1542,7 +1544,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
>   */
>  static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	struct page **pagep, enum sgp_type sgp, gfp_t gfp,
> -	struct mm_struct *fault_mm, int *fault_type)
> +	struct vm_area_struct *vma, struct vm_fault *vmf, int *fault_type)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info;
> @@ -1597,7 +1599,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	 */
>  	info = SHMEM_I(inode);
>  	sbinfo = SHMEM_SB(inode->i_sb);
> -	charge_mm = fault_mm ? : current->mm;
> +	charge_mm = vma ? vma->vm_mm : current->mm;
> 
>  	if (swap.val) {
>  		/* Look it up and read it in.. */
> @@ -1607,7 +1609,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  			if (fault_type) {
>  				*fault_type |= VM_FAULT_MAJOR;
>  				count_vm_event(PGMAJFAULT);
> -				mem_cgroup_count_vm_event(fault_mm, PGMAJFAULT);
> +				mem_cgroup_count_vm_event(vma->vm_mm,
> +							  PGMAJFAULT);
Seems vma is not valid in some cases.

>  			}
>  			/* Here we actually start the io */
>  			page = shmem_swapin(swap, gfp, info, index);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-04  8:59   ` Hillf Danton
@ 2016-11-04 14:53     ` Mike Rapoport
  2016-11-04 15:44     ` Mike Rapoport
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Rapoport @ 2016-11-04 14:53 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrea Arcangeli', 'Andrew Morton',
	linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov'

On Fri, Nov 04, 2016 at 04:59:32PM +0800, Hillf Danton wrote:
> > @@ -1542,7 +1544,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
> >   */
> >  static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  	struct page **pagep, enum sgp_type sgp, gfp_t gfp,
> > -	struct mm_struct *fault_mm, int *fault_type)
> > +	struct vm_area_struct *vma, struct vm_fault *vmf, int *fault_type)
> >  {
> >  	struct address_space *mapping = inode->i_mapping;
> >  	struct shmem_inode_info *info;
> > @@ -1597,7 +1599,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  	 */
> >  	info = SHMEM_I(inode);
> >  	sbinfo = SHMEM_SB(inode->i_sb);
> > -	charge_mm = fault_mm ? : current->mm;
> > +	charge_mm = vma ? vma->vm_mm : current->mm;
> > 
> >  	if (swap.val) {
> >  		/* Look it up and read it in.. */
> > @@ -1607,7 +1609,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  			if (fault_type) {
> >  				*fault_type |= VM_FAULT_MAJOR;
> >  				count_vm_event(PGMAJFAULT);
> > -				mem_cgroup_count_vm_event(fault_mm, PGMAJFAULT);
> > +				mem_cgroup_count_vm_event(vma->vm_mm,
> > +							  PGMAJFAULT);
> Seems vma is not valid in some cases.

Well, currently, when fault_type != NULL, the vma is valid. Still it would
be better to use charge_mm here.
Will repost soon.
 
> >  			}
> >  			/* Here we actually start the io */
> >  			page = shmem_swapin(swap, gfp, info, index);
 
--
Sincerely yours,
Mike.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event
  2016-11-03  7:41   ` Hillf Danton
  2016-11-03 17:52     ` Mike Rapoport
@ 2016-11-04 15:40     ` Mike Rapoport
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Rapoport @ 2016-11-04 15:40 UTC (permalink / raw)
  To: Hillf Danton, 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, Dr. David Alan Gilbert, Mike Kravetz, Shaohua Li,
	Pavel Emelyanov

On Thu, Nov 03, 2016 at 03:41:15PM +0800, Hillf Danton wrote:
> On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> > @@ -576,7 +581,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >  			goto out;
> >  		}
> > 
> > -		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
> > +		ret = move_vma(vma, addr, old_len, new_len, new_addr,
> > +			       &locked, &uf);
> >  	}
> >  out:
> >  	if (offset_in_page(ret)) {
> > @@ -586,5 +592,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> >  	up_write(&current->mm->mmap_sem);
> >  	if (locked && new_len > old_len)
> >  		mm_populate(new_addr + old_len, new_len - old_len);
> > +	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
> 
> nit: s/uf/&uf/
> 
> >  	return ret;
> >  }
> > 

Below is the updated patch.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request
  2016-11-03  8:01   ` Hillf Danton
  2016-11-03 17:24     ` Mike Rapoport
@ 2016-11-04 15:42     ` Mike Rapoport
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Rapoport @ 2016-11-04 15:42 UTC (permalink / raw)
  To: Hillf Danton, 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Michael Rapoport',
	Dr.David.Alan.Gilbert, dgilbert, Mike Kravetz, Shaohua Li,
	Pavel Emelyanov,

On Thu, Nov 03, 2016 at 04:01:12PM +0800, Hillf Danton wrote:
> On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> > +void madvise_userfault_dontneed(struct vm_area_struct *vma,
> > +				struct vm_area_struct **prev,
> > +				unsigned long start, unsigned long end)
> > +{
> > +	struct userfaultfd_ctx *ctx;
> > +	struct userfaultfd_wait_queue ewq;
> > +
> > +	ctx = vma->vm_userfaultfd_ctx.ctx;
> > +	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
> > +		return;
> > +
> > +	userfaultfd_ctx_get(ctx);
> > +	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
> > +	up_read(&vma->vm_mm->mmap_sem);
> > +
> > +	msg_init(&ewq.msg);
> > +
> > +	ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
> > +	ewq.msg.arg.madv_dn.start = start;
> > +	ewq.msg.arg.madv_dn.end = end;
> > +
> > +	userfaultfd_event_wait_completion(ctx, &ewq);
> > +
> > +	down_read(&vma->vm_mm->mmap_sem);
> 
> After napping with mmap_sem released, is vma still valid?
> 
> > +}
> > +

Below is the updated patch that accesses mmap_sem via local reference to
mm_struct rather than via vma.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-04  8:59   ` Hillf Danton
  2016-11-04 14:53     ` Mike Rapoport
@ 2016-11-04 15:44     ` Mike Rapoport
  2016-11-04 16:56       ` Andrea Arcangeli
  2016-11-18  0:37       ` Andrea Arcangeli
  1 sibling, 2 replies; 69+ messages in thread
From: Mike Rapoport @ 2016-11-04 15:44 UTC (permalink / raw)
  To: Hillf Danton, 'Andrea Arcangeli', 'Andrew Morton'
  Cc: linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov'

On Fri, Nov 04, 2016 at 04:59:32PM +0800, Hillf Danton wrote:
> > @@ -1542,7 +1544,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
> >   */
> >  static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  	struct page **pagep, enum sgp_type sgp, gfp_t gfp,
> > -	struct mm_struct *fault_mm, int *fault_type)
> > +	struct vm_area_struct *vma, struct vm_fault *vmf, int *fault_type)
> >  {
> >  	struct address_space *mapping = inode->i_mapping;
> >  	struct shmem_inode_info *info;
> > @@ -1597,7 +1599,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  	 */
> >  	info = SHMEM_I(inode);
> >  	sbinfo = SHMEM_SB(inode->i_sb);
> > -	charge_mm = fault_mm ? : current->mm;
> > +	charge_mm = vma ? vma->vm_mm : current->mm;
> > 
> >  	if (swap.val) {
> >  		/* Look it up and read it in.. */
> > @@ -1607,7 +1609,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  			if (fault_type) {
> >  				*fault_type |= VM_FAULT_MAJOR;
> >  				count_vm_event(PGMAJFAULT);
> > -				mem_cgroup_count_vm_event(fault_mm, PGMAJFAULT);
> > +				mem_cgroup_count_vm_event(vma->vm_mm,
> > +							  PGMAJFAULT);
> Seems vma is not valid in some cases.
> 
> >  			}
> >  			/* Here we actually start the io */
> >  			page = shmem_swapin(swap, gfp, info, index);
> 

Below is the updated patch that uses charge_mm instead of vma which might
be not valid.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-03 17:33     ` Mike Kravetz
  2016-11-03 19:14       ` Mike Kravetz
@ 2016-11-04 16:35       ` Andrea Arcangeli
  1 sibling, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-04 16:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Thu, Nov 03, 2016 at 10:33:09AM -0700, Mike Kravetz wrote:
> On 11/03/2016 03:15 AM, Hillf Danton wrote:
> >> +	if (zeropage)
> >> +		return -EINVAL;
> > 
> > Release mmap_sem before return?

This shows we need to extend the selftest to execute UFFDIO_ZEROPAGE
also on tmpfs and hugetlbfs two cases, and verify it returns -EINVAL.

> >> +
> >> +	src_addr = src_start;
> >> +	dst_addr = dst_start;
> >> +	copied = 0;
> >> +	page = NULL;
> >> +	vma_hpagesize = vma_kernel_pagesize(dst_vma);
> >> +
> >> +retry:
> >> +	/*
> >> +	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
> >> +	 * retry, dst_vma will be set to NULL and we must lookup again.
> >> +	 */
> >> +	err = -EINVAL;
> >> +	if (!dst_vma) {
> >> +		dst_vma = find_vma(dst_mm, dst_start);
> > 
> > In case of retry, s/dst_start/dst_addr/?
> > And check if we find a valid vma?

I don't think that's needed. Yes intuitively if a munmap zaps the
start of the vma during the copy we could continue, but userfaultfd
generally is as strict as it can get.

This is why UFFDIO_COPY is not doing like mremap, that just wipe
whatever existed in destination silently. UFFDIO_COPY returns -EEXIST
whenever something is already mapped there during a UFFDIO_COPY.

When it's userland managing the faults, being more strict I think it's
safer.

Running a copy concurrent with a munmap or any other vma mangling
leads to an undefined result. I think it's preferable to generate an
error to userland if it ever does an undefined operation considering
the risk if something goes wrong here while userland are managing the
faults. Furthermore this keeps the code simpler.

This is also why the revalidation code then does:

		if (dst_start < dst_vma->vm_start ||
		    dst_start + len > dst_vma->vm_end)
			goto out_unlock;

and it pretends the vma it is still there for the whole range being
copied.

So I tend to prefer the current version, letting it succeed silently
while correct and valid in theory, in practice sounds worse than the
current stricter behavior.

In any case if we change this for hugetlbfs, the non-hugetlbfs variant
should also be updated.

> >> @@ -182,6 +355,13 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> >>  		goto out_unlock;
> >>
> >>  	/*
> >> +	 * If this is a HUGETLB vma, pass off to appropriate routine
> >> +	 */
> >> +	if (dst_vma->vm_flags & VM_HUGETLB)
> >> +		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> >> +						src_start, len, false);
> > 
> > Use is_vm_hugetlb_page()? 
> > 
> > 
> 
> Thanks Hillf, all valid points.  I will create another version of
> this patch.

Nice cleanup yes.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg
  2016-11-03 17:24     ` Mike Rapoport
@ 2016-11-04 16:40       ` Andrea Arcangeli
  0 siblings, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-04 16:40 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Hillf Danton, Andrew Morton, linux-mm, Dr. David Alan Gilbert,
	Mike Kravetz, Shaohua Li, Pavel Emelyanov

On Thu, Nov 03, 2016 at 11:24:46AM -0600, Mike Rapoport wrote:
> (changed 'CC:
> - Michael Rapoport <RAPOPORT@il.ibm.com>,
> - Dr. David Alan Gilbert@v2.random,  <dgilbert@redhat.com>,
> + Dr. David Alan Gilbert  <dgilbert@redhat.com>,
> - Pavel Emelyanov <xemul@parallels.com>@v2.random
> + Pavel Emelyanov <xemul@virtuozzo.com>

Sorry for this mess, so it turns out git will crunch a non rfc2822
compliant email address just fine, but postfix will not be happy and
it rewrites the header in a best effort way. The email is still
delivered because send-email specifies the addresses that git can cope
with on the sendmail command line instead of using -t, that's why the
email is delivered by the header is garbled.

On the git list they're discussing if the parsing of the email
addresses can be made more strict to follow rfc2822, otherwise from
--dry-run things look ok, but then when you removed --dry-run you find
out the hard way you left a trailing " in an email address...

> On Thu, Nov 03, 2016 at 04:01:12PM +0800, Hillf Danton wrote:
> > On Thursday, November 03, 2016 3:34 AM Andrea Arcangeli wrote:
> > > +void madvise_userfault_dontneed(struct vm_area_struct *vma,
> > > +				struct vm_area_struct **prev,
> > > +				unsigned long start, unsigned long end)
> > > +{
> > > +	struct userfaultfd_ctx *ctx;
> > > +	struct userfaultfd_wait_queue ewq;
> > > +
> > > +	ctx = vma->vm_userfaultfd_ctx.ctx;
> > > +	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_MADVDONTNEED))
> > > +		return;
> > > +
> > > +	userfaultfd_ctx_get(ctx);
> > > +	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
> > > +	up_read(&vma->vm_mm->mmap_sem);
> > > +
> > > +	msg_init(&ewq.msg);
> > > +
> > > +	ewq.msg.event = UFFD_EVENT_MADVDONTNEED;
> > > +	ewq.msg.arg.madv_dn.start = start;
> > > +	ewq.msg.arg.madv_dn.end = end;
> > > +
> > > +	userfaultfd_event_wait_completion(ctx, &ewq);
> > > +
> > > +	down_read(&vma->vm_mm->mmap_sem);
> > 
> > After napping with mmap_sem released, is vma still valid?

Wow, nice catch Hillf. There was zero chance to catch this at runtime,
we don't munmap the vma while the testcase runs, plus even if we did
such thing, to notice it would need to be reused fast enough. It was
just a single instruction window for a pointer dereference...

> You are right, vma may be invalid at that point. Thanks for spotting.
> 
> Andrea, how do you prefer the fix, incremental or the entire patch updated?

I'm applying your updated patch, fix you sent is correct.

I will also move *prev= NULL just after up_read too, doing it before
up_read looks like it has to be done before before releasing the lock
which is not the case. Furthermore it's a microoptimization for
scalability to do it after, but it won't make any runtime difference
of course.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-04 15:44     ` Mike Rapoport
@ 2016-11-04 16:56       ` Andrea Arcangeli
  2016-11-18  0:37       ` Andrea Arcangeli
  1 sibling, 0 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-04 16:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov'

On Fri, Nov 04, 2016 at 09:44:40AM -0600, Mike Rapoport wrote:
> Below is the updated patch that uses charge_mm instead of vma which might
> be not valid.

Like you said earlier the vma couldn't be NULL if fault_type wasn't
NULL, but applied as cleanup.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-03 19:14       ` Mike Kravetz
  2016-11-04  6:43         ` Hillf Danton
@ 2016-11-04 19:36         ` Andrea Arcangeli
  2016-11-04 20:34           ` Mike Kravetz
  2016-11-08 21:06           ` Mike Kravetz
  1 sibling, 2 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-04 19:36 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Thu, Nov 03, 2016 at 12:14:15PM -0700, Mike Kravetz wrote:
> +		/* lookup dst_addr as we may have copied some pages */
> +		dst_vma = find_vma(dst_mm, dst_addr);

I put back dst_start here.

> +		if (dst_addr < dst_vma->vm_start ||
> +		    dst_addr + len - (copied * vma_hpagesize) > dst_vma->vm_end)
> +			goto out_unlock;

Actually this introduces a bug: copied * vma_hpagesize in the new
patch is wrong, copied is already in byte units. I rolled back this
one because of the dst_start commented above anyway.

> +	/*
> +	 * Validate alignment based on huge page size
> +	 */
> +	if (dst_addr & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> +		goto out_unlock;

If the vma changes under us we an as well fail. So I moved the
alignment checks on dst_start/len before the retry loop and I added a
further WARN_ON check inside the loop on dst_addr/len-copied just in
case but that cannot trigger as we abort if the vma_hpagesize changed
(hence WARN_ON).

If we need to relax this later and handle a change of vma_hpagesize,
it'll be backwards compatible change. I don't think it's needed and
this is more strict behavior.

> +	while (src_addr < src_start + len) {
> +		pte_t dst_pteval;
> +
> +		BUG_ON(dst_addr >= dst_start + len);
> +		dst_addr &= huge_page_mask(h);

The additional mask is superflous here, it was already enforced by the
alignment checks so I turned it into a bugcheck.

This is the current status, I'm sending a full diff against the
previous submit for review of the latest updates. It's easier to
review incrementally I think.

Please test it, I updated the aa.git tree userfault branch in sync
with this.

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 063ccc7..8a0ee3ba 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -628,11 +628,11 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 	}
 }
 
-void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
+void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx,
 				 unsigned long from, unsigned long to,
 				 unsigned long len)
 {
-	struct userfaultfd_ctx *ctx = vm_ctx.ctx;
+	struct userfaultfd_ctx *ctx = vm_ctx->ctx;
 	struct userfaultfd_wait_queue ewq;
 
 	if (!ctx)
@@ -657,6 +657,7 @@ void madvise_userfault_dontneed(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue ewq;
 
@@ -665,8 +666,9 @@ void madvise_userfault_dontneed(struct vm_area_struct *vma,
 		return;
 
 	userfaultfd_ctx_get(ctx);
+	up_read(&mm->mmap_sem);
+
 	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
-	up_read(&vma->vm_mm->mmap_sem);
 
 	msg_init(&ewq.msg);
 
@@ -676,7 +678,7 @@ void madvise_userfault_dontneed(struct vm_area_struct *vma,
 
 	userfaultfd_event_wait_completion(ctx, &ewq);
 
-	down_read(&vma->vm_mm->mmap_sem);
+	down_read(&mm->mmap_sem);
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 5caf97f..01a4e98 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -77,7 +77,7 @@ extern void dup_userfaultfd_complete(struct list_head *);
 
 extern void mremap_userfaultfd_prep(struct vm_area_struct *,
 				    struct vm_userfaultfd_ctx *);
-extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
+extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *,
 					unsigned long from, unsigned long to,
 					unsigned long len);
 
@@ -143,7 +143,7 @@ static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 {
 }
 
-static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
+static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *ctx,
 					       unsigned long from,
 					       unsigned long to,
 					       unsigned long len)
diff --git a/mm/mremap.c b/mm/mremap.c
index 450e811..cef4967 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -592,6 +592,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	up_write(&current->mm->mmap_sem);
 	if (locked && new_len > old_len)
 		mm_populate(new_addr + old_len, new_len - old_len);
-	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
+	mremap_userfaultfd_complete(&uf, addr, new_addr, old_len);
 	return ret;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 578622e..5d3e8bf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1609,7 +1609,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			if (fault_type) {
 				*fault_type |= VM_FAULT_MAJOR;
 				count_vm_event(PGMAJFAULT);
-				mem_cgroup_count_vm_event(vma->vm_mm,
+				mem_cgroup_count_vm_event(charge_mm,
 							  PGMAJFAULT);
 			}
 			/* Here we actually start the io */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d47b743..e8d7a89 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -172,8 +172,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	 * by THP.  Since we can not reliably insert a zero page, this
 	 * feature is not supported.
 	 */
-	if (zeropage)
+	if (zeropage) {
+		up_read(&dst_mm->mmap_sem);
 		return -EINVAL;
+	}
 
 	src_addr = src_start;
 	dst_addr = dst_start;
@@ -181,6 +183,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	page = NULL;
 	vma_hpagesize = vma_kernel_pagesize(dst_vma);
 
+	/*
+	 * Validate alignment based on huge page size
+	 */
+	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+		goto out_unlock;
+
 retry:
 	/*
 	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
@@ -189,11 +197,15 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	err = -EINVAL;
 	if (!dst_vma) {
 		dst_vma = find_vma(dst_mm, dst_start);
-		vma_hpagesize = vma_kernel_pagesize(dst_vma);
+		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
+			goto out_unlock;
+
+		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
+			goto out_unlock;
 
 		/*
-		 * Make sure the vma is not shared, that the dst range is
-		 * both valid and fully within a single existing vma.
+		 * Make sure the vma is not shared, that the remaining dst
+		 * range is both valid and fully within a single existing vma.
 		 */
 		if (dst_vma->vm_flags & VM_SHARED)
 			goto out_unlock;
@@ -202,10 +214,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			goto out_unlock;
 	}
 
-	/*
-	 * Validate alignment based on huge page size
-	 */
-	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
+		    (len - copied) & (vma_hpagesize - 1)))
 		goto out_unlock;
 
 	/*
@@ -227,7 +237,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		pte_t dst_pteval;
 
 		BUG_ON(dst_addr >= dst_start + len);
-		dst_addr &= huge_page_mask(h);
+		VM_BUG_ON(dst_addr & ~huge_page_mask(h));
 
 		/*
 		 * Serialize via hugetlb_fault_mutex
@@ -300,17 +310,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	return copied ? copied : err;
 }
 #else /* !CONFIG_HUGETLB_PAGE */
-static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
-					      struct vm_area_struct *dst_vma,
-					      unsigned long dst_start,
-					      unsigned long src_start,
-					      unsigned long len,
-					      bool zeropage)
-{
-	up_read(&dst_mm->mmap_sem);	/* HUGETLB not configured */
-	BUG();
-	return -EINVAL;
-}
+/* fail at build time if gcc attempts to use this */
+extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
+				      struct vm_area_struct *dst_vma,
+				      unsigned long dst_start,
+				      unsigned long src_start,
+				      unsigned long len,
+				      bool zeropage);
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
@@ -360,9 +366,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	/*
 	 * If this is a HUGETLB vma, pass off to appropriate routine
 	 */
-	if (dst_vma->vm_flags & VM_HUGETLB)
+	if (is_vm_hugetlb_page(dst_vma))
 		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
-						src_start, len, false);
+						src_start, len, zeropage);
 
 	/*
 	 * Be strict and only allow __mcopy_atomic on userfaultfd
@@ -431,8 +437,11 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 				err = mfill_zeropage_pte(dst_mm, dst_pmd,
 							 dst_vma, dst_addr);
 		} else {
-			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-						     dst_addr, src_addr, &page);
+			err = -EINVAL; /* if zeropage is true return -EINVAL */
+			if (likely(!zeropage))
+				err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
+							     dst_vma, dst_addr,
+							     src_addr, &page);
 		}
 
 		cond_resched();
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index fed2119..5a840a6 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -625,6 +625,86 @@ static int faulting_process(void)
 	return 0;
 }
 
+static int uffdio_zeropage(int ufd, unsigned long offset)
+{
+	struct uffdio_zeropage uffdio_zeropage;
+	int ret;
+	unsigned long has_zeropage = EXPECTED_IOCTLS & (1 << _UFFDIO_ZEROPAGE);
+
+	if (offset >= nr_pages * page_size)
+		fprintf(stderr, "unexpected offset %lu\n",
+			offset), exit(1);
+	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
+	uffdio_zeropage.range.len = page_size;
+	uffdio_zeropage.mode = 0;
+	ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
+	if (ret) {
+		/* real retval in ufdio_zeropage.zeropage */
+		if (has_zeropage) {
+			if (uffdio_zeropage.zeropage == -EEXIST)
+				fprintf(stderr, "UFFDIO_ZEROPAGE -EEXIST\n"),
+					exit(1);
+			else
+				fprintf(stderr, "UFFDIO_ZEROPAGE error %Ld\n",
+					uffdio_zeropage.zeropage), exit(1);
+		} else {
+			if (uffdio_zeropage.zeropage != -EINVAL)
+				fprintf(stderr,
+					"UFFDIO_ZEROPAGE not -EINVAL %Ld\n",
+					uffdio_zeropage.zeropage), exit(1);
+		}
+	} else if (has_zeropage) {
+		if (uffdio_zeropage.zeropage != page_size) {
+			fprintf(stderr, "UFFDIO_ZEROPAGE unexpected %Ld\n",
+				uffdio_zeropage.zeropage), exit(1);
+		} else
+			return 1;
+	} else {
+		fprintf(stderr,
+			"UFFDIO_ZEROPAGE succeeded %Ld\n",
+			uffdio_zeropage.zeropage), exit(1);
+	}
+
+	return 0;
+}
+
+/* exercise UFFDIO_ZEROPAGE */
+static int userfaultfd_zeropage_test(void)
+{
+	struct uffdio_register uffdio_register;
+	unsigned long expected_ioctls;
+
+	printf("testing UFFDIO_ZEROPAGE: ");
+	fflush(stdout);
+
+	if (release_pages(area_dst))
+		return 1;
+
+	if (userfaultfd_open(0) < 0)
+		return 1;
+	uffdio_register.range.start = (unsigned long) area_dst;
+	uffdio_register.range.len = nr_pages * page_size;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		fprintf(stderr, "register failure\n"), exit(1);
+
+	expected_ioctls = EXPECTED_IOCTLS;
+	if ((uffdio_register.ioctls & expected_ioctls) !=
+	    expected_ioctls)
+		fprintf(stderr,
+			"unexpected missing ioctl for anon memory\n"),
+			exit(1);
+
+	if (uffdio_zeropage(uffd, 0)) {
+		if (my_bcmp(area_dst, zeropage, page_size))
+			fprintf(stderr, "zeropage is not zero\n"), exit(1);
+	}
+
+	close(uffd);
+	printf("done.\n");
+	return 0;
+}
+
 static int userfaultfd_events_test(void)
 {
 	struct uffdio_register uffdio_register;
@@ -679,6 +759,7 @@ static int userfaultfd_events_test(void)
 	if (pthread_join(uffd_mon, (void **)&userfaults))
 		return 1;
 
+	close(uffd);
 	printf("userfaults: %ld\n", userfaults);
 
 	return userfaults != nr_pages;
@@ -852,7 +933,7 @@ static int userfaultfd_stress(void)
 		return err;
 
 	close(uffd);
-	return userfaultfd_events_test();
+	return userfaultfd_zeropage_test() || userfaultfd_events_test();
 }
 
 #ifndef HUGETLB_TEST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-04 19:36         ` Andrea Arcangeli
@ 2016-11-04 20:34           ` Mike Kravetz
  2016-11-08 21:06           ` Mike Kravetz
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Kravetz @ 2016-11-04 20:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/04/2016 12:36 PM, Andrea Arcangeli wrote:
> On Thu, Nov 03, 2016 at 12:14:15PM -0700, Mike Kravetz wrote:
>> +		/* lookup dst_addr as we may have copied some pages */
>> +		dst_vma = find_vma(dst_mm, dst_addr);
> 
> I put back dst_start here.
> 
>> +		if (dst_addr < dst_vma->vm_start ||
>> +		    dst_addr + len - (copied * vma_hpagesize) > dst_vma->vm_end)
>> +			goto out_unlock;
> 
> Actually this introduces a bug: copied * vma_hpagesize in the new
> patch is wrong, copied is already in byte units. I rolled back this
> one because of the dst_start commented above anyway.
> 
>> +	/*
>> +	 * Validate alignment based on huge page size
>> +	 */
>> +	if (dst_addr & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
>> +		goto out_unlock;
> 
> If the vma changes under us we an as well fail. So I moved the
> alignment checks on dst_start/len before the retry loop and I added a
> further WARN_ON check inside the loop on dst_addr/len-copied just in
> case but that cannot trigger as we abort if the vma_hpagesize changed
> (hence WARN_ON).
> 
> If we need to relax this later and handle a change of vma_hpagesize,
> it'll be backwards compatible change. I don't think it's needed and
> this is more strict behavior.
> 
>> +	while (src_addr < src_start + len) {
>> +		pte_t dst_pteval;
>> +
>> +		BUG_ON(dst_addr >= dst_start + len);
>> +		dst_addr &= huge_page_mask(h);
> 
> The additional mask is superflous here, it was already enforced by the
> alignment checks so I turned it into a bugcheck.

Thanks,

I had made similar hugetlb changes and was testing.  I'll perform hugetlb
testing with the full diff patch below.

-- 
Mike Kravetz

> 
> This is the current status, I'm sending a full diff against the
> previous submit for review of the latest updates. It's easier to
> review incrementally I think.
> 
> Please test it, I updated the aa.git tree userfault branch in sync
> with this.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 063ccc7..8a0ee3ba 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -628,11 +628,11 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>  	}
>  }
>  
> -void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx vm_ctx,
> +void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *vm_ctx,
>  				 unsigned long from, unsigned long to,
>  				 unsigned long len)
>  {
> -	struct userfaultfd_ctx *ctx = vm_ctx.ctx;
> +	struct userfaultfd_ctx *ctx = vm_ctx->ctx;
>  	struct userfaultfd_wait_queue ewq;
>  
>  	if (!ctx)
> @@ -657,6 +657,7 @@ void madvise_userfault_dontneed(struct vm_area_struct *vma,
>  				struct vm_area_struct **prev,
>  				unsigned long start, unsigned long end)
>  {
> +	struct mm_struct *mm = vma->vm_mm;
>  	struct userfaultfd_ctx *ctx;
>  	struct userfaultfd_wait_queue ewq;
>  
> @@ -665,8 +666,9 @@ void madvise_userfault_dontneed(struct vm_area_struct *vma,
>  		return;
>  
>  	userfaultfd_ctx_get(ctx);
> +	up_read(&mm->mmap_sem);
> +
>  	*prev = NULL; /* We wait for ACK w/o the mmap semaphore */
> -	up_read(&vma->vm_mm->mmap_sem);
>  
>  	msg_init(&ewq.msg);
>  
> @@ -676,7 +678,7 @@ void madvise_userfault_dontneed(struct vm_area_struct *vma,
>  
>  	userfaultfd_event_wait_completion(ctx, &ewq);
>  
> -	down_read(&vma->vm_mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
>  }
>  
>  static int userfaultfd_release(struct inode *inode, struct file *file)
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 5caf97f..01a4e98 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -77,7 +77,7 @@ extern void dup_userfaultfd_complete(struct list_head *);
>  
>  extern void mremap_userfaultfd_prep(struct vm_area_struct *,
>  				    struct vm_userfaultfd_ctx *);
> -extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx,
> +extern void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *,
>  					unsigned long from, unsigned long to,
>  					unsigned long len);
>  
> @@ -143,7 +143,7 @@ static inline void mremap_userfaultfd_prep(struct vm_area_struct *vma,
>  {
>  }
>  
> -static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx ctx,
> +static inline void mremap_userfaultfd_complete(struct vm_userfaultfd_ctx *ctx,
>  					       unsigned long from,
>  					       unsigned long to,
>  					       unsigned long len)
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 450e811..cef4967 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -592,6 +592,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	up_write(&current->mm->mmap_sem);
>  	if (locked && new_len > old_len)
>  		mm_populate(new_addr + old_len, new_len - old_len);
> -	mremap_userfaultfd_complete(uf, addr, new_addr, old_len);
> +	mremap_userfaultfd_complete(&uf, addr, new_addr, old_len);
>  	return ret;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 578622e..5d3e8bf 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1609,7 +1609,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  			if (fault_type) {
>  				*fault_type |= VM_FAULT_MAJOR;
>  				count_vm_event(PGMAJFAULT);
> -				mem_cgroup_count_vm_event(vma->vm_mm,
> +				mem_cgroup_count_vm_event(charge_mm,
>  							  PGMAJFAULT);
>  			}
>  			/* Here we actually start the io */
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index d47b743..e8d7a89 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -172,8 +172,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	 * by THP.  Since we can not reliably insert a zero page, this
>  	 * feature is not supported.
>  	 */
> -	if (zeropage)
> +	if (zeropage) {
> +		up_read(&dst_mm->mmap_sem);
>  		return -EINVAL;
> +	}
>  
>  	src_addr = src_start;
>  	dst_addr = dst_start;
> @@ -181,6 +183,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	page = NULL;
>  	vma_hpagesize = vma_kernel_pagesize(dst_vma);
>  
> +	/*
> +	 * Validate alignment based on huge page size
> +	 */
> +	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> +		goto out_unlock;
> +
>  retry:
>  	/*
>  	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
> @@ -189,11 +197,15 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	err = -EINVAL;
>  	if (!dst_vma) {
>  		dst_vma = find_vma(dst_mm, dst_start);
> -		vma_hpagesize = vma_kernel_pagesize(dst_vma);
> +		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
> +			goto out_unlock;
> +
> +		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
> +			goto out_unlock;
>  
>  		/*
> -		 * Make sure the vma is not shared, that the dst range is
> -		 * both valid and fully within a single existing vma.
> +		 * Make sure the vma is not shared, that the remaining dst
> +		 * range is both valid and fully within a single existing vma.
>  		 */
>  		if (dst_vma->vm_flags & VM_SHARED)
>  			goto out_unlock;
> @@ -202,10 +214,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  			goto out_unlock;
>  	}
>  
> -	/*
> -	 * Validate alignment based on huge page size
> -	 */
> -	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> +	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
> +		    (len - copied) & (vma_hpagesize - 1)))
>  		goto out_unlock;
>  
>  	/*
> @@ -227,7 +237,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		pte_t dst_pteval;
>  
>  		BUG_ON(dst_addr >= dst_start + len);
> -		dst_addr &= huge_page_mask(h);
> +		VM_BUG_ON(dst_addr & ~huge_page_mask(h));
>  
>  		/*
>  		 * Serialize via hugetlb_fault_mutex
> @@ -300,17 +310,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  	return copied ? copied : err;
>  }
>  #else /* !CONFIG_HUGETLB_PAGE */
> -static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> -					      struct vm_area_struct *dst_vma,
> -					      unsigned long dst_start,
> -					      unsigned long src_start,
> -					      unsigned long len,
> -					      bool zeropage)
> -{
> -	up_read(&dst_mm->mmap_sem);	/* HUGETLB not configured */
> -	BUG();
> -	return -EINVAL;
> -}
> +/* fail at build time if gcc attempts to use this */
> +extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> +				      struct vm_area_struct *dst_vma,
> +				      unsigned long dst_start,
> +				      unsigned long src_start,
> +				      unsigned long len,
> +				      bool zeropage);
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
>  static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
> @@ -360,9 +366,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  	/*
>  	 * If this is a HUGETLB vma, pass off to appropriate routine
>  	 */
> -	if (dst_vma->vm_flags & VM_HUGETLB)
> +	if (is_vm_hugetlb_page(dst_vma))
>  		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
> -						src_start, len, false);
> +						src_start, len, zeropage);
>  
>  	/*
>  	 * Be strict and only allow __mcopy_atomic on userfaultfd
> @@ -431,8 +437,11 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  				err = mfill_zeropage_pte(dst_mm, dst_pmd,
>  							 dst_vma, dst_addr);
>  		} else {
> -			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
> -						     dst_addr, src_addr, &page);
> +			err = -EINVAL; /* if zeropage is true return -EINVAL */
> +			if (likely(!zeropage))
> +				err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
> +							     dst_vma, dst_addr,
> +							     src_addr, &page);
>  		}
>  
>  		cond_resched();
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index fed2119..5a840a6 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -625,6 +625,86 @@ static int faulting_process(void)
>  	return 0;
>  }
>  
> +static int uffdio_zeropage(int ufd, unsigned long offset)
> +{
> +	struct uffdio_zeropage uffdio_zeropage;
> +	int ret;
> +	unsigned long has_zeropage = EXPECTED_IOCTLS & (1 << _UFFDIO_ZEROPAGE);
> +
> +	if (offset >= nr_pages * page_size)
> +		fprintf(stderr, "unexpected offset %lu\n",
> +			offset), exit(1);
> +	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
> +	uffdio_zeropage.range.len = page_size;
> +	uffdio_zeropage.mode = 0;
> +	ret = ioctl(ufd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
> +	if (ret) {
> +		/* real retval in ufdio_zeropage.zeropage */
> +		if (has_zeropage) {
> +			if (uffdio_zeropage.zeropage == -EEXIST)
> +				fprintf(stderr, "UFFDIO_ZEROPAGE -EEXIST\n"),
> +					exit(1);
> +			else
> +				fprintf(stderr, "UFFDIO_ZEROPAGE error %Ld\n",
> +					uffdio_zeropage.zeropage), exit(1);
> +		} else {
> +			if (uffdio_zeropage.zeropage != -EINVAL)
> +				fprintf(stderr,
> +					"UFFDIO_ZEROPAGE not -EINVAL %Ld\n",
> +					uffdio_zeropage.zeropage), exit(1);
> +		}
> +	} else if (has_zeropage) {
> +		if (uffdio_zeropage.zeropage != page_size) {
> +			fprintf(stderr, "UFFDIO_ZEROPAGE unexpected %Ld\n",
> +				uffdio_zeropage.zeropage), exit(1);
> +		} else
> +			return 1;
> +	} else {
> +		fprintf(stderr,
> +			"UFFDIO_ZEROPAGE succeeded %Ld\n",
> +			uffdio_zeropage.zeropage), exit(1);
> +	}
> +
> +	return 0;
> +}
> +
> +/* exercise UFFDIO_ZEROPAGE */
> +static int userfaultfd_zeropage_test(void)
> +{
> +	struct uffdio_register uffdio_register;
> +	unsigned long expected_ioctls;
> +
> +	printf("testing UFFDIO_ZEROPAGE: ");
> +	fflush(stdout);
> +
> +	if (release_pages(area_dst))
> +		return 1;
> +
> +	if (userfaultfd_open(0) < 0)
> +		return 1;
> +	uffdio_register.range.start = (unsigned long) area_dst;
> +	uffdio_register.range.len = nr_pages * page_size;
> +	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> +	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
> +		fprintf(stderr, "register failure\n"), exit(1);
> +
> +	expected_ioctls = EXPECTED_IOCTLS;
> +	if ((uffdio_register.ioctls & expected_ioctls) !=
> +	    expected_ioctls)
> +		fprintf(stderr,
> +			"unexpected missing ioctl for anon memory\n"),
> +			exit(1);
> +
> +	if (uffdio_zeropage(uffd, 0)) {
> +		if (my_bcmp(area_dst, zeropage, page_size))
> +			fprintf(stderr, "zeropage is not zero\n"), exit(1);
> +	}
> +
> +	close(uffd);
> +	printf("done.\n");
> +	return 0;
> +}
> +
>  static int userfaultfd_events_test(void)
>  {
>  	struct uffdio_register uffdio_register;
> @@ -679,6 +759,7 @@ static int userfaultfd_events_test(void)
>  	if (pthread_join(uffd_mon, (void **)&userfaults))
>  		return 1;
>  
> +	close(uffd);
>  	printf("userfaults: %ld\n", userfaults);
>  
>  	return userfaults != nr_pages;
> @@ -852,7 +933,7 @@ static int userfaultfd_stress(void)
>  		return err;
>  
>  	close(uffd);
> -	return userfaultfd_events_test();
> +	return userfaultfd_zeropage_test() || userfaultfd_events_test();
>  }
>  
>  #ifndef HUGETLB_TEST
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-04 19:36         ` Andrea Arcangeli
  2016-11-04 20:34           ` Mike Kravetz
@ 2016-11-08 21:06           ` Mike Kravetz
  2016-11-16 18:28             ` Andrea Arcangeli
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Kravetz @ 2016-11-08 21:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/04/2016 12:36 PM, Andrea Arcangeli wrote:
> This is the current status, I'm sending a full diff against the
> previous submit for review of the latest updates. It's easier to
> review incrementally I think.
> 
> Please test it, I updated the aa.git tree userfault branch in sync
> with this.
> 

Hello Andrea,

I found a couple more issues with hugetlbfs support.  The below patch
describes and addresses the issues.  It is against your aa tree. Do note
that there is a patch going into mm tree that is a pre-req for this
patch.  The patch is "mm/hugetlb: fix huge page reservation leak in
private mapping error paths".
http://marc.info/?l=linux-mm&m=147693310409312&w=2

-- 
Mike Kravetz

From: Mike Kravetz <mike.kravetz@oracle.com>

userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing

The new routine copy_huge_page_from_user() uses kmap_atomic() to map
PAGE_SIZE pages.  However, this prevents page faults in the subsequent
call to copy_from_user().  This is OK in the case where the routine
is copied with mmap_sema held.  However, in another case we want to
allow page faults.  So, add a new argument allow_pagefault to indicate
if the routine should allow page faults.

A patch (mm/hugetlb: fix huge page reservation leak in private mapping
error paths) was recently submitted and is being added to -mm tree.  It
addresses the issue huge page reservations when a huge page is allocated,
and free'ed before being instantiated in an address space.  This would
typically happen in error paths.  The routine __mcopy_atomic_hugetlb has
such an error path, so it will need to call restore_reserve_on_error()
before free'ing the huge page.  restore_reserve_on_error is currently
only visible in mm/hugetlb.c.  So, add it to a header file so that it
can be used in mm/userfaultfd.c.  Another option would be to move
__mcopy_atomic_hugetlb into mm/hugetlb.c

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  2 ++
 include/linux/mm.h      |  3 ++-
 mm/hugetlb.c            |  7 +++----
 mm/memory.c             | 13 ++++++++++---
 mm/userfaultfd.c        |  6 ++++--
 5 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fc27b66..bf02b7e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -101,6 +101,8 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h,
struct mm_struct *mm,
 				struct vm_area_struct *vma,
 				struct address_space *mapping,
 				pgoff_t idx, unsigned long address);
+void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
+				unsigned long address, struct page *page);

 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t
*pud);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 39157f5..7c73a05 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2417,7 +2417,8 @@ extern void copy_user_huge_page(struct page *dst,
struct page *src,
 				unsigned int pages_per_huge_page);
 extern long copy_huge_page_from_user(struct page *dst_page,
 				const void __user *usr_src,
-				unsigned int pages_per_huge_page);
+				unsigned int pages_per_huge_page,
+				bool allow_pagefault);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

 extern struct page_ext_operations debug_guardpage_ops;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7bfeee3..9ce8ecb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1935,9 +1935,8 @@ static long vma_add_reservation(struct hstate *h,
  * reserve map here to be consistent with global reserve count adjustments
  * to be made by free_huge_page.
  */
-static void restore_reserve_on_error(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address,
-			struct page *page)
+void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
+				unsigned long address, struct page *page)
 {
 	if (unlikely(PagePrivate(page))) {
 		long rc = vma_needs_reservation(h, vma, address);
@@ -3981,7 +3980,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,

 		ret = copy_huge_page_from_user(page,
 						(const void __user *) src_addr,
-						pages_per_huge_page(h));
+						pages_per_huge_page(h), false);

 		/* fallback to copy_from_user outside mmap_sem */
 		if (unlikely(ret)) {
diff --git a/mm/memory.c b/mm/memory.c
index b911110..0137c4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4106,7 +4106,8 @@ void copy_user_huge_page(struct page *dst, struct
page *src,

 long copy_huge_page_from_user(struct page *dst_page,
 				const void __user *usr_src,
-				unsigned int pages_per_huge_page)
+				unsigned int pages_per_huge_page,
+				bool allow_pagefault)
 {
 	void *src = (void *)usr_src;
 	void *page_kaddr;
@@ -4114,11 +4115,17 @@ long copy_huge_page_from_user(struct page *dst_page,
 	unsigned long ret_val = pages_per_huge_page * PAGE_SIZE;

 	for (i = 0; i < pages_per_huge_page; i++) {
-		page_kaddr = kmap_atomic(dst_page + i);
+		if (allow_pagefault)
+			page_kaddr = kmap(dst_page + i);
+		else
+			page_kaddr = kmap_atomic(dst_page + i);
 		rc = copy_from_user(page_kaddr,
 				(const void __user *)(src + i * PAGE_SIZE),
 				PAGE_SIZE);
-		kunmap_atomic(page_kaddr);
+		if (allow_pagefault)
+			kunmap(page_kaddr);
+		else
+			kunmap_atomic(page_kaddr);

 		ret_val -= (PAGE_SIZE - rc);
 		if (rc)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e8d7a89..c8588aa 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -275,7 +275,7 @@ static __always_inline ssize_t
__mcopy_atomic_hugetlb(struct mm_struct *dst_mm,

 			err = copy_huge_page_from_user(page,
 						(const void __user *)src_addr,
-						pages_per_huge_page(h));
+						pages_per_huge_page(h), true);
 			if (unlikely(err)) {
 				err = -EFAULT;
 				goto out;
@@ -302,8 +302,10 @@ static __always_inline ssize_t
__mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 out_unlock:
 	up_read(&dst_mm->mmap_sem);
 out:
-	if (page)
+	if (page) {
+		restore_reserve_on_error(h, dst_vma, dst_addr, page);
 		put_page(page);
+	}
 	BUG_ON(copied < 0);
 	BUG_ON(err > 0);
 	BUG_ON(!copied && !err);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-08 21:06           ` Mike Kravetz
@ 2016-11-16 18:28             ` Andrea Arcangeli
  2016-11-16 18:53               ` Mike Kravetz
  2016-11-17 19:41               ` Mike Kravetz
  0 siblings, 2 replies; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-16 18:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

Hello Mike,

On Tue, Nov 08, 2016 at 01:06:06PM -0800, Mike Kravetz wrote:
> -- 
> Mike Kravetz
> 
> From: Mike Kravetz <mike.kravetz@oracle.com>
> 
> userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
> 
> The new routine copy_huge_page_from_user() uses kmap_atomic() to map
> PAGE_SIZE pages.  However, this prevents page faults in the subsequent
> call to copy_from_user().  This is OK in the case where the routine
> is copied with mmap_sema held.  However, in another case we want to
> allow page faults.  So, add a new argument allow_pagefault to indicate
> if the routine should allow page faults.
> 
> A patch (mm/hugetlb: fix huge page reservation leak in private mapping
> error paths) was recently submitted and is being added to -mm tree.  It
> addresses the issue huge page reservations when a huge page is allocated,
> and free'ed before being instantiated in an address space.  This would
> typically happen in error paths.  The routine __mcopy_atomic_hugetlb has
> such an error path, so it will need to call restore_reserve_on_error()
> before free'ing the huge page.  restore_reserve_on_error is currently
> only visible in mm/hugetlb.c.  So, add it to a header file so that it
> can be used in mm/userfaultfd.c.  Another option would be to move
> __mcopy_atomic_hugetlb into mm/hugetlb.c

It would have been better to split this in two patches.

> @@ -302,8 +302,10 @@ static __always_inline ssize_t
> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  out_unlock:
>  	up_read(&dst_mm->mmap_sem);
>  out:
> -	if (page)
> +	if (page) {
> +		restore_reserve_on_error(h, dst_vma, dst_addr, page);
>  		put_page(page);
> +	}
>  	BUG_ON(copied < 0);

If the revalidation fails dst_vma could even be NULL.

We get there with page not NULL only if something in the revalidation
fails effectively... I'll have to drop the above change as the fix
will hurt more than the vma reservation not being restored. Didn't
think too much about it, but there was no obvious way to restore the
reservation of a vma, after we drop the mmap_sem. However if we don't
drop the mmap_sem, we'd recurse into it, and it'll deadlock in current
implementation if a down_write is already pending somewhere else. In
this specific case fairness is not an issue, but it's not checking
it's the same thread taking it again, so it's doesn't allow to recurse
(checking it's the same thread would make it slower).

I also fixed the gup support for userfaultfd, could you review it?
Beware, untested... will test it shortly with qemu postcopy live
migration with hugetlbfs instead of THP (that currently gracefully
complains about FAULT_FLAG_ALLOW_RETRY missing, KVM ioctl returns
badaddr and DEBUG_VM=y clearly showed the stack trace of where
FAULT_FLAG_ALLOW_RETRY was missing).

I think this enhancement is needed by Oracle too, so that you don't
get an error from I/O syscalls, and you instead get an userfault.

We need to update the selftest to trigger userfaults not only with the
CPU but with O_DIRECT too.

Note, the FOLL_NOWAIT is needed to offload the userfaults to async
page faults. KVM tries an async fault first (FOLL_NOWAIT, nonblocking
= NULL), if that fails it offload a blocking (*nonblocking = 1) fault
through async page fault kernel thread while guest scheduler schedule
away the blocked process. So the userfaults behave like SSD swapins
from disk hitting on a single guest thread and not the whole host vcpu
thread. Clearly hugetlbfs cannot ever block for I/O, FOLL_NOWAIT is
only useful to avoid blocking in the vcpu thread in
handle_userfault().

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-16 18:28             ` Andrea Arcangeli
@ 2016-11-16 18:53               ` Mike Kravetz
  2016-11-17 15:40                 ` Andrea Arcangeli
  2016-11-17 19:41               ` Mike Kravetz
  1 sibling, 1 reply; 69+ messages in thread
From: Mike Kravetz @ 2016-11-16 18:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/16/2016 10:28 AM, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Tue, Nov 08, 2016 at 01:06:06PM -0800, Mike Kravetz wrote:
>> -- 
>> Mike Kravetz
>>
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
>>
>> The new routine copy_huge_page_from_user() uses kmap_atomic() to map
>> PAGE_SIZE pages.  However, this prevents page faults in the subsequent
>> call to copy_from_user().  This is OK in the case where the routine
>> is copied with mmap_sema held.  However, in another case we want to
>> allow page faults.  So, add a new argument allow_pagefault to indicate
>> if the routine should allow page faults.
>>
>> A patch (mm/hugetlb: fix huge page reservation leak in private mapping
>> error paths) was recently submitted and is being added to -mm tree.  It
>> addresses the issue huge page reservations when a huge page is allocated,
>> and free'ed before being instantiated in an address space.  This would
>> typically happen in error paths.  The routine __mcopy_atomic_hugetlb has
>> such an error path, so it will need to call restore_reserve_on_error()
>> before free'ing the huge page.  restore_reserve_on_error is currently
>> only visible in mm/hugetlb.c.  So, add it to a header file so that it
>> can be used in mm/userfaultfd.c.  Another option would be to move
>> __mcopy_atomic_hugetlb into mm/hugetlb.c
> 
> It would have been better to split this in two patches.
> 
>> @@ -302,8 +302,10 @@ static __always_inline ssize_t
>> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>  out_unlock:
>>  	up_read(&dst_mm->mmap_sem);
>>  out:
>> -	if (page)
>> +	if (page) {
>> +		restore_reserve_on_error(h, dst_vma, dst_addr, page);
>>  		put_page(page);
>> +	}
>>  	BUG_ON(copied < 0);
> 
> If the revalidation fails dst_vma could even be NULL.
> 
> We get there with page not NULL only if something in the revalidation
> fails effectively... I'll have to drop the above change as the fix
> will hurt more than the vma reservation not being restored. Didn't
> think too much about it, but there was no obvious way to restore the
> reservation of a vma, after we drop the mmap_sem. However if we don't
> drop the mmap_sem, we'd recurse into it, and it'll deadlock in current
> implementation if a down_write is already pending somewhere else. In
> this specific case fairness is not an issue, but it's not checking
> it's the same thread taking it again, so it's doesn't allow to recurse
> (checking it's the same thread would make it slower).

Thanks for reviewing this Andrea.  You are correct, we can not call
restore_reserve_on_error without mmap_sem held.

I was running some tests with error injection to exercise the error
path and noticed the reservation leaks as the system eventually ran
out of huge pages.  I need to think about it some more, but we may
want to at least do something like the following before put_page (with
a BIG comment):

	if (unlikely(PagePrivate(page)))
		ClearPagePrivate(page);

That would at least keep the global reservation count from increasing.
Let me look into that.

> I also fixed the gup support for userfaultfd, could you review it?

I will take a look, and 'may' have a test that can be modified for this.

-- 
Mike Kravetz

> Beware, untested... will test it shortly with qemu postcopy live
> migration with hugetlbfs instead of THP (that currently gracefully
> complains about FAULT_FLAG_ALLOW_RETRY missing, KVM ioctl returns
> badaddr and DEBUG_VM=y clearly showed the stack trace of where
> FAULT_FLAG_ALLOW_RETRY was missing).
> 
> I think this enhancement is needed by Oracle too, so that you don't
> get an error from I/O syscalls, and you instead get an userfault.
> 
> We need to update the selftest to trigger userfaults not only with the
> CPU but with O_DIRECT too.
> 
> Note, the FOLL_NOWAIT is needed to offload the userfaults to async
> page faults. KVM tries an async fault first (FOLL_NOWAIT, nonblocking
> = NULL), if that fails it offload a blocking (*nonblocking = 1) fault
> through async page fault kernel thread while guest scheduler schedule
> away the blocked process. So the userfaults behave like SSD swapins
> from disk hitting on a single guest thread and not the whole host vcpu
> thread. Clearly hugetlbfs cannot ever block for I/O, FOLL_NOWAIT is
> only useful to avoid blocking in the vcpu thread in
> handle_userfault().
> 
> From ff1ce62ee0acb14ed71621ba99f01f008a5d212d Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 16 Nov 2016 18:34:20 +0100
> Subject: [PATCH 1/1] userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
> 
> Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
> get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
> will work on hugetlbfs. This is required for fully functional
> userfaultfd non-present support on hugetlbfs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/hugetlb.h |  5 +++--
>  mm/gup.c                |  2 +-
>  mm/hugetlb.c            | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 44 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index bf02b7e..542416d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -65,7 +65,8 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			 struct page **, struct vm_area_struct **,
> -			 unsigned long *, unsigned long *, long, unsigned int);
> +			 unsigned long *, unsigned long *, long, unsigned int,
> +			 int *);
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *);
>  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> @@ -138,7 +139,7 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> +#define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  static inline void hugetlb_report_meminfo(struct seq_file *m)
> diff --git a/mm/gup.c b/mm/gup.c
> index ec4f827..36e88a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -572,7 +572,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (is_vm_hugetlb_page(vma)) {
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags);
> +						gup_flags, nonblocking);
>  				continue;
>  			}
>  		}
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9ce8ecb..022750d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4039,7 +4039,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 struct page **pages, struct vm_area_struct **vmas,
>  			 unsigned long *position, unsigned long *nr_pages,
> -			 long i, unsigned int flags)
> +			 long i, unsigned int flags, int *nonblocking)
>  {
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
> @@ -4102,16 +4102,43 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		    ((flags & FOLL_WRITE) &&
>  		      !huge_pte_write(huge_ptep_get(pte)))) {
>  			int ret;
> +			unsigned int fault_flags = 0;
>  
>  			if (pte)
>  				spin_unlock(ptl);
> -			ret = hugetlb_fault(mm, vma, vaddr,
> -				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> -			if (!(ret & VM_FAULT_ERROR))
> -				continue;
> -
> -			remainder = 0;
> -			break;
> +			if (flags & FOLL_WRITE)
> +				fault_flags |= FAULT_FLAG_WRITE;
> +			if (nonblocking)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
> +			if (flags & FOLL_NOWAIT)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> +					FAULT_FLAG_RETRY_NOWAIT;
> +			if (flags & FOLL_TRIED) {
> +				VM_WARN_ON_ONCE(fault_flags &
> +						FAULT_FLAG_ALLOW_RETRY);
> +				fault_flags |= FAULT_FLAG_TRIED;
> +			}
> +			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> +			if (ret & VM_FAULT_ERROR) {
> +				remainder = 0;
> +				break;
> +			}
> +			if (ret & VM_FAULT_RETRY) {
> +				if (nonblocking)
> +					*nonblocking = 0;
> +				*nr_pages = 0;
> +				/*
> +				 * VM_FAULT_RETRY must not return an
> +				 * error, it will return zero
> +				 * instead.
> +				 *
> +				 * No need to update "position" as the
> +				 * caller will not check it after
> +				 * *nr_pages is set to 0.
> +				 */
> +				return i;
> +			}
> +			continue;
>  		}
>  
>  		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> @@ -4140,6 +4167,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  	}
>  	*nr_pages = remainder;
> +	/*
> +	 * setting position is actually required only if remainder is
> +	 * not zero but it's faster not to add a "if (remainder)"
> +	 * branch.
> +	 */
>  	*position = vaddr;
>  
>  	return i ? i : -EFAULT;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-16 18:53               ` Mike Kravetz
@ 2016-11-17 15:40                 ` Andrea Arcangeli
  2016-11-17 19:26                   ` Mike Kravetz
  0 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-17 15:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
> I was running some tests with error injection to exercise the error
> path and noticed the reservation leaks as the system eventually ran
> out of huge pages.  I need to think about it some more, but we may
> want to at least do something like the following before put_page (with
> a BIG comment):
> 
> 	if (unlikely(PagePrivate(page)))
> 		ClearPagePrivate(page);
> 
> That would at least keep the global reservation count from increasing.
> Let me look into that.

However what happens if the old vma got munmapped and a new compatible
vma was instantiated and passes revalidation fine? The reserved page
of the old vma goes to a different vma then?

This reservation code is complex and has lots of special cases anyway,
but the main concern at this point is the
set_page_private(subpool_vma(vma)) released by
hugetlb_vm_op_close->unlock_or_release_subpool.

Aside the accounting, what about the page_private(page) subpool? It's
used by huge_page_free which would get out of sync with vma/inode
destruction if we release the mmap_sem.

	struct hugepage_subpool *spool =
		(struct hugepage_subpool *)page_private(page);

I think in the revalidation code we need to check if
page_private(page) still matches the subpool_vma(vma), if it doesn't
and it's a stale pointer, we can't even call put_page before fixing up
the page_private first.

The other way to solve this is not to release the mmap_sem at all and
in the slow path call __get_user_pages(nonblocking=NULL). That's
slower than using the CPU TLB to reach the source data and it'd
prevent also to handle a userfault in the source address of
UFFDIO_COPY, because an userfault in the source would require the
mmap_sem to be released (i.e. it'd require get_user_pages_fast that
would again recurse on the mmap_sem and in turn we could as well stick
to the current nonblocking copy-user). We currently don't handle
nesting with non-cooperative anyway so it'd be ok for now not to
release the mmap_sem while copying in UFFDIO_COPY.


Offtopic here but while reading this code I also noticed
free_huge_page is wasting CPU and then noticed other places wasting
CPU.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-17 15:40                 ` Andrea Arcangeli
@ 2016-11-17 19:26                   ` Mike Kravetz
  2016-11-18  0:05                     ` Andrea Arcangeli
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Kravetz @ 2016-11-17 19:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/17/2016 07:40 AM, Andrea Arcangeli wrote:
> On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
>> I was running some tests with error injection to exercise the error
>> path and noticed the reservation leaks as the system eventually ran
>> out of huge pages.  I need to think about it some more, but we may
>> want to at least do something like the following before put_page (with
>> a BIG comment):
>>
>> 	if (unlikely(PagePrivate(page)))
>> 		ClearPagePrivate(page);
>>
>> That would at least keep the global reservation count from increasing.
>> Let me look into that.
> 
> However what happens if the old vma got munmapped

When the huge page was allocated, the reservation map associated with
the vma was marked to indicate the reservation was consumed.  In addition
the global reservation count and subpool count were adjusted to account
for the page allocation.  So, when the vma gets unmapped the reservation
map will be examined.  Since the map indicates the reservation was consumed,
no adjustment will be made to the global or subpool reservation count.

>                                                   and a new compatible
> vma was instantiated and passes revalidation fine? The reserved page
> of the old vma goes to a different vma then?

No, the new vma should get a new reservation.  It can not use the old
reservation as it was associated with the old vma.  This is at least
the case for private mappings where the reservation maps are associated
with the vma.

> This reservation code is complex and has lots of special cases anyway,
> but the main concern at this point is the
> set_page_private(subpool_vma(vma)) released by
> hugetlb_vm_op_close->unlock_or_release_subpool.

Do note that set_page_private(subpool_vma(vma)) just indicates which
subpool was used when the huge page was allocated.  I do not believe
there is any connection made to the vma.  The vma is only used to get
to the inode and superblock which contains subpool information.  With
the subpool stored in page_private, the subpool count can be adjusted
at free_huge_page time.  Also note that the subpool can not be free'ed
in unlock_or_release_subpool until put_page is complete for the page.
This is because the page is accounted for in spool->used_hpages.

> Aside the accounting, what about the page_private(page) subpool? It's
> used by huge_page_free which would get out of sync with vma/inode
> destruction if we release the mmap_sem.

I do not think that is the case.  Reservation and subpool adjustments
made at vma/inode destruction time are based on entries in the reservation
map.  Those entries are created/destroyed when holding mmap_sem.

> 	struct hugepage_subpool *spool =
> 		(struct hugepage_subpool *)page_private(page);
> 
> I think in the revalidation code we need to check if
> page_private(page) still matches the subpool_vma(vma), if it doesn't
> and it's a stale pointer, we can't even call put_page before fixing up
> the page_private first.

I do not think that is correct.  page_private(page) points to the subpool
used when the page was allocated.  Therefore, adjustments were made to that
subpool when the page was allocated.  We need to adjust the same subpool
when calling put_page.  I don't think there is any need to look at the
vma/subpool_vma(vma).  If it doesn't match, we certainly do not want to
adjust counts in a potentially different subpool when calling page_put.

As you said, this reservation code is complex.  It might be good if
Hillf could comment as he understands this code.

I still believe a simple call to ClearPagePrivate(page) may be all we
need to do in the error path.  If this is the case, the only downside
is that it would appear the reservation was consumed for that page.
So, subsequent faults 'might' not get a huge page.

> The other way to solve this is not to release the mmap_sem at all and
> in the slow path call __get_user_pages(nonblocking=NULL). That's
> slower than using the CPU TLB to reach the source data and it'd
> prevent also to handle a userfault in the source address of
> UFFDIO_COPY, because an userfault in the source would require the
> mmap_sem to be released (i.e. it'd require get_user_pages_fast that
> would again recurse on the mmap_sem and in turn we could as well stick
> to the current nonblocking copy-user). We currently don't handle
> nesting with non-cooperative anyway so it'd be ok for now not to
> release the mmap_sem while copying in UFFDIO_COPY.
> 
> 
> Offtopic here but while reading this code I also noticed
> free_huge_page is wasting CPU and then noticed other places wasting
> CPU.

Good catch.

> 
> From 9f3966c5bbf88cb8f702393d6a78abf1b8f960f9 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Thu, 17 Nov 2016 15:28:20 +0100
> Subject: [PATCH 1/1] hugetlbfs: use non atomic ops when the page is private
> 
> After the page has been freed it's fully private and no other CPU can
> manipulate the page structure anymore (other than get_page_unless_zero
> from speculative lookups, but those will fail because of the zero
> refcount).
> 
> The same is true when the page has been newly allocated.
> 
> So we can use faster non atomic ops for those cases.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/hugetlb.c | 44 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 022750d..7c422c1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1216,16 +1216,44 @@ bool page_huge_active(struct page *page)
>  }
>  
>  /* never called for tail page */
> +static __always_inline void ____set_page_huge_active(struct page *page,
> +						     bool atomic)
> +{
> +	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
> +	if (atomic)
> +		SetPagePrivate(&page[1]);
> +	else
> +		__SetPagePrivate(&page[1]);
> +}
> +
>  static void set_page_huge_active(struct page *page)
>  {
> +	____set_page_huge_active(page, true);
> +}
> +
> +static void __set_page_huge_active(struct page *page)
> +{
> +	____set_page_huge_active(page, false);
> +}
> +
> +static __always_inline void ____clear_page_huge_active(struct page *page,
> +						       bool atomic)
> +{
>  	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
> -	SetPagePrivate(&page[1]);
> +	if (atomic)
> +		ClearPagePrivate(&page[1]);
> +	else
> +		__ClearPagePrivate(&page[1]);
>  }
>  
>  static void clear_page_huge_active(struct page *page)
>  {
> -	VM_BUG_ON_PAGE(!PageHeadHuge(page), page);
> -	ClearPagePrivate(&page[1]);
> +	____clear_page_huge_active(page, true);
> +}
> +
> +static void __clear_page_huge_active(struct page *page)
> +{
> +	____clear_page_huge_active(page, false);
>  }
>  
>  void free_huge_page(struct page *page)
> @@ -1245,7 +1273,7 @@ void free_huge_page(struct page *page)
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(page_mapcount(page), page);
>  	restore_reserve = PagePrivate(page);
> -	ClearPagePrivate(page);
> +	__ClearPagePrivate(page);
>  
>  	/*
>  	 * A return code of zero implies that the subpool will be under its
> @@ -1256,7 +1284,7 @@ void free_huge_page(struct page *page)
>  		restore_reserve = true;
>  
>  	spin_lock(&hugetlb_lock);
> -	clear_page_huge_active(page);
> +	__clear_page_huge_active(page);
>  	hugetlb_cgroup_uncharge_page(hstate_index(h),
>  				     pages_per_huge_page(h), page);
>  	if (restore_reserve)
> @@ -3534,7 +3562,7 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  	copy_user_huge_page(new_page, old_page, address, vma,
>  			    pages_per_huge_page(h));
>  	__SetPageUptodate(new_page);
> -	set_page_huge_active(new_page);
> +	__set_page_huge_active(new_page);
>  
>  	mmun_start = address & huge_page_mask(h);
>  	mmun_end = mmun_start + huge_page_size(h);
> @@ -3697,7 +3725,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		}
>  		clear_huge_page(page, address, pages_per_huge_page(h));
>  		__SetPageUptodate(page);
> -		set_page_huge_active(page);
> +		__set_page_huge_active(page);
>  
>  		if (vma->vm_flags & VM_MAYSHARE) {
>  			int err = huge_add_to_page_cache(page, mapping, idx);
> @@ -4000,7 +4028,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	 * the set_pte_at() write.
>  	 */
>  	__SetPageUptodate(page);
> -	set_page_huge_active(page);
> +	__set_page_huge_active(page);
>  
>  	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
>  	spin_lock(ptl);
> 
> 
> 
>> I will take a look, and 'may' have a test that can be modified for this.
> 
> Overnight stress of postcopy live migration over hugetlbfs passed
> without a single glitch with the patch applied, so it's tested
> now. It'd still be good to add an O_DIRECT test to the selftest.

Great.  I did review the patch, but did not test as planned.

> The previous issue with the mmap_sem release and accounting and
> potential subpool use after free, is only about malicious apps, it'd
> be impossible to reproduce it with qemu or the current selftest, but
> we've to take care of it before I can resubmit for upstream.

Of course.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-16 18:28             ` Andrea Arcangeli
  2016-11-16 18:53               ` Mike Kravetz
@ 2016-11-17 19:41               ` Mike Kravetz
  1 sibling, 0 replies; 69+ messages in thread
From: Mike Kravetz @ 2016-11-17 19:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/16/2016 10:28 AM, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Tue, Nov 08, 2016 at 01:06:06PM -0800, Mike Kravetz wrote:
>> -- 
>> Mike Kravetz
>>
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
>>
>> The new routine copy_huge_page_from_user() uses kmap_atomic() to map
>> PAGE_SIZE pages.  However, this prevents page faults in the subsequent
>> call to copy_from_user().  This is OK in the case where the routine
>> is copied with mmap_sema held.  However, in another case we want to
>> allow page faults.  So, add a new argument allow_pagefault to indicate
>> if the routine should allow page faults.
>>
>> A patch (mm/hugetlb: fix huge page reservation leak in private mapping
>> error paths) was recently submitted and is being added to -mm tree.  It
>> addresses the issue huge page reservations when a huge page is allocated,
>> and free'ed before being instantiated in an address space.  This would
>> typically happen in error paths.  The routine __mcopy_atomic_hugetlb has
>> such an error path, so it will need to call restore_reserve_on_error()
>> before free'ing the huge page.  restore_reserve_on_error is currently
>> only visible in mm/hugetlb.c.  So, add it to a header file so that it
>> can be used in mm/userfaultfd.c.  Another option would be to move
>> __mcopy_atomic_hugetlb into mm/hugetlb.c
> 
> It would have been better to split this in two patches.
> 
>> @@ -302,8 +302,10 @@ static __always_inline ssize_t
>> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>  out_unlock:
>>  	up_read(&dst_mm->mmap_sem);
>>  out:
>> -	if (page)
>> +	if (page) {
>> +		restore_reserve_on_error(h, dst_vma, dst_addr, page);
>>  		put_page(page);
>> +	}
>>  	BUG_ON(copied < 0);
> 
> If the revalidation fails dst_vma could even be NULL.
> 
> We get there with page not NULL only if something in the revalidation
> fails effectively... I'll have to drop the above change as the fix
> will hurt more than the vma reservation not being restored. Didn't
> think too much about it, but there was no obvious way to restore the
> reservation of a vma, after we drop the mmap_sem. However if we don't
> drop the mmap_sem, we'd recurse into it, and it'll deadlock in current
> implementation if a down_write is already pending somewhere else. In
> this specific case fairness is not an issue, but it's not checking
> it's the same thread taking it again, so it's doesn't allow to recurse
> (checking it's the same thread would make it slower).
> 
> I also fixed the gup support for userfaultfd, could you review it?
> Beware, untested... will test it shortly with qemu postcopy live
> migration with hugetlbfs instead of THP (that currently gracefully
> complains about FAULT_FLAG_ALLOW_RETRY missing, KVM ioctl returns
> badaddr and DEBUG_VM=y clearly showed the stack trace of where
> FAULT_FLAG_ALLOW_RETRY was missing).
> 
> I think this enhancement is needed by Oracle too, so that you don't
> get an error from I/O syscalls, and you instead get an userfault.
> 
> We need to update the selftest to trigger userfaults not only with the
> CPU but with O_DIRECT too.
> 
> Note, the FOLL_NOWAIT is needed to offload the userfaults to async
> page faults. KVM tries an async fault first (FOLL_NOWAIT, nonblocking
> = NULL), if that fails it offload a blocking (*nonblocking = 1) fault
> through async page fault kernel thread while guest scheduler schedule
> away the blocked process. So the userfaults behave like SSD swapins
> from disk hitting on a single guest thread and not the whole host vcpu
> thread. Clearly hugetlbfs cannot ever block for I/O, FOLL_NOWAIT is
> only useful to avoid blocking in the vcpu thread in
> handle_userfault().
> 
> From ff1ce62ee0acb14ed71621ba99f01f008a5d212d Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 16 Nov 2016 18:34:20 +0100
> Subject: [PATCH 1/1] userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
> 
> Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
> get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
> will work on hugetlbfs. This is required for fully functional
> userfaultfd non-present support on hugetlbfs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

> ---
>  include/linux/hugetlb.h |  5 +++--
>  mm/gup.c                |  2 +-
>  mm/hugetlb.c            | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 44 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index bf02b7e..542416d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -65,7 +65,8 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			 struct page **, struct vm_area_struct **,
> -			 unsigned long *, unsigned long *, long, unsigned int);
> +			 unsigned long *, unsigned long *, long, unsigned int,
> +			 int *);
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *);
>  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> @@ -138,7 +139,7 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> +#define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  static inline void hugetlb_report_meminfo(struct seq_file *m)
> diff --git a/mm/gup.c b/mm/gup.c
> index ec4f827..36e88a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -572,7 +572,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (is_vm_hugetlb_page(vma)) {
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags);
> +						gup_flags, nonblocking);
>  				continue;
>  			}
>  		}
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9ce8ecb..022750d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4039,7 +4039,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 struct page **pages, struct vm_area_struct **vmas,
>  			 unsigned long *position, unsigned long *nr_pages,
> -			 long i, unsigned int flags)
> +			 long i, unsigned int flags, int *nonblocking)
>  {
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
> @@ -4102,16 +4102,43 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		    ((flags & FOLL_WRITE) &&
>  		      !huge_pte_write(huge_ptep_get(pte)))) {
>  			int ret;
> +			unsigned int fault_flags = 0;
>  
>  			if (pte)
>  				spin_unlock(ptl);
> -			ret = hugetlb_fault(mm, vma, vaddr,
> -				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> -			if (!(ret & VM_FAULT_ERROR))
> -				continue;
> -
> -			remainder = 0;
> -			break;
> +			if (flags & FOLL_WRITE)
> +				fault_flags |= FAULT_FLAG_WRITE;
> +			if (nonblocking)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
> +			if (flags & FOLL_NOWAIT)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> +					FAULT_FLAG_RETRY_NOWAIT;
> +			if (flags & FOLL_TRIED) {
> +				VM_WARN_ON_ONCE(fault_flags &
> +						FAULT_FLAG_ALLOW_RETRY);
> +				fault_flags |= FAULT_FLAG_TRIED;
> +			}
> +			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> +			if (ret & VM_FAULT_ERROR) {
> +				remainder = 0;
> +				break;
> +			}
> +			if (ret & VM_FAULT_RETRY) {
> +				if (nonblocking)
> +					*nonblocking = 0;
> +				*nr_pages = 0;
> +				/*
> +				 * VM_FAULT_RETRY must not return an
> +				 * error, it will return zero
> +				 * instead.
> +				 *
> +				 * No need to update "position" as the
> +				 * caller will not check it after
> +				 * *nr_pages is set to 0.
> +				 */
> +				return i;
> +			}
> +			continue;
>  		}
>  
>  		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> @@ -4140,6 +4167,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  	}
>  	*nr_pages = remainder;
> +	/*
> +	 * setting position is actually required only if remainder is
> +	 * not zero but it's faster not to add a "if (remainder)"
> +	 * branch.
> +	 */
>  	*position = vaddr;
>  
>  	return i ? i : -EFAULT;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-17 19:26                   ` Mike Kravetz
@ 2016-11-18  0:05                     ` Andrea Arcangeli
  2016-11-18  5:52                       ` Mike Kravetz
  0 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-18  0:05 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Thu, Nov 17, 2016 at 11:26:17AM -0800, Mike Kravetz wrote:
> On 11/17/2016 07:40 AM, Andrea Arcangeli wrote:
> > On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
> >> I was running some tests with error injection to exercise the error
> >> path and noticed the reservation leaks as the system eventually ran
> >> out of huge pages.  I need to think about it some more, but we may
> >> want to at least do something like the following before put_page (with
> >> a BIG comment):
> >>
> >> 	if (unlikely(PagePrivate(page)))
> >> 		ClearPagePrivate(page);
> >>
> >> That would at least keep the global reservation count from increasing.
> >> Let me look into that.
> > 
> > However what happens if the old vma got munmapped
> 
> When the huge page was allocated, the reservation map associated with
> the vma was marked to indicate the reservation was consumed.  In addition
> the global reservation count and subpool count were adjusted to account
> for the page allocation.  So, when the vma gets unmapped the reservation
> map will be examined.  Since the map indicates the reservation was consumed,
> no adjustment will be made to the global or subpool reservation count.

ClearPagePrivate before put_page, will simply avoid to run
h->resv_huge_pages++?

Not increasing resv_huge_pages means more non reserved allocations
will pass. That is a global value though, how is it ok to leave it
permanently lower?

If PagePrivate was set, it means alloc_huge_page already run this:

			SetPagePrivate(page);
			h->resv_huge_pages--;

But it would also have set a reserve map on the vma for that range
before that.

When the vma is destroyed the reserve is flushed back to global, minus
the consumed pages (reserve = (end - start) - region_count(resv,
start, end)).

Why should then we skip h->resv_huge_pages++ for the consumed pages by
running ClearPagePrivate?

It's not clear was wrong in the first place considering
put_page->free_huge_page() acts on the global stuff only?

void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
				unsigned long address, struct page *page)
{
	if (unlikely(PagePrivate(page))) {
		long rc = vma_needs_reservation(h, vma, address);

		if (unlikely(rc < 0)) {
			/*
			 * Rare out of memory condition in reserve map
			 * manipulation.  Clear PagePrivate so that
			 * global reserve count will not be incremented
			 * by free_huge_page.  This will make it appear
			 * as though the reservation for this page was
			 * consumed.  This may prevent the task from
			 * faulting in the page at a later time.  This
			 * is better than inconsistent global huge page
			 * accounting of reserve counts.
			 */
			ClearPagePrivate(page);

The ClearPagePrivate was run above because vma_needs_reservation run
out of memory and couldn't be added?

So I suppose the vma reservation wasn't possible in the above case, in
our allocation case alloc_huge_page succeeded at those reserve maps
allocations:

	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
	if (map_chg < 0)
		return ERR_PTR(-ENOMEM);
	[..]
		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
			SetPagePrivate(page);
			h->resv_huge_pages--;
		}

> >                                                   and a new compatible
> > vma was instantiated and passes revalidation fine? The reserved page
> > of the old vma goes to a different vma then?
> 
> No, the new vma should get a new reservation.  It can not use the old
> reservation as it was associated with the old vma.  This is at least
> the case for private mappings where the reservation maps are associated
> with the vma.

You're not suggesting to call ClearPagePrivate in the second pass of
the "retry" loop if all goes fine and second pass succeeds, but only if
we end up in a error of revalidation at the second pass?

So the page with PagePrivate set could go to a different vma despite
the vma reserve map was accounted for in the original vma? Is that ok?

> > This reservation code is complex and has lots of special cases anyway,
> > but the main concern at this point is the
> > set_page_private(subpool_vma(vma)) released by
> > hugetlb_vm_op_close->unlock_or_release_subpool.
> 
> Do note that set_page_private(subpool_vma(vma)) just indicates which
> subpool was used when the huge page was allocated.  I do not believe
> there is any connection made to the vma.  The vma is only used to get
> to the inode and superblock which contains subpool information.  With
> the subpool stored in page_private, the subpool count can be adjusted
> at free_huge_page time.  Also note that the subpool can not be free'ed
> in unlock_or_release_subpool until put_page is complete for the page.
> This is because the page is accounted for in spool->used_hpages.

Yes I figured myself shortly later used_hpages. So there's no risk of
use after free on the subpool pointed by the page at least.

I also considered shutting down this accounting entirely by calling
alloc_huge_page(allow_reserve = 0) in hugetlbfs mcopy atomic... Can't
we start that way so we don't have to worry about the reservation
accounting at all?

> > Aside the accounting, what about the page_private(page) subpool? It's
> > used by huge_page_free which would get out of sync with vma/inode
> > destruction if we release the mmap_sem.
> 
> I do not think that is the case.  Reservation and subpool adjustments
> made at vma/inode destruction time are based on entries in the reservation
> map.  Those entries are created/destroyed when holding mmap_sem.
> 
> > 	struct hugepage_subpool *spool =
> > 		(struct hugepage_subpool *)page_private(page);
> > 
> > I think in the revalidation code we need to check if
> > page_private(page) still matches the subpool_vma(vma), if it doesn't
> > and it's a stale pointer, we can't even call put_page before fixing up
> > the page_private first.
> 
> I do not think that is correct.  page_private(page) points to the subpool
> used when the page was allocated.  Therefore, adjustments were made to that
> subpool when the page was allocated.  We need to adjust the same subpool
> when calling put_page.  I don't think there is any need to look at the
> vma/subpool_vma(vma).  If it doesn't match, we certainly do not want to
> adjust counts in a potentially different subpool when calling page_put.

Isn't the subpool different for every mountpoint of hugetlbfs?

The old vma subpool can't be a stale pointer, because of the
used_hpages but if there are two different hugetlbfs mounts the
subpool seems to come from the superblock so it may change after we
release the mmap_sem.

Don't we have to add a check for the new vma subpool change against
the page->private?

Otherwise we'd be putting the page in some other subpool than the one
it was allocated from, as long as they pass the vma_hpagesize !=
vma_kernel_pagesize(dst_vma) check.

> As you said, this reservation code is complex.  It might be good if
> Hillf could comment as he understands this code.
> 
> I still believe a simple call to ClearPagePrivate(page) may be all we
> need to do in the error path.  If this is the case, the only downside
> is that it would appear the reservation was consumed for that page.
> So, subsequent faults 'might' not get a huge page.

I thought running out of hugepages is what you experienced already
with the current code if using error injection.

> Good catch.

Eh, that was an easy part :).

> Great.  I did review the patch, but did not test as planned.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-04 15:44     ` Mike Rapoport
  2016-11-04 16:56       ` Andrea Arcangeli
@ 2016-11-18  0:37       ` Andrea Arcangeli
  2016-11-20 12:10         ` Mike Rapoport
  1 sibling, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-11-18  0:37 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov'

Hello,

I found a minor issue with the non cooperative testcase, sometime an
userfault would trigger in between UFFD_EVENT_MADVDONTNEED and
UFFDIO_UNREGISTER:

		case UFFD_EVENT_MADVDONTNEED:
			uffd_reg.range.start = msg.arg.madv_dn.start;
			uffd_reg.range.len = msg.arg.madv_dn.end -
				msg.arg.madv_dn.start;
			if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range))

It always triggered at the nr == 0:

	for (nr = 0; nr < nr_pages; nr++) {
		if (my_bcmp(area_dst + nr * page_size, zeropage, page_size))

The userfault still pending after UFFDIO_UNREGISTER returned, lead to
poll() getting a UFFD_EVENT_PAGEFAULT and trying to do a UFFDIO_COPY
into the unregistered range, which gracefully results in -EINVAL.

So this could be all handled in userland, by storing the MADV_DONTNEED
range and calling UFFDIO_WAKE instead of UFFDIO_COPY... but I think
it's more reliable to fix it into the kernel.

If a pending userfault happens before UFFDIO_UNREGISTER it'll just
behave like if it happened after.

I also noticed the order of uffd notification of MADV_DONTNEED and the
pagetable zap was wrong, we've to notify userland first so it won't
risk to call UFFDIO_COPY while the process runs zap_page_range.

With the two patches appended below the -EINVAL error out of
UFFDIO_COPY is gone.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-18  0:05                     ` Andrea Arcangeli
@ 2016-11-18  5:52                       ` Mike Kravetz
  2016-11-22  1:16                         ` Mike Kravetz
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Kravetz @ 2016-11-18  5:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/17/2016 04:05 PM, Andrea Arcangeli wrote:
> On Thu, Nov 17, 2016 at 11:26:17AM -0800, Mike Kravetz wrote:
>> On 11/17/2016 07:40 AM, Andrea Arcangeli wrote:
>>> On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
>>>> I was running some tests with error injection to exercise the error
>>>> path and noticed the reservation leaks as the system eventually ran
>>>> out of huge pages.  I need to think about it some more, but we may
>>>> want to at least do something like the following before put_page (with
>>>> a BIG comment):
>>>>
>>>> 	if (unlikely(PagePrivate(page)))
>>>> 		ClearPagePrivate(page);
>>>>
>>>> That would at least keep the global reservation count from increasing.
>>>> Let me look into that.
>>>
>>> However what happens if the old vma got munmapped
>>
>> When the huge page was allocated, the reservation map associated with
>> the vma was marked to indicate the reservation was consumed.  In addition
>> the global reservation count and subpool count were adjusted to account
>> for the page allocation.  So, when the vma gets unmapped the reservation
>> map will be examined.  Since the map indicates the reservation was consumed,
>> no adjustment will be made to the global or subpool reservation count.
> 
> ClearPagePrivate before put_page, will simply avoid to run
> h->resv_huge_pages++?

Correct

> Not increasing resv_huge_pages means more non reserved allocations
> will pass. That is a global value though, how is it ok to leave it
> permanently lower?

Recall that h->resv_huge_pages is incremented when a mapping/vma is
created to indicate that reservations exist. It is decremented when
a huge page is allocated.  In addition, the reservation map is
initially set up to indicate that reservations exist for the
pages in the mapping. When a page is allocated the map is modified
to indicate a reservation was consumed.

So path, resv_huge_pages was decremented as the page was allocated
and the reserve map indicates the reservation was consumed.  At this
time, resv_huge_pages and the reserve map accurately reflect the
state of reservations in the vma and globally.

In this path, we were unable instantiate the page in the mapping/vma
so we must free the page.  The "ideal" solution would be to adjust the
reserve mat to indicate the reservation was not consumed, and increment
resv_huge_pages.  However, since we dropped mmap_sema we can not adjust
the reserve map.  So, the questions is what should be done with
resv_huge_pages?   There are only two options.
1) Leave it "as is" when the page is free'ed.
   In this case, the reserve map will indicate the reservation was consumed
   but there will not be an associated page in the mapping.  Therefore,
   it is possible that a susbequent fault can happen on the page.  At that
   time, it will appear as though the reservation for the page was already
   consumed.  Therefore, is there are not any available pages the fault will
   fail (ENOMEM).  Note that this only impacts subsequent faults of this
   specific mapping/page.
2) Increment it when the page is free'ed
   As above, the reserve map will still indicate the reservation was
consumed
   and subsequent faults can only be satisfied if there are available pages.
   However, there is now a global reservation as indicated by
resv_huge_pages
   that is not associated with any mapping.  What this means is that there
   is a reserved page and nobody can use.  The reservation is being held for
   a mapping that has already consumed the reservation.

> If PagePrivate was set, it means alloc_huge_page already run this:
> 
> 			SetPagePrivate(page);
> 			h->resv_huge_pages--;
> 
> But it would also have set a reserve map on the vma for that range
> before that.
> 
> When the vma is destroyed the reserve is flushed back to global, minus
> the consumed pages (reserve = (end - start) - region_count(resv,
> start, end)).

Consider a very simply 1 page vma.  Obviously, (end - start) = 1 and
as mentioned above the reserve map was adjusted to indicate the reservation
was consumed.  So, region_count(resv, start, end) is also going to = 1
and reserve will = 0.  Therefore, no adjustment of resv_huge_pages will
be made when the the vma is destroyed.

The same happens for the the single page within a larger vma.

> Why should then we skip h->resv_huge_pages++ for the consumed pages by
> running ClearPagePrivate?
> 
> It's not clear was wrong in the first place considering
> put_page->free_huge_page() acts on the global stuff only?

See the description of 2) above.

> void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> 				unsigned long address, struct page *page)
> {
> 	if (unlikely(PagePrivate(page))) {
> 		long rc = vma_needs_reservation(h, vma, address);
> 
> 		if (unlikely(rc < 0)) {
> 			/*
> 			 * Rare out of memory condition in reserve map
> 			 * manipulation.  Clear PagePrivate so that
> 			 * global reserve count will not be incremented
> 			 * by free_huge_page.  This will make it appear
> 			 * as though the reservation for this page was
> 			 * consumed.  This may prevent the task from
> 			 * faulting in the page at a later time.  This
> 			 * is better than inconsistent global huge page
> 			 * accounting of reserve counts.
> 			 */
> 			ClearPagePrivate(page);
> 
> The ClearPagePrivate was run above because vma_needs_reservation run
> out of memory and couldn't be added?

restore_reserve_on_error tries to do the "ideal" cleanup by adjusting
the reserve map to indicate the reservation was not consumed.  If it
can do this, then it does not ClearPagePrivate(page) as the global
reservation count SHOULD be incremented as the reserve map now indicates
the reservation as not consumed.

The only time it can not adjust the reserve map is if the reserve map
map manipulation routines fail.  They would only fail if they can not
allocate a 32 byte structure used to track reservations.  As noted in
the comments this is rare, and if we can't allocate 32 bytes I suspect
there are other issues.  But, in this case ClearPagePrivate is run so
that when the page is free'ed the global count will be consistent with
the reserve map.  This is essentially why I think we should do the same
in __mcopy_atomic_hugetlb.

> So I suppose the vma reservation wasn't possible in the above case, in
> our allocation case alloc_huge_page succeeded at those reserve maps
> allocations:
> 
> 	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
> 	if (map_chg < 0)
> 		return ERR_PTR(-ENOMEM);
> 	[..]
> 		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
> 			SetPagePrivate(page);
> 			h->resv_huge_pages--;
> 		}
> 

It is slightly different.  In alloc_huge_page we are adjusting the reserve
map to indicate the reservation was consumed.  In restore_reserve_on_error
we are adjusting the map to indicate the reservation was not consumed.

>>>                                                   and a new compatible
>>> vma was instantiated and passes revalidation fine? The reserved page
>>> of the old vma goes to a different vma then?
>>
>> No, the new vma should get a new reservation.  It can not use the old
>> reservation as it was associated with the old vma.  This is at least
>> the case for private mappings where the reservation maps are associated
>> with the vma.
> 
> You're not suggesting to call ClearPagePrivate in the second pass of
> the "retry" loop if all goes fine and second pass succeeds, but only if
> we end up in a error of revalidation at the second pass?

That is correct.  Only on the error path.  We will only call put_page
on the error path and we only call ClearPagePrivate before put_page.

> 
> So the page with PagePrivate set could go to a different vma despite
> the vma reserve map was accounted for in the original vma? Is that ok?

Yes, see the potential issue with subsequent faults.  It is not the
"ideal" case that would be sone in restore_reserve_on_error, but I
think it is the beast alternative.  And, I'm guessing this error
path is not something we will hit frequently?  Or, do you think it
will be exercised often?

>>> This reservation code is complex and has lots of special cases anyway,
>>> but the main concern at this point is the
>>> set_page_private(subpool_vma(vma)) released by
>>> hugetlb_vm_op_close->unlock_or_release_subpool.
>>
>> Do note that set_page_private(subpool_vma(vma)) just indicates which
>> subpool was used when the huge page was allocated.  I do not believe
>> there is any connection made to the vma.  The vma is only used to get
>> to the inode and superblock which contains subpool information.  With
>> the subpool stored in page_private, the subpool count can be adjusted
>> at free_huge_page time.  Also note that the subpool can not be free'ed
>> in unlock_or_release_subpool until put_page is complete for the page.
>> This is because the page is accounted for in spool->used_hpages.
> 
> Yes I figured myself shortly later used_hpages. So there's no risk of
> use after free on the subpool pointed by the page at least.
> 
> I also considered shutting down this accounting entirely by calling
> alloc_huge_page(allow_reserve = 0) in hugetlbfs mcopy atomic... Can't
> we start that way so we don't have to worry about the reservation
> accounting at all?

Does "allow_reserve = 0" indicate alloc_huge_page(... avoid_reserve = 1)?
I think, that is what you are asking.

Well we could do that.  But, it could cause failures.  Again consider
an overly simple case of a 1 page vma.  Also, suppose there is only one
huge page in the system.  When the vma is created/mapped the one huge
page is reserved.  However, we call alloc_huge_page(...avoid_reserve = 1)
the allocation will fail as we indicated that reservations should not
be used.  If there was another page in the system, or we were configured
to allocate surplus pages then it may succeed.

>>> Aside the accounting, what about the page_private(page) subpool? It's
>>> used by huge_page_free which would get out of sync with vma/inode
>>> destruction if we release the mmap_sem.
>>
>> I do not think that is the case.  Reservation and subpool adjustments
>> made at vma/inode destruction time are based on entries in the reservation
>> map.  Those entries are created/destroyed when holding mmap_sem.
>>
>>> 	struct hugepage_subpool *spool =
>>> 		(struct hugepage_subpool *)page_private(page);
>>>
>>> I think in the revalidation code we need to check if
>>> page_private(page) still matches the subpool_vma(vma), if it doesn't
>>> and it's a stale pointer, we can't even call put_page before fixing up
>>> the page_private first.
>>
>> I do not think that is correct.  page_private(page) points to the subpool
>> used when the page was allocated.  Therefore, adjustments were made to that
>> subpool when the page was allocated.  We need to adjust the same subpool
>> when calling put_page.  I don't think there is any need to look at the
>> vma/subpool_vma(vma).  If it doesn't match, we certainly do not want to
>> adjust counts in a potentially different subpool when calling page_put.
> 
> Isn't the subpool different for every mountpoint of hugetlbfs?

yes.

> 
> The old vma subpool can't be a stale pointer, because of the
> used_hpages but if there are two different hugetlbfs mounts the
> subpool seems to come from the superblock so it may change after we
> release the mmap_sem.
> 
> Don't we have to add a check for the new vma subpool change against
> the page->private?
> 
> Otherwise we'd be putting the page in some other subpool than the one
> it was allocated from, as long as they pass the vma_hpagesize !=
> vma_kernel_pagesize(dst_vma) check.

I don't think there is an issue.  page->private will be set to point to
the subpool where the page was originally charged.  That will not change
until the page_put, and the same subpool will be used to adjust the
count.  The page can't go to another vma, without first passing through
free_huge_page and doing proper subpool accounting.  If it does go to
another vma, then alloc_huge_page will set it to the proper subpool.

>> As you said, this reservation code is complex.  It might be good if
>> Hillf could comment as he understands this code.
>>
>> I still believe a simple call to ClearPagePrivate(page) may be all we
>> need to do in the error path.  If this is the case, the only downside
>> is that it would appear the reservation was consumed for that page.
>> So, subsequent faults 'might' not get a huge page.
> 
> I thought running out of hugepages is what you experienced already
> with the current code if using error injection.

Yes, what I experienced is described in 2) of my first comment in this
e-mail.  I would eventually end up with all huge pages being "reserved"
and unable to use/allocate them.  So, none of the huge pages in the system
(or memory associated with them) could be used until reboot. :(

-- 
Mike Kravetz

> 
>> Good catch.
> 
> Eh, that was an easy part :).
> 
>> Great.  I did review the patch, but did not test as planned.
> 
> Thanks,
> Andrea
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults
  2016-11-18  0:37       ` Andrea Arcangeli
@ 2016-11-20 12:10         ` Mike Rapoport
  0 siblings, 0 replies; 69+ messages in thread
From: Mike Rapoport @ 2016-11-20 12:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Mike Kravetz',
	'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov'

On Fri, Nov 18, 2016 at 01:37:34AM +0100, Andrea Arcangeli wrote:
> Hello,
> 
> I found a minor issue with the non cooperative testcase, sometime an
> userfault would trigger in between UFFD_EVENT_MADVDONTNEED and
> UFFDIO_UNREGISTER:
> 
> 		case UFFD_EVENT_MADVDONTNEED:
> 			uffd_reg.range.start = msg.arg.madv_dn.start;
> 			uffd_reg.range.len = msg.arg.madv_dn.end -
> 				msg.arg.madv_dn.start;
> 			if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range))
> 
> It always triggered at the nr == 0:
> 
> 	for (nr = 0; nr < nr_pages; nr++) {
> 		if (my_bcmp(area_dst + nr * page_size, zeropage, page_size))
> 
> The userfault still pending after UFFDIO_UNREGISTER returned, lead to
> poll() getting a UFFD_EVENT_PAGEFAULT and trying to do a UFFDIO_COPY
> into the unregistered range, which gracefully results in -EINVAL.
> 
> So this could be all handled in userland, by storing the MADV_DONTNEED
> range and calling UFFDIO_WAKE instead of UFFDIO_COPY... but I think
> it's more reliable to fix it into the kernel.
> 
> If a pending userfault happens before UFFDIO_UNREGISTER it'll just
> behave like if it happened after.
> 
> I also noticed the order of uffd notification of MADV_DONTNEED and the
> pagetable zap was wrong, we've to notify userland first so it won't
> risk to call UFFDIO_COPY while the process runs zap_page_range.
> 
> With the two patches appended below the -EINVAL error out of
> UFFDIO_COPY is gone.
> 
> From fc27d209e566d95e8ae0eb83a703aa4e02316b4c Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Thu, 17 Nov 2016 20:15:50 +0100
> Subject: [PATCH 1/2] userfaultfd: non-cooperative: avoid MADV_DONTNEED race
>  condition
> 
> MADV_DONTNEED must be notified to userland before the pages are
> zapped. This allows userland to immediately stop adding pages to the
> userfaultfd ranges before the pages are actually zapped or there could
> be non-zeropage leftovers as result of concurrent UFFDIO_COPY run in
> between zap_page_range and madvise_userfault_dontneed (both
> MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so
> they can run concurrently).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>

> ---
>  mm/madvise.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 7168bc6..4d4c7f8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -476,8 +476,8 @@ static long madvise_dontneed(struct vm_area_struct *vma,
>  	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
>  		return -EINVAL;
> 
> -	zap_page_range(vma, start, end - start, NULL);
>  	madvise_userfault_dontneed(vma, prev, start, end);
> +	zap_page_range(vma, start, end - start, NULL);
>  	return 0;
>  }
> 
> 
> 
> From 18e7b30cf82c927af4c0323a6caac20184a03ff4 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Thu, 17 Nov 2016 20:20:40 +0100
> Subject: [PATCH 2/2] userfaultfd: non-cooperative: wake userfaults after
>  UFFDIO_UNREGISTER
> 
> Userfaults may still happen after the userfaultfd monitor thread
> received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run.
> 
> Wake any pending userfault within UFFDIO_UNREGISTER protected by the
> mmap_sem for writing, so they will not be reported to userland leading
> to UFFDIO_COPY returning -EINVAL (as the range was already
> unregistered) and they will not hang permanently either.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>

> ---
>  fs/userfaultfd.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 2b75fab..42168d3 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1267,6 +1267,19 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  			start = vma->vm_start;
>  		vma_end = min(end, vma->vm_end);
> 
> +		if (userfaultfd_missing(vma)) {
> +			/*
> +			 * Wake any concurrent pending userfault while
> +			 * we unregister, so they will not hang
> +			 * permanently and it avoids userland to call
> +			 * UFFDIO_WAKE explicitly.
> +			 */
> +			struct userfaultfd_wake_range range;
> +			range.start = start;
> +			range.len = vma_end - start;
> +			wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range);
> +		}
> +
>  		new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
>  		prev = vma_merge(mm, prev, start, vma_end, new_flags,
>  				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-18  5:52                       ` Mike Kravetz
@ 2016-11-22  1:16                         ` Mike Kravetz
  2016-11-23  6:38                           ` Hillf Danton
  0 siblings, 1 reply; 69+ messages in thread
From: Mike Kravetz @ 2016-11-22  1:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hillf Danton, 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On 11/17/2016 09:52 PM, Mike Kravetz wrote:
> On 11/17/2016 04:05 PM, Andrea Arcangeli wrote:
>> On Thu, Nov 17, 2016 at 11:26:17AM -0800, Mike Kravetz wrote:
>>> On 11/17/2016 07:40 AM, Andrea Arcangeli wrote:
>>>> On Wed, Nov 16, 2016 at 10:53:39AM -0800, Mike Kravetz wrote:
>>>>> I was running some tests with error injection to exercise the error
>>>>> path and noticed the reservation leaks as the system eventually ran
>>>>> out of huge pages.  I need to think about it some more, but we may
>>>>> want to at least do something like the following before put_page (with
>>>>> a BIG comment):
>>>>>
>>>>> 	if (unlikely(PagePrivate(page)))
>>>>> 		ClearPagePrivate(page);
>>>>>
>>>>> That would at least keep the global reservation count from increasing.
>>>>> Let me look into that.
>>>>
>>>> However what happens if the old vma got munmapped
>>>
>>> When the huge page was allocated, the reservation map associated with
>>> the vma was marked to indicate the reservation was consumed.  In addition
>>> the global reservation count and subpool count were adjusted to account
>>> for the page allocation.  So, when the vma gets unmapped the reservation
>>> map will be examined.  Since the map indicates the reservation was consumed,
>>> no adjustment will be made to the global or subpool reservation count.
>>
>> ClearPagePrivate before put_page, will simply avoid to run
>> h->resv_huge_pages++?
> 
> Correct
> 
>> Not increasing resv_huge_pages means more non reserved allocations
>> will pass. That is a global value though, how is it ok to leave it
>> permanently lower?
> 
> Recall that h->resv_huge_pages is incremented when a mapping/vma is
> created to indicate that reservations exist. It is decremented when
> a huge page is allocated.  In addition, the reservation map is
> initially set up to indicate that reservations exist for the
> pages in the mapping. When a page is allocated the map is modified
> to indicate a reservation was consumed.
> 
> So path, resv_huge_pages was decremented as the page was allocated
> and the reserve map indicates the reservation was consumed.  At this
> time, resv_huge_pages and the reserve map accurately reflect the
> state of reservations in the vma and globally.
> 
> In this path, we were unable instantiate the page in the mapping/vma
> so we must free the page.  The "ideal" solution would be to adjust the
> reserve mat to indicate the reservation was not consumed, and increment
> resv_huge_pages.  However, since we dropped mmap_sema we can not adjust
> the reserve map.  So, the questions is what should be done with
> resv_huge_pages?   There are only two options.
> 1) Leave it "as is" when the page is free'ed.
>    In this case, the reserve map will indicate the reservation was consumed
>    but there will not be an associated page in the mapping.  Therefore,
>    it is possible that a susbequent fault can happen on the page.  At that
>    time, it will appear as though the reservation for the page was already
>    consumed.  Therefore, is there are not any available pages the fault will
>    fail (ENOMEM).  Note that this only impacts subsequent faults of this
>    specific mapping/page.
> 2) Increment it when the page is free'ed
>    As above, the reserve map will still indicate the reservation was
> consumed
>    and subsequent faults can only be satisfied if there are available pages.
>    However, there is now a global reservation as indicated by
> resv_huge_pages
>    that is not associated with any mapping.  What this means is that there
>    is a reserved page and nobody can use.  The reservation is being held for
>    a mapping that has already consumed the reservation.
> 
>> If PagePrivate was set, it means alloc_huge_page already run this:
>>
>> 			SetPagePrivate(page);
>> 			h->resv_huge_pages--;
>>
>> But it would also have set a reserve map on the vma for that range
>> before that.
>>
>> When the vma is destroyed the reserve is flushed back to global, minus
>> the consumed pages (reserve = (end - start) - region_count(resv,
>> start, end)).
> 
> Consider a very simply 1 page vma.  Obviously, (end - start) = 1 and
> as mentioned above the reserve map was adjusted to indicate the reservation
> was consumed.  So, region_count(resv, start, end) is also going to = 1
> and reserve will = 0.  Therefore, no adjustment of resv_huge_pages will
> be made when the the vma is destroyed.
> 
> The same happens for the the single page within a larger vma.
> 
>> Why should then we skip h->resv_huge_pages++ for the consumed pages by
>> running ClearPagePrivate?
>>
>> It's not clear was wrong in the first place considering
>> put_page->free_huge_page() acts on the global stuff only?
> 
> See the description of 2) above.
> 
>> void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>> 				unsigned long address, struct page *page)
>> {
>> 	if (unlikely(PagePrivate(page))) {
>> 		long rc = vma_needs_reservation(h, vma, address);
>>
>> 		if (unlikely(rc < 0)) {
>> 			/*
>> 			 * Rare out of memory condition in reserve map
>> 			 * manipulation.  Clear PagePrivate so that
>> 			 * global reserve count will not be incremented
>> 			 * by free_huge_page.  This will make it appear
>> 			 * as though the reservation for this page was
>> 			 * consumed.  This may prevent the task from
>> 			 * faulting in the page at a later time.  This
>> 			 * is better than inconsistent global huge page
>> 			 * accounting of reserve counts.
>> 			 */
>> 			ClearPagePrivate(page);
>>
>> The ClearPagePrivate was run above because vma_needs_reservation run
>> out of memory and couldn't be added?
> 
> restore_reserve_on_error tries to do the "ideal" cleanup by adjusting
> the reserve map to indicate the reservation was not consumed.  If it
> can do this, then it does not ClearPagePrivate(page) as the global
> reservation count SHOULD be incremented as the reserve map now indicates
> the reservation as not consumed.
> 
> The only time it can not adjust the reserve map is if the reserve map
> map manipulation routines fail.  They would only fail if they can not
> allocate a 32 byte structure used to track reservations.  As noted in
> the comments this is rare, and if we can't allocate 32 bytes I suspect
> there are other issues.  But, in this case ClearPagePrivate is run so
> that when the page is free'ed the global count will be consistent with
> the reserve map.  This is essentially why I think we should do the same
> in __mcopy_atomic_hugetlb.
> 
>> So I suppose the vma reservation wasn't possible in the above case, in
>> our allocation case alloc_huge_page succeeded at those reserve maps
>> allocations:
>>
>> 	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
>> 	if (map_chg < 0)
>> 		return ERR_PTR(-ENOMEM);
>> 	[..]
>> 		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
>> 			SetPagePrivate(page);
>> 			h->resv_huge_pages--;
>> 		}
>>
> 
> It is slightly different.  In alloc_huge_page we are adjusting the reserve
> map to indicate the reservation was consumed.  In restore_reserve_on_error
> we are adjusting the map to indicate the reservation was not consumed.
> 
>>>>                                                   and a new compatible
>>>> vma was instantiated and passes revalidation fine? The reserved page
>>>> of the old vma goes to a different vma then?
>>>
>>> No, the new vma should get a new reservation.  It can not use the old
>>> reservation as it was associated with the old vma.  This is at least
>>> the case for private mappings where the reservation maps are associated
>>> with the vma.
>>
>> You're not suggesting to call ClearPagePrivate in the second pass of
>> the "retry" loop if all goes fine and second pass succeeds, but only if
>> we end up in a error of revalidation at the second pass?
> 
> That is correct.  Only on the error path.  We will only call put_page
> on the error path and we only call ClearPagePrivate before put_page.
> 
>>
>> So the page with PagePrivate set could go to a different vma despite
>> the vma reserve map was accounted for in the original vma? Is that ok?
> 
> Yes, see the potential issue with subsequent faults.  It is not the
> "ideal" case that would be sone in restore_reserve_on_error, but I
> think it is the beast alternative.  And, I'm guessing this error
> path is not something we will hit frequently?  Or, do you think it
> will be exercised often?
> 
>>>> This reservation code is complex and has lots of special cases anyway,
>>>> but the main concern at this point is the
>>>> set_page_private(subpool_vma(vma)) released by
>>>> hugetlb_vm_op_close->unlock_or_release_subpool.
>>>
>>> Do note that set_page_private(subpool_vma(vma)) just indicates which
>>> subpool was used when the huge page was allocated.  I do not believe
>>> there is any connection made to the vma.  The vma is only used to get
>>> to the inode and superblock which contains subpool information.  With
>>> the subpool stored in page_private, the subpool count can be adjusted
>>> at free_huge_page time.  Also note that the subpool can not be free'ed
>>> in unlock_or_release_subpool until put_page is complete for the page.
>>> This is because the page is accounted for in spool->used_hpages.
>>
>> Yes I figured myself shortly later used_hpages. So there's no risk of
>> use after free on the subpool pointed by the page at least.
>>
>> I also considered shutting down this accounting entirely by calling
>> alloc_huge_page(allow_reserve = 0) in hugetlbfs mcopy atomic... Can't
>> we start that way so we don't have to worry about the reservation
>> accounting at all?
> 
> Does "allow_reserve = 0" indicate alloc_huge_page(... avoid_reserve = 1)?
> I think, that is what you are asking.
> 
> Well we could do that.  But, it could cause failures.  Again consider
> an overly simple case of a 1 page vma.  Also, suppose there is only one
> huge page in the system.  When the vma is created/mapped the one huge
> page is reserved.  However, we call alloc_huge_page(...avoid_reserve = 1)
> the allocation will fail as we indicated that reservations should not
> be used.  If there was another page in the system, or we were configured
> to allocate surplus pages then it may succeed.
> 
>>>> Aside the accounting, what about the page_private(page) subpool? It's
>>>> used by huge_page_free which would get out of sync with vma/inode
>>>> destruction if we release the mmap_sem.
>>>
>>> I do not think that is the case.  Reservation and subpool adjustments
>>> made at vma/inode destruction time are based on entries in the reservation
>>> map.  Those entries are created/destroyed when holding mmap_sem.
>>>
>>>> 	struct hugepage_subpool *spool =
>>>> 		(struct hugepage_subpool *)page_private(page);
>>>>
>>>> I think in the revalidation code we need to check if
>>>> page_private(page) still matches the subpool_vma(vma), if it doesn't
>>>> and it's a stale pointer, we can't even call put_page before fixing up
>>>> the page_private first.
>>>
>>> I do not think that is correct.  page_private(page) points to the subpool
>>> used when the page was allocated.  Therefore, adjustments were made to that
>>> subpool when the page was allocated.  We need to adjust the same subpool
>>> when calling put_page.  I don't think there is any need to look at the
>>> vma/subpool_vma(vma).  If it doesn't match, we certainly do not want to
>>> adjust counts in a potentially different subpool when calling page_put.
>>
>> Isn't the subpool different for every mountpoint of hugetlbfs?
> 
> yes.
> 
>>
>> The old vma subpool can't be a stale pointer, because of the
>> used_hpages but if there are two different hugetlbfs mounts the
>> subpool seems to come from the superblock so it may change after we
>> release the mmap_sem.
>>
>> Don't we have to add a check for the new vma subpool change against
>> the page->private?
>>
>> Otherwise we'd be putting the page in some other subpool than the one
>> it was allocated from, as long as they pass the vma_hpagesize !=
>> vma_kernel_pagesize(dst_vma) check.
> 
> I don't think there is an issue.  page->private will be set to point to
> the subpool where the page was originally charged.  That will not change
> until the page_put, and the same subpool will be used to adjust the
> count.  The page can't go to another vma, without first passing through
> free_huge_page and doing proper subpool accounting.  If it does go to
> another vma, then alloc_huge_page will set it to the proper subpool.
> 
>>> As you said, this reservation code is complex.  It might be good if
>>> Hillf could comment as he understands this code.
>>>
>>> I still believe a simple call to ClearPagePrivate(page) may be all we
>>> need to do in the error path.  If this is the case, the only downside
>>> is that it would appear the reservation was consumed for that page.
>>> So, subsequent faults 'might' not get a huge page.
>>
>> I thought running out of hugepages is what you experienced already
>> with the current code if using error injection.
> 
> Yes, what I experienced is described in 2) of my first comment in this
> e-mail.  I would eventually end up with all huge pages being "reserved"
> and unable to use/allocate them.  So, none of the huge pages in the system
> (or memory associated with them) could be used until reboot. :(

I am not sure if you are convinced ClearPagePrivate is an acceptable
solution to this issue.  If you do, here is the simple patch to add
it and an appropriate comment.

From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 21 Nov 2016 14:16:33 -0800
Subject: [PATCH] userfaultfd: hugetlbfs: reserve count on error in
 __mcopy_atomic_hugetlb

If __mcopy_atomic_hugetlb exits with an error, put_page will be called
if a huge page was allocated and needs to be freed.  If a reservation
was associated with the huge page, the PagePrivate flag will be set.
Clear PagePrivate before calling put_page/free_huge_page so that the
global reservation count is not incremented.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/userfaultfd.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b565481..d56ba83 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -303,8 +303,23 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 out_unlock:
 	up_read(&dst_mm->mmap_sem);
 out:
-	if (page)
+	if (page) {
+		/*
+		 * We encountered an error and are about to free a newly
+		 * allocated huge page.  It is possible that there was a
+		 * reservation associated with the page that has been
+		 * consumed.  See the routine restore_reserve_on_error
+		 * for details.  Unfortunately, we can not call
+		 * restore_reserve_on_error now as it would require holding
+		 * mmap_sem.  Clear the PagePrivate flag so that the global
+		 * reserve count will not be incremented in free_huge_page.
+		 * The reservation map will still indicate the reservation
+		 * was consumed and possibly prevent later page allocation.
+		 * This is better than leaking a global reservation.
+		 */
+		ClearPagePrivate(page);
 		put_page(page);
+	}
 	BUG_ON(copied < 0);
 	BUG_ON(err > 0);
 	BUG_ON(!copied && !err);
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-22  1:16                         ` Mike Kravetz
@ 2016-11-23  6:38                           ` Hillf Danton
  2016-12-15 19:02                             ` Andrea Arcangeli
  0 siblings, 1 reply; 69+ messages in thread
From: Hillf Danton @ 2016-11-23  6:38 UTC (permalink / raw)
  To: 'Mike Kravetz', 'Andrea Arcangeli'
  Cc: 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Tuesday, November 22, 2016 9:17 AM Mike Kravetz wrote:
> I am not sure if you are convinced ClearPagePrivate is an acceptable
> solution to this issue.  If you do, here is the simple patch to add
> it and an appropriate comment.
> 
Hi Mike and Andrea

Sorry for my jumping in.

In commit 07443a85ad
("mm, hugetlb: return a reserved page to a reserved pool if failed")
newly allocated huge page gets cleared for a successful COW.

I'm wondering if we can handle our error path along that way?

Obvious I could miss the points you are concerning.

thanks
Hillf
> 
> If __mcopy_atomic_hugetlb exits with an error, put_page will be called
> if a huge page was allocated and needs to be freed.  If a reservation
> was associated with the huge page, the PagePrivate flag will be set.
> Clear PagePrivate before calling put_page/free_huge_page so that the
> global reservation count is not incremented.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/userfaultfd.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index b565481..d56ba83 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -303,8 +303,23 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  out_unlock:
>  	up_read(&dst_mm->mmap_sem);
>  out:
> -	if (page)
> +	if (page) {
> +		/*
> +		 * We encountered an error and are about to free a newly
> +		 * allocated huge page.  It is possible that there was a
> +		 * reservation associated with the page that has been
> +		 * consumed.  See the routine restore_reserve_on_error
> +		 * for details.  Unfortunately, we can not call
> +		 * restore_reserve_on_error now as it would require holding
> +		 * mmap_sem.  Clear the PagePrivate flag so that the global
> +		 * reserve count will not be incremented in free_huge_page.
> +		 * The reservation map will still indicate the reservation
> +		 * was consumed and possibly prevent later page allocation.
> +		 * This is better than leaking a global reservation.
> +		 */
> +		ClearPagePrivate(page);
>  		put_page(page);
> +	}
>  	BUG_ON(copied < 0);
>  	BUG_ON(err > 0);
>  	BUG_ON(!copied && !err);
> --
> 2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-11-23  6:38                           ` Hillf Danton
@ 2016-12-15 19:02                             ` Andrea Arcangeli
  2016-12-16  3:54                               ` Hillf Danton
  0 siblings, 1 reply; 69+ messages in thread
From: Andrea Arcangeli @ 2016-12-15 19:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Mike Kravetz', 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Wed, Nov 23, 2016 at 02:38:37PM +0800, Hillf Danton wrote:
> On Tuesday, November 22, 2016 9:17 AM Mike Kravetz wrote:
> > I am not sure if you are convinced ClearPagePrivate is an acceptable
> > solution to this issue.  If you do, here is the simple patch to add
> > it and an appropriate comment.
> > 
> Hi Mike and Andrea
> 
> Sorry for my jumping in.
> 
> In commit 07443a85ad
> ("mm, hugetlb: return a reserved page to a reserved pool if failed")
> newly allocated huge page gets cleared for a successful COW.
> 
> I'm wondering if we can handle our error path along that way?
> 
> Obvious I could miss the points you are concerning.

The hugepage allocation toggles the region covering the page in the
vma reservations, so when the vma is virtually unmapped, those regions
that got toggled, are considered not reserved and the global
reservation is not decreased.

Because the global reservation is decreased by the same page
allocation that sets the page private flag after toggling the virtual
regions, the page private flag shall be cleared when the page is
finally mapped in userland, as it's not reserved anymore. This way
when the page is freed, the global reservation will not be increased
(and when the vma is unmapped the reservation will not be decreased
either, because of the region toggling above).

hugetlb_mcopy_atomic_pte is already correctly doing:

	ClearPagePrivate(page);
	hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);

while mapping the hugepage in userland.

The issue is that if we can't reach hugetlb_mcopy_atomic_pte because
userland screws with the vmas while the UFFDIO_COPY releases the
mmap_sem, the point where we error out, has the vma out of sync
because we had to drop the mmap_sem in the first place. So we can't
toggle the vma virtual region covering the page back to its original
state (i.e. reserved). That's what restore_reserve_on_error would try
to achieve, but we can't run it as the vma we got in the error path is
stale.

All we know is that one more page will be considered not reserved when
the vma is unmapped, so the global reservation will be decreased of
one less page when the vma is unmapped. In turn when freeing such
hugepage in the error path, we've to prevent the global reserve to be
increased once again and to do so we've to clear the page private flag
before freeing the hugepage.

I already applied Mark's patch that clears the page private flag in
the error path. If anything is incorrect in the explanation above let
me know.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  2016-12-15 19:02                             ` Andrea Arcangeli
@ 2016-12-16  3:54                               ` Hillf Danton
  0 siblings, 0 replies; 69+ messages in thread
From: Hillf Danton @ 2016-12-16  3:54 UTC (permalink / raw)
  To: 'Andrea Arcangeli'
  Cc: 'Mike Kravetz', 'Andrew Morton',
	linux-mm, 'Dr. David Alan Gilbert', 'Shaohua Li',
	'Pavel Emelyanov', 'Mike Rapoport'

On Friday, December 16, 2016 3:03 AM Andrea Arcangeli wrote:
 > I already applied Mark's patch that clears the page private flag in
> the error path. 
> 
Glad to hear it:)

Happy Christmas Andrea.

thanks
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2016-12-16  3:54 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
2016-11-03  7:41   ` Hillf Danton
2016-11-03 17:52     ` Mike Rapoport
2016-11-04 15:40     ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
2016-11-03  8:01   ` Hillf Danton
2016-11-03 17:24     ` Mike Rapoport
2016-11-04 16:40       ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg Andrea Arcangeli
2016-11-04 15:42     ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
2016-11-03 10:15   ` Hillf Danton
2016-11-03 17:33     ` Mike Kravetz
2016-11-03 19:14       ` Mike Kravetz
2016-11-04  6:43         ` Hillf Danton
2016-11-04 19:36         ` Andrea Arcangeli
2016-11-04 20:34           ` Mike Kravetz
2016-11-08 21:06           ` Mike Kravetz
2016-11-16 18:28             ` Andrea Arcangeli
2016-11-16 18:53               ` Mike Kravetz
2016-11-17 15:40                 ` Andrea Arcangeli
2016-11-17 19:26                   ` Mike Kravetz
2016-11-18  0:05                     ` Andrea Arcangeli
2016-11-18  5:52                       ` Mike Kravetz
2016-11-22  1:16                         ` Mike Kravetz
2016-11-23  6:38                           ` Hillf Danton
2016-12-15 19:02                             ` Andrea Arcangeli
2016-12-16  3:54                               ` Hillf Danton
2016-11-17 19:41               ` Mike Kravetz
2016-11-04 16:35       ` Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
2016-11-04  7:02   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
2016-11-04  7:39   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
2016-11-04  8:59   ` Hillf Danton
2016-11-04 14:53     ` Mike Rapoport
2016-11-04 15:44     ` Mike Rapoport
2016-11-04 16:56       ` Andrea Arcangeli
2016-11-18  0:37       ` Andrea Arcangeli
2016-11-20 12:10         ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.