linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] userfault21 update
@ 2015-06-15 17:22 Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This is an incremental update to the userfaultfd code in -mm.

This fixes two bugs that could cause some malfunction (but nothing
that could cause memory corruption or kernel crashes of any sort,
neither in kernel nor userland).

This also introduces some enhancement: gdb now runs fine, signals can
interrupt userfaults (userfaults are retried when signal returns),
read blocking got wakeone behavior (with benchmark results in commit
header), the UFFDIO_API invocation is enforced before other ioctl can
run (to enforce future backwards compatibility just in case of API
bumps), one dependency on a scheduler change has been reverted.

Notably this introduces the testsuite as well. A good way to run the
testsuite is:

# it will use 10MiB-~6GiB 999 bounces, continue forever unless an error triggers
while ./userfaultfd $[RANDOM % 6000 + 10] 999; do true; done

What caused a significant amount of time wasted, had nothing to do
with userfaultfd. The testsuite exposed erratic memcmp/bcmp retvals if
part of the strings compared can change under memcmp/bcmp (while still
being different in other parts of the string that aren't actually
changing). I will provide a separate standalone testcase for this not
using userfaultfd (I already created it to be sure it isn't a bug in
userfaultfd, and nevertheless my my_bcmp works fine even with
userfaultfd). Insisting memcmp/bcmp would eventually lead to the
correct result that in kernel-speak to be initially (but erroneously)
translated to missing TLB flush (or cache flush but on x86 unlikely)
or a pagefault hitting on the zeropage somehow, or some other subtle
kernel bug. Eventually I had to consider the possibiltity memcmp or
bcmp core library functions were broken, despite how unlikely this
sounds. It might be possible that this only happens if the memory
changing is inside the "len" range being compared and that nothing
goes wrong if the data changing is beyond the end of the "len" even if
in the same cacheline. So it might be possible that it's perfectly
correct in C standard terms, but the total erratic result is
unacceptable to me and it makes memcmp/bcmp very risky to use in
multithreaded programs. I will ensure this gets fixed in my systems
with perhaps slower versions of memcpy/bcmp. If the two pages never
actually are the same at any given time (no matter if they're
changing) both bcmp and memcmp can't keep returning an erratic racy 0
here. If this is safe by C standard, this still wouldn't be safe
enough for me. It's unclear how this erratic result materializes at
this point and if SIMD instructions have special restrictions on
memory that is modified by other CPUs. CPU bugs in SIMD cannot be
ruled out either yet.

Andrea Arcangeli (7):
  userfaultfd: require UFFDIO_API before other ioctls
  userfaultfd: propagate the full address in THP faults
  userfaultfd: allow signals to interrupt a userfault
  userfaultfd: avoid missing wakeups during refile in userfaultfd_read
  userfaultfd: switch to exclusive wakeup for blocking reads
  userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to
    __wake_up_locked_key"
  userfaultfd: selftest

 fs/userfaultfd.c                         |  78 +++-
 include/linux/wait.h                     |   5 +-
 kernel/sched/wait.c                      |   7 +-
 mm/huge_memory.c                         |  10 +-
 net/sunrpc/sched.c                       |   2 +-
 tools/testing/selftests/vm/Makefile      |   4 +-
 tools/testing/selftests/vm/userfaultfd.c | 669 +++++++++++++++++++++++++++++++
 7 files changed, 752 insertions(+), 23 deletions(-)
 create mode 100644 tools/testing/selftests/vm/userfaultfd.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-06-15 18:11   ` Linus Torvalds
  2015-06-15 17:22 ` [PATCH 2/7] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

UFFDIO_API was already forced before read/poll could work. This
makes the code more strict to force it also for all other ioctls.

All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.

This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.

There's no actual plan or need to change the API or the ioctl, the
current API already should cover fine even the non cooperative usage,
but this is just for the longer term future just in case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5f11678..b69d236 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1115,6 +1115,12 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	int ret = -EINVAL;
 	struct userfaultfd_ctx *ctx = file->private_data;
 
+	if (cmd != UFFDIO_API) {
+		if (ctx->state == UFFD_STATE_WAIT_API)
+			return -EINVAL;
+		BUG_ON(ctx->state != UFFD_STATE_RUNNING);
+	}
+
 	switch(cmd) {
 	case UFFDIO_API:
 		ret = userfaultfd_api(ctx, arg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/7] userfaultfd: propagate the full address in THP faults
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

The THP faults were not propagating the original fault address. The latest
version of the API with uffd.arg.pagefault.address is supposed to propagate the
full address through THP faults.

This was not a kernel crashing bug and it wouldn't risk to corrupt
user memory, but it would cause a SIGBUS failure because the wrong page was
being copied.

For various reasons this wasn't easily reproducible in the qemu
workload, but the strestest exposed the problem immediately.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 80d4ae1..73eb404 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -717,13 +717,14 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
-					unsigned long haddr, pmd_t *pmd,
+					unsigned long address, pmd_t *pmd,
 					struct page *page, gfp_t gfp,
 					unsigned int flags)
 {
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	spinlock_t *ptl;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -765,7 +766,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_cancel_charge(page, memcg);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr, flags,
+			ret = handle_userfault(vma, address, flags,
 					       VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
@@ -841,7 +842,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (userfaultfd_missing(vma)) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr, flags,
+				ret = handle_userfault(vma, address, flags,
 						       VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -865,7 +866,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp, flags);
+	return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
+					    flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 2/7] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 4/7] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning
VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault
would just be retried by userland then. This is safe at least on x86
and powerpc (the two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault returns, no
access to data structures protected by the mmap_sem must be done by the fault
code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called.

This has two main benefits: signals can run with lower latency in production
(signals aren't blocked by userfaults and userfaults are immediately repeated
after signal processing) and gdb can then trivially debug the threads blocked
in this kind of userfaults coming directly from userland.

On a side note: while gdb has a need to get signal processed, coredumps always
worked perfectly with userfaults, no matter if the userfault is triggered by
GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b69d236..8286ec8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -262,7 +262,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
-	bool must_wait;
+	bool must_wait, return_to_userland;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -327,6 +327,9 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	uwq.msg = userfault_msg(address, flags, reason);
 	uwq.ctx = ctx;
 
+	return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -338,14 +341,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(TASK_KILLABLE);
+	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
+			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
-		   !fatal_signal_pending(current))) {
+		   (return_to_userland ? !signal_pending(current) :
+		    !fatal_signal_pending(current)))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
@@ -353,6 +358,30 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	__set_current_state(TASK_RUNNING);
 
+	if (return_to_userland) {
+		if (signal_pending(current) &&
+		    !fatal_signal_pending(current)) {
+			/*
+			 * If we got a SIGSTOP or SIGCONT and this is
+			 * a normal userland page fault, just let
+			 * userland return so the signal will be
+			 * handled and gdb debugging works.  The page
+			 * fault code immediately after we return from
+			 * this function is going to release the
+			 * mmap_sem and it's not depending on it
+			 * (unlike gup would if we were not to return
+			 * VM_FAULT_RETRY).
+			 *
+			 * If a fatal signal is pending we still take
+			 * the streamlined VM_FAULT_RETRY failure path
+			 * and there's no need to retake the mmap_sem
+			 * in such case.
+			 */
+			down_read(&mm->mmap_sem);
+			ret = 0;
+		}
+	}
+
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/7] userfaultfd: avoid missing wakeups during refile in userfaultfd_read
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2015-06-15 17:22 ` [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads Andrea Arcangeli
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

During the refile in userfaultfd_read both waitqueues could look empty
to the lockless wake_userfault(). Use a seqcount to prevent this false
negative that could leave an userfault blocked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 8286ec8..f9e11ec 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -45,6 +45,8 @@ struct userfaultfd_ctx {
 	wait_queue_head_t fault_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
+	/* a refile sequence protected by fault_pending_wqh lock */
+	struct seqcount refile_seq;
 	/* pseudo fd refcounting */
 	atomic_t refcount;
 	/* userfaultfd syscall flags */
@@ -547,6 +549,15 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 		uwq = find_userfault(ctx);
 		if (uwq) {
 			/*
+			 * Use a seqcount to repeat the lockless check
+			 * in wake_userfault() to avoid missing
+			 * wakeups because during the refile both
+			 * waitqueue could become empty if this is the
+			 * only userfault.
+			 */
+			write_seqcount_begin(&ctx->refile_seq);
+
+			/*
 			 * The fault_pending_wqh.lock prevents the uwq
 			 * to disappear from under us.
 			 *
@@ -570,6 +581,8 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			list_del(&uwq->wq.task_list);
 			__add_wait_queue(&ctx->fault_wqh, &uwq->wq);
 
+			write_seqcount_end(&ctx->refile_seq);
+
 			/* careful to always initialize msg if ret == 0 */
 			*msg = uwq->msg;
 			spin_unlock(&ctx->fault_pending_wqh.lock);
@@ -648,6 +661,9 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
 static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 					   struct userfaultfd_wake_range *range)
 {
+	unsigned seq;
+	bool need_wakeup;
+
 	/*
 	 * To be sure waitqueue_active() is not reordered by the CPU
 	 * before the pagetable update, use an explicit SMP memory
@@ -663,8 +679,13 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 	 * userfaults yet. So we take the spinlock only when we're
 	 * sure we've userfaults to wake.
 	 */
-	if (waitqueue_active(&ctx->fault_pending_wqh) ||
-	    waitqueue_active(&ctx->fault_wqh))
+	do {
+		seq = read_seqcount_begin(&ctx->refile_seq);
+		need_wakeup = waitqueue_active(&ctx->fault_pending_wqh) ||
+			waitqueue_active(&ctx->fault_wqh);
+		cond_resched();
+	} while (read_seqcount_retry(&ctx->refile_seq, seq));
+	if (need_wakeup)
 		__wake_userfault(ctx, range);
 }
 
@@ -1223,6 +1244,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 	init_waitqueue_head(&ctx->fault_pending_wqh);
 	init_waitqueue_head(&ctx->fault_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
+	seqcount_init(&ctx->refile_seq);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2015-06-15 17:22 ` [PATCH 4/7] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-06-15 18:19   ` Linus Torvalds
  2015-06-15 17:22 ` [PATCH 6/7] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" Andrea Arcangeli
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Blocking reads can easily use exclusive wakeups. Poll in theory could
too but there's no poll_wait_exclusive in common code yet.

If a poll() non-exclusive waitqueue is encountered before the
exclusive readblocking waitqueue, then everything will be waken. If a
read exclusive waitqueue is encountered before the poll waitqueue,
only the read will be waken (poll or any other blocked read waiting
will not be waken). In short the improvement is only available if
using pure blocked reads and never poll.

Here a benchmark showing the performance before/after the patch. This
is with the userfaultfd stresstest suite.

Here the relevant points:

wakeall:      48380.184730      task-clock (msec)         #   17.266 CPUs utilized            ( +-  0.47% )
wakeone:      45333.241105      task-clock (msec)         #   17.430 CPUs utilized            ( +-  0.72% )
wakeall:   121,667,046,528      cycles                    #    2.515 GHz                      ( +-  0.56% ) [83.44%]
wakeone:   113,920,663,471      cycles                    #    2.513 GHz                      ( +-  0.66% ) [83.32%]
wakeall:       2.801974022 seconds time elapsed ( +-  0.43% )
wakeone:       2.600912438 seconds time elapsed ( +-  0.02% )

Note the above number refers to the full testsuite and half of the
time it using only poll which remains wake-all, so for pure read
blocking workload the improvement is more significant. And this is
only with 24 uffd threads on only 24 CPUs.

Here the raw userfault numbers referring to those passes only using
blocking reads (never poll mixed in here):

wakeall:mode: rnd racing, userfaults: 206 158 203 194 144 190 102 165 92 80 138 98 76 108 55 64 50 46 27 36 29 8 11 3
wakeone:mode: rnd racing, userfaults: 212 153 144 139 215 116 138 119 118 102 65 93 81 75 63 59 49 44 46 40 19 14 21 6

wakeall:mode: rnd, userfaults: 499 439 546 492 452 403 355 375 318 296 246 230 182 200 194 110 91 125 63 47 41 46 20 11
wakeone:mode: rnd, userfaults: 544 523 490 501 457 449 477 256 471 353 299 323 227 228 222 191 172 79 71 77 65 25 38 0

Full data below.

===
wakeall:

nr_pages: 25584, nr_pages_per_cpu: 1066
bounces: 15, mode: rnd racing ver poll, userfaults: 108 92 116 107 82 97 85 109 88 74 77 56 71 51 41 41 34 24 23 36 18 27 8 3
bounces: 14, mode: racing ver poll, userfaults: 38 26 15 27 27 27 23 17 25 26 17 15 19 12 17 13 3 10 3 5 3 2 4 1
bounces: 13, mode: rnd ver poll, userfaults: 454 456 434 458 394 349 435 370 375 355 364 325 260 168 263 188 159 199 81 113 89 33 36 20
bounces: 12, mode: ver poll, userfaults: 88 59 79 66 52 64 49 40 51 42 29 25 17 24 24 22 19 27 12 4 15 13 9 1
bounces: 11, mode: rnd racing poll, userfaults: 110 128 129 82 102 93 89 62 75 48 66 94 71 40 32 49 29 22 35 24 21 19 10 6
bounces: 10, mode: racing poll, userfaults: 31 28 19 33 23 23 20 17 19 22 14 18 9 8 12 13 14 15 8 2 0 0 0 2
bounces: 9, mode: rnd poll, userfaults: 380 553 558 401 379 351 326 228 283 362 211 223 250 173 189 190 129 133 201 226 143 33 0 0
bounces: 8, mode: poll, userfaults: 75 57 58 46 41 42 46 27 41 52 29 20 27 7 16 6 4 5 7 2 2 1 0 0
bounces: 7, mode: rnd racing ver, userfaults: 192 137 129 114 106 144 89 79 75 54 52 64 77 36 44 39 24 31 27 25 16 8 2 5
bounces: 6, mode: racing ver, userfaults: 28 14 17 26 18 28 12 8 11 30 9 4 21 13 14 13 6 18 19 6 11 2 3 0
bounces: 5, mode: rnd ver, userfaults: 569 477 604 497 515 333 403 368 331 314 328 271 180 223 205 152 125 108 79 29 44 43 12 30
bounces: 4, mode: ver, userfaults: 97 103 90 93 90 43 74 60 58 54 51 41 37 16 31 27 24 27 13 5 5 6 2 1
bounces: 3, mode: rnd racing, userfaults: 241 159 164 169 114 136 133 114 90 69 115 88 111 98 77 64 30 66 23 16 19 12 11 7
bounces: 2, mode: racing, userfaults: 30 34 26 40 20 24 29 38 20 21 17 19 22 13 8 12 7 14 11 5 3 3 1 3
bounces: 1, mode: rnd, userfaults: 595 509 478 371 372 429 397 388 390 269 228 290 205 137 144 157 131 115 148 79 53 46 10 8
bounces: 0, mode:, userfaults: 108 100 83 69 50 64 52 55 52 41 39 27 36 49 28 38 18 30 27 6 8 10 7 2
nr_pages: 25584, nr_pages_per_cpu: 1066
bounces: 15, mode: rnd racing ver poll, userfaults: 167 117 118 92 75 89 112 89 96 103 86 95 78 72 42 72 49 32 73 27 26 20 16 15
bounces: 14, mode: racing ver poll, userfaults: 30 32 18 29 18 24 16 17 11 16 17 11 9 6 17 11 9 8 5 5 7 3 0 0
bounces: 13, mode: rnd ver poll, userfaults: 552 437 468 410 425 432 407 363 388 306 377 298 282 320 289 203 102 132 75 76 97 22 8 5
bounces: 12, mode: ver poll, userfaults: 89 92 92 72 64 68 55 44 61 42 42 28 28 23 26 19 9 6 11 8 1 5 1 1
bounces: 11, mode: rnd racing poll, userfaults: 113 79 101 65 61 67 92 73 52 65 60 46 41 45 27 26 33 21 25 26 3 13 10 5
bounces: 10, mode: racing poll, userfaults: 47 32 43 41 47 26 38 35 32 24 47 24 33 24 33 18 15 15 9 28 13 5 1 0
bounces: 9, mode: rnd poll, userfaults: 487 517 489 376 295 416 294 463 418 438 371 391 246 259 264 263 166 206 181 103 85 23 0 0
bounces: 8, mode: poll, userfaults: 97 95 82 99 65 67 60 61 40 60 60 41 40 40 24 26 22 33 10 12 8 4 4 0
bounces: 7, mode: rnd racing ver, userfaults: 171 127 127 114 137 101 106 103 72 90 69 61 39 44 29 37 29 36 21 33 2 9 1 1
bounces: 6, mode: racing ver, userfaults: 23 13 20 13 10 6 11 10 9 4 5 8 7 5 9 11 7 15 8 3 2 1 3 2
bounces: 5, mode: rnd ver, userfaults: 450 496 410 398 503 372 342 298 306 294 221 184 191 156 80 102 108 74 69 29 63 36 23 38
bounces: 4, mode: ver, userfaults: 126 106 125 82 100 69 96 75 51 55 46 49 61 41 43 49 29 25 30 25 12 0 0 0
bounces: 3, mode: rnd racing, userfaults: 142 126 124 104 105 104 122 69 77 92 52 52 77 40 37 35 30 19 13 19 11 12 11 2
bounces: 2, mode: racing, userfaults: 24 22 28 18 6 21 18 27 20 19 18 21 7 15 15 14 14 7 3 4 4 2 3 5
bounces: 1, mode: rnd, userfaults: 448 451 388 433 355 283 304 421 359 260 251 277 239 276 242 167 89 114 144 65 29 27 23 7
bounces: 0, mode:, userfaults: 66 71 53 62 43 57 30 31 37 25 38 8 29 10 20 23 13 8 6 10 2 3 1 0
nr_pages: 25584, nr_pages_per_cpu: 1066
bounces: 15, mode: rnd racing ver poll, userfaults: 117 80 92 70 114 88 86 65 82 67 72 55 68 48 49 24 48 31 21 32 27 22 5 12
bounces: 14, mode: racing ver poll, userfaults: 19 28 26 23 20 21 24 28 17 30 32 14 10 13 8 16 17 3 18 15 7 3 8 0
bounces: 13, mode: rnd ver poll, userfaults: 402 390 379 330 449 441 416 436 469 249 376 327 321 332 318 380 297 139 110 105 41 29 19 0
bounces: 12, mode: ver poll, userfaults: 111 80 79 75 78 70 71 55 42 37 32 48 41 24 41 27 24 30 16 21 24 9 24 1
bounces: 11, mode: rnd racing poll, userfaults: 114 85 129 95 96 93 80 66 44 80 55 62 65 65 76 34 54 48 22 41 44 8 21 5
bounces: 10, mode: racing poll, userfaults: 19 14 17 9 8 10 15 11 15 7 4 3 9 8 7 2 9 4 5 6 2 1 9 0
bounces: 9, mode: rnd poll, userfaults: 425 362 355 359 334 327 336 341 286 224 204 216 197 126 162 97 93 181 91 134 63 63 29 33
bounces: 8, mode: poll, userfaults: 82 53 65 43 48 48 31 25 28 25 41 13 10 12 7 13 6 3 4 4 3 1 1 1
bounces: 7, mode: rnd racing ver, userfaults: 146 143 118 142 124 152 123 103 97 99 94 72 100 51 59 52 40 38 19 32 13 7 4 6
bounces: 6, mode: racing ver, userfaults: 24 31 22 24 14 18 19 36 24 22 36 22 19 10 2 0 6 2 4 2 1 1 4 2
bounces: 5, mode: rnd ver, userfaults: 532 478 425 541 433 355 278 360 267 267 257 229 157 174 174 136 163 69 107 79 37 39 36 23
bounces: 4, mode: ver, userfaults: 110 122 74 69 102 61 81 71 43 65 44 29 26 10 26 23 9 11 18 5 15 2 3 3
bounces: 3, mode: rnd racing, userfaults: 206 158 203 194 144 190 102 165 92 80 138 98 76 108 55 64 50 46 27 36 29 8 11 3
bounces: 2, mode: racing, userfaults: 66 32 30 25 28 24 19 21 35 23 23 12 25 14 11 39 12 4 6 3 1 0 0 0
bounces: 1, mode: rnd, userfaults: 499 439 546 492 452 403 355 375 318 296 246 230 182 200 194 110 91 125 63 47 41 46 20 11
bounces: 0, mode:, userfaults: 77 84 103 69 56 40 31 36 48 41 30 45 34 20 15 17 20 4 7 4 4 3 1 1

 Performance counter stats for './userfaultfd 100 16' (3 runs):

      48380.184730      task-clock (msec)         #   17.266 CPUs utilized            ( +-  0.47% )
           193,120      context-switches          #    0.004 M/sec                    ( +-  0.42% )
            29,131      cpu-migrations            #    0.602 K/sec                    ( +-  0.31% )
           124,534      page-faults               #    0.003 M/sec                    ( +-  0.73% )
   121,667,046,528      cycles                    #    2.515 GHz                      ( +-  0.56% ) [83.44%]
    88,715,214,401      stalled-cycles-frontend   #   72.92% frontend cycles idle     ( +-  0.74% ) [83.59%]
    38,649,861,616      stalled-cycles-backend    #   31.77% backend  cycles idle     ( +-  1.06% ) [67.25%]
    63,497,981,871      instructions              #    0.52  insns per cycle
                                                  #    1.40  stalled cycles per insn  ( +-  0.21% ) [83.86%]
    12,494,182,458      branches                  #  258.250 M/sec                    ( +-  0.25% ) [83.55%]
        57,700,648      branch-misses             #    0.46% of all branches          ( +-  2.64% ) [83.41%]

       2.801974022 seconds time elapsed                                          ( +-  0.43% )

wakeone:

nr_pages: 25584, nr_pages_per_cpu: 1066
bounces: 15, mode: rnd racing ver poll, userfaults: 125 94 71 95 73 81 84 69 61 57 60 72 57 39 38 55 28 24 26 8 11 21 5 1
bounces: 14, mode: racing ver poll, userfaults: 24 16 8 12 30 21 16 22 22 19 14 10 11 6 8 6 10 9 6 7 3 8 3 3
bounces: 13, mode: rnd ver poll, userfaults: 548 353 409 354 423 425 387 316 350 282 379 203 210 214 228 180 192 187 166 75 37 45 42 0
bounces: 12, mode: ver poll, userfaults: 91 75 67 65 52 68 44 43 58 29 27 29 20 27 13 11 12 9 5 1 2 2 0 1
bounces: 11, mode: rnd racing poll, userfaults: 142 94 130 117 128 91 96 84 59 53 88 55 53 74 22 52 32 24 22 41 34 12 4 6
bounces: 10, mode: racing poll, userfaults: 22 29 23 22 26 20 14 21 15 11 23 20 23 16 4 15 15 11 6 11 5 6 0 4
bounces: 9, mode: rnd poll, userfaults: 333 402 361 374 345 344 297 338 208 307 222 171 196 208 134 137 97 87 191 133 120 68 45 3
bounces: 8, mode: poll, userfaults: 84 71 67 56 57 55 41 39 41 27 33 46 33 26 13 20 14 11 9 4 1 7 1 0
bounces: 7, mode: rnd racing ver, userfaults: 152 127 123 133 121 126 93 124 99 80 81 67 59 67 51 51 46 34 27 17 8 7 9 7
bounces: 6, mode: racing ver, userfaults: 36 34 20 17 15 33 18 20 13 19 31 17 19 9 16 22 10 21 6 5 6 1 1 2
bounces: 5, mode: rnd ver, userfaults: 534 448 498 402 400 454 396 357 287 262 266 281 239 233 235 258 230 100 106 95 82 55 29 3
bounces: 4, mode: ver, userfaults: 106 114 91 79 83 88 54 33 63 56 78 41 40 42 35 16 26 38 26 14 12 27 11 8
bounces: 3, mode: rnd racing, userfaults: 191 147 186 124 131 136 147 110 95 104 88 85 103 56 56 74 43 42 29 47 30 7 9 14
bounces: 2, mode: racing, userfaults: 34 48 28 45 46 39 64 39 32 22 20 16 32 37 18 44 36 14 20 11 18 11 18 1
bounces: 1, mode: rnd, userfaults: 568 488 434 492 473 343 295 271 364 329 306 240 299 330 212 240 163 151 144 88 60 52 8 5
bounces: 0, mode:, userfaults: 112 103 67 47 56 57 32 60 55 46 41 51 64 31 25 35 15 17 25 8 7 10 0 1
nr_pages: 25584, nr_pages_per_cpu: 1066
bounces: 15, mode: rnd racing ver poll, userfaults: 120 184 139 91 102 89 92 97 100 67 127 92 69 53 53 64 55 51 39 13 12 11 6 4
bounces: 14, mode: racing ver poll, userfaults: 52 21 23 39 22 19 9 20 12 14 13 22 8 16 21 13 10 12 8 12 2 3 4 0
bounces: 13, mode: rnd ver poll, userfaults: 509 453 392 469 467 413 422 434 384 227 298 197 318 236 213 273 210 320 144 62 58 35 3 0
bounces: 12, mode: ver poll, userfaults: 82 85 64 67 68 45 61 32 29 35 32 46 22 17 29 22 7 18 4 15 8 8 10 4
bounces: 11, mode: rnd racing poll, userfaults: 170 110 107 116 135 99 91 88 65 78 53 61 47 47 64 46 33 31 20 15 7 8 6 3
bounces: 10, mode: racing poll, userfaults: 34 20 23 22 19 19 19 24 22 24 13 27 26 23 8 7 14 8 6 7 6 9 5 2
bounces: 9, mode: rnd poll, userfaults: 522 482 455 502 361 606 363 365 259 302 325 449 333 424 308 470 124 142 147 129 28 2 0 0
bounces: 8, mode: poll, userfaults: 112 90 86 71 71 68 63 65 48 33 27 33 23 18 30 16 26 25 13 16 10 8 3 0
bounces: 7, mode: rnd racing ver, userfaults: 211 147 125 114 150 108 121 136 117 86 96 80 60 56 58 49 45 36 51 27 19 31 7 3
bounces: 6, mode: racing ver, userfaults: 35 35 42 12 21 27 20 22 22 43 23 23 27 27 23 24 35 21 19 12 5 11 14 13
bounces: 5, mode: rnd ver, userfaults: 413 473 383 326 282 300 349 240 283 223 220 208 248 197 196 155 136 68 74 62 32 24 41 33
bounces: 4, mode: ver, userfaults: 129 85 78 77 59 49 61 35 56 31 24 32 25 16 20 15 24 8 19 4 10 8 12 3
bounces: 3, mode: rnd racing, userfaults: 207 125 161 147 158 156 103 116 125 89 113 78 75 39 51 49 46 18 27 18 14 20 13 2
bounces: 2, mode: racing, userfaults: 15 30 21 12 17 14 11 15 9 19 12 12 9 14 17 13 11 7 4 13 2 6 6 9
bounces: 1, mode: rnd, userfaults: 567 489 490 463 526 485 365 465 394 416 362 281 263 212 164 143 129 141 122 42 65 30 37 25
bounces: 0, mode:, userfaults: 96 80 91 66 74 71 51 43 56 47 45 49 24 28 18 43 20 6 15 29 2 27 10 11
nr_pages: 25584, nr_pages_per_cpu: 1066
bounces: 15, mode: rnd racing ver poll, userfaults: 141 97 67 113 101 90 98 87 76 66 79 46 53 53 43 34 51 26 25 25 22 3 4 2
bounces: 14, mode: racing ver poll, userfaults: 19 16 12 19 17 19 13 10 15 11 16 5 7 13 9 8 4 0 1 3 3 0 2 1
bounces: 13, mode: rnd ver poll, userfaults: 439 369 337 357 417 306 286 328 376 342 242 190 321 181 166 223 304 165 124 133 45 99 0 0
bounces: 12, mode: ver poll, userfaults: 130 86 69 80 81 60 70 60 100 86 62 66 66 41 63 54 30 30 15 10 15 3 1 0
bounces: 11, mode: rnd racing poll, userfaults: 141 105 103 95 109 82 61 56 100 72 75 60 68 60 35 34 21 43 11 18 19 4 7 12
bounces: 10, mode: racing poll, userfaults: 39 24 22 25 19 20 20 10 20 6 16 11 14 9 8 8 10 5 7 3 3 3 4 0
bounces: 9, mode: rnd poll, userfaults: 527 589 542 524 599 525 329 478 417 373 351 412 317 392 385 237 186 126 166 106 33 38 0 0
bounces: 8, mode: poll, userfaults: 77 65 62 54 60 48 41 29 42 31 20 23 14 18 11 11 11 8 6 1 1 2 0 0
bounces: 7, mode: rnd racing ver, userfaults: 195 90 96 159 99 89 107 81 65 61 57 47 71 45 46 25 26 24 34 27 17 15 13 4
bounces: 6, mode: racing ver, userfaults: 40 19 29 20 21 13 23 11 12 16 12 17 16 17 13 12 17 7 23 5 4 7 2 2
bounces: 5, mode: rnd ver, userfaults: 420 455 283 460 361 416 270 274 304 152 354 296 252 154 167 196 148 180 129 67 117 46 24 11
bounces: 4, mode: ver, userfaults: 116 79 58 60 62 45 44 58 43 53 31 33 43 33 18 16 4 13 7 2 9 6 5 0
bounces: 3, mode: rnd racing, userfaults: 212 153 144 139 215 116 138 119 118 102 65 93 81 75 63 59 49 44 46 40 19 14 21 6
bounces: 2, mode: racing, userfaults: 45 30 42 26 28 39 32 31 20 36 43 20 26 11 12 11 22 14 11 4 20 13 0 3
bounces: 1, mode: rnd, userfaults: 544 523 490 501 457 449 477 256 471 353 299 323 227 228 222 191 172 79 71 77 65 25 38 0
bounces: 0, mode:, userfaults: 117 82 83 81 77 61 46 50 33 66 46 28 18 34 35 20 8 8 19 12 12 18 2 3

 Performance counter stats for './userfaultfd 100 16' (3 runs):

      45333.241105      task-clock (msec)         #   17.430 CPUs utilized            ( +-  0.72% )
           190,388      context-switches          #    0.004 M/sec                    ( +-  0.85% )
            26,983      cpu-migrations            #    0.595 K/sec                    ( +-  1.02% )
           135,232      page-faults               #    0.003 M/sec                    ( +-  1.46% )
   113,920,663,471      cycles                    #    2.513 GHz                      ( +-  0.66% ) [83.32%]
    83,823,483,951      stalled-cycles-frontend   #   73.58% frontend cycles idle     ( +-  0.65% ) [83.41%]
    35,786,661,114      stalled-cycles-backend    #   31.41% backend  cycles idle     ( +-  0.86% ) [67.29%]
    59,478,650,192      instructions              #    0.52  insns per cycle
                                                  #    1.41  stalled cycles per insn  ( +-  0.65% ) [84.04%]
    11,635,219,658      branches                  #  256.660 M/sec                    ( +-  0.71% ) [83.69%]
        59,203,898      branch-misses             #    0.51% of all branches          ( +-  2.03% ) [83.54%]

       2.600912438 seconds time elapsed                                          ( +-  0.02% )

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f9e11ec..a66c4be 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -501,6 +501,7 @@ static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
 	struct userfaultfd_ctx *ctx = file->private_data;
 	unsigned int ret;
 
+	/* FIXME: poll_wait_exclusive doesn't exist yet in common code */
 	poll_wait(file, &ctx->fd_wqh, wait);
 
 	switch (ctx->state) {
@@ -542,7 +543,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 
 	/* always take the fd_wqh lock before the fault_pending_wqh lock */
 	spin_lock(&ctx->fd_wqh.lock);
-	__add_wait_queue(&ctx->fd_wqh, &wait);
+	__add_wait_queue_exclusive(&ctx->fd_wqh, &wait);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		spin_lock(&ctx->fault_pending_wqh.lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/7] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2015-06-15 17:22 ` [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-06-15 17:22 ` [PATCH 7/7] userfaultfd: selftest Andrea Arcangeli
  2015-10-12 15:04 ` [PATCH 0/7] userfault21 update Patrick Donnelly
  7 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This reverts commit 855c5a9026b0fce58c8de5382ef8ce00f74c1331 and
adapts fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what
we insert in the two waitqueue heads of the blocked userfaults. No
exclusive waitqueue risks to be inserted into those two waitqueue
heads so we can as well stick to "nr == 1" of the old code and we can
rely purely on the fact no waitqueue inserted in one of the two
waitqueue heads we must enforce as wakeall, has wait->flags
WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c     | 8 ++++----
 include/linux/wait.h | 5 ++---
 kernel/sched/wait.c  | 7 +++----
 net/sunrpc/sched.c   | 2 +-
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index a66c4be..e601dd8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -467,8 +467,8 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	 * the fault_*wqh.
 	 */
 	spin_lock(&ctx->fault_pending_wqh.lock);
-	__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0, &range);
-	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, &range);
+	__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, &range);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, &range);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	wake_up_poll(&ctx->fd_wqh, POLLHUP);
@@ -652,10 +652,10 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/* wake all in the range and autoremove */
 	if (waitqueue_active(&ctx->fault_pending_wqh))
-		__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, 0,
+		__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
 				     range);
 	if (waitqueue_active(&ctx->fault_wqh))
-		__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+		__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, range);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 }
 
diff --git a/include/linux/wait.h b/include/linux/wait.h
index cf884cf..2db8334 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,8 +147,7 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
-			  void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -180,7 +179,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 6da208dd2..852143a 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,10 +106,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
-			  void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
 {
-	__wake_up_common(q, mode, nr, 0, key);
+	__wake_up_common(q, mode, 1, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -284,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, 1, key);
+		__wake_up_locked_key(q, mode, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index b140c09..337ca85 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/7] userfaultfd: selftest
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2015-06-15 17:22 ` [PATCH 6/7] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" Andrea Arcangeli
@ 2015-06-15 17:22 ` Andrea Arcangeli
  2015-10-12 15:04 ` [PATCH 0/7] userfault21 update Patrick Donnelly
  7 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 17:22 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, qemu-devel, kvm
  Cc: Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

This test allocates two virtual areas and bounces the physical memory
across the two virtual areas using only userfaultfd.

This exposed a race condition in the refile of the userfault in
userfaultfd_read and an alignment issue with the address returned to
userland with THP enabled. It also allowed to test the interruption of
userfaults by signals (like running the testcase under gdb).

As expected no sign of memory corruption has ever materialized no
matter how I changed the stress test while developing it. The two bugs
had no impact on the safety and correctness of the memory being
tracked by userfaultfd. The fix for those two bugs was also
strightforward and required no design change of any sort.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/Makefile      |   4 +-
 tools/testing/selftests/vm/userfaultfd.c | 669 +++++++++++++++++++++++++++++++
 2 files changed, 672 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/userfaultfd.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index a5ce953..6f19ecc 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -1,12 +1,14 @@
 # Makefile for vm selftests
 
 CFLAGS = -Wall
-BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest
+BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest userfaultfd
 BINARIES += transhuge-stress
 
 all: $(BINARIES)
 %: %.c
 	$(CC) $(CFLAGS) -o $@ $^ -lrt
+userfaultfd: userfaultfd.c
+	$(CC) $(CFLAGS) -o $@ $^ -g -lrt -lpthread
 
 TEST_PROGS := run_vmtests
 TEST_FILES := $(BINARIES)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
new file mode 100644
index 0000000..418dc33
--- /dev/null
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -0,0 +1,669 @@
+/*
+ * Stress userfaultfd syscall.
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ * This test allocates two virtual areas and bounces the physical
+ * memory across the two virtual areas (from area_src to area_dst)
+ * using userfaultfd.
+ *
+ * There are three threads running per CPU:
+ *
+ * 1) one per-CPU thread takes a per-page pthread_mutex in a random
+ *    page of the area_dst (while the physical page may still be in
+ *    area_src), and increments a per-page counter in the same page,
+ *    and checks its value against a verification region.
+ *
+ * 2) another per-CPU thread handles the userfaults generated by
+ *    thread 1 above. userfaultfd blocking reads or poll() modes are
+ *    exercised interleaved.
+ *
+ * 3) one last per-CPU thread transfers the memory in the background
+ *    at maximum bandwidth (if not already transferred by thread
+ *    2). Each cpu thread takes cares of transferring a portion of the
+ *    area.
+ *
+ * When all threads of type 3 completed the transfer, one bounce is
+ * complete. area_src and area_dst are then swapped. All threads are
+ * respawned and so the bounce is immediately restarted in the
+ * opposite direction.
+ *
+ * per-CPU threads 1 by triggering userfaults inside
+ * pthread_mutex_lock will also verify the atomicity of the memory
+ * transfer (UFFDIO_COPY).
+ *
+ * The program takes two parameters: the amounts of physical memory in
+ * megabytes (MiB) of the area and the number of bounces to execute.
+ *
+ * # 100MiB 99999 bounces
+ * ./userfaultfd 100 99999
+ *
+ * # 1GiB 99 bounces
+ * ./userfaultfd 1000 99
+ *
+ * # 10MiB-~6GiB 999 bounces, continue forever unless an error triggers
+ * while ./userfaultfd $[RANDOM % 6000 + 10] 999; do true; done
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <time.h>
+#include <signal.h>
+#include <poll.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <pthread.h>
+#include "../../../../include/uapi/linux/userfaultfd.h"
+
+#ifdef __x86_64__
+#define __NR_userfaultfd 323
+#elif defined(__i386__)
+#define __NR_userfaultfd 359
+#elif defined(__powewrpc__)
+#define __NR_userfaultfd 364
+#else
+#error "missing __NR_userfaultfd definition"
+#endif
+
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+
+#define BOUNCE_RANDOM		(1<<0)
+#define BOUNCE_RACINGFAULTS	(1<<1)
+#define BOUNCE_VERIFY		(1<<2)
+#define BOUNCE_POLL		(1<<3)
+static int bounces;
+
+static unsigned long long *count_verify;
+static int uffd, finished, *pipefd;
+static char *area_src, *area_dst;
+static char *zeropage;
+pthread_attr_t attr;
+
+/* pthread_mutex_t starts at page offset 0 */
+#define area_mutex(___area, ___nr)					\
+	((pthread_mutex_t *) ((___area) + (___nr)*page_size))
+/*
+ * count is placed in the page after pthread_mutex_t naturally aligned
+ * to avoid non alignment faults on non-x86 archs.
+ */
+#define area_count(___area, ___nr)					\
+	((volatile unsigned long long *) ((unsigned long)		\
+				 ((___area) + (___nr)*page_size +	\
+				  sizeof(pthread_mutex_t) +		\
+				  sizeof(unsigned long long) - 1) &	\
+				 ~(unsigned long)(sizeof(unsigned long long) \
+						  -  1)))
+
+static int my_bcmp(char *str1, char *str2, size_t n)
+{
+	unsigned long i;
+	for (i = 0; i < n; i++)
+		if (str1[i] != str2[i])
+			return 1;
+	return 0;
+}
+
+static void *locking_thread(void *arg)
+{
+	unsigned long cpu = (unsigned long) arg;
+	struct random_data rand;
+	unsigned long page_nr;
+	int32_t rand_nr;
+	unsigned long long count;
+	char randstate[64];
+	unsigned int seed;
+	time_t start;
+
+	if (bounces & BOUNCE_RANDOM) {
+		seed = (unsigned int) time(NULL) - bounces;
+		if (!(bounces & BOUNCE_RACINGFAULTS))
+			seed += cpu;
+		bzero(&rand, sizeof(rand));
+		bzero(&randstate, sizeof(randstate));
+		if (initstate_r(seed, randstate, sizeof(randstate), &rand))
+			fprintf(stderr, "srandom_r error\n"), exit(1);
+	} else {
+		page_nr = -bounces;
+		if (!(bounces & BOUNCE_RACINGFAULTS))
+			page_nr += cpu * nr_pages_per_cpu;
+	}
+
+	while (!finished) {
+		if (bounces & BOUNCE_RANDOM) {
+			if (random_r(&rand, &rand_nr))
+				fprintf(stderr, "random_r 1 error\n"), exit(1);
+			page_nr = rand_nr;
+			if (sizeof(page_nr) > sizeof(rand_nr)) {
+				if (random_r(&rand, &rand_nr))
+					fprintf(stderr, "random_r 2 error\n"), exit(1);
+				page_nr |= ((unsigned long) rand_nr) << 32;
+			}
+		} else
+			page_nr += 1;
+		page_nr %= nr_pages;
+
+		start = time(NULL);
+		if (bounces & BOUNCE_VERIFY) {
+			count = *area_count(area_dst, page_nr);
+			if (!count)
+				fprintf(stderr,
+					"page_nr %lu wrong count %Lu %Lu\n",
+					page_nr, count,
+					count_verify[page_nr]), exit(1);
+
+
+			/*
+			 * NOTE: why is s/my_bcmp/bcmp/ or
+			 * s/my_bcmp/memcmp/ failing?  This has
+			 * nothing to do with userfaultfd: it's
+			 * possible to reproduce the same failure with
+			 * a testcase that won't invoke the
+			 * userfaultfd syscalls at all. Apparently if
+			 * part of the memory to be compared is
+			 * changing under memcmp/bcmp, that library
+			 * call will misbehave. This is concerning and
+			 * it caused confusion while developing this
+			 * testcase.
+			 *
+			 * If this check is moved inside the
+			 * pthread_mutex protected region (by
+			 * uncommenting the two mutex lock/unlock
+			 * lines), then "memset/bcmp" will behave
+			 * correctly.
+			 *
+			 * As far as single threaded C standard is
+			 * concerned, memset may even be perfectly
+			 * correct, but in multithreaded world it
+			 * looks an unacceptable risk that memset
+			 * returns 0 just because part of the memory
+			 * is changing under it.
+			 *
+			 * These two pages are never equal at any
+			 * given time.
+			 *
+			 * If the code is changed so that the second
+			 * part of the page that is always different
+			 * than the zeropage is never written to,
+			 * still memcmp/bcmp returns zero because the
+			 * first part is still changing.
+			 *
+			 * Insisting multiple times eventually leads
+			 * to the correct result confirming it must be
+			 * some CPU cache effect with SIMD
+			 * instructions of the default memcmp/bcmp.
+			 */
+#if 1
+			if (!my_bcmp(area_dst + page_nr * page_size, zeropage,
+				     page_size))
+				fprintf(stderr,
+					"my_bcmp page_nr %lu wrong count %Lu %Lu\n",
+					page_nr, count,
+					count_verify[page_nr]), exit(1);
+#else
+			unsigned long loops;
+
+			loops = 0;
+			// uncomment the below line to test with mutex
+			//pthread_mutex_lock(area_mutex(area_dst, page_nr));
+			while (!bcmp(area_dst + page_nr * page_size, zeropage,
+				     page_size)) {
+				loops += 1;
+				if (loops > 10)
+					break;
+			}
+			// uncomment below line to test with mutex
+			//pthread_mutex_unlock(area_mutex(area_dst, page_nr));
+			if (loops) {
+				fprintf(stderr,
+					"page_nr %lu all zero thread %lu %p %lu\n",
+					page_nr, cpu, area_dst + page_nr * page_size,
+					loops);
+				if (loops > 10)
+					exit(1);
+			}
+#endif
+		}
+
+		pthread_mutex_lock(area_mutex(area_dst, page_nr));
+		count = *area_count(area_dst, page_nr);
+		if (count != count_verify[page_nr]) {
+			fprintf(stderr,
+				"page_nr %lu memory corruption %Lu %Lu\n",
+				page_nr, count,
+				count_verify[page_nr]), exit(1);
+		}
+		count++;
+		*area_count(area_dst, page_nr) = count_verify[page_nr] = count;
+		pthread_mutex_unlock(area_mutex(area_dst, page_nr));
+
+		if (time(NULL) - start > 1)
+			fprintf(stderr,
+				"userfault too slow %ld "
+				"possible false positive with overcommit\n",
+				time(NULL) - start);
+	}
+
+	return NULL;
+}
+
+static int copy_page(unsigned long offset)
+{
+	struct uffdio_copy uffdio_copy;
+
+	if (offset >= nr_pages * page_size)
+		fprintf(stderr, "unexpected offset %lu\n",
+			offset), exit(1);
+	uffdio_copy.dst = (unsigned long) area_dst + offset;
+	uffdio_copy.src = (unsigned long) area_src + offset;
+	uffdio_copy.len = page_size;
+	uffdio_copy.mode = 0;
+	uffdio_copy.copy = 0;
+	if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy)) {
+		/* real retval in ufdio_copy.copy */
+		if (uffdio_copy.copy != -EEXIST)
+			fprintf(stderr, "UFFDIO_COPY error %Ld\n",
+				uffdio_copy.copy), exit(1);
+	} else if (uffdio_copy.copy != page_size) {
+		fprintf(stderr, "UFFDIO_COPY unexpected copy %Ld\n",
+			uffdio_copy.copy), exit(1);
+	} else
+		return 1;
+	return 0;
+}
+
+static void *uffd_poll_thread(void *arg)
+{
+	unsigned long cpu = (unsigned long) arg;
+	struct pollfd pollfd[2];
+	struct uffd_msg msg;
+	int ret;
+	unsigned long offset;
+	char tmp_chr;
+	unsigned long long userfaults = 0;
+
+	pollfd[0].fd = uffd;
+	pollfd[0].events = POLLIN;
+	pollfd[1].fd = pipefd[cpu*2];
+	pollfd[1].events = POLLIN;
+
+	for (;;) {
+		ret = poll(pollfd, 2, -1);
+		if (!ret)
+			fprintf(stderr, "poll error %d\n", ret), exit(1);
+		if (ret < 0)
+			perror("poll"), exit(1);
+		if (pollfd[1].revents & POLLIN) {
+			if (read(pollfd[1].fd, &tmp_chr, 1) != 1)
+				fprintf(stderr, "read pipefd error\n"),
+					exit(1);
+			break;
+		}
+		if (!(pollfd[0].revents & POLLIN))
+			fprintf(stderr, "pollfd[0].revents %d\n",
+				pollfd[0].revents), exit(1);
+		ret = read(uffd, &msg, sizeof(msg));
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				continue;
+			perror("nonblocking read error"), exit(1);
+		}
+		if (msg.event != UFFD_EVENT_PAGEFAULT)
+			fprintf(stderr, "unexpected msg event %u\n",
+				msg.event), exit(1);
+		if (msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
+		offset = (char *)msg.arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
+		if (copy_page(offset))
+			userfaults++;
+	}
+	return (void *)userfaults;
+}
+
+pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static void *uffd_read_thread(void *arg)
+{
+	unsigned long long *this_cpu_userfaults;
+	struct uffd_msg msg;
+	unsigned long offset;
+	int ret;
+
+	this_cpu_userfaults = (unsigned long long *) arg;
+	*this_cpu_userfaults = 0;
+
+	pthread_mutex_unlock(&uffd_read_mutex);
+	/* from here cancellation is ok */
+
+	for (;;) {
+		ret = read(uffd, &msg, sizeof(msg));
+		if (ret != sizeof(msg)) {
+			if (ret < 0)
+				perror("blocking read error"), exit(1);
+			else
+				fprintf(stderr, "short read\n"), exit(1);
+		}
+		if (msg.event != UFFD_EVENT_PAGEFAULT)
+			fprintf(stderr, "unexpected msg event %u\n",
+				msg.event), exit(1);
+		if (msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
+		offset = (char *)msg.arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
+		if (copy_page(offset))
+			(*this_cpu_userfaults)++;
+	}
+	return (void *)NULL;
+}
+
+static void *background_thread(void *arg)
+{
+	unsigned long cpu = (unsigned long) arg;
+	unsigned long page_nr;
+
+	for (page_nr = cpu * nr_pages_per_cpu;
+	     page_nr < (cpu+1) * nr_pages_per_cpu;
+	     page_nr++)
+		copy_page(page_nr * page_size);
+
+	return NULL;
+}
+
+static int stress(unsigned long long *userfaults)
+{
+	unsigned long cpu;
+	pthread_t locking_threads[nr_cpus];
+	pthread_t uffd_threads[nr_cpus];
+	pthread_t background_threads[nr_cpus];
+	void **_userfaults = (void **) userfaults;
+
+	finished = 0;
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		if (pthread_create(&locking_threads[cpu], &attr,
+				   locking_thread, (void *)cpu))
+			return 1;
+		if (bounces & BOUNCE_POLL) {
+			if (pthread_create(&uffd_threads[cpu], &attr,
+					   uffd_poll_thread, (void *)cpu))
+				return 1;
+		} else {
+			if (pthread_create(&uffd_threads[cpu], &attr,
+					   uffd_read_thread,
+					   &_userfaults[cpu]))
+				return 1;
+			pthread_mutex_lock(&uffd_read_mutex);
+		}
+		if (pthread_create(&background_threads[cpu], &attr,
+				   background_thread, (void *)cpu))
+			return 1;
+	}
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		if (pthread_join(background_threads[cpu], NULL))
+			return 1;
+
+	/*
+	 * Be strict and immediately zap area_src, the whole area has
+	 * been transferred already by the background treads. The
+	 * area_src could then be faulted in in a racy way by still
+	 * running uffdio_threads reading zeropages after we zapped
+	 * area_src (but they're guaranteed to get -EEXIST from
+	 * UFFDIO_COPY without writing zero pages into area_dst
+	 * because the background threads already completed).
+	 */
+	if (madvise(area_src, nr_pages * page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		return 1;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		char c;
+		if (bounces & BOUNCE_POLL) {
+			if (write(pipefd[cpu*2+1], &c, 1) != 1) {
+				fprintf(stderr, "pipefd write error\n");
+				return 1;
+			}
+			if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
+				return 1;
+		} else {
+			if (pthread_cancel(uffd_threads[cpu]))
+				return 1;
+			if (pthread_join(uffd_threads[cpu], NULL))
+				return 1;
+		}
+	}
+
+	finished = 1;
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		if (pthread_join(locking_threads[cpu], NULL))
+			return 1;
+
+	return 0;
+}
+
+static int userfaultfd_stress(void)
+{
+	void *area;
+	char *tmp_area;
+	unsigned long nr;
+	struct uffdio_register uffdio_register;
+	struct uffdio_api uffdio_api;
+	unsigned long cpu;
+	int uffd_flags;
+	unsigned long long userfaults[nr_cpus];
+
+	if (posix_memalign(&area, page_size, nr_pages * page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+	area_src = area;
+	if (posix_memalign(&area, page_size, nr_pages * page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+	area_dst = area;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		fprintf(stderr,
+			"userfaultfd syscall not available in this kernel\n");
+		return 1;
+	}
+	uffd_flags = fcntl(uffd, F_GETFD, NULL);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
+		fprintf(stderr, "UFFDIO_API\n");
+		return 1;
+	}
+	if (uffdio_api.api != UFFD_API) {
+		fprintf(stderr, "UFFDIO_API error %Lu\n", uffdio_api.api);
+		return 1;
+	}
+
+	count_verify = malloc(nr_pages * sizeof(unsigned long long));
+	if (!count_verify) {
+		perror("count_verify");
+		return 1;
+	}
+
+	for (nr = 0; nr < nr_pages; nr++) {
+		*area_mutex(area_src, nr) = (pthread_mutex_t)
+			PTHREAD_MUTEX_INITIALIZER;
+		count_verify[nr] = *area_count(area_src, nr) = 1;
+	}
+
+	pipefd = malloc(sizeof(int) * nr_cpus * 2);
+	if (!pipefd) {
+		perror("pipefd");
+		return 1;
+	}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) {
+			perror("pipe");
+			return 1;
+		}
+	}
+
+	if (posix_memalign(&area, page_size, page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+	zeropage = area;
+	bzero(zeropage, page_size);
+
+	pthread_mutex_lock(&uffd_read_mutex);
+
+	pthread_attr_init(&attr);
+	pthread_attr_setstacksize(&attr, 16*1024*1024);
+
+	while (bounces--) {
+		unsigned long expected_ioctls;
+
+		printf("bounces: %d, mode:", bounces);
+		if (bounces & BOUNCE_RANDOM)
+			printf(" rnd");
+		if (bounces & BOUNCE_RACINGFAULTS)
+			printf(" racing");
+		if (bounces & BOUNCE_VERIFY)
+			printf(" ver");
+		if (bounces & BOUNCE_POLL)
+			printf(" poll");
+		printf(", ");
+		fflush(stdout);
+
+		if (bounces & BOUNCE_POLL)
+			fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
+		else
+			fcntl(uffd, F_SETFL, uffd_flags & ~O_NONBLOCK);
+
+		/* register */
+		uffdio_register.range.start = (unsigned long) area_dst;
+		uffdio_register.range.len = nr_pages * page_size;
+		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
+			fprintf(stderr, "register failure\n");
+			return 1;
+		}
+		expected_ioctls = (1 << _UFFDIO_WAKE) |
+				  (1 << _UFFDIO_COPY) |
+				  (1 << _UFFDIO_ZEROPAGE);
+		if ((uffdio_register.ioctls & expected_ioctls) !=
+		    expected_ioctls) {
+			fprintf(stderr,
+				"unexpected missing ioctl for anon memory\n");
+			return 1;
+		}
+
+		/*
+		 * The madvise done previously isn't enough: some
+		 * uffd_thread could have read userfaults (one of
+		 * those already resolved by the background thread)
+		 * and it may be in the process of calling
+		 * UFFDIO_COPY. UFFDIO_COPY will read the zapped
+		 * area_src and it would map a zero page in it (of
+		 * course such a UFFDIO_COPY is perfectly safe as it'd
+		 * return -EEXIST). The problem comes at the next
+		 * bounce though: that racing UFFDIO_COPY would
+		 * generate zeropages in the area_src, so invalidating
+		 * the previous MADV_DONTNEED. Without this additional
+		 * MADV_DONTNEED those zeropages leftovers in the
+		 * area_src would lead to -EEXIST failure during the
+		 * next bounce, effectively leaving a zeropage in the
+		 * area_dst.
+		 *
+		 * Try to comment this out madvise to see the memory
+		 * corruption being caught pretty quick.
+		 *
+		 * khugepaged is also inhibited to collapse THP after
+		 * MADV_DONTNEED only after the UFFDIO_REGISTER, so it's
+		 * required to MADV_DONTNEED here.
+		 */
+		if (madvise(area_dst, nr_pages * page_size, MADV_DONTNEED)) {
+			perror("madvise 2");
+			return 1;
+		}
+
+		/* bounce pass */
+		if (stress(userfaults))
+			return 1;
+
+		/* unregister */
+		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
+			fprintf(stderr, "register failure\n");
+			return 1;
+		}
+
+		/* verification */
+		if (bounces & BOUNCE_VERIFY) {
+			for (nr = 0; nr < nr_pages; nr++) {
+				if (my_bcmp(area_dst,
+					    area_dst + nr * page_size,
+					    sizeof(pthread_mutex_t))) {
+					fprintf(stderr,
+						"error mutex 2 %lu\n",
+						nr);
+					bounces = 0;
+				}
+				if (*area_count(area_dst, nr) != count_verify[nr]) {
+					fprintf(stderr,
+						"error area_count %Lu %Lu %lu\n",
+						*area_count(area_src, nr),
+						count_verify[nr],
+						nr);
+					bounces = 0;
+				}
+			}
+		}
+
+		/* prepare next bounce */
+		tmp_area = area_src;
+		area_src = area_dst;
+		area_dst = tmp_area;
+
+		printf("userfaults:");
+		for (cpu = 0; cpu < nr_cpus; cpu++)
+			printf(" %Lu", userfaults[cpu]);
+		printf("\n");
+	}
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc < 3)
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if ((unsigned long) area_count((char *)0 +
+				       sizeof(unsigned long long), 0) >
+	    page_size)
+		fprintf(stderr, "Impossible to run this test\n"), exit(2);
+	nr_pages_per_cpu = atol(argv[1]) * 1024*1024 / page_size /
+		nr_cpus;
+	if (!nr_pages_per_cpu) {
+		fprintf(stderr, "invalid MiB\n");
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	}
+	bounces = atoi(argv[2]);
+	if (bounces <= 0) {
+		fprintf(stderr, "invalid bounces\n");
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	}
+	nr_pages = nr_pages_per_cpu * nr_cpus;
+	printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
+	       nr_pages, nr_pages_per_cpu);
+	return userfaultfd_stress();
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls
  2015-06-15 17:22 ` [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
@ 2015-06-15 18:11   ` Linus Torvalds
  2015-06-15 21:43     ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2015-06-15 18:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Huangpeng (Peter),
	Paolo Bonzini, qemu-devel, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Johannes Weiner, Rik van Riel,
	Kirill A. Shutemov, linux-kernel, zhang.zhanghailiang,
	Sanidhya Kashyap, Dave Hansen, Peter Feiner, Mel Gorman, kvm

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Jun 15, 2015 7:22 AM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
>
> +       if (cmd != UFFDIO_API) {
> +               if (ctx->state == UFFD_STATE_WAIT_API)
> +                       return -EINVAL;
> +               BUG_ON(ctx->state != UFFD_STATE_RUNNING);
> +       }

NAK.

Once again: we don't add BUG_ON() as some kind of assert. If your
non-critical code has s bug in it, you do WARN_ONCE() and you return. You
don't kill the machine just because of some "this can't happen" situation.

It turns out "this can't happen" happens way too often, just because code
changes, or programmers didn't think all the cases through. And killing the
machine is just NOT ACCEPTABLE.

People need to stop adding machine-killing checks to code that just doesn't
merit killing the machine.

And if you are so damn sure that it really cannot happen ever, then you
damn well had better remove the test too!

BUG_ON is not a debugging tool, or a "I think this would be bad" helper.

    Linus

[-- Attachment #2: Type: text/html, Size: 1338 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads
  2015-06-15 17:22 ` [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads Andrea Arcangeli
@ 2015-06-15 18:19   ` Linus Torvalds
  2015-06-15 22:19     ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2015-06-15 18:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Huangpeng (Peter),
	Paolo Bonzini, qemu-devel, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Johannes Weiner, Rik van Riel,
	Kirill A. Shutemov, linux-kernel, zhang.zhanghailiang,
	Sanidhya Kashyap, Dave Hansen, Peter Feiner, Mel Gorman, kvm

[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]

On Jun 15, 2015 7:22 AM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
>
> Blocking reads can easily use exclusive wakeups. Poll in theory could
> too but there's no poll_wait_exclusive in common code yet.

NAK.

Tie while commit message is crap, and so us the comment

No, your really cannot "easily use exclusive waits", and no, using them for
polling isn't about a lack of interface, it's about the fact that it would
be buggy shit.

What if the process doing the polling never doors anything with the end
result? Maybe it meant to, but it got killed before it could? Are you going
to leave everybody else blocked, even though there are pending events?

The same us try of read() too. What if the reader only reads party of the
message? The wake didn't wake anybody else, so now people are (again)
blocked despite there being data.

So no, exclusive waiting is never "simple". You have to 100% guarantee that
you will consume all the data that caused the wake event (or perhaps wake
the next person up if you don't).

    Linus

[-- Attachment #2: Type: text/html, Size: 1301 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls
  2015-06-15 18:11   ` Linus Torvalds
@ 2015-06-15 21:43     ` Andrea Arcangeli
  2015-06-15 21:55       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huangpeng (Peter),
	Paolo Bonzini, qemu-devel, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Johannes Weiner, Rik van Riel,
	Kirill A. Shutemov, linux-kernel, zhang.zhanghailiang,
	Sanidhya Kashyap, Dave Hansen, Peter Feiner, Mel Gorman, kvm

On Mon, Jun 15, 2015 at 08:11:50AM -1000, Linus Torvalds wrote:
> On Jun 15, 2015 7:22 AM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
> >
> > +       if (cmd != UFFDIO_API) {
> > +               if (ctx->state == UFFD_STATE_WAIT_API)
> > +                       return -EINVAL;
> > +               BUG_ON(ctx->state != UFFD_STATE_RUNNING);
> > +       }
> 
> NAK.
> 
> Once again: we don't add BUG_ON() as some kind of assert. If your
> non-critical code has s bug in it, you do WARN_ONCE() and you return. You
> don't kill the machine just because of some "this can't happen" situation.
> 
> It turns out "this can't happen" happens way too often, just because code
> changes, or programmers didn't think all the cases through. And killing the
> machine is just NOT ACCEPTABLE.
> 
> People need to stop adding machine-killing checks to code that just doesn't
> merit killing the machine.
> 
> And if you are so damn sure that it really cannot happen ever, then you
> damn well had better remove the test too!
> 
> BUG_ON is not a debugging tool, or a "I think this would be bad" helper.

Several times I got very hardly reproducible bugs noticed purely
because of BUG_ON (not VM_BUG_ON) inserted out of pure paranoia, so I
know as a matter of fact that they're worth the little cost. It's hard
to tell if things didn't get worse, if the workload continued, or even
if I ended up getting a bugreport in the first place with only a
WARN_ON variant, precisely because a WARN_ON isn't necessarily a bug.

Example: when a WARN_ON in the network code showup (and they do once
in a while as there are so many), nobody panics because we assume it
may not actually be a bug so we can cross finger it goes away at the
next git fetch... not even sure if they all get reported in the first
place.

BUG_ONs are terribly annoying when they trigger, and even worse if
they're false positives, but they're worth the pain in my view.

Of course what's unacceptable is that BUG_ON can be triggered at will
by userland, that would be a security issue. Just in case I verified
to run two UFFDIO_API in a row and a UFFDIO_REGISTER without an
UFFDIO_API before it, and no BUG_ON triggers with this code inserted.

Said that it's your choice, so I'm not going to argue further about
this and I'm sure fine with WARN_ONCE too, there were a few more to
convert in the state machine invariant checks. While at it I can also
use VM_WARN_ONCE to cover my performance concern.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls
  2015-06-15 21:43     ` Andrea Arcangeli
@ 2015-06-15 21:55       ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2015-06-15 21:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Huangpeng (Peter),
	qemu-devel, Paolo Bonzini, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Kirill A. Shutemov, Rik van Riel,
	Johannes Weiner, linux-kernel, Sanidhya Kashyap,
	zhang.zhanghailiang, Dave Hansen, Mel Gorman, Peter Feiner, kvm

[-- Attachment #1: Type: text/plain, Size: 1036 bytes --]

On Jun 15, 2015 11:43 AM, "Andrea Arcangeli" <aarcange@redhat.com> wrote:
>
> Several times I got very hardly reproducible bugs noticed purely
> because of BUG_ON (not VM_BUG_ON)

Feel free to use them while developing. Don't send me patches with your
broken debug code, though.

For users, a dead machine means that it is less likely you will ever get a
bug report. People set "reboot on oops", and when running X is not always
something that can be seen anyway. They'll just see a rebooting or a dead
machine, and not send you any debug output.

This is not negotiable. Seriously. Get rid of the BUG_ON if you expect your
patches to be merged mainline.

Also, even for debugging, using something like

    if (WARN_ON_ONCE(...))
         return -EINVAL;

is the right thing to do. Then you can unwind locks etc, and return
cleanly, and you'll only get one warning rather than a stream of them etc.

So stop making excuses for your bad BUG_ON use. Do it in private where
nobody can see your perversions, or do it right.

        Linus

[-- Attachment #2: Type: text/html, Size: 1322 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads
  2015-06-15 18:19   ` Linus Torvalds
@ 2015-06-15 22:19     ` Andrea Arcangeli
  2015-06-16  6:41       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-15 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huangpeng (Peter),
	Paolo Bonzini, qemu-devel, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Johannes Weiner, Rik van Riel,
	Kirill A. Shutemov, linux-kernel, zhang.zhanghailiang,
	Sanidhya Kashyap, Dave Hansen, Peter Feiner, Mel Gorman, kvm

On Mon, Jun 15, 2015 at 08:19:07AM -1000, Linus Torvalds wrote:
> What if the process doing the polling never doors anything with the end
> result? Maybe it meant to, but it got killed before it could? Are you going
> to leave everybody else blocked, even though there are pending events?

Yes, it would leave the other blocked, how is it different from having
just 1 reader and it gets killed?

If any qemu thread gets killed the thing is going to be noticeable,
there's no fault-tolerance-double-thread for anything. If one wants to
use more threads for fault tolerance of this scenario with
userfaultfd, one just needs to add a feature flag to the
uffdio_api.features to request it and change the behavior to wakeall
but by default if we can do wakeone I think we should.

> The same us try of read() too. What if the reader only reads party of the
> message? The wake didn't wake anybody else, so now people are (again)
> blocked despite there being data.

I totally agree that for a normal read that would be a concern, but
the wakeone only applies to the uffd. I'm not even trying to change
other read methods.

The uffd can't short-read. Lengths not multiple of sizeof(struct
uffd_msg) immediately return -EINVAL. read will return one or more
events, sizeof(struct uffd_msg). Signal interruptions only are
reported if it's about to block and it found nothing.

> So no, exclusive waiting is never "simple". You have to 100% guarantee that
> you will consume all the data that caused the wake event (or perhaps wake
> the next person up if you don't).

I don't see where it goes wrong.

Now if __wake_up_common didn't check the retval of
default_wake_function->try_to_wake_up before decrements and checking
nr_exclusive I would where the problem about the next guy is, but it
does this:

		if (curr->func(curr, mode, wake_flags, key) &&
				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
			break;

Every new userfault blocking (and at max 1 event to read is generated
for each new blocking userfaults) wakes one more reader, and each
reader is guaranteed to be blocked only if the pending (pending as not
read yet) waitqueue is truly empty. Where does it misbehave?

Yes each reader is required then to handle whatever userfault event it
got from read (or to pass it to another thread before quitting), but
this is a must anyway. This is because after the userfault is read it
is moved from pending fault queue to normal fault queue, so it won't
ever be read again, if it wasn't the case read would infinite loop and
it couldn't block (the same applies to poll, poll blocks after the
pending event has been read).

The testsuite can reproduce the bug fixed in 4/7 in like 3 seconds,
and it's 100% reproducible. And the window for such a bug is really
small: exactly in between list_del(); list_add the two
waitqueue_active must run in the other CPU. So it's hard to imagine if
this had some major issue, the testsuite wouldn't show it. In fact the
load seems to scale more evenly across all uffd threads too without no
apparent downside.

qemu uses just one reader, and it's even using poll, so this is not
needed for the short term production code, and it's totally fine to
defer this patch.

I'm not saying doing wakeone is easy and it's enough to flip a switch
everywhere to get it everywhere, and perhaps there's something wrong
still, I just I don't see where the actual bug is and how it should
work better without this patch but it's certainly fine to drop the
patch anyway (at least for now).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads
  2015-06-15 22:19     ` Andrea Arcangeli
@ 2015-06-16  6:41       ` Linus Torvalds
  2015-06-16 12:17         ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2015-06-16  6:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Huangpeng (Peter),
	Paolo Bonzini, qemu-devel, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Johannes Weiner, Rik van Riel,
	Kirill A. Shutemov, Linux Kernel Mailing List,
	zhang.zhanghailiang, Sanidhya Kashyap, Dave Hansen, Peter Feiner,
	Mel Gorman, KVM list

On Mon, Jun 15, 2015 at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Yes, it would leave the other blocked, how is it different from having
> just 1 reader and it gets killed?

Either is completely wrong. But the read() code can at least see that
"I'm returning early due to a signal, so I'll wake up any other
waiters".

Poll simply *cannot* do that. Because by definition poll always
returns without actually clearing the thing that caused the wakeup.

So for "poll()", using exclusive waits is wrong very much by
definition. For read(), you *can* use exclusive waits correctly, it
just requires you to wake up others if you don't read all the data
(either due to being killed by a signal, or because the read was
incomplete).

> If any qemu thread gets killed the thing is going to be noticeable,
> there's no fault-tolerance-double-thread for anything.

What does qemu have to do with anything?

We don't implement kernel interfaces that are broken, and that can
leave processes blocked when they shouldn't be blocked. We also don't
implement kernel interfaces that only work with one program and then
say "if that program is broken, it's not our problem".

> I'm not saying doing wakeone is easy [...]

Bullshit, Andrea.

That's *exactly* what you said in the commit message for the broken
patch that I complained about. And I quote:

  "Blocking reads can easily use exclusive wakeups. Poll in theory
could too but there's no poll_wait_exclusive in common code yet"

and I pointed out that your commit message was garbage, and that it's
not at all as easy as you claim, and that your patch was broken, and
your description was even more broken.

The whole "poll cannot use exclsive waits" has _nothing_ to do with us
not having "poll_wait_exclusive()". Poll *fundamentally* cannot use
exclusive waits. Your commit message was garbage, and actively
misleading. Don't make excuses.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads
  2015-06-16  6:41       ` Linus Torvalds
@ 2015-06-16 12:17         ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-06-16 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Huangpeng (Peter),
	Paolo Bonzini, qemu-devel, Pavel Emelyanov, Hugh Dickins,
	Andrew Morton, Dr. David Alan Gilbert, Andres Lagar-Cavilla,
	Andy Lutomirski, linux-mm, Johannes Weiner, Rik van Riel,
	Kirill A. Shutemov, Linux Kernel Mailing List,
	zhang.zhanghailiang, Sanidhya Kashyap, Dave Hansen, Peter Feiner,
	Mel Gorman, KVM list

On Mon, Jun 15, 2015 at 08:41:24PM -1000, Linus Torvalds wrote:
> On Mon, Jun 15, 2015 at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Yes, it would leave the other blocked, how is it different from having
> > just 1 reader and it gets killed?
> 
> Either is completely wrong. But the read() code can at least see that
> "I'm returning early due to a signal, so I'll wake up any other
> waiters".
> 
> Poll simply *cannot* do that. Because by definition poll always
> returns without actually clearing the thing that caused the wakeup.
> 
> So for "poll()", using exclusive waits is wrong very much by
> definition. For read(), you *can* use exclusive waits correctly, it
> just requires you to wake up others if you don't read all the data
> (either due to being killed by a signal, or because the read was
> incomplete).

There's no interface to do wakeone with poll so I haven't thought much
about it frankly but intuitively it didn't look radically different as
long as poll checks every fd revent it gets. If I was to patch it to
introduce wakeone in poll I'd think more about it of course. Perhaps
I've been overoptimistic here.

> What does qemu have to do with anything?
> 
> We don't implement kernel interfaces that are broken, and that can
> leave processes blocked when they shouldn't be blocked. We also don't
> implement kernel interfaces that only work with one program and then
> say "if that program is broken, it's not our problem".

I'm testing with the stresstest application not with qemu, qemu cannot
take advantage of this anyway because it uses a single thread so far
and it uses poll not blocking reads. The stresstest suite listens to
events with one thread per CPU and it interleaves poll usage with
blocking reads at every bounce, and it's working correctly so far.

> > I'm not saying doing wakeone is easy [...]
> 
> Bullshit, Andrea.
> 
> That's *exactly* what you said in the commit message for the broken
> patch that I complained about. And I quote:

Please don't quote me out of context, and quote me in full if you
quote me:

"I'm not saying doing wakeone is easy and it's enough to flip a switch
everywhere to get it everywhere"

In the above paragraph (that you quoted in a truncated version of it)
I was talking in general, not specific to userfaultfd. I meant in
general wakeone is not easy.

> patch that I complained about. And I quote:
> 
>   "Blocking reads can easily use exclusive wakeups. Poll in theory
> could too but there's no poll_wait_exclusive in common code yet"

With "I'm not saying doing wakeone is easy and it's enough to flip a
switch everywhere to get it everywhere" I intended exactly to clarify
that "Blocking reads can easily use exclusive wakeups" was in the
context of userfaultfd only.

With regard to the patch you still haven't told what exactly what
runtime breakage I shall expect from my simple change. The case you
mentioned about a thread that gets killed is irrelevant because an
userfault would get missed anyway, if a task listening to userfaultfd
get killed after it received any uffd_msg structure. Wakeone or
wakeall won't move the needle for that case. There's no broadcast of
userfaults to all readers, even with wakeall, only the first reader
that wakes up, gets the messages, the others returns to sleep
immediately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] userfault21 update
  2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2015-06-15 17:22 ` [PATCH 7/7] userfaultfd: selftest Andrea Arcangeli
@ 2015-10-12 15:04 ` Patrick Donnelly
  2015-10-19 21:42   ` Andrea Arcangeli
  7 siblings, 1 reply; 18+ messages in thread
From: Patrick Donnelly @ 2015-10-12 15:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, open list, linux-mm, qemu-devel, kvm,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hello Andrea,

On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This is an incremental update to the userfaultfd code in -mm.

Sorry I'm late to this party. I'm curious how a ptrace monitor might
use a userfaultfd to handle faults in all of its tracees. Is this
possible without having each (newly forked) tracee "cooperate" by
creating a userfaultfd and passing that to the tracer?

Have you considered using one userfaultfd for an entire tree of
processes (signaled through a flag)? Would not a process id included
in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient
to disambiguate faults?

-- 
Patrick Donnelly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] userfault21 update
  2015-10-12 15:04 ` [PATCH 0/7] userfault21 update Patrick Donnelly
@ 2015-10-19 21:42   ` Andrea Arcangeli
  2015-10-20 13:44     ` Patrick Donnelly
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2015-10-19 21:42 UTC (permalink / raw)
  To: Patrick Donnelly
  Cc: Andrew Morton, open list, linux-mm, qemu-devel, kvm,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

Hello Patrick,

On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote:
> Hello Andrea,
> 
> On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > This is an incremental update to the userfaultfd code in -mm.
> 
> Sorry I'm late to this party. I'm curious how a ptrace monitor might
> use a userfaultfd to handle faults in all of its tracees. Is this
> possible without having each (newly forked) tracee "cooperate" by
> creating a userfaultfd and passing that to the tracer?

To make the non cooperative usage work, userfaulfd also needs more
features to track fork() and mremap() syscalls and such, as the
monitor needs to be aware about modifications to the address space of
each "mm" is managing and of new forked "mm" as well. So fork() won't
need to call userfaultfd once we add those features, but it still
doesn't need to know about the "pid". The uffd_msg already has padding
to add the features you need for that.

Pavel invented and developed those features for the non cooperative
usage to implement postcopy live migration of containers. He posted
some patchset on the lists too, but it probably needs to be rebased on
upstream.

The ptrace monitor thread can also fault into the userfault area if it
wants to (but only if it's not the userfault manager thread as well).
I didn't expect the ptrace monitor to want to be a userfault manager
too though.

On a side note, the signals the ptrace monitor sends to the tracee
(SIGCONT|STOP included) will only be executed by the tracee without
waiting for userfault resolution from the userfault manager, if the
tracees userfault wasn't triggered in kernel context (and in a non
cooperative usage that's not an assumption you can make). If the
tracee hits an userfault while running in kernel context, the
userfault manager must resolve the userfault before any signal (except
SIGKILL of course) can be executed by the tracee. Only SIGKILL is
instantly executed by all tracees no matter if it was an userfault in
kernel or user context. That may be another reason for not wanting the
ptrace monitor and the userfault manager in the same thread (they can
still be running in two different threads of the same external
process).

> Have you considered using one userfaultfd for an entire tree of
> processes (signaled through a flag)? Would not a process id included
> in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient
> to disambiguate faults?

I got a private email asking a corollary question about having the
faulting IP address in the uffd_msg recently, which I answered and I
take opportunity to quote it as well below, as it's somewhat connected
with your "pid" question and this adds more context.

===

At times it's the kernel accessing the page (copy-user get user pages)
like if the buffer is a parameter to the write or read syscalls, just
to make an example.

The IP address triggering the fault isn't necessarily a userland
address. Furthermore not even the pid is known, so you don't know
which process accessed it.

userfaultfd only notifies userland that a certain page is requested
and must be mapped ASAP. You don't know why or who touched it.

===

Now about adding the "pid": the association between "pid" and "mm"
isn't so strict in the kernel. You can tell which "pid" shares the
same "mm" but if you look from userland, you can't always tell which
"mm"(/process) the pid belongs to. At times async io threads or
vhost-net threads can impersonate the "mm" and in effect become part
of the process and you'd get those random "pid" of kernel threads.

It could also be a ptrace that triggers an userfault, with a "pid" that
isn't part of the application and the manager must still work
seamlessly no matter who or which "pid" triggered the userfault.

So overall dealing the "pid"s sounds like not very clean as the same
kernel thread "pid" can impersonate multiple "mm" and you wouldn't get
the information of which "mm" the "address" belongs to.

When userfaultfd() is called, it literally binds to the "mm" the
process is running on and it's pid agnostic. Then when a kernel thread
impersonating the "mm" faults into the "mm" with get_user_pages or
copy_user or when a ptrace faults into the "mm", the userafult manager
won't even see the difference.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/7] userfault21 update
  2015-10-19 21:42   ` Andrea Arcangeli
@ 2015-10-20 13:44     ` Patrick Donnelly
  0 siblings, 0 replies; 18+ messages in thread
From: Patrick Donnelly @ 2015-10-20 13:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, open list, linux-mm, qemu-devel, kvm,
	Pavel Emelyanov, Sanidhya Kashyap, zhang.zhanghailiang,
	Linus Torvalds, Kirill A. Shutemov, Andres Lagar-Cavilla,
	Dave Hansen, Paolo Bonzini, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Hugh Dickins, Peter Feiner,
	Dr. David Alan Gilbert, Johannes Weiner, Huangpeng (Peter)

On Mon, Oct 19, 2015 at 5:42 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello Patrick,
>
> On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote:
>> Hello Andrea,
>>
>> On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> > This is an incremental update to the userfaultfd code in -mm.
>>
>> Sorry I'm late to this party. I'm curious how a ptrace monitor might
>> use a userfaultfd to handle faults in all of its tracees. Is this
>> possible without having each (newly forked) tracee "cooperate" by
>> creating a userfaultfd and passing that to the tracer?
>
> To make the non cooperative usage work, userfaulfd also needs more
> features to track fork() and mremap() syscalls and such, as the
> monitor needs to be aware about modifications to the address space of
> each "mm" is managing and of new forked "mm" as well. So fork() won't
> need to call userfaultfd once we add those features, but it still
> doesn't need to know about the "pid". The uffd_msg already has padding
> to add the features you need for that.
>
> Pavel invented and developed those features for the non cooperative
> usage to implement postcopy live migration of containers. He posted
> some patchset on the lists too, but it probably needs to be rebased on
> upstream.
>
> The ptrace monitor thread can also fault into the userfault area if it
> wants to (but only if it's not the userfault manager thread as well).
> I didn't expect the ptrace monitor to want to be a userfault manager
> too though.
> [...]

Okay, it's definitely tricky to make this work for a tree of
non-cooperative processes. Brainstorming some ideas:

o If we are using ptrace, then we can add a ptrace event for receiving
the userfaultfd associated with the tracee, via waitpid (!). The
ptrace monitor can deduplicate userfaultfds by looking at the inode.
It can also associate a userfaultfd with a group of threads sharing a
mm. [For my possible use-case with Parrot[1], we already track the
shared address spaces of tracees in order to implement an mmap hook.]

o The userfaultfd can have a flag for tracking a tree of processes
(which can be sent via unix sockets to the userfault handler) and use
an opaque tag (the mm pointer?) to disambiguate the faults, instead of
a pid. There would need to be some kind of message to notify about
newly cloned threads and the mm associated with them? Yes, you
wouldn't be able to know which pid (or kernel/ptrace thread) generated
a fault but at least you would know which pids the mm belongs to.

I didn't see the patchset Pavel posted in a quick search of the
archives. Only this [2].

[1] http://ccl.cse.nd.edu/software/parrot/
[2] https://lkml.org/lkml/2015/1/15/103

-- 
Patrick Donnelly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-10-20 13:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
2015-06-15 18:11   ` Linus Torvalds
2015-06-15 21:43     ` Andrea Arcangeli
2015-06-15 21:55       ` Linus Torvalds
2015-06-15 17:22 ` [PATCH 2/7] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 4/7] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads Andrea Arcangeli
2015-06-15 18:19   ` Linus Torvalds
2015-06-15 22:19     ` Andrea Arcangeli
2015-06-16  6:41       ` Linus Torvalds
2015-06-16 12:17         ` Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 6/7] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 7/7] userfaultfd: selftest Andrea Arcangeli
2015-10-12 15:04 ` [PATCH 0/7] userfault21 update Patrick Donnelly
2015-10-19 21:42   ` Andrea Arcangeli
2015-10-20 13:44     ` Patrick Donnelly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).