All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	qemu-devel@nongnu.org, kvm@vger.kernel.org
Cc: Pavel Emelyanov <xemul@parallels.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	zhang.zhanghailiang@huawei.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Andy Lutomirski <luto@amacapital.net>,
	Hugh Dickins <hughd@google.com>,
	Peter Feiner <pfeiner@google.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Subject: [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault
Date: Mon, 15 Jun 2015 19:22:07 +0200	[thread overview]
Message-ID: <1434388931-24487-4-git-send-email-aarcange@redhat.com> (raw)
In-Reply-To: <1434388931-24487-1-git-send-email-aarcange@redhat.com>

This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning
VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault
would just be retried by userland then. This is safe at least on x86
and powerpc (the two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault returns, no
access to data structures protected by the mmap_sem must be done by the fault
code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called.

This has two main benefits: signals can run with lower latency in production
(signals aren't blocked by userfaults and userfaults are immediately repeated
after signal processing) and gdb can then trivially debug the threads blocked
in this kind of userfaults coming directly from userland.

On a side note: while gdb has a need to get signal processed, coredumps always
worked perfectly with userfaults, no matter if the userfault is triggered by
GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b69d236..8286ec8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -262,7 +262,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
-	bool must_wait;
+	bool must_wait, return_to_userland;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -327,6 +327,9 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	uwq.msg = userfault_msg(address, flags, reason);
 	uwq.ctx = ctx;
 
+	return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -338,14 +341,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(TASK_KILLABLE);
+	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
+			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
-		   !fatal_signal_pending(current))) {
+		   (return_to_userland ? !signal_pending(current) :
+		    !fatal_signal_pending(current)))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
@@ -353,6 +358,30 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	__set_current_state(TASK_RUNNING);
 
+	if (return_to_userland) {
+		if (signal_pending(current) &&
+		    !fatal_signal_pending(current)) {
+			/*
+			 * If we got a SIGSTOP or SIGCONT and this is
+			 * a normal userland page fault, just let
+			 * userland return so the signal will be
+			 * handled and gdb debugging works.  The page
+			 * fault code immediately after we return from
+			 * this function is going to release the
+			 * mmap_sem and it's not depending on it
+			 * (unlike gup would if we were not to return
+			 * VM_FAULT_RETRY).
+			 *
+			 * If a fatal signal is pending we still take
+			 * the streamlined VM_FAULT_RETRY failure path
+			 * and there's no need to retake the mmap_sem
+			 * in such case.
+			 */
+			down_read(&mm->mmap_sem);
+			ret = 0;
+		}
+	}
+
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	qemu-devel@nongnu.org, kvm@vger.kernel.org
Cc: Pavel Emelyanov <xemul@parallels.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	zhang.zhanghailiang@huawei.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Andy Lutomirski <luto@amacapital.net>,
	Hugh Dickins <hughd@google.com>,
	Peter Feiner <pfeiner@google.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Subject: [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault
Date: Mon, 15 Jun 2015 19:22:07 +0200	[thread overview]
Message-ID: <1434388931-24487-4-git-send-email-aarcange@redhat.com> (raw)
In-Reply-To: <1434388931-24487-1-git-send-email-aarcange@redhat.com>

This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning
VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault
would just be retried by userland then. This is safe at least on x86
and powerpc (the two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault returns, no
access to data structures protected by the mmap_sem must be done by the fault
code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called.

This has two main benefits: signals can run with lower latency in production
(signals aren't blocked by userfaults and userfaults are immediately repeated
after signal processing) and gdb can then trivially debug the threads blocked
in this kind of userfaults coming directly from userland.

On a side note: while gdb has a need to get signal processed, coredumps always
worked perfectly with userfaults, no matter if the userfault is triggered by
GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b69d236..8286ec8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -262,7 +262,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
-	bool must_wait;
+	bool must_wait, return_to_userland;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -327,6 +327,9 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	uwq.msg = userfault_msg(address, flags, reason);
 	uwq.ctx = ctx;
 
+	return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -338,14 +341,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(TASK_KILLABLE);
+	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
+			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
-		   !fatal_signal_pending(current))) {
+		   (return_to_userland ? !signal_pending(current) :
+		    !fatal_signal_pending(current)))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
@@ -353,6 +358,30 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	__set_current_state(TASK_RUNNING);
 
+	if (return_to_userland) {
+		if (signal_pending(current) &&
+		    !fatal_signal_pending(current)) {
+			/*
+			 * If we got a SIGSTOP or SIGCONT and this is
+			 * a normal userland page fault, just let
+			 * userland return so the signal will be
+			 * handled and gdb debugging works.  The page
+			 * fault code immediately after we return from
+			 * this function is going to release the
+			 * mmap_sem and it's not depending on it
+			 * (unlike gup would if we were not to return
+			 * VM_FAULT_RETRY).
+			 *
+			 * If a fatal signal is pending we still take
+			 * the streamlined VM_FAULT_RETRY failure path
+			 * and there's no need to retake the mmap_sem
+			 * in such case.
+			 */
+			down_read(&mm->mmap_sem);
+			ret = 0;
+		}
+	}
+
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	qemu-devel@nongnu.org, kvm@vger.kernel.org
Cc: zhang.zhanghailiang@huawei.com,
	Pavel Emelyanov <xemul@parallels.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Mel Gorman <mgorman@suse.de>, Paolo Bonzini <pbonzini@redhat.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Feiner <pfeiner@google.com>
Subject: [Qemu-devel] [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault
Date: Mon, 15 Jun 2015 19:22:07 +0200	[thread overview]
Message-ID: <1434388931-24487-4-git-send-email-aarcange@redhat.com> (raw)
In-Reply-To: <1434388931-24487-1-git-send-email-aarcange@redhat.com>

This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning
VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault
would just be retried by userland then. This is safe at least on x86
and powerpc (the two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault returns, no
access to data structures protected by the mmap_sem must be done by the fault
code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called.

This has two main benefits: signals can run with lower latency in production
(signals aren't blocked by userfaults and userfaults are immediately repeated
after signal processing) and gdb can then trivially debug the threads blocked
in this kind of userfaults coming directly from userland.

On a side note: while gdb has a need to get signal processed, coredumps always
worked perfectly with userfaults, no matter if the userfault is triggered by
GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index b69d236..8286ec8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -262,7 +262,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
-	bool must_wait;
+	bool must_wait, return_to_userland;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -327,6 +327,9 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	uwq.msg = userfault_msg(address, flags, reason);
 	uwq.ctx = ctx;
 
+	return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -338,14 +341,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(TASK_KILLABLE);
+	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
+			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
-		   !fatal_signal_pending(current))) {
+		   (return_to_userland ? !signal_pending(current) :
+		    !fatal_signal_pending(current)))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
@@ -353,6 +358,30 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	__set_current_state(TASK_RUNNING);
 
+	if (return_to_userland) {
+		if (signal_pending(current) &&
+		    !fatal_signal_pending(current)) {
+			/*
+			 * If we got a SIGSTOP or SIGCONT and this is
+			 * a normal userland page fault, just let
+			 * userland return so the signal will be
+			 * handled and gdb debugging works.  The page
+			 * fault code immediately after we return from
+			 * this function is going to release the
+			 * mmap_sem and it's not depending on it
+			 * (unlike gup would if we were not to return
+			 * VM_FAULT_RETRY).
+			 *
+			 * If a fatal signal is pending we still take
+			 * the streamlined VM_FAULT_RETRY failure path
+			 * and there's no need to retake the mmap_sem
+			 * in such case.
+			 */
+			down_read(&mm->mmap_sem);
+			ret = 0;
+		}
+	}
+
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run

  parent reply	other threads:[~2015-06-15 17:22 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-15 17:22 [PATCH 0/7] userfault21 update Andrea Arcangeli
2015-06-15 17:22 ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22 ` Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 1/7] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
2015-06-15 17:22   ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-06-15 18:11   ` Linus Torvalds
2015-06-15 18:11     ` [Qemu-devel] " Linus Torvalds
2015-06-15 21:43     ` Andrea Arcangeli
2015-06-15 21:43       ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 21:43       ` Andrea Arcangeli
2015-06-15 21:55       ` Linus Torvalds
2015-06-15 21:55         ` [Qemu-devel] " Linus Torvalds
2015-06-15 17:22 ` [PATCH 2/7] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
2015-06-15 17:22   ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-06-15 17:22 ` Andrea Arcangeli [this message]
2015-06-15 17:22   ` [Qemu-devel] [PATCH 3/7] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 4/7] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
2015-06-15 17:22   ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 5/7] userfaultfd: switch to exclusive wakeup for blocking reads Andrea Arcangeli
2015-06-15 17:22   ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-06-15 18:19   ` Linus Torvalds
2015-06-15 18:19     ` [Qemu-devel] " Linus Torvalds
2015-06-15 22:19     ` Andrea Arcangeli
2015-06-15 22:19       ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 22:19       ` Andrea Arcangeli
2015-06-16  6:41       ` Linus Torvalds
2015-06-16  6:41         ` [Qemu-devel] " Linus Torvalds
2015-06-16  6:41         ` Linus Torvalds
2015-06-16 12:17         ` Andrea Arcangeli
2015-06-16 12:17           ` [Qemu-devel] " Andrea Arcangeli
2015-06-16 12:17           ` Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 6/7] userfaultfd: Revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" Andrea Arcangeli
2015-06-15 17:22   ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-06-15 17:22 ` [PATCH 7/7] userfaultfd: selftest Andrea Arcangeli
2015-06-15 17:22   ` [Qemu-devel] " Andrea Arcangeli
2015-06-15 17:22   ` Andrea Arcangeli
2015-10-12 15:04 ` [PATCH 0/7] userfault21 update Patrick Donnelly
2015-10-12 15:04   ` [Qemu-devel] " Patrick Donnelly
2015-10-12 15:04   ` Patrick Donnelly
2015-10-19 21:42   ` Andrea Arcangeli
2015-10-19 21:42     ` [Qemu-devel] " Andrea Arcangeli
2015-10-19 21:42     ` Andrea Arcangeli
2015-10-19 21:42     ` Andrea Arcangeli
2015-10-20 13:44     ` Patrick Donnelly
2015-10-20 13:44       ` [Qemu-devel] " Patrick Donnelly
2015-10-20 13:44       ` Patrick Donnelly

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1434388931-24487-4-git-send-email-aarcange@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=dave.hansen@intel.com \
    --cc=dgilbert@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kirill@shutemov.name \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mgorman@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=pfeiner@google.com \
    --cc=qemu-devel@nongnu.org \
    --cc=riel@redhat.com \
    --cc=sanidhya.gatech@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=xemul@parallels.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.