[RFC] userfaultfd: fix SIGBUS resulting from false rwsem wakeups

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] userfaultfd: fix SIGBUS resulting from false rwsem wakeups
@ 2017-01-11  0:55 Andrea Arcangeli
  2017-01-11  0:55 ` [PATCH 1/1] " Andrea Arcangeli
  0 siblings, 1 reply; 4+ messages in thread
From: Andrea Arcangeli @ 2017-01-11  0:55 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton

Hello everyone,

Mike noticed the userfaultfd selftest on 32-way SMP bare metal
triggers a SIGBUS and I could reproduce with KVM -smp 32 Fedora
guests. This never happened with fewer CPUs.

I grabbed the stack trace of the false wakeup. It looks like the rwsem
code can wake the task unexpectedly. We knew the scheduler didn't like
the idea of providing perfect wakeups, and the current way of relying
on the scheduler wakeups to be perfect has already been questioned,
but this seems a somewhat wider issue that affects part outside of the
scheduler too. I was undecided at the idea the scheduler could ignore
a __set_current_state(TASK_INTERRUPTIBLE); schedule(); and in fact it
never did, but the rwsem wakeup is external to the scheduler so we
need to fix this regardless.

The SMP race condition is so rare it doesn't even happen on fewer than
32 CPUs, and most certainly it won't happen on anything but the
selftest itself. If it happens the side effect is just the SIGBUS
killing the task gracefully, there's zero risk of memory corruption,
zero fs corruption, no memleaks and there are no security issues
associated with this race condition either. So it's not major concern
but it must still be fixed ASAP. For those usages where userfaultfd is
used for vma-less enforcement of no access over memory holes, it would
even a lesser concern.

This is very lightly tested so far, comments welcome.

Note: this is fully orthogonal issue with the new userfaultfd features
in -mm, so I'm posting this against upstream.

Note2: my userfault aa.git branch has the one-more-time VM_FAULT_RETRY
feature for the WP support which hides the problem 100% successfully
but it's just doing it by luck. That's needed for another reason and
it'll be still needed on top of this. The chances of hitting the race
twice in a row is zero, this is why it gets hidden, but in theory this
could happen twice in a row and the fix must handle twice in a row
too. So this patch has to be tested against current upstream only or
you can't reproduce any problem in the first place. Once we converge
on a reviewed solution I'll forward port it to -mm/userfault branch (I
doubt it applies clean).

Andrea Arcangeli (1):
  userfaultfd: fix SIGBUS resulting from false rwsem wakeups

 fs/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] userfaultfd: fix SIGBUS resulting from false rwsem wakeups
  2017-01-11  0:55 [RFC] userfaultfd: fix SIGBUS resulting from false rwsem wakeups Andrea Arcangeli
@ 2017-01-11  0:55 ` Andrea Arcangeli
  2017-01-12  3:13   ` Hillf Danton
  2017-01-12 21:19   ` Mike Kravetz
  0 siblings, 2 replies; 4+ messages in thread
From: Andrea Arcangeli @ 2017-01-11  0:55 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Mike Kravetz,
	Pavel Emelyanov, Hillf Danton

With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.

Keeping the wait_even sequence identical to the first one, would
require running userfaultfd_must_wait() again to know if the loop
should be repeated, and it would also require retaking the rwsem and
revalidating the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)

   [<ffffffff8102e19b>] save_stack_trace+0x2b/0x50
   [<ffffffff8110b1d6>] try_to_wake_up+0x2a6/0x580
   [<ffffffff8110b502>] wake_up_q+0x32/0x70
   [<ffffffff8113d7a0>] rwsem_wake+0xe0/0x120
   [<ffffffff8148361b>] call_rwsem_wake+0x1b/0x30
   [<ffffffff81131d5b>] up_write+0x3b/0x40
   [<ffffffff812280fc>] vm_mmap_pgoff+0x9c/0xc0
   [<ffffffff81244b79>] SyS_mmap_pgoff+0x1a9/0x240
   [<ffffffff810228f2>] SyS_mmap+0x22/0x30
   [<ffffffff81842dfc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
   [<ffffffffffffffff>] 0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffff880218027d40 ffffffff814749f6 ffff8802180c4a80
 ffff880231ead4c0 ffff880218027e18 ffffffff812e0102 ffffffff81842b1c
 ffff880218027dd0 ffff880218082960 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff814749f6>] dump_stack+0xb8/0x112
 [<ffffffff812e0102>] handle_userfault+0x572/0x650
 [<ffffffff81842b1c>] ? _raw_spin_unlock_irq+0x2c/0x50
 [<ffffffff812407b4>] ? handle_mm_fault+0xcf4/0x1520
 [<ffffffff81240d7e>] ? handle_mm_fault+0x12be/0x1520
 [<ffffffff81240d8b>] handle_mm_fault+0x12cb/0x1520
 [<ffffffff81483588>] ? call_rwsem_down_read_failed+0x18/0x30
 [<ffffffff810572c5>] __do_page_fault+0x175/0x500
 [<ffffffff810576f1>] trace_do_page_fault+0x61/0x270
 [<ffffffff81050739>] do_async_page_fault+0x19/0x90
 [<ffffffff81843ef5>] async_page_fault+0x25/0x30

This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to
return before than expected (which results in graceful SIGBUS at the
next attempt).

This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().

This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.

We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem. False wakeups
could still happen again the second time handle_userfault() is
invoked, even if it's a so rare race condition that getting false
wakeups twice in a row is impossible to reproduce. This full fix is
needed for correctness, the only alternative would be to allow
VM_FAULT_RETRY to be returned infinitely. With this fix the WP support
can stick to a strict "one more" VM_FAULT_RETRY logic (no need of
returning it infinite times to avoid the SIGBUS).

Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d96e2f3..43953e0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -63,6 +63,7 @@ struct userfaultfd_wait_queue {
 	struct uffd_msg msg;
 	wait_queue_t wq;
 	struct userfaultfd_ctx *ctx;
+	bool waken;
 };
 
 struct userfaultfd_wake_range {
@@ -86,6 +87,12 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 	if (len && (start > uwq->msg.arg.pagefault.address ||
 		    start + len <= uwq->msg.arg.pagefault.address))
 		goto out;
+	WRITE_ONCE(uwq->waken, true);
+	/*
+	 * The implicit smp_mb__before_spinlock in try_to_wake_up()
+	 * renders uwq->waken visible to other CPUs before the task is
+	 * waken.
+	 */
 	ret = wake_up_state(wq->private, mode);
 	if (ret)
 		/*
@@ -264,6 +271,7 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	struct userfaultfd_wait_queue uwq;
 	int ret;
 	bool must_wait, return_to_userland;
+	long blocking_state;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -334,10 +342,13 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	uwq.wq.private = current;
 	uwq.msg = userfault_msg(vmf->address, vmf->flags, reason);
 	uwq.ctx = ctx;
+	uwq.waken = false;
 
 	return_to_userland =
 		(vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
 		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+	blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
+			 TASK_KILLABLE;
 
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
@@ -350,8 +361,7 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
-			  TASK_KILLABLE);
+	set_current_state(blocking_state);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
@@ -364,6 +374,29 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
+
+		/*
+		 * False wakeups can orginate even from rwsem before
+		 * up_read() however userfaults will wait either for a
+		 * targeted wakeup on the specific uwq waitqueue from
+		 * wake_userfault() or for signals or for uffd
+		 * release.
+		 */
+		while (!READ_ONCE(uwq.waken)) {
+			/*
+			 * This needs the full smp_store_mb()
+			 * guarantee as the state write must be
+			 * visible to other CPUs before reading
+			 * uwq.waken from other CPUs.
+			 */
+			set_current_state(blocking_state);
+			if (READ_ONCE(uwq.waken) ||
+			    READ_ONCE(ctx->released) ||
+			    (return_to_userland ? signal_pending(current) :
+			     fatal_signal_pending(current)))
+				break;
+			schedule();
+		}
 	}
 
 	__set_current_state(TASK_RUNNING);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] userfaultfd: fix SIGBUS resulting from false rwsem wakeups
  2017-01-11  0:55 ` [PATCH 1/1] " Andrea Arcangeli
@ 2017-01-12  3:13   ` Hillf Danton
  2017-01-12 21:19   ` Mike Kravetz
  1 sibling, 0 replies; 4+ messages in thread
From: Hillf Danton @ 2017-01-12  3:13 UTC (permalink / raw)
  To: 'Andrea Arcangeli', linux-mm, 'Andrew Morton'
  Cc: 'Michael Rapoport', 'Dr. David Alan Gilbert',
	'Mike Kravetz', 'Pavel Emelyanov'

On Wednesday, January 11, 2017 8:56 AM Andrea Arcangeli wrote: 
> 
> With >=32 CPUs the userfaultfd selftest triggered a graceful but
> unexpected SIGBUS because VM_FAULT_RETRY was returned by
> handle_userfault() despite the UFFDIO_COPY wasn't completed.
> 
> This seems caused by rwsem waking the thread blocked in
> handle_userfault() and we can't run up_read() before the wait_event
> sequence is complete.
> 
> Keeping the wait_even sequence identical to the first one, would
> require running userfaultfd_must_wait() again to know if the loop
> should be repeated, and it would also require retaking the rwsem and
> revalidating the whole vma status.
> 
> It seems simpler to wait the targeted wakeup so that if false wakeups
> materialize we still wait for our specific wakeup event, unless of
> course there are signals or the uffd was released.
> 
> Debug code collecting the stack trace of the wakeup showed this:
> 
> $ ./userfaultfd 100 99999
> nr_pages: 25600, nr_pages_per_cpu: 800
> bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56
22 1 17 31
> 8 4 2
> bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
> 
>    [<ffffffff8102e19b>] save_stack_trace+0x2b/0x50
>    [<ffffffff8110b1d6>] try_to_wake_up+0x2a6/0x580
>    [<ffffffff8110b502>] wake_up_q+0x32/0x70
>    [<ffffffff8113d7a0>] rwsem_wake+0xe0/0x120
>    [<ffffffff8148361b>] call_rwsem_wake+0x1b/0x30
>    [<ffffffff81131d5b>] up_write+0x3b/0x40
>    [<ffffffff812280fc>] vm_mmap_pgoff+0x9c/0xc0
>    [<ffffffff81244b79>] SyS_mmap_pgoff+0x1a9/0x240
>    [<ffffffff810228f2>] SyS_mmap+0x22/0x30
>    [<ffffffff81842dfc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
>    [<ffffffffffffffff>] 0xffffffffffffffff
> FAULT_FLAG_ALLOW_RETRY missing 70
> CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
>  0000000000000000 ffff880218027d40 ffffffff814749f6 ffff8802180c4a80
>  ffff880231ead4c0 ffff880218027e18 ffffffff812e0102 ffffffff81842b1c
>  ffff880218027dd0 ffff880218082960 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff814749f6>] dump_stack+0xb8/0x112
>  [<ffffffff812e0102>] handle_userfault+0x572/0x650
>  [<ffffffff81842b1c>] ? _raw_spin_unlock_irq+0x2c/0x50
>  [<ffffffff812407b4>] ? handle_mm_fault+0xcf4/0x1520
>  [<ffffffff81240d7e>] ? handle_mm_fault+0x12be/0x1520
>  [<ffffffff81240d8b>] handle_mm_fault+0x12cb/0x1520
>  [<ffffffff81483588>] ? call_rwsem_down_read_failed+0x18/0x30
>  [<ffffffff810572c5>] __do_page_fault+0x175/0x500
>  [<ffffffff810576f1>] trace_do_page_fault+0x61/0x270
>  [<ffffffff81050739>] do_async_page_fault+0x19/0x90
>  [<ffffffff81843ef5>] async_page_fault+0x25/0x30
> 
> This always happens when the main userfault selftest thread is running
> clone() while glibc runs either mprotect or mmap (both taking mmap_sem
> down_write()) to allocate the thread stack of the background threads,
> while locking/userfault threads already run at full throttle and are
> susceptible to false wakeups that may cause handle_userfault() to
> return before than expected (which results in graceful SIGBUS at the
> next attempt).
> 
> This was reproduced only with >=32 CPUs because the loop to start the
> thread where clone() is too quick with fewer CPUs, while with 32 CPUs
> there's already significant activity on ~32 locking and userfault
> threads when the last background threads are started with clone().
> 
> This >=32 CPUs SMP race condition is likely reproducible only with the
> selftest because of the much heavier userfault load it generates if
> compared to real apps.
> 
> We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
> patch floating around that provides it also hidden this problem but in
> reality only is successfully at hiding the problem. False wakeups
> could still happen again the second time handle_userfault() is
> invoked, even if it's a so rare race condition that getting false
> wakeups twice in a row is impossible to reproduce. This full fix is
> needed for correctness, the only alternative would be to allow
> VM_FAULT_RETRY to be returned infinitely. With this fix the WP support
> can stick to a strict "one more" VM_FAULT_RETRY logic (no need of
> returning it infinite times to avoid the SIGBUS).
> 
> Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] userfaultfd: fix SIGBUS resulting from false rwsem wakeups
  2017-01-11  0:55 ` [PATCH 1/1] " Andrea Arcangeli
  2017-01-12  3:13   ` Hillf Danton
@ 2017-01-12 21:19   ` Mike Kravetz
  1 sibling, 0 replies; 4+ messages in thread
From: Mike Kravetz @ 2017-01-12 21:19 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-mm, Andrew Morton
  Cc: Michael Rapoport, Dr. David Alan Gilbert, Pavel Emelyanov, Hillf Danton

On 01/10/2017 04:55 PM, Andrea Arcangeli wrote:
> With >=32 CPUs the userfaultfd selftest triggered a graceful but
> unexpected SIGBUS because VM_FAULT_RETRY was returned by
> handle_userfault() despite the UFFDIO_COPY wasn't completed.
> 
> This seems caused by rwsem waking the thread blocked in
> handle_userfault() and we can't run up_read() before the wait_event
> sequence is complete.
> 
> Keeping the wait_even sequence identical to the first one, would
> require running userfaultfd_must_wait() again to know if the loop
> should be repeated, and it would also require retaking the rwsem and
> revalidating the whole vma status.
> 
> It seems simpler to wait the targeted wakeup so that if false wakeups
> materialize we still wait for our specific wakeup event, unless of
> course there are signals or the uffd was released.
> 
> Debug code collecting the stack trace of the wakeup showed this:
> 
> $ ./userfaultfd 100 99999
> nr_pages: 25600, nr_pages_per_cpu: 800
> bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
> bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
> 
>    [<ffffffff8102e19b>] save_stack_trace+0x2b/0x50
>    [<ffffffff8110b1d6>] try_to_wake_up+0x2a6/0x580
>    [<ffffffff8110b502>] wake_up_q+0x32/0x70
>    [<ffffffff8113d7a0>] rwsem_wake+0xe0/0x120
>    [<ffffffff8148361b>] call_rwsem_wake+0x1b/0x30
>    [<ffffffff81131d5b>] up_write+0x3b/0x40
>    [<ffffffff812280fc>] vm_mmap_pgoff+0x9c/0xc0
>    [<ffffffff81244b79>] SyS_mmap_pgoff+0x1a9/0x240
>    [<ffffffff810228f2>] SyS_mmap+0x22/0x30
>    [<ffffffff81842dfc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
>    [<ffffffffffffffff>] 0xffffffffffffffff
> FAULT_FLAG_ALLOW_RETRY missing 70
> CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
>  0000000000000000 ffff880218027d40 ffffffff814749f6 ffff8802180c4a80
>  ffff880231ead4c0 ffff880218027e18 ffffffff812e0102 ffffffff81842b1c
>  ffff880218027dd0 ffff880218082960 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff814749f6>] dump_stack+0xb8/0x112
>  [<ffffffff812e0102>] handle_userfault+0x572/0x650
>  [<ffffffff81842b1c>] ? _raw_spin_unlock_irq+0x2c/0x50
>  [<ffffffff812407b4>] ? handle_mm_fault+0xcf4/0x1520
>  [<ffffffff81240d7e>] ? handle_mm_fault+0x12be/0x1520
>  [<ffffffff81240d8b>] handle_mm_fault+0x12cb/0x1520
>  [<ffffffff81483588>] ? call_rwsem_down_read_failed+0x18/0x30
>  [<ffffffff810572c5>] __do_page_fault+0x175/0x500
>  [<ffffffff810576f1>] trace_do_page_fault+0x61/0x270
>  [<ffffffff81050739>] do_async_page_fault+0x19/0x90
>  [<ffffffff81843ef5>] async_page_fault+0x25/0x30
> 
> This always happens when the main userfault selftest thread is running
> clone() while glibc runs either mprotect or mmap (both taking mmap_sem
> down_write()) to allocate the thread stack of the background threads,
> while locking/userfault threads already run at full throttle and are
> susceptible to false wakeups that may cause handle_userfault() to
> return before than expected (which results in graceful SIGBUS at the
> next attempt).
> 
> This was reproduced only with >=32 CPUs because the loop to start the
> thread where clone() is too quick with fewer CPUs, while with 32 CPUs
> there's already significant activity on ~32 locking and userfault
> threads when the last background threads are started with clone().
> 
> This >=32 CPUs SMP race condition is likely reproducible only with the
> selftest because of the much heavier userfault load it generates if
> compared to real apps.
> 
> We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
> patch floating around that provides it also hidden this problem but in
> reality only is successfully at hiding the problem. False wakeups
> could still happen again the second time handle_userfault() is
> invoked, even if it's a so rare race condition that getting false
> wakeups twice in a row is impossible to reproduce. This full fix is
> needed for correctness, the only alternative would be to allow
> VM_FAULT_RETRY to be returned infinitely. With this fix the WP support
> can stick to a strict "one more" VM_FAULT_RETRY logic (no need of
> returning it infinite times to avoid the SIGBUS).
> 
> Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Tests are now successful

Tested-by: Mike Kravetz <mike.kravetz@oracle.com>

If anyone is going to the trouble of adding the Tested-by, please
change the Reported-by to,

Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>

Shubham is the one who originally discovered the issue.
-- 
Mike Kravetz

> ---
>  fs/userfaultfd.c | 37 +++++++++++++++++++++++++++++++++++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index d96e2f3..43953e0 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -63,6 +63,7 @@ struct userfaultfd_wait_queue {
>  	struct uffd_msg msg;
>  	wait_queue_t wq;
>  	struct userfaultfd_ctx *ctx;
> +	bool waken;
>  };
>  
>  struct userfaultfd_wake_range {
> @@ -86,6 +87,12 @@ static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
>  	if (len && (start > uwq->msg.arg.pagefault.address ||
>  		    start + len <= uwq->msg.arg.pagefault.address))
>  		goto out;
> +	WRITE_ONCE(uwq->waken, true);
> +	/*
> +	 * The implicit smp_mb__before_spinlock in try_to_wake_up()
> +	 * renders uwq->waken visible to other CPUs before the task is
> +	 * waken.
> +	 */
>  	ret = wake_up_state(wq->private, mode);
>  	if (ret)
>  		/*
> @@ -264,6 +271,7 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  	struct userfaultfd_wait_queue uwq;
>  	int ret;
>  	bool must_wait, return_to_userland;
> +	long blocking_state;
>  
>  	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
>  
> @@ -334,10 +342,13 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  	uwq.wq.private = current;
>  	uwq.msg = userfault_msg(vmf->address, vmf->flags, reason);
>  	uwq.ctx = ctx;
> +	uwq.waken = false;
>  
>  	return_to_userland =
>  		(vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
>  		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
> +	blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
> +			 TASK_KILLABLE;
>  
>  	spin_lock(&ctx->fault_pending_wqh.lock);
>  	/*
> @@ -350,8 +361,7 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  	 * following the spin_unlock to happen before the list_add in
>  	 * __add_wait_queue.
>  	 */
> -	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
> -			  TASK_KILLABLE);
> +	set_current_state(blocking_state);
>  	spin_unlock(&ctx->fault_pending_wqh.lock);
>  
>  	must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> @@ -364,6 +374,29 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  		wake_up_poll(&ctx->fd_wqh, POLLIN);
>  		schedule();
>  		ret |= VM_FAULT_MAJOR;
> +
> +		/*
> +		 * False wakeups can orginate even from rwsem before
> +		 * up_read() however userfaults will wait either for a
> +		 * targeted wakeup on the specific uwq waitqueue from
> +		 * wake_userfault() or for signals or for uffd
> +		 * release.
> +		 */
> +		while (!READ_ONCE(uwq.waken)) {
> +			/*
> +			 * This needs the full smp_store_mb()
> +			 * guarantee as the state write must be
> +			 * visible to other CPUs before reading
> +			 * uwq.waken from other CPUs.
> +			 */
> +			set_current_state(blocking_state);
> +			if (READ_ONCE(uwq.waken) ||
> +			    READ_ONCE(ctx->released) ||
> +			    (return_to_userland ? signal_pending(current) :
> +			     fatal_signal_pending(current)))
> +				break;
> +			schedule();
> +		}
>  	}
>  
>  	__set_current_state(TASK_RUNNING);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-01-12 21:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11  0:55 [RFC] userfaultfd: fix SIGBUS resulting from false rwsem wakeups Andrea Arcangeli
2017-01-11  0:55 ` [PATCH 1/1] " Andrea Arcangeli
2017-01-12  3:13   ` Hillf Danton
2017-01-12 21:19   ` Mike Kravetz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.