[RFC,03/13] selftests/vm/userfaultfd: wake after copy failure
diff mbox series

Message ID 20201129004548.1619714-4-namit@vmware.com
State New, archived
Headers show
Series
  • fs/userfaultfd: support iouring and polling
Related show

Commit Message

Nadav Amit Nov. 29, 2020, 12:45 a.m. UTC
From: Nadav Amit <namit@vmware.com>

When userfaultfd copy-ioctl fails since the PTE already exists, an
-EEXIST error is returned and the faulting thread is not woken. The
current userfaultfd test does not wake the faulting thread in such case.
The assumption is presumably that another thread set the PTE through
copy/wp ioctl and would wake the faulting thread or that alternatively
the fault handler would realize there is no need to "must_wait" and
continue. This is not necessarily true.

There is an assumption that the "must_wait" tests in handle_userfault()
are sufficient to provide definitive answer whether the offending PTE is
populated or not. However, userfaultfd_must_wait() test is lockless.
Consequently, concurrent calls to ptep_modify_prot_start(), for
instance, can clear the PTE and can cause userfaultfd_must_wait()
to wrongly assume it is not populated and a wait is needed.

There are therefore 3 options:
(1) Change the tests to wake on copy failure.
(2) Wake faulting thread unconditionally on zero/copy ioctls before
    returning -EEXIST.
(3) Change the userfaultfd_must_wait() to hold locks.

This patch took the first approach, but the others are valid solutions
with different tradeoffs.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: io-uring@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Peter Xu Dec. 21, 2020, 7:28 p.m. UTC | #1
On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> When userfaultfd copy-ioctl fails since the PTE already exists, an
> -EEXIST error is returned and the faulting thread is not woken. The
> current userfaultfd test does not wake the faulting thread in such case.
> The assumption is presumably that another thread set the PTE through
> copy/wp ioctl and would wake the faulting thread or that alternatively
> the fault handler would realize there is no need to "must_wait" and
> continue. This is not necessarily true.
> 
> There is an assumption that the "must_wait" tests in handle_userfault()
> are sufficient to provide definitive answer whether the offending PTE is
> populated or not. However, userfaultfd_must_wait() test is lockless.
> Consequently, concurrent calls to ptep_modify_prot_start(), for
> instance, can clear the PTE and can cause userfaultfd_must_wait()
> to wrongly assume it is not populated and a wait is needed.

Yes userfaultfd_must_wait() is lockless, however my understanding is that we'll
enqueue before reading the page table, which seems to me that we'll always get
notified even the race happens.  Should apply to either UFFDIO_WRITEPROTECT or
UFFDIO_COPY, iiuc, as long as we follow the order of (1) modify pgtable (2)
wake sleeping threads.  Then it also means that when must_wait() returned true,
it should always get waked up when fault resolved.

Taking UFFDIO_COPY as example, even if UFFDIO_COPY happen right before
must_wait() calls:

       worker thread                       uffd thread
       -------------                       -----------

   handle_userfault
    spin_lock(fault_pending_wqh)
    enqueue()
    set_current_state(INTERRUPTIBLE)
    spin_unlock(fault_pending_wqh)
    must_wait()
      lockless walk page table
                                           UFFDIO_COPY
                                             fill in the hole
                                             wake up threads
                                               (this will wake up worker thread too?)
    schedule()
      (which may return immediately?)

While here fault_pending_wqh is lock protected. I just feel like there's some
other reason to cause the thread to stall.  Or did I miss something?

Thanks,
Nadav Amit Dec. 21, 2020, 7:51 p.m. UTC | #2
> On Dec 21, 2020, at 11:28 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> When userfaultfd copy-ioctl fails since the PTE already exists, an
>> -EEXIST error is returned and the faulting thread is not woken. The
>> current userfaultfd test does not wake the faulting thread in such case.
>> The assumption is presumably that another thread set the PTE through
>> copy/wp ioctl and would wake the faulting thread or that alternatively
>> the fault handler would realize there is no need to "must_wait" and
>> continue. This is not necessarily true.
>> 
>> There is an assumption that the "must_wait" tests in handle_userfault()
>> are sufficient to provide definitive answer whether the offending PTE is
>> populated or not. However, userfaultfd_must_wait() test is lockless.
>> Consequently, concurrent calls to ptep_modify_prot_start(), for
>> instance, can clear the PTE and can cause userfaultfd_must_wait()
>> to wrongly assume it is not populated and a wait is needed.
> 
> Yes userfaultfd_must_wait() is lockless, however my understanding is that we'll
> enqueue before reading the page table, which seems to me that we'll always get
> notified even the race happens.  Should apply to either UFFDIO_WRITEPROTECT or
> UFFDIO_COPY, iiuc, as long as we follow the order of (1) modify pgtable (2)
> wake sleeping threads.  Then it also means that when must_wait() returned true,
> it should always get waked up when fault resolved.
> 
> Taking UFFDIO_COPY as example, even if UFFDIO_COPY happen right before
> must_wait() calls:
> 
>       worker thread                       uffd thread
>       -------------                       -----------
> 
>   handle_userfault
>    spin_lock(fault_pending_wqh)
>    enqueue()
>    set_current_state(INTERRUPTIBLE)
>    spin_unlock(fault_pending_wqh)
>    must_wait()
>      lockless walk page table
>                                           UFFDIO_COPY
>                                             fill in the hole
>                                             wake up threads
>                                               (this will wake up worker thread too?)
>    schedule()
>      (which may return immediately?)
> 
> While here fault_pending_wqh is lock protected. I just feel like there's some
> other reason to cause the thread to stall.  Or did I miss something?

But what happens if the copy completed before the enqueuing? Assume
the page is write-protected during UFFDIO_COPY:


cpu0					cpu1		
----					----			
handle_userfault
					UFFDIO_COPY
					[ write-protected ]
				 	 fill in the hole
				 	 wake up threads
				 	 [nothing to wake]
							
					UFFD_WP (unprotect)
					 logically marks as unprotected
					 [nothing to wake]

 spin_lock(fault_pending_wqh)
  enqueue()
  set_current_state(INTERRUPTIBLE)
  spin_unlock(fault_pending_wqh)
  must_wait()

					[ #PF on the same PTE
					 due to write-protection ]

					...
					 wp_page_copy()
					  ptep_clear_flush_notify()
					  [ PTE is clear ]
					
   lockless walk page table
    pte_none(*pte) -> must wait

Note that additional scenarios are possible. For instance, instead of
wp_page_copy(), we can have other change_pte_range() (due to worker’s
mprotect() or NUMA balancing), calling ptep_modify_prot_start() and clearing
the PTE.

Am I missing something?
Peter Xu Dec. 21, 2020, 8:52 p.m. UTC | #3
On Mon, Dec 21, 2020 at 07:51:52PM +0000, Nadav Amit wrote:
> > On Dec 21, 2020, at 11:28 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> When userfaultfd copy-ioctl fails since the PTE already exists, an
> >> -EEXIST error is returned and the faulting thread is not woken. The
> >> current userfaultfd test does not wake the faulting thread in such case.
> >> The assumption is presumably that another thread set the PTE through
> >> copy/wp ioctl and would wake the faulting thread or that alternatively
> >> the fault handler would realize there is no need to "must_wait" and
> >> continue. This is not necessarily true.
> >> 
> >> There is an assumption that the "must_wait" tests in handle_userfault()
> >> are sufficient to provide definitive answer whether the offending PTE is
> >> populated or not. However, userfaultfd_must_wait() test is lockless.
> >> Consequently, concurrent calls to ptep_modify_prot_start(), for
> >> instance, can clear the PTE and can cause userfaultfd_must_wait()
> >> to wrongly assume it is not populated and a wait is needed.
> > 
> > Yes userfaultfd_must_wait() is lockless, however my understanding is that we'll
> > enqueue before reading the page table, which seems to me that we'll always get
> > notified even the race happens.  Should apply to either UFFDIO_WRITEPROTECT or
> > UFFDIO_COPY, iiuc, as long as we follow the order of (1) modify pgtable (2)
> > wake sleeping threads.  Then it also means that when must_wait() returned true,
> > it should always get waked up when fault resolved.
> > 
> > Taking UFFDIO_COPY as example, even if UFFDIO_COPY happen right before
> > must_wait() calls:
> > 
> >       worker thread                       uffd thread
> >       -------------                       -----------
> > 
> >   handle_userfault
> >    spin_lock(fault_pending_wqh)
> >    enqueue()
> >    set_current_state(INTERRUPTIBLE)
> >    spin_unlock(fault_pending_wqh)
> >    must_wait()
> >      lockless walk page table
> >                                           UFFDIO_COPY
> >                                             fill in the hole
> >                                             wake up threads
> >                                               (this will wake up worker thread too?)
> >    schedule()
> >      (which may return immediately?)
> > 
> > While here fault_pending_wqh is lock protected. I just feel like there's some
> > other reason to cause the thread to stall.  Or did I miss something?
> 
> But what happens if the copy completed before the enqueuing? Assume
> the page is write-protected during UFFDIO_COPY:
> 
> 
> cpu0					cpu1		
> ----					----			
> handle_userfault
> 					UFFDIO_COPY
> 					[ write-protected ]
> 				 	 fill in the hole
> 				 	 wake up threads
> 				 	 [nothing to wake]
> 							
> 					UFFD_WP (unprotect)
> 					 logically marks as unprotected
> 					 [nothing to wake]
> 
>  spin_lock(fault_pending_wqh)
>   enqueue()
>   set_current_state(INTERRUPTIBLE)
>   spin_unlock(fault_pending_wqh)
>   must_wait()
> 
> 					[ #PF on the same PTE
> 					 due to write-protection ]
> 
> 					...
> 					 wp_page_copy()
> 					  ptep_clear_flush_notify()
> 					  [ PTE is clear ]
> 					
>    lockless walk page table
>     pte_none(*pte) -> must wait
> 
> Note that additional scenarios are possible. For instance, instead of
> wp_page_copy(), we can have other change_pte_range() (due to worker’s
> mprotect() or NUMA balancing), calling ptep_modify_prot_start() and clearing
> the PTE.
> 
> Am I missing something?

Ah I see your point, thanks.  I think you're right:

Reviewed-by: Peter Xu <peterx@redhat.com>

Would you mind adding something like above into the commit message if you're
going to repost?  IMHO it would even be nicer to mention why
UFFDIO_WRITEPROTECT does not need this extra wakeup (I think it's because it'll
do the wakeup unconditionally anyway).
Nadav Amit Dec. 21, 2020, 8:54 p.m. UTC | #4
> On Dec 21, 2020, at 12:52 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Mon, Dec 21, 2020 at 07:51:52PM +0000, Nadav Amit wrote:
>>> On Dec 21, 2020, at 11:28 AM, Peter Xu <peterx@redhat.com> wrote:
>>> 
>>> On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> When userfaultfd copy-ioctl fails since the PTE already exists, an
>>>> -EEXIST error is returned and the faulting thread is not woken. The
>>>> current userfaultfd test does not wake the faulting thread in such case.
>>>> The assumption is presumably that another thread set the PTE through
>>>> copy/wp ioctl and would wake the faulting thread or that alternatively
>>>> the fault handler would realize there is no need to "must_wait" and
>>>> continue. This is not necessarily true.
>>>> 
>>>> There is an assumption that the "must_wait" tests in handle_userfault()
>>>> are sufficient to provide definitive answer whether the offending PTE is
>>>> populated or not. However, userfaultfd_must_wait() test is lockless.
>>>> Consequently, concurrent calls to ptep_modify_prot_start(), for
>>>> instance, can clear the PTE and can cause userfaultfd_must_wait()
>>>> to wrongly assume it is not populated and a wait is needed.
>>> 
>>> Yes userfaultfd_must_wait() is lockless, however my understanding is that we'll
>>> enqueue before reading the page table, which seems to me that we'll always get
>>> notified even the race happens.  Should apply to either UFFDIO_WRITEPROTECT or
>>> UFFDIO_COPY, iiuc, as long as we follow the order of (1) modify pgtable (2)
>>> wake sleeping threads.  Then it also means that when must_wait() returned true,
>>> it should always get waked up when fault resolved.
>>> 
>>> Taking UFFDIO_COPY as example, even if UFFDIO_COPY happen right before
>>> must_wait() calls:
>>> 
>>>      worker thread                       uffd thread
>>>      -------------                       -----------
>>> 
>>>  handle_userfault
>>>   spin_lock(fault_pending_wqh)
>>>   enqueue()
>>>   set_current_state(INTERRUPTIBLE)
>>>   spin_unlock(fault_pending_wqh)
>>>   must_wait()
>>>     lockless walk page table
>>>                                          UFFDIO_COPY
>>>                                            fill in the hole
>>>                                            wake up threads
>>>                                              (this will wake up worker thread too?)
>>>   schedule()
>>>     (which may return immediately?)
>>> 
>>> While here fault_pending_wqh is lock protected. I just feel like there's some
>>> other reason to cause the thread to stall.  Or did I miss something?
>> 
>> But what happens if the copy completed before the enqueuing? Assume
>> the page is write-protected during UFFDIO_COPY:
>> 
>> 
>> cpu0					cpu1		
>> ----					----			
>> handle_userfault
>> 					UFFDIO_COPY
>> 					[ write-protected ]
>> 				 	 fill in the hole
>> 				 	 wake up threads
>> 				 	 [nothing to wake]
>> 							
>> 					UFFD_WP (unprotect)
>> 					 logically marks as unprotected
>> 					 [nothing to wake]
>> 
>> spin_lock(fault_pending_wqh)
>>  enqueue()
>>  set_current_state(INTERRUPTIBLE)
>>  spin_unlock(fault_pending_wqh)
>>  must_wait()
>> 
>> 					[ #PF on the same PTE
>> 					 due to write-protection ]
>> 
>> 					...
>> 					 wp_page_copy()
>> 					  ptep_clear_flush_notify()
>> 					  [ PTE is clear ]
>> 					
>>   lockless walk page table
>>    pte_none(*pte) -> must wait
>> 
>> Note that additional scenarios are possible. For instance, instead of
>> wp_page_copy(), we can have other change_pte_range() (due to worker’s
>> mprotect() or NUMA balancing), calling ptep_modify_prot_start() and clearing
>> the PTE.
>> 
>> Am I missing something?
> 
> Ah I see your point, thanks.  I think you're right:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Would you mind adding something like above into the commit message if you're
> going to repost?  IMHO it would even be nicer to mention why
> UFFDIO_WRITEPROTECT does not need this extra wakeup (I think it's because it'll
> do the wakeup unconditionally anyway).

Yes, the commit log needs to be fixed.

I will update it based on your feedback on RFC-v2.

Thanks,
Nadav

Patch
diff mbox series

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 9b0912a01777..f7e6cf43db71 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -484,6 +484,18 @@  static void retry_copy_page(int ufd, struct uffdio_copy *uffdio_copy,
 	}
 }
 
+static void wake_range(int ufd, unsigned long addr, unsigned long len)
+{
+	struct uffdio_range uffdio_wake;
+
+	uffdio_wake.start = addr;
+	uffdio_wake.len = len;
+
+	if (ioctl(ufd, UFFDIO_WAKE, &uffdio_wake))
+		fprintf(stderr, "error waking %lu\n",
+			addr), exit(1);
+}
+
 static int __copy_page(int ufd, unsigned long offset, bool retry)
 {
 	struct uffdio_copy uffdio_copy;
@@ -507,6 +519,7 @@  static int __copy_page(int ufd, unsigned long offset, bool retry)
 				uffdio_copy.copy);
 			exit(1);
 		}
+		wake_range(ufd, uffdio_copy.dst, page_size);
 	} else if (uffdio_copy.copy != page_size) {
 		fprintf(stderr, "UFFDIO_COPY unexpected copy %Ld\n",
 			uffdio_copy.copy); exit(1);