linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released
@ 2023-07-18 14:33 Dimitris Siakavaras
  2023-07-19 21:16 ` Axel Rasmussen
  0 siblings, 1 reply; 5+ messages in thread
From: Dimitris Siakavaras @ 2023-07-18 14:33 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel

Hi, this is my first bug report so I apologise in advance for any 
missing information and/or difficulty in explaining the problem in my 
email. I am at your disposal to provide any other necessary information 
or modify appropriately my email.

Problem: Using userfaultfd for a process that uses KVM and triggers the 
asynchronous page fault handling results in processes to hung forever.
Processor: AMD EPYC 7402 24-Core Processor
Kernel version: 5.13 (the problem also occurs on 6.4.3 and 6.5-rc2)

Unfortunately, my execution environment involves a pretty complex set of 
components to setup so it is not straightforward for me to share code 
that can be used to reproduce the issue, so I will try to explain the 
problem as clearly as possible.

I have two processes:
1. A firecracker VM process (https://firecracker-microvm.github.io/) 
which uses KVM.
2. A second process that handles the userpage faults of the firecracker 
process.

The race condition involves the released field of the userfaultfd_ctx 
structure.
More specifically:

* Process 2 invokes the close() system call for the userfaultfd 
descriptor, thus triggering the execution of userfaultfd_release() in 
the kernel.
   userfaultfd_release() contains the following lines of code:

    WRITE_ONCE(ctx->released, true);

     if (!mmget_not_zero(mm))
         goto wakeup;

     /*
      * Flush page faults out of all CPUs. NOTE: all page faults
      * must be retried without returning VM_FAULT_SIGBUS if
      * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
      * changes while handle_userfault released the mmap_lock. So
      * it's critical that released is set to true (above), before
      * taking the mmap_lock for writing.
      */
     mmap_write_lock(mm);

* Process 1 is getting a page fault while running inside KVM_ENTRY. This 
triggers the execution of kvm_tdp_page_fault(), and the following 
function call chain is executed:

kvm_tdp_page_fault() -> direct_page_fault() -> try_async_pf() -> 
kvm_arch_setup_async_pf() -> kvm_setup_async_pf()

kvm_setup_async_pf() adds in the workqueue function async_pf_execute:
     INIT_WORK(&work->work, async_pf_execute);

Then, the following function call chain is executed:
async_pf_execute() -> get_user_pages_remote() -> 
__get_user_pages_remote() -> __get_user_pages_locked() -> __get_user_pages()

__get_user_pages() is called with mmap_lock taken and in there is the 
following code:
retry:
         /*
          * If we have a pending SIGKILL, don't keep faulting pages and
          * potentially allocating memory.
          */
         if (fatal_signal_pending(current)) {
             ret = -EINTR;
             goto out;
         }
         cond_resched();

         page = follow_page_mask(vma, start, foll_flags, &ctx);
         if (!page) {
             ret = faultin_page(vma, start, &foll_flags, locked);
             switch (ret) {
             case 0:
                 goto retry;

When faultin_page() is called here it will in turn call the following 
chain of functions:

faultin_page() -> handle_mm_fault() -> __handle__mm_fault() -> 
handle_pte_fault() -> do_anonymous_page() -> handle_userfault()

The final handle_userfault() function is the function used by 
userfaultfd to handle the userfault. In this function we can find the 
following code:

if (unlikely(READ_ONCE(ctx->released))) {
         /*
          * Don't return VM_FAULT_SIGBUS in this case, so a non
          * cooperative manager can close the uffd after the
          * last UFFDIO_COPY, without risking to trigger an
          * involuntary SIGBUS if the process was starting the
          * userfaultfd while the userfaultfd was still armed
          * (but after the last UFFDIO_COPY). If the uffd
          * wasn't already closed when the userfault reached
          * this point, that would normally be solved by
          * userfaultfd_must_wait returning 'false'.
          *
          * If we were to return VM_FAULT_SIGBUS here, the non
          * cooperative manager would be instead forced to
          * always call UFFDIO_UNREGISTER before it can safely
          * close the uffd.
          */
         ret = VM_FAULT_NOPAGE;
         goto out;
}

The problem is that when ctx->released has been set to 1 by 
userfaultfd_release() called by Process 2, handle_userfault() will 
return VM_FAULT_NOPAGE due to the above if statement.
This will result in VM_FAULT_NOPAGE returned by handle_mm_fault() in 
faultin_page() and faultin_page() in turn will return 0.
Getting back to the invocation of faultin_page() from __get_user_pages() 
the "case 0:" statement will cause the execution to go back to the retry 
label. Given that ctx->released never turns back to 0, this loop will 
continue forever and Process 1 will be stuck calling faultin_page(), 
getting 0 as return value, going back to retry, and so on.

Given that Process 1 still holds the mmap_lock and will never release 
it, process 2 will also hang in the call of mmap_write_lock(mm).

This results in both processes being stuck in a deadlock/livelock situation.

Unfortunately, I have only a minor knowledge of the mm kernel subsystem 
so I am not able to provide a solution to the problem, but I hope 
someone else with experience in kernel developing can come up with a 
proper solution.

Thank you very much,
Best Regards,
Dimitris Siakavaras

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released
  2023-07-18 14:33 Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released Dimitris Siakavaras
@ 2023-07-19 21:16 ` Axel Rasmussen
  2023-07-19 21:54   ` Axel Rasmussen
       [not found]   ` <20230720103534.312-1-hdanton@sina.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Axel Rasmussen @ 2023-07-19 21:16 UTC (permalink / raw)
  To: Dimitris Siakavaras
  Cc: viro, linux-fsdevel, linux-kernel, Peter Xu, linux-mm, Axel Rasmussen

Thanks for the detailed report Dimitris! I've CCed the MM mailing list and some
folks who work on userfaultfd.

I took a look at this today, but I haven't quite come up with a solution.

I thought it might be as easy as changing userfaultfd_release() to set released
*after* taking the lock. But no such luck, the ordering is what it is to deal
with another subtle case:


	WRITE_ONCE(ctx->released, true);

	if (!mmget_not_zero(mm))
		goto wakeup;

	/*
	 * Flush page faults out of all CPUs. NOTE: all page faults
	 * must be retried without returning VM_FAULT_SIGBUS if
	 * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
	 * changes while handle_userfault released the mmap_lock. So
	 * it's critical that released is set to true (above), before
	 * taking the mmap_lock for writing.
	 */
	mmap_write_lock(mm);

I think perhaps the right thing to do is to have handle_userfault() release
mmap_lock when it returns VM_FAULT_NOPAGE, and to have GUP deal with that
appropriately? But, some investigation is required to be sure that's okay to do
in the other non-GUP ways we can end up in handle_userfault().

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released
  2023-07-19 21:16 ` Axel Rasmussen
@ 2023-07-19 21:54   ` Axel Rasmussen
  2023-07-20 20:06     ` Peter Xu
       [not found]   ` <20230720103534.312-1-hdanton@sina.com>
  1 sibling, 1 reply; 5+ messages in thread
From: Axel Rasmussen @ 2023-07-19 21:54 UTC (permalink / raw)
  To: Dimitris Siakavaras; +Cc: viro, linux-fsdevel, linux-kernel, Peter Xu, linux-mm

On Wed, Jul 19, 2023 at 2:16 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> Thanks for the detailed report Dimitris! I've CCed the MM mailing list and some
> folks who work on userfaultfd.

Apologies, I should have quoted the original message for the others I
added to CC: https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr/T/#u

>
> I took a look at this today, but I haven't quite come up with a solution.
>
> I thought it might be as easy as changing userfaultfd_release() to set released
> *after* taking the lock. But no such luck, the ordering is what it is to deal
> with another subtle case:
>
>
>         WRITE_ONCE(ctx->released, true);
>
>         if (!mmget_not_zero(mm))
>                 goto wakeup;
>
>         /*
>          * Flush page faults out of all CPUs. NOTE: all page faults
>          * must be retried without returning VM_FAULT_SIGBUS if
>          * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
>          * changes while handle_userfault released the mmap_lock. So
>          * it's critical that released is set to true (above), before
>          * taking the mmap_lock for writing.
>          */
>         mmap_write_lock(mm);
>
> I think perhaps the right thing to do is to have handle_userfault() release
> mmap_lock when it returns VM_FAULT_NOPAGE, and to have GUP deal with that
> appropriately? But, some investigation is required to be sure that's okay to do
> in the other non-GUP ways we can end up in handle_userfault().

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released
  2023-07-19 21:54   ` Axel Rasmussen
@ 2023-07-20 20:06     ` Peter Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Xu @ 2023-07-20 20:06 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Dimitris Siakavaras, viro, linux-fsdevel, linux-kernel, linux-mm

Hello, Axel, Dimitris,

On Wed, Jul 19, 2023 at 02:54:21PM -0700, Axel Rasmussen wrote:
> On Wed, Jul 19, 2023 at 2:16 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > Thanks for the detailed report Dimitris! I've CCed the MM mailing list and some
> > folks who work on userfaultfd.
> 
> Apologies, I should have quoted the original message for the others I
> added to CC: https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr/T/#u
> 
> >
> > I took a look at this today, but I haven't quite come up with a solution.
> >
> > I thought it might be as easy as changing userfaultfd_release() to set released
> > *after* taking the lock. But no such luck, the ordering is what it is to deal
> > with another subtle case:
> >
> >
> >         WRITE_ONCE(ctx->released, true);
> >
> >         if (!mmget_not_zero(mm))
> >                 goto wakeup;
> >
> >         /*
> >          * Flush page faults out of all CPUs. NOTE: all page faults
> >          * must be retried without returning VM_FAULT_SIGBUS if
> >          * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
> >          * changes while handle_userfault released the mmap_lock. So
> >          * it's critical that released is set to true (above), before
> >          * taking the mmap_lock for writing.
> >          */
> >         mmap_write_lock(mm);
> >
> > I think perhaps the right thing to do is to have handle_userfault() release
> > mmap_lock when it returns VM_FAULT_NOPAGE, and to have GUP deal with that
> > appropriately? But, some investigation is required to be sure that's okay to do
> > in the other non-GUP ways we can end up in handle_userfault().

Heh, this is also what I thought after reading. :)

If we see in the very early commit from Andrea it seems that would not hang
gup but just sigbus-ing it (see the comment that's mostly exactly the thing
Dimitris hit here):

commit 86039bd3b4e6a1129318cbfed4e0a6e001656635
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Sep 4 15:46:31 2015 -0700

    userfaultfd: add new syscall to provide memory externalization

+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_sem.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return VM_FAULT_SIGBUS;
+

Then we switched over to the friendly way, assuming CRIU could close() the
uffd during the monitee process running, in:

commit 656710a60e3693911bee3a355d2f2bbae3faba33
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Sep 8 16:12:42 2017 -0700

    userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS

I had a feeling that after that we didn't test gup (I assume normal page
fault path will still work).  Let me copy Mike too for that just in case he
has anything to say.  Paste thread again:

https://lore.kernel.org/lkml/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr/T/#u

My understanding is that releasing mmap lock here should work, but we need
to move the code a bit.  Dimitris, please feel free to try the patch
attached here if you want.  It's probably not a major use case of uffd over
kvm (IIUC unregister before close() will also work?), but if it's trivial
to fix we should proably fix it.

Thanks,

===8<===
From 7e9ef050b487220463fa77a7aa97259ffe9bb15e Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 20 Jul 2023 15:33:55 -0400
Subject: [PATCH] mm/uffd: Fix release hang over GUP

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/userfaultfd.c | 57 ++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bbfaf5837a08..2358e6b00315 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -455,32 +455,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY))
 		goto out;
 
-	/*
-	 * If it's already released don't get it. This avoids to loop
-	 * in __get_user_pages if userfaultfd_release waits on the
-	 * caller of handle_userfault to release the mmap_lock.
-	 */
-	if (unlikely(READ_ONCE(ctx->released))) {
-		/*
-		 * Don't return VM_FAULT_SIGBUS in this case, so a non
-		 * cooperative manager can close the uffd after the
-		 * last UFFDIO_COPY, without risking to trigger an
-		 * involuntary SIGBUS if the process was starting the
-		 * userfaultfd while the userfaultfd was still armed
-		 * (but after the last UFFDIO_COPY). If the uffd
-		 * wasn't already closed when the userfault reached
-		 * this point, that would normally be solved by
-		 * userfaultfd_must_wait returning 'false'.
-		 *
-		 * If we were to return VM_FAULT_SIGBUS here, the non
-		 * cooperative manager would be instead forced to
-		 * always call UFFDIO_UNREGISTER before it can safely
-		 * close the uffd.
-		 */
-		ret = VM_FAULT_NOPAGE;
-		goto out;
-	}
-
 	/*
 	 * Check that we can return VM_FAULT_RETRY.
 	 *
@@ -517,6 +491,37 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
 		goto out;
 
+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_lock.
+	 */
+	if (unlikely(READ_ONCE(ctx->released))) {
+		/*
+		 * Don't return VM_FAULT_SIGBUS in this case, so a non
+		 * cooperative manager can close the uffd after the
+		 * last UFFDIO_COPY, without risking to trigger an
+		 * involuntary SIGBUS if the process was starting the
+		 * userfaultfd while the userfaultfd was still armed
+		 * (but after the last UFFDIO_COPY). If the uffd
+		 * wasn't already closed when the userfault reached
+		 * this point, that would normally be solved by
+		 * userfaultfd_must_wait returning 'false'.
+		 *
+		 * If we were to return VM_FAULT_SIGBUS here, the non
+		 * cooperative manager would be instead forced to
+		 * always call UFFDIO_UNREGISTER before it can safely
+		 * close the uffd.
+		 *
+		 * We release the mmap lock in this special case, just in
+		 * case we're in a gup to not dead loop, so the other uffd
+		 * handler thread/process can have a chance to take the
+		 * write lock and do the unregistration.
+		 */
+		release_fault_lock(vmf);
+		goto out;
+	}
+
 	/* take the reference before dropping the mmap_lock */
 	userfaultfd_ctx_get(ctx);
 
-- 
2.41.0
===8<===

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released
       [not found]   ` <20230720103534.312-1-hdanton@sina.com>
@ 2023-07-20 20:07     ` Peter Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Xu @ 2023-07-20 20:07 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Dimitris Siakavaras, Axel Rasmussen, linux-kernel, linux-mm

On Thu, Jul 20, 2023 at 06:35:34PM +0800, Hillf Danton wrote:
> On Wed, Jul 19, 2023 at 2:16 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > I think perhaps the right thing to do is to have handle_userfault() release
> > mmap_lock when it returns VM_FAULT_NOPAGE, and to have GUP deal with that
> > appropriately? But, some investigation is required to be sure that's okay to do
> > in the other non-GUP ways we can end up in handle_userfault().
> 
> See if making kworker special works.
> 
> --- x/fs/userfaultfd.c
> +++ y/fs/userfaultfd.c
> @@ -457,6 +457,8 @@ vm_fault_t handle_userfault(struct vm_fa
>  		 * close the uffd.
>  		 */
>  		ret = VM_FAULT_NOPAGE;
> +		if (current->flags & PF_WQ_WORKER)
> +			ret = VM_FAULT_OOM;
>  		goto out;
>  	}

Sorry this won't work - we need userfault to work with all forms of
kworkers, especially including kvm async pf. Thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-07-20 20:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-18 14:33 Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released Dimitris Siakavaras
2023-07-19 21:16 ` Axel Rasmussen
2023-07-19 21:54   ` Axel Rasmussen
2023-07-20 20:06     ` Peter Xu
     [not found]   ` <20230720103534.312-1-hdanton@sina.com>
2023-07-20 20:07     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).