Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released

From: Dimitris Siakavaras <jimsiak@cslab.ece.ntua.gr>
To: viro@zeniv.linux.org.uk
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Using userfaultfd with KVM's async page fault handling causes processes to hung waiting for mmap_lock to be released
Date: Tue, 18 Jul 2023 17:33:12 +0300	[thread overview]
Message-ID: <79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr> (raw)

Hi, this is my first bug report so I apologise in advance for any 
missing information and/or difficulty in explaining the problem in my 
email. I am at your disposal to provide any other necessary information 
or modify appropriately my email.

Problem: Using userfaultfd for a process that uses KVM and triggers the 
asynchronous page fault handling results in processes to hung forever.
Processor: AMD EPYC 7402 24-Core Processor
Kernel version: 5.13 (the problem also occurs on 6.4.3 and 6.5-rc2)

Unfortunately, my execution environment involves a pretty complex set of 
components to setup so it is not straightforward for me to share code 
that can be used to reproduce the issue, so I will try to explain the 
problem as clearly as possible.

I have two processes:
1. A firecracker VM process (https://firecracker-microvm.github.io/) 
which uses KVM.
2. A second process that handles the userpage faults of the firecracker 
process.

The race condition involves the released field of the userfaultfd_ctx 
structure.
More specifically:

* Process 2 invokes the close() system call for the userfaultfd 
descriptor, thus triggering the execution of userfaultfd_release() in 
the kernel.
   userfaultfd_release() contains the following lines of code:

    WRITE_ONCE(ctx->released, true);

     if (!mmget_not_zero(mm))
         goto wakeup;

     /*
      * Flush page faults out of all CPUs. NOTE: all page faults
      * must be retried without returning VM_FAULT_SIGBUS if
      * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
      * changes while handle_userfault released the mmap_lock. So
      * it's critical that released is set to true (above), before
      * taking the mmap_lock for writing.
      */
     mmap_write_lock(mm);

* Process 1 is getting a page fault while running inside KVM_ENTRY. This 
triggers the execution of kvm_tdp_page_fault(), and the following 
function call chain is executed:

kvm_tdp_page_fault() -> direct_page_fault() -> try_async_pf() -> 
kvm_arch_setup_async_pf() -> kvm_setup_async_pf()

kvm_setup_async_pf() adds in the workqueue function async_pf_execute:
     INIT_WORK(&work->work, async_pf_execute);

Then, the following function call chain is executed:
async_pf_execute() -> get_user_pages_remote() -> 
__get_user_pages_remote() -> __get_user_pages_locked() -> __get_user_pages()

__get_user_pages() is called with mmap_lock taken and in there is the 
following code:
retry:
         /*
          * If we have a pending SIGKILL, don't keep faulting pages and
          * potentially allocating memory.
          */
         if (fatal_signal_pending(current)) {
             ret = -EINTR;
             goto out;
         }
         cond_resched();

         page = follow_page_mask(vma, start, foll_flags, &ctx);
         if (!page) {
             ret = faultin_page(vma, start, &foll_flags, locked);
             switch (ret) {
             case 0:
                 goto retry;

When faultin_page() is called here it will in turn call the following 
chain of functions:

faultin_page() -> handle_mm_fault() -> __handle__mm_fault() -> 
handle_pte_fault() -> do_anonymous_page() -> handle_userfault()

The final handle_userfault() function is the function used by 
userfaultfd to handle the userfault. In this function we can find the 
following code:

if (unlikely(READ_ONCE(ctx->released))) {
         /*
          * Don't return VM_FAULT_SIGBUS in this case, so a non
          * cooperative manager can close the uffd after the
          * last UFFDIO_COPY, without risking to trigger an
          * involuntary SIGBUS if the process was starting the
          * userfaultfd while the userfaultfd was still armed
          * (but after the last UFFDIO_COPY). If the uffd
          * wasn't already closed when the userfault reached
          * this point, that would normally be solved by
          * userfaultfd_must_wait returning 'false'.
          *
          * If we were to return VM_FAULT_SIGBUS here, the non
          * cooperative manager would be instead forced to
          * always call UFFDIO_UNREGISTER before it can safely
          * close the uffd.
          */
         ret = VM_FAULT_NOPAGE;
         goto out;
}

The problem is that when ctx->released has been set to 1 by 
userfaultfd_release() called by Process 2, handle_userfault() will 
return VM_FAULT_NOPAGE due to the above if statement.
This will result in VM_FAULT_NOPAGE returned by handle_mm_fault() in 
faultin_page() and faultin_page() in turn will return 0.
Getting back to the invocation of faultin_page() from __get_user_pages() 
the "case 0:" statement will cause the execution to go back to the retry 
label. Given that ctx->released never turns back to 0, this loop will 
continue forever and Process 1 will be stuck calling faultin_page(), 
getting 0 as return value, going back to retry, and so on.

Given that Process 1 still holds the mmap_lock and will never release 
it, process 2 will also hang in the call of mmap_write_lock(mm).

This results in both processes being stuck in a deadlock/livelock situation.

Unfortunately, I have only a minor knowledge of the mm kernel subsystem 
so I am not able to provide a solution to the problem, but I hope 
someone else with experience in kernel developing can come up with a 
proper solution.

Thank you very much,
Best Regards,
Dimitris Siakavaras