linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: Async #PF fixes and cleanups
@ 2024-01-10  1:15 Sean Christopherson
  2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-01-10  1:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Fix a use-after-module-unload bug in the async #PF code by ensuring all
workqueue items fully complete before tearing down vCPUs.  Do a bit of
cleanup to try and make the code slightly more readable.

Side topic, I'm pretty s390's flic_set_attr() is broken/racy.  The async #PF
code assumes that only the vCPU can invoke
kvm_clear_async_pf_completion_queue(), as there are multiple assets that
are effectively protected by vcpu->mutex.  I don't any real world VMMs
trigger the race(s), but AFAICT it's a bug.  I think/assume taking all
vCPUs' mutexes would plug the hole?

Sean Christopherson (4):
  KVM: Always flush async #PF workqueue when vCPU is being destroyed
  KVM: Put mm immediately after async #PF worker completes remote gup()
  KVM: Get reference to VM's address space in the async #PF worker
  KVM: Nullify async #PF worker's "apf" pointer as soon as it might be
    freed

 include/linux/kvm_host.h |  1 -
 virt/kvm/async_pf.c      | 79 ++++++++++++++++++++++++++++------------
 2 files changed, 55 insertions(+), 25 deletions(-)


base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-10  1:15 [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
@ 2024-01-10  1:15 ` Sean Christopherson
  2024-01-20 12:40   ` Xu Yilun
                     ` (2 more replies)
  2024-01-10  1:15 ` [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup() Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-01-10  1:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, e.g. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put.  Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.

Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:

 WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
 CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Workqueue: events async_pf_execute [kvm]
 RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
 Call Trace:
  <TASK>
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>
 ---[ end trace 0000000000000000 ]---
 INFO: task kworker/8:1:251 blocked for more than 120 seconds.
       Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
 Workqueue: events async_pf_execute [kvm]
 Call Trace:
  <TASK>
  __schedule+0x33f/0xa40
  schedule+0x53/0xc0
  schedule_timeout+0x12a/0x140
  __wait_for_common+0x8d/0x1d0
  __flush_work.isra.0+0x19f/0x2c0
  kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
  kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
  kvm_put_kvm+0x1c1/0x320 [kvm]
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>

If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed.  And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.

Note that kvm_check_async_pf_completion() may also take the work item off
the completion queue and so also needs to flush the work queue, as the
work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
on the workqueue could theoretically delay a vCPU due to waiting for the
work to complete, but that's a very, very small chance, and likely a very
small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
new request, i.e. will effectively delay entering the guest, so the
remaining work is really just:

        trace_kvm_async_pf_completed(addr, cr2_or_gpa);

        __kvm_vcpu_wake_up(vcpu);

        mmput(mm);

and mmput() can't drop the last reference to the page tables if the vCPU is
still alive, i.e. the vCPU won't get stuck tearing down page tables.

Add a helper to do the flushing, specifically to deal with "wakeup all"
work items, as they aren't actually work items, i.e. are never placed in a
workqueue.  Trying to flush a bogus workqueue entry rightly makes
__flush_work() complain (kudos to whoever added that sanity check).

Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.

Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Cc: Xu Yilun <yilun.xu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..876927a558ad 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
 	__kvm_vcpu_wake_up(vcpu);
 
 	mmput(mm);
-	kvm_put_kvm(vcpu->kvm);
+}
+
+static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
+{
+	/*
+	 * The async #PF is "done", but KVM must wait for the work item itself,
+	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
+	 * KVM must ensure *no* code owned by the KVM (the module) can be run
+	 * after the last call to module_put(), i.e. after the last reference
+	 * to the last vCPU's file is put.
+	 *
+	 * Wake all events skip the queue and go straight done, i.e. don't
+	 * need to be flushed (but sanity check that the work wasn't queued).
+	 */
+	if (work->wakeup_all)
+		WARN_ON_ONCE(work->work.func);
+	else
+		flush_work(&work->work);
+	kmem_cache_free(async_pf_cache, work);
 }
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
@@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 #else
 		if (cancel_work_sync(&work->work)) {
 			mmput(work->mm);
-			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
 			kmem_cache_free(async_pf_cache, work);
 		}
 #endif
@@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 			list_first_entry(&vcpu->async_pf.done,
 					 typeof(*work), link);
 		list_del(&work->link);
-		kmem_cache_free(async_pf_cache, work);
+
+		spin_unlock(&vcpu->async_pf.lock);
+
+		/*
+		 * The async #PF is "done", but KVM must wait for the work item
+		 * itself, i.e. async_pf_execute(), to run to completion.  If
+		 * KVM is a module, KVM must ensure *no* code owned by the KVM
+		 * (the module) can be run after the last call to module_put(),
+		 * i.e. after the last reference to the last vCPU's file is put.
+		 */
+		kvm_flush_and_free_async_pf_work(work);
+		spin_lock(&vcpu->async_pf.lock);
 	}
 	spin_unlock(&vcpu->async_pf.lock);
 
@@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 
 		list_del(&work->queue);
 		vcpu->async_pf.queued--;
-		kmem_cache_free(async_pf_cache, work);
+		kvm_flush_and_free_async_pf_work(work);
 	}
 }
 
@@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	work->arch = *arch;
 	work->mm = current->mm;
 	mmget(work->mm);
-	kvm_get_kvm(work->vcpu->kvm);
 
 	INIT_WORK(&work->work, async_pf_execute);
 
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup()
  2024-01-10  1:15 [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
  2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
@ 2024-01-10  1:15 ` Sean Christopherson
  2024-01-20 15:24   ` Xu Yilun
  2024-01-26 16:23   ` Vitaly Kuznetsov
  2024-01-10  1:15 ` [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-01-10  1:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Put the async #PF worker's reference to the VM's address space as soon as
the worker is done with the mm.  This will allow deferring getting a
reference to the worker itself without having to track whether or not
getting a reference succeeded.

Note, if the vCPU is still alive, there is no danger of the worker getting
stuck with tearing down the host page tables, as userspace also holds a
reference (obviously), i.e. there is no risk of delaying the page-present
notification due to triggering the slow path in mmput().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/async_pf.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 876927a558ad..d5dc50318aa6 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -64,6 +64,7 @@ static void async_pf_execute(struct work_struct *work)
 	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
 	if (locked)
 		mmap_read_unlock(mm);
+	mmput(mm);
 
 	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
 		kvm_arch_async_page_present(vcpu, apf);
@@ -85,8 +86,6 @@ static void async_pf_execute(struct work_struct *work)
 	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
 
 	__kvm_vcpu_wake_up(vcpu);
-
-	mmput(mm);
 }
 
 static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker
  2024-01-10  1:15 [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
  2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
  2024-01-10  1:15 ` [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup() Sean Christopherson
@ 2024-01-10  1:15 ` Sean Christopherson
  2024-01-20 15:16   ` Xu Yilun
  2024-01-26 16:21   ` Vitaly Kuznetsov
  2024-01-10  1:15 ` [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed Sean Christopherson
  2024-02-06 21:36 ` [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
  4 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-01-10  1:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Get a reference to the target VM's address space in async_pf_execute()
instead of gifting a reference from kvm_setup_async_pf().  Keeping the
address space alive just to service an async #PF is counter-productive,
i.e. if the process is exiting and all vCPUs are dead, then NOT doing
get_user_pages_remote() and freeing the address space asap is desirable.

Handling the mm reference entirely within async_pf_execute() also
simplifies the async #PF flows as a whole, e.g. it's not immediately
obvious when the worker task vs. the vCPU task is responsible for putting
the gifted mm reference.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h |  1 -
 virt/kvm/async_pf.c      | 32 ++++++++++++++++++--------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..bbfefd7e612f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -238,7 +238,6 @@ struct kvm_async_pf {
 	struct list_head link;
 	struct list_head queue;
 	struct kvm_vcpu *vcpu;
-	struct mm_struct *mm;
 	gpa_t cr2_or_gpa;
 	unsigned long addr;
 	struct kvm_arch_async_pf arch;
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d5dc50318aa6..c3f4f351a2ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
 {
 	struct kvm_async_pf *apf =
 		container_of(work, struct kvm_async_pf, work);
-	struct mm_struct *mm = apf->mm;
 	struct kvm_vcpu *vcpu = apf->vcpu;
+	struct mm_struct *mm = vcpu->kvm->mm;
 	unsigned long addr = apf->addr;
 	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
 	int locked = 1;
@@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
 	might_sleep();
 
 	/*
-	 * This work is run asynchronously to the task which owns
-	 * mm and might be done in another context, so we must
-	 * access remotely.
+	 * Attempt to pin the VM's host address space, and simply skip gup() if
+	 * acquiring a pin fail, i.e. if the process is exiting.  Note, KVM
+	 * holds a reference to its associated mm_struct until the very end of
+	 * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
+	 * work item is fully processed.
 	 */
-	mmap_read_lock(mm);
-	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
-	if (locked)
-		mmap_read_unlock(mm);
-	mmput(mm);
+	if (mmget_not_zero(mm)) {
+		mmap_read_lock(mm);
+		get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
+		if (locked)
+			mmap_read_unlock(mm);
+		mmput(mm);
+	}
 
+	/*
+	 * Notify and kick the vCPU even if faulting in the page failed, e.g.
+	 * so that the vCPU can retry the fault synchronously.
+	 */
 	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
 		kvm_arch_async_page_present(vcpu, apf);
 
@@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 #ifdef CONFIG_KVM_ASYNC_PF_SYNC
 		flush_work(&work->work);
 #else
-		if (cancel_work_sync(&work->work)) {
-			mmput(work->mm);
+		if (cancel_work_sync(&work->work))
 			kmem_cache_free(async_pf_cache, work);
-		}
 #endif
 		spin_lock(&vcpu->async_pf.lock);
 	}
@@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	work->cr2_or_gpa = cr2_or_gpa;
 	work->addr = hva;
 	work->arch = *arch;
-	work->mm = current->mm;
-	mmget(work->mm);
 
 	INIT_WORK(&work->work, async_pf_execute);
 
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
  2024-01-10  1:15 [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-01-10  1:15 ` [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker Sean Christopherson
@ 2024-01-10  1:15 ` Sean Christopherson
  2024-01-20 15:24   ` Xu Yilun
  2024-01-26 16:30   ` Vitaly Kuznetsov
  2024-02-06 21:36 ` [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
  4 siblings, 2 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-01-10  1:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Nullify the async #PF worker's local "apf" pointer immediately after the
point where the structure can be freed by the vCPU.  The existing comment
is helpful, but easy to overlook as there is no associated code.

Update the comment to clarify that it can be freed by as soon as the lock
is dropped, as "after this point" isn't strictly accurate, nor does it
help understand what prevents the structure from being freed earlier.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/async_pf.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index c3f4f351a2ae..1088c6628de9 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -83,13 +83,14 @@ static void async_pf_execute(struct work_struct *work)
 	apf->vcpu = NULL;
 	spin_unlock(&vcpu->async_pf.lock);
 
-	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
-		kvm_arch_async_page_present_queued(vcpu);
-
 	/*
-	 * apf may be freed by kvm_check_async_pf_completion() after
-	 * this point
+	 * The apf struct may freed by kvm_check_async_pf_completion() as soon
+	 * as the lock is dropped.  Nullify it to prevent improper usage.
 	 */
+	apf = NULL;
+
+	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
+		kvm_arch_async_page_present_queued(vcpu);
 
 	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
 
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
@ 2024-01-20 12:40   ` Xu Yilun
  2024-01-24 19:04     ` Sean Christopherson
  2024-02-06 19:06     ` Sean Christopherson
  2024-01-26 16:51   ` Vitaly Kuznetsov
  2024-02-19 13:59   ` Xu Yilun
  2 siblings, 2 replies; 25+ messages in thread
From: Xu Yilun @ 2024-01-20 12:40 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put.  Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
> 
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> 
>  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
>  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
>  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  Workqueue: events async_pf_execute [kvm]
>  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
>  Call Trace:
>   <TASK>
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
>        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
>  Workqueue: events async_pf_execute [kvm]
>  Call Trace:
>   <TASK>
>   __schedule+0x33f/0xa40
>   schedule+0x53/0xc0
>   schedule_timeout+0x12a/0x140
>   __wait_for_common+0x8d/0x1d0
>   __flush_work.isra.0+0x19f/0x2c0
>   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
>   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
>   kvm_put_kvm+0x1c1/0x320 [kvm]
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
> 
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the

I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
we just need to say that vCPUs are freed before module_put(KVM the module)
in kvm_destroy_vm(), then the whole logic for module unloading fix is:

  1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
  2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
  3. vCPUs must be freed before module_put(KVM the module).

  So all workqueue callbacks complete before module_put(KVM the module).


__fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
makes me distracted from reason of the fix.

> last reference to the KVM module.
> 
> Note that kvm_check_async_pf_completion() may also take the work item off
> the completion queue and so also needs to flush the work queue, as the
> work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
> on the workqueue could theoretically delay a vCPU due to waiting for the
> work to complete, but that's a very, very small chance, and likely a very
> small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
> new request, i.e. will effectively delay entering the guest, so the
> remaining work is really just:
> 
>         trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> 
>         __kvm_vcpu_wake_up(vcpu);
> 
>         mmput(mm);
> 
> and mmput() can't drop the last reference to the page tables if the vCPU is
> still alive, i.e. the vCPU won't get stuck tearing down page tables.
> 
> Add a helper to do the flushing, specifically to deal with "wakeup all"
> work items, as they aren't actually work items, i.e. are never placed in a
> workqueue.  Trying to flush a bogus workqueue entry rightly makes
> __flush_work() complain (kudos to whoever added that sanity check).
> 
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
> 
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: stable@vger.kernel.org
> Cc: David Matlack <dmatlack@google.com>
> Cc: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..876927a558ad 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
>  	__kvm_vcpu_wake_up(vcpu);
>  
>  	mmput(mm);
> -	kvm_put_kvm(vcpu->kvm);
> +}
> +
> +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> +{
> +	/*
> +	 * The async #PF is "done", but KVM must wait for the work item itself,
> +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> +	 * after the last call to module_put(), i.e. after the last reference
> +	 * to the last vCPU's file is put.

Maybe drop the i.e? It is not exactly true, other components like VFIO
may also be the last one to put KVM reference?

> +	 *
> +	 * Wake all events skip the queue and go straight done, i.e. don't
> +	 * need to be flushed (but sanity check that the work wasn't queued).
> +	 */
> +	if (work->wakeup_all)
> +		WARN_ON_ONCE(work->work.func);
> +	else
> +		flush_work(&work->work);
> +	kmem_cache_free(async_pf_cache, work);
>  }
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> -		kmem_cache_free(async_pf_cache, work);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		kvm_flush_and_free_async_pf_work(work);
> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>  
>  		list_del(&work->queue);
>  		vcpu->async_pf.queued--;
> -		kmem_cache_free(async_pf_cache, work);
> +		kvm_flush_and_free_async_pf_work(work);
>  	}
>  }
>  
> @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->arch = *arch;
>  	work->mm = current->mm;
>  	mmget(work->mm);
> -	kvm_get_kvm(work->vcpu->kvm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);

Overall LGTM, Reviewed-by: Xu Yilun <yilun.xu@intel.com>

>  
> -- 
> 2.43.0.472.g3155946c3a-goog
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker
  2024-01-10  1:15 ` [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker Sean Christopherson
@ 2024-01-20 15:16   ` Xu Yilun
  2024-01-24 18:52     ` Sean Christopherson
  2024-01-26 16:21   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 25+ messages in thread
From: Xu Yilun @ 2024-01-20 15:16 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Tue, Jan 09, 2024 at 05:15:32PM -0800, Sean Christopherson wrote:
> Get a reference to the target VM's address space in async_pf_execute()
> instead of gifting a reference from kvm_setup_async_pf().  Keeping the
> address space alive just to service an async #PF is counter-productive,
> i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> get_user_pages_remote() and freeing the address space asap is desirable.
> 
> Handling the mm reference entirely within async_pf_execute() also
> simplifies the async #PF flows as a whole, e.g. it's not immediately
> obvious when the worker task vs. the vCPU task is responsible for putting
> the gifted mm reference.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h |  1 -
>  virt/kvm/async_pf.c      | 32 ++++++++++++++++++--------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..bbfefd7e612f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -238,7 +238,6 @@ struct kvm_async_pf {
>  	struct list_head link;
>  	struct list_head queue;
>  	struct kvm_vcpu *vcpu;
> -	struct mm_struct *mm;
>  	gpa_t cr2_or_gpa;
>  	unsigned long addr;
>  	struct kvm_arch_async_pf arch;
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d5dc50318aa6..c3f4f351a2ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
>  {
>  	struct kvm_async_pf *apf =
>  		container_of(work, struct kvm_async_pf, work);
> -	struct mm_struct *mm = apf->mm;
>  	struct kvm_vcpu *vcpu = apf->vcpu;
> +	struct mm_struct *mm = vcpu->kvm->mm;
>  	unsigned long addr = apf->addr;
>  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
>  	int locked = 1;
> @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
>  	might_sleep();
>  
>  	/*
> -	 * This work is run asynchronously to the task which owns
> -	 * mm and might be done in another context, so we must
> -	 * access remotely.
> +	 * Attempt to pin the VM's host address space, and simply skip gup() if
> +	 * acquiring a pin fail, i.e. if the process is exiting.  Note, KVM
> +	 * holds a reference to its associated mm_struct until the very end of
> +	 * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> +	 * work item is fully processed.
>  	 */
> -	mmap_read_lock(mm);
> -	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> -	if (locked)
> -		mmap_read_unlock(mm);
> -	mmput(mm);
> +	if (mmget_not_zero(mm)) {
> +		mmap_read_lock(mm);
> +		get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> +		if (locked)
> +			mmap_read_unlock(mm);
> +		mmput(mm);
> +	}
>  
> +	/*
> +	 * Notify and kick the vCPU even if faulting in the page failed, e.g.

How about when the process is exiting? Could we just skip the following?

Thanks,
Yilun

> +	 * so that the vCPU can retry the fault synchronously.
> +	 */
>  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
>  		kvm_arch_async_page_present(vcpu, apf);
>  
> @@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #ifdef CONFIG_KVM_ASYNC_PF_SYNC
>  		flush_work(&work->work);
>  #else
> -		if (cancel_work_sync(&work->work)) {
> -			mmput(work->mm);
> +		if (cancel_work_sync(&work->work))
>  			kmem_cache_free(async_pf_cache, work);
> -		}
>  #endif
>  		spin_lock(&vcpu->async_pf.lock);
>  	}
> @@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->cr2_or_gpa = cr2_or_gpa;
>  	work->addr = hva;
>  	work->arch = *arch;
> -	work->mm = current->mm;
> -	mmget(work->mm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);
>  
> -- 
> 2.43.0.472.g3155946c3a-goog
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup()
  2024-01-10  1:15 ` [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup() Sean Christopherson
@ 2024-01-20 15:24   ` Xu Yilun
  2024-01-26 16:23   ` Vitaly Kuznetsov
  1 sibling, 0 replies; 25+ messages in thread
From: Xu Yilun @ 2024-01-20 15:24 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Tue, Jan 09, 2024 at 05:15:31PM -0800, Sean Christopherson wrote:
> Put the async #PF worker's reference to the VM's address space as soon as
> the worker is done with the mm.  This will allow deferring getting a
> reference to the worker itself without having to track whether or not
> getting a reference succeeded.
> 
> Note, if the vCPU is still alive, there is no danger of the worker getting
> stuck with tearing down the host page tables, as userspace also holds a
> reference (obviously), i.e. there is no risk of delaying the page-present
> notification due to triggering the slow path in mmput().
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Xu Yilun <yilun.xu@intel.com>

> ---
>  virt/kvm/async_pf.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 876927a558ad..d5dc50318aa6 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -64,6 +64,7 @@ static void async_pf_execute(struct work_struct *work)
>  	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
>  	if (locked)
>  		mmap_read_unlock(mm);
> +	mmput(mm);
>  
>  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
>  		kvm_arch_async_page_present(vcpu, apf);
> @@ -85,8 +86,6 @@ static void async_pf_execute(struct work_struct *work)
>  	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>  
>  	__kvm_vcpu_wake_up(vcpu);
> -
> -	mmput(mm);
>  }
>  
>  static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> -- 
> 2.43.0.472.g3155946c3a-goog
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
  2024-01-10  1:15 ` [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed Sean Christopherson
@ 2024-01-20 15:24   ` Xu Yilun
  2024-01-26 16:30   ` Vitaly Kuznetsov
  1 sibling, 0 replies; 25+ messages in thread
From: Xu Yilun @ 2024-01-20 15:24 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Tue, Jan 09, 2024 at 05:15:33PM -0800, Sean Christopherson wrote:
> Nullify the async #PF worker's local "apf" pointer immediately after the
> point where the structure can be freed by the vCPU.  The existing comment
> is helpful, but easy to overlook as there is no associated code.
> 
> Update the comment to clarify that it can be freed by as soon as the lock
> is dropped, as "after this point" isn't strictly accurate, nor does it
> help understand what prevents the structure from being freed earlier.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Xu Yilun <yilun.xu@intel.com>

> ---
>  virt/kvm/async_pf.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index c3f4f351a2ae..1088c6628de9 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -83,13 +83,14 @@ static void async_pf_execute(struct work_struct *work)
>  	apf->vcpu = NULL;
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> -	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> -		kvm_arch_async_page_present_queued(vcpu);
> -
>  	/*
> -	 * apf may be freed by kvm_check_async_pf_completion() after
> -	 * this point
> +	 * The apf struct may freed by kvm_check_async_pf_completion() as soon
> +	 * as the lock is dropped.  Nullify it to prevent improper usage.
>  	 */
> +	apf = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> +		kvm_arch_async_page_present_queued(vcpu);
>  
>  	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>  
> -- 
> 2.43.0.472.g3155946c3a-goog
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker
  2024-01-20 15:16   ` Xu Yilun
@ 2024-01-24 18:52     ` Sean Christopherson
  2024-01-26  8:06       ` Xu Yilun
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2024-01-24 18:52 UTC (permalink / raw)
  To: Xu Yilun; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Sat, Jan 20, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:15:32PM -0800, Sean Christopherson wrote:
> > Get a reference to the target VM's address space in async_pf_execute()
> > instead of gifting a reference from kvm_setup_async_pf().  Keeping the
> > address space alive just to service an async #PF is counter-productive,
> > i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> > get_user_pages_remote() and freeing the address space asap is desirable.
> > 
> > Handling the mm reference entirely within async_pf_execute() also
> > simplifies the async #PF flows as a whole, e.g. it's not immediately
> > obvious when the worker task vs. the vCPU task is responsible for putting
> > the gifted mm reference.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  include/linux/kvm_host.h |  1 -
> >  virt/kvm/async_pf.c      | 32 ++++++++++++++++++--------------
> >  2 files changed, 18 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 7e7fd25b09b3..bbfefd7e612f 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -238,7 +238,6 @@ struct kvm_async_pf {
> >  	struct list_head link;
> >  	struct list_head queue;
> >  	struct kvm_vcpu *vcpu;
> > -	struct mm_struct *mm;
> >  	gpa_t cr2_or_gpa;
> >  	unsigned long addr;
> >  	struct kvm_arch_async_pf arch;
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index d5dc50318aa6..c3f4f351a2ae 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
> >  {
> >  	struct kvm_async_pf *apf =
> >  		container_of(work, struct kvm_async_pf, work);
> > -	struct mm_struct *mm = apf->mm;
> >  	struct kvm_vcpu *vcpu = apf->vcpu;
> > +	struct mm_struct *mm = vcpu->kvm->mm;
> >  	unsigned long addr = apf->addr;
> >  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> >  	int locked = 1;
> > @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
> >  	might_sleep();
> >  
> >  	/*
> > -	 * This work is run asynchronously to the task which owns
> > -	 * mm and might be done in another context, so we must
> > -	 * access remotely.
> > +	 * Attempt to pin the VM's host address space, and simply skip gup() if
> > +	 * acquiring a pin fail, i.e. if the process is exiting.  Note, KVM
> > +	 * holds a reference to its associated mm_struct until the very end of
> > +	 * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> > +	 * work item is fully processed.
> >  	 */
> > -	mmap_read_lock(mm);
> > -	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > -	if (locked)
> > -		mmap_read_unlock(mm);
> > -	mmput(mm);
> > +	if (mmget_not_zero(mm)) {
> > +		mmap_read_lock(mm);
> > +		get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > +		if (locked)
> > +			mmap_read_unlock(mm);
> > +		mmput(mm);
> > +	}
> >  
> > +	/*
> > +	 * Notify and kick the vCPU even if faulting in the page failed, e.g.
> 
> How about when the process is exiting? Could we just skip the following?

Maybe?  I'm not opposed to trimming this down even more, but I doubt it will make
much of a difference.  The vCPU can't be running so async_pf.lock shouldn't be
contended, no IPIs will be issued for kicks, etc.  So for this patch at least,
I want to take the most conservative approach while still cleaning up the mm_struct
usage.

> > +	 * so that the vCPU can retry the fault synchronously.
> > +	 */
> >  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
> >  		kvm_arch_async_page_present(vcpu, apf);
> >  
> > @@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  #ifdef CONFIG_KVM_ASYNC_PF_SYNC
> >  		flush_work(&work->work);
> >  #else
> > -		if (cancel_work_sync(&work->work)) {
> > -			mmput(work->mm);
> > +		if (cancel_work_sync(&work->work))
> >  			kmem_cache_free(async_pf_cache, work);
> > -		}
> >  #endif
> >  		spin_lock(&vcpu->async_pf.lock);
> >  	}
> > @@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  	work->cr2_or_gpa = cr2_or_gpa;
> >  	work->addr = hva;
> >  	work->arch = *arch;
> > -	work->mm = current->mm;
> > -	mmget(work->mm);
> >  
> >  	INIT_WORK(&work->work, async_pf_execute);
> >  
> > -- 
> > 2.43.0.472.g3155946c3a-goog
> > 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-20 12:40   ` Xu Yilun
@ 2024-01-24 19:04     ` Sean Christopherson
  2024-01-26  7:36       ` Xu Yilun
  2024-02-06 19:06     ` Sean Christopherson
  1 sibling, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2024-01-24 19:04 UTC (permalink / raw)
  To: Xu Yilun; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Sat, Jan 20, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > KVM must ensure that none of its workqueue callbacks is running when the
> > last reference to the KVM _module_ is put.  Gifting a reference to the
> > associated VM prevents the workqueue callback from dereferencing freed
> > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > before the callback completes.
> > 
> > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> > 
> >  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> >  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> >  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >  Workqueue: events async_pf_execute [kvm]
> >  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> >  Call Trace:
> >   <TASK>
> >   async_pf_execute+0x198/0x260 [kvm]
> >   process_one_work+0x145/0x2d0
> >   worker_thread+0x27e/0x3a0
> >   kthread+0xba/0xe0
> >   ret_from_fork+0x2d/0x50
> >   ret_from_fork_asm+0x11/0x20
> >   </TASK>
> >  ---[ end trace 0000000000000000 ]---
> >  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> >        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
> >  Workqueue: events async_pf_execute [kvm]
> >  Call Trace:
> >   <TASK>
> >   __schedule+0x33f/0xa40
> >   schedule+0x53/0xc0
> >   schedule_timeout+0x12a/0x140
> >   __wait_for_common+0x8d/0x1d0
> >   __flush_work.isra.0+0x19f/0x2c0
> >   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> >   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> >   kvm_put_kvm+0x1c1/0x320 [kvm]
> >   async_pf_execute+0x198/0x260 [kvm]
> >   process_one_work+0x145/0x2d0
> >   worker_thread+0x27e/0x3a0
> >   kthread+0xba/0xe0
> >   ret_from_fork+0x2d/0x50
> >   ret_from_fork_asm+0x11/0x20
> >   </TASK>
> > 
> > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > then there's no need to gift async_pf_execute() a reference because all
> > invocations of async_pf_execute() will be forced to complete before the
> > vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> 
> I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> we just need to say that vCPUs are freed before module_put(KVM the module)
> in kvm_destroy_vm(), then the whole logic for module unloading fix is:
> 
>   1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
>   2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
>   3. vCPUs must be freed before module_put(KVM the module).
> 
>   So all workqueue callbacks complete before module_put(KVM the module).
> 
> 
> __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> makes me distracted from reason of the fix.

My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
is alive at this point and (b) that __fput() very deliberately ensures module_put()
is called after all module function callbacks/hooks complete, as there was quite
a bit of confusion around who/what can safely do module_put().

> > last reference to the KVM module.
> > 
> > Note that kvm_check_async_pf_completion() may also take the work item off
> > the completion queue and so also needs to flush the work queue, as the
> > work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
> > on the workqueue could theoretically delay a vCPU due to waiting for the
> > work to complete, but that's a very, very small chance, and likely a very
> > small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
> > new request, i.e. will effectively delay entering the guest, so the
> > remaining work is really just:
> > 
> >         trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> > 
> >         __kvm_vcpu_wake_up(vcpu);
> > 
> >         mmput(mm);
> > 
> > and mmput() can't drop the last reference to the page tables if the vCPU is
> > still alive, i.e. the vCPU won't get stuck tearing down page tables.
> > 
> > Add a helper to do the flushing, specifically to deal with "wakeup all"
> > work items, as they aren't actually work items, i.e. are never placed in a
> > workqueue.  Trying to flush a bogus workqueue entry rightly makes
> > __flush_work() complain (kudos to whoever added that sanity check).
> > 
> > Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> > freed") *tried* to fix the module refcounting issue by having VMs grab a
> > reference to the module, but that only made the bug slightly harder to hit
> > as it gave async_pf_execute() a bit more time to complete before the KVM
> > module could be unloaded.
> > 
> > Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> > Cc: stable@vger.kernel.org
> > Cc: David Matlack <dmatlack@google.com>
> > Cc: Xu Yilun <yilun.xu@linux.intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
> >  1 file changed, 32 insertions(+), 5 deletions(-)
> > 
> > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > index e033c79d528e..876927a558ad 100644
> > --- a/virt/kvm/async_pf.c
> > +++ b/virt/kvm/async_pf.c
> > @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
> >  	__kvm_vcpu_wake_up(vcpu);
> >  
> >  	mmput(mm);
> > -	kvm_put_kvm(vcpu->kvm);
> > +}
> > +
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > +	/*
> > +	 * The async #PF is "done", but KVM must wait for the work item itself,
> > +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> > +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> > +	 * after the last call to module_put(), i.e. after the last reference
> > +	 * to the last vCPU's file is put.
> 
> Maybe drop the i.e? It is not exactly true, other components like VFIO
> may also be the last one to put KVM reference?

Ah, yeah, agreed.  I'll drop that last snippet, it doesn't and much value.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-24 19:04     ` Sean Christopherson
@ 2024-01-26  7:36       ` Xu Yilun
  0 siblings, 0 replies; 25+ messages in thread
From: Xu Yilun @ 2024-01-26  7:36 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Wed, Jan 24, 2024 at 11:04:55AM -0800, Sean Christopherson wrote:
> On Sat, Jan 20, 2024, Xu Yilun wrote:
> > On Tue, Jan 09, 2024 at 05:15:30PM -0800, Sean Christopherson wrote:
> > > Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> > > completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> > > KVM must ensure that none of its workqueue callbacks is running when the
> > > last reference to the KVM _module_ is put.  Gifting a reference to the
> > > associated VM prevents the workqueue callback from dereferencing freed
> > > vCPU/VM memory, but does not prevent the KVM module from being unloaded
> > > before the callback completes.
> > > 
> > > Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> > > async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> > > result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> > > finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> > > 
> > >  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
> > >  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
> > >  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >  Workqueue: events async_pf_execute [kvm]
> > >  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
> > >  Call Trace:
> > >   <TASK>
> > >   async_pf_execute+0x198/0x260 [kvm]
> > >   process_one_work+0x145/0x2d0
> > >   worker_thread+0x27e/0x3a0
> > >   kthread+0xba/0xe0
> > >   ret_from_fork+0x2d/0x50
> > >   ret_from_fork_asm+0x11/0x20
> > >   </TASK>
> > >  ---[ end trace 0000000000000000 ]---
> > >  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
> > >        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
> > >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > >  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
> > >  Workqueue: events async_pf_execute [kvm]
> > >  Call Trace:
> > >   <TASK>
> > >   __schedule+0x33f/0xa40
> > >   schedule+0x53/0xc0
> > >   schedule_timeout+0x12a/0x140
> > >   __wait_for_common+0x8d/0x1d0
> > >   __flush_work.isra.0+0x19f/0x2c0
> > >   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
> > >   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
> > >   kvm_put_kvm+0x1c1/0x320 [kvm]
> > >   async_pf_execute+0x198/0x260 [kvm]
> > >   process_one_work+0x145/0x2d0
> > >   worker_thread+0x27e/0x3a0
> > >   kthread+0xba/0xe0
> > >   ret_from_fork+0x2d/0x50
> > >   ret_from_fork_asm+0x11/0x20
> > >   </TASK>
> > > 
> > > If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> > > then there's no need to gift async_pf_execute() a reference because all
> > > invocations of async_pf_execute() will be forced to complete before the
> > > vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> > > unloading bug as __fput() won't do module_put() on the last vCPU reference
> > > until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> > 
> > I'm not sure why __fput() of vCPU fd should be mentioned here. I assume
> > we just need to say that vCPUs are freed before module_put(KVM the module)
> > in kvm_destroy_vm(), then the whole logic for module unloading fix is:
> > 
> >   1. All workqueue callbacks complete when kvm_clear_async_pf_completion_queue(vcpu)
> >   2. kvm_clear_async_pf_completion_queue(vcpu) must be executed before vCPU free.
> >   3. vCPUs must be freed before module_put(KVM the module).
> > 
> >   So all workqueue callbacks complete before module_put(KVM the module).
> > 
> > 
> > __fput() of vCPU fd is not the only trigger of kvm_destroy_vm(), that
> > makes me distracted from reason of the fix.
> 
> My goal was to call out that (a) the vCPU file descriptor is what ensures kvm.ko
> is alive at this point and (b) that __fput() very deliberately ensures module_put()
> is called after all module function callbacks/hooks complete, as there was quite

Ah, I understood. These are ensured by your previous fix which grants
kvm_vcpu_fops the module owner. LGTM now.

Thanks,
Yilun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker
  2024-01-24 18:52     ` Sean Christopherson
@ 2024-01-26  8:06       ` Xu Yilun
  0 siblings, 0 replies; 25+ messages in thread
From: Xu Yilun @ 2024-01-26  8:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Wed, Jan 24, 2024 at 10:52:12AM -0800, Sean Christopherson wrote:
> On Sat, Jan 20, 2024, Xu Yilun wrote:
> > On Tue, Jan 09, 2024 at 05:15:32PM -0800, Sean Christopherson wrote:
> > > Get a reference to the target VM's address space in async_pf_execute()
> > > instead of gifting a reference from kvm_setup_async_pf().  Keeping the
> > > address space alive just to service an async #PF is counter-productive,
> > > i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> > > get_user_pages_remote() and freeing the address space asap is desirable.
> > > 
> > > Handling the mm reference entirely within async_pf_execute() also
> > > simplifies the async #PF flows as a whole, e.g. it's not immediately
> > > obvious when the worker task vs. the vCPU task is responsible for putting
> > > the gifted mm reference.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  include/linux/kvm_host.h |  1 -
> > >  virt/kvm/async_pf.c      | 32 ++++++++++++++++++--------------
> > >  2 files changed, 18 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 7e7fd25b09b3..bbfefd7e612f 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -238,7 +238,6 @@ struct kvm_async_pf {
> > >  	struct list_head link;
> > >  	struct list_head queue;
> > >  	struct kvm_vcpu *vcpu;
> > > -	struct mm_struct *mm;
> > >  	gpa_t cr2_or_gpa;
> > >  	unsigned long addr;
> > >  	struct kvm_arch_async_pf arch;
> > > diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> > > index d5dc50318aa6..c3f4f351a2ae 100644
> > > --- a/virt/kvm/async_pf.c
> > > +++ b/virt/kvm/async_pf.c
> > > @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
> > >  {
> > >  	struct kvm_async_pf *apf =
> > >  		container_of(work, struct kvm_async_pf, work);
> > > -	struct mm_struct *mm = apf->mm;
> > >  	struct kvm_vcpu *vcpu = apf->vcpu;
> > > +	struct mm_struct *mm = vcpu->kvm->mm;
> > >  	unsigned long addr = apf->addr;
> > >  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
> > >  	int locked = 1;
> > > @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
> > >  	might_sleep();
> > >  
> > >  	/*
> > > -	 * This work is run asynchronously to the task which owns
> > > -	 * mm and might be done in another context, so we must
> > > -	 * access remotely.
> > > +	 * Attempt to pin the VM's host address space, and simply skip gup() if
> > > +	 * acquiring a pin fail, i.e. if the process is exiting.  Note, KVM
> > > +	 * holds a reference to its associated mm_struct until the very end of
> > > +	 * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> > > +	 * work item is fully processed.
> > >  	 */
> > > -	mmap_read_lock(mm);
> > > -	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > > -	if (locked)
> > > -		mmap_read_unlock(mm);
> > > -	mmput(mm);
> > > +	if (mmget_not_zero(mm)) {
> > > +		mmap_read_lock(mm);
> > > +		get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> > > +		if (locked)
> > > +			mmap_read_unlock(mm);
> > > +		mmput(mm);
> > > +	}
> > >  
> > > +	/*
> > > +	 * Notify and kick the vCPU even if faulting in the page failed, e.g.
> > 
> > How about when the process is exiting? Could we just skip the following?
> 
> Maybe?  I'm not opposed to trimming this down even more, but I doubt it will make
> much of a difference.  The vCPU can't be running so async_pf.lock shouldn't be
> contended, no IPIs will be issued for kicks, etc.  So for this patch at least,
> I want to take the most conservative approach while still cleaning up the mm_struct
> usage.

It's good to me.

Reviewed-by: Xu Yilun <yilun.xu@intel.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker
  2024-01-10  1:15 ` [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker Sean Christopherson
  2024-01-20 15:16   ` Xu Yilun
@ 2024-01-26 16:21   ` Vitaly Kuznetsov
  2024-01-26 16:39     ` Sean Christopherson
  1 sibling, 1 reply; 25+ messages in thread
From: Vitaly Kuznetsov @ 2024-01-26 16:21 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Sean Christopherson <seanjc@google.com> writes:

> Get a reference to the target VM's address space in async_pf_execute()
> instead of gifting a reference from kvm_setup_async_pf().  Keeping the
> address space alive just to service an async #PF is counter-productive,
> i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> get_user_pages_remote() and freeing the address space asap is
> desirable.

It took me a while to realize why all vCPU fds are managed by the same
mm which did KVM_CREATE_VM as (AFAIU) fds can be passed around. Turns
out, we explicitly forbid this in kvm_vcpu_ioctl():

        if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
                return -EIO;

so this indeed means that grabbing current->mm in kvm_setup_async_pf()
can be avoided. I'm not sure whether it's just me or a "all vCPUs are
quired to be managed by the same mm" comment somewhere would be helpful.

>
> Handling the mm reference entirely within async_pf_execute() also
> simplifies the async #PF flows as a whole, e.g. it's not immediately
> obvious when the worker task vs. the vCPU task is responsible for putting
> the gifted mm reference.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_host.h |  1 -
>  virt/kvm/async_pf.c      | 32 ++++++++++++++++++--------------
>  2 files changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7e7fd25b09b3..bbfefd7e612f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -238,7 +238,6 @@ struct kvm_async_pf {
>  	struct list_head link;
>  	struct list_head queue;
>  	struct kvm_vcpu *vcpu;
> -	struct mm_struct *mm;
>  	gpa_t cr2_or_gpa;
>  	unsigned long addr;
>  	struct kvm_arch_async_pf arch;
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d5dc50318aa6..c3f4f351a2ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -46,8 +46,8 @@ static void async_pf_execute(struct work_struct *work)
>  {
>  	struct kvm_async_pf *apf =
>  		container_of(work, struct kvm_async_pf, work);
> -	struct mm_struct *mm = apf->mm;
>  	struct kvm_vcpu *vcpu = apf->vcpu;
> +	struct mm_struct *mm = vcpu->kvm->mm;
>  	unsigned long addr = apf->addr;
>  	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
>  	int locked = 1;
> @@ -56,16 +56,24 @@ static void async_pf_execute(struct work_struct *work)
>  	might_sleep();
>  
>  	/*
> -	 * This work is run asynchronously to the task which owns
> -	 * mm and might be done in another context, so we must
> -	 * access remotely.
> +	 * Attempt to pin the VM's host address space, and simply skip gup() if
> +	 * acquiring a pin fail, i.e. if the process is exiting.  Note, KVM
> +	 * holds a reference to its associated mm_struct until the very end of
> +	 * kvm_destroy_vm(), i.e. the struct itself won't be freed before this
> +	 * work item is fully processed.
>  	 */
> -	mmap_read_lock(mm);
> -	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> -	if (locked)
> -		mmap_read_unlock(mm);
> -	mmput(mm);
> +	if (mmget_not_zero(mm)) {
> +		mmap_read_lock(mm);
> +		get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
> +		if (locked)
> +			mmap_read_unlock(mm);
> +		mmput(mm);
> +	}
>  
> +	/*
> +	 * Notify and kick the vCPU even if faulting in the page failed, e.g.
> +	 * so that the vCPU can retry the fault synchronously.
> +	 */
>  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
>  		kvm_arch_async_page_present(vcpu, apf);
>  
> @@ -129,10 +137,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #ifdef CONFIG_KVM_ASYNC_PF_SYNC
>  		flush_work(&work->work);
>  #else
> -		if (cancel_work_sync(&work->work)) {
> -			mmput(work->mm);
> +		if (cancel_work_sync(&work->work))
>  			kmem_cache_free(async_pf_cache, work);
> -		}
>  #endif
>  		spin_lock(&vcpu->async_pf.lock);
>  	}
> @@ -211,8 +217,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->cr2_or_gpa = cr2_or_gpa;
>  	work->addr = hva;
>  	work->arch = *arch;
> -	work->mm = current->mm;
> -	mmget(work->mm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup()
  2024-01-10  1:15 ` [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup() Sean Christopherson
  2024-01-20 15:24   ` Xu Yilun
@ 2024-01-26 16:23   ` Vitaly Kuznetsov
  1 sibling, 0 replies; 25+ messages in thread
From: Vitaly Kuznetsov @ 2024-01-26 16:23 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Sean Christopherson <seanjc@google.com> writes:

> Put the async #PF worker's reference to the VM's address space as soon as
> the worker is done with the mm.  This will allow deferring getting a
> reference to the worker itself without having to track whether or not
> getting a reference succeeded.
>
> Note, if the vCPU is still alive, there is no danger of the worker getting
> stuck with tearing down the host page tables, as userspace also holds a
> reference (obviously), i.e. there is no risk of delaying the page-present
> notification due to triggering the slow path in mmput().
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index 876927a558ad..d5dc50318aa6 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -64,6 +64,7 @@ static void async_pf_execute(struct work_struct *work)
>  	get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
>  	if (locked)
>  		mmap_read_unlock(mm);
> +	mmput(mm);
>  
>  	if (IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC))
>  		kvm_arch_async_page_present(vcpu, apf);
> @@ -85,8 +86,6 @@ static void async_pf_execute(struct work_struct *work)
>  	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>  
>  	__kvm_vcpu_wake_up(vcpu);
> -
> -	mmput(mm);
>  }
>  
>  static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
  2024-01-10  1:15 ` [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed Sean Christopherson
  2024-01-20 15:24   ` Xu Yilun
@ 2024-01-26 16:30   ` Vitaly Kuznetsov
  1 sibling, 0 replies; 25+ messages in thread
From: Vitaly Kuznetsov @ 2024-01-26 16:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Sean Christopherson <seanjc@google.com> writes:

> Nullify the async #PF worker's local "apf" pointer immediately after the
> point where the structure can be freed by the vCPU.  The existing comment
> is helpful, but easy to overlook as there is no associated code.
>
> Update the comment to clarify that it can be freed by as soon as the lock
> is dropped, as "after this point" isn't strictly accurate, nor does it
> help understand what prevents the structure from being freed earlier.
>

"No functional change intended." must be made a requirement, especially
for those who made it their trademark)

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index c3f4f351a2ae..1088c6628de9 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -83,13 +83,14 @@ static void async_pf_execute(struct work_struct *work)
>  	apf->vcpu = NULL;
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> -	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> -		kvm_arch_async_page_present_queued(vcpu);
> -
>  	/*
> -	 * apf may be freed by kvm_check_async_pf_completion() after
> -	 * this point
> +	 * The apf struct may freed by kvm_check_async_pf_completion() as soon

Nit: "may be freed"/"may get freed" maybe?

> +	 * as the lock is dropped.  Nullify it to prevent improper usage.
>  	 */
> +	apf = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
> +		kvm_arch_async_page_present_queued(vcpu);
>  
>  	trace_kvm_async_pf_completed(addr, cr2_or_gpa);

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker
  2024-01-26 16:21   ` Vitaly Kuznetsov
@ 2024-01-26 16:39     ` Sean Christopherson
  0 siblings, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-01-26 16:39 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Xu Yilun

On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > Get a reference to the target VM's address space in async_pf_execute()
> > instead of gifting a reference from kvm_setup_async_pf().  Keeping the
> > address space alive just to service an async #PF is counter-productive,
> > i.e. if the process is exiting and all vCPUs are dead, then NOT doing
> > get_user_pages_remote() and freeing the address space asap is
> > desirable.
> 
> It took me a while to realize why all vCPU fds are managed by the same
> mm which did KVM_CREATE_VM as (AFAIU) fds can be passed around. Turns
> out, we explicitly forbid this in kvm_vcpu_ioctl():
> 
>         if (vcpu->kvm->mm != current->mm || vcpu->kvm->vm_dead)
>                 return -EIO;
> 
> so this indeed means that grabbing current->mm in kvm_setup_async_pf()
> can be avoided. I'm not sure whether it's just me or a "all vCPUs are
> quired to be managed by the same mm" comment somewhere would be helpful.

It's definitely not just you.  Documentation/virt/kvm/* could use thorough
documentation of what all in KVM relies on vCPUs, and all meaningful ioctls(),
to be executed in the same mm_struct (address space).  Because that requirement
is pervasive throughout KVM.  E.g. sharing KVM page tables across vCPUs is safe
iff all vCPUs are in the same address space, otherwise the hva=>pfn translations
through the memslot would diverge, mmu_notifiers would be all kinds of broken, etc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
  2024-01-20 12:40   ` Xu Yilun
@ 2024-01-26 16:51   ` Vitaly Kuznetsov
  2024-01-26 17:19     ` Sean Christopherson
  2024-02-19 13:59   ` Xu Yilun
  2 siblings, 1 reply; 25+ messages in thread
From: Vitaly Kuznetsov @ 2024-01-26 16:51 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun, Sean Christopherson

Sean Christopherson <seanjc@google.com> writes:

> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, e.g. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put.  Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
>
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
>
>  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
>  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
>  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  Workqueue: events async_pf_execute [kvm]
>  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
>  Call Trace:
>   <TASK>
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
>        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
>  Workqueue: events async_pf_execute [kvm]
>  Call Trace:
>   <TASK>
>   __schedule+0x33f/0xa40
>   schedule+0x53/0xc0
>   schedule_timeout+0x12a/0x140
>   __wait_for_common+0x8d/0x1d0
>   __flush_work.isra.0+0x19f/0x2c0
>   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
>   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
>   kvm_put_kvm+0x1c1/0x320 [kvm]
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> last reference to the KVM module.
>
> Note that kvm_check_async_pf_completion() may also take the work item off
> the completion queue and so also needs to flush the work queue, as the
> work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
> on the workqueue could theoretically delay a vCPU due to waiting for the
> work to complete, but that's a very, very small chance, and likely a very
> small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
> new request, i.e. will effectively delay entering the guest, so the
> remaining work is really just:
>
>         trace_kvm_async_pf_completed(addr, cr2_or_gpa);
>
>         __kvm_vcpu_wake_up(vcpu);
>
>         mmput(mm);
>
> and mmput() can't drop the last reference to the page tables if the vCPU is
> still alive, i.e. the vCPU won't get stuck tearing down page tables.
>
> Add a helper to do the flushing, specifically to deal with "wakeup all"
> work items, as they aren't actually work items, i.e. are never placed in a
> workqueue.  Trying to flush a bogus workqueue entry rightly makes
> __flush_work() complain (kudos to whoever added that sanity check).
>
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
>
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: stable@vger.kernel.org
> Cc: David Matlack <dmatlack@google.com>
> Cc: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 37 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..876927a558ad 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,25 @@ static void async_pf_execute(struct work_struct *work)
>  	__kvm_vcpu_wake_up(vcpu);
>  
>  	mmput(mm);
> -	kvm_put_kvm(vcpu->kvm);
> +}
> +
> +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> +{
> +	/*
> +	 * The async #PF is "done", but KVM must wait for the work item itself,
> +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> +	 * after the last call to module_put(), i.e. after the last reference
> +	 * to the last vCPU's file is put.
> +	 *

Do I understand correctly that the problem is also present on the
"normal" path, i.e.:

  KVM_REQ_APF_READY
     kvm_check_async_pf_completion()
         kmem_cache_free(,work)

on one CPU can actually finish _before_ work is fully flushed on the
other (async_pf_execute() has already added an item to 'done' list but
hasn't completed)? Is it just the fact that the window of opportunity
to get the freed item re-purposed is so short that no real issue was
ever noticed? In that case I'd suggest we emphasize that in the comment
as currently it sounds like kvm_arch_destroy_vm() is the only
probemmatic path.

> +	 * Wake all events skip the queue and go straight done, i.e. don't
> +	 * need to be flushed (but sanity check that the work wasn't queued).
> +	 */
> +	if (work->wakeup_all)
> +		WARN_ON_ONCE(work->work.func);
> +	else
> +		flush_work(&work->work);
> +	kmem_cache_free(async_pf_cache, work);
>  }
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> -		kmem_cache_free(async_pf_cache, work);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		kvm_flush_and_free_async_pf_work(work);
> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>  
>  		list_del(&work->queue);
>  		vcpu->async_pf.queued--;
> -		kmem_cache_free(async_pf_cache, work);
> +		kvm_flush_and_free_async_pf_work(work);
>  	}
>  }
>  
> @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->arch = *arch;
>  	work->mm = current->mm;
>  	mmget(work->mm);
> -	kvm_get_kvm(work->vcpu->kvm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-26 16:51   ` Vitaly Kuznetsov
@ 2024-01-26 17:19     ` Sean Christopherson
  2024-01-29  9:02       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2024-01-26 17:19 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Xu Yilun

On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
> > +{
> > +	/*
> > +	 * The async #PF is "done", but KVM must wait for the work item itself,
> > +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> > +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> > +	 * after the last call to module_put(), i.e. after the last reference
> > +	 * to the last vCPU's file is put.
> > +	 *
> 
> Do I understand correctly that the problem is also present on the
> "normal" path, i.e.:
> 
>   KVM_REQ_APF_READY
>      kvm_check_async_pf_completion()
>          kmem_cache_free(,work)
> 
> on one CPU can actually finish _before_ work is fully flushed on the
> other (async_pf_execute() has already added an item to 'done' list but
> hasn't completed)? Is it just the fact that the window of opportunity
> to get the freed item re-purposed is so short that no real issue was
> ever noticed?

Sort of?  It's not a problem with accessing a freed obect, the issue is that
async_pf_execute() can still be executing while KVM-the-module is unloaded.

The reason the "normal" path is problematic is because it removes the
work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
to kvm_arch_destroy_vm().  So to hit the bug where the "normal" path processes
the completed work, userspace would need to terminate the VM (i.e. exit() or
close all fds), and KVM would need to completely tear down the VM, all before
async_pf_execute() finished it's last few instructions.

Which is possible, just comically unlikely :-)

> In that case I'd suggest we emphasize that in the comment as currently it
> sounds like kvm_arch_destroy_vm() is the only probemmatic path.

How's this?

	/*
	 * The async #PF is "done", but KVM must wait for the work item itself,
	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
	 * KVM must ensure *no* code owned by the KVM (the module) can be run
	 * after the last call to module_put().  Note, flushing the work item
	 * is always required when the item is taken off the completion queue.
	 * E.g. even if the vCPU handles the item in the "normal" path, the VM
	 * could be terminated before async_pf_execute() completes.
	 *
	 * Wake all events skip the queue and go straight done, i.e. don't
	 * need to be flushed (but sanity check that the work wasn't queued).
	 */

> > +	 * Wake all events skip the queue and go straight done, i.e. don't
> > +	 * need to be flushed (but sanity check that the work wasn't queued).
> > +	 */
> > +	if (work->wakeup_all)
> > +		WARN_ON_ONCE(work->work.func);
> > +	else
> > +		flush_work(&work->work);
> > +	kmem_cache_free(async_pf_cache, work);
> >  }
> >  
> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  #else
> >  		if (cancel_work_sync(&work->work)) {
> >  			mmput(work->mm);
> > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> >  			kmem_cache_free(async_pf_cache, work);
> >  		}
> >  #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  			list_first_entry(&vcpu->async_pf.done,
> >  					 typeof(*work), link);
> >  		list_del(&work->link);
> > -		kmem_cache_free(async_pf_cache, work);
> > +
> > +		spin_unlock(&vcpu->async_pf.lock);
> > +
> > +		/*
> > +		 * The async #PF is "done", but KVM must wait for the work item
> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > +		 * (the module) can be run after the last call to module_put(),
> > +		 * i.e. after the last reference to the last vCPU's file is put.
> > +		 */

Doh, this is a duplicate comment, I'll delete it.
 
> > +		kvm_flush_and_free_async_pf_work(work);
> > +		spin_lock(&vcpu->async_pf.lock);
> >  	}
> >  	spin_unlock(&vcpu->async_pf.lock);
> >  
> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
> >  
> >  		list_del(&work->queue);
> >  		vcpu->async_pf.queued--;
> > -		kmem_cache_free(async_pf_cache, work);
> > +		kvm_flush_and_free_async_pf_work(work);
> >  	}
> >  }
> >  
> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  	work->arch = *arch;
> >  	work->mm = current->mm;
> >  	mmget(work->mm);
> > -	kvm_get_kvm(work->vcpu->kvm);
> >  
> >  	INIT_WORK(&work->work, async_pf_execute);
> 
> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> -- 
> Vitaly
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-26 17:19     ` Sean Christopherson
@ 2024-01-29  9:02       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 25+ messages in thread
From: Vitaly Kuznetsov @ 2024-01-29  9:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Xu Yilun

Sean Christopherson <seanjc@google.com> writes:

> On Fri, Jan 26, 2024, Vitaly Kuznetsov wrote:
>> > +static void kvm_flush_and_free_async_pf_work(struct kvm_async_pf *work)
>> > +{
>> > +	/*
>> > +	 * The async #PF is "done", but KVM must wait for the work item itself,
>> > +	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
>> > +	 * KVM must ensure *no* code owned by the KVM (the module) can be run
>> > +	 * after the last call to module_put(), i.e. after the last reference
>> > +	 * to the last vCPU's file is put.
>> > +	 *
>> 
>> Do I understand correctly that the problem is also present on the
>> "normal" path, i.e.:
>> 
>>   KVM_REQ_APF_READY
>>      kvm_check_async_pf_completion()
>>          kmem_cache_free(,work)
>> 
>> on one CPU can actually finish _before_ work is fully flushed on the
>> other (async_pf_execute() has already added an item to 'done' list but
>> hasn't completed)? Is it just the fact that the window of opportunity
>> to get the freed item re-purposed is so short that no real issue was
>> ever noticed?
>
> Sort of?  It's not a problem with accessing a freed obect, the issue is that
> async_pf_execute() can still be executing while KVM-the-module is unloaded.
>
> The reason the "normal" path is problematic is because it removes the
> work entry from async_pf.done and from async_pf.queue, i.e. makes it invisible
> to kvm_arch_destroy_vm().  So to hit the bug where the "normal" path processes
> the completed work, userspace would need to terminate the VM (i.e. exit() or
> close all fds), and KVM would need to completely tear down the VM, all before
> async_pf_execute() finished it's last few instructions.
>
> Which is possible, just comically unlikely :-)
>
>> In that case I'd suggest we emphasize that in the comment as currently it
>> sounds like kvm_arch_destroy_vm() is the only probemmatic path.
>
> How's this?
>
> 	/*
> 	 * The async #PF is "done", but KVM must wait for the work item itself,
> 	 * i.e. async_pf_execute(), to run to completion.  If KVM is a module,
> 	 * KVM must ensure *no* code owned by the KVM (the module) can be run
> 	 * after the last call to module_put().  Note, flushing the work item
> 	 * is always required when the item is taken off the completion queue.
> 	 * E.g. even if the vCPU handles the item in the "normal" path, the VM
> 	 * could be terminated before async_pf_execute() completes.
> 	 *
> 	 * Wake all events skip the queue and go straight done, i.e. don't
> 	 * need to be flushed (but sanity check that the work wasn't queued).
> 	 */
>

Sounds good to me, thanks!

>> > +	 * Wake all events skip the queue and go straight done, i.e. don't
>> > +	 * need to be flushed (but sanity check that the work wasn't queued).
>> > +	 */
>> > +	if (work->wakeup_all)
>> > +		WARN_ON_ONCE(work->work.func);
>> > +	else
>> > +		flush_work(&work->work);
>> > +	kmem_cache_free(async_pf_cache, work);
>> >  }
>> >  
>> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> >  #else
>> >  		if (cancel_work_sync(&work->work)) {
>> >  			mmput(work->mm);
>> > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>> >  			kmem_cache_free(async_pf_cache, work);
>> >  		}
>> >  #endif
>> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>> >  			list_first_entry(&vcpu->async_pf.done,
>> >  					 typeof(*work), link);
>> >  		list_del(&work->link);
>> > -		kmem_cache_free(async_pf_cache, work);
>> > +
>> > +		spin_unlock(&vcpu->async_pf.lock);
>> > +
>> > +		/*
>> > +		 * The async #PF is "done", but KVM must wait for the work item
>> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
>> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
>> > +		 * (the module) can be run after the last call to module_put(),
>> > +		 * i.e. after the last reference to the last vCPU's file is put.
>> > +		 */
>
> Doh, this is a duplicate comment, I'll delete it.
>  
>> > +		kvm_flush_and_free_async_pf_work(work);
>> > +		spin_lock(&vcpu->async_pf.lock);
>> >  	}
>> >  	spin_unlock(&vcpu->async_pf.lock);
>> >  
>> > @@ -151,7 +179,7 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
>> >  
>> >  		list_del(&work->queue);
>> >  		vcpu->async_pf.queued--;
>> > -		kmem_cache_free(async_pf_cache, work);
>> > +		kvm_flush_and_free_async_pf_work(work);
>> >  	}
>> >  }
>> >  
>> > @@ -186,7 +214,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>> >  	work->arch = *arch;
>> >  	work->mm = current->mm;
>> >  	mmget(work->mm);
>> > -	kvm_get_kvm(work->vcpu->kvm);
>> >  
>> >  	INIT_WORK(&work->work, async_pf_execute);
>> 
>> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> 
>> -- 
>> Vitaly
>> 
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-20 12:40   ` Xu Yilun
  2024-01-24 19:04     ` Sean Christopherson
@ 2024-02-06 19:06     ` Sean Christopherson
  1 sibling, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-02-06 19:06 UTC (permalink / raw)
  To: Xu Yilun; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Sat, Jan 20, 2024, Xu Yilun wrote:
> Overall LGTM, Reviewed-by: Xu Yilun <yilun.xu@intel.com>

A very random thing.  AFAICT, b4 doesn't appear to pickup tags that are "buried"
partway through a line.  You didn't do anything wrong (IMO, at least :-D), and
it was easy enough to apply manually, just an FYI for the future.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/4] KVM: Async #PF fixes and cleanups
  2024-01-10  1:15 [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-01-10  1:15 ` [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed Sean Christopherson
@ 2024-02-06 21:36 ` Sean Christopherson
  4 siblings, 0 replies; 25+ messages in thread
From: Sean Christopherson @ 2024-02-06 21:36 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Xu Yilun

On Tue, 09 Jan 2024 17:15:29 -0800, Sean Christopherson wrote:
> Fix a use-after-module-unload bug in the async #PF code by ensuring all
> workqueue items fully complete before tearing down vCPUs.  Do a bit of
> cleanup to try and make the code slightly more readable.
> 
> Side topic, I'm pretty s390's flic_set_attr() is broken/racy.  The async #PF
> code assumes that only the vCPU can invoke
> kvm_clear_async_pf_completion_queue(), as there are multiple assets that
> are effectively protected by vcpu->mutex.  I don't any real world VMMs
> trigger the race(s), but AFAICT it's a bug.  I think/assume taking all
> vCPUs' mutexes would plug the hole?
> 
> [...]

Applied to kvm-x86 asyncpf, with comment tweaks as per Vitaly.  Thanks!

[1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
      https://github.com/kvm-x86/linux/commit/3d75b8aa5c29
[2/4] KVM: Put mm immediately after async #PF worker completes remote gup()
      https://github.com/kvm-x86/linux/commit/422eeb543ac9
[3/4] KVM: Get reference to VM's address space in the async #PF worker
      https://github.com/kvm-x86/linux/commit/8284765f03b7
[4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
      https://github.com/kvm-x86/linux/commit/c2744ed2230a

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
  2024-01-20 12:40   ` Xu Yilun
  2024-01-26 16:51   ` Vitaly Kuznetsov
@ 2024-02-19 13:59   ` Xu Yilun
  2024-02-19 15:51     ` Sean Christopherson
  2 siblings, 1 reply; 25+ messages in thread
From: Xu Yilun @ 2024-02-19 13:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> -		kmem_cache_free(async_pf_cache, work);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		kvm_flush_and_free_async_pf_work(work);

I have a new concern when I re-visit this patchset.

Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
superset of async_pf.done (except wake-all work, which is not within
concern).  And done work would be skipped from sync (cancel_work_sync()) by:

                if (!work->vcpu)
                        continue;

But now with this patch we also sync done works, how about we just sync all
queued work instead.

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..2268f16a36c2 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -71,7 +71,6 @@ static void async_pf_execute(struct work_struct *work)
        spin_lock(&vcpu->async_pf.lock);
        first = list_empty(&vcpu->async_pf.done);
        list_add_tail(&apf->link, &vcpu->async_pf.done);
-       apf->vcpu = NULL;
        spin_unlock(&vcpu->async_pf.lock);

        if (!IS_ENABLED(CONFIG_KVM_ASYNC_PF_SYNC) && first)
@@ -101,13 +100,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
                                         typeof(*work), queue);
                list_del(&work->queue);

-               /*
-                * We know it's present in vcpu->async_pf.done, do
-                * nothing here.
-                */
-               if (!work->vcpu)
-                       continue;
-
                spin_unlock(&vcpu->async_pf.lock);
 #ifdef CONFIG_KVM_ASYNC_PF_SYNC
                flush_work(&work->work);


This way we don't have to sync twice for the same purpose, also we could
avoid reusing work->vcpu as a "done" flag which confused me a bit.

Thanks,
Yilun

> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-02-19 13:59   ` Xu Yilun
@ 2024-02-19 15:51     ` Sean Christopherson
  2024-02-20  3:02       ` Xu Yilun
  0 siblings, 1 reply; 25+ messages in thread
From: Sean Christopherson @ 2024-02-19 15:51 UTC (permalink / raw)
  To: Xu Yilun; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Mon, Feb 19, 2024, Xu Yilun wrote:
> >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  #else
> >  		if (cancel_work_sync(&work->work)) {
> >  			mmput(work->mm);
> > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> >  			kmem_cache_free(async_pf_cache, work);
> >  		}
> >  #endif
> > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  			list_first_entry(&vcpu->async_pf.done,
> >  					 typeof(*work), link);
> >  		list_del(&work->link);
> > -		kmem_cache_free(async_pf_cache, work);
> > +
> > +		spin_unlock(&vcpu->async_pf.lock);
> > +
> > +		/*
> > +		 * The async #PF is "done", but KVM must wait for the work item
> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > +		 * (the module) can be run after the last call to module_put(),
> > +		 * i.e. after the last reference to the last vCPU's file is put.
> > +		 */
> > +		kvm_flush_and_free_async_pf_work(work);
> 
> I have a new concern when I re-visit this patchset.
> 
> Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
> superset of async_pf.done (except wake-all work, which is not within
> concern).  And done work would be skipped from sync (cancel_work_sync()) by:
> 
>                 if (!work->vcpu)
>                         continue;
> 
> But now with this patch we also sync done works, how about we just sync all
> queued work instead.

Hmm, IIUC, I think we can simply revert commit 22583f0d9c85 ("KVM: async_pf: avoid
recursive flushing of work items").

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2024-02-19 15:51     ` Sean Christopherson
@ 2024-02-20  3:02       ` Xu Yilun
  0 siblings, 0 replies; 25+ messages in thread
From: Xu Yilun @ 2024-02-20  3:02 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack

On Mon, Feb 19, 2024 at 07:51:24AM -0800, Sean Christopherson wrote:
> On Mon, Feb 19, 2024, Xu Yilun wrote:
> > >  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > > @@ -114,7 +132,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > >  #else
> > >  		if (cancel_work_sync(&work->work)) {
> > >  			mmput(work->mm);
> > > -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
> > >  			kmem_cache_free(async_pf_cache, work);
> > >  		}
> > >  #endif
> > > @@ -126,7 +143,18 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > >  			list_first_entry(&vcpu->async_pf.done,
> > >  					 typeof(*work), link);
> > >  		list_del(&work->link);
> > > -		kmem_cache_free(async_pf_cache, work);
> > > +
> > > +		spin_unlock(&vcpu->async_pf.lock);
> > > +
> > > +		/*
> > > +		 * The async #PF is "done", but KVM must wait for the work item
> > > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > > +		 * (the module) can be run after the last call to module_put(),
> > > +		 * i.e. after the last reference to the last vCPU's file is put.
> > > +		 */
> > > +		kvm_flush_and_free_async_pf_work(work);
> > 
> > I have a new concern when I re-visit this patchset.
> > 
> > Form kvm_check_async_pf_completion(), I see async_pf.queue is always a
> > superset of async_pf.done (except wake-all work, which is not within
> > concern).  And done work would be skipped from sync (cancel_work_sync()) by:
> > 
> >                 if (!work->vcpu)
> >                         continue;
> > 
> > But now with this patch we also sync done works, how about we just sync all
> > queued work instead.
> 
> Hmm, IIUC, I think we can simply revert commit 22583f0d9c85 ("KVM: async_pf: avoid
> recursive flushing of work items").

Ah, yes. This also make me clear about the history of the confusing spin_lock.
Reverting is good to me.

Thanks,
Yilun

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-02-20  3:06 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-10  1:15 [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson
2024-01-10  1:15 ` [PATCH 1/4] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
2024-01-20 12:40   ` Xu Yilun
2024-01-24 19:04     ` Sean Christopherson
2024-01-26  7:36       ` Xu Yilun
2024-02-06 19:06     ` Sean Christopherson
2024-01-26 16:51   ` Vitaly Kuznetsov
2024-01-26 17:19     ` Sean Christopherson
2024-01-29  9:02       ` Vitaly Kuznetsov
2024-02-19 13:59   ` Xu Yilun
2024-02-19 15:51     ` Sean Christopherson
2024-02-20  3:02       ` Xu Yilun
2024-01-10  1:15 ` [PATCH 2/4] KVM: Put mm immediately after async #PF worker completes remote gup() Sean Christopherson
2024-01-20 15:24   ` Xu Yilun
2024-01-26 16:23   ` Vitaly Kuznetsov
2024-01-10  1:15 ` [PATCH 3/4] KVM: Get reference to VM's address space in the async #PF worker Sean Christopherson
2024-01-20 15:16   ` Xu Yilun
2024-01-24 18:52     ` Sean Christopherson
2024-01-26  8:06       ` Xu Yilun
2024-01-26 16:21   ` Vitaly Kuznetsov
2024-01-26 16:39     ` Sean Christopherson
2024-01-10  1:15 ` [PATCH 4/4] KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed Sean Christopherson
2024-01-20 15:24   ` Xu Yilun
2024-01-26 16:30   ` Vitaly Kuznetsov
2024-02-06 21:36 ` [PATCH 0/4] KVM: Async #PF fixes and cleanups Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).