All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paolo Bonzini <pbonzini@redhat.com>
To: Sean Christopherson <seanjc@google.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	David Hildenbrand <david@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Matlack <dmatlack@google.com>,
	Ben Gardon <bgardon@google.com>,
	Mingwei Zhang <mizhang@google.com>
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker
Date: Wed, 2 Mar 2022 18:25:14 +0100	[thread overview]
Message-ID: <b9270432-4ee8-be8e-8aa1-4b09992f82b8@redhat.com> (raw)
In-Reply-To: <20220226001546.360188-23-seanjc@google.com>

On 2/26/22 01:15, Sean Christopherson wrote:
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
> 
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap.  If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds.  Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
> 
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---

Do we even need kvm_tdp_mmu_zap_invalidated_roots() now?  That is,
something like the following:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd3625a875ef..5fd8bc858c6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
  {
  	lockdep_assert_held(&kvm->slots_lock);
  
+	/*
+	 * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
+	 * count.  If we're dying, zap everything as it's going to happen
+	 * soon anyway.
+	 */
+	if (!refcount_read(&kvm->users_count)) {
+		kvm_mmu_zap_all(kvm);
+		return;
+	}
+
  	write_lock(&kvm->mmu_lock);
  	trace_kvm_mmu_zap_all_fast(kvm);
  
@@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
  	kvm_zap_obsolete_pages(kvm);
  
  	write_unlock(&kvm->mmu_lock);
-
-	/*
-	 * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
-	 * returning to the caller, e.g. if the zap is in response to a memslot
-	 * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
-	 * associated with the deleted memslot once the update completes, and
-	 * Deferring the zap until the final reference to the root is put would
-	 * lead to use-after-free.
-	 */
-	if (is_tdp_mmu_enabled(kvm)) {
-		read_lock(&kvm->mmu_lock);
-		kvm_tdp_mmu_zap_invalidated_roots(kvm);
-		read_unlock(&kvm->mmu_lock);
-	}
  }
  
  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cd1bf68e7511..af9db5b8f713 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
  	WARN_ON(!root->tdp_mmu_page);
  
  	/*
-	 * The root now has refcount=0 and is valid.  Readers cannot acquire
-	 * a reference to it (they all visit valid roots only, except for
-	 * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
-	 * any reference itself.
+	 * The root now has refcount=0.  It is valid, but readers already
+	 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
+	 * rejects it.  This remains true for the rest of the execution
+	 * of this function, because readers visit valid roots only
+	 * (except for tdp_mmu_zap_root_work(), which however operates only
+	 * on one specific root and does not acquire any reference itself).

  	 *
  	 * Even though there are flows that need to visit all roots for
  	 * correctness, they all take mmu_lock for write, so they cannot yet
@@ -996,103 +994,16 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
  	}
  }
  
-static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
-						  struct kvm_mmu_page *prev_root)
-{
-	struct kvm_mmu_page *next_root;
-
-	if (prev_root)
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &prev_root->link,
-						  typeof(*prev_root), link);
-	else
-		next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						   typeof(*next_root), link);
-
-	while (next_root && !(next_root->role.invalid &&
-			      refcount_read(&next_root->tdp_mmu_root_count)))
-		next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
-						  &next_root->link,
-						  typeof(*next_root), link);
-
-	return next_root;
-}
-
-/*
- * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
- * zap" completes.  Since kvm_tdp_mmu_invalidate_all_roots() has acquired a
- * reference to each invalidated root, roots will not be freed until after this
- * function drops the gifted reference, e.g. so that vCPUs don't get stuck with
- * tearing paging structures.
- */
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
-{
-	struct kvm_mmu_page *next_root;
-	struct kvm_mmu_page *root;
-
-	lockdep_assert_held_read(&kvm->mmu_lock);
-
-	rcu_read_lock();
-
-	root = next_invalidated_root(kvm, NULL);
-
-	while (root) {
-		next_root = next_invalidated_root(kvm, root);
-
-		rcu_read_unlock();
-
-		/*
-		 * Zap the root regardless of what marked it invalid, e.g. even
-		 * if the root was marked invalid by kvm_tdp_mmu_put_root() due
-		 * to its last reference being put.  All SPTEs must be dropped
-		 * before returning to the caller, e.g. if a memslot is deleted
-		 * or moved, the memslot's associated SPTEs are unreachable via
-		 * the mmu_notifier once the memslot update completes.
-		 *
-		 * A TLB flush is unnecessary, invalidated roots are guaranteed
-		 * to be unreachable by the guest (see kvm_tdp_mmu_put_root()
-		 * for more details), and unlike the legacy MMU, no vCPU kick
-		 * is needed to play nice with lockless shadow walks as the TDP
-		 * MMU protects its paging structures via RCU.  Note, zapping
-		 * will still flush on yield, but that's a minor performance
-		 * blip and not a functional issue.
-		 */
-		tdp_mmu_zap_root(kvm, root, true);
-
-		/*
-		 * Put the reference acquired in
-		 * kvm_tdp_mmu_invalidate_roots
-		 */
-		kvm_tdp_mmu_put_root(kvm, root, true);
-
-		root = next_root;
-
-		rcu_read_lock();
-	}
-
-	rcu_read_unlock();
-}
-
  /*
   * Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
- * is about to be zapped, e.g. in response to a memslots update.  The caller is
- * responsible for invoking kvm_tdp_mmu_zap_invalidated_roots() to the actual
- * zapping.
- *
- * Take a reference on all roots to prevent the root from being freed before it
- * is zapped by this thread.  Freeing a root is not a correctness issue, but if
- * a vCPU drops the last reference to a root prior to the root being zapped, it
- * will get stuck with tearing down the entire paging structure.
- *
- * Get a reference even if the root is already invalid,
- * kvm_tdp_mmu_zap_invalidated_roots() assumes it was gifted a reference to all
- * invalid roots, e.g. there's no epoch to identify roots that were invalidated
- * by a previous call.  Roots stay on the list until the last reference is
- * dropped, so even though all invalid roots are zapped, a root may not go away
- * for quite some time, e.g. if a vCPU blocks across multiple memslot updates.
+ * is about to be zapped, e.g. in response to a memslots update.  The actual
+ * zapping is performed asynchronously, so a reference is taken on all roots
+ * as well as (once per root) on the struct kvm.
   *
- * Because mmu_lock is held for write, it should be impossible to observe a
- * root with zero refcount, i.e. the list of roots cannot be stale.
+ * Get a reference even if the root is already invalid, the asynchronous worker
+ * assumes it was gifted a reference to the root it processes.  Because mmu_lock
+ * is held for write, it should be impossible to observe a root with zero refcount,
+ * i.e. the list of roots cannot be stale.
   *
   * This has essentially the same effect for the TDP MMU
   * as updating mmu_valid_gen does for the shadow MMU.
@@ -1103,8 +1014,11 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
  
  	lockdep_assert_held_write(&kvm->mmu_lock);
  	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
-		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
+		if (!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
  			root->role.invalid = true;
+			kvm_get_kvm(kvm);
+			tdp_mmu_schedule_zap_root(kvm, root);
+		}
  	}
  }
  

It passes a smoke test, and also resolves the debate on the fate of patch 1.

However, I think we now need a module_get/module_put when creating/destroying
a VM; the workers can outlive kvm_vm_release and therefore any reference
automatically taken by VFS's fops_get/fops_put.

Paolo

  parent reply	other threads:[~2022-03-02 17:25 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-26  0:15 [PATCH v3 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 01/28] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots Sean Christopherson
2022-03-02 19:08   ` Mingwei Zhang
2022-03-02 19:51     ` Sean Christopherson
2022-03-03  0:57       ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 02/28] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Sean Christopherson
2022-03-02 19:50   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 03/28] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Sean Christopherson
2022-02-28 23:15   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Sean Christopherson
2022-03-02 23:59   ` Mingwei Zhang
2022-03-03  0:12     ` Sean Christopherson
2022-03-03  1:20       ` Mingwei Zhang
2022-03-03  1:41         ` Sean Christopherson
2022-03-03  4:50           ` Mingwei Zhang
2022-03-03 16:45             ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 05/28] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Sean Christopherson
2022-02-28 23:17   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 06/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Sean Christopherson
2022-02-28 23:26   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 07/28] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Sean Christopherson
2022-03-01  0:11   ` Ben Gardon
2022-03-03 18:02   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 08/28] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Sean Christopherson
2022-03-03 18:08   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 09/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Sean Christopherson
2022-03-03 18:24   ` Mingwei Zhang
2022-03-03 18:32   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 10/28] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Sean Christopherson
2022-03-03 18:34   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 11/28] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path Sean Christopherson
2022-03-03 18:37   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 12/28] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals Sean Christopherson
2022-03-03 18:47   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 13/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 14/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Sean Christopherson
2022-03-01  0:19   ` Ben Gardon
2022-03-03 18:50   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 15/28] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Sean Christopherson
2022-03-01  0:32   ` Ben Gardon
2022-03-03 21:19   ` Mingwei Zhang
2022-03-03 21:24     ` Mingwei Zhang
2022-03-03 23:06       ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 16/28] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 17/28] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 18/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 19/28] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Sean Christopherson
2022-03-01 18:21   ` Paolo Bonzini
2022-03-01 19:43     ` Sean Christopherson
2022-03-01 20:12       ` Paolo Bonzini
2022-03-02  2:13         ` Sean Christopherson
2022-03-02 14:54           ` Paolo Bonzini
2022-03-02 17:43             ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 21/28] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls Sean Christopherson
2022-03-01  0:43   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker Sean Christopherson
2022-03-01 17:57   ` Ben Gardon
2022-03-02 17:25   ` Paolo Bonzini [this message]
2022-03-02 17:35     ` Sean Christopherson
2022-03-02 18:33       ` David Matlack
2022-03-02 18:36         ` Paolo Bonzini
2022-03-02 18:01     ` Sean Christopherson
2022-03-02 18:20       ` Paolo Bonzini
2022-03-02 19:33         ` Sean Christopherson
2022-03-02 20:14           ` Paolo Bonzini
2022-03-02 20:47             ` Sean Christopherson
2022-03-02 21:22               ` Paolo Bonzini
2022-03-02 22:25                 ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 23/28] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE Sean Christopherson
2022-03-01 18:06   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 24/28] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 25/28] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 26/28] KVM: selftests: Split out helper to allocate guest mem via memfd Sean Christopherson
2022-02-28 23:36   ` David Woodhouse
2022-03-02 18:36     ` Paolo Bonzini
2022-03-02 21:55       ` David Woodhouse
2022-02-26  0:15 ` [PATCH v3 27/28] KVM: selftests: Define cpu_relax() helpers for s390 and x86 Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 28/28] KVM: selftests: Add test to populate a VM with the max possible guest mem Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b9270432-4ee8-be8e-8aa1-4b09992f82b8@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=bgardon@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=frankja@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mizhang@google.com \
    --cc=seanjc@google.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.