linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	David Hildenbrand <david@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Matlack <dmatlack@google.com>,
	Ben Gardon <bgardon@google.com>,
	Mingwei Zhang <mizhang@google.com>
Subject: Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker
Date: Wed, 2 Mar 2022 18:01:39 +0000	[thread overview]
Message-ID: <Yh+xA31FrfGoxXLB@google.com> (raw)
In-Reply-To: <b9270432-4ee8-be8e-8aa1-4b09992f82b8@redhat.com>

On Wed, Mar 02, 2022, Paolo Bonzini wrote:
> On 2/26/22 01:15, Sean Christopherson wrote:
> > Zap defunct roots, a.k.a. roots that have been invalidated after their
> > last reference was initially dropped, asynchronously via the system work
> > queue instead of forcing the work upon the unfortunate task that happened
> > to drop the last reference.
> > 
> > If a vCPU task drops the last reference, the vCPU is effectively blocked
> > by the host for the entire duration of the zap.  If the root being zapped
> > happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> > being active, the zap can take several hundred seconds.  Unsurprisingly,
> > most guests are unhappy if a vCPU disappears for hundreds of seconds.
> > 
> > E.g. running a synthetic selftest that triggers a vCPU root zap with
> > ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> > Offloading the zap to a worker drops the block time to <100ms.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> 
> Do we even need kvm_tdp_mmu_zap_invalidated_roots() now?  That is,
> something like the following:

Nice!  I initially did something similar (moving invalidated roots to a separate
list), but never circled back to idea after implementing the worker stuff.

> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bd3625a875ef..5fd8bc858c6f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5698,6 +5698,16 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  {
>  	lockdep_assert_held(&kvm->slots_lock);
> +	/*
> +	 * kvm_tdp_mmu_invalidate_all_roots() needs a nonzero reference
> +	 * count.  If we're dying, zap everything as it's going to happen
> +	 * soon anyway.
> +	 */
> +	if (!refcount_read(&kvm->users_count)) {
> +		kvm_mmu_zap_all(kvm);
> +		return;
> +	}

I'd prefer we make this an assertion and shove this logic to set_nx_huge_pages(),
because in that case there's no need to zap anything, the guest can never run
again.  E.g. (I'm trying to remember why I didn't do this before...)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b2c1c4eb6007..d4d25ab88ae7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6132,7 +6132,8 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 
                list_for_each_entry(kvm, &vm_list, vm_list) {
                        mutex_lock(&kvm->slots_lock);
-                       kvm_mmu_zap_all_fast(kvm);
+                       if (refcount_read(&kvm->users_count))
+                               kvm_mmu_zap_all_fast(kvm);
                        mutex_unlock(&kvm->slots_lock);
 
                        wake_up_process(kvm->arch.nx_lpage_recovery_thread);


> +
>  	write_lock(&kvm->mmu_lock);
>  	trace_kvm_mmu_zap_all_fast(kvm);
> @@ -5732,20 +5742,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	kvm_zap_obsolete_pages(kvm);
>  	write_unlock(&kvm->mmu_lock);
> -
> -	/*
> -	 * Zap the invalidated TDP MMU roots, all SPTEs must be dropped before
> -	 * returning to the caller, e.g. if the zap is in response to a memslot
> -	 * deletion, mmu_notifier callbacks will be unable to reach the SPTEs
> -	 * associated with the deleted memslot once the update completes, and
> -	 * Deferring the zap until the final reference to the root is put would
> -	 * lead to use-after-free.
> -	 */
> -	if (is_tdp_mmu_enabled(kvm)) {
> -		read_lock(&kvm->mmu_lock);
> -		kvm_tdp_mmu_zap_invalidated_roots(kvm);
> -		read_unlock(&kvm->mmu_lock);
> -	}
>  }
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index cd1bf68e7511..af9db5b8f713 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -142,10 +142,12 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	WARN_ON(!root->tdp_mmu_page);
>  	/*
> -	 * The root now has refcount=0 and is valid.  Readers cannot acquire
> -	 * a reference to it (they all visit valid roots only, except for
> -	 * kvm_tdp_mmu_zap_invalidated_roots() which however does not acquire
> -	 * any reference itself.
> +	 * The root now has refcount=0.  It is valid, but readers already
> +	 * cannot acquire a reference to it because kvm_tdp_mmu_get_root()
> +	 * rejects it.  This remains true for the rest of the execution
> +	 * of this function, because readers visit valid roots only

One thing that keeps tripping me up is the "readers" verbiage.  I get confused
because taking mmu_lock for read vs. write doesn't really have anything to do with
reading or writing state, e.g. "readers" still write SPTEs, and so I keep thinking
"readers" means anything iterating over the set of roots.  Not sure if there's a
shorthand that won't be confusing.

> +	 * (except for tdp_mmu_zap_root_work(), which however operates only
> +	 * on one specific root and does not acquire any reference itself).
> 
>  	 *
>  	 * Even though there are flows that need to visit all roots for
>  	 * correctness, they all take mmu_lock for write, so they cannot yet

...

> It passes a smoke test, and also resolves the debate on the fate of patch 1.

+1000, I love this approach.  Do you want me to work on a v3, or shall I let you
have the honors?

  parent reply	other threads:[~2022-03-02 18:01 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-26  0:15 [PATCH v3 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 01/28] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots Sean Christopherson
2022-03-02 19:08   ` Mingwei Zhang
2022-03-02 19:51     ` Sean Christopherson
2022-03-03  0:57       ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 02/28] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Sean Christopherson
2022-03-02 19:50   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 03/28] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Sean Christopherson
2022-02-28 23:15   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 04/28] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Sean Christopherson
2022-03-02 23:59   ` Mingwei Zhang
2022-03-03  0:12     ` Sean Christopherson
2022-03-03  1:20       ` Mingwei Zhang
2022-03-03  1:41         ` Sean Christopherson
2022-03-03  4:50           ` Mingwei Zhang
2022-03-03 16:45             ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 05/28] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Sean Christopherson
2022-02-28 23:17   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 06/28] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Sean Christopherson
2022-02-28 23:26   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 07/28] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Sean Christopherson
2022-03-01  0:11   ` Ben Gardon
2022-03-03 18:02   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 08/28] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Sean Christopherson
2022-03-03 18:08   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 09/28] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Sean Christopherson
2022-03-03 18:24   ` Mingwei Zhang
2022-03-03 18:32   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 10/28] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Sean Christopherson
2022-03-03 18:34   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 11/28] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path Sean Christopherson
2022-03-03 18:37   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 12/28] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals Sean Christopherson
2022-03-03 18:47   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 13/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 14/28] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Sean Christopherson
2022-03-01  0:19   ` Ben Gardon
2022-03-03 18:50   ` Mingwei Zhang
2022-02-26  0:15 ` [PATCH v3 15/28] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Sean Christopherson
2022-03-01  0:32   ` Ben Gardon
2022-03-03 21:19   ` Mingwei Zhang
2022-03-03 21:24     ` Mingwei Zhang
2022-03-03 23:06       ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 16/28] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 17/28] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 18/28] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 19/28] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 20/28] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Sean Christopherson
2022-03-01 18:21   ` Paolo Bonzini
2022-03-01 19:43     ` Sean Christopherson
2022-03-01 20:12       ` Paolo Bonzini
2022-03-02  2:13         ` Sean Christopherson
2022-03-02 14:54           ` Paolo Bonzini
2022-03-02 17:43             ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 21/28] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls Sean Christopherson
2022-03-01  0:43   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker Sean Christopherson
2022-03-01 17:57   ` Ben Gardon
2022-03-02 17:25   ` Paolo Bonzini
2022-03-02 17:35     ` Sean Christopherson
2022-03-02 18:33       ` David Matlack
2022-03-02 18:36         ` Paolo Bonzini
2022-03-02 18:01     ` Sean Christopherson [this message]
2022-03-02 18:20       ` Paolo Bonzini
2022-03-02 19:33         ` Sean Christopherson
2022-03-02 20:14           ` Paolo Bonzini
2022-03-02 20:47             ` Sean Christopherson
2022-03-02 21:22               ` Paolo Bonzini
2022-03-02 22:25                 ` Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 23/28] KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE Sean Christopherson
2022-03-01 18:06   ` Ben Gardon
2022-02-26  0:15 ` [PATCH v3 24/28] KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 25/28] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 26/28] KVM: selftests: Split out helper to allocate guest mem via memfd Sean Christopherson
2022-02-28 23:36   ` David Woodhouse
2022-03-02 18:36     ` Paolo Bonzini
2022-03-02 21:55       ` David Woodhouse
2022-02-26  0:15 ` [PATCH v3 27/28] KVM: selftests: Define cpu_relax() helpers for s390 and x86 Sean Christopherson
2022-02-26  0:15 ` [PATCH v3 28/28] KVM: selftests: Add test to populate a VM with the max possible guest mem Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yh+xA31FrfGoxXLB@google.com \
    --to=seanjc@google.com \
    --cc=bgardon@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=frankja@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mizhang@google.com \
    --cc=pbonzini@redhat.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).