Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging

From: Sean Christopherson <seanjc@google.com>
To: Ben Gardon <bgardon@google.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>,
	David Matlack <dmatlack@google.com>,
	Jim Mattson <jmattson@google.com>,
	David Dunn <daviddunn@google.com>,
	Jing Zhang <jingzhangos@google.com>,
	Junaid Shahid <junaids@google.com>
Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
Date: Tue, 12 Apr 2022 16:43:14 +0000	[thread overview]
Message-ID: <YlWsImxP0C01BUtM@google.com> (raw)
In-Reply-To: <20220321224358.1305530-10-bgardon@google.com>

On Mon, Mar 21, 2022, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..6c08a5731fcb 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> +				int shadow_root_level);

Same comments from the earlier patch.

> +extern int max_huge_page_level __read_mostly;

Can you put this at the top of the heaader?  x86.h somehow ended up with extern
variables being declared in the middle of the file and I find it very jarring,
e.g. global definitions are pretty much never buried in the middle of a .c file.

>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index af60922906ef..eb8929e394ec 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>  		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
>  
> +static bool try_promote_lpage(struct kvm *kvm,

I believe we've settled on huge_page instead of lpage.

And again, I strongly prefer a 0/-errno return instead of a boolean as seeing
-EBUSY or whatever makes it super obviously that the early returns are failure
paths.

> +			      const struct kvm_memory_slot *slot,
> +			      struct tdp_iter *iter)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> +	struct rsvd_bits_validate shadow_zero_check;
> +	bool map_writable;
> +	kvm_pfn_t pfn;
> +	u64 new_spte;
> +	u64 mt_mask;
> +
> +	/*
> +	 * If addresses are being invalidated, don't do in-place promotion to
> +	 * avoid accidentally mapping an invalidated address.
> +	 */
> +	if (unlikely(kvm->mmu_notifier_count))
> +		return false;
> +
> +	if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> +	    iter->gfn >= slot->base_gfn + slot->npages)
> +		return false;
> +
> +	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> +				   &map_writable, NULL);
> +	if (is_error_noslot_pfn(pfn))
> +		return false;
> +
> +	/*
> +	 * Can't reconstitute an lpage if the consituent pages can't be

"huge page", though honestly I'd just drop the comment, IMO this is more intuitive
then say the checks against the slot stuff above.

> +	 * mapped higher.
> +	 */
> +	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> +						    pfn, PG_LEVEL_NUM))
> +		return false;
> +
> +	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> +	/*
> +	 * In some cases, a vCPU pointer is required to get the MT mask,
> +	 * however in most cases it can be generated without one. If a
> +	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> +	 * In that case, bail on in-place promotion.
> +	 */
> +	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,

I wouldn't bother with the "unlikely".  It's wrong for a VM with non-coherent DMA,
and it's very unlikely (heh) to actually be a meaningful optimization in any case.

> +							   kvm_is_mmio_pfn(pfn),
> +							   &mt_mask)))
> +		return false;
> +
> +	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,

A comment stating the return type is intentionally ignore would be helpful.  Not
strictly necessary because it's mostly obvious after looking at the details, but
it'd save someone from having to dig into said details.

> +		  map_writable, mt_mask, &shadow_zero_check, &new_spte);

Bad indentation.

> +
> +	if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> +		return true;

And by returning an int, and because the failure path rereads the SPTE for you,
this becomes:

	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);

> +
> +	/* Re-read the SPTE as it must have been changed by another thread. */
> +	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> +
> +	return false;
> +}
> +
>  /*
>   * Clear leaf entries which could be replaced by large mappings, for
>   * GFNs within the slot.

This comment needs to be updated to include the huge page promotion behavior. And
maybe renamed the function too?  E.g.

static void zap_or_promote_collapsible_sptes(struct kvm *kvm,
					     struct kvm_mmu_page *root,
					     const struct kvm_memory_slot *slot)

> @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>  			continue;
>  
> -		if (!is_shadow_present_pte(iter.old_spte) ||
> -		    !is_last_spte(iter.old_spte, iter.level))
> +		if (iter.level > max_huge_page_level ||
> +		    iter.gfn < slot->base_gfn ||
> +		    iter.gfn >= slot->base_gfn + slot->npages)

Isn't this exact check in try_promote_lpage()?  Ditto for the kvm_mmu_max_mapping_level()
check that's just out of sight.  That one in particular can be somewhat expsensive,
especially when KVM is fixed to use a helper that disable IRQs so the host page tables
aren't freed while they're being walked.  Oh, and the huge page promotion path
doesn't incorporate the reserved pfn check.

In other words, shouldn't this be:

		if (!is_shadow_present_pte(iter.old_spte))
			continue;

		if (iter.level > max_huge_page_level ||
		    iter.gfn < slot->base_gfn ||
		    iter.gfn >= slot->base_gfn + slot->npages)
			continue;

		pfn = spte_to_pfn(iter.old_spte);
		if (kvm_is_reserved_pfn(pfn) ||
		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
							    pfn, PG_LEVEL_NUM))
			continue;

Followed by the promotion stuff.  And then unless I'm overlooking something, "pfn"
can be passed into try_promote_huge_page(), it just needs to be masked appropriately.
I.e. the promotion path can avoid the __gfn_to_pfn_memslot() lookup and also drop
its is_error_noslot_pfn() check since the pfn is pulled from the SPTE and KVM should
never install garbage into the SPTE (emulated/noslot MMIO pfns fail the shadow
present check).

> +			continue;
> +
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;

I strongly prefer to keep the !is_shadow_present_pte() check first, it really
should be the first thing any of these flows check.

> +
> +		/* Try to promote the constitutent pages to an lpage. */
> +		if (!is_last_spte(iter.old_spte, iter.level) &&
> +		    try_promote_lpage(kvm, slot, &iter))

There is an undocumented function change here, and I can't tell if it's intentional.
If the promotion fails, KVM continues on an zaps the non-leaf shadow page.  If that
is intentional behavior, it should be done in a follow-up patch, e.g. so that it can
be easily reverted if it turns out that zappping e.g. a PUD is bad for performance.

I.e. shouldn't this be:

		if (!is_last_spte(iter.old_spte, iter.level)) {
			try_promote_huge_page(...);
			continue;
		}

and then converted to the current variant in a follow-up?

>  			continue;
>  
>  		pfn = spte_to_pfn(iter.old_spte);
> -- 
> 2.35.1.894.gb6a874cedc-goog
>