linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Ben Gardon <bgardon@google.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>,
	David Matlack <dmatlack@google.com>,
	Jim Mattson <jmattson@google.com>,
	David Dunn <daviddunn@google.com>,
	Jing Zhang <jingzhangos@google.com>,
	Junaid Shahid <junaids@google.com>
Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
Date: Tue, 12 Apr 2022 16:43:14 +0000	[thread overview]
Message-ID: <YlWsImxP0C01BUtM@google.com> (raw)
In-Reply-To: <20220321224358.1305530-10-bgardon@google.com>

On Mon, Mar 21, 2022, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..6c08a5731fcb 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> +				int shadow_root_level);

Same comments from the earlier patch.

> +extern int max_huge_page_level __read_mostly;

Can you put this at the top of the heaader?  x86.h somehow ended up with extern
variables being declared in the middle of the file and I find it very jarring,
e.g. global definitions are pretty much never buried in the middle of a .c file.

>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index af60922906ef..eb8929e394ec 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>  		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
>  
> +static bool try_promote_lpage(struct kvm *kvm,

I believe we've settled on huge_page instead of lpage.

And again, I strongly prefer a 0/-errno return instead of a boolean as seeing
-EBUSY or whatever makes it super obviously that the early returns are failure
paths.

> +			      const struct kvm_memory_slot *slot,
> +			      struct tdp_iter *iter)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> +	struct rsvd_bits_validate shadow_zero_check;
> +	bool map_writable;
> +	kvm_pfn_t pfn;
> +	u64 new_spte;
> +	u64 mt_mask;
> +
> +	/*
> +	 * If addresses are being invalidated, don't do in-place promotion to
> +	 * avoid accidentally mapping an invalidated address.
> +	 */
> +	if (unlikely(kvm->mmu_notifier_count))
> +		return false;
> +
> +	if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> +	    iter->gfn >= slot->base_gfn + slot->npages)
> +		return false;
> +
> +	pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> +				   &map_writable, NULL);
> +	if (is_error_noslot_pfn(pfn))
> +		return false;
> +
> +	/*
> +	 * Can't reconstitute an lpage if the consituent pages can't be

"huge page", though honestly I'd just drop the comment, IMO this is more intuitive
then say the checks against the slot stuff above.

> +	 * mapped higher.
> +	 */
> +	if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> +						    pfn, PG_LEVEL_NUM))
> +		return false;
> +
> +	build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> +	/*
> +	 * In some cases, a vCPU pointer is required to get the MT mask,
> +	 * however in most cases it can be generated without one. If a
> +	 * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> +	 * In that case, bail on in-place promotion.
> +	 */
> +	if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,

I wouldn't bother with the "unlikely".  It's wrong for a VM with non-coherent DMA,
and it's very unlikely (heh) to actually be a meaningful optimization in any case.

> +							   kvm_is_mmio_pfn(pfn),
> +							   &mt_mask)))
> +		return false;
> +
> +	__make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,

A comment stating the return type is intentionally ignore would be helpful.  Not
strictly necessary because it's mostly obvious after looking at the details, but
it'd save someone from having to dig into said details.

> +		  map_writable, mt_mask, &shadow_zero_check, &new_spte);

Bad indentation.

> +
> +	if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> +		return true;

And by returning an int, and because the failure path rereads the SPTE for you,
this becomes:

	return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);

> +
> +	/* Re-read the SPTE as it must have been changed by another thread. */
> +	iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> +
> +	return false;
> +}
> +
>  /*
>   * Clear leaf entries which could be replaced by large mappings, for
>   * GFNs within the slot.

This comment needs to be updated to include the huge page promotion behavior. And
maybe renamed the function too?  E.g.

static void zap_or_promote_collapsible_sptes(struct kvm *kvm,
					     struct kvm_mmu_page *root,
					     const struct kvm_memory_slot *slot)

> @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
>  			continue;
>  
> -		if (!is_shadow_present_pte(iter.old_spte) ||
> -		    !is_last_spte(iter.old_spte, iter.level))
> +		if (iter.level > max_huge_page_level ||
> +		    iter.gfn < slot->base_gfn ||
> +		    iter.gfn >= slot->base_gfn + slot->npages)

Isn't this exact check in try_promote_lpage()?  Ditto for the kvm_mmu_max_mapping_level()
check that's just out of sight.  That one in particular can be somewhat expsensive,
especially when KVM is fixed to use a helper that disable IRQs so the host page tables
aren't freed while they're being walked.  Oh, and the huge page promotion path
doesn't incorporate the reserved pfn check.

In other words, shouldn't this be:


		if (!is_shadow_present_pte(iter.old_spte))
			continue;

		if (iter.level > max_huge_page_level ||
		    iter.gfn < slot->base_gfn ||
		    iter.gfn >= slot->base_gfn + slot->npages)
			continue;

		pfn = spte_to_pfn(iter.old_spte);
		if (kvm_is_reserved_pfn(pfn) ||
		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
							    pfn, PG_LEVEL_NUM))
			continue;

Followed by the promotion stuff.  And then unless I'm overlooking something, "pfn"
can be passed into try_promote_huge_page(), it just needs to be masked appropriately.
I.e. the promotion path can avoid the __gfn_to_pfn_memslot() lookup and also drop
its is_error_noslot_pfn() check since the pfn is pulled from the SPTE and KVM should
never install garbage into the SPTE (emulated/noslot MMIO pfns fail the shadow
present check).

> +			continue;
> +
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;

I strongly prefer to keep the !is_shadow_present_pte() check first, it really
should be the first thing any of these flows check.

> +
> +		/* Try to promote the constitutent pages to an lpage. */
> +		if (!is_last_spte(iter.old_spte, iter.level) &&
> +		    try_promote_lpage(kvm, slot, &iter))

There is an undocumented function change here, and I can't tell if it's intentional.
If the promotion fails, KVM continues on an zaps the non-leaf shadow page.  If that
is intentional behavior, it should be done in a follow-up patch, e.g. so that it can
be easily reverted if it turns out that zappping e.g. a PUD is bad for performance.

I.e. shouldn't this be:

		if (!is_last_spte(iter.old_spte, iter.level)) {
			try_promote_huge_page(...);
			continue;
		}

and then converted to the current variant in a follow-up?

>  			continue;
>  
>  		pfn = spte_to_pfn(iter.old_spte);
> -- 
> 2.35.1.894.gb6a874cedc-goog
> 

  parent reply	other threads:[~2022-04-12 16:43 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
2022-03-21 22:43 ` [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper Ben Gardon
2022-03-21 22:43 ` [PATCH v2 2/9] KVM: x86/mmu: Factor mt_mask out of __make_spte Ben Gardon
2022-03-21 22:43 ` [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check " Ben Gardon
2022-04-12 15:52   ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 4/9] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte Ben Gardon
2022-03-21 22:43 ` [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask Ben Gardon
2022-04-12 15:46   ` Sean Christopherson
2022-04-21 18:50     ` Ben Gardon
2022-04-21 19:09       ` Ben Gardon
2022-03-21 22:43 ` [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu Ben Gardon
2022-03-28 18:04   ` David Matlack
2022-03-21 22:43 ` [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops Ben Gardon
2022-04-11 23:00   ` Sean Christopherson
2022-04-11 23:24     ` Ben Gardon
2022-04-11 23:33     ` Sean Christopherson
2022-04-12 19:30     ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c Ben Gardon
2022-04-12 19:39   ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
2022-03-28 17:45   ` David Matlack
2022-03-28 18:07     ` Ben Gardon
2022-03-28 18:20       ` David Matlack
2022-07-12 23:21       ` Sean Christopherson
2022-07-13 16:20         ` Sean Christopherson
2022-03-28 18:21   ` David Matlack
2022-04-12 16:43   ` Sean Christopherson [this message]
2022-04-25 18:09     ` Ben Gardon
2022-03-25 12:00 ` [PATCH v2 0/9] KVM: x86/MMU: Optimize " Paolo Bonzini
2022-07-12  1:37   ` Sean Christopherson
2022-07-14  7:55     ` Paolo Bonzini
2022-07-14 15:27       ` Sean Christopherson
2022-03-28 17:49 ` David Matlack

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YlWsImxP0C01BUtM@google.com \
    --to=seanjc@google.com \
    --cc=bgardon@google.com \
    --cc=daviddunn@google.com \
    --cc=dmatlack@google.com \
    --cc=jingzhangos@google.com \
    --cc=jmattson@google.com \
    --cc=junaids@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).