linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Gardon <bgardon@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: LKML <linux-kernel@vger.kernel.org>, kvm <kvm@vger.kernel.org>,
	Paolo Bonzini <pbonzini@redhat.com>, Peter Xu <peterx@redhat.com>,
	David Matlack <dmatlack@google.com>,
	Jim Mattson <jmattson@google.com>,
	David Dunn <daviddunn@google.com>,
	Jing Zhang <jingzhangos@google.com>,
	Junaid Shahid <junaids@google.com>
Subject: Re: [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging
Date: Mon, 25 Apr 2022 11:09:37 -0700	[thread overview]
Message-ID: <CANgfPd-xxxzV3fp55Gx3Y_5ugfkcXMiipvgBMvYHVvAEdhrAMQ@mail.gmail.com> (raw)
In-Reply-To: <YlWsImxP0C01BUtM@google.com>

On Tue, Apr 12, 2022 at 9:43 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Mar 21, 2022, Ben Gardon wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 1bff453f7cbe..6c08a5731fcb 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -171,4 +171,10 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > +void
> > +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> > +                             int shadow_root_level);
>
> Same comments from the earlier patch.
>
> > +extern int max_huge_page_level __read_mostly;
>
> Can you put this at the top of the heaader?  x86.h somehow ended up with extern
> variables being declared in the middle of the file and I find it very jarring,
> e.g. global definitions are pretty much never buried in the middle of a .c file.

Will do. I'm working on a v3 of this series now.

>
>
> >  #endif /* __KVM_X86_MMU_INTERNAL_H */
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index af60922906ef..eb8929e394ec 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1709,6 +1709,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> >               clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> >  }
> >
> > +static bool try_promote_lpage(struct kvm *kvm,
>
> I believe we've settled on huge_page instead of lpage.
>
> And again, I strongly prefer a 0/-errno return instead of a boolean as seeing
> -EBUSY or whatever makes it super obviously that the early returns are failure
> paths.

Will do. To your and David's comments about retries, this makes the
retry scheme really nice and clean.

>
> > +                           const struct kvm_memory_slot *slot,
> > +                           struct tdp_iter *iter)
> > +{
> > +     struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> > +     struct rsvd_bits_validate shadow_zero_check;
> > +     bool map_writable;
> > +     kvm_pfn_t pfn;
> > +     u64 new_spte;
> > +     u64 mt_mask;
> > +
> > +     /*
> > +      * If addresses are being invalidated, don't do in-place promotion to
> > +      * avoid accidentally mapping an invalidated address.
> > +      */
> > +     if (unlikely(kvm->mmu_notifier_count))
> > +             return false;
> > +
> > +     if (iter->level > max_huge_page_level || iter->gfn < slot->base_gfn ||
> > +         iter->gfn >= slot->base_gfn + slot->npages)
> > +             return false;
> > +
> > +     pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> > +                                &map_writable, NULL);
> > +     if (is_error_noslot_pfn(pfn))
> > +             return false;
> > +
> > +     /*
> > +      * Can't reconstitute an lpage if the consituent pages can't be
>
> "huge page", though honestly I'd just drop the comment, IMO this is more intuitive
> then say the checks against the slot stuff above.
>
> > +      * mapped higher.
> > +      */
> > +     if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> > +                                                 pfn, PG_LEVEL_NUM))
> > +             return false;
> > +
> > +     build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> > +
> > +     /*
> > +      * In some cases, a vCPU pointer is required to get the MT mask,
> > +      * however in most cases it can be generated without one. If a
> > +      * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> > +      * In that case, bail on in-place promotion.
> > +      */
> > +     if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
>
> I wouldn't bother with the "unlikely".  It's wrong for a VM with non-coherent DMA,
> and it's very unlikely (heh) to actually be a meaningful optimization in any case.
>
> > +                                                        kvm_is_mmio_pfn(pfn),
> > +                                                        &mt_mask)))
> > +             return false;
> > +
> > +     __make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
>
> A comment stating the return type is intentionally ignore would be helpful.  Not
> strictly necessary because it's mostly obvious after looking at the details, but
> it'd save someone from having to dig into said details.
>
> > +               map_writable, mt_mask, &shadow_zero_check, &new_spte);
>
> Bad indentation.
>
> > +
> > +     if (tdp_mmu_set_spte_atomic(kvm, iter, new_spte))
> > +             return true;
>
> And by returning an int, and because the failure path rereads the SPTE for you,
> this becomes:
>
>         return tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
>
> > +
> > +     /* Re-read the SPTE as it must have been changed by another thread. */
> > +     iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
> > +
> > +     return false;
> > +}
> > +
> >  /*
> >   * Clear leaf entries which could be replaced by large mappings, for
> >   * GFNs within the slot.
>
> This comment needs to be updated to include the huge page promotion behavior. And
> maybe renamed the function too?  E.g.
>
> static void zap_or_promote_collapsible_sptes(struct kvm *kvm,
>                                              struct kvm_mmu_page *root,
>                                              const struct kvm_memory_slot *slot)
>
> > @@ -1729,8 +1789,17 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> >               if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> >                       continue;
> >
> > -             if (!is_shadow_present_pte(iter.old_spte) ||
> > -                 !is_last_spte(iter.old_spte, iter.level))
> > +             if (iter.level > max_huge_page_level ||
> > +                 iter.gfn < slot->base_gfn ||
> > +                 iter.gfn >= slot->base_gfn + slot->npages)
>
> Isn't this exact check in try_promote_lpage()?  Ditto for the kvm_mmu_max_mapping_level()
> check that's just out of sight.  That one in particular can be somewhat expsensive,
> especially when KVM is fixed to use a helper that disable IRQs so the host page tables
> aren't freed while they're being walked.  Oh, and the huge page promotion path
> doesn't incorporate the reserved pfn check.
>
> In other words, shouldn't this be:
>
>
>                 if (!is_shadow_present_pte(iter.old_spte))
>                         continue;
>
>                 if (iter.level > max_huge_page_level ||
>                     iter.gfn < slot->base_gfn ||
>                     iter.gfn >= slot->base_gfn + slot->npages)
>                         continue;
>
>                 pfn = spte_to_pfn(iter.old_spte);
>                 if (kvm_is_reserved_pfn(pfn) ||
>                     iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
>                                                             pfn, PG_LEVEL_NUM))
>                         continue;
>
> Followed by the promotion stuff.  And then unless I'm overlooking something, "pfn"
> can be passed into try_promote_huge_page(), it just needs to be masked appropriately.
> I.e. the promotion path can avoid the __gfn_to_pfn_memslot() lookup and also drop
> its is_error_noslot_pfn() check since the pfn is pulled from the SPTE and KVM should
> never install garbage into the SPTE (emulated/noslot MMIO pfns fail the shadow
> present check).

I'll work on deduplicating the checks. The big distinction is that in
the promotion function, the iterator is still at the non-leaf SPTE, so
if we get the PFN from the SPTE, it'll be the PFN of a page table, not
a PFN backing guest memory. I could use the same GFN to PFN memslot
conversion in both cases, but it seems more expensive than extracting
the PFN from the SPTE.

__gfn_to_pfn_memslot should never return a reserved PFN right?

>
> > +                     continue;
> > +
> > +             if (!is_shadow_present_pte(iter.old_spte))
> > +                     continue;
>
> I strongly prefer to keep the !is_shadow_present_pte() check first, it really
> should be the first thing any of these flows check.
>
> > +
> > +             /* Try to promote the constitutent pages to an lpage. */
> > +             if (!is_last_spte(iter.old_spte, iter.level) &&
> > +                 try_promote_lpage(kvm, slot, &iter))
>
> There is an undocumented function change here, and I can't tell if it's intentional.
> If the promotion fails, KVM continues on an zaps the non-leaf shadow page.  If that
> is intentional behavior, it should be done in a follow-up patch, e.g. so that it can
> be easily reverted if it turns out that zappping e.g. a PUD is bad for performance.
>
> I.e. shouldn't this be:
>
>                 if (!is_last_spte(iter.old_spte, iter.level)) {
>                         try_promote_huge_page(...);
>                         continue;
>                 }
>
> and then converted to the current variant in a follow-up?

Ah, good point.

>
> >                       continue;
> >
> >               pfn = spte_to_pfn(iter.old_spte);
> > --
> > 2.35.1.894.gb6a874cedc-goog
> >

  reply	other threads:[~2022-04-25 18:09 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-21 22:43 [PATCH v2 0/9] KVM: x86/MMU: Optimize disabling dirty logging Ben Gardon
2022-03-21 22:43 ` [PATCH v2 1/9] KVM: x86/mmu: Move implementation of make_spte to a helper Ben Gardon
2022-03-21 22:43 ` [PATCH v2 2/9] KVM: x86/mmu: Factor mt_mask out of __make_spte Ben Gardon
2022-03-21 22:43 ` [PATCH v2 3/9] KVM: x86/mmu: Factor shadow_zero_check " Ben Gardon
2022-04-12 15:52   ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 4/9] KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte Ben Gardon
2022-03-21 22:43 ` [PATCH v2 5/9] KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask Ben Gardon
2022-04-12 15:46   ` Sean Christopherson
2022-04-21 18:50     ` Ben Gardon
2022-04-21 19:09       ` Ben Gardon
2022-03-21 22:43 ` [PATCH v2 6/9] KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend on vcpu Ben Gardon
2022-03-28 18:04   ` David Matlack
2022-03-21 22:43 ` [PATCH v2 7/9] KVM: x86/mmu: Add try_get_mt_mask to x86_ops Ben Gardon
2022-04-11 23:00   ` Sean Christopherson
2022-04-11 23:24     ` Ben Gardon
2022-04-11 23:33     ` Sean Christopherson
2022-04-12 19:30     ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 8/9] KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c Ben Gardon
2022-04-12 19:39   ` Sean Christopherson
2022-03-21 22:43 ` [PATCH v2 9/9] KVM: x86/mmu: Promote pages in-place when disabling dirty logging Ben Gardon
2022-03-28 17:45   ` David Matlack
2022-03-28 18:07     ` Ben Gardon
2022-03-28 18:20       ` David Matlack
2022-07-12 23:21       ` Sean Christopherson
2022-07-13 16:20         ` Sean Christopherson
2022-03-28 18:21   ` David Matlack
2022-04-12 16:43   ` Sean Christopherson
2022-04-25 18:09     ` Ben Gardon [this message]
2022-03-25 12:00 ` [PATCH v2 0/9] KVM: x86/MMU: Optimize " Paolo Bonzini
2022-07-12  1:37   ` Sean Christopherson
2022-07-14  7:55     ` Paolo Bonzini
2022-07-14 15:27       ` Sean Christopherson
2022-03-28 17:49 ` David Matlack

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANgfPd-xxxzV3fp55Gx3Y_5ugfkcXMiipvgBMvYHVvAEdhrAMQ@mail.gmail.com \
    --to=bgardon@google.com \
    --cc=daviddunn@google.com \
    --cc=dmatlack@google.com \
    --cc=jingzhangos@google.com \
    --cc=jmattson@google.com \
    --cc=junaids@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).