+Yosry

On Wed, Mar 23, 2022, Mingwei Zhang wrote:
> Add extra check to specify the case of nx hugepage and allow KVM to
> reconstruct large mapping after dirty logging is disabled. Existing code
> works only for nx hugepage but the condition is too general in that does
> not consider other usage case (such as dirty logging). Note that
> when dirty logging is disabled, KVM calls kvm_mmu_zap_collapsible_sptes()
> which only zaps leaf SPTEs.

This opening is very, very confusing.  A big part of the confusion is due to poor
naming in KVM, where kvm_mmu_page.lpage_disallowed really should be
nx_huge_page_disalowed.  Enabling dirty logging also disables huge pages, but that
goes through the memslot's disallow_lpage.  Reading through this paragraph, and
looking at the code, it sounds like disallowed_hugepage_adjust() is explicitly
checking if a page is disallowed due to dirty logging, which is not the case.

I'd prefer the changelog lead with the bug it's fixing and only briefly mention
dirty logging as a scenario where KVM can end up with shadow pages without child
SPTEs.  Such scenarios can also happen with mmu_notifier calls, etc...

E.g.

  Explicitly check if a NX huge page is disallowed when determining if a page
  fault needs to be forced to use a smaller sized page.  KVM incorrectly
  assumes that the NX huge page mitigation is the only scenario where KVM
  will create a shadow page instead of a huge page.  Any scenario that causes
  KVM to zap leaf SPTEs may result in having a SP that can be made huge
  without violating the NX huge page mitigation.  E.g. disabling of dirty
  logging, zapping from mmu_notifier due to page migration, guest MTRR changes
  that affect the viability of a huge page, etc...

> Moreover, existing code assumes that a present PMD or PUD indicates that
> there exist 'smaller SPTEs' under the paging structure. This assumption may
> no be true if KVM zaps only leafs in MMU.
> 
> Missing the check causes KVM incorrectly regards the faulting page as a NX
> huge page and refuse to map it at desired level. And this leads to back
> performance issue in shadow mmu and potentially in TDP mmu as well.
> 
> Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
> Cc: stable@vger.kernel.org

I vote to keep the Fixes tag, but drop the Cc: stable@ and NAK any MANUALSEL/AUTOSEL
for backporting this to stable trees.  Yes, it's a bug, but the NX huge page zapping
kthread will (eventually) reclaim the lost performance.  On the flip side, if there's
an edge case we mess up (see below), then we've introduced a far worse bug.

> Reviewed-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5628d0ba637e..d9b2001d8217 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2919,6 +2919,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>  	    cur_level == fault->goal_level &&
>  	    is_shadow_present_pte(spte) &&
>  	    !is_large_pte(spte)) {
> +		struct kvm_mmu_page *sp;
> +		u64 page_mask;
> +		/*
> +		 * When nx hugepage flag is not set, there is no reason to go
> +		 * down to another level. This helps KVM re-generate large
> +		 * mappings after dirty logging disabled.
> +		 */

Honestly, I'd just omit the comment.  Again, IMO the blurb about dirty logging
does more harm than good because it makes it sound like this is somehow unique to
dirty logging, which it is not.  Lack of a check was really just a simple goof.

> +		sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> +		if (!sp->lpage_disallowed)

This is unsafe for the TDP MMU.  If mmu_lock is held for read, tdp_mmu_link_sp()
could be in the process of creating the shadow page for a different fault.
Critically, tdp_mmu_link_sp() sets lpage_disallowed _after_ installing the SPTE.
Thus this code could see the present shadow page, with the correct old_spte (and
thus succeed its eventual tdp_mmu_set_spte_atomic()), but with the wrong
lpage_disallowed.

To fix, tdp_mmu_link_sp() needs to tag the shadow page with lpage_disallowed before
setting the SPTE, and also needs appropriate memory barriers.  It's a bit messy at
first glance, but actually presents an opportunity to further improve TDP MMU
performance.  tdp_mmu_pages can be turned into a counter, at which point the
link/unlock paths don't need to acquire the spinlock except for NX huge page case.

Yosry (Cc'd) is also looking into adding stats for the number of page table pages
consumed by KVM, turning tdp_mmu_pages into a counter would trivialize that for
the TDP MMU (the stats for the TDP MMU and old MMU are best kept separated, see
the details in the attached patch).

Unlike the legacy MMU, the TDP MMU never re-links existing shadow pages.  This means
it doesn't need to worry about the scenario where lpage_disallowed was already set.
So, setting the flag prior to the SPTE is trivial and doesn't need to be unwound.

Disclaimer: attached patches are lightly tested.

On top, this patch would need to add barriers, e.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5cb845fae56e..0bf85bf3da64 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2896,6 +2896,12 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
            cur_level == fault->goal_level &&
            is_shadow_present_pte(spte) &&
            !is_large_pte(spte)) {
+               /* comment goes here. */
+               smp_rmb();
+
+               if (!sp->lpage_disallowed)
+                       return;
+
                /*
                 * A small SPTE exists for this pfn, but FNAME(fetch)
                 * and __direct_map would like to create a large PTE
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5ca78a89d8ed..9a0bc19179b0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1207,6 +1207,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
                        tdp_mmu_init_child_sp(sp, &iter);

                        sp->lpage_disallowed = account_nx;
+                       /* comment goes here. */
+                       smp_wmb();

                        if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
                                tdp_mmu_free_sp(sp);