All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] KVM: x86: MMU page fault clean-up
@ 2019-12-06 23:57 Sean Christopherson
  2019-12-06 23:57 ` [PATCH 01/16] KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM Sean Christopherson
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

The original purpose of this series was to call thp_adjust() from
__direct_map() and FNAME(fetch) to eliminate a page refcounting quirk[*].
Before doing that, I wanted to clean up the large page handling so that
the map/fetch funtions weren't being passed multiple booleans that tracked
the same basic info.  While trying to decipher all the the interactions,
I stumbled across a handful of fun things:

  - 32-bit KVM w/ TDP is completely broken with respect to 64-bit GPAs due
    to the page fault handlers and all related flows dropping bits 63:32
    of the GPA.  As a result, KVM inserts the wrong GPA and the guest hangs
    because it generates EPT/NPT faults until it's killed.

  - The TDP and non-paging page fault flows are identical except for
    one-off constraints on guest page size.

  - The !VALID_PAGE(root_hpa) checks in the page fault flows are bogus.
    They were added a few years ago to "fix" a nVMX bug and are no longer
    needed now that nVMX is in much better shape.

Patch 1 fixes the 32-bit KVM w/ TDP issue.  More details below.

Patches 2-12 are 99% refactoring to merge TDP and non-paging page fault
handling, and to do the thp_adjust() move.  These are basically nops from
a functional perspective.  There are technically functional changes in a
few patches, but they are very superficial and in theory won't be
observable in normal usage.

Patches 13-16 add WARNs on the !VALID_PAGE(root_hpa) checks to make it
clear that root_hpa is expected to be valid when handling page faults,
e.g. for the longest time I thought KVM relied on the checks in map/fetch
to correctly handle kvm_mmu_zap_all().


32-bit KVM w/ TDP:

I marked this patch for stable because it's obviously a bug fix, but I'm
entirely not sure we want to backport the fix.  Obviously no userspace VMM
is actually exposing 64-bit GPAs to its guests, i.e. odds are this won't
actually fix any real world use cases.  And, the scope of the changes are
likely going to make backporting a pain.  But, on the other hand, if it's
not backported then future bug fixes in related code are likely to
conflict, and it does fix the case where a buggy guest kernel accesses a
non-existent 64-bit GPA (crashes instead of hanging indefinitely).

I'm also not confident I found all the cases where KVM is truncating the
GPA.  AFAIK, 32-bit Qemu simply doesn't support 64-bit GPAs.  To confirm
the bug and verify the fix, I hacked KVM and the guest kernel to generate
64-bit GPAs when remapping MMIO, which covers a tiny fragment of KVM.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fa46fbed60013..49a59bcb32117 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5737,7 +5737,7 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
        vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len);
        vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write;
        vcpu->run->exit_reason = KVM_EXIT_MMIO;
-       vcpu->run->mmio.phys_addr = gpa;
+       vcpu->run->mmio.phys_addr = gpa & 0xffffffffull;
 
        return ops->read_write_exit_mmio(vcpu, gpa, val, bytes);
 }


diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index b9c78f3bcd673..e22f254987bea 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -184,7 +184,10 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
        if (kernel_map_sync_memtype(phys_addr, size, pcm))
                goto err_free_area;
 
-       if (ioremap_page_range(vaddr, vaddr + size, phys_addr, prot))
+       BUG_ON(!boot_cpu_data.x86_phys_bits);
+
+       if (ioremap_page_range(vaddr, vaddr + size,
+                              phys_addr | BIT_ULL(boot_cpu_data.x86_phys_bits - 1), prot))
                goto err_free_area;
 
        ret_addr = (void __iomem *) (vaddr + offset);

[*] https://lkml.kernel.org/r/20191126174603.GB22233@linux.intel.com

Sean Christopherson (16):
  KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM
  KVM: x86/mmu: Move definition of make_mmu_pages_available() up
  KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault()
  KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf()
  KVM: x86/mmu: Refactor handling of cache consistency with TDP
  KVM: x86/mmu: Refactor the per-slot level calculation in
    mapping_level()
  KVM: x86/mmu: Refactor handling of forced 4k pages in page faults
  KVM: x86/mmu: Incorporate guest's page level into max level for shadow
    MMU
  KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level
  KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage
  KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault()
  KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map()
  KVM: x86/mmu: Move calls to thp_adjust() down a level
  KVM: x86/mmu: Move root_hpa validity checks to top of page fault
    handler
  KVM: x86/mmu: WARN on an invalid root_hpa
  KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault

 arch/x86/include/asm/kvm_host.h |   8 +-
 arch/x86/kvm/mmu/mmu.c          | 438 ++++++++++++++------------------
 arch/x86/kvm/mmu/paging_tmpl.h  |  58 +++--
 arch/x86/kvm/mmutrace.h         |  12 +-
 arch/x86/kvm/x86.c              |  40 ++-
 arch/x86/kvm/x86.h              |   2 +-
 include/linux/kvm_host.h        |   6 +-
 virt/kvm/async_pf.c             |  10 +-
 8 files changed, 259 insertions(+), 315 deletions(-)

-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 01/16] KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 02/16] KVM: x86/mmu: Move definition of make_mmu_pages_available() up Sean Christopherson
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses.  When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.

Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.

Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.

Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value.  Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  8 ++--
 arch/x86/kvm/mmu/mmu.c          | 69 +++++++++++++++++++--------------
 arch/x86/kvm/mmu/paging_tmpl.h  | 25 +++++++-----
 arch/x86/kvm/mmutrace.h         | 12 +++---
 arch/x86/kvm/x86.c              | 40 +++++++++----------
 arch/x86/kvm/x86.h              |  2 +-
 include/linux/kvm_host.h        |  6 +--
 virt/kvm/async_pf.c             | 10 ++---
 8 files changed, 94 insertions(+), 78 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b79cd6aa4075..8233ffa9db6a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -378,12 +378,12 @@ struct kvm_mmu {
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
 	unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
 	u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
-	int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err,
+	int (*page_fault)(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 err,
 			  bool prefault);
 	void (*inject_page_fault)(struct kvm_vcpu *vcpu,
 				  struct x86_exception *fault);
-	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
-			    struct x86_exception *exception);
+	gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gpa_t gva_or_gpa,
+			    u32 access, struct x86_exception *exception);
 	gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
 			       struct x86_exception *exception);
 	int (*sync_page)(struct kvm_vcpu *vcpu,
@@ -1468,7 +1468,7 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u64 error_code,
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f92b40d798c..3977c2b7e8e5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3528,7 +3528,7 @@ static bool is_access_allowed(u32 fault_err_code, u64 spte)
  * - true: let the vcpu to access on the same address again.
  * - false: let the real page fault path to fix it.
  */
-static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
+static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int level,
 			    u32 error_code)
 {
 	struct kvm_shadow_walk_iterator iterator;
@@ -3548,7 +3548,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	do {
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
+		for_each_shadow_entry_lockless(vcpu, cr2_or_gpa, iterator, spte)
 			if (!is_shadow_present_pte(spte) ||
 			    iterator.level < level)
 				break;
@@ -3626,7 +3626,7 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 
 	} while (true);
 
-	trace_fast_page_fault(vcpu, gva, error_code, iterator.sptep,
+	trace_fast_page_fault(vcpu, cr2_or_gpa, error_code, iterator.sptep,
 			      spte, fault_handled);
 	walk_shadow_page_lockless_end(vcpu);
 
@@ -3634,10 +3634,11 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
+			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
+			 bool *writable);
 static int make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
-static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
+static int nonpaging_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			 gfn_t gfn, bool prefault)
 {
 	int r;
@@ -3663,16 +3664,16 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}
 
-	if (fast_page_fault(vcpu, v, level, error_code))
+	if (fast_page_fault(vcpu, gpa, level, error_code))
 		return RET_PF_RETRY;
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
 		return RET_PF_RETRY;
 
-	if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
+	if (handle_abnormal_pfn(vcpu, gpa, gfn, pfn, ACC_ALL, &r))
 		return r;
 
 	r = RET_PF_RETRY;
@@ -3683,7 +3684,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
 		goto out_unlock;
 	if (likely(!force_pt_level))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, v, write, map_writable, level, pfn,
+	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
 			 prefault, false);
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3981,7 +3982,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_sync_roots);
 
-static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
+static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gpa_t vaddr,
 				  u32 access, struct x86_exception *exception)
 {
 	if (exception)
@@ -3989,7 +3990,7 @@ static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
 	return vaddr;
 }
 
-static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr,
+static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gpa_t vaddr,
 					 u32 access,
 					 struct x86_exception *exception)
 {
@@ -4149,13 +4150,14 @@ static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
 	walk_shadow_page_lockless_end(vcpu);
 }
 
-static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
+static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 				u32 error_code, bool prefault)
 {
-	gfn_t gfn = gva >> PAGE_SHIFT;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
 	int r;
 
-	pgprintk("%s: gva %lx error %x\n", __func__, gva, error_code);
+	/* Note, paging is disabled, ergo gva == gpa. */
+	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
@@ -4167,11 +4169,12 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
 	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
 
 
-	return nonpaging_map(vcpu, gva & PAGE_MASK,
+	return nonpaging_map(vcpu, gpa & PAGE_MASK,
 			     error_code, gfn, prefault);
 }
 
-static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
+static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+				   gfn_t gfn)
 {
 	struct kvm_arch_async_pf arch;
 
@@ -4180,11 +4183,13 @@ static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 	arch.direct_map = vcpu->arch.mmu->direct_map;
 	arch.cr3 = vcpu->arch.mmu->get_cr3(vcpu);
 
-	return kvm_setup_async_pf(vcpu, gva, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
+	return kvm_setup_async_pf(vcpu, cr2_or_gpa,
+				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
+			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
+			 bool *writable)
 {
 	struct kvm_memory_slot *slot;
 	bool async;
@@ -4204,12 +4209,12 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 		return false; /* *pfn has correct page already */
 
 	if (!prefault && kvm_can_do_async_pf(vcpu)) {
-		trace_kvm_try_async_get_page(gva, gfn);
+		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
-			trace_kvm_async_pf_doublefault(gva, gfn);
+			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
 			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 			return true;
-		} else if (kvm_arch_setup_async_pf(vcpu, gva, gfn))
+		} else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
 			return true;
 	}
 
@@ -4222,6 +4227,12 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 {
 	int r = 1;
 
+#ifndef CONFIG_X86_64
+	/* A 64-bit CR2 should be impossible on 32-bit KVM. */
+	if (WARN_ON_ONCE(fault_address >> 32))
+		return -EFAULT;
+#endif
+
 	vcpu->arch.l1tf_flush_l1d = true;
 	switch (vcpu->arch.apf.host_apf_reason) {
 	default:
@@ -4259,7 +4270,7 @@ check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
 	return kvm_mtrr_check_gfn_range_consistency(vcpu, gfn, page_num);
 }
 
-static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
+static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			  bool prefault)
 {
 	kvm_pfn_t pfn;
@@ -5516,7 +5527,7 @@ static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
 	int r, emulation_type = 0;
@@ -5525,18 +5536,18 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	/* With shadow page tables, fault_address contains a GVA or nGPA.  */
 	if (vcpu->arch.mmu->direct_map) {
 		vcpu->arch.gpa_available = true;
-		vcpu->arch.gpa_val = cr2;
+		vcpu->arch.gpa_val = cr2_or_gpa;
 	}
 
 	r = RET_PF_INVALID;
 	if (unlikely(error_code & PFERR_RSVD_MASK)) {
-		r = handle_mmio_page_fault(vcpu, cr2, direct);
+		r = handle_mmio_page_fault(vcpu, cr2_or_gpa, direct);
 		if (r == RET_PF_EMULATE)
 			goto emulate;
 	}
 
 	if (r == RET_PF_INVALID) {
-		r = vcpu->arch.mmu->page_fault(vcpu, cr2,
+		r = vcpu->arch.mmu->page_fault(vcpu, cr2_or_gpa,
 					       lower_32_bits(error_code),
 					       false);
 		WARN_ON(r == RET_PF_INVALID);
@@ -5556,7 +5567,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 */
 	if (vcpu->arch.mmu->direct_map &&
 	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2));
+		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
 		return 1;
 	}
 
@@ -5571,7 +5582,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 	 * explicitly shadowing L1's page tables, i.e. unprotecting something
 	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
 	 */
-	if (!mmio_info_in_cache(vcpu, cr2, direct) && !is_guest_mode(vcpu))
+	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
 		emulation_type = EMULTYPE_ALLOW_RETRY;
 emulate:
 	/*
@@ -5586,7 +5597,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 			return 1;
 	}
 
-	return x86_emulate_instruction(vcpu, cr2, emulation_type, insn,
+	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
 				       insn_len);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_page_fault);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 97b21e7fd013..c1d7b866a03f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -291,11 +291,11 @@ static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
 }
 
 /*
- * Fetch a guest pte for a guest virtual address
+ * Fetch a guest pte for a guest virtual address, or for an L2's GPA.
  */
 static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 				    struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-				    gva_t addr, u32 access)
+				    gpa_t addr, u32 access)
 {
 	int ret;
 	pt_element_t pte;
@@ -496,7 +496,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 }
 
 static int FNAME(walk_addr)(struct guest_walker *walker,
-			    struct kvm_vcpu *vcpu, gva_t addr, u32 access)
+			    struct kvm_vcpu *vcpu, gpa_t addr, u32 access)
 {
 	return FNAME(walk_addr_generic)(walker, vcpu, vcpu->arch.mmu, addr,
 					access);
@@ -611,7 +611,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
  * If the guest tries to write a write-protected page, we need to
  * emulate this operation, return 1 to indicate this case.
  */
-static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
+static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 			 struct guest_walker *gw,
 			 int write_fault, int hlevel,
 			 kvm_pfn_t pfn, bool map_writable, bool prefault,
@@ -765,7 +765,7 @@ FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
  *  Returns: 1 if we need to emulate the instruction, 0 otherwise, or
  *           a negative value on error.
  */
-static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
+static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 			     bool prefault)
 {
 	int write_fault = error_code & PFERR_WRITE_MASK;
@@ -945,18 +945,19 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa)
 	spin_unlock(&vcpu->kvm->mmu_lock);
 }
 
-static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
+/* Note, @addr is a GPA when gva_to_gpa() translates an L2 GPA to an L1 GPA. */
+static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gpa_t addr, u32 access,
 			       struct x86_exception *exception)
 {
 	struct guest_walker walker;
 	gpa_t gpa = UNMAPPED_GVA;
 	int r;
 
-	r = FNAME(walk_addr)(&walker, vcpu, vaddr, access);
+	r = FNAME(walk_addr)(&walker, vcpu, addr, access);
 
 	if (r) {
 		gpa = gfn_to_gpa(walker.gfn);
-		gpa |= vaddr & ~PAGE_MASK;
+		gpa |= addr & ~PAGE_MASK;
 	} else if (exception)
 		*exception = walker.fault;
 
@@ -964,7 +965,8 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
 }
 
 #if PTTYPE != PTTYPE_EPT
-static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
+/* Note, gva_to_gpa_nested() is only used to translate L2 GVAs. */
+static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gpa_t vaddr,
 				      u32 access,
 				      struct x86_exception *exception)
 {
@@ -972,6 +974,11 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr,
 	gpa_t gpa = UNMAPPED_GVA;
 	int r;
 
+#ifndef CONFIG_X86_64
+	/* A 64-bit GVA should be impossible on 32-bit KVM. */
+	WARN_ON_ONCE(vaddr >> 32);
+#endif
+
 	r = FNAME(walk_addr_nested)(&walker, vcpu, vaddr, access);
 
 	if (r) {
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 7ca8831c7d1a..3c6522b84ff1 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -249,13 +249,13 @@ TRACE_EVENT(
 
 TRACE_EVENT(
 	fast_page_fault,
-	TP_PROTO(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code,
+	TP_PROTO(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u32 error_code,
 		 u64 *sptep, u64 old_spte, bool retry),
-	TP_ARGS(vcpu, gva, error_code, sptep, old_spte, retry),
+	TP_ARGS(vcpu, cr2_or_gpa, error_code, sptep, old_spte, retry),
 
 	TP_STRUCT__entry(
 		__field(int, vcpu_id)
-		__field(gva_t, gva)
+		__field(gpa_t, cr2_or_gpa)
 		__field(u32, error_code)
 		__field(u64 *, sptep)
 		__field(u64, old_spte)
@@ -265,7 +265,7 @@ TRACE_EVENT(
 
 	TP_fast_assign(
 		__entry->vcpu_id = vcpu->vcpu_id;
-		__entry->gva = gva;
+		__entry->cr2_or_gpa = cr2_or_gpa;
 		__entry->error_code = error_code;
 		__entry->sptep = sptep;
 		__entry->old_spte = old_spte;
@@ -273,9 +273,9 @@ TRACE_EVENT(
 		__entry->retry = retry;
 	),
 
-	TP_printk("vcpu %d gva %lx error_code %s sptep %p old %#llx"
+	TP_printk("vcpu %d gva %llx error_code %s sptep %p old %#llx"
 		  " new %llx spurious %d fixed %d", __entry->vcpu_id,
-		  __entry->gva, __print_flags(__entry->error_code, "|",
+		  __entry->cr2_or_gpa, __print_flags(__entry->error_code, "|",
 		  kvm_mmu_trace_pferr_flags), __entry->sptep,
 		  __entry->old_spte, __entry->new_spte,
 		  __spte_satisfied(old_spte), __spte_satisfied(new_spte)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ed167e039e5..fa46fbed6001 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6377,11 +6377,11 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type)
 	return 1;
 }
 
-static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
+static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  bool write_fault_to_shadow_pgtable,
 				  int emulation_type)
 {
-	gpa_t gpa = cr2;
+	gpa_t gpa = cr2_or_gpa;
 	kvm_pfn_t pfn;
 
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY))
@@ -6395,7 +6395,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 		 * Write permission should be allowed since only
 		 * write access need to be emulated.
 		 */
-		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
+		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
 
 		/*
 		 * If the mapping is invalid in guest, let cpu retry
@@ -6452,10 +6452,10 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
 }
 
 static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
-			      unsigned long cr2,  int emulation_type)
+			      gpa_t cr2_or_gpa,  int emulation_type)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	unsigned long last_retry_eip, last_retry_addr, gpa = cr2;
+	unsigned long last_retry_eip, last_retry_addr, gpa = cr2_or_gpa;
 
 	last_retry_eip = vcpu->arch.last_retry_eip;
 	last_retry_addr = vcpu->arch.last_retry_addr;
@@ -6484,14 +6484,14 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (x86_page_table_writing_insn(ctxt))
 		return false;
 
-	if (ctxt->eip == last_retry_eip && last_retry_addr == cr2)
+	if (ctxt->eip == last_retry_eip && last_retry_addr == cr2_or_gpa)
 		return false;
 
 	vcpu->arch.last_retry_eip = ctxt->eip;
-	vcpu->arch.last_retry_addr = cr2;
+	vcpu->arch.last_retry_addr = cr2_or_gpa;
 
 	if (!vcpu->arch.mmu->direct_map)
-		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
+		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
 
 	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
@@ -6637,11 +6637,8 @@ static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
 	return false;
 }
 
-int x86_emulate_instruction(struct kvm_vcpu *vcpu,
-			    unsigned long cr2,
-			    int emulation_type,
-			    void *insn,
-			    int insn_len)
+int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+			    int emulation_type, void *insn, int insn_len)
 {
 	int r;
 	struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
@@ -6687,8 +6684,9 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 				kvm_queue_exception(vcpu, UD_VECTOR);
 				return 1;
 			}
-			if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
-						emulation_type))
+			if (reexecute_instruction(vcpu, cr2_or_gpa,
+						  write_fault_to_spt,
+						  emulation_type))
 				return 1;
 			if (ctxt->have_exception) {
 				/*
@@ -6722,7 +6720,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		return 1;
 	}
 
-	if (retry_instruction(ctxt, cr2, emulation_type))
+	if (retry_instruction(ctxt, cr2_or_gpa, emulation_type))
 		return 1;
 
 	/* this is needed for vmware backdoor interface to work since it
@@ -6734,7 +6732,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 
 restart:
 	/* Save the faulting GPA (cr2) in the address field */
-	ctxt->exception.address = cr2;
+	ctxt->exception.address = cr2_or_gpa;
 
 	r = x86_emulate_insn(ctxt);
 
@@ -6742,7 +6740,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 		return 1;
 
 	if (r == EMULATION_FAILED) {
-		if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
+		if (reexecute_instruction(vcpu, cr2_or_gpa, write_fault_to_spt,
 					emulation_type))
 			return 1;
 
@@ -10012,7 +10010,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	      work->arch.cr3 != vcpu->arch.mmu->get_cr3(vcpu))
 		return;
 
-	vcpu->arch.mmu->page_fault(vcpu, work->gva, 0, true);
+	vcpu->arch.mmu->page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
@@ -10125,7 +10123,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 {
 	struct x86_exception fault;
 
-	trace_kvm_async_pf_not_present(work->arch.token, work->gva);
+	trace_kvm_async_pf_not_present(work->arch.token, work->cr2_or_gpa);
 	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
 
 	if (kvm_can_deliver_async_pf(vcpu) &&
@@ -10160,7 +10158,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 		work->arch.token = ~0; /* broadcast wakeup */
 	else
 		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
-	trace_kvm_async_pf_ready(work->arch.token, work->gva);
+	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
 
 	if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
 	    !apf_get_user(vcpu, &val)) {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 29391af8871d..cab5e71f0f0f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -289,7 +289,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
 					  int page_num);
 bool kvm_vector_hashing_enabled(void);
-int x86_emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2,
+int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 			    int emulation_type, void *insn, int insn_len);
 
 #define KVM_SUPPORTED_XCR0     (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7ed1e2f8641e..0b564a928774 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -204,7 +204,7 @@ struct kvm_async_pf {
 	struct list_head queue;
 	struct kvm_vcpu *vcpu;
 	struct mm_struct *mm;
-	gva_t gva;
+	gpa_t cr2_or_gpa;
 	unsigned long addr;
 	struct kvm_arch_async_pf arch;
 	bool   wakeup_all;
@@ -212,8 +212,8 @@ struct kvm_async_pf {
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
-int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
-		       struct kvm_arch_async_pf *arch);
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+		       unsigned long hva, struct kvm_arch_async_pf *arch);
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 35305d6e68cc..d8ef708a2ef6 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -64,7 +64,7 @@ static void async_pf_execute(struct work_struct *work)
 	struct mm_struct *mm = apf->mm;
 	struct kvm_vcpu *vcpu = apf->vcpu;
 	unsigned long addr = apf->addr;
-	gva_t gva = apf->gva;
+	gpa_t cr2_or_gpa = apf->cr2_or_gpa;
 	int locked = 1;
 
 	might_sleep();
@@ -92,7 +92,7 @@ static void async_pf_execute(struct work_struct *work)
 	 * this point
 	 */
 
-	trace_kvm_async_pf_completed(addr, gva);
+	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
 
 	if (swq_has_sleeper(&vcpu->wq))
 		swake_up_one(&vcpu->wq);
@@ -165,8 +165,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 	}
 }
 
-int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
-		       struct kvm_arch_async_pf *arch)
+int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+		       unsigned long hva, struct kvm_arch_async_pf *arch)
 {
 	struct kvm_async_pf *work;
 
@@ -185,7 +185,7 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, unsigned long hva,
 
 	work->wakeup_all = false;
 	work->vcpu = vcpu;
-	work->gva = gva;
+	work->cr2_or_gpa = cr2_or_gpa;
 	work->addr = hva;
 	work->arch = *arch;
 	work->mm = current->mm;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 02/16] KVM: x86/mmu: Move definition of make_mmu_pages_available() up
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
  2019-12-06 23:57 ` [PATCH 01/16] KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 03/16] KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault() Sean Christopherson
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Move make_mmu_pages_available() above its first user to put it closer
to related code and eliminate a forward declaration.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3977c2b7e8e5..e2792305ce32 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2899,6 +2899,26 @@ static bool prepare_zap_oldest_mmu_page(struct kvm *kvm,
 	return kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
 }
 
+static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
+{
+	LIST_HEAD(invalid_list);
+
+	if (likely(kvm_mmu_available_pages(vcpu->kvm) >= KVM_MIN_FREE_MMU_PAGES))
+		return 0;
+
+	while (kvm_mmu_available_pages(vcpu->kvm) < KVM_REFILL_PAGES) {
+		if (!prepare_zap_oldest_mmu_page(vcpu->kvm, &invalid_list))
+			break;
+
+		++vcpu->kvm->stat.mmu_recycled;
+	}
+	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+
+	if (!kvm_mmu_available_pages(vcpu->kvm))
+		return -ENOSPC;
+	return 0;
+}
+
 /*
  * Changing the number of mmu pages allocated to the vm
  * Note: if goal_nr_mmu_pages is too small, you will get dead lock
@@ -3636,7 +3656,6 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int level,
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
 			 bool *writable);
-static int make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			 gfn_t gfn, bool prefault)
@@ -5507,26 +5526,6 @@ int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page_virt);
 
-static int make_mmu_pages_available(struct kvm_vcpu *vcpu)
-{
-	LIST_HEAD(invalid_list);
-
-	if (likely(kvm_mmu_available_pages(vcpu->kvm) >= KVM_MIN_FREE_MMU_PAGES))
-		return 0;
-
-	while (kvm_mmu_available_pages(vcpu->kvm) < KVM_REFILL_PAGES) {
-		if (!prepare_zap_oldest_mmu_page(vcpu->kvm, &invalid_list))
-			break;
-
-		++vcpu->kvm->stat.mmu_recycled;
-	}
-	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
-
-	if (!kvm_mmu_available_pages(vcpu->kvm))
-		return -ENOSPC;
-	return 0;
-}
-
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 03/16] KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault()
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
  2019-12-06 23:57 ` [PATCH 01/16] KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM Sean Christopherson
  2019-12-06 23:57 ` [PATCH 02/16] KVM: x86/mmu: Move definition of make_mmu_pages_available() up Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 04/16] KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf() Sean Christopherson
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Fold nonpaging_map() into its sole caller, nonpaging_page_fault(), in
preparation for combining the bulk of nonpaging_page_fault() and
tdp_page_fault() into a common helper.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 106 +++++++++++++++++++----------------------
 1 file changed, 49 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2792305ce32..8217cdd6beac 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3657,60 +3657,6 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
 			 bool *writable);
 
-static int nonpaging_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-			 gfn_t gfn, bool prefault)
-{
-	int r;
-	int level;
-	bool force_pt_level;
-	kvm_pfn_t pfn;
-	unsigned long mmu_seq;
-	bool map_writable, write = error_code & PFERR_WRITE_MASK;
-	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
-				is_nx_huge_page_enabled();
-
-	force_pt_level = lpage_disallowed;
-	level = mapping_level(vcpu, gfn, &force_pt_level);
-	if (likely(!force_pt_level)) {
-		/*
-		 * This path builds a PAE pagetable - so we can map
-		 * 2mb pages at maximum. Therefore check if the level
-		 * is larger than that.
-		 */
-		if (level > PT_DIRECTORY_LEVEL)
-			level = PT_DIRECTORY_LEVEL;
-
-		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-	}
-
-	if (fast_page_fault(vcpu, gpa, level, error_code))
-		return RET_PF_RETRY;
-
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
-
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
-		return RET_PF_RETRY;
-
-	if (handle_abnormal_pfn(vcpu, gpa, gfn, pfn, ACC_ALL, &r))
-		return r;
-
-	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
-		goto out_unlock;
-	if (make_mmu_pages_available(vcpu) < 0)
-		goto out_unlock;
-	if (likely(!force_pt_level))
-		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
-			 prefault, false);
-out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
-	return r;
-}
-
 static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 			       struct list_head *invalid_list)
 {
@@ -4172,12 +4118,21 @@ static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 				u32 error_code, bool prefault)
 {
-	gfn_t gfn = gpa >> PAGE_SHIFT;
 	int r;
+	int level;
+	kvm_pfn_t pfn;
+	unsigned long mmu_seq;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	bool write = error_code & PFERR_WRITE_MASK;
+	bool force_pt_level, map_writable;
+	bool exec = error_code & PFERR_FETCH_MASK;
+	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
 
 	/* Note, paging is disabled, ergo gva == gpa. */
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
+	gpa &= PAGE_MASK;
+
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
 
@@ -4187,9 +4142,46 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 
 	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
 
+	force_pt_level = lpage_disallowed;
+	level = mapping_level(vcpu, gfn, &force_pt_level);
+	if (likely(!force_pt_level)) {
+		/*
+		 * This path builds a PAE pagetable - so we can map
+		 * 2mb pages at maximum. Therefore check if the level
+		 * is larger than that.
+		 */
+		if (level > PT_DIRECTORY_LEVEL)
+			level = PT_DIRECTORY_LEVEL;
 
-	return nonpaging_map(vcpu, gpa & PAGE_MASK,
-			     error_code, gfn, prefault);
+		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
+	}
+
+	if (fast_page_fault(vcpu, gpa, level, error_code))
+		return RET_PF_RETRY;
+
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
+
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+		return RET_PF_RETRY;
+
+	if (handle_abnormal_pfn(vcpu, gpa, gfn, pfn, ACC_ALL, &r))
+		return r;
+
+	r = RET_PF_RETRY;
+	spin_lock(&vcpu->kvm->mmu_lock);
+	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
+		goto out_unlock;
+	if (make_mmu_pages_available(vcpu) < 0)
+		goto out_unlock;
+	if (likely(!force_pt_level))
+		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
+			 prefault, false);
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(pfn);
+	return r;
 }
 
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/16] KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf()
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (2 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 03/16] KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault() Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 05/16] KVM: x86/mmu: Refactor handling of cache consistency with TDP Sean Christopherson
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Move nonpaging_page_fault() below try_async_pf() to eliminate the
forward declaration of try_async_pf() and to prepare for combining the
bulk of nonpaging_page_fault() and tdp_page_fault() into a common
helper.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 142 ++++++++++++++++++++---------------------
 1 file changed, 69 insertions(+), 73 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8217cdd6beac..ef1b7663f6ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3653,10 +3653,6 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int level,
 	return fault_handled;
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, bool write,
-			 bool *writable);
-
 static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 			       struct list_head *invalid_list)
 {
@@ -4115,75 +4111,6 @@ static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
 	walk_shadow_page_lockless_end(vcpu);
 }
 
-static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
-				u32 error_code, bool prefault)
-{
-	int r;
-	int level;
-	kvm_pfn_t pfn;
-	unsigned long mmu_seq;
-	gfn_t gfn = gpa >> PAGE_SHIFT;
-	bool write = error_code & PFERR_WRITE_MASK;
-	bool force_pt_level, map_writable;
-	bool exec = error_code & PFERR_FETCH_MASK;
-	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
-
-	/* Note, paging is disabled, ergo gva == gpa. */
-	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
-
-	gpa &= PAGE_MASK;
-
-	if (page_fault_handle_page_track(vcpu, error_code, gfn))
-		return RET_PF_EMULATE;
-
-	r = mmu_topup_memory_caches(vcpu);
-	if (r)
-		return r;
-
-	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
-
-	force_pt_level = lpage_disallowed;
-	level = mapping_level(vcpu, gfn, &force_pt_level);
-	if (likely(!force_pt_level)) {
-		/*
-		 * This path builds a PAE pagetable - so we can map
-		 * 2mb pages at maximum. Therefore check if the level
-		 * is larger than that.
-		 */
-		if (level > PT_DIRECTORY_LEVEL)
-			level = PT_DIRECTORY_LEVEL;
-
-		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-	}
-
-	if (fast_page_fault(vcpu, gpa, level, error_code))
-		return RET_PF_RETRY;
-
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
-
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
-		return RET_PF_RETRY;
-
-	if (handle_abnormal_pfn(vcpu, gpa, gfn, pfn, ACC_ALL, &r))
-		return r;
-
-	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
-		goto out_unlock;
-	if (make_mmu_pages_available(vcpu) < 0)
-		goto out_unlock;
-	if (likely(!force_pt_level))
-		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
-			 prefault, false);
-out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
-	return r;
-}
-
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				   gfn_t gfn)
 {
@@ -4233,6 +4160,75 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	return false;
 }
 
+static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
+				u32 error_code, bool prefault)
+{
+	int r;
+	int level;
+	kvm_pfn_t pfn;
+	unsigned long mmu_seq;
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	bool write = error_code & PFERR_WRITE_MASK;
+	bool force_pt_level, map_writable;
+	bool exec = error_code & PFERR_FETCH_MASK;
+	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
+
+	/* Note, paging is disabled, ergo gva == gpa. */
+	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
+
+	gpa &= PAGE_MASK;
+
+	if (page_fault_handle_page_track(vcpu, error_code, gfn))
+		return RET_PF_EMULATE;
+
+	r = mmu_topup_memory_caches(vcpu);
+	if (r)
+		return r;
+
+	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
+
+	force_pt_level = lpage_disallowed;
+	level = mapping_level(vcpu, gfn, &force_pt_level);
+	if (likely(!force_pt_level)) {
+		/*
+		 * This path builds a PAE pagetable - so we can map
+		 * 2mb pages at maximum. Therefore check if the level
+		 * is larger than that.
+		 */
+		if (level > PT_DIRECTORY_LEVEL)
+			level = PT_DIRECTORY_LEVEL;
+
+		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
+	}
+
+	if (fast_page_fault(vcpu, gpa, level, error_code))
+		return RET_PF_RETRY;
+
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
+
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+		return RET_PF_RETRY;
+
+	if (handle_abnormal_pfn(vcpu, gpa, gfn, pfn, ACC_ALL, &r))
+		return r;
+
+	r = RET_PF_RETRY;
+	spin_lock(&vcpu->kvm->mmu_lock);
+	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
+		goto out_unlock;
+	if (make_mmu_pages_available(vcpu) < 0)
+		goto out_unlock;
+	if (likely(!force_pt_level))
+		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
+			 prefault, false);
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(pfn);
+	return r;
+}
+
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len)
 {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 05/16] KVM: x86/mmu: Refactor handling of cache consistency with TDP
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (3 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 04/16] KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf() Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 06/16] KVM: x86/mmu: Refactor the per-slot level calculation in mapping_level() Sean Christopherson
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Pre-calculate the max level for a TDP page with respect to MTRR cache
consistency in preparation of replacing force_pt_level with max_level,
and eventually combining the bulk of nonpaging_page_fault() and
tdp_page_fault() into a common helper.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ef1b7663f6ea..e4a96236aa14 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4267,16 +4267,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 }
 EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
 
-static bool
-check_hugepage_cache_consistency(struct kvm_vcpu *vcpu, gfn_t gfn, int level)
-{
-	int page_num = KVM_PAGES_PER_HPAGE(level);
-
-	gfn &= ~(page_num - 1);
-
-	return kvm_mtrr_check_gfn_range_consistency(vcpu, gfn, page_num);
-}
-
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			  bool prefault)
 {
@@ -4290,6 +4280,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	bool map_writable;
 	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
 				is_nx_huge_page_enabled();
+	int max_level;
 
 	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
 
@@ -4300,14 +4291,21 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (r)
 		return r;
 
-	force_pt_level =
-		lpage_disallowed ||
-		!check_hugepage_cache_consistency(vcpu, gfn, PT_DIRECTORY_LEVEL);
+	for (max_level = PT_MAX_HUGEPAGE_LEVEL;
+	     max_level > PT_PAGE_TABLE_LEVEL;
+	     max_level--) {
+		int page_num = KVM_PAGES_PER_HPAGE(max_level);
+		gfn_t base = gfn & ~(page_num - 1);
+
+		if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
+			break;
+	}
+
+	force_pt_level = lpage_disallowed || max_level == PT_PAGE_TABLE_LEVEL;
 	level = mapping_level(vcpu, gfn, &force_pt_level);
 	if (likely(!force_pt_level)) {
-		if (level > PT_DIRECTORY_LEVEL &&
-		    !check_hugepage_cache_consistency(vcpu, gfn, level))
-			level = PT_DIRECTORY_LEVEL;
+		if (level > max_level)
+			level = max_level;
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 06/16] KVM: x86/mmu: Refactor the per-slot level calculation in mapping_level()
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (4 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 05/16] KVM: x86/mmu: Refactor handling of cache consistency with TDP Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 07/16] KVM: x86/mmu: Refactor handling of forced 4k pages in page faults Sean Christopherson
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Invert the loop which adjusts the allowed page level based on what's
compatible with the associated memslot to use a largest-to-smallest
page size walk.  This paves the way for passing around a "max level"
variable instead of having redundant checks and/or multiple booleans.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4a96236aa14..bd1711201181 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1326,7 +1326,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 			 bool *force_pt_level)
 {
-	int host_level, level, max_level;
+	int host_level, max_level;
 	struct kvm_memory_slot *slot;
 
 	if (unlikely(*force_pt_level))
@@ -1343,12 +1343,12 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 		return host_level;
 
 	max_level = min(kvm_x86_ops->get_lpage_level(), host_level);
-
-	for (level = PT_DIRECTORY_LEVEL; level <= max_level; ++level)
-		if (__mmu_gfn_lpage_is_disallowed(large_gfn, level, slot))
+	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
+		if (!__mmu_gfn_lpage_is_disallowed(large_gfn, max_level, slot))
 			break;
+	}
 
-	return level - 1;
+	return max_level;
 }
 
 /*
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 07/16] KVM: x86/mmu: Refactor handling of forced 4k pages in page faults
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (5 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 06/16] KVM: x86/mmu: Refactor the per-slot level calculation in mapping_level() Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 08/16] KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU Sean Christopherson
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Refactor the page fault handlers and mapping_level() to track the max
allowed page level instead of only tracking if a 4k page is mandatory
due to one restriction or another.  This paves the way for cleanly
consolidating tdp_page_fault() and nonpaging_page_fault(), and for
eliminating a redundant check on mmu_gfn_lpage_is_disallowed().

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 45 ++++++++++++++--------------------
 arch/x86/kvm/mmu/paging_tmpl.h | 16 +++++++-----
 2 files changed, 29 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd1711201181..877924cbb75b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1324,18 +1324,19 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 }
 
 static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
-			 bool *force_pt_level)
+			 int *max_levelp)
 {
-	int host_level, max_level;
+	int host_level, max_level = *max_levelp;
 	struct kvm_memory_slot *slot;
 
-	if (unlikely(*force_pt_level))
+	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
 		return PT_PAGE_TABLE_LEVEL;
 
 	slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
-	*force_pt_level = !memslot_valid_for_gpte(slot, true);
-	if (unlikely(*force_pt_level))
+	if (!memslot_valid_for_gpte(slot, true)) {
+		*max_levelp = PT_PAGE_TABLE_LEVEL;
 		return PT_PAGE_TABLE_LEVEL;
+	}
 
 	host_level = host_mapping_level(vcpu->kvm, large_gfn);
 
@@ -4169,9 +4170,10 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 	unsigned long mmu_seq;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	bool write = error_code & PFERR_WRITE_MASK;
-	bool force_pt_level, map_writable;
+	bool map_writable;
 	bool exec = error_code & PFERR_FETCH_MASK;
 	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
+	int max_level;
 
 	/* Note, paging is disabled, ergo gva == gpa. */
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
@@ -4187,19 +4189,12 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 
 	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
 
-	force_pt_level = lpage_disallowed;
-	level = mapping_level(vcpu, gfn, &force_pt_level);
-	if (likely(!force_pt_level)) {
-		/*
-		 * This path builds a PAE pagetable - so we can map
-		 * 2mb pages at maximum. Therefore check if the level
-		 * is larger than that.
-		 */
-		if (level > PT_DIRECTORY_LEVEL)
-			level = PT_DIRECTORY_LEVEL;
+	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
+	max_level = lpage_disallowed ? PT_PAGE_TABLE_LEVEL : PT_DIRECTORY_LEVEL;
 
+	level = mapping_level(vcpu, gfn, &max_level);
+	if (level > PT_PAGE_TABLE_LEVEL)
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-	}
 
 	if (fast_page_fault(vcpu, gpa, level, error_code))
 		return RET_PF_RETRY;
@@ -4219,7 +4214,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	if (likely(!force_pt_level))
+	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
 	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
 			 prefault, false);
@@ -4273,7 +4268,6 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	kvm_pfn_t pfn;
 	int r;
 	int level;
-	bool force_pt_level;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
 	int write = error_code & PFERR_WRITE_MASK;
@@ -4301,13 +4295,12 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			break;
 	}
 
-	force_pt_level = lpage_disallowed || max_level == PT_PAGE_TABLE_LEVEL;
-	level = mapping_level(vcpu, gfn, &force_pt_level);
-	if (likely(!force_pt_level)) {
-		if (level > max_level)
-			level = max_level;
+	if (lpage_disallowed)
+		max_level = PT_PAGE_TABLE_LEVEL;
+
+	level = mapping_level(vcpu, gfn, &max_level);
+	if (level > PT_PAGE_TABLE_LEVEL)
 		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-	}
 
 	if (fast_page_fault(vcpu, gpa, level, error_code))
 		return RET_PF_RETRY;
@@ -4327,7 +4320,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	if (likely(!force_pt_level))
+	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
 		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
 	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
 			 prefault, lpage_disallowed);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c1d7b866a03f..1938a6e4e631 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -778,7 +778,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	bool map_writable, is_self_change_mapping;
 	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
 				is_nx_huge_page_enabled();
-	bool force_pt_level = lpage_disallowed;
+	int max_level;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 
@@ -818,14 +818,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
 	      &walker, user_fault, &vcpu->arch.write_fault_to_shadow_pgtable);
 
+	max_level = lpage_disallowed ? PT_PAGE_TABLE_LEVEL :
+				       PT_MAX_HUGEPAGE_LEVEL;
+
 	if (walker.level >= PT_DIRECTORY_LEVEL && !is_self_change_mapping) {
-		level = mapping_level(vcpu, walker.gfn, &force_pt_level);
-		if (likely(!force_pt_level)) {
+		level = mapping_level(vcpu, walker.gfn, &max_level);
+		if (likely(max_level > PT_DIRECTORY_LEVEL)) {
 			level = min(walker.level, level);
 			walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
 		}
-	} else
-		force_pt_level = true;
+	} else {
+		max_level = PT_PAGE_TABLE_LEVEL;
+	}
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
@@ -865,7 +869,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	if (!force_pt_level)
+	if (max_level > PT_PAGE_TABLE_LEVEL)
 		transparent_hugepage_adjust(vcpu, walker.gfn, &pfn, &level);
 	r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
 			 level, pfn, map_writable, prefault, lpage_disallowed);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 08/16] KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (6 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 07/16] KVM: x86/mmu: Refactor handling of forced 4k pages in page faults Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 09/16] KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level Sean Christopherson
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Restrict the max level for a shadow page based on the guest's level
instead of capping the level after the fact for host-mapped huge pages,
e.g. hugetlbfs pages.  Explicitly capping the max level using the guest
mapping level also eliminates FNAME(page_fault)'s subtle dependency on
THP only supporting 2mb pages.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/paging_tmpl.h | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1938a6e4e631..7d57ec576df0 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -773,7 +773,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	struct guest_walker walker;
 	int r;
 	kvm_pfn_t pfn;
-	int level = PT_PAGE_TABLE_LEVEL;
+	int level;
 	unsigned long mmu_seq;
 	bool map_writable, is_self_change_mapping;
 	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
@@ -818,18 +818,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
 	      &walker, user_fault, &vcpu->arch.write_fault_to_shadow_pgtable);
 
-	max_level = lpage_disallowed ? PT_PAGE_TABLE_LEVEL :
-				       PT_MAX_HUGEPAGE_LEVEL;
-
-	if (walker.level >= PT_DIRECTORY_LEVEL && !is_self_change_mapping) {
-		level = mapping_level(vcpu, walker.gfn, &max_level);
-		if (likely(max_level > PT_DIRECTORY_LEVEL)) {
-			level = min(walker.level, level);
-			walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
-		}
-	} else {
+	if (lpage_disallowed || is_self_change_mapping)
 		max_level = PT_PAGE_TABLE_LEVEL;
-	}
+	else
+		max_level = walker.level;
+
+	level = mapping_level(vcpu, walker.gfn, &max_level);
+	if (level > PT_PAGE_TABLE_LEVEL)
+		walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
 
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 09/16] KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (7 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 08/16] KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 10/16] KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage Sean Christopherson
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Persist the max page level calculated via gfn_lpage_is_disallowed() to
the max level "returned" by mapping_level() so that its naturally taken
into account by the max level check that conditions calling
transparent_hugepage_adjust().

Drop the gfn_lpage_is_disallowed() check in thp_adjust() as it's now
handled by mapping_level() and its callers.

Add a comment to document the behavior of host_mapping_level() and its
interaction with max level and transparent huge pages.

Note, transferring the gfn_lpage_is_disallowed() from thp_adjust() to
mapping_level() superficially affects how changes to a memslot's
disallow_lpage count will be handled due to thp_adjust() being run while
holding mmu_lock.

In the more common case where a different vCPU increments the count via
account_shadowed(), gfn_lpage_is_disallowed() is rechecked by set_spte()
to ensure a writable large page isn't created.

In the less common case where the count is decremented to zero due to
all shadow pages in the memslot being zapped, THP behavior now matches
hugetlbfs behavior in the sense that a small page will be created when a
large page could be used if the count reaches zero in the miniscule
window between mapping_level() and acquiring mmu_lock.

Lastly, the new THP behavior also follows hugetlbfs behavior in the
absurdly unlikely scenario of a memslot being moved such that the
memslot's compatibility with respect to large pages changes, but without
changing the validity of the gpf->pfn walk.  I.e. if a memslot is moved
between mapping_level() and snapshotting mmu_seq, it's theoretically
possible to consume a stale disallow_lpage count.  But, since KVM zaps
all shadow pages when moving a memslot and forces all vCPUs to reload a
new MMU, the inserted spte will always be thrown away prior to
completing the memslot move, i.e. whether or not the spte accurately
reflects disallow_lpage is irrelevant.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 877924cbb75b..8782a70abe78 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1326,7 +1326,7 @@ gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 			 int *max_levelp)
 {
-	int host_level, max_level = *max_levelp;
+	int max_level = *max_levelp;
 	struct kvm_memory_slot *slot;
 
 	if (unlikely(max_level == PT_PAGE_TABLE_LEVEL))
@@ -1338,18 +1338,27 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn,
 		return PT_PAGE_TABLE_LEVEL;
 	}
 
-	host_level = host_mapping_level(vcpu->kvm, large_gfn);
-
-	if (host_level == PT_PAGE_TABLE_LEVEL)
-		return host_level;
-
-	max_level = min(kvm_x86_ops->get_lpage_level(), host_level);
+	max_level = min(max_level, kvm_x86_ops->get_lpage_level());
 	for ( ; max_level > PT_PAGE_TABLE_LEVEL; max_level--) {
 		if (!__mmu_gfn_lpage_is_disallowed(large_gfn, max_level, slot))
 			break;
 	}
 
-	return max_level;
+	*max_levelp = max_level;
+
+	if (max_level == PT_PAGE_TABLE_LEVEL)
+		return PT_PAGE_TABLE_LEVEL;
+
+	/*
+	 * Note, host_mapping_level() does *not* handle transparent huge pages.
+	 * As suggested by "mapping", it reflects the page size established by
+	 * the associated vma, if there is one, i.e. host_mapping_level() will
+	 * return a huge page level if and only if a vma exists and the backing
+	 * implementation for the vma uses huge pages, e.g. hugetlbfs and dax.
+	 * So, do not propagate host_mapping_level() to max_level as KVM can
+	 * still promote the guest mapping to a huge page in the THP case.
+	 */
+	return host_mapping_level(vcpu->kvm, large_gfn);
 }
 
 /*
@@ -3420,8 +3429,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
 	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn)) &&
-	    !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
+	    PageTransCompoundMap(pfn_to_page(pfn))) {
 		unsigned long mask;
 		/*
 		 * mmu_notifier_retry was successful and we hold the
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 10/16] KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (8 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 09/16] KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 11/16] KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault() Sean Christopherson
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Rename __direct_map()'s param that controls whether or not a disallowed
NX large page should be accounted to match what it actually does.  The
nonpaging_page_fault() case unconditionally passes %false for the param
even though it locally sets lpage_disallowed.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8782a70abe78..a09a3bc0b6ae 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3349,7 +3349,7 @@ static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 			int map_writable, int level, kvm_pfn_t pfn,
-			bool prefault, bool lpage_disallowed)
+			bool prefault, bool account_disallowed_nx_lpage)
 {
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
@@ -3378,7 +3378,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 					      it.level - 1, true, ACC_ALL);
 
 			link_shadow_page(vcpu, it.sptep, sp);
-			if (lpage_disallowed)
+			if (account_disallowed_nx_lpage)
 				account_huge_nx_page(vcpu->kvm, sp);
 		}
 	}
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 11/16] KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault()
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (9 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 10/16] KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 12/16] KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map() Sean Christopherson
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Consolidate the direct MMU page fault handlers into a common helper,
direct_page_fault().  Except for unique max level conditions, the tdp
and nonpaging fault handlers are functionally identical.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 166 +++++++++++++++--------------------------
 1 file changed, 61 insertions(+), 105 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a09a3bc0b6ae..929fca02b075 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4169,67 +4169,72 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	return false;
 }
 
-static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
-				u32 error_code, bool prefault)
+static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
+			     bool prefault, int max_level, bool is_tdp)
 {
-	int r;
-	int level;
-	kvm_pfn_t pfn;
-	unsigned long mmu_seq;
-	gfn_t gfn = gpa >> PAGE_SHIFT;
 	bool write = error_code & PFERR_WRITE_MASK;
-	bool map_writable;
 	bool exec = error_code & PFERR_FETCH_MASK;
 	bool lpage_disallowed = exec && is_nx_huge_page_enabled();
-	int max_level;
+	bool map_writable;
 
-	/* Note, paging is disabled, ergo gva == gpa. */
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	unsigned long mmu_seq;
+	kvm_pfn_t pfn;
+	int level, r;
+
+	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
+
+	if (page_fault_handle_page_track(vcpu, error_code, gfn))
+		return RET_PF_EMULATE;
+
+	r = mmu_topup_memory_caches(vcpu);
+	if (r)
+		return r;
+
+	if (lpage_disallowed)
+		max_level = PT_PAGE_TABLE_LEVEL;
+
+	level = mapping_level(vcpu, gfn, &max_level);
+	if (level > PT_PAGE_TABLE_LEVEL)
+		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
+
+	if (fast_page_fault(vcpu, gpa, level, error_code))
+		return RET_PF_RETRY;
+
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
+
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+		return RET_PF_RETRY;
+
+	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
+		return r;
+
+	r = RET_PF_RETRY;
+	spin_lock(&vcpu->kvm->mmu_lock);
+	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
+		goto out_unlock;
+	if (make_mmu_pages_available(vcpu) < 0)
+		goto out_unlock;
+	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
+		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault,
+			 is_tdp && lpage_disallowed);
+
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(pfn);
+	return r;
+}
+
+static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
+				u32 error_code, bool prefault)
+{
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
-	gpa &= PAGE_MASK;
-
-	if (page_fault_handle_page_track(vcpu, error_code, gfn))
-		return RET_PF_EMULATE;
-
-	r = mmu_topup_memory_caches(vcpu);
-	if (r)
-		return r;
-
-	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
-
 	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
-	max_level = lpage_disallowed ? PT_PAGE_TABLE_LEVEL : PT_DIRECTORY_LEVEL;
-
-	level = mapping_level(vcpu, gfn, &max_level);
-	if (level > PT_PAGE_TABLE_LEVEL)
-		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-
-	if (fast_page_fault(vcpu, gpa, level, error_code))
-		return RET_PF_RETRY;
-
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
-
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
-		return RET_PF_RETRY;
-
-	if (handle_abnormal_pfn(vcpu, gpa, gfn, pfn, ACC_ALL, &r))
-		return r;
-
-	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
-		goto out_unlock;
-	if (make_mmu_pages_available(vcpu) < 0)
-		goto out_unlock;
-	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
-		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
-			 prefault, false);
-out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
-	return r;
+	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
+				 PT_DIRECTORY_LEVEL, false);
 }
 
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
@@ -4273,69 +4278,20 @@ EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 			  bool prefault)
 {
-	kvm_pfn_t pfn;
-	int r;
-	int level;
-	gfn_t gfn = gpa >> PAGE_SHIFT;
-	unsigned long mmu_seq;
-	int write = error_code & PFERR_WRITE_MASK;
-	bool map_writable;
-	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
-				is_nx_huge_page_enabled();
 	int max_level;
 
-	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
-
-	if (page_fault_handle_page_track(vcpu, error_code, gfn))
-		return RET_PF_EMULATE;
-
-	r = mmu_topup_memory_caches(vcpu);
-	if (r)
-		return r;
-
 	for (max_level = PT_MAX_HUGEPAGE_LEVEL;
 	     max_level > PT_PAGE_TABLE_LEVEL;
 	     max_level--) {
 		int page_num = KVM_PAGES_PER_HPAGE(max_level);
-		gfn_t base = gfn & ~(page_num - 1);
+		gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
 
 		if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
 			break;
 	}
 
-	if (lpage_disallowed)
-		max_level = PT_PAGE_TABLE_LEVEL;
-
-	level = mapping_level(vcpu, gfn, &max_level);
-	if (level > PT_PAGE_TABLE_LEVEL)
-		gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-
-	if (fast_page_fault(vcpu, gpa, level, error_code))
-		return RET_PF_RETRY;
-
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
-
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
-		return RET_PF_RETRY;
-
-	if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
-		return r;
-
-	r = RET_PF_RETRY;
-	spin_lock(&vcpu->kvm->mmu_lock);
-	if (mmu_notifier_retry(vcpu->kvm, mmu_seq))
-		goto out_unlock;
-	if (make_mmu_pages_available(vcpu) < 0)
-		goto out_unlock;
-	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
-		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn,
-			 prefault, lpage_disallowed);
-out_unlock:
-	spin_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
-	return r;
+	return direct_page_fault(vcpu, gpa, error_code, prefault,
+				 max_level, true);
 }
 
 static void nonpaging_init_context(struct kvm_vcpu *vcpu,
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 12/16] KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map()
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (10 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 11/16] KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault() Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 13/16] KVM: x86/mmu: Move calls to thp_adjust() down a level Sean Christopherson
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Move thp_adjust() above __direct_map() in preparation of calling
thp_adjust() from  __direct_map() and FNAME(fetch).

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 76 +++++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 929fca02b075..49e5d48e7327 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3324,6 +3324,44 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
+					gfn_t gfn, kvm_pfn_t *pfnp,
+					int *levelp)
+{
+	kvm_pfn_t pfn = *pfnp;
+	int level = *levelp;
+
+	/*
+	 * Check if it's a transparent hugepage. If this would be an
+	 * hugetlbfs page, level wouldn't be set to
+	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
+	 * here.
+	 */
+	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
+	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
+	    PageTransCompoundMap(pfn_to_page(pfn))) {
+		unsigned long mask;
+		/*
+		 * mmu_notifier_retry was successful and we hold the
+		 * mmu_lock here, so the pmd can't become splitting
+		 * from under us, and in turn
+		 * __split_huge_page_refcount() can't run from under
+		 * us and we can safely transfer the refcount from
+		 * PG_tail to PG_head as we switch the pfn to tail to
+		 * head.
+		 */
+		*levelp = level = PT_DIRECTORY_LEVEL;
+		mask = KVM_PAGES_PER_HPAGE(level) - 1;
+		VM_BUG_ON((gfn & mask) != (pfn & mask));
+		if (pfn & mask) {
+			kvm_release_pfn_clean(pfn);
+			pfn &= ~mask;
+			kvm_get_pfn(pfn);
+			*pfnp = pfn;
+		}
+	}
+}
+
 static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
 				       gfn_t gfn, kvm_pfn_t *pfnp, int *levelp)
 {
@@ -3414,44 +3452,6 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
-static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
-					gfn_t gfn, kvm_pfn_t *pfnp,
-					int *levelp)
-{
-	kvm_pfn_t pfn = *pfnp;
-	int level = *levelp;
-
-	/*
-	 * Check if it's a transparent hugepage. If this would be an
-	 * hugetlbfs page, level wouldn't be set to
-	 * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
-	 * here.
-	 */
-	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompoundMap(pfn_to_page(pfn))) {
-		unsigned long mask;
-		/*
-		 * mmu_notifier_retry was successful and we hold the
-		 * mmu_lock here, so the pmd can't become splitting
-		 * from under us, and in turn
-		 * __split_huge_page_refcount() can't run from under
-		 * us and we can safely transfer the refcount from
-		 * PG_tail to PG_head as we switch the pfn to tail to
-		 * head.
-		 */
-		*levelp = level = PT_DIRECTORY_LEVEL;
-		mask = KVM_PAGES_PER_HPAGE(level) - 1;
-		VM_BUG_ON((gfn & mask) != (pfn & mask));
-		if (pfn & mask) {
-			kvm_release_pfn_clean(pfn);
-			pfn &= ~mask;
-			kvm_get_pfn(pfn);
-			*pfnp = pfn;
-		}
-	}
-}
-
 static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
 				kvm_pfn_t pfn, unsigned access, int *ret_val)
 {
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 13/16] KVM: x86/mmu: Move calls to thp_adjust() down a level
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (11 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 12/16] KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map() Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 14/16] KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler Sean Christopherson
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Move the calls to thp_adjust() down a level from the page fault handlers
to the map/fetch helpers and remove the page count shuffling done in
thp_adjust().

Despite holding a reference to the underlying page while processing a
page fault, the page fault flows don't actually rely on holding a
reference to the page when thp_adjust() is called.  At that point, the
fault handlers hold mmu_lock, which prevents mmu_notifier from completing
any invalidations, and have verified no invalidations from mmu_notifier
have occurred since the page reference was acquired (which is done prior
to taking mmu_lock).

The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
quirk that is necessitated because thp_adjust() modifies the pfn that is
consumed by its caller.  Because the page fault handlers call
kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
reference to the correct pfn purely for correctness when the pfn is
released.

Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
adjustment doesn't change the pfn as seen by the page fault handlers,
i.e. the pfn released by the page fault handlers is the same pfn that
was returned by gfn_to_pfn().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 31 ++++++++++++-------------------
 arch/x86/kvm/mmu/paging_tmpl.h | 11 ++++++-----
 2 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 49e5d48e7327..904fb466dd24 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3341,24 +3341,15 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
 	    !kvm_is_zone_device_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL &&
 	    PageTransCompoundMap(pfn_to_page(pfn))) {
 		unsigned long mask;
+
 		/*
-		 * mmu_notifier_retry was successful and we hold the
-		 * mmu_lock here, so the pmd can't become splitting
-		 * from under us, and in turn
-		 * __split_huge_page_refcount() can't run from under
-		 * us and we can safely transfer the refcount from
-		 * PG_tail to PG_head as we switch the pfn to tail to
-		 * head.
+		 * mmu_notifier_retry() was successful and mmu_lock is held, so
+		 * the pmd can't be split from under us.
 		 */
 		*levelp = level = PT_DIRECTORY_LEVEL;
 		mask = KVM_PAGES_PER_HPAGE(level) - 1;
 		VM_BUG_ON((gfn & mask) != (pfn & mask));
-		if (pfn & mask) {
-			kvm_release_pfn_clean(pfn);
-			pfn &= ~mask;
-			kvm_get_pfn(pfn);
-			*pfnp = pfn;
-		}
+		*pfnp = pfn & ~mask;
 	}
 }
 
@@ -3386,8 +3377,9 @@ static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
-			int map_writable, int level, kvm_pfn_t pfn,
-			bool prefault, bool account_disallowed_nx_lpage)
+			int map_writable, int level, int max_level,
+			kvm_pfn_t pfn, bool prefault,
+			bool account_disallowed_nx_lpage)
 {
 	struct kvm_shadow_walk_iterator it;
 	struct kvm_mmu_page *sp;
@@ -3398,6 +3390,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
 		return RET_PF_RETRY;
 
+	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
+		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
+
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
 	for_each_shadow_entry(vcpu, gpa, it) {
 		/*
@@ -4216,10 +4211,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
-		transparent_hugepage_adjust(vcpu, gfn, &pfn, &level);
-	r = __direct_map(vcpu, gpa, write, map_writable, level, pfn, prefault,
-			 is_tdp && lpage_disallowed);
+	r = __direct_map(vcpu, gpa, write, map_writable, level, max_level, pfn,
+			 prefault, is_tdp && lpage_disallowed);
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 7d57ec576df0..3b0ba2a77e28 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -613,7 +613,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
  */
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 			 struct guest_walker *gw,
-			 int write_fault, int hlevel,
+			 int write_fault, int hlevel, int max_level,
 			 kvm_pfn_t pfn, bool map_writable, bool prefault,
 			 bool lpage_disallowed)
 {
@@ -673,6 +673,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	gfn = gw->gfn | ((addr & PT_LVL_OFFSET_MASK(gw->level)) >> PAGE_SHIFT);
 	base_gfn = gfn;
 
+	if (max_level > PT_PAGE_TABLE_LEVEL)
+		transparent_hugepage_adjust(vcpu, gw->gfn, &pfn, &hlevel);
+
 	trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
 
 	for (; shadow_walk_okay(&it); shadow_walk_next(&it)) {
@@ -865,10 +868,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	if (max_level > PT_PAGE_TABLE_LEVEL)
-		transparent_hugepage_adjust(vcpu, walker.gfn, &pfn, &level);
-	r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
-			 level, pfn, map_writable, prefault, lpage_disallowed);
+	r = FNAME(fetch)(vcpu, addr, &walker, write_fault, level, max_level,
+			 pfn, map_writable, prefault, lpage_disallowed);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 14/16] KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (12 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 13/16] KVM: x86/mmu: Move calls to thp_adjust() down a level Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 15/16] KVM: x86/mmu: WARN on an invalid root_hpa Sean Christopherson
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

Add a check on root_hpa at the beginning of the page fault handler to
consolidate several checks on root_hpa that are scattered throughout the
page fault code.  This is a preparatory step towards eventually removing
such checks altogether, or at the very least WARNing if an invalid root
is encountered.  Remove only the checks that can be easily audited to
confirm that root_hpa cannot be invalidated between their current
location and the new check in kvm_mmu_page_fault(), and aren't currently
protected by mmu_lock, i.e. keep the checks in __direct_map() and
FNAME(fetch) for the time being.

The root_hpa checks that are consolidate were all added by commit

  37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere")

which was a follow up to a bug fix for __direct_map(), commit

  989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map")

At the time, nested VMX had, in hindsight, crazy handling of nested
interrupts and would trigger a nested VM-Exit in ->interrupt_allowed(),
and thus unexpectedly reset the MMU in flows such as can_do_async_pf().

Now that the wonky nested VM-Exit behavior is gone, the root_hpa checks
are bogus and confusing, e.g. it's not at all obvious what they actually
protect against, and at first glance they appear to be broken since many
of them run without holding mmu_lock.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904fb466dd24..5e8666d25053 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3561,9 +3561,6 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int level,
 	u64 spte = 0ull;
 	uint retry_count = 0;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
-		return false;
-
 	if (!page_fault_can_be_fast(error_code))
 		return false;
 
@@ -4007,9 +4004,6 @@ walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 	int root, leaf;
 	bool reserved = false;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
-		goto exit;
-
 	walk_shadow_page_lockless_begin(vcpu);
 
 	for (shadow_walk_init(&iterator, vcpu, addr),
@@ -4039,7 +4033,7 @@ walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 			root--;
 		}
 	}
-exit:
+
 	*sptep = spte;
 	return reserved;
 }
@@ -4103,9 +4097,6 @@ static void shadow_page_table_clear_flood(struct kvm_vcpu *vcpu, gva_t addr)
 	struct kvm_shadow_walk_iterator iterator;
 	u64 spte;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
-		return;
-
 	walk_shadow_page_lockless_begin(vcpu);
 	for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
 		clear_sp_write_flooding_count(iterator.sptep);
@@ -5468,6 +5459,9 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 	int r, emulation_type = 0;
 	bool direct = vcpu->arch.mmu->direct_map;
 
+	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+		return RET_PF_RETRY;
+
 	/* With shadow page tables, fault_address contains a GVA or nGPA.  */
 	if (vcpu->arch.mmu->direct_map) {
 		vcpu->arch.gpa_available = true;
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 15/16] KVM: x86/mmu: WARN on an invalid root_hpa
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (13 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 14/16] KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-06 23:57 ` [PATCH 16/16] KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault Sean Christopherson
  2019-12-09 15:31 ` [PATCH 00/16] KVM: x86: MMU page fault clean-up Paolo Bonzini
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

WARN on the existing invalid root_hpa checks in __direct_map() and
FNAME(fetch).  The "legitimate" path that invalidated root_hpa in the
middle of a page fault is long since gone, i.e. it should no longer be
impossible to invalidate in the middle of a page fault[*].

The root_hpa checks were added by two related commits

  989c6b34f6a94 ("KVM: MMU: handle invalid root_hpa at __direct_map")
  37f6a4e237303 ("KVM: x86: handle invalid root_hpa everywhere")

to fix a bug where nested_vmx_vmexit() could be called *in the middle*
of a page fault.  At the time, vmx_interrupt_allowed(), which was and
still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
a VM-Exit on a pending interrupt.  Emulating the nested VM-Exit resulted
in root_hpa being invalidated by kvm_mmu_reset_context() without
explicitly terminating the page fault.

Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
on an invalid root_hpa to detect any flows that reset the MMU while
handling a page fault.  The broken vmx_interrupt_allowed() behavior has
long since been fixed and resetting the MMU during a page fault should
not be considered legal behavior.

[*] It's actually technically possible in FNAME(page_fault)() because it
    calls inject_page_fault() when the guest translation is invalid, but
    in that case the page fault handling is immediately terminated.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 2 +-
 arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5e8666d25053..88fd1022731f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3387,7 +3387,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	gfn_t base_gfn = gfn;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
 		return RET_PF_RETRY;
 
 	if (likely(max_level > PT_PAGE_TABLE_LEVEL))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 3b0ba2a77e28..b53bed3c901c 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -637,7 +637,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	if (FNAME(gpte_changed)(vcpu, gw, top_level))
 		goto out_gpte_changed;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
 		goto out_gpte_changed;
 
 	for (shadow_walk_init(&it, vcpu, addr);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 16/16] KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (14 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 15/16] KVM: x86/mmu: WARN on an invalid root_hpa Sean Christopherson
@ 2019-12-06 23:57 ` Sean Christopherson
  2019-12-09 15:31 ` [PATCH 00/16] KVM: x86: MMU page fault clean-up Paolo Bonzini
  16 siblings, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2019-12-06 23:57 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel

WARN if root_hpa is invalid when handling a page fault.  The check on
root_hpa exists for historical reasons that no longer apply to the
current KVM code base.

Remove an equivalent debug-only warning in direct_page_fault(), whose
existence more or less confirms that root_hpa should always be valid
when handling a page fault.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 88fd1022731f..32534ea1efd0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4168,8 +4168,6 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	kvm_pfn_t pfn;
 	int level, r;
 
-	MMU_WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa));
-
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
 
@@ -5459,7 +5457,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 	int r, emulation_type = 0;
 	bool direct = vcpu->arch.mmu->direct_map;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
 		return RET_PF_RETRY;
 
 	/* With shadow page tables, fault_address contains a GVA or nGPA.  */
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/16] KVM: x86: MMU page fault clean-up
  2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
                   ` (15 preceding siblings ...)
  2019-12-06 23:57 ` [PATCH 16/16] KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault Sean Christopherson
@ 2019-12-09 15:31 ` Paolo Bonzini
  16 siblings, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2019-12-09 15:31 UTC (permalink / raw)
  To: Sean Christopherson, Radim Krčmář
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel

On 07/12/19 00:57, Sean Christopherson wrote:
> The original purpose of this series was to call thp_adjust() from
> __direct_map() and FNAME(fetch) to eliminate a page refcounting quirk[*].
> Before doing that, I wanted to clean up the large page handling so that
> the map/fetch funtions weren't being passed multiple booleans that tracked
> the same basic info.  While trying to decipher all the the interactions,
> I stumbled across a handful of fun things:
> 
>   - 32-bit KVM w/ TDP is completely broken with respect to 64-bit GPAs due
>     to the page fault handlers and all related flows dropping bits 63:32
>     of the GPA.  As a result, KVM inserts the wrong GPA and the guest hangs
>     because it generates EPT/NPT faults until it's killed.
> 
>   - The TDP and non-paging page fault flows are identical except for
>     one-off constraints on guest page size.
> 
>   - The !VALID_PAGE(root_hpa) checks in the page fault flows are bogus.
>     They were added a few years ago to "fix" a nVMX bug and are no longer
>     needed now that nVMX is in much better shape.
> 
> Patch 1 fixes the 32-bit KVM w/ TDP issue.  More details below.
> 
> Patches 2-12 are 99% refactoring to merge TDP and non-paging page fault
> handling, and to do the thp_adjust() move.  These are basically nops from
> a functional perspective.  There are technically functional changes in a
> few patches, but they are very superficial and in theory won't be
> observable in normal usage.
> 
> Patches 13-16 add WARNs on the !VALID_PAGE(root_hpa) checks to make it
> clear that root_hpa is expected to be valid when handling page faults,
> e.g. for the longest time I thought KVM relied on the checks in map/fetch
> to correctly handle kvm_mmu_zap_all().
> 
> 
> 32-bit KVM w/ TDP:
> 
> I marked this patch for stable because it's obviously a bug fix, but I'm
> entirely not sure we want to backport the fix.  Obviously no userspace VMM
> is actually exposing 64-bit GPAs to its guests, i.e. odds are this won't
> actually fix any real world use cases.  And, the scope of the changes are
> likely going to make backporting a pain.  But, on the other hand, if it's
> not backported then future bug fixes in related code are likely to
> conflict, and it does fix the case where a buggy guest kernel accesses a
> non-existent 64-bit GPA (crashes instead of hanging indefinitely).
> 
> I'm also not confident I found all the cases where KVM is truncating the
> GPA.  AFAIK, 32-bit Qemu simply doesn't support 64-bit GPAs.  To confirm
> the bug and verify the fix, I hacked KVM and the guest kernel to generate
> 64-bit GPAs when remapping MMIO, which covers a tiny fragment of KVM.

Queued, thanks!

Paolo

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fa46fbed60013..49a59bcb32117 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5737,7 +5737,7 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
>         vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len);
>         vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write;
>         vcpu->run->exit_reason = KVM_EXIT_MMIO;
> -       vcpu->run->mmio.phys_addr = gpa;
> +       vcpu->run->mmio.phys_addr = gpa & 0xffffffffull;
>  
>         return ops->read_write_exit_mmio(vcpu, gpa, val, bytes);
>  }
> 
> 
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index b9c78f3bcd673..e22f254987bea 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -184,7 +184,10 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
>         if (kernel_map_sync_memtype(phys_addr, size, pcm))
>                 goto err_free_area;
>  
> -       if (ioremap_page_range(vaddr, vaddr + size, phys_addr, prot))
> +       BUG_ON(!boot_cpu_data.x86_phys_bits);
> +
> +       if (ioremap_page_range(vaddr, vaddr + size,
> +                              phys_addr | BIT_ULL(boot_cpu_data.x86_phys_bits - 1), prot))
>                 goto err_free_area;
>  
>         ret_addr = (void __iomem *) (vaddr + offset);
> 
> [*] https://lkml.kernel.org/r/20191126174603.GB22233@linux.intel.com
> 
> Sean Christopherson (16):
>   KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM
>   KVM: x86/mmu: Move definition of make_mmu_pages_available() up
>   KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault()
>   KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf()
>   KVM: x86/mmu: Refactor handling of cache consistency with TDP
>   KVM: x86/mmu: Refactor the per-slot level calculation in
>     mapping_level()
>   KVM: x86/mmu: Refactor handling of forced 4k pages in page faults
>   KVM: x86/mmu: Incorporate guest's page level into max level for shadow
>     MMU
>   KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level
>   KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage
>   KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault()
>   KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map()
>   KVM: x86/mmu: Move calls to thp_adjust() down a level
>   KVM: x86/mmu: Move root_hpa validity checks to top of page fault
>     handler
>   KVM: x86/mmu: WARN on an invalid root_hpa
>   KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault
> 
>  arch/x86/include/asm/kvm_host.h |   8 +-
>  arch/x86/kvm/mmu/mmu.c          | 438 ++++++++++++++------------------
>  arch/x86/kvm/mmu/paging_tmpl.h  |  58 +++--
>  arch/x86/kvm/mmutrace.h         |  12 +-
>  arch/x86/kvm/x86.c              |  40 ++-
>  arch/x86/kvm/x86.h              |   2 +-
>  include/linux/kvm_host.h        |   6 +-
>  virt/kvm/async_pf.c             |  10 +-
>  8 files changed, 259 insertions(+), 315 deletions(-)
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-12-09 15:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-06 23:57 [PATCH 00/16] KVM: x86: MMU page fault clean-up Sean Christopherson
2019-12-06 23:57 ` [PATCH 01/16] KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM Sean Christopherson
2019-12-06 23:57 ` [PATCH 02/16] KVM: x86/mmu: Move definition of make_mmu_pages_available() up Sean Christopherson
2019-12-06 23:57 ` [PATCH 03/16] KVM: x86/mmu: Fold nonpaging_map() into nonpaging_page_fault() Sean Christopherson
2019-12-06 23:57 ` [PATCH 04/16] KVM: x86/mmu: Move nonpaging_page_fault() below try_async_pf() Sean Christopherson
2019-12-06 23:57 ` [PATCH 05/16] KVM: x86/mmu: Refactor handling of cache consistency with TDP Sean Christopherson
2019-12-06 23:57 ` [PATCH 06/16] KVM: x86/mmu: Refactor the per-slot level calculation in mapping_level() Sean Christopherson
2019-12-06 23:57 ` [PATCH 07/16] KVM: x86/mmu: Refactor handling of forced 4k pages in page faults Sean Christopherson
2019-12-06 23:57 ` [PATCH 08/16] KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU Sean Christopherson
2019-12-06 23:57 ` [PATCH 09/16] KVM: x86/mmu: Persist gfn_lpage_is_disallowed() to max_level Sean Christopherson
2019-12-06 23:57 ` [PATCH 10/16] KVM: x86/mmu: Rename lpage_disallowed to account_disallowed_nx_lpage Sean Christopherson
2019-12-06 23:57 ` [PATCH 11/16] KVM: x86/mmu: Consolidate tdp_page_fault() and nonpaging_page_fault() Sean Christopherson
2019-12-06 23:57 ` [PATCH 12/16] KVM: x86/mmu: Move transparent_hugepage_adjust() above __direct_map() Sean Christopherson
2019-12-06 23:57 ` [PATCH 13/16] KVM: x86/mmu: Move calls to thp_adjust() down a level Sean Christopherson
2019-12-06 23:57 ` [PATCH 14/16] KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler Sean Christopherson
2019-12-06 23:57 ` [PATCH 15/16] KVM: x86/mmu: WARN on an invalid root_hpa Sean Christopherson
2019-12-06 23:57 ` [PATCH 16/16] KVM: x86/mmu: WARN if root_hpa is invalid when handling a page fault Sean Christopherson
2019-12-09 15:31 ` [PATCH 00/16] KVM: x86: MMU page fault clean-up Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.