KVM Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/22] Introduce the TDP MMU
@ 2020-09-25 21:22 Ben Gardon
  2020-09-25 21:22 ` [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte Ben Gardon
                   ` (24 more replies)
  0 siblings, 25 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Over the years, the needs for KVM's x86 MMU have grown from running small
guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
we previously depended on shadow paging to run all guests, we now have
two dimensional paging (TDP). This patch set introduces a new
implementation of much of the KVM MMU, optimized for running guests with
TDP. We have re-implemented many of the MMU functions to take advantage of
the relative simplicity of TDP and eliminate the need for an rmap.
Building on this simplified implementation, a future patch set will change
the synchronization model for this "TDP MMU" to enable more parallelism
than the monolithic MMU lock. A TDP MMU is currently in use at Google
and has given us the performance necessary to live migrate our 416 vCPU,
12TiB m2-ultramem-416 VMs.

This work was motivated by the need to handle page faults in parallel for
very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
long latency on guest page faults. This contention can be easily seen
running the KVM selftests demand_paging_test with a couple hundred vCPUs.
Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
the TDP MMU reduced the test duration by 89% and the execution was
dominated by get_user_pages and the user fault FD ioctl instead of the
MMU lock.

This series is the first of two. In this series we add a basic
implementation of the TDP MMU. In the next series we will improve the
performance of the TDP MMU and allow it to execute MMU operations
in parallel.

The overall purpose of the KVM MMU is to program paging structures
(CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
addresses (HPA), and to provide utilities for other KVM features, for
example dirty logging. The definition of the L1 guest physical address
(GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
MMU must program the x86 page tables to encode the full translation of
guest virtual addresses (GVA) to HPA. This requires "shadowing" the
guest's page tables to create a composite x86 paging structure. This
solution is complicated, requires separate paging structures for each
guest CR3, and requires emulating guest page table changes. The TDP case
is much simpler. In this case, KVM lets the guest control CR3 and programs
the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
no way to change this mapping and only one version of the paging structure
is needed per L1 paging mode. In this case the paging mode is some
combination of the number of levels in the paging structure, the address
space (normal execution or system management mode, on x86), and other
attributes. Most VMs only ever use 1 paging mode and so only ever need one
TDP structure.

This series implements a "TDP MMU" through alternative implementations of
MMU functions for running L1 guests with TDP. The TDP MMU falls back to
the existing shadow paging implementation when TDP is not available, and
interoperates with the existing shadow paging implementation for nesting.
The use of the TDP MMU can be controlled by a module parameter which is
snapshot on VM creation and follows the life of the VM. This snapshot
is used in many functions to decide whether or not to use TDP MMU handlers
for a given operation.

This series can also be viewed in Gerrit here:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
(Thanks to Dmitry Vyukov <dvyukov@google.com> for setting up the
Gerrit instance)

Ben Gardon (22):
  kvm: mmu: Separate making SPTEs from set_spte
  kvm: mmu: Introduce tdp_iter
  kvm: mmu: Init / Uninit the TDP MMU
  kvm: mmu: Allocate and free TDP MMU roots
  kvm: mmu: Add functions to handle changed TDP SPTEs
  kvm: mmu: Make address space ID a property of memslots
  kvm: mmu: Support zapping SPTEs in the TDP MMU
  kvm: mmu: Separate making non-leaf sptes from link_shadow_page
  kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
  kvm: mmu: Add TDP MMU PF handler
  kvm: mmu: Factor out allocating a new tdp_mmu_page
  kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
  kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  kvm: mmu: Add access tracking for tdp_mmu
  kvm: mmu: Support changed pte notifier in tdp MMU
  kvm: mmu: Add dirty logging handler for changed sptes
  kvm: mmu: Support dirty logging for the TDP MMU
  kvm: mmu: Support disabling dirty logging for the tdp MMU
  kvm: mmu: Support write protection for nesting in tdp MMU
  kvm: mmu: NX largepage recovery for TDP MMU
  kvm: mmu: Support MMIO in the TDP MMU
  kvm: mmu: Don't clear write flooding count for direct roots

 arch/x86/include/asm/kvm_host.h |   17 +
 arch/x86/kvm/Makefile           |    3 +-
 arch/x86/kvm/mmu/mmu.c          |  437 ++++++----
 arch/x86/kvm/mmu/mmu_internal.h |   98 +++
 arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
 arch/x86/kvm/mmu/tdp_iter.c     |  198 +++++
 arch/x86/kvm/mmu/tdp_iter.h     |   55 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 1315 +++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   52 ++
 include/linux/kvm_host.h        |    2 +
 virt/kvm/kvm_main.c             |    7 +-
 11 files changed, 2022 insertions(+), 165 deletions(-)
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h

-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-30  4:55   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Separate the functions for generating leaf page table entries from the
function that inserts them into the paging structure. This refactoring
will facilitate changes to the MMU sychronization model to use atomic
compare / exchanges (which are not guaranteed to succeed) instead of a
monolithic MMU lock.

No functional change expected.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This commit introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 52 +++++++++++++++++++++++++++---------------
 1 file changed, 34 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 71aa3da2a0b7b..81240b558d67f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2971,20 +2971,14 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 #define SET_SPTE_WRITE_PROTECTED_PT	BIT(0)
 #define SET_SPTE_NEED_REMOTE_TLB_FLUSH	BIT(1)
 
-static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
-		    unsigned int pte_access, int level,
-		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
-		    bool can_unsync, bool host_writable)
+static u64 make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
+		     gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
+		     bool can_unsync, bool host_writable, bool ad_disabled,
+		     int *ret)
 {
 	u64 spte = 0;
-	int ret = 0;
-	struct kvm_mmu_page *sp;
-
-	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
-		return 0;
 
-	sp = sptep_to_sp(sptep);
-	if (sp_ad_disabled(sp))
+	if (ad_disabled)
 		spte |= SPTE_AD_DISABLED_MASK;
 	else if (kvm_vcpu_ad_need_write_protect(vcpu))
 		spte |= SPTE_AD_WRPROT_ONLY_MASK;
@@ -3037,27 +3031,49 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		 * is responsibility of mmu_get_page / kvm_sync_page.
 		 * Same reasoning can be applied to dirty page accounting.
 		 */
-		if (!can_unsync && is_writable_pte(*sptep))
-			goto set_pte;
+		if (!can_unsync && is_writable_pte(old_spte))
+			return spte;
 
 		if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 			pgprintk("%s: found shadow page for %llx, marking ro\n",
 				 __func__, gfn);
-			ret |= SET_SPTE_WRITE_PROTECTED_PT;
+			*ret |= SET_SPTE_WRITE_PROTECTED_PT;
 			pte_access &= ~ACC_WRITE_MASK;
 			spte &= ~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
 		}
 	}
 
-	if (pte_access & ACC_WRITE_MASK) {
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+	if (pte_access & ACC_WRITE_MASK)
 		spte |= spte_shadow_dirty_mask(spte);
-	}
 
 	if (speculative)
 		spte = mark_spte_for_access_track(spte);
 
-set_pte:
+	return spte;
+}
+
+static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
+		    unsigned int pte_access, int level,
+		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
+		    bool can_unsync, bool host_writable)
+{
+	u64 spte = 0;
+	struct kvm_mmu_page *sp;
+	int ret = 0;
+
+	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
+		return 0;
+
+	sp = sptep_to_sp(sptep);
+
+	spte = make_spte(vcpu, pte_access, level, gfn, pfn, *sptep, speculative,
+			 can_unsync, host_writable, sp_ad_disabled(sp), &ret);
+	if (!spte)
+		return 0;
+
+	if (spte & PT_WRITABLE_MASK)
+		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+
 	if (mmu_spte_update(sptep, spte))
 		ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
 	return ret;
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
  2020-09-25 21:22 ` [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:04   ` Paolo Bonzini
                     ` (4 more replies)
  2020-09-25 21:22 ` [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU Ben Gardon
                   ` (22 subsequent siblings)
  24 siblings, 5 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

The TDP iterator implements a pre-order traversal of a TDP paging
structure. This iterator will be used in future patches to create
an efficient implementation of the KVM MMU for the TDP case.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/Makefile           |   3 +-
 arch/x86/kvm/mmu/mmu.c          |  19 +---
 arch/x86/kvm/mmu/mmu_internal.h |  15 +++
 arch/x86/kvm/mmu/tdp_iter.c     | 163 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_iter.h     |  53 +++++++++++
 5 files changed, 237 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
 create mode 100644 arch/x86/kvm/mmu/tdp_iter.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 4a3081e9f4b5d..cf6a9947955f7 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -15,7 +15,8 @@ kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
-			   hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o
+			   hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
+			   mmu/tdp_iter.o
 
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 81240b558d67f..b48b00c8cde65 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -134,15 +134,6 @@ module_param(dbg, bool, 0644);
 #define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
 #define SPTE_MMIO_MASK (3ULL << 52)
 
-#define PT64_LEVEL_BITS 9
-
-#define PT64_LEVEL_SHIFT(level) \
-		(PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
-
-#define PT64_INDEX(address, level)\
-	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
-
-
 #define PT32_LEVEL_BITS 10
 
 #define PT32_LEVEL_SHIFT(level) \
@@ -192,8 +183,6 @@ module_param(dbg, bool, 0644);
 #define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
 #define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
 
-#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
-
 /* make pte_list_desc fit well in cache line */
 #define PTE_LIST_EXT 3
 
@@ -346,7 +335,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
-static bool is_mmio_spte(u64 spte)
+bool is_mmio_spte(u64 spte)
 {
 	return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
 }
@@ -623,7 +612,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
 	return vcpu->arch.efer & EFER_NX;
 }
 
-static int is_shadow_present_pte(u64 pte)
+int is_shadow_present_pte(u64 pte)
 {
 	return (pte != 0) && !is_mmio_spte(pte);
 }
@@ -633,7 +622,7 @@ static int is_large_pte(u64 pte)
 	return pte & PT_PAGE_SIZE_MASK;
 }
 
-static int is_last_spte(u64 pte, int level)
+int is_last_spte(u64 pte, int level)
 {
 	if (level == PG_LEVEL_4K)
 		return 1;
@@ -647,7 +636,7 @@ static bool is_executable_pte(u64 spte)
 	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
 }
 
-static kvm_pfn_t spte_to_pfn(u64 pte)
+kvm_pfn_t spte_to_pfn(u64 pte)
 {
 	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 3acf3b8eb469d..65bb110847858 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -60,4 +60,19 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
 
+#define PT64_LEVEL_BITS 9
+
+#define PT64_LEVEL_SHIFT(level) \
+		(PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
+
+#define PT64_INDEX(address, level)\
+	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
+#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
+
+/* Functions for interpreting SPTEs */
+kvm_pfn_t spte_to_pfn(u64 pte);
+bool is_mmio_spte(u64 spte);
+int is_shadow_present_pte(u64 pte);
+int is_last_spte(u64 pte, int level);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
new file mode 100644
index 0000000000000..ee90d62d2a9b1
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -0,0 +1,163 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "mmu_internal.h"
+#include "tdp_iter.h"
+
+/*
+ * Recalculates the pointer to the SPTE for the current GFN and level and
+ * reread the SPTE.
+ */
+static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
+{
+	iter->sptep = iter->pt_path[iter->level - 1] +
+		SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+	iter->old_spte = READ_ONCE(*iter->sptep);
+}
+
+/*
+ * Sets a TDP iterator to walk a pre-order traversal of the paging structure
+ * rooted at root_pt, starting with the walk to translate goal_gfn.
+ */
+void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+		    gfn_t goal_gfn)
+{
+	WARN_ON(root_level < 1);
+	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
+
+	iter->goal_gfn = goal_gfn;
+	iter->root_level = root_level;
+	iter->level = root_level;
+	iter->pt_path[iter->level - 1] = root_pt;
+
+	iter->gfn = iter->goal_gfn -
+		(iter->goal_gfn % KVM_PAGES_PER_HPAGE(iter->level));
+	tdp_iter_refresh_sptep(iter);
+
+	iter->valid = true;
+}
+
+/*
+ * Given an SPTE and its level, returns a pointer containing the host virtual
+ * address of the child page table referenced by the SPTE. Returns null if
+ * there is no such entry.
+ */
+u64 *spte_to_child_pt(u64 spte, int level)
+{
+	u64 *pt;
+	/* There's no child entry if this entry isn't present */
+	if (!is_shadow_present_pte(spte))
+		return NULL;
+
+	/* There is no child page table if this is a leaf entry. */
+	if (is_last_spte(spte, level))
+		return NULL;
+
+	pt = (u64 *)__va(spte_to_pfn(spte) << PAGE_SHIFT);
+	return pt;
+}
+
+/*
+ * Steps down one level in the paging structure towards the goal GFN. Returns
+ * true if the iterator was able to step down a level, false otherwise.
+ */
+static bool try_step_down(struct tdp_iter *iter)
+{
+	u64 *child_pt;
+
+	if (iter->level == PG_LEVEL_4K)
+		return false;
+
+	/*
+	 * Reread the SPTE before stepping down to avoid traversing into page
+	 * tables that are no longer linked from this entry.
+	 */
+	iter->old_spte = READ_ONCE(*iter->sptep);
+
+	child_pt = spte_to_child_pt(iter->old_spte, iter->level);
+	if (!child_pt)
+		return false;
+
+	iter->level--;
+	iter->pt_path[iter->level - 1] = child_pt;
+	iter->gfn = iter->goal_gfn -
+		(iter->goal_gfn % KVM_PAGES_PER_HPAGE(iter->level));
+	tdp_iter_refresh_sptep(iter);
+
+	return true;
+}
+
+/*
+ * Steps to the next entry in the current page table, at the current page table
+ * level. The next entry could point to a page backing guest memory or another
+ * page table, or it could be non-present. Returns true if the iterator was
+ * able to step to the next entry in the page table, false if the iterator was
+ * already at the end of the current page table.
+ */
+static bool try_step_side(struct tdp_iter *iter)
+{
+	/*
+	 * Check if the iterator is already at the end of the current page
+	 * table.
+	 */
+	if (!((iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) %
+	      KVM_PAGES_PER_HPAGE(iter->level + 1)))
+		return false;
+
+	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
+	iter->goal_gfn = iter->gfn;
+	iter->sptep++;
+	iter->old_spte = READ_ONCE(*iter->sptep);
+
+	return true;
+}
+
+/*
+ * Tries to traverse back up a level in the paging structure so that the walk
+ * can continue from the next entry in the parent page table. Returns true on a
+ * successful step up, false if already in the root page.
+ */
+static bool try_step_up(struct tdp_iter *iter)
+{
+	if (iter->level == iter->root_level)
+		return false;
+
+	iter->level++;
+	iter->gfn =  iter->gfn - (iter->gfn % KVM_PAGES_PER_HPAGE(iter->level));
+	tdp_iter_refresh_sptep(iter);
+
+	return true;
+}
+
+/*
+ * Step to the next SPTE in a pre-order traversal of the paging structure.
+ * To get to the next SPTE, the iterator either steps down towards the goal
+ * GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
+ * highter GFN.
+ *
+ * The basic algorithm is as follows:
+ * 1. If the current SPTE is a non-last-level SPTE, step down into the page
+ *    table it points to.
+ * 2. If the iterator cannot step down, it will try to step to the next SPTE
+ *    in the current page of the paging structure.
+ * 3. If the iterator cannot step to the next entry in the current page, it will
+ *    try to step up to the parent paging structure page. In this case, that
+ *    SPTE will have already been visited, and so the iterator must also step
+ *    to the side again.
+ */
+void tdp_iter_next(struct tdp_iter *iter)
+{
+	bool done;
+
+	done = try_step_down(iter);
+	if (done)
+		return;
+
+	done = try_step_side(iter);
+	while (!done) {
+		if (!try_step_up(iter)) {
+			iter->valid = false;
+			break;
+		}
+		done = try_step_side(iter);
+	}
+}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
new file mode 100644
index 0000000000000..b102109778eac
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __KVM_X86_MMU_TDP_ITER_H
+#define __KVM_X86_MMU_TDP_ITER_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+/*
+ * A TDP iterator performs a pre-order walk over a TDP paging structure.
+ */
+struct tdp_iter {
+	/*
+	 * The iterator will traverse the paging structure towards the mapping
+	 * for this GFN.
+	 */
+	gfn_t goal_gfn;
+	/* Pointers to the page tables traversed to reach the current SPTE */
+	u64 *pt_path[PT64_ROOT_MAX_LEVEL];
+	/* A pointer to the current SPTE */
+	u64 *sptep;
+	/* The lowest GFN mapped by the current SPTE */
+	gfn_t gfn;
+	/* The level of the root page given to the iterator */
+	int root_level;
+	/* The iterator's current level within the paging structure */
+	int level;
+	/* A snapshot of the value at sptep */
+	u64 old_spte;
+	/*
+	 * Whether the iterator has a valid state. This will be false if the
+	 * iterator walks off the end of the paging structure.
+	 */
+	bool valid;
+};
+
+/*
+ * Iterates over every SPTE mapping the GFN range [start, end) in a
+ * preorder traversal.
+ */
+#define for_each_tdp_pte(iter, root, root_level, start, end) \
+	for (tdp_iter_start(&iter, root, root_level, start); \
+	     iter.valid && iter.gfn < end;		     \
+	     tdp_iter_next(&iter))
+
+u64 *spte_to_child_pt(u64 pte, int level);
+
+void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
+		    gfn_t goal_gfn);
+void tdp_iter_next(struct tdp_iter *iter);
+
+#endif /* __KVM_X86_MMU_TDP_ITER_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
  2020-09-25 21:22 ` [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte Ben Gardon
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:06   ` Paolo Bonzini
                     ` (2 more replies)
  2020-09-25 21:22 ` [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots Ben Gardon
                   ` (21 subsequent siblings)
  24 siblings, 3 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

The TDP MMU offers an alternative mode of operation to the x86 shadow
paging based MMU, optimized for running an L1 guest with TDP. The TDP MMU
will require new fields that need to be initialized and torn down. Add
hooks into the existing KVM MMU initialization process to do that
initialization / cleanup. Currently the initialization and cleanup
fucntions do not do very much, however more operations will be added in
future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  9 +++++++++
 arch/x86/kvm/Makefile           |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  5 +++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 34 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      | 10 ++++++++++
 5 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
 create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5303dbc5c9bce..35107819f48ae 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -963,6 +963,15 @@ struct kvm_arch {
 
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
+
+	/*
+	 * Whether the TDP MMU is enabled for this VM. This contains a
+	 * snapshot of the TDP MMU module parameter from when the VM was
+	 * created and remains unchanged for the life of the VM. If this is
+	 * true, TDP MMU handler functions will run for various MMU
+	 * operations.
+	 */
+	bool tdp_mmu_enabled;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index cf6a9947955f7..e5b33938f86ed 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -16,7 +16,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
 			   hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
-			   mmu/tdp_iter.o
+			   mmu/tdp_iter.o mmu/tdp_mmu.o
 
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b48b00c8cde65..0cb0c26939dfc 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -19,6 +19,7 @@
 #include "ioapic.h"
 #include "mmu.h"
 #include "mmu_internal.h"
+#include "tdp_mmu.h"
 #include "x86.h"
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"
@@ -5865,6 +5866,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 {
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
 
+	kvm_mmu_init_tdp_mmu(kvm);
+
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
@@ -5875,6 +5878,8 @@ void kvm_mmu_uninit_vm(struct kvm *kvm)
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
 
 	kvm_page_track_unregister_notifier(kvm, node);
+
+	kvm_mmu_uninit_tdp_mmu(kvm);
 }
 
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
new file mode 100644
index 0000000000000..8241e18c111e6
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include "tdp_mmu.h"
+
+static bool __read_mostly tdp_mmu_enabled = true;
+module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
+
+static bool is_tdp_mmu_enabled(void)
+{
+	if (!READ_ONCE(tdp_mmu_enabled))
+		return false;
+
+	if (WARN_ONCE(!tdp_enabled,
+		      "Creating a VM with TDP MMU enabled requires TDP."))
+		return false;
+
+	return true;
+}
+
+/* Initializes the TDP MMU for the VM, if enabled. */
+void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
+{
+	if (!is_tdp_mmu_enabled())
+		return;
+
+	/* This should not be changed for the lifetime of the VM. */
+	kvm->arch.tdp_mmu_enabled = true;
+}
+
+void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
+{
+	if (!kvm->arch.tdp_mmu_enabled)
+		return;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
new file mode 100644
index 0000000000000..dd3764f5a9aa3
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __KVM_X86_MMU_TDP_MMU_H
+#define __KVM_X86_MMU_TDP_MMU_H
+
+#include <linux/kvm_host.h>
+
+void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
+void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
+#endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (2 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-30  6:06   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs Ben Gardon
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

The TDP MMU must be able to allocate paging structure root pages and track
the usage of those pages. Implement a similar, but separate system for root
page allocation to that of the x86 shadow paging implementation. When
future patches add synchronization model changes to allow for parallel
page faults, these pages will need to be handled differently from the
x86 shadow paging based MMU's root pages.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/mmu/mmu.c          |  27 +++---
 arch/x86/kvm/mmu/mmu_internal.h |   9 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 157 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   5 +
 5 files changed, 188 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35107819f48ae..9ce6b35ecb33a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -972,6 +972,7 @@ struct kvm_arch {
 	 * operations.
 	 */
 	bool tdp_mmu_enabled;
+	struct list_head tdp_mmu_roots;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0cb0c26939dfc..0f871e36394da 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -170,11 +170,6 @@ module_param(dbg, bool, 0644);
 #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | shadow_user_mask \
 			| shadow_x_mask | shadow_nx_mask | shadow_me_mask)
 
-#define ACC_EXEC_MASK    1
-#define ACC_WRITE_MASK   PT_WRITABLE_MASK
-#define ACC_USER_MASK    PT_USER_MASK
-#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
-
 /* The mask for the R/X bits in EPT PTEs */
 #define PT64_EPT_READABLE_MASK			0x1ull
 #define PT64_EPT_EXECUTABLE_MASK		0x4ull
@@ -232,7 +227,7 @@ struct kvm_shadow_walk_iterator {
 	     __shadow_walk_next(&(_walker), spte))
 
 static struct kmem_cache *pte_list_desc_cache;
-static struct kmem_cache *mmu_page_header_cache;
+struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
 static u64 __read_mostly shadow_nx_mask;
@@ -3597,10 +3592,14 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 	if (!VALID_PAGE(*root_hpa))
 		return;
 
-	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
-	--sp->root_count;
-	if (!sp->root_count && sp->role.invalid)
-		kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+	if (is_tdp_mmu_root(kvm, *root_hpa)) {
+		kvm_tdp_mmu_put_root_hpa(kvm, *root_hpa);
+	} else {
+		sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
+		--sp->root_count;
+		if (!sp->root_count && sp->role.invalid)
+			kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+	}
 
 	*root_hpa = INVALID_PAGE;
 }
@@ -3691,7 +3690,13 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	unsigned i;
 
 	if (shadow_root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
+		if (vcpu->kvm->arch.tdp_mmu_enabled) {
+			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+		} else {
+			root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
+					      true);
+		}
+
 		if (!VALID_PAGE(root))
 			return -ENOSPC;
 		vcpu->arch.mmu->root_hpa = root;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 65bb110847858..530b7d893c7b3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -41,8 +41,12 @@ struct kvm_mmu_page {
 
 	/* Number of writes since the last time traversal visited this page.  */
 	atomic_t write_flooding_count;
+
+	bool tdp_mmu_page;
 };
 
+extern struct kmem_cache *mmu_page_header_cache;
+
 static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
 {
 	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
@@ -69,6 +73,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
+#define ACC_EXEC_MASK    1
+#define ACC_WRITE_MASK   PT_WRITABLE_MASK
+#define ACC_USER_MASK    PT_USER_MASK
+#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
+
 /* Functions for interpreting SPTEs */
 kvm_pfn_t spte_to_pfn(u64 pte);
 bool is_mmio_spte(u64 spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8241e18c111e6..cdca829e42040 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
+#include "mmu.h"
+#include "mmu_internal.h"
 #include "tdp_mmu.h"
 
 static bool __read_mostly tdp_mmu_enabled = true;
@@ -25,10 +27,165 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 
 	/* This should not be changed for the lifetime of the VM. */
 	kvm->arch.tdp_mmu_enabled = true;
+
+	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
 }
 
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 {
 	if (!kvm->arch.tdp_mmu_enabled)
 		return;
+
+	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
+}
+
+#define for_each_tdp_mmu_root(_kvm, _root)			    \
+	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
+
+bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
+{
+	struct kvm_mmu_page *root;
+
+	if (!kvm->arch.tdp_mmu_enabled)
+		return false;
+
+	root = to_shadow_page(hpa);
+
+	if (WARN_ON(!root))
+		return false;
+
+	return root->tdp_mmu_page;
+}
+
+static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+	lockdep_assert_held(&kvm->mmu_lock);
+
+	WARN_ON(root->root_count);
+	WARN_ON(!root->tdp_mmu_page);
+
+	list_del(&root->link);
+
+	free_page((unsigned long)root->spt);
+	kmem_cache_free(mmu_page_header_cache, root);
+}
+
+static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+	lockdep_assert_held(&kvm->mmu_lock);
+
+	root->root_count--;
+	if (!root->root_count)
+		free_tdp_mmu_root(kvm, root);
+}
+
+static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
+{
+	lockdep_assert_held(&kvm->mmu_lock);
+	WARN_ON(!root->root_count);
+
+	root->root_count++;
+}
+
+void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa)
+{
+	struct kvm_mmu_page *root;
+
+	root = to_shadow_page(root_hpa);
+
+	if (WARN_ON(!root))
+		return;
+
+	put_tdp_mmu_root(kvm, root);
+}
+
+static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
+		struct kvm *kvm, union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *root;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	for_each_tdp_mmu_root(kvm, root) {
+		WARN_ON(!root->root_count);
+
+		if (root->role.word == role.word)
+			return root;
+	}
+
+	return NULL;
+}
+
+static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu,
+					       union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *new_root;
+	struct kvm_mmu_page *root;
+
+	new_root = kvm_mmu_memory_cache_alloc(
+			&vcpu->arch.mmu_page_header_cache);
+	new_root->spt = kvm_mmu_memory_cache_alloc(
+			&vcpu->arch.mmu_shadow_page_cache);
+	set_page_private(virt_to_page(new_root->spt), (unsigned long)new_root);
+
+	new_root->role.word = role.word;
+	new_root->root_count = 1;
+	new_root->gfn = 0;
+	new_root->tdp_mmu_page = true;
+
+	spin_lock(&vcpu->kvm->mmu_lock);
+
+	/* Check that no matching root exists before adding this one. */
+	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
+	if (root) {
+		get_tdp_mmu_root(vcpu->kvm, root);
+		spin_unlock(&vcpu->kvm->mmu_lock);
+		free_page((unsigned long)new_root->spt);
+		kmem_cache_free(mmu_page_header_cache, new_root);
+		return root;
+	}
+
+	list_add(&new_root->link, &vcpu->kvm->arch.tdp_mmu_roots);
+	spin_unlock(&vcpu->kvm->mmu_lock);
+
+	return new_root;
+}
+
+static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_page *root;
+	union kvm_mmu_page_role role;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = vcpu->arch.mmu->shadow_root_level;
+	role.direct = true;
+	role.gpte_is_8_bytes = true;
+	role.access = ACC_ALL;
+
+	spin_lock(&vcpu->kvm->mmu_lock);
+
+	/* Search for an already allocated root with the same role. */
+	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
+	if (root) {
+		get_tdp_mmu_root(vcpu->kvm, root);
+		spin_unlock(&vcpu->kvm->mmu_lock);
+		return root;
+	}
+
+	spin_unlock(&vcpu->kvm->mmu_lock);
+
+	/* If there is no appropriate root, allocate one. */
+	root = alloc_tdp_mmu_root(vcpu, role);
+
+	return root;
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_page *root;
+
+	root = get_tdp_mmu_vcpu_root(vcpu);
+	if (!root)
+		return INVALID_PAGE;
+
+	return __pa(root->spt);
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index dd3764f5a9aa3..9274debffeaa1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,4 +7,9 @@
 
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
+
+bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
+
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (3 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:39   ` Paolo Bonzini
  2020-09-25 21:22 ` [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots Ben Gardon
                   ` (19 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

The existing bookkeeping done by KVM when a PTE is changed is spread
around several functions. This makes it difficult to remember all the
stats, bitmaps, and other subsystems that need to be updated whenever a
PTE is modified. When a non-leaf PTE is marked non-present or becomes a
leaf PTE, page table memory must also be freed. To simplify the MMU and
facilitate the use of atomic operations on SPTEs in future patches, create
functions to handle some of the bookkeeping required as a result of
a change.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |   6 +-
 arch/x86/kvm/mmu/mmu_internal.h |   3 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 105 ++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0f871e36394da..f09081f9137b0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -310,8 +310,8 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
 		kvm_flush_remote_tlbs(kvm);
 }
 
-static void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
-		u64 start_gfn, u64 pages)
+void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
+					u64 pages)
 {
 	struct kvm_tlb_range range;
 
@@ -819,7 +819,7 @@ static bool is_accessed_spte(u64 spte)
 			     : !is_access_track_spte(spte);
 }
 
-static bool is_dirty_spte(u64 spte)
+bool is_dirty_spte(u64 spte)
 {
 	u64 dirty_mask = spte_shadow_dirty_mask(spte);
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 530b7d893c7b3..ff1fe0e04fba5 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -83,5 +83,8 @@ kvm_pfn_t spte_to_pfn(u64 pte);
 bool is_mmio_spte(u64 spte);
 int is_shadow_present_pte(u64 pte);
 int is_last_spte(u64 pte, int level);
+bool is_dirty_spte(u64 spte);
 
+void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
+					u64 pages);
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index cdca829e42040..653507773b42c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -189,3 +189,108 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 
 	return __pa(root->spt);
 }
+
+static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+				u64 old_spte, u64 new_spte, int level);
+
+/**
+ * handle_changed_spte - handle bookkeeping associated with an SPTE change
+ * @kvm: kvm instance
+ * @as_id: the address space of the paging structure the SPTE was a part of
+ * @gfn: the base GFN that was mapped by the SPTE
+ * @old_spte: The value of the SPTE before the change
+ * @new_spte: The value of the SPTE after the change
+ * @level: the level of the PT the SPTE is part of in the paging structure
+ *
+ * Handle bookkeeping that might result from the modification of a SPTE.
+ * This function must be called for all TDP SPTE modifications.
+ */
+static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+				u64 old_spte, u64 new_spte, int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
+	bool is_leaf = is_present && is_last_spte(new_spte, level);
+	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	u64 *pt;
+	u64 old_child_spte;
+	int i;
+
+	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
+	WARN_ON(level < PG_LEVEL_4K);
+	WARN_ON(gfn % KVM_PAGES_PER_HPAGE(level));
+
+	/*
+	 * If this warning were to trigger it would indicate that there was a
+	 * missing MMU notifier or a race with some notifier handler.
+	 * A present, leaf SPTE should never be directly replaced with another
+	 * present leaf SPTE pointing to a differnt PFN. A notifier handler
+	 * should be zapping the SPTE before the main MM's page table is
+	 * changed, or the SPTE should be zeroed, and the TLBs flushed by the
+	 * thread before replacement.
+	 */
+	if (was_leaf && is_leaf && pfn_changed) {
+		pr_err("Invalid SPTE change: cannot replace a present leaf\n"
+		       "SPTE with another present leaf SPTE mapping a\n"
+		       "different PFN!\n"
+		       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
+		       as_id, gfn, old_spte, new_spte, level);
+
+		/*
+		 * Crash the host to prevent error propagation and guest data
+		 * courruption.
+		 */
+		BUG();
+	}
+
+	if (old_spte == new_spte)
+		return;
+
+	/*
+	 * The only times a SPTE should be changed from a non-present to
+	 * non-present state is when an MMIO entry is installed/modified/
+	 * removed. In that case, there is nothing to do here.
+	 */
+	if (!was_present && !is_present) {
+		/*
+		 * If this change does not involve a MMIO SPTE, it is
+		 * unexpected. Log the change, though it should not impact the
+		 * guest since both the former and current SPTEs are nonpresent.
+		 */
+		if (WARN_ON(!is_mmio_spte(old_spte) && !is_mmio_spte(new_spte)))
+			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
+			       "should not be replaced with another,\n"
+			       "different nonpresent SPTE, unless one or both\n"
+			       "are MMIO SPTEs.\n"
+			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
+			       as_id, gfn, old_spte, new_spte, level);
+		return;
+	}
+
+
+	if (was_leaf && is_dirty_spte(old_spte) &&
+	    (!is_dirty_spte(new_spte) || pfn_changed))
+		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+
+	/*
+	 * Recursively handle child PTs if the change removed a subtree from
+	 * the paging structure.
+	 */
+	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
+		pt = spte_to_child_pt(old_spte, level);
+
+		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+			old_child_spte = *(pt + i);
+			*(pt + i) = 0;
+			handle_changed_spte(kvm, as_id,
+				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
+				old_child_spte, 0, level - 1);
+		}
+
+		kvm_flush_remote_tlbs_with_address(kvm, gfn,
+						   KVM_PAGES_PER_HPAGE(level));
+
+		free_page((unsigned long)pt);
+	}
+}
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (4 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-30  6:10   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU Ben Gardon
                   ` (18 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Save address space ID as a field in each memslot so that functions that
do not use rmaps (which implicitly encode the id) can handle multiple
address spaces correctly.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 05e3c2fb3ef78..a460bc712a81c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -345,6 +345,7 @@ struct kvm_memory_slot {
 	struct kvm_arch_memory_slot arch;
 	unsigned long userspace_addr;
 	u32 flags;
+	int as_id;
 	short id;
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf88233b819a0..f9c80351c9efd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1318,6 +1318,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	new.npages = mem->memory_size >> PAGE_SHIFT;
 	new.flags = mem->flags;
 	new.userspace_addr = mem->userspace_addr;
+	new.as_id = as_id;
 
 	if (new.npages > KVM_MEM_MAX_NR_PAGES)
 		return -EINVAL;
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (5 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:14   ` Paolo Bonzini
  2020-09-30  6:15   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 08/22] kvm: mmu: Separate making non-leaf sptes from link_shadow_page Ben Gardon
                   ` (17 subsequent siblings)
  24 siblings, 2 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Add functions to zap SPTEs to the TDP MMU. These are needed to tear down
TDP MMU roots properly and implement other MMU functions which require
tearing down mappings. Future patches will add functions to populate the
page tables, but as for this patch there will not be any work for these
functions to do.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c      |  15 +++++
 arch/x86/kvm/mmu/tdp_iter.c |  17 ++++++
 arch/x86/kvm/mmu/tdp_iter.h |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c  | 106 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h  |   2 +
 5 files changed, 141 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f09081f9137b0..7a17cca19b0c1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5852,6 +5852,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	kvm_reload_remote_mmus(kvm);
 
 	kvm_zap_obsolete_pages(kvm);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_zap_all(kvm);
+
 	spin_unlock(&kvm->mmu_lock);
 }
 
@@ -5892,6 +5896,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *memslot;
 	int i;
+	bool flush;
 
 	spin_lock(&kvm->mmu_lock);
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -5911,6 +5916,12 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		}
 	}
 
+	if (kvm->arch.tdp_mmu_enabled) {
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, gfn_start, gfn_end);
+		if (flush)
+			kvm_flush_remote_tlbs(kvm);
+	}
+
 	spin_unlock(&kvm->mmu_lock);
 }
 
@@ -6077,6 +6088,10 @@ void kvm_mmu_zap_all(struct kvm *kvm)
 	}
 
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_zap_all(kvm);
+
 	spin_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index ee90d62d2a9b1..6c1a38429c81a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -161,3 +161,20 @@ void tdp_iter_next(struct tdp_iter *iter)
 		done = try_step_side(iter);
 	}
 }
+
+/*
+ * Restart the walk over the paging structure from the root, starting from the
+ * highest gfn the iterator had previously reached. Assumes that the entire
+ * paging structure, except the root page, may have been completely torn down
+ * and rebuilt.
+ */
+void tdp_iter_refresh_walk(struct tdp_iter *iter)
+{
+	gfn_t goal_gfn = iter->goal_gfn;
+
+	if (iter->gfn > goal_gfn)
+		goal_gfn = iter->gfn;
+
+	tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
+		       iter->root_level, goal_gfn);
+}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b102109778eac..34da3bdada436 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -49,5 +49,6 @@ u64 *spte_to_child_pt(u64 pte, int level);
 void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 		    gfn_t goal_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
+void tdp_iter_refresh_walk(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 653507773b42c..d96fc182c8497 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -2,6 +2,7 @@
 
 #include "mmu.h"
 #include "mmu_internal.h"
+#include "tdp_iter.h"
 #include "tdp_mmu.h"
 
 static bool __read_mostly tdp_mmu_enabled = true;
@@ -57,8 +58,13 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
 	return root->tdp_mmu_page;
 }
 
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			  gfn_t start, gfn_t end);
+
 static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
+	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
+
 	lockdep_assert_held(&kvm->mmu_lock);
 
 	WARN_ON(root->root_count);
@@ -66,6 +72,8 @@ static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
 
 	list_del(&root->link);
 
+	zap_gfn_range(kvm, root, 0, max_gfn);
+
 	free_page((unsigned long)root->spt);
 	kmem_cache_free(mmu_page_header_cache, root);
 }
@@ -193,6 +201,11 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level);
 
+static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
+{
+	return sp->role.smm ? 1 : 0;
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -294,3 +307,96 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		free_page((unsigned long)pt);
 	}
 }
+
+#define for_each_tdp_pte_root(_iter, _root, _start, _end) \
+	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
+
+/*
+ * If the MMU lock is contended or this thread needs to yield, flushes
+ * the TLBs, releases, the MMU lock, yields, reacquires the MMU lock,
+ * restarts the tdp_iter's walk from the root, and returns true.
+ * If no yield is needed, returns false.
+ */
+static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
+{
+	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+		kvm_flush_remote_tlbs(kvm);
+		cond_resched_lock(&kvm->mmu_lock);
+		tdp_iter_refresh_walk(iter);
+		return true;
+	} else {
+		return false;
+	}
+}
+
+/*
+ * Tears down the mappings for the range of gfns, [start, end), and frees the
+ * non-root pages mapping GFNs strictly within that range. Returns true if
+ * SPTEs have been cleared and a TLB flush is needed before releasing the
+ * MMU lock.
+ */
+static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			  gfn_t start, gfn_t end)
+{
+	struct tdp_iter iter;
+	bool flush_needed = false;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, start, end) {
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		/*
+		 * If this is a non-last-level SPTE that covers a larger range
+		 * than should be zapped, continue, and zap the mappings at a
+		 * lower level.
+		 */
+		if ((iter.gfn < start ||
+		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		*iter.sptep = 0;
+		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte, 0,
+				    iter.level);
+
+		flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+	}
+	return flush_needed;
+}
+
+/*
+ * Tears down the mappings for the range of gfns, [start, end), and frees the
+ * non-root pages mapping GFNs strictly within that range. Returns true if
+ * SPTEs have been cleared and a TLB flush is needed before releasing the
+ * MMU lock.
+ */
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	struct kvm_mmu_page *root;
+	bool flush = false;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		flush = zap_gfn_range(kvm, root, start, end) || flush;
+
+		put_tdp_mmu_root(kvm, root);
+	}
+
+	return flush;
+}
+
+void kvm_tdp_mmu_zap_all(struct kvm *kvm)
+{
+	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
+	bool flush;
+
+	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 9274debffeaa1..cb86f9fe69017 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -12,4 +12,6 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
 
+bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 08/22] kvm: mmu: Separate making non-leaf sptes from link_shadow_page
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (6 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-25 21:22 ` [PATCH 09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg Ben Gardon
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

The TDP MMU page fault handler will need to be able to create non-leaf
SPTEs to build up the paging structures. Rather than re-implementing the
function, factor the SPTE creation out of link_shadow_page.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7a17cca19b0c1..6344e7863a0f5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2555,21 +2555,30 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
-			     struct kvm_mmu_page *sp)
+static u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
 	u64 spte;
 
-	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
-
-	spte = __pa(sp->spt) | shadow_present_mask | PT_WRITABLE_MASK |
+	spte = __pa(child_pt) | shadow_present_mask | PT_WRITABLE_MASK |
 	       shadow_user_mask | shadow_x_mask | shadow_me_mask;
 
-	if (sp_ad_disabled(sp))
+	if (ad_disabled)
 		spte |= SPTE_AD_DISABLED_MASK;
 	else
 		spte |= shadow_accessed_mask;
 
+	return spte;
+}
+
+static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
+			     struct kvm_mmu_page *sp)
+{
+	u64 spte;
+
+	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
+
+	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+
 	mmu_spte_set(sptep, spte);
 
 	mmu_page_add_parent_pte(vcpu, sp, sptep);
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (7 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 08/22] kvm: mmu: Separate making non-leaf sptes from link_shadow_page Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-30 16:19   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler Ben Gardon
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

In order to avoid creating executable hugepages in the TDP MMU PF
handler, remove the dependency between disallowed_hugepage_adjust and
the shadow_walk_iterator. This will open the function up to being used
by the TDP MMU PF handler in a future patch.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 17 +++++++++--------
 arch/x86/kvm/mmu/paging_tmpl.h |  3 ++-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6344e7863a0f5..f6e6fc9959c04 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3295,13 +3295,12 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return level;
 }
 
-static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
-				       gfn_t gfn, kvm_pfn_t *pfnp, int *levelp)
+static void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
+					kvm_pfn_t *pfnp, int *goal_levelp)
 {
-	int level = *levelp;
-	u64 spte = *it.sptep;
+	int goal_level = *goal_levelp;
 
-	if (it.level == level && level > PG_LEVEL_4K &&
+	if (cur_level == goal_level && goal_level > PG_LEVEL_4K &&
 	    is_nx_huge_page_enabled() &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte)) {
@@ -3312,9 +3311,10 @@ static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
 		 * patching back for them into pfn the next 9 bits of
 		 * the address.
 		 */
-		u64 page_mask = KVM_PAGES_PER_HPAGE(level) - KVM_PAGES_PER_HPAGE(level - 1);
+		u64 page_mask = KVM_PAGES_PER_HPAGE(goal_level) -
+				KVM_PAGES_PER_HPAGE(goal_level - 1);
 		*pfnp |= gfn & page_mask;
-		(*levelp)--;
+		(*goal_levelp)--;
 	}
 }
 
@@ -3339,7 +3339,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
 		 * We cannot overwrite existing page tables with an NX
 		 * large page, as the leaf could be executable.
 		 */
-		disallowed_hugepage_adjust(it, gfn, &pfn, &level);
+		disallowed_hugepage_adjust(*it.sptep, gfn, it.level,
+					   &pfn, &level);
 
 		base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 		if (it.level == level)
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 4dd6b1e5b8cf7..6a8666cb0d24b 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -690,7 +690,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 		 * We cannot overwrite existing page tables with an NX
 		 * large page, as the leaf could be executable.
 		 */
-		disallowed_hugepage_adjust(it, gw->gfn, &pfn, &hlevel);
+		disallowed_hugepage_adjust(*it.sptep, gw->gfn, it.level,
+					   &pfn, &hlevel);
 
 		base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 		if (it.level == hlevel)
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (8 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:24   ` Paolo Bonzini
  2020-09-30 16:37   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page Ben Gardon
                   ` (14 subsequent siblings)
  24 siblings, 2 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Add functions to handle page faults in the TDP MMU. These page faults
are currently handled in much the same way as the x86 shadow paging
based MMU, however the ordering of some operations is slightly
different. Future patches will add eager NX splitting, a fast page fault
handler, and parallel page faults.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  66 ++++++-----------
 arch/x86/kvm/mmu/mmu_internal.h |  45 +++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 127 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |   4 +
 4 files changed, 200 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f6e6fc9959c04..52d661a758585 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -153,12 +153,6 @@ module_param(dbg, bool, 0644);
 #else
 #define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
 #endif
-#define PT64_LVL_ADDR_MASK(level) \
-	(PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
-						* PT64_LEVEL_BITS))) - 1))
-#define PT64_LVL_OFFSET_MASK(level) \
-	(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
-						* PT64_LEVEL_BITS))) - 1))
 
 #define PT32_BASE_ADDR_MASK PAGE_MASK
 #define PT32_DIR_BASE_ADDR_MASK \
@@ -182,20 +176,6 @@ module_param(dbg, bool, 0644);
 /* make pte_list_desc fit well in cache line */
 #define PTE_LIST_EXT 3
 
-/*
- * Return values of handle_mmio_page_fault and mmu.page_fault:
- * RET_PF_RETRY: let CPU fault again on the address.
- * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
- *
- * For handle_mmio_page_fault only:
- * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
- */
-enum {
-	RET_PF_RETRY = 0,
-	RET_PF_EMULATE = 1,
-	RET_PF_INVALID = 2,
-};
-
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
 	struct pte_list_desc *more;
@@ -233,7 +213,7 @@ static struct percpu_counter kvm_total_used_mmu_pages;
 static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
-static u64 __read_mostly shadow_accessed_mask;
+u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_value;
 static u64 __read_mostly shadow_mmio_access_mask;
@@ -364,7 +344,7 @@ static inline bool spte_ad_need_write_protect(u64 spte)
 	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
 }
 
-static bool is_nx_huge_page_enabled(void)
+bool is_nx_huge_page_enabled(void)
 {
 	return READ_ONCE(nx_huge_pages);
 }
@@ -381,7 +361,7 @@ static inline u64 spte_shadow_dirty_mask(u64 spte)
 	return spte_ad_enabled(spte) ? shadow_dirty_mask : 0;
 }
 
-static inline bool is_access_track_spte(u64 spte)
+inline bool is_access_track_spte(u64 spte)
 {
 	return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
 }
@@ -433,7 +413,7 @@ static u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
-static u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
+u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 {
 
 	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
@@ -613,7 +593,7 @@ int is_shadow_present_pte(u64 pte)
 	return (pte != 0) && !is_mmio_spte(pte);
 }
 
-static int is_large_pte(u64 pte)
+int is_large_pte(u64 pte)
 {
 	return pte & PT_PAGE_SIZE_MASK;
 }
@@ -2555,7 +2535,7 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
+u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
 	u64 spte;
 
@@ -2961,14 +2941,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 				     E820_TYPE_RAM);
 }
 
-/* Bits which may be returned by set_spte() */
-#define SET_SPTE_WRITE_PROTECTED_PT	BIT(0)
-#define SET_SPTE_NEED_REMOTE_TLB_FLUSH	BIT(1)
-
-static u64 make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
-		     gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
-		     bool can_unsync, bool host_writable, bool ad_disabled,
-		     int *ret)
+u64 make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
+	      gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
+	      bool can_unsync, bool host_writable, bool ad_disabled, int *ret)
 {
 	u64 spte = 0;
 
@@ -3249,8 +3224,8 @@ static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return level;
 }
 
-static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
-				   int max_level, kvm_pfn_t *pfnp)
+int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
+			    int max_level, kvm_pfn_t *pfnp)
 {
 	struct kvm_memory_slot *slot;
 	struct kvm_lpage_info *linfo;
@@ -3295,8 +3270,8 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 	return level;
 }
 
-static void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
-					kvm_pfn_t *pfnp, int *goal_levelp)
+void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
+				kvm_pfn_t *pfnp, int *goal_levelp)
 {
 	int goal_level = *goal_levelp;
 
@@ -4113,8 +4088,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (page_fault_handle_page_track(vcpu, error_code, gfn))
 		return RET_PF_EMULATE;
 
-	if (fast_page_fault(vcpu, gpa, error_code))
-		return RET_PF_RETRY;
+	if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		if (fast_page_fault(vcpu, gpa, error_code))
+			return RET_PF_RETRY;
 
 	r = mmu_topup_memory_caches(vcpu, false);
 	if (r)
@@ -4139,8 +4115,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	r = make_mmu_pages_available(vcpu);
 	if (r)
 		goto out_unlock;
-	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
-			 prefault, is_tdp && lpage_disallowed);
+
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		r = kvm_tdp_mmu_page_fault(vcpu, write, map_writable, max_level,
+					   gpa, pfn, prefault,
+					   is_tdp && lpage_disallowed);
+	else
+		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
+				 prefault, is_tdp && lpage_disallowed);
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index ff1fe0e04fba5..4cef9da051847 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -73,6 +73,15 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
+#define PT64_LVL_ADDR_MASK(level) \
+	(PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + (((level) - 1) \
+						* PT64_LEVEL_BITS))) - 1))
+#define PT64_LVL_OFFSET_MASK(level) \
+	(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
+						* PT64_LEVEL_BITS))) - 1))
+
+extern u64 shadow_accessed_mask;
+
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
 #define ACC_USER_MASK    PT_USER_MASK
@@ -84,7 +93,43 @@ bool is_mmio_spte(u64 spte);
 int is_shadow_present_pte(u64 pte);
 int is_last_spte(u64 pte, int level);
 bool is_dirty_spte(u64 spte);
+int is_large_pte(u64 pte);
+bool is_access_track_spte(u64 spte);
 
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
 					u64 pages);
+
+/*
+ * Return values of handle_mmio_page_fault and mmu.page_fault:
+ * RET_PF_RETRY: let CPU fault again on the address.
+ * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
+ *
+ * For handle_mmio_page_fault only:
+ * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
+ */
+enum {
+	RET_PF_RETRY = 0,
+	RET_PF_EMULATE = 1,
+	RET_PF_INVALID = 2,
+};
+
+/* Bits which may be returned by set_spte() */
+#define SET_SPTE_WRITE_PROTECTED_PT	BIT(0)
+#define SET_SPTE_NEED_REMOTE_TLB_FLUSH	BIT(1)
+
+u64 make_spte(struct kvm_vcpu *vcpu, unsigned int pte_access, int level,
+	      gfn_t gfn, kvm_pfn_t pfn, u64 old_spte, bool speculative,
+	      bool can_unsync, bool host_writable, bool ad_disabled, int *ret);
+u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
+u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
+
+int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
+			    int max_level, kvm_pfn_t *pfnp);
+void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
+				kvm_pfn_t *pfnp, int *goal_levelp);
+
+bool is_nx_huge_page_enabled(void);
+
+void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d96fc182c8497..37bdebc2592ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -311,6 +311,10 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 #define for_each_tdp_pte_root(_iter, _root, _start, _end) \
 	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
 
+#define for_each_tdp_pte_vcpu(_iter, _vcpu, _start, _end)		   \
+	for_each_tdp_pte(_iter, __va(_vcpu->arch.mmu->root_hpa),	   \
+			 _vcpu->arch.mmu->shadow_root_level, _start, _end)
+
 /*
  * If the MMU lock is contended or this thread needs to yield, flushes
  * the TLBs, releases, the MMU lock, yields, reacquires the MMU lock,
@@ -400,3 +404,126 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
 }
+
+/*
+ * Installs a last-level SPTE to handle a TDP page fault.
+ * (NPT/EPT violation/misconfiguration)
+ */
+static int page_fault_handle_target_level(struct kvm_vcpu *vcpu, int write,
+					  int map_writable, int as_id,
+					  struct tdp_iter *iter,
+					  kvm_pfn_t pfn, bool prefault)
+{
+	u64 new_spte;
+	int ret = 0;
+	int make_spte_ret = 0;
+
+	if (unlikely(is_noslot_pfn(pfn)))
+		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+	else
+		new_spte = make_spte(vcpu, ACC_ALL, iter->level, iter->gfn,
+				     pfn, iter->old_spte, prefault, true,
+				     map_writable, !shadow_accessed_mask,
+				     &make_spte_ret);
+
+	/*
+	 * If the page fault was caused by a write but the page is write
+	 * protected, emulation is needed. If the emulation was skipped,
+	 * the vCPU would have the same fault again.
+	 */
+	if ((make_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) && write)
+		ret = RET_PF_EMULATE;
+
+	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
+	if (unlikely(is_mmio_spte(new_spte)))
+		ret = RET_PF_EMULATE;
+
+	*iter->sptep = new_spte;
+	handle_changed_spte(vcpu->kvm, as_id, iter->gfn, iter->old_spte,
+			    new_spte, iter->level);
+
+	if (!prefault)
+		vcpu->stat.pf_fixed++;
+
+	return ret;
+}
+
+/*
+ * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
+ * page tables and SPTEs to translate the faulting guest physical address.
+ */
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
+			   int max_level, gpa_t gpa, kvm_pfn_t pfn,
+			   bool prefault, bool account_disallowed_nx_lpage)
+{
+	struct tdp_iter iter;
+	struct kvm_mmu_memory_cache *pf_pt_cache =
+			&vcpu->arch.mmu_shadow_page_cache;
+	u64 *child_pt;
+	u64 new_spte;
+	int ret;
+	int as_id = kvm_arch_vcpu_memslots_id(vcpu);
+	gfn_t gfn = gpa >> PAGE_SHIFT;
+	int level;
+
+	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
+		return RET_PF_RETRY;
+
+	if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
+		return RET_PF_RETRY;
+
+	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn);
+
+	for_each_tdp_pte_vcpu(iter, vcpu, gfn, gfn + 1) {
+		disallowed_hugepage_adjust(iter.old_spte, gfn, iter.level,
+					   &pfn, &level);
+
+		if (iter.level == level)
+			break;
+
+		/*
+		 * If there is an SPTE mapping a large page at a higher level
+		 * than the target, that SPTE must be cleared and replaced
+		 * with a non-leaf SPTE.
+		 */
+		if (is_shadow_present_pte(iter.old_spte) &&
+		    is_large_pte(iter.old_spte)) {
+			*iter.sptep = 0;
+			handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
+					    iter.old_spte, 0, iter.level);
+			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
+					KVM_PAGES_PER_HPAGE(iter.level));
+
+			/*
+			 * The iter must explicitly re-read the spte here
+			 * because the new is needed before the next iteration
+			 * of the loop.
+			 */
+			iter.old_spte = READ_ONCE(*iter.sptep);
+		}
+
+		if (!is_shadow_present_pte(iter.old_spte)) {
+			child_pt = kvm_mmu_memory_cache_alloc(pf_pt_cache);
+			clear_page(child_pt);
+			new_spte = make_nonleaf_spte(child_pt,
+						     !shadow_accessed_mask);
+
+			*iter.sptep = new_spte;
+			handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
+					    iter.old_spte, new_spte,
+					    iter.level);
+		}
+	}
+
+	if (WARN_ON(iter.level != level))
+		return RET_PF_RETRY;
+
+	ret = page_fault_handle_target_level(vcpu, write, map_writable,
+					     as_id, &iter, pfn, prefault);
+
+	/* If emulating, flush this vcpu's TLB. */
+	if (ret == RET_PF_EMULATE)
+		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+
+	return ret;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index cb86f9fe69017..abf23dc0ab7ad 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -14,4 +14,8 @@ void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
 
 bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
+
+int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
+			   int level, gpa_t gpa, kvm_pfn_t pfn, bool prefault,
+			   bool lpage_disallowed);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (9 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:22   ` Paolo Bonzini
  2020-09-25 21:22 ` [PATCH 12/22] kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU Ben Gardon
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Move the code to allocate a struct kvm_mmu_page for the TDP MMU out of the
root allocation code to support allocating a struct kvm_mmu_page for every
page of page table memory used by the TDP MMU, in the next commit.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 59 ++++++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 37bdebc2592ea..a3bcee6bf30e8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -123,27 +123,50 @@ static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
 	return NULL;
 }
 
-static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu,
-					       union kvm_mmu_page_role role)
+static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
+						   int level)
+{
+	union kvm_mmu_page_role role;
+
+	role = vcpu->arch.mmu->mmu_role.base;
+	role.level = vcpu->arch.mmu->shadow_root_level;
+	role.direct = true;
+	role.gpte_is_8_bytes = true;
+	role.access = ACC_ALL;
+
+	return role;
+}
+
+static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					       int level)
+{
+	struct kvm_mmu_page *sp;
+
+	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+
+	sp->role.word = page_role_for_level(vcpu, level).word;
+	sp->gfn = gfn;
+	sp->tdp_mmu_page = true;
+
+	return sp;
+}
+
+static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_page *new_root;
 	struct kvm_mmu_page *root;
 
-	new_root = kvm_mmu_memory_cache_alloc(
-			&vcpu->arch.mmu_page_header_cache);
-	new_root->spt = kvm_mmu_memory_cache_alloc(
-			&vcpu->arch.mmu_shadow_page_cache);
-	set_page_private(virt_to_page(new_root->spt), (unsigned long)new_root);
-
-	new_root->role.word = role.word;
+	new_root = alloc_tdp_mmu_page(vcpu, 0,
+				      vcpu->arch.mmu->shadow_root_level);
 	new_root->root_count = 1;
-	new_root->gfn = 0;
-	new_root->tdp_mmu_page = true;
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 
 	/* Check that no matching root exists before adding this one. */
-	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
+	root = find_tdp_mmu_root_with_role(vcpu->kvm,
+		page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level));
 	if (root) {
 		get_tdp_mmu_root(vcpu->kvm, root);
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -161,18 +184,12 @@ static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu,
 static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu_page *root;
-	union kvm_mmu_page_role role;
-
-	role = vcpu->arch.mmu->mmu_role.base;
-	role.level = vcpu->arch.mmu->shadow_root_level;
-	role.direct = true;
-	role.gpte_is_8_bytes = true;
-	role.access = ACC_ALL;
 
 	spin_lock(&vcpu->kvm->mmu_lock);
 
 	/* Search for an already allocated root with the same role. */
-	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
+	root = find_tdp_mmu_root_with_role(vcpu->kvm,
+		page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level));
 	if (root) {
 		get_tdp_mmu_root(vcpu->kvm, root);
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -182,7 +199,7 @@ static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	/* If there is no appropriate root, allocate one. */
-	root = alloc_tdp_mmu_root(vcpu, role);
+	root = alloc_tdp_mmu_root(vcpu);
 
 	return root;
 }
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 12/22] kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (10 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-25 21:22 ` [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for " Ben Gardon
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Attach struct kvm_mmu_pages to every page in the TDP MMU to track
metadata, facilitate NX reclaim, and enable inproved parallelism of MMU
operations in future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 13 ++++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9ce6b35ecb33a..a76bcb51d43d8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -972,7 +972,11 @@ struct kvm_arch {
 	 * operations.
 	 */
 	bool tdp_mmu_enabled;
+
+	/* List of struct tdp_mmu_pages being used as roots */
 	struct list_head tdp_mmu_roots;
+	/* List of struct tdp_mmu_pages not being used as roots */
+	struct list_head tdp_mmu_pages;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a3bcee6bf30e8..557e780bdf9f9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -30,6 +30,7 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	kvm->arch.tdp_mmu_enabled = true;
 
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 }
 
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
@@ -244,6 +245,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
 	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
 	u64 *pt;
+	struct kvm_mmu_page *sp;
 	u64 old_child_spte;
 	int i;
 
@@ -309,6 +311,9 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 */
 	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
 		pt = spte_to_child_pt(old_spte, level);
+		sp = sptep_to_sp(pt);
+
+		list_del(&sp->link);
 
 		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
 			old_child_spte = *(pt + i);
@@ -322,6 +327,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 						   KVM_PAGES_PER_HPAGE(level));
 
 		free_page((unsigned long)pt);
+		kmem_cache_free(mmu_page_header_cache, sp);
 	}
 }
 
@@ -474,8 +480,7 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
 			   bool prefault, bool account_disallowed_nx_lpage)
 {
 	struct tdp_iter iter;
-	struct kvm_mmu_memory_cache *pf_pt_cache =
-			&vcpu->arch.mmu_shadow_page_cache;
+	struct kvm_mmu_page *sp;
 	u64 *child_pt;
 	u64 new_spte;
 	int ret;
@@ -520,7 +525,9 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
-			child_pt = kvm_mmu_memory_cache_alloc(pf_pt_cache);
+			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level);
+			list_add(&sp->link, &vcpu->kvm->arch.tdp_mmu_pages);
+			child_pt = sp->spt;
 			clear_page(child_pt);
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (11 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 12/22] kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-30 17:03   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu Ben Gardon
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
hooks to handle the invalidate range family of MMU notifiers.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  9 ++++-
 arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
 3 files changed, 86 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 52d661a758585..0ddfdab942554 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1884,7 +1884,14 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
 			unsigned flags)
 {
-	return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+	int r;
+
+	r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
+
+	return r;
 }
 
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 557e780bdf9f9..1cea58db78a13 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -60,7 +60,7 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
 }
 
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end);
+			  gfn_t start, gfn_t end, bool can_yield);
 
 static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
@@ -73,7 +73,7 @@ static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
 
 	list_del(&root->link);
 
-	zap_gfn_range(kvm, root, 0, max_gfn);
+	zap_gfn_range(kvm, root, 0, max_gfn, false);
 
 	free_page((unsigned long)root->spt);
 	kmem_cache_free(mmu_page_header_cache, root);
@@ -361,9 +361,14 @@ static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
  * non-root pages mapping GFNs strictly within that range. Returns true if
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
+ * If can_yield is true, will release the MMU lock and reschedule if the
+ * scheduler needs the CPU or there is contention on the MMU lock. If this
+ * function cannot yield, it will not release the MMU lock or reschedule and
+ * the caller must ensure it does not supply too large a GFN range, or the
+ * operation can cause a soft lockup.
  */
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-			  gfn_t start, gfn_t end)
+			  gfn_t start, gfn_t end, bool can_yield)
 {
 	struct tdp_iter iter;
 	bool flush_needed = false;
@@ -387,7 +392,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte, 0,
 				    iter.level);
 
-		flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+		if (can_yield)
+			flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
+		else
+			flush_needed = true;
 	}
 	return flush_needed;
 }
@@ -410,7 +418,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
 		 */
 		get_tdp_mmu_root(kvm, root);
 
-		flush = zap_gfn_range(kvm, root, start, end) || flush;
+		flush = zap_gfn_range(kvm, root, start, end, true) || flush;
 
 		put_tdp_mmu_root(kvm, root);
 	}
@@ -551,3 +559,65 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
 
 	return ret;
 }
+
+static int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm, unsigned long start,
+		unsigned long end, unsigned long data,
+		int (*handler)(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       struct kvm_mmu_page *root, gfn_t start,
+			       gfn_t end, unsigned long data))
+{
+	struct kvm_memslots *slots;
+	struct kvm_memory_slot *memslot;
+	struct kvm_mmu_page *root;
+	int ret = 0;
+	int as_id;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		as_id = kvm_mmu_page_as_id(root);
+		slots = __kvm_memslots(kvm, as_id);
+		kvm_for_each_memslot(memslot, slots) {
+			unsigned long hva_start, hva_end;
+			gfn_t gfn_start, gfn_end;
+
+			hva_start = max(start, memslot->userspace_addr);
+			hva_end = min(end, memslot->userspace_addr +
+				      (memslot->npages << PAGE_SHIFT));
+			if (hva_start >= hva_end)
+				continue;
+			/*
+			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
+			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
+			 */
+			gfn_start = hva_to_gfn_memslot(hva_start, memslot);
+			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
+
+			ret |= handler(kvm, memslot, root, gfn_start,
+				       gfn_end, data);
+		}
+
+		put_tdp_mmu_root(kvm, root);
+	}
+
+	return ret;
+}
+
+static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     struct kvm_mmu_page *root, gfn_t start,
+				     gfn_t end, unsigned long unused)
+{
+	return zap_gfn_range(kvm, root, start, end, false);
+}
+
+int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
+			      unsigned long end)
+{
+	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
+					    zap_gfn_range_hva_wrapper);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index abf23dc0ab7ad..ce804a97bfa1d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -18,4 +18,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
 			   int level, gpa_t gpa, kvm_pfn_t pfn, bool prefault,
 			   bool lpage_disallowed);
+
+int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
+			      unsigned long end);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (12 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for " Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:32   ` Paolo Bonzini
  2020-09-30 17:48   ` Sean Christopherson
  2020-09-25 21:22 ` [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU Ben Gardon
                   ` (10 subsequent siblings)
  24 siblings, 2 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. The
main Linux MM uses the access tracking MMU notifiers for swap and other
features. Add hooks to handle the test/flush HVA (range) family of
MMU notifiers.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  29 ++++++---
 arch/x86/kvm/mmu/mmu_internal.h |   7 +++
 arch/x86/kvm/mmu/tdp_mmu.c      | 103 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.h      |   4 ++
 4 files changed, 133 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0ddfdab942554..8c1e806b3d53f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -212,12 +212,12 @@ static struct percpu_counter kvm_total_used_mmu_pages;
 
 static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
-static u64 __read_mostly shadow_user_mask;
+u64 __read_mostly shadow_user_mask;
 u64 __read_mostly shadow_accessed_mask;
 static u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_value;
 static u64 __read_mostly shadow_mmio_access_mask;
-static u64 __read_mostly shadow_present_mask;
+u64 __read_mostly shadow_present_mask;
 static u64 __read_mostly shadow_me_mask;
 
 /*
@@ -265,7 +265,6 @@ static u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
 static u8 __read_mostly shadow_phys_bits;
 
 static void mmu_spte_set(u64 *sptep, u64 spte);
-static bool is_executable_pte(u64 spte);
 static union kvm_mmu_page_role
 kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
 
@@ -332,7 +331,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
 	return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
 }
 
-static inline bool spte_ad_enabled(u64 spte)
+inline bool spte_ad_enabled(u64 spte)
 {
 	MMU_WARN_ON(is_mmio_spte(spte));
 	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_DISABLED_MASK;
@@ -607,7 +606,7 @@ int is_last_spte(u64 pte, int level)
 	return 0;
 }
 
-static bool is_executable_pte(u64 spte)
+bool is_executable_pte(u64 spte)
 {
 	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
 }
@@ -791,7 +790,7 @@ static bool spte_has_volatile_bits(u64 spte)
 	return false;
 }
 
-static bool is_accessed_spte(u64 spte)
+bool is_accessed_spte(u64 spte)
 {
 	u64 accessed_mask = spte_shadow_accessed_mask(spte);
 
@@ -941,7 +940,7 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
-static u64 mark_spte_for_access_track(u64 spte)
+u64 mark_spte_for_access_track(u64 spte)
 {
 	if (spte_ad_enabled(spte))
 		return spte & ~shadow_accessed_mask;
@@ -1945,12 +1944,24 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
 {
-	return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
+	int young = false;
+
+	young = kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
+	if (kvm->arch.tdp_mmu_enabled)
+		young |= kvm_tdp_mmu_age_hva_range(kvm, start, end);
+
+	return young;
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
 {
-	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+	int young = false;
+
+	young = kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+	if (kvm->arch.tdp_mmu_enabled)
+		young |= kvm_tdp_mmu_test_age_hva(kvm, hva);
+
+	return young;
 }
 
 #ifdef MMU_DEBUG
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 4cef9da051847..228bda0885552 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -80,7 +80,9 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
 						* PT64_LEVEL_BITS))) - 1))
 
+extern u64 shadow_user_mask;
 extern u64 shadow_accessed_mask;
+extern u64 shadow_present_mask;
 
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
@@ -95,6 +97,9 @@ int is_last_spte(u64 pte, int level);
 bool is_dirty_spte(u64 spte);
 int is_large_pte(u64 pte);
 bool is_access_track_spte(u64 spte);
+bool is_accessed_spte(u64 spte);
+bool spte_ad_enabled(u64 spte);
+bool is_executable_pte(u64 spte);
 
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
 					u64 pages);
@@ -132,4 +137,6 @@ bool is_nx_huge_page_enabled(void);
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
+u64 mark_spte_for_access_track(u64 spte);
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1cea58db78a13..0a4b98669b3ef 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,6 +224,18 @@ static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return sp->role.smm ? 1 : 0;
 }
 
+static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
+{
+	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+
+	if (!is_shadow_present_pte(old_spte) || !is_last_spte(old_spte, level))
+		return;
+
+	if (is_accessed_spte(old_spte) &&
+	    (!is_accessed_spte(new_spte) || pfn_changed))
+		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -236,7 +248,7 @@ static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
  * Handle bookkeeping that might result from the modification of a SPTE.
  * This function must be called for all TDP SPTE modifications.
  */
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				u64 old_spte, u64 new_spte, int level)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
@@ -331,6 +343,13 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 
+static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+				u64 old_spte, u64 new_spte, int level)
+{
+	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
+	handle_changed_spte_acc_track(old_spte, new_spte, level);
+}
+
 #define for_each_tdp_pte_root(_iter, _root, _start, _end) \
 	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
 
@@ -621,3 +640,85 @@ int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
 	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
 					    zap_gfn_range_hva_wrapper);
 }
+
+/*
+ * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
+ * if any of the GFNs in the range have been accessed.
+ */
+static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
+			 struct kvm_mmu_page *root, gfn_t start, gfn_t end,
+			 unsigned long unused)
+{
+	struct tdp_iter iter;
+	int young = 0;
+	u64 new_spte = 0;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, start, end) {
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		/*
+		 * If we have a non-accessed entry we don't need to change the
+		 * pte.
+		 */
+		if (!is_accessed_spte(iter.old_spte))
+			continue;
+
+		new_spte = iter.old_spte;
+
+		if (spte_ad_enabled(new_spte)) {
+			clear_bit((ffs(shadow_accessed_mask) - 1),
+				  (unsigned long *)&new_spte);
+		} else {
+			/*
+			 * Capture the dirty status of the page, so that it doesn't get
+			 * lost when the SPTE is marked for access tracking.
+			 */
+			if (is_writable_pte(new_spte))
+				kvm_set_pfn_dirty(spte_to_pfn(new_spte));
+
+			new_spte = mark_spte_for_access_track(new_spte);
+		}
+
+		*iter.sptep = new_spte;
+		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				      new_spte, iter.level);
+		young = true;
+	}
+
+	return young;
+}
+
+int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
+			      unsigned long end)
+{
+	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
+					    age_gfn_range);
+}
+
+static int test_age_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+			struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
+			unsigned long unused2)
+{
+	struct tdp_iter iter;
+	int young = 0;
+
+	for_each_tdp_pte_root(iter, root, gfn, gfn + 1) {
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		if (is_accessed_spte(iter.old_spte))
+			young = true;
+	}
+
+	return young;
+}
+
+int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
+{
+	return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
+					    test_age_gfn);
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ce804a97bfa1d..f316773b7b5a8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -21,4 +21,8 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
 
 int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
 			      unsigned long end);
+
+int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
+			      unsigned long end);
+int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (13 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:33   ` Paolo Bonzini
  2020-09-28 15:11   ` Paolo Bonzini
  2020-09-25 21:22 ` [PATCH 16/22] kvm: mmu: Add dirty logging handler for changed sptes Ben Gardon
                   ` (9 subsequent siblings)
  24 siblings, 2 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

In order to interoperate correctly with the rest of KVM and other Linux
subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
a hook and handle the change_pte MMU notifier.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 46 +++++++++++++------------
 arch/x86/kvm/mmu/mmu_internal.h | 13 +++++++
 arch/x86/kvm/mmu/tdp_mmu.c      | 61 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  3 ++
 4 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8c1e806b3d53f..0d80abe82ca93 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -122,9 +122,6 @@ module_param(dbg, bool, 0644);
 
 #define PTE_PREFETCH_NUM		8
 
-#define PT_FIRST_AVAIL_BITS_SHIFT 10
-#define PT64_SECOND_AVAIL_BITS_SHIFT 54
-
 /*
  * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
  * Access Tracking SPTEs.
@@ -147,13 +144,6 @@ module_param(dbg, bool, 0644);
 #define PT32_INDEX(address, level)\
 	(((address) >> PT32_LEVEL_SHIFT(level)) & ((1 << PT32_LEVEL_BITS) - 1))
 
-
-#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
-#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
-#else
-#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
-#endif
-
 #define PT32_BASE_ADDR_MASK PAGE_MASK
 #define PT32_DIR_BASE_ADDR_MASK \
 	(PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
@@ -170,9 +160,6 @@ module_param(dbg, bool, 0644);
 
 #include <trace/events/kvm.h>
 
-#define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
-#define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
-
 /* make pte_list_desc fit well in cache line */
 #define PTE_LIST_EXT 3
 
@@ -1708,6 +1695,21 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 	return kvm_zap_rmapp(kvm, rmap_head);
 }
 
+u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
+{
+	u64 new_spte;
+
+	new_spte = old_spte & ~PT64_BASE_ADDR_MASK;
+	new_spte |= (u64)new_pfn << PAGE_SHIFT;
+
+	new_spte &= ~PT_WRITABLE_MASK;
+	new_spte &= ~SPTE_HOST_WRITEABLE;
+
+	new_spte = mark_spte_for_access_track(new_spte);
+
+	return new_spte;
+}
+
 static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			     struct kvm_memory_slot *slot, gfn_t gfn, int level,
 			     unsigned long data)
@@ -1733,13 +1735,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			pte_list_remove(rmap_head, sptep);
 			goto restart;
 		} else {
-			new_spte = *sptep & ~PT64_BASE_ADDR_MASK;
-			new_spte |= (u64)new_pfn << PAGE_SHIFT;
-
-			new_spte &= ~PT_WRITABLE_MASK;
-			new_spte &= ~SPTE_HOST_WRITEABLE;
-
-			new_spte = mark_spte_for_access_track(new_spte);
+			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+					*sptep, new_pfn);
 
 			mmu_spte_clear_track_bits(sptep);
 			mmu_spte_set(sptep, new_spte);
@@ -1895,7 +1892,14 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
 
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
-	return kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+	int r;
+
+	r = kvm_handle_hva(kvm, hva, (unsigned long)&pte, kvm_set_pte_rmapp);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		r |= kvm_tdp_mmu_set_spte_hva(kvm, hva, &pte);
+
+	return r;
 }
 
 static int kvm_age_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 228bda0885552..8eaa6e4764bce 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -80,6 +80,12 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 	(PT64_BASE_ADDR_MASK & ((1ULL << (PAGE_SHIFT + (((level) - 1) \
 						* PT64_LEVEL_BITS))) - 1))
 
+#ifdef CONFIG_DYNAMIC_PHYSICAL_MASK
+#define PT64_BASE_ADDR_MASK (physical_mask & ~(u64)(PAGE_SIZE-1))
+#else
+#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
+#endif
+
 extern u64 shadow_user_mask;
 extern u64 shadow_accessed_mask;
 extern u64 shadow_present_mask;
@@ -89,6 +95,12 @@ extern u64 shadow_present_mask;
 #define ACC_USER_MASK    PT_USER_MASK
 #define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
 
+#define PT_FIRST_AVAIL_BITS_SHIFT 10
+#define PT64_SECOND_AVAIL_BITS_SHIFT 54
+
+#define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
+#define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
+
 /* Functions for interpreting SPTEs */
 kvm_pfn_t spte_to_pfn(u64 pte);
 bool is_mmio_spte(u64 spte);
@@ -138,5 +150,6 @@ bool is_nx_huge_page_enabled(void);
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
 u64 mark_spte_for_access_track(u64 spte);
+u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0a4b98669b3ef..3119583409131 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -722,3 +722,64 @@ int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
 	return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
 					    test_age_gfn);
 }
+
+/*
+ * Handle the changed_pte MMU notifier for the TDP MMU.
+ * data is a pointer to the new pte_t mapping the HVA specified by the MMU
+ * notifier.
+ * Returns non-zero if a flush is needed before releasing the MMU lock.
+ */
+static int set_tdp_spte(struct kvm *kvm, struct kvm_memory_slot *slot,
+			struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
+			unsigned long data)
+{
+	struct tdp_iter iter;
+	pte_t *ptep = (pte_t *)data;
+	kvm_pfn_t new_pfn;
+	u64 new_spte;
+	int need_flush = 0;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	WARN_ON(pte_huge(*ptep));
+
+	new_pfn = pte_pfn(*ptep);
+
+	for_each_tdp_pte_root(iter, root, gfn, gfn + 1) {
+		if (iter.level != PG_LEVEL_4K)
+			continue;
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			break;
+
+		*iter.sptep = 0;
+		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				    new_spte, iter.level);
+
+		kvm_flush_remote_tlbs_with_address(kvm, iter.gfn, 1);
+
+		if (!pte_write(*ptep)) {
+			new_spte = kvm_mmu_changed_pte_notifier_make_spte(
+					iter.old_spte, new_pfn);
+
+			*iter.sptep = new_spte;
+			handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+					    new_spte, iter.level);
+		}
+
+		need_flush = 1;
+	}
+
+	if (need_flush)
+		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+
+	return 0;
+}
+
+int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
+			     pte_t *host_ptep)
+{
+	return kvm_tdp_mmu_handle_hva_range(kvm, address, address + 1,
+					    (unsigned long)host_ptep,
+					    set_tdp_spte);
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index f316773b7b5a8..5a399aa60b8d8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -25,4 +25,7 @@ int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
 int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
 			      unsigned long end);
 int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
+
+int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
+			     pte_t *host_ptep);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 16/22] kvm: mmu: Add dirty logging handler for changed sptes
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (14 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  0:45   ` Paolo Bonzini
  2020-09-25 21:22 ` [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU Ben Gardon
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Add a function to handle the dirty logging bookkeeping associated with
SPTE changes. This will be important for future commits which will allow
the TDP MMU to log dirty pages the same way the x86 shadow paging based
MMU does.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 21 +++++++++++++++++++++
 include/linux/kvm_host.h   |  1 +
 virt/kvm/kvm_main.c        |  6 ++----
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3119583409131..bbe973d3f8084 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -236,6 +236,24 @@ static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
 
+static void handle_changed_spte_dlog(struct kvm *kvm, int as_id, gfn_t gfn,
+				    u64 old_spte, u64 new_spte, int level)
+{
+	bool pfn_changed;
+	struct kvm_memory_slot *slot;
+
+	if (level > PG_LEVEL_4K)
+		return;
+
+	pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+
+	if ((!is_writable_pte(old_spte) || pfn_changed) &&
+	    is_writable_pte(new_spte)) {
+		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
+		mark_page_dirty_in_slot(slot, gfn);
+	}
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -348,6 +366,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 {
 	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
+	handle_changed_spte_dlog(kvm, as_id, gfn, old_spte, new_spte, level);
 }
 
 #define for_each_tdp_pte_root(_iter, _root, _start, _end) \
@@ -685,6 +704,8 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 		*iter.sptep = new_spte;
 		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
 				      new_spte, iter.level);
+		handle_changed_spte_dlog(kvm, as_id, iter.gfn, iter.old_spte,
+					 new_spte, iter.level);
 		young = true;
 	}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a460bc712a81c..2f8c3f644d809 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -798,6 +798,7 @@ struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f9c80351c9efd..b5082ce60a33f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -143,8 +143,6 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn);
-
 __visible bool kvm_rebooting;
 EXPORT_SYMBOL_GPL(kvm_rebooting);
 
@@ -2640,8 +2638,7 @@ int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len)
 }
 EXPORT_SYMBOL_GPL(kvm_clear_guest);
 
-static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
-				    gfn_t gfn)
+void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot, gfn_t gfn)
 {
 	if (memslot && memslot->dirty_bitmap) {
 		unsigned long rel_gfn = gfn - memslot->base_gfn;
@@ -2649,6 +2646,7 @@ static void mark_page_dirty_in_slot(struct kvm_memory_slot *memslot,
 		set_bit_le(rel_gfn, memslot->dirty_bitmap);
 	}
 }
+EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
 
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 {
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (15 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 16/22] kvm: mmu: Add dirty logging handler for changed sptes Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  1:04   ` Paolo Bonzini
                     ` (2 more replies)
  2020-09-25 21:22 ` [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU Ben Gardon
                   ` (7 subsequent siblings)
  24 siblings, 3 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Dirty logging is a key feature of the KVM MMU and must be supported by
the TDP MMU. Add support for both the write protection and PML dirty
logging modes.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c          |  19 +-
 arch/x86/kvm/mmu/mmu_internal.h |   2 +
 arch/x86/kvm/mmu/tdp_iter.c     |  18 ++
 arch/x86/kvm/mmu/tdp_iter.h     |   1 +
 arch/x86/kvm/mmu/tdp_mmu.c      | 295 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  10 ++
 6 files changed, 343 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d80abe82ca93..b9074603f9df1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -201,7 +201,7 @@ static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 u64 __read_mostly shadow_user_mask;
 u64 __read_mostly shadow_accessed_mask;
-static u64 __read_mostly shadow_dirty_mask;
+u64 __read_mostly shadow_dirty_mask;
 static u64 __read_mostly shadow_mmio_value;
 static u64 __read_mostly shadow_mmio_access_mask;
 u64 __read_mostly shadow_present_mask;
@@ -324,7 +324,7 @@ inline bool spte_ad_enabled(u64 spte)
 	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_DISABLED_MASK;
 }
 
-static inline bool spte_ad_need_write_protect(u64 spte)
+inline bool spte_ad_need_write_protect(u64 spte)
 {
 	MMU_WARN_ON(is_mmio_spte(spte));
 	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
@@ -1591,6 +1591,9 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 {
 	struct kvm_rmap_head *rmap_head;
 
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
+				slot->base_gfn + gfn_offset, mask, true);
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					  PG_LEVEL_4K, slot);
@@ -1617,6 +1620,9 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 {
 	struct kvm_rmap_head *rmap_head;
 
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_clear_dirty_pt_masked(kvm, slot,
+				slot->base_gfn + gfn_offset, mask, false);
 	while (mask) {
 		rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
 					  PG_LEVEL_4K, slot);
@@ -5954,6 +5960,8 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 	spin_lock(&kvm->mmu_lock);
 	flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
 				start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
+	if (kvm->arch.tdp_mmu_enabled)
+		flush = kvm_tdp_mmu_wrprot_slot(kvm, memslot, false) || flush;
 	spin_unlock(&kvm->mmu_lock);
 
 	/*
@@ -6034,6 +6042,7 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
 	kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
 					   memslot->npages);
 }
+EXPORT_SYMBOL_GPL(kvm_arch_flush_remote_tlbs_memslot);
 
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot)
@@ -6042,6 +6051,8 @@ void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 
 	spin_lock(&kvm->mmu_lock);
 	flush = slot_handle_leaf(kvm, memslot, __rmap_clear_dirty, false);
+	if (kvm->arch.tdp_mmu_enabled)
+		flush = kvm_tdp_mmu_clear_dirty_slot(kvm, memslot) || flush;
 	spin_unlock(&kvm->mmu_lock);
 
 	/*
@@ -6063,6 +6074,8 @@ void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
 	spin_lock(&kvm->mmu_lock);
 	flush = slot_handle_large_level(kvm, memslot, slot_rmap_write_protect,
 					false);
+	if (kvm->arch.tdp_mmu_enabled)
+		flush = kvm_tdp_mmu_wrprot_slot(kvm, memslot, true) || flush;
 	spin_unlock(&kvm->mmu_lock);
 
 	if (flush)
@@ -6077,6 +6090,8 @@ void kvm_mmu_slot_set_dirty(struct kvm *kvm,
 
 	spin_lock(&kvm->mmu_lock);
 	flush = slot_handle_all_level(kvm, memslot, __rmap_set_dirty, false);
+	if (kvm->arch.tdp_mmu_enabled)
+		flush = kvm_tdp_mmu_slot_set_dirty(kvm, memslot) || flush;
 	spin_unlock(&kvm->mmu_lock);
 
 	if (flush)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8eaa6e4764bce..1a777ccfde44e 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -89,6 +89,7 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 extern u64 shadow_user_mask;
 extern u64 shadow_accessed_mask;
 extern u64 shadow_present_mask;
+extern u64 shadow_dirty_mask;
 
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
@@ -112,6 +113,7 @@ bool is_access_track_spte(u64 spte);
 bool is_accessed_spte(u64 spte);
 bool spte_ad_enabled(u64 spte);
 bool is_executable_pte(u64 spte);
+bool spte_ad_need_write_protect(u64 spte);
 
 void kvm_flush_remote_tlbs_with_address(struct kvm *kvm, u64 start_gfn,
 					u64 pages);
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 6c1a38429c81a..132e286150856 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -178,3 +178,21 @@ void tdp_iter_refresh_walk(struct tdp_iter *iter)
 	tdp_iter_start(iter, iter->pt_path[iter->root_level - 1],
 		       iter->root_level, goal_gfn);
 }
+
+/*
+ * Move on to the next SPTE, but do not move down into a child page table even
+ * if the current SPTE leads to one.
+ */
+void tdp_iter_next_no_step_down(struct tdp_iter *iter)
+{
+	bool done;
+
+	done = try_step_side(iter);
+	while (!done) {
+		if (!try_step_up(iter)) {
+			iter->valid = false;
+			break;
+		}
+		done = try_step_side(iter);
+	}
+}
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 34da3bdada436..d0e65a62ea7d9 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -50,5 +50,6 @@ void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
 		    gfn_t goal_gfn);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_refresh_walk(struct tdp_iter *iter);
+void tdp_iter_next_no_step_down(struct tdp_iter *iter);
 
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bbe973d3f8084..e5cb7f0ec23e8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -700,6 +700,7 @@ static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
 
 			new_spte = mark_spte_for_access_track(new_spte);
 		}
+		new_spte &= ~shadow_dirty_mask;
 
 		*iter.sptep = new_spte;
 		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
@@ -804,3 +805,297 @@ int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
 					    set_tdp_spte);
 }
 
+/*
+ * Remove write access from all the SPTEs mapping GFNs [start, end). If
+ * skip_4k is set, SPTEs that map 4k pages, will not be write-protected.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			     gfn_t start, gfn_t end, bool skip_4k)
+{
+	struct tdp_iter iter;
+	u64 new_spte;
+	bool spte_set = false;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, start, end) {
+iteration_start:
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		/*
+		 * If this entry points to a page of 4K entries, and 4k entries
+		 * should be skipped, skip the whole page. If the non-leaf
+		 * entry is at a higher level, move on to the next,
+		 * (lower level) entry.
+		 */
+		if (!is_last_spte(iter.old_spte, iter.level)) {
+			if (skip_4k && iter.level == PG_LEVEL_2M) {
+				tdp_iter_next_no_step_down(&iter);
+				if (iter.valid && iter.gfn >= end)
+					goto iteration_start;
+				else
+					break;
+			} else {
+				continue;
+			}
+		}
+
+		WARN_ON(skip_4k && iter.level == PG_LEVEL_4K);
+
+		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+
+		*iter.sptep = new_spte;
+		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				      new_spte, iter.level);
+		handle_changed_spte_acc_track(iter.old_spte, new_spte,
+					      iter.level);
+		spte_set = true;
+
+		tdp_mmu_iter_cond_resched(kvm, &iter);
+	}
+	return spte_set;
+}
+
+/*
+ * Remove write access from all the SPTEs mapping GFNs in the memslot. If
+ * skip_4k is set, SPTEs that map 4k pages, will not be write-protected.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+			     bool skip_4k)
+{
+	struct kvm_mmu_page *root;
+	int root_as_id;
+	bool spte_set = false;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		root_as_id = kvm_mmu_page_as_id(root);
+		if (root_as_id != slot->as_id)
+			continue;
+
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
+				slot->base_gfn + slot->npages, skip_4k) ||
+			   spte_set;
+
+		put_tdp_mmu_root(kvm, root);
+	}
+
+	return spte_set;
+}
+
+/*
+ * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
+ * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
+ * If AD bits are not enabled, this will require clearing the writable bit on
+ * each SPTE. Returns true if an SPTE has been changed and the TLBs need to
+ * be flushed.
+ */
+static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+			   gfn_t start, gfn_t end)
+{
+	struct tdp_iter iter;
+	u64 new_spte;
+	bool spte_set = false;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, start, end) {
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		if (spte_ad_need_write_protect(iter.old_spte)) {
+			if (is_writable_pte(iter.old_spte))
+				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+			else
+				continue;
+		} else {
+			if (iter.old_spte & shadow_dirty_mask)
+				new_spte = iter.old_spte & ~shadow_dirty_mask;
+			else
+				continue;
+		}
+
+		*iter.sptep = new_spte;
+		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				      new_spte, iter.level);
+		handle_changed_spte_acc_track(iter.old_spte, new_spte,
+					      iter.level);
+		spte_set = true;
+
+		tdp_mmu_iter_cond_resched(kvm, &iter);
+	}
+	return spte_set;
+}
+
+/*
+ * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
+ * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
+ * If AD bits are not enabled, this will require clearing the writable bit on
+ * each SPTE. Returns true if an SPTE has been changed and the TLBs need to
+ * be flushed.
+ */
+bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *root;
+	int root_as_id;
+	bool spte_set = false;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		root_as_id = kvm_mmu_page_as_id(root);
+		if (root_as_id != slot->as_id)
+			continue;
+
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		spte_set = clear_dirty_gfn_range(kvm, root, slot->base_gfn,
+				slot->base_gfn + slot->npages) || spte_set;
+
+		put_tdp_mmu_root(kvm, root);
+	}
+
+	return spte_set;
+}
+
+/*
+ * Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is
+ * set in mask, starting at gfn. The given memslot is expected to contain all
+ * the GFNs represented by set bits in the mask. If AD bits are enabled,
+ * clearing the dirty status will involve clearing the dirty bit on each SPTE
+ * or, if AD bits are not enabled, clearing the writable bit on each SPTE.
+ */
+static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
+				  gfn_t gfn, unsigned long mask, bool wrprot)
+{
+	struct tdp_iter iter;
+	u64 new_spte;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, gfn + __ffs(mask),
+			      gfn + BITS_PER_LONG) {
+		if (!mask)
+			break;
+
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level) ||
+		    iter.level > PG_LEVEL_4K ||
+		    !(mask & (1UL << (iter.gfn - gfn))))
+			continue;
+
+		if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
+			if (is_writable_pte(iter.old_spte))
+				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+			else
+				continue;
+		} else {
+			if (iter.old_spte & shadow_dirty_mask)
+				new_spte = iter.old_spte & ~shadow_dirty_mask;
+			else
+				continue;
+		}
+
+		*iter.sptep = new_spte;
+		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				      new_spte, iter.level);
+		handle_changed_spte_acc_track(iter.old_spte, new_spte,
+					      iter.level);
+
+		mask &= ~(1UL << (iter.gfn - gfn));
+	}
+}
+
+/*
+ * Clears the dirty status of all the 4k SPTEs mapping GFNs for which a bit is
+ * set in mask, starting at gfn. The given memslot is expected to contain all
+ * the GFNs represented by set bits in the mask. If AD bits are enabled,
+ * clearing the dirty status will involve clearing the dirty bit on each SPTE
+ * or, if AD bits are not enabled, clearing the writable bit on each SPTE.
+ */
+void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+				       struct kvm_memory_slot *slot,
+				       gfn_t gfn, unsigned long mask,
+				       bool wrprot)
+{
+	struct kvm_mmu_page *root;
+	int root_as_id;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	for_each_tdp_mmu_root(kvm, root) {
+		root_as_id = kvm_mmu_page_as_id(root);
+		if (root_as_id != slot->as_id)
+			continue;
+
+		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
+	}
+}
+
+/*
+ * Set the dirty status of all the SPTEs mapping GFNs in the memslot. This is
+ * only used for PML, and so will involve setting the dirty bit on each SPTE.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+static bool set_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
+				gfn_t start, gfn_t end)
+{
+	struct tdp_iter iter;
+	u64 new_spte;
+	bool spte_set = false;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, start, end) {
+		if (!is_shadow_present_pte(iter.old_spte))
+			continue;
+
+		new_spte = iter.old_spte | shadow_dirty_mask;
+
+		*iter.sptep = new_spte;
+		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				    new_spte, iter.level);
+		spte_set = true;
+
+		tdp_mmu_iter_cond_resched(kvm, &iter);
+	}
+
+	return spte_set;
+}
+
+/*
+ * Set the dirty status of all the SPTEs mapping GFNs in the memslot. This is
+ * only used for PML, and so will involve setting the dirty bit on each SPTE.
+ * Returns true if an SPTE has been changed and the TLBs need to be flushed.
+ */
+bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *root;
+	int root_as_id;
+	bool spte_set = false;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		root_as_id = kvm_mmu_page_as_id(root);
+		if (root_as_id != slot->as_id)
+			continue;
+
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		spte_set = set_dirty_gfn_range(kvm, root, slot->base_gfn,
+				slot->base_gfn + slot->npages) || spte_set;
+
+		put_tdp_mmu_root(kvm, root);
+	}
+	return spte_set;
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 5a399aa60b8d8..2c9322ba3462b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -28,4 +28,14 @@ int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 int kvm_tdp_mmu_set_spte_hva(struct kvm *kvm, unsigned long address,
 			     pte_t *host_ptep);
+
+bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
+			     bool skip_4k);
+bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
+				  struct kvm_memory_slot *slot);
+void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+				       struct kvm_memory_slot *slot,
+				       gfn_t gfn, unsigned long mask,
+				       bool wrprot);
+bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (16 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-26  1:09   ` Paolo Bonzini
  2020-09-25 21:22 ` [PATCH 19/22] kvm: mmu: Support write protection for nesting in " Ben Gardon
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Dirty logging ultimately breaks down MMU mappings to 4k granularity.
When dirty logging is no longer needed, these granaular mappings
represent a useless performance penalty. When dirty logging is disabled,
search the paging structure for mappings that could be re-constituted
into a large page mapping. Zap those mappings so that they can be
faulted in again at a higher mapping level.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  3 ++
 arch/x86/kvm/mmu/tdp_mmu.c | 62 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
 3 files changed, 67 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b9074603f9df1..12892fc4f146d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6025,6 +6025,9 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 	spin_lock(&kvm->mmu_lock);
 	slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
 			 kvm_mmu_zap_collapsible_spte, true);
+
+	if (kvm->arch.tdp_mmu_enabled)
+		kvm_tdp_mmu_zap_collapsible_sptes(kvm, memslot);
 	spin_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e5cb7f0ec23e8..a2895119655ac 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1099,3 +1099,65 @@ bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot)
 	return spte_set;
 }
 
+/*
+ * Clear non-leaf entries (and free associated page tables) which could
+ * be replaced by large mappings, for GFNs within the slot.
+ */
+static void zap_collapsible_spte_range(struct kvm *kvm,
+				       struct kvm_mmu_page *root,
+				       gfn_t start, gfn_t end)
+{
+	struct tdp_iter iter;
+	kvm_pfn_t pfn;
+	bool spte_set = false;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, start, end) {
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		pfn = spte_to_pfn(iter.old_spte);
+		if (kvm_is_reserved_pfn(pfn) ||
+		    !PageTransCompoundMap(pfn_to_page(pfn)))
+			continue;
+
+		*iter.sptep = 0;
+		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				    0, iter.level);
+		spte_set = true;
+
+		spte_set = !tdp_mmu_iter_cond_resched(kvm, &iter);
+	}
+
+	if (spte_set)
+		kvm_flush_remote_tlbs(kvm);
+}
+
+/*
+ * Clear non-leaf entries (and free associated page tables) which could
+ * be replaced by large mappings, for GFNs within the slot.
+ */
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot)
+{
+	struct kvm_mmu_page *root;
+	int root_as_id;
+
+	for_each_tdp_mmu_root(kvm, root) {
+		root_as_id = kvm_mmu_page_as_id(root);
+		if (root_as_id != slot->as_id)
+			continue;
+
+		/*
+		 * Take a reference on the root so that it cannot be freed if
+		 * this thread releases the MMU lock and yields in this loop.
+		 */
+		get_tdp_mmu_root(kvm, root);
+
+		zap_collapsible_spte_range(kvm, root, slot->base_gfn,
+					   slot->base_gfn + slot->npages);
+
+		put_tdp_mmu_root(kvm, root);
+	}
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 2c9322ba3462b..10e70699c5372 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -38,4 +38,6 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 				       gfn_t gfn, unsigned long mask,
 				       bool wrprot);
 bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
+				       const struct kvm_memory_slot *slot);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 19/22] kvm: mmu: Support write protection for nesting in tdp MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (17 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU Ben Gardon
@ 2020-09-25 21:22 ` Ben Gardon
  2020-09-30 18:06   ` Sean Christopherson
  2020-09-25 21:23 ` [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU Ben Gardon
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:22 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

To support nested virtualization, KVM will sometimes need to write
protect pages which are part of a shadowed paging structure or are not
writable in the shadowed paging structure. Add a function to write
protect GFN mappings for this purpose.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  5 ++++
 arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
 3 files changed, 65 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 12892fc4f146d..e6f5093ba8f6f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1667,6 +1667,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 		write_protected |= __rmap_write_protect(kvm, rmap_head, true);
 	}
 
+	if (kvm->arch.tdp_mmu_enabled)
+		write_protected =
+			kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn) ||
+			write_protected;
+
 	return write_protected;
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a2895119655ac..931cb469b1f2f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1161,3 +1161,60 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 		put_tdp_mmu_root(kvm, root);
 	}
 }
+
+/*
+ * Removes write access on the last level SPTE mapping this GFN and unsets the
+ * SPTE_MMU_WRITABLE bit to ensure future writes continue to be intercepted.
+ * Returns true if an SPTE was set and a TLB flush is needed.
+ */
+static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
+			      gfn_t gfn)
+{
+	struct tdp_iter iter;
+	u64 new_spte;
+	bool spte_set = false;
+	int as_id = kvm_mmu_page_as_id(root);
+
+	for_each_tdp_pte_root(iter, root, gfn, gfn + 1) {
+		if (!is_shadow_present_pte(iter.old_spte) ||
+		    !is_last_spte(iter.old_spte, iter.level))
+			continue;
+
+		if (!is_writable_pte(iter.old_spte))
+			break;
+
+		new_spte = iter.old_spte &
+			~(PT_WRITABLE_MASK | SPTE_MMU_WRITEABLE);
+
+		*iter.sptep = new_spte;
+		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
+				    new_spte, iter.level);
+		spte_set = true;
+	}
+
+	return spte_set;
+}
+
+/*
+ * Removes write access on the last level SPTE mapping this GFN and unsets the
+ * SPTE_MMU_WRITABLE bit to ensure future writes continue to be intercepted.
+ * Returns true if an SPTE was set and a TLB flush is needed.
+ */
+bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	struct kvm_mmu_page *root;
+	int root_as_id;
+	bool spte_set = false;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	for_each_tdp_mmu_root(kvm, root) {
+		root_as_id = kvm_mmu_page_as_id(root);
+		if (root_as_id != slot->as_id)
+			continue;
+
+		spte_set = write_protect_gfn(kvm, root, gfn) || spte_set;
+	}
+	return spte_set;
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 10e70699c5372..2ecb047211a6d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -40,4 +40,7 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 bool kvm_tdp_mmu_slot_set_dirty(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 				       const struct kvm_memory_slot *slot);
+
+bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
+				   struct kvm_memory_slot *slot, gfn_t gfn);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (18 preceding siblings ...)
  2020-09-25 21:22 ` [PATCH 19/22] kvm: mmu: Support write protection for nesting in " Ben Gardon
@ 2020-09-25 21:23 ` Ben Gardon
  2020-09-26  1:14   ` Paolo Bonzini
                     ` (2 more replies)
  2020-09-25 21:23 ` [PATCH 21/22] kvm: mmu: Support MMIO in the " Ben Gardon
                   ` (4 subsequent siblings)
  24 siblings, 3 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:23 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

When KVM maps a largepage backed region at a lower level in order to
make it executable (i.e. NX large page shattering), it reduces the TLB
performance of that region. In order to avoid making this degradation
permanent, KVM must periodically reclaim shattered NX largepages by
zapping them and allowing them to be rebuilt in the page fault handler.

With this patch, the TDP MMU does not respect KVM's rate limiting on
reclaim. It traverses the entire TDP structure every time. This will be
addressed in a future patch.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++---
 arch/x86/kvm/mmu/mmu_internal.h |  4 ++
 arch/x86/kvm/mmu/tdp_mmu.c      | 66 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  2 +
 5 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a76bcb51d43d8..cf00b1c837708 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -963,6 +963,7 @@ struct kvm_arch {
 
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
+	struct task_struct *nx_lpage_tdp_mmu_recovery_thread;
 
 	/*
 	 * Whether the TDP MMU is enabled for this VM. This contains a
@@ -977,6 +978,8 @@ struct kvm_arch {
 	struct list_head tdp_mmu_roots;
 	/* List of struct tdp_mmu_pages not being used as roots */
 	struct list_head tdp_mmu_pages;
+	struct list_head tdp_mmu_lpage_disallowed_pages;
+	u64 tdp_mmu_lpage_disallowed_page_count;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e6f5093ba8f6f..6101c696e92d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -54,12 +54,12 @@
 
 extern bool itlb_multihit_kvm_mitigation;
 
-static int __read_mostly nx_huge_pages = -1;
+int __read_mostly nx_huge_pages = -1;
 #ifdef CONFIG_PREEMPT_RT
 /* Recovery can cause latency spikes, disable it for PREEMPT_RT.  */
-static uint __read_mostly nx_huge_pages_recovery_ratio = 0;
+uint __read_mostly nx_huge_pages_recovery_ratio = 0;
 #else
-static uint __read_mostly nx_huge_pages_recovery_ratio = 60;
+uint __read_mostly nx_huge_pages_recovery_ratio = 60;
 #endif
 
 static int set_nx_huge_pages(const char *val, const struct kernel_param *kp);
@@ -6455,7 +6455,7 @@ static long get_nx_lpage_recovery_timeout(u64 start_time)
 		: MAX_SCHEDULE_TIMEOUT;
 }
 
-static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
+static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t tdp_mmu)
 {
 	u64 start_time;
 	long remaining_time;
@@ -6476,7 +6476,10 @@ static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
 		if (kthread_should_stop())
 			return 0;
 
-		kvm_recover_nx_lpages(kvm);
+		if (tdp_mmu)
+			kvm_tdp_mmu_recover_nx_lpages(kvm);
+		else
+			kvm_recover_nx_lpages(kvm);
 	}
 }
 
@@ -6489,6 +6492,17 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
 					  &kvm->arch.nx_lpage_recovery_thread);
 	if (!err)
 		kthread_unpark(kvm->arch.nx_lpage_recovery_thread);
+	else
+		return err;
+
+	if (!kvm->arch.tdp_mmu_enabled)
+		return err;
+
+	err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 1,
+			"kvm-nx-lpage-tdp-mmu-recovery",
+			&kvm->arch.nx_lpage_tdp_mmu_recovery_thread);
+	if (!err)
+		kthread_unpark(kvm->arch.nx_lpage_tdp_mmu_recovery_thread);
 
 	return err;
 }
@@ -6497,4 +6511,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 {
 	if (kvm->arch.nx_lpage_recovery_thread)
 		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
+
+	if (kvm->arch.nx_lpage_tdp_mmu_recovery_thread)
+		kthread_stop(kvm->arch.nx_lpage_tdp_mmu_recovery_thread);
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1a777ccfde44e..567e119da424f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -43,6 +43,7 @@ struct kvm_mmu_page {
 	atomic_t write_flooding_count;
 
 	bool tdp_mmu_page;
+	u64 *parent_sptep;
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
@@ -154,4 +155,7 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 u64 mark_spte_for_access_track(u64 spte);
 u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn);
 
+extern int nx_huge_pages;
+extern uint nx_huge_pages_recovery_ratio;
+
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 931cb469b1f2f..b83c18e29f9c6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -578,10 +578,18 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
 			new_spte = make_nonleaf_spte(child_pt,
 						     !shadow_accessed_mask);
 
+			if (iter.level <= max_level &&
+			    account_disallowed_nx_lpage) {
+				list_add(&sp->lpage_disallowed_link,
+					 &vcpu->kvm->arch.tdp_mmu_lpage_disallowed_pages);
+				vcpu->kvm->arch.tdp_mmu_lpage_disallowed_page_count++;
+			}
+
 			*iter.sptep = new_spte;
 			handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
 					    iter.old_spte, new_spte,
 					    iter.level);
+			sp->parent_sptep = iter.sptep;
 		}
 	}
 
@@ -1218,3 +1226,61 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	return spte_set;
 }
 
+/*
+ * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
+ * exist in order to allow execute access on a region that would otherwise be
+ * mapped as a large page.
+ */
+void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
+{
+	struct kvm_mmu_page *sp;
+	bool flush;
+	int rcu_idx;
+	unsigned int ratio;
+	ulong to_zap;
+	u64 old_spte;
+
+	rcu_idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+
+	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
+	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
+
+	while (to_zap &&
+	       !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
+		/*
+		 * We use a separate list instead of just using active_mmu_pages
+		 * because the number of lpage_disallowed pages is expected to
+		 * be relatively small compared to the total.
+		 */
+		sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
+				      struct kvm_mmu_page,
+				      lpage_disallowed_link);
+
+		old_spte = *sp->parent_sptep;
+		*sp->parent_sptep = 0;
+
+		list_del(&sp->lpage_disallowed_link);
+		kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
+
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
+				    old_spte, 0, sp->role.level + 1);
+
+		flush = true;
+
+		if (!--to_zap || need_resched() ||
+		    spin_needbreak(&kvm->mmu_lock)) {
+			flush = false;
+			kvm_flush_remote_tlbs(kvm);
+			if (to_zap)
+				cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, rcu_idx);
+}
+
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 2ecb047211a6d..45ea2d44545db 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
 
 bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn);
+
+void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 21/22] kvm: mmu: Support MMIO in the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (19 preceding siblings ...)
  2020-09-25 21:23 ` [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU Ben Gardon
@ 2020-09-25 21:23 ` Ben Gardon
  2020-09-30 18:19   ` Sean Christopherson
  2020-09-25 21:23 ` [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots Ben Gardon
                   ` (3 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:23 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

In order to support MMIO, KVM must be able to walk the TDP paging
structures to find mappings for a given GFN. Support this walk for
the TDP MMU.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 70 ++++++++++++++++++++++++++------------
 arch/x86/kvm/mmu/tdp_mmu.c | 17 +++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
 3 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6101c696e92d3..0ce7720a72d4e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3939,54 +3939,82 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 	return vcpu_match_mmio_gva(vcpu, addr);
 }
 
-/* return true if reserved bit is detected on spte. */
-static bool
-walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
+/*
+ * Return the level of the lowest level SPTE added to sptes.
+ * That SPTE may be non-present.
+ */
+static int get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes)
 {
 	struct kvm_shadow_walk_iterator iterator;
-	u64 sptes[PT64_ROOT_MAX_LEVEL], spte = 0ull;
-	struct rsvd_bits_validate *rsvd_check;
-	int root, leaf;
-	bool reserved = false;
+	int leaf = vcpu->arch.mmu->root_level;
+	u64 spte;
 
-	rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
 
 	walk_shadow_page_lockless_begin(vcpu);
 
-	for (shadow_walk_init(&iterator, vcpu, addr),
-		 leaf = root = iterator.level;
+	for (shadow_walk_init(&iterator, vcpu, addr);
 	     shadow_walk_okay(&iterator);
 	     __shadow_walk_next(&iterator, spte)) {
+		leaf = iterator.level;
 		spte = mmu_spte_get_lockless(iterator.sptep);
 
 		sptes[leaf - 1] = spte;
-		leaf--;
 
 		if (!is_shadow_present_pte(spte))
 			break;
 
+	}
+
+	walk_shadow_page_lockless_end(vcpu);
+
+	return leaf;
+}
+
+/* return true if reserved bit is detected on spte. */
+static bool get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
+{
+	u64 sptes[PT64_ROOT_MAX_LEVEL];
+	struct rsvd_bits_validate *rsvd_check;
+	int root;
+	int leaf;
+	int level;
+	bool reserved = false;
+
+	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
+		*sptep = 0ull;
+		return reserved;
+	}
+
+	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
+		leaf = kvm_tdp_mmu_get_walk(vcpu, addr, sptes);
+	else
+		leaf = get_walk(vcpu, addr, sptes);
+
+	rsvd_check = &vcpu->arch.mmu->shadow_zero_check;
+
+	for (level = root; level >= leaf; level--) {
+		if (!is_shadow_present_pte(sptes[level - 1]))
+			break;
 		/*
 		 * Use a bitwise-OR instead of a logical-OR to aggregate the
 		 * reserved bit and EPT's invalid memtype/XWR checks to avoid
 		 * adding a Jcc in the loop.
 		 */
-		reserved |= __is_bad_mt_xwr(rsvd_check, spte) |
-			    __is_rsvd_bits_set(rsvd_check, spte, iterator.level);
+		reserved |= __is_bad_mt_xwr(rsvd_check, sptes[level - 1]) |
+			    __is_rsvd_bits_set(rsvd_check, sptes[level - 1],
+					       level);
 	}
 
-	walk_shadow_page_lockless_end(vcpu);
-
 	if (reserved) {
 		pr_err("%s: detect reserved bits on spte, addr 0x%llx, dump hierarchy:\n",
 		       __func__, addr);
-		while (root > leaf) {
+		for (level = root; level >= leaf; level--)
 			pr_err("------ spte 0x%llx level %d.\n",
-			       sptes[root - 1], root);
-			root--;
-		}
+			       sptes[level - 1], level);
 	}
 
-	*sptep = spte;
+	*sptep = sptes[leaf - 1];
+
 	return reserved;
 }
 
@@ -3998,7 +4026,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 	if (mmio_info_in_cache(vcpu, addr, direct))
 		return RET_PF_EMULATE;
 
-	reserved = walk_shadow_page_get_mmio_spte(vcpu, addr, &spte);
+	reserved = get_mmio_spte(vcpu, addr, &spte);
 	if (WARN_ON(reserved))
 		return -EINVAL;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b83c18e29f9c6..42dde27decd75 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1284,3 +1284,20 @@ void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
+/*
+ * Return the level of the lowest level SPTE added to sptes.
+ * That SPTE may be non-present.
+ */
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes)
+{
+	struct tdp_iter iter;
+	int leaf = vcpu->arch.mmu->shadow_root_level;
+	gfn_t gfn = addr >> PAGE_SHIFT;
+
+	for_each_tdp_pte_vcpu(iter, vcpu, gfn, gfn + 1) {
+		leaf = iter.level;
+		sptes[leaf - 1] = iter.old_spte;
+	}
+
+	return leaf;
+}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 45ea2d44545db..cc0b7241975aa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -45,4 +45,6 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn);
 
 void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
+
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes);
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (20 preceding siblings ...)
  2020-09-25 21:23 ` [PATCH 21/22] kvm: mmu: Support MMIO in the " Ben Gardon
@ 2020-09-25 21:23 ` Ben Gardon
  2020-09-26  1:25   ` Paolo Bonzini
  2020-09-26  1:14 ` [PATCH 00/22] Introduce the TDP MMU Paolo Bonzini
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-25 21:23 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong,
	Ben Gardon

Direct roots don't have a write flooding count because the guest can't
affect that paging structure. Thus there's no need to clear the write
flooding count on a fast CR3 switch for direct roots.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Signed-off-by: Ben Gardon <bgardon@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 15 +++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |  2 ++
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0ce7720a72d4e..345c934fabf4c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4267,7 +4267,8 @@ static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 	context->nx = false;
 }
 
-static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
+static inline bool is_root_usable(struct kvm *kvm,
+				  struct kvm_mmu_root_info *root, gpa_t pgd,
 				  union kvm_mmu_page_role role)
 {
 	return (role.direct || pgd == root->pgd) &&
@@ -4293,13 +4294,13 @@ static bool cached_root_available(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 	root.pgd = mmu->root_pgd;
 	root.hpa = mmu->root_hpa;
 
-	if (is_root_usable(&root, new_pgd, new_role))
+	if (is_root_usable(vcpu->kvm, &root, new_pgd, new_role))
 		return true;
 
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
 		swap(root, mmu->prev_roots[i]);
 
-		if (is_root_usable(&root, new_pgd, new_role))
+		if (is_root_usable(vcpu->kvm, &root, new_pgd, new_role))
 			break;
 	}
 
@@ -4356,7 +4357,13 @@ static void __kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 	 */
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 
-	__clear_sp_write_flooding_count(to_shadow_page(vcpu->arch.mmu->root_hpa));
+	/*
+	 * If this is a direct root page, it doesn't have a write flooding
+	 * count. Otherwise, clear the write flooding count.
+	 */
+	if (!new_role.direct)
+		__clear_sp_write_flooding_count(
+				to_shadow_page(vcpu->arch.mmu->root_hpa));
 }
 
 void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd, bool skip_tlb_flush,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 42dde27decd75..c07831b0c73e1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -124,6 +124,18 @@ static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
 	return NULL;
 }
 
+hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
+				    union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *root;
+
+	root = find_tdp_mmu_root_with_role(kvm, role);
+	if (root)
+		return __pa(root->spt);
+
+	return INVALID_PAGE;
+}
+
 static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
 						   int level)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index cc0b7241975aa..2395ffa71bb05 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -9,6 +9,8 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
 
 bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
+hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
+				    union kvm_mmu_page_role role);
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
 
-- 
2.28.0.709.gb0816b6eb0-goog


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
@ 2020-09-26  0:04   ` Paolo Bonzini
  2020-09-30  5:06     ` Sean Christopherson
  2020-09-26  0:54   ` Paolo Bonzini
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:04 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>  
> -static bool is_mmio_spte(u64 spte)
> +bool is_mmio_spte(u64 spte)
>  {
>  	return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
>  }
> @@ -623,7 +612,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.efer & EFER_NX;
>  }
>  
> -static int is_shadow_present_pte(u64 pte)
> +int is_shadow_present_pte(u64 pte)
>  {
>  	return (pte != 0) && !is_mmio_spte(pte);
>  }
> @@ -633,7 +622,7 @@ static int is_large_pte(u64 pte)
>  	return pte & PT_PAGE_SIZE_MASK;
>  }
>  
> -static int is_last_spte(u64 pte, int level)
> +int is_last_spte(u64 pte, int level)
>  {
>  	if (level == PG_LEVEL_4K)
>  		return 1;
> @@ -647,7 +636,7 @@ static bool is_executable_pte(u64 spte)
>  	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
>  }
>  
> -static kvm_pfn_t spte_to_pfn(u64 pte)
> +kvm_pfn_t spte_to_pfn(u64 pte)
>  {
>  	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
>  }

Should these be inlines in mmu_internal.h instead?

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-25 21:22 ` [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU Ben Gardon
@ 2020-09-26  0:06   ` Paolo Bonzini
  2020-09-30  5:34   ` Sean Christopherson
  2020-09-30 16:57   ` Sean Christopherson
  2 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:06 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +static bool __read_mostly tdp_mmu_enabled = true;
> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
> +

This would need some custom callbacks to avoid the warning in
is_tdp_mmu_enabled().

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU
  2020-09-25 21:22 ` [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU Ben Gardon
@ 2020-09-26  0:14   ` Paolo Bonzini
  2020-09-30  6:15   ` Sean Christopherson
  1 sibling, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:14 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +/*
> + * If the MMU lock is contended or this thread needs to yield, flushes
> + * the TLBs, releases, the MMU lock, yields, reacquires the MMU lock,
> + * restarts the tdp_iter's walk from the root, and returns true.
> + * If no yield is needed, returns false.
> + */

The comment is not really necessary. :)

Paolo

> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +		kvm_flush_remote_tlbs(kvm);
> +		cond_resched_lock(&kvm->mmu_lock);
> +		tdp_iter_refresh_walk(iter);
> +		return true;
> +	} else {
> +		return false;
> +	}


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page
  2020-09-25 21:22 ` [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page Ben Gardon
@ 2020-09-26  0:22   ` Paolo Bonzini
  2020-09-30 18:53     ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:22 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> Move the code to allocate a struct kvm_mmu_page for the TDP MMU out of the
> root allocation code to support allocating a struct kvm_mmu_page for every
> page of page table memory used by the TDP MMU, in the next commit.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

Maybe worth squashing into the earlier patch.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-25 21:22 ` [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler Ben Gardon
@ 2020-09-26  0:24   ` Paolo Bonzini
  2020-09-30 16:37   ` Sean Christopherson
  1 sibling, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:24 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
>  
> -static bool is_nx_huge_page_enabled(void)
> +bool is_nx_huge_page_enabled(void)
>  {
>  	return READ_ONCE(nx_huge_pages);
>  }
> @@ -381,7 +361,7 @@ static inline u64 spte_shadow_dirty_mask(u64 spte)
>  	return spte_ad_enabled(spte) ? shadow_dirty_mask : 0;
>  }
>  
> -static inline bool is_access_track_spte(u64 spte)
> +inline bool is_access_track_spte(u64 spte)
>  {
>  	return !spte_ad_enabled(spte) && (spte & shadow_acc_track_mask) == 0;
>  }
> @@ -433,7 +413,7 @@ static u64 get_mmio_spte_generation(u64 spte)
>  	return gen;
>  }
>  
> -static u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
> +u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
>  {
>  
>  	u64 gen = kvm_vcpu_memslots(vcpu)->generation & MMIO_SPTE_GEN_MASK;
> @@ -613,7 +593,7 @@ int is_shadow_present_pte(u64 pte)
>  	return (pte != 0) && !is_mmio_spte(pte);
>  }
>  
> -static int is_large_pte(u64 pte)
> +int is_large_pte(u64 pte)
>  {
>  	return pte & PT_PAGE_SIZE_MASK;
>  }

All candidates for inlining too

(Also probably we'll create a common.c file for stuff that is common to
the shadow and TDP MMU, but that can come later).

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu
  2020-09-25 21:22 ` [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu Ben Gardon
@ 2020-09-26  0:32   ` Paolo Bonzini
  2020-09-30 17:48   ` Sean Christopherson
  1 sibling, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:32 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> @@ -332,7 +331,7 @@ static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
>  }
>  
> -static inline bool spte_ad_enabled(u64 spte)
> +inline bool spte_ad_enabled(u64 spte)
>  {
>  	MMU_WARN_ON(is_mmio_spte(spte));
>  	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_DISABLED_MASK;
> @@ -607,7 +606,7 @@ int is_last_spte(u64 pte, int level)
>  	return 0;
>  }
>  
> -static bool is_executable_pte(u64 spte)
> +bool is_executable_pte(u64 spte)
>  {
>  	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
>  }
> @@ -791,7 +790,7 @@ static bool spte_has_volatile_bits(u64 spte)
>  	return false;
>  }
>  
> -static bool is_accessed_spte(u64 spte)
> +bool is_accessed_spte(u64 spte)
>  {
>  	u64 accessed_mask = spte_shadow_accessed_mask(spte);
>  
> @@ -941,7 +940,7 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
>  	return __get_spte_lockless(sptep);
>  }
>  
> -static u64 mark_spte_for_access_track(u64 spte)
> +u64 mark_spte_for_access_track(u64 spte)
>  {
>  	if (spte_ad_enabled(spte))
>  		return spte & ~shadow_accessed_mask;

More candidates for inlining, of course.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-09-25 21:22 ` [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU Ben Gardon
@ 2020-09-26  0:33   ` Paolo Bonzini
  2020-09-28 15:11   ` Paolo Bonzini
  1 sibling, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:33 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> @@ -1708,6 +1695,21 @@ static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
>  	return kvm_zap_rmapp(kvm, rmap_head);
>  }
>  
> +u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
> +{
> +	u64 new_spte;
> +
> +	new_spte = old_spte & ~PT64_BASE_ADDR_MASK;
> +	new_spte |= (u64)new_pfn << PAGE_SHIFT;
> +
> +	new_spte &= ~PT_WRITABLE_MASK;
> +	new_spte &= ~SPTE_HOST_WRITEABLE;
> +
> +	new_spte = mark_spte_for_access_track(new_spte);
> +
> +	return new_spte;
> +}
> +

And another. :)


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs
  2020-09-25 21:22 ` [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs Ben Gardon
@ 2020-09-26  0:39   ` Paolo Bonzini
  2020-09-28 17:23     ` Paolo Bonzini
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:39 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +
> +	/*
> +	 * Recursively handle child PTs if the change removed a subtree from
> +	 * the paging structure.
> +	 */
> +	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> +		pt = spte_to_child_pt(old_spte, level);
> +
> +		for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
> +			old_child_spte = *(pt + i);
> +			*(pt + i) = 0;
> +			handle_changed_spte(kvm, as_id,
> +				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
> +				old_child_spte, 0, level - 1);
> +		}

Is it worth returning a "flush" value to the caller, to avoid multiple
kvm_flush_remote_tlbs_with_address when e.g. zapping a 3rd-level PTE?

Also I prefer if we already include here a "stupid" version of
handle_changed_spte that just calls __handle_changed_spte.  (If my
suggestion is accepted, handle_changed_spte could actually handle the
flushing).

Paolo

> +
> +		kvm_flush_remote_tlbs_with_address(kvm, gfn,
> +						   KVM_PAGES_PER_HPAGE(level));
> +
> +		free_page((unsigned long)pt);
> +	}


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 16/22] kvm: mmu: Add dirty logging handler for changed sptes
  2020-09-25 21:22 ` [PATCH 16/22] kvm: mmu: Add dirty logging handler for changed sptes Ben Gardon
@ 2020-09-26  0:45   ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:45 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +static void handle_changed_spte_dlog(struct kvm *kvm, int as_id, gfn_t gfn,
> +				    u64 old_spte, u64 new_spte, int level)
> +{
> +	bool pfn_changed;
> +	struct kvm_memory_slot *slot;

_dlog is a new abbreviation... I think I prefer the full _dirty_log,
it's not any longer than _acc_track.

I wonder if it's worth trying to commonize code between the TDP MMU,
mmu_spte_update and fast_pf_fix_direct_spte.  But that can be left for
later since this is only a first step.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
  2020-09-26  0:04   ` Paolo Bonzini
@ 2020-09-26  0:54   ` Paolo Bonzini
  2020-09-30  5:08   ` Sean Christopherson
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  0:54 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +	bool done;
> +
> +	done = try_step_down(iter);
> +	if (done)
> +		return;
> +
> +	done = try_step_side(iter);
> +	while (!done) {
> +		if (!try_step_up(iter)) {
> +			iter->valid = false;
> +			break;
> +		}
> +		done = try_step_side(iter);

Seems easier to read without the "done" boolean:

	if (try_step_down(iter))
		return;

	do {
		/* Maybe try_step_right? :) */
		if (try_step_side(iter))
			return;
	} while (try_step_up(iter));
	iter->valid = false;

Also it may be worth adding an "end_level" argument to the constructor,
and checking against it in try_step_down instead of using PG_LEVEL_4K.
By passing in PG_LEVEL_2M, you can avoid adding
tdp_iter_next_no_step_down in patch 17 and generally simplify the logic
there.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU
  2020-09-25 21:22 ` [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU Ben Gardon
@ 2020-09-26  1:04   ` Paolo Bonzini
  2020-10-08 18:27     ` Ben Gardon
  2020-09-29 15:07   ` Paolo Bonzini
  2020-09-30 18:04   ` Sean Christopherson
  2 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  1:04 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
>  				start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
> +	if (kvm->arch.tdp_mmu_enabled)
> +		flush = kvm_tdp_mmu_wrprot_slot(kvm, memslot, false) || flush;
>  	spin_unlock(&kvm->mmu_lock);
>  

In fact you can just pass down the end-level KVM_MAX_HUGEPAGE_LEVEL or
PGLEVEL_4K here to kvm_tdp_mmu_wrprot_slot and from there to
wrprot_gfn_range.

> 
> +		/*
> +		 * Take a reference on the root so that it cannot be freed if
> +		 * this thread releases the MMU lock and yields in this loop.
> +		 */
> +		get_tdp_mmu_root(kvm, root);
> +
> +		spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> +				slot->base_gfn + slot->npages, skip_4k) ||
> +			   spte_set;
> +
> +		put_tdp_mmu_root(kvm, root);


Generalyl using "|=" is the more common idiom in mmu.c.

> +static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			   gfn_t start, gfn_t end)
> ...
> +		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> +				      new_spte, iter.level);
> +		handle_changed_spte_acc_track(iter.old_spte, new_spte,
> +					      iter.level);

Is it worth not calling handle_changed_spte?  handle_changed_spte_dlog
obviously will never fire but duplicating the code is a bit ugly.

I guess this patch is the first one that really gives the "feeling" of
what the data structures look like.  The main difference with the shadow
MMU is that you have the tdp_iter instead of the callback-based code of
slot_handle_level_range, but otherwise it's not hard to follow one if
you know the other.  Reorganizing the code so that mmu.c is little more
than a wrapper around the two will help as well in this respect.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU
  2020-09-25 21:22 ` [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU Ben Gardon
@ 2020-09-26  1:09   ` Paolo Bonzini
  2020-10-07 16:30     ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  1:09 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +	for_each_tdp_pte_root(iter, root, start, end) {
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    is_last_spte(iter.old_spte, iter.level))
> +			continue;
> +

I'm starting to wonder if another iterator like
for_each_tdp_leaf_pte_root would be clearer, since this idiom repeats
itself quite often.  The tdp_iter_next_leaf function would be easily
implemented as

	while (likely(iter->valid) &&
	       (!is_shadow_present_pte(iter.old_spte) ||
		is_last_spte(iter.old_spte, iter.level))
		tdp_iter_next(iter);

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/22] Introduce the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (21 preceding siblings ...)
  2020-09-25 21:23 ` [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots Ben Gardon
@ 2020-09-26  1:14 ` Paolo Bonzini
  2020-09-28 17:31 ` Paolo Bonzini
  2020-09-30  6:19 ` Sean Christopherson
  24 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  1:14 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> Over the years, the needs for KVM's x86 MMU have grown from running small
> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
> we previously depended on shadow paging to run all guests, we now have
> two dimensional paging (TDP). This patch set introduces a new
> implementation of much of the KVM MMU, optimized for running guests with
> TDP. We have re-implemented many of the MMU functions to take advantage of
> the relative simplicity of TDP and eliminate the need for an rmap.
> Building on this simplified implementation, a future patch set will change
> the synchronization model for this "TDP MMU" to enable more parallelism
> than the monolithic MMU lock. A TDP MMU is currently in use at Google
> and has given us the performance necessary to live migrate our 416 vCPU,
> 12TiB m2-ultramem-416 VMs.
> 
> This work was motivated by the need to handle page faults in parallel for
> very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
> KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
> long latency on guest page faults. This contention can be easily seen
> running the KVM selftests demand_paging_test with a couple hundred vCPUs.
> Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
> per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
> the TDP MMU reduced the test duration by 89% and the execution was
> dominated by get_user_pages and the user fault FD ioctl instead of the
> MMU lock.
> 
> This series is the first of two. In this series we add a basic
> implementation of the TDP MMU. In the next series we will improve the
> performance of the TDP MMU and allow it to execute MMU operations
> in parallel.
> 
> The overall purpose of the KVM MMU is to program paging structures
> (CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
> addresses (HPA), and to provide utilities for other KVM features, for
> example dirty logging. The definition of the L1 guest physical address
> (GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
> and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
> MMU must program the x86 page tables to encode the full translation of
> guest virtual addresses (GVA) to HPA. This requires "shadowing" the
> guest's page tables to create a composite x86 paging structure. This
> solution is complicated, requires separate paging structures for each
> guest CR3, and requires emulating guest page table changes. The TDP case
> is much simpler. In this case, KVM lets the guest control CR3 and programs
> the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
> no way to change this mapping and only one version of the paging structure
> is needed per L1 paging mode. In this case the paging mode is some
> combination of the number of levels in the paging structure, the address
> space (normal execution or system management mode, on x86), and other
> attributes. Most VMs only ever use 1 paging mode and so only ever need one
> TDP structure.
> 
> This series implements a "TDP MMU" through alternative implementations of
> MMU functions for running L1 guests with TDP. The TDP MMU falls back to
> the existing shadow paging implementation when TDP is not available, and
> interoperates with the existing shadow paging implementation for nesting.
> The use of the TDP MMU can be controlled by a module parameter which is
> snapshot on VM creation and follows the life of the VM. This snapshot
> is used in many functions to decide whether or not to use TDP MMU handlers
> for a given operation.
> 
> This series can also be viewed in Gerrit here:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> (Thanks to Dmitry Vyukov <dvyukov@google.com> for setting up the
> Gerrit instance)
> 
> Ben Gardon (22):
>   kvm: mmu: Separate making SPTEs from set_spte
>   kvm: mmu: Introduce tdp_iter
>   kvm: mmu: Init / Uninit the TDP MMU
>   kvm: mmu: Allocate and free TDP MMU roots
>   kvm: mmu: Add functions to handle changed TDP SPTEs
>   kvm: mmu: Make address space ID a property of memslots
>   kvm: mmu: Support zapping SPTEs in the TDP MMU
>   kvm: mmu: Separate making non-leaf sptes from link_shadow_page
>   kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
>   kvm: mmu: Add TDP MMU PF handler
>   kvm: mmu: Factor out allocating a new tdp_mmu_page
>   kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
>   kvm: mmu: Support invalidate range MMU notifier for TDP MMU
>   kvm: mmu: Add access tracking for tdp_mmu
>   kvm: mmu: Support changed pte notifier in tdp MMU
>   kvm: mmu: Add dirty logging handler for changed sptes
>   kvm: mmu: Support dirty logging for the TDP MMU
>   kvm: mmu: Support disabling dirty logging for the tdp MMU
>   kvm: mmu: Support write protection for nesting in tdp MMU
>   kvm: mmu: NX largepage recovery for TDP MMU
>   kvm: mmu: Support MMIO in the TDP MMU
>   kvm: mmu: Don't clear write flooding count for direct roots
> 
>  arch/x86/include/asm/kvm_host.h |   17 +
>  arch/x86/kvm/Makefile           |    3 +-
>  arch/x86/kvm/mmu/mmu.c          |  437 ++++++----
>  arch/x86/kvm/mmu/mmu_internal.h |   98 +++
>  arch/x86/kvm/mmu/paging_tmpl.h  |    3 +-
>  arch/x86/kvm/mmu/tdp_iter.c     |  198 +++++
>  arch/x86/kvm/mmu/tdp_iter.h     |   55 ++
>  arch/x86/kvm/mmu/tdp_mmu.c      | 1315 +++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h      |   52 ++
>  include/linux/kvm_host.h        |    2 +
>  virt/kvm/kvm_main.c             |    7 +-
>  11 files changed, 2022 insertions(+), 165 deletions(-)
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
>  create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
> 

Ok, I've not finished reading the code but I have already an idea of
what it's like.  I really think we should fast track this as the basis
for more 5.11 work.  I'll finish reviewing it and, if you don't mind, I
might make some of the changes myself so I have the occasion to play and
get accustomed to the code; speak up if you disagree with them though!
Another thing I'd like to add is a few tracepoints.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-25 21:23 ` [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU Ben Gardon
@ 2020-09-26  1:14   ` Paolo Bonzini
  2020-09-30 22:23     ` Ben Gardon
  2020-09-29 18:24   ` Paolo Bonzini
  2020-09-30 18:15   ` Sean Christopherson
  2 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  1:14 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:23, Ben Gardon wrote:
> +
> +	if (!kvm->arch.tdp_mmu_enabled)
> +		return err;
> +
> +	err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 1,
> +			"kvm-nx-lpage-tdp-mmu-recovery",
> +			&kvm->arch.nx_lpage_tdp_mmu_recovery_thread);

Any reason to have two threads?

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots
  2020-09-25 21:23 ` [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots Ben Gardon
@ 2020-09-26  1:25   ` Paolo Bonzini
  2020-10-05 22:48     ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-26  1:25 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:23, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 42dde27decd75..c07831b0c73e1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -124,6 +124,18 @@ static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
>  	return NULL;
>  }
>  
> +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> +				    union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *root;
> +
> +	root = find_tdp_mmu_root_with_role(kvm, role);
> +	if (root)
> +		return __pa(root->spt);
> +
> +	return INVALID_PAGE;
> +}
> +
>  static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
>  						   int level)
>  {
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index cc0b7241975aa..2395ffa71bb05 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -9,6 +9,8 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
>  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
>  
>  bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> +				    union kvm_mmu_page_role role);
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>  void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
>  

Probably missing a piece since this code is not used and neither is the
new argument to is_root_usable.

I'm a bit confused by is_root_usable since there should be only one PGD
for the TDP MMU (the one for the root_mmu).

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-09-25 21:22 ` [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU Ben Gardon
  2020-09-26  0:33   ` Paolo Bonzini
@ 2020-09-28 15:11   ` Paolo Bonzini
  2020-10-07 16:53     ` Ben Gardon
  1 sibling, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-28 15:11 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +		*iter.sptep = 0;
> +		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> +				    new_spte, iter.level);
> +

Can you explain why new_spte is passed here instead of 0?

All calls to handle_changed_spte are preceded by "*something = 
new_spte" except this one, so I'm thinking of having a change_spte 
function like

static void change_spte(struct kvm *kvm, int as_id, gfn_t gfn,
                        u64 *sptep, u64 new_spte, int level)
{
        u64 old_spte = *sptep;
        *sptep = new_spte;

        __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
        handle_changed_spte_acc_track(old_spte, new_spte, level);
        handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, new_spte, level);
}

in addition to the previously-mentioned cleanup of always calling
handle_changed_spte instead of special-casing calls to two of the
three functions.  It would be a nice place to add the
trace_kvm_mmu_set_spte tracepoint, too.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs
  2020-09-26  0:39   ` Paolo Bonzini
@ 2020-09-28 17:23     ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-28 17:23 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 26/09/20 02:39, Paolo Bonzini wrote:
>> +			handle_changed_spte(kvm, as_id,
>> +				gfn + (i * KVM_PAGES_PER_HPAGE(level - 1)),
>> +				old_child_spte, 0, level - 1);
>> +		}
> Is it worth returning a "flush" value to the caller, to avoid multiple
> kvm_flush_remote_tlbs_with_address when e.g. zapping a 3rd-level PTE?

Nevermind, this is not possible since you are freeing the page
immediately after kvm_flush_remote_tlbs_with_address.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/22] Introduce the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (22 preceding siblings ...)
  2020-09-26  1:14 ` [PATCH 00/22] Introduce the TDP MMU Paolo Bonzini
@ 2020-09-28 17:31 ` Paolo Bonzini
  2020-09-29 17:40   ` Ben Gardon
  2020-09-30  6:19 ` Sean Christopherson
  24 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-28 17:31 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> This series is the first of two. In this series we add a basic
> implementation of the TDP MMU. In the next series we will improve the
> performance of the TDP MMU and allow it to execute MMU operations
> in parallel.

I have finished rebasing and adding a few cleanups on top, but I don't
have time to test it today.  I think the changes shouldn't get too much
in the way of the second series, but I've also pushed your v1 unmodified
to kvm/tdp-mmu for future convenience.  I'll await for your feedback in
the meanwhile!

One feature that I noticed is missing is the shrinker.  What are your
plans (or opinions) around it?

Also, the code generally assume a 64-bit CPU (i.e. that writes to 64-bit
PTEs are atomic).  That is not a big issue, it just needs a small change
on top to make the TDP MMU conditional on CONFIG_X86_64.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU
  2020-09-25 21:22 ` [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU Ben Gardon
  2020-09-26  1:04   ` Paolo Bonzini
@ 2020-09-29 15:07   ` Paolo Bonzini
  2020-09-30 18:04   ` Sean Christopherson
  2 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-29 15:07 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:22, Ben Gardon wrote:
> +	for_each_tdp_pte_root(iter, root, start, end) {
> +iteration_start:
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;
> +
> +		/*
> +		 * If this entry points to a page of 4K entries, and 4k entries
> +		 * should be skipped, skip the whole page. If the non-leaf
> +		 * entry is at a higher level, move on to the next,
> +		 * (lower level) entry.
> +		 */
> +		if (!is_last_spte(iter.old_spte, iter.level)) {
> +			if (skip_4k && iter.level == PG_LEVEL_2M) {
> +				tdp_iter_next_no_step_down(&iter);
> +				if (iter.valid && iter.gfn >= end)
> +					goto iteration_start;
> +				else
> +					break;

The iteration_start label confuses me mightily. :)  That would be a case
where iter.gfn >= end (so for_each_tdp_pte_root would exit) but you want
to proceed anyway with the gfn that was found by
tdp_iter_next_no_step_down.  Are you sure you didn't mean

	if (iter.valid && iter.gfn < end)
		goto iteration_start;
	else
		break;

because that would make much more sense: basically a "continue" that
skips the tdp_iter_next.  With the min_level change I suggested no
Friday, it would become something like this:

        for_each_tdp_pte_root_level(iter, root, start, end, min_level) {
                if (!is_shadow_present_pte(iter.old_spte) ||
                    !is_last_spte(iter.old_spte, iter.level))
                        continue;

                new_spte = iter.old_spte & ~PT_WRITABLE_MASK;

		*iter.sptep = new_spte;
                handle_change_spte(kvm, as_id, iter.gfn, iter.old_spte,
				   new_spte, iter.level);

                spte_set = true;
                tdp_mmu_iter_cond_resched(kvm, &iter);
        }

which is all nice and understandable.

Also, related to this function, why ignore the return value of
tdp_mmu_iter_cond_resched?  It does makes sense to assign spte_set =
true since, just like in kvm_mmu_slot_largepage_remove_write_access's
instance of slot_handle_large_level, you don't even need to flush on
cond_resched.  However, in order to do that you would have to add some
kind of "bool flush_on_resched" argument to tdp_mmu_iter_cond_resched,
or have two separate functions tdp_mmu_iter_cond_{flush_and_,}resched.

The same is true of clear_dirty_gfn_range and set_dirty_gfn_range.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/22] Introduce the TDP MMU
  2020-09-28 17:31 ` Paolo Bonzini
@ 2020-09-29 17:40   ` Ben Gardon
  2020-09-29 18:10     ` Paolo Bonzini
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-29 17:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Mon, Sep 28, 2020 at 10:31 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> > This series is the first of two. In this series we add a basic
> > implementation of the TDP MMU. In the next series we will improve the
> > performance of the TDP MMU and allow it to execute MMU operations
> > in parallel.
>
> I have finished rebasing and adding a few cleanups on top, but I don't
> have time to test it today.  I think the changes shouldn't get too much
> in the way of the second series, but I've also pushed your v1 unmodified
> to kvm/tdp-mmu for future convenience.  I'll await for your feedback in
> the meanwhile!

Awesome, thank you for the reviews and positive response! I'll get to
work responding to your comments and preparing a v2.

>
> One feature that I noticed is missing is the shrinker.  What are your
> plans (or opinions) around it?

I assume by the shrinker you mean the page table quota that controls
how many pages the MMU can use at a time to back guest memory?
I think the shrinker is less important for the TDP MMU as there is an
implicit limit on how much memory it will use to back guest memory.
You could set the limit smaller than the number of pages required to
fully map the guest's memory, but I'm not really sure why you would
want to in a practical scenario. I understand the quota's importance
for x86 shadow paging and nested TDP scenarios where the guest could
cause KVM to allocate an unbounded amount of memory for page tables,
but the guest does not have this power in the non-nested TDP scenario.
Really, I didn't include it in this series because we haven't needed
it at Google and so I never implemented the quota enforcement. It
shouldn't be difficult to implement if you think it's worth having,
and it'll be needed to support nested on the TDP MMU (without using
the shadow MMU) anyway. If you're okay with leaving it out of the
initial patch set though, I'm inclined to do that.

>
> Also, the code generally assume a 64-bit CPU (i.e. that writes to 64-bit
> PTEs are atomic).  That is not a big issue, it just needs a small change
> on top to make the TDP MMU conditional on CONFIG_X86_64.

Ah, that didn't occur to me. Thank you for pointing that out. I'll fix
that oversight in v2.

>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/22] Introduce the TDP MMU
  2020-09-29 17:40   ` Ben Gardon
@ 2020-09-29 18:10     ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-29 18:10 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 29/09/20 19:40, Ben Gardon wrote:
> I'll get to work responding to your comments and preparing a v2.

Please do respond to the comments, but I've actually already done most
of the changes (I'm bad at reviewing code without tinkering).  NX
recovery seems broken, but we can leave it out in the beginning as it's
fairly self contained.

I was going to post today, but I was undecided about whether to leave
out NX or try and fix it.

>> One feature that I noticed is missing is the shrinker.  What are your
>> plans (or opinions) around it?
> I assume by the shrinker you mean the page table quota that controls
> how many pages the MMU can use at a time to back guest memory?
> I think the shrinker is less important for the TDP MMU as there is an
> implicit limit on how much memory it will use to back guest memory.

Good point.  That's why I asked for opinions too.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-25 21:23 ` [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU Ben Gardon
  2020-09-26  1:14   ` Paolo Bonzini
@ 2020-09-29 18:24   ` Paolo Bonzini
  2020-09-30 18:15   ` Sean Christopherson
  2 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-29 18:24 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 25/09/20 23:23, Ben Gardon wrote:
> +	struct list_head tdp_mmu_lpage_disallowed_pages;

This list is never INIT_LIST_HEAD-ed, but I see other issues if I do so
(or maybe it's just too late).

Paolo

> +	u64 tdp_mmu_lpage_disallowed_page_count;


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte
  2020-09-25 21:22 ` [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte Ben Gardon
@ 2020-09-30  4:55   ` Sean Christopherson
  2020-09-30 23:03     ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  4:55 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:41PM -0700, Ben Gardon wrote:
> +static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> +		    unsigned int pte_access, int level,
> +		    gfn_t gfn, kvm_pfn_t pfn, bool speculative,
> +		    bool can_unsync, bool host_writable)
> +{
> +	u64 spte = 0;
> +	struct kvm_mmu_page *sp;
> +	int ret = 0;
> +
> +	if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
> +		return 0;
> +
> +	sp = sptep_to_sp(sptep);
> +
> +	spte = make_spte(vcpu, pte_access, level, gfn, pfn, *sptep, speculative,
> +			 can_unsync, host_writable, sp_ad_disabled(sp), &ret);
> +	if (!spte)
> +		return 0;

This is an impossible condition.  Well, maybe it's theoretically possible
if page track is active, with EPT exec-only support (shadow_present_mask is
zero), and pfn==0.  But in that case, returning early is wrong.

Rather than return the spte, what about returning 'ret', passing 'new_spte'
as a u64 *, and dropping the bail early path?  That would also eliminate
the minor wart of make_spte() relying on the caller to initialize 'ret'.

> +
> +	if (spte & PT_WRITABLE_MASK)
> +		kvm_vcpu_mark_page_dirty(vcpu, gfn);
> +
>  	if (mmu_spte_update(sptep, spte))
>  		ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
>  	return ret;
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-26  0:04   ` Paolo Bonzini
@ 2020-09-30  5:06     ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  5:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Cannon Matthews, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Sat, Sep 26, 2020 at 02:04:49AM +0200, Paolo Bonzini wrote:
> On 25/09/20 23:22, Ben Gardon wrote:
> >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> >  
> > -static bool is_mmio_spte(u64 spte)
> > +bool is_mmio_spte(u64 spte)
> >  {
> >  	return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
> >  }
> > @@ -623,7 +612,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
> >  	return vcpu->arch.efer & EFER_NX;
> >  }
> >  
> > -static int is_shadow_present_pte(u64 pte)
> > +int is_shadow_present_pte(u64 pte)
> >  {
> >  	return (pte != 0) && !is_mmio_spte(pte);
> >  }
> > @@ -633,7 +622,7 @@ static int is_large_pte(u64 pte)
> >  	return pte & PT_PAGE_SIZE_MASK;
> >  }
> >  
> > -static int is_last_spte(u64 pte, int level)
> > +int is_last_spte(u64 pte, int level)
> >  {
> >  	if (level == PG_LEVEL_4K)
> >  		return 1;
> > @@ -647,7 +636,7 @@ static bool is_executable_pte(u64 spte)
> >  	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
> >  }
> >  
> > -static kvm_pfn_t spte_to_pfn(u64 pte)
> > +kvm_pfn_t spte_to_pfn(u64 pte)
> >  {
> >  	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> >  }
> 
> Should these be inlines in mmu_internal.h instead?

Ya, that would be my preference as well.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
  2020-09-26  0:04   ` Paolo Bonzini
  2020-09-26  0:54   ` Paolo Bonzini
@ 2020-09-30  5:08   ` Sean Christopherson
  2020-09-30  5:24   ` Sean Christopherson
  2020-09-30 23:20   ` Eric van Tassell
  4 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  5:08 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:42PM -0700, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> new file mode 100644
> index 0000000000000..b102109778eac
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */

Kernel style is to use C++ comments for the SPDX headers in .c and .h (and
maybe.S too?).

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
                     ` (2 preceding siblings ...)
  2020-09-30  5:08   ` Sean Christopherson
@ 2020-09-30  5:24   ` Sean Christopherson
  2020-09-30  6:24     ` Paolo Bonzini
  2020-09-30 23:20   ` Eric van Tassell
  4 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  5:24 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:42PM -0700, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> new file mode 100644
> index 0000000000000..ee90d62d2a9b1
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -0,0 +1,163 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include "mmu_internal.h"
> +#include "tdp_iter.h"
> +
> +/*
> + * Recalculates the pointer to the SPTE for the current GFN and level and
> + * reread the SPTE.
> + */
> +static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> +{
> +	iter->sptep = iter->pt_path[iter->level - 1] +
> +		SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> +	iter->old_spte = READ_ONCE(*iter->sptep);
> +}
> +
> +/*
> + * Sets a TDP iterator to walk a pre-order traversal of the paging structure
> + * rooted at root_pt, starting with the walk to translate goal_gfn.
> + */
> +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> +		    gfn_t goal_gfn)
> +{
> +	WARN_ON(root_level < 1);
> +	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
> +
> +	iter->goal_gfn = goal_gfn;
> +	iter->root_level = root_level;
> +	iter->level = root_level;
> +	iter->pt_path[iter->level - 1] = root_pt;
> +
> +	iter->gfn = iter->goal_gfn -
> +		(iter->goal_gfn % KVM_PAGES_PER_HPAGE(iter->level));

Maybe use the params, if only to avoid the line wrap?

	iter->gfn = goal_gfn - (goal_gfn % KVM_PAGES_PER_HPAGE(root_level));

Actually, peeking further into the file, this calculation is repeated in both
try_step_up and try_step_down,  probably worth adding a helper of some form.

> +	tdp_iter_refresh_sptep(iter);
> +
> +	iter->valid = true;
> +}
> +
> +/*
> + * Given an SPTE and its level, returns a pointer containing the host virtual
> + * address of the child page table referenced by the SPTE. Returns null if
> + * there is no such entry.
> + */
> +u64 *spte_to_child_pt(u64 spte, int level)
> +{
> +	u64 *pt;
> +	/* There's no child entry if this entry isn't present */
> +	if (!is_shadow_present_pte(spte))
> +		return NULL;
> +
> +	/* There is no child page table if this is a leaf entry. */
> +	if (is_last_spte(spte, level))
> +		return NULL;

I'd collapse the checks and their comments.

> +
> +	pt = (u64 *)__va(spte_to_pfn(spte) << PAGE_SHIFT);
> +	return pt;

No need for the local variable or the explicit cast.

	/* There's no child if this entry is non-present or a leaf entry. */
	if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
		return NULL;

	return __va(spte_to_pfn(spte) << PAGE_SHIFT);

> +}

...

> +void tdp_iter_next(struct tdp_iter *iter)
> +{
> +	bool done;
> +
> +	done = try_step_down(iter);
> +	if (done)
> +		return;
> +
> +	done = try_step_side(iter);
> +	while (!done) {
> +		if (!try_step_up(iter)) {
> +			iter->valid = false;
> +			break;
> +		}
> +		done = try_step_side(iter);
> +	}

At the risk of being too clever:

	bool done;

	if (try_step_down(iter))
		return;

	do {
		done = try_step_side(iter);
	} while (!done && try_step_up(iter));

	iter->valid = done;

> +}

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-25 21:22 ` [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU Ben Gardon
  2020-09-26  0:06   ` Paolo Bonzini
@ 2020-09-30  5:34   ` Sean Christopherson
  2020-09-30 18:36     ` Ben Gardon
  2020-09-30 16:57   ` Sean Christopherson
  2 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  5:34 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

Nit on all the shortlogs, can you use "KVM: x86/mmu" instead of "kvm: mmu"?

On Fri, Sep 25, 2020 at 02:22:43PM -0700, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> new file mode 100644
> index 0000000000000..8241e18c111e6
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include "tdp_mmu.h"
> +
> +static bool __read_mostly tdp_mmu_enabled = true;
> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);

Do y'all actually toggle tdp_mmu_enabled while VMs are running?  I can see
having a per-VM capability, or a read-only module param, but a writable
module param is... interesting.

> +static bool is_tdp_mmu_enabled(void)
> +{
> +	if (!READ_ONCE(tdp_mmu_enabled))
> +		return false;
> +
> +	if (WARN_ONCE(!tdp_enabled,
> +		      "Creating a VM with TDP MMU enabled requires TDP."))

This should be enforced, i.e. clear tdp_mmu_enabled if !tdp_enabled.  As is,
it's a user triggerable WARN, which is not good, e.g. with PANIC_ON_WARN.

> +		return false;
> +
> +	return true;
> +}

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots
  2020-09-25 21:22 ` [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots Ben Gardon
@ 2020-09-30  6:06   ` Sean Christopherson
  2020-09-30  6:26     ` Paolo Bonzini
  2020-10-12 22:59     ` Ben Gardon
  0 siblings, 2 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  6:06 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:44PM -0700, Ben Gardon wrote:
  static u64 __read_mostly shadow_nx_mask;
> @@ -3597,10 +3592,14 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
>  	if (!VALID_PAGE(*root_hpa))
>  		return;
>  
> -	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> -	--sp->root_count;
> -	if (!sp->root_count && sp->role.invalid)
> -		kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> +	if (is_tdp_mmu_root(kvm, *root_hpa)) {
> +		kvm_tdp_mmu_put_root_hpa(kvm, *root_hpa);
> +	} else {
> +		sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> +		--sp->root_count;
> +		if (!sp->root_count && sp->role.invalid)
> +			kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);

Hmm, I see that future patches use put_tdp_mmu_root()/get_tdp_mmu_root(),
but the code itself isn't specific to the TDP MMU.  Even if this ends up
being the only non-TDP user of get/put, I think it'd be worth making them
common helpers, e.g.

	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
	if (mmu_put_root(sp) {
		if (is_tdp_mmu(...))
			kvm_tdp_mmu_free_root(kvm, sp);
		else if (sp->role.invalid)
			kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
	}

> +	}
>  
>  	*root_hpa = INVALID_PAGE;
>  }
> @@ -3691,7 +3690,13 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	unsigned i;
>  
>  	if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> -		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> +		if (vcpu->kvm->arch.tdp_mmu_enabled) {

I believe this will break 32-bit NPT.  Or at a minimum, look weird.  It'd
be better to explicitly disable the TDP MMU on 32-bit KVM, then this becomes

	if (vcpu->kvm->arch.tdp_mmu_enabled) {

	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {

	} else {

	}

> +			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> +		} else {
> +			root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
> +					      true);
> +		}

May not matter in the end, but the braces aren't needed.

> +
>  		if (!VALID_PAGE(root))
>  			return -ENOSPC;
>  		vcpu->arch.mmu->root_hpa = root;
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 65bb110847858..530b7d893c7b3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -41,8 +41,12 @@ struct kvm_mmu_page {
>  
>  	/* Number of writes since the last time traversal visited this page.  */
>  	atomic_t write_flooding_count;
> +
> +	bool tdp_mmu_page;
>  };
>  
> +extern struct kmem_cache *mmu_page_header_cache;
> +
>  static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
>  {
>  	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> @@ -69,6 +73,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>  	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
>  #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  
> +#define ACC_EXEC_MASK    1
> +#define ACC_WRITE_MASK   PT_WRITABLE_MASK
> +#define ACC_USER_MASK    PT_USER_MASK
> +#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> +
>  /* Functions for interpreting SPTEs */
>  kvm_pfn_t spte_to_pfn(u64 pte);
>  bool is_mmio_spte(u64 spte);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8241e18c111e6..cdca829e42040 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1,5 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  
> +#include "mmu.h"
> +#include "mmu_internal.h"
>  #include "tdp_mmu.h"
>  
>  static bool __read_mostly tdp_mmu_enabled = true;
> @@ -25,10 +27,165 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  
>  	/* This should not be changed for the lifetime of the VM. */
>  	kvm->arch.tdp_mmu_enabled = true;
> +
> +	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
>  }
>  
>  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  {
>  	if (!kvm->arch.tdp_mmu_enabled)
>  		return;
> +
> +	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> +}
> +
> +#define for_each_tdp_mmu_root(_kvm, _root)			    \
> +	list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
> +
> +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> +{
> +	struct kvm_mmu_page *root;
> +
> +	if (!kvm->arch.tdp_mmu_enabled)
> +		return false;
> +
> +	root = to_shadow_page(hpa);
> +
> +	if (WARN_ON(!root))
> +		return false;
> +
> +	return root->tdp_mmu_page;

Why all the extra checks?

> +}
> +
> +static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> +	lockdep_assert_held(&kvm->mmu_lock);
> +
> +	WARN_ON(root->root_count);
> +	WARN_ON(!root->tdp_mmu_page);
> +
> +	list_del(&root->link);
> +
> +	free_page((unsigned long)root->spt);
> +	kmem_cache_free(mmu_page_header_cache, root);
> +}
> +
> +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> +	lockdep_assert_held(&kvm->mmu_lock);
> +
> +	root->root_count--;
> +	if (!root->root_count)
> +		free_tdp_mmu_root(kvm, root);
> +}
> +
> +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> +{
> +	lockdep_assert_held(&kvm->mmu_lock);
> +	WARN_ON(!root->root_count);
> +
> +	root->root_count++;
> +}
> +
> +void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa)
> +{
> +	struct kvm_mmu_page *root;
> +
> +	root = to_shadow_page(root_hpa);
> +
> +	if (WARN_ON(!root))
> +		return;
> +
> +	put_tdp_mmu_root(kvm, root);
> +}
> +
> +static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
> +		struct kvm *kvm, union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *root;
> +
> +	lockdep_assert_held(&kvm->mmu_lock);
> +	for_each_tdp_mmu_root(kvm, root) {
> +		WARN_ON(!root->root_count);
> +
> +		if (root->role.word == role.word)
> +			return root;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu,
> +					       union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *new_root;
> +	struct kvm_mmu_page *root;
> +
> +	new_root = kvm_mmu_memory_cache_alloc(
> +			&vcpu->arch.mmu_page_header_cache);
> +	new_root->spt = kvm_mmu_memory_cache_alloc(
> +			&vcpu->arch.mmu_shadow_page_cache);
> +	set_page_private(virt_to_page(new_root->spt), (unsigned long)new_root);
> +
> +	new_root->role.word = role.word;
> +	new_root->root_count = 1;
> +	new_root->gfn = 0;
> +	new_root->tdp_mmu_page = true;
> +
> +	spin_lock(&vcpu->kvm->mmu_lock);
> +
> +	/* Check that no matching root exists before adding this one. */
> +	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
> +	if (root) {
> +		get_tdp_mmu_root(vcpu->kvm, root);
> +		spin_unlock(&vcpu->kvm->mmu_lock);

Hrm, I'm not a big fan of dropping locks in the middle of functions, but the
alternatives aren't great.  :-/  Best I can come up with is

	if (root)
		get_tdp_mmu_root()
	else
		list_add();

	spin_unlock();

	if (root) {
		free_page()
		kmem_cache_free()
	} else {
		root = new_root;
	}

	return root;

Not sure that's any better.

> +		free_page((unsigned long)new_root->spt);
> +		kmem_cache_free(mmu_page_header_cache, new_root);
> +		return root;
> +	}
> +
> +	list_add(&new_root->link, &vcpu->kvm->arch.tdp_mmu_roots);
> +	spin_unlock(&vcpu->kvm->mmu_lock);
> +
> +	return new_root;
> +}
> +
> +static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_mmu_page *root;
> +	union kvm_mmu_page_role role;
> +
> +	role = vcpu->arch.mmu->mmu_role.base;
> +	role.level = vcpu->arch.mmu->shadow_root_level;
> +	role.direct = true;
> +	role.gpte_is_8_bytes = true;
> +	role.access = ACC_ALL;
> +
> +	spin_lock(&vcpu->kvm->mmu_lock);
> +
> +	/* Search for an already allocated root with the same role. */
> +	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
> +	if (root) {
> +		get_tdp_mmu_root(vcpu->kvm, root);
> +		spin_unlock(&vcpu->kvm->mmu_lock);

Rather than manually unlock and return, this can be

	if (root)
		get_tdp_mmju_root();

	spin_unlock()

	if (!root)
		root = alloc_tdp_mmu_root();

	return root;

You could also add a helper to do the "get" along with the "find".  Not sure
if that's worth the code.
	
> +		return root;
> +	}
> +
> +	spin_unlock(&vcpu->kvm->mmu_lock);
> +
> +	/* If there is no appropriate root, allocate one. */
> +	root = alloc_tdp_mmu_root(vcpu, role);
> +
> +	return root;
> +}
> +
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_mmu_page *root;
> +
> +	root = get_tdp_mmu_vcpu_root(vcpu);
> +	if (!root)
> +		return INVALID_PAGE;
> +
> +	return __pa(root->spt);
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index dd3764f5a9aa3..9274debffeaa1 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -7,4 +7,9 @@
>  
>  void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
>  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> +
> +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> +void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
> +
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots
  2020-09-25 21:22 ` [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots Ben Gardon
@ 2020-09-30  6:10   ` Sean Christopherson
  2020-09-30 23:11     ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  6:10 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:46PM -0700, Ben Gardon wrote:
> Save address space ID as a field in each memslot so that functions that
> do not use rmaps (which implicitly encode the id) can handle multiple
> address spaces correctly.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  include/linux/kvm_host.h | 1 +
>  virt/kvm/kvm_main.c      | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 05e3c2fb3ef78..a460bc712a81c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -345,6 +345,7 @@ struct kvm_memory_slot {
>  	struct kvm_arch_memory_slot arch;
>  	unsigned long userspace_addr;
>  	u32 flags;
> +	int as_id;

Ha!  Peter Xu's dirtly ring also added this.  This should be a u16, it'll
save 8 bytes per memslot (oooooooh).  Any chance you want to include Peter's
patch[*]?  It has some nitpicking from Peter and I regarding what to do
with as_id on deletion.  That would also avoid silent merge conflicts on
Peter's end.

[*] https://lkml.kernel.org/r/20200708193408.242909-2-peterx@redhat.com

>  	short id;
>  };
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf88233b819a0..f9c80351c9efd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1318,6 +1318,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  	new.npages = mem->memory_size >> PAGE_SHIFT;
>  	new.flags = mem->flags;
>  	new.userspace_addr = mem->userspace_addr;
> +	new.as_id = as_id;
>  
>  	if (new.npages > KVM_MEM_MAX_NR_PAGES)
>  		return -EINVAL;
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU
  2020-09-25 21:22 ` [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU Ben Gardon
  2020-09-26  0:14   ` Paolo Bonzini
@ 2020-09-30  6:15   ` Sean Christopherson
  2020-09-30  6:28     ` Paolo Bonzini
  1 sibling, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  6:15 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:47PM -0700, Ben Gardon wrote:
> Add functions to zap SPTEs to the TDP MMU. These are needed to tear down
> TDP MMU roots properly and implement other MMU functions which require
> tearing down mappings. Future patches will add functions to populate the
> page tables, but as for this patch there will not be any work for these
> functions to do.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c      |  15 +++++
>  arch/x86/kvm/mmu/tdp_iter.c |  17 ++++++
>  arch/x86/kvm/mmu/tdp_iter.h |   1 +
>  arch/x86/kvm/mmu/tdp_mmu.c  | 106 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.h  |   2 +
>  5 files changed, 141 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f09081f9137b0..7a17cca19b0c1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5852,6 +5852,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  	kvm_reload_remote_mmus(kvm);
>  
>  	kvm_zap_obsolete_pages(kvm);
> +
> +	if (kvm->arch.tdp_mmu_enabled)
> +		kvm_tdp_mmu_zap_all(kvm);

Haven't looked into how this works; is kvm_tdp_mmu_zap_all() additive to
what is done by the legacy zapping, or is it a replacement?

> +
>  	spin_unlock(&kvm->mmu_lock);
>  }
> @@ -57,8 +58,13 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
>  	return root->tdp_mmu_page;
>  }
>  
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			  gfn_t start, gfn_t end);
> +
>  static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  {
> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);

BIT_ULL(...)

> +
>  	lockdep_assert_held(&kvm->mmu_lock);
>  
>  	WARN_ON(root->root_count);
> @@ -66,6 +72,8 @@ static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  
>  	list_del(&root->link);
>  
> +	zap_gfn_range(kvm, root, 0, max_gfn);
> +
>  	free_page((unsigned long)root->spt);
>  	kmem_cache_free(mmu_page_header_cache, root);
>  }
> @@ -193,6 +201,11 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  				u64 old_spte, u64 new_spte, int level);
>  
> +static int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
> +{
> +	return sp->role.smm ? 1 : 0;
> +}
> +
>  /**
>   * handle_changed_spte - handle bookkeeping associated with an SPTE change
>   * @kvm: kvm instance
> @@ -294,3 +307,96 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  		free_page((unsigned long)pt);
>  	}
>  }
> +
> +#define for_each_tdp_pte_root(_iter, _root, _start, _end) \
> +	for_each_tdp_pte(_iter, _root->spt, _root->role.level, _start, _end)
> +
> +/*
> + * If the MMU lock is contended or this thread needs to yield, flushes
> + * the TLBs, releases, the MMU lock, yields, reacquires the MMU lock,
> + * restarts the tdp_iter's walk from the root, and returns true.
> + * If no yield is needed, returns false.
> + */
> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
> +		kvm_flush_remote_tlbs(kvm);
> +		cond_resched_lock(&kvm->mmu_lock);
> +		tdp_iter_refresh_walk(iter);
> +		return true;
> +	} else {
> +		return false;
> +	}

Kernel style is to not bother with an "else" if the "if" returns.

> +}
> +
> +/*
> + * Tears down the mappings for the range of gfns, [start, end), and frees the
> + * non-root pages mapping GFNs strictly within that range. Returns true if
> + * SPTEs have been cleared and a TLB flush is needed before releasing the
> + * MMU lock.
> + */
> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			  gfn_t start, gfn_t end)
> +{
> +	struct tdp_iter iter;
> +	bool flush_needed = false;
> +	int as_id = kvm_mmu_page_as_id(root);
> +
> +	for_each_tdp_pte_root(iter, root, start, end) {
> +		if (!is_shadow_present_pte(iter.old_spte))
> +			continue;
> +
> +		/*
> +		 * If this is a non-last-level SPTE that covers a larger range
> +		 * than should be zapped, continue, and zap the mappings at a
> +		 * lower level.
> +		 */
> +		if ((iter.gfn < start ||
> +		     iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> +		    !is_last_spte(iter.old_spte, iter.level))
> +			continue;
> +
> +		*iter.sptep = 0;
> +		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte, 0,
> +				    iter.level);
> +
> +		flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> +	}
> +	return flush_needed;
> +}
> +
> +/*
> + * Tears down the mappings for the range of gfns, [start, end), and frees the
> + * non-root pages mapping GFNs strictly within that range. Returns true if
> + * SPTEs have been cleared and a TLB flush is needed before releasing the
> + * MMU lock.
> + */
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	struct kvm_mmu_page *root;
> +	bool flush = false;
> +
> +	for_each_tdp_mmu_root(kvm, root) {
> +		/*
> +		 * Take a reference on the root so that it cannot be freed if
> +		 * this thread releases the MMU lock and yields in this loop.
> +		 */
> +		get_tdp_mmu_root(kvm, root);
> +
> +		flush = zap_gfn_range(kvm, root, start, end) || flush;
> +
> +		put_tdp_mmu_root(kvm, root);
> +	}
> +
> +	return flush;
> +}
> +
> +void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> +{
> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);

BIT_ULL

> +	bool flush;
> +
> +	flush = kvm_tdp_mmu_zap_gfn_range(kvm, 0, max_gfn);
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 9274debffeaa1..cb86f9fe69017 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -12,4 +12,6 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
>  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
>  void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
>  
> +bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> +void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/22] Introduce the TDP MMU
  2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
                   ` (23 preceding siblings ...)
  2020-09-28 17:31 ` Paolo Bonzini
@ 2020-09-30  6:19 ` Sean Christopherson
  2020-09-30  6:30   ` Paolo Bonzini
  24 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30  6:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

In case Paolo is feeling trigger happy, I'm going to try and get through the
second half of this series tomorrow.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-30  5:24   ` Sean Christopherson
@ 2020-09-30  6:24     ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30  6:24 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 07:24, Sean Christopherson wrote:
> Maybe use the params, if only to avoid the line wrap?
> 
> 	iter->gfn = goal_gfn - (goal_gfn % KVM_PAGES_PER_HPAGE(root_level));
> 
> Actually, peeking further into the file, this calculation is repeated in both
> try_step_up and try_step_down,  probably worth adding a helper of some form.

Also it's written more concisely as

	iter->gfn = goal_gfn & -KVM_PAGES_PER_HPAGE(iter->level);

> 
> 
> 	bool done;
> 
> 	if (try_step_down(iter))
> 		return;
> 
> 	do {
> 		done = try_step_side(iter);
> 	} while (!done && try_step_up(iter));
> 
> 	iter->valid = done;

I pointed out something similar in my review, my version was

	bool done;

	if (try_step_down(iter))
		return;

	do {
		if (try_step_side(iter))
			return;
	} while (try_step_up(iter));
	iter->valid = false;

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots
  2020-09-30  6:06   ` Sean Christopherson
@ 2020-09-30  6:26     ` Paolo Bonzini
  2020-09-30 15:38       ` Sean Christopherson
  2020-10-12 22:59     ` Ben Gardon
  1 sibling, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30  6:26 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 08:06, Sean Christopherson wrote:
>> +static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu,
>> +					       union kvm_mmu_page_role role)
>> +{
>> +	struct kvm_mmu_page *new_root;
>> +	struct kvm_mmu_page *root;
>> +
>> +	new_root = kvm_mmu_memory_cache_alloc(
>> +			&vcpu->arch.mmu_page_header_cache);
>> +	new_root->spt = kvm_mmu_memory_cache_alloc(
>> +			&vcpu->arch.mmu_shadow_page_cache);
>> +	set_page_private(virt_to_page(new_root->spt), (unsigned long)new_root);
>> +
>> +	new_root->role.word = role.word;
>> +	new_root->root_count = 1;
>> +	new_root->gfn = 0;
>> +	new_root->tdp_mmu_page = true;
>> +
>> +	spin_lock(&vcpu->kvm->mmu_lock);
>> +
>> +	/* Check that no matching root exists before adding this one. */
>> +	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
>> +	if (root) {
>> +		get_tdp_mmu_root(vcpu->kvm, root);
>> +		spin_unlock(&vcpu->kvm->mmu_lock);
> Hrm, I'm not a big fan of dropping locks in the middle of functions, but the
> alternatives aren't great.  :-/  Best I can come up with is
> 
> 	if (root)
> 		get_tdp_mmu_root()
> 	else
> 		list_add();
> 
> 	spin_unlock();
> 
> 	if (root) {
> 		free_page()
> 		kmem_cache_free()
> 	} else {
> 		root = new_root;
> 	}
> 
> 	return root;
> 
> Not sure that's any better.
> 
>> +		free_page((unsigned long)new_root->spt);
>> +		kmem_cache_free(mmu_page_header_cache, new_root);
>> +		return root;
>> +	}
>> +
>> +	list_add(&new_root->link, &vcpu->kvm->arch.tdp_mmu_roots);
>> +	spin_unlock(&vcpu->kvm->mmu_lock);
>> +
>> +	return new_root;
>> +}
>> +
>> +static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_mmu_page *root;
>> +	union kvm_mmu_page_role role;
>> +
>> +	role = vcpu->arch.mmu->mmu_role.base;
>> +	role.level = vcpu->arch.mmu->shadow_root_level;
>> +	role.direct = true;
>> +	role.gpte_is_8_bytes = true;
>> +	role.access = ACC_ALL;
>> +
>> +	spin_lock(&vcpu->kvm->mmu_lock);
>> +
>> +	/* Search for an already allocated root with the same role. */
>> +	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
>> +	if (root) {
>> +		get_tdp_mmu_root(vcpu->kvm, root);
>> +		spin_unlock(&vcpu->kvm->mmu_lock);
> Rather than manually unlock and return, this can be
> 
> 	if (root)
> 		get_tdp_mmju_root();
> 
> 	spin_unlock()
> 
> 	if (!root)
> 		root = alloc_tdp_mmu_root();
> 
> 	return root;
> 
> You could also add a helper to do the "get" along with the "find".  Not sure
> if that's worth the code.

All in all I don't think it's any clearer than Ben's code.  At least in
his case the "if"s clearly point at the double-checked locking pattern.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU
  2020-09-30  6:15   ` Sean Christopherson
@ 2020-09-30  6:28     ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30  6:28 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 08:15, Sean Christopherson wrote:
>>  	kvm_zap_obsolete_pages(kvm);
>> +
>> +	if (kvm->arch.tdp_mmu_enabled)
>> +		kvm_tdp_mmu_zap_all(kvm);
> 
> Haven't looked into how this works; is kvm_tdp_mmu_zap_all() additive to
> what is done by the legacy zapping, or is it a replacement?

It's additive because the shadow MMU is still used for nesting.

>> +
>>  	spin_unlock(&kvm->mmu_lock);
>>  }
>> @@ -57,8 +58,13 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
>>  	return root->tdp_mmu_page;
>>  }
>>  
>> +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>> +			  gfn_t start, gfn_t end);
>> +
>>  static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>>  {
>> +	gfn_t max_gfn = 1ULL << (boot_cpu_data.x86_phys_bits - PAGE_SHIFT);
> 
> BIT_ULL(...)

Not sure about that.  Here the point is not to have a single bit, but to
do a power of two.  Same for the version below.

>> + * If the MMU lock is contended or this thread needs to yield, flushes
>> + * the TLBs, releases, the MMU lock, yields, reacquires the MMU lock,
>> + * restarts the tdp_iter's walk from the root, and returns true.
>> + * If no yield is needed, returns false.
>> + */
>> +static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>> +{
>> +	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
>> +		kvm_flush_remote_tlbs(kvm);
>> +		cond_resched_lock(&kvm->mmu_lock);
>> +		tdp_iter_refresh_walk(iter);
>> +		return true;
>> +	} else {
>> +		return false;
>> +	}
> 
> Kernel style is to not bother with an "else" if the "if" returns.

I have rewritten all of this in my version anyway. :)

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 00/22] Introduce the TDP MMU
  2020-09-30  6:19 ` Sean Christopherson
@ 2020-09-30  6:30   ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30  6:30 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 08:19, Sean Christopherson wrote:
> In case Paolo is feeling trigger happy, I'm going to try and get through the
> second half of this series tomorrow.

I'm indeed feeling trigger happy about this series, but I wasn't
planning to include it in kvm.git this week.  I'll have my version
posted by tomorrow, and I'll include some of your feedback already when
it does not make incremental review too much harder.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots
  2020-09-30  6:26     ` Paolo Bonzini
@ 2020-09-30 15:38       ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 15:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Cannon Matthews, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 08:26:28AM +0200, Paolo Bonzini wrote:
> On 30/09/20 08:06, Sean Christopherson wrote:
> >> +static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_mmu_page *root;
> >> +	union kvm_mmu_page_role role;
> >> +
> >> +	role = vcpu->arch.mmu->mmu_role.base;
> >> +	role.level = vcpu->arch.mmu->shadow_root_level;
> >> +	role.direct = true;
> >> +	role.gpte_is_8_bytes = true;
> >> +	role.access = ACC_ALL;
> >> +
> >> +	spin_lock(&vcpu->kvm->mmu_lock);
> >> +
> >> +	/* Search for an already allocated root with the same role. */
> >> +	root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
> >> +	if (root) {
> >> +		get_tdp_mmu_root(vcpu->kvm, root);
> >> +		spin_unlock(&vcpu->kvm->mmu_lock);
> > Rather than manually unlock and return, this can be
> > 
> > 	if (root)
> > 		get_tdp_mmju_root();
> > 
> > 	spin_unlock()
> > 
> > 	if (!root)
> > 		root = alloc_tdp_mmu_root();
> > 
> > 	return root;
> > 
> > You could also add a helper to do the "get" along with the "find".  Not sure
> > if that's worth the code.
> 
> All in all I don't think it's any clearer than Ben's code.  At least in
> his case the "if"s clearly point at the double-checked locking pattern.

Actually, why is this even dropping the lock to do the alloc?  The allocs are
coming from the caches, which are designed to be invoked while holding the
spin lock.

Also relevant is that, other than this code, the only user of
find_tdp_mmu_root_with_role() is kvm_tdp_mmu_root_hpa_for_role(), and that
helper is itself unused.  I.e. the "find" can be open coded.

Putting those two together yields this, which IMO is much cleaner.

static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
{
        union kvm_mmu_page_role role;
	struct kvm *kvm = vcpu->kvm;
        struct kvm_mmu_page *root;

	role = page_role_for_level(vcpu, vcpu->arch.mmu->shadow_root_level);

        spin_lock(&kvm->mmu_lock);

        /* Check for an existing root before allocating a new one. */
        for_each_tdp_mmu_root(kvm, root) {
                if (root->role.word == role.word) {
                        get_tdp_mmu_root(root);
                        spin_unlock(&kvm->mmu_lock);
                        return root;
                }
        }

        root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
        root->root_count = 1;

        list_add(&root->link, &kvm->arch.tdp_mmu_roots);

        spin_unlock(&kvm->mmu_lock);

        return root;
}


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
  2020-09-25 21:22 ` [PATCH 09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg Ben Gardon
@ 2020-09-30 16:19   ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 16:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:49PM -0700, Ben Gardon wrote:
> In order to avoid creating executable hugepages in the TDP MMU PF
> handler, remove the dependency between disallowed_hugepage_adjust and
> the shadow_walk_iterator. This will open the function up to being used
> by the TDP MMU PF handler in a future patch.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 17 +++++++++--------
>  arch/x86/kvm/mmu/paging_tmpl.h |  3 ++-
>  2 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6344e7863a0f5..f6e6fc9959c04 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3295,13 +3295,12 @@ static int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	return level;
>  }
>  
> -static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
> -				       gfn_t gfn, kvm_pfn_t *pfnp, int *levelp)
> +static void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
> +					kvm_pfn_t *pfnp, int *goal_levelp)
>  {
> -	int level = *levelp;
> -	u64 spte = *it.sptep;
> +	int goal_level = *goal_levelp;

More bikeshedding...  Can we keep 'level' and 'levelp' instead of prepending
'goal_'?  I get the intent, but goal_level isn't necessarily accurate as the
original requested level (referred to as req_level in the caller in kvm/queue)
may be different than goal_level, i.e. this helper may have already decremented
the level.  IMO it's more accurate/correct to keep the plain 'level'.

> -	if (it.level == level && level > PG_LEVEL_4K &&
> +	if (cur_level == goal_level && goal_level > PG_LEVEL_4K &&
>  	    is_nx_huge_page_enabled() &&
>  	    is_shadow_present_pte(spte) &&
>  	    !is_large_pte(spte)) {
> @@ -3312,9 +3311,10 @@ static void disallowed_hugepage_adjust(struct kvm_shadow_walk_iterator it,
>  		 * patching back for them into pfn the next 9 bits of
>  		 * the address.
>  		 */
> -		u64 page_mask = KVM_PAGES_PER_HPAGE(level) - KVM_PAGES_PER_HPAGE(level - 1);
> +		u64 page_mask = KVM_PAGES_PER_HPAGE(goal_level) -
> +				KVM_PAGES_PER_HPAGE(goal_level - 1);
>  		*pfnp |= gfn & page_mask;
> -		(*levelp)--;
> +		(*goal_levelp)--;
>  	}
>  }
>  
> @@ -3339,7 +3339,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, int write,
>  		 * We cannot overwrite existing page tables with an NX
>  		 * large page, as the leaf could be executable.
>  		 */
> -		disallowed_hugepage_adjust(it, gfn, &pfn, &level);
> +		disallowed_hugepage_adjust(*it.sptep, gfn, it.level,
> +					   &pfn, &level);
>  
>  		base_gfn = gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
>  		if (it.level == level)
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 4dd6b1e5b8cf7..6a8666cb0d24b 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -690,7 +690,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
>  		 * We cannot overwrite existing page tables with an NX
>  		 * large page, as the leaf could be executable.
>  		 */
> -		disallowed_hugepage_adjust(it, gw->gfn, &pfn, &hlevel);
> +		disallowed_hugepage_adjust(*it.sptep, gw->gfn, it.level,
> +					   &pfn, &hlevel);
>  
>  		base_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
>  		if (it.level == hlevel)
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-25 21:22 ` [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler Ben Gardon
  2020-09-26  0:24   ` Paolo Bonzini
@ 2020-09-30 16:37   ` Sean Christopherson
  2020-09-30 16:55     ` Paolo Bonzini
                       ` (2 more replies)
  1 sibling, 3 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 16:37 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:50PM -0700, Ben Gardon wrote:
> @@ -4113,8 +4088,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	if (page_fault_handle_page_track(vcpu, error_code, gfn))
>  		return RET_PF_EMULATE;
>  
> -	if (fast_page_fault(vcpu, gpa, error_code))
> -		return RET_PF_RETRY;
> +	if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +		if (fast_page_fault(vcpu, gpa, error_code))
> +			return RET_PF_RETRY;

It'll probably be easier to handle is_tdp_mmu() in fast_page_fault().

>  
>  	r = mmu_topup_memory_caches(vcpu, false);
>  	if (r)
> @@ -4139,8 +4115,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	r = make_mmu_pages_available(vcpu);
>  	if (r)
>  		goto out_unlock;
> -	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> -			 prefault, is_tdp && lpage_disallowed);
> +
> +	if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> +		r = kvm_tdp_mmu_page_fault(vcpu, write, map_writable, max_level,
> +					   gpa, pfn, prefault,
> +					   is_tdp && lpage_disallowed);
> +	else
> +		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> +				 prefault, is_tdp && lpage_disallowed);
>  
>  out_unlock:
>  	spin_unlock(&vcpu->kvm->mmu_lock);

...

> +/*
> + * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> + * page tables and SPTEs to translate the faulting guest physical address.
> + */
> +int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> +			   int max_level, gpa_t gpa, kvm_pfn_t pfn,
> +			   bool prefault, bool account_disallowed_nx_lpage)
> +{
> +	struct tdp_iter iter;
> +	struct kvm_mmu_memory_cache *pf_pt_cache =
> +			&vcpu->arch.mmu_shadow_page_cache;
> +	u64 *child_pt;
> +	u64 new_spte;
> +	int ret;
> +	int as_id = kvm_arch_vcpu_memslots_id(vcpu);
> +	gfn_t gfn = gpa >> PAGE_SHIFT;
> +	int level;
> +
> +	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> +		return RET_PF_RETRY;

I feel like we should kill off these silly WARNs in the existing code instead
of adding more.  If they actually fired, I'm pretty sure that they would
continue firing and spamming the kernel log until the VM is killed as I don't
see how restarting the guest will magically fix anything.

> +
> +	if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
> +		return RET_PF_RETRY;

This seems especially gratuitous, this has exactly one caller that explicitly
checks is_tdp_mmu_root().  Again, if this fires it will spam the kernel log
into submission.

> +
> +	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn);
> +
> +	for_each_tdp_pte_vcpu(iter, vcpu, gfn, gfn + 1) {
> +		disallowed_hugepage_adjust(iter.old_spte, gfn, iter.level,
> +					   &pfn, &level);
> +
> +		if (iter.level == level)
> +			break;
> +
> +		/*
> +		 * If there is an SPTE mapping a large page at a higher level
> +		 * than the target, that SPTE must be cleared and replaced
> +		 * with a non-leaf SPTE.
> +		 */
> +		if (is_shadow_present_pte(iter.old_spte) &&
> +		    is_large_pte(iter.old_spte)) {
> +			*iter.sptep = 0;
> +			handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
> +					    iter.old_spte, 0, iter.level);
> +			kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
> +					KVM_PAGES_PER_HPAGE(iter.level));
> +
> +			/*
> +			 * The iter must explicitly re-read the spte here
> +			 * because the new is needed before the next iteration
> +			 * of the loop.
> +			 */

I think it'd be better to explicitly, and simply, call out that iter.old_spte
is consumed below.  It's subtle enough to warrant a comment, but the comment
didn't actually help me.  Maybe something like:

			/*
			 * Refresh iter.old_spte, it will trigger the !present
			 * path below.
			 */

> +			iter.old_spte = READ_ONCE(*iter.sptep);
> +		}
> +
> +		if (!is_shadow_present_pte(iter.old_spte)) {
> +			child_pt = kvm_mmu_memory_cache_alloc(pf_pt_cache);
> +			clear_page(child_pt);
> +			new_spte = make_nonleaf_spte(child_pt,
> +						     !shadow_accessed_mask);
> +
> +			*iter.sptep = new_spte;
> +			handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
> +					    iter.old_spte, new_spte,
> +					    iter.level);
> +		}
> +	}
> +
> +	if (WARN_ON(iter.level != level))
> +		return RET_PF_RETRY;

This also seems unnecessary.  Or maybe these are all good candiates for
KVM_BUG_ON...

> +
> +	ret = page_fault_handle_target_level(vcpu, write, map_writable,
> +					     as_id, &iter, pfn, prefault);
> +
> +	/* If emulating, flush this vcpu's TLB. */

Why?  It's obvious _what_ the code is doing, the comment should explain _why_.

> +	if (ret == RET_PF_EMULATE)
> +		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> +
> +	return ret;
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index cb86f9fe69017..abf23dc0ab7ad 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -14,4 +14,8 @@ void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
>  
>  bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> +
> +int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> +			   int level, gpa_t gpa, kvm_pfn_t pfn, bool prefault,
> +			   bool lpage_disallowed);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-30 16:37   ` Sean Christopherson
@ 2020-09-30 16:55     ` Paolo Bonzini
  2020-09-30 17:37     ` Paolo Bonzini
  2020-10-06 22:33     ` Ben Gardon
  2 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30 16:55 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 18:37, Sean Christopherson wrote:
>> +
>> +	if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
>> +		return RET_PF_RETRY;
> I feel like we should kill off these silly WARNs in the existing code instead
> of adding more.  If they actually fired, I'm pretty sure that they would
> continue firing and spamming the kernel log until the VM is killed as I don't
> see how restarting the guest will magically fix anything.

This is true, but I think it's better to be defensive.  They're
certainly all candidates for KVM_BUG_ON.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-25 21:22 ` [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU Ben Gardon
  2020-09-26  0:06   ` Paolo Bonzini
  2020-09-30  5:34   ` Sean Christopherson
@ 2020-09-30 16:57   ` Sean Christopherson
  2020-09-30 17:39     ` Paolo Bonzini
  2 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 16:57 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:43PM -0700, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> new file mode 100644
> index 0000000000000..8241e18c111e6
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include "tdp_mmu.h"
> +
> +static bool __read_mostly tdp_mmu_enabled = true;
> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);

This param should not exist until the TDP MMU is fully functional, e.g. running
KVM against "kvm: mmu: Support zapping SPTEs in the TDP MMU" immediately hits a
BUG() in the rmap code.  I haven't wrapped my head around the entire series to
grok whether it make sense to incrementally enable the TDP MMU, but my gut says
that's probably non-sensical.  The local variable can exist (default to false)
so that you can flip a single switch to enable the code instead of having to
plumb in the variable to its consumers.

  kernel BUG at arch/x86/kvm/mmu/mmu.c:1427!
  invalid opcode: 0000 [#1] SMP
  CPU: 4 PID: 1218 Comm: stable Not tainted 5.9.0-rc4+ #44
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:rmap_get_first.isra.0+0x51/0x60 [kvm]
  Code: <0f> 0b 45 31 c0 4c 89 c0 c3 66 0f 1f 44 00 00 0f 1f 44 00 00 49 b9
  RSP: 0018:ffffc9000099fb50 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
  RDX: ffffc9000099fb60 RSI: ffffc9000099fb58 RDI: ffff88816b1a7ec8
  RBP: ffff88816b1a7e70 R08: ffff888173c95000 R09: ffff88816b1a7448
  R10: 00000000000000f8 R11: ffff88817bd29c70 R12: ffffc90000981000
  R13: ffffc9000099fbac R14: ffffc90000989a88 R15: ffff88816b1a7ec8
  FS:  00007f7a755fd700(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f7a60141000 CR3: 000000016b031004 CR4: 0000000000172ea0
  Call Trace:
   __kvm_mmu_prepare_zap_page+0x98/0x330 [kvm]
   kvm_mmu_zap_all_fast+0x100/0x190 [kvm]
   kvm_page_track_flush_slot+0x54/0x80 [kvm]
   kvm_set_memslot+0x198/0x640 [kvm]
   kvm_delete_memslot+0x59/0xc0 [kvm]
   __kvm_set_memory_region+0x494/0x560 [kvm]
   ? khugepaged+0x470/0x2230
   ? mem_cgroup_charge_statistics.isra.0+0x1c/0x40
   kvm_set_memory_region+0x27/0x40 [kvm]
   kvm_vm_ioctl+0x379/0xca0 [kvm]
   ? do_user_addr_fault+0x1ad/0x3a7
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x33/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  2020-09-25 21:22 ` [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for " Ben Gardon
@ 2020-09-30 17:03   ` Sean Christopherson
  2020-09-30 23:15     ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 17:03 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:53PM -0700, Ben Gardon wrote:
> In order to interoperate correctly with the rest of KVM and other Linux
> subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
> hooks to handle the invalidate range family of MMU notifiers.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  9 ++++-
>  arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++++++++++++++++++++++++---
>  arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
>  3 files changed, 86 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 52d661a758585..0ddfdab942554 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1884,7 +1884,14 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
>  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
>  			unsigned flags)
>  {
> -	return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> +	int r;
> +
> +	r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> +
> +	if (kvm->arch.tdp_mmu_enabled)
> +		r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);

Similar to an earlier question, is this intentionally additive, or can this
instead by:

	if (kvm->arch.tdp_mmu_enabled)
		r = kvm_tdp_mmu_zap_hva_range(kvm, start, end);
	else
		r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);

> +
> +	return r;
>  }
>  
>  int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 557e780bdf9f9..1cea58db78a13 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -60,7 +60,7 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
>  }
>  
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end);
> +			  gfn_t start, gfn_t end, bool can_yield);
>  
>  static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  {
> @@ -73,7 +73,7 @@ static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  
>  	list_del(&root->link);
>  
> -	zap_gfn_range(kvm, root, 0, max_gfn);
> +	zap_gfn_range(kvm, root, 0, max_gfn, false);
>  
>  	free_page((unsigned long)root->spt);
>  	kmem_cache_free(mmu_page_header_cache, root);
> @@ -361,9 +361,14 @@ static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
>   * non-root pages mapping GFNs strictly within that range. Returns true if
>   * SPTEs have been cleared and a TLB flush is needed before releasing the
>   * MMU lock.
> + * If can_yield is true, will release the MMU lock and reschedule if the
> + * scheduler needs the CPU or there is contention on the MMU lock. If this
> + * function cannot yield, it will not release the MMU lock or reschedule and
> + * the caller must ensure it does not supply too large a GFN range, or the
> + * operation can cause a soft lockup.
>   */
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> -			  gfn_t start, gfn_t end)
> +			  gfn_t start, gfn_t end, bool can_yield)
>  {
>  	struct tdp_iter iter;
>  	bool flush_needed = false;
> @@ -387,7 +392,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  		handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte, 0,
>  				    iter.level);
>  
> -		flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> +		if (can_yield)
> +			flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);

		flush_needed = !can_yield || !tdp_mmu_iter_cond_resched(kvm, &iter);

> +		else
> +			flush_needed = true;
>  	}
>  	return flush_needed;
>  }
> @@ -410,7 +418,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
>  		 */
>  		get_tdp_mmu_root(kvm, root);
>  
> -		flush = zap_gfn_range(kvm, root, start, end) || flush;
> +		flush = zap_gfn_range(kvm, root, start, end, true) || flush;
>  
>  		put_tdp_mmu_root(kvm, root);
>  	}
> @@ -551,3 +559,65 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
>  
>  	return ret;
>  }
> +
> +static int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm, unsigned long start,
> +		unsigned long end, unsigned long data,
> +		int (*handler)(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			       struct kvm_mmu_page *root, gfn_t start,
> +			       gfn_t end, unsigned long data))
> +{
> +	struct kvm_memslots *slots;
> +	struct kvm_memory_slot *memslot;
> +	struct kvm_mmu_page *root;
> +	int ret = 0;
> +	int as_id;
> +
> +	for_each_tdp_mmu_root(kvm, root) {
> +		/*
> +		 * Take a reference on the root so that it cannot be freed if
> +		 * this thread releases the MMU lock and yields in this loop.
> +		 */
> +		get_tdp_mmu_root(kvm, root);
> +
> +		as_id = kvm_mmu_page_as_id(root);
> +		slots = __kvm_memslots(kvm, as_id);
> +		kvm_for_each_memslot(memslot, slots) {
> +			unsigned long hva_start, hva_end;
> +			gfn_t gfn_start, gfn_end;
> +
> +			hva_start = max(start, memslot->userspace_addr);
> +			hva_end = min(end, memslot->userspace_addr +
> +				      (memslot->npages << PAGE_SHIFT));
> +			if (hva_start >= hva_end)
> +				continue;
> +			/*
> +			 * {gfn(page) | page intersects with [hva_start, hva_end)} =
> +			 * {gfn_start, gfn_start+1, ..., gfn_end-1}.
> +			 */
> +			gfn_start = hva_to_gfn_memslot(hva_start, memslot);
> +			gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
> +
> +			ret |= handler(kvm, memslot, root, gfn_start,
> +				       gfn_end, data);

Eh, I'd say let this one poke out, the above hva_to_gfn_memslot() already
overruns 80 chars.  IMO it's more readable without the wraps.

> +		}
> +
> +		put_tdp_mmu_root(kvm, root);
> +	}
> +
> +	return ret;
> +}
> +
> +static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
> +				     struct kvm_memory_slot *slot,
> +				     struct kvm_mmu_page *root, gfn_t start,
> +				     gfn_t end, unsigned long unused)
> +{
> +	return zap_gfn_range(kvm, root, start, end, false);
> +}
> +
> +int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> +			      unsigned long end)
> +{
> +	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
> +					    zap_gfn_range_hva_wrapper);
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index abf23dc0ab7ad..ce804a97bfa1d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -18,4 +18,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
>  			   int level, gpa_t gpa, kvm_pfn_t pfn, bool prefault,
>  			   bool lpage_disallowed);
> +
> +int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> +			      unsigned long end);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-30 16:37   ` Sean Christopherson
  2020-09-30 16:55     ` Paolo Bonzini
@ 2020-09-30 17:37     ` Paolo Bonzini
  2020-10-06 22:35       ` Ben Gardon
  2020-10-06 22:33     ` Ben Gardon
  2 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30 17:37 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 18:37, Sean Christopherson wrote:
>> +	ret = page_fault_handle_target_level(vcpu, write, map_writable,
>> +					     as_id, &iter, pfn, prefault);
>> +
>> +	/* If emulating, flush this vcpu's TLB. */
> Why?  It's obvious _what_ the code is doing, the comment should explain _why_.
> 
>> +	if (ret == RET_PF_EMULATE)
>> +		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>> +
>> +	return ret;
>> +}

In particular it seems to be only needed in this case...

+	/*
+	 * If the page fault was caused by a write but the page is write
+	 * protected, emulation is needed. If the emulation was skipped,
+	 * the vCPU would have the same fault again.
+	 */
+	if ((make_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) && write)
+		ret = RET_PF_EMULATE;
+

... corresponding to this code in mmu.c

        if (set_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) {
                if (write_fault)
                        ret = RET_PF_EMULATE;
                kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
        }

So it should indeed be better to make the code in
page_fault_handle_target_level look the same as mmu/mmu.c.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-30 16:57   ` Sean Christopherson
@ 2020-09-30 17:39     ` Paolo Bonzini
  2020-09-30 18:42       ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30 17:39 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 18:57, Sean Christopherson wrote:
>> +
>> +static bool __read_mostly tdp_mmu_enabled = true;
>> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
> This param should not exist until the TDP MMU is fully functional, e.g. running
> KVM against "kvm: mmu: Support zapping SPTEs in the TDP MMU" immediately hits a
> BUG() in the rmap code.  I haven't wrapped my head around the entire series to
> grok whether it make sense to incrementally enable the TDP MMU, but my gut says
> that's probably non-sensical.

No, it doesn't.  Whether to add the module parameter is kind of
secondary, but I agree it shouldn't be true---not even at the end of
this series, since fast page fault for example is not implemented yet.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu
  2020-09-25 21:22 ` [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu Ben Gardon
  2020-09-26  0:32   ` Paolo Bonzini
@ 2020-09-30 17:48   ` Sean Christopherson
  2020-10-06 23:38     ` Ben Gardon
  1 sibling, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 17:48 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:54PM -0700, Ben Gardon wrote:
> @@ -1945,12 +1944,24 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
>  
>  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
>  {
> -	return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
> +	int young = false;
> +
> +	young = kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
> +	if (kvm->arch.tdp_mmu_enabled)

If we end up with a per-VM flag, would it make sense to add a static key
wrapper similar to the in-kernel lapic?  I assume once this lands the vast
majority of VMs will use the TDP MMU.

> +		young |= kvm_tdp_mmu_age_hva_range(kvm, start, end);
> +
> +	return young;
>  }

...

> +
> +/*
> + * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
> + * if any of the GFNs in the range have been accessed.
> + */
> +static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			 struct kvm_mmu_page *root, gfn_t start, gfn_t end,
> +			 unsigned long unused)
> +{
> +	struct tdp_iter iter;
> +	int young = 0;
> +	u64 new_spte = 0;
> +	int as_id = kvm_mmu_page_as_id(root);
> +
> +	for_each_tdp_pte_root(iter, root, start, end) {

Ah, I think we should follow the existing shadow iterates by naming this

	for_each_tdp_pte_using_root()

My first reaction was that this was iterating over TDP roots, which was a bit
confusing.  I suspect others will make the same mistake unless they look at the
implementation of for_each_tdp_pte_root().

Similar comments on the _vcpu() variant.  For that one I think it'd be
preferable to take the struct kvm_mmu, i.e. have for_each_tdp_pte_using_mmu(),
as both kvm_tdp_mmu_page_fault() and kvm_tdp_mmu_get_walk() explicitly
reference vcpu->arch.mmu in the surrounding code.

E.g. I find this more intuitive

	struct kvm_mmu *mmu = vcpu->arch.mmu;
	int leaf = mmu->shadow_root_level;

	for_each_tdp_pte_using_mmu(iter, mmu, gfn, gfn + 1) {
		leaf = iter.level;
		sptes[leaf - 1] = iter.old_spte;
	}

	return leaf

versus this, which makes me want to look at the implementation of for_each().


	int leaf = vcpu->arch.mmu->shadow_root_level;

	for_each_tdp_pte_vcpu(iter, vcpu, gfn, gfn + 1) {
		...
	}

> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    !is_last_spte(iter.old_spte, iter.level))
> +			continue;
> +
> +		/*
> +		 * If we have a non-accessed entry we don't need to change the
> +		 * pte.
> +		 */
> +		if (!is_accessed_spte(iter.old_spte))
> +			continue;
> +
> +		new_spte = iter.old_spte;
> +
> +		if (spte_ad_enabled(new_spte)) {
> +			clear_bit((ffs(shadow_accessed_mask) - 1),
> +				  (unsigned long *)&new_spte);
> +		} else {
> +			/*
> +			 * Capture the dirty status of the page, so that it doesn't get
> +			 * lost when the SPTE is marked for access tracking.
> +			 */
> +			if (is_writable_pte(new_spte))
> +				kvm_set_pfn_dirty(spte_to_pfn(new_spte));
> +
> +			new_spte = mark_spte_for_access_track(new_spte);
> +		}
> +
> +		*iter.sptep = new_spte;
> +		__handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> +				      new_spte, iter.level);
> +		young = true;

young is an int, not a bool.  Not really your fault as KVM has a really bad
habit of using ints instead of bools.

> +	}
> +
> +	return young;
> +}
> +
> +int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
> +			      unsigned long end)
> +{
> +	return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
> +					    age_gfn_range);
> +}
> +
> +static int test_age_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
> +			unsigned long unused2)
> +{
> +	struct tdp_iter iter;
> +	int young = 0;
> +
> +	for_each_tdp_pte_root(iter, root, gfn, gfn + 1) {
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    !is_last_spte(iter.old_spte, iter.level))
> +			continue;
> +
> +		if (is_accessed_spte(iter.old_spte))
> +			young = true;

Same bool vs. int weirdness here.  Also, |= doesn't short circuit for ints
or bools, so this can be

		young |= is_accessed_spte(...)

Actually, can't we just return true immediately?

> +	}
> +
> +	return young;
> +}
> +
> +int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
> +{
> +	return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
> +					    test_age_gfn);
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index ce804a97bfa1d..f316773b7b5a8 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -21,4 +21,8 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
>  
>  int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
>  			      unsigned long end);
> +
> +int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
> +			      unsigned long end);
> +int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU
  2020-09-25 21:22 ` [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU Ben Gardon
  2020-09-26  1:04   ` Paolo Bonzini
  2020-09-29 15:07   ` Paolo Bonzini
@ 2020-09-30 18:04   ` Sean Christopherson
  2020-09-30 18:08     ` Paolo Bonzini
  2 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 18:04 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:57PM -0700, Ben Gardon wrote:
> +/*
> + * Remove write access from all the SPTEs mapping GFNs in the memslot. If
> + * skip_4k is set, SPTEs that map 4k pages, will not be write-protected.
> + * Returns true if an SPTE has been changed and the TLBs need to be flushed.
> + */
> +bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			     bool skip_4k)
> +{
> +	struct kvm_mmu_page *root;
> +	int root_as_id;
> +	bool spte_set = false;
> +
> +	for_each_tdp_mmu_root(kvm, root) {
> +		root_as_id = kvm_mmu_page_as_id(root);
> +		if (root_as_id != slot->as_id)
> +			continue;

This pattern pops up quite a few times, probably worth adding

#define for_each_tdp_mmu_root_using_memslot(...)	\
	for_each_tdp_mmu_root(...)			\
		if (kvm_mmu_page_as_id(root) != slot->as_id) {
		} else

> +
> +		/*
> +		 * Take a reference on the root so that it cannot be freed if
> +		 * this thread releases the MMU lock and yields in this loop.
> +		 */
> +		get_tdp_mmu_root(kvm, root);
> +
> +		spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> +				slot->base_gfn + slot->npages, skip_4k) ||
> +			   spte_set;
> +
> +		put_tdp_mmu_root(kvm, root);
> +	}
> +
> +	return spte_set;
> +}
> +
> +/*
> + * Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
> + * AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
> + * If AD bits are not enabled, this will require clearing the writable bit on
> + * each SPTE. Returns true if an SPTE has been changed and the TLBs need to
> + * be flushed.
> + */
> +static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> +			   gfn_t start, gfn_t end)
> +{
> +	struct tdp_iter iter;
> +	u64 new_spte;
> +	bool spte_set = false;
> +	int as_id = kvm_mmu_page_as_id(root);
> +
> +	for_each_tdp_pte_root(iter, root, start, end) {
> +		if (!is_shadow_present_pte(iter.old_spte) ||
> +		    !is_last_spte(iter.old_spte, iter.level))
> +			continue;

Same thing here, extra wrappers would probably be helpful.  At least add one
for the present case, e.g.

  #define for_each_present_tdp_pte_using_root()

and maybe even

  #define for_each_leaf_tdp_pte_using_root()

since the "!present || !last" pops up 4 or 5 times.

> +
> +		if (spte_ad_need_write_protect(iter.old_spte)) {
> +			if (is_writable_pte(iter.old_spte))
> +				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
> +			else
> +				continue;

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 19/22] kvm: mmu: Support write protection for nesting in tdp MMU
  2020-09-25 21:22 ` [PATCH 19/22] kvm: mmu: Support write protection for nesting in " Ben Gardon
@ 2020-09-30 18:06   ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 18:06 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:22:59PM -0700, Ben Gardon wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 12892fc4f146d..e6f5093ba8f6f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1667,6 +1667,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>  		write_protected |= __rmap_write_protect(kvm, rmap_head, true);
>  	}
>  
> +	if (kvm->arch.tdp_mmu_enabled)
> +		write_protected =
> +			kvm_tdp_mmu_write_protect_gfn(kvm, slot, gfn) ||
> +			write_protected;

Similar to other comments, this can use |=.

> +
>  	return write_protected;
>  }

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU
  2020-09-30 18:04   ` Sean Christopherson
@ 2020-09-30 18:08     ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30 18:08 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 20:04, Sean Christopherson wrote:
>> +	for_each_tdp_mmu_root(kvm, root) {
>> +		root_as_id = kvm_mmu_page_as_id(root);
>> +		if (root_as_id != slot->as_id)
>> +			continue;
> This pattern pops up quite a few times, probably worth adding
> 
> #define for_each_tdp_mmu_root_using_memslot(...)	\
> 	for_each_tdp_mmu_root(...)			\
> 		if (kvm_mmu_page_as_id(root) != slot->as_id) {
> 		} else
> 

It's not really relevant that it's a memslot, but

	for_each_tdp_mmu_root_using_as_id

makes sense too.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-25 21:23 ` [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU Ben Gardon
  2020-09-26  1:14   ` Paolo Bonzini
  2020-09-29 18:24   ` Paolo Bonzini
@ 2020-09-30 18:15   ` Sean Christopherson
  2020-09-30 19:56     ` Paolo Bonzini
  2020-09-30 22:27     ` Ben Gardon
  2 siblings, 2 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 18:15 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
> +/*
> + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
> + * exist in order to allow execute access on a region that would otherwise be
> + * mapped as a large page.
> + */
> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
> +{
> +	struct kvm_mmu_page *sp;
> +	bool flush;
> +	int rcu_idx;
> +	unsigned int ratio;
> +	ulong to_zap;
> +	u64 old_spte;
> +
> +	rcu_idx = srcu_read_lock(&kvm->srcu);
> +	spin_lock(&kvm->mmu_lock);
> +
> +	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> +	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;

This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
MMU never increments nx_lpage_splits, it instead has its own counter,
tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
guaranteed to be zero and thus this is completely untested.

I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
will be collisions with other code incrementing nx_lpage_splits.   And the TDP
MMU should be updating stats anyways.

> +
> +	while (to_zap &&
> +	       !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
> +		/*
> +		 * We use a separate list instead of just using active_mmu_pages
> +		 * because the number of lpage_disallowed pages is expected to
> +		 * be relatively small compared to the total.
> +		 */
> +		sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
> +				      struct kvm_mmu_page,
> +				      lpage_disallowed_link);
> +
> +		old_spte = *sp->parent_sptep;
> +		*sp->parent_sptep = 0;
> +
> +		list_del(&sp->lpage_disallowed_link);
> +		kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
> +
> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
> +				    old_spte, 0, sp->role.level + 1);
> +
> +		flush = true;
> +
> +		if (!--to_zap || need_resched() ||
> +		    spin_needbreak(&kvm->mmu_lock)) {
> +			flush = false;
> +			kvm_flush_remote_tlbs(kvm);
> +			if (to_zap)
> +				cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +	srcu_read_unlock(&kvm->srcu, rcu_idx);
> +}
> +
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 2ecb047211a6d..45ea2d44545db 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  
>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>  				   struct kvm_memory_slot *slot, gfn_t gfn);
> +
> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> -- 
> 2.28.0.709.gb0816b6eb0-goog
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 21/22] kvm: mmu: Support MMIO in the TDP MMU
  2020-09-25 21:23 ` [PATCH 21/22] kvm: mmu: Support MMIO in the " Ben Gardon
@ 2020-09-30 18:19   ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 18:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 02:23:01PM -0700, Ben Gardon wrote:
> In order to support MMIO, KVM must be able to walk the TDP paging

Probably worth clarifying that this is for emulated MMIO, as opposed to
mapping MMIO host addresses.

> structures to find mappings for a given GFN. Support this walk for
> the TDP MMU.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-30  5:34   ` Sean Christopherson
@ 2020-09-30 18:36     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 18:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Sep 29, 2020 at 10:35 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> Nit on all the shortlogs, can you use "KVM: x86/mmu" instead of "kvm: mmu"?

Will do.

>
> On Fri, Sep 25, 2020 at 02:22:43PM -0700, Ben Gardon wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > new file mode 100644
> > index 0000000000000..8241e18c111e6
> > --- /dev/null
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -0,0 +1,34 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#include "tdp_mmu.h"
> > +
> > +static bool __read_mostly tdp_mmu_enabled = true;
> > +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
>
> Do y'all actually toggle tdp_mmu_enabled while VMs are running?  I can see
> having a per-VM capability, or a read-only module param, but a writable
> module param is... interesting.

We don't use this much, but it is useful when running tests to be able
to go back and forth between running with and without the TDP MMU. I
should have added a note that the module parameter is mostly for
development purposes.

>
> > +static bool is_tdp_mmu_enabled(void)
> > +{
> > +     if (!READ_ONCE(tdp_mmu_enabled))
> > +             return false;
> > +
> > +     if (WARN_ONCE(!tdp_enabled,
> > +                   "Creating a VM with TDP MMU enabled requires TDP."))
>
> This should be enforced, i.e. clear tdp_mmu_enabled if !tdp_enabled.  As is,
> it's a user triggerable WARN, which is not good, e.g. with PANIC_ON_WARN.

That's a good point.

>
> > +             return false;
> > +
> > +     return true;
> > +}

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU
  2020-09-30 17:39     ` Paolo Bonzini
@ 2020-09-30 18:42       ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 18:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Cannon Matthews, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 10:39 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 30/09/20 18:57, Sean Christopherson wrote:
> >> +
> >> +static bool __read_mostly tdp_mmu_enabled = true;
> >> +module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
> > This param should not exist until the TDP MMU is fully functional, e.g. running
> > KVM against "kvm: mmu: Support zapping SPTEs in the TDP MMU" immediately hits a
> > BUG() in the rmap code.  I haven't wrapped my head around the entire series to
> > grok whether it make sense to incrementally enable the TDP MMU, but my gut says
> > that's probably non-sensical.
>
> No, it doesn't.  Whether to add the module parameter is kind of
> secondary, but I agree it shouldn't be true---not even at the end of
> this series, since fast page fault for example is not implemented yet.
>
> Paolo
>
I fully agree, sorry about that. I should have at least defaulted the
module parameter to false before sending the series out. I'll remedy
that in the next patch set. (Unless you beat me to it, Paolo)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page
  2020-09-26  0:22   ` Paolo Bonzini
@ 2020-09-30 18:53     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 18:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 5:22 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> > Move the code to allocate a struct kvm_mmu_page for the TDP MMU out of the
> > root allocation code to support allocating a struct kvm_mmu_page for every
> > page of page table memory used by the TDP MMU, in the next commit.
> >
> > Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> > machine. This series introduced no new failures.
> >
> > This series can be viewed in Gerrit at:
> >       https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
>
> Maybe worth squashing into the earlier patch.
>
> Paolo
>

That sounds good to me. Definitely reduces churn in the series.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-30 18:15   ` Sean Christopherson
@ 2020-09-30 19:56     ` Paolo Bonzini
  2020-09-30 22:33       ` Ben Gardon
  2020-09-30 22:27     ` Ben Gardon
  1 sibling, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30 19:56 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 30/09/20 20:15, Sean Christopherson wrote:
> On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
>> +/*
>> + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
>> + * exist in order to allow execute access on a region that would otherwise be
>> + * mapped as a large page.
>> + */
>> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
>> +{
>> +	struct kvm_mmu_page *sp;
>> +	bool flush;
>> +	int rcu_idx;
>> +	unsigned int ratio;
>> +	ulong to_zap;
>> +	u64 old_spte;
>> +
>> +	rcu_idx = srcu_read_lock(&kvm->srcu);
>> +	spin_lock(&kvm->mmu_lock);
>> +
>> +	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
>> +	to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
> 
> This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
> MMU never increments nx_lpage_splits, it instead has its own counter,
> tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
> guaranteed to be zero and thus this is completely untested.

Except if you do shadow paging (through nested EPT) and then it bombs
immediately. :)

> I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
> a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
> will be collisions with other code incrementing nx_lpage_splits.   And the TDP
> MMU should be updating stats anyways.

This is true, but having two counters is necessary (in the current
implementation) because otherwise you zap more than the requested ratio
of pages.

The simplest solution is to add a "bool tdp_page" to struct
kvm_mmu_page, so that you can have a single list of
lpage_disallowed_pages and a single thread.  The while loop can then
dispatch to the right "zapper" code.

Anyway this patch is completely broken, so let's kick it away to the
next round.

Paolo

>> +
>> +	while (to_zap &&
>> +	       !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
>> +		/*
>> +		 * We use a separate list instead of just using active_mmu_pages
>> +		 * because the number of lpage_disallowed pages is expected to
>> +		 * be relatively small compared to the total.
>> +		 */
>> +		sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
>> +				      struct kvm_mmu_page,
>> +				      lpage_disallowed_link);
>> +
>> +		old_spte = *sp->parent_sptep;
>> +		*sp->parent_sptep = 0;
>> +
>> +		list_del(&sp->lpage_disallowed_link);
>> +		kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
>> +
>> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
>> +				    old_spte, 0, sp->role.level + 1);
>> +
>> +		flush = true;
>> +
>> +		if (!--to_zap || need_resched() ||
>> +		    spin_needbreak(&kvm->mmu_lock)) {
>> +			flush = false;
>> +			kvm_flush_remote_tlbs(kvm);
>> +			if (to_zap)
>> +				cond_resched_lock(&kvm->mmu_lock);
>> +		}
>> +	}
>> +
>> +	if (flush)
>> +		kvm_flush_remote_tlbs(kvm);
>> +
>> +	spin_unlock(&kvm->mmu_lock);
>> +	srcu_read_unlock(&kvm->srcu, rcu_idx);
>> +}
>> +
>> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
>> index 2ecb047211a6d..45ea2d44545db 100644
>> --- a/arch/x86/kvm/mmu/tdp_mmu.h
>> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
>> @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
>>  
>>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>>  				   struct kvm_memory_slot *slot, gfn_t gfn);
>> +
>> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
>>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
>> -- 
>> 2.28.0.709.gb0816b6eb0-goog
>>
> 


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-26  1:14   ` Paolo Bonzini
@ 2020-09-30 22:23     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 22:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 6:15 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:23, Ben Gardon wrote:
> > +
> > +     if (!kvm->arch.tdp_mmu_enabled)
> > +             return err;
> > +
> > +     err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 1,
> > +                     "kvm-nx-lpage-tdp-mmu-recovery",
> > +                     &kvm->arch.nx_lpage_tdp_mmu_recovery_thread);
>
> Any reason to have two threads?
>
> Paolo

At some point it felt cleaner. In this patch set NX reclaim is pretty
similar between the "shadow MMU" and TDP MMU so they don't really need
to be separate threads.

>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-30 18:15   ` Sean Christopherson
  2020-09-30 19:56     ` Paolo Bonzini
@ 2020-09-30 22:27     ` Ben Gardon
  1 sibling, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 22:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 11:16 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
> > +/*
> > + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
> > + * exist in order to allow execute access on a region that would otherwise be
> > + * mapped as a large page.
> > + */
> > +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +     bool flush;
> > +     int rcu_idx;
> > +     unsigned int ratio;
> > +     ulong to_zap;
> > +     u64 old_spte;
> > +
> > +     rcu_idx = srcu_read_lock(&kvm->srcu);
> > +     spin_lock(&kvm->mmu_lock);
> > +
> > +     ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> > +     to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
>
> This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
> MMU never increments nx_lpage_splits, it instead has its own counter,
> tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
> guaranteed to be zero and thus this is completely untested.

Good catch, I should write some NX reclaim selftests.

>
> I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
> a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
> will be collisions with other code incrementing nx_lpage_splits.   And the TDP
> MMU should be updating stats anyways.

A VM actually can have both the legacy MMU and TDP MMU, by design. The
legacy MMU handles nested. Eventually I'd like the TDP MMU to be
responsible for building nested shadow TDP tables, but haven't
implemented it.

>
> > +
> > +     while (to_zap &&
> > +            !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
> > +             /*
> > +              * We use a separate list instead of just using active_mmu_pages
> > +              * because the number of lpage_disallowed pages is expected to
> > +              * be relatively small compared to the total.
> > +              */
> > +             sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
> > +                                   struct kvm_mmu_page,
> > +                                   lpage_disallowed_link);
> > +
> > +             old_spte = *sp->parent_sptep;
> > +             *sp->parent_sptep = 0;
> > +
> > +             list_del(&sp->lpage_disallowed_link);
> > +             kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
> > +
> > +             handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
> > +                                 old_spte, 0, sp->role.level + 1);
> > +
> > +             flush = true;
> > +
> > +             if (!--to_zap || need_resched() ||
> > +                 spin_needbreak(&kvm->mmu_lock)) {
> > +                     flush = false;
> > +                     kvm_flush_remote_tlbs(kvm);
> > +                     if (to_zap)
> > +                             cond_resched_lock(&kvm->mmu_lock);
> > +             }
> > +     }
> > +
> > +     if (flush)
> > +             kvm_flush_remote_tlbs(kvm);
> > +
> > +     spin_unlock(&kvm->mmu_lock);
> > +     srcu_read_unlock(&kvm->srcu, rcu_idx);
> > +}
> > +
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 2ecb047211a6d..45ea2d44545db 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >
> >  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> >                                  struct kvm_memory_slot *slot, gfn_t gfn);
> > +
> > +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU
  2020-09-30 19:56     ` Paolo Bonzini
@ 2020-09-30 22:33       ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 22:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Cannon Matthews, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 12:56 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 30/09/20 20:15, Sean Christopherson wrote:
> > On Fri, Sep 25, 2020 at 02:23:00PM -0700, Ben Gardon wrote:
> >> +/*
> >> + * Clear non-leaf SPTEs and free the page tables they point to, if those SPTEs
> >> + * exist in order to allow execute access on a region that would otherwise be
> >> + * mapped as a large page.
> >> + */
> >> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm)
> >> +{
> >> +    struct kvm_mmu_page *sp;
> >> +    bool flush;
> >> +    int rcu_idx;
> >> +    unsigned int ratio;
> >> +    ulong to_zap;
> >> +    u64 old_spte;
> >> +
> >> +    rcu_idx = srcu_read_lock(&kvm->srcu);
> >> +    spin_lock(&kvm->mmu_lock);
> >> +
> >> +    ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> >> +    to_zap = ratio ? DIV_ROUND_UP(kvm->stat.nx_lpage_splits, ratio) : 0;
> >
> > This is broken, and possibly related to Paolo's INIT_LIST_HEAD issue.  The TDP
> > MMU never increments nx_lpage_splits, it instead has its own counter,
> > tdp_mmu_lpage_disallowed_page_count.  Unless I'm missing something, to_zap is
> > guaranteed to be zero and thus this is completely untested.
>
> Except if you do shadow paging (through nested EPT) and then it bombs
> immediately. :)
>
> > I don't see any reason for a separate tdp_mmu_lpage_disallowed_page_count,
> > a single VM can't have both a legacy MMU and a TDP MMU, so it's not like there
> > will be collisions with other code incrementing nx_lpage_splits.   And the TDP
> > MMU should be updating stats anyways.
>
> This is true, but having two counters is necessary (in the current
> implementation) because otherwise you zap more than the requested ratio
> of pages.
>
> The simplest solution is to add a "bool tdp_page" to struct
> kvm_mmu_page, so that you can have a single list of
> lpage_disallowed_pages and a single thread.  The while loop can then
> dispatch to the right "zapper" code.

I actually did add that bool in patch 4: kvm: mmu: Allocate and free
TDP MMU roots.
I'm a little nervous about putting them in the same list, but I agree
it would definitely simplify the implementation of reclaim.

>
> Anyway this patch is completely broken, so let's kick it away to the
> next round.

Understood, sorry I didn't test this one better. I'll incorporate your
feedback and include it in the next series.

>
> Paolo
>
> >> +
> >> +    while (to_zap &&
> >> +           !list_empty(&kvm->arch.tdp_mmu_lpage_disallowed_pages)) {
> >> +            /*
> >> +             * We use a separate list instead of just using active_mmu_pages
> >> +             * because the number of lpage_disallowed pages is expected to
> >> +             * be relatively small compared to the total.
> >> +             */
> >> +            sp = list_first_entry(&kvm->arch.tdp_mmu_lpage_disallowed_pages,
> >> +                                  struct kvm_mmu_page,
> >> +                                  lpage_disallowed_link);
> >> +
> >> +            old_spte = *sp->parent_sptep;
> >> +            *sp->parent_sptep = 0;
> >> +
> >> +            list_del(&sp->lpage_disallowed_link);
> >> +            kvm->arch.tdp_mmu_lpage_disallowed_page_count--;
> >> +
> >> +            handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), sp->gfn,
> >> +                                old_spte, 0, sp->role.level + 1);
> >> +
> >> +            flush = true;
> >> +
> >> +            if (!--to_zap || need_resched() ||
> >> +                spin_needbreak(&kvm->mmu_lock)) {
> >> +                    flush = false;
> >> +                    kvm_flush_remote_tlbs(kvm);
> >> +                    if (to_zap)
> >> +                            cond_resched_lock(&kvm->mmu_lock);
> >> +            }
> >> +    }
> >> +
> >> +    if (flush)
> >> +            kvm_flush_remote_tlbs(kvm);
> >> +
> >> +    spin_unlock(&kvm->mmu_lock);
> >> +    srcu_read_unlock(&kvm->srcu, rcu_idx);
> >> +}
> >> +
> >> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> >> index 2ecb047211a6d..45ea2d44545db 100644
> >> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> >> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> >> @@ -43,4 +43,6 @@ void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >>
> >>  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
> >>                                 struct kvm_memory_slot *slot, gfn_t gfn);
> >> +
> >> +void kvm_tdp_mmu_recover_nx_lpages(struct kvm *kvm);
> >>  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> >> --
> >> 2.28.0.709.gb0816b6eb0-goog
> >>
> >
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte
  2020-09-30  4:55   ` Sean Christopherson
@ 2020-09-30 23:03     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 23:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Sep 29, 2020 at 9:55 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:22:41PM -0700, Ben Gardon wrote:
> > +static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> > +                 unsigned int pte_access, int level,
> > +                 gfn_t gfn, kvm_pfn_t pfn, bool speculative,
> > +                 bool can_unsync, bool host_writable)
> > +{
> > +     u64 spte = 0;
> > +     struct kvm_mmu_page *sp;
> > +     int ret = 0;
> > +
> > +     if (set_mmio_spte(vcpu, sptep, gfn, pfn, pte_access))
> > +             return 0;
> > +
> > +     sp = sptep_to_sp(sptep);
> > +
> > +     spte = make_spte(vcpu, pte_access, level, gfn, pfn, *sptep, speculative,
> > +                      can_unsync, host_writable, sp_ad_disabled(sp), &ret);
> > +     if (!spte)
> > +             return 0;
>
> This is an impossible condition.  Well, maybe it's theoretically possible
> if page track is active, with EPT exec-only support (shadow_present_mask is
> zero), and pfn==0.  But in that case, returning early is wrong.
>
> Rather than return the spte, what about returning 'ret', passing 'new_spte'
> as a u64 *, and dropping the bail early path?  That would also eliminate
> the minor wart of make_spte() relying on the caller to initialize 'ret'.

I agree that would make this much cleaner.

>
> > +
> > +     if (spte & PT_WRITABLE_MASK)
> > +             kvm_vcpu_mark_page_dirty(vcpu, gfn);
> > +
> >       if (mmu_spte_update(sptep, spte))
> >               ret |= SET_SPTE_NEED_REMOTE_TLB_FLUSH;
> >       return ret;
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots
  2020-09-30  6:10   ` Sean Christopherson
@ 2020-09-30 23:11     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 23:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Sep 29, 2020 at 11:11 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:22:46PM -0700, Ben Gardon wrote:
> > Save address space ID as a field in each memslot so that functions that
> > do not use rmaps (which implicitly encode the id) can handle multiple
> > address spaces correctly.
> >
> > Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> > machine. This series introduced no new failures.
> >
> > This series can be viewed in Gerrit at:
> >       https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  include/linux/kvm_host.h | 1 +
> >  virt/kvm/kvm_main.c      | 1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 05e3c2fb3ef78..a460bc712a81c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -345,6 +345,7 @@ struct kvm_memory_slot {
> >       struct kvm_arch_memory_slot arch;
> >       unsigned long userspace_addr;
> >       u32 flags;
> > +     int as_id;
>
> Ha!  Peter Xu's dirtly ring also added this.  This should be a u16, it'll
> save 8 bytes per memslot (oooooooh).  Any chance you want to include Peter's
> patch[*]?  It has some nitpicking from Peter and I regarding what to do
> with as_id on deletion.  That would also avoid silent merge conflicts on
> Peter's end.
>
> [*] https://lkml.kernel.org/r/20200708193408.242909-2-peterx@redhat.com

Oh that's great! Yes, let's use Peter's patch in place of this one.


>
> >       short id;
> >  };
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index cf88233b819a0..f9c80351c9efd 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1318,6 +1318,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >       new.npages = mem->memory_size >> PAGE_SHIFT;
> >       new.flags = mem->flags;
> >       new.userspace_addr = mem->userspace_addr;
> > +     new.as_id = as_id;
> >
> >       if (new.npages > KVM_MEM_MAX_NR_PAGES)
> >               return -EINVAL;
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  2020-09-30 17:03   ` Sean Christopherson
@ 2020-09-30 23:15     ` Ben Gardon
  2020-09-30 23:24       ` Sean Christopherson
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 23:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 10:04 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:22:53PM -0700, Ben Gardon wrote:
> > In order to interoperate correctly with the rest of KVM and other Linux
> > subsystems, the TDP MMU must correctly handle various MMU notifiers. Add
> > hooks to handle the invalidate range family of MMU notifiers.
> >
> > Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> > machine. This series introduced no new failures.
> >
> > This series can be viewed in Gerrit at:
> >       https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> >
> > Signed-off-by: Ben Gardon <bgardon@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     |  9 ++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c | 80 +++++++++++++++++++++++++++++++++++---
> >  arch/x86/kvm/mmu/tdp_mmu.h |  3 ++
> >  3 files changed, 86 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 52d661a758585..0ddfdab942554 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1884,7 +1884,14 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> >  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
> >                       unsigned flags)
> >  {
> > -     return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > +     int r;
> > +
> > +     r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > +
> > +     if (kvm->arch.tdp_mmu_enabled)
> > +             r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
>
> Similar to an earlier question, is this intentionally additive, or can this
> instead by:
>
>         if (kvm->arch.tdp_mmu_enabled)
>                 r = kvm_tdp_mmu_zap_hva_range(kvm, start, end);
>         else
>                 r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
>

It is intentionally additive so the legacy/shadow MMU can handle nested.

> > +
> > +     return r;
> >  }
> >
> >  int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 557e780bdf9f9..1cea58db78a13 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -60,7 +60,7 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> >  }
> >
> >  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > -                       gfn_t start, gfn_t end);
> > +                       gfn_t start, gfn_t end, bool can_yield);
> >
> >  static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >  {
> > @@ -73,7 +73,7 @@ static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> >
> >       list_del(&root->link);
> >
> > -     zap_gfn_range(kvm, root, 0, max_gfn);
> > +     zap_gfn_range(kvm, root, 0, max_gfn, false);
> >
> >       free_page((unsigned long)root->spt);
> >       kmem_cache_free(mmu_page_header_cache, root);
> > @@ -361,9 +361,14 @@ static bool tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter)
> >   * non-root pages mapping GFNs strictly within that range. Returns true if
> >   * SPTEs have been cleared and a TLB flush is needed before releasing the
> >   * MMU lock.
> > + * If can_yield is true, will release the MMU lock and reschedule if the
> > + * scheduler needs the CPU or there is contention on the MMU lock. If this
> > + * function cannot yield, it will not release the MMU lock or reschedule and
> > + * the caller must ensure it does not supply too large a GFN range, or the
> > + * operation can cause a soft lockup.
> >   */
> >  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > -                       gfn_t start, gfn_t end)
> > +                       gfn_t start, gfn_t end, bool can_yield)
> >  {
> >       struct tdp_iter iter;
> >       bool flush_needed = false;
> > @@ -387,7 +392,10 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >               handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte, 0,
> >                                   iter.level);
> >
> > -             flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
> > +             if (can_yield)
> > +                     flush_needed = !tdp_mmu_iter_cond_resched(kvm, &iter);
>
>                 flush_needed = !can_yield || !tdp_mmu_iter_cond_resched(kvm, &iter);
>
> > +             else
> > +                     flush_needed = true;
> >       }
> >       return flush_needed;
> >  }
> > @@ -410,7 +418,7 @@ bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end)
> >                */
> >               get_tdp_mmu_root(kvm, root);
> >
> > -             flush = zap_gfn_range(kvm, root, start, end) || flush;
> > +             flush = zap_gfn_range(kvm, root, start, end, true) || flush;
> >
> >               put_tdp_mmu_root(kvm, root);
> >       }
> > @@ -551,3 +559,65 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> >
> >       return ret;
> >  }
> > +
> > +static int kvm_tdp_mmu_handle_hva_range(struct kvm *kvm, unsigned long start,
> > +             unsigned long end, unsigned long data,
> > +             int (*handler)(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +                            struct kvm_mmu_page *root, gfn_t start,
> > +                            gfn_t end, unsigned long data))
> > +{
> > +     struct kvm_memslots *slots;
> > +     struct kvm_memory_slot *memslot;
> > +     struct kvm_mmu_page *root;
> > +     int ret = 0;
> > +     int as_id;
> > +
> > +     for_each_tdp_mmu_root(kvm, root) {
> > +             /*
> > +              * Take a reference on the root so that it cannot be freed if
> > +              * this thread releases the MMU lock and yields in this loop.
> > +              */
> > +             get_tdp_mmu_root(kvm, root);
> > +
> > +             as_id = kvm_mmu_page_as_id(root);
> > +             slots = __kvm_memslots(kvm, as_id);
> > +             kvm_for_each_memslot(memslot, slots) {
> > +                     unsigned long hva_start, hva_end;
> > +                     gfn_t gfn_start, gfn_end;
> > +
> > +                     hva_start = max(start, memslot->userspace_addr);
> > +                     hva_end = min(end, memslot->userspace_addr +
> > +                                   (memslot->npages << PAGE_SHIFT));
> > +                     if (hva_start >= hva_end)
> > +                             continue;
> > +                     /*
> > +                      * {gfn(page) | page intersects with [hva_start, hva_end)} =
> > +                      * {gfn_start, gfn_start+1, ..., gfn_end-1}.
> > +                      */
> > +                     gfn_start = hva_to_gfn_memslot(hva_start, memslot);
> > +                     gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot);
> > +
> > +                     ret |= handler(kvm, memslot, root, gfn_start,
> > +                                    gfn_end, data);
>
> Eh, I'd say let this one poke out, the above hva_to_gfn_memslot() already
> overruns 80 chars.  IMO it's more readable without the wraps.

Will do.

>
> > +             }
> > +
> > +             put_tdp_mmu_root(kvm, root);
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static int zap_gfn_range_hva_wrapper(struct kvm *kvm,
> > +                                  struct kvm_memory_slot *slot,
> > +                                  struct kvm_mmu_page *root, gfn_t start,
> > +                                  gfn_t end, unsigned long unused)
> > +{
> > +     return zap_gfn_range(kvm, root, start, end, false);
> > +}
> > +
> > +int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> > +                           unsigned long end)
> > +{
> > +     return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
> > +                                         zap_gfn_range_hva_wrapper);
> > +}
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index abf23dc0ab7ad..ce804a97bfa1d 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -18,4 +18,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> >  int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> >                          int level, gpa_t gpa, kvm_pfn_t pfn, bool prefault,
> >                          bool lpage_disallowed);
> > +
> > +int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> > +                           unsigned long end);
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
                     ` (3 preceding siblings ...)
  2020-09-30  5:24   ` Sean Christopherson
@ 2020-09-30 23:20   ` Eric van Tassell
  2020-09-30 23:34     ` Paolo Bonzini
  4 siblings, 1 reply; 105+ messages in thread
From: Eric van Tassell @ 2020-09-30 23:20 UTC (permalink / raw)
  To: Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Paolo Bonzini, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong



On 9/25/20 4:22 PM, Ben Gardon wrote:
> The TDP iterator implements a pre-order traversal of a TDP paging
> structure. This iterator will be used in future patches to create
> an efficient implementation of the KVM MMU for the TDP case.
> 
> Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
> machine. This series introduced no new failures.
> 
> This series can be viewed in Gerrit at:
> 	https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinux-review.googlesource.com%2Fc%2Fvirt%2Fkvm%2Fkvm%2F%2B%2F2538&amp;data=02%7C01%7CEric.VanTassell%40amd.com%7C08ae391c1b9a4da5eb1e08d861997899%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637366659084240984&amp;sdata=pVD8B%2BoNGf1fN18y%2Bjrwyuhlu7TbP1DhUIg%2BbP8KP2s%3D&amp;reserved=0
> 
> Signed-off-by: Ben Gardon <bgardon@google.com>
> ---
>   arch/x86/kvm/Makefile           |   3 +-
>   arch/x86/kvm/mmu/mmu.c          |  19 +---
>   arch/x86/kvm/mmu/mmu_internal.h |  15 +++
>   arch/x86/kvm/mmu/tdp_iter.c     | 163 ++++++++++++++++++++++++++++++++
>   arch/x86/kvm/mmu/tdp_iter.h     |  53 +++++++++++
>   5 files changed, 237 insertions(+), 16 deletions(-)
>   create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
>   create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 4a3081e9f4b5d..cf6a9947955f7 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -15,7 +15,8 @@ kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
>   
>   kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
>   			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
> -			   hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o
> +			   hyperv.o debugfs.o mmu/mmu.o mmu/page_track.o \
> +			   mmu/tdp_iter.o
>   
>   kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
>   kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 81240b558d67f..b48b00c8cde65 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -134,15 +134,6 @@ module_param(dbg, bool, 0644);
>   #define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
>   #define SPTE_MMIO_MASK (3ULL << 52)
>   
> -#define PT64_LEVEL_BITS 9
> -
> -#define PT64_LEVEL_SHIFT(level) \
> -		(PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> -
> -#define PT64_INDEX(address, level)\
> -	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> -
> -
>   #define PT32_LEVEL_BITS 10
>   
>   #define PT32_LEVEL_SHIFT(level) \
> @@ -192,8 +183,6 @@ module_param(dbg, bool, 0644);
>   #define SPTE_HOST_WRITEABLE	(1ULL << PT_FIRST_AVAIL_BITS_SHIFT)
>   #define SPTE_MMU_WRITEABLE	(1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1))
>   
> -#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> -
>   /* make pte_list_desc fit well in cache line */
>   #define PTE_LIST_EXT 3
>   
> @@ -346,7 +335,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 access_mask)
>   }
>   EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>   
> -static bool is_mmio_spte(u64 spte)
> +bool is_mmio_spte(u64 spte)
>   {
>   	return (spte & SPTE_SPECIAL_MASK) == SPTE_MMIO_MASK;
>   }
> @@ -623,7 +612,7 @@ static int is_nx(struct kvm_vcpu *vcpu)
>   	return vcpu->arch.efer & EFER_NX;
>   }
>   
> -static int is_shadow_present_pte(u64 pte)
> +int is_shadow_present_pte(u64 pte)
>   {
>   	return (pte != 0) && !is_mmio_spte(pte);
 From <Figure 28-1: Formats of EPTP and EPT Paging-Structure Entries" of 
the manual I don't have at my fingertips right now, I believe you should 
only check the low 3 bits(mask = 0x7). Since the upper bits are ignored, 
might that not mean they're not guaranteed to be 0?
>   }
> @@ -633,7 +622,7 @@ static int is_large_pte(u64 pte)
>   	return pte & PT_PAGE_SIZE_MASK;
>   }
>   
> -static int is_last_spte(u64 pte, int level)
> +int is_last_spte(u64 pte, int level)
>   {
>   	if (level == PG_LEVEL_4K)
>   		return 1;
> @@ -647,7 +636,7 @@ static bool is_executable_pte(u64 spte)
>   	return (spte & (shadow_x_mask | shadow_nx_mask)) == shadow_x_mask;
>   }
>   
> -static kvm_pfn_t spte_to_pfn(u64 pte)
> +kvm_pfn_t spte_to_pfn(u64 pte)
>   {
>   	return (pte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
>   }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 3acf3b8eb469d..65bb110847858 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -60,4 +60,19 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>   bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>   				    struct kvm_memory_slot *slot, u64 gfn);
>   
> +#define PT64_LEVEL_BITS 9
> +
> +#define PT64_LEVEL_SHIFT(level) \
> +		(PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> +
> +#define PT64_INDEX(address, level)\
> +	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> +#define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +
> +/* Functions for interpreting SPTEs */
> +kvm_pfn_t spte_to_pfn(u64 pte);
> +bool is_mmio_spte(u64 spte);
> +int is_shadow_present_pte(u64 pte);
> +int is_last_spte(u64 pte, int level);
> +
>   #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> new file mode 100644
> index 0000000000000..ee90d62d2a9b1
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -0,0 +1,163 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include "mmu_internal.h"
> +#include "tdp_iter.h"
> +
> +/*
> + * Recalculates the pointer to the SPTE for the current GFN and level and
> + * reread the SPTE.
> + */
> +static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> +{
> +	iter->sptep = iter->pt_path[iter->level - 1] +
> +		SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
> +	iter->old_spte = READ_ONCE(*iter->sptep);
> +}
> +
> +/*
> + * Sets a TDP iterator to walk a pre-order traversal of the paging structure
> + * rooted at root_pt, starting with the walk to translate goal_gfn.
> + */
> +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> +		    gfn_t goal_gfn)
> +{
> +	WARN_ON(root_level < 1);
> +	WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
> +
> +	iter->goal_gfn = goal_gfn;
> +	iter->root_level = root_level;
> +	iter->level = root_level;
> +	iter->pt_path[iter->level - 1] = root_pt;
> +
> +	iter->gfn = iter->goal_gfn -
> +		(iter->goal_gfn % KVM_PAGES_PER_HPAGE(iter->level));
> +	tdp_iter_refresh_sptep(iter);
> +
> +	iter->valid = true;
> +}
> +
> +/*
> + * Given an SPTE and its level, returns a pointer containing the host virtual
> + * address of the child page table referenced by the SPTE. Returns null if
> + * there is no such entry.
> + */
> +u64 *spte_to_child_pt(u64 spte, int level)
> +{
> +	u64 *pt;
> +	/* There's no child entry if this entry isn't present */
> +	if (!is_shadow_present_pte(spte))
> +		return NULL;
> +
> +	/* There is no child page table if this is a leaf entry. */
> +	if (is_last_spte(spte, level))
> +		return NULL;
> +
> +	pt = (u64 *)__va(spte_to_pfn(spte) << PAGE_SHIFT);
> +	return pt;
> +}
> +
> +/*
> + * Steps down one level in the paging structure towards the goal GFN. Returns
> + * true if the iterator was able to step down a level, false otherwise.
> + */
> +static bool try_step_down(struct tdp_iter *iter)
> +{
> +	u64 *child_pt;
> +
> +	if (iter->level == PG_LEVEL_4K)
> +		return false;
> +
> +	/*
> +	 * Reread the SPTE before stepping down to avoid traversing into page
> +	 * tables that are no longer linked from this entry.
> +	 */
> +	iter->old_spte = READ_ONCE(*iter->sptep);
> +
> +	child_pt = spte_to_child_pt(iter->old_spte, iter->level);
> +	if (!child_pt)
> +		return false;
> +
> +	iter->level--;
> +	iter->pt_path[iter->level - 1] = child_pt;
> +	iter->gfn = iter->goal_gfn -
> +		(iter->goal_gfn % KVM_PAGES_PER_HPAGE(iter->level));
> +	tdp_iter_refresh_sptep(iter);
> +
> +	return true;
> +}
> +
> +/*
> + * Steps to the next entry in the current page table, at the current page table
> + * level. The next entry could point to a page backing guest memory or another
> + * page table, or it could be non-present. Returns true if the iterator was
> + * able to step to the next entry in the page table, false if the iterator was
> + * already at the end of the current page table.
> + */
> +static bool try_step_side(struct tdp_iter *iter)
> +{
> +	/*
> +	 * Check if the iterator is already at the end of the current page
> +	 * table.
> +	 */
> +	if (!((iter->gfn + KVM_PAGES_PER_HPAGE(iter->level)) %
> +	      KVM_PAGES_PER_HPAGE(iter->level + 1)))
> +		return false;
> +
> +	iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
> +	iter->goal_gfn = iter->gfn;
> +	iter->sptep++;
> +	iter->old_spte = READ_ONCE(*iter->sptep);
> +
> +	return true;
> +}
> +
> +/*
> + * Tries to traverse back up a level in the paging structure so that the walk
> + * can continue from the next entry in the parent page table. Returns true on a
> + * successful step up, false if already in the root page.
> + */
> +static bool try_step_up(struct tdp_iter *iter)
> +{
> +	if (iter->level == iter->root_level)
> +		return false;
> +
> +	iter->level++;
> +	iter->gfn =  iter->gfn - (iter->gfn % KVM_PAGES_PER_HPAGE(iter->level));
> +	tdp_iter_refresh_sptep(iter);
> +
> +	return true;
> +}
> +
> +/*
> + * Step to the next SPTE in a pre-order traversal of the paging structure.
> + * To get to the next SPTE, the iterator either steps down towards the goal
> + * GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
> + * highter GFN.
> + *
> + * The basic algorithm is as follows:
> + * 1. If the current SPTE is a non-last-level SPTE, step down into the page
> + *    table it points to.
> + * 2. If the iterator cannot step down, it will try to step to the next SPTE
> + *    in the current page of the paging structure.
> + * 3. If the iterator cannot step to the next entry in the current page, it will
> + *    try to step up to the parent paging structure page. In this case, that
> + *    SPTE will have already been visited, and so the iterator must also step
> + *    to the side again.
> + */
> +void tdp_iter_next(struct tdp_iter *iter)
> +{
> +	bool done;
> +
> +	done = try_step_down(iter);
> +	if (done)
> +		return;
> +
> +	done = try_step_side(iter);
> +	while (!done) {
> +		if (!try_step_up(iter)) {
> +			iter->valid = false;
> +			break;
> +		}
> +		done = try_step_side(iter);
> +	}
> +}
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> new file mode 100644
> index 0000000000000..b102109778eac
> --- /dev/null
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __KVM_X86_MMU_TDP_ITER_H
> +#define __KVM_X86_MMU_TDP_ITER_H
> +
> +#include <linux/kvm_host.h>
> +
> +#include "mmu.h"
> +
> +/*
> + * A TDP iterator performs a pre-order walk over a TDP paging structure.
> + */
> +struct tdp_iter {
> +	/*
> +	 * The iterator will traverse the paging structure towards the mapping
> +	 * for this GFN.
> +	 */
> +	gfn_t goal_gfn;
> +	/* Pointers to the page tables traversed to reach the current SPTE */
> +	u64 *pt_path[PT64_ROOT_MAX_LEVEL];
> +	/* A pointer to the current SPTE */
> +	u64 *sptep;
> +	/* The lowest GFN mapped by the current SPTE */
> +	gfn_t gfn;
> +	/* The level of the root page given to the iterator */
> +	int root_level;
> +	/* The iterator's current level within the paging structure */
> +	int level;
> +	/* A snapshot of the value at sptep */
> +	u64 old_spte;
> +	/*
> +	 * Whether the iterator has a valid state. This will be false if the
> +	 * iterator walks off the end of the paging structure.
> +	 */
> +	bool valid;
> +};
> +
> +/*
> + * Iterates over every SPTE mapping the GFN range [start, end) in a
> + * preorder traversal.
> + */
> +#define for_each_tdp_pte(iter, root, root_level, start, end) \
> +	for (tdp_iter_start(&iter, root, root_level, start); \
> +	     iter.valid && iter.gfn < end;		     \
> +	     tdp_iter_next(&iter))
> +
> +u64 *spte_to_child_pt(u64 pte, int level);
> +
> +void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
> +		    gfn_t goal_gfn);
> +void tdp_iter_next(struct tdp_iter *iter);
> +
> +#endif /* __KVM_X86_MMU_TDP_ITER_H */
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  2020-09-30 23:15     ` Ben Gardon
@ 2020-09-30 23:24       ` Sean Christopherson
  2020-09-30 23:27         ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-09-30 23:24 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 04:15:17PM -0700, Ben Gardon wrote:
> On Wed, Sep 30, 2020 at 10:04 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 52d661a758585..0ddfdab942554 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1884,7 +1884,14 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> > >  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
> > >                       unsigned flags)
> > >  {
> > > -     return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > > +     int r;
> > > +
> > > +     r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > > +
> > > +     if (kvm->arch.tdp_mmu_enabled)
> > > +             r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
> >
> > Similar to an earlier question, is this intentionally additive, or can this
> > instead by:
> >
> >         if (kvm->arch.tdp_mmu_enabled)
> >                 r = kvm_tdp_mmu_zap_hva_range(kvm, start, end);
> >         else
> >                 r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> >
> 
> It is intentionally additive so the legacy/shadow MMU can handle nested.

Duh.  Now everything makes sense.  I completely spaced on nested EPT.

I wonder if would be worth adding a per-VM sticky bit that is set when an
rmap is added so that all of these flows can skip the rmap walks when using
the TDP MMU without a nested guest.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for TDP MMU
  2020-09-30 23:24       ` Sean Christopherson
@ 2020-09-30 23:27         ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-09-30 23:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 4:24 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Wed, Sep 30, 2020 at 04:15:17PM -0700, Ben Gardon wrote:
> > On Wed, Sep 30, 2020 at 10:04 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 52d661a758585..0ddfdab942554 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1884,7 +1884,14 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> > > >  int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end,
> > > >                       unsigned flags)
> > > >  {
> > > > -     return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > > > +     int r;
> > > > +
> > > > +     r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > > > +
> > > > +     if (kvm->arch.tdp_mmu_enabled)
> > > > +             r |= kvm_tdp_mmu_zap_hva_range(kvm, start, end);
> > >
> > > Similar to an earlier question, is this intentionally additive, or can this
> > > instead by:
> > >
> > >         if (kvm->arch.tdp_mmu_enabled)
> > >                 r = kvm_tdp_mmu_zap_hva_range(kvm, start, end);
> > >         else
> > >                 r = kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> > >
> >
> > It is intentionally additive so the legacy/shadow MMU can handle nested.
>
> Duh.  Now everything makes sense.  I completely spaced on nested EPT.
>
> I wonder if would be worth adding a per-VM sticky bit that is set when an
> rmap is added so that all of these flows can skip the rmap walks when using
> the TDP MMU without a nested guest.

We actually do that in the full version of this whole TDP MMU scheme.
It works very well.
I'm not sure why I didn't include that in this patch set - probably
just complexity. I'll definitely include that as an optimization along
with the lazy rmap allocation in the followup patch set.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-30 23:20   ` Eric van Tassell
@ 2020-09-30 23:34     ` Paolo Bonzini
  2020-10-01  0:07       ` Sean Christopherson
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-09-30 23:34 UTC (permalink / raw)
  To: Eric van Tassell, Ben Gardon, linux-kernel, kvm
  Cc: Cannon Matthews, Peter Xu, Sean Christopherson, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 01/10/20 01:20, Eric van Tassell wrote:
>>
>> +int is_shadow_present_pte(u64 pte)
>>   {
>>       return (pte != 0) && !is_mmio_spte(pte);
> From <Figure 28-1: Formats of EPTP and EPT Paging-Structure Entries" of
> the manual I don't have at my fingertips right now, I believe you should
> only check the low 3 bits(mask = 0x7). Since the upper bits are ignored,
> might that not mean they're not guaranteed to be 0?

No, this a property of the KVM MMU (and how it builds its PTEs) rather
than the hardware present check.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 02/22] kvm: mmu: Introduce tdp_iter
  2020-09-30 23:34     ` Paolo Bonzini
@ 2020-10-01  0:07       ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-10-01  0:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Eric van Tassell, Ben Gardon, linux-kernel, kvm, Cannon Matthews,
	Peter Xu, Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Thu, Oct 01, 2020 at 01:34:53AM +0200, Paolo Bonzini wrote:
> On 01/10/20 01:20, Eric van Tassell wrote:
> >>
> >> +int is_shadow_present_pte(u64 pte)
> >>   {
> >>       return (pte != 0) && !is_mmio_spte(pte);
> > From <Figure 28-1: Formats of EPTP and EPT Paging-Structure Entries" of
> > the manual I don't have at my fingertips right now, I believe you should
> > only check the low 3 bits(mask = 0x7). Since the upper bits are ignored,
> > might that not mean they're not guaranteed to be 0?
> 
> No, this a property of the KVM MMU (and how it builds its PTEs) rather
> than the hardware present check.

Ya, I found out the hard way that "present" in is_shadow_present_pte() really
means "valid", or "may be present".  The most notable case is EPT without A/D
bits (I think this is the only case where a valid SPTE can be fully not-present
in hardware).  Accessed tracking will clear all RWX bits to make the EPT entry
not-present, but from KVM's perspective it's treated as valid/present because
it can be made present in hardware without taking the MMU lock.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots
  2020-09-26  1:25   ` Paolo Bonzini
@ 2020-10-05 22:48     ` Ben Gardon
  2020-10-05 23:44       ` Sean Christopherson
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-05 22:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 6:25 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:23, Ben Gardon wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 42dde27decd75..c07831b0c73e1 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -124,6 +124,18 @@ static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
> >       return NULL;
> >  }
> >
> > +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> > +                                 union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *root;
> > +
> > +     root = find_tdp_mmu_root_with_role(kvm, role);
> > +     if (root)
> > +             return __pa(root->spt);
> > +
> > +     return INVALID_PAGE;
> > +}
> > +
> >  static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> >                                                  int level)
> >  {
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index cc0b7241975aa..2395ffa71bb05 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -9,6 +9,8 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> >  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> >
> >  bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> > +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> > +                                 union kvm_mmu_page_role role);
> >  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> >  void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
> >
>
> Probably missing a piece since this code is not used and neither is the
> new argument to is_root_usable.
>
> I'm a bit confused by is_root_usable since there should be only one PGD
> for the TDP MMU (the one for the root_mmu).

*facepalm* sorry about that. This commit used to be titled "Implement
fast CR3 switching for the TDP MMU" but several refactors later most
of it was not useful. The only change that should be part of this
patch is the one to avoid clearing the write flooding counts. I must
have failed to revert the other changes.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots
  2020-10-05 22:48     ` Ben Gardon
@ 2020-10-05 23:44       ` Sean Christopherson
  2020-10-06 16:19         ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Sean Christopherson @ 2020-10-05 23:44 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, LKML, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Mon, Oct 05, 2020 at 03:48:09PM -0700, Ben Gardon wrote:
> On Fri, Sep 25, 2020 at 6:25 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 25/09/20 23:23, Ben Gardon wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 42dde27decd75..c07831b0c73e1 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -124,6 +124,18 @@ static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
> > >       return NULL;
> > >  }
> > >
> > > +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> > > +                                 union kvm_mmu_page_role role)
> > > +{
> > > +     struct kvm_mmu_page *root;
> > > +
> > > +     root = find_tdp_mmu_root_with_role(kvm, role);
> > > +     if (root)
> > > +             return __pa(root->spt);
> > > +
> > > +     return INVALID_PAGE;
> > > +}
> > > +
> > >  static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> > >                                                  int level)
> > >  {
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > > index cc0b7241975aa..2395ffa71bb05 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > > @@ -9,6 +9,8 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > >  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> > >
> > >  bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> > > +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> > > +                                 union kvm_mmu_page_role role);
> > >  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> > >  void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
> > >
> >
> > Probably missing a piece since this code is not used and neither is the
> > new argument to is_root_usable.
> >
> > I'm a bit confused by is_root_usable since there should be only one PGD
> > for the TDP MMU (the one for the root_mmu).
> 
> *facepalm* sorry about that. This commit used to be titled "Implement
> fast CR3 switching for the TDP MMU" but several refactors later most
> of it was not useful. The only change that should be part of this
> patch is the one to avoid clearing the write flooding counts. I must
> have failed to revert the other changes.

Tangentially related, isn't it possible to end up with multiple roots if the
MAXPHYSADDR is different between vCPUs?  I.e. if userspace coerces KVM into
using a mix of 4-level and 5-level EPT?

Not saying that's a remotely valid config...

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots
  2020-10-05 23:44       ` Sean Christopherson
@ 2020-10-06 16:19         ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-10-06 16:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, LKML, kvm, Cannon Matthews, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Mon, Oct 5, 2020 at 5:07 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Mon, Oct 05, 2020 at 03:48:09PM -0700, Ben Gardon wrote:
> > On Fri, Sep 25, 2020 at 6:25 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > On 25/09/20 23:23, Ben Gardon wrote:
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > index 42dde27decd75..c07831b0c73e1 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > > @@ -124,6 +124,18 @@ static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
> > > >       return NULL;
> > > >  }
> > > >
> > > > +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> > > > +                                 union kvm_mmu_page_role role)
> > > > +{
> > > > +     struct kvm_mmu_page *root;
> > > > +
> > > > +     root = find_tdp_mmu_root_with_role(kvm, role);
> > > > +     if (root)
> > > > +             return __pa(root->spt);
> > > > +
> > > > +     return INVALID_PAGE;
> > > > +}
> > > > +
> > > >  static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> > > >                                                  int level)
> > > >  {
> > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > > > index cc0b7241975aa..2395ffa71bb05 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > > > @@ -9,6 +9,8 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> > > >  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> > > >
> > > >  bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> > > > +hpa_t kvm_tdp_mmu_root_hpa_for_role(struct kvm *kvm,
> > > > +                                 union kvm_mmu_page_role role);
> > > >  hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> > > >  void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
> > > >
> > >
> > > Probably missing a piece since this code is not used and neither is the
> > > new argument to is_root_usable.
> > >
> > > I'm a bit confused by is_root_usable since there should be only one PGD
> > > for the TDP MMU (the one for the root_mmu).
> >
> > *facepalm* sorry about that. This commit used to be titled "Implement
> > fast CR3 switching for the TDP MMU" but several refactors later most
> > of it was not useful. The only change that should be part of this
> > patch is the one to avoid clearing the write flooding counts. I must
> > have failed to revert the other changes.
>
> Tangentially related, isn't it possible to end up with multiple roots if the
> MAXPHYSADDR is different between vCPUs?  I.e. if userspace coerces KVM into
> using a mix of 4-level and 5-level EPT?
>
> Not saying that's a remotely valid config...

We'll also end up with multiple TDP MMU roots if using SMM, and being
able to switch back and forth between "legacy/shadow MMU" roots and
TDP MMU roots improves nested performance since we can use the TDP MMU
for L1.
Since the TDP MMU associates struct kvm_mmu_pages with all its roots,
no special casing should be needed for root switching.
At one point in this patch set I was using some alternative data
structure to replace struct kvm_mmu_page for the TDP MMU, but I
abandoned that approach.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-30 16:37   ` Sean Christopherson
  2020-09-30 16:55     ` Paolo Bonzini
  2020-09-30 17:37     ` Paolo Bonzini
@ 2020-10-06 22:33     ` Ben Gardon
  2020-10-07 20:55       ` Sean Christopherson
  2 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-06 22:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 9:37 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:22:50PM -0700, Ben Gardon wrote:
> > @@ -4113,8 +4088,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >       if (page_fault_handle_page_track(vcpu, error_code, gfn))
> >               return RET_PF_EMULATE;
> >
> > -     if (fast_page_fault(vcpu, gpa, error_code))
> > -             return RET_PF_RETRY;
> > +     if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > +             if (fast_page_fault(vcpu, gpa, error_code))
> > +                     return RET_PF_RETRY;
>
> It'll probably be easier to handle is_tdp_mmu() in fast_page_fault().

I'd prefer to keep this check here because then in the fast page fault
path, we can just handle the case where we do have a tdp mmu root with
the tdp mmu fast pf handler and it'll mirror the split below with
__direct_map and the TDP MMU PF handler.

>
> >
> >       r = mmu_topup_memory_caches(vcpu, false);
> >       if (r)
> > @@ -4139,8 +4115,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >       r = make_mmu_pages_available(vcpu);
> >       if (r)
> >               goto out_unlock;
> > -     r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> > -                      prefault, is_tdp && lpage_disallowed);
> > +
> > +     if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > +             r = kvm_tdp_mmu_page_fault(vcpu, write, map_writable, max_level,
> > +                                        gpa, pfn, prefault,
> > +                                        is_tdp && lpage_disallowed);
> > +     else
> > +             r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> > +                              prefault, is_tdp && lpage_disallowed);
> >
> >  out_unlock:
> >       spin_unlock(&vcpu->kvm->mmu_lock);
>
> ...
>
> > +/*
> > + * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> > + * page tables and SPTEs to translate the faulting guest physical address.
> > + */
> > +int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> > +                        int max_level, gpa_t gpa, kvm_pfn_t pfn,
> > +                        bool prefault, bool account_disallowed_nx_lpage)
> > +{
> > +     struct tdp_iter iter;
> > +     struct kvm_mmu_memory_cache *pf_pt_cache =
> > +                     &vcpu->arch.mmu_shadow_page_cache;
> > +     u64 *child_pt;
> > +     u64 new_spte;
> > +     int ret;
> > +     int as_id = kvm_arch_vcpu_memslots_id(vcpu);
> > +     gfn_t gfn = gpa >> PAGE_SHIFT;
> > +     int level;
> > +
> > +     if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> > +             return RET_PF_RETRY;
>
> I feel like we should kill off these silly WARNs in the existing code instead
> of adding more.  If they actually fired, I'm pretty sure that they would
> continue firing and spamming the kernel log until the VM is killed as I don't
> see how restarting the guest will magically fix anything.
>
> > +
> > +     if (WARN_ON(!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa)))
> > +             return RET_PF_RETRY;
>
> This seems especially gratuitous, this has exactly one caller that explicitly
> checks is_tdp_mmu_root().  Again, if this fires it will spam the kernel log
> into submission.
>
> > +
> > +     level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn);
> > +
> > +     for_each_tdp_pte_vcpu(iter, vcpu, gfn, gfn + 1) {
> > +             disallowed_hugepage_adjust(iter.old_spte, gfn, iter.level,
> > +                                        &pfn, &level);
> > +
> > +             if (iter.level == level)
> > +                     break;
> > +
> > +             /*
> > +              * If there is an SPTE mapping a large page at a higher level
> > +              * than the target, that SPTE must be cleared and replaced
> > +              * with a non-leaf SPTE.
> > +              */
> > +             if (is_shadow_present_pte(iter.old_spte) &&
> > +                 is_large_pte(iter.old_spte)) {
> > +                     *iter.sptep = 0;
> > +                     handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
> > +                                         iter.old_spte, 0, iter.level);
> > +                     kvm_flush_remote_tlbs_with_address(vcpu->kvm, iter.gfn,
> > +                                     KVM_PAGES_PER_HPAGE(iter.level));
> > +
> > +                     /*
> > +                      * The iter must explicitly re-read the spte here
> > +                      * because the new is needed before the next iteration
> > +                      * of the loop.
> > +                      */
>
> I think it'd be better to explicitly, and simply, call out that iter.old_spte
> is consumed below.  It's subtle enough to warrant a comment, but the comment
> didn't actually help me.  Maybe something like:
>
>                         /*
>                          * Refresh iter.old_spte, it will trigger the !present
>                          * path below.
>                          */
>

That's a good point and calling out the relation to the present check
below is much clearer.


> > +                     iter.old_spte = READ_ONCE(*iter.sptep);
> > +             }
> > +
> > +             if (!is_shadow_present_pte(iter.old_spte)) {
> > +                     child_pt = kvm_mmu_memory_cache_alloc(pf_pt_cache);
> > +                     clear_page(child_pt);
> > +                     new_spte = make_nonleaf_spte(child_pt,
> > +                                                  !shadow_accessed_mask);
> > +
> > +                     *iter.sptep = new_spte;
> > +                     handle_changed_spte(vcpu->kvm, as_id, iter.gfn,
> > +                                         iter.old_spte, new_spte,
> > +                                         iter.level);
> > +             }
> > +     }
> > +
> > +     if (WARN_ON(iter.level != level))
> > +             return RET_PF_RETRY;
>
> This also seems unnecessary.  Or maybe these are all good candiates for
> KVM_BUG_ON...
>

I've replaced all these warnings with KVM_BUG_ONs.

> > +
> > +     ret = page_fault_handle_target_level(vcpu, write, map_writable,
> > +                                          as_id, &iter, pfn, prefault);
> > +
> > +     /* If emulating, flush this vcpu's TLB. */
>
> Why?  It's obvious _what_ the code is doing, the comment should explain _why_.
>
> > +     if (ret == RET_PF_EMULATE)
> > +             kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> > +
> > +     return ret;
> > +}
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index cb86f9fe69017..abf23dc0ab7ad 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -14,4 +14,8 @@ void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
> >
> >  bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, gfn_t start, gfn_t end);
> >  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> > +
> > +int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> > +                        int level, gpa_t gpa, kvm_pfn_t pfn, bool prefault,
> > +                        bool lpage_disallowed);
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-09-30 17:37     ` Paolo Bonzini
@ 2020-10-06 22:35       ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-10-06 22:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Cannon Matthews, Peter Xu,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 10:38 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 30/09/20 18:37, Sean Christopherson wrote:
> >> +    ret = page_fault_handle_target_level(vcpu, write, map_writable,
> >> +                                         as_id, &iter, pfn, prefault);
> >> +
> >> +    /* If emulating, flush this vcpu's TLB. */
> > Why?  It's obvious _what_ the code is doing, the comment should explain _why_.
> >
> >> +    if (ret == RET_PF_EMULATE)
> >> +            kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> >> +
> >> +    return ret;
> >> +}
>
> In particular it seems to be only needed in this case...
>
> +       /*
> +        * If the page fault was caused by a write but the page is write
> +        * protected, emulation is needed. If the emulation was skipped,
> +        * the vCPU would have the same fault again.
> +        */
> +       if ((make_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) && write)
> +               ret = RET_PF_EMULATE;
> +
>
> ... corresponding to this code in mmu.c
>
>         if (set_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) {
>                 if (write_fault)
>                         ret = RET_PF_EMULATE;
>                 kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>         }
>
> So it should indeed be better to make the code in
> page_fault_handle_target_level look the same as mmu/mmu.c.

That's an excellent point. I've made an effort to make them more
similar. I think this difference arose from the synchronization
changes I was working back from, but this will be much more elegant in
either case.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu
  2020-09-30 17:48   ` Sean Christopherson
@ 2020-10-06 23:38     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-10-06 23:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Sep 30, 2020 at 10:49 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:22:54PM -0700, Ben Gardon wrote:
> > @@ -1945,12 +1944,24 @@ static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
> >
> >  int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
> >  {
> > -     return kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
> > +     int young = false;
> > +
> > +     young = kvm_handle_hva_range(kvm, start, end, 0, kvm_age_rmapp);
> > +     if (kvm->arch.tdp_mmu_enabled)
>
> If we end up with a per-VM flag, would it make sense to add a static key
> wrapper similar to the in-kernel lapic?  I assume once this lands the vast
> majority of VMs will use the TDP MMU.
>
> > +             young |= kvm_tdp_mmu_age_hva_range(kvm, start, end);
> > +
> > +     return young;
> >  }
>
> ...
>
> > +
> > +/*
> > + * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
> > + * if any of the GFNs in the range have been accessed.
> > + */
> > +static int age_gfn_range(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +                      struct kvm_mmu_page *root, gfn_t start, gfn_t end,
> > +                      unsigned long unused)
> > +{
> > +     struct tdp_iter iter;
> > +     int young = 0;
> > +     u64 new_spte = 0;
> > +     int as_id = kvm_mmu_page_as_id(root);
> > +
> > +     for_each_tdp_pte_root(iter, root, start, end) {
>
> Ah, I think we should follow the existing shadow iterates by naming this
>
>         for_each_tdp_pte_using_root()
>
> My first reaction was that this was iterating over TDP roots, which was a bit
> confusing.  I suspect others will make the same mistake unless they look at the
> implementation of for_each_tdp_pte_root().
>
> Similar comments on the _vcpu() variant.  For that one I think it'd be
> preferable to take the struct kvm_mmu, i.e. have for_each_tdp_pte_using_mmu(),
> as both kvm_tdp_mmu_page_fault() and kvm_tdp_mmu_get_walk() explicitly
> reference vcpu->arch.mmu in the surrounding code.
>
> E.g. I find this more intuitive
>
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         int leaf = mmu->shadow_root_level;
>
>         for_each_tdp_pte_using_mmu(iter, mmu, gfn, gfn + 1) {
>                 leaf = iter.level;
>                 sptes[leaf - 1] = iter.old_spte;
>         }
>
>         return leaf
>
> versus this, which makes me want to look at the implementation of for_each().
>
>
>         int leaf = vcpu->arch.mmu->shadow_root_level;
>
>         for_each_tdp_pte_vcpu(iter, vcpu, gfn, gfn + 1) {
>                 ...
>         }

I will change these macros as you suggested. I agree adding _using_
makes them clearer.

>
> > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > +                 !is_last_spte(iter.old_spte, iter.level))
> > +                     continue;
> > +
> > +             /*
> > +              * If we have a non-accessed entry we don't need to change the
> > +              * pte.
> > +              */
> > +             if (!is_accessed_spte(iter.old_spte))
> > +                     continue;
> > +
> > +             new_spte = iter.old_spte;
> > +
> > +             if (spte_ad_enabled(new_spte)) {
> > +                     clear_bit((ffs(shadow_accessed_mask) - 1),
> > +                               (unsigned long *)&new_spte);
> > +             } else {
> > +                     /*
> > +                      * Capture the dirty status of the page, so that it doesn't get
> > +                      * lost when the SPTE is marked for access tracking.
> > +                      */
> > +                     if (is_writable_pte(new_spte))
> > +                             kvm_set_pfn_dirty(spte_to_pfn(new_spte));
> > +
> > +                     new_spte = mark_spte_for_access_track(new_spte);
> > +             }
> > +
> > +             *iter.sptep = new_spte;
> > +             __handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> > +                                   new_spte, iter.level);
> > +             young = true;
>
> young is an int, not a bool.  Not really your fault as KVM has a really bad
> habit of using ints instead of bools.

Yeah, I saw that too. In mmu.c young ends up being set to true as
well, just though a function return so it's less obvious. Do you think
it would be preferable to set young to 1 or convert it to a bool?

>
> > +     }
> > +
> > +     return young;
> > +}
> > +
> > +int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
> > +                           unsigned long end)
> > +{
> > +     return kvm_tdp_mmu_handle_hva_range(kvm, start, end, 0,
> > +                                         age_gfn_range);
> > +}
> > +
> > +static int test_age_gfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +                     struct kvm_mmu_page *root, gfn_t gfn, gfn_t unused,
> > +                     unsigned long unused2)
> > +{
> > +     struct tdp_iter iter;
> > +     int young = 0;
> > +
> > +     for_each_tdp_pte_root(iter, root, gfn, gfn + 1) {
> > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > +                 !is_last_spte(iter.old_spte, iter.level))
> > +                     continue;
> > +
> > +             if (is_accessed_spte(iter.old_spte))
> > +                     young = true;
>
> Same bool vs. int weirdness here.  Also, |= doesn't short circuit for ints
> or bools, so this can be
>
>                 young |= is_accessed_spte(...)
>
> Actually, can't we just return true immediately?

Great point, I'll do that.

>
> > +     }
> > +
> > +     return young;
> > +}
> > +
> > +int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva)
> > +{
> > +     return kvm_tdp_mmu_handle_hva_range(kvm, hva, hva + 1, 0,
> > +                                         test_age_gfn);
> > +}
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index ce804a97bfa1d..f316773b7b5a8 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -21,4 +21,8 @@ int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, int write, int map_writable,
> >
> >  int kvm_tdp_mmu_zap_hva_range(struct kvm *kvm, unsigned long start,
> >                             unsigned long end);
> > +
> > +int kvm_tdp_mmu_age_hva_range(struct kvm *kvm, unsigned long start,
> > +                           unsigned long end);
> > +int kvm_tdp_mmu_test_age_hva(struct kvm *kvm, unsigned long hva);
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU
  2020-09-26  1:09   ` Paolo Bonzini
@ 2020-10-07 16:30     ` Ben Gardon
  2020-10-07 17:21       ` Paolo Bonzini
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-07 16:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 6:09 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> > +     for_each_tdp_pte_root(iter, root, start, end) {
> > +             if (!is_shadow_present_pte(iter.old_spte) ||
> > +                 is_last_spte(iter.old_spte, iter.level))
> > +                     continue;
> > +
>
> I'm starting to wonder if another iterator like
> for_each_tdp_leaf_pte_root would be clearer, since this idiom repeats
> itself quite often.  The tdp_iter_next_leaf function would be easily
> implemented as
>
>         while (likely(iter->valid) &&
>                (!is_shadow_present_pte(iter.old_spte) ||
>                 is_last_spte(iter.old_spte, iter.level))
>                 tdp_iter_next(iter);

Do you see a substantial efficiency difference between adding a
tdp_iter_next_leaf and building on for_each_tdp_pte_using_root with
something like:

#define for_each_tdp_leaf_pte_using_root(_iter, _root, _start, _end)    \
        for_each_tdp_pte_using_root(_iter, _root, _start, _end)         \
                if (!is_shadow_present_pte(_iter.old_spte) ||           \
                    !is_last_spte(_iter.old_spte, _iter.level))         \
                        continue;                                       \
                else

I agree that putting those checks in a wrapper makes the code more concise.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-09-28 15:11   ` Paolo Bonzini
@ 2020-10-07 16:53     ` Ben Gardon
  2020-10-07 17:18       ` Paolo Bonzini
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-07 16:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Mon, Sep 28, 2020 at 8:11 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> > +             *iter.sptep = 0;
> > +             handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> > +                                 new_spte, iter.level);
> > +
>
> Can you explain why new_spte is passed here instead of 0?

That's just a bug. Thank you for catching it.

>
> All calls to handle_changed_spte are preceded by "*something =
> new_spte" except this one, so I'm thinking of having a change_spte
> function like
>
> static void change_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>                         u64 *sptep, u64 new_spte, int level)
> {
>         u64 old_spte = *sptep;
>         *sptep = new_spte;
>
>         __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level);
>         handle_changed_spte_acc_track(old_spte, new_spte, level);
>         handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, new_spte, level);
> }
>
> in addition to the previously-mentioned cleanup of always calling
> handle_changed_spte instead of special-casing calls to two of the
> three functions.  It would be a nice place to add the
> trace_kvm_mmu_set_spte tracepoint, too.

I'm not sure we can avoid special casing calls to the access tracking
and dirty logging handler functions. At least in the past that's
created bugs with things being marked dirty or accessed when they
shouldn't be. I'll revisit those assumptions. It would certainly be
nice to get rid of that complexity.

I agree that putting the SPTE assignment and handler functions in a
helper function would clean up the code. I'll do that. I got some
feedback on the RFC I sent last year which led me to open-code a lot
more, but I think this is still a good cleanup.

Re tracepoints, I was planning to just insert them all once this code
is stabilized, if that's alright.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-10-07 16:53     ` Ben Gardon
@ 2020-10-07 17:18       ` Paolo Bonzini
  2020-10-07 17:30         ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-10-07 17:18 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 07/10/20 18:53, Ben Gardon wrote:
>> in addition to the previously-mentioned cleanup of always calling
>> handle_changed_spte instead of special-casing calls to two of the
>> three functions.  It would be a nice place to add the
>> trace_kvm_mmu_set_spte tracepoint, too.
> I'm not sure we can avoid special casing calls to the access tracking
> and dirty logging handler functions. At least in the past that's
> created bugs with things being marked dirty or accessed when they
> shouldn't be. I'll revisit those assumptions. It would certainly be
> nice to get rid of that complexity.
> 
> I agree that putting the SPTE assignment and handler functions in a
> helper function would clean up the code. I'll do that.

Well that's not easy if you have to think of which functions have to be
called.

I'll take a closer look at the access tracking and dirty logging cases
to try and understand what those bugs can be.  Apart from that I have my
suggested changes and I can probably finish testing them and send them
out tomorrow.

Paolo

> I got some
> feedback on the RFC I sent last year which led me to open-code a lot
> more, but I think this is still a good cleanup.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU
  2020-10-07 16:30     ` Ben Gardon
@ 2020-10-07 17:21       ` Paolo Bonzini
  2020-10-07 17:28         ` Ben Gardon
  0 siblings, 1 reply; 105+ messages in thread
From: Paolo Bonzini @ 2020-10-07 17:21 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 07/10/20 18:30, Ben Gardon wrote:
>> I'm starting to wonder if another iterator like
>> for_each_tdp_leaf_pte_root would be clearer, since this idiom repeats
>> itself quite often.  The tdp_iter_next_leaf function would be easily
>> implemented as
>>
>>         while (likely(iter->valid) &&
>>                (!is_shadow_present_pte(iter.old_spte) ||
>>                 is_last_spte(iter.old_spte, iter.level))
>>                 tdp_iter_next(iter);
> Do you see a substantial efficiency difference between adding a
> tdp_iter_next_leaf and building on for_each_tdp_pte_using_root with
> something like:
> 
> #define for_each_tdp_leaf_pte_using_root(_iter, _root, _start, _end)    \
>         for_each_tdp_pte_using_root(_iter, _root, _start, _end)         \
>                 if (!is_shadow_present_pte(_iter.old_spte) ||           \
>                     !is_last_spte(_iter.old_spte, _iter.level))         \
>                         continue;                                       \
>                 else
> 
> I agree that putting those checks in a wrapper makes the code more concise.
> 

No, that would be just another way to write the same thing.  That said,
making the iteration API more complicated also has disadvantages because
if get a Cartesian explosion of changes.

Regarding the naming, I'm leaning towards

    tdp_root_for_each_pte
    tdp_vcpu_for_each_pte

which is shorter than the version with "using" and still clarifies that
"root" and "vcpu" are the thing that the iteration works on.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU
  2020-10-07 17:21       ` Paolo Bonzini
@ 2020-10-07 17:28         ` Ben Gardon
  2020-10-07 17:53           ` Paolo Bonzini
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-07 17:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Oct 7, 2020 at 10:21 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 07/10/20 18:30, Ben Gardon wrote:
> >> I'm starting to wonder if another iterator like
> >> for_each_tdp_leaf_pte_root would be clearer, since this idiom repeats
> >> itself quite often.  The tdp_iter_next_leaf function would be easily
> >> implemented as
> >>
> >>         while (likely(iter->valid) &&
> >>                (!is_shadow_present_pte(iter.old_spte) ||
> >>                 is_last_spte(iter.old_spte, iter.level))
> >>                 tdp_iter_next(iter);
> > Do you see a substantial efficiency difference between adding a
> > tdp_iter_next_leaf and building on for_each_tdp_pte_using_root with
> > something like:
> >
> > #define for_each_tdp_leaf_pte_using_root(_iter, _root, _start, _end)    \
> >         for_each_tdp_pte_using_root(_iter, _root, _start, _end)         \
> >                 if (!is_shadow_present_pte(_iter.old_spte) ||           \
> >                     !is_last_spte(_iter.old_spte, _iter.level))         \
> >                         continue;                                       \
> >                 else
> >
> > I agree that putting those checks in a wrapper makes the code more concise.
> >
>
> No, that would be just another way to write the same thing.  That said,
> making the iteration API more complicated also has disadvantages because
> if get a Cartesian explosion of changes.

I wouldn't be too worried about that. The only things I ever found
worth making an iterator case for were:
Every SPTE
Every present SPTE
Every present leaf SPTE

And really there aren't many cases that use the middle one.

>
> Regarding the naming, I'm leaning towards
>
>     tdp_root_for_each_pte
>     tdp_vcpu_for_each_pte
>
> which is shorter than the version with "using" and still clarifies that
> "root" and "vcpu" are the thing that the iteration works on.

That sounds good to me. I agree it's similarly clear.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-10-07 17:18       ` Paolo Bonzini
@ 2020-10-07 17:30         ` Ben Gardon
  2020-10-07 17:54           ` Paolo Bonzini
  0 siblings, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-07 17:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Wed, Oct 7, 2020 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 07/10/20 18:53, Ben Gardon wrote:
> >> in addition to the previously-mentioned cleanup of always calling
> >> handle_changed_spte instead of special-casing calls to two of the
> >> three functions.  It would be a nice place to add the
> >> trace_kvm_mmu_set_spte tracepoint, too.
> > I'm not sure we can avoid special casing calls to the access tracking
> > and dirty logging handler functions. At least in the past that's
> > created bugs with things being marked dirty or accessed when they
> > shouldn't be. I'll revisit those assumptions. It would certainly be
> > nice to get rid of that complexity.
> >
> > I agree that putting the SPTE assignment and handler functions in a
> > helper function would clean up the code. I'll do that.
>
> Well that's not easy if you have to think of which functions have to be
> called.
>
> I'll take a closer look at the access tracking and dirty logging cases
> to try and understand what those bugs can be.  Apart from that I have my
> suggested changes and I can probably finish testing them and send them
> out tomorrow.

Awesome, thank you. I'll look forward to seeing them. Will you be
applying those changes to the tdp_mmu branch you created as well?

>
> Paolo
>
> > I got some
> > feedback on the RFC I sent last year which led me to open-code a lot
> > more, but I think this is still a good cleanup.
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU
  2020-10-07 17:28         ` Ben Gardon
@ 2020-10-07 17:53           ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-10-07 17:53 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 07/10/20 19:28, Ben Gardon wrote:
>> No, that would be just another way to write the same thing.  That said,
>> making the iteration API more complicated also has disadvantages because
>> if get a Cartesian explosion of changes.
> I wouldn't be too worried about that. The only things I ever found
> worth making an iterator case for were:
> Every SPTE
> Every present SPTE
> Every present leaf SPTE

* (vcpu, root) * (all levels, large only)

We only need a small subset of these, but the naming would be more
complex at least.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU
  2020-10-07 17:30         ` Ben Gardon
@ 2020-10-07 17:54           ` Paolo Bonzini
  0 siblings, 0 replies; 105+ messages in thread
From: Paolo Bonzini @ 2020-10-07 17:54 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On 07/10/20 19:30, Ben Gardon wrote:
>> Well that's not easy if you have to think of which functions have to be
>> called.
>>
>> I'll take a closer look at the access tracking and dirty logging cases
>> to try and understand what those bugs can be.  Apart from that I have my
>> suggested changes and I can probably finish testing them and send them
>> out tomorrow.
> Awesome, thank you. I'll look forward to seeing them. Will you be
> applying those changes to the tdp_mmu branch you created as well?
> 

No, only to the rebased version.

Paolo


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler
  2020-10-06 22:33     ` Ben Gardon
@ 2020-10-07 20:55       ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-10-07 20:55 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Oct 06, 2020 at 03:33:21PM -0700, Ben Gardon wrote:
> On Wed, Sep 30, 2020 at 9:37 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Fri, Sep 25, 2020 at 02:22:50PM -0700, Ben Gardon wrote:
> > > @@ -4113,8 +4088,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > >       if (page_fault_handle_page_track(vcpu, error_code, gfn))
> > >               return RET_PF_EMULATE;
> > >
> > > -     if (fast_page_fault(vcpu, gpa, error_code))
> > > -             return RET_PF_RETRY;
> > > +     if (!is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > > +             if (fast_page_fault(vcpu, gpa, error_code))
> > > +                     return RET_PF_RETRY;
> >
> > It'll probably be easier to handle is_tdp_mmu() in fast_page_fault().
> 
> I'd prefer to keep this check here because then in the fast page fault
> path, we can just handle the case where we do have a tdp mmu root with
> the tdp mmu fast pf handler and it'll mirror the split below with
> __direct_map and the TDP MMU PF handler.

Hmm, what about adding wrappers for these few cases where TDP MMU splits
cleanly from the existing paths?  The thought being that it would keep the
control flow somewhat straightforward, and might also help us keep the two
paths aligned (more below).

> > >
> > >       r = mmu_topup_memory_caches(vcpu, false);
> > >       if (r)
> > > @@ -4139,8 +4115,14 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> > >       r = make_mmu_pages_available(vcpu);
> > >       if (r)
> > >               goto out_unlock;
> > > -     r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> > > -                      prefault, is_tdp && lpage_disallowed);
> > > +
> > > +     if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > > +             r = kvm_tdp_mmu_page_fault(vcpu, write, map_writable, max_level,
> > > +                                        gpa, pfn, prefault,
> > > +                                        is_tdp && lpage_disallowed);
> > > +     else
> > > +             r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> > > +                              prefault, is_tdp && lpage_disallowed);

Somewhat tangetially related to the above, it feels like the TDP MMU helper
here would be better named tdp_mmu_map() or so.  KVM has already done the
"fault" part, in that it has faulted in the page (if relevant) and obtained
a pfn.  What's left is the actual insertion into the TDP page tables.

And again related to the helper, ideally tdp_mmu_map() and __direct_map()
would have identical prototypes.  Ditto for the fast page fault paths.  In
theory, that would allow the compiler to generate identical preamble, with
only the final check being different.  And if the compiler isn't smart enough
to do that on its own, we might even make the wrapper non-inline, with an
"unlikely" annotation to coerce the compiler to generate a tail call for the
preferred path.

> > >
> > >  out_unlock:
> > >       spin_unlock(&vcpu->kvm->mmu_lock);

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU
  2020-09-26  1:04   ` Paolo Bonzini
@ 2020-10-08 18:27     ` Ben Gardon
  0 siblings, 0 replies; 105+ messages in thread
From: Ben Gardon @ 2020-10-08 18:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Cannon Matthews, Peter Xu, Sean Christopherson,
	Peter Shier, Peter Feiner, Junaid Shahid, Jim Mattson,
	Yulei Zhang, Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Fri, Sep 25, 2020 at 6:04 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 25/09/20 23:22, Ben Gardon wrote:
> >                               start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
> > +     if (kvm->arch.tdp_mmu_enabled)
> > +             flush = kvm_tdp_mmu_wrprot_slot(kvm, memslot, false) || flush;
> >       spin_unlock(&kvm->mmu_lock);
> >
>
> In fact you can just pass down the end-level KVM_MAX_HUGEPAGE_LEVEL or
> PGLEVEL_4K here to kvm_tdp_mmu_wrprot_slot and from there to
> wrprot_gfn_range.

That makes sense. My only worry there is the added complexity of error
handling values besides PG_LEVEL_2M and PG_LEVEL_4K. Since there are
only two callers, I don't think that will be too much of a problem
though. I don't think KVM_MAX_HUGEPAGE_LEVEL would actually be a good
value to pass in as I don't think that would write protect 2M
mappings. KVM_MAX_HUGEPAGE_LEVEL is defined as PG_LEVEL_1G, or 3.

>
> >
> > +             /*
> > +              * Take a reference on the root so that it cannot be freed if
> > +              * this thread releases the MMU lock and yields in this loop.
> > +              */
> > +             get_tdp_mmu_root(kvm, root);
> > +
> > +             spte_set = wrprot_gfn_range(kvm, root, slot->base_gfn,
> > +                             slot->base_gfn + slot->npages, skip_4k) ||
> > +                        spte_set;
> > +
> > +             put_tdp_mmu_root(kvm, root);
>
>
> Generalyl using "|=" is the more common idiom in mmu.c.

I changed to this in response to some feedback on the RFC, about
mixing bitwise ops and bools, but I like the |= syntax more as well.

>
> > +static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > +                        gfn_t start, gfn_t end)
> > ...
> > +             __handle_changed_spte(kvm, as_id, iter.gfn, iter.old_spte,
> > +                                   new_spte, iter.level);
> > +             handle_changed_spte_acc_track(iter.old_spte, new_spte,
> > +                                           iter.level);
>
> Is it worth not calling handle_changed_spte?  handle_changed_spte_dlog
> obviously will never fire but duplicating the code is a bit ugly.
>
> I guess this patch is the first one that really gives the "feeling" of
> what the data structures look like.  The main difference with the shadow
> MMU is that you have the tdp_iter instead of the callback-based code of
> slot_handle_level_range, but otherwise it's not hard to follow one if
> you know the other.  Reorganizing the code so that mmu.c is little more
> than a wrapper around the two will help as well in this respect.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots
  2020-09-30  6:06   ` Sean Christopherson
  2020-09-30  6:26     ` Paolo Bonzini
@ 2020-10-12 22:59     ` Ben Gardon
  2020-10-12 23:59       ` Sean Christopherson
  1 sibling, 1 reply; 105+ messages in thread
From: Ben Gardon @ 2020-10-12 22:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

On Tue, Sep 29, 2020 at 11:06 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Sep 25, 2020 at 02:22:44PM -0700, Ben Gardon wrote:
>   static u64 __read_mostly shadow_nx_mask;
> > @@ -3597,10 +3592,14 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
> >       if (!VALID_PAGE(*root_hpa))
> >               return;
> >
> > -     sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> > -     --sp->root_count;
> > -     if (!sp->root_count && sp->role.invalid)
> > -             kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
> > +     if (is_tdp_mmu_root(kvm, *root_hpa)) {
> > +             kvm_tdp_mmu_put_root_hpa(kvm, *root_hpa);
> > +     } else {
> > +             sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> > +             --sp->root_count;
> > +             if (!sp->root_count && sp->role.invalid)
> > +                     kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
>
> Hmm, I see that future patches use put_tdp_mmu_root()/get_tdp_mmu_root(),
> but the code itself isn't specific to the TDP MMU.  Even if this ends up
> being the only non-TDP user of get/put, I think it'd be worth making them
> common helpers, e.g.
>
>         sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
>         if (mmu_put_root(sp) {
>                 if (is_tdp_mmu(...))
>                         kvm_tdp_mmu_free_root(kvm, sp);
>                 else if (sp->role.invalid)
>                         kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
>         }
>
> > +     }
> >
> >       *root_hpa = INVALID_PAGE;
> >  }
> > @@ -3691,7 +3690,13 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >       unsigned i;
> >
> >       if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > -             root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> > +             if (vcpu->kvm->arch.tdp_mmu_enabled) {
>
> I believe this will break 32-bit NPT.  Or at a minimum, look weird.  It'd
> be better to explicitly disable the TDP MMU on 32-bit KVM, then this becomes
>
>         if (vcpu->kvm->arch.tdp_mmu_enabled) {
>
>         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
>
>         } else {
>
>         }
>

How does this break 32-bit NPT? I'm not sure I understand how we would
get into a bad state here because I'm not familiar with the specifics
of 32 bit NPT.

> > +                     root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > +             } else {
> > +                     root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
> > +                                           true);
> > +             }
>
> May not matter in the end, but the braces aren't needed.
>
> > +
> >               if (!VALID_PAGE(root))
> >                       return -ENOSPC;
> >               vcpu->arch.mmu->root_hpa = root;
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 65bb110847858..530b7d893c7b3 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -41,8 +41,12 @@ struct kvm_mmu_page {
> >
> >       /* Number of writes since the last time traversal visited this page.  */
> >       atomic_t write_flooding_count;
> > +
> > +     bool tdp_mmu_page;
> >  };
> >
> > +extern struct kmem_cache *mmu_page_header_cache;
> > +
> >  static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
> >  {
> >       struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
> > @@ -69,6 +73,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
> >       (((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> >  #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >
> > +#define ACC_EXEC_MASK    1
> > +#define ACC_WRITE_MASK   PT_WRITABLE_MASK
> > +#define ACC_USER_MASK    PT_USER_MASK
> > +#define ACC_ALL          (ACC_EXEC_MASK | ACC_WRITE_MASK | ACC_USER_MASK)
> > +
> >  /* Functions for interpreting SPTEs */
> >  kvm_pfn_t spte_to_pfn(u64 pte);
> >  bool is_mmio_spte(u64 spte);
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 8241e18c111e6..cdca829e42040 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1,5 +1,7 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >
> > +#include "mmu.h"
> > +#include "mmu_internal.h"
> >  #include "tdp_mmu.h"
> >
> >  static bool __read_mostly tdp_mmu_enabled = true;
> > @@ -25,10 +27,165 @@ void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> >
> >       /* This should not be changed for the lifetime of the VM. */
> >       kvm->arch.tdp_mmu_enabled = true;
> > +
> > +     INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> >  }
> >
> >  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >  {
> >       if (!kvm->arch.tdp_mmu_enabled)
> >               return;
> > +
> > +     WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> > +}
> > +
> > +#define for_each_tdp_mmu_root(_kvm, _root)                       \
> > +     list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link)
> > +
> > +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
> > +{
> > +     struct kvm_mmu_page *root;
> > +
> > +     if (!kvm->arch.tdp_mmu_enabled)
> > +             return false;
> > +
> > +     root = to_shadow_page(hpa);
> > +
> > +     if (WARN_ON(!root))
> > +             return false;
> > +
> > +     return root->tdp_mmu_page;
>
> Why all the extra checks?
>
> > +}
> > +
> > +static void free_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > +     lockdep_assert_held(&kvm->mmu_lock);
> > +
> > +     WARN_ON(root->root_count);
> > +     WARN_ON(!root->tdp_mmu_page);
> > +
> > +     list_del(&root->link);
> > +
> > +     free_page((unsigned long)root->spt);
> > +     kmem_cache_free(mmu_page_header_cache, root);
> > +}
> > +
> > +static void put_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > +     lockdep_assert_held(&kvm->mmu_lock);
> > +
> > +     root->root_count--;
> > +     if (!root->root_count)
> > +             free_tdp_mmu_root(kvm, root);
> > +}
> > +
> > +static void get_tdp_mmu_root(struct kvm *kvm, struct kvm_mmu_page *root)
> > +{
> > +     lockdep_assert_held(&kvm->mmu_lock);
> > +     WARN_ON(!root->root_count);
> > +
> > +     root->root_count++;
> > +}
> > +
> > +void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa)
> > +{
> > +     struct kvm_mmu_page *root;
> > +
> > +     root = to_shadow_page(root_hpa);
> > +
> > +     if (WARN_ON(!root))
> > +             return;
> > +
> > +     put_tdp_mmu_root(kvm, root);
> > +}
> > +
> > +static struct kvm_mmu_page *find_tdp_mmu_root_with_role(
> > +             struct kvm *kvm, union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *root;
> > +
> > +     lockdep_assert_held(&kvm->mmu_lock);
> > +     for_each_tdp_mmu_root(kvm, root) {
> > +             WARN_ON(!root->root_count);
> > +
> > +             if (root->role.word == role.word)
> > +                     return root;
> > +     }
> > +
> > +     return NULL;
> > +}
> > +
> > +static struct kvm_mmu_page *alloc_tdp_mmu_root(struct kvm_vcpu *vcpu,
> > +                                            union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *new_root;
> > +     struct kvm_mmu_page *root;
> > +
> > +     new_root = kvm_mmu_memory_cache_alloc(
> > +                     &vcpu->arch.mmu_page_header_cache);
> > +     new_root->spt = kvm_mmu_memory_cache_alloc(
> > +                     &vcpu->arch.mmu_shadow_page_cache);
> > +     set_page_private(virt_to_page(new_root->spt), (unsigned long)new_root);
> > +
> > +     new_root->role.word = role.word;
> > +     new_root->root_count = 1;
> > +     new_root->gfn = 0;
> > +     new_root->tdp_mmu_page = true;
> > +
> > +     spin_lock(&vcpu->kvm->mmu_lock);
> > +
> > +     /* Check that no matching root exists before adding this one. */
> > +     root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
> > +     if (root) {
> > +             get_tdp_mmu_root(vcpu->kvm, root);
> > +             spin_unlock(&vcpu->kvm->mmu_lock);
>
> Hrm, I'm not a big fan of dropping locks in the middle of functions, but the
> alternatives aren't great.  :-/  Best I can come up with is
>
>         if (root)
>                 get_tdp_mmu_root()
>         else
>                 list_add();
>
>         spin_unlock();
>
>         if (root) {
>                 free_page()
>                 kmem_cache_free()
>         } else {
>                 root = new_root;
>         }
>
>         return root;
>
> Not sure that's any better.
>
> > +             free_page((unsigned long)new_root->spt);
> > +             kmem_cache_free(mmu_page_header_cache, new_root);
> > +             return root;
> > +     }
> > +
> > +     list_add(&new_root->link, &vcpu->kvm->arch.tdp_mmu_roots);
> > +     spin_unlock(&vcpu->kvm->mmu_lock);
> > +
> > +     return new_root;
> > +}
> > +
> > +static struct kvm_mmu_page *get_tdp_mmu_vcpu_root(struct kvm_vcpu *vcpu)
> > +{
> > +     struct kvm_mmu_page *root;
> > +     union kvm_mmu_page_role role;
> > +
> > +     role = vcpu->arch.mmu->mmu_role.base;
> > +     role.level = vcpu->arch.mmu->shadow_root_level;
> > +     role.direct = true;
> > +     role.gpte_is_8_bytes = true;
> > +     role.access = ACC_ALL;
> > +
> > +     spin_lock(&vcpu->kvm->mmu_lock);
> > +
> > +     /* Search for an already allocated root with the same role. */
> > +     root = find_tdp_mmu_root_with_role(vcpu->kvm, role);
> > +     if (root) {
> > +             get_tdp_mmu_root(vcpu->kvm, root);
> > +             spin_unlock(&vcpu->kvm->mmu_lock);
>
> Rather than manually unlock and return, this can be
>
>         if (root)
>                 get_tdp_mmju_root();
>
>         spin_unlock()
>
>         if (!root)
>                 root = alloc_tdp_mmu_root();
>
>         return root;
>
> You could also add a helper to do the "get" along with the "find".  Not sure
> if that's worth the code.
>
> > +             return root;
> > +     }
> > +
> > +     spin_unlock(&vcpu->kvm->mmu_lock);
> > +
> > +     /* If there is no appropriate root, allocate one. */
> > +     root = alloc_tdp_mmu_root(vcpu, role);
> > +
> > +     return root;
> > +}
> > +
> > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > +{
> > +     struct kvm_mmu_page *root;
> > +
> > +     root = get_tdp_mmu_vcpu_root(vcpu);
> > +     if (!root)
> > +             return INVALID_PAGE;
> > +
> > +     return __pa(root->spt);
> >  }
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index dd3764f5a9aa3..9274debffeaa1 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -7,4 +7,9 @@
> >
> >  void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
> >  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
> > +
> > +bool is_tdp_mmu_root(struct kvm *kvm, hpa_t root);
> > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> > +void kvm_tdp_mmu_put_root_hpa(struct kvm *kvm, hpa_t root_hpa);
> > +
> >  #endif /* __KVM_X86_MMU_TDP_MMU_H */
> > --
> > 2.28.0.709.gb0816b6eb0-goog
> >

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots
  2020-10-12 22:59     ` Ben Gardon
@ 2020-10-12 23:59       ` Sean Christopherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sean Christopherson @ 2020-10-12 23:59 UTC (permalink / raw)
  To: Ben Gardon
  Cc: LKML, kvm, Cannon Matthews, Paolo Bonzini, Peter Xu, Peter Shier,
	Peter Feiner, Junaid Shahid, Jim Mattson, Yulei Zhang,
	Wanpeng Li, Vitaly Kuznetsov, Xiao Guangrong

Heads up, you may get this multiple times, our mail servers got "upgraded"
recently and are giving me troubles...

On Mon, Oct 12, 2020 at 03:59:35PM -0700, Ben Gardon wrote:
> On Tue, Sep 29, 2020 at 11:06 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > > @@ -3691,7 +3690,13 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > >       unsigned i;
> > >
> > >       if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> > > -             root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> > > +             if (vcpu->kvm->arch.tdp_mmu_enabled) {
> >
> > I believe this will break 32-bit NPT.  Or at a minimum, look weird.  It'd
> > be better to explicitly disable the TDP MMU on 32-bit KVM, then this becomes
> >
> >         if (vcpu->kvm->arch.tdp_mmu_enabled) {
> >
> >         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> >
> >         } else {
> >
> >         }
> >
>
> How does this break 32-bit NPT? I'm not sure I understand how we would
> get into a bad state here because I'm not familiar with the specifics
> of 32 bit NPT.

32-bit NPT will have a max TDP level of PT32E_ROOT_LEVEL (3), i.e. will
fail the "shadow_root_level >= PT64_ROOT_4LEVEL" check, and thus won't get
to the tdp_mmu_enabled check.  That would likely break as some parts of KVM
would see tdp_mmu_enabled, but this root allocation would continue using
the legacy MMU.

It's somewhat of a moot point, because IIRC there are other things that will
break with 32-bit KVM, i.e. TDP MMU will be 64-bit only.  But burying that
assumption/dependency in these flows is weird.

> > > +                     root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > > +             } else {
> > > +                     root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level,
> > > +                                           true);
> > > +             }


^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, back to index

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25 21:22 [PATCH 00/22] Introduce the TDP MMU Ben Gardon
2020-09-25 21:22 ` [PATCH 01/22] kvm: mmu: Separate making SPTEs from set_spte Ben Gardon
2020-09-30  4:55   ` Sean Christopherson
2020-09-30 23:03     ` Ben Gardon
2020-09-25 21:22 ` [PATCH 02/22] kvm: mmu: Introduce tdp_iter Ben Gardon
2020-09-26  0:04   ` Paolo Bonzini
2020-09-30  5:06     ` Sean Christopherson
2020-09-26  0:54   ` Paolo Bonzini
2020-09-30  5:08   ` Sean Christopherson
2020-09-30  5:24   ` Sean Christopherson
2020-09-30  6:24     ` Paolo Bonzini
2020-09-30 23:20   ` Eric van Tassell
2020-09-30 23:34     ` Paolo Bonzini
2020-10-01  0:07       ` Sean Christopherson
2020-09-25 21:22 ` [PATCH 03/22] kvm: mmu: Init / Uninit the TDP MMU Ben Gardon
2020-09-26  0:06   ` Paolo Bonzini
2020-09-30  5:34   ` Sean Christopherson
2020-09-30 18:36     ` Ben Gardon
2020-09-30 16:57   ` Sean Christopherson
2020-09-30 17:39     ` Paolo Bonzini
2020-09-30 18:42       ` Ben Gardon
2020-09-25 21:22 ` [PATCH 04/22] kvm: mmu: Allocate and free TDP MMU roots Ben Gardon
2020-09-30  6:06   ` Sean Christopherson
2020-09-30  6:26     ` Paolo Bonzini
2020-09-30 15:38       ` Sean Christopherson
2020-10-12 22:59     ` Ben Gardon
2020-10-12 23:59       ` Sean Christopherson
2020-09-25 21:22 ` [PATCH 05/22] kvm: mmu: Add functions to handle changed TDP SPTEs Ben Gardon
2020-09-26  0:39   ` Paolo Bonzini
2020-09-28 17:23     ` Paolo Bonzini
2020-09-25 21:22 ` [PATCH 06/22] kvm: mmu: Make address space ID a property of memslots Ben Gardon
2020-09-30  6:10   ` Sean Christopherson
2020-09-30 23:11     ` Ben Gardon
2020-09-25 21:22 ` [PATCH 07/22] kvm: mmu: Support zapping SPTEs in the TDP MMU Ben Gardon
2020-09-26  0:14   ` Paolo Bonzini
2020-09-30  6:15   ` Sean Christopherson
2020-09-30  6:28     ` Paolo Bonzini
2020-09-25 21:22 ` [PATCH 08/22] kvm: mmu: Separate making non-leaf sptes from link_shadow_page Ben Gardon
2020-09-25 21:22 ` [PATCH 09/22] kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg Ben Gardon
2020-09-30 16:19   ` Sean Christopherson
2020-09-25 21:22 ` [PATCH 10/22] kvm: mmu: Add TDP MMU PF handler Ben Gardon
2020-09-26  0:24   ` Paolo Bonzini
2020-09-30 16:37   ` Sean Christopherson
2020-09-30 16:55     ` Paolo Bonzini
2020-09-30 17:37     ` Paolo Bonzini
2020-10-06 22:35       ` Ben Gardon
2020-10-06 22:33     ` Ben Gardon
2020-10-07 20:55       ` Sean Christopherson
2020-09-25 21:22 ` [PATCH 11/22] kvm: mmu: Factor out allocating a new tdp_mmu_page Ben Gardon
2020-09-26  0:22   ` Paolo Bonzini
2020-09-30 18:53     ` Ben Gardon
2020-09-25 21:22 ` [PATCH 12/22] kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU Ben Gardon
2020-09-25 21:22 ` [PATCH 13/22] kvm: mmu: Support invalidate range MMU notifier for " Ben Gardon
2020-09-30 17:03   ` Sean Christopherson
2020-09-30 23:15     ` Ben Gardon
2020-09-30 23:24       ` Sean Christopherson
2020-09-30 23:27         ` Ben Gardon
2020-09-25 21:22 ` [PATCH 14/22] kvm: mmu: Add access tracking for tdp_mmu Ben Gardon
2020-09-26  0:32   ` Paolo Bonzini
2020-09-30 17:48   ` Sean Christopherson
2020-10-06 23:38     ` Ben Gardon
2020-09-25 21:22 ` [PATCH 15/22] kvm: mmu: Support changed pte notifier in tdp MMU Ben Gardon
2020-09-26  0:33   ` Paolo Bonzini
2020-09-28 15:11   ` Paolo Bonzini
2020-10-07 16:53     ` Ben Gardon
2020-10-07 17:18       ` Paolo Bonzini
2020-10-07 17:30         ` Ben Gardon
2020-10-07 17:54           ` Paolo Bonzini
2020-09-25 21:22 ` [PATCH 16/22] kvm: mmu: Add dirty logging handler for changed sptes Ben Gardon
2020-09-26  0:45   ` Paolo Bonzini
2020-09-25 21:22 ` [PATCH 17/22] kvm: mmu: Support dirty logging for the TDP MMU Ben Gardon
2020-09-26  1:04   ` Paolo Bonzini
2020-10-08 18:27     ` Ben Gardon
2020-09-29 15:07   ` Paolo Bonzini
2020-09-30 18:04   ` Sean Christopherson
2020-09-30 18:08     ` Paolo Bonzini
2020-09-25 21:22 ` [PATCH 18/22] kvm: mmu: Support disabling dirty logging for the tdp MMU Ben Gardon
2020-09-26  1:09   ` Paolo Bonzini
2020-10-07 16:30     ` Ben Gardon
2020-10-07 17:21       ` Paolo Bonzini
2020-10-07 17:28         ` Ben Gardon
2020-10-07 17:53           ` Paolo Bonzini
2020-09-25 21:22 ` [PATCH 19/22] kvm: mmu: Support write protection for nesting in " Ben Gardon
2020-09-30 18:06   ` Sean Christopherson
2020-09-25 21:23 ` [PATCH 20/22] kvm: mmu: NX largepage recovery for TDP MMU Ben Gardon
2020-09-26  1:14   ` Paolo Bonzini
2020-09-30 22:23     ` Ben Gardon
2020-09-29 18:24   ` Paolo Bonzini
2020-09-30 18:15   ` Sean Christopherson
2020-09-30 19:56     ` Paolo Bonzini
2020-09-30 22:33       ` Ben Gardon
2020-09-30 22:27     ` Ben Gardon
2020-09-25 21:23 ` [PATCH 21/22] kvm: mmu: Support MMIO in the " Ben Gardon
2020-09-30 18:19   ` Sean Christopherson
2020-09-25 21:23 ` [PATCH 22/22] kvm: mmu: Don't clear write flooding count for direct roots Ben Gardon
2020-09-26  1:25   ` Paolo Bonzini
2020-10-05 22:48     ` Ben Gardon
2020-10-05 23:44       ` Sean Christopherson
2020-10-06 16:19         ` Ben Gardon
2020-09-26  1:14 ` [PATCH 00/22] Introduce the TDP MMU Paolo Bonzini
2020-09-28 17:31 ` Paolo Bonzini
2020-09-29 17:40   ` Ben Gardon
2020-09-29 18:10     ` Paolo Bonzini
2020-09-30  6:19 ` Sean Christopherson
2020-09-30  6:30   ` Paolo Bonzini

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git