[PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
@ 2022-05-03 15:07 Lai Jiangshan
  2022-05-03 15:07 ` [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page() Lai Jiangshan
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson; +Cc: Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Current code uses mmu->pae_root, mmu->pml4_root, and mmu->pml5_root to
setup special roots.  The initialization code is complex and the roots
are not associated with struct kvm_mmu_page which causes the code more
complex.

So add new special shadow pages to simplify it.

The special shadow pages are associated with struct kvm_mmu_page and
VCPU-local.

The special shadow pages are created and freed when the roots are
changed (or one-off) which can be optimized but not in the patchset
since the re-creating is light way (in normal case only the struct
kvm_mmu_page needs to be re-allocated and sp->spt doens't, because
it is likely to be mmu->pae_root)

Changed from v1:
	Rebase to newest kvm/queue. Slightly update patch4.

[V1]: https://lore.kernel.org/lkml/20220420132605.3813-1-jiangshanlai@gmail.com/

Lai Jiangshan (7):
  KVM: X86/MMU: Add using_special_root_page()
  KVM: X86/MMU: Add special shadow pages
  KVM: X86/MMU: Link PAE root pagetable with its children
  KVM: X86/MMU: Activate special shadow pages and remove old logic
  KVM: X86/MMU: Remove the check of the return value of to_shadow_page()
  KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand
  KVM: X86/MMU: Remove mmu_alloc_special_roots()

 arch/x86/include/asm/kvm_host.h |   3 -
 arch/x86/kvm/mmu/mmu.c          | 487 ++++++++++----------------------
 arch/x86/kvm/mmu/mmu_internal.h |  10 -
 arch/x86/kvm/mmu/paging_tmpl.h  |  14 +-
 arch/x86/kvm/mmu/spte.c         |   7 +
 arch/x86/kvm/mmu/spte.h         |   1 +
 arch/x86/kvm/mmu/tdp_mmu.h      |   7 +-
 7 files changed, 178 insertions(+), 351 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page()
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-13 22:53   ` David Matlack
  2022-05-03 15:07 ` [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages Lai Jiangshan
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

In some case, special roots are used in mmu.  It is often using
to_shadow_page(mmu->root.hpa) to check if special roots are used.

Add using_special_root_page() to directly check if special roots are
used or needed to be used even mmu->root.hpa is not set.

Prepare for making to_shadow_page(mmu->root.hpa) return non-NULL via
using special shadow pages.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 909372762363..7f20796af351 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1711,6 +1711,14 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
+static bool using_special_root_page(struct kvm_mmu *mmu)
+{
+	if (mmu->root_role.direct)
+		return mmu->root_role.level == PT32E_ROOT_LEVEL;
+	else
+		return mmu->cpu_role.base.level <= PT32E_ROOT_LEVEL;
+}
+
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
 {
 	struct kvm_mmu_page *sp;
@@ -4241,10 +4249,10 @@ static bool fast_pgd_switch(struct kvm *kvm, struct kvm_mmu *mmu,
 {
 	/*
 	 * For now, limit the caching to 64-bit hosts+VMs in order to avoid
-	 * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
-	 * later if necessary.
+	 * having to deal with PDPTEs.  Special roots can not be put into
+	 * mmu->prev_roots[].
 	 */
-	if (VALID_PAGE(mmu->root.hpa) && !to_shadow_page(mmu->root.hpa))
+	if (VALID_PAGE(mmu->root.hpa) && using_special_root_page(mmu))
 		kvm_mmu_free_roots(kvm, mmu, KVM_MMU_ROOT_CURRENT);
 
 	if (VALID_PAGE(mmu->root.hpa))
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
  2022-05-03 15:07 ` [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page() Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-13 23:43   ` David Matlack
  2022-05-03 15:07 ` [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children Lai Jiangshan
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Special pages are pages to hold PDPTEs for 32bit guest or higher
level pages linked to special page when shadowing NPT.

Current code use mmu->pae_root, mmu->pml4_root, and mmu->pml5_root to
setup special root.  The initialization code is complex and the roots
are not associated with struct kvm_mmu_page which causes the code more
complex.

Add kvm_mmu_alloc_special_page() and mmu_free_special_root_page() to
allocate and free special shadow pages and prepare for using special
shadow pages to replace current logic and share the most logic with
normal shadow pages.

The code is not activated since using_special_root_page() is false in
the place where it is inserted.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c | 91 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 90 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7f20796af351..126f0cd07f98 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1719,6 +1719,58 @@ static bool using_special_root_page(struct kvm_mmu *mmu)
 		return mmu->cpu_role.base.level <= PT32E_ROOT_LEVEL;
 }
 
+/*
+ * Special pages are pages to hold PAE PDPTEs for 32bit guest or higher level
+ * pages linked to special page when shadowing NPT.
+ *
+ * Special pages are specially allocated.  If sp->spt needs to be 32bit, it
+ * will use the preallocated mmu->pae_root.
+ *
+ * Special pages are only visible to local VCPU except through rmap from their
+ * children, so they are not in the kvm->arch.active_mmu_pages nor in the hash.
+ *
+ * And they are either accounted nor write-protected since they don't has gfn
+ * associated.
+ *
+ * Because of above, special pages can not be freed nor zapped like normal
+ * shadow pages.  They are freed directly when the special root is freed, see
+ * mmu_free_special_root_page().
+ *
+ * Special root page can not be put on mmu->prev_roots because the comparison
+ * must use PDPTEs instead of CR3 and mmu->pae_root can not be shared for multi
+ * root pages.
+ *
+ * Except above limitations, all the other abilities are the same as other
+ * shadow page, like link, parent rmap, sync, unsync etc.
+ *
+ * Special pages can be obsoleted but might be possibly reused later.  When
+ * the obsoleting process is done, all the obsoleted shadow pages are unlinked
+ * from the special pages by the help of the parent rmap of the children and
+ * the special pages become theoretically valid again.  If there is no other
+ * event to cause a VCPU to free the root and the VCPU is being preempted by
+ * the host during two obsoleting processes, the VCPU can reuse its special
+ * pages when it is back.
+ */
+static struct kvm_mmu_page *kvm_mmu_alloc_special_page(struct kvm_vcpu *vcpu,
+		union kvm_mmu_page_role role)
+{
+	struct kvm_mmu_page *sp;
+
+	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
+	sp->gfn = 0;
+	sp->role = role;
+	if (role.level == PT32E_ROOT_LEVEL &&
+	    vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL)
+		sp->spt = vcpu->arch.mmu->pae_root;
+	else
+		sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	/* sp->gfns is not used for special shadow page */
+	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+
+	return sp;
+}
+
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
 {
 	struct kvm_mmu_page *sp;
@@ -2076,6 +2128,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	if (level <= vcpu->arch.mmu->cpu_role.base.level)
 		role.passthrough = 0;
 
+	if (unlikely(level >= PT32E_ROOT_LEVEL && using_special_root_page(vcpu->arch.mmu)))
+		return kvm_mmu_alloc_special_page(vcpu, role);
+
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
 		if (sp->gfn != gfn) {
@@ -3290,6 +3345,37 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 	*root_hpa = INVALID_PAGE;
 }
 
+static void mmu_free_special_root_page(struct kvm *kvm, struct kvm_mmu *mmu)
+{
+	u64 spte = mmu->root.hpa;
+	struct kvm_mmu_page *sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
+	int i;
+
+	/* Free level 5 or 4 roots for shadow NPT for 32 bit L1 */
+	while (sp->role.level > PT32E_ROOT_LEVEL)
+	{
+		spte = sp->spt[0];
+		mmu_page_zap_pte(kvm, sp, sp->spt + 0, NULL);
+		free_page((unsigned long)sp->spt);
+		kmem_cache_free(mmu_page_header_cache, sp);
+		if (!is_shadow_present_pte(spte))
+			return;
+		sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
+	}
+
+	if (WARN_ON_ONCE(sp->role.level != PT32E_ROOT_LEVEL))
+		return;
+
+	/* Free PAE roots */
+	for (i = 0; i < 4; i++)
+		mmu_page_zap_pte(kvm, sp, sp->spt + i, NULL);
+
+	if (sp->spt != mmu->pae_root)
+		free_page((unsigned long)sp->spt);
+
+	kmem_cache_free(mmu_page_header_cache, sp);
+}
+
 /* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
 void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 			ulong roots_to_free)
@@ -3323,7 +3409,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 
 	if (free_active_root) {
 		if (to_shadow_page(mmu->root.hpa)) {
-			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
+			if (using_special_root_page(mmu))
+				mmu_free_special_root_page(kvm, mmu);
+			else
+				mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
 		} else if (mmu->pae_root) {
 			for (i = 0; i < 4; ++i) {
 				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
  2022-05-03 15:07 ` [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page() Lai Jiangshan
  2022-05-03 15:07 ` [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-17  0:01   ` David Matlack
  2022-05-03 15:07 ` [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic Lai Jiangshan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

When special shadow pages are activated, link_shadow_page() might link
a special shadow pages which is the PAE root for PAE paging with its
children.

Add make_pae_pdpte() to handle it.

The code is not activated since special shadow pages are not activated
yet.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c  | 6 +++++-
 arch/x86/kvm/mmu/spte.c | 7 +++++++
 arch/x86/kvm/mmu/spte.h | 1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 126f0cd07f98..3fe70ad3bda2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2277,7 +2277,11 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
 
-	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+	if (unlikely(sp->role.level == PT32_ROOT_LEVEL &&
+		     vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL))
+		spte = make_pae_pdpte(sp->spt);
+	else
+		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
 
 	mmu_spte_set(sptep, spte);
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 75c9e87d446a..ccd9267a58ca 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -251,6 +251,13 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
 	return child_spte;
 }
 
+u64 make_pae_pdpte(u64 *child_pt)
+{
+	/* The only ignore bits in PDPTE are 11:9. */
+	BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
+	return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
+		shadow_me_value;
+}
 
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
 {
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index fbbab180395e..09a7e4ba017a 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -413,6 +413,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
 u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
+u64 make_pae_pdpte(u64 *child_pt);
 u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
 u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
 u64 mark_spte_for_access_track(u64 spte);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
                   ` (2 preceding siblings ...)
  2022-05-03 15:07 ` [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-17  0:16   ` David Matlack
  2022-05-03 15:07 ` [PATCH V2 5/7] KVM: X86/MMU: Remove the check of the return value of to_shadow_page() Lai Jiangshan
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Activate special shadow pages by allocate special shadow pages in
mmu_alloc_direct_roots() and mmu_alloc_shadow_roots().

Make shadow walkings walk from the topmost shadow page even it is
special shadow page so that they can be walked like normal root
and shadowed PDPTEs can be made and installed on-demand.

Walking from the topmost causes FNAME(fetch) needs to visit high level
special shadow pages and allocate special shadow pages when shadowing
NPT for 32bit L1 in 64bit host, so change FNAME(fetch) and
FNAME(walk_addr_generic) to handle it for affected code.

Do sync from the topmost in kvm_mmu_sync_roots() and simplifies
the code.

Now all the root pages and pagetable pointed by a present spte in
struct kvm_mmu are associated by struct kvm_mmu_page, and
to_shadow_page() is guaranteed to be not NULL.

Affect cases are those that using_special_root_page() return true.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c         | 168 +++------------------------------
 arch/x86/kvm/mmu/paging_tmpl.h |  14 ++-
 2 files changed, 24 insertions(+), 158 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3fe70ad3bda2..6f626d7e8ebb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2214,26 +2214,6 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
 	iterator->addr = addr;
 	iterator->shadow_addr = root;
 	iterator->level = vcpu->arch.mmu->root_role.level;
-
-	if (iterator->level >= PT64_ROOT_4LEVEL &&
-	    vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
-	    !vcpu->arch.mmu->root_role.direct)
-		iterator->level = PT32E_ROOT_LEVEL;
-
-	if (iterator->level == PT32E_ROOT_LEVEL) {
-		/*
-		 * prev_root is currently only used for 64-bit hosts. So only
-		 * the active root_hpa is valid here.
-		 */
-		BUG_ON(root != vcpu->arch.mmu->root.hpa);
-
-		iterator->shadow_addr
-			= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
-		iterator->shadow_addr &= PT64_BASE_ADDR_MASK;
-		--iterator->level;
-		if (!iterator->shadow_addr)
-			iterator->level = 0;
-	}
 }
 
 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
@@ -3412,21 +3392,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 					   &invalid_list);
 
 	if (free_active_root) {
-		if (to_shadow_page(mmu->root.hpa)) {
-			if (using_special_root_page(mmu))
-				mmu_free_special_root_page(kvm, mmu);
-			else
-				mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
-		} else if (mmu->pae_root) {
-			for (i = 0; i < 4; ++i) {
-				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
-					continue;
-
-				mmu_free_root_page(kvm, &mmu->pae_root[i],
-						   &invalid_list);
-				mmu->pae_root[i] = INVALID_PAE_ROOT;
-			}
-		}
+		if (using_special_root_page(mmu))
+			mmu_free_special_root_page(kvm, mmu);
+		else
+			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
 		mmu->root.hpa = INVALID_PAGE;
 		mmu->root.pgd = 0;
 	}
@@ -3491,7 +3460,6 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u8 shadow_root_level = mmu->root_role.level;
 	hpa_t root;
-	unsigned i;
 	int r;
 
 	write_lock(&vcpu->kvm->mmu_lock);
@@ -3502,24 +3470,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
 		mmu->root.hpa = root;
-	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
+	} else if (shadow_root_level >= PT32E_ROOT_LEVEL) {
 		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
 		mmu->root.hpa = root;
-	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
-		if (WARN_ON_ONCE(!mmu->pae_root)) {
-			r = -EIO;
-			goto out_unlock;
-		}
-
-		for (i = 0; i < 4; ++i) {
-			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
-			mmu->pae_root[i] = root | PT_PRESENT_MASK |
-					   shadow_me_mask;
-		}
-		mmu->root.hpa = __pa(mmu->pae_root);
 	} else {
 		WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
 		r = -EIO;
@@ -3597,10 +3550,8 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
 	hpa_t root;
-	unsigned i;
 	int r;
 
 	root_pgd = mmu->get_guest_pgd(vcpu);
@@ -3609,21 +3560,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (mmu_check_root(vcpu, root_gfn))
 		return 1;
 
-	/*
-	 * On SVM, reading PDPTRs might access guest memory, which might fault
-	 * and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
-	 */
-	if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
-		for (i = 0; i < 4; ++i) {
-			pdptrs[i] = mmu->get_pdptr(vcpu, i);
-			if (!(pdptrs[i] & PT_PRESENT_MASK))
-				continue;
-
-			if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
-				return 1;
-		}
-	}
-
 	r = mmu_first_shadow_root_alloc(vcpu->kvm);
 	if (r)
 		return r;
@@ -3633,70 +3569,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (r < 0)
 		goto out_unlock;
 
-	/*
-	 * Do we shadow a long mode page table? If so we need to
-	 * write-protect the guests page table root.
-	 */
-	if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, root_gfn, 0,
-				      mmu->root_role.level, false);
-		mmu->root.hpa = root;
-		goto set_root_pgd;
-	}
-
-	if (WARN_ON_ONCE(!mmu->pae_root)) {
-		r = -EIO;
-		goto out_unlock;
-	}
-
-	/*
-	 * We shadow a 32 bit page table. This may be a legacy 2-level
-	 * or a PAE 3-level page table. In either case we need to be aware that
-	 * the shadow page table may be a PAE or a long mode page table.
-	 */
-	pm_mask = PT_PRESENT_MASK | shadow_me_value;
-	if (mmu->root_role.level >= PT64_ROOT_4LEVEL) {
-		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
-
-		if (WARN_ON_ONCE(!mmu->pml4_root)) {
-			r = -EIO;
-			goto out_unlock;
-		}
-		mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
-
-		if (mmu->root_role.level == PT64_ROOT_5LEVEL) {
-			if (WARN_ON_ONCE(!mmu->pml5_root)) {
-				r = -EIO;
-				goto out_unlock;
-			}
-			mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
-		}
-	}
-
-	for (i = 0; i < 4; ++i) {
-		WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
-		if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
-			if (!(pdptrs[i] & PT_PRESENT_MASK)) {
-				mmu->pae_root[i] = INVALID_PAE_ROOT;
-				continue;
-			}
-			root_gfn = pdptrs[i] >> PAGE_SHIFT;
-		}
-
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
-		mmu->pae_root[i] = root | pm_mask;
-	}
-
-	if (mmu->root_role.level == PT64_ROOT_5LEVEL)
-		mmu->root.hpa = __pa(mmu->pml5_root);
-	else if (mmu->root_role.level == PT64_ROOT_4LEVEL)
-		mmu->root.hpa = __pa(mmu->pml4_root);
-	else
-		mmu->root.hpa = __pa(mmu->pae_root);
-
-set_root_pgd:
+	root = mmu_alloc_root(vcpu, root_gfn, 0,
+			      mmu->root_role.level, false);
+	mmu->root.hpa = root;
 	mmu->root.pgd = root_pgd;
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
@@ -3813,8 +3688,7 @@ static bool is_unsync_root(hpa_t root)
 
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 {
-	int i;
-	struct kvm_mmu_page *sp;
+	hpa_t root = vcpu->arch.mmu->root.hpa;
 
 	if (vcpu->arch.mmu->root_role.direct)
 		return;
@@ -3824,31 +3698,11 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 
-	if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
-		hpa_t root = vcpu->arch.mmu->root.hpa;
-		sp = to_shadow_page(root);
-
-		if (!is_unsync_root(root))
-			return;
-
-		write_lock(&vcpu->kvm->mmu_lock);
-		mmu_sync_children(vcpu, sp, true);
-		write_unlock(&vcpu->kvm->mmu_lock);
+	if (!is_unsync_root(root))
 		return;
-	}
 
 	write_lock(&vcpu->kvm->mmu_lock);
-
-	for (i = 0; i < 4; ++i) {
-		hpa_t root = vcpu->arch.mmu->pae_root[i];
-
-		if (IS_VALID_PAE_ROOT(root)) {
-			root &= PT64_BASE_ADDR_MASK;
-			sp = to_shadow_page(root);
-			mmu_sync_children(vcpu, sp, true);
-		}
-	}
-
+	mmu_sync_children(vcpu, to_shadow_page(root), true);
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b025decf610d..19ef31a078fa 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -316,6 +316,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	u16 errcode = 0;
 	gpa_t real_gpa;
 	gfn_t gfn;
+	int i;
 
 	trace_kvm_mmu_pagetable_walk(addr, access);
 retry_walk:
@@ -323,6 +324,16 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	pte           = mmu->get_guest_pgd(vcpu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
+	/*
+	 * FNAME(fetch) might pass these values to allocate special shadow
+	 * page.  Although the gfn is not used at the end, it is better not
+	 * to pass an uninitialized value to kvm_mmu_get_page().
+	 */
+	for (i = 2; i < PT_MAX_FULL_LEVELS; i++) {
+		walker->table_gfn[i] = 0;
+		walker->pt_access[i] = ACC_ALL;
+	}
+
 #if PTTYPE == 64
 	walk_nx_mask = 1ULL << PT64_NX_SHIFT;
 	if (walker->level == PT32E_ROOT_LEVEL) {
@@ -675,7 +686,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * Verify that the gpte in the page we've just write
 		 * protected is still there.
 		 */
-		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
+		if (it.level - 1 < top_level &&
+		    FNAME(gpte_changed)(vcpu, gw, it.level - 1))
 			goto out_gpte_changed;
 
 		if (sp)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 5/7] KVM: X86/MMU: Remove the check of the return value of to_shadow_page()
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
                   ` (3 preceding siblings ...)
  2022-05-03 15:07 ` [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-17 16:47   ` David Matlack
  2022-05-03 15:07 ` [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand Lai Jiangshan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Remove the check of the return value of to_shadow_page() in
mmu_free_root_page(), kvm_mmu_free_guest_mode_roots(), is_unsync_root()
and is_tdp_mmu() because it can not return NULL.

Remove the check of the return value of to_shadow_page() in
is_page_fault_stale() and is_obsolete_root() because it can not return
NULL and the obsoleting for special shadow page is already handled by
a different way.

When the obsoleting process is done, all the obsoleted shadow pages are
already unlinked from the special pages by the help of the parent rmap
of the children and the special pages become theoretically valid again.
The special shadow page can be freed if is_obsolete_sp() return true,
or be reused if is_obsolete_sp() return false.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c     | 44 +++-----------------------------------
 arch/x86/kvm/mmu/tdp_mmu.h |  7 +-----
 2 files changed, 4 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f626d7e8ebb..bcb3e2730277 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3318,8 +3318,6 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 		return;
 
 	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
-	if (WARN_ON(!sp))
-		return;
 
 	if (is_tdp_mmu_page(sp))
 		kvm_tdp_mmu_put_root(kvm, sp, false);
@@ -3422,8 +3420,7 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
 		if (!VALID_PAGE(root_hpa))
 			continue;
 
-		if (!to_shadow_page(root_hpa) ||
-			to_shadow_page(root_hpa)->role.guest_mode)
+		if (to_shadow_page(root_hpa)->role.guest_mode)
 			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
 	}
 
@@ -3673,13 +3670,6 @@ static bool is_unsync_root(hpa_t root)
 	smp_rmb();
 	sp = to_shadow_page(root);
 
-	/*
-	 * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
-	 * PDPTEs for a given PAE root need to be synchronized individually.
-	 */
-	if (WARN_ON_ONCE(!sp))
-		return false;
-
 	if (sp->unsync || sp->unsync_children)
 		return true;
 
@@ -3975,21 +3965,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 				struct kvm_page_fault *fault, int mmu_seq)
 {
-	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
-
-	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
-	if (sp && is_obsolete_sp(vcpu->kvm, sp))
-		return true;
-
-	/*
-	 * Roots without an associated shadow page are considered invalid if
-	 * there is a pending request to free obsolete roots.  The request is
-	 * only a hint that the current root _may_ be obsolete and needs to be
-	 * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
-	 * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
-	 * to reload even if no vCPU is actively using the root.
-	 */
-	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
+	if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa)))
 		return true;
 
 	return fault->slot &&
@@ -5094,24 +5070,10 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 
 static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
 {
-	struct kvm_mmu_page *sp;
-
 	if (!VALID_PAGE(root_hpa))
 		return false;
 
-	/*
-	 * When freeing obsolete roots, treat roots as obsolete if they don't
-	 * have an associated shadow page.  This does mean KVM will get false
-	 * positives and free roots that don't strictly need to be freed, but
-	 * such false positives are relatively rare:
-	 *
-	 *  (a) only PAE paging and nested NPT has roots without shadow pages
-	 *  (b) remote reloads due to a memslot update obsoletes _all_ roots
-	 *  (c) KVM doesn't track previous roots for PAE paging, and the guest
-	 *      is unlikely to zap an in-use PGD.
-	 */
-	sp = to_shadow_page(root_hpa);
-	return !sp || is_obsolete_sp(kvm, sp);
+	return is_obsolete_sp(kvm, to_shadow_page(root_hpa));
 }
 
 static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index c163f7cc23ca..5779a2a7161e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -78,13 +78,8 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
 	if (WARN_ON(!VALID_PAGE(hpa)))
 		return false;
 
-	/*
-	 * A NULL shadow page is legal when shadowing a non-paging guest with
-	 * PAE paging, as the MMU will be direct with root_hpa pointing at the
-	 * pae_root page, not a shadow page.
-	 */
 	sp = to_shadow_page(hpa);
-	return sp && is_tdp_mmu_page(sp) && sp->root_count;
+	return is_tdp_mmu_page(sp) && sp->root_count;
 }
 #else
 static inline int kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return 0; }
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
                   ` (4 preceding siblings ...)
  2022-05-03 15:07 ` [PATCH V2 5/7] KVM: X86/MMU: Remove the check of the return value of to_shadow_page() Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-17 16:57   ` David Matlack
  2022-05-03 15:07 ` [PATCH V2 7/7] KVM: X86/MMU: Remove mmu_alloc_special_roots() Lai Jiangshan
  2022-05-13  8:22 ` [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

mmu->pae_root for non-PAE paging is allocated on-demand, but
mmu->pae_root for PAE paging is allocated early when struct kvm_mmu is
being created.

Simplify the code to allocate mmu->pae_root for PAE paging and make
it on-demand.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c          | 99 ++++++++++++++-------------------
 arch/x86/kvm/mmu/mmu_internal.h | 10 ----
 2 files changed, 42 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bcb3e2730277..c97f830c5f8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -691,6 +691,41 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static int mmu_alloc_pae_root(struct kvm_vcpu *vcpu)
+{
+	struct page *page;
+
+	if (vcpu->arch.mmu->root_role.level != PT32E_ROOT_LEVEL)
+		return 0;
+	if (vcpu->arch.mmu->pae_root)
+		return 0;
+
+	/*
+	 * Allocate a page to hold the four PDPTEs for PAE paging when emulating
+	 * 32-bit mode.  CR3 is only 32 bits even on x86_64 in this case.
+	 * Therefore we need to allocate the PDP table in the first 4GB of
+	 * memory, which happens to fit the DMA32 zone.
+	 */
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
+	if (!page)
+		return -ENOMEM;
+	vcpu->arch.mmu->pae_root = page_address(page);
+
+	/*
+	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
+	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
+	 * that KVM's writes and the CPU's reads get along.  Note, this is
+	 * only necessary when using shadow paging, as 64-bit NPT can get at
+	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
+	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
+	 */
+	if (!tdp_enabled)
+		set_memory_decrypted((unsigned long)vcpu->arch.mmu->pae_root, 1);
+	else
+		WARN_ON_ONCE(shadow_me_value);
+	return 0;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -5031,6 +5066,9 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
 	if (r)
 		goto out;
+	r = mmu_alloc_pae_root(vcpu);
+	if (r)
+		return r;
 	r = mmu_alloc_special_roots(vcpu);
 	if (r)
 		goto out;
@@ -5495,63 +5533,18 @@ static void free_mmu_pages(struct kvm_mmu *mmu)
 	free_page((unsigned long)mmu->pml5_root);
 }
 
-static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
+static void __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 {
-	struct page *page;
 	int i;
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
-
-	/* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */
-	if (!tdp_enabled && mmu == &vcpu->arch.guest_mmu)
-		return 0;
-
-	/*
-	 * When using PAE paging, the four PDPTEs are treated as 'root' pages,
-	 * while the PDP table is a per-vCPU construct that's allocated at MMU
-	 * creation.  When emulating 32-bit mode, cr3 is only 32 bits even on
-	 * x86_64.  Therefore we need to allocate the PDP table in the first
-	 * 4GB of memory, which happens to fit the DMA32 zone.  TDP paging
-	 * generally doesn't use PAE paging and can skip allocating the PDP
-	 * table.  The main exception, handled here, is SVM's 32-bit NPT.  The
-	 * other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit
-	 * KVM; that horror is handled on-demand by mmu_alloc_special_roots().
-	 */
-	if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL)
-		return 0;
-
-	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32);
-	if (!page)
-		return -ENOMEM;
-
-	mmu->pae_root = page_address(page);
-
-	/*
-	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
-	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
-	 * that KVM's writes and the CPU's reads get along.  Note, this is
-	 * only necessary when using shadow paging, as 64-bit NPT can get at
-	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
-	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
-	 */
-	if (!tdp_enabled)
-		set_memory_decrypted((unsigned long)mmu->pae_root, 1);
-	else
-		WARN_ON_ONCE(shadow_me_value);
-
-	for (i = 0; i < 4; ++i)
-		mmu->pae_root[i] = INVALID_PAE_ROOT;
-
-	return 0;
 }
 
 int kvm_mmu_create(struct kvm_vcpu *vcpu)
 {
-	int ret;
-
 	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
 	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
 
@@ -5563,18 +5556,10 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
 
-	ret = __kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu);
-	if (ret)
-		return ret;
-
-	ret = __kvm_mmu_create(vcpu, &vcpu->arch.root_mmu);
-	if (ret)
-		goto fail_allocate_root;
+	__kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu);
+	__kvm_mmu_create(vcpu, &vcpu->arch.root_mmu);
 
-	return ret;
- fail_allocate_root:
-	free_mmu_pages(&vcpu->arch.guest_mmu);
-	return ret;
+	return 0;
 }
 
 #define BATCH_ZAP_PAGES	10
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..d5673a42680f 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -20,16 +20,6 @@ extern bool dbg;
 #define MMU_WARN_ON(x) do { } while (0)
 #endif
 
-/*
- * Unlike regular MMU roots, PAE "roots", a.k.a. PDPTEs/PDPTRs, have a PRESENT
- * bit, and thus are guaranteed to be non-zero when valid.  And, when a guest
- * PDPTR is !PRESENT, its corresponding PAE root cannot be set to INVALID_PAGE,
- * as the CPU would treat that as PRESENT PDPTR with reserved bits set.  Use
- * '0' instead of INVALID_PAGE to indicate an invalid PAE root.
- */
-#define INVALID_PAE_ROOT	0
-#define IS_VALID_PAE_ROOT(x)	(!!(x))
-
 typedef u64 __rcu *tdp_ptep_t;
 
 struct kvm_mmu_page {
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V2 7/7] KVM: X86/MMU: Remove mmu_alloc_special_roots()
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
                   ` (5 preceding siblings ...)
  2022-05-03 15:07 ` [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand Lai Jiangshan
@ 2022-05-03 15:07 ` Lai Jiangshan
  2022-05-13  8:22 ` [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
  7 siblings, 0 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-03 15:07 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

mmu_alloc_special_roots() allocates mmu->pae_root for non-PAE paging
(as for shadowing 32bit NPT on 64 bit host) and mmu->pml4_root and
mmu->pml5_root.

But mmu->pml4_root and mmu->pml5_root is not used, neither mmu->pae_root
for non-PAE paging.

So remove mmu_alloc_special_roots(), mmu->pml4_root and mmu->pml5_root.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/include/asm/kvm_host.h |  3 --
 arch/x86/kvm/mmu/mmu.c          | 77 ---------------------------------
 2 files changed, 80 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c59fea4bdb6e..e647587d1d1f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -458,9 +458,6 @@ struct kvm_mmu {
 	u8 permissions[16];
 
 	u64 *pae_root;
-	u64 *pml4_root;
-	u64 *pml5_root;
-
 	/*
 	 * check zero bits on shadow page table entries, these
 	 * bits include not only hardware reserved bits but also
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c97f830c5f8c..9c04acf9728a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3611,78 +3611,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	return r;
 }
 
-static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
-{
-	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	bool need_pml5 = mmu->root_role.level > PT64_ROOT_4LEVEL;
-	u64 *pml5_root = NULL;
-	u64 *pml4_root = NULL;
-	u64 *pae_root;
-
-	/*
-	 * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
-	 * tables are allocated and initialized at root creation as there is no
-	 * equivalent level in the guest's NPT to shadow.  Allocate the tables
-	 * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
-	 */
-	if (mmu->root_role.direct ||
-	    mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
-	    mmu->root_role.level < PT64_ROOT_4LEVEL)
-		return 0;
-
-	/*
-	 * NPT, the only paging mode that uses this horror, uses a fixed number
-	 * of levels for the shadow page tables, e.g. all MMUs are 4-level or
-	 * all MMus are 5-level.  Thus, this can safely require that pml5_root
-	 * is allocated if the other roots are valid and pml5 is needed, as any
-	 * prior MMU would also have required pml5.
-	 */
-	if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root))
-		return 0;
-
-	/*
-	 * The special roots should always be allocated in concert.  Yell and
-	 * bail if KVM ends up in a state where only one of the roots is valid.
-	 */
-	if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root ||
-			 (need_pml5 && mmu->pml5_root)))
-		return -EIO;
-
-	/*
-	 * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and
-	 * doesn't need to be decrypted.
-	 */
-	pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-	if (!pae_root)
-		return -ENOMEM;
-
-#ifdef CONFIG_X86_64
-	pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-	if (!pml4_root)
-		goto err_pml4;
-
-	if (need_pml5) {
-		pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-		if (!pml5_root)
-			goto err_pml5;
-	}
-#endif
-
-	mmu->pae_root = pae_root;
-	mmu->pml4_root = pml4_root;
-	mmu->pml5_root = pml5_root;
-
-	return 0;
-
-#ifdef CONFIG_X86_64
-err_pml5:
-	free_page((unsigned long)pml4_root);
-err_pml4:
-	free_page((unsigned long)pae_root);
-	return -ENOMEM;
-#endif
-}
-
 static bool is_unsync_root(hpa_t root)
 {
 	struct kvm_mmu_page *sp;
@@ -5069,9 +4997,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	r = mmu_alloc_pae_root(vcpu);
 	if (r)
 		return r;
-	r = mmu_alloc_special_roots(vcpu);
-	if (r)
-		goto out;
 	if (vcpu->arch.mmu->root_role.direct)
 		r = mmu_alloc_direct_roots(vcpu);
 	else
@@ -5529,8 +5454,6 @@ static void free_mmu_pages(struct kvm_mmu *mmu)
 	if (!tdp_enabled && mmu->pae_root)
 		set_memory_encrypted((unsigned long)mmu->pae_root, 1);
 	free_page((unsigned long)mmu->pae_root);
-	free_page((unsigned long)mmu->pml4_root);
-	free_page((unsigned long)mmu->pml5_root);
 }
 
 static void __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
  2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
                   ` (6 preceding siblings ...)
  2022-05-03 15:07 ` [PATCH V2 7/7] KVM: X86/MMU: Remove mmu_alloc_special_roots() Lai Jiangshan
@ 2022-05-13  8:22 ` Lai Jiangshan
  2022-05-13 18:31   ` David Matlack
  7 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-13  8:22 UTC (permalink / raw)
  To: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan

On Tue, May 3, 2022 at 11:06 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
>
> Current code uses mmu->pae_root, mmu->pml4_root, and mmu->pml5_root to
> setup special roots.  The initialization code is complex and the roots
> are not associated with struct kvm_mmu_page which causes the code more
> complex.
>
> So add new special shadow pages to simplify it.
>

Ping

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots
  2022-05-13  8:22 ` [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
@ 2022-05-13 18:31   ` David Matlack
  0 siblings, 0 replies; 23+ messages in thread
From: David Matlack @ 2022-05-13 18:31 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan

On Fri, May 13, 2022 at 1:22 AM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> On Tue, May 3, 2022 at 11:06 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
> >
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > Current code uses mmu->pae_root, mmu->pml4_root, and mmu->pml5_root to
> > setup special roots.  The initialization code is complex and the roots
> > are not associated with struct kvm_mmu_page which causes the code more
> > complex.
> >
> > So add new special shadow pages to simplify it.

FYI I plan to review this after I get a v5 Eager Page Splitting out
(today hopefully). But from looking at it so far, it looks like a
great cleanup!

> >
>
> Ping

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page()
  2022-05-03 15:07 ` [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page() Lai Jiangshan
@ 2022-05-13 22:53   ` David Matlack
  2022-05-26  9:20     ` Lai Jiangshan
  0 siblings, 1 reply; 23+ messages in thread
From: David Matlack @ 2022-05-13 22:53 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson,
	Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, May 03, 2022 at 11:07:29PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> In some case, special roots are used in mmu.  It is often using
> to_shadow_page(mmu->root.hpa) to check if special roots are used.
> 
> Add using_special_root_page() to directly check if special roots are
> used or needed to be used even mmu->root.hpa is not set.
> 
> Prepare for making to_shadow_page(mmu->root.hpa) return non-NULL via
> using special shadow pages.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 909372762363..7f20796af351 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1711,6 +1711,14 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
>  	mmu_spte_clear_no_track(parent_pte);
>  }
>  
> +static bool using_special_root_page(struct kvm_mmu *mmu)

Could you enumerate all the scenarios that use special roots and which
roots are the special ones? I think that would help a lot with reviewing
this series and would be useful to encode in a comment, probably above
this function here, for future readers.

Also the term "special" is really vague. Maybe once you enumerate all
the scenarios a common theme will arise and we can pick a better name,
unless you have any ideas off the top of your head.

> +{
> +	if (mmu->root_role.direct)
> +		return mmu->root_role.level == PT32E_ROOT_LEVEL;
> +	else
> +		return mmu->cpu_role.base.level <= PT32E_ROOT_LEVEL;
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
>  {
>  	struct kvm_mmu_page *sp;
> @@ -4241,10 +4249,10 @@ static bool fast_pgd_switch(struct kvm *kvm, struct kvm_mmu *mmu,
>  {
>  	/*
>  	 * For now, limit the caching to 64-bit hosts+VMs in order to avoid
> -	 * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
> -	 * later if necessary.
> +	 * having to deal with PDPTEs.  Special roots can not be put into
> +	 * mmu->prev_roots[].
>  	 */
> -	if (VALID_PAGE(mmu->root.hpa) && !to_shadow_page(mmu->root.hpa))
> +	if (VALID_PAGE(mmu->root.hpa) && using_special_root_page(mmu))
>  		kvm_mmu_free_roots(kvm, mmu, KVM_MMU_ROOT_CURRENT);
>  
>  	if (VALID_PAGE(mmu->root.hpa))
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages
  2022-05-03 15:07 ` [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages Lai Jiangshan
@ 2022-05-13 23:43   ` David Matlack
  2022-05-26  9:38     ` Lai Jiangshan
  0 siblings, 1 reply; 23+ messages in thread
From: David Matlack @ 2022-05-13 23:43 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson,
	Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, May 03, 2022 at 11:07:30PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Special pages are pages to hold PDPTEs for 32bit guest or higher
> level pages linked to special page when shadowing NPT.
> 
> Current code use mmu->pae_root, mmu->pml4_root, and mmu->pml5_root to
> setup special root.  The initialization code is complex and the roots
> are not associated with struct kvm_mmu_page which causes the code more
> complex.
> 
> Add kvm_mmu_alloc_special_page() and mmu_free_special_root_page() to
> allocate and free special shadow pages and prepare for using special
> shadow pages to replace current logic and share the most logic with
> normal shadow pages.
> 
> The code is not activated since using_special_root_page() is false in
> the place where it is inserted.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 91 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7f20796af351..126f0cd07f98 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1719,6 +1719,58 @@ static bool using_special_root_page(struct kvm_mmu *mmu)
>  		return mmu->cpu_role.base.level <= PT32E_ROOT_LEVEL;
>  }
>  
> +/*
> + * Special pages are pages to hold PAE PDPTEs for 32bit guest or higher level
> + * pages linked to special page when shadowing NPT.
> + *
> + * Special pages are specially allocated.  If sp->spt needs to be 32bit, it

I'm not sure what you mean by "If sp->spt needs to be 32bit". Do you mean
"If sp shadows a 32-bit PAE page table"?

> + * will use the preallocated mmu->pae_root.
> + *
> + * Special pages are only visible to local VCPU except through rmap from their
> + * children, so they are not in the kvm->arch.active_mmu_pages nor in the hash.
> + *
> + * And they are either accounted nor write-protected since they don't has gfn
> + * associated.

Instead of "has gfn associated", how about "shadow a guest page table"?

> + *
> + * Because of above, special pages can not be freed nor zapped like normal
> + * shadow pages.  They are freed directly when the special root is freed, see
> + * mmu_free_special_root_page().
> + *
> + * Special root page can not be put on mmu->prev_roots because the comparison
> + * must use PDPTEs instead of CR3 and mmu->pae_root can not be shared for multi
> + * root pages.
> + *
> + * Except above limitations, all the other abilities are the same as other
> + * shadow page, like link, parent rmap, sync, unsync etc.
> + *
> + * Special pages can be obsoleted but might be possibly reused later.  When
> + * the obsoleting process is done, all the obsoleted shadow pages are unlinked
> + * from the special pages by the help of the parent rmap of the children and
> + * the special pages become theoretically valid again.  If there is no other
> + * event to cause a VCPU to free the root and the VCPU is being preempted by
> + * the host during two obsoleting processes, the VCPU can reuse its special
> + * pages when it is back.

Sorry I am having a lot of trouble parsing this paragraph.

> + */

This comment (and more broadly, this series) mixes "special page",
"special root", "special root page", and "special shadow page". Can you
be more consistent with the terminology?

> +static struct kvm_mmu_page *kvm_mmu_alloc_special_page(struct kvm_vcpu *vcpu,
> +		union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +
> +	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> +	sp->gfn = 0;
> +	sp->role = role;
> +	if (role.level == PT32E_ROOT_LEVEL &&
> +	    vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL)
> +		sp->spt = vcpu->arch.mmu->pae_root;

Why use pae_root here instead of allocating from the cache?

> +	else
> +		sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	/* sp->gfns is not used for special shadow page */
> +	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> +	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
> +
> +	return sp;
> +}
> +
>  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
>  {
>  	struct kvm_mmu_page *sp;
> @@ -2076,6 +2128,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	if (level <= vcpu->arch.mmu->cpu_role.base.level)
>  		role.passthrough = 0;
>  
> +	if (unlikely(level >= PT32E_ROOT_LEVEL && using_special_root_page(vcpu->arch.mmu)))
> +		return kvm_mmu_alloc_special_page(vcpu, role);
> +
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
>  		if (sp->gfn != gfn) {
> @@ -3290,6 +3345,37 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
>  	*root_hpa = INVALID_PAGE;
>  }
>  
> +static void mmu_free_special_root_page(struct kvm *kvm, struct kvm_mmu *mmu)
> +{
> +	u64 spte = mmu->root.hpa;
> +	struct kvm_mmu_page *sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> +	int i;
> +
> +	/* Free level 5 or 4 roots for shadow NPT for 32 bit L1 */
> +	while (sp->role.level > PT32E_ROOT_LEVEL)
> +	{
> +		spte = sp->spt[0];
> +		mmu_page_zap_pte(kvm, sp, sp->spt + 0, NULL);

Instead of using mmu_page_zap_pte(..., NULL) what about creating a new
helper that just does drop_parent_pte(), since that's all you really
want?

> +		free_page((unsigned long)sp->spt);
> +		kmem_cache_free(mmu_page_header_cache, sp);
> +		if (!is_shadow_present_pte(spte))
> +			return;
> +		sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> +	}
> +
> +	if (WARN_ON_ONCE(sp->role.level != PT32E_ROOT_LEVEL))
> +		return;
> +
> +	/* Free PAE roots */

nit: This loop does not do any freeing, it just disconnets the PAE root
table from the 4 PAE page directories. So how about:

/* Disconnect PAE root from the 4 PAE page directories */

> +	for (i = 0; i < 4; i++)
> +		mmu_page_zap_pte(kvm, sp, sp->spt + i, NULL);
> +
> +	if (sp->spt != mmu->pae_root)
> +		free_page((unsigned long)sp->spt);
> +
> +	kmem_cache_free(mmu_page_header_cache, sp);
> +}
> +
>  /* roots_to_free must be some combination of the KVM_MMU_ROOT_* flags */
>  void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
>  			ulong roots_to_free)
> @@ -3323,7 +3409,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
>  
>  	if (free_active_root) {
>  		if (to_shadow_page(mmu->root.hpa)) {
> -			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> +			if (using_special_root_page(mmu))
> +				mmu_free_special_root_page(kvm, mmu);
> +			else
> +				mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
>  		} else if (mmu->pae_root) {
>  			for (i = 0; i < 4; ++i) {
>  				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children
  2022-05-03 15:07 ` [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children Lai Jiangshan
@ 2022-05-17  0:01   ` David Matlack
  2022-05-17  1:13     ` Lai Jiangshan
  0 siblings, 1 reply; 23+ messages in thread
From: David Matlack @ 2022-05-17  0:01 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson,
	Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, May 03, 2022 at 11:07:31PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> When special shadow pages are activated, link_shadow_page() might link
> a special shadow pages which is the PAE root for PAE paging with its
> children.
> 
> Add make_pae_pdpte() to handle it.
> 
> The code is not activated since special shadow pages are not activated
> yet.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c  | 6 +++++-
>  arch/x86/kvm/mmu/spte.c | 7 +++++++
>  arch/x86/kvm/mmu/spte.h | 1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 126f0cd07f98..3fe70ad3bda2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2277,7 +2277,11 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  
>  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
>  
> -	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	if (unlikely(sp->role.level == PT32_ROOT_LEVEL &&
> +		     vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL))
> +		spte = make_pae_pdpte(sp->spt);
> +	else
> +		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
>  
>  	mmu_spte_set(sptep, spte);
>  
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 75c9e87d446a..ccd9267a58ca 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -251,6 +251,13 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
>  	return child_spte;
>  }
>  
> +u64 make_pae_pdpte(u64 *child_pt)
> +{
> +	/* The only ignore bits in PDPTE are 11:9. */
> +	BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
> +	return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
> +		shadow_me_value;

If I'm reading mmu_alloc_{direct,shadow}_roots() correctly, PAE page
directories just get: root | PT_PRESENT_MASK | shadow_me_value. Is there
a reason to add SPTE_MMU_PRESENT_MASK or am I misreading the code?

> +}
>  
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
>  {
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index fbbab180395e..09a7e4ba017a 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -413,6 +413,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	       u64 old_spte, bool prefetch, bool can_unsync,
>  	       bool host_writable, u64 *new_spte);
>  u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
> +u64 make_pae_pdpte(u64 *child_pt);
>  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
>  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
>  u64 mark_spte_for_access_track(u64 spte);
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic
  2022-05-03 15:07 ` [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic Lai Jiangshan
@ 2022-05-17  0:16   ` David Matlack
  2022-05-26  9:15     ` Lai Jiangshan
  0 siblings, 1 reply; 23+ messages in thread
From: David Matlack @ 2022-05-17  0:16 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson,
	Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, May 03, 2022 at 11:07:32PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Activate special shadow pages by allocate special shadow pages in
> mmu_alloc_direct_roots() and mmu_alloc_shadow_roots().
> 
> Make shadow walkings walk from the topmost shadow page even it is
> special shadow page so that they can be walked like normal root
> and shadowed PDPTEs can be made and installed on-demand.
> 
> Walking from the topmost causes FNAME(fetch) needs to visit high level
> special shadow pages and allocate special shadow pages when shadowing
> NPT for 32bit L1 in 64bit host, so change FNAME(fetch) and
> FNAME(walk_addr_generic) to handle it for affected code.
> 
> Do sync from the topmost in kvm_mmu_sync_roots() and simplifies
> the code.
> 
> Now all the root pages and pagetable pointed by a present spte in
> struct kvm_mmu are associated by struct kvm_mmu_page, and
> to_shadow_page() is guaranteed to be not NULL.
> 
> Affect cases are those that using_special_root_page() return true.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 168 +++------------------------------
>  arch/x86/kvm/mmu/paging_tmpl.h |  14 ++-
>  2 files changed, 24 insertions(+), 158 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3fe70ad3bda2..6f626d7e8ebb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2214,26 +2214,6 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
>  	iterator->addr = addr;
>  	iterator->shadow_addr = root;
>  	iterator->level = vcpu->arch.mmu->root_role.level;
> -
> -	if (iterator->level >= PT64_ROOT_4LEVEL &&
> -	    vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
> -	    !vcpu->arch.mmu->root_role.direct)
> -		iterator->level = PT32E_ROOT_LEVEL;
> -
> -	if (iterator->level == PT32E_ROOT_LEVEL) {
> -		/*
> -		 * prev_root is currently only used for 64-bit hosts. So only
> -		 * the active root_hpa is valid here.
> -		 */
> -		BUG_ON(root != vcpu->arch.mmu->root.hpa);
> -
> -		iterator->shadow_addr
> -			= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
> -		iterator->shadow_addr &= PT64_BASE_ADDR_MASK;
> -		--iterator->level;
> -		if (!iterator->shadow_addr)
> -			iterator->level = 0;
> -	}
>  }
>  
>  static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
> @@ -3412,21 +3392,10 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
>  					   &invalid_list);
>  
>  	if (free_active_root) {
> -		if (to_shadow_page(mmu->root.hpa)) {
> -			if (using_special_root_page(mmu))
> -				mmu_free_special_root_page(kvm, mmu);
> -			else
> -				mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
> -		} else if (mmu->pae_root) {
> -			for (i = 0; i < 4; ++i) {
> -				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
> -					continue;
> -
> -				mmu_free_root_page(kvm, &mmu->pae_root[i],
> -						   &invalid_list);
> -				mmu->pae_root[i] = INVALID_PAE_ROOT;
> -			}
> -		}
> +		if (using_special_root_page(mmu))
> +			mmu_free_special_root_page(kvm, mmu);
> +		else
> +			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
>  		mmu->root.hpa = INVALID_PAGE;
>  		mmu->root.pgd = 0;
>  	}
> @@ -3491,7 +3460,6 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	u8 shadow_root_level = mmu->root_role.level;
>  	hpa_t root;
> -	unsigned i;
>  	int r;
>  
>  	write_lock(&vcpu->kvm->mmu_lock);
> @@ -3502,24 +3470,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	if (is_tdp_mmu_enabled(vcpu->kvm)) {
>  		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
>  		mmu->root.hpa = root;
> -	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> +	} else if (shadow_root_level >= PT32E_ROOT_LEVEL) {
>  		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
>  		mmu->root.hpa = root;
> -	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
> -		if (WARN_ON_ONCE(!mmu->pae_root)) {
> -			r = -EIO;
> -			goto out_unlock;
> -		}
> -
> -		for (i = 0; i < 4; ++i) {
> -			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
> -
> -			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
> -					      i << 30, PT32_ROOT_LEVEL, true);
> -			mmu->pae_root[i] = root | PT_PRESENT_MASK |
> -					   shadow_me_mask;
> -		}
> -		mmu->root.hpa = __pa(mmu->pae_root);
>  	} else {
>  		WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
>  		r = -EIO;
> @@ -3597,10 +3550,8 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
>  static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
> -	u64 pdptrs[4], pm_mask;
>  	gfn_t root_gfn, root_pgd;
>  	hpa_t root;
> -	unsigned i;
>  	int r;
>  
>  	root_pgd = mmu->get_guest_pgd(vcpu);
> @@ -3609,21 +3560,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	if (mmu_check_root(vcpu, root_gfn))
>  		return 1;
>  
> -	/*
> -	 * On SVM, reading PDPTRs might access guest memory, which might fault
> -	 * and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
> -	 */
> -	if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
> -		for (i = 0; i < 4; ++i) {
> -			pdptrs[i] = mmu->get_pdptr(vcpu, i);
> -			if (!(pdptrs[i] & PT_PRESENT_MASK))
> -				continue;
> -
> -			if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
> -				return 1;
> -		}
> -	}
> -
>  	r = mmu_first_shadow_root_alloc(vcpu->kvm);
>  	if (r)
>  		return r;
> @@ -3633,70 +3569,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	if (r < 0)
>  		goto out_unlock;
>  
> -	/*
> -	 * Do we shadow a long mode page table? If so we need to
> -	 * write-protect the guests page table root.
> -	 */
> -	if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
> -		root = mmu_alloc_root(vcpu, root_gfn, 0,
> -				      mmu->root_role.level, false);
> -		mmu->root.hpa = root;
> -		goto set_root_pgd;
> -	}
> -
> -	if (WARN_ON_ONCE(!mmu->pae_root)) {
> -		r = -EIO;
> -		goto out_unlock;
> -	}
> -
> -	/*
> -	 * We shadow a 32 bit page table. This may be a legacy 2-level
> -	 * or a PAE 3-level page table. In either case we need to be aware that
> -	 * the shadow page table may be a PAE or a long mode page table.
> -	 */
> -	pm_mask = PT_PRESENT_MASK | shadow_me_value;
> -	if (mmu->root_role.level >= PT64_ROOT_4LEVEL) {
> -		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
> -
> -		if (WARN_ON_ONCE(!mmu->pml4_root)) {
> -			r = -EIO;
> -			goto out_unlock;
> -		}
> -		mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
> -
> -		if (mmu->root_role.level == PT64_ROOT_5LEVEL) {
> -			if (WARN_ON_ONCE(!mmu->pml5_root)) {
> -				r = -EIO;
> -				goto out_unlock;
> -			}
> -			mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
> -		}
> -	}
> -
> -	for (i = 0; i < 4; ++i) {
> -		WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
> -
> -		if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
> -			if (!(pdptrs[i] & PT_PRESENT_MASK)) {
> -				mmu->pae_root[i] = INVALID_PAE_ROOT;
> -				continue;
> -			}
> -			root_gfn = pdptrs[i] >> PAGE_SHIFT;
> -		}
> -
> -		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
> -				      PT32_ROOT_LEVEL, false);
> -		mmu->pae_root[i] = root | pm_mask;
> -	}
> -
> -	if (mmu->root_role.level == PT64_ROOT_5LEVEL)
> -		mmu->root.hpa = __pa(mmu->pml5_root);
> -	else if (mmu->root_role.level == PT64_ROOT_4LEVEL)
> -		mmu->root.hpa = __pa(mmu->pml4_root);
> -	else
> -		mmu->root.hpa = __pa(mmu->pae_root);
> -
> -set_root_pgd:
> +	root = mmu_alloc_root(vcpu, root_gfn, 0,
> +			      mmu->root_role.level, false);
> +	mmu->root.hpa = root;
>  	mmu->root.pgd = root_pgd;
>  out_unlock:
>  	write_unlock(&vcpu->kvm->mmu_lock);
> @@ -3813,8 +3688,7 @@ static bool is_unsync_root(hpa_t root)
>  
>  void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
>  {
> -	int i;
> -	struct kvm_mmu_page *sp;
> +	hpa_t root = vcpu->arch.mmu->root.hpa;
>  
>  	if (vcpu->arch.mmu->root_role.direct)
>  		return;
> @@ -3824,31 +3698,11 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
>  
>  	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
>  
> -	if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
> -		hpa_t root = vcpu->arch.mmu->root.hpa;
> -		sp = to_shadow_page(root);
> -
> -		if (!is_unsync_root(root))
> -			return;
> -
> -		write_lock(&vcpu->kvm->mmu_lock);
> -		mmu_sync_children(vcpu, sp, true);
> -		write_unlock(&vcpu->kvm->mmu_lock);
> +	if (!is_unsync_root(root))
>  		return;
> -	}
>  
>  	write_lock(&vcpu->kvm->mmu_lock);
> -
> -	for (i = 0; i < 4; ++i) {
> -		hpa_t root = vcpu->arch.mmu->pae_root[i];
> -
> -		if (IS_VALID_PAE_ROOT(root)) {
> -			root &= PT64_BASE_ADDR_MASK;
> -			sp = to_shadow_page(root);
> -			mmu_sync_children(vcpu, sp, true);
> -		}
> -	}
> -
> +	mmu_sync_children(vcpu, to_shadow_page(root), true);
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index b025decf610d..19ef31a078fa 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -316,6 +316,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	u16 errcode = 0;
>  	gpa_t real_gpa;
>  	gfn_t gfn;
> +	int i;
>  
>  	trace_kvm_mmu_pagetable_walk(addr, access);
>  retry_walk:
> @@ -323,6 +324,16 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	pte           = mmu->get_guest_pgd(vcpu);
>  	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
>  
> +	/*
> +	 * FNAME(fetch) might pass these values to allocate special shadow
> +	 * page.  Although the gfn is not used at the end, it is better not
> +	 * to pass an uninitialized value to kvm_mmu_get_page().
> +	 */

This comment does not explain why FNAME(fetch) might pass these values
to allocate special shadow pages. How about adjust it to something like
this:

  /*
   * Initialize the guest walker with default values.  These values will
   * be used in cases where KVM shadows a guest page table structure
   * with more levels than what the guest. For example, KVM shadows
   * 2-level non-PAE guests with 3-level PAE paging.
   *
   * Note, the gfn is technically ignored for these special shadow
   * pages, but it's more consistent to always pass 0 to
   * kvm_mmu_get_page().
   */

> +	for (i = 2; i < PT_MAX_FULL_LEVELS; i++) {

s/2/PT32_ROOT_LEVEL/

> +		walker->table_gfn[i] = 0;
> +		walker->pt_access[i] = ACC_ALL;
> +	}
> +
>  #if PTTYPE == 64
>  	walk_nx_mask = 1ULL << PT64_NX_SHIFT;
>  	if (walker->level == PT32E_ROOT_LEVEL) {
> @@ -675,7 +686,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  		 * Verify that the gpte in the page we've just write
>  		 * protected is still there.
>  		 */
> -		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
> +		if (it.level - 1 < top_level &&
> +		    FNAME(gpte_changed)(vcpu, gw, it.level - 1))
>  			goto out_gpte_changed;
>  
>  		if (sp)
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children
  2022-05-17  0:01   ` David Matlack
@ 2022-05-17  1:13     ` Lai Jiangshan
  2022-05-17 16:41       ` David Matlack
  0 siblings, 1 reply; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-17  1:13 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

On Tue, May 17, 2022 at 8:01 AM David Matlack <dmatlack@google.com> wrote:
>
> On Tue, May 03, 2022 at 11:07:31PM +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > When special shadow pages are activated, link_shadow_page() might link
> > a special shadow pages which is the PAE root for PAE paging with its
> > children.
> >
> > Add make_pae_pdpte() to handle it.
> >
> > The code is not activated since special shadow pages are not activated
> > yet.
> >
> > Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c  | 6 +++++-
> >  arch/x86/kvm/mmu/spte.c | 7 +++++++
> >  arch/x86/kvm/mmu/spte.h | 1 +
> >  3 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 126f0cd07f98..3fe70ad3bda2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2277,7 +2277,11 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> >
> >       BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
> >
> > -     spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> > +     if (unlikely(sp->role.level == PT32_ROOT_LEVEL &&
> > +                  vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL))
> > +             spte = make_pae_pdpte(sp->spt);
> > +     else
> > +             spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> >
> >       mmu_spte_set(sptep, spte);
> >
> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 75c9e87d446a..ccd9267a58ca 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -251,6 +251,13 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> >       return child_spte;
> >  }
> >
> > +u64 make_pae_pdpte(u64 *child_pt)
> > +{
> > +     /* The only ignore bits in PDPTE are 11:9. */
> > +     BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
> > +     return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
> > +             shadow_me_value;
>
> If I'm reading mmu_alloc_{direct,shadow}_roots() correctly, PAE page
> directories just get: root | PT_PRESENT_MASK | shadow_me_value. Is there
> a reason to add SPTE_MMU_PRESENT_MASK or am I misreading the code?

Because it has a struct kvm_mmu_page associated with it now.

sp->spt[i] requires SPTE_MMU_PRESENT_MASK if it is present.

>
> > +}
> >
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
> >  {
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index fbbab180395e..09a7e4ba017a 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -413,6 +413,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >              u64 old_spte, bool prefetch, bool can_unsync,
> >              bool host_writable, u64 *new_spte);
> >  u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
> > +u64 make_pae_pdpte(u64 *child_pt);
> >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
> >  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
> >  u64 mark_spte_for_access_track(u64 spte);
> > --
> > 2.19.1.6.gb485710b
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children
  2022-05-17  1:13     ` Lai Jiangshan
@ 2022-05-17 16:41       ` David Matlack
  2022-05-26  9:12         ` Lai Jiangshan
  0 siblings, 1 reply; 23+ messages in thread
From: David Matlack @ 2022-05-17 16:41 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

On Tue, May 17, 2022 at 09:13:54AM +0800, Lai Jiangshan wrote:
> On Tue, May 17, 2022 at 8:01 AM David Matlack <dmatlack@google.com> wrote:
> >
> > On Tue, May 03, 2022 at 11:07:31PM +0800, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > >
> > > When special shadow pages are activated, link_shadow_page() might link
> > > a special shadow pages which is the PAE root for PAE paging with its
> > > children.
> > >
> > > Add make_pae_pdpte() to handle it.
> > >
> > > The code is not activated since special shadow pages are not activated
> > > yet.
> > >
> > > Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c  | 6 +++++-
> > >  arch/x86/kvm/mmu/spte.c | 7 +++++++
> > >  arch/x86/kvm/mmu/spte.h | 1 +
> > >  3 files changed, 13 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 126f0cd07f98..3fe70ad3bda2 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2277,7 +2277,11 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
> > >
> > >       BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
> > >
> > > -     spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> > > +     if (unlikely(sp->role.level == PT32_ROOT_LEVEL &&
> > > +                  vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL))
> > > +             spte = make_pae_pdpte(sp->spt);
> > > +     else
> > > +             spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> > >
> > >       mmu_spte_set(sptep, spte);
> > >
> > > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > > index 75c9e87d446a..ccd9267a58ca 100644
> > > --- a/arch/x86/kvm/mmu/spte.c
> > > +++ b/arch/x86/kvm/mmu/spte.c
> > > @@ -251,6 +251,13 @@ u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index)
> > >       return child_spte;
> > >  }
> > >
> > > +u64 make_pae_pdpte(u64 *child_pt)
> > > +{
> > > +     /* The only ignore bits in PDPTE are 11:9. */
> > > +     BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
> > > +     return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
> > > +             shadow_me_value;
> >
> > If I'm reading mmu_alloc_{direct,shadow}_roots() correctly, PAE page
> > directories just get: root | PT_PRESENT_MASK | shadow_me_value. Is there
> > a reason to add SPTE_MMU_PRESENT_MASK or am I misreading the code?
> 
> Because it has a struct kvm_mmu_page associated with it now.
> 
> sp->spt[i] requires SPTE_MMU_PRESENT_MASK if it is present.

Ah of course. e.g. FNAME(fetch) will call is_shadow_present_pte() on PAE
PDPTEs.

Could you also update the comment above SPTE_MMU_PRESENT_MASK? Right now it
says: "Use bit 11, as it is ignored by all flavors of SPTEs and checking a low
bit often generates better code than for a high bit, e.g. 56+." I think it
would be helpful to also meniton that SPTE_MMU_PRESENT_MASK is also used in
PDPTEs which only ignore bits 11:9.


> 
> >
> > > +}
> > >
> > >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled)
> > >  {
> > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > index fbbab180395e..09a7e4ba017a 100644
> > > --- a/arch/x86/kvm/mmu/spte.h
> > > +++ b/arch/x86/kvm/mmu/spte.h
> > > @@ -413,6 +413,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> > >              u64 old_spte, bool prefetch, bool can_unsync,
> > >              bool host_writable, u64 *new_spte);
> > >  u64 make_huge_page_split_spte(u64 huge_spte, int huge_level, int index);
> > > +u64 make_pae_pdpte(u64 *child_pt);
> > >  u64 make_nonleaf_spte(u64 *child_pt, bool ad_disabled);
> > >  u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access);
> > >  u64 mark_spte_for_access_track(u64 spte);
> > > --
> > > 2.19.1.6.gb485710b
> > >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 5/7] KVM: X86/MMU: Remove the check of the return value of to_shadow_page()
  2022-05-03 15:07 ` [PATCH V2 5/7] KVM: X86/MMU: Remove the check of the return value of to_shadow_page() Lai Jiangshan
@ 2022-05-17 16:47   ` David Matlack
  0 siblings, 0 replies; 23+ messages in thread
From: David Matlack @ 2022-05-17 16:47 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson,
	Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, May 03, 2022 at 11:07:33PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Remove the check of the return value of to_shadow_page() in
> mmu_free_root_page(), kvm_mmu_free_guest_mode_roots(), is_unsync_root()
> and is_tdp_mmu() because it can not return NULL.
> 
> Remove the check of the return value of to_shadow_page() in
> is_page_fault_stale() and is_obsolete_root() because it can not return
> NULL and the obsoleting for special shadow page is already handled by
> a different way.
> 
> When the obsoleting process is done, all the obsoleted shadow pages are
> already unlinked from the special pages by the help of the parent rmap
> of the children and the special pages become theoretically valid again.
> The special shadow page can be freed if is_obsolete_sp() return true,
> or be reused if is_obsolete_sp() return false.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Reviewed-by: David Matlack <dmatlack@google.com>

> ---
>  arch/x86/kvm/mmu/mmu.c     | 44 +++-----------------------------------
>  arch/x86/kvm/mmu/tdp_mmu.h |  7 +-----
>  2 files changed, 4 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f626d7e8ebb..bcb3e2730277 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3318,8 +3318,6 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
>  		return;
>  
>  	sp = to_shadow_page(*root_hpa & PT64_BASE_ADDR_MASK);
> -	if (WARN_ON(!sp))
> -		return;
>  
>  	if (is_tdp_mmu_page(sp))
>  		kvm_tdp_mmu_put_root(kvm, sp, false);
> @@ -3422,8 +3420,7 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
>  		if (!VALID_PAGE(root_hpa))
>  			continue;
>  
> -		if (!to_shadow_page(root_hpa) ||
> -			to_shadow_page(root_hpa)->role.guest_mode)
> +		if (to_shadow_page(root_hpa)->role.guest_mode)
>  			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
>  	}
>  
> @@ -3673,13 +3670,6 @@ static bool is_unsync_root(hpa_t root)
>  	smp_rmb();
>  	sp = to_shadow_page(root);
>  
> -	/*
> -	 * PAE roots (somewhat arbitrarily) aren't backed by shadow pages, the
> -	 * PDPTEs for a given PAE root need to be synchronized individually.
> -	 */
> -	if (WARN_ON_ONCE(!sp))
> -		return false;
> -
>  	if (sp->unsync || sp->unsync_children)
>  		return true;
>  
> @@ -3975,21 +3965,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
>  				struct kvm_page_fault *fault, int mmu_seq)
>  {
> -	struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
> -
> -	/* Special roots, e.g. pae_root, are not backed by shadow pages. */
> -	if (sp && is_obsolete_sp(vcpu->kvm, sp))
> -		return true;
> -
> -	/*
> -	 * Roots without an associated shadow page are considered invalid if
> -	 * there is a pending request to free obsolete roots.  The request is
> -	 * only a hint that the current root _may_ be obsolete and needs to be
> -	 * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
> -	 * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
> -	 * to reload even if no vCPU is actively using the root.
> -	 */
> -	if (!sp && kvm_test_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> +	if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa)))
>  		return true;
>  
>  	return fault->slot &&
> @@ -5094,24 +5070,10 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
>  
>  static bool is_obsolete_root(struct kvm *kvm, hpa_t root_hpa)
>  {
> -	struct kvm_mmu_page *sp;
> -
>  	if (!VALID_PAGE(root_hpa))
>  		return false;
>  
> -	/*
> -	 * When freeing obsolete roots, treat roots as obsolete if they don't
> -	 * have an associated shadow page.  This does mean KVM will get false
> -	 * positives and free roots that don't strictly need to be freed, but
> -	 * such false positives are relatively rare:
> -	 *
> -	 *  (a) only PAE paging and nested NPT has roots without shadow pages
> -	 *  (b) remote reloads due to a memslot update obsoletes _all_ roots
> -	 *  (c) KVM doesn't track previous roots for PAE paging, and the guest
> -	 *      is unlikely to zap an in-use PGD.
> -	 */
> -	sp = to_shadow_page(root_hpa);
> -	return !sp || is_obsolete_sp(kvm, sp);
> +	return is_obsolete_sp(kvm, to_shadow_page(root_hpa));
>  }
>  
>  static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index c163f7cc23ca..5779a2a7161e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -78,13 +78,8 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
>  	if (WARN_ON(!VALID_PAGE(hpa)))
>  		return false;
>  
> -	/*
> -	 * A NULL shadow page is legal when shadowing a non-paging guest with
> -	 * PAE paging, as the MMU will be direct with root_hpa pointing at the
> -	 * pae_root page, not a shadow page.
> -	 */
>  	sp = to_shadow_page(hpa);
> -	return sp && is_tdp_mmu_page(sp) && sp->root_count;
> +	return is_tdp_mmu_page(sp) && sp->root_count;
>  }
>  #else
>  static inline int kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return 0; }
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand
  2022-05-03 15:07 ` [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand Lai Jiangshan
@ 2022-05-17 16:57   ` David Matlack
  2022-05-26  8:52     ` Lai Jiangshan
  0 siblings, 1 reply; 23+ messages in thread
From: David Matlack @ 2022-05-17 16:57 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson,
	Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

On Tue, May 03, 2022 at 11:07:34PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> mmu->pae_root for non-PAE paging is allocated on-demand, but
> mmu->pae_root for PAE paging is allocated early when struct kvm_mmu is
> being created.
> 
> Simplify the code to allocate mmu->pae_root for PAE paging and make
> it on-demand.
> 
> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 99 ++++++++++++++-------------------
>  arch/x86/kvm/mmu/mmu_internal.h | 10 ----
>  2 files changed, 42 insertions(+), 67 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bcb3e2730277..c97f830c5f8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -691,6 +691,41 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static int mmu_alloc_pae_root(struct kvm_vcpu *vcpu)
> +{
> +	struct page *page;
> +
> +	if (vcpu->arch.mmu->root_role.level != PT32E_ROOT_LEVEL)
> +		return 0;
> +	if (vcpu->arch.mmu->pae_root)
> +		return 0;
> +
> +	/*
> +	 * Allocate a page to hold the four PDPTEs for PAE paging when emulating
> +	 * 32-bit mode.  CR3 is only 32 bits even on x86_64 in this case.
> +	 * Therefore we need to allocate the PDP table in the first 4GB of
> +	 * memory, which happens to fit the DMA32 zone.
> +	 */
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
> +	if (!page)
> +		return -ENOMEM;
> +	vcpu->arch.mmu->pae_root = page_address(page);
> +
> +	/*
> +	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
> +	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
> +	 * that KVM's writes and the CPU's reads get along.  Note, this is
> +	 * only necessary when using shadow paging, as 64-bit NPT can get at
> +	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
> +	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
> +	 */
> +	if (!tdp_enabled)
> +		set_memory_decrypted((unsigned long)vcpu->arch.mmu->pae_root, 1);
> +	else
> +		WARN_ON_ONCE(shadow_me_value);
> +	return 0;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -5031,6 +5066,9 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>  	r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
>  	if (r)
>  		goto out;
> +	r = mmu_alloc_pae_root(vcpu);
> +	if (r)
> +		return r;
>  	r = mmu_alloc_special_roots(vcpu);
>  	if (r)
>  		goto out;
> @@ -5495,63 +5533,18 @@ static void free_mmu_pages(struct kvm_mmu *mmu)
>  	free_page((unsigned long)mmu->pml5_root);
>  }
>  
> -static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> +static void __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)

vcpu is now unused.

>  {
> -	struct page *page;
>  	int i;
>  
>  	mmu->root.hpa = INVALID_PAGE;
>  	mmu->root.pgd = 0;
>  	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>  		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;

optional: Consider open-coding this directly in kvm_mmu_create() and
drop __kvm_mmu_create().

> -
> -	/* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */
> -	if (!tdp_enabled && mmu == &vcpu->arch.guest_mmu)
> -		return 0;
> -
> -	/*
> -	 * When using PAE paging, the four PDPTEs are treated as 'root' pages,
> -	 * while the PDP table is a per-vCPU construct that's allocated at MMU
> -	 * creation.  When emulating 32-bit mode, cr3 is only 32 bits even on
> -	 * x86_64.  Therefore we need to allocate the PDP table in the first
> -	 * 4GB of memory, which happens to fit the DMA32 zone.  TDP paging
> -	 * generally doesn't use PAE paging and can skip allocating the PDP
> -	 * table.  The main exception, handled here, is SVM's 32-bit NPT.  The
> -	 * other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit
> -	 * KVM; that horror is handled on-demand by mmu_alloc_special_roots().
> -	 */
> -	if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL)
> -		return 0;
> -
> -	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32);
> -	if (!page)
> -		return -ENOMEM;
> -
> -	mmu->pae_root = page_address(page);
> -
> -	/*
> -	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
> -	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
> -	 * that KVM's writes and the CPU's reads get along.  Note, this is
> -	 * only necessary when using shadow paging, as 64-bit NPT can get at
> -	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
> -	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
> -	 */
> -	if (!tdp_enabled)
> -		set_memory_decrypted((unsigned long)mmu->pae_root, 1);
> -	else
> -		WARN_ON_ONCE(shadow_me_value);
> -
> -	for (i = 0; i < 4; ++i)
> -		mmu->pae_root[i] = INVALID_PAE_ROOT;
> -
> -	return 0;
>  }
>  
>  int kvm_mmu_create(struct kvm_vcpu *vcpu)

kvm_mmu_create() could return void now too.

>  {
> -	int ret;
> -
>  	vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache;
>  	vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO;
>  
> @@ -5563,18 +5556,10 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.mmu = &vcpu->arch.root_mmu;
>  	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
>  
> -	ret = __kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu);
> -	if (ret)
> -		return ret;
> -
> -	ret = __kvm_mmu_create(vcpu, &vcpu->arch.root_mmu);
> -	if (ret)
> -		goto fail_allocate_root;
> +	__kvm_mmu_create(vcpu, &vcpu->arch.guest_mmu);
> +	__kvm_mmu_create(vcpu, &vcpu->arch.root_mmu);
>  
> -	return ret;
> - fail_allocate_root:
> -	free_mmu_pages(&vcpu->arch.guest_mmu);
> -	return ret;
> +	return 0;
>  }
>  
>  #define BATCH_ZAP_PAGES	10
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..d5673a42680f 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -20,16 +20,6 @@ extern bool dbg;
>  #define MMU_WARN_ON(x) do { } while (0)
>  #endif
>  
> -/*
> - * Unlike regular MMU roots, PAE "roots", a.k.a. PDPTEs/PDPTRs, have a PRESENT
> - * bit, and thus are guaranteed to be non-zero when valid.  And, when a guest
> - * PDPTR is !PRESENT, its corresponding PAE root cannot be set to INVALID_PAGE,
> - * as the CPU would treat that as PRESENT PDPTR with reserved bits set.  Use
> - * '0' instead of INVALID_PAGE to indicate an invalid PAE root.
> - */
> -#define INVALID_PAE_ROOT	0
> -#define IS_VALID_PAE_ROOT(x)	(!!(x))
> -
>  typedef u64 __rcu *tdp_ptep_t;
>  
>  struct kvm_mmu_page {
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand
  2022-05-17 16:57   ` David Matlack
@ 2022-05-26  8:52     ` Lai Jiangshan
  0 siblings, 0 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-26  8:52 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

Hello

Thank you for the review.

On Wed, May 18, 2022 at 12:57 AM David Matlack <dmatlack@google.com> wrote:

> >
> > -static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > +static void __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>
> vcpu is now unused.

Removed in V3.

>
> >  {
> > -     struct page *page;
> >       int i;
> >
> >       mmu->root.hpa = INVALID_PAGE;
> >       mmu->root.pgd = 0;
> >       for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> >               mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>
> optional: Consider open-coding this directly in kvm_mmu_create() and
> drop __kvm_mmu_create().

__kvm_mmu_create() is kept, and I don't want to duplicate this code.


> >  }
> >
> >  int kvm_mmu_create(struct kvm_vcpu *vcpu)
>
> kvm_mmu_create() could return void now too.

Did in v3.

Thanks
Lai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children
  2022-05-17 16:41       ` David Matlack
@ 2022-05-26  9:12         ` Lai Jiangshan
  0 siblings, 0 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-26  9:12 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

On Wed, May 18, 2022 at 12:42 AM David Matlack <dmatlack@google.com> wrote:

>
> Ah of course. e.g. FNAME(fetch) will call is_shadow_present_pte() on PAE
> PDPTEs.
>
> Could you also update the comment above SPTE_MMU_PRESENT_MASK? Right now it
> says: "Use bit 11, as it is ignored by all flavors of SPTEs and checking a low
> bit often generates better code than for a high bit, e.g. 56+." I think it
> would be helpful to also meniton that SPTE_MMU_PRESENT_MASK is also used in
> PDPTEs which only ignore bits 11:9.
>

Hello

Thank you for the review.

I think using BUILD_BUG_ON() in the place that requires the constraint
can avoid exploding comments in the definition since it is a build
time check and there are not too many constraints.

So I didn't change it in V3.

Or better (still using build-time check rather than comments):

#define PT_PTE_IGNORE_BITS xxxx
#define PAE_PTE_IGNORE_BITS xxxx
#define EPT_PTE_IGNORE_BITS xxxx

static_assert(PT_PTE_IGNORE_BITS & PAE_PTE_IGNORE_BITS &
              EPT_PTE_IGNORE_BITS & SPTE_MMU_PRESENT_MASK);

Thanks
Lai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic
  2022-05-17  0:16   ` David Matlack
@ 2022-05-26  9:15     ` Lai Jiangshan
  0 siblings, 0 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-26  9:15 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

On Tue, May 17, 2022 at 8:16 AM David Matlack <dmatlack@google.com> wrote:

>
> This comment does not explain why FNAME(fetch) might pass these values
> to allocate special shadow pages. How about adjust it to something like
> this:
>
>   /*
>    * Initialize the guest walker with default values.  These values will
>    * be used in cases where KVM shadows a guest page table structure
>    * with more levels than what the guest. For example, KVM shadows
>    * 2-level non-PAE guests with 3-level PAE paging.
>    *
>    * Note, the gfn is technically ignored for these special shadow
>    * pages, but it's more consistent to always pass 0 to
>    * kvm_mmu_get_page().
>    */

The comment is copied into V3 with slightly changed.

>
> > +     for (i = 2; i < PT_MAX_FULL_LEVELS; i++) {
>
> s/2/PT32_ROOT_LEVEL/
>

Did in v3

Thanks
Lai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page()
  2022-05-13 22:53   ` David Matlack
@ 2022-05-26  9:20     ` Lai Jiangshan
  0 siblings, 0 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-26  9:20 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

On Sat, May 14, 2022 at 6:53 AM David Matlack <dmatlack@google.com> wrote:
>
> On Tue, May 03, 2022 at 11:07:29PM +0800, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > In some case, special roots are used in mmu.  It is often using
> > to_shadow_page(mmu->root.hpa) to check if special roots are used.
> >
> > Add using_special_root_page() to directly check if special roots are
> > used or needed to be used even mmu->root.hpa is not set.
> >
> > Prepare for making to_shadow_page(mmu->root.hpa) return non-NULL via
> > using special shadow pages.
> >
> > Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 909372762363..7f20796af351 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1711,6 +1711,14 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
> >       mmu_spte_clear_no_track(parent_pte);
> >  }
> >
> > +static bool using_special_root_page(struct kvm_mmu *mmu)
>
> Could you enumerate all the scenarios that use special roots and which
> roots are the special ones? I think that would help a lot with reviewing
> this series and would be useful to encode in a comment, probably above
> this function here, for future readers.

Thank you for the review.

All the scenarios are listed in v3.

And comments are added to prove the code matches the scenarios and
the scenarios match the code (the proof must be in bi-direction).

>
> Also the term "special" is really vague. Maybe once you enumerate all
> the scenarios a common theme will arise and we can pick a better name,
> unless you have any ideas off the top of your head.

"special" is renamed to "local"

thanks
Lai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages
  2022-05-13 23:43   ` David Matlack
@ 2022-05-26  9:38     ` Lai Jiangshan
  0 siblings, 0 replies; 23+ messages in thread
From: Lai Jiangshan @ 2022-05-26  9:38 UTC (permalink / raw)
  To: David Matlack
  Cc: LKML, open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips),
	Paolo Bonzini, Sean Christopherson, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin

Hwllo

Thank you for the review.

On Sat, May 14, 2022 at 7:43 AM David Matlack <dmatlack@google.com> wrote:

> > +/*
> > + * Special pages are pages to hold PAE PDPTEs for 32bit guest or higher level
> > + * pages linked to special page when shadowing NPT.
> > + *
> > + * Special pages are specially allocated.  If sp->spt needs to be 32bit, it
>
> I'm not sure what you mean by "If sp->spt needs to be 32bit". Do you mean
> "If sp shadows a 32-bit PAE page table"?
>

"If sp->spt needs to be put in a 32bit CR3 (even on x86_64)"

> > + * will use the preallocated mmu->pae_root.
> > + *
> > + * Special pages are only visible to local VCPU except through rmap from their
> > + * children, so they are not in the kvm->arch.active_mmu_pages nor in the hash.
> > + *
> > + * And they are either accounted nor write-protected since they don't has gfn
> > + * associated.
>
> Instead of "has gfn associated", how about "shadow a guest page table"?
>

Did in v3.

> > + *
> > + * Special pages can be obsoleted but might be possibly reused later.  When
> > + * the obsoleting process is done, all the obsoleted shadow pages are unlinked
> > + * from the special pages by the help of the parent rmap of the children and
> > + * the special pages become theoretically valid again.  If there is no other
> > + * event to cause a VCPU to free the root and the VCPU is being preempted by
> > + * the host during two obsoleting processes, the VCPU can reuse its special
> > + * pages when it is back.
>
> Sorry I am having a lot of trouble parsing this paragraph.
>

This paragraph is rewritten in v3.

> > + */
>
> This comment (and more broadly, this series) mixes "special page",
> "special root", "special root page", and "special shadow page". Can you
> be more consistent with the terminology?
>

In v3, there are only "local shadow page" and "local root shadow page".
And "local root shadow page" can be shorted as "local root page".

> > +static struct kvm_mmu_page *kvm_mmu_alloc_special_page(struct kvm_vcpu *vcpu,
> > +             union kvm_mmu_page_role role)
> > +{
> > +     struct kvm_mmu_page *sp;
> > +
> > +     sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > +     sp->gfn = 0;
> > +     sp->role = role;
> > +     if (role.level == PT32E_ROOT_LEVEL &&
> > +         vcpu->arch.mmu->root_role.level == PT32E_ROOT_LEVEL)
> > +             sp->spt = vcpu->arch.mmu->pae_root;
>
> Why use pae_root here instead of allocating from the cache?

Because of 32bit cr3.

The comment is updated in V3.

> > +static void mmu_free_special_root_page(struct kvm *kvm, struct kvm_mmu *mmu)
> > +{
> > +     u64 spte = mmu->root.hpa;
> > +     struct kvm_mmu_page *sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> > +     int i;
> > +
> > +     /* Free level 5 or 4 roots for shadow NPT for 32 bit L1 */
> > +     while (sp->role.level > PT32E_ROOT_LEVEL)
> > +     {
> > +             spte = sp->spt[0];
> > +             mmu_page_zap_pte(kvm, sp, sp->spt + 0, NULL);
>
> Instead of using mmu_page_zap_pte(..., NULL) what about creating a new
> helper that just does drop_parent_pte(), since that's all you really
> want?
>

There are several places using mmu_page_zap_pte(..., NULL) in the mmu.c.

mmu_page_zap_pte() is more general and reviewers don't need to understand
extra code and extra comments.  For example, some comments are needed
to explain that sp->spt[i] is not a large page when disconnecting PAE root
from the 4 PAE page directories if using a helper that just does
drop_parent_pte().


> > +             free_page((unsigned long)sp->spt);
> > +             kmem_cache_free(mmu_page_header_cache, sp);
> > +             if (!is_shadow_present_pte(spte))
> > +                     return;
> > +             sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> > +     }
> > +
> > +     if (WARN_ON_ONCE(sp->role.level != PT32E_ROOT_LEVEL))
> > +             return;
> > +
> > +     /* Free PAE roots */
>
> nit: This loop does not do any freeing, it just disconnets the PAE root
> table from the 4 PAE page directories. So how about:
>
> /* Disconnect PAE root from the 4 PAE page directories */
>

Did in v3.

Thanks
Lai

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-05-26  9:38 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-03 15:07 [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
2022-05-03 15:07 ` [PATCH V2 1/7] KVM: X86/MMU: Add using_special_root_page() Lai Jiangshan
2022-05-13 22:53   ` David Matlack
2022-05-26  9:20     ` Lai Jiangshan
2022-05-03 15:07 ` [PATCH V2 2/7] KVM: X86/MMU: Add special shadow pages Lai Jiangshan
2022-05-13 23:43   ` David Matlack
2022-05-26  9:38     ` Lai Jiangshan
2022-05-03 15:07 ` [PATCH V2 3/7] KVM: X86/MMU: Link PAE root pagetable with its children Lai Jiangshan
2022-05-17  0:01   ` David Matlack
2022-05-17  1:13     ` Lai Jiangshan
2022-05-17 16:41       ` David Matlack
2022-05-26  9:12         ` Lai Jiangshan
2022-05-03 15:07 ` [PATCH V2 4/7] KVM: X86/MMU: Activate special shadow pages and remove old logic Lai Jiangshan
2022-05-17  0:16   ` David Matlack
2022-05-26  9:15     ` Lai Jiangshan
2022-05-03 15:07 ` [PATCH V2 5/7] KVM: X86/MMU: Remove the check of the return value of to_shadow_page() Lai Jiangshan
2022-05-17 16:47   ` David Matlack
2022-05-03 15:07 ` [PATCH V2 6/7] KVM: X86/MMU: Allocate mmu->pae_root for PAE paging on-demand Lai Jiangshan
2022-05-17 16:57   ` David Matlack
2022-05-26  8:52     ` Lai Jiangshan
2022-05-03 15:07 ` [PATCH V2 7/7] KVM: X86/MMU: Remove mmu_alloc_special_roots() Lai Jiangshan
2022-05-13  8:22 ` [PATCH V2 0/7] KVM: X86/MMU: Use one-off special shadow page for special roots Lai Jiangshan
2022-05-13 18:31   ` David Matlack

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.