All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root
@ 2021-12-10  9:25 Lai Jiangshan
  2021-12-10  9:25 ` [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch() Lai Jiangshan
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson; +Cc: Lai Jiangshan

From: Lai Jiangshan <laijs@linux.alibaba.com>

(Request For Help for testing on AMD machine with 32 bit L1 hypervisor,
see information below)

KVM handles root pages specially for these cases:

direct mmu (nonpaping for 32 bit guest):
	gCR0_PG=0
shadow mmu (shadow paping for 32 bit guest):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1
direct mmu (NPT for 32bit host):
	hEFER_LMA=0
shadow nested NPT (for 32bit L1 hypervisor):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1,hEFER_LMA=0
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE={0|1},hEFER_LMA=1,hCR4_LA57={0|1}
Shadow nested NPT for 64bit L1 hypervisor:
	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1

They are either using special roots or matched the condition 
((mmu->shadow_root_level > mmu->root_level) && !mm->direct_map)
(refered as level promotion) or both.

All the cases are using special roots except the last one.
Many cases are doing level promotion including the last one.

When special roots are used, the root page will not be backed by
kvm_mmu_page.  So they must be treated specially, but not all places
is considering this problem, and Sean is adding some code to check
this special roots.

When level promotion, the kvm treats them silently always.

These treaments incur problems or complication, see the changelog
of every patch.

These patches were made when I reviewed all the usage of shadow_root_level
and root_level.  Some of them are sent and accepted.  Patch3-6 are too
complicated so they had been held back.  Patch1 and patch2 were sent.
Patch1 was rejected, but I think it is good.  Patch2 is said to be
accepted, but it is not shown in the kvm/queue.  Patch3-6 conflicts
with patch1,2 so patch1,2 are included here too.

Other reason that patch 3-6 were held back is that the patch 3-6 are
not tested with shadow NPT cases listed above.  Because I don't have
guest images can act as 32 bit L1 hypervisor, nor I can access to
AMD machine with 5 level paging.  I'm a bit reluctant to ask for the
resource, so I send the patches and wish someone test them and modify
them.  At least, it provides some thinking and reveals problems of the
existing code and of the AMD cases.
( *Request For Help* here.)

These patches have been tested with the all cases except the shadow-NPT
cases, the code coverage is believed to be more than 95% (hundreds of
code related to shadow-NPT are shoved, and be replaced with common
role.pae_root and role.level_promoted code with only 8 line of code is
added for shadow-NPT, only 2 line of code is not covered in my tests).

And Sean also found the problem of the last case listed above and asked
questions in a reply[1] to one of my emails, I hope this patchset can
be my reply to his questions about such complicated case.

If special roots are removed and PAE page is write-protected, there
can be some more cleanups.

[1]: https://lore.kernel.org/lkml/YbFY533IT3XSIqAK@google.com/

Lai Jiangshan (6):
  KVM: X86: Check root_level only in fast_pgd_switch()
  KVM: X86: Walk shadow page starting with shadow_root_level
  KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page()
  KVM: X86: Introduce role.level_promoted
  KVM: X86: Alloc pae_root shadow page
  KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests

 arch/x86/include/asm/kvm_host.h |   9 +-
 arch/x86/kvm/mmu/mmu.c          | 440 ++++++++++----------------------
 arch/x86/kvm/mmu/mmu_audit.c    |  26 +-
 arch/x86/kvm/mmu/paging_tmpl.h  |  15 +-
 arch/x86/kvm/mmu/tdp_mmu.h      |   7 +-
 5 files changed, 164 insertions(+), 333 deletions(-)

-- 
2.19.1.6.gb485710b


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch()
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
@ 2021-12-10  9:25 ` Lai Jiangshan
  2022-01-04 20:24   ` Sean Christopherson
  2021-12-10  9:25 ` [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level Lai Jiangshan
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

If root_level >= 4, shadow_root_level must be >= 4 too.
Checking only root_level can reduce a check.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 11b06d536cc9..846a2e426e0b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4136,8 +4136,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
 	 * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
 	 * later if necessary.
 	 */
-	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
-	    mmu->root_level >= PT64_ROOT_4LEVEL)
+	if (mmu->root_level >= PT64_ROOT_4LEVEL)
 		return cached_root_available(vcpu, new_pgd, new_role);
 
 	return false;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
  2021-12-10  9:25 ` [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch() Lai Jiangshan
@ 2021-12-10  9:25 ` Lai Jiangshan
  2022-01-04 20:34   ` Sean Christopherson
  2021-12-10  9:25 ` [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Walking from the root page of the shadow page table should start with
the level of the shadow page table: shadow_root_level.

Also change a small defect in audit_mappings(), it is believed
that the current walking level is more valuable to print.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu_audit.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index 9e7dcf999f08..6bbbf85b3e46 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -63,7 +63,7 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 		hpa_t root = vcpu->arch.mmu->root_hpa;
 
 		sp = to_shadow_page(root);
-		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->root_level);
+		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->shadow_root_level);
 		return;
 	}
 
@@ -119,8 +119,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
 	hpa =  pfn << PAGE_SHIFT;
 	if ((*sptep & PT64_BASE_ADDR_MASK) != hpa)
 		audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
-			     "ent %llxn", vcpu->arch.mmu->root_level, pfn,
-			     hpa, *sptep);
+			     "ent %llxn", level, pfn, hpa, *sptep);
 }
 
 static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page()
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
  2021-12-10  9:25 ` [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch() Lai Jiangshan
  2021-12-10  9:25 ` [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level Lai Jiangshan
@ 2021-12-10  9:25 ` Lai Jiangshan
  2022-01-04 20:53   ` Sean Christopherson
  2021-12-10  9:25 ` [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted Lai Jiangshan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

kvm_mmu_alloc_page() will access to more bits of the role.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 846a2e426e0b..54e7cbc15380 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1734,13 +1734,13 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!direct)
+	if (!role.direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -1752,6 +1752,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
 	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
 	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+	sp->gfn = gfn;
+	sp->role = role;
 	return sp;
 }
 
@@ -2138,10 +2140,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
-
-	sp->gfn = gfn;
-	sp->role = role;
+	sp = kvm_mmu_alloc_page(vcpu, gfn, role);
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
 		account_shadowed(vcpu->kvm, sp);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
                   ` (2 preceding siblings ...)
  2021-12-10  9:25 ` [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
@ 2021-12-10  9:25 ` Lai Jiangshan
  2022-01-04 22:14   ` Sean Christopherson
  2021-12-10  9:25 ` [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page Lai Jiangshan
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Level pormotion occurs when mmu->shadow_root_level > mmu->root_level.

There are several cases that can cuase level pormotion:

shadow mmu (shadow paging for 32 bit guest):
	case1:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0

shadow nested NPT (for 32bit L1 hypervisor):
	case2:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
	case3:	gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1

shadow nested NPT (for 64bit L1 hypervisor):
	case4:	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1

When level pormotion occurs (32bit guest, case1-3), special roots are
often used.  But case4 is not using special roots.  It uses shadow page
without fully aware of the specialty.  It might work accidentally:
	1) The root_page (root_sp->spt) is allocated with level = 5,
	   and root_sp->spt[0] is allocated with the same gfn and the
	   same role except role.level = 4.  Luckly that they are
	   different shadow pages.
	2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and
	   walker->pt_access[4], which are normally unused when
	   mmu->shadow_root_level == mmu->root_level == 4, so that
	   FNAME(fetch) can use them to allocate shadow page for
	   root_sp->spt[0] and link them when shadow_root_level == 5.

But it has problems.
If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and usees the same gfn as the root for the nNPT before and after
switching gCR4_LA57.  The host (hCR4_LA57=1) wold use the same root_sp
for guest even guest switches gCR4_LA57.  The guest will see unexpected
page mapped and L2 can hurts L1.  It is lucky the the problem can't
hurt L0.

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless.  For a normal high
level sp, sp->gfns is often unused and kept zero, but it could be
relevant and meaningful when sp->gfns is used because they are back
by concret gptes.  For level-promoted root_sp described before, root_sp
is just a portal to contribute root_sp->spt[0], and root_sp should not
have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0]
of the root gfn is changed.

This patch adds role.level_promoted to address the two problems.
role.level_promoted is set when shadow paging and
role.level > gMMU.level.

An alternative way to fix the problem of case4 is that: also using the
special root pml5_root for it.  But it would required to change many
other places because it is assumption that special roots is only used
for 32bit guests.

This patch also paves the way to use level promoted shadow page for
case1-3, but that requires the special handling or PAE paging, so the
extensive usage of it is not included.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu/mmu.c          | 15 +++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88ecf53f0d2b..6465c83794fc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -334,7 +334,8 @@ union kvm_mmu_page_role {
 		unsigned smap_andnot_wp:1;
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
-		unsigned :6;
+		unsigned level_promoted:1;
+		unsigned :5;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 54e7cbc15380..4769253e9024 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -767,6 +767,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
+	if (sp->role.level_promoted)
+		return sp->gfn;
+
 	if (!sp->role.direct)
 		return sp->gfns[index];
 
@@ -776,6 +779,8 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
 	if (!sp->role.direct) {
+		if (WARN_ON_ONCE(sp->role.level_promoted && gfn != sp->gfn))
+			return;
 		sp->gfns[index] = gfn;
 		return;
 	}
@@ -1702,7 +1707,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 	hlist_del(&sp->hash_link);
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
-	if (!sp->role.direct)
+	if (!sp->role.direct && !sp->role.level_promoted)
 		free_page((unsigned long)sp->gfns);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
@@ -1740,7 +1745,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!role.direct)
+	if (!(role.direct || role.level_promoted))
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2084,6 +2089,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
 		role.quadrant = quadrant;
 	}
+	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
+		role.level_promoted = 0;
 
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
@@ -4836,6 +4843,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
 
 	role.base.direct = false;
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
+	if (role.base.level > role_regs_to_root_level(regs))
+		role.base.level_promoted = 1;
 
 	return role;
 }
@@ -5228,6 +5237,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
+		if (sp->role.level_promoted)
+			continue;
 		if (detect_write_misaligned(sp, gpa, bytes) ||
 		      detect_write_flooding(sp)) {
 			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5c78300fc7d9..16ac276d342a 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		.level = 0xf,
 		.access = 0x7,
 		.quadrant = 0x3,
+		.level_promoted = 0x1,
 	};
 
 	/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
                   ` (3 preceding siblings ...)
  2021-12-10  9:25 ` [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted Lai Jiangshan
@ 2021-12-10  9:25 ` Lai Jiangshan
  2022-01-04 21:54   ` Sean Christopherson
  2021-12-10  9:25 ` [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests Lai Jiangshan
  2021-12-10 10:27 ` [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Maxim Levitsky
  6 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Currently pae_root is special root page, this patch adds facility to
allow using kvm_mmu_get_page() to allocate pae_root shadow page.

When kvm_mmu_get_page() is called for level == PT32E_ROOT_LEVEL and
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, it will get a
DMA32 root page with default PAE pdptes installed.

The pae_root bit is needed in the page role because:
	it is required to be DMA32 page.
	its first 4 sptes are initialized with default_pae_pdpte.

default_pae_pdpte is needed because the cpu expect PAE pdptes are
present when VMenter.  default_pae_pdpte is designed to have no
SPTE_MMU_PRESENT_MASK so that it is present in the view of CPU but not
present in the view of shadow papging, and the page fault handler will
replace it with real present shadow page.

When changing from default_pae_pdpte to a present spte, no tlb flushing
is requested, although both are present in the view of CPU.  The reason
is that default_pae_pdpte points to zero page, no pte is present if the
paging structure is cached.

No functionality changed since this code is not activated because when
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, kvm_mmu_get_page()
is only called for level == 1 or 2 now.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/include/asm/kvm_host.h |   4 +-
 arch/x86/kvm/mmu/mmu.c          | 113 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |   1 +
 3 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6465c83794fc..82a8844f80ac 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -335,7 +335,8 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned level_promoted:1;
-		unsigned :5;
+		unsigned pae_root:1;
+		unsigned :4;
 
 		/*
 		 * This is left at the top of the word so that
@@ -695,6 +696,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	unsigned long mmu_pae_root_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4769253e9024..0d2976dad863 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -724,6 +724,67 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static u64 default_pae_pdpte;
+
+static void free_default_pae_pdpte(void)
+{
+	free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK));
+	default_pae_pdpte = 0;
+}
+
+static int alloc_default_pae_pdpte(void)
+{
+	unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
+
+	if (!p)
+		return -ENOMEM;
+	default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask;
+	if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) ||
+		    is_mmio_spte(default_pae_pdpte))) {
+		free_default_pae_pdpte();
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int alloc_pae_root(struct kvm_vcpu *vcpu)
+{
+	struct page *page;
+	unsigned long pae_root;
+	u64* pdpte;
+
+	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
+		return 0;
+	if (vcpu->arch.mmu_pae_root_cache)
+		return 0;
+
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
+	if (!page)
+		return -ENOMEM;
+
+	pae_root = (unsigned long)page_address(page);
+
+	/*
+	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
+	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
+	 * that KVM's writes and the CPU's reads get along.  Note, this is
+	 * only necessary when using shadow paging, as 64-bit NPT can get at
+	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
+	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
+	 */
+	if (!tdp_enabled)
+		set_memory_decrypted(pae_root, 1);
+	else
+		WARN_ON_ONCE(shadow_me_mask);
+	vcpu->arch.mmu_pae_root_cache = pae_root;
+	pdpte = (void *)pae_root;
+	pdpte[0] = default_pae_pdpte;
+	pdpte[1] = default_pae_pdpte;
+	pdpte[2] = default_pae_pdpte;
+	pdpte[3] = default_pae_pdpte;
+	return 0;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -735,6 +796,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 		return r;
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
+	if (r)
+		return r;
+	r = alloc_pae_root(vcpu);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -753,6 +817,10 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
+	if (!tdp_enabled && vcpu->arch.mmu_pae_root_cache)
+		set_memory_encrypted(vcpu->arch.mmu_pae_root_cache, 1);
+	free_page(vcpu->arch.mmu_pae_root_cache);
+	vcpu->arch.mmu_pae_root_cache = 0;
 }
 
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
@@ -1706,6 +1774,8 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
 	list_del(&sp->link);
+	if (!tdp_enabled && sp->role.pae_root)
+		set_memory_encrypted((unsigned long)sp->spt, 1);
 	free_page((unsigned long)sp->spt);
 	if (!sp->role.direct && !sp->role.level_promoted)
 		free_page((unsigned long)sp->gfns);
@@ -1735,8 +1805,13 @@ static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
 static void drop_parent_pte(struct kvm_mmu_page *sp,
 			    u64 *parent_pte)
 {
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(parent_pte);
+
 	mmu_page_remove_parent_pte(sp, parent_pte);
-	mmu_spte_clear_no_track(parent_pte);
+	if (!parent_sp->role.pae_root)
+		mmu_spte_clear_no_track(parent_pte);
+	else
+		__update_clear_spte_fast(parent_pte, default_pae_pdpte);
 }
 
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, union kvm_mmu_page_role role)
@@ -1744,7 +1819,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	if (!role.pae_root) {
+		sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	} else {
+		sp->spt = (void *)vcpu->arch.mmu_pae_root_cache;
+		vcpu->arch.mmu_pae_root_cache = 0;
+	}
 	if (!(role.direct || role.level_promoted))
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
@@ -2091,6 +2171,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	}
 	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
 		role.level_promoted = 0;
+	if (role.pae_root && (level < PT32E_ROOT_LEVEL))
+		role.pae_root = 0;
 
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
@@ -2226,14 +2308,27 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
+static u64 make_pae_pdpte(u64 *child_pt)
+{
+	u64 spte = __pa(child_pt) | PT_PRESENT_MASK;
+
+	/* The only ignore bits in PDPTE are 11:9. */
+	BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
+	return spte | SPTE_MMU_PRESENT_MASK | shadow_me_mask;
+}
+
 static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 			     struct kvm_mmu_page *sp)
 {
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
 	u64 spte;
 
 	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
 
-	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+	if (!parent_sp->role.pae_root)
+		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+	else
+		spte = make_pae_pdpte(sp->spt);
 
 	mmu_spte_set(sptep, spte);
 
@@ -4733,6 +4828,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
 	role.base.direct = true;
 	role.base.has_4_byte_gpte = false;
+	if (role.base.level == PT32E_ROOT_LEVEL)
+		role.base.pae_root = 1;
 
 	return role;
 }
@@ -4798,6 +4895,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	else
 		role.base.level = PT64_ROOT_4LEVEL;
 
+	if (!____is_cr0_pg(regs) || !____is_efer_lma(regs))
+		role.base.pae_root = 1;
+
 	return role;
 }
 
@@ -4845,6 +4945,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
 	if (role.base.level > role_regs_to_root_level(regs))
 		role.base.level_promoted = 1;
+	if (role.base.level == PT32E_ROOT_LEVEL)
+		role.base.pae_root = 1;
 
 	return role;
 }
@@ -6133,6 +6235,10 @@ int kvm_mmu_module_init(void)
 	if (ret)
 		goto out;
 
+	ret = alloc_default_pae_pdpte();
+	if (ret)
+		goto out;
+
 	return 0;
 
 out:
@@ -6174,6 +6280,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 
 void kvm_mmu_module_exit(void)
 {
+	free_default_pae_pdpte();
 	mmu_destroy_caches();
 	percpu_counter_destroy(&kvm_total_used_mmu_pages);
 	unregister_shrinker(&mmu_shrinker);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 16ac276d342a..014136e15b26 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1044,6 +1044,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		.access = 0x7,
 		.quadrant = 0x3,
 		.level_promoted = 0x1,
+		.pae_root = 0x1,
 	};
 
 	/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
                   ` (4 preceding siblings ...)
  2021-12-10  9:25 ` [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page Lai Jiangshan
@ 2021-12-10  9:25 ` Lai Jiangshan
  2022-01-04 20:55   ` Sean Christopherson
  2021-12-10 10:27 ` [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Maxim Levitsky
  6 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2021-12-10  9:25 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <laijs@linux.alibaba.com>

Use role.pae_root = 1 for shadow_root_level == 3 no matter if it is
shadow MMU or if the level is promoted.

Use role.level_promoted = 1 for promoted shadow page if it is shadow
MMU and the level is promoted.

And remove the unneeded special roots.  Now all the root pages and
pagetable pointed by a present spte in kvm_mmu are backed by struct
kvm_mmu_page, and to_shadow_page() is guaranteed to be not NULL.

shadow_walk() and the intialization of shadow page are much simplied
since there is not special roots.

Affect cases:
direct mmu (nonpaping for 32 bit guest):
	gCR0_PG=0 (pae_root=1)
shadow mmu (shadow paping for 32 bit guest):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0 (pae_root=1,level_promoted=1)
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1 (pae_root=1,level_promoted=0)
direct mmu (NPT for 32bit host):
	hEFER_LMA=0 (pae_root=1)
shadow nested NPT (for 32bit L1 hypervisor):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
		(pae_root=1,level_promoted=1)
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1,hEFER_LMA=0
		(pae_root=1,level_promoted=0)
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE={0|1},hEFER_LMA=1,hCR4_LA57={0|1}
		(pae_root=0,level_promoted=1)
		(default_pae_pdpte is not used even guest is using PAE paging)

Shadow nested NPT for 64bit L1 hypervisor has been already handled:
	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1
		(pae_root=0,level_promoted=1)

FNAME(walk_addr_generic) adds initialization code for shadow nested NPT
for 32bit L1 hypervisor when the level increment might be more than one,
for example, 2->4, 2->5, 3->5.

After this patch, the PAE Page-Directory-Pointer-Table is also write
protected (including NPT's).

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/include/asm/kvm_host.h |   4 -
 arch/x86/kvm/mmu/mmu.c          | 302 ++------------------------------
 arch/x86/kvm/mmu/mmu_audit.c    |  23 +--
 arch/x86/kvm/mmu/paging_tmpl.h  |  13 +-
 arch/x86/kvm/mmu/tdp_mmu.h      |   7 +-
 5 files changed, 30 insertions(+), 319 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 82a8844f80ac..d4ab6f53ab00 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -466,10 +466,6 @@ struct kvm_mmu {
 	*/
 	u32 pkru_mask;
 
-	u64 *pae_root;
-	u64 *pml4_root;
-	u64 *pml5_root;
-
 	/*
 	 * check zero bits on shadow page table entries, these
 	 * bits include not only hardware reserved bits but also
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0d2976dad863..fd2bc851b700 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2252,26 +2252,6 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
 	iterator->addr = addr;
 	iterator->shadow_addr = root;
 	iterator->level = vcpu->arch.mmu->shadow_root_level;
-
-	if (iterator->level >= PT64_ROOT_4LEVEL &&
-	    vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
-	    !vcpu->arch.mmu->direct_map)
-		iterator->level = PT32E_ROOT_LEVEL;
-
-	if (iterator->level == PT32E_ROOT_LEVEL) {
-		/*
-		 * prev_root is currently only used for 64-bit hosts. So only
-		 * the active root_hpa is valid here.
-		 */
-		BUG_ON(root != vcpu->arch.mmu->root_hpa);
-
-		iterator->shadow_addr
-			= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
-		iterator->shadow_addr &= PT64_BASE_ADDR_MASK;
-		--iterator->level;
-		if (!iterator->shadow_addr)
-			iterator->level = 0;
-	}
 }
 
 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
@@ -3375,19 +3355,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 					   &invalid_list);
 
 	if (free_active_root) {
-		if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
-		    (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
-			mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
-		} else if (mmu->pae_root) {
-			for (i = 0; i < 4; ++i) {
-				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
-					continue;
-
-				mmu_free_root_page(kvm, &mmu->pae_root[i],
-						   &invalid_list);
-				mmu->pae_root[i] = INVALID_PAE_ROOT;
-			}
-		}
+		mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
 		mmu->root_hpa = INVALID_PAGE;
 		mmu->root_pgd = 0;
 	}
@@ -3452,7 +3420,6 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u8 shadow_root_level = mmu->shadow_root_level;
 	hpa_t root;
-	unsigned i;
 	int r;
 
 	write_lock(&vcpu->kvm->mmu_lock);
@@ -3463,24 +3430,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
 		mmu->root_hpa = root;
-	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
+	} else if (shadow_root_level >= PT32E_ROOT_LEVEL) {
 		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
 		mmu->root_hpa = root;
-	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
-		if (WARN_ON_ONCE(!mmu->pae_root)) {
-			r = -EIO;
-			goto out_unlock;
-		}
-
-		for (i = 0; i < 4; ++i) {
-			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
-			mmu->pae_root[i] = root | PT_PRESENT_MASK |
-					   shadow_me_mask;
-		}
-		mmu->root_hpa = __pa(mmu->pae_root);
 	} else {
 		WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
 		r = -EIO;
@@ -3558,10 +3510,8 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
 	hpa_t root;
-	unsigned i;
 	int r;
 
 	root_pgd = mmu->get_guest_pgd(vcpu);
@@ -3570,21 +3520,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (mmu_check_root(vcpu, root_gfn))
 		return 1;
 
-	/*
-	 * On SVM, reading PDPTRs might access guest memory, which might fault
-	 * and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
-	 */
-	if (mmu->root_level == PT32E_ROOT_LEVEL) {
-		for (i = 0; i < 4; ++i) {
-			pdptrs[i] = mmu->get_pdptr(vcpu, i);
-			if (!(pdptrs[i] & PT_PRESENT_MASK))
-				continue;
-
-			if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
-				return 1;
-		}
-	}
-
 	r = mmu_first_shadow_root_alloc(vcpu->kvm);
 	if (r)
 		return r;
@@ -3594,146 +3529,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (r < 0)
 		goto out_unlock;
 
-	/*
-	 * Do we shadow a long mode page table? If so we need to
-	 * write-protect the guests page table root.
-	 */
-	if (mmu->root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, root_gfn, 0,
-				      mmu->shadow_root_level, false);
-		mmu->root_hpa = root;
-		goto set_root_pgd;
-	}
-
-	if (WARN_ON_ONCE(!mmu->pae_root)) {
-		r = -EIO;
-		goto out_unlock;
-	}
-
-	/*
-	 * We shadow a 32 bit page table. This may be a legacy 2-level
-	 * or a PAE 3-level page table. In either case we need to be aware that
-	 * the shadow page table may be a PAE or a long mode page table.
-	 */
-	pm_mask = PT_PRESENT_MASK | shadow_me_mask;
-	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
-		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
-
-		if (WARN_ON_ONCE(!mmu->pml4_root)) {
-			r = -EIO;
-			goto out_unlock;
-		}
-		mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
-
-		if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) {
-			if (WARN_ON_ONCE(!mmu->pml5_root)) {
-				r = -EIO;
-				goto out_unlock;
-			}
-			mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
-		}
-	}
-
-	for (i = 0; i < 4; ++i) {
-		WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
-		if (mmu->root_level == PT32E_ROOT_LEVEL) {
-			if (!(pdptrs[i] & PT_PRESENT_MASK)) {
-				mmu->pae_root[i] = INVALID_PAE_ROOT;
-				continue;
-			}
-			root_gfn = pdptrs[i] >> PAGE_SHIFT;
-		}
-
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
-		mmu->pae_root[i] = root | pm_mask;
-	}
-
-	if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
-		mmu->root_hpa = __pa(mmu->pml5_root);
-	else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
-		mmu->root_hpa = __pa(mmu->pml4_root);
-	else
-		mmu->root_hpa = __pa(mmu->pae_root);
-
-set_root_pgd:
+	root = mmu_alloc_root(vcpu, root_gfn, 0,
+			      mmu->shadow_root_level, false);
+	mmu->root_hpa = root;
 	mmu->root_pgd = root_pgd;
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 
-	return 0;
-}
-
-static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
-{
-	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL;
-	u64 *pml5_root = NULL;
-	u64 *pml4_root = NULL;
-	u64 *pae_root;
-
-	/*
-	 * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
-	 * tables are allocated and initialized at root creation as there is no
-	 * equivalent level in the guest's NPT to shadow.  Allocate the tables
-	 * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
-	 */
-	if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
-	    mmu->shadow_root_level < PT64_ROOT_4LEVEL)
-		return 0;
-
-	/*
-	 * NPT, the only paging mode that uses this horror, uses a fixed number
-	 * of levels for the shadow page tables, e.g. all MMUs are 4-level or
-	 * all MMus are 5-level.  Thus, this can safely require that pml5_root
-	 * is allocated if the other roots are valid and pml5 is needed, as any
-	 * prior MMU would also have required pml5.
-	 */
-	if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root))
-		return 0;
-
-	/*
-	 * The special roots should always be allocated in concert.  Yell and
-	 * bail if KVM ends up in a state where only one of the roots is valid.
-	 */
-	if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root ||
-			 (need_pml5 && mmu->pml5_root)))
-		return -EIO;
-
-	/*
-	 * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and
-	 * doesn't need to be decrypted.
-	 */
-	pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-	if (!pae_root)
-		return -ENOMEM;
-
-#ifdef CONFIG_X86_64
-	pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-	if (!pml4_root)
-		goto err_pml4;
-
-	if (need_pml5) {
-		pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-		if (!pml5_root)
-			goto err_pml5;
-	}
-#endif
-
-	mmu->pae_root = pae_root;
-	mmu->pml4_root = pml4_root;
-	mmu->pml5_root = pml5_root;
-
-	return 0;
-
-#ifdef CONFIG_X86_64
-err_pml5:
-	free_page((unsigned long)pml4_root);
-err_pml4:
-	free_page((unsigned long)pae_root);
-	return -ENOMEM;
-#endif
+	return r;
 }
 
 static bool is_unsync_root(hpa_t root)
@@ -3765,46 +3568,23 @@ static bool is_unsync_root(hpa_t root)
 
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 {
-	int i;
-	struct kvm_mmu_page *sp;
+	hpa_t root = vcpu->arch.mmu->root_hpa;
 
 	if (vcpu->arch.mmu->direct_map)
 		return;
 
-	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
+	if (!VALID_PAGE(root))
 		return;
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 
-	if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
-		hpa_t root = vcpu->arch.mmu->root_hpa;
-		sp = to_shadow_page(root);
-
-		if (!is_unsync_root(root))
-			return;
-
-		write_lock(&vcpu->kvm->mmu_lock);
-		kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
-
-		mmu_sync_children(vcpu, sp, true);
-
-		kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
-		write_unlock(&vcpu->kvm->mmu_lock);
+	if (!is_unsync_root(root))
 		return;
-	}
 
 	write_lock(&vcpu->kvm->mmu_lock);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
 
-	for (i = 0; i < 4; ++i) {
-		hpa_t root = vcpu->arch.mmu->pae_root[i];
-
-		if (IS_VALID_PAE_ROOT(root)) {
-			root &= PT64_BASE_ADDR_MASK;
-			sp = to_shadow_page(root);
-			mmu_sync_children(vcpu, sp, true);
-		}
-	}
+	mmu_sync_children(vcpu, to_shadow_page(root), true);
 
 	kvm_mmu_audit(vcpu, AUDIT_POST_SYNC);
 	write_unlock(&vcpu->kvm->mmu_lock);
@@ -4895,8 +4675,11 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	else
 		role.base.level = PT64_ROOT_4LEVEL;
 
-	if (!____is_cr0_pg(regs) || !____is_efer_lma(regs))
+	if (!____is_cr0_pg(regs) || !____is_efer_lma(regs)) {
 		role.base.pae_root = 1;
+		if (____is_cr0_pg(regs) && !____is_cr4_pse(regs))
+			role.base.level_promoted = 1;
+	}
 
 	return role;
 }
@@ -5161,9 +4944,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	int r;
 
 	r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
-	if (r)
-		goto out;
-	r = mmu_alloc_special_roots(vcpu);
 	if (r)
 		goto out;
 	if (vcpu->arch.mmu->direct_map)
@@ -5580,65 +5360,14 @@ slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot,
 				 PG_LEVEL_4K, flush_on_yield);
 }
 
-static void free_mmu_pages(struct kvm_mmu *mmu)
-{
-	if (!tdp_enabled && mmu->pae_root)
-		set_memory_encrypted((unsigned long)mmu->pae_root, 1);
-	free_page((unsigned long)mmu->pae_root);
-	free_page((unsigned long)mmu->pml4_root);
-	free_page((unsigned long)mmu->pml5_root);
-}
-
 static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 {
-	struct page *page;
 	int i;
 
 	mmu->root_hpa = INVALID_PAGE;
 	mmu->root_pgd = 0;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
-
-	/* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */
-	if (!tdp_enabled && mmu == &vcpu->arch.guest_mmu)
-		return 0;
-
-	/*
-	 * When using PAE paging, the four PDPTEs are treated as 'root' pages,
-	 * while the PDP table is a per-vCPU construct that's allocated at MMU
-	 * creation.  When emulating 32-bit mode, cr3 is only 32 bits even on
-	 * x86_64.  Therefore we need to allocate the PDP table in the first
-	 * 4GB of memory, which happens to fit the DMA32 zone.  TDP paging
-	 * generally doesn't use PAE paging and can skip allocating the PDP
-	 * table.  The main exception, handled here, is SVM's 32-bit NPT.  The
-	 * other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit
-	 * KVM; that horror is handled on-demand by mmu_alloc_special_roots().
-	 */
-	if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL)
-		return 0;
-
-	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32);
-	if (!page)
-		return -ENOMEM;
-
-	mmu->pae_root = page_address(page);
-
-	/*
-	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
-	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
-	 * that KVM's writes and the CPU's reads get along.  Note, this is
-	 * only necessary when using shadow paging, as 64-bit NPT can get at
-	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
-	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
-	 */
-	if (!tdp_enabled)
-		set_memory_decrypted((unsigned long)mmu->pae_root, 1);
-	else
-		WARN_ON_ONCE(shadow_me_mask);
-
-	for (i = 0; i < 4; ++i)
-		mmu->pae_root[i] = INVALID_PAE_ROOT;
-
 	return 0;
 }
 
@@ -5667,7 +5396,6 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 
 	return ret;
  fail_allocate_root:
-	free_mmu_pages(&vcpu->arch.guest_mmu);
 	return ret;
 }
 
@@ -6273,8 +6001,6 @@ unsigned long kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
-	free_mmu_pages(&vcpu->arch.root_mmu);
-	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index 6bbbf85b3e46..f5e8dabe13bf 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -53,31 +53,14 @@ static void __mmu_spte_walk(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 {
-	int i;
+	hpa_t root = vcpu->arch.mmu->root_hpa;
 	struct kvm_mmu_page *sp;
 
 	if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
 		return;
 
-	if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
-		hpa_t root = vcpu->arch.mmu->root_hpa;
-
-		sp = to_shadow_page(root);
-		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->shadow_root_level);
-		return;
-	}
-
-	for (i = 0; i < 4; ++i) {
-		hpa_t root = vcpu->arch.mmu->pae_root[i];
-
-		if (IS_VALID_PAE_ROOT(root)) {
-			root &= PT64_BASE_ADDR_MASK;
-			sp = to_shadow_page(root);
-			__mmu_spte_walk(vcpu, sp, fn, 2);
-		}
-	}
-
-	return;
+	sp = to_shadow_page(root);
+	__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->shadow_root_level);
 }
 
 typedef void (*sp_handler) (struct kvm *kvm, struct kvm_mmu_page *sp);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 014136e15b26..d71b562bf8f0 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -365,6 +365,16 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	pte           = mmu->get_guest_pgd(vcpu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
+	/* kvm_mmu_get_page() will uses this values for allocating level
+	 * promoted shadow page.
+	 */
+	walker->table_gfn[4] = gpte_to_gfn(pte);
+	walker->pt_access[4] = ACC_ALL;
+	walker->table_gfn[3] = gpte_to_gfn(pte);
+	walker->pt_access[3] = ACC_ALL;
+	walker->table_gfn[2] = gpte_to_gfn(pte);
+	walker->pt_access[2] = ACC_ALL;
+
 #if PTTYPE == 64
 	walk_nx_mask = 1ULL << PT64_NX_SHIFT;
 	if (walker->level == PT32E_ROOT_LEVEL) {
@@ -710,7 +720,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * Verify that the gpte in the page we've just write
 		 * protected is still there.
 		 */
-		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
+		if (it.level - 1 < top_level &&
+		    FNAME(gpte_changed)(vcpu, gw, it.level - 1))
 			goto out_gpte_changed;
 
 		if (sp)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 476b133544dd..822ff5d76b91 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -100,13 +100,8 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
 	if (WARN_ON(!VALID_PAGE(hpa)))
 		return false;
 
-	/*
-	 * A NULL shadow page is legal when shadowing a non-paging guest with
-	 * PAE paging, as the MMU will be direct with root_hpa pointing at the
-	 * pae_root page, not a shadow page.
-	 */
 	sp = to_shadow_page(hpa);
-	return sp && is_tdp_mmu_page(sp) && sp->root_count;
+	return is_tdp_mmu_page(sp) && sp->root_count;
 }
 #else
 static inline bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return false; }
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root
  2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
                   ` (5 preceding siblings ...)
  2021-12-10  9:25 ` [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests Lai Jiangshan
@ 2021-12-10 10:27 ` Maxim Levitsky
  6 siblings, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2021-12-10 10:27 UTC (permalink / raw)
  To: Lai Jiangshan, linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan

On Fri, 2021-12-10 at 17:25 +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> (Request For Help for testing on AMD machine with 32 bit L1 hypervisor,
> see information below)
> 
> KVM handles root pages specially for these cases:
> 
> direct mmu (nonpaping for 32 bit guest):
> 	gCR0_PG=0
> shadow mmu (shadow paping for 32 bit guest):
> 	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0
> 	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1
> direct mmu (NPT for 32bit host):
> 	hEFER_LMA=0
> shadow nested NPT (for 32bit L1 hypervisor):
> 	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
> 	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1,hEFER_LMA=0
> 	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE={0|1},hEFER_LMA=1,hCR4_LA57={0|1}
> Shadow nested NPT for 64bit L1 hypervisor:
> 	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1
> 
> They are either using special roots or matched the condition 
> ((mmu->shadow_root_level > mmu->root_level) && !mm->direct_map)
> (refered as level promotion) or both.
> 
> All the cases are using special roots except the last one.
> Many cases are doing level promotion including the last one.
> 
> When special roots are used, the root page will not be backed by
> kvm_mmu_page.  So they must be treated specially, but not all places
> is considering this problem, and Sean is adding some code to check
> this special roots.
> 
> When level promotion, the kvm treats them silently always.
> 
> These treaments incur problems or complication, see the changelog
> of every patch.
> 
> These patches were made when I reviewed all the usage of shadow_root_level
> and root_level.  Some of them are sent and accepted.  Patch3-6 are too
> complicated so they had been held back.  Patch1 and patch2 were sent.
> Patch1 was rejected, but I think it is good.  Patch2 is said to be
> accepted, but it is not shown in the kvm/queue.  Patch3-6 conflicts
> with patch1,2 so patch1,2 are included here too.
> 
> Other reason that patch 3-6 were held back is that the patch 3-6 are
> not tested with shadow NPT cases listed above.  Because I don't have
> guest images can act as 32 bit L1 hypervisor, nor I can access to
> AMD machine with 5 level paging.  I'm a bit reluctant to ask for the
> resource, so I send the patches and wish someone test them and modify
> them.  At least, it provides some thinking and reveals problems of the
> existing code and of the AMD cases.
> ( *Request For Help* here.)
> 
> These patches have been tested with the all cases except the shadow-NPT
> cases, the code coverage is believed to be more than 95% (hundreds of
> code related to shadow-NPT are shoved, and be replaced with common
> role.pae_root and role.level_promoted code with only 8 line of code is
> added for shadow-NPT, only 2 line of code is not covered in my tests).
> 
> And Sean also found the problem of the last case listed above and asked
> questions in a reply[1] to one of my emails, I hope this patchset can
> be my reply to his questions about such complicated case.
> 
> If special roots are removed and PAE page is write-protected, there
> can be some more cleanups.
> 
> [1]: https://lore.kernel.org/lkml/YbFY533IT3XSIqAK@google.com/
> 
> Lai Jiangshan (6):
>   KVM: X86: Check root_level only in fast_pgd_switch()
>   KVM: X86: Walk shadow page starting with shadow_root_level
>   KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page()
>   KVM: X86: Introduce role.level_promoted
>   KVM: X86: Alloc pae_root shadow page
>   KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests
> 
>  arch/x86/include/asm/kvm_host.h |   9 +-
>  arch/x86/kvm/mmu/mmu.c          | 440 ++++++++++----------------------
>  arch/x86/kvm/mmu/mmu_audit.c    |  26 +-
>  arch/x86/kvm/mmu/paging_tmpl.h  |  15 +-
>  arch/x86/kvm/mmu/tdp_mmu.h      |   7 +-
>  5 files changed, 164 insertions(+), 333 deletions(-)
> 


I have 32 bit VM which can run an other 32 bit VM, and both it and the nested VM are using the mainline kernel).
I'll test this patch series soon.

I also have seabios hacked to use PAE instead of no paging, which I usually use for my 32 bit guests,
so I can make it switch to SMM+PAE paging mode to test it.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch()
  2021-12-10  9:25 ` [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch() Lai Jiangshan
@ 2022-01-04 20:24   ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2022-01-04 20:24 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> If root_level >= 4, shadow_root_level must be >= 4 too.

Please explain why so that it's explicitly clear why this ok, e.g.

  Drop the shadow_root_level check when determining if a "fast" PGD switch
  is allowed, as KVM never shadows 64-bit guest pages tables with PAE paging
  (32-bit KVM doesn't support 64-bit guests).

with that:

Reviewed-by: Sean Christopherson <seanjc@google.com>

> Checking only root_level can reduce a check.
>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 11b06d536cc9..846a2e426e0b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4136,8 +4136,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
>  	 * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
>  	 * later if necessary.
>  	 */
> -	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> -	    mmu->root_level >= PT64_ROOT_4LEVEL)
> +	if (mmu->root_level >= PT64_ROOT_4LEVEL)
>  		return cached_root_available(vcpu, new_pgd, new_role);
>  
>  	return false;
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level
  2021-12-10  9:25 ` [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level Lai Jiangshan
@ 2022-01-04 20:34   ` Sean Christopherson
  2022-01-04 20:37     ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-01-04 20:34 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Walking from the root page of the shadow page table should start with
> the level of the shadow page table: shadow_root_level.
> 
> Also change a small defect in audit_mappings(), it is believed
> that the current walking level is more valuable to print.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu_audit.c | 5 ++---

I vote we remove mmu_audit.c.  It has bitrotted horribly, and none of the
current set of KVM developers even knows how to use it effectively.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level
  2022-01-04 20:34   ` Sean Christopherson
@ 2022-01-04 20:37     ` Paolo Bonzini
  2022-01-04 20:43       ` Maxim Levitsky
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-01-04 20:37 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin

On 1/4/22 21:34, Sean Christopherson wrote:
> On Fri, Dec 10, 2021, Lai Jiangshan wrote:
>> From: Lai Jiangshan<laijs@linux.alibaba.com>
>>
>> Walking from the root page of the shadow page table should start with
>> the level of the shadow page table: shadow_root_level.
>>
>> Also change a small defect in audit_mappings(), it is believed
>> that the current walking level is more valuable to print.
>>
>> Signed-off-by: Lai Jiangshan<laijs@linux.alibaba.com>
>> ---
>>   arch/x86/kvm/mmu/mmu_audit.c | 5 ++---
> 
> I vote we remove mmu_audit.c.  It has bitrotted horribly, and none of the
> current set of KVM developers even knows how to use it effectively.

No complaints.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level
  2022-01-04 20:37     ` Paolo Bonzini
@ 2022-01-04 20:43       ` Maxim Levitsky
  0 siblings, 0 replies; 22+ messages in thread
From: Maxim Levitsky @ 2022-01-04 20:43 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin

On Tue, 2022-01-04 at 21:37 +0100, Paolo Bonzini wrote:
> On 1/4/22 21:34, Sean Christopherson wrote:
> > On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> > > From: Lai Jiangshan<laijs@linux.alibaba.com>
> > > 
> > > Walking from the root page of the shadow page table should start with
> > > the level of the shadow page table: shadow_root_level.
> > > 
> > > Also change a small defect in audit_mappings(), it is believed
> > > that the current walking level is more valuable to print.
> > > 
> > > Signed-off-by: Lai Jiangshan<laijs@linux.alibaba.com>
> > > ---
> > >   arch/x86/kvm/mmu/mmu_audit.c | 5 ++---
> > 
> > I vote we remove mmu_audit.c.  It has bitrotted horribly, and none of the
> > current set of KVM developers even knows how to use it effectively.
> 
> No complaints.

I played with it once, its not that bad IMHO.

Best regards,
	Maxim Levitsky

> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page()
  2021-12-10  9:25 ` [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
@ 2022-01-04 20:53   ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2022-01-04 20:53 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> kvm_mmu_alloc_page() will access to more bits of the role.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 846a2e426e0b..54e7cbc15380 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1734,13 +1734,13 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
>  	mmu_spte_clear_no_track(parent_pte);
>  }
>  
> -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
> +static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, union kvm_mmu_page_role role)

Please wrap the union param to a new line.  Checkpatch now warns at 100 chars, but
80 is still the guideline (checkpatch was tweaked so that people would stop adding
really awkward newlines just to honor the 80 char limit).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests
  2021-12-10  9:25 ` [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests Lai Jiangshan
@ 2022-01-04 20:55   ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2022-01-04 20:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 476b133544dd..822ff5d76b91 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -100,13 +100,8 @@ static inline bool is_tdp_mmu(struct kvm_mmu *mmu)
>  	if (WARN_ON(!VALID_PAGE(hpa)))
>  		return false;
>  
> -	/*
> -	 * A NULL shadow page is legal when shadowing a non-paging guest with
> -	 * PAE paging, as the MMU will be direct with root_hpa pointing at the
> -	 * pae_root page, not a shadow page.
> -	 */
>  	sp = to_shadow_page(hpa);
> -	return sp && is_tdp_mmu_page(sp) && sp->root_count;
> +	return is_tdp_mmu_page(sp) && sp->root_count;
>  }

is_page_fault_stale() can get similar treatment

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e8c69c2dfbd9..9ff8e228b55e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3855,19 +3855,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
 {
        struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root_hpa);

-       /* Special roots, e.g. pae_root, are not backed by shadow pages. */
-       if (sp && is_obsolete_sp(vcpu->kvm, sp))
-               return true;
-
-       /*
-        * Roots without an associated shadow page are considered invalid if
-        * there is a pending request to free obsolete roots.  The request is
-        * only a hint that the current root _may_ be obsolete and needs to be
-        * reloaded, e.g. if the guest frees a PGD that KVM is tracking as a
-        * previous root, then __kvm_mmu_prepare_zap_page() signals all vCPUs
-        * to reload even if no vCPU is actively using the root.
-        */
-       if (!sp && kvm_test_request(KVM_REQ_MMU_RELOAD, vcpu))
+       if (is_obsolete_sp(vcpu->kvm, sp))
                return true;

        return fault->slot &&

>  #else
>  static inline bool kvm_mmu_init_tdp_mmu(struct kvm *kvm) { return false; }
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2021-12-10  9:25 ` [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page Lai Jiangshan
@ 2022-01-04 21:54   ` Sean Christopherson
  2022-01-05  3:11     ` Lai Jiangshan
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-01-04 21:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Currently pae_root is special root page, this patch adds facility to
> allow using kvm_mmu_get_page() to allocate pae_root shadow page.
> 
> When kvm_mmu_get_page() is called for level == PT32E_ROOT_LEVEL and
> vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, it will get a
> DMA32 root page with default PAE pdptes installed.
> 
> The pae_root bit is needed in the page role because:
> 	it is required to be DMA32 page.

Please try to avoid ambiguous pronouns, "it" here does not refer to any of the
nouns that were established in the initial sentence, e.g.

        PAE roots must be allocated below 4gb (CR3 has only 32 bits).

> 	its first 4 sptes are initialized with default_pae_pdpte.
> 
> default_pae_pdpte is needed because the cpu expect PAE pdptes are
> present when VMenter.

That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
is perfectly ok, present with reserved bits is what's not allowed.

Intel SDM:
  A VM entry that checks the validity of the PDPTEs uses the same checks that are
  used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
  would cause a general-protection exception due to the PDPTEs that would be loaded
  (e.g., because a reserved bit is set), the VM entry fails.
  
  7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
     (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.

AMD APM:
  On current AMD64 processors, in native mode, these four entries are unconditionally
  loaded into the table walk cache whenever CR3 is written with the PDPT base address,
  and remain locked in. At this point they are also checked for reserved bit violations,
  and if such violations are present a general-protection exception (#GP) occurs.

  Under SVM, however, when the processor is in guest mode with PAE enabled, the
  guest PDPT entries are not cached or validated at this point, but instead are loaded
  and checked on demand in the normal course of address translation, just like page
  directory and page table entries. Any reserved bit violations are detected at the point
  of use, and result in a page-fault (#PF) exception rather than a Page Translation and
  Protection general-protection (#GP) exception.

> default_pae_pdpte is designed to have no
> SPTE_MMU_PRESENT_MASK so that it is present in the view of CPU but not
> present in the view of shadow papging, and the page fault handler will
> replace it with real present shadow page.

> 
> When changing from default_pae_pdpte to a present spte, no tlb flushing
> is requested, although both are present in the view of CPU.  The reason
> is that default_pae_pdpte points to zero page, no pte is present if the
> paging structure is cached.
> 
> No functionality changed since this code is not activated because when
> vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, kvm_mmu_get_page()
> is only called for level == 1 or 2 now.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   4 +-
>  arch/x86/kvm/mmu/mmu.c          | 113 +++++++++++++++++++++++++++++++-
>  arch/x86/kvm/mmu/paging_tmpl.h  |   1 +
>  3 files changed, 114 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6465c83794fc..82a8844f80ac 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -335,7 +335,8 @@ union kvm_mmu_page_role {
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
>  		unsigned level_promoted:1;
> -		unsigned :5;
> +		unsigned pae_root:1;
> +		unsigned :4;
>  
>  		/*
>  		 * This is left at the top of the word so that
> @@ -695,6 +696,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	unsigned long mmu_pae_root_cache;

Hmm, I think it will be easier to understand exactly what this "cache" holds if
the cache is tracked as a pointer, either the page_address() "void *" or maybe
even the original "struct page".  A "void *" is probably the way to go; less code
than tracking "struct page" and kvm_mmu_memory_cache_alloc() also returns "void *".

>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4769253e9024..0d2976dad863 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -724,6 +724,67 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static u64 default_pae_pdpte;
> +
> +static void free_default_pae_pdpte(void)
> +{
> +	free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK));
> +	default_pae_pdpte = 0;
> +}
> +
> +static int alloc_default_pae_pdpte(void)
> +{
> +	unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
> +
> +	if (!p)
> +		return -ENOMEM;
> +	default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask;
> +	if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) ||
> +		    is_mmio_spte(default_pae_pdpte))) {
> +		free_default_pae_pdpte();
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int alloc_pae_root(struct kvm_vcpu *vcpu)

I'd prefer we name this mmu_topup_pae_root_cache().  "alloc_pae_root" implies the
helper returns the allocated root.

> +{
> +	struct page *page;
> +	unsigned long pae_root;
> +	u64* pdpte;
> +
> +	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
> +		return 0;
> +	if (vcpu->arch.mmu_pae_root_cache)
> +		return 0;
> +
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	pae_root = (unsigned long)page_address(page);
> +
> +	/*
> +	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
> +	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
> +	 * that KVM's writes and the CPU's reads get along.  Note, this is
> +	 * only necessary when using shadow paging, as 64-bit NPT can get at
> +	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
> +	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
> +	 */
> +	if (!tdp_enabled)
> +		set_memory_decrypted(pae_root, 1);
> +	else
> +		WARN_ON_ONCE(shadow_me_mask);
> +	vcpu->arch.mmu_pae_root_cache = pae_root;
> +	pdpte = (void *)pae_root;
> +	pdpte[0] = default_pae_pdpte;
> +	pdpte[1] = default_pae_pdpte;
> +	pdpte[2] = default_pae_pdpte;
> +	pdpte[3] = default_pae_pdpte;
> +	return 0;

Assuming I'm not missing something regarding default_pae_pdpte, this can be:

	struct page *page;

	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
		return 0;
	if (vcpu->arch.mmu_pae_root_cache)
		return 0;

	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
	if (!page)
		return -ENOMEM;

	vcpu->arch.mmu_pae_root_cache = page_address(page);

	/*
	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
	 * that KVM's writes and the CPU's reads get along.  Note, this is
	 * only necessary when using shadow paging, as 64-bit NPT can get at
	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
	 */
	if (!tdp_enabled)
		set_memory_decrypted((unsigned long)vcpu->arch.mmu_pae_root_cache, 1);
	else
		WARN_ON_ONCE(shadow_me_mask);
	return 0;


> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -735,6 +796,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  		return r;
>  	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
>  				       PT64_ROOT_MAX_LEVEL);
> +	if (r)
> +		return r;
> +	r = alloc_pae_root(vcpu);
>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -753,6 +817,10 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
>  	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> +	if (!tdp_enabled && vcpu->arch.mmu_pae_root_cache)
> +		set_memory_encrypted(vcpu->arch.mmu_pae_root_cache, 1);
> +	free_page(vcpu->arch.mmu_pae_root_cache);
> +	vcpu->arch.mmu_pae_root_cache = 0;

If this ends up being a void *, set to NULL intead of 0.

>  }
>  
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
> @@ -1706,6 +1774,8 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
>  	hlist_del(&sp->hash_link);
>  	list_del(&sp->link);
> +	if (!tdp_enabled && sp->role.pae_root)
> +		set_memory_encrypted((unsigned long)sp->spt, 1);

Hmm, I wonder if it would be better to do:

	if (sp->role.pae_root)
		mmu_free_pae_root(sp->spt);
	else
		free_page((unsigned long)sp->spt);

and then share the helper with mmu_free_memory_caches(), e.g.

	if (vcpu->arch.mmu_pae_root_cache) {
		mmu_free_pae_root(vcpu->arch.mmu_pae_root_cache);
		vcpu->arch.mmu_pae_root_cache = NULL;
	}

This is perfectly ok, but I think it would be asy for someone to misread
mmu_free_memory_caches() and think that the free_page() there could be inside the
if statement.  Having this in a helper:

	if (!tdp_enabled)
		set_memory_encrypted((unsigned long)root, 1);
	free_page((unsigned long)root);

makes it a bit more obvious that re-encrypting memory is a !TDP specific action
that's on top of the "applies to everything" free_page().

>  	free_page((unsigned long)sp->spt);
>  	if (!sp->role.direct && !sp->role.level_promoted)
>  		free_page((unsigned long)sp->gfns);
> @@ -1735,8 +1805,13 @@ static void mmu_page_remove_parent_pte(struct kvm_mmu_page *sp,
>  static void drop_parent_pte(struct kvm_mmu_page *sp,
>  			    u64 *parent_pte)
>  {
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(parent_pte);
> +
>  	mmu_page_remove_parent_pte(sp, parent_pte);
> -	mmu_spte_clear_no_track(parent_pte);
> +	if (!parent_sp->role.pae_root)
> +		mmu_spte_clear_no_track(parent_pte);
> +	else
> +		__update_clear_spte_fast(parent_pte, default_pae_pdpte);
>  }
>  
>  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn, union kvm_mmu_page_role role)
> @@ -1744,7 +1819,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	struct kvm_mmu_page *sp;
>  
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> -	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	if (!role.pae_root) {
> +		sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> +	} else {
> +		sp->spt = (void *)vcpu->arch.mmu_pae_root_cache;
> +		vcpu->arch.mmu_pae_root_cache = 0;
> +	}
>  	if (!(role.direct || role.level_promoted))
>  		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> @@ -2091,6 +2171,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  	}
>  	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
>  		role.level_promoted = 0;
> +	if (role.pae_root && (level < PT32E_ROOT_LEVEL))

Hmm, for both these checks, I think it's easier to read if the first test is
dropped, i.e.

	if (level <= vcpu->arch.mmu->root_level)
		role.level_promoted = 0;
	if (level != PT32E_ROOT_LEVEL)
		role.pae_root = 0;
		
> +		role.pae_root = 0;
>  
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -2226,14 +2308,27 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
>  	__shadow_walk_next(iterator, *iterator->sptep);
>  }
>  
> +static u64 make_pae_pdpte(u64 *child_pt)
> +{
> +	u64 spte = __pa(child_pt) | PT_PRESENT_MASK;
> +
> +	/* The only ignore bits in PDPTE are 11:9. */

Yeah, stupid PAE paging. :-/

> +	BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
> +	return spte | SPTE_MMU_PRESENT_MASK | shadow_me_mask;

I'd say drop "spte" and just do:

	return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
	       shadow_me_mask;

> +}
> +
>  static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  			     struct kvm_mmu_page *sp)
>  {
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
>  	u64 spte;
>  
>  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
>  
> -	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	if (!parent_sp->role.pae_root)
> +		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	else
> +		spte = make_pae_pdpte(sp->spt);
>  
>  	mmu_spte_set(sptep, spte);
>  
> @@ -4733,6 +4828,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
>  	role.base.direct = true;
>  	role.base.has_4_byte_gpte = false;
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
>  
>  	return role;
>  }
> @@ -4798,6 +4895,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	else
>  		role.base.level = PT64_ROOT_4LEVEL;
>  
> +	if (!____is_cr0_pg(regs) || !____is_efer_lma(regs))

The CR0.PG check is redundant as EFER.LMA cannot be '1' if CR0.PG=0.  It's confusing
because the role.base.level calculation above omits the CR0.PG check.  Speaking of
which, I think this would be clearer:

	if (role.base.level == PT32E_ROOT_LEVEL)
		role.base.pae_root = 1;

Alternatively, it could be folded in to the level calcuation, e.g.

	if (!____is_efer_lma(regs)) {
		role.base.level = PT32E_ROOT_LEVEL;
		role.base.pae_root = 1;
		if (____is_cr0_pg(regs) && !____is_cr4_pse(regs))
			role.base.level_promoted = 1;
	} else if (____is_cr4_la57(regs)) {
		role.base.level = PT64_ROOT_5LEVEL;
	} else {
		role.base.level = PT64_ROOT_4LEVEL;
	}

though I think I prefer to keep that purely a level calcuation.

> +		role.base.pae_root = 1;
> +
>  	return role;
>  }
>  
> @@ -4845,6 +4945,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
>  	if (role.base.level > role_regs_to_root_level(regs))
>  		role.base.level_promoted = 1;
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;

I almost wonder if we'd be better off having a "root" flag, and then a helper:

  static bool is_pae_root(union kvm_mmu_page_role role)
  {
  	return role.root && role.level == PT32E_ROOT_LEVEL;
  }

I believe the only collateral damage would be that it'd prevent reusing a 4-level
root in a 5-level paging structure, but that's not necessarily a bad thing.

>  
>  	return role;
>  }
> @@ -6133,6 +6235,10 @@ int kvm_mmu_module_init(void)
>  	if (ret)
>  		goto out;
>  
> +	ret = alloc_default_pae_pdpte();
> +	if (ret)
> +		goto out;
> +
>  	return 0;
>  
>  out:
> @@ -6174,6 +6280,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  
>  void kvm_mmu_module_exit(void)
>  {
> +	free_default_pae_pdpte();
>  	mmu_destroy_caches();
>  	percpu_counter_destroy(&kvm_total_used_mmu_pages);
>  	unregister_shrinker(&mmu_shrinker);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 16ac276d342a..014136e15b26 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1044,6 +1044,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		.access = 0x7,
>  		.quadrant = 0x3,
>  		.level_promoted = 0x1,
> +		.pae_root = 0x1,
>  	};
>  
>  	/*
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted
  2021-12-10  9:25 ` [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted Lai Jiangshan
@ 2022-01-04 22:14   ` Sean Christopherson
  2022-02-11 16:06     ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-01-04 22:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Fri, Dec 10, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Level pormotion occurs when mmu->shadow_root_level > mmu->root_level.

s/pormotion/promotion (in multiple places)

That said, I strongly prefer "passthrough" or maybe "nop" over "level_promoted".
Promoted in the context of page level usually refers to making a hugepage, which
is not the case here.

> 
> There are several cases that can cuase level pormotion:
> 
> shadow mmu (shadow paging for 32 bit guest):
> 	case1:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0
> 
> shadow nested NPT (for 32bit L1 hypervisor):
> 	case2:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
> 	case3:	gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1
> 
> shadow nested NPT (for 64bit L1 hypervisor):
> 	case4:	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1
> 
> When level pormotion occurs (32bit guest, case1-3), special roots are
> often used.  But case4 is not using special roots.  It uses shadow page
> without fully aware of the specialty.  It might work accidentally:
> 	1) The root_page (root_sp->spt) is allocated with level = 5,
> 	   and root_sp->spt[0] is allocated with the same gfn and the
> 	   same role except role.level = 4.  Luckly that they are
> 	   different shadow pages.
> 	2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and
> 	   walker->pt_access[4], which are normally unused when
> 	   mmu->shadow_root_level == mmu->root_level == 4, so that
> 	   FNAME(fetch) can use them to allocate shadow page for
> 	   root_sp->spt[0] and link them when shadow_root_level == 5.
> 
> But it has problems.
> If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
> and usees the same gfn as the root for the nNPT before and after
> switching gCR4_LA57.  The host (hCR4_LA57=1) wold use the same root_sp
> for guest even guest switches gCR4_LA57.  The guest will see unexpected
> page mapped and L2 can hurts L1.  It is lucky the the problem can't
> hurt L0.
> 
> The root_sp should be like role.direct=1 sometimes: its contents are
> not backed by gptes, root_sp->gfns is meaningless.  For a normal high
> level sp, sp->gfns is often unused and kept zero, but it could be
> relevant and meaningful when sp->gfns is used because they are back
> by concret gptes.  For level-promoted root_sp described before, root_sp
> is just a portal to contribute root_sp->spt[0], and root_sp should not
> have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0]
> of the root gfn is changed.
> 
> This patch adds role.level_promoted to address the two problems.
> role.level_promoted is set when shadow paging and
> role.level > gMMU.level.
> 
> An alternative way to fix the problem of case4 is that: also using the
> special root pml5_root for it.  But it would required to change many
> other places because it is assumption that special roots is only used
> for 32bit guests.
> 
> This patch also paves the way to use level promoted shadow page for
> case1-3, but that requires the special handling or PAE paging, so the
> extensive usage of it is not included.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ++-
>  arch/x86/kvm/mmu/mmu.c          | 15 +++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88ecf53f0d2b..6465c83794fc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -334,7 +334,8 @@ union kvm_mmu_page_role {
>  		unsigned smap_andnot_wp:1;
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
> -		unsigned :6;
> +		unsigned level_promoted:1;
> +		unsigned :5;

The massive comment above this union needs to be updated.

>  		/*
>  		 * This is left at the top of the word so that
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 54e7cbc15380..4769253e9024 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -767,6 +767,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>  
>  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  {
> +	if (sp->role.level_promoted)
> +		return sp->gfn;
> +
>  	if (!sp->role.direct)
>  		return sp->gfns[index];
>  
> @@ -776,6 +779,8 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
>  {
>  	if (!sp->role.direct) {
> +		if (WARN_ON_ONCE(sp->role.level_promoted && gfn != sp->gfn))
> +			return;
>  		sp->gfns[index] = gfn;

This is wrong, sp->gfns is NULL when sp->role.level_promoted is true.  I believe
you want something similar to the "get" flow, e.g.

	if (sp->role.passthrough) {
		WARN_ON_ONCE(gfn != sp->gfn);
		return;
	}

	if (!sp->role.direct) {
		...
	}

Alternatively, should we mark passthrough shadow pages as direct=1?  That would
naturally handle this code, and for things like reexecute_instruction()'s usage
of kvm_mmu_unprotect_page(), I don't think passthrough shadow pages should be
considered indirect, e.g. zapping them won't help and the shadow page can't become
unsync.

>  		return;
>  	}
> @@ -1702,7 +1707,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  	hlist_del(&sp->hash_link);
>  	list_del(&sp->link);
>  	free_page((unsigned long)sp->spt);
> -	if (!sp->role.direct)
> +	if (!sp->role.direct && !sp->role.level_promoted)

Hmm, at this point, it may be better to just omit the check entirely.  @sp is
zero allocated and free_page() plays nice with NULL "pointers".  I believe TDX
and maybe SNP support will need to do more work here, i.e. may add back a similar
check, but if so then checking sp->gfns directly for !NULL is preferable.

>  		free_page((unsigned long)sp->gfns);
>  	kmem_cache_free(mmu_page_header_cache, sp);
>  }
> @@ -1740,7 +1745,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -	if (!role.direct)
> +	if (!(role.direct || role.level_promoted))

I personally prefer the style of the previous check, mainly because I'm horrible
at reading !(x || y).

	if (!role.direct && !role.passthrough)

>  		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
> @@ -2084,6 +2089,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>  		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
>  		role.quadrant = quadrant;
>  	}
> +	if (role.level_promoted && (level <= vcpu->arch.mmu->root_level))
> +		role.level_promoted = 0;
>  
>  	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>  	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
> @@ -4836,6 +4843,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>  
>  	role.base.direct = false;
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
> +	if (role.base.level > role_regs_to_root_level(regs))
> +		role.base.level_promoted = 1;
>  
>  	return role;
>  }
> @@ -5228,6 +5237,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>  	kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
>  
>  	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
> +		if (sp->role.level_promoted)
> +			continue;
>  		if (detect_write_misaligned(sp, gpa, bytes) ||
>  		      detect_write_flooding(sp)) {
>  			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5c78300fc7d9..16ac276d342a 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		.level = 0xf,
>  		.access = 0x7,
>  		.quadrant = 0x3,
> +		.level_promoted = 0x1,
>  	};
>  
>  	/*
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2022-01-04 21:54   ` Sean Christopherson
@ 2022-01-05  3:11     ` Lai Jiangshan
  2022-01-05 16:45       ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2022-01-05  3:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	H. Peter Anvin

On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:

> >
> > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > present when VMenter.
>
> That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> is perfectly ok, present with reserved bits is what's not allowed.
>
> Intel SDM:
>   A VM entry that checks the validity of the PDPTEs uses the same checks that are
>   used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
>   would cause a general-protection exception due to the PDPTEs that would be loaded
>   (e.g., because a reserved bit is set), the VM entry fails.
>
>   7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
>      (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.

But in practice, the VM entry fails if the present bit is not set in
the PDPTE for
the linear address being accessed (when EPT enabled at least).  The host kvm
complains and dumps the vmcs state.

Setting a default pdpte is the simplest way to solve it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2022-01-05  3:11     ` Lai Jiangshan
@ 2022-01-05 16:45       ` Sean Christopherson
  2022-01-06  2:01         ` Lai Jiangshan
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-01-05 16:45 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, kvm, Paolo Bonzini, Lai Jiangshan, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	H. Peter Anvin

On Wed, Jan 05, 2022, Lai Jiangshan wrote:
> On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
> 
> > >
> > > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > > present when VMenter.
> >
> > That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> > is perfectly ok, present with reserved bits is what's not allowed.
> >
> > Intel SDM:
> >   A VM entry that checks the validity of the PDPTEs uses the same checks that are
> >   used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
> >   would cause a general-protection exception due to the PDPTEs that would be loaded
> >   (e.g., because a reserved bit is set), the VM entry fails.
> >
> >   7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
> >      (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
> 
> But in practice, the VM entry fails if the present bit is not set in the
> PDPTE for the linear address being accessed (when EPT enabled at least).  The
> host kvm complains and dumps the vmcs state.

That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
shadow page tables.

And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
shadow page tables for not-present PDPTEs.

If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
debug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2022-01-05 16:45       ` Sean Christopherson
@ 2022-01-06  2:01         ` Lai Jiangshan
  2022-01-06 19:41           ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Lai Jiangshan @ 2022-01-06  2:01 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: LKML, kvm, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin



On 2022/1/6 00:45, Sean Christopherson wrote:
> On Wed, Jan 05, 2022, Lai Jiangshan wrote:
>> On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
>>
>>>>
>>>> default_pae_pdpte is needed because the cpu expect PAE pdptes are
>>>> present when VMenter.
>>>
>>> That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
>>> is perfectly ok, present with reserved bits is what's not allowed.
>>>
>>> Intel SDM:
>>>    A VM entry that checks the validity of the PDPTEs uses the same checks that are
>>>    used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
>>>    would cause a general-protection exception due to the PDPTEs that would be loaded
>>>    (e.g., because a reserved bit is set), the VM entry fails.
>>>
>>>    7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
>>>       (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
>>
>> But in practice, the VM entry fails if the present bit is not set in the
>> PDPTE for the linear address being accessed (when EPT enabled at least).  The
>> host kvm complains and dumps the vmcs state.
> 
> That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
> The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
> shadow page tables.

Oh, I wrote the negative what I want to say again when I try to emphasis
something after I wrote a sentence and modified it several times.

I wanted to mean "EPT not enabled" when vmx.

The VM entry fails when the guest is in very early stage when booting which
might be still in real mode.

VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
reason=80000021 qualification=0000000000000002

IDTVectoring: info=00000000 errorcode=00000000

> 
> And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
> shadow page tables for not-present PDPTEs.
> 
> If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
> debug.

If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
it.

If you can't try this patchset, the mmu->pae_root can be possible to be modified
to test it.

I guess the vmx fails to translate %rip when VMentry in this case.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2022-01-06  2:01         ` Lai Jiangshan
@ 2022-01-06 19:41           ` Sean Christopherson
  2022-01-07  4:36             ` Lai Jiangshan
  0 siblings, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-01-06 19:41 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Lai Jiangshan, LKML, kvm, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	H. Peter Anvin

On Thu, Jan 06, 2022, Lai Jiangshan wrote:
> 
> 
> On 2022/1/6 00:45, Sean Christopherson wrote:
> > On Wed, Jan 05, 2022, Lai Jiangshan wrote:
> > > On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > > > 
> > > > > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > > > > present when VMenter.
> > > > 
> > > > That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> > > > is perfectly ok, present with reserved bits is what's not allowed.
> > > > 
> > > > Intel SDM:
> > > >    A VM entry that checks the validity of the PDPTEs uses the same checks that are
> > > >    used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
> > > >    would cause a general-protection exception due to the PDPTEs that would be loaded
> > > >    (e.g., because a reserved bit is set), the VM entry fails.
> > > > 
> > > >    7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
> > > >       (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
> > > 
> > > But in practice, the VM entry fails if the present bit is not set in the
> > > PDPTE for the linear address being accessed (when EPT enabled at least).  The
> > > host kvm complains and dumps the vmcs state.
> > 
> > That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
> > The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
> > shadow page tables.
> 
> Oh, I wrote the negative what I want to say again when I try to emphasis
> something after I wrote a sentence and modified it several times.
> 
> I wanted to mean "EPT not enabled" when vmx.

Heh, that makes a lot more sense.

> The VM entry fails when the guest is in very early stage when booting which
> might be still in real mode.
> 
> VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
> reason=80000021 qualification=0000000000000002

Yep, that's the signature for an illegal PDPTE at VM-Enter.  But as noted above,
a not-present PDPTE is perfectly legal, VM-Enter should failed if and only if a
PDPTE is present and has reserved bits set.

> IDTVectoring: info=00000000 errorcode=00000000
> 
> > 
> > And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
> > shadow page tables for not-present PDPTEs.
> > 
> > If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
> > debug.
> 
> If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
> it.

I can't reproduce the failure with this on top of your series + kvm/queue (commit
cc0e35f9c2d4 ("KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled")).

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f6f7caf76b70..b7170a840330 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -728,22 +728,11 @@ static u64 default_pae_pdpte;

 static void free_default_pae_pdpte(void)
 {
-       free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK));
        default_pae_pdpte = 0;
 }

 static int alloc_default_pae_pdpte(void)
 {
-       unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
-
-       if (!p)
-               return -ENOMEM;
-       default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask;
-       if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) ||
-                   is_mmio_spte(default_pae_pdpte))) {
-               free_default_pae_pdpte();
-               return -EINVAL;
-       }
        return 0;
 }

Are you using a different base and/or running with other changes?

To aid debug, the below patch will dump the PDPTEs from the current MMU root on
failure (I'll also submit this as a formal patch).  On failure, I would expect
that at least one of the PDPTEs will be present with reserved bits set.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe06b02994e6..c13f37ef1bbc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5773,11 +5773,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
        pr_err("CR4: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n",
               cr4, vmcs_readl(CR4_READ_SHADOW), vmcs_readl(CR4_GUEST_HOST_MASK));
        pr_err("CR3 = 0x%016lx\n", vmcs_readl(GUEST_CR3));
-       if (cpu_has_vmx_ept()) {
+       if (enable_ept) {
                pr_err("PDPTR0 = 0x%016llx  PDPTR1 = 0x%016llx\n",
                       vmcs_read64(GUEST_PDPTR0), vmcs_read64(GUEST_PDPTR1));
                pr_err("PDPTR2 = 0x%016llx  PDPTR3 = 0x%016llx\n",
                       vmcs_read64(GUEST_PDPTR2), vmcs_read64(GUEST_PDPTR3));
+       } else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL &&
+                  VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
+               u64 *pdpte = __va(vcpu->arch.mmu->root_hpa);
+
+               pr_err("PDPTE0 = 0x%016llx  PDPTE1 = 0x%016llx\n",
+                      pdpte[0], pdpte[1]);
+               pr_err("PDPTE2 = 0x%016llx  PDPTE3 = 0x%016llx\n",
+                      pdpte[2], pdpte[3]);
        }
        pr_err("RSP = 0x%016lx  RIP = 0x%016lx\n",
               vmcs_readl(GUEST_RSP), vmcs_readl(GUEST_RIP));

> If you can't try this patchset, the mmu->pae_root can be possible to be modified
> to test it.
> 
> I guess the vmx fails to translate %rip when VMentry in this case.

No, the CPU doesn't translate RIP at VM-Enter, vmcs.GUEST_RIP is only checked for
legality, e.g. that it's canonical.  Translating RIP through page tables is firmly
a post-VM-Enter code fetch action.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
  2022-01-06 19:41           ` Sean Christopherson
@ 2022-01-07  4:36             ` Lai Jiangshan
  0 siblings, 0 replies; 22+ messages in thread
From: Lai Jiangshan @ 2022-01-07  4:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, LKML, kvm, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	H. Peter Anvin



On 2022/1/7 03:41, Sean Christopherson wrote:
> On Thu, Jan 06, 2022, Lai Jiangshan wrote:
>>
>>
>> On 2022/1/6 00:45, Sean Christopherson wrote:
>>> On Wed, Jan 05, 2022, Lai Jiangshan wrote:
>>>> On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
>>>>
>>>>>>
>>>>>> default_pae_pdpte is needed because the cpu expect PAE pdptes are
>>>>>> present when VMenter.
>>>>>
>>>>> That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
>>>>> is perfectly ok, present with reserved bits is what's not allowed.
>>>>>
>>>>> Intel SDM:
>>>>>     A VM entry that checks the validity of the PDPTEs uses the same checks that are
>>>>>     used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
>>>>>     would cause a general-protection exception due to the PDPTEs that would be loaded
>>>>>     (e.g., because a reserved bit is set), the VM entry fails.
>>>>>
>>>>>     7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
>>>>>        (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
>>>>
>>>> But in practice, the VM entry fails if the present bit is not set in the
>>>> PDPTE for the linear address being accessed (when EPT enabled at least).  The
>>>> host kvm complains and dumps the vmcs state.
>>>
>>> That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
>>> The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
>>> shadow page tables.
>>
>> Oh, I wrote the negative what I want to say again when I try to emphasis
>> something after I wrote a sentence and modified it several times.
>>
>> I wanted to mean "EPT not enabled" when vmx.
> 
> Heh, that makes a lot more sense.
> 
>> The VM entry fails when the guest is in very early stage when booting which
>> might be still in real mode.
>>
>> VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
>> reason=80000021 qualification=0000000000000002
> 
> Yep, that's the signature for an illegal PDPTE at VM-Enter.  But as noted above,
> a not-present PDPTE is perfectly legal, VM-Enter should failed if and only if a
> PDPTE is present and has reserved bits set.
> 
>> IDTVectoring: info=00000000 errorcode=00000000
>>
>>>
>>> And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
>>> shadow page tables for not-present PDPTEs.
>>>
>>> If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
>>> debug.
>>
>> If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
>> it.
> 
> I can't reproduce the failure with this on top of your series + kvm/queue (commit
> cc0e35f9c2d4 ("KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled")).
> 


I can't reproduce the failure with this code base either.  And I can't reproduce
the failure when I switch to the code base when I developed it.

After reviewing all the logs I saved that time, I think it was fixed after
make_pae_pdpte().  I should have added make_pae_pdpte() first before added
default_pae_pdpte.  (The code was still mess and the guest can't fully
function even when make_pae_pdpte() was added that time)

Removing default_pae_pdpte will simplify the code. Thank you.

Thanks
Lai.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted
  2022-01-04 22:14   ` Sean Christopherson
@ 2022-02-11 16:06     ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2022-02-11 16:06 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin

On 1/4/22 23:14, Sean Christopherson wrote:
> Alternatively, should we mark passthrough shadow pages as direct=1?  That would
> naturally handle this code, and for things like reexecute_instruction()'s usage
> of kvm_mmu_unprotect_page(), I don't think passthrough shadow pages should be
> considered indirect, e.g. zapping them won't help and the shadow page can't become
> unsync.

So the main difference between direct and passthrough shadow pages is that
passthrough pages can have indirect children.  A direct page maps the
page at sp->gfn, while a passthrough page maps the page _table_ at
sp->gfn.

Is this correct?

If so, I think there is a difference between a passthrough page that
maps a level-2 page from level-4, and a passthrough page that maps a
level-3 page from level-4.  This means that a single bit in the role
is not enough.

One way to handle this could be to have a single field "direct_levels"
that subsumes both "direct" and "passthrough".  direct && !passthrough
would correspond to "direct_levels == level", while !direct && !passthrough
would correspond to "direct_levels == 0".

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-02-11 16:06 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
2021-12-10  9:25 ` [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch() Lai Jiangshan
2022-01-04 20:24   ` Sean Christopherson
2021-12-10  9:25 ` [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level Lai Jiangshan
2022-01-04 20:34   ` Sean Christopherson
2022-01-04 20:37     ` Paolo Bonzini
2022-01-04 20:43       ` Maxim Levitsky
2021-12-10  9:25 ` [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
2022-01-04 20:53   ` Sean Christopherson
2021-12-10  9:25 ` [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted Lai Jiangshan
2022-01-04 22:14   ` Sean Christopherson
2022-02-11 16:06     ` Paolo Bonzini
2021-12-10  9:25 ` [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page Lai Jiangshan
2022-01-04 21:54   ` Sean Christopherson
2022-01-05  3:11     ` Lai Jiangshan
2022-01-05 16:45       ` Sean Christopherson
2022-01-06  2:01         ` Lai Jiangshan
2022-01-06 19:41           ` Sean Christopherson
2022-01-07  4:36             ` Lai Jiangshan
2021-12-10  9:25 ` [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests Lai Jiangshan
2022-01-04 20:55   ` Sean Christopherson
2021-12-10 10:27 ` [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Maxim Levitsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.