[RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root
@ 2022-03-30 13:21 Lai Jiangshan
  2022-03-30 13:21 ` [RFC PATCH V3 1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-03-30 13:21 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson; +Cc: Lai Jiangshan

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

(Request For Help for testing on AMD machine with 32 bit L1 hypervisor,
see information below)

KVM handles root pages specially for these cases:

direct mmu (nonpaping for 32 bit guest):
	gCR0_PG=0
shadow mmu (shadow paping for 32 bit guest):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1
direct mmu (NPT for 32bit host):
	hEFER_LMA=0
shadow nested NPT (for 32bit L1 hypervisor):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1,hEFER_LMA=0
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE={0|1},hEFER_LMA=1,hCR4_LA57={0|1}
Shadow nested NPT for 64bit L1 hypervisor:
	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1

They are either using special roots or matched the condition 
((mmu->shadow_root_level > mmu->root_level) && !mm->direct_map)
(refered as level expansion) or both.

All the cases are using special roots except the last one.
Many cases are doing level expansion including the last one.

When special roots are used, the root page will not be backed by
kvm_mmu_page.  So they must be treated specially, but not all places
is considering this problem, and Sean is adding some code to check
this special roots.

When level expansion, the kvm treats them silently always.

These treaments incur problems or complication, see the changelog
of every patch.

These patches were made when I reviewed all the usage of shadow_root_level
and root_level.  Many root level patches are sent and accepted.

These patches has not been tested with shadow NPT cases listed above.
Because I don't have guest images can act as 32 bit L1 hypervisor, nor
I can access to AMD machine with 5 level paging.  I'm a bit reluctant
to ask for the resource, so I send the patches and wish someone test
them and modify them.  At least, it provides some thinking and reveals
problems of the existing code and of the AMD cases.
( *Request For Help* here.)

These patches have been tested with the all cases except the shadow-NPT
cases, the code coverage is believed to be more than 95% (hundreds of
code related to shadow-NPT are shoved, and be replaced with common
role.pae_root and role.glevel code with only 8 line of code is
added for shadow-NPT, only 2 line of code is not covered in my tests).

Cleanup patches (such as use role.glevel instead of !role.direct) will
be sent after this patchset is queued.

[V2]: https://lore.kernel.org/lkml/20220329153604.507475-1-jiangshanlai@gmail.com/
[V1]: https://lore.kernel.org/lkml/20211210092508.7185-1-jiangshanlai@gmail.com/

Changed from V2:
	Instroduce role.glevel instead of role.passthrough

Changed from V1:
	Apply Sean's comments and suggestion. (Too much to list. Thanks!)
	Add some comments.
	Change changelog for role.pae_root patch.

Lai Jiangshan (4):
  KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page()
  KVM: X86: Introduce role.glevel for level expanded pagetable
  KVM: X86: Alloc role.pae_root shadow page
  KVM: X86: Use passthrough and pae_root shadow page for 32bit guests

 Documentation/virt/kvm/mmu.rst  |   9 +
 arch/x86/include/asm/kvm_host.h |  16 +-
 arch/x86/kvm/mmu/mmu.c          | 399 +++++++++-----------------------
 arch/x86/kvm/mmu/paging_tmpl.h  |  15 +-
 4 files changed, 138 insertions(+), 301 deletions(-)

-- 
2.19.1.6.gb485710b

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH V3 1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page()
  2022-03-30 13:21 [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
@ 2022-03-30 13:21 ` Lai Jiangshan
  2022-03-30 13:21 ` [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable Lai Jiangshan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-03-30 13:21 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

kvm_mmu_alloc_page() will access to more bits of the role.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..02eae110cbe1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1706,13 +1706,14 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
 	mmu_spte_clear_no_track(parent_pte);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct)
+static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+					       union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!direct)
+	if (!role.direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -1724,6 +1725,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
 	list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages);
 	kvm_mod_used_mmu_pages(vcpu->kvm, +1);
+	sp->gfn = gfn;
+	sp->role = role;
 	return sp;
 }
 
@@ -2107,10 +2110,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 
 	++vcpu->kvm->stat.mmu_cache_miss;
 
-	sp = kvm_mmu_alloc_page(vcpu, direct);
-
-	sp->gfn = gfn;
-	sp->role = role;
+	sp = kvm_mmu_alloc_page(vcpu, gfn, role);
 	hlist_add_head(&sp->hash_link, sp_list);
 	if (!direct) {
 		account_shadowed(vcpu->kvm, sp);
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-03-30 13:21 [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
  2022-03-30 13:21 ` [RFC PATCH V3 1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
@ 2022-03-30 13:21 ` Lai Jiangshan
  2022-03-30 16:01   ` Lai Jiangshan
  2022-04-12 21:31   ` Sean Christopherson
  2022-03-30 13:21 ` [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page Lai Jiangshan
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-03-30 13:21 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, linux-doc

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Level expansion occurs when mmu->shadow_root_level > mmu->root_level.

There are several cases that can cuase level expansion:

shadow mmu (shadow paging for 32 bit guest):
	case1:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0

shadow nested NPT (for 32bit L1 hypervisor):
	case2:	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
	case3:	gCR0_PG=1,gEFER_LMA=0,hEFER_LMA=1

shadow nested NPT (for 64bit L1 hypervisor):
	case4:	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1

When level expansion occurs (32bit guest, case1-3), special roots are
often used.  But case4 is not using special roots.  It uses shadow page
without fully aware of the specialty.  It might work accidentally:
	1) The root_page (root_sp->spt) is allocated with level = 5,
	   and root_sp->spt[0] is allocated with the same gfn and the
	   same role except role.level = 4.  Luckly that they are
	   different shadow pages.
	2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and
	   walker->pt_access[4], which are normally unused when
	   mmu->shadow_root_level == mmu->root_level == 4, so that
	   FNAME(fetch) can use them to allocate shadow page for
	   root_sp->spt[0] and link them when shadow_root_level == 5.

But it has problems.
If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root for the nNPT before and after
switching gCR4_LA57.  The host (hCR4_LA57=1) wold use the same root_sp
for guest even guest switches gCR4_LA57.  The guest will see unexpected
page mapped and L2 can hurts L1.  It is lucky the the problem can't
hurt L0.

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless.  For a normal high
level sp, sp->gfns is often unused and kept zero, but it could be
relevant and meaningful when sp->gfns is used because they are back
by concret gptes.  For expanded root_sp described before, root_sp
is just a portal to contribute root_sp->spt[0], and root_sp should not
have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0]
of the root gfn is changed.

This patch adds role.glevel to address the two problems.
With the new role.glevel, passthrough sp can be created for expanded
shadow pagetable: 0 < role.glevel < role.level.

An alternative way to fix the problem of case4 is that: also using the
special root pml5_root for it.  But it would required to change many
other places because it is assumption that special roots is only used
for 32bit guests.

This patch also paves the way to use passthrough shadow page for
case1-3, but that requires the special handling or PAE paging, so the
extensive usage of it is in later patches.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 Documentation/virt/kvm/mmu.rst  |  7 +++++++
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/kvm/mmu/mmu.c          | 21 +++++++++++++++++----
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
 4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 5b1ebad24c77..dee0e96d694a 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -202,6 +202,13 @@ Shadow pages contain the following information:
     Is 1 if the MMU instance cannot use A/D bits.  EPT did not have A/D
     bits before Haswell; shadow EPT page tables also cannot use A/D bits
     if the L1 hypervisor does not enable them.
+  role.glevel:
+    The level in guest pagetable if the sp is indirect.  Is 0 if the sp
+    is direct without corresponding guest pagetable, like TDP or !CR0.PG.
+    When role.level > guest paging level, indirect sp is created on the
+    top with role.glevel = guest paging level and acks as passthrough sp
+    and its contents are specially installed rather than the translations
+    of the corresponding guest pagetable.
   gfn:
     Either the guest page table containing the translations shadowed by this
     page, or the base page frame for linear translations.  See role.direct.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9694dd5e6ccc..67e1bccaf472 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -314,7 +314,7 @@ struct kvm_kernel_irq_routing_entry;
  *     cr0_wp=0, therefore these three bits only give rise to 5 possibilities.
  *
  * Therefore, the maximum number of possible upper-level shadow pages for a
- * single gfn is a bit less than 2^13.
+ * single gfn is a bit less than 2^15.
  */
 union kvm_mmu_page_role {
 	u32 word;
@@ -331,7 +331,8 @@ union kvm_mmu_page_role {
 		unsigned smap_andnot_wp:1;
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
-		unsigned :6;
+		unsigned glevel:4;
+		unsigned :2;
 
 		/*
 		 * This is left at the top of the word so that
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 02eae110cbe1..d53037df8177 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -737,8 +737,12 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 {
-	if (!sp->role.direct)
+	if (!sp->role.direct) {
+		if (unlikely(sp->role.glevel < sp->role.level))
+			return sp->gfn;
+
 		return sp->gfns[index];
+	}
 
 	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
 }
@@ -746,6 +750,11 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn)
 {
 	if (!sp->role.direct) {
+		if (unlikely(sp->role.glevel < sp->role.level)) {
+			WARN_ON_ONCE(gfn != sp->gfn);
+			return;
+		}
+
 		sp->gfns[index] = gfn;
 		return;
 	}
@@ -1674,8 +1683,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 	hlist_del(&sp->hash_link);
 	list_del(&sp->link);
 	free_page((unsigned long)sp->spt);
-	if (!sp->role.direct)
-		free_page((unsigned long)sp->gfns);
+	free_page((unsigned long)sp->gfns);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -1713,7 +1721,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
-	if (!role.direct)
+	if (role.glevel == role.level)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -2054,6 +2062,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 		quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
 		role.quadrant = quadrant;
 	}
+	if (level < role.glevel)
+		role.glevel = level;
 
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
@@ -4817,6 +4827,7 @@ kvm_calc_shadow_root_page_role_common(struct kvm_vcpu *vcpu,
 	role.base.smep_andnot_wp = role.ext.cr4_smep && !____is_cr0_wp(regs);
 	role.base.smap_andnot_wp = role.ext.cr4_smap && !____is_cr0_wp(regs);
 	role.base.has_4_byte_gpte = ____is_cr0_pg(regs) && !____is_cr4_pae(regs);
+	role.base.glevel = role_regs_to_root_level(regs);
 
 	return role;
 }
@@ -5312,6 +5323,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	++vcpu->kvm->stat.mmu_pte_write;
 
 	for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) {
+		if (sp->role.glevel < sp->role.level)
+			continue;
 		if (detect_write_misaligned(sp, gpa, bytes) ||
 		      detect_write_flooding(sp)) {
 			kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 8621188b46df..67489a060eba 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1042,6 +1042,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		.level = 0xf,
 		.access = 0x7,
 		.quadrant = 0x3,
+		.glevel = 0xf,
 	};
 
 	/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-03-30 13:21 [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
  2022-03-30 13:21 ` [RFC PATCH V3 1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
  2022-03-30 13:21 ` [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable Lai Jiangshan
@ 2022-03-30 13:21 ` Lai Jiangshan
  2022-04-12 21:14   ` Sean Christopherson
  2022-03-30 13:21 ` [RFC PATCH V3 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests Lai Jiangshan
  2022-04-12  9:35 ` [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
  4 siblings, 1 reply; 27+ messages in thread
From: Lai Jiangshan @ 2022-03-30 13:21 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, linux-doc

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Currently pae_root is special root page, this patch adds facility to
allow using kvm_mmu_get_page() to allocate pae_root shadow page.

When kvm_mmu_get_page() is called for role.level == PT32E_ROOT_LEVEL and
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, it will get a
PAE root pagetable and set role.pae_root=1 for freeing.

The role.pae_root bit is needed in the page role because:
  o PAE roots must be allocated below 4gb (for kvm_mmu_get_page())
  o PAE roots can not be encrypted (for kvm_mmu_get_page())
  o Must be re-encrypted when freeing (for kvm_mmu_free_page())
  o PAE root's PDPTE is special (for link_shadow_page())
  o Not share the decrypted low-address pagetable with non-PAE-root
    ones or vice verse. (for kvm_mmu_get_page(), the crucial reason)

Both role.pae_root in link_shadow_page() and in kvm_mmu_get_page() can
be possible changed to use shadow_root_level and role.level instead.

But in kvm_mmu_free_page(), it can't use vcpu->arch.mmu->shadow_root_level.

PAE roots must be allocated below 4gb (CR3 has only 32 bits).  So a
cache is introduced (mmu_pae_root_cache).

No functionality changed since this code is not activated because when
vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL, kvm_mmu_get_page()
is only called for level == 1 or 2 now.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 Documentation/virt/kvm/mmu.rst  |  2 +
 arch/x86/include/asm/kvm_host.h |  9 +++-
 arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 +
 4 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index dee0e96d694a..800f1eba55b3 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -209,6 +209,8 @@ Shadow pages contain the following information:
     top with role.glevel = guest paging level and acks as passthrough sp
     and its contents are specially installed rather than the translations
     of the corresponding guest pagetable.
+  role.pae_root:
+    Is 1 if it is a PAE root.
   gfn:
     Either the guest page table containing the translations shadowed by this
     page, or the base page frame for linear translations.  See role.direct.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 67e1bccaf472..658c493e7617 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -313,6 +313,11 @@ struct kvm_kernel_irq_routing_entry;
  *   - on top of this, smep_andnot_wp and smap_andnot_wp are only set if
  *     cr0_wp=0, therefore these three bits only give rise to 5 possibilities.
  *
+ *   - pae_root can only be set when level=3, so combinations for level and
+ *     pae_root can be seen as 2/3/3-page_root/4/5, a.k.a 5 possibilities.
+ *     Combined with cr0_wp, smep_andnot_wp and smap_andnot_wp, it will be
+ *     5X5 = 25 < 2^5.
+ *
  * Therefore, the maximum number of possible upper-level shadow pages for a
  * single gfn is a bit less than 2^15.
  */
@@ -332,7 +337,8 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned glevel:4;
-		unsigned :2;
+		unsigned pae_root:1;
+		unsigned :1;
 
 		/*
 		 * This is left at the top of the word so that
@@ -699,6 +705,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	void *mmu_pae_root_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d53037df8177..81ccaa7c1165 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -694,6 +694,35 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static int mmu_topup_pae_root_cache(struct kvm_vcpu *vcpu)
+{
+	struct page *page;
+
+	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
+		return 0;
+	if (vcpu->arch.mmu_pae_root_cache)
+		return 0;
+
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
+	if (!page)
+		return -ENOMEM;
+	vcpu->arch.mmu_pae_root_cache = page_address(page);
+
+	/*
+	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
+	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
+	 * that KVM's writes and the CPU's reads get along.  Note, this is
+	 * only necessary when using shadow paging, as 64-bit NPT can get at
+	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
+	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
+	 */
+	if (!tdp_enabled)
+		set_memory_decrypted((unsigned long)vcpu->arch.mmu_pae_root_cache, 1);
+	else
+		WARN_ON_ONCE(shadow_me_mask);
+	return 0;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -705,6 +734,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 		return r;
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
+	if (r)
+		return r;
+	r = mmu_topup_pae_root_cache(vcpu);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -717,12 +749,23 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 					  PT64_ROOT_MAX_LEVEL);
 }
 
+static void mmu_free_pae_root(void *root_pt)
+{
+	if (!tdp_enabled)
+		set_memory_encrypted((unsigned long)root_pt, 1);
+	free_page((unsigned long)root_pt);
+}
+
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
+	if (vcpu->arch.mmu_pae_root_cache) {
+		mmu_free_pae_root(vcpu->arch.mmu_pae_root_cache);
+		vcpu->arch.mmu_pae_root_cache = NULL;
+	}
 }
 
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
@@ -1682,7 +1725,10 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
 	hlist_del(&sp->hash_link);
 	list_del(&sp->link);
-	free_page((unsigned long)sp->spt);
+	if (sp->role.pae_root)
+		mmu_free_pae_root(sp->spt);
+	else
+		free_page((unsigned long)sp->spt);
 	free_page((unsigned long)sp->gfns);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
@@ -1720,7 +1766,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
-	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	if (!role.pae_root) {
+		sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	} else {
+		sp->spt = vcpu->arch.mmu_pae_root_cache;
+		vcpu->arch.mmu_pae_root_cache = NULL;
+	}
 	if (role.glevel == role.level)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
@@ -2064,6 +2115,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	}
 	if (level < role.glevel)
 		role.glevel = level;
+	if (level != PT32E_ROOT_LEVEL)
+		role.pae_root = 0;
 
 	sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 	for_each_valid_sp(vcpu->kvm, sp, sp_list) {
@@ -2199,14 +2252,26 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator)
 	__shadow_walk_next(iterator, *iterator->sptep);
 }
 
+static u64 make_pae_pdpte(u64 *child_pt)
+{
+	/* The only ignore bits in PDPTE are 11:9. */
+	BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK));
+	return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK |
+		shadow_me_mask;
+}
+
 static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 			     struct kvm_mmu_page *sp)
 {
+	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
 	u64 spte;
 
 	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
 
-	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+	if (!parent_sp->role.pae_root)
+		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
+	else
+		spte = make_pae_pdpte(sp->spt);
 
 	mmu_spte_set(sptep, spte);
 
@@ -4782,6 +4847,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
 	role.base.direct = true;
 	role.base.has_4_byte_gpte = false;
+	if (role.base.level == PT32E_ROOT_LEVEL)
+		role.base.pae_root = 1;
 
 	return role;
 }
@@ -4848,6 +4915,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	else
 		role.base.level = PT64_ROOT_4LEVEL;
 
+	if (role.base.level == PT32E_ROOT_LEVEL)
+		role.base.pae_root = 1;
+
 	return role;
 }
 
@@ -4893,6 +4963,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
 
 	role.base.direct = false;
 	role.base.level = kvm_mmu_get_tdp_level(vcpu);
+	if (role.base.level == PT32E_ROOT_LEVEL)
+		role.base.pae_root = 1;
 
 	return role;
 }
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 67489a060eba..1015f33e0758 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		.access = 0x7,
 		.quadrant = 0x3,
 		.glevel = 0xf,
+		.pae_root = 0x1,
 	};
 
 	/*
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH V3 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests
  2022-03-30 13:21 [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
                   ` (2 preceding siblings ...)
  2022-03-30 13:21 ` [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page Lai Jiangshan
@ 2022-03-30 13:21 ` Lai Jiangshan
  2022-04-12 21:34   ` Sean Christopherson
  2022-04-12  9:35 ` [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
  4 siblings, 1 reply; 27+ messages in thread
From: Lai Jiangshan @ 2022-03-30 13:21 UTC (permalink / raw)
  To: linux-kernel, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

Use role.pae_root = 1 for shadow_root_level == 3 no matter if it is
shadow MMU or not.

When it is shadow MMU, level expansion might occur and use passthrough
sp (0 < role.glevel < role.level) for expanded shadow pagetable.

And remove the unneeded special roots.  Now all the root pages and
pagetable pointed by a present spte in kvm_mmu are backed by struct
kvm_mmu_page, and to_shadow_page() is guaranteed to be not NULL.

shadow_walk() and the intialization of shadow page are much simplified
since there is not special roots.

Affect cases:
direct mmu (nonpaping for 32 bit guest):
	gCR0_PG=0 (pae_root=1)
shadow mmu (shadow paping for 32 bit guest):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0 (pae_root=1,passthrough)
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1 (pae_root=1,no passthrough)
direct mmu (NPT for 32bit host):
	hEFER_LMA=0 (pae_root=1)
shadow nested NPT (for 32bit L1 hypervisor):
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=0,hEFER_LMA=0
		(pae_root=1,passthrough)
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE=1,hEFER_LMA=0
		(pae_root=1,no passthrough)
	gCR0_PG=1,gEFER_LMA=0,gCR4_PSE={0|1},hEFER_LMA=1,hCR4_LA57={0|1}
		(pae_root=0,passthrough)
		(default_pae_pdpte is not used even guest is using PAE paging)

Shadow nested NPT for 64bit L1 hypervisor has been already handled:
	gEFER_LMA=1,gCR4_LA57=0,hEFER_LMA=1,hCR4_LA57=1
		(pae_root=0,passthrough)

FNAME(walk_addr_generic) adds initialization code for shadow nested NPT
for 32bit L1 hypervisor when the level increment might be more than one,
for example, 2->4, 2->5, 3->5.

After this patch, the PAE Page-Directory-Pointer-Table is also write
protected (including NPT's).

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
---
 arch/x86/include/asm/kvm_host.h |   4 -
 arch/x86/kvm/mmu/mmu.c          | 290 +-------------------------------
 arch/x86/kvm/mmu/paging_tmpl.h  |  13 +-
 3 files changed, 20 insertions(+), 287 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 658c493e7617..26aab4418844 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -467,10 +467,6 @@ struct kvm_mmu {
 	*/
 	u32 pkru_mask;
 
-	u64 *pae_root;
-	u64 *pml4_root;
-	u64 *pml5_root;
-
 	/*
 	 * check zero bits on shadow page table entries, these
 	 * bits include not only hardware reserved bits but also
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 81ccaa7c1165..27498caa3990 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2196,26 +2196,6 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
 	iterator->addr = addr;
 	iterator->shadow_addr = root;
 	iterator->level = vcpu->arch.mmu->shadow_root_level;
-
-	if (iterator->level >= PT64_ROOT_4LEVEL &&
-	    vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
-	    !vcpu->arch.mmu->direct_map)
-		iterator->level = PT32E_ROOT_LEVEL;
-
-	if (iterator->level == PT32E_ROOT_LEVEL) {
-		/*
-		 * prev_root is currently only used for 64-bit hosts. So only
-		 * the active root_hpa is valid here.
-		 */
-		BUG_ON(root != vcpu->arch.mmu->root.hpa);
-
-		iterator->shadow_addr
-			= vcpu->arch.mmu->pae_root[(addr >> 30) & 3];
-		iterator->shadow_addr &= PT64_BASE_ADDR_MASK;
-		--iterator->level;
-		if (!iterator->shadow_addr)
-			iterator->level = 0;
-	}
 }
 
 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
@@ -3328,18 +3308,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 					   &invalid_list);
 
 	if (free_active_root) {
-		if (to_shadow_page(mmu->root.hpa)) {
-			mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
-		} else if (mmu->pae_root) {
-			for (i = 0; i < 4; ++i) {
-				if (!IS_VALID_PAE_ROOT(mmu->pae_root[i]))
-					continue;
-
-				mmu_free_root_page(kvm, &mmu->pae_root[i],
-						   &invalid_list);
-				mmu->pae_root[i] = INVALID_PAE_ROOT;
-			}
-		}
+		mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list);
 		mmu->root.hpa = INVALID_PAGE;
 		mmu->root.pgd = 0;
 	}
@@ -3404,7 +3373,6 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u8 shadow_root_level = mmu->shadow_root_level;
 	hpa_t root;
-	unsigned i;
 	int r;
 
 	write_lock(&vcpu->kvm->mmu_lock);
@@ -3415,24 +3383,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
 		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
 		mmu->root.hpa = root;
-	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
+	} else if (shadow_root_level >= PT32E_ROOT_LEVEL) {
 		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
 		mmu->root.hpa = root;
-	} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
-		if (WARN_ON_ONCE(!mmu->pae_root)) {
-			r = -EIO;
-			goto out_unlock;
-		}
-
-		for (i = 0; i < 4; ++i) {
-			WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
-			root = mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT),
-					      i << 30, PT32_ROOT_LEVEL, true);
-			mmu->pae_root[i] = root | PT_PRESENT_MASK |
-					   shadow_me_mask;
-		}
-		mmu->root.hpa = __pa(mmu->pae_root);
 	} else {
 		WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
 		r = -EIO;
@@ -3510,10 +3463,8 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	u64 pdptrs[4], pm_mask;
 	gfn_t root_gfn, root_pgd;
 	hpa_t root;
-	unsigned i;
 	int r;
 
 	root_pgd = mmu->get_guest_pgd(vcpu);
@@ -3522,21 +3473,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (mmu_check_root(vcpu, root_gfn))
 		return 1;
 
-	/*
-	 * On SVM, reading PDPTRs might access guest memory, which might fault
-	 * and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
-	 */
-	if (mmu->root_level == PT32E_ROOT_LEVEL) {
-		for (i = 0; i < 4; ++i) {
-			pdptrs[i] = mmu->get_pdptr(vcpu, i);
-			if (!(pdptrs[i] & PT_PRESENT_MASK))
-				continue;
-
-			if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
-				return 1;
-		}
-	}
-
 	r = mmu_first_shadow_root_alloc(vcpu->kvm);
 	if (r)
 		return r;
@@ -3546,70 +3482,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (r < 0)
 		goto out_unlock;
 
-	/*
-	 * Do we shadow a long mode page table? If so we need to
-	 * write-protect the guests page table root.
-	 */
-	if (mmu->root_level >= PT64_ROOT_4LEVEL) {
-		root = mmu_alloc_root(vcpu, root_gfn, 0,
-				      mmu->shadow_root_level, false);
-		mmu->root.hpa = root;
-		goto set_root_pgd;
-	}
-
-	if (WARN_ON_ONCE(!mmu->pae_root)) {
-		r = -EIO;
-		goto out_unlock;
-	}
-
-	/*
-	 * We shadow a 32 bit page table. This may be a legacy 2-level
-	 * or a PAE 3-level page table. In either case we need to be aware that
-	 * the shadow page table may be a PAE or a long mode page table.
-	 */
-	pm_mask = PT_PRESENT_MASK | shadow_me_mask;
-	if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
-		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
-
-		if (WARN_ON_ONCE(!mmu->pml4_root)) {
-			r = -EIO;
-			goto out_unlock;
-		}
-		mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
-
-		if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) {
-			if (WARN_ON_ONCE(!mmu->pml5_root)) {
-				r = -EIO;
-				goto out_unlock;
-			}
-			mmu->pml5_root[0] = __pa(mmu->pml4_root) | pm_mask;
-		}
-	}
-
-	for (i = 0; i < 4; ++i) {
-		WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
-
-		if (mmu->root_level == PT32E_ROOT_LEVEL) {
-			if (!(pdptrs[i] & PT_PRESENT_MASK)) {
-				mmu->pae_root[i] = INVALID_PAE_ROOT;
-				continue;
-			}
-			root_gfn = pdptrs[i] >> PAGE_SHIFT;
-		}
-
-		root = mmu_alloc_root(vcpu, root_gfn, i << 30,
-				      PT32_ROOT_LEVEL, false);
-		mmu->pae_root[i] = root | pm_mask;
-	}
-
-	if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
-		mmu->root.hpa = __pa(mmu->pml5_root);
-	else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
-		mmu->root.hpa = __pa(mmu->pml4_root);
-	else
-		mmu->root.hpa = __pa(mmu->pae_root);
-
-set_root_pgd:
+	root = mmu_alloc_root(vcpu, root_gfn, 0,
+			      mmu->shadow_root_level, false);
+	mmu->root.hpa = root;
 	mmu->root.pgd = root_pgd;
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
@@ -3617,77 +3492,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	return r;
 }
 
-static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
-{
-	struct kvm_mmu *mmu = vcpu->arch.mmu;
-	bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL;
-	u64 *pml5_root = NULL;
-	u64 *pml4_root = NULL;
-	u64 *pae_root;
-
-	/*
-	 * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP
-	 * tables are allocated and initialized at root creation as there is no
-	 * equivalent level in the guest's NPT to shadow.  Allocate the tables
-	 * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
-	 */
-	if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
-	    mmu->shadow_root_level < PT64_ROOT_4LEVEL)
-		return 0;
-
-	/*
-	 * NPT, the only paging mode that uses this horror, uses a fixed number
-	 * of levels for the shadow page tables, e.g. all MMUs are 4-level or
-	 * all MMus are 5-level.  Thus, this can safely require that pml5_root
-	 * is allocated if the other roots are valid and pml5 is needed, as any
-	 * prior MMU would also have required pml5.
-	 */
-	if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root))
-		return 0;
-
-	/*
-	 * The special roots should always be allocated in concert.  Yell and
-	 * bail if KVM ends up in a state where only one of the roots is valid.
-	 */
-	if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root ||
-			 (need_pml5 && mmu->pml5_root)))
-		return -EIO;
-
-	/*
-	 * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and
-	 * doesn't need to be decrypted.
-	 */
-	pae_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-	if (!pae_root)
-		return -ENOMEM;
-
-#ifdef CONFIG_X86_64
-	pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-	if (!pml4_root)
-		goto err_pml4;
-
-	if (need_pml5) {
-		pml5_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
-		if (!pml5_root)
-			goto err_pml5;
-	}
-#endif
-
-	mmu->pae_root = pae_root;
-	mmu->pml4_root = pml4_root;
-	mmu->pml5_root = pml5_root;
-
-	return 0;
-
-#ifdef CONFIG_X86_64
-err_pml5:
-	free_page((unsigned long)pml4_root);
-err_pml4:
-	free_page((unsigned long)pae_root);
-	return -ENOMEM;
-#endif
-}
-
 static bool is_unsync_root(hpa_t root)
 {
 	struct kvm_mmu_page *sp;
@@ -3725,8 +3529,7 @@ static bool is_unsync_root(hpa_t root)
 
 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 {
-	int i;
-	struct kvm_mmu_page *sp;
+	hpa_t root = vcpu->arch.mmu->root.hpa;
 
 	if (vcpu->arch.mmu->direct_map)
 		return;
@@ -3736,31 +3539,11 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 
-	if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
-		hpa_t root = vcpu->arch.mmu->root.hpa;
-		sp = to_shadow_page(root);
-
-		if (!is_unsync_root(root))
-			return;
-
-		write_lock(&vcpu->kvm->mmu_lock);
-		mmu_sync_children(vcpu, sp, true);
-		write_unlock(&vcpu->kvm->mmu_lock);
+	if (!is_unsync_root(root))
 		return;
-	}
 
 	write_lock(&vcpu->kvm->mmu_lock);
-
-	for (i = 0; i < 4; ++i) {
-		hpa_t root = vcpu->arch.mmu->pae_root[i];
-
-		if (IS_VALID_PAE_ROOT(root)) {
-			root &= PT64_BASE_ADDR_MASK;
-			sp = to_shadow_page(root);
-			mmu_sync_children(vcpu, sp, true);
-		}
-	}
-
+	mmu_sync_children(vcpu, to_shadow_page(root), true);
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
@@ -5162,9 +4945,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	int r;
 
 	r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
-	if (r)
-		goto out;
-	r = mmu_alloc_special_roots(vcpu);
 	if (r)
 		goto out;
 	if (vcpu->arch.mmu->direct_map)
@@ -5635,65 +5415,14 @@ slot_handle_level_4k(struct kvm *kvm, const struct kvm_memory_slot *memslot,
 				 PG_LEVEL_4K, flush_on_yield);
 }
 
-static void free_mmu_pages(struct kvm_mmu *mmu)
-{
-	if (!tdp_enabled && mmu->pae_root)
-		set_memory_encrypted((unsigned long)mmu->pae_root, 1);
-	free_page((unsigned long)mmu->pae_root);
-	free_page((unsigned long)mmu->pml4_root);
-	free_page((unsigned long)mmu->pml5_root);
-}
-
 static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 {
-	struct page *page;
 	int i;
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
-
-	/* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */
-	if (!tdp_enabled && mmu == &vcpu->arch.guest_mmu)
-		return 0;
-
-	/*
-	 * When using PAE paging, the four PDPTEs are treated as 'root' pages,
-	 * while the PDP table is a per-vCPU construct that's allocated at MMU
-	 * creation.  When emulating 32-bit mode, cr3 is only 32 bits even on
-	 * x86_64.  Therefore we need to allocate the PDP table in the first
-	 * 4GB of memory, which happens to fit the DMA32 zone.  TDP paging
-	 * generally doesn't use PAE paging and can skip allocating the PDP
-	 * table.  The main exception, handled here, is SVM's 32-bit NPT.  The
-	 * other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit
-	 * KVM; that horror is handled on-demand by mmu_alloc_special_roots().
-	 */
-	if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL)
-		return 0;
-
-	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32);
-	if (!page)
-		return -ENOMEM;
-
-	mmu->pae_root = page_address(page);
-
-	/*
-	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
-	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
-	 * that KVM's writes and the CPU's reads get along.  Note, this is
-	 * only necessary when using shadow paging, as 64-bit NPT can get at
-	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
-	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
-	 */
-	if (!tdp_enabled)
-		set_memory_decrypted((unsigned long)mmu->pae_root, 1);
-	else
-		WARN_ON_ONCE(shadow_me_mask);
-
-	for (i = 0; i < 4; ++i)
-		mmu->pae_root[i] = INVALID_PAE_ROOT;
-
 	return 0;
 }
 
@@ -5722,7 +5451,6 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 
 	return ret;
  fail_allocate_root:
-	free_mmu_pages(&vcpu->arch.guest_mmu);
 	return ret;
 }
 
@@ -6363,8 +6091,6 @@ int kvm_mmu_module_init(void)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
-	free_mmu_pages(&vcpu->arch.root_mmu);
-	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
 }
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1015f33e0758..62a762bbcf87 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -365,6 +365,16 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	pte           = mmu->get_guest_pgd(vcpu);
 	have_ad       = PT_HAVE_ACCESSED_DIRTY(mmu);
 
+	/* kvm_mmu_get_page() might use this values for allocating passthrough
+	 * shadow page.
+	 */
+	walker->table_gfn[4] = gpte_to_gfn(pte);
+	walker->pt_access[4] = ACC_ALL;
+	walker->table_gfn[3] = gpte_to_gfn(pte);
+	walker->pt_access[3] = ACC_ALL;
+	walker->table_gfn[2] = gpte_to_gfn(pte);
+	walker->pt_access[2] = ACC_ALL;
+
 #if PTTYPE == 64
 	walk_nx_mask = 1ULL << PT64_NX_SHIFT;
 	if (walker->level == PT32E_ROOT_LEVEL) {
@@ -710,7 +720,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * Verify that the gpte in the page we've just write
 		 * protected is still there.
 		 */
-		if (FNAME(gpte_changed)(vcpu, gw, it.level - 1))
+		if (it.level - 1 < top_level &&
+		    FNAME(gpte_changed)(vcpu, gw, it.level - 1))
 			goto out_gpte_changed;
 
 		if (sp)
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-03-30 13:21 ` [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable Lai Jiangshan
@ 2022-03-30 16:01   ` Lai Jiangshan
  2022-04-12 21:31   ` Sean Christopherson
  1 sibling, 0 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-03-30 16:01 UTC (permalink / raw)
  To: LKML, kvm, Paolo Bonzini, Sean Christopherson
  Cc: Lai Jiangshan, Jonathan Corbet, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin, linux-doc

On Wed, Mar 30, 2022 at 9:21 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:

> @@ -1713,7 +1721,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>         sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> -       if (!role.direct)
> +       if (role.glevel == role.level)
>                 sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
>         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>
> @@ -2054,6 +2062,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
>                 quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
>                 role.quadrant = quadrant;
>         }
> +       if (level < role.glevel)
> +               role.glevel = level;

missing:

if (direct)
   role.glevel = 0;

>
>         sp_list = &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>         for_each_valid_sp(vcpu->kvm, sp, sp_list) {

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root
  2022-03-30 13:21 [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
                   ` (3 preceding siblings ...)
  2022-03-30 13:21 ` [RFC PATCH V3 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests Lai Jiangshan
@ 2022-04-12  9:35 ` Lai Jiangshan
  4 siblings, 0 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-12  9:35 UTC (permalink / raw)
  To: LKML, kvm, Paolo Bonzini, Sean Christopherson; +Cc: Lai Jiangshan

On Wed, Mar 30, 2022 at 9:21 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:
>
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
>


Ping

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-03-30 13:21 ` [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page Lai Jiangshan
@ 2022-04-12 21:14   ` Sean Christopherson
  2022-04-14  9:07     ` Lai Jiangshan
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-12 21:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> 
> Currently pae_root is special root page, this patch adds facility to
> allow using kvm_mmu_get_page() to allocate pae_root shadow page.

I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
gfn calculation will drop bits 11:5.

Handling this as a one-off is probably easier.  For TDP, only 32-bit KVM with NPT
benefits from reusing roots, IMO and shaving a few pages in that case is not worth
the complexity.

> @@ -332,7 +337,8 @@ union kvm_mmu_page_role {
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
>  		unsigned glevel:4;
> -		unsigned :2;
> +		unsigned pae_root:1;

If we do end up adding a role bit, it can simply be "root", which may or may not
be useful for other things.  is_pae_root() is then a combo of root+level.  This
will clean up the code a bit as role.root is (mostly?) hardcoded based on the
function, e.g. root allocators set it, child allocators clear it.

> +		unsigned :1;
>  
>  		/*
>  		 * This is left at the top of the word so that
> @@ -699,6 +705,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	void *mmu_pae_root_cache;
>  
>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d53037df8177..81ccaa7c1165 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -694,6 +694,35 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static int mmu_topup_pae_root_cache(struct kvm_vcpu *vcpu)
> +{
> +	struct page *page;
> +
> +	if (vcpu->arch.mmu->shadow_root_level != PT32E_ROOT_LEVEL)
> +		return 0;
> +	if (vcpu->arch.mmu_pae_root_cache)
> +		return 0;
> +
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32);
> +	if (!page)
> +		return -ENOMEM;
> +	vcpu->arch.mmu_pae_root_cache = page_address(page);
> +
> +	/*
> +	 * CR3 is only 32 bits when PAE paging is used, thus it's impossible to
> +	 * get the CPU to treat the PDPTEs as encrypted.  Decrypt the page so
> +	 * that KVM's writes and the CPU's reads get along.  Note, this is
> +	 * only necessary when using shadow paging, as 64-bit NPT can get at
> +	 * the C-bit even when shadowing 32-bit NPT, and SME isn't supported
> +	 * by 32-bit kernels (when KVM itself uses 32-bit NPT).
> +	 */
> +	if (!tdp_enabled)
> +		set_memory_decrypted((unsigned long)vcpu->arch.mmu_pae_root_cache, 1);
> +	else
> +		WARN_ON_ONCE(shadow_me_mask);
> +	return 0;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -705,6 +734,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  		return r;
>  	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
>  				       PT64_ROOT_MAX_LEVEL);
> +	if (r)
> +		return r;
> +	r = mmu_topup_pae_root_cache(vcpu);

This doesn't need to be called from the common mmu_topup_memory_caches(), e.g. it
will unnecessarily require allocating another DMA32 page when handling a page fault.
I'd rather call this directly kvm_mmu_load(), which also makes it more obvious
that the cache really is only used for roots.

>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -717,12 +749,23 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  					  PT64_ROOT_MAX_LEVEL);
>  }
>  

...

>  static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>  			     struct kvm_mmu_page *sp)
>  {
> +	struct kvm_mmu_page *parent_sp = sptep_to_sp(sptep);
>  	u64 spte;
>  
>  	BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK != PT_WRITABLE_MASK);
>  
> -	spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	if (!parent_sp->role.pae_root)

Hmm, without role.root, this could be:

	if (sp->role.level == (PT32E_ROOT_level - 1) &&
	    ((__pa(sptep) & PT64_BASE_ADDR_MASK) == vcpu->arch.mmu->root.hpa))
		spte = make_pae_pdpte(sp->spt);
	else
		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));

Which is gross, but it works.  We could also do FNAME(link_shadow_page) to send
PAE roots down a dedicated path (also gross).  Point being, I don't think we
strictly need a "root" flag unless the PAE roots are put in mmu_page_hash.

> +		spte = make_nonleaf_spte(sp->spt, sp_ad_disabled(sp));
> +	else
> +		spte = make_pae_pdpte(sp->spt);
>  
>  	mmu_spte_set(sptep, spte);
>  
> @@ -4782,6 +4847,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
>  	role.base.direct = true;
>  	role.base.has_4_byte_gpte = false;
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
>  
>  	return role;
>  }
> @@ -4848,6 +4915,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
>  	else
>  		role.base.level = PT64_ROOT_4LEVEL;
>  
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
> +
>  	return role;
>  }
>  
> @@ -4893,6 +4963,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
>  
>  	role.base.direct = false;
>  	role.base.level = kvm_mmu_get_tdp_level(vcpu);
> +	if (role.base.level == PT32E_ROOT_LEVEL)
> +		role.base.pae_root = 1;
>  
>  	return role;
>  }
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 67489a060eba..1015f33e0758 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		.access = 0x7,
>  		.quadrant = 0x3,
>  		.glevel = 0xf,
> +		.pae_root = 0x1,
>  	};
>  
>  	/*
> -- 
> 2.19.1.6.gb485710b
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-03-30 13:21 ` [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable Lai Jiangshan
  2022-03-30 16:01   ` Lai Jiangshan
@ 2022-04-12 21:31   ` Sean Christopherson
  2022-04-13  4:13     ` Lai Jiangshan
  2022-04-13  8:38     ` Paolo Bonzini
  1 sibling, 2 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-12 21:31 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> +  role.glevel:
> +    The level in guest pagetable if the sp is indirect.  Is 0 if the sp
> +    is direct without corresponding guest pagetable, like TDP or !CR0.PG.
> +    When role.level > guest paging level, indirect sp is created on the
> +    top with role.glevel = guest paging level and acks as passthrough sp

s/acks/acts

> +    and its contents are specially installed rather than the translations
> +    of the corresponding guest pagetable.
>    gfn:
>      Either the guest page table containing the translations shadowed by this
>      page, or the base page frame for linear translations.  See role.direct.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9694dd5e6ccc..67e1bccaf472 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -314,7 +314,7 @@ struct kvm_kernel_irq_routing_entry;
>   *     cr0_wp=0, therefore these three bits only give rise to 5 possibilities.
>   *
>   * Therefore, the maximum number of possible upper-level shadow pages for a
> - * single gfn is a bit less than 2^13.
> + * single gfn is a bit less than 2^15.
>   */
>  union kvm_mmu_page_role {
>  	u32 word;
> @@ -331,7 +331,8 @@ union kvm_mmu_page_role {
>  		unsigned smap_andnot_wp:1;
>  		unsigned ad_disabled:1;
>  		unsigned guest_mode:1;
> -		unsigned :6;
> +		unsigned glevel:4;

We don't need 4 bits for this.  Crossing our fingers that we never had to shadow
a 2-level guest with a 6-level host, we can do:

		unsigned passthrough_delta:2;

Where the field is ignored if direct=1, '0' for non-passthrough, and 1-3 to handle
shadow_root_level - guest_root_level.  Basically the same idea as Paolo's smushing
of direct+passthrough into mapping_level, just dressed up differently.

Side topic, we should steal a bit back from "level", or at least document that we
can steal a bit if necessary.

> +		unsigned :2;
>  
>  		/*
>  		 * This is left at the top of the word so that
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 02eae110cbe1..d53037df8177 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -737,8 +737,12 @@ static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
>  
>  static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  {
> -	if (!sp->role.direct)
> +	if (!sp->role.direct) {
> +		if (unlikely(sp->role.glevel < sp->role.level))

Regardless of whatever magic we end up using, there should be an is_passthrough_sp()
helper to wrap the magic.

> +			return sp->gfn;
> +
>  		return sp->gfns[index];
> +	}
>  
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests
  2022-03-30 13:21 ` [RFC PATCH V3 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests Lai Jiangshan
@ 2022-04-12 21:34   ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-12 21:34 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, kvm, Paolo Bonzini, Lai Jiangshan,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin

On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> ---
>  arch/x86/include/asm/kvm_host.h |   4 -
>  arch/x86/kvm/mmu/mmu.c          | 290 +-------------------------------
>  arch/x86/kvm/mmu/paging_tmpl.h  |  13 +-
>  3 files changed, 20 insertions(+), 287 deletions(-)

This diffstat is amazing :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-12 21:31   ` Sean Christopherson
@ 2022-04-13  4:13     ` Lai Jiangshan
  2022-04-13  8:38     ` Paolo Bonzini
  1 sibling, 0 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-13  4:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Wed, Apr 13, 2022 at 5:31 AM Sean Christopherson <seanjc@google.com> wrote:

> >  union kvm_mmu_page_role {
> >       u32 word;
> > @@ -331,7 +331,8 @@ union kvm_mmu_page_role {
> >               unsigned smap_andnot_wp:1;
> >               unsigned ad_disabled:1;
> >               unsigned guest_mode:1;
> > -             unsigned :6;
> > +             unsigned glevel:4;
>
> We don't need 4 bits for this.  Crossing our fingers that we never had to shadow
> a 2-level guest with a 6-level host, we can do:
>
>                 unsigned passthrough_delta:2;

We can save the bits in the future when we need more bits so I didn't
hesitate to use 4 bits since glevel gives simple code. ^_^

I think the name passthrough_delta is more informative and glevel
is used only for comparison so passthrough_delta can also be
simple.  I will apply your suggestion.


Thanks
Lai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-12 21:31   ` Sean Christopherson
  2022-04-13  4:13     ` Lai Jiangshan
@ 2022-04-13  8:38     ` Paolo Bonzini
  2022-04-13 14:42       ` Sean Christopherson
  1 sibling, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-04-13  8:38 UTC (permalink / raw)
  To: Sean Christopherson, Lai Jiangshan
  Cc: linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On 4/12/22 23:31, Sean Christopherson wrote:
>> +		unsigned glevel:4;
> We don't need 4 bits for this.  Crossing our fingers that we never had to shadow
> a 2-level guest with a 6-level host, we can do:
> 
> 		unsigned passthrough_delta:2;
> 
> Where the field is ignored if direct=1, '0' for non-passthrough, and 1-3 to handle
> shadow_root_level - guest_root_level.  Basically the same idea as Paolo's smushing
> of direct+passthrough into mapping_level, just dressed up differently.

Basically, your passthrough_delta is level - glevel in Jiangshan's 
patches.  You'll need 3 bits anyway when we remove direct later (that 
would be passthrough_delta == level).

Regarding the naming:

* If we keep Jiangshan's logic, I don't like the glevel name very much, 
any of mapping_level, target_level or direct_level would be clearer?

* If we go with yours, I would call the field "passthrough_levels".

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-13  8:38     ` Paolo Bonzini
@ 2022-04-13 14:42       ` Sean Christopherson
  2022-04-13 14:46         ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-13 14:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On Wed, Apr 13, 2022, Paolo Bonzini wrote:
> On 4/12/22 23:31, Sean Christopherson wrote:
> > > +		unsigned glevel:4;
> > We don't need 4 bits for this.  Crossing our fingers that we never had to shadow
> > a 2-level guest with a 6-level host, we can do:
> > 
> > 		unsigned passthrough_delta:2;
> > 
> > Where the field is ignored if direct=1, '0' for non-passthrough, and 1-3 to handle
> > shadow_root_level - guest_root_level.  Basically the same idea as Paolo's smushing
> > of direct+passthrough into mapping_level, just dressed up differently.
> 
> Basically, your passthrough_delta is level - glevel in Jiangshan's patches.
> You'll need 3 bits anyway when we remove direct later (that would be
> passthrough_delta == level).

Are we planning on removing direct?

> Regarding the naming:
> 
> * If we keep Jiangshan's logic, I don't like the glevel name very much, any
> of mapping_level, target_level or direct_level would be clearer?

I don't love any of these names, especially glevel, because the field doesn't
strictly track the guest/mapping/target/direct level.  That could obviously be
remedied by making it valid at all times, but then the role would truly need 3
bits (on top of direct) to track 5-level guest paging.

> * If we go with yours, I would call the field "passthrough_levels".

Hmm, it's not a raw level though.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-13 14:42       ` Sean Christopherson
@ 2022-04-13 14:46         ` Paolo Bonzini
  2022-04-13 15:32           ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-04-13 14:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On 4/13/22 16:42, Sean Christopherson wrote:
> On Wed, Apr 13, 2022, Paolo Bonzini wrote:
>> On 4/12/22 23:31, Sean Christopherson wrote:
>>> We don't need 4 bits for this.  Crossing our fingers that we never had to shadow
>>> a 2-level guest with a 6-level host, we can do:
>>>
>>> 		unsigned passthrough_delta:2;
>>>
>> Basically, your passthrough_delta is level - glevel in Jiangshan's patches.
>> You'll need 3 bits anyway when we remove direct later (that would be
>> passthrough_delta == level).
> 
> Are we planning on removing direct?

I think so, it's redundant and the code almost always checks 
direct||passthrough (which would be passthrough_delta > 0 with your scheme).

>> Regarding the naming:
>>
>> * If we keep Jiangshan's logic, I don't like the glevel name very much, any
>> of mapping_level, target_level or direct_level would be clearer?
> 
> I don't love any of these names, especially glevel, because the field doesn't
> strictly track the guest/mapping/target/direct level.  That could obviously be
> remedied by making it valid at all times, but then the role would truly need 3
> bits (on top of direct) to track 5-level guest paging.

Yes, it would need 3 bits but direct can be removed.

>> * If we go with yours, I would call the field "passthrough_levels".
> 
> Hmm, it's not a raw level though.

Hence the plural. :)

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-13 14:46         ` Paolo Bonzini
@ 2022-04-13 15:32           ` Sean Christopherson
  2022-04-13 16:03             ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-13 15:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On Wed, Apr 13, 2022, Paolo Bonzini wrote:
> On 4/13/22 16:42, Sean Christopherson wrote:
> > On Wed, Apr 13, 2022, Paolo Bonzini wrote:
> > > On 4/12/22 23:31, Sean Christopherson wrote:
> > > > We don't need 4 bits for this.  Crossing our fingers that we never had to shadow
> > > > a 2-level guest with a 6-level host, we can do:
> > > > 
> > > > 		unsigned passthrough_delta:2;
> > > > 
> > > Basically, your passthrough_delta is level - glevel in Jiangshan's patches.
> > > You'll need 3 bits anyway when we remove direct later (that would be
> > > passthrough_delta == level).
> > 
> > Are we planning on removing direct?
> 
> I think so, it's redundant and the code almost always checks
> direct||passthrough (which would be passthrough_delta > 0 with your scheme).

It's not redundant, just split out.  E.g. if 3 bits are used for the target_level,
a special value is needed to indicate "direct", otherwise KVM couldn't differentiate
between indirect and direct.  Violent agreement and all that :-)

I'm ok dropping direct and rolling it into target_level, just so long as we add
helpers, e.g. IIUC they would be

static inline bool is_sp_direct(...)
{
	return !sp->role.target_level;
}

static inline bool is_sp_direct_or_passthrough(...)
{
	return sp->role.target_level != sp->role.level;
}

> > > Regarding the naming:
> > > 
> > > * If we keep Jiangshan's logic, I don't like the glevel name very much, any
> > > of mapping_level, target_level or direct_level would be clearer?
> > 
> > I don't love any of these names, especially glevel, because the field doesn't
> > strictly track the guest/mapping/target/direct level.  That could obviously be
> > remedied by making it valid at all times, but then the role would truly need 3
> > bits (on top of direct) to track 5-level guest paging.
> 
> Yes, it would need 3 bits but direct can be removed.
> 
> > > * If we go with yours, I would call the field "passthrough_levels".
> > 
> > Hmm, it's not a raw level though.
> 
> Hence the plural. :)

LOL, I honestly thought that was a typo.  Making it plural sounds like it's passing
through to multiple levels.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-13 15:32           ` Sean Christopherson
@ 2022-04-13 16:03             ` Paolo Bonzini
  2022-04-14 15:51               ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-04-13 16:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On 4/13/22 17:32, Sean Christopherson wrote:
>>> Are we planning on removing direct?
>>
>> I think so, it's redundant and the code almost always checks
>> direct||passthrough (which would be passthrough_delta > 0 with your scheme).
> 
> I'm ok dropping direct and rolling it into target_level, just so long as we add
> helpers, e.g. IIUC they would be
> 
> static inline bool is_sp_direct(...)
> {
> 	return !sp->role.target_level;
> }
> 
> static inline bool is_sp_direct_or_passthrough(...)
> {
> 	return sp->role.target_level != sp->role.level;
> }

Yes of course.  Or respectively:

	return sp->role.passthrough_levels == s->role.level;

	return sp->role.passthrough_levels > 0;

I'm not sure about a more concise name for the latter.  Maybe 
sp_has_gpt(...) but I haven't thought it through very much.

>>> Hmm, it's not a raw level though.
>>
>> Hence the plural. :)
> 
> LOL, I honestly thought that was a typo.  Making it plural sounds like it's passing
> through to multiple levels.

I meant it as number of levels being passed through.  I'll leave that to 
Jiangshan, either target_level or passthrough_levels will do for me.

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-12 21:14   ` Sean Christopherson
@ 2022-04-14  9:07     ` Lai Jiangshan
  2022-04-14  9:08       ` Paolo Bonzini
  2022-04-14 14:52       ` Sean Christopherson
  0 siblings, 2 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-14  9:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, kvm, Paolo Bonzini, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Wed, Apr 13, 2022 at 5:14 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> >
> > Currently pae_root is special root page, this patch adds facility to
> > allow using kvm_mmu_get_page() to allocate pae_root shadow page.
>
> I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
> for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
> reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
> gfn calculation will drop bits 11:5.

I forgot about it.

>
> Handling this as a one-off is probably easier.  For TDP, only 32-bit KVM with NPT
> benefits from reusing roots, IMO and shaving a few pages in that case is not worth
> the complexity.
>

I liked the one-off idea yesterday and started trying it.

But things were not going as smoothly as I thought.  There are too
many corner cases to cover.  Maybe I don't get what you envisioned.

one-off shadow pages must not be in the hash, must be freed
immediately in kvm_mmu_free_roots(), taken care in
kvm_mmu_prepare_zap_page() and so on.

When the guest is 32bit, the host has to free and allocate sp
every time when the guest changes cr3.  It will be a regression
when !TDP.

one-off shadow pages are too distinguished from others.

When using one-off shadow pages, role.passthough can be one
bit and be used only for 5-level NPT L0 for 4-level NPT L1,
which is neat.  And role.pae_root can be removed.

I want the newly added shadow pages to fit into the current
shadow page management and root management.

I'm going to add sp->pae_off (u16) which is 11:5 of the cr3
when the guest is PAE paging. It needs only less than 10 lines
of code.

Thanks.
Lai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14  9:07     ` Lai Jiangshan
@ 2022-04-14  9:08       ` Paolo Bonzini
  2022-04-14  9:32         ` Lai Jiangshan
  2022-04-14 14:52       ` Sean Christopherson
  1 sibling, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-04-14  9:08 UTC (permalink / raw)
  To: Lai Jiangshan, Sean Christopherson
  Cc: LKML, kvm, Lai Jiangshan, Jonathan Corbet, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML,
	H. Peter Anvin, linux-doc

On 4/14/22 11:07, Lai Jiangshan wrote:
>> I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
>> for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
>> reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
>> gfn calculation will drop bits 11:5.
> 
> I forgot about it.


Isn't the pae_root always rebuilt by

         if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
                 kvm_mmu_free_roots(vcpu->kvm, mmu, KVM_MMU_ROOT_CURRENT);

in load_pdptrs?  I think reuse cannot happen.

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14  9:08       ` Paolo Bonzini
@ 2022-04-14  9:32         ` Lai Jiangshan
  2022-04-14 10:04           ` Paolo Bonzini
  2022-04-14 13:35           ` Lai Jiangshan
  0 siblings, 2 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-14  9:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Thu, Apr 14, 2022 at 5:08 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 4/14/22 11:07, Lai Jiangshan wrote:
> >> I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
> >> for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
> >> reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
> >> gfn calculation will drop bits 11:5.
> >
> > I forgot about it.
>
>
> Isn't the pae_root always rebuilt by
>
>          if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
>                  kvm_mmu_free_roots(vcpu->kvm, mmu, KVM_MMU_ROOT_CURRENT);
>
> in load_pdptrs?  I think reuse cannot happen.
>

In this patchset, root sp can be reused if it is found from the hash,
including new pae root.

All new kinds of sp added in this patchset are in the hash too.

No more special root pages.

kvm_mmu_free_roots() can not free those new types of sp if they are still
valid.  And different vcpu can use the same pae root sp if the guest cr3
of the vcpus are the same.

And new pae root can be put in prev_root too (not implemented yet)
because they are not too special anymore.  As long as sp->gfn, sp->pae_off,
sp->role are matched, they can be reused.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14  9:32         ` Lai Jiangshan
@ 2022-04-14 10:04           ` Paolo Bonzini
  2022-04-14 11:06             ` Lai Jiangshan
  2022-04-14 13:35           ` Lai Jiangshan
  1 sibling, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-04-14 10:04 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Sean Christopherson, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On 4/14/22 11:32, Lai Jiangshan wrote:
> kvm_mmu_free_roots() can not free those new types of sp if they are still
> valid.  And different vcpu can use the same pae root sp if the guest cr3
> of the vcpus are the same.

Right, but then load_pdptrs only needs to zap the page before (or 
instead of) calling kvm_mmu_free_roots().

Paolo

> And new pae root can be put in prev_root too (not implemented yet)
> because they are not too special anymore.  As long as sp->gfn, sp->pae_off,
> sp->role are matched, they can be reused.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14 10:04           ` Paolo Bonzini
@ 2022-04-14 11:06             ` Lai Jiangshan
  2022-04-14 14:12               ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-14 11:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Thu, Apr 14, 2022 at 6:04 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 4/14/22 11:32, Lai Jiangshan wrote:
> > kvm_mmu_free_roots() can not free those new types of sp if they are still
> > valid.  And different vcpu can use the same pae root sp if the guest cr3
> > of the vcpus are the same.
>
> Right, but then load_pdptrs only needs to zap the page before (or
> instead of) calling kvm_mmu_free_roots().
>

Guest PAE page is write-protected instead now (see patch4) and
kvm_mmu_pte_write() needs to handle this special write operation
with respect to sp->pae_off (todo).
And load_pdptrs() doesn't need to check if the pdptrs are changed.

The semantics will be changed. When the guest updates its PAE root,
the hwTLB will not be updated/flushed immediately until some change
to CRx, but after this change, it will be flushed immediately.

Could we fix 5-level NPT L0 for 4-level NPT L1 only first? it is
a real bug.  I separated it out when I tried to implement one-off
shadow pages.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14  9:32         ` Lai Jiangshan
  2022-04-14 10:04           ` Paolo Bonzini
@ 2022-04-14 13:35           ` Lai Jiangshan
  1 sibling, 0 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-14 13:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Thu, Apr 14, 2022 at 5:32 PM Lai Jiangshan <jiangshanlai@gmail.com> wrote:

>
> All new kinds of sp added in this patchset are in the hash too.
>


I think role.guest_pae_root is needed to distinguish it from
a sp for a level-3 guest page in a 4-level pagetable.

Or just role.guest_root_level(or role.root_level) and it can replace
role.passthrough_depth and role.guest_pae_root and role.pae_root.

role.pae_root will be

(role.root_level == 3 || role.root_level == 2) && role.level == 3 &&
(host is 32bit || !tdp_enabled)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14 11:06             ` Lai Jiangshan
@ 2022-04-14 14:12               ` Paolo Bonzini
  2022-04-14 14:42                 ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-04-14 14:12 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Sean Christopherson, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On 4/14/22 13:06, Lai Jiangshan wrote:
>> Right, but then load_pdptrs only needs to zap the page before (or
>> instead of) calling kvm_mmu_free_roots().
>>
> 
> Guest PAE page is write-protected instead now (see patch4) and
> kvm_mmu_pte_write() needs to handle this special write operation
> with respect to sp->pae_off (todo).
> And load_pdptrs() doesn't need to check if the pdptrs are changed.

Write-protecting the PDPTR page is unnecessary, the PDPTRs cannot change 
without another CR3.  That should be easy to do in account_shadowed and 
unaccount_shadowed

> I think role.guest_pae_root is needed to distinguish it from
> a sp for a level-3 guest page in a 4-level pagetable.
>
> Or just role.guest_root_level(or role.root_level) and it can replace
> role.passthrough_depth and role.guest_pae_root and role.pae_root.

Yes, I agree.  Though this would also get change patch 1 substantially, 
so I'll wait for you to respin.

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14 14:12               ` Paolo Bonzini
@ 2022-04-14 14:42                 ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-14 14:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Thu, Apr 14, 2022, Paolo Bonzini wrote:
> On 4/14/22 13:06, Lai Jiangshan wrote:
> > > Right, but then load_pdptrs only needs to zap the page before (or
> > > instead of) calling kvm_mmu_free_roots().
> > > 
> > 
> > Guest PAE page is write-protected instead now (see patch4) and
> > kvm_mmu_pte_write() needs to handle this special write operation
> > with respect to sp->pae_off (todo).
> > And load_pdptrs() doesn't need to check if the pdptrs are changed.
> 
> Write-protecting the PDPTR page is unnecessary, the PDPTRs cannot change
> without another CR3.  That should be easy to do in account_shadowed and
> unaccount_shadowed

Technically that's not true under SVM?

  Under SVM, however, when the processor is in guest mode with PAE enabled, the
  guest PDPT entries are not cached or validated at this point, but instead are
  loaded and checked on demand in the normal course of address translation, just
  like page directory and page table entries

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page
  2022-04-14  9:07     ` Lai Jiangshan
  2022-04-14  9:08       ` Paolo Bonzini
@ 2022-04-14 14:52       ` Sean Christopherson
  1 sibling, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-04-14 14:52 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: LKML, kvm, Paolo Bonzini, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Thu, Apr 14, 2022, Lai Jiangshan wrote:
> On Wed, Apr 13, 2022 at 5:14 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Mar 30, 2022, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com>
> > >
> > > Currently pae_root is special root page, this patch adds facility to
> > > allow using kvm_mmu_get_page() to allocate pae_root shadow page.
> >
> > I don't think this will work for shadow paging.  CR3 only has to be 32-byte aligned
> > for PAE paging.  Unless I'm missing something subtle in the code, KVM will incorrectly
> > reuse a pae_root if the guest puts multiple PAE CR3s on a single page because KVM's
> > gfn calculation will drop bits 11:5.
> 
> I forgot about it.
> 
> >
> > Handling this as a one-off is probably easier.  For TDP, only 32-bit KVM with NPT
> > benefits from reusing roots, IMO and shaving a few pages in that case is not worth
> > the complexity.
> >
> 
> I liked the one-off idea yesterday and started trying it.
> 
> But things were not going as smoothly as I thought.  There are too
> many corner cases to cover.  Maybe I don't get what you envisioned.

Hmm, I believe I was thinking that each vCPU could have a pre-allocated pae_root
shadow page, i.e. keep pae_root, but make it

	struct kvm_mmu_page pae_root;

The alloc/free paths would still need special handling, but at least in theory,
all other code that expects root a shadow page will Just Work.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-13 16:03             ` Paolo Bonzini
@ 2022-04-14 15:51               ` Sean Christopherson
  2022-04-14 16:32                 ` Lai Jiangshan
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-04-14 15:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Lai Jiangshan, linux-kernel, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-doc

On Wed, Apr 13, 2022, Paolo Bonzini wrote:
> On 4/13/22 17:32, Sean Christopherson wrote:
> > > > Are we planning on removing direct?
> > > 
> > > I think so, it's redundant and the code almost always checks
> > > direct||passthrough (which would be passthrough_delta > 0 with your scheme).
> > 
> > I'm ok dropping direct and rolling it into target_level, just so long as we add
> > helpers, e.g. IIUC they would be
> > 
> > static inline bool is_sp_direct(...)
> > {
> > 	return !sp->role.target_level;
> > }
> > 
> > static inline bool is_sp_direct_or_passthrough(...)
> > {
> > 	return sp->role.target_level != sp->role.level;
> > }
> 
> Yes of course.  Or respectively:
> 
> 	return sp->role.passthrough_levels == s->role.level;
> 
> 	return sp->role.passthrough_levels > 0;
> 
> I'm not sure about a more concise name for the latter.  Maybe
> sp_has_gpt(...) but I haven't thought it through very much.
> 
> > > > Hmm, it's not a raw level though.
> > > 
> > > Hence the plural. :)
> > 
> > LOL, I honestly thought that was a typo.  Making it plural sounds like it's passing
> > through to multiple levels.
> 
> I meant it as number of levels being passed through.  I'll leave that to
> Jiangshan, either target_level or passthrough_levels will do for me.

It took me until like 9pm last night to finally understand what you meant by
"passthrough level".   Now that I actually have my head wrapped around this...

Stepping back, "glevel" and any of its derivations is actually just a combination
of CR0.PG, CR4.PAE, EFER.LMA, and CR4.LA57.  And "has_4_byte_gpte" is CR0.PG && !CR4.PAE.
When running with !tdp_enabled, CR0.PG is tracked by "direct".  And with TDP enabled,
CR0.PG is either a don't care (L1 or nested EPT), or is guaranteed to be '1' (nested NPT).

So, rather than add yet more synthetic information to the role, what about using
the info we already have?  I don't think it changes the number of bits that need to
be stored, but I think the result would be easier for people to understand, at
least superficially, e.g. "oh, the mode matters, got it".  We'd need a beefy comment
to explain the whole "passthrough levels" thing, but I think it the code would be
more approachable for most people.

If we move efer_lma and cr4_la57 from kvm_mmu_extended_role to kvm_mmu_page_role,
and rename has_4_byte_gpte to cr4_pae, then we don't need passthrough_levels.
If needed for performance, we could still have a "passthrough" bit, but otherwise
detecting a passthrough SP would be

static inline bool is_passthrough_sp(struct kvm_mmu_page *sp)
{
	return !sp->role.direct && sp->role.level > role_to_root_level(sp->role);
}

where role_to_root_level() is extracted from kvm_calc_cpu_role() is Paolo's series.

Or, if we wanted to optimize "is passthrough", then cr4_la57 could be left in
the extended role, because passthrough is guaranteed to be '0' if CR4.LA57=1.

That would prevent reusing shadow pages between 64-bit paging and PAE paging, but
in practice no sane guest is going to reuse page tables between those mode, so who
cares.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable
  2022-04-14 15:51               ` Sean Christopherson
@ 2022-04-14 16:32                 ` Lai Jiangshan
  0 siblings, 0 replies; 27+ messages in thread
From: Lai Jiangshan @ 2022-04-14 16:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, LKML, kvm, Lai Jiangshan, Jonathan Corbet,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	X86 ML, H. Peter Anvin, linux-doc

On Thu, Apr 14, 2022 at 11:51 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 13, 2022, Paolo Bonzini wrote:
> > On 4/13/22 17:32, Sean Christopherson wrote:
> > > > > Are we planning on removing direct?
> > > >
> > > > I think so, it's redundant and the code almost always checks
> > > > direct||passthrough (which would be passthrough_delta > 0 with your scheme).
> > >
> > > I'm ok dropping direct and rolling it into target_level, just so long as we add
> > > helpers, e.g. IIUC they would be
> > >
> > > static inline bool is_sp_direct(...)
> > > {
> > >     return !sp->role.target_level;
> > > }
> > >
> > > static inline bool is_sp_direct_or_passthrough(...)
> > > {
> > >     return sp->role.target_level != sp->role.level;
> > > }
> >
> > Yes of course.  Or respectively:
> >
> >       return sp->role.passthrough_levels == s->role.level;
> >
> >       return sp->role.passthrough_levels > 0;
> >
> > I'm not sure about a more concise name for the latter.  Maybe
> > sp_has_gpt(...) but I haven't thought it through very much.
> >
> > > > > Hmm, it's not a raw level though.
> > > >
> > > > Hence the plural. :)
> > >
> > > LOL, I honestly thought that was a typo.  Making it plural sounds like it's passing
> > > through to multiple levels.
> >
> > I meant it as number of levels being passed through.  I'll leave that to
> > Jiangshan, either target_level or passthrough_levels will do for me.
>
> It took me until like 9pm last night to finally understand what you meant by
> "passthrough level".   Now that I actually have my head wrapped around this...
>
> Stepping back, "glevel" and any of its derivations is actually just a combination
> of CR0.PG, CR4.PAE, EFER.LMA, and CR4.LA57.  And "has_4_byte_gpte" is CR0.PG && !CR4.PAE.
> When running with !tdp_enabled, CR0.PG is tracked by "direct".  And with TDP enabled,
> CR0.PG is either a don't care (L1 or nested EPT), or is guaranteed to be '1' (nested NPT).

glevel in the patchset is the page level of the corresponding page
table in the guest, not the level of the guest *root* page table.

role.glevel is initialized as the guest root level, and changed to the
level of the guest page in kvm_mmu_get_page().

role.glevel is a bad name and is not sufficient to handle other
purposes like role.pae_root, role.guest_pae_root.

role.root_level is much better.

role.root_level is a combination of CR0.PG, CR4.PAE, EFER.LMA, and
CR4.LA57 as you said.


>
> So, rather than add yet more synthetic information to the role, what about using
> the info we already have?  I don't think it changes the number of bits that need to
> be stored, but I think the result would be easier for people to understand, at
> least superficially, e.g. "oh, the mode matters, got it".  We'd need a beefy comment
> to explain the whole "passthrough levels" thing, but I think it the code would be
> more approachable for most people.
>
> If we move efer_lma and cr4_la57 from kvm_mmu_extended_role to kvm_mmu_page_role,
> and rename has_4_byte_gpte to cr4_pae, then we don't need passthrough_levels.
> If needed for performance, we could still have a "passthrough" bit, but otherwise
> detecting a passthrough SP would be
>
> static inline bool is_passthrough_sp(struct kvm_mmu_page *sp)
> {
>         return !sp->role.direct && sp->role.level > role_to_root_level(sp->role);
> }
>
> where role_to_root_level() is extracted from kvm_calc_cpu_role() is Paolo's series.
>

I'm going to add

static inline bool sp_has_gpte(struct kvm_mmu_page *sp)
{
   return !sp->role.direct && (

   /* passthrough */
   sp->role.level > role_to_root_level(sp->role) ||

   /* guest pae root */
   (sp->role.level == 3 && role_to_root_level(sp->role) == 3)

   );
}

And rename for_each_gfn_indirect_valid_sp() to
for_each_gfn_valid_sp_has_gpte() which use sp_has_gpte() instead.

I'm not objecting using efer_lma and cr4_la57. But I think role.root_level
is more convenient than role_to_root_level(sp->role).

cr4_pae, efer_lma and cr4_la57 are more natrual than has_4_byte_gpte
and root_level.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-04-14 16:57 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-30 13:21 [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan
2022-03-30 13:21 ` [RFC PATCH V3 1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
2022-03-30 13:21 ` [RFC PATCH V3 2/4] KVM: X86: Introduce role.glevel for level expanded pagetable Lai Jiangshan
2022-03-30 16:01   ` Lai Jiangshan
2022-04-12 21:31   ` Sean Christopherson
2022-04-13  4:13     ` Lai Jiangshan
2022-04-13  8:38     ` Paolo Bonzini
2022-04-13 14:42       ` Sean Christopherson
2022-04-13 14:46         ` Paolo Bonzini
2022-04-13 15:32           ` Sean Christopherson
2022-04-13 16:03             ` Paolo Bonzini
2022-04-14 15:51               ` Sean Christopherson
2022-04-14 16:32                 ` Lai Jiangshan
2022-03-30 13:21 ` [RFC PATCH V3 3/4] KVM: X86: Alloc role.pae_root shadow page Lai Jiangshan
2022-04-12 21:14   ` Sean Christopherson
2022-04-14  9:07     ` Lai Jiangshan
2022-04-14  9:08       ` Paolo Bonzini
2022-04-14  9:32         ` Lai Jiangshan
2022-04-14 10:04           ` Paolo Bonzini
2022-04-14 11:06             ` Lai Jiangshan
2022-04-14 14:12               ` Paolo Bonzini
2022-04-14 14:42                 ` Sean Christopherson
2022-04-14 13:35           ` Lai Jiangshan
2022-04-14 14:52       ` Sean Christopherson
2022-03-30 13:21 ` [RFC PATCH V3 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests Lai Jiangshan
2022-04-12 21:34   ` Sean Christopherson
2022-04-12  9:35 ` [RFC PATCH V3 0/4] KVM: X86: Add and use shadow page with level expanded or acting as pae_root Lai Jiangshan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.