linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] 5-level EPT
@ 2016-12-29  9:25 Liang Li
  2016-12-29  9:26 ` [PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table Liang Li
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Liang Li @ 2016-12-29  9:25 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar, Liang Li

x86-64 is currently limited physical address width to 46 bits, which
can support 64 TiB of memory. Some vendors require to support more for
some use case. Intel plans to extend the physical address width to
52 bits in some of the future products.  

The current EPT implementation only supports 4 level page table, which
can support maximum 48 bits physical address width, so it's needed to
extend the EPT to 5 level to support 52 bits physical address width.

This patchset has been tested in the SIMICS environment for 5 level
paging guest, which was patched with Kirill's patchset for enabling
5 level page table, with both the EPT and shadow page support. I just
covered the booting process, the guest can boot successfully. 

Some parts of this patchset can be improved. Any comments on the design
or the patches would be appreciated.

Liang Li (4):
  x86: Add the new CPUID and CR4 bits for 5 level page table
  KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL
  KVM: MMU: Add 5 level EPT & Shadow page table support.
  VMX: Expose the LA57 feature to VM

 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/kvm_host.h             |  15 +--
 arch/x86/include/asm/vmx.h                  |   1 +
 arch/x86/include/uapi/asm/processor-flags.h |   2 +
 arch/x86/kvm/cpuid.c                        |  15 ++-
 arch/x86/kvm/cpuid.h                        |   8 ++
 arch/x86/kvm/emulate.c                      |  15 ++-
 arch/x86/kvm/kvm_cache_regs.h               |   7 +-
 arch/x86/kvm/mmu.c                          | 179 +++++++++++++++++++++-------
 arch/x86/kvm/mmu.h                          |   2 +-
 arch/x86/kvm/mmu_audit.c                    |   5 +-
 arch/x86/kvm/paging_tmpl.h                  |  19 ++-
 arch/x86/kvm/svm.c                          |   2 +-
 arch/x86/kvm/vmx.c                          |  23 ++--
 arch/x86/kvm/x86.c                          |   8 +-
 arch/x86/kvm/x86.h                          |  10 ++
 16 files changed, 234 insertions(+), 78 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
@ 2016-12-29  9:26 ` Liang Li
  2016-12-29  9:26 ` [PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Liang Li
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Liang Li @ 2016-12-29  9:26 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar, Liang Li

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 2167 bytes --]

Define the related bits for the 5 level page table, which supports
57 bits width virtual address space. This patch maybe included in
Kirill's patch set which enables 5 level page table for x86,
because 5 level EPT doesn't depend on 5 level page table, we put
it here for independence.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/include/asm/cpufeatures.h          | 1 +
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index eafee31..2cf4018 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -288,6 +288,7 @@
 #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/
 #define X86_FEATURE_PKU		(16*32+ 3) /* Protection Keys for Userspace */
 #define X86_FEATURE_OSPKE	(16*32+ 4) /* OS Protection Keys Enable */
+#define X86_FEATURE_LA57	(16*32 + 16) /* 5-level page tables */
 #define X86_FEATURE_RDPID	(16*32+ 22) /* RDPID instruction */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (ebx), word 17 */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50..185f3d1 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR		_BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT	10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT	_BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT	12 /* enable 5-level page tables */
+#define X86_CR4_LA57		_BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT	13 /* enable VMX virtualization */
 #define X86_CR4_VMXE		_BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT	14 /* enable safer mode (TXT) */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
  2016-12-29  9:26 ` [PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table Liang Li
@ 2016-12-29  9:26 ` Liang Li
  2017-03-09 14:39   ` Paolo Bonzini
  2016-12-29  9:26 ` [PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Liang Li
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Liang Li @ 2016-12-29  9:26 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar, Liang Li

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 7732 bytes --]

Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/kvm/mmu.c       | 36 ++++++++++++++++++------------------
 arch/x86/kvm/mmu.h       |  2 +-
 arch/x86/kvm/mmu_audit.c |  4 ++--
 arch/x86/kvm/svm.c       |  2 +-
 4 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7012de4..4c40273 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
 }
 
 struct mmu_page_path {
-	struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-	unsigned int idx[PT64_ROOT_LEVEL];
+	struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
+	unsigned int idx[PT64_ROOT_4LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)			\
@@ -2193,8 +2193,8 @@ static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 	iterator->shadow_addr = vcpu->arch.mmu.root_hpa;
 	iterator->level = vcpu->arch.mmu.shadow_root_level;
 
-	if (iterator->level == PT64_ROOT_LEVEL &&
-	    vcpu->arch.mmu.root_level < PT64_ROOT_LEVEL &&
+	if (iterator->level == PT64_ROOT_4LEVEL &&
+	    vcpu->arch.mmu.root_level < PT64_ROOT_4LEVEL &&
 	    !vcpu->arch.mmu.direct_map)
 		--iterator->level;
 
@@ -3061,8 +3061,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL &&
-	    (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL ||
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
 	     vcpu->arch.mmu.direct_map)) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
@@ -3114,10 +3114,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_page *sp;
 	unsigned i;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3158,14 +3158,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * Do we shadow a long mode page table? If so we need to
 	 * write-protect the guests page table root.
 	 */
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		MMU_WARN_ON(VALID_PAGE(root));
 
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
+		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
 				      0, ACC_ALL);
 		root = __pa(sp->spt);
 		++sp->root_count;
@@ -3180,7 +3180,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * the shadow page table may be a PAE or a long mode page table.
 	 */
 	pm_mask = PT_PRESENT_MASK;
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL)
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
 		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
 	for (i = 0; i < 4; ++i) {
@@ -3213,7 +3213,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * If we shadow a 32 bit page table with a long mode page
 	 * table we enter this path.
 	 */
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
 		if (vcpu->arch.mmu.lm_root == NULL) {
 			/*
 			 * The additional page necessary for this is only
@@ -3258,7 +3258,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 		sp = page_header(root);
 		mmu_sync_children(vcpu, sp);
@@ -3334,7 +3334,7 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 {
 	struct kvm_shadow_walk_iterator iterator;
-	u64 sptes[PT64_ROOT_LEVEL], spte = 0ull;
+	u64 sptes[PT64_ROOT_4LEVEL], spte = 0ull;
 	int root, leaf;
 	bool reserved = false;
 
@@ -3725,7 +3725,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
 		rsvd_check->rsvd_bits_mask[1][0] =
 			rsvd_check->rsvd_bits_mask[0][0];
 		break;
-	case PT64_ROOT_LEVEL:
+	case PT64_ROOT_4LEVEL:
 		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
 			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
 			rsvd_bits(maxphyaddr, 51);
@@ -4034,7 +4034,7 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
 static void paging64_init_context(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu *context)
 {
-	paging64_init_context_common(vcpu, context, PT64_ROOT_LEVEL);
+	paging64_init_context_common(vcpu, context, PT64_ROOT_4LEVEL);
 }
 
 static void paging32_init_context(struct kvm_vcpu *vcpu,
@@ -4088,7 +4088,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 		context->root_level = 0;
 	} else if (is_long_mode(vcpu)) {
 		context->nx = is_nx(vcpu);
-		context->root_level = PT64_ROOT_LEVEL;
+		context->root_level = PT64_ROOT_4LEVEL;
 		reset_rsvds_bits_mask(vcpu, context);
 		context->gva_to_gpa = paging64_gva_to_gpa;
 	} else if (is_pae(vcpu)) {
@@ -4196,7 +4196,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 		g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
 	} else if (is_long_mode(vcpu)) {
 		g_context->nx = is_nx(vcpu);
-		g_context->root_level = PT64_ROOT_LEVEL;
+		g_context->root_level = PT64_ROOT_4LEVEL;
 		reset_rsvds_bits_mask(vcpu, g_context);
 		g_context->gva_to_gpa = paging64_gva_to_gpa_nested;
 	} else if (is_pae(vcpu)) {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ddc56e9..0d87de7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -37,7 +37,7 @@
 #define PT32_DIR_PSE36_MASK \
 	(((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
 
-#define PT64_ROOT_LEVEL 4
+#define PT64_ROOT_4LEVEL 4
 #define PT32_ROOT_LEVEL 2
 #define PT32E_ROOT_LEVEL 3
 
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index dcce533..2e6996d 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -62,11 +62,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		sp = page_header(root);
-		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_LEVEL);
+		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_4LEVEL);
 		return;
 	}
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 08a4d3a..1acc6de 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -565,7 +565,7 @@ static inline void invlpga(unsigned long addr, u32 asid)
 static int get_npt_level(void)
 {
 #ifdef CONFIG_X86_64
-	return PT64_ROOT_LEVEL;
+	return PT64_ROOT_4LEVEL;
 #else
 	return PT32E_ROOT_LEVEL;
 #endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
  2016-12-29  9:26 ` [PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table Liang Li
  2016-12-29  9:26 ` [PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Liang Li
@ 2016-12-29  9:26 ` Liang Li
  2017-03-09 15:12   ` Paolo Bonzini
  2016-12-29  9:26 ` [PATCH RFC 4/4] VMX: Expose the LA57 feature to VM Liang Li
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Liang Li @ 2016-12-29  9:26 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar, Liang Li

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 19505 bytes --]

The future Intel CPU will extend the max physical address to 52 bits.
To support the new physical address width, EPT is extended to support
5 level page table.
This patch add the 5 level EPT and extend shadow page to support
5 level paging guest. As the RFC version, this patch enables 5 level
EPT once the hardware supports, and this is not a good choice because
5 level EPT requires more memory access comparing to use 4 level EPT.
The right thing is to use 5 level EPT only when it's needed, will
change in the future version.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/include/asm/vmx.h      |   1 +
 arch/x86/kvm/cpuid.h            |   8 ++
 arch/x86/kvm/mmu.c              | 167 +++++++++++++++++++++++++++++++---------
 arch/x86/kvm/mmu_audit.c        |   5 +-
 arch/x86/kvm/paging_tmpl.h      |  19 ++++-
 arch/x86/kvm/vmx.c              |  19 +++--
 arch/x86/kvm/x86.h              |  10 +++
 8 files changed, 184 insertions(+), 48 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a7066dc..e505dac 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -124,6 +124,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 #define KVM_NR_VAR_MTRR 8
 
 #define ASYNC_PF_PER_VCPU 64
+#define PT64_ROOT_5LEVEL 5
 
 enum kvm_reg {
 	VCPU_REGS_RAX = 0,
@@ -310,7 +311,7 @@ struct kvm_pio_request {
 };
 
 struct rsvd_bits_validate {
-	u64 rsvd_bits_mask[2][4];
+	u64 rsvd_bits_mask[2][PT64_ROOT_5LEVEL];
 	u64 bad_mt_xwr;
 };
 
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 2b5b2d4..bf2f178 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -442,6 +442,7 @@ enum vmcs_field {
 
 #define VMX_EPT_EXECUTE_ONLY_BIT		(1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT			(1ull << 6)
+#define VMX_EPT_PAGE_WALK_5_BIT			(1ull << 7)
 #define VMX_EPTP_UC_BIT				(1ull << 8)
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 35058c2..4bdf3dc 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -88,6 +88,14 @@ static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
 	return best && (best->ecx & bit(X86_FEATURE_PKU));
 }
 
+static inline bool guest_cpuid_has_la57(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	return best && (best->ecx & bit(X86_FEATURE_LA57));
+}
+
 static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4c40273..0a56f27 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
 }
 
 struct mmu_page_path {
-	struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
-	unsigned int idx[PT64_ROOT_4LEVEL];
+	struct kvm_mmu_page *parent[PT64_ROOT_5LEVEL];
+	unsigned int idx[PT64_ROOT_5LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)			\
@@ -2198,6 +2198,11 @@ static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 	    !vcpu->arch.mmu.direct_map)
 		--iterator->level;
 
+	if (iterator->level == PT64_ROOT_5LEVEL &&
+	    vcpu->arch.mmu.root_level < PT64_ROOT_5LEVEL &&
+	    !vcpu->arch.mmu.direct_map)
+		iterator->level -= 2;
+
 	if (iterator->level == PT32E_ROOT_LEVEL) {
 		iterator->shadow_addr
 			= vcpu->arch.mmu.pae_root[(addr >> 30) & 3];
@@ -3061,9 +3066,12 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
-	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
-	     vcpu->arch.mmu.direct_map)) {
+	if ((vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+	     (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
+	      vcpu->arch.mmu.direct_map)) ||
+	    (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL &&
+	     (vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL ||
+	      vcpu->arch.mmu.direct_map))) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		spin_lock(&vcpu->kvm->mmu_lock);
@@ -3114,10 +3122,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_page *sp;
 	unsigned i;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL) {
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, 0, 0,
+				vcpu->arch.mmu.shadow_root_level, 1, ACC_ALL);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3158,15 +3168,16 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * Do we shadow a long mode page table? If so we need to
 	 * write-protect the guests page table root.
 	 */
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		MMU_WARN_ON(VALID_PAGE(root));
 
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
-				      0, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
+				vcpu->arch.mmu.root_level, 0, ACC_ALL);
 		root = __pa(sp->spt);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3180,7 +3191,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * the shadow page table may be a PAE or a long mode page table.
 	 */
 	pm_mask = PT_PRESENT_MASK;
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL)
 		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
 	for (i = 0; i < 4; ++i) {
@@ -3213,7 +3225,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * If we shadow a 32 bit page table with a long mode page
 	 * table we enter this path.
 	 */
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL) {
 		if (vcpu->arch.mmu.lm_root == NULL) {
 			/*
 			 * The additional page necessary for this is only
@@ -3257,8 +3270,8 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 		return;
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
-	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 		sp = page_header(root);
 		mmu_sync_children(vcpu, sp);
@@ -3334,7 +3347,7 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 {
 	struct kvm_shadow_walk_iterator iterator;
-	u64 sptes[PT64_ROOT_4LEVEL], spte = 0ull;
+	u64 sptes[PT64_ROOT_5LEVEL], spte = 0ull;
 	int root, leaf;
 	bool reserved = false;
 
@@ -3655,10 +3668,16 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
 }
 
 #define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE_LA57 57
+
 #define PTTYPE PTTYPE_EPT
 #include "paging_tmpl.h"
 #undef PTTYPE
 
+#define PTTYPE PTTYPE_LA57
+#include "paging_tmpl.h"
+#undef PTTYPE
+
 #define PTTYPE 64
 #include "paging_tmpl.h"
 #undef PTTYPE
@@ -3747,6 +3766,26 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
 		rsvd_check->rsvd_bits_mask[1][0] =
 			rsvd_check->rsvd_bits_mask[0][0];
 		break;
+	case PT64_ROOT_5LEVEL:
+		rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd |
+			nonleaf_bit8_rsvd | rsvd_bits(7, 7);
+		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
+			nonleaf_bit8_rsvd | rsvd_bits(7, 7);
+		rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd |
+			nonleaf_bit8_rsvd | gbpages_bit_rsvd;
+		rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd;
+		rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd;
+		rsvd_check->rsvd_bits_mask[1][4] =
+			rsvd_check->rsvd_bits_mask[0][4];
+		rsvd_check->rsvd_bits_mask[1][3] =
+			rsvd_check->rsvd_bits_mask[0][3];
+		rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd |
+			gbpages_bit_rsvd | rsvd_bits(13, 29);
+		rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
+			rsvd_bits(13, 20);		/* large page */
+		rsvd_check->rsvd_bits_mask[1][0] =
+			rsvd_check->rsvd_bits_mask[0][0];
+		break;
 	}
 }
 
@@ -3761,25 +3800,43 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 
 static void
 __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
-			    int maxphyaddr, bool execonly)
+			    int maxphyaddr, bool execonly, int ept_level)
 {
 	u64 bad_mt_xwr;
 
-	rsvd_check->rsvd_bits_mask[0][3] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
-	rsvd_check->rsvd_bits_mask[0][2] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
-	rsvd_check->rsvd_bits_mask[0][1] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
-	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
-
-	/* large page */
-	rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3];
-	rsvd_check->rsvd_bits_mask[1][2] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
-	rsvd_check->rsvd_bits_mask[1][1] =
-		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
-	rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0];
+	if (ept_level == 5) {
+		rsvd_check->rsvd_bits_mask[0][4] = rsvd_bits(3, 7);
+		rsvd_check->rsvd_bits_mask[0][3] = rsvd_bits(3, 7);
+		rsvd_check->rsvd_bits_mask[0][2] = rsvd_bits(3, 6);
+		rsvd_check->rsvd_bits_mask[0][1] = rsvd_bits(3, 6);
+		rsvd_check->rsvd_bits_mask[0][0] = 0;
+
+		/* large page */
+		rsvd_check->rsvd_bits_mask[1][4] =
+			 rsvd_check->rsvd_bits_mask[0][4];
+		rsvd_check->rsvd_bits_mask[1][3] =
+			 rsvd_check->rsvd_bits_mask[0][3];
+		rsvd_check->rsvd_bits_mask[1][2] = rsvd_bits(12, 29);
+		rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(12, 20);
+		rsvd_check->rsvd_bits_mask[1][0] = 0;
+	} else {
+		rsvd_check->rsvd_bits_mask[0][3] =
+			rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
+		rsvd_check->rsvd_bits_mask[0][2] =
+			rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+		rsvd_check->rsvd_bits_mask[0][1] =
+			rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
+		rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
+		/* large page */
+		rsvd_check->rsvd_bits_mask[1][3] =
+			 rsvd_check->rsvd_bits_mask[0][3];
+		rsvd_check->rsvd_bits_mask[1][2] =
+			rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
+		rsvd_check->rsvd_bits_mask[1][1] =
+			rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
+		rsvd_check->rsvd_bits_mask[1][0] =
+			 rsvd_check->rsvd_bits_mask[0][0];
+	}
 
 	bad_mt_xwr = 0xFFull << (2 * 8);	/* bits 3..5 must not be 2 */
 	bad_mt_xwr |= 0xFFull << (3 * 8);	/* bits 3..5 must not be 3 */
@@ -3794,10 +3851,10 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 }
 
 static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
-		struct kvm_mmu *context, bool execonly)
+		struct kvm_mmu *context, bool execonly, int ept_level)
 {
 	__reset_rsvds_bits_mask_ept(&context->guest_rsvd_check,
-				    cpuid_maxphyaddr(vcpu), execonly);
+			cpuid_maxphyaddr(vcpu), execonly, ept_level);
 }
 
 /*
@@ -3844,8 +3901,8 @@ static inline bool boot_cpu_is_amd(void)
 					true, true);
 	else
 		__reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
-					    boot_cpu_data.x86_phys_bits,
-					    false);
+					    boot_cpu_data.x86_phys_bits, false,
+					    context->shadow_root_level);
 
 }
 
@@ -3858,7 +3915,8 @@ static inline bool boot_cpu_is_amd(void)
 				struct kvm_mmu *context, bool execonly)
 {
 	__reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
-				    boot_cpu_data.x86_phys_bits, execonly);
+				    boot_cpu_data.x86_phys_bits, execonly,
+				    context->shadow_root_level);
 }
 
 static void update_permission_bitmask(struct kvm_vcpu *vcpu,
@@ -4037,6 +4095,28 @@ static void paging64_init_context(struct kvm_vcpu *vcpu,
 	paging64_init_context_common(vcpu, context, PT64_ROOT_4LEVEL);
 }
 
+static void paging_la57_init_context(struct kvm_vcpu *vcpu,
+				  struct kvm_mmu *context)
+{
+	context->nx = is_nx(vcpu);
+	context->root_level = PT64_ROOT_5LEVEL;
+
+	reset_rsvds_bits_mask(vcpu, context);
+	update_permission_bitmask(vcpu, context, false);
+	update_pkru_bitmask(vcpu, context, false);
+	update_last_nonleaf_level(vcpu, context);
+
+	MMU_WARN_ON(!is_pae(vcpu));
+	context->page_fault = paging_la57_page_fault;
+	context->gva_to_gpa = paging_la57_gva_to_gpa;
+	context->sync_page = paging_la57_sync_page;
+	context->invlpg = paging_la57_invlpg;
+	context->update_pte = paging_la57_update_pte;
+	context->shadow_root_level = PT64_ROOT_5LEVEL;
+	context->root_hpa = INVALID_PAGE;
+	context->direct_map = false;
+}
+
 static void paging32_init_context(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu *context)
 {
@@ -4086,6 +4166,11 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 		context->nx = false;
 		context->gva_to_gpa = nonpaging_gva_to_gpa;
 		context->root_level = 0;
+	} else if (is_la57_mode(vcpu)) {
+		context->nx = is_nx(vcpu);
+		context->root_level = PT64_ROOT_5LEVEL;
+		reset_rsvds_bits_mask(vcpu, context);
+		context->gva_to_gpa = paging_la57_gva_to_gpa;
 	} else if (is_long_mode(vcpu)) {
 		context->nx = is_nx(vcpu);
 		context->root_level = PT64_ROOT_4LEVEL;
@@ -4119,6 +4204,8 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
 
 	if (!is_paging(vcpu))
 		nonpaging_init_context(vcpu, context);
+	else if (is_la57_mode(vcpu))
+		paging_la57_init_context(vcpu, context);
 	else if (is_long_mode(vcpu))
 		paging64_init_context(vcpu, context);
 	else if (is_pae(vcpu))
@@ -4158,7 +4245,8 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly)
 
 	update_permission_bitmask(vcpu, context, true);
 	update_pkru_bitmask(vcpu, context, true);
-	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
+	reset_rsvds_bits_mask_ept(vcpu, context, execonly,
+				  context->shadow_root_level);
 	reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
@@ -4194,6 +4282,11 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 		g_context->nx = false;
 		g_context->root_level = 0;
 		g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
+	} else if (is_la57_mode(vcpu)) {
+		g_context->nx = is_nx(vcpu);
+		g_context->root_level = PT64_ROOT_5LEVEL;
+		reset_rsvds_bits_mask(vcpu, g_context);
+		g_context->gva_to_gpa = paging_la57_gva_to_gpa_nested;
 	} else if (is_long_mode(vcpu)) {
 		g_context->nx = is_nx(vcpu);
 		g_context->root_level = PT64_ROOT_4LEVEL;
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 2e6996d..bb40094 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -62,11 +62,12 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		sp = page_header(root);
-		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_4LEVEL);
+		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu.root_level);
 		return;
 	}
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a011054..c126cd3 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -50,6 +50,21 @@ extern u64 __pure __using_nonexistent_pte_bit(void)
 	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 2
 	#endif
+#elif PTTYPE == PTTYPE_LA57
+	#define pt_element_t u64
+	#define guest_walker guest_walker_la57
+	#define FNAME(name) paging_la57_##name
+	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+	#define PT_LEVEL_BITS PT64_LEVEL_BITS
+	#define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
+	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
+	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
+	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
+	#define PT_MAX_FULL_LEVELS 5
+	#define CMPXCHG cmpxchg
 #elif PTTYPE == 32
 	#define pt_element_t u32
 	#define guest_walker guest_walker32
@@ -266,7 +281,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
 {
 	unsigned pkeys = 0;
-#if PTTYPE == 64
+#if PTTYPE == 64 || PTTYPE == PTTYPE_LA57
 	pte_t pte = {.pte = gpte};
 
 	pkeys = pte_flags_pkey(pte_flags(pte));
@@ -300,7 +315,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	walker->level = mmu->root_level;
 	pte           = mmu->get_cr3(vcpu);
 
-#if PTTYPE == 64
+#if PTTYPE == 64 || PTTYPE == PTTYPE_LA57
 	if (walker->level == PT32E_ROOT_LEVEL) {
 		pte = mmu->get_pdptr(vcpu, (addr >> 30) & 3);
 		trace_kvm_mmu_paging_element(pte, walker->level);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 24db5fb..bfc9f0a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1220,6 +1220,11 @@ static inline bool cpu_has_vmx_ept_4levels(void)
 	return vmx_capability.ept & VMX_EPT_PAGE_WALK_4_BIT;
 }
 
+static inline bool cpu_has_vmx_ept_5levels(void)
+{
+	return vmx_capability.ept & VMX_EPT_PAGE_WALK_5_BIT;
+}
+
 static inline bool cpu_has_vmx_ept_ad_bits(void)
 {
 	return vmx_capability.ept & VMX_EPT_AD_BIT;
@@ -4249,13 +4254,20 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	vmx->emulation_required = emulation_required(vcpu);
 }
 
+static int get_ept_level(void)
+{
+	if (cpu_has_vmx_ept_5levels())
+		return VMX_EPT_MAX_GAW + 1;
+	return VMX_EPT_DEFAULT_GAW + 1;
+}
+
 static u64 construct_eptp(unsigned long root_hpa)
 {
 	u64 eptp;
 
 	/* TODO write the value reading from MSR */
 	eptp = VMX_EPT_DEFAULT_MT |
-		VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
+		(get_ept_level() - 1) << VMX_EPT_GAW_EPTP_SHIFT;
 	if (enable_ept_ad_bits)
 		eptp |= VMX_EPT_AD_ENABLE_BIT;
 	eptp |= (root_hpa & PAGE_MASK);
@@ -9356,11 +9368,6 @@ static void __init vmx_check_processor_compat(void *rtn)
 	}
 }
 
-static int get_ept_level(void)
-{
-	return VMX_EPT_DEFAULT_GAW + 1;
-}
-
 static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index e8ff3e4..26627df 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -60,6 +60,16 @@ static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
 	return cs_l;
 }
 
+static inline bool is_la57_mode(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+	return (vcpu->arch.efer & EFER_LMA) &&
+		 kvm_read_cr4_bits(vcpu, X86_CR4_LA57);
+#else
+	return 0;
+#endif
+}
+
 static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 4/4] VMX: Expose the LA57 feature to VM
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
                   ` (2 preceding siblings ...)
  2016-12-29  9:26 ` [PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Liang Li
@ 2016-12-29  9:26 ` Liang Li
  2017-03-09 15:16   ` Paolo Bonzini
  2016-12-29 20:38 ` [PATCH RFC 0/4] 5-level EPT Valdis.Kletnieks
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Liang Li @ 2016-12-29  9:26 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar, Liang Li

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 8355 bytes --]

This patch exposes 5 level page table feature to the VM,
at the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width,
it's the prerequisite to support 5 level paging guest.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 12 ++++++------
 arch/x86/kvm/cpuid.c            | 15 +++++++++------
 arch/x86/kvm/emulate.c          | 15 ++++++++++-----
 arch/x86/kvm/kvm_cache_regs.h   |  7 ++++++-
 arch/x86/kvm/vmx.c              |  4 ++--
 arch/x86/kvm/x86.c              |  8 ++++++--
 6 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e505dac..57850b3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -86,8 +86,8 @@
 			  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE     \
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
-			  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
-			  | X86_CR4_PKE))
+			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
+			  | X86_CR4_SMAP | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
@@ -1269,15 +1269,15 @@ static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, u32 error_code)
 	kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
 }
 
-static inline u64 get_canonical(u64 la)
+static inline u64 get_canonical(u64 la, u8 vaddr_bits)
 {
-	return ((int64_t)la << 16) >> 16;
+	return ((int64_t)la << (64 - vaddr_bits)) >> (64 - vaddr_bits);
 }
 
-static inline bool is_noncanonical_address(u64 la)
+static inline bool is_noncanonical_address(u64 la, u8 vaddr_bits)
 {
 #ifdef CONFIG_X86_64
-	return get_canonical(la) != la;
+	return get_canonical(la, vaddr_bits) != la;
 #else
 	return false;
 #endif
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index e85f6bd..69e8c1a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -126,13 +126,16 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->fpu_activate(vcpu);
 
 	/*
-	 * The existing code assumes virtual address is 48-bit in the canonical
-	 * address checks; exit if it is ever changed.
+	 * The existing code assumes virtual address is 48-bit or 57-bit in the
+	 * canonical address checks; exit if it is ever changed.
 	 */
 	best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
-	if (best && ((best->eax & 0xff00) >> 8) != 48 &&
-		((best->eax & 0xff00) >> 8) != 0)
-		return -EINVAL;
+	if (best) {
+		int vaddr_bits = (best->eax & 0xff00) >> 8;
+
+		if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
+			return -EINVAL;
+	}
 
 	/* Update physical-address width */
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
@@ -383,7 +386,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.ecx*/
 	const u32 kvm_cpuid_7_0_ecx_x86_features =
-		F(AVX512VBMI) | F(PKU) | 0 /*OSPKE*/;
+		F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/;
 
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 56628a4..da01dd7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -676,6 +676,11 @@ static unsigned insn_alignment(struct x86_emulate_ctxt *ctxt, unsigned size)
 	}
 }
 
+static __always_inline u8 virt_addr_bits(struct x86_emulate_ctxt *ctxt)
+{
+	return (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_LA57) ? 57 : 48;
+}
+
 static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
 				       struct segmented_address addr,
 				       unsigned *max_size, unsigned size,
@@ -693,7 +698,7 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
 	switch (mode) {
 	case X86EMUL_MODE_PROT64:
 		*linear = la;
-		if (is_noncanonical_address(la))
+		if (is_noncanonical_address(la, virt_addr_bits(ctxt)))
 			goto bad;
 
 		*max_size = min_t(u64, ~0u, (1ull << 48) - la);
@@ -1721,7 +1726,7 @@ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
 		if (ret != X86EMUL_CONTINUE)
 			return ret;
 		if (is_noncanonical_address(get_desc_base(&seg_desc) |
-					     ((u64)base3 << 32)))
+				((u64)base3 << 32), virt_addr_bits(ctxt)))
 			return emulate_gp(ctxt, 0);
 	}
 load:
@@ -2796,8 +2801,8 @@ static int em_sysexit(struct x86_emulate_ctxt *ctxt)
 		ss_sel = cs_sel + 8;
 		cs.d = 0;
 		cs.l = 1;
-		if (is_noncanonical_address(rcx) ||
-		    is_noncanonical_address(rdx))
+		if (is_noncanonical_address(rcx, virt_addr_bits(ctxt)) ||
+		    is_noncanonical_address(rdx, virt_addr_bits(ctxt)))
 			return emulate_gp(ctxt, 0);
 		break;
 	}
@@ -3712,7 +3717,7 @@ static int em_lgdt_lidt(struct x86_emulate_ctxt *ctxt, bool lgdt)
 	if (rc != X86EMUL_CONTINUE)
 		return rc;
 	if (ctxt->mode == X86EMUL_MODE_PROT64 &&
-	    is_noncanonical_address(desc_ptr.address))
+	    is_noncanonical_address(desc_ptr.address, virt_addr_bits(ctxt)))
 		return emulate_gp(ctxt, 0);
 	if (lgdt)
 		ctxt->ops->set_gdt(ctxt, &desc_ptr);
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 762cdf2..5daf75f 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -4,7 +4,7 @@
 #define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
 #define KVM_POSSIBLE_CR4_GUEST_BITS				  \
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
-	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE)
+	 | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_PGE)
 
 static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu,
 					      enum kvm_reg reg)
@@ -78,6 +78,11 @@ static inline ulong kvm_read_cr4(struct kvm_vcpu *vcpu)
 	return kvm_read_cr4_bits(vcpu, ~0UL);
 }
 
+static inline u8 get_virt_addr_bits(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr4_bits(vcpu, X86_CR4_LA57) ? 57 : 48;
+}
+
 static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu)
 {
 	return (kvm_register_read(vcpu, VCPU_REGS_RAX) & -1u)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bfc9f0a..183a53e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -123,7 +123,7 @@
 	(KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | X86_CR0_PG | X86_CR0_PE)
 #define KVM_CR4_GUEST_OWNED_BITS				      \
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR      \
-	 | X86_CR4_OSXMMEXCPT | X86_CR4_TSD)
+	 | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_TSD)
 
 #define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE)
 #define KVM_RMODE_VM_CR4_ALWAYS_ON (X86_CR4_VME | X86_CR4_PAE | X86_CR4_VMXE)
@@ -7017,7 +7017,7 @@ static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
 		 * non-canonical form. This is the only check on the memory
 		 * destination for long mode!
 		 */
-		exn = is_noncanonical_address(*ret);
+		exn = is_noncanonical_address(*ret, get_virt_addr_bits(vcpu));
 	} else if (is_protmode(vcpu)) {
 		/* Protected mode: apply checks for segment validity in the
 		 * following order:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51ccfe0..b935658 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -762,6 +762,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	if (!guest_cpuid_has_pku(vcpu) && (cr4 & X86_CR4_PKE))
 		return 1;
 
+	if (!guest_cpuid_has_la57(vcpu) && (cr4 & X86_CR4_LA57))
+		return 1;
+
 	if (is_long_mode(vcpu)) {
 		if (!(cr4 & X86_CR4_PAE))
 			return 1;
@@ -1074,7 +1077,8 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_KERNEL_GS_BASE:
 	case MSR_CSTAR:
 	case MSR_LSTAR:
-		if (is_noncanonical_address(msr->data))
+		if (is_noncanonical_address(msr->data,
+				get_virt_addr_bits(vcpu)))
 			return 1;
 		break;
 	case MSR_IA32_SYSENTER_EIP:
@@ -1091,7 +1095,7 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		 * value, and that something deterministic happens if the guest
 		 * invokes 64-bit SYSENTER.
 		 */
-		msr->data = get_canonical(msr->data);
+		msr->data = get_canonical(msr->data, get_virt_addr_bits(vcpu));
 	}
 	return kvm_x86_ops->set_msr(vcpu, msr);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/4] 5-level EPT
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
                   ` (3 preceding siblings ...)
  2016-12-29  9:26 ` [PATCH RFC 4/4] VMX: Expose the LA57 feature to VM Liang Li
@ 2016-12-29 20:38 ` Valdis.Kletnieks
  2016-12-30  1:26   ` Li, Liang Z
  2017-01-02 10:18 ` Paolo Bonzini
  2017-01-05 13:26 ` Kirill A. Shutemov
  6 siblings, 1 reply; 15+ messages in thread
From: Valdis.Kletnieks @ 2016-12-29 20:38 UTC (permalink / raw)
  To: Liang Li
  Cc: kvm, linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar

[-- Attachment #1: Type: text/plain, Size: 591 bytes --]

On Thu, 29 Dec 2016 17:25:59 +0800, Liang Li said:
> x86-64 is currently limited physical address width to 46 bits, which
> can support 64 TiB of memory. Some vendors require to support more for
> some use case. Intel plans to extend the physical address width to
> 52 bits in some of the future products.

Can you explain why this patchset mentions 52 bits in some places,
and 57 in others?  Is it because there are currently in-process
chipsets that will do 52, but you want to future-proof it by extending
it to 57 so future chipsets won't need more work?  Or is there some other
reason?

[-- Attachment #2: Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH RFC 0/4] 5-level EPT
  2016-12-29 20:38 ` [PATCH RFC 0/4] 5-level EPT Valdis.Kletnieks
@ 2016-12-30  1:26   ` Li, Liang Z
  0 siblings, 0 replies; 15+ messages in thread
From: Li, Liang Z @ 2016-12-30  1:26 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: kvm, linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar

> Subject: Re: [PATCH RFC 0/4] 5-level EPT
> 
> On Thu, 29 Dec 2016 17:25:59 +0800, Liang Li said:
> > x86-64 is currently limited physical address width to 46 bits, which
> > can support 64 TiB of memory. Some vendors require to support more for
> > some use case. Intel plans to extend the physical address width to
> > 52 bits in some of the future products.
> 
> Can you explain why this patchset mentions 52 bits in some places, and 57 in
> others?  Is it because there are currently in-process chipsets that will do 52,
> but you want to future-proof it by extending it to 57 so future chipsets won't
> need more work?  Or is there some other reason?

The 57-bits I referred in  this patch set means the virtual address width which will
be supported in the future CPU with 52-bits physical address width.
5 level EPT can support maximum 57-bits physical address width, as long as the
future CPU use no more than 57-bits physical address width, no more work is needed.

Thanks!
Liang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/4] 5-level EPT
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
                   ` (4 preceding siblings ...)
  2016-12-29 20:38 ` [PATCH RFC 0/4] 5-level EPT Valdis.Kletnieks
@ 2017-01-02 10:18 ` Paolo Bonzini
  2017-01-17  2:18   ` Li, Liang Z
  2017-01-05 13:26 ` Kirill A. Shutemov
  6 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2017-01-02 10:18 UTC (permalink / raw)
  To: Liang Li, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar

On 29/12/2016 10:25, Liang Li wrote:
> x86-64 is currently limited physical address width to 46 bits, which
> can support 64 TiB of memory. Some vendors require to support more for
> some use case. Intel plans to extend the physical address width to
> 52 bits in some of the future products.  
> 
> The current EPT implementation only supports 4 level page table, which
> can support maximum 48 bits physical address width, so it's needed to
> extend the EPT to 5 level to support 52 bits physical address width.
> 
> This patchset has been tested in the SIMICS environment for 5 level
> paging guest, which was patched with Kirill's patchset for enabling
> 5 level page table, with both the EPT and shadow page support. I just
> covered the booting process, the guest can boot successfully. 
> 
> Some parts of this patchset can be improved. Any comments on the design
> or the patches would be appreciated.

I will review the patches.  They seem fairly straightforward.

However, I am worried about the design of the 5-level page table feature
with respect to migration.

Processors that support the new LA57 mode can write
57-canonical/48-noncanonical linear addresses to some registers even
when LA57 mode is inactive.  This is true even of unprivileged
instructions, in particular WRFSBASE/WRGSBASE.

This is fairly bad because, if a guest performs such a write (because of
a bug or because of malice), it will not be possible to migrate the
virtual machine to a machine that lacks LA57 mode.

Ordinarily, hypervisors trap CPUID to hide features that are only
present in some processors of a heterogeneous cluster, and the
hypervisor also traps for example CR4 writes to prevent enabling
features that were masked away.  In this case, however, the only way for
the hypervisor to prevent the write would be to run the guest with
CR4.FSGSBASE=0 and trap all executions of WRFSBASE/WRGSBASE.  This might
have negative effects on performance for workloads that use the
instructions.

Of course, this is a problem even without your patches.  However, I
think it should be addressed first.  I am seriously thinking of
blacklisting FSGSBASE completely on LA57 machines until the above is
fixed in hardware.

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/4] 5-level EPT
  2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
                   ` (5 preceding siblings ...)
  2017-01-02 10:18 ` Paolo Bonzini
@ 2017-01-05 13:26 ` Kirill A. Shutemov
  6 siblings, 0 replies; 15+ messages in thread
From: Kirill A. Shutemov @ 2017-01-05 13:26 UTC (permalink / raw)
  To: Liang Li
  Cc: kvm, linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, pbonzini, rkrcmar

On Thu, Dec 29, 2016 at 05:25:59PM +0800, Liang Li wrote:
> x86-64 is currently limited physical address width to 46 bits, which
> can support 64 TiB of memory. Some vendors require to support more for
> some use case. Intel plans to extend the physical address width to
> 52 bits in some of the future products.  
> 
> The current EPT implementation only supports 4 level page table, which
> can support maximum 48 bits physical address width, so it's needed to
> extend the EPT to 5 level to support 52 bits physical address width.
> 
> This patchset has been tested in the SIMICS environment for 5 level
> paging guest, which was patched with Kirill's patchset for enabling
> 5 level page table, with both the EPT and shadow page support. I just
> covered the booting process, the guest can boot successfully. 
> 
> Some parts of this patchset can be improved. Any comments on the design
> or the patches would be appreciated.

This looks reasonable, assuming my very limited knowledge of the subject.

The first patch is actually in my patchset, split across two patches.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH RFC 0/4] 5-level EPT
  2017-01-02 10:18 ` Paolo Bonzini
@ 2017-01-17  2:18   ` Li, Liang Z
  2017-03-09 14:16     ` Paolo Bonzini
  0 siblings, 1 reply; 15+ messages in thread
From: Li, Liang Z @ 2017-01-17  2:18 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar, Neiger, Gil, Lai, Paul C

> On 29/12/2016 10:25, Liang Li wrote:
> > x86-64 is currently limited physical address width to 46 bits, which
> > can support 64 TiB of memory. Some vendors require to support more for
> > some use case. Intel plans to extend the physical address width to
> > 52 bits in some of the future products.
> >
> > The current EPT implementation only supports 4 level page table, which
> > can support maximum 48 bits physical address width, so it's needed to
> > extend the EPT to 5 level to support 52 bits physical address width.
> >
> > This patchset has been tested in the SIMICS environment for 5 level
> > paging guest, which was patched with Kirill's patchset for enabling
> > 5 level page table, with both the EPT and shadow page support. I just
> > covered the booting process, the guest can boot successfully.
> >
> > Some parts of this patchset can be improved. Any comments on the
> > design or the patches would be appreciated.
> 
> I will review the patches.  They seem fairly straightforward.
> 
> However, I am worried about the design of the 5-level page table feature
> with respect to migration.
> 
> Processors that support the new LA57 mode can write 57-canonical/48-
> noncanonical linear addresses to some registers even when LA57 mode is
> inactive.  This is true even of unprivileged instructions, in particular
> WRFSBASE/WRGSBASE.
> 
> This is fairly bad because, if a guest performs such a write (because of a bug
> or because of malice), it will not be possible to migrate the virtual machine to
> a machine that lacks LA57 mode.
> 
> Ordinarily, hypervisors trap CPUID to hide features that are only present in
> some processors of a heterogeneous cluster, and the hypervisor also traps
> for example CR4 writes to prevent enabling features that were masked away.
> In this case, however, the only way for the hypervisor to prevent the write
> would be to run the guest with
> CR4.FSGSBASE=0 and trap all executions of WRFSBASE/WRGSBASE.  This
> might have negative effects on performance for workloads that use the
> instructions.
> 
> Of course, this is a problem even without your patches.  However, I think it
> should be addressed first.  I am seriously thinking of blacklisting FSGSBASE
> completely on LA57 machines until the above is fixed in hardware.
> 
> Paolo

The issue has already been forwarded to the hardware guys, still waiting for the feedback.

Thanks!
Liang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/4] 5-level EPT
  2017-01-17  2:18   ` Li, Liang Z
@ 2017-03-09 14:16     ` Paolo Bonzini
  2017-03-10  8:00       ` Yu Zhang
  0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-09 14:16 UTC (permalink / raw)
  To: Li, Liang Z, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar, Neiger, Gil, Lai, Paul C



On 17/01/2017 03:18, Li, Liang Z wrote:
>> On 29/12/2016 10:25, Liang Li wrote:
>>> x86-64 is currently limited physical address width to 46 bits, which
>>> can support 64 TiB of memory. Some vendors require to support more for
>>> some use case. Intel plans to extend the physical address width to
>>> 52 bits in some of the future products.
>>>
>>> The current EPT implementation only supports 4 level page table, which
>>> can support maximum 48 bits physical address width, so it's needed to
>>> extend the EPT to 5 level to support 52 bits physical address width.
>>>
>>> This patchset has been tested in the SIMICS environment for 5 level
>>> paging guest, which was patched with Kirill's patchset for enabling
>>> 5 level page table, with both the EPT and shadow page support. I just
>>> covered the booting process, the guest can boot successfully.
>>>
>>> Some parts of this patchset can be improved. Any comments on the
>>> design or the patches would be appreciated.
>>
>> I will review the patches.  They seem fairly straightforward.
>>
>> However, I am worried about the design of the 5-level page table feature
>> with respect to migration.
>>
>> Processors that support the new LA57 mode can write 57-canonical/48-
>> noncanonical linear addresses to some registers even when LA57 mode is
>> inactive.  This is true even of unprivileged instructions, in particular
>> WRFSBASE/WRGSBASE.
>>
>> This is fairly bad because, if a guest performs such a write (because of a bug
>> or because of malice), it will not be possible to migrate the virtual machine to
>> a machine that lacks LA57 mode.
>>
>> Ordinarily, hypervisors trap CPUID to hide features that are only present in
>> some processors of a heterogeneous cluster, and the hypervisor also traps
>> for example CR4 writes to prevent enabling features that were masked away.
>> In this case, however, the only way for the hypervisor to prevent the write
>> would be to run the guest with
>> CR4.FSGSBASE=0 and trap all executions of WRFSBASE/WRGSBASE.  This
>> might have negative effects on performance for workloads that use the
>> instructions.
>>
>> Of course, this is a problem even without your patches.  However, I think it
>> should be addressed first.  I am seriously thinking of blacklisting FSGSBASE
>> completely on LA57 machines until the above is fixed in hardware.
>>
>> Paolo
> 
> The issue has already been forwarded to the hardware guys, still waiting for the feedback.

Going to review this now.  Any news?

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL
  2016-12-29  9:26 ` [PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Liang Li
@ 2017-03-09 14:39   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-09 14:39 UTC (permalink / raw)
  To: Liang Li, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar



On 29/12/2016 10:26, Liang Li wrote:
> Now we have 4 level page table and 5 level page table in 64 bits
> long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
> then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
> helpful to make the code more clear.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>

I think you should also #define PT64_ROOT_MAX_LEVEL to 4, and use it
whenever your final series replaces a 4 or PT64_ROOT_LEVEL with
PT64_ROOT_5LEVEL.

Then the next patch can do

-#define PT64_ROOT_MAX_LEVEL PT64_ROOT_4LEVEL
+#define PT64_ROOT_MAX_LEVEL PT64_ROOT_5LEVEL

Paolo


> ---
>  arch/x86/kvm/mmu.c       | 36 ++++++++++++++++++------------------
>  arch/x86/kvm/mmu.h       |  2 +-
>  arch/x86/kvm/mmu_audit.c |  4 ++--
>  arch/x86/kvm/svm.c       |  2 +-
>  4 files changed, 22 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7012de4..4c40273 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
>  }
>  
>  struct mmu_page_path {
> -	struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
> -	unsigned int idx[PT64_ROOT_LEVEL];
> +	struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
> +	unsigned int idx[PT64_ROOT_4LEVEL];
>  };
>  
>  #define for_each_sp(pvec, sp, parents, i)			\
> @@ -2193,8 +2193,8 @@ static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
>  	iterator->shadow_addr = vcpu->arch.mmu.root_hpa;
>  	iterator->level = vcpu->arch.mmu.shadow_root_level;
>  
> -	if (iterator->level == PT64_ROOT_LEVEL &&
> -	    vcpu->arch.mmu.root_level < PT64_ROOT_LEVEL &&
> +	if (iterator->level == PT64_ROOT_4LEVEL &&
> +	    vcpu->arch.mmu.root_level < PT64_ROOT_4LEVEL &&
>  	    !vcpu->arch.mmu.direct_map)
>  		--iterator->level;
>  
> @@ -3061,8 +3061,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
>  	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>  		return;
>  
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL &&
> -	    (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL ||
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
> +	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
>  	     vcpu->arch.mmu.direct_map)) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  
> @@ -3114,10 +3114,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu_page *sp;
>  	unsigned i;
>  
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
>  		spin_lock(&vcpu->kvm->mmu_lock);
>  		make_mmu_pages_available(vcpu);
> -		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
> +		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
>  		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
> @@ -3158,14 +3158,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * Do we shadow a long mode page table? If so we need to
>  	 * write-protect the guests page table root.
>  	 */
> -	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
> +	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  
>  		MMU_WARN_ON(VALID_PAGE(root));
>  
>  		spin_lock(&vcpu->kvm->mmu_lock);
>  		make_mmu_pages_available(vcpu);
> -		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
> +		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
>  				      0, ACC_ALL);
>  		root = __pa(sp->spt);
>  		++sp->root_count;
> @@ -3180,7 +3180,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * the shadow page table may be a PAE or a long mode page table.
>  	 */
>  	pm_mask = PT_PRESENT_MASK;
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL)
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
>  		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
>  	for (i = 0; i < 4; ++i) {
> @@ -3213,7 +3213,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * If we shadow a 32 bit page table with a long mode page
>  	 * table we enter this path.
>  	 */
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
>  		if (vcpu->arch.mmu.lm_root == NULL) {
>  			/*
>  			 * The additional page necessary for this is only
> @@ -3258,7 +3258,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
>  
>  	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
>  	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
> -	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
> +	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  		sp = page_header(root);
>  		mmu_sync_children(vcpu, sp);
> @@ -3334,7 +3334,7 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
>  walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>  {
>  	struct kvm_shadow_walk_iterator iterator;
> -	u64 sptes[PT64_ROOT_LEVEL], spte = 0ull;
> +	u64 sptes[PT64_ROOT_4LEVEL], spte = 0ull;
>  	int root, leaf;
>  	bool reserved = false;
>  
> @@ -3725,7 +3725,7 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
>  		rsvd_check->rsvd_bits_mask[1][0] =
>  			rsvd_check->rsvd_bits_mask[0][0];
>  		break;
> -	case PT64_ROOT_LEVEL:
> +	case PT64_ROOT_4LEVEL:
>  		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
>  			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
>  			rsvd_bits(maxphyaddr, 51);
> @@ -4034,7 +4034,7 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
>  static void paging64_init_context(struct kvm_vcpu *vcpu,
>  				  struct kvm_mmu *context)
>  {
> -	paging64_init_context_common(vcpu, context, PT64_ROOT_LEVEL);
> +	paging64_init_context_common(vcpu, context, PT64_ROOT_4LEVEL);
>  }
>  
>  static void paging32_init_context(struct kvm_vcpu *vcpu,
> @@ -4088,7 +4088,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>  		context->root_level = 0;
>  	} else if (is_long_mode(vcpu)) {
>  		context->nx = is_nx(vcpu);
> -		context->root_level = PT64_ROOT_LEVEL;
> +		context->root_level = PT64_ROOT_4LEVEL;
>  		reset_rsvds_bits_mask(vcpu, context);
>  		context->gva_to_gpa = paging64_gva_to_gpa;
>  	} else if (is_pae(vcpu)) {
> @@ -4196,7 +4196,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
>  		g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
>  	} else if (is_long_mode(vcpu)) {
>  		g_context->nx = is_nx(vcpu);
> -		g_context->root_level = PT64_ROOT_LEVEL;
> +		g_context->root_level = PT64_ROOT_4LEVEL;
>  		reset_rsvds_bits_mask(vcpu, g_context);
>  		g_context->gva_to_gpa = paging64_gva_to_gpa_nested;
>  	} else if (is_pae(vcpu)) {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index ddc56e9..0d87de7 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -37,7 +37,7 @@
>  #define PT32_DIR_PSE36_MASK \
>  	(((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
>  
> -#define PT64_ROOT_LEVEL 4
> +#define PT64_ROOT_4LEVEL 4
>  #define PT32_ROOT_LEVEL 2
>  #define PT32E_ROOT_LEVEL 3
>  
> diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
> index dcce533..2e6996d 100644
> --- a/arch/x86/kvm/mmu_audit.c
> +++ b/arch/x86/kvm/mmu_audit.c
> @@ -62,11 +62,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
>  	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>  		return;
>  
> -	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
> +	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  
>  		sp = page_header(root);
> -		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_LEVEL);
> +		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_4LEVEL);
>  		return;
>  	}
>  
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 08a4d3a..1acc6de 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -565,7 +565,7 @@ static inline void invlpga(unsigned long addr, u32 asid)
>  static int get_npt_level(void)
>  {
>  #ifdef CONFIG_X86_64
> -	return PT64_ROOT_LEVEL;
> +	return PT64_ROOT_4LEVEL;
>  #else
>  	return PT32E_ROOT_LEVEL;
>  #endif
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2016-12-29  9:26 ` [PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Liang Li
@ 2017-03-09 15:12   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-09 15:12 UTC (permalink / raw)
  To: Liang Li, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar



On 29/12/2016 10:26, Liang Li wrote:
> The future Intel CPU will extend the max physical address to 52 bits.
> To support the new physical address width, EPT is extended to support
> 5 level page table.
> This patch add the 5 level EPT and extend shadow page to support
> 5 level paging guest. As the RFC version, this patch enables 5 level
> EPT once the hardware supports, and this is not a good choice because
> 5 level EPT requires more memory access comparing to use 4 level EPT.
> The right thing is to use 5 level EPT only when it's needed, will
> change in the future version.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +-
>  arch/x86/include/asm/vmx.h      |   1 +
>  arch/x86/kvm/cpuid.h            |   8 ++
>  arch/x86/kvm/mmu.c              | 167 +++++++++++++++++++++++++++++++---------
>  arch/x86/kvm/mmu_audit.c        |   5 +-
>  arch/x86/kvm/paging_tmpl.h      |  19 ++++-
>  arch/x86/kvm/vmx.c              |  19 +++--
>  arch/x86/kvm/x86.h              |  10 +++
>  8 files changed, 184 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a7066dc..e505dac 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -124,6 +124,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define PT64_ROOT_5LEVEL 5
>  
>  enum kvm_reg {
>  	VCPU_REGS_RAX = 0,
> @@ -310,7 +311,7 @@ struct kvm_pio_request {
>  };
>  
>  struct rsvd_bits_validate {
> -	u64 rsvd_bits_mask[2][4];
> +	u64 rsvd_bits_mask[2][PT64_ROOT_5LEVEL];
>  	u64 bad_mt_xwr;
>  };
>  
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 2b5b2d4..bf2f178 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -442,6 +442,7 @@ enum vmcs_field {
>  
>  #define VMX_EPT_EXECUTE_ONLY_BIT		(1ull)
>  #define VMX_EPT_PAGE_WALK_4_BIT			(1ull << 6)
> +#define VMX_EPT_PAGE_WALK_5_BIT			(1ull << 7)
>  #define VMX_EPTP_UC_BIT				(1ull << 8)
>  #define VMX_EPTP_WB_BIT				(1ull << 14)
>  #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index 35058c2..4bdf3dc 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -88,6 +88,14 @@ static inline bool guest_cpuid_has_pku(struct kvm_vcpu *vcpu)
>  	return best && (best->ecx & bit(X86_FEATURE_PKU));
>  }
>  
> +static inline bool guest_cpuid_has_la57(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *best;
> +
> +	best = kvm_find_cpuid_entry(vcpu, 7, 0);
> +	return best && (best->ecx & bit(X86_FEATURE_LA57));
> +}
> +
>  static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_cpuid_entry2 *best;
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 4c40273..0a56f27 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1986,8 +1986,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
>  }
>  
>  struct mmu_page_path {
> -	struct kvm_mmu_page *parent[PT64_ROOT_4LEVEL];
> -	unsigned int idx[PT64_ROOT_4LEVEL];
> +	struct kvm_mmu_page *parent[PT64_ROOT_5LEVEL];
> +	unsigned int idx[PT64_ROOT_5LEVEL];
>  };
>  
>  #define for_each_sp(pvec, sp, parents, i)			\
> @@ -2198,6 +2198,11 @@ static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
>  	    !vcpu->arch.mmu.direct_map)
>  		--iterator->level;
>  
> +	if (iterator->level == PT64_ROOT_5LEVEL &&
> +	    vcpu->arch.mmu.root_level < PT64_ROOT_5LEVEL &&
> +	    !vcpu->arch.mmu.direct_map)
> +		iterator->level -= 2;

This (and the "if" before it as well) might actually be dead code.
Please remove it in a separate patch.

>  	if (iterator->level == PT32E_ROOT_LEVEL) {
>  		iterator->shadow_addr
>  			= vcpu->arch.mmu.pae_root[(addr >> 30) & 3];
> @@ -3061,9 +3066,12 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
>  	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>  		return;
>  
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
> -	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
> -	     vcpu->arch.mmu.direct_map)) {
> +	if ((vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
> +	     (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
> +	      vcpu->arch.mmu.direct_map)) ||
> +	    (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL &&
> +	     (vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL ||
> +	      vcpu->arch.mmu.direct_map))) {

Same here:

	if (vcpu->arch.mmu.shadow_root_level >= PT64_ROOT_4LEVEL)

should be enough.  In general, checking >= PT64_ROOT_4LEVEL is better
IMHO than checking for == PT64_ROOT_4LEVEL || == PT64_ROOT_5LEVEL.
These "if"s basically need to single out PAE.  A hypothetical 6-level
page table extension would in all likelihood behave just like 64-bit
LA48 and LA57 paging.

>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  
>  		spin_lock(&vcpu->kvm->mmu_lock);
> @@ -3114,10 +3122,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu_page *sp;
>  	unsigned i;
>  
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL) {

Same here and everywhere else.

>  		spin_lock(&vcpu->kvm->mmu_lock);
>  		make_mmu_pages_available(vcpu);
> -		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
> +		sp = kvm_mmu_get_page(vcpu, 0, 0,
> +				vcpu->arch.mmu.shadow_root_level, 1, ACC_ALL);
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
>  		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
> @@ -3158,15 +3168,16 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * Do we shadow a long mode page table? If so we need to
>  	 * write-protect the guests page table root.
>  	 */
> -	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
> +	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  
>  		MMU_WARN_ON(VALID_PAGE(root));
>  
>  		spin_lock(&vcpu->kvm->mmu_lock);
>  		make_mmu_pages_available(vcpu);
> -		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
> -				      0, ACC_ALL);
> +		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
> +				vcpu->arch.mmu.root_level, 0, ACC_ALL);
>  		root = __pa(sp->spt);
>  		++sp->root_count;
>  		spin_unlock(&vcpu->kvm->mmu_lock);
> @@ -3180,7 +3191,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * the shadow page table may be a PAE or a long mode page table.
>  	 */
>  	pm_mask = PT_PRESENT_MASK;
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL)
>  		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
>  	for (i = 0; i < 4; ++i) {
> @@ -3213,7 +3225,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
>  	 * If we shadow a 32 bit page table with a long mode page
>  	 * table we enter this path.
>  	 */
> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
> +	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.shadow_root_level == PT64_ROOT_5LEVEL) {
>  		if (vcpu->arch.mmu.lm_root == NULL) {
>  			/*
>  			 * The additional page necessary for this is only
> @@ -3257,8 +3270,8 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
>  		return;
>  
>  	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
> -	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
> -	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
> +	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  		sp = page_header(root);
>  		mmu_sync_children(vcpu, sp);
> @@ -3334,7 +3347,7 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
>  walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
>  {
>  	struct kvm_shadow_walk_iterator iterator;
> -	u64 sptes[PT64_ROOT_4LEVEL], spte = 0ull;
> +	u64 sptes[PT64_ROOT_5LEVEL], spte = 0ull;
>  	int root, leaf;
>  	bool reserved = false;
>  
> @@ -3655,10 +3668,16 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
>  }
>  
>  #define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE_LA57 57
> +
>  #define PTTYPE PTTYPE_EPT
>  #include "paging_tmpl.h"
>  #undef PTTYPE
>  
> +#define PTTYPE PTTYPE_LA57
> +#include "paging_tmpl.h"
> +#undef PTTYPE

This is not needed.  The format for LA57 page tables is the same as for
LA48.

>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> @@ -3747,6 +3766,26 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu,
>  		rsvd_check->rsvd_bits_mask[1][0] =
>  			rsvd_check->rsvd_bits_mask[0][0];
>  		break;
> +	case PT64_ROOT_5LEVEL:
> +		rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd |
> +			nonleaf_bit8_rsvd | rsvd_bits(7, 7);
> +		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
> +			nonleaf_bit8_rsvd | rsvd_bits(7, 7);

I think the code for this and PT64_ROOT_4LEVEL should be the same
(setting rsvd_bits_mask[x][4] for PT64_ROOT_4LEVEL is okay).

You are assuming that MAXPHYADDR=52, but the Intel whitepaper doesn't
say this is going to be always the case.  rsvd_bits in
arch/x86/kvm/mmu.h is not a hot path, feel free to add an

	if (e < s)
		return 0;

there.

> +		rsvd_check->rsvd_bits_mask[0][2] = exb_bit_rsvd |
> +			nonleaf_bit8_rsvd | gbpages_bit_rsvd;
> +		rsvd_check->rsvd_bits_mask[0][1] = exb_bit_rsvd;
> +		rsvd_check->rsvd_bits_mask[0][0] = exb_bit_rsvd;
> +		rsvd_check->rsvd_bits_mask[1][4] =
> +			rsvd_check->rsvd_bits_mask[0][4];
> +		rsvd_check->rsvd_bits_mask[1][3] =
> +			rsvd_check->rsvd_bits_mask[0][3];
> +		rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd |
> +			gbpages_bit_rsvd | rsvd_bits(13, 29);
> +		rsvd_check->rsvd_bits_mask[1][1] = exb_bit_rsvd |
> +			rsvd_bits(13, 20);		/* large page */
> +		rsvd_check->rsvd_bits_mask[1][0] =
> +			rsvd_check->rsvd_bits_mask[0][0];
> +		break;
>  	}
>  }
>  
> @@ -3761,25 +3800,43 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>  
>  static void
>  __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
> -			    int maxphyaddr, bool execonly)
> +			    int maxphyaddr, bool execonly, int ept_level)
>  {
>  	u64 bad_mt_xwr;
>  
> -	rsvd_check->rsvd_bits_mask[0][3] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> -	rsvd_check->rsvd_bits_mask[0][2] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> -	rsvd_check->rsvd_bits_mask[0][1] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> -	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> -
> -	/* large page */
> -	rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3];
> -	rsvd_check->rsvd_bits_mask[1][2] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> -	rsvd_check->rsvd_bits_mask[1][1] =
> -		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> -	rsvd_check->rsvd_bits_mask[1][0] = rsvd_check->rsvd_bits_mask[0][0];
> +	if (ept_level == 5) {
> +		rsvd_check->rsvd_bits_mask[0][4] = rsvd_bits(3, 7);

Same here, this "if" is not needed at all and the new ept_level argument
shouldn't be required either.

> +		rsvd_check->rsvd_bits_mask[0][3] = rsvd_bits(3, 7);
> +		rsvd_check->rsvd_bits_mask[0][2] = rsvd_bits(3, 6);
> +		rsvd_check->rsvd_bits_mask[0][1] = rsvd_bits(3, 6);
> +		rsvd_check->rsvd_bits_mask[0][0] = 0;
> +
> +		/* large page */
> +		rsvd_check->rsvd_bits_mask[1][4] =
> +			 rsvd_check->rsvd_bits_mask[0][4];
> +		rsvd_check->rsvd_bits_mask[1][3] =
> +			 rsvd_check->rsvd_bits_mask[0][3];
> +		rsvd_check->rsvd_bits_mask[1][2] = rsvd_bits(12, 29);
> +		rsvd_check->rsvd_bits_mask[1][1] = rsvd_bits(12, 20);
> +		rsvd_check->rsvd_bits_mask[1][0] = 0;
> +	} else {
> +		rsvd_check->rsvd_bits_mask[0][3] =
> +			rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
> +		rsvd_check->rsvd_bits_mask[0][2] =
> +			rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> +		rsvd_check->rsvd_bits_mask[0][1] =
> +			rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 6);
> +		rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
> +		/* large page */
> +		rsvd_check->rsvd_bits_mask[1][3] =
> +			 rsvd_check->rsvd_bits_mask[0][3];
> +		rsvd_check->rsvd_bits_mask[1][2] =
> +			rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
> +		rsvd_check->rsvd_bits_mask[1][1] =
> +			rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 20);
> +		rsvd_check->rsvd_bits_mask[1][0] =
> +			 rsvd_check->rsvd_bits_mask[0][0];
> +	}
>  
>  	bad_mt_xwr = 0xFFull << (2 * 8);	/* bits 3..5 must not be 2 */
>  	bad_mt_xwr |= 0xFFull << (3 * 8);	/* bits 3..5 must not be 3 */
> @@ -3794,10 +3851,10 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>  }
>  
>  static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu,
> -		struct kvm_mmu *context, bool execonly)
> +		struct kvm_mmu *context, bool execonly, int ept_level)
>  {
>  	__reset_rsvds_bits_mask_ept(&context->guest_rsvd_check,
> -				    cpuid_maxphyaddr(vcpu), execonly);
> +			cpuid_maxphyaddr(vcpu), execonly, ept_level);
>  }
>  
>  /*
> @@ -3844,8 +3901,8 @@ static inline bool boot_cpu_is_amd(void)
>  					true, true);
>  	else
>  		__reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
> -					    boot_cpu_data.x86_phys_bits,
> -					    false);
> +					    boot_cpu_data.x86_phys_bits, false,
> +					    context->shadow_root_level);
>  
>  }
>  
> @@ -3858,7 +3915,8 @@ static inline bool boot_cpu_is_amd(void)
>  				struct kvm_mmu *context, bool execonly)
>  {
>  	__reset_rsvds_bits_mask_ept(&context->shadow_zero_check,
> -				    boot_cpu_data.x86_phys_bits, execonly);
> +				    boot_cpu_data.x86_phys_bits, execonly,
> +				    context->shadow_root_level);
>  }
>  
>  static void update_permission_bitmask(struct kvm_vcpu *vcpu,
> @@ -4037,6 +4095,28 @@ static void paging64_init_context(struct kvm_vcpu *vcpu,
>  	paging64_init_context_common(vcpu, context, PT64_ROOT_4LEVEL);
>  }
>  
> +static void paging_la57_init_context(struct kvm_vcpu *vcpu,
> +				  struct kvm_mmu *context)
> +{
> +	context->nx = is_nx(vcpu);
> +	context->root_level = PT64_ROOT_5LEVEL;
> +
> +	reset_rsvds_bits_mask(vcpu, context);
> +	update_permission_bitmask(vcpu, context, false);
> +	update_pkru_bitmask(vcpu, context, false);
> +	update_last_nonleaf_level(vcpu, context);
> +
> +	MMU_WARN_ON(!is_pae(vcpu));
> +	context->page_fault = paging_la57_page_fault;
> +	context->gva_to_gpa = paging_la57_gva_to_gpa;
> +	context->sync_page = paging_la57_sync_page;
> +	context->invlpg = paging_la57_invlpg;
> +	context->update_pte = paging_la57_update_pte;
> +	context->shadow_root_level = PT64_ROOT_5LEVEL;
> +	context->root_hpa = INVALID_PAGE;
> +	context->direct_map = false;

This should be using paging64_init_context_common.

Even better, paging64_init_context could do

	int root_level =
	    is_la57_mode(vcpu) ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
	paging64_init_context_common(vcpu, context, root_level);

and then you can skip the change in kvm_init_shadow_mmu.

> +}
> +
>  static void paging32_init_context(struct kvm_vcpu *vcpu,
>  				  struct kvm_mmu *context)
>  {
> @@ -4086,6 +4166,11 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>  		context->nx = false;
>  		context->gva_to_gpa = nonpaging_gva_to_gpa;
>  		context->root_level = 0;
> +	} else if (is_la57_mode(vcpu)) {
> +		context->nx = is_nx(vcpu);
> +		context->root_level = PT64_ROOT_5LEVEL;
> +		reset_rsvds_bits_mask(vcpu, context);
> +		context->gva_to_gpa = paging_la57_gva_to_gpa;

Please put the

	if (is_la57_mode(vcpu))

inside the is_long_mode branch below, since the only difference is
context->root_level.

>  	} else if (is_long_mode(vcpu)) {
>  		context->nx = is_nx(vcpu);
>  		context->root_level = PT64_ROOT_4LEVEL;
> @@ -4119,6 +4204,8 @@ void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu)
>  
>  	if (!is_paging(vcpu))
>  		nonpaging_init_context(vcpu, context);
> +	else if (is_la57_mode(vcpu))
> +		paging_la57_init_context(vcpu, context);
>  	else if (is_long_mode(vcpu))
>  		paging64_init_context(vcpu, context);
>  	else if (is_pae(vcpu))
> @@ -4158,7 +4245,8 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly)
>  
>  	update_permission_bitmask(vcpu, context, true);
>  	update_pkru_bitmask(vcpu, context, true);
> -	reset_rsvds_bits_mask_ept(vcpu, context, execonly);
> +	reset_rsvds_bits_mask_ept(vcpu, context, execonly,
> +				  context->shadow_root_level);
>  	reset_ept_shadow_zero_bits_mask(vcpu, context, execonly);
>  }
>  EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
> @@ -4194,6 +4282,11 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
>  		g_context->nx = false;
>  		g_context->root_level = 0;
>  		g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
> +	} else if (is_la57_mode(vcpu)) {
> +		g_context->nx = is_nx(vcpu);
> +		g_context->root_level = PT64_ROOT_5LEVEL;
> +		reset_rsvds_bits_mask(vcpu, g_context);
> +		g_context->gva_to_gpa = paging_la57_gva_to_gpa_nested;

Same here.

>  	} else if (is_long_mode(vcpu)) {
>  		g_context->nx = is_nx(vcpu);
>  		g_context->root_level = PT64_ROOT_4LEVEL;
> diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
> index 2e6996d..bb40094 100644
> --- a/arch/x86/kvm/mmu_audit.c
> +++ b/arch/x86/kvm/mmu_audit.c
> @@ -62,11 +62,12 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
>  	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
>  		return;
>  
> -	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
> +	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.root_level == PT64_ROOT_5LEVEL) {

As above, please use >= PT64_ROOT_4LEVEL here.

>  		hpa_t root = vcpu->arch.mmu.root_hpa;
>  
>  		sp = page_header(root);
> -		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_4LEVEL);
> +		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu.root_level);
>  		return;
>  	}
>  
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index a011054..c126cd3 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h

This is not needed.

> @@ -50,6 +50,21 @@ extern u64 __pure __using_nonexistent_pte_bit(void)
>  	#define CMPXCHG cmpxchg64
>  	#define PT_MAX_FULL_LEVELS 2
>  	#endif
> +#elif PTTYPE == PTTYPE_LA57
> +	#define pt_element_t u64
> +	#define guest_walker guest_walker_la57
> +	#define FNAME(name) paging_la57_##name
> +	#define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> +	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> +	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> +	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> +	#define PT_GUEST_ACCESSED_MASK PT_ACCESSED_MASK
> +	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
> +	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
> +	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
> +	#define PT_MAX_FULL_LEVELS 5
> +	#define CMPXCHG cmpxchg
>  #elif PTTYPE == 32
>  	#define pt_element_t u32
>  	#define guest_walker guest_walker32
> @@ -266,7 +281,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>  static inline unsigned FNAME(gpte_pkeys)(struct kvm_vcpu *vcpu, u64 gpte)
>  {
>  	unsigned pkeys = 0;
> -#if PTTYPE == 64
> +#if PTTYPE == 64 || PTTYPE == PTTYPE_LA57
>  	pte_t pte = {.pte = gpte};
>  
>  	pkeys = pte_flags_pkey(pte_flags(pte));
> @@ -300,7 +315,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	walker->level = mmu->root_level;
>  	pte           = mmu->get_cr3(vcpu);
>  
> -#if PTTYPE == 64
> +#if PTTYPE == 64 || PTTYPE == PTTYPE_LA57
>  	if (walker->level == PT32E_ROOT_LEVEL) {
>  		pte = mmu->get_pdptr(vcpu, (addr >> 30) & 3);
>  		trace_kvm_mmu_paging_element(pte, walker->level);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 24db5fb..bfc9f0a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1220,6 +1220,11 @@ static inline bool cpu_has_vmx_ept_4levels(void)
>  	return vmx_capability.ept & VMX_EPT_PAGE_WALK_4_BIT;
>  }
>  
> +static inline bool cpu_has_vmx_ept_5levels(void)
> +{
> +	return vmx_capability.ept & VMX_EPT_PAGE_WALK_5_BIT;
> +}
> +
>  static inline bool cpu_has_vmx_ept_ad_bits(void)
>  {
>  	return vmx_capability.ept & VMX_EPT_AD_BIT;
> @@ -4249,13 +4254,20 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>  	vmx->emulation_required = emulation_required(vcpu);
>  }
>  
> +static int get_ept_level(void)
> +{
> +	if (cpu_has_vmx_ept_5levels())
> +		return VMX_EPT_MAX_GAW + 1;
> +	return VMX_EPT_DEFAULT_GAW + 1;
> +}
> +
>  static u64 construct_eptp(unsigned long root_hpa)
>  {
>  	u64 eptp;
>  
>  	/* TODO write the value reading from MSR */
>  	eptp = VMX_EPT_DEFAULT_MT |
> -		VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
> +		(get_ept_level() - 1) << VMX_EPT_GAW_EPTP_SHIFT;
>  	if (enable_ept_ad_bits)
>  		eptp |= VMX_EPT_AD_ENABLE_BIT;
>  	eptp |= (root_hpa & PAGE_MASK);

For nested virt you need to set the shift to what L1 uses, so I think
you need to add a root_level argument here and in kvm_init_shadow_ept_mmu.

Paolo

> @@ -9356,11 +9368,6 @@ static void __init vmx_check_processor_compat(void *rtn)
>  	}
>  }
>  
> -static int get_ept_level(void)
> -{
> -	return VMX_EPT_DEFAULT_GAW + 1;
> -}
> -
>  static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  {
>  	u8 cache;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index e8ff3e4..26627df 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -60,6 +60,16 @@ static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
>  	return cs_l;
>  }
>  
> +static inline bool is_la57_mode(struct kvm_vcpu *vcpu)
> +{
> +#ifdef CONFIG_X86_64
> +	return (vcpu->arch.efer & EFER_LMA) &&
> +		 kvm_read_cr4_bits(vcpu, X86_CR4_LA57);
> +#else
> +	return 0;
> +#endif
> +}
> +
>  static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 4/4] VMX: Expose the LA57 feature to VM
  2016-12-29  9:26 ` [PATCH RFC 4/4] VMX: Expose the LA57 feature to VM Liang Li
@ 2017-03-09 15:16   ` Paolo Bonzini
  0 siblings, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-09 15:16 UTC (permalink / raw)
  To: Liang Li, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar



On 29/12/2016 10:26, Liang Li wrote:
> -		if (is_noncanonical_address(la))
> +		if (is_noncanonical_address(la, virt_addr_bits(ctxt)))

Using virt_addr_bits and get_virt_addr_bits is quite a mouthful.  What
about using instead a pair of functions like these:

bool is_noncanonical_address(struct kvm_vcpu *vcpu, u64 addr)
{
	return addr == get_canonical(addr, get_virt_addr_bits(vcpu));
}

bool emulate_is_noncanonical_address(struct x86_emulate_ctxt *ctxt,
				     u64 addr)
{
	return addr == get_canonical(addr, virt_addr_bits(ctxt));
}


Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 0/4] 5-level EPT
  2017-03-09 14:16     ` Paolo Bonzini
@ 2017-03-10  8:00       ` Yu Zhang
  0 siblings, 0 replies; 15+ messages in thread
From: Yu Zhang @ 2017-03-10  8:00 UTC (permalink / raw)
  To: Paolo Bonzini, Li, Liang Z, kvm
  Cc: linux-kernel, tglx, mingo, kirill.shutemov, dave.hansen,
	guangrong.xiao, rkrcmar, Neiger, Gil, Lai, Paul C



On 3/9/2017 10:16 PM, Paolo Bonzini wrote:
>
> On 17/01/2017 03:18, Li, Liang Z wrote:
>>> On 29/12/2016 10:25, Liang Li wrote:
>>>> x86-64 is currently limited physical address width to 46 bits, which
>>>> can support 64 TiB of memory. Some vendors require to support more for
>>>> some use case. Intel plans to extend the physical address width to
>>>> 52 bits in some of the future products.
>>>>
>>>> The current EPT implementation only supports 4 level page table, which
>>>> can support maximum 48 bits physical address width, so it's needed to
>>>> extend the EPT to 5 level to support 52 bits physical address width.
>>>>
>>>> This patchset has been tested in the SIMICS environment for 5 level
>>>> paging guest, which was patched with Kirill's patchset for enabling
>>>> 5 level page table, with both the EPT and shadow page support. I just
>>>> covered the booting process, the guest can boot successfully.
>>>>
>>>> Some parts of this patchset can be improved. Any comments on the
>>>> design or the patches would be appreciated.
>>> I will review the patches.  They seem fairly straightforward.
>>>
>>> However, I am worried about the design of the 5-level page table feature
>>> with respect to migration.
>>>
>>> Processors that support the new LA57 mode can write 57-canonical/48-
>>> noncanonical linear addresses to some registers even when LA57 mode is
>>> inactive.  This is true even of unprivileged instructions, in particular
>>> WRFSBASE/WRGSBASE.
>>>
>>> This is fairly bad because, if a guest performs such a write (because of a bug
>>> or because of malice), it will not be possible to migrate the virtual machine to
>>> a machine that lacks LA57 mode.
>>>
>>> Ordinarily, hypervisors trap CPUID to hide features that are only present in
>>> some processors of a heterogeneous cluster, and the hypervisor also traps
>>> for example CR4 writes to prevent enabling features that were masked away.
>>> In this case, however, the only way for the hypervisor to prevent the write
>>> would be to run the guest with
>>> CR4.FSGSBASE=0 and trap all executions of WRFSBASE/WRGSBASE.  This
>>> might have negative effects on performance for workloads that use the
>>> instructions.
>>>
>>> Of course, this is a problem even without your patches.  However, I think it
>>> should be addressed first.  I am seriously thinking of blacklisting FSGSBASE
>>> completely on LA57 machines until the above is fixed in hardware.
>>>
>>> Paolo
>> The issue has already been forwarded to the hardware guys, still waiting for the feedback.
> Going to review this now.  Any news?

Thanks for your reivew, Paolo.
This is Yu Zhang from Intel. I'll pick up this 5 level ept feature, and 
will try to address your comments next. :-)
Now I am learning Liang's code and trying to bring VM up with Kirill's 
native 5 level paging code integrated.

Yu
> Paolo
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-03-10  8:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-29  9:25 [PATCH RFC 0/4] 5-level EPT Liang Li
2016-12-29  9:26 ` [PATCH RFC 1/4] x86: Add the new CPUID and CR4 bits for 5 level page table Liang Li
2016-12-29  9:26 ` [PATCH RFC 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Liang Li
2017-03-09 14:39   ` Paolo Bonzini
2016-12-29  9:26 ` [PATCH RFC 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Liang Li
2017-03-09 15:12   ` Paolo Bonzini
2016-12-29  9:26 ` [PATCH RFC 4/4] VMX: Expose the LA57 feature to VM Liang Li
2017-03-09 15:16   ` Paolo Bonzini
2016-12-29 20:38 ` [PATCH RFC 0/4] 5-level EPT Valdis.Kletnieks
2016-12-30  1:26   ` Li, Liang Z
2017-01-02 10:18 ` Paolo Bonzini
2017-01-17  2:18   ` Li, Liang Z
2017-03-09 14:16     ` Paolo Bonzini
2017-03-10  8:00       ` Yu Zhang
2017-01-05 13:26 ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).