All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support
@ 2017-08-12 13:35 Yu Zhang
  2017-08-12 13:35 ` [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width Yu Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-12 13:35 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, pbonzini, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

Intel's existing processors limit the maximum linear address width to
48 bits, and the maximum physical address width to 46 bits. And the
upcoming processors will extend maximum linear address width to 57 bits
and maximum physical address width can go upto 52 bits in practical.

With linear address width greater than 48, a new paging mode in IA-32e
is introduced - 5 level paging(also known as LA57). And to support VMs
with this feature, KVM MMU code need to be extended. 

And to achieve this, this patchset:
1> leverages 2 qemu parameters: +la57 and phys-bits to expose wider linear
address width and physical address width to the VM;
2> extends shadow logic to construct 5 level shadow page for VMs running
in LA57 mode;
3> extends ept logic to construct 5 level ept table for VMs whose maximum
physical width exceeds 48 bits.

Yu Zhang (4):
  KVM: MMU: check guest CR3 reserved bits based on its physical address
    width.
  KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.
  KVM: MMU: Add 5 level EPT & Shadow page table support.
  KVM: MMU: Expose the LA57 feature to VM.

 arch/x86/include/asm/kvm_host.h | 29 +++++----------------
 arch/x86/include/asm/vmx.h      |  1 +
 arch/x86/kvm/cpuid.c            | 21 ++++++++++-----
 arch/x86/kvm/emulate.c          | 24 +++++++++++------
 arch/x86/kvm/kvm_cache_regs.h   |  2 +-
 arch/x86/kvm/mmu.c              | 58 +++++++++++++++++++++++++----------------
 arch/x86/kvm/mmu.h              |  8 +++++-
 arch/x86/kvm/mmu_audit.c        |  4 +--
 arch/x86/kvm/svm.c              |  6 ++---
 arch/x86/kvm/vmx.c              | 27 ++++++++++++-------
 arch/x86/kvm/x86.c              | 15 ++++++-----
 arch/x86/kvm/x86.h              | 44 +++++++++++++++++++++++++++++++
 12 files changed, 158 insertions(+), 81 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width.
  2017-08-12 13:35 [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Yu Zhang
@ 2017-08-12 13:35 ` Yu Zhang
  2017-08-14  7:36   ` Paolo Bonzini
  2017-08-14 16:13   ` Jim Mattson
  2017-08-12 13:35 ` [PATCH v1 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Yu Zhang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-12 13:35 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, pbonzini, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/emulate.c          | 12 ++++++++++--
 arch/x86/kvm/x86.c              |  8 ++++----
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e4862e..018300e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -79,7 +79,6 @@
 			  | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
 			  | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG))
 
-#define CR3_L_MODE_RESERVED_BITS 0xFFFFFF0000000000ULL
 #define CR3_PCID_INVD		 BIT_64(63)
 #define CR4_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index fb00559..a98b88a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4097,8 +4097,16 @@ static int check_cr_write(struct x86_emulate_ctxt *ctxt)
 		u64 rsvd = 0;
 
 		ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
-		if (efer & EFER_LMA)
-			rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
+		if (efer & EFER_LMA) {
+			u64 maxphyaddr;
+			u32 eax = 0x80000008;
+
+			ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL, NULL);
+			maxphyaddr = eax * 0xff;
+
+			rsvd = (~((1UL << maxphyaddr) - 1)) &
+				~CR3_PCID_INVD;
+		}
 
 		if (new_val & rsvd)
 			return emulate_gp(ctxt, 0);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e40a779..d9100c4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -813,10 +813,10 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		return 0;
 	}
 
-	if (is_long_mode(vcpu)) {
-		if (cr3 & CR3_L_MODE_RESERVED_BITS)
-			return 1;
-	} else if (is_pae(vcpu) && is_paging(vcpu) &&
+	if (is_long_mode(vcpu) &&
+	    (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
+		return 1;
+	else if (is_pae(vcpu) && is_paging(vcpu) &&
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v1 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.
  2017-08-12 13:35 [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Yu Zhang
  2017-08-12 13:35 ` [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width Yu Zhang
@ 2017-08-12 13:35 ` Yu Zhang
  2017-08-12 13:35 ` [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Yu Zhang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-12 13:35 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, pbonzini, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.

Also PT64_ROOT_MAX_LEVEL is defined as PT64_ROOT_4LEVEL, so that
we can just redefine it to PT64_ROOT_5LEVEL whenever a replacement
is needed for 5 level paging.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/kvm/mmu.c       | 36 ++++++++++++++++++------------------
 arch/x86/kvm/mmu.h       |  4 +++-
 arch/x86/kvm/mmu_audit.c |  4 ++--
 arch/x86/kvm/svm.c       |  2 +-
 4 files changed, 24 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7ee21c0..cd4d2cc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2167,8 +2167,8 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn,
 }
 
 struct mmu_page_path {
-	struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-	unsigned int idx[PT64_ROOT_LEVEL];
+	struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL];
+	unsigned int idx[PT64_ROOT_MAX_LEVEL];
 };
 
 #define for_each_sp(pvec, sp, parents, i)			\
@@ -2383,8 +2383,8 @@ static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator,
 	iterator->shadow_addr = vcpu->arch.mmu.root_hpa;
 	iterator->level = vcpu->arch.mmu.shadow_root_level;
 
-	if (iterator->level == PT64_ROOT_LEVEL &&
-	    vcpu->arch.mmu.root_level < PT64_ROOT_LEVEL &&
+	if (iterator->level == PT64_ROOT_4LEVEL &&
+	    vcpu->arch.mmu.root_level < PT64_ROOT_4LEVEL &&
 	    !vcpu->arch.mmu.direct_map)
 		--iterator->level;
 
@@ -3323,8 +3323,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL &&
-	    (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL ||
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
+	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
 	     vcpu->arch.mmu.direct_map)) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
@@ -3376,10 +3376,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_page *sp;
 	unsigned i;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3420,14 +3420,14 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * Do we shadow a long mode page table? If so we need to
 	 * write-protect the guests page table root.
 	 */
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		MMU_WARN_ON(VALID_PAGE(root));
 
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
+		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
 				      0, ACC_ALL);
 		root = __pa(sp->spt);
 		++sp->root_count;
@@ -3442,7 +3442,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * the shadow page table may be a PAE or a long mode page table.
 	 */
 	pm_mask = PT_PRESENT_MASK;
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL)
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL)
 		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
 	for (i = 0; i < 4; ++i) {
@@ -3475,7 +3475,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * If we shadow a 32 bit page table with a long mode page
 	 * table we enter this path.
 	 */
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
 		if (vcpu->arch.mmu.lm_root == NULL) {
 			/*
 			 * The additional page necessary for this is only
@@ -3520,7 +3520,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 		sp = page_header(root);
 		mmu_sync_children(vcpu, sp);
@@ -3596,7 +3596,7 @@ static bool
 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr, u64 *sptep)
 {
 	struct kvm_shadow_walk_iterator iterator;
-	u64 sptes[PT64_ROOT_LEVEL], spte = 0ull;
+	u64 sptes[PT64_ROOT_MAX_LEVEL], spte = 0ull;
 	int root, leaf;
 	bool reserved = false;
 
@@ -4022,7 +4022,7 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 		rsvd_check->rsvd_bits_mask[1][0] =
 			rsvd_check->rsvd_bits_mask[0][0];
 		break;
-	case PT64_ROOT_LEVEL:
+	case PT64_ROOT_4LEVEL:
 		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
 			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
 			rsvd_bits(maxphyaddr, 51);
@@ -4332,7 +4332,7 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
 static void paging64_init_context(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu *context)
 {
-	paging64_init_context_common(vcpu, context, PT64_ROOT_LEVEL);
+	paging64_init_context_common(vcpu, context, PT64_ROOT_4LEVEL);
 }
 
 static void paging32_init_context(struct kvm_vcpu *vcpu,
@@ -4387,7 +4387,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 		context->root_level = 0;
 	} else if (is_long_mode(vcpu)) {
 		context->nx = is_nx(vcpu);
-		context->root_level = PT64_ROOT_LEVEL;
+		context->root_level = PT64_ROOT_4LEVEL;
 		reset_rsvds_bits_mask(vcpu, context);
 		context->gva_to_gpa = paging64_gva_to_gpa;
 	} else if (is_pae(vcpu)) {
@@ -4498,7 +4498,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 		g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
 	} else if (is_long_mode(vcpu)) {
 		g_context->nx = is_nx(vcpu);
-		g_context->root_level = PT64_ROOT_LEVEL;
+		g_context->root_level = PT64_ROOT_4LEVEL;
 		reset_rsvds_bits_mask(vcpu, g_context);
 		g_context->gva_to_gpa = paging64_gva_to_gpa_nested;
 	} else if (is_pae(vcpu)) {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d7d248a..60b9001 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -37,10 +37,12 @@
 #define PT32_DIR_PSE36_MASK \
 	(((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
 
-#define PT64_ROOT_LEVEL 4
+#define PT64_ROOT_4LEVEL 4
 #define PT32_ROOT_LEVEL 2
 #define PT32E_ROOT_LEVEL 3
 
+#define PT64_ROOT_MAX_LEVEL PT64_ROOT_4LEVEL
+
 #define PT_PDPE_LEVEL 3
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index dcce533..2e6996d 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -62,11 +62,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		sp = page_header(root);
-		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_LEVEL);
+		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_4LEVEL);
 		return;
 	}
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1fa9ee5..f7aa33d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -570,7 +570,7 @@ static inline void invlpga(unsigned long addr, u32 asid)
 static int get_npt_level(void)
 {
 #ifdef CONFIG_X86_64
-	return PT64_ROOT_LEVEL;
+	return PT64_ROOT_4LEVEL;
 #else
 	return PT32E_ROOT_LEVEL;
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-12 13:35 [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Yu Zhang
  2017-08-12 13:35 ` [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width Yu Zhang
  2017-08-12 13:35 ` [PATCH v1 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Yu Zhang
@ 2017-08-12 13:35 ` Yu Zhang
  2017-08-14  7:31   ` Paolo Bonzini
  2017-08-12 13:35 ` [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM Yu Zhang
  2017-08-14  7:32 ` [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Paolo Bonzini
  4 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-12 13:35 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, pbonzini, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.

Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h | 10 +++++-----
 arch/x86/include/asm/vmx.h      |  1 +
 arch/x86/kvm/cpuid.c            |  5 +++++
 arch/x86/kvm/mmu.c              | 42 +++++++++++++++++++++++++++--------------
 arch/x86/kvm/mmu.h              |  6 +++++-
 arch/x86/kvm/mmu_audit.c        |  4 ++--
 arch/x86/kvm/svm.c              |  4 ++--
 arch/x86/kvm/vmx.c              | 19 +++++++++++++------
 arch/x86/kvm/x86.h              | 10 ++++++++++
 9 files changed, 71 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 018300e..7e98a75 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -316,14 +316,14 @@ struct kvm_pio_request {
 };
 
 struct rsvd_bits_validate {
-	u64 rsvd_bits_mask[2][4];
+	u64 rsvd_bits_mask[2][5];
 	u64 bad_mt_xwr;
 };
 
 /*
- * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
- * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
- * mode.
+ * x86 supports 4 paging modes (5-level 64-bit, 4-level 64-bit, 3-level 32-bit,
+ * and 2-level 32-bit).  The kvm_mmu structure abstracts the details of the
+ * current mmu mode.
  */
 struct kvm_mmu {
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
@@ -979,7 +979,7 @@ struct kvm_x86_ops {
 	void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
 	int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
-	int (*get_tdp_level)(void);
+	int (*get_tdp_level)(struct kvm_vcpu *vcpu);
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 	int (*get_lpage_level)(void);
 	bool (*rdtscp_supported)(void);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 5f63a2e..a0fb025 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -453,6 +453,7 @@ enum vmcs_field {
 
 #define VMX_EPT_EXECUTE_ONLY_BIT		(1ull)
 #define VMX_EPT_PAGE_WALK_4_BIT			(1ull << 6)
+#define VMX_EPT_PAGE_WALK_5_BIT			(1ull << 7)
 #define VMX_EPTP_UC_BIT				(1ull << 8)
 #define VMX_EPTP_WB_BIT				(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT			(1ull << 16)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 59ca2ee..aceacf8 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -137,6 +137,11 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
 	/* Update physical-address width */
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
 
+#ifdef CONFIG_X86_64
+	if (vcpu->arch.maxphyaddr > 48)
+		kvm_mmu_reset_context(vcpu);
+#endif
+
 	kvm_pmu_refresh(vcpu);
 	return 0;
 }
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cd4d2cc..298d840 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3323,9 +3323,8 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
-	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
-	     vcpu->arch.mmu.direct_map)) {
+	if (vcpu->arch.mmu.root_level >= PT64_ROOT_4LEVEL ||
+	    vcpu->arch.mmu.direct_map) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		spin_lock(&vcpu->kvm->mmu_lock);
@@ -3376,10 +3375,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_page *sp;
 	unsigned i;
 
-	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.shadow_root_level >= PT64_ROOT_4LEVEL) {
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_4LEVEL, 1, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, 0, 0,
+				vcpu->arch.mmu.shadow_root_level, 1, ACC_ALL);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
@@ -3420,15 +3420,15 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	 * Do we shadow a long mode page table? If so we need to
 	 * write-protect the guests page table root.
 	 */
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.root_level >= PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		MMU_WARN_ON(VALID_PAGE(root));
 
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
-		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_4LEVEL,
-				      0, ACC_ALL);
+		sp = kvm_mmu_get_page(vcpu, root_gfn, 0,
+				vcpu->arch.mmu.shadow_root_level, 0, ACC_ALL);
 		root = __pa(sp->spt);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3520,7 +3520,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
 
 	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.root_level >= PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 		sp = page_header(root);
 		mmu_sync_children(vcpu, sp);
@@ -4022,6 +4022,12 @@ __reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 		rsvd_check->rsvd_bits_mask[1][0] =
 			rsvd_check->rsvd_bits_mask[0][0];
 		break;
+	case PT64_ROOT_5LEVEL:
+		rsvd_check->rsvd_bits_mask[0][4] = exb_bit_rsvd |
+			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
+			rsvd_bits(maxphyaddr, 51);
+		rsvd_check->rsvd_bits_mask[1][4] =
+			rsvd_check->rsvd_bits_mask[0][4];
 	case PT64_ROOT_4LEVEL:
 		rsvd_check->rsvd_bits_mask[0][3] = exb_bit_rsvd |
 			nonleaf_bit8_rsvd | rsvd_bits(7, 7) |
@@ -4063,6 +4069,8 @@ __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
 {
 	u64 bad_mt_xwr;
 
+	rsvd_check->rsvd_bits_mask[0][4] =
+		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
 	rsvd_check->rsvd_bits_mask[0][3] =
 		rsvd_bits(maxphyaddr, 51) | rsvd_bits(3, 7);
 	rsvd_check->rsvd_bits_mask[0][2] =
@@ -4072,6 +4080,7 @@ __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
 	rsvd_check->rsvd_bits_mask[0][0] = rsvd_bits(maxphyaddr, 51);
 
 	/* large page */
+	rsvd_check->rsvd_bits_mask[1][4] = rsvd_check->rsvd_bits_mask[0][4];
 	rsvd_check->rsvd_bits_mask[1][3] = rsvd_check->rsvd_bits_mask[0][3];
 	rsvd_check->rsvd_bits_mask[1][2] =
 		rsvd_bits(maxphyaddr, 51) | rsvd_bits(12, 29);
@@ -4332,7 +4341,10 @@ static void paging64_init_context_common(struct kvm_vcpu *vcpu,
 static void paging64_init_context(struct kvm_vcpu *vcpu,
 				  struct kvm_mmu *context)
 {
-	paging64_init_context_common(vcpu, context, PT64_ROOT_4LEVEL);
+	int root_level = is_la57_mode(vcpu) ?
+			 PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
+
+	paging64_init_context_common(vcpu, context, root_level);
 }
 
 static void paging32_init_context(struct kvm_vcpu *vcpu,
@@ -4373,7 +4385,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 	context->sync_page = nonpaging_sync_page;
 	context->invlpg = nonpaging_invlpg;
 	context->update_pte = nonpaging_update_pte;
-	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+	context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
 	context->root_hpa = INVALID_PAGE;
 	context->direct_map = true;
 	context->set_cr3 = kvm_x86_ops->set_tdp_cr3;
@@ -4387,7 +4399,8 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 		context->root_level = 0;
 	} else if (is_long_mode(vcpu)) {
 		context->nx = is_nx(vcpu);
-		context->root_level = PT64_ROOT_4LEVEL;
+		context->root_level = is_la57_mode(vcpu) ?
+				PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
 		reset_rsvds_bits_mask(vcpu, context);
 		context->gva_to_gpa = paging64_gva_to_gpa;
 	} else if (is_pae(vcpu)) {
@@ -4444,7 +4457,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 
 	MMU_WARN_ON(VALID_PAGE(context->root_hpa));
 
-	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
+	context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
 
 	context->nx = true;
 	context->ept_ad = accessed_dirty;
@@ -4498,7 +4511,8 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 		g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
 	} else if (is_long_mode(vcpu)) {
 		g_context->nx = is_nx(vcpu);
-		g_context->root_level = PT64_ROOT_4LEVEL;
+		g_context->root_level = is_la57_mode(vcpu) ?
+					PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
 		reset_rsvds_bits_mask(vcpu, g_context);
 		g_context->gva_to_gpa = paging64_gva_to_gpa_nested;
 	} else if (is_pae(vcpu)) {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60b9001..7152b5b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -37,11 +37,12 @@
 #define PT32_DIR_PSE36_MASK \
 	(((1ULL << PT32_DIR_PSE36_SIZE) - 1) << PT32_DIR_PSE36_SHIFT)
 
+#define PT64_ROOT_5LEVEL 5
 #define PT64_ROOT_4LEVEL 4
 #define PT32_ROOT_LEVEL 2
 #define PT32E_ROOT_LEVEL 3
 
-#define PT64_ROOT_MAX_LEVEL PT64_ROOT_4LEVEL
+#define PT64_ROOT_MAX_LEVEL PT64_ROOT_5LEVEL
 
 #define PT_PDPE_LEVEL 3
 #define PT_DIRECTORY_LEVEL 2
@@ -50,6 +51,9 @@
 
 static inline u64 rsvd_bits(int s, int e)
 {
+	if (e < s)
+		return 0;
+
 	return ((1ULL << (e - s + 1)) - 1) << s;
 }
 
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 2e6996d..d22ddbd 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -62,11 +62,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return;
 
-	if (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL) {
+	if (vcpu->arch.mmu.root_level >= PT64_ROOT_4LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
 		sp = page_header(root);
-		__mmu_spte_walk(vcpu, sp, fn, PT64_ROOT_4LEVEL);
+		__mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu.root_level);
 		return;
 	}
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f7aa33d..bdd0142 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -567,7 +567,7 @@ static inline void invlpga(unsigned long addr, u32 asid)
 	asm volatile (__ex(SVM_INVLPGA) : : "a"(addr), "c"(asid));
 }
 
-static int get_npt_level(void)
+static int get_npt_level(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
 	return PT64_ROOT_4LEVEL;
@@ -2389,7 +2389,7 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu.get_cr3           = nested_svm_get_tdp_cr3;
 	vcpu->arch.mmu.get_pdptr         = nested_svm_get_tdp_pdptr;
 	vcpu->arch.mmu.inject_page_fault = nested_svm_inject_npf_exit;
-	vcpu->arch.mmu.shadow_root_level = get_npt_level();
+	vcpu->arch.mmu.shadow_root_level = get_npt_level(vcpu);
 	reset_shadow_zero_bits_mask(vcpu, &vcpu->arch.mmu);
 	vcpu->arch.walk_mmu              = &vcpu->arch.nested_mmu;
 }
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ed1074e..614ade7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1200,6 +1200,11 @@ static inline bool cpu_has_vmx_ept_4levels(void)
 	return vmx_capability.ept & VMX_EPT_PAGE_WALK_4_BIT;
 }
 
+static inline bool cpu_has_vmx_ept_5levels(void)
+{
+	return vmx_capability.ept & VMX_EPT_PAGE_WALK_5_BIT;
+}
+
 static inline bool cpu_has_vmx_ept_ad_bits(void)
 {
 	return vmx_capability.ept & VMX_EPT_AD_BIT;
@@ -4296,13 +4301,20 @@ static void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	vmx->emulation_required = emulation_required(vcpu);
 }
 
+static int get_ept_level(struct kvm_vcpu *vcpu)
+{
+	if (cpu_has_vmx_ept_5levels() && (cpuid_maxphyaddr(vcpu) > 48))
+		return VMX_EPT_MAX_GAW + 1;
+	return VMX_EPT_DEFAULT_GAW + 1;
+}
+
 static u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa)
 {
 	u64 eptp;
 
 	/* TODO write the value reading from MSR */
 	eptp = VMX_EPT_DEFAULT_MT |
-		VMX_EPT_DEFAULT_GAW << VMX_EPT_GAW_EPTP_SHIFT;
+		(get_ept_level(vcpu) - 1) << VMX_EPT_GAW_EPTP_SHIFT;
 	if (enable_ept_ad_bits &&
 	    (!is_guest_mode(vcpu) || nested_ept_ad_enabled(vcpu)))
 		eptp |= VMX_EPT_AD_ENABLE_BIT;
@@ -9505,11 +9517,6 @@ static void __init vmx_check_processor_compat(void *rtn)
 	}
 }
 
-static int get_ept_level(void)
-{
-	return VMX_EPT_DEFAULT_GAW + 1;
-}
-
 static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6120670..0107ab7 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -62,6 +62,16 @@ static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
 	return cs_l;
 }
 
+static inline bool is_la57_mode(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+	return (vcpu->arch.efer & EFER_LMA) &&
+		 kvm_read_cr4_bits(vcpu, X86_CR4_LA57);
+#else
+	return 0;
+#endif
+}
+
 static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-12 13:35 [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Yu Zhang
                   ` (2 preceding siblings ...)
  2017-08-12 13:35 ` [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Yu Zhang
@ 2017-08-12 13:35 ` Yu Zhang
  2017-08-17 11:57   ` Paolo Bonzini
  2017-08-14  7:32 ` [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Paolo Bonzini
  4 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-12 13:35 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, pbonzini, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

This patch exposes 5 level page table feature to the VM,
at the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h | 18 ++----------------
 arch/x86/kvm/cpuid.c            | 16 ++++++++++------
 arch/x86/kvm/emulate.c          | 12 ++++++------
 arch/x86/kvm/kvm_cache_regs.h   |  2 +-
 arch/x86/kvm/vmx.c              |  8 ++++----
 arch/x86/kvm/x86.c              |  7 +++++--
 arch/x86/kvm/x86.h              | 34 ++++++++++++++++++++++++++++++++++
 7 files changed, 62 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7e98a75..4bc7f11 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -85,8 +85,8 @@
 			  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE     \
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
-			  | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP \
-			  | X86_CR4_PKE))
+			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
+			  | X86_CR4_SMAP | X86_CR4_PKE))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
@@ -1297,20 +1297,6 @@ static inline void kvm_inject_gp(struct kvm_vcpu *vcpu, u32 error_code)
 	kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
 }
 
-static inline u64 get_canonical(u64 la)
-{
-	return ((int64_t)la << 16) >> 16;
-}
-
-static inline bool is_noncanonical_address(u64 la)
-{
-#ifdef CONFIG_X86_64
-	return get_canonical(la) != la;
-#else
-	return false;
-#endif
-}
-
 #define TSS_IOPB_BASE_OFFSET 0x66
 #define TSS_BASE_SIZE 0x68
 #define TSS_IOPB_SIZE (65536 / 8)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index aceacf8..2161b33 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -126,13 +126,16 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
 		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
 
 	/*
-	 * The existing code assumes virtual address is 48-bit in the canonical
-	 * address checks; exit if it is ever changed.
+	 * The existing code assumes virtual address is 48-bit or 57-bit in the
+	 * canonical address checks; exit if it is ever changed.
 	 */
 	best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
-	if (best && ((best->eax & 0xff00) >> 8) != 48 &&
-		((best->eax & 0xff00) >> 8) != 0)
-		return -EINVAL;
+	if (best) {
+		int vaddr_bits = (best->eax & 0xff00) >> 8;
+
+		if (vaddr_bits != 48 && vaddr_bits != 57 && vaddr_bits != 0)
+			return -EINVAL;
+	}
 
 	/* Update physical-address width */
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
@@ -388,7 +391,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.ecx*/
 	const u32 kvm_cpuid_7_0_ecx_x86_features =
-		F(AVX512VBMI) | F(PKU) | 0 /*OSPKE*/ | F(AVX512_VPOPCNTDQ);
+		F(AVX512VBMI) | F(LA57) | F(PKU) |
+		0 /*OSPKE*/ | F(AVX512_VPOPCNTDQ);
 
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a98b88a..50107ae 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -694,7 +694,7 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
 	switch (mode) {
 	case X86EMUL_MODE_PROT64:
 		*linear = la;
-		if (is_noncanonical_address(la))
+		if (emul_is_noncanonical_address(la, ctxt))
 			goto bad;
 
 		*max_size = min_t(u64, ~0u, (1ull << 48) - la);
@@ -1748,8 +1748,8 @@ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt,
 				sizeof(base3), &ctxt->exception);
 		if (ret != X86EMUL_CONTINUE)
 			return ret;
-		if (is_noncanonical_address(get_desc_base(&seg_desc) |
-					     ((u64)base3 << 32)))
+		if (emul_is_noncanonical_address(get_desc_base(&seg_desc) |
+				((u64)base3 << 32), ctxt))
 			return emulate_gp(ctxt, 0);
 	}
 load:
@@ -2840,8 +2840,8 @@ static int em_sysexit(struct x86_emulate_ctxt *ctxt)
 		ss_sel = cs_sel + 8;
 		cs.d = 0;
 		cs.l = 1;
-		if (is_noncanonical_address(rcx) ||
-		    is_noncanonical_address(rdx))
+		if (emul_is_noncanonical_address(rcx, ctxt) ||
+		    emul_is_noncanonical_address(rdx, ctxt))
 			return emulate_gp(ctxt, 0);
 		break;
 	}
@@ -3756,7 +3756,7 @@ static int em_lgdt_lidt(struct x86_emulate_ctxt *ctxt, bool lgdt)
 	if (rc != X86EMUL_CONTINUE)
 		return rc;
 	if (ctxt->mode == X86EMUL_MODE_PROT64 &&
-	    is_noncanonical_address(desc_ptr.address))
+	    emul_is_noncanonical_address(desc_ptr.address, ctxt))
 		return emulate_gp(ctxt, 0);
 	if (lgdt)
 		ctxt->ops->set_gdt(ctxt, &desc_ptr);
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 762cdf2..0052317 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -4,7 +4,7 @@
 #define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
 #define KVM_POSSIBLE_CR4_GUEST_BITS				  \
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR  \
-	 | X86_CR4_OSXMMEXCPT | X86_CR4_PGE)
+	 | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_PGE)
 
 static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu,
 					      enum kvm_reg reg)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 614ade7..ee0e5b7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -122,7 +122,7 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 	(KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | X86_CR0_PG | X86_CR0_PE)
 #define KVM_CR4_GUEST_OWNED_BITS				      \
 	(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR      \
-	 | X86_CR4_OSXMMEXCPT | X86_CR4_TSD)
+	 | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_TSD)
 
 #define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE)
 #define KVM_RMODE_VM_CR4_ALWAYS_ON (X86_CR4_VME | X86_CR4_PAE | X86_CR4_VMXE)
@@ -3368,7 +3368,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		    (!msr_info->host_initiated &&
 		     !guest_cpuid_has(vcpu, X86_FEATURE_MPX)))
 			return 1;
-		if (is_noncanonical_address(data & PAGE_MASK) ||
+		if (is_noncanonical_address(data & PAGE_MASK, vcpu) ||
 		    (data & MSR_IA32_BNDCFGS_RSVD))
 			return 1;
 		vmcs_write64(GUEST_BNDCFGS, data);
@@ -7034,7 +7034,7 @@ static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
 		 * non-canonical form. This is the only check on the memory
 		 * destination for long mode!
 		 */
-		exn = is_noncanonical_address(*ret);
+		exn = is_noncanonical_address(*ret, vcpu);
 	} else if (is_protmode(vcpu)) {
 		/* Protected mode: apply checks for segment validity in the
 		 * following order:
@@ -7839,7 +7839,7 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
 
 	switch (type) {
 	case VMX_VPID_EXTENT_INDIVIDUAL_ADDR:
-		if (is_noncanonical_address(operand.gla)) {
+		if (is_noncanonical_address(operand.gla, vcpu)) {
 			nested_vmx_failValid(vcpu,
 				VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
 			return kvm_skip_emulated_instruction(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d9100c4..4935660 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -769,6 +769,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	if (!guest_cpuid_has(vcpu, X86_FEATURE_PKU) && (cr4 & X86_CR4_PKE))
 		return 1;
 
+	if (!guest_cpuid_has(vcpu, X86_FEATURE_LA57) && (cr4 & X86_CR4_LA57))
+		return 1;
+
 	if (is_long_mode(vcpu)) {
 		if (!(cr4 & X86_CR4_PAE))
 			return 1;
@@ -1074,7 +1077,7 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_KERNEL_GS_BASE:
 	case MSR_CSTAR:
 	case MSR_LSTAR:
-		if (is_noncanonical_address(msr->data))
+		if (is_noncanonical_address(msr->data, vcpu))
 			return 1;
 		break;
 	case MSR_IA32_SYSENTER_EIP:
@@ -1091,7 +1094,7 @@ int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		 * value, and that something deterministic happens if the guest
 		 * invokes 64-bit SYSENTER.
 		 */
-		msr->data = get_canonical(msr->data);
+		msr->data = get_canonical(msr->data, vcpu_virt_addr_bits(vcpu));
 	}
 	return kvm_x86_ops->set_msr(vcpu, msr);
 }
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 0107ab7..63ed0d0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -97,6 +97,40 @@ static inline u32 bit(int bitno)
 	return 1 << (bitno & 31);
 }
 
+static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
+{
+	return kvm_read_cr4_bits(vcpu, X86_CR4_LA57) ? 57 : 48;
+}
+
+static inline u8 ctxt_virt_addr_bits(struct x86_emulate_ctxt *ctxt)
+{
+	return (ctxt->ops->get_cr(ctxt, 4) & X86_CR4_LA57) ? 57 : 48;
+}
+
+static inline u64 get_canonical(u64 la, u8 vaddr_bits)
+{
+	return ((int64_t)la << (64 - vaddr_bits)) >> (64 - vaddr_bits);
+}
+
+static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+	return get_canonical(la, vcpu_virt_addr_bits(vcpu)) != la;
+#else
+	return false;
+#endif
+}
+
+static inline bool emul_is_noncanonical_address(u64 la,
+						struct x86_emulate_ctxt *ctxt)
+{
+#ifdef CONFIG_X86_64
+	return get_canonical(la, ctxt_virt_addr_bits(ctxt)) != la;
+#else
+	return false;
+#endif
+}
+
 static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
 					gva_t gva, gfn_t gfn, unsigned access)
 {
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-12 13:35 ` [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Yu Zhang
@ 2017-08-14  7:31   ` Paolo Bonzini
  2017-08-14 11:37     ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-14  7:31 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 12/08/2017 15:35, Yu Zhang wrote:
>  struct rsvd_bits_validate {
> -	u64 rsvd_bits_mask[2][4];
> +	u64 rsvd_bits_mask[2][5];
>  	u64 bad_mt_xwr;
>  };


Can you change this 4 to PT64_ROOT_MAX_LEVEL in patch 2?

> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
> -	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
> -	     vcpu->arch.mmu.direct_map)) {
> +	if (vcpu->arch.mmu.root_level >= PT64_ROOT_4LEVEL ||
> +	    vcpu->arch.mmu.direct_map) {
>  		hpa_t root = vcpu->arch.mmu.root_hpa;

You should keep the check on shadow_root_level (changing it to >= of
course), otherwise you break the case where EPT is disabled, paging is
disabled (so vcpu->arch.mmu.direct_map is true) and the host kernel is
32-bit.  In that case shadow pages use PAE format, and entering this
branch is incorrect.

> @@ -4444,7 +4457,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  
>  	MMU_WARN_ON(VALID_PAGE(context->root_hpa));
>  
> -	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
> +	context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
>  
>  	context->nx = true;
>  	context->ept_ad = accessed_dirty;

Below, there is:

        context->root_level = context->shadow_root_level;

this should be forced to PT64_ROOT_4LEVEL until there is support for
nested EPT 5-level page tables.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support
  2017-08-12 13:35 [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Yu Zhang
                   ` (3 preceding siblings ...)
  2017-08-12 13:35 ` [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM Yu Zhang
@ 2017-08-14  7:32 ` Paolo Bonzini
  4 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-14  7:32 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 12/08/2017 15:35, Yu Zhang wrote:
> Intel's existing processors limit the maximum linear address width to
> 48 bits, and the maximum physical address width to 46 bits. And the
> upcoming processors will extend maximum linear address width to 57 bits
> and maximum physical address width can go upto 52 bits in practical.
> 
> With linear address width greater than 48, a new paging mode in IA-32e
> is introduced - 5 level paging(also known as LA57). And to support VMs
> with this feature, KVM MMU code need to be extended. 
> 
> And to achieve this, this patchset:
> 1> leverages 2 qemu parameters: +la57 and phys-bits to expose wider linear
> address width and physical address width to the VM;
> 2> extends shadow logic to construct 5 level shadow page for VMs running
> in LA57 mode;
> 3> extends ept logic to construct 5 level ept table for VMs whose maximum
> physical width exceeds 48 bits.

Thanks, this looks good.  I only had a few suggestions in my reply to
patch 3.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width.
  2017-08-12 13:35 ` [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width Yu Zhang
@ 2017-08-14  7:36   ` Paolo Bonzini
  2017-08-14 11:39     ` Yu Zhang
  2017-08-14 16:13   ` Jim Mattson
  1 sibling, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-14  7:36 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 12/08/2017 15:35, Yu Zhang wrote:
> +			ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL, NULL);
> +			maxphyaddr = eax * 0xff;

This is "&", not "*".  You can also use rsvd_bits here.

> +			rsvd = (~((1UL << maxphyaddr) - 1)) &
> +				~CR3_PCID_INVD;
> +		}
>  
>  		if (new_val & rsvd)
>  			return emulate_gp(ctxt, 0);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e40a779..d9100c4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -813,10 +813,10 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>  		return 0;
>  	}
>  
> -	if (is_long_mode(vcpu)) {
> -		if (cr3 & CR3_L_MODE_RESERVED_BITS)
> -			return 1;
> -	} else if (is_pae(vcpu) && is_paging(vcpu) &&
> +	if (is_long_mode(vcpu) &&
> +	    (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
> +		return 1;

62 is a little better, since 63 is the PCID invalidate bit.

Paolo

> +	else if (is_pae(vcpu) && is_paging(vcpu) &&
>  		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
>  		return 1;
>  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-14  7:31   ` Paolo Bonzini
@ 2017-08-14 11:37     ` Yu Zhang
  2017-08-14 14:13       ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-14 11:37 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

Thanks a lot for your comments, Paolo. :-)


On 8/14/2017 3:31 PM, Paolo Bonzini wrote:
> On 12/08/2017 15:35, Yu Zhang wrote:
>>   struct rsvd_bits_validate {
>> -	u64 rsvd_bits_mask[2][4];
>> +	u64 rsvd_bits_mask[2][5];
>>   	u64 bad_mt_xwr;
>>   };
>
> Can you change this 4 to PT64_ROOT_MAX_LEVEL in patch 2?

Well, I had tried, but failed to find a neat approach to do so.
The difficulty I have met is that PT64_ROOT_MAX_LEVEL is defined 
together with
PT64_ROOT_4LEVEL/PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL in mmu.h, yet the
rsvd_bits_validate structure is defined in kvm_host.h, which are 
included in quite
a lot .c files that do not include mmu.h or include the mmu.h after 
kvm_host.h.

I guess that's the reason why the magic number 4 instead of PT64_ROOT_4LEVEL
is used in current definition of rsvd_bits_vadlidate. :-)

>
>> -	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_4LEVEL &&
>> -	    (vcpu->arch.mmu.root_level == PT64_ROOT_4LEVEL ||
>> -	     vcpu->arch.mmu.direct_map)) {
>> +	if (vcpu->arch.mmu.root_level >= PT64_ROOT_4LEVEL ||
>> +	    vcpu->arch.mmu.direct_map) {
>>   		hpa_t root = vcpu->arch.mmu.root_hpa;
> You should keep the check on shadow_root_level (changing it to >= of
> course), otherwise you break the case where EPT is disabled, paging is
> disabled (so vcpu->arch.mmu.direct_map is true) and the host kernel is
> 32-bit.  In that case shadow pages use PAE format, and entering this
> branch is incorrect.

Oh, right. Thanks!

>
>> @@ -4444,7 +4457,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>>   
>>   	MMU_WARN_ON(VALID_PAGE(context->root_hpa));
>>   
>> -	context->shadow_root_level = kvm_x86_ops->get_tdp_level();
>> +	context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
>>   
>>   	context->nx = true;
>>   	context->ept_ad = accessed_dirty;
> Below, there is:
>
>          context->root_level = context->shadow_root_level;
>
> this should be forced to PT64_ROOT_4LEVEL until there is support for
> nested EPT 5-level page tables.

So the context->shadow_root_level could be 5 or 4, and 
context->root_level is always 4?

My understanding is that shadow ept level should be determined by the 
width of ngpa,
and that if L1 guest is not exposed with EPT5 feature, it shall only use 
4 level ept for L2
guest, and the shadow ept does not need a 5 level one. Is this 
understanding correct?
And how about we set both values to PT64_ROOT_4LEVEL for now?

Besides, if we wanna support nested EPT5, what do you think we need to 
do besides exposing
the EPT5 feature to L1 guest?

Thanks
Yu

>
> Thanks,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width.
  2017-08-14  7:36   ` Paolo Bonzini
@ 2017-08-14 11:39     ` Yu Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-14 11:39 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/14/2017 3:36 PM, Paolo Bonzini wrote:
> On 12/08/2017 15:35, Yu Zhang wrote:
>> +			ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL, NULL);
>> +			maxphyaddr = eax * 0xff;
> This is "&", not "*".  You can also use rsvd_bits here.

Oops. Sorry for the typo. :-)

>> +			rsvd = (~((1UL << maxphyaddr) - 1)) &
>> +				~CR3_PCID_INVD;
>> +		}
>>   
>>   		if (new_val & rsvd)
>>   			return emulate_gp(ctxt, 0);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index e40a779..d9100c4 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -813,10 +813,10 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>>   		return 0;
>>   	}
>>   
>> -	if (is_long_mode(vcpu)) {
>> -		if (cr3 & CR3_L_MODE_RESERVED_BITS)
>> -			return 1;
>> -	} else if (is_pae(vcpu) && is_paging(vcpu) &&
>> +	if (is_long_mode(vcpu) &&
>> +	    (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
>> +		return 1;
> 62 is a little better, since 63 is the PCID invalidate bit.

Got it. Both kvm_set_cr3() and check_cr_write() should use 
rsvd_bits(maxphyaddr, 62) .

Thanks
Yu

> Paolo
>
>> +	else if (is_pae(vcpu) && is_paging(vcpu) &&
>>   		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
>>   		return 1;
>>   
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-14 11:37     ` Yu Zhang
@ 2017-08-14 14:13       ` Paolo Bonzini
  2017-08-14 14:32         ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-14 14:13 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 14/08/2017 13:37, Yu Zhang wrote:
> Thanks a lot for your comments, Paolo. :-)
> 
> 
> On 8/14/2017 3:31 PM, Paolo Bonzini wrote:
>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>   struct rsvd_bits_validate {
>>> -    u64 rsvd_bits_mask[2][4];
>>> +    u64 rsvd_bits_mask[2][5];
>>>       u64 bad_mt_xwr;
>>>   };
>>
>> Can you change this 4 to PT64_ROOT_MAX_LEVEL in patch 2?
> 
> Well, I had tried, but failed to find a neat approach to do so. The
> difficulty I have met is that PT64_ROOT_MAX_LEVEL is defined together
> with PT64_ROOT_4LEVEL/PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL in mmu.h, yet
> the rsvd_bits_validate structure is defined in kvm_host.h, which are 
> included in quite a lot .c files that do not include mmu.h or include
> the mmu.h after kvm_host.h.
> 
> I guess that's the reason why the magic number 4 instead of
> PT64_ROOT_4LEVEL is used in current definition of rsvd_bits_vadlidate. :-)

Yes, you're right.  I think the solution is to define
PT64_ROOT_MAX_LEVEL in kvm_host.h.

>>> @@ -4444,7 +4457,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu
>>> *vcpu, bool execonly,
>>>         MMU_WARN_ON(VALID_PAGE(context->root_hpa));
>>>   -    context->shadow_root_level = kvm_x86_ops->get_tdp_level();
>>> +    context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
>>>         context->nx = true;
>>>       context->ept_ad = accessed_dirty;
>> Below, there is:
>>
>>          context->root_level = context->shadow_root_level;
>>
>> this should be forced to PT64_ROOT_4LEVEL until there is support for
>> nested EPT 5-level page tables.
> 
> So the context->shadow_root_level could be 5 or 4, and
> context->root_level is always 4?

That was my idea, but setting both to 4 should be fine too as you
suggest below.

> My understanding is that shadow ept level should be determined by
> the width of ngpa, and that if L1 guest is not exposed with EPT5
> feature, it shall only use 4 level ept for L2 guest, and the shadow
> ept does not need a 5 level one. Is this understanding correct? And
> how about we set both values to PT64_ROOT_4LEVEL for now?>
> Besides, if we wanna support nested EPT5, what do you think we need to
> do besides exposing the EPT5 feature to L1 guest?

Nothing else, I think.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-14 14:13       ` Paolo Bonzini
@ 2017-08-14 14:32         ` Yu Zhang
  2017-08-14 15:02           ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-14 14:32 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/14/2017 10:13 PM, Paolo Bonzini wrote:
> On 14/08/2017 13:37, Yu Zhang wrote:
>> Thanks a lot for your comments, Paolo. :-)
>>
>>
>> On 8/14/2017 3:31 PM, Paolo Bonzini wrote:
>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>    struct rsvd_bits_validate {
>>>> -    u64 rsvd_bits_mask[2][4];
>>>> +    u64 rsvd_bits_mask[2][5];
>>>>        u64 bad_mt_xwr;
>>>>    };
>>> Can you change this 4 to PT64_ROOT_MAX_LEVEL in patch 2?
>> Well, I had tried, but failed to find a neat approach to do so. The
>> difficulty I have met is that PT64_ROOT_MAX_LEVEL is defined together
>> with PT64_ROOT_4LEVEL/PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL in mmu.h, yet
>> the rsvd_bits_validate structure is defined in kvm_host.h, which are
>> included in quite a lot .c files that do not include mmu.h or include
>> the mmu.h after kvm_host.h.
>>
>> I guess that's the reason why the magic number 4 instead of
>> PT64_ROOT_4LEVEL is used in current definition of rsvd_bits_vadlidate. :-)
> Yes, you're right.  I think the solution is to define
> PT64_ROOT_MAX_LEVEL in kvm_host.h.

Thanks, Paolo. How about we also move the definition of PT64_ROOT_4LEVEL/
PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL from mmu.h to kvm_host.h? Then we
can define PT64_ROOT_MAX_LEVEL as PT64_ROOT_4LEVEL instead of 4 in 
kvm_host.h.

>>>> @@ -4444,7 +4457,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu
>>>> *vcpu, bool execonly,
>>>>          MMU_WARN_ON(VALID_PAGE(context->root_hpa));
>>>>    -    context->shadow_root_level = kvm_x86_ops->get_tdp_level();
>>>> +    context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
>>>>          context->nx = true;
>>>>        context->ept_ad = accessed_dirty;
>>> Below, there is:
>>>
>>>           context->root_level = context->shadow_root_level;
>>>
>>> this should be forced to PT64_ROOT_4LEVEL until there is support for
>>> nested EPT 5-level page tables.
>> So the context->shadow_root_level could be 5 or 4, and
>> context->root_level is always 4?
> That was my idea, but setting both to 4 should be fine too as you
> suggest below.
>
>> My understanding is that shadow ept level should be determined by
>> the width of ngpa, and that if L1 guest is not exposed with EPT5
>> feature, it shall only use 4 level ept for L2 guest, and the shadow
>> ept does not need a 5 level one. Is this understanding correct? And
>> how about we set both values to PT64_ROOT_4LEVEL for now?>
>> Besides, if we wanna support nested EPT5, what do you think we need to
>> do besides exposing the EPT5 feature to L1 guest?
> Nothing else, I think.

Thanks. I'll try to keep both values fixed to PT64_ROOT_4LEVEL then. :-)
For nested EPT5, we can enable it later(should be a quite simple patch, 
but need to
be verified in our simics environment, which I am not sure if nested 
scenario works).

B.R.
Yu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-14 15:02           ` Paolo Bonzini
@ 2017-08-14 14:55             ` Yu Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-14 14:55 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/14/2017 11:02 PM, Paolo Bonzini wrote:
> On 14/08/2017 16:32, Yu Zhang wrote:
>>
>> On 8/14/2017 10:13 PM, Paolo Bonzini wrote:
>>> On 14/08/2017 13:37, Yu Zhang wrote:
>>>> Thanks a lot for your comments, Paolo. :-)
>>>>
>>>>
>>>> On 8/14/2017 3:31 PM, Paolo Bonzini wrote:
>>>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>>>     struct rsvd_bits_validate {
>>>>>> -    u64 rsvd_bits_mask[2][4];
>>>>>> +    u64 rsvd_bits_mask[2][5];
>>>>>>         u64 bad_mt_xwr;
>>>>>>     };
>>>>> Can you change this 4 to PT64_ROOT_MAX_LEVEL in patch 2?
>>>> Well, I had tried, but failed to find a neat approach to do so. The
>>>> difficulty I have met is that PT64_ROOT_MAX_LEVEL is defined together
>>>> with PT64_ROOT_4LEVEL/PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL in mmu.h, yet
>>>> the rsvd_bits_validate structure is defined in kvm_host.h, which are
>>>> included in quite a lot .c files that do not include mmu.h or include
>>>> the mmu.h after kvm_host.h.
>>>>
>>>> I guess that's the reason why the magic number 4 instead of
>>>> PT64_ROOT_4LEVEL is used in current definition of
>>>> rsvd_bits_vadlidate. :-)
>>> Yes, you're right.  I think the solution is to define
>>> PT64_ROOT_MAX_LEVEL in kvm_host.h.
>> Thanks, Paolo. How about we also move the definition of PT64_ROOT_4LEVEL/
>> PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL from mmu.h to kvm_host.h? Then we
>> can define PT64_ROOT_MAX_LEVEL as PT64_ROOT_4LEVEL instead of 4 in
>> kvm_host.h.
> No, I think those are best left in mmu.h.  They are only used in mmu
> files, except for two occurrences in svm.c.
>
> kvm_host.h would have PT64_ROOT_MAX_LEVEL just because it is slightly
> better than "4" or "5".

OK. I can define PT64_ROOT_MAX_LEVEL in kvm_host.h as 4 in patch 2, and 
change
it to 5 in patch 3. :- )

Thanks
Yu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support.
  2017-08-14 14:32         ` Yu Zhang
@ 2017-08-14 15:02           ` Paolo Bonzini
  2017-08-14 14:55             ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-14 15:02 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 14/08/2017 16:32, Yu Zhang wrote:
> 
> 
> On 8/14/2017 10:13 PM, Paolo Bonzini wrote:
>> On 14/08/2017 13:37, Yu Zhang wrote:
>>> Thanks a lot for your comments, Paolo. :-)
>>>
>>>
>>> On 8/14/2017 3:31 PM, Paolo Bonzini wrote:
>>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>>    struct rsvd_bits_validate {
>>>>> -    u64 rsvd_bits_mask[2][4];
>>>>> +    u64 rsvd_bits_mask[2][5];
>>>>>        u64 bad_mt_xwr;
>>>>>    };
>>>> Can you change this 4 to PT64_ROOT_MAX_LEVEL in patch 2?
>>> Well, I had tried, but failed to find a neat approach to do so. The
>>> difficulty I have met is that PT64_ROOT_MAX_LEVEL is defined together
>>> with PT64_ROOT_4LEVEL/PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL in mmu.h, yet
>>> the rsvd_bits_validate structure is defined in kvm_host.h, which are
>>> included in quite a lot .c files that do not include mmu.h or include
>>> the mmu.h after kvm_host.h.
>>>
>>> I guess that's the reason why the magic number 4 instead of
>>> PT64_ROOT_4LEVEL is used in current definition of
>>> rsvd_bits_vadlidate. :-)
>> Yes, you're right.  I think the solution is to define
>> PT64_ROOT_MAX_LEVEL in kvm_host.h.
> 
> Thanks, Paolo. How about we also move the definition of PT64_ROOT_4LEVEL/
> PT32E_ROOT_LEVEL/PT32_ROOT_LEVEL from mmu.h to kvm_host.h? Then we
> can define PT64_ROOT_MAX_LEVEL as PT64_ROOT_4LEVEL instead of 4 in
> kvm_host.h.

No, I think those are best left in mmu.h.  They are only used in mmu
files, except for two occurrences in svm.c.

kvm_host.h would have PT64_ROOT_MAX_LEVEL just because it is slightly
better than "4" or "5".

Paolo

>>>>> @@ -4444,7 +4457,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu
>>>>> *vcpu, bool execonly,
>>>>>          MMU_WARN_ON(VALID_PAGE(context->root_hpa));
>>>>>    -    context->shadow_root_level = kvm_x86_ops->get_tdp_level();
>>>>> +    context->shadow_root_level = kvm_x86_ops->get_tdp_level(vcpu);
>>>>>          context->nx = true;
>>>>>        context->ept_ad = accessed_dirty;
>>>> Below, there is:
>>>>
>>>>           context->root_level = context->shadow_root_level;
>>>>
>>>> this should be forced to PT64_ROOT_4LEVEL until there is support for
>>>> nested EPT 5-level page tables.
>>> So the context->shadow_root_level could be 5 or 4, and
>>> context->root_level is always 4?
>> That was my idea, but setting both to 4 should be fine too as you
>> suggest below.
>>
>>> My understanding is that shadow ept level should be determined by
>>> the width of ngpa, and that if L1 guest is not exposed with EPT5
>>> feature, it shall only use 4 level ept for L2 guest, and the shadow
>>> ept does not need a 5 level one. Is this understanding correct? And
>>> how about we set both values to PT64_ROOT_4LEVEL for now?>
>>> Besides, if we wanna support nested EPT5, what do you think we need to
>>> do besides exposing the EPT5 feature to L1 guest?
>> Nothing else, I think.
> 
> Thanks. I'll try to keep both values fixed to PT64_ROOT_4LEVEL then. :-)
> For nested EPT5, we can enable it later(should be a quite simple patch,
> but need to
> be verified in our simics environment, which I am not sure if nested
> scenario works).
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width.
  2017-08-12 13:35 ` [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width Yu Zhang
  2017-08-14  7:36   ` Paolo Bonzini
@ 2017-08-14 16:13   ` Jim Mattson
  2017-08-14 16:40     ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Jim Mattson @ 2017-08-14 16:13 UTC (permalink / raw)
  To: Yu Zhang
  Cc: kvm list, LKML, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, xiaoguangrong,
	Joerg Roedel

On Sat, Aug 12, 2017 at 6:35 AM, Yu Zhang <yu.c.zhang@linux.intel.com> wrote:
> Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
> reserved bits in CR3. Yet the length of reserved bits in
> guest CR3 should be based on the physical address width
> exposed to the VM. This patch changes CR3 check logic to
> calculate the reserved bits at runtime.
>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 -
>  arch/x86/kvm/emulate.c          | 12 ++++++++++--
>  arch/x86/kvm/x86.c              |  8 ++++----
>  3 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9e4862e..018300e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -79,7 +79,6 @@
>                           | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
>                           | X86_CR0_NW | X86_CR0_CD | X86_CR0_PG))
>
> -#define CR3_L_MODE_RESERVED_BITS 0xFFFFFF0000000000ULL
>  #define CR3_PCID_INVD           BIT_64(63)
>  #define CR4_RESERVED_BITS                                               \
>         (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index fb00559..a98b88a 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4097,8 +4097,16 @@ static int check_cr_write(struct x86_emulate_ctxt *ctxt)
>                 u64 rsvd = 0;
>
>                 ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
> -               if (efer & EFER_LMA)
> -                       rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
> +               if (efer & EFER_LMA) {
> +                       u64 maxphyaddr;
> +                       u32 eax = 0x80000008;
> +
> +                       ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL, NULL);
> +                       maxphyaddr = eax * 0xff;

What if leaf 0x80000008 is not defined?

> +
> +                       rsvd = (~((1UL << maxphyaddr) - 1)) &
> +                               ~CR3_PCID_INVD;
> +               }
>
>                 if (new_val & rsvd)
>                         return emulate_gp(ctxt, 0);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e40a779..d9100c4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -813,10 +813,10 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>                 return 0;
>         }
>
> -       if (is_long_mode(vcpu)) {
> -               if (cr3 & CR3_L_MODE_RESERVED_BITS)
> -                       return 1;
> -       } else if (is_pae(vcpu) && is_paging(vcpu) &&
> +       if (is_long_mode(vcpu) &&
> +           (cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
> +               return 1;
> +       else if (is_pae(vcpu) && is_paging(vcpu) &&
>                    !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
>                 return 1;
>
> --
> 2.5.0
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width.
  2017-08-14 16:13   ` Jim Mattson
@ 2017-08-14 16:40     ` Paolo Bonzini
  2017-08-15  7:50       ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-14 16:40 UTC (permalink / raw)
  To: Jim Mattson, Yu Zhang
  Cc: kvm list, LKML, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, xiaoguangrong,
	Joerg Roedel

On 14/08/2017 18:13, Jim Mattson wrote:
>>                 ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
>> -               if (efer & EFER_LMA)
>> -                       rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
>> +               if (efer & EFER_LMA) {
>> +                       u64 maxphyaddr;
>> +                       u32 eax = 0x80000008;
>> +
>> +                       ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL, NULL);
>> +                       maxphyaddr = eax * 0xff;
>
> What if leaf 0x80000008 is not defined?

I noticed this too, but I thought it was mitigated by being under
EFER_LMA.  Unfortunately, kvm_set_efer doesn't check
guest_cpuid_has_longmode, so I guess you do have to test leaf 0x80000000
first.  Alternatively:

1) kvm_cpuid could return false if it's falling back to
check_cpuid_limit, and emulator_get_cpuid can then be changed to return bool

2) kvm_cpuid and emulator_get_cpuid could gain a new argument to disable
the check_cpuid_limit fallback.

Yu, would you like to implement the latter?

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width.
  2017-08-14 16:40     ` Paolo Bonzini
@ 2017-08-15  7:50       ` Yu Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-15  7:50 UTC (permalink / raw)
  To: Paolo Bonzini, Jim Mattson
  Cc: kvm list, LKML, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, xiaoguangrong,
	Joerg Roedel



On 8/15/2017 12:40 AM, Paolo Bonzini wrote:
> On 14/08/2017 18:13, Jim Mattson wrote:
>>>                  ctxt->ops->get_msr(ctxt, MSR_EFER, &efer);
>>> -               if (efer & EFER_LMA)
>>> -                       rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
>>> +               if (efer & EFER_LMA) {
>>> +                       u64 maxphyaddr;
>>> +                       u32 eax = 0x80000008;
>>> +
>>> +                       ctxt->ops->get_cpuid(ctxt, &eax, NULL, NULL, NULL);
>>> +                       maxphyaddr = eax * 0xff;
>> What if leaf 0x80000008 is not defined?
> I noticed this too, but I thought it was mitigated by being under
> EFER_LMA.  Unfortunately, kvm_set_efer doesn't check
> guest_cpuid_has_longmode, so I guess you do have to test leaf 0x80000000
> first.  Alternatively:
>
> 1) kvm_cpuid could return false if it's falling back to
> check_cpuid_limit, and emulator_get_cpuid can then be changed to return bool
>
> 2) kvm_cpuid and emulator_get_cpuid could gain a new argument to disable
> the check_cpuid_limit fallback.
>
> Yu, would you like to implement the latter?

Thanks for pointing this out, Jim & Paolo. The latter choice sounds 
better to me. :-)
I'd like to implement this in a separate patch in next version patch set.

Yu
> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-17 11:57   ` Paolo Bonzini
@ 2017-08-17 11:53     ` Yu Zhang
  2017-08-17 14:29       ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-17 11:53 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
> On 12/08/2017 15:35, Yu Zhang wrote:
>> index a98b88a..50107ae 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>>   	switch (mode) {
>>   	case X86EMUL_MODE_PROT64:
>>   		*linear = la;
>> -		if (is_noncanonical_address(la))
>> +		if (emul_is_noncanonical_address(la, ctxt))
>>   			goto bad;
>>   
>>   		*max_size = min_t(u64, ~0u, (1ull << 48) - la);
> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits and
> then "inline" emul_is_noncanonical_address as "get_canonical(la,
> va_bits) != la".

Sorry, I just sent out the v2 patch set without noticing this reply. :-)

The emul_is_noncanonical() is defined in x86.h so that no 
ctxt_virt_addr_bits needed in emulate.c, are you
suggesting to use ctx_virt_addr_bits in this file each time before 
emul_is_noncanonical_address() is called?

Yu

> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-12 13:35 ` [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM Yu Zhang
@ 2017-08-17 11:57   ` Paolo Bonzini
  2017-08-17 11:53     ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-17 11:57 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 12/08/2017 15:35, Yu Zhang wrote:
> index a98b88a..50107ae 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt,
>  	switch (mode) {
>  	case X86EMUL_MODE_PROT64:
>  		*linear = la;
> -		if (is_noncanonical_address(la))
> +		if (emul_is_noncanonical_address(la, ctxt))
>  			goto bad;
>  
>  		*max_size = min_t(u64, ~0u, (1ull << 48) - la);

Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits and
then "inline" emul_is_noncanonical_address as "get_canonical(la,
va_bits) != la".

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-17 11:53     ` Yu Zhang
@ 2017-08-17 14:29       ` Paolo Bonzini
  2017-08-18  8:28         ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-17 14:29 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 17/08/2017 13:53, Yu Zhang wrote:
> 
> 
> On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
>> On 12/08/2017 15:35, Yu Zhang wrote:
>>> index a98b88a..50107ae 100644
>>> --- a/arch/x86/kvm/emulate.c
>>> +++ b/arch/x86/kvm/emulate.c
>>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct
>>> x86_emulate_ctxt *ctxt,
>>>       switch (mode) {
>>>       case X86EMUL_MODE_PROT64:
>>>           *linear = la;
>>> -        if (is_noncanonical_address(la))
>>> +        if (emul_is_noncanonical_address(la, ctxt))
>>>               goto bad;
>>>             *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits and
>> then "inline" emul_is_noncanonical_address as "get_canonical(la,
>> va_bits) != la".
> 
> Sorry, I just sent out the v2 patch set without noticing this reply. :-)
> 
> The emul_is_noncanonical() is defined in x86.h so that no
> ctxt_virt_addr_bits needed in emulate.c, are you
> suggesting to use ctx_virt_addr_bits in this file each time before
> emul_is_noncanonical_address() is called?

No, only in this instance which uses "48" after the call to
emul_is_noncanonical_address.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-17 14:29       ` Paolo Bonzini
@ 2017-08-18  8:28         ` Yu Zhang
  2017-08-18 12:50           ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-18  8:28 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/17/2017 10:29 PM, Paolo Bonzini wrote:
> On 17/08/2017 13:53, Yu Zhang wrote:
>>
>> On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>> index a98b88a..50107ae 100644
>>>> --- a/arch/x86/kvm/emulate.c
>>>> +++ b/arch/x86/kvm/emulate.c
>>>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct
>>>> x86_emulate_ctxt *ctxt,
>>>>        switch (mode) {
>>>>        case X86EMUL_MODE_PROT64:
>>>>            *linear = la;
>>>> -        if (is_noncanonical_address(la))
>>>> +        if (emul_is_noncanonical_address(la, ctxt))
>>>>                goto bad;
>>>>              *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits and
>>> then "inline" emul_is_noncanonical_address as "get_canonical(la,
>>> va_bits) != la".
>> Sorry, I just sent out the v2 patch set without noticing this reply. :-)
>>
>> The emul_is_noncanonical() is defined in x86.h so that no
>> ctxt_virt_addr_bits needed in emulate.c, are you
>> suggesting to use ctx_virt_addr_bits in this file each time before
>> emul_is_noncanonical_address() is called?
> No, only in this instance which uses "48" after the call to
> emul_is_noncanonical_address.

Sorry, Paolo. I still do not quite get it.
Do you mean the
  *max_size = min_t(u64, ~0u, (1ull << 48) - la);
also need to be changed?

But I do not understand why this statement is used like this. My 
understanding is that
for 64 bit scenario, the *max_size is calculated to guarantee la + 
*max_size still falls in
the canonical address space.

And if above understanding is correct, I think it should be something 
like below:
   *max_size = min_t(u64, ~0u - la, (1ull << 48) - la);

And with LA57, may better be changed to:
   *max_size = min_t(u64, ~0u - la, (1ull << ctxt_virt_addr_bits(ctxt)) 
- la);

And for the above
   if (emul_is_noncanonical_address(la, ctxt))
we may just leave it as it is.

Is this understanding correct? Or did I misunderstand your comments? :-)

Thanks
Yu
> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-18  8:28         ` Yu Zhang
@ 2017-08-18 12:50           ` Paolo Bonzini
  2017-08-21  7:27             ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-18 12:50 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 18/08/2017 10:28, Yu Zhang wrote:
> 
> 
> On 8/17/2017 10:29 PM, Paolo Bonzini wrote:
>> On 17/08/2017 13:53, Yu Zhang wrote:
>>>
>>> On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
>>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>> index a98b88a..50107ae 100644
>>>>> --- a/arch/x86/kvm/emulate.c
>>>>> +++ b/arch/x86/kvm/emulate.c
>>>>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct
>>>>> x86_emulate_ctxt *ctxt,
>>>>>        switch (mode) {
>>>>>        case X86EMUL_MODE_PROT64:
>>>>>            *linear = la;
>>>>> -        if (is_noncanonical_address(la))
>>>>> +        if (emul_is_noncanonical_address(la, ctxt))
>>>>>                goto bad;
>>>>>              *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>>> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits
>>>> and
>>>> then "inline" emul_is_noncanonical_address as "get_canonical(la,
>>>> va_bits) != la".
>>> Sorry, I just sent out the v2 patch set without noticing this reply. :-)
>>>
>>> The emul_is_noncanonical() is defined in x86.h so that no
>>> ctxt_virt_addr_bits needed in emulate.c, are you
>>> suggesting to use ctx_virt_addr_bits in this file each time before
>>> emul_is_noncanonical_address() is called?
>> No, only in this instance which uses "48" after the call to
>> emul_is_noncanonical_address.
> 
> Sorry, Paolo. I still do not quite get it.
> Do you mean the
>  *max_size = min_t(u64, ~0u, (1ull << 48) - la);
> also need to be changed?
> 
> But I do not understand why this statement is used like this. My
> understanding is that
> for 64 bit scenario, the *max_size is calculated to guarantee la +
> *max_size still falls in
> the canonical address space.
> 
> And if above understanding is correct, I think it should be something
> like below:
>   *max_size = min_t(u64, ~0u - la, (1ull << 48) - la);

The "~0u" part is simply because max_size has 32-bit size (it's an
unsigned int variable), while (1ull << 48) - la has 64-bit size.  It
protects from the overflow.

> And with LA57, may better be changed to:
>   *max_size = min_t(u64, ~0u - la, (1ull << ctxt_virt_addr_bits(ctxt)) -
> la);
> 
> And for the above
>   if (emul_is_noncanonical_address(la, ctxt))
> we may just leave it as it is.

Yes, exactly.  But since emul_is_noncanonical_address is already using
ctxt_virt_addr_bits(ctxt), it may make sense to compute
ctxt_virt_addr_bits(ctxt) once and then reuse it twice, once in
get_canonical(la, va_bits) != la and once in (1ull << va_bits) - la.

Paolo

> Is this understanding correct? Or did I misunderstand your comments? :-)
> 
> Thanks
> Yu
>> Paolo
>>
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-18 12:50           ` Paolo Bonzini
@ 2017-08-21  7:27             ` Yu Zhang
  2017-08-21 10:12               ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Yu Zhang @ 2017-08-21  7:27 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/18/2017 8:50 PM, Paolo Bonzini wrote:
> On 18/08/2017 10:28, Yu Zhang wrote:
>>
>> On 8/17/2017 10:29 PM, Paolo Bonzini wrote:
>>> On 17/08/2017 13:53, Yu Zhang wrote:
>>>> On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
>>>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>>> index a98b88a..50107ae 100644
>>>>>> --- a/arch/x86/kvm/emulate.c
>>>>>> +++ b/arch/x86/kvm/emulate.c
>>>>>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct
>>>>>> x86_emulate_ctxt *ctxt,
>>>>>>         switch (mode) {
>>>>>>         case X86EMUL_MODE_PROT64:
>>>>>>             *linear = la;
>>>>>> -        if (is_noncanonical_address(la))
>>>>>> +        if (emul_is_noncanonical_address(la, ctxt))
>>>>>>                 goto bad;
>>>>>>               *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>>>> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits
>>>>> and
>>>>> then "inline" emul_is_noncanonical_address as "get_canonical(la,
>>>>> va_bits) != la".
>>>> Sorry, I just sent out the v2 patch set without noticing this reply. :-)
>>>>
>>>> The emul_is_noncanonical() is defined in x86.h so that no
>>>> ctxt_virt_addr_bits needed in emulate.c, are you
>>>> suggesting to use ctx_virt_addr_bits in this file each time before
>>>> emul_is_noncanonical_address() is called?
>>> No, only in this instance which uses "48" after the call to
>>> emul_is_noncanonical_address.
>> Sorry, Paolo. I still do not quite get it.
>> Do you mean the
>>   *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>> also need to be changed?
>>
>> But I do not understand why this statement is used like this. My
>> understanding is that
>> for 64 bit scenario, the *max_size is calculated to guarantee la +
>> *max_size still falls in
>> the canonical address space.
>>
>> And if above understanding is correct, I think it should be something
>> like below:
>>    *max_size = min_t(u64, ~0u - la, (1ull << 48) - la);
> The "~0u" part is simply because max_size has 32-bit size (it's an
> unsigned int variable), while (1ull << 48) - la has 64-bit size.  It
> protects from the overflow.

Oh, right. "~0u" is only an unsigned int. Thanks for your clarification. :-)

But what if value of "la" falls in between 0xFFFFFFFFFFFFFFFF and 
0xFFFF000000000000?
(1ull << 48) - la may result in something between 0x1000000000001 and 
0x2000000000000,
and the *max_size would be 4G - 1 in this scenario.
For instance, when "la" is 0xFFFFFFFFFFFFFFF0(unlikely in practice 
though), the *max_size
we are expecting should be 15, instead of 4G - 1.

If above understanding is correct, maybe we should change this code as 
below:
@@ -690,16 +690,21 @@ static __always_inline int __linearize(struct 
x86_emulate_ctxt *ctxt,
         ulong la;
         u32 lim;
         u16 sel;
+       u64 canonical_limit;
+       u8 va_bits;

         la = seg_base(ctxt, addr.seg) + addr.ea;
         *max_size = 0;
         switch (mode) {
         case X86EMUL_MODE_PROT64:
                 *linear = la;
-               if (emul_is_noncanonical_address(la, ctxt))
+               va_bits = ctxt_virt_addr_bits(ctxt);
+               if (get_canonical(la, va_bits) != la)
                         goto bad;

-               *max_size = min_t(u64, ~0u, (1ull << 48) - la);
+               canonical_limit = (la & (1 << va_bits)) ?
+                                 ~0ull : ((1 << va_bits) -1);
+               *max_size = min_t(u64, ~0u, canonical_limit - la + 1);

Does this sound reasonable?
BTW, I did not use min_t(u64, ~0ull - la + 1, (1 << va_bits) - la) here, 
because I still would like to
keep *max_size as an unsigned int, and my previous suggestion may cause 
the return value of
min_t be truncated.

Yu

>> And with LA57, may better be changed to:
>>    *max_size = min_t(u64, ~0u - la, (1ull << ctxt_virt_addr_bits(ctxt)) -
>> la);
>>
>> And for the above
>>    if (emul_is_noncanonical_address(la, ctxt))
>> we may just leave it as it is.
> Yes, exactly.  But since emul_is_noncanonical_address is already using
> ctxt_virt_addr_bits(ctxt), it may make sense to compute
> ctxt_virt_addr_bits(ctxt) once and then reuse it twice, once in
> get_canonical(la, va_bits) != la and once in (1ull << va_bits) - la.
>
> Paolo
>
>> Is this understanding correct? Or did I misunderstand your comments? :-)
>>
>> Thanks
>> Yu
>>> Paolo
>>>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-21  7:27             ` Yu Zhang
@ 2017-08-21 10:12               ` Paolo Bonzini
  2017-08-21 12:11                 ` Yu Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-08-21 10:12 UTC (permalink / raw)
  To: Yu Zhang, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro

On 21/08/2017 09:27, Yu Zhang wrote:
> 
> 
> On 8/18/2017 8:50 PM, Paolo Bonzini wrote:
>> On 18/08/2017 10:28, Yu Zhang wrote:
>>>
>>> On 8/17/2017 10:29 PM, Paolo Bonzini wrote:
>>>> On 17/08/2017 13:53, Yu Zhang wrote:
>>>>> On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
>>>>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>>>> index a98b88a..50107ae 100644
>>>>>>> --- a/arch/x86/kvm/emulate.c
>>>>>>> +++ b/arch/x86/kvm/emulate.c
>>>>>>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct
>>>>>>> x86_emulate_ctxt *ctxt,
>>>>>>>         switch (mode) {
>>>>>>>         case X86EMUL_MODE_PROT64:
>>>>>>>             *linear = la;
>>>>>>> -        if (is_noncanonical_address(la))
>>>>>>> +        if (emul_is_noncanonical_address(la, ctxt))
>>>>>>>                 goto bad;
>>>>>>>               *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>>>>> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits
>>>>>> and
>>>>>> then "inline" emul_is_noncanonical_address as "get_canonical(la,
>>>>>> va_bits) != la".
>>>>> Sorry, I just sent out the v2 patch set without noticing this
>>>>> reply. :-)
>>>>>
>>>>> The emul_is_noncanonical() is defined in x86.h so that no
>>>>> ctxt_virt_addr_bits needed in emulate.c, are you
>>>>> suggesting to use ctx_virt_addr_bits in this file each time before
>>>>> emul_is_noncanonical_address() is called?
>>>> No, only in this instance which uses "48" after the call to
>>>> emul_is_noncanonical_address.
>>> Sorry, Paolo. I still do not quite get it.
>>> Do you mean the
>>>   *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>> also need to be changed?
>>>
>>> But I do not understand why this statement is used like this. My
>>> understanding is that
>>> for 64 bit scenario, the *max_size is calculated to guarantee la +
>>> *max_size still falls in
>>> the canonical address space.
>>>
>>> And if above understanding is correct, I think it should be something
>>> like below:
>>>    *max_size = min_t(u64, ~0u - la, (1ull << 48) - la);
>> The "~0u" part is simply because max_size has 32-bit size (it's an
>> unsigned int variable), while (1ull << 48) - la has 64-bit size.  It
>> protects from the overflow.
> 
> But what if value of "la" falls in between 0xFFFFFFFFFFFFFFFF and
> 0xFFFF000000000000? (1ull << 48) - la may result in something between
> 0x1000000000001 and> 0x2000000000000, and the *max_size would be 4G - 1
> in this scenario.  For instance, when "la" is 0xFFFFFFFFFFFFFFF0 (unlikely
> in practice though), the *max_size we are expecting should be 15, instead
> of 4G - 1.

No, it is possible to wrap a memory access from the top half of the
address space to the bottom half, so there's no limit at 0xFFFFFFFFFFFFFFF0.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM.
  2017-08-21 10:12               ` Paolo Bonzini
@ 2017-08-21 12:11                 ` Yu Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Yu Zhang @ 2017-08-21 12:11 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: linux-kernel, rkrcmar, tglx, mingo, hpa, xiaoguangrong, joro



On 8/21/2017 6:12 PM, Paolo Bonzini wrote:
> On 21/08/2017 09:27, Yu Zhang wrote:
>>
>> On 8/18/2017 8:50 PM, Paolo Bonzini wrote:
>>> On 18/08/2017 10:28, Yu Zhang wrote:
>>>> On 8/17/2017 10:29 PM, Paolo Bonzini wrote:
>>>>> On 17/08/2017 13:53, Yu Zhang wrote:
>>>>>> On 8/17/2017 7:57 PM, Paolo Bonzini wrote:
>>>>>>> On 12/08/2017 15:35, Yu Zhang wrote:
>>>>>>>> index a98b88a..50107ae 100644
>>>>>>>> --- a/arch/x86/kvm/emulate.c
>>>>>>>> +++ b/arch/x86/kvm/emulate.c
>>>>>>>> @@ -694,7 +694,7 @@ static __always_inline int __linearize(struct
>>>>>>>> x86_emulate_ctxt *ctxt,
>>>>>>>>          switch (mode) {
>>>>>>>>          case X86EMUL_MODE_PROT64:
>>>>>>>>              *linear = la;
>>>>>>>> -        if (is_noncanonical_address(la))
>>>>>>>> +        if (emul_is_noncanonical_address(la, ctxt))
>>>>>>>>                  goto bad;
>>>>>>>>                *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>>>>>> Oops, you missed one here.  Probably best to use ctxt_virt_addr_bits
>>>>>>> and
>>>>>>> then "inline" emul_is_noncanonical_address as "get_canonical(la,
>>>>>>> va_bits) != la".
>>>>>> Sorry, I just sent out the v2 patch set without noticing this
>>>>>> reply. :-)
>>>>>>
>>>>>> The emul_is_noncanonical() is defined in x86.h so that no
>>>>>> ctxt_virt_addr_bits needed in emulate.c, are you
>>>>>> suggesting to use ctx_virt_addr_bits in this file each time before
>>>>>> emul_is_noncanonical_address() is called?
>>>>> No, only in this instance which uses "48" after the call to
>>>>> emul_is_noncanonical_address.
>>>> Sorry, Paolo. I still do not quite get it.
>>>> Do you mean the
>>>>    *max_size = min_t(u64, ~0u, (1ull << 48) - la);
>>>> also need to be changed?
>>>>
>>>> But I do not understand why this statement is used like this. My
>>>> understanding is that
>>>> for 64 bit scenario, the *max_size is calculated to guarantee la +
>>>> *max_size still falls in
>>>> the canonical address space.
>>>>
>>>> And if above understanding is correct, I think it should be something
>>>> like below:
>>>>     *max_size = min_t(u64, ~0u - la, (1ull << 48) - la);
>>> The "~0u" part is simply because max_size has 32-bit size (it's an
>>> unsigned int variable), while (1ull << 48) - la has 64-bit size.  It
>>> protects from the overflow.
>> But what if value of "la" falls in between 0xFFFFFFFFFFFFFFFF and
>> 0xFFFF000000000000? (1ull << 48) - la may result in something between
>> 0x1000000000001 and> 0x2000000000000, and the *max_size would be 4G - 1
>> in this scenario.  For instance, when "la" is 0xFFFFFFFFFFFFFFF0 (unlikely
>> in practice though), the *max_size we are expecting should be 15, instead
>> of 4G - 1.
> No, it is possible to wrap a memory access from the top half of the
> address space to the bottom half, so there's no limit at 0xFFFFFFFFFFFFFFF0.

Oh? So you mean it is allowed for one instruction to reside both in the 
top half of
the address space and in the bottom half?

If this is possible, I guess the code should be

*max_size = min_t(u64, ~0u, (1ull << va_bits) - la);

But I wonder, why should such scenario be allowed? :-)

Thanks
Yu




>
> Paolo
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-08-21 12:34 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-12 13:35 [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Yu Zhang
2017-08-12 13:35 ` [PATCH v1 1/4] KVM: MMU: check guest CR3 reserved bits based on its physical address width Yu Zhang
2017-08-14  7:36   ` Paolo Bonzini
2017-08-14 11:39     ` Yu Zhang
2017-08-14 16:13   ` Jim Mattson
2017-08-14 16:40     ` Paolo Bonzini
2017-08-15  7:50       ` Yu Zhang
2017-08-12 13:35 ` [PATCH v1 2/4] KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL Yu Zhang
2017-08-12 13:35 ` [PATCH v1 3/4] KVM: MMU: Add 5 level EPT & Shadow page table support Yu Zhang
2017-08-14  7:31   ` Paolo Bonzini
2017-08-14 11:37     ` Yu Zhang
2017-08-14 14:13       ` Paolo Bonzini
2017-08-14 14:32         ` Yu Zhang
2017-08-14 15:02           ` Paolo Bonzini
2017-08-14 14:55             ` Yu Zhang
2017-08-12 13:35 ` [PATCH v1 4/4] KVM: MMU: Expose the LA57 feature to VM Yu Zhang
2017-08-17 11:57   ` Paolo Bonzini
2017-08-17 11:53     ` Yu Zhang
2017-08-17 14:29       ` Paolo Bonzini
2017-08-18  8:28         ` Yu Zhang
2017-08-18 12:50           ` Paolo Bonzini
2017-08-21  7:27             ` Yu Zhang
2017-08-21 10:12               ` Paolo Bonzini
2017-08-21 12:11                 ` Yu Zhang
2017-08-14  7:32 ` [PATCH v1 0/4] KVM: MMU: 5 level EPT/shadow support Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.