[PATCH v2 0/3] KVM: MMU: fix nested guest live migration with PML

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] KVM: MMU: fix nested guest live migration with PML
@ 2019-09-27 11:15 Paolo Bonzini
  2019-09-27 11:15 ` [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-09-27 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid

Shadow paging is fundamentally incompatible with the page-modification
log, because the GPAs in the log come from the wrong memory map.
In particular, for the EPT page-modification log, the GPAs in the log come
from L2 rather than L1.  (If there was a non-EPT page-modification log,
we couldn't use it for shadow paging because it would log GVAs rather
than GPAs).

Therefore, we need to rely on write protection to record dirty pages.
This has the side effect of bypassing PML, since writes now result in an
EPT violation vmexit.

This turns out to be a surprisingly small patch---the testcase is what's
guilty of the scary diffstat.  But that is because the KVM MMU code is
absurdly clever, so a very close review is appreciated.

Paolo

v1->v2: don't place the new code in spte_clear_dirty [Junaid]

Paolo Bonzini (3):
  KVM: x86: assign two bits to track SPTE kinds
  KVM: x86: fix nested guest live migration with PML
  selftests: kvm: add test for dirty logging inside nested guests

 arch/x86/include/asm/kvm_host.h                    |   7 -
 arch/x86/kvm/mmu.c                                 |  65 +++++--
 tools/testing/selftests/kvm/Makefile               |   1 +
 .../selftests/kvm/include/x86_64/processor.h       |   3 +
 tools/testing/selftests/kvm/include/x86_64/vmx.h   |  14 ++
 tools/testing/selftests/kvm/lib/kvm_util.c         |   2 +-
 .../testing/selftests/kvm/lib/kvm_util_internal.h  |   3 +
 tools/testing/selftests/kvm/lib/x86_64/vmx.c       | 201 ++++++++++++++++++++-
 .../selftests/kvm/x86_64/vmx_dirty_log_test.c      | 156 ++++++++++++++++
 9 files changed, 426 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds
  2019-09-27 11:15 [PATCH v2 0/3] KVM: MMU: fix nested guest live migration with PML Paolo Bonzini
@ 2019-09-27 11:15 ` Paolo Bonzini
  2019-12-11 18:39   ` Ben Gardon
  2019-09-27 11:15 ` [PATCH v2 2/3] KVM: x86: fix nested guest live migration with PML Paolo Bonzini
  2019-09-27 11:15 ` [PATCH v2 3/3] selftests: kvm: add test for dirty logging inside nested guests Paolo Bonzini
  2 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2019-09-27 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid

Currently, we are overloading SPTE_SPECIAL_MASK to mean both
"A/D bits unavailable" and MMIO, where the difference between the
two is determined by mio_mask and mmio_value.

However, the next patch will need two bits to distinguish
availability of A/D bits from write protection.  So, while at
it give MMIO its own bit pattern, and move the two bits from
bit 62 to bits 52..53 since Intel is allocating EPT page table
bits from the top.

Reviewed-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  7 -------
 arch/x86/kvm/mmu.c              | 28 ++++++++++++++++++----------
 2 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 23edf56cf577..50eb430b0ad8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -219,13 +219,6 @@ enum {
 				 PFERR_WRITE_MASK |		\
 				 PFERR_PRESENT_MASK)
 
-/*
- * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
- * Access Tracking SPTEs. We use bit 62 instead of bit 63 to avoid conflicting
- * with the SVE bit in EPT PTEs.
- */
-#define SPTE_SPECIAL_MASK (1ULL << 62)
-
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
 /*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5269aa057dfa..bac8d228d82b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -83,7 +83,16 @@ enum {
 #define PTE_PREFETCH_NUM		8
 
 #define PT_FIRST_AVAIL_BITS_SHIFT 10
-#define PT64_SECOND_AVAIL_BITS_SHIFT 52
+#define PT64_SECOND_AVAIL_BITS_SHIFT 54
+
+/*
+ * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
+ * Access Tracking SPTEs.
+ */
+#define SPTE_SPECIAL_MASK (3ULL << 52)
+#define SPTE_AD_ENABLED_MASK (0ULL << 52)
+#define SPTE_AD_DISABLED_MASK (1ULL << 52)
+#define SPTE_MMIO_MASK (3ULL << 52)
 
 #define PT64_LEVEL_BITS 9
 
@@ -219,12 +228,11 @@ struct kvm_shadow_walk_iterator {
 static u64 __read_mostly shadow_me_mask;
 
 /*
- * SPTEs used by MMUs without A/D bits are marked with shadow_acc_track_value.
- * Non-present SPTEs with shadow_acc_track_value set are in place for access
- * tracking.
+ * SPTEs used by MMUs without A/D bits are marked with SPTE_AD_DISABLED_MASK;
+ * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
+ * pages.
  */
 static u64 __read_mostly shadow_acc_track_mask;
-static const u64 shadow_acc_track_value = SPTE_SPECIAL_MASK;
 
 /*
  * The mask/shift to use for saving the original R/X bits when marking the PTE
@@ -304,7 +312,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
 {
 	BUG_ON((u64)(unsigned)access_mask != access_mask);
 	BUG_ON((mmio_mask & mmio_value) != mmio_value);
-	shadow_mmio_value = mmio_value | SPTE_SPECIAL_MASK;
+	shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
 	shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
 	shadow_mmio_access_mask = access_mask;
 }
@@ -323,7 +331,7 @@ static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 static inline bool spte_ad_enabled(u64 spte)
 {
 	MMU_WARN_ON(is_mmio_spte(spte));
-	return !(spte & shadow_acc_track_value);
+	return (spte & SPTE_SPECIAL_MASK) == SPTE_AD_ENABLED_MASK;
 }
 
 static inline u64 spte_shadow_accessed_mask(u64 spte)
@@ -461,7 +469,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 {
 	BUG_ON(!dirty_mask != !accessed_mask);
 	BUG_ON(!accessed_mask && !acc_track_mask);
-	BUG_ON(acc_track_mask & shadow_acc_track_value);
+	BUG_ON(acc_track_mask & SPTE_SPECIAL_MASK);
 
 	shadow_user_mask = user_mask;
 	shadow_accessed_mask = accessed_mask;
@@ -2622,7 +2630,7 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
 	       shadow_user_mask | shadow_x_mask | shadow_me_mask;
 
 	if (sp_ad_disabled(sp))
-		spte |= shadow_acc_track_value;
+		spte |= SPTE_AD_DISABLED_MASK;
 	else
 		spte |= shadow_accessed_mask;
 
@@ -2968,7 +2976,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 
 	sp = page_header(__pa(sptep));
 	if (sp_ad_disabled(sp))
-		spte |= shadow_acc_track_value;
+		spte |= SPTE_AD_DISABLED_MASK;
 
 	/*
 	 * For the EPT case, shadow_present_mask is 0 if hardware
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] KVM: x86: fix nested guest live migration with PML
  2019-09-27 11:15 [PATCH v2 0/3] KVM: MMU: fix nested guest live migration with PML Paolo Bonzini
  2019-09-27 11:15 ` [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds Paolo Bonzini
@ 2019-09-27 11:15 ` Paolo Bonzini
  2019-09-27 20:24   ` Junaid Shahid
  2019-09-27 11:15 ` [PATCH v2 3/3] selftests: kvm: add test for dirty logging inside nested guests Paolo Bonzini
  2 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2019-09-27 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid

Shadow paging is fundamentally incompatible with the page-modification
log, because the GPAs in the log come from the wrong memory map.
In particular, for the EPT page-modification log, the GPAs in the log come
from L2 rather than L1.  (If there was a non-EPT page-modification log,
we couldn't use it for shadow paging because it would log GVAs rather
than GPAs).

Therefore, we need to rely on write protection to record dirty pages.
This has the side effect of bypassing PML, since writes now result in an
EPT violation vmexit.

This is relatively easy to add to KVM, because pretty much the only place
that needs changing is spte_clear_dirty.  The first access to the page
already goes through the page fault path and records the correct GPA;
it's only subsequent accesses that are wrong.  Therefore, we can equip
set_spte (where the first access happens) to record that the SPTE will
have to be write protected, and then spte_clear_dirty will use this
information to do the right thing.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu.c | 39 ++++++++++++++++++++++++++++++++-------
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bac8d228d82b..24c23c66b226 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -92,6 +92,7 @@ enum {
 #define SPTE_SPECIAL_MASK (3ULL << 52)
 #define SPTE_AD_ENABLED_MASK (0ULL << 52)
 #define SPTE_AD_DISABLED_MASK (1ULL << 52)
+#define SPTE_AD_WRPROT_ONLY_MASK (2ULL << 52)
 #define SPTE_MMIO_MASK (3ULL << 52)
 
 #define PT64_LEVEL_BITS 9
@@ -328,10 +329,27 @@ static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 	return sp->role.ad_disabled;
 }
 
+static inline bool kvm_vcpu_ad_need_write_protect(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * When using the EPT page-modification log, the GPAs in the log
+	 * would come from L2 rather than L1.  Therefore, we need to rely
+	 * on write protection to record dirty pages.  This also bypasses
+	 * PML, since writes now result in a vmexit.
+	 */
+	return vcpu->arch.mmu == &vcpu->arch.guest_mmu;
+}
+
 static inline bool spte_ad_enabled(u64 spte)
 {
 	MMU_WARN_ON(is_mmio_spte(spte));
-	return (spte & SPTE_SPECIAL_MASK) == SPTE_AD_ENABLED_MASK;
+	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_DISABLED_MASK;
+}
+
+static inline bool spte_ad_need_write_protect(u64 spte)
+{
+	MMU_WARN_ON(is_mmio_spte(spte));
+	return (spte & SPTE_SPECIAL_MASK) != SPTE_AD_ENABLED_MASK;
 }
 
 static inline u64 spte_shadow_accessed_mask(u64 spte)
@@ -1597,16 +1615,16 @@ static bool spte_clear_dirty(u64 *sptep)
 
 	rmap_printk("rmap_clear_dirty: spte %p %llx\n", sptep, *sptep);
 
+	MMU_WARN_ON(!spte_ad_enabled(spte));
 	spte &= ~shadow_dirty_mask;
-
 	return mmu_spte_update(sptep, spte);
 }
 
-static bool wrprot_ad_disabled_spte(u64 *sptep)
+static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
 	bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
 					       (unsigned long *)sptep);
-	if (was_writable)
+	if (was_writable && !spte_ad_enabled(*sptep))
 		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
 
 	return was_writable;
@@ -1625,10 +1643,10 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 	bool flush = false;
 
 	for_each_rmap_spte(rmap_head, &iter, sptep)
-		if (spte_ad_enabled(*sptep))
-			flush |= spte_clear_dirty(sptep);
+		if (spte_ad_need_write_protect(*sptep))
+			flush |= spte_wrprot_for_clear_dirty(sptep);
 		else
-			flush |= wrprot_ad_disabled_spte(sptep);
+			flush |= spte_clear_dirty(sptep);
 
 	return flush;
 }
@@ -1639,6 +1657,11 @@ static bool spte_set_dirty(u64 *sptep)
 
 	rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep);
 
+	/*
+	 * Similar to the !kvm_x86_ops->slot_disable_log_dirty case,
+	 * do not bother adding back write access to pages marked
+	 * SPTE_AD_WRPROT_ONLY_MASK.
+	 */
 	spte |= shadow_dirty_mask;
 
 	return mmu_spte_update(sptep, spte);
@@ -2977,6 +3000,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	sp = page_header(__pa(sptep));
 	if (sp_ad_disabled(sp))
 		spte |= SPTE_AD_DISABLED_MASK;
+	else if (kvm_vcpu_ad_need_write_protect(vcpu))
+		spte |= SPTE_AD_WRPROT_ONLY_MASK;
 
 	/*
 	 * For the EPT case, shadow_present_mask is 0 if hardware
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] selftests: kvm: add test for dirty logging inside nested guests
  2019-09-27 11:15 [PATCH v2 0/3] KVM: MMU: fix nested guest live migration with PML Paolo Bonzini
  2019-09-27 11:15 ` [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds Paolo Bonzini
  2019-09-27 11:15 ` [PATCH v2 2/3] KVM: x86: fix nested guest live migration with PML Paolo Bonzini
@ 2019-09-27 11:15 ` Paolo Bonzini
  2 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-09-27 11:15 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid

Check that accesses by nested guests are logged according to the
L1 physical addresses rather than L2.

Most of the patch is really adding EPT support to the testing
framework.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tools/testing/selftests/kvm/Makefile               |   1 +
 .../selftests/kvm/include/x86_64/processor.h       |   3 +
 tools/testing/selftests/kvm/include/x86_64/vmx.h   |  14 ++
 tools/testing/selftests/kvm/lib/kvm_util.c         |   2 +-
 .../testing/selftests/kvm/lib/kvm_util_internal.h  |   3 +
 tools/testing/selftests/kvm/lib/x86_64/vmx.c       | 201 ++++++++++++++++++++-
 .../selftests/kvm/x86_64/vmx_dirty_log_test.c      | 156 ++++++++++++++++
 7 files changed, 377 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 62c591f87dab..fd84b7a78dcf 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -22,6 +22,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/smm_test
 TEST_GEN_PROGS_x86_64 += x86_64/state_test
 TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
+TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += clear_dirty_log_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 0c17f2ee685e..ff234018219c 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -1083,6 +1083,9 @@ void vcpu_set_msr(struct kvm_vm *vm, uint32_t vcpuid, uint64_t msr_index,
 #define VMX_BASIC_MEM_TYPE_WB	6LLU
 #define VMX_BASIC_INOUT		0x0040000000000000LLU
 
+/* VMX_EPT_VPID_CAP bits */
+#define VMX_EPT_VPID_CAP_AD_BITS	(1ULL << 21)
+
 /* MSR_IA32_VMX_MISC bits */
 #define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29)
 #define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
index 69b17055f63d..6ae5a47fe067 100644
--- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
+++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
@@ -569,6 +569,10 @@ struct vmx_pages {
 	void *enlightened_vmcs_hva;
 	uint64_t enlightened_vmcs_gpa;
 	void *enlightened_vmcs;
+
+	void *eptp_hva;
+	uint64_t eptp_gpa;
+	void *eptp;
 };
 
 struct vmx_pages *vcpu_alloc_vmx(struct kvm_vm *vm, vm_vaddr_t *p_vmx_gva);
@@ -576,4 +580,14 @@ struct vmx_pages {
 void prepare_vmcs(struct vmx_pages *vmx, void *guest_rip, void *guest_rsp);
 bool load_vmcs(struct vmx_pages *vmx);
 
+void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm,
+		   uint64_t nested_paddr, uint64_t paddr, uint32_t eptp_memslot);
+void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm,
+		 uint64_t nested_paddr, uint64_t paddr, uint64_t size,
+		 uint32_t eptp_memslot);
+void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm,
+			uint32_t memslot, uint32_t eptp_memslot);
+void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm,
+		  uint32_t eptp_memslot);
+
 #endif /* SELFTEST_KVM_VMX_H */
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 80a338b5403c..41cf45416060 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -705,7 +705,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
  *   on error (e.g. currently no memory region using memslot as a KVM
  *   memory slot ID).
  */
-static struct userspace_mem_region *
+struct userspace_mem_region *
 memslot2region(struct kvm_vm *vm, uint32_t memslot)
 {
 	struct userspace_mem_region *region;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util_internal.h b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
index f36262e0f655..ac50c42750cf 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util_internal.h
+++ b/tools/testing/selftests/kvm/lib/kvm_util_internal.h
@@ -68,4 +68,7 @@ struct kvm_vm {
 void regs_dump(FILE *stream, struct kvm_regs *regs, uint8_t indent);
 void sregs_dump(FILE *stream, struct kvm_sregs *sregs, uint8_t indent);
 
+struct userspace_mem_region *
+memslot2region(struct kvm_vm *vm, uint32_t memslot);
+
 #endif /* SELFTEST_KVM_UTIL_INTERNAL_H */
diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
index 9cef0455b819..fab8f6b0bf52 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
@@ -7,11 +7,39 @@
 
 #include "test_util.h"
 #include "kvm_util.h"
+#include "../kvm_util_internal.h"
 #include "processor.h"
 #include "vmx.h"
 
+#define PAGE_SHIFT_4K  12
+
+#define KVM_EPT_PAGE_TABLE_MIN_PADDR 0x1c0000
+
 bool enable_evmcs;
 
+struct eptPageTableEntry {
+	uint64_t readable:1;
+	uint64_t writable:1;
+	uint64_t executable:1;
+	uint64_t memory_type:3;
+	uint64_t ignore_pat:1;
+	uint64_t page_size:1;
+	uint64_t accessed:1;
+	uint64_t dirty:1;
+	uint64_t ignored_11_10:2;
+	uint64_t address:40;
+	uint64_t ignored_62_52:11;
+	uint64_t suppress_ve:1;
+};
+
+struct eptPageTablePointer {
+	uint64_t memory_type:3;
+	uint64_t page_walk_length:3;
+	uint64_t ad_enabled:1;
+	uint64_t reserved_11_07:5;
+	uint64_t address:40;
+	uint64_t reserved_63_52:12;
+};
 int vcpu_enable_evmcs(struct kvm_vm *vm, int vcpu_id)
 {
 	uint16_t evmcs_ver;
@@ -174,15 +202,35 @@ bool load_vmcs(struct vmx_pages *vmx)
  */
 static inline void init_vmcs_control_fields(struct vmx_pages *vmx)
 {
+	uint32_t sec_exec_ctl = 0;
+
 	vmwrite(VIRTUAL_PROCESSOR_ID, 0);
 	vmwrite(POSTED_INTR_NV, 0);
 
 	vmwrite(PIN_BASED_VM_EXEC_CONTROL, rdmsr(MSR_IA32_VMX_TRUE_PINBASED_CTLS));
-	if (!vmwrite(SECONDARY_VM_EXEC_CONTROL, 0))
+
+	if (vmx->eptp_gpa) {
+		uint64_t ept_paddr;
+		struct eptPageTablePointer eptp = {
+			.memory_type = VMX_BASIC_MEM_TYPE_WB,
+			.page_walk_length = 3, /* + 1 */
+			.ad_enabled = !!(rdmsr(MSR_IA32_VMX_EPT_VPID_CAP) & VMX_EPT_VPID_CAP_AD_BITS),
+			.address = vmx->eptp_gpa >> PAGE_SHIFT_4K,
+		};
+
+		memcpy(&ept_paddr, &eptp, sizeof(ept_paddr));
+		vmwrite(EPT_POINTER, ept_paddr);
+		sec_exec_ctl |= SECONDARY_EXEC_ENABLE_EPT;
+	}
+
+	if (!vmwrite(SECONDARY_VM_EXEC_CONTROL, sec_exec_ctl))
 		vmwrite(CPU_BASED_VM_EXEC_CONTROL,
 			rdmsr(MSR_IA32_VMX_TRUE_PROCBASED_CTLS) | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS);
-	else
+	else {
 		vmwrite(CPU_BASED_VM_EXEC_CONTROL, rdmsr(MSR_IA32_VMX_TRUE_PROCBASED_CTLS));
+		GUEST_ASSERT(!sec_exec_ctl);
+	}
+
 	vmwrite(EXCEPTION_BITMAP, 0);
 	vmwrite(PAGE_FAULT_ERROR_CODE_MASK, 0);
 	vmwrite(PAGE_FAULT_ERROR_CODE_MATCH, -1); /* Never match */
@@ -327,3 +375,152 @@ void prepare_vmcs(struct vmx_pages *vmx, void *guest_rip, void *guest_rsp)
 	init_vmcs_host_state();
 	init_vmcs_guest_state(guest_rip, guest_rsp);
 }
+
+void nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm,
+	 	   uint64_t nested_paddr, uint64_t paddr, uint32_t eptp_memslot)
+{
+	uint16_t index[4];
+	struct eptPageTableEntry *pml4e;
+
+	TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use "
+		    "unknown or unsupported guest mode, mode: 0x%x", vm->mode);
+
+	TEST_ASSERT((nested_paddr % vm->page_size) == 0,
+		    "Nested physical address not on page boundary,\n"
+		    "  nested_paddr: 0x%lx vm->page_size: 0x%x",
+		    nested_paddr, vm->page_size);
+	TEST_ASSERT((nested_paddr >> vm->page_shift) <= vm->max_gfn,
+		    "Physical address beyond beyond maximum supported,\n"
+		    "  nested_paddr: 0x%lx vm->max_gfn: 0x%lx vm->page_size: 0x%x",
+		    paddr, vm->max_gfn, vm->page_size);
+	TEST_ASSERT((paddr % vm->page_size) == 0,
+		    "Physical address not on page boundary,\n"
+		    "  paddr: 0x%lx vm->page_size: 0x%x",
+		    paddr, vm->page_size);
+	TEST_ASSERT((paddr >> vm->page_shift) <= vm->max_gfn,
+		    "Physical address beyond beyond maximum supported,\n"
+		    "  paddr: 0x%lx vm->max_gfn: 0x%lx vm->page_size: 0x%x",
+		    paddr, vm->max_gfn, vm->page_size);
+
+	index[0] = (nested_paddr >> 12) & 0x1ffu;
+	index[1] = (nested_paddr >> 21) & 0x1ffu;
+	index[2] = (nested_paddr >> 30) & 0x1ffu;
+	index[3] = (nested_paddr >> 39) & 0x1ffu;
+
+	/* Allocate page directory pointer table if not present. */
+	pml4e = vmx->eptp_hva;
+	if (!pml4e[index[3]].readable) {
+		pml4e[index[3]].address = vm_phy_page_alloc(vm,
+			  KVM_EPT_PAGE_TABLE_MIN_PADDR, eptp_memslot)
+			>> vm->page_shift;
+		pml4e[index[3]].writable = true;
+		pml4e[index[3]].readable = true;
+		pml4e[index[3]].executable = true;
+	}
+
+	/* Allocate page directory table if not present. */
+	struct eptPageTableEntry *pdpe;
+	pdpe = addr_gpa2hva(vm, pml4e[index[3]].address * vm->page_size);
+	if (!pdpe[index[2]].readable) {
+		pdpe[index[2]].address = vm_phy_page_alloc(vm,
+			  KVM_EPT_PAGE_TABLE_MIN_PADDR, eptp_memslot)
+			>> vm->page_shift;
+		pdpe[index[2]].writable = true;
+		pdpe[index[2]].readable = true;
+		pdpe[index[2]].executable = true;
+	}
+
+	/* Allocate page table if not present. */
+	struct eptPageTableEntry *pde;
+	pde = addr_gpa2hva(vm, pdpe[index[2]].address * vm->page_size);
+	if (!pde[index[1]].readable) {
+		pde[index[1]].address = vm_phy_page_alloc(vm,
+			  KVM_EPT_PAGE_TABLE_MIN_PADDR, eptp_memslot)
+			>> vm->page_shift;
+		pde[index[1]].writable = true;
+		pde[index[1]].readable = true;
+		pde[index[1]].executable = true;
+	}
+
+	/* Fill in page table entry. */
+	struct eptPageTableEntry *pte;
+	pte = addr_gpa2hva(vm, pde[index[1]].address * vm->page_size);
+	pte[index[0]].address = paddr >> vm->page_shift;
+	pte[index[0]].writable = true;
+	pte[index[0]].readable = true;
+	pte[index[0]].executable = true;
+
+	/*
+	 * For now mark these as accessed and dirty because the only
+	 * testcase we have needs that.  Can be reconsidered later.
+	 */
+	pte[index[0]].accessed = true;
+	pte[index[0]].dirty = true;
+}
+
+/*
+ * Map a range of EPT guest physical addresses to the VM's physical address
+ *
+ * Input Args:
+ *   vm - Virtual Machine
+ *   nested_paddr - Nested guest physical address to map
+ *   paddr - VM Physical Address
+ *   size - The size of the range to map
+ *   eptp_memslot - Memory region slot for new virtual translation tables
+ *
+ * Output Args: None
+ *
+ * Return: None
+ *
+ * Within the VM given by vm, creates a nested guest translation for the
+ * page range starting at nested_paddr to the page range starting at paddr.
+ */
+void nested_map(struct vmx_pages *vmx, struct kvm_vm *vm,
+		uint64_t nested_paddr, uint64_t paddr, uint64_t size,
+		uint32_t eptp_memslot)
+{
+	size_t page_size = vm->page_size;
+	size_t npages = size / page_size;
+
+	TEST_ASSERT(nested_paddr + size > nested_paddr, "Vaddr overflow");
+	TEST_ASSERT(paddr + size > paddr, "Paddr overflow");
+
+	while (npages--) {
+		nested_pg_map(vmx, vm, nested_paddr, paddr, eptp_memslot);
+		nested_paddr += page_size;
+		paddr += page_size;
+	}
+}
+
+/* Prepare an identity extended page table that maps all the
+ * physical pages in VM.
+ */
+void nested_map_memslot(struct vmx_pages *vmx, struct kvm_vm *vm,
+			uint32_t memslot, uint32_t eptp_memslot)
+{
+	sparsebit_idx_t i, last;
+	struct userspace_mem_region *region =
+		memslot2region(vm, memslot);
+
+	i = (region->region.guest_phys_addr >> vm->page_shift) - 1;
+	last = i + (region->region.memory_size >> vm->page_shift);
+	for (;;) {
+		i = sparsebit_next_clear(region->unused_phy_pages, i);
+		if (i > last)
+			break;
+
+		nested_map(vmx, vm,
+			   (uint64_t)i << vm->page_shift,
+			   (uint64_t)i << vm->page_shift,
+			   1 << vm->page_shift,
+			   eptp_memslot);
+	}
+}
+
+void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm,
+		  uint32_t eptp_memslot)
+{
+	vmx->eptp = (void *)vm_vaddr_alloc(vm, getpagesize(), 0x10000, 0, 0);
+	vmx->eptp_hva = addr_gva2hva(vm, (uintptr_t)vmx->eptp);
+	vmx->eptp_gpa = addr_gva2gpa(vm, (uintptr_t)vmx->eptp);
+}
diff --git a/tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c b/tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c
new file mode 100644
index 000000000000..0bca1cfe2c1e
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KVM dirty page logging test
+ *
+ * Copyright (C) 2018, Red Hat, Inc.
+ */
+
+#define _GNU_SOURCE /* for program_invocation_name */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "vmx.h"
+
+#define VCPU_ID				1
+
+/* The memory slot index to track dirty pages */
+#define TEST_MEM_SLOT_INDEX		1
+#define TEST_MEM_SIZE			3
+
+/* L1 guest test virtual memory offset */
+#define GUEST_TEST_MEM			0xc0000000
+
+/* L2 guest test virtual memory offset */
+#define NESTED_TEST_MEM1		0xc0001000
+#define NESTED_TEST_MEM2		0xc0002000
+
+static void l2_guest_code(void)
+{
+	*(volatile uint64_t *)NESTED_TEST_MEM1;
+	*(volatile uint64_t *)NESTED_TEST_MEM1 = 1;
+	GUEST_SYNC(true);
+	GUEST_SYNC(false);
+
+	*(volatile uint64_t *)NESTED_TEST_MEM2 = 1;
+	GUEST_SYNC(true);
+	*(volatile uint64_t *)NESTED_TEST_MEM2 = 1;
+	GUEST_SYNC(true);
+	GUEST_SYNC(false);
+
+	/* Exit to L1 and never come back.  */
+	vmcall();
+}
+
+void l1_guest_code(struct vmx_pages *vmx)
+{
+#define L2_GUEST_STACK_SIZE 64
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+
+	GUEST_ASSERT(vmx->vmcs_gpa);
+	GUEST_ASSERT(prepare_for_vmx_operation(vmx));
+	GUEST_ASSERT(load_vmcs(vmx));
+
+	prepare_vmcs(vmx, l2_guest_code,
+		     &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+	GUEST_SYNC(false);
+	GUEST_ASSERT(!vmlaunch());
+	GUEST_SYNC(false);
+	GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
+	GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+	vm_vaddr_t vmx_pages_gva = 0;
+	struct vmx_pages *vmx;
+	unsigned long *bmap;
+	uint64_t *host_test_mem;
+
+	struct kvm_vm *vm;
+	struct kvm_run *run;
+	struct ucall uc;
+	bool done = false;
+
+	/* Create VM */
+	vm = vm_create_default(VCPU_ID, 0, l1_guest_code);
+	vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
+	vmx = vcpu_alloc_vmx(vm, &vmx_pages_gva);
+	vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
+	run = vcpu_state(vm, VCPU_ID);
+
+	/* Add an extra memory slot for testing dirty logging */
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    GUEST_TEST_MEM,
+				    TEST_MEM_SLOT_INDEX,
+				    TEST_MEM_SIZE,
+				    KVM_MEM_LOG_DIRTY_PAGES);
+
+	/*
+	 * Add an identity map for GVA range [0xc0000000, 0xc0002000).  This
+	 * affects both L1 and L2.  However...
+	 */
+	virt_map(vm, GUEST_TEST_MEM, GUEST_TEST_MEM,
+		 TEST_MEM_SIZE * 4096, 0);
+
+	/*
+	 * ... pages in the L2 GPA range [0xc0001000, 0xc0003000) will map to
+	 * 0xc0000000.
+	 *
+	 * Note that prepare_eptp should be called only L1's GPA map is done,
+	 * meaning after the last call to virt_map.
+	 */
+	prepare_eptp(vmx, vm, 0);
+	nested_map_memslot(vmx, vm, 0, 0);
+	nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, 4096, 0);
+	nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, 4096, 0);
+
+	bmap = bitmap_alloc(TEST_MEM_SIZE);
+	host_test_mem = addr_gpa2hva(vm, GUEST_TEST_MEM);
+
+	while (!done) {
+		memset(host_test_mem, 0xaa, TEST_MEM_SIZE * 4096);
+		_vcpu_run(vm, VCPU_ID);
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "Unexpected exit reason: %u (%s),\n",
+			    run->exit_reason,
+			    exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vm, VCPU_ID, &uc)) {
+		case UCALL_ABORT:
+			TEST_ASSERT(false, "%s at %s:%d", (const char *)uc.args[0],
+				    __FILE__, uc.args[1]);
+			/* NOT REACHED */
+		case UCALL_SYNC:
+			/*
+			 * The nested guest wrote at offset 0x1000 in the memslot, but the
+			 * dirty bitmap must be filled in according to L1 GPA, not L2.
+			 */
+			kvm_vm_get_dirty_log(vm, TEST_MEM_SLOT_INDEX, bmap);
+			if (uc.args[1]) {
+				TEST_ASSERT(test_bit(0, bmap), "Page 0 incorrectly reported clean\n");
+				TEST_ASSERT(host_test_mem[0] == 1, "Page 0 not written by guest\n");
+			} else {
+				TEST_ASSERT(!test_bit(0, bmap), "Page 0 incorrectly reported dirty\n");
+				TEST_ASSERT(host_test_mem[0] == 0xaaaaaaaaaaaaaaaaULL, "Page 0 written by guest\n");
+			}
+
+			TEST_ASSERT(!test_bit(1, bmap), "Page 1 incorrectly reported dirty\n");
+			TEST_ASSERT(host_test_mem[4096 / 8] == 0xaaaaaaaaaaaaaaaaULL, "Page 1 written by guest\n");
+			TEST_ASSERT(!test_bit(2, bmap), "Page 2 incorrectly reported dirty\n");
+			TEST_ASSERT(host_test_mem[8192 / 8] == 0xaaaaaaaaaaaaaaaaULL, "Page 2 written by guest\n");
+			break;
+		case UCALL_DONE:
+			done = true;
+			break;
+		default:
+			TEST_ASSERT(false, "Unknown ucall 0x%x.", uc.cmd);
+		}
+	}
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: fix nested guest live migration with PML
  2019-09-27 11:15 ` [PATCH v2 2/3] KVM: x86: fix nested guest live migration with PML Paolo Bonzini
@ 2019-09-27 20:24   ` Junaid Shahid
  0 siblings, 0 replies; 10+ messages in thread
From: Junaid Shahid @ 2019-09-27 20:24 UTC (permalink / raw)
  To: Paolo Bonzini, linux-kernel, kvm; +Cc: Vitaly Kuznetsov, Sean Christopherson

On 9/27/19 4:15 AM, Paolo Bonzini wrote:
> Shadow paging is fundamentally incompatible with the page-modification
> log, because the GPAs in the log come from the wrong memory map.
> In particular, for the EPT page-modification log, the GPAs in the log come
> from L2 rather than L1.  (If there was a non-EPT page-modification log,
> we couldn't use it for shadow paging because it would log GVAs rather
> than GPAs).
> 
> Therefore, we need to rely on write protection to record dirty pages.
> This has the side effect of bypassing PML, since writes now result in an
> EPT violation vmexit.
> 
> This is relatively easy to add to KVM, because pretty much the only place
> that needs changing is spte_clear_dirty.  The first access to the page
> already goes through the page fault path and records the correct GPA;
> it's only subsequent accesses that are wrong.  Therefore, we can equip
> set_spte (where the first access happens) to record that the SPTE will
> have to be write protected, and then spte_clear_dirty will use this
> information to do the right thing.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: Junaid Shahid <junaids@google.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds
  2019-09-27 11:15 ` [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds Paolo Bonzini
@ 2019-12-11 18:39   ` Ben Gardon
  2019-12-11 19:13     ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Ben Gardon @ 2019-12-11 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Sean Christopherson, Junaid Shahid

Has anyone tested this patch on a long-running machine? It looks like
the SPTE_MMIO_MASK overlaps with the bits used to track MMIO
generation number, which makes me think that on a long running VM, a
high enough generation number would overwrite the SPTE_SPECIAL_MASK
region and cause the MMIO SPTE to be misinterpreted. It seems like
setting bits 52 and 53 would also cause an incorrect generation number
to be read from the PTE.


On Fri, Sep 27, 2019 at 4:16 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Currently, we are overloading SPTE_SPECIAL_MASK to mean both
> "A/D bits unavailable" and MMIO, where the difference between the
> two is determined by mio_mask and mmio_value.
>
> However, the next patch will need two bits to distinguish
> availability of A/D bits from write protection.  So, while at
> it give MMIO its own bit pattern, and move the two bits from
> bit 62 to bits 52..53 since Intel is allocating EPT page table
> bits from the top.
>
> Reviewed-by: Junaid Shahid <junaids@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  7 -------
>  arch/x86/kvm/mmu.c              | 28 ++++++++++++++++++----------
>  2 files changed, 18 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 23edf56cf577..50eb430b0ad8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -219,13 +219,6 @@ enum {
>                                  PFERR_WRITE_MASK |             \
>                                  PFERR_PRESENT_MASK)
>
> -/*
> - * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> - * Access Tracking SPTEs. We use bit 62 instead of bit 63 to avoid conflicting
> - * with the SVE bit in EPT PTEs.
> - */
> -#define SPTE_SPECIAL_MASK (1ULL << 62)
> -
>  /* apic attention bits */
>  #define KVM_APIC_CHECK_VAPIC   0
>  /*
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 5269aa057dfa..bac8d228d82b 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -83,7 +83,16 @@ enum {
>  #define PTE_PREFETCH_NUM               8
>
>  #define PT_FIRST_AVAIL_BITS_SHIFT 10
> -#define PT64_SECOND_AVAIL_BITS_SHIFT 52
> +#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> +
> +/*
> + * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> + * Access Tracking SPTEs.
> + */
> +#define SPTE_SPECIAL_MASK (3ULL << 52)
> +#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> +#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> +#define SPTE_MMIO_MASK (3ULL << 52)
>
>  #define PT64_LEVEL_BITS 9
>
> @@ -219,12 +228,11 @@ struct kvm_shadow_walk_iterator {
>  static u64 __read_mostly shadow_me_mask;
>
>  /*
> - * SPTEs used by MMUs without A/D bits are marked with shadow_acc_track_value.
> - * Non-present SPTEs with shadow_acc_track_value set are in place for access
> - * tracking.
> + * SPTEs used by MMUs without A/D bits are marked with SPTE_AD_DISABLED_MASK;
> + * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
> + * pages.
>   */
>  static u64 __read_mostly shadow_acc_track_mask;
> -static const u64 shadow_acc_track_value = SPTE_SPECIAL_MASK;
>
>  /*
>   * The mask/shift to use for saving the original R/X bits when marking the PTE
> @@ -304,7 +312,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value, u64 access_mask)
>  {
>         BUG_ON((u64)(unsigned)access_mask != access_mask);
>         BUG_ON((mmio_mask & mmio_value) != mmio_value);
> -       shadow_mmio_value = mmio_value | SPTE_SPECIAL_MASK;
> +       shadow_mmio_value = mmio_value | SPTE_MMIO_MASK;
>         shadow_mmio_mask = mmio_mask | SPTE_SPECIAL_MASK;
>         shadow_mmio_access_mask = access_mask;
>  }
> @@ -323,7 +331,7 @@ static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
>  static inline bool spte_ad_enabled(u64 spte)
>  {
>         MMU_WARN_ON(is_mmio_spte(spte));
> -       return !(spte & shadow_acc_track_value);
> +       return (spte & SPTE_SPECIAL_MASK) == SPTE_AD_ENABLED_MASK;
>  }
>
>  static inline u64 spte_shadow_accessed_mask(u64 spte)
> @@ -461,7 +469,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
>  {
>         BUG_ON(!dirty_mask != !accessed_mask);
>         BUG_ON(!accessed_mask && !acc_track_mask);
> -       BUG_ON(acc_track_mask & shadow_acc_track_value);
> +       BUG_ON(acc_track_mask & SPTE_SPECIAL_MASK);
>
>         shadow_user_mask = user_mask;
>         shadow_accessed_mask = accessed_mask;
> @@ -2622,7 +2630,7 @@ static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep,
>                shadow_user_mask | shadow_x_mask | shadow_me_mask;
>
>         if (sp_ad_disabled(sp))
> -               spte |= shadow_acc_track_value;
> +               spte |= SPTE_AD_DISABLED_MASK;
>         else
>                 spte |= shadow_accessed_mask;
>
> @@ -2968,7 +2976,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>
>         sp = page_header(__pa(sptep));
>         if (sp_ad_disabled(sp))
> -               spte |= shadow_acc_track_value;
> +               spte |= SPTE_AD_DISABLED_MASK;
>
>         /*
>          * For the EPT case, shadow_present_mask is 0 if hardware
> --
> 1.8.3.1
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds
  2019-12-11 18:39   ` Ben Gardon
@ 2019-12-11 19:13     ` Sean Christopherson
  2019-12-11 23:28       ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2019-12-11 19:13 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, linux-kernel, kvm, Vitaly Kuznetsov, Junaid Shahid

On Wed, Dec 11, 2019 at 10:39:22AM -0800, Ben Gardon wrote:
> Has anyone tested this patch on a long-running machine? It looks like
> the SPTE_MMIO_MASK overlaps with the bits used to track MMIO
> generation number, which makes me think that on a long running VM, a
> high enough generation number would overwrite the SPTE_SPECIAL_MASK
> region and cause the MMIO SPTE to be misinterpreted. It seems like
> setting bits 52 and 53 would also cause an incorrect generation number
> to be read from the PTE.

Hmm, the MMIO SPTE won't be misintrepreted as MMIO SPTEs will always have
bits 53:52=11b, i.e. bits 10:9 of the generation number effectively get
ignored.  It does mean that check_mmio_spte() could theoretically consume
stale MMIO entries due those bits being ignored, but only if a SPTE lived
across 512 memslot updates, which is impossible given that KVM currently
zaps all SPTEs on a memslot update (fast zapping can let a stale SPTE live
for while the memslot update is in progress, but not longer than that).

There is/was the pass-through issue that cropped up when I tried to stop
zapping all SPTEs on a memslot update, but I doubt this would explain that
problem.

Assuming we haven't missed something, the easiest fix would be to reduce
the MMIO generation by one bit and use bits 62:54 for the MMIO generation.

> On Fri, Sep 27, 2019 at 4:16 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > Currently, we are overloading SPTE_SPECIAL_MASK to mean both
> > "A/D bits unavailable" and MMIO, where the difference between the
> > two is determined by mio_mask and mmio_value.
> >
> > However, the next patch will need two bits to distinguish
> > availability of A/D bits from write protection.  So, while at
> > it give MMIO its own bit pattern, and move the two bits from
> > bit 62 to bits 52..53 since Intel is allocating EPT page table
> > bits from the top.
> >
> > Reviewed-by: Junaid Shahid <junaids@google.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  7 -------
> >  arch/x86/kvm/mmu.c              | 28 ++++++++++++++++++----------
> >  2 files changed, 18 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 23edf56cf577..50eb430b0ad8 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -219,13 +219,6 @@ enum {
> >                                  PFERR_WRITE_MASK |             \
> >                                  PFERR_PRESENT_MASK)
> >
> > -/*
> > - * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> > - * Access Tracking SPTEs. We use bit 62 instead of bit 63 to avoid conflicting
> > - * with the SVE bit in EPT PTEs.
> > - */
> > -#define SPTE_SPECIAL_MASK (1ULL << 62)
> > -
> >  /* apic attention bits */
> >  #define KVM_APIC_CHECK_VAPIC   0
> >  /*
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 5269aa057dfa..bac8d228d82b 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -83,7 +83,16 @@ enum {
> >  #define PTE_PREFETCH_NUM               8
> >
> >  #define PT_FIRST_AVAIL_BITS_SHIFT 10
> > -#define PT64_SECOND_AVAIL_BITS_SHIFT 52
> > +#define PT64_SECOND_AVAIL_BITS_SHIFT 54
> > +
> > +/*
> > + * The mask used to denote special SPTEs, which can be either MMIO SPTEs or
> > + * Access Tracking SPTEs.
> > + */
> > +#define SPTE_SPECIAL_MASK (3ULL << 52)
> > +#define SPTE_AD_ENABLED_MASK (0ULL << 52)
> > +#define SPTE_AD_DISABLED_MASK (1ULL << 52)
> > +#define SPTE_MMIO_MASK (3ULL << 52)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds
  2019-12-11 19:13     ` Sean Christopherson
@ 2019-12-11 23:28       ` Paolo Bonzini
  2019-12-12  0:29         ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2019-12-11 23:28 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: linux-kernel, kvm, Vitaly Kuznetsov, Junaid Shahid

On 11/12/19 20:13, Sean Christopherson wrote:
> Assuming we haven't missed something, the easiest fix would be to reduce
> the MMIO generation by one bit and use bits 62:54 for the MMIO generation.

Yes, and I mistakenly thought it would be done just by adjusting 
PT64_SECOND_AVAIL_BITS_SHIFT.

I will test and send formally something like this:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f92b40d798c..aa2d86f42b9a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -405,11 +405,13 @@ static inline bool is_access_track_spte(u64 spte)
 }
 
 /*
- * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
+ * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
  * the memslots generation and is derived as follows:
  *
  * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
- * Bits 9-18 of the MMIO generation are propagated to spte bits 52-61
+ * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
  *
+ * We don't use bit 63 to avoid conflicting with the SVE bit in EPT PTEs.
+ *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -418,15 +418,16 @@ static inline bool is_access_track_spte(u64 spte)
  * requires a full MMU zap).  The flag is instead explicitly queried when
  * checking for MMIO spte cache hits.
  */
-#define MMIO_SPTE_GEN_MASK		GENMASK_ULL(18, 0)
+#define MMIO_SPTE_GEN_MASK		GENMASK_ULL(17, 0)
 
 #define MMIO_SPTE_GEN_LOW_START		3
 #define MMIO_SPTE_GEN_LOW_END		11
 #define MMIO_SPTE_GEN_LOW_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
 						    MMIO_SPTE_GEN_LOW_START)
 
-#define MMIO_SPTE_GEN_HIGH_START	52
-#define MMIO_SPTE_GEN_HIGH_END		61
+/* Leave room for SPTE_SPECIAL_MASK.  */
+#define MMIO_SPTE_GEN_HIGH_START	PT64_SECOND_AVAIL_BITS_SHIFT
+#define MMIO_SPTE_GEN_HIGH_END		62
 #define MMIO_SPTE_GEN_HIGH_MASK		GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
 						    MMIO_SPTE_GEN_HIGH_START)
 static u64 generation_mmio_spte_mask(u64 gen)


Paolo


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds
  2019-12-11 23:28       ` Paolo Bonzini
@ 2019-12-12  0:29         ` Sean Christopherson
  2019-12-12  0:33           ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2019-12-12  0:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ben Gardon, linux-kernel, kvm, Vitaly Kuznetsov, Junaid Shahid

On Thu, Dec 12, 2019 at 12:28:27AM +0100, Paolo Bonzini wrote:
> On 11/12/19 20:13, Sean Christopherson wrote:
> > Assuming we haven't missed something, the easiest fix would be to reduce
> > the MMIO generation by one bit and use bits 62:54 for the MMIO generation.
> 
> Yes, and I mistakenly thought it would be done just by adjusting 
> PT64_SECOND_AVAIL_BITS_SHIFT.
> 
> I will test and send formally something like this:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..aa2d86f42b9a 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -405,11 +405,13 @@ static inline bool is_access_track_spte(u64 spte)
>  }
>  
>  /*
> - * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
> + * Due to limited space in PTEs, the MMIO generation is a 18 bit subset of
>   * the memslots generation and is derived as follows:
>   *
>   * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
> - * Bits 9-18 of the MMIO generation are propagated to spte bits 52-61
> + * Bits 9-17 of the MMIO generation are propagated to spte bits 54-62
>   *
> + * We don't use bit 63 to avoid conflicting with the SVE bit in EPT PTEs.
> + *
>   * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
>   * the MMIO generation number, as doing so would require stealing a bit from
> @@ -418,15 +418,16 @@ static inline bool is_access_track_spte(u64 spte)
>   * requires a full MMU zap).  The flag is instead explicitly queried when
>   * checking for MMIO spte cache hits.
>   */
> -#define MMIO_SPTE_GEN_MASK		GENMASK_ULL(18, 0)
> +#define MMIO_SPTE_GEN_MASK		GENMASK_ULL(17, 0)
>  
>  #define MMIO_SPTE_GEN_LOW_START		3
>  #define MMIO_SPTE_GEN_LOW_END		11
>  #define MMIO_SPTE_GEN_LOW_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
>  						    MMIO_SPTE_GEN_LOW_START)
>  
> -#define MMIO_SPTE_GEN_HIGH_START	52
> -#define MMIO_SPTE_GEN_HIGH_END		61
> +/* Leave room for SPTE_SPECIAL_MASK.  */
> +#define MMIO_SPTE_GEN_HIGH_START	PT64_SECOND_AVAIL_BITS_SHIFT

I'd rather have GEN_HIGH_START be an explicit bit number and then add
a BUILD_BUG_ON(GEN_HIGH_START < PT64_SECOND_AVAIL_BITS_SHIFT) to ensure
the MMIO gen doesn't overlap other stuff.  That way we get a build error
if someone changes PT64_SECOND_AVAIL_BITS_SHIFT, otherwise the MMIO gen
will end up who knows where and probably overwrite NX or EPT.SUPPRESS_VE.

> +#define MMIO_SPTE_GEN_HIGH_END		62
>  #define MMIO_SPTE_GEN_HIGH_MASK		GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
>  						    MMIO_SPTE_GEN_HIGH_START)
>  static u64 generation_mmio_spte_mask(u64 gen)
> 
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds
  2019-12-12  0:29         ` Sean Christopherson
@ 2019-12-12  0:33           ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2019-12-12  0:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ben Gardon, linux-kernel, kvm, Vitaly Kuznetsov, Junaid Shahid

On 12/12/19 01:29, Sean Christopherson wrote:
>>   */
>> -#define MMIO_SPTE_GEN_MASK		GENMASK_ULL(18, 0)
>> +#define MMIO_SPTE_GEN_MASK		GENMASK_ULL(17, 0)
>>  
>>  #define MMIO_SPTE_GEN_LOW_START		3
>>  #define MMIO_SPTE_GEN_LOW_END		11
>>  #define MMIO_SPTE_GEN_LOW_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
>>  						    MMIO_SPTE_GEN_LOW_START)
>>  
>> -#define MMIO_SPTE_GEN_HIGH_START	52
>> -#define MMIO_SPTE_GEN_HIGH_END		61
>> +/* Leave room for SPTE_SPECIAL_MASK.  */
>> +#define MMIO_SPTE_GEN_HIGH_START	PT64_SECOND_AVAIL_BITS_SHIFT
> I'd rather have GEN_HIGH_START be an explicit bit number and then add
> a BUILD_BUG_ON(GEN_HIGH_START < PT64_SECOND_AVAIL_BITS_SHIFT) to ensure
> the MMIO gen doesn't overlap other stuff.  That way we get a build error
> if someone changes PT64_SECOND_AVAIL_BITS_SHIFT, otherwise the MMIO gen
> will end up who knows where and probably overwrite NX or EPT.SUPPRESS_VE.
> 

Fair enough.  While at it I'll also add MMIO_SPTE_GEN_BITS (
	MMIO_SPTE_GEN_LOW_END - MMIO_SPTE_GEN_LOW_START + 1
	+ MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1) and use it in
MMIO_SPTE_GEN_MASK.

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-12-12  0:33 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-27 11:15 [PATCH v2 0/3] KVM: MMU: fix nested guest live migration with PML Paolo Bonzini
2019-09-27 11:15 ` [PATCH v2 1/3] KVM: x86: assign two bits to track SPTE kinds Paolo Bonzini
2019-12-11 18:39   ` Ben Gardon
2019-12-11 19:13     ` Sean Christopherson
2019-12-11 23:28       ` Paolo Bonzini
2019-12-12  0:29         ` Sean Christopherson
2019-12-12  0:33           ` Paolo Bonzini
2019-09-27 11:15 ` [PATCH v2 2/3] KVM: x86: fix nested guest live migration with PML Paolo Bonzini
2019-09-27 20:24   ` Junaid Shahid
2019-09-27 11:15 ` [PATCH v2 3/3] selftests: kvm: add test for dirty logging inside nested guests Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).