kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration
@ 2021-04-01 14:18 Maxim Levitsky
  2021-04-01 14:18 ` [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Maxim Levitsky
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

This patch set aims to fix few flaws that were discovered
in KVM_{GET|SET}_SREGS on x86:

* There is no support for reading/writing PDPTRs although
  these are considered to be part of the guest state.

* There is useless interrupt bitmap which isn't needed

* No support for future extensions (via flags and such)

Final two patches in this patch series allow to
correctly migrate PDPTRs when new API is used.

This patch series was tested by doing nested migration test
of 32 bit PAE L1 + 32 bit PAE L2 on AMD and Intel and by
nested migration test of 64 bit L1 + 32 bit PAE L2 on AMD.
The later test currently fails on Intel (regardless of my patches).

Finally patch 2 in this patch series fixes a rare L0 kernel oops,
which I can trigger by migrating a hyper-v machine.

Best regards,
	Maxim Levitskky

Maxim Levitsky (6):
  KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
  KVM: nSVM: call nested_svm_load_cr3 on nested state load
  KVM: x86: introduce kvm_register_clear_available
  KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
  KVM: nSVM: avoid loading PDPTRs after migration when possible
  KVM: nVMX: avoid loading PDPTRs after migration when possible

 Documentation/virt/kvm/api.rst  |  43 ++++++++++
 arch/x86/include/asm/kvm_host.h |   7 ++
 arch/x86/include/uapi/asm/kvm.h |  13 +++
 arch/x86/kvm/kvm_cache_regs.h   |  12 +++
 arch/x86/kvm/svm/nested.c       |  55 ++++++++-----
 arch/x86/kvm/svm/svm.c          |   6 +-
 arch/x86/kvm/vmx/nested.c       |  26 ++++--
 arch/x86/kvm/x86.c              | 136 ++++++++++++++++++++++++++------
 include/uapi/linux/kvm.h        |   5 ++
 9 files changed, 249 insertions(+), 54 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
  2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
@ 2021-04-01 14:18 ` Maxim Levitsky
  2021-04-01 14:31   ` Paolo Bonzini
  2021-04-02 17:27   ` Sean Christopherson
  2021-04-01 14:18 ` [PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

Similar to the rest of guest page accesses after migration,
this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES
request.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index fd334e4aa6db..b44f1f6b68db 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2564,11 +2564,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 		return -EINVAL;
 	}
 
-	/* Shadow page tables on either EPT or shadow page tables. */
-	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
-				entry_failure_code))
-		return -EINVAL;
-
 	/*
 	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
 	 * on nested VM-Exit, which can occur without actually running L2 and
@@ -3109,11 +3104,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	enum vm_entry_failure_code entry_failure_code;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct kvm_host_map *map;
 	struct page *page;
 	u64 hpa;
 
+	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
+				&entry_failure_code))
+		return false;
+
 	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
 		/*
 		 * Translate L1 physical address to host physical
@@ -3357,6 +3357,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	}
 
 	if (from_vmentry) {
+		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
+		    nested_cpu_has_ept(vmcs12), &entry_failure_code))
+			goto vmentry_fail_vmexit_guest_mode;
+
 		failed_index = nested_vmx_load_msr(vcpu,
 						   vmcs12->vm_entry_msr_load_addr,
 						   vmcs12->vm_entry_msr_load_count);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load
  2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
  2021-04-01 14:18 ` [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Maxim Levitsky
@ 2021-04-01 14:18 ` Maxim Levitsky
  2021-04-01 14:31   ` Paolo Bonzini
  2021-04-01 14:18 ` [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available Maxim Levitsky
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
only root_mmu is reset.

On regular nested entries we call nested_svm_load_cr3 which both updates
the guest's CR3 in the MMU when it is needed, and it also initializes
the mmu again which makes it initialize the walk_mmu as well when nested
paging is enabled in both host and guest.

Since we don't call nested_svm_load_cr3 on nested state load,
the walk_mmu can be left uninitialized, which can lead to a NULL pointer
dereference while accessing it if we happen to get a nested page fault
right after entering the nested guest first time after the migration and
we decide to emulate it, which leads to the emulator trying to access
walk_mmu->gva_to_gpa which is NULL.

Therefore we should call this function on nested state load as well.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/nested.c | 40 +++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 8523f60adb92..ac5e3e17bda4 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -215,24 +215,6 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
 	return true;
 }
 
-static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-
-	if (WARN_ON(!is_guest_mode(vcpu)))
-		return true;
-
-	if (!nested_svm_vmrun_msrpm(svm)) {
-		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-		vcpu->run->internal.suberror =
-			KVM_INTERNAL_ERROR_EMULATION;
-		vcpu->run->internal.ndata = 0;
-		return false;
-	}
-
-	return true;
-}
-
 static bool nested_vmcb_check_controls(struct vmcb_control_area *control)
 {
 	if (CC(!vmcb_is_intercept(control, INTERCEPT_VMRUN)))
@@ -1312,6 +1294,28 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (WARN_ON(!is_guest_mode(vcpu)))
+		return true;
+
+	if (nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
+				nested_npt_enabled(svm)))
+		return false;
+
+	if (!nested_svm_vmrun_msrpm(svm)) {
+		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+		vcpu->run->internal.suberror =
+			KVM_INTERNAL_ERROR_EMULATION;
+		vcpu->run->internal.ndata = 0;
+		return false;
+	}
+
+	return true;
+}
+
 struct kvm_x86_nested_ops svm_nested_ops = {
 	.check_events = svm_check_nested_events,
 	.triple_fault = nested_svm_triple_fault,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available
  2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
  2021-04-01 14:18 ` [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Maxim Levitsky
  2021-04-01 14:18 ` [PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky
@ 2021-04-01 14:18 ` Maxim Levitsky
  2021-04-01 14:31   ` Paolo Bonzini
  2021-04-05 17:03   ` Sean Christopherson
  2021-04-01 14:18 ` [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 Maxim Levitsky
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

Small refactoring that will be used in the next patch.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 7 +++++++
 arch/x86/kvm/svm/svm.c        | 6 ++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 2e11da2f5621..07d607947805 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -55,6 +55,13 @@ static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
 	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
 }
 
+static inline void kvm_register_clear_available(struct kvm_vcpu *vcpu,
+					       enum kvm_reg reg)
+{
+	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
+	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
+}
+
 static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
 					   enum kvm_reg reg)
 {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495..2843732299a2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3880,10 +3880,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 		vcpu->arch.apf.host_apf_flags =
 			kvm_read_and_reset_apf_flags();
 
-	if (npt_enabled) {
-		vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
-		vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
-	}
+	if (npt_enabled)
+		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
 
 	/*
 	 * We need to handle MC intercepts here before the vcpu has a chance to
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
  2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
                   ` (2 preceding siblings ...)
  2021-04-01 14:18 ` [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available Maxim Levitsky
@ 2021-04-01 14:18 ` Maxim Levitsky
  2021-04-01 14:44   ` Paolo Bonzini
  2021-04-01 14:18 ` [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible Maxim Levitsky
  2021-04-01 14:18 ` [PATCH 6/6] KVM: nVMX: " Maxim Levitsky
  5 siblings, 1 reply; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

This is a new version of KVM_GET_SREGS / KVM_SET_SREGS ioctls,
aiming to replace them.

It has the following changes:
   * Has flags for future extensions
   * Has vcpu's PDPTS, which allows to save/restore them on migration.
   * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)

New capability, KVM_CAP_SREGS2 is added to signal
userspace of this ioctl.

Currently only implemented on x86.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 Documentation/virt/kvm/api.rst  |  43 ++++++++++
 arch/x86/include/asm/kvm_host.h |   7 ++
 arch/x86/include/uapi/asm/kvm.h |  13 +++
 arch/x86/kvm/kvm_cache_regs.h   |   5 ++
 arch/x86/kvm/x86.c              | 136 ++++++++++++++++++++++++++------
 include/uapi/linux/kvm.h        |   5 ++
 6 files changed, 185 insertions(+), 24 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38e327d4b479..b006d5b5f554 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4941,6 +4941,49 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+
+4.131 KVM_GET_SREGS2
+------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (out)
+:Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+This ioctl is preferred over KVM_GET_SREGS when available.
+
+::
+
+struct kvm_sregs2 {
+	/* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+	struct kvm_segment cs, ds, es, fs, gs, ss;
+	struct kvm_segment tr, ldt;
+	struct kvm_dtable gdt, idt;
+	__u64 cr0, cr2, cr3, cr4, cr8;
+	__u64 efer;
+	__u64 apic_base;
+	__u64 flags; /* must be zero*/
+	__u64 pdptrs[4];
+	__u64 padding;
+};
+
+
+4.132 KVM_SET_SREGS2
+------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (in)
+:Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.
+See KVM_GET_SREGS2 for the data structures.
+This ioctl is preferred over the KVM_SET_SREGS when available.
+
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6..87b680d111f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -838,6 +838,13 @@ struct kvm_vcpu_arch {
 
 	/* Protected Guests */
 	bool guest_state_protected;
+
+	/*
+	 * Do we need to reload the pdptrs when entering nested state?
+	 * Set after nested migration if userspace didn't use the
+	 * newer KVM_SET_SREGS2 ioctl to load pdptrs from the migration state.
+	 */
+	bool reload_pdptrs_on_nested_entry;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a3022c8af82..201a85884c81 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -159,6 +159,19 @@ struct kvm_sregs {
 	__u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
 };
 
+struct kvm_sregs2 {
+	/* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+	struct kvm_segment cs, ds, es, fs, gs, ss;
+	struct kvm_segment tr, ldt;
+	struct kvm_dtable gdt, idt;
+	__u64 cr0, cr2, cr3, cr4, cr8;
+	__u64 efer;
+	__u64 apic_base;
+	__u64 flags; /* must be zero*/
+	__u64 pdptrs[4];
+	__u64 padding;
+};
+
 /* for KVM_GET_FPU and KVM_SET_FPU */
 struct kvm_fpu {
 	__u8  fpr[8][16];
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 07d607947805..1a6e2de4248a 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -120,6 +120,11 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, int index)
 	return vcpu->arch.walk_mmu->pdptrs[index];
 }
 
+static inline void kvm_pdptr_write(struct kvm_vcpu *vcpu, int index, u64 value)
+{
+	vcpu->arch.walk_mmu->pdptrs[index] = value;
+}
+
 static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
 {
 	ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d95f90a048..f10a37f88c30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -112,6 +112,9 @@ static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
 static void store_regs(struct kvm_vcpu *vcpu);
 static int sync_regs(struct kvm_vcpu *vcpu);
 
+static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+
 struct kvm_x86_ops kvm_x86_ops __read_mostly;
 EXPORT_SYMBOL_GPL(kvm_x86_ops);
 
@@ -3796,6 +3799,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_USER_SPACE_MSR:
 	case KVM_CAP_X86_MSR_FILTER:
 	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
+	case KVM_CAP_SREGS2:
 		r = 1;
 		break;
 #ifdef CONFIG_KVM_XEN
@@ -4713,6 +4717,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	void __user *argp = (void __user *)arg;
 	int r;
 	union {
+		struct kvm_sregs2 *sregs2;
 		struct kvm_lapic_state *lapic;
 		struct kvm_xsave *xsave;
 		struct kvm_xcrs *xcrs;
@@ -5085,6 +5090,28 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_GET_SREGS2: {
+		u.sregs2 = kzalloc(sizeof(struct kvm_sregs2), GFP_KERNEL_ACCOUNT);
+		r = -ENOMEM;
+		if (!u.sregs2)
+			goto out;
+		__get_sregs2(vcpu, u.sregs2);
+		r = -EFAULT;
+		if (copy_to_user(argp, u.sregs2, sizeof(struct kvm_sregs2)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_SREGS2: {
+		u.sregs2 = memdup_user(argp, sizeof(struct kvm_sregs2));
+		if (IS_ERR(u.sregs2)) {
+			r = PTR_ERR(u.sregs2);
+			u.sregs2 = NULL;
+			goto out;
+		}
+		r = __set_sregs2(vcpu, u.sregs2);
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
@@ -9647,7 +9674,7 @@ void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 }
 EXPORT_SYMBOL_GPL(kvm_get_cs_db_l_bits);
 
-static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 {
 	struct desc_ptr dt;
 
@@ -9680,14 +9707,30 @@ static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	sregs->cr8 = kvm_get_cr8(vcpu);
 	sregs->efer = vcpu->arch.efer;
 	sregs->apic_base = kvm_get_apic_base(vcpu);
+}
 
-	memset(sregs->interrupt_bitmap, 0, sizeof(sregs->interrupt_bitmap));
+static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+	__get_sregs_common(vcpu, sregs);
+
+	if (vcpu->arch.guest_state_protected)
+		return;
 
 	if (vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft)
 		set_bit(vcpu->arch.interrupt.nr,
 			(unsigned long *)sregs->interrupt_bitmap);
 }
 
+static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
+{
+	int i;
+
+	__get_sregs_common(vcpu, (struct kvm_sregs *)sregs2);
+	if (is_pae_paging(vcpu))
+		for (i = 0 ; i < 4 ; i++)
+			sregs2->pdptrs[i] = kvm_pdptr_read(vcpu, i);
+}
+
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs)
 {
@@ -9799,11 +9842,9 @@ static bool kvm_is_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	return kvm_is_valid_cr4(vcpu, sregs->cr4);
 }
 
-static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs, int *mmu_reset_needed)
 {
 	struct msr_data apic_base_msr;
-	int mmu_reset_needed = 0;
-	int pending_vec, max_bits, idx;
 	struct desc_ptr dt;
 	int ret = -EINVAL;
 
@@ -9815,8 +9856,9 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	if (kvm_set_apic_base(vcpu, &apic_base_msr))
 		goto out;
 
+	ret = 0;
 	if (vcpu->arch.guest_state_protected)
-		goto skip_protected_regs;
+		goto out;
 
 	dt.size = sregs->idt.limit;
 	dt.address = sregs->idt.base;
@@ -9826,32 +9868,22 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	static_call(kvm_x86_set_gdt)(vcpu, &dt);
 
 	vcpu->arch.cr2 = sregs->cr2;
-	mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
+	*mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
 	vcpu->arch.cr3 = sregs->cr3;
 	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
 
 	kvm_set_cr8(vcpu, sregs->cr8);
 
-	mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
+	*mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
 	static_call(kvm_x86_set_efer)(vcpu, sregs->efer);
 
-	mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0;
+	*mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0;
 	static_call(kvm_x86_set_cr0)(vcpu, sregs->cr0);
 	vcpu->arch.cr0 = sregs->cr0;
 
-	mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
+	*mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
 	static_call(kvm_x86_set_cr4)(vcpu, sregs->cr4);
 
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
-	if (is_pae_paging(vcpu)) {
-		load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
-		mmu_reset_needed = 1;
-	}
-	srcu_read_unlock(&vcpu->kvm->srcu, idx);
-
-	if (mmu_reset_needed)
-		kvm_mmu_reset_context(vcpu);
-
 	kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
 	kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
 	kvm_set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
@@ -9869,8 +9901,39 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	    sregs->cs.selector == 0xf000 && sregs->cs.base == 0xffff0000 &&
 	    !is_protmode(vcpu))
 		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+out:
+	return ret;
+}
+
+static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
+{
+	int idx, pending_vec, max_bits;
+	int mmu_reset_needed = 0;
+	int ret = __set_sregs_common(vcpu, sregs, &mmu_reset_needed);
+
+	if (ret)
+		return ret;
+
+	if (vcpu->arch.guest_state_protected)
+		return 0;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	if (is_pae_paging(vcpu)) {
+		load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
+		mmu_reset_needed = 1;
+
+		/* If we are going to enter a nested guest, we had just
+		 * loaded wrong PDPTRs, thus we need to reload them
+		 * on guest mode entry
+		 */
+
+		vcpu->arch.reload_pdptrs_on_nested_entry = true;
+	}
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	if (mmu_reset_needed)
+		kvm_mmu_reset_context(vcpu);
 
-skip_protected_regs:
 	max_bits = KVM_NR_INTERRUPTS;
 	pending_vec = find_first_bit(
 		(const unsigned long *)sregs->interrupt_bitmap, max_bits);
@@ -9880,12 +9943,37 @@ static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	}
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
+	return 0;
+}
 
-	ret = 0;
-out:
-	return ret;
+static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2)
+{
+	int mmu_reset_needed = 0;
+	int i, ret, idx;
+
+	if (sregs2->flags || sregs2->padding)
+		return -EINVAL;
+
+	ret = __set_sregs_common(vcpu, (struct kvm_sregs *)sregs2, &mmu_reset_needed);
+	if (ret || vcpu->arch.guest_state_protected)
+		return ret;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	if (is_pae_paging(vcpu)) {
+		for (i = 0 ; i < 4 ; i++)
+			kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);
+		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
+		mmu_reset_needed = 1;
+	}
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	if (mmu_reset_needed)
+		kvm_mmu_reset_context(vcpu);
+
+	return 0;
 }
 
+
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs)
 {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620..212a98082c36 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING 192
 #define KVM_CAP_X86_BUS_LOCK_EXIT 193
 #define KVM_CAP_PPC_DAWR1 194
+#define KVM_CAP_SREGS2 196
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1616,6 +1617,10 @@ struct kvm_xen_hvm_attr {
 #define KVM_XEN_VCPU_GET_ATTR	_IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
 #define KVM_XEN_VCPU_SET_ATTR	_IOW(KVMIO,  0xcb, struct kvm_xen_vcpu_attr)
 
+
+#define KVM_GET_SREGS2             _IOR(KVMIO,  0xca, struct kvm_sregs2)
+#define KVM_SET_SREGS2             _IOW(KVMIO,  0xcb, struct kvm_sregs2)
+
 struct kvm_xen_vcpu_attr {
 	__u16 type;
 	__u16 pad[3];
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible
  2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
                   ` (3 preceding siblings ...)
  2021-04-01 14:18 ` [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 Maxim Levitsky
@ 2021-04-01 14:18 ` Maxim Levitsky
  2021-04-05 17:01   ` Sean Christopherson
  2021-04-01 14:18 ` [PATCH 6/6] KVM: nVMX: " Maxim Levitsky
  5 siblings, 1 reply; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
part of the migration state and thus are loaded
by those ioctls.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/nested.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ac5e3e17bda4..b94916548cfa 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -373,10 +373,9 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 		return -EINVAL;
 
 	if (!nested_npt && is_pae_paging(vcpu) &&
-	    (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
+	    (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR)))
 		if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
 			return -EINVAL;
-	}
 
 	/*
 	 * TODO: optimize unconditional TLB flush/MMU sync here and in
@@ -552,6 +551,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 	nested_vmcb02_prepare_control(svm);
 	nested_vmcb02_prepare_save(svm, vmcb12);
 
+	kvm_register_clear_available(&svm->vcpu, VCPU_EXREG_PDPTR);
+
 	ret = nested_svm_load_cr3(&svm->vcpu, vmcb12->save.cr3,
 				  nested_npt_enabled(svm));
 	if (ret)
@@ -779,6 +780,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
 	nested_svm_uninit_mmu_context(vcpu);
 
+	kvm_register_clear_available(&svm->vcpu, VCPU_EXREG_PDPTR);
+
 	rc = nested_svm_load_cr3(vcpu, svm->vmcb->save.cr3, false);
 	if (rc)
 		return 1;
@@ -1301,6 +1304,14 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
 	if (WARN_ON(!is_guest_mode(vcpu)))
 		return true;
 
+	if (vcpu->arch.reload_pdptrs_on_nested_entry) {
+		/* If legacy KVM_SET_SREGS API was used, it might have
+		 * loaded wrong PDPTRs from memory so we have to reload
+		 * them here (which is against x86 spec)
+		 */
+		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+	}
+
 	if (nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
 				nested_npt_enabled(svm)))
 		return false;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] KVM: nVMX: avoid loading PDPTRs after migration when possible
  2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
                   ` (4 preceding siblings ...)
  2021-04-01 14:18 ` [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible Maxim Levitsky
@ 2021-04-01 14:18 ` Maxim Levitsky
  5 siblings, 0 replies; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 14:18 UTC (permalink / raw)
  To: kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION, Maxim Levitsky

if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
part of the migration state and thus are loaded
by those ioctls.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b44f1f6b68db..f2291165995e 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1115,7 +1115,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool ne
 	 * must not be dereferenced.
 	 */
 	if (!nested_ept && is_pae_paging(vcpu) &&
-	    (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
+	    (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR))) {
 		if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))) {
 			*entry_failure_code = ENTRY_FAIL_PDPTE;
 			return -EINVAL;
@@ -3110,6 +3110,14 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 	struct page *page;
 	u64 hpa;
 
+	if (vcpu->arch.reload_pdptrs_on_nested_entry) {
+		/* if legacy KVM_SET_SREGS API was used, it might have loaded
+		 * wrong PDPTRs from memory so we have to reload them here
+		 * (which is against x86 spec)
+		 */
+		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+	}
+
 	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
 				&entry_failure_code))
 		return false;
@@ -3357,6 +3365,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	}
 
 	if (from_vmentry) {
+		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
 		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
 		    nested_cpu_has_ept(vmcs12), &entry_failure_code))
 			goto vmentry_fail_vmexit_guest_mode;
@@ -4195,6 +4204,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
 	 * couldn't have changed.
 	 */
+	kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
 	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &ignored))
 		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
  2021-04-01 14:18 ` [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Maxim Levitsky
@ 2021-04-01 14:31   ` Paolo Bonzini
  2021-04-02 17:27   ` Sean Christopherson
  1 sibling, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2021-04-01 14:31 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Jonathan Corbet, Borislav Petkov, Ingo Molnar,
	open list:DOCUMENTATION

On 01/04/21 16:18, Maxim Levitsky wrote:
> Similar to the rest of guest page accesses after migration,
> this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES
> request.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>   arch/x86/kvm/vmx/nested.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index fd334e4aa6db..b44f1f6b68db 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2564,11 +2564,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>   		return -EINVAL;
>   	}
>   
> -	/* Shadow page tables on either EPT or shadow page tables. */
> -	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> -				entry_failure_code))
> -		return -EINVAL;
> -
>   	/*
>   	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
>   	 * on nested VM-Exit, which can occur without actually running L2 and
> @@ -3109,11 +3104,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>   static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>   {
>   	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	enum vm_entry_failure_code entry_failure_code;
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	struct kvm_host_map *map;
>   	struct page *page;
>   	u64 hpa;
>   
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> +				&entry_failure_code))
> +		return false;
> +
>   	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
>   		/*
>   		 * Translate L1 physical address to host physical
> @@ -3357,6 +3357,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>   	}
>   
>   	if (from_vmentry) {
> +		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
> +		    nested_cpu_has_ept(vmcs12), &entry_failure_code))
> +			goto vmentry_fail_vmexit_guest_mode;
> +
>   		failed_index = nested_vmx_load_msr(vcpu,
>   						   vmcs12->vm_entry_msr_load_addr,
>   						   vmcs12->vm_entry_msr_load_count);
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load
  2021-04-01 14:18 ` [PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky
@ 2021-04-01 14:31   ` Paolo Bonzini
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2021-04-01 14:31 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Jonathan Corbet, Borislav Petkov, Ingo Molnar,
	open list:DOCUMENTATION

On 01/04/21 16:18, Maxim Levitsky wrote:
> While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
> by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
> only root_mmu is reset.
> 
> On regular nested entries we call nested_svm_load_cr3 which both updates
> the guest's CR3 in the MMU when it is needed, and it also initializes
> the mmu again which makes it initialize the walk_mmu as well when nested
> paging is enabled in both host and guest.
> 
> Since we don't call nested_svm_load_cr3 on nested state load,
> the walk_mmu can be left uninitialized, which can lead to a NULL pointer
> dereference while accessing it if we happen to get a nested page fault
> right after entering the nested guest first time after the migration and
> we decide to emulate it, which leads to the emulator trying to access
> walk_mmu->gva_to_gpa which is NULL.
> 
> Therefore we should call this function on nested state load as well.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>   arch/x86/kvm/svm/nested.c | 40 +++++++++++++++++++++------------------
>   1 file changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 8523f60adb92..ac5e3e17bda4 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -215,24 +215,6 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
>   	return true;
>   }
>   
> -static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> -{
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -
> -	if (WARN_ON(!is_guest_mode(vcpu)))
> -		return true;
> -
> -	if (!nested_svm_vmrun_msrpm(svm)) {
> -		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> -		vcpu->run->internal.suberror =
> -			KVM_INTERNAL_ERROR_EMULATION;
> -		vcpu->run->internal.ndata = 0;
> -		return false;
> -	}
> -
> -	return true;
> -}
> -
>   static bool nested_vmcb_check_controls(struct vmcb_control_area *control)
>   {
>   	if (CC(!vmcb_is_intercept(control, INTERCEPT_VMRUN)))
> @@ -1312,6 +1294,28 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
>   	return ret;
>   }
>   
> +static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	if (WARN_ON(!is_guest_mode(vcpu)))
> +		return true;
> +
> +	if (nested_svm_load_cr3(&svm->vcpu, vcpu->arch.cr3,
> +				nested_npt_enabled(svm)))
> +		return false;
> +
> +	if (!nested_svm_vmrun_msrpm(svm)) {
> +		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> +		vcpu->run->internal.suberror =
> +			KVM_INTERNAL_ERROR_EMULATION;
> +		vcpu->run->internal.ndata = 0;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
>   struct kvm_x86_nested_ops svm_nested_ops = {
>   	.check_events = svm_check_nested_events,
>   	.triple_fault = nested_svm_triple_fault,
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available
  2021-04-01 14:18 ` [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available Maxim Levitsky
@ 2021-04-01 14:31   ` Paolo Bonzini
  2021-04-05 17:03   ` Sean Christopherson
  1 sibling, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2021-04-01 14:31 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Jonathan Corbet, Borislav Petkov, Ingo Molnar,
	open list:DOCUMENTATION

On 01/04/21 16:18, Maxim Levitsky wrote:
> Small refactoring that will be used in the next patch.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>   arch/x86/kvm/kvm_cache_regs.h | 7 +++++++
>   arch/x86/kvm/svm/svm.c        | 6 ++----
>   2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 2e11da2f5621..07d607947805 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -55,6 +55,13 @@ static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
>   	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
>   }
>   
> +static inline void kvm_register_clear_available(struct kvm_vcpu *vcpu,
> +					       enum kvm_reg reg)
> +{
> +	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
> +	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
> +}
> +
>   static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
>   					   enum kvm_reg reg)
>   {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 271196400495..2843732299a2 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3880,10 +3880,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>   		vcpu->arch.apf.host_apf_flags =
>   			kvm_read_and_reset_apf_flags();
>   
> -	if (npt_enabled) {
> -		vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
> -		vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
> -	}
> +	if (npt_enabled)
> +		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
>   
>   	/*
>   	 * We need to handle MC intercepts here before the vcpu has a chance to
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
  2021-04-01 14:18 ` [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 Maxim Levitsky
@ 2021-04-01 14:44   ` Paolo Bonzini
  2021-04-01 17:10     ` Maxim Levitsky
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2021-04-01 14:44 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Jonathan Corbet, Borislav Petkov, Ingo Molnar,
	open list:DOCUMENTATION

Just a quick review on the API:

On 01/04/21 16:18, Maxim Levitsky wrote:
> +struct kvm_sregs2 {
> +	/* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
> +	struct kvm_segment cs, ds, es, fs, gs, ss;
> +	struct kvm_segment tr, ldt;
> +	struct kvm_dtable gdt, idt;
> +	__u64 cr0, cr2, cr3, cr4, cr8;
> +	__u64 efer;
> +	__u64 apic_base;
> +	__u64 flags; /* must be zero*/

I think it would make sense to define a flag bit for the PDPTRs, so that 
userspace can use KVM_SET_SREGS2 unconditionally (e.g. even when 
migrating from a source that uses KVM_GET_SREGS and therefore doesn't 
provide the PDPTRs).

> +	__u64 pdptrs[4];
> +	__u64 padding;

No need to add padding; if we add more fields in the future we can use 
the flags to determine the length of the userspace data, similar to 
KVM_GET/SET_NESTED_STATE.


> 
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	if (is_pae_paging(vcpu)) {
> +		for (i = 0 ; i < 4 ; i++)
> +			kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);
> +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> +		mmu_reset_needed = 1;
> +	}
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +

SRCU should not be needed here?

> +	case KVM_GET_SREGS2: {
> +		u.sregs2 = kzalloc(sizeof(struct kvm_sregs2), GFP_KERNEL_ACCOUNT);
> +		r = -ENOMEM;
> +		if (!u.sregs2)
> +			goto out;

No need to account, I think it's a little slower and this allocation is 
very short lived.

>  #define KVM_CAP_PPC_DAWR1 194
> +#define KVM_CAP_SREGS2 196

195, not 196.

>  #define KVM_XEN_VCPU_GET_ATTR	_IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
>  #define KVM_XEN_VCPU_SET_ATTR	_IOW(KVMIO,  0xcb, struct kvm_xen_vcpu_attr)
> +
> +#define KVM_GET_SREGS2             _IOR(KVMIO,  0xca, struct kvm_sregs2)
> +#define KVM_SET_SREGS2             _IOW(KVMIO,  0xcb, struct kvm_sregs2)
> +

It's not exactly overlapping, but please bump the ioctls to 0xcc/0xcd.

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
  2021-04-01 14:44   ` Paolo Bonzini
@ 2021-04-01 17:10     ` Maxim Levitsky
  2021-04-01 17:33       ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-01 17:10 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Jonathan Corbet, Borislav Petkov, Ingo Molnar,
	open list:DOCUMENTATION

On Thu, 2021-04-01 at 16:44 +0200, Paolo Bonzini wrote:
> Just a quick review on the API:
> 
> On 01/04/21 16:18, Maxim Levitsky wrote:
> > +struct kvm_sregs2 {
> > +	/* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
> > +	struct kvm_segment cs, ds, es, fs, gs, ss;
> > +	struct kvm_segment tr, ldt;
> > +	struct kvm_dtable gdt, idt;
> > +	__u64 cr0, cr2, cr3, cr4, cr8;
> > +	__u64 efer;
> > +	__u64 apic_base;
> > +	__u64 flags; /* must be zero*/
> 
> I think it would make sense to define a flag bit for the PDPTRs, so that 
> userspace can use KVM_SET_SREGS2 unconditionally (e.g. even when 
> migrating from a source that uses KVM_GET_SREGS and therefore doesn't 
> provide the PDPTRs).
Yes, I didn't think about this case! I'll add this to the next version.
Thanks!

> 
> > +	__u64 pdptrs[4];
> > +	__u64 padding;
> 
> No need to add padding; if we add more fields in the future we can use 
> the flags to determine the length of the userspace data, similar to 
> KVM_GET/SET_NESTED_STATE.
Got it, will fix. I added it just in case.

> 
> 
> > +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +	if (is_pae_paging(vcpu)) {
> > +		for (i = 0 ; i < 4 ; i++)
> > +			kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);
> > +		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > +		mmu_reset_needed = 1;
> > +	}
> > +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +
> 
> SRCU should not be needed here?

I haven't yet studied in depth the locking that is used in the kvm,
so I put this to be on the safe side.

I looked at it a bit and it looks like the pdptr reading code takes
this lock because it accesses the memslots, which is not done here,
and therefore the lock is indeed not needed here.

I need to study in depth how locking is done in kvm to be 100% sure
about this.


> 
> > +	case KVM_GET_SREGS2: {
> > +		u.sregs2 = kzalloc(sizeof(struct kvm_sregs2), GFP_KERNEL_ACCOUNT);
> > +		r = -ENOMEM;
> > +		if (!u.sregs2)
> > +			goto out;
> 
> No need to account, I think it's a little slower and this allocation is 
> very short lived.
Right, I will fix this in the next version.

> 
> >  #define KVM_CAP_PPC_DAWR1 194
> > +#define KVM_CAP_SREGS2 196
> 
> 195, not 196.

I am also planning to add KVM_CAP_SET_GUEST_DEBUG2 for which I
used 195.
Prior to sending I rebased all of my patch series on top of kvm/queue,
but I kept the numbers just in case.

> 
> >  #define KVM_XEN_VCPU_GET_ATTR	_IOWR(KVMIO, 0xca, struct kvm_xen_vcpu_attr)
> >  #define KVM_XEN_VCPU_SET_ATTR	_IOW(KVMIO,  0xcb, struct kvm_xen_vcpu_attr)
> > +
> > +#define KVM_GET_SREGS2             _IOR(KVMIO,  0xca, struct kvm_sregs2)
> > +#define KVM_SET_SREGS2             _IOW(KVMIO,  0xcb, struct kvm_sregs2)
> > +
> 
> It's not exactly overlapping, but please bump the ioctls to 0xcc/0xcd.
Will do.


Thanks a lot for the review!

Best regards,
	Maxim Levitsky

> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
  2021-04-01 17:10     ` Maxim Levitsky
@ 2021-04-01 17:33       ` Paolo Bonzini
  0 siblings, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2021-04-01 17:33 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Sean Christopherson,
	Joerg Roedel, H. Peter Anvin,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Thomas Gleixner, Jonathan Corbet, Borislav Petkov, Ingo Molnar,
	open list:DOCUMENTATION

On 01/04/21 19:10, Maxim Levitsky wrote:
> I haven't yet studied in depth the locking that is used in the kvm,
> so I put this to be on the safe side.
> 
> I looked at it a bit and it looks like the pdptr reading code takes
> this lock because it accesses the memslots, which is not done here,
> and therefore the lock is indeed not needed here.
> 
> I need to study in depth how locking is done in kvm to be 100% sure
> about this.

Yes, SRCU protects reading the memslots (and therefore accessing guest 
memory).

>> 195, not 196.
> 
> I am also planning to add KVM_CAP_SET_GUEST_DEBUG2 for which I
> used 195.

Sounds good then.

Paolo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
  2021-04-01 14:18 ` [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Maxim Levitsky
  2021-04-01 14:31   ` Paolo Bonzini
@ 2021-04-02 17:27   ` Sean Christopherson
  2021-04-06 10:22     ` Maxim Levitsky
  1 sibling, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-04-02 17:27 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION

On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> Similar to the rest of guest page accesses after migration,
> this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES
> request.

FWIW, I still object to this approach, and this patch has a plethora of issues.

I'm not against deferring various state loading to KVM_RUN, but wholesale moving
all of GUEST_CR3 processing without in-depth consideration of all the side
effects is a really bad idea.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index fd334e4aa6db..b44f1f6b68db 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2564,11 +2564,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  		return -EINVAL;
>  	}
>  
> -	/* Shadow page tables on either EPT or shadow page tables. */
> -	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> -				entry_failure_code))
> -		return -EINVAL;
> -
>  	/*
>  	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
>  	 * on nested VM-Exit, which can occur without actually running L2 and
> @@ -3109,11 +3104,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>  {
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	enum vm_entry_failure_code entry_failure_code;
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct kvm_host_map *map;
>  	struct page *page;
>  	u64 hpa;
>  
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> +				&entry_failure_code))

This results in KVM_RUN returning 0 without filling vcpu->run->exit_reason.
Speaking from experience, debugging those types of issues is beyond painful.

It also means CR3 is double loaded in the from_vmentry case.

And it will cause KVM to incorrectly return NVMX_VMENTRY_KVM_INTERNAL_ERROR
if a consistency check fails when nested_get_vmcs12_pages() is called on
from_vmentry.  E.g. run unit tests with this and it will silently disappear.

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index bbb006a..b8ccc69 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -8172,6 +8172,16 @@ static void test_guest_segment_base_addr_fields(void)
        vmcs_write(GUEST_AR_ES, ar_saved);
 }

+static void test_guest_cr3(void)
+{
+       u64 cr3_saved = vmcs_read(GUEST_CR3);
+
+       vmcs_write(GUEST_CR3, -1ull);
+       test_guest_state("Bad CR3 fails VM-Enter", true, -1ull, "GUEST_CR3");
+
+       vmcs_write(GUEST_DR7, cr3_saved);
+}
+
 /*
  * Check that the virtual CPU checks the VMX Guest State Area as
  * documented in the Intel SDM.
@@ -8181,6 +8191,8 @@ static void vmx_guest_state_area_test(void)
        vmx_set_test_stage(1);
        test_set_guest(guest_state_test_main);

+       test_guest_cr3();
+
        /*
         * The IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field
         * must each contain a canonical address.


> +		return false;
> +
>  	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
>  		/*
>  		 * Translate L1 physical address to host physical
> @@ -3357,6 +3357,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  	}
>  
>  	if (from_vmentry) {
> +		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
> +		    nested_cpu_has_ept(vmcs12), &entry_failure_code))

This alignment is messed up; it looks like two separate function calls.

> +			goto vmentry_fail_vmexit_guest_mode;
> +
>  		failed_index = nested_vmx_load_msr(vcpu,
>  						   vmcs12->vm_entry_msr_load_addr,
>  						   vmcs12->vm_entry_msr_load_count);
> -- 
> 2.26.2
> 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible
  2021-04-01 14:18 ` [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible Maxim Levitsky
@ 2021-04-05 17:01   ` Sean Christopherson
  2021-04-06 10:12     ` Maxim Levitsky
  0 siblings, 1 reply; 18+ messages in thread
From: Sean Christopherson @ 2021-04-05 17:01 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION

On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
> part of the migration state and thus are loaded
> by those ioctls.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/svm/nested.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index ac5e3e17bda4..b94916548cfa 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -373,10 +373,9 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
>  		return -EINVAL;
>  
>  	if (!nested_npt && is_pae_paging(vcpu) &&
> -	    (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
> +	    (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR)))
>  		if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))

What if we ditch the optimizations[*] altogether and just do:

	if (!nested_npt && is_pae_paging(vcpu) &&
	    CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
		return -EINVAL;

Won't that obviate the need for KVM_{GET|SET}_SREGS2 since KVM will always load
the PDPTRs from memory?  IMO, nested migration with shadowing paging doesn't
warrant this level of optimization complexity.

[*] For some definitions of "optimization", since the extra pdptrs_changed()
    check in the existing code is likely a net negative.

>  			return -EINVAL;
> -	}
>  
>  	/*
>  	 * TODO: optimize unconditional TLB flush/MMU sync here and in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available
  2021-04-01 14:18 ` [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available Maxim Levitsky
  2021-04-01 14:31   ` Paolo Bonzini
@ 2021-04-05 17:03   ` Sean Christopherson
  1 sibling, 0 replies; 18+ messages in thread
From: Sean Christopherson @ 2021-04-05 17:03 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION

On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> Small refactoring that will be used in the next patch.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/kvm_cache_regs.h | 7 +++++++
>  arch/x86/kvm/svm/svm.c        | 6 ++----
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 2e11da2f5621..07d607947805 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -55,6 +55,13 @@ static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
>  	__set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
>  }
>  
> +static inline void kvm_register_clear_available(struct kvm_vcpu *vcpu,

I don't love the name, because it makes me think too hard about available vs.
dirty.  :-)   If we drop the optimizations, what if we also drop this patch to
avoid bikeshedding the name?

> +					       enum kvm_reg reg)
> +{
> +	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
> +	__clear_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
> +}
> +
>  static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
>  					   enum kvm_reg reg)
>  {
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 271196400495..2843732299a2 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3880,10 +3880,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>  		vcpu->arch.apf.host_apf_flags =
>  			kvm_read_and_reset_apf_flags();
>  
> -	if (npt_enabled) {
> -		vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
> -		vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
> -	}
> +	if (npt_enabled)
> +		kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
>  
>  	/*
>  	 * We need to handle MC intercepts here before the vcpu has a chance to
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible
  2021-04-05 17:01   ` Sean Christopherson
@ 2021-04-06 10:12     ` Maxim Levitsky
  0 siblings, 0 replies; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-06 10:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION

On Mon, 2021-04-05 at 17:01 +0000, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> > if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
> > part of the migration state and thus are loaded
> > by those ioctls.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/svm/nested.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ac5e3e17bda4..b94916548cfa 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -373,10 +373,9 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
> >  		return -EINVAL;
> >  
> >  	if (!nested_npt && is_pae_paging(vcpu) &&
> > -	    (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
> > +	    (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, VCPU_EXREG_PDPTR)))
> >  		if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
> 
> What if we ditch the optimizations[*] altogether and just do:
> 
> 	if (!nested_npt && is_pae_paging(vcpu) &&
> 	    CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
> 		return -EINVAL;
> 
> Won't that obviate the need for KVM_{GET|SET}_SREGS2 since KVM will always load
> the PDPTRs from memory?  IMO, nested migration with shadowing paging doesn't
> warrant this level of optimization complexity.

Its not an optimization, it was done to be 100% within the X86 spec. 
PDPTRs are internal cpu registers which are loaded only when
CR3/CR0/CR4 are written by the guest, guest entry loads CR3, or 
when guest exit loads CR3 (I checked both Intel and AMD manuals).

In addition to that when NPT is enabled, AMD drops this siliness and 
just treats PDPTRs as normal paging entries, while on Intel side 
when EPT is enabled, PDPTRs are stored in VMCS.

Nested migration is neither of these cases, thus PDPTRs should be 
stored out of band.
Same for non nested migration.

This was requested by Jim Mattson, and I went ahead and 
implemented it, even though I do understand that no sane OS 
relies on PDPTRs to be unsync v.s the actual page
table containing them.

Best regards,
	Maxim Levitsky


> 
> [*] For some definitions of "optimization", since the extra pdptrs_changed()
>     check in the existing code is likely a net negative.
> 
> >  			return -EINVAL;
> > -	}
> >  
> >  	/*
> >  	 * TODO: optimize unconditional TLB flush/MMU sync here and in



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
  2021-04-02 17:27   ` Sean Christopherson
@ 2021-04-06 10:22     ` Maxim Levitsky
  0 siblings, 0 replies; 18+ messages in thread
From: Maxim Levitsky @ 2021-04-06 10:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel,
	H. Peter Anvin, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Paolo Bonzini, Thomas Gleixner, Jonathan Corbet, Borislav Petkov,
	Ingo Molnar, open list:DOCUMENTATION

On Fri, 2021-04-02 at 17:27 +0000, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> > Similar to the rest of guest page accesses after migration,
> > this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES
> > request.
> 
> FWIW, I still object to this approach, and this patch has a plethora of issues.
> 
> I'm not against deferring various state loading to KVM_RUN, but wholesale moving
> all of GUEST_CR3 processing without in-depth consideration of all the side
> effects is a really bad idea.
It could be, I won't argue about this.

> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 14 +++++++++-----
> >  1 file changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index fd334e4aa6db..b44f1f6b68db 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2564,11 +2564,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >  		return -EINVAL;
> >  	}
> >  
> > -	/* Shadow page tables on either EPT or shadow page tables. */
> > -	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> > -				entry_failure_code))
> > -		return -EINVAL;
> > -
> >  	/*
> >  	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> >  	 * on nested VM-Exit, which can occur without actually running L2 and
> > @@ -3109,11 +3104,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
> >  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +	enum vm_entry_failure_code entry_failure_code;
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	struct kvm_host_map *map;
> >  	struct page *page;
> >  	u64 hpa;
> >  
> > +	if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12),
> > +				&entry_failure_code))
> 
> This results in KVM_RUN returning 0 without filling vcpu->run->exit_reason.
> Speaking from experience, debugging those types of issues is beyond painful.
> 
> It also means CR3 is double loaded in the from_vmentry case.
> 
> And it will cause KVM to incorrectly return NVMX_VMENTRY_KVM_INTERNAL_ERROR
> if a consistency check fails when nested_get_vmcs12_pages() is called on
> from_vmentry.  E.g. run unit tests with this and it will silently disappear.

I do remember now that you said something about this, but I wasn't able
to find it in my email. Sorry about this.
I agree with you.

I think that a question I should ask is why do we really need to 
delay accessing guest memory after a migration.

So far I mostly just assumed that we need to do so, thinking that qemu
updates the memslots or something, or maybe because guest memory
isn't fully migrated and relies on post-copy to finish it.

Also I am not against leaving CR3 processing in here and doing only PDPTR load
in KVM_RUN (and only when *SREG2 API is not used).

> 
> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> index bbb006a..b8ccc69 100644
> --- a/x86/vmx_tests.c
> +++ b/x86/vmx_tests.c
> @@ -8172,6 +8172,16 @@ static void test_guest_segment_base_addr_fields(void)
>         vmcs_write(GUEST_AR_ES, ar_saved);
>  }
> 
> +static void test_guest_cr3(void)
> +{
> +       u64 cr3_saved = vmcs_read(GUEST_CR3);
> +
> +       vmcs_write(GUEST_CR3, -1ull);
> +       test_guest_state("Bad CR3 fails VM-Enter", true, -1ull, "GUEST_CR3");
> +
> +       vmcs_write(GUEST_DR7, cr3_saved);
> +}
> +
Could you send this test to kvm unit tests?

>  /*
>   * Check that the virtual CPU checks the VMX Guest State Area as
>   * documented in the Intel SDM.
> @@ -8181,6 +8191,8 @@ static void vmx_guest_state_area_test(void)
>         vmx_set_test_stage(1);
>         test_set_guest(guest_state_test_main);
> 
> +       test_guest_cr3();
> +
>         /*
>          * The IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field
>          * must each contain a canonical address.
> 
> 
> > +		return false;
> > +
> >  	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
> >  		/*
> >  		 * Translate L1 physical address to host physical
> > @@ -3357,6 +3357,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> >  	}
> >  
> >  	if (from_vmentry) {
> > +		if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
> > +		    nested_cpu_has_ept(vmcs12), &entry_failure_code))
> 
> This alignment is messed up; it looks like two separate function calls.
Sorry about this, I see it now.
> 
> > +			goto vmentry_fail_vmexit_guest_mode;
> > +
> >  		failed_index = nested_vmx_load_msr(vcpu,
> >  						   vmcs12->vm_entry_msr_load_addr,
> >  						   vmcs12->vm_entry_msr_load_count);
> > -- 
> > 2.26.2
> > 


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-04-06 10:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 14:18 [PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration Maxim Levitsky
2021-04-01 14:18 ` [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Maxim Levitsky
2021-04-01 14:31   ` Paolo Bonzini
2021-04-02 17:27   ` Sean Christopherson
2021-04-06 10:22     ` Maxim Levitsky
2021-04-01 14:18 ` [PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load Maxim Levitsky
2021-04-01 14:31   ` Paolo Bonzini
2021-04-01 14:18 ` [PATCH 3/6] KVM: x86: introduce kvm_register_clear_available Maxim Levitsky
2021-04-01 14:31   ` Paolo Bonzini
2021-04-05 17:03   ` Sean Christopherson
2021-04-01 14:18 ` [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 Maxim Levitsky
2021-04-01 14:44   ` Paolo Bonzini
2021-04-01 17:10     ` Maxim Levitsky
2021-04-01 17:33       ` Paolo Bonzini
2021-04-01 14:18 ` [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible Maxim Levitsky
2021-04-05 17:01   ` Sean Christopherson
2021-04-06 10:12     ` Maxim Levitsky
2021-04-01 14:18 ` [PATCH 6/6] KVM: nVMX: " Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).