All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
@ 2021-05-17 13:50 Vitaly Kuznetsov
  2021-05-17 13:50 ` [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used() Vitaly Kuznetsov
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Changes since v1 (Sean):
- Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
- Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
  copy_enlightened_to_vmcs12().

Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
 + WSL2) was crashing immediately after migration. It was also reported
that we have more issues to fix as, while the failure rate was lowered 
signifincatly, it was still possible to observe crashes after several
dozens of migration. Turns out, the issue arises when we manage to issue
KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
the flag itself is not part of saved nested state. A few other less 
significant issues are fixed along the way.

While there's no proof this series fixes all eVMCS related problems,
Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
crashing in testing.

Patches are based on the current kvm/next tree.

Vitaly Kuznetsov (7):
  KVM: nVMX: Introduce nested_evmcs_is_used()
  KVM: nVMX: Release enlightened VMCS on VMCLEAR
  KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
    vmx_get_nested_state()
  KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
  KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
  KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
  KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
    lost

 arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
 .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
 2 files changed, 115 insertions(+), 59 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:11   ` Maxim Levitsky
  2021-05-24 13:54   ` Paolo Bonzini
  2021-05-17 13:50 ` [PATCH v2 2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR Vitaly Kuznetsov
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
can not be called directly from vmx_set_nested_state() as KVM may not have
all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
restored yet). Enlightened VMCS is mapped later while getting nested state
pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
resulting state will be unset (and such state will later fail to load).

Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
after restore.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 6058a65a6ede..3080e00c8f90 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -141,6 +141,27 @@ static void init_vmcs_shadow_fields(void)
 	max_shadow_read_write_fields = j;
 }
 
+static inline bool nested_evmcs_is_used(struct vcpu_vmx *vmx)
+{
+	struct kvm_vcpu *vcpu = &vmx->vcpu;
+
+	if (vmx->nested.hv_evmcs)
+		return true;
+
+	/*
+	 * After KVM_SET_NESTED_STATE, enlightened VMCS is mapped during
+	 * KVM_REQ_GET_NESTED_STATE_PAGES handling and until the request is
+	 * processed vmx->nested.hv_evmcs is NULL. It is, however, possible to
+	 * detect such state by checking 'nested.current_vmptr == -1ull' when
+	 * vCPU is in guest mode, it is only possible with eVMCS.
+	 */
+	if (unlikely(vmx->nested.enlightened_vmcs_enabled && is_guest_mode(vcpu) &&
+		     (vmx->nested.current_vmptr == -1ull)))
+		return true;
+
+	return false;
+}
+
 /*
  * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
  * set the success or error code of an emulated VMX instruction (as specified
@@ -187,7 +208,7 @@ static int nested_vmx_fail(struct kvm_vcpu *vcpu, u32 vm_instruction_error)
 	 * failValid writes the error number to the current VMCS, which
 	 * can't be done if there isn't a current VMCS.
 	 */
-	if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
+	if (vmx->nested.current_vmptr == -1ull && !nested_evmcs_is_used(vmx))
 		return nested_vmx_failInvalid(vcpu);
 
 	return nested_vmx_failValid(vcpu, vm_instruction_error);
@@ -2208,7 +2229,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 	u32 exec_control;
 	u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12);
 
-	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs)
+	if (vmx->nested.dirty_vmcs12 || nested_evmcs_is_used(vmx))
 		prepare_vmcs02_early_rare(vmx, vmcs12);
 
 	/*
@@ -3437,7 +3458,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 
 	load_vmcs12_host_state(vcpu, vmcs12);
 	vmcs12->vm_exit_reason = exit_reason.full;
-	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
+	if (enable_shadow_vmcs || nested_evmcs_is_used(vmx))
 		vmx->nested.need_vmcs12_to_shadow_sync = true;
 	return NVMX_VMENTRY_VMEXIT;
 }
@@ -4032,7 +4053,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-	if (vmx->nested.hv_evmcs)
+	if (nested_evmcs_is_used(vmx))
 		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
 
 	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !vmx->nested.hv_evmcs;
@@ -6056,7 +6077,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 		if (vmx_has_valid_vmcs12(vcpu)) {
 			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
 
-			if (vmx->nested.hv_evmcs)
+			if (nested_evmcs_is_used(vmx))
 				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
 
 			if (is_guest_mode(vcpu) &&
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
  2021-05-17 13:50 ` [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used() Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:13   ` Maxim Levitsky
  2021-05-17 13:50 ` [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() Vitaly Kuznetsov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when
enlightened VMCS is in use. TLFS has the following brief description:
"The L1 hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state". Normally,
this change can be ignored as unmapping active eVMCS can be postponed
until the next VMLAUNCH instruction but in case nested state is migrated
with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped
may result in its synchronization with VMCS12 and this is incorrect:
L1 hypervisor is free to reuse inactive eVMCS memory for something else.

Inactive eVMCS after VMCLEAR can just be unmapped.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 3080e00c8f90..ea2869d8b823 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5008,6 +5008,8 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
 				     vmptr + offsetof(struct vmcs12,
 						      launch_state),
 				     &zero, sizeof(zero));
+	} else if (vmx->nested.hv_evmcs && vmptr == vmx->nested.hv_evmcs_vmptr) {
+		nested_release_evmcs(vcpu);
 	}
 
 	return nested_vmx_succeed(vcpu);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
  2021-05-17 13:50 ` [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used() Vitaly Kuznetsov
  2021-05-17 13:50 ` [PATCH v2 2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:26   ` Maxim Levitsky
  2021-05-24 13:56   ` Paolo Bonzini
  2021-05-17 13:50 ` [PATCH v2 4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() Vitaly Kuznetsov
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
hypervisor is not obliged to keep it up-to-date while it is mangling L2's
state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
VMCS12 is used as a source of ultimate truth, we must make sure we pick all
the changes to eVMCS and thus 'clean fields' data must be ignored.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 49 +++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ea2869d8b823..ec476f64df73 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1607,7 +1607,7 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
 	vmcs_load(vmx->loaded_vmcs->vmcs);
 }
 
-static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
+static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields)
 {
 	struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12;
 	struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs;
@@ -1616,7 +1616,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 	vmcs12->tpr_threshold = evmcs->tpr_threshold;
 	vmcs12->guest_rip = evmcs->guest_rip;
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC))) {
 		vmcs12->guest_rsp = evmcs->guest_rsp;
 		vmcs12->guest_rflags = evmcs->guest_rflags;
@@ -1624,23 +1624,23 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 			evmcs->guest_interruptibility_info;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
 		vmcs12->cpu_based_vm_exec_control =
 			evmcs->cpu_based_vm_exec_control;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN))) {
 		vmcs12->exception_bitmap = evmcs->exception_bitmap;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY))) {
 		vmcs12->vm_entry_controls = evmcs->vm_entry_controls;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT))) {
 		vmcs12->vm_entry_intr_info_field =
 			evmcs->vm_entry_intr_info_field;
@@ -1650,7 +1650,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 			evmcs->vm_entry_instruction_len;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1))) {
 		vmcs12->host_ia32_pat = evmcs->host_ia32_pat;
 		vmcs12->host_ia32_efer = evmcs->host_ia32_efer;
@@ -1670,7 +1670,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 		vmcs12->host_tr_selector = evmcs->host_tr_selector;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1))) {
 		vmcs12->pin_based_vm_exec_control =
 			evmcs->pin_based_vm_exec_control;
@@ -1679,18 +1679,18 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 			evmcs->secondary_vm_exec_control;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP))) {
 		vmcs12->io_bitmap_a = evmcs->io_bitmap_a;
 		vmcs12->io_bitmap_b = evmcs->io_bitmap_b;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP))) {
 		vmcs12->msr_bitmap = evmcs->msr_bitmap;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2))) {
 		vmcs12->guest_es_base = evmcs->guest_es_base;
 		vmcs12->guest_cs_base = evmcs->guest_cs_base;
@@ -1730,14 +1730,14 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 		vmcs12->guest_tr_selector = evmcs->guest_tr_selector;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2))) {
 		vmcs12->tsc_offset = evmcs->tsc_offset;
 		vmcs12->virtual_apic_page_addr = evmcs->virtual_apic_page_addr;
 		vmcs12->xss_exit_bitmap = evmcs->xss_exit_bitmap;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR))) {
 		vmcs12->cr0_guest_host_mask = evmcs->cr0_guest_host_mask;
 		vmcs12->cr4_guest_host_mask = evmcs->cr4_guest_host_mask;
@@ -1749,7 +1749,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 		vmcs12->guest_dr7 = evmcs->guest_dr7;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER))) {
 		vmcs12->host_fs_base = evmcs->host_fs_base;
 		vmcs12->host_gs_base = evmcs->host_gs_base;
@@ -1759,13 +1759,13 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
 		vmcs12->host_rsp = evmcs->host_rsp;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT))) {
 		vmcs12->ept_pointer = evmcs->ept_pointer;
 		vmcs12->virtual_processor_id = evmcs->virtual_processor_id;
 	}
 
-	if (unlikely(!(evmcs->hv_clean_fields &
+	if (unlikely(!(hv_clean_fields &
 		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1))) {
 		vmcs12->vmcs_link_pointer = evmcs->vmcs_link_pointer;
 		vmcs12->guest_ia32_debugctl = evmcs->guest_ia32_debugctl;
@@ -3473,6 +3473,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	enum nvmx_vmentry_status status;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
+	struct hv_enlightened_vmcs *evmcs;
 	enum nested_evmptrld_status evmptrld_status;
 
 	++vcpu->stat.nested_run;
@@ -3488,7 +3489,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 		return nested_vmx_failInvalid(vcpu);
 	}
 
-	if (CC(!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull))
+	evmcs = vmx->nested.hv_evmcs;
+	if (CC(!evmcs && vmx->nested.current_vmptr == -1ull))
 		return nested_vmx_failInvalid(vcpu);
 
 	vmcs12 = get_vmcs12(vcpu);
@@ -3502,8 +3504,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	if (CC(vmcs12->hdr.shadow_vmcs))
 		return nested_vmx_failInvalid(vcpu);
 
-	if (vmx->nested.hv_evmcs) {
-		copy_enlightened_to_vmcs12(vmx);
+	if (evmcs) {
+		copy_enlightened_to_vmcs12(vmx, evmcs->hv_clean_fields);
 		/* Enlightened VMCS doesn't have launch state */
 		vmcs12->launch_state = !launch;
 	} else if (enable_shadow_vmcs) {
@@ -6136,7 +6138,14 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
 		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
 			if (vmx->nested.hv_evmcs)
-				copy_enlightened_to_vmcs12(vmx);
+				/*
+				 * L1 hypervisor is not obliged to keep eVMCS
+				 * clean fields data always up-to-date while
+				 * not in guest mode, 'hv_clean_fields' is only
+				 * supposed to be actual upon vmentry so we need
+				 * to ignore it here and do full copy.
+				 */
+				copy_enlightened_to_vmcs12(vmx, 0);
 			else if (enable_shadow_vmcs)
 				copy_shadow_to_vmcs12(vmx);
 		}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  2021-05-17 13:50 ` [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:27   ` Maxim Levitsky
  2021-05-17 13:50 ` [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() Vitaly Kuznetsov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened
VMCS sync when we exit to L1. The comment in nested_vmx_failValid()
validly states why shadow vmcs sync can be omitted but this doesn't
apply to enlightened VMCS as it 'shadows' all VMCS12 fields.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ec476f64df73..eb2d25a93356 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -194,9 +194,13 @@ static int nested_vmx_failValid(struct kvm_vcpu *vcpu,
 			| X86_EFLAGS_ZF);
 	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
 	/*
-	 * We don't need to force a shadow sync because
-	 * VM_INSTRUCTION_ERROR is not shadowed
+	 * We don't need to force sync to shadow VMCS because
+	 * VM_INSTRUCTION_ERROR is not shadowed. Enlightened VMCS 'shadows' all
+	 * fields and thus must be synced.
 	 */
+	if (nested_evmcs_is_used(to_vmx(vcpu)))
+		to_vmx(vcpu)->nested.need_vmcs12_to_shadow_sync = true;
+
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
                   ` (3 preceding siblings ...)
  2021-05-17 13:50 ` [PATCH v2 4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:34   ` Maxim Levitsky
  2021-05-17 13:50 ` [PATCH v2 6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration Vitaly Kuznetsov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

When nested state migration happens during L1's execution, it
is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
At lease genuine Hyper-v seems to not be very happy when 'clean fields'
data changes underneath it.

'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index eb2d25a93356..3bfbf991bf45 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2081,14 +2081,10 @@ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-	if (vmx->nested.hv_evmcs) {
+	if (vmx->nested.hv_evmcs)
 		copy_vmcs12_to_enlightened(vmx);
-		/* All fields are clean */
-		vmx->nested.hv_evmcs->hv_clean_fields |=
-			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
-	} else {
+	else
 		copy_vmcs12_to_shadow(vmx);
-	}
 
 	vmx->nested.need_vmcs12_to_shadow_sync = false;
 }
@@ -2629,6 +2625,12 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 
 	kvm_rsp_write(vcpu, vmcs12->guest_rsp);
 	kvm_rip_write(vcpu, vmcs12->guest_rip);
+
+	/* Mark all fields as clean so L1 hypervisor can set what's dirty */
+	if (hv_evmcs)
+		vmx->nested.hv_evmcs->hv_clean_fields |=
+			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
+
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
                   ` (4 preceding siblings ...)
  2021-05-17 13:50 ` [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:35   ` Maxim Levitsky
  2021-05-17 13:50 ` [PATCH v2 7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost Vitaly Kuznetsov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

VMCS12 is used to keep the authoritative state during nested state
migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're
in between L2->L1 vmexit and L1 guest run when actual sync to
enlightened (or shadow) VMCS happens. Nested state, however, has
no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()->
set_current_vmptr() always sets it. Enlightened vmptrld path, however,
doesn't have the quirk so some VMCS12 changes may not get properly
reflected to eVMCS and L1 will see an incorrect state.

Note, during L2 execution or when need_vmcs12_to_shadow_sync is not
set the change is effectively a nop: in the former case all changes
will get reflected during the first L2->L1 vmexit and in the later
case VMCS12 and eVMCS are already in sync (thanks to
copy_enlightened_to_vmcs12() in vmx_get_nested_state()).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 3bfbf991bf45..a0dedd413a23 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3127,6 +3127,12 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 		if (evmptrld_status == EVMPTRLD_VMFAIL ||
 		    evmptrld_status == EVMPTRLD_ERROR)
 			return false;
+
+		/*
+		 * Post migration VMCS12 always provides the most actual
+		 * information, copy it to eVMCS upon entry.
+		 */
+		vmx->nested.need_vmcs12_to_shadow_sync = true;
 	}
 
 	return true;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
                   ` (5 preceding siblings ...)
  2021-05-17 13:50 ` [PATCH v2 6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration Vitaly Kuznetsov
@ 2021-05-17 13:50 ` Vitaly Kuznetsov
  2021-05-24 12:36   ` Maxim Levitsky
  2021-05-24 12:08 ` [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Maxim Levitsky
  2021-05-24 14:01 ` Paolo Bonzini
  8 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-17 13:50 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Do KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE for a freshly restored VM
(before the first KVM_RUN) to check that KVM_STATE_NESTED_EVMCS is not
lost.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 .../testing/selftests/kvm/x86_64/evmcs_test.c | 64 +++++++++++--------
 1 file changed, 38 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/evmcs_test.c b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
index 63096cea26c6..fcef347a681a 100644
--- a/tools/testing/selftests/kvm/x86_64/evmcs_test.c
+++ b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
@@ -121,14 +121,38 @@ void inject_nmi(struct kvm_vm *vm)
 	vcpu_events_set(vm, VCPU_ID, &events);
 }
 
+static void save_restore_vm(struct kvm_vm *vm)
+{
+	struct kvm_regs regs1, regs2;
+	struct kvm_x86_state *state;
+
+	state = vcpu_save_state(vm, VCPU_ID);
+	memset(&regs1, 0, sizeof(regs1));
+	vcpu_regs_get(vm, VCPU_ID, &regs1);
+
+	kvm_vm_release(vm);
+
+	/* Restore state in a new VM.  */
+	kvm_vm_restart(vm, O_RDWR);
+	vm_vcpu_add(vm, VCPU_ID);
+	vcpu_set_hv_cpuid(vm, VCPU_ID);
+	vcpu_enable_evmcs(vm, VCPU_ID);
+	vcpu_load_state(vm, VCPU_ID, state);
+	free(state);
+
+	memset(&regs2, 0, sizeof(regs2));
+	vcpu_regs_get(vm, VCPU_ID, &regs2);
+	TEST_ASSERT(!memcmp(&regs1, &regs2, sizeof(regs2)),
+		    "Unexpected register values after vcpu_load_state; rdi: %lx rsi: %lx",
+		    (ulong) regs2.rdi, (ulong) regs2.rsi);
+}
+
 int main(int argc, char *argv[])
 {
 	vm_vaddr_t vmx_pages_gva = 0;
 
-	struct kvm_regs regs1, regs2;
 	struct kvm_vm *vm;
 	struct kvm_run *run;
-	struct kvm_x86_state *state;
 	struct ucall uc;
 	int stage;
 
@@ -145,10 +169,6 @@ int main(int argc, char *argv[])
 	vcpu_set_hv_cpuid(vm, VCPU_ID);
 	vcpu_enable_evmcs(vm, VCPU_ID);
 
-	run = vcpu_state(vm, VCPU_ID);
-
-	vcpu_regs_get(vm, VCPU_ID, &regs1);
-
 	vcpu_alloc_vmx(vm, &vmx_pages_gva);
 	vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
 
@@ -160,6 +180,7 @@ int main(int argc, char *argv[])
 	pr_info("Running L1 which uses EVMCS to run L2\n");
 
 	for (stage = 1;; stage++) {
+		run = vcpu_state(vm, VCPU_ID);
 		_vcpu_run(vm, VCPU_ID);
 		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
 			    "Stage %d: unexpected exit reason: %u (%s),\n",
@@ -184,32 +205,23 @@ int main(int argc, char *argv[])
 			    uc.args[1] == stage, "Stage %d: Unexpected register values vmexit, got %lx",
 			    stage, (ulong)uc.args[1]);
 
-		state = vcpu_save_state(vm, VCPU_ID);
-		memset(&regs1, 0, sizeof(regs1));
-		vcpu_regs_get(vm, VCPU_ID, &regs1);
-
-		kvm_vm_release(vm);
-
-		/* Restore state in a new VM.  */
-		kvm_vm_restart(vm, O_RDWR);
-		vm_vcpu_add(vm, VCPU_ID);
-		vcpu_set_hv_cpuid(vm, VCPU_ID);
-		vcpu_enable_evmcs(vm, VCPU_ID);
-		vcpu_load_state(vm, VCPU_ID, state);
-		run = vcpu_state(vm, VCPU_ID);
-		free(state);
-
-		memset(&regs2, 0, sizeof(regs2));
-		vcpu_regs_get(vm, VCPU_ID, &regs2);
-		TEST_ASSERT(!memcmp(&regs1, &regs2, sizeof(regs2)),
-			    "Unexpected register values after vcpu_load_state; rdi: %lx rsi: %lx",
-			    (ulong) regs2.rdi, (ulong) regs2.rsi);
+		save_restore_vm(vm);
 
 		/* Force immediate L2->L1 exit before resuming */
 		if (stage == 8) {
 			pr_info("Injecting NMI into L1 before L2 had a chance to run after restore\n");
 			inject_nmi(vm);
 		}
+
+		/*
+		 * Do KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE for a freshly
+		 * restored VM (before the first KVM_RUN) to check that
+		 * KVM_STATE_NESTED_EVMCS is not lost.
+		 */
+		if (stage == 9) {
+			pr_info("Trying extra KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE cycle\n");
+			save_restore_vm(vm);
+		}
 	}
 
 done:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
                   ` (6 preceding siblings ...)
  2021-05-17 13:50 ` [PATCH v2 7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost Vitaly Kuznetsov
@ 2021-05-24 12:08 ` Maxim Levitsky
  2021-05-24 12:44   ` Vitaly Kuznetsov
  2021-05-24 14:01 ` Paolo Bonzini
  8 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:08 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> Changes since v1 (Sean):
> - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>   copy_enlightened_to_vmcs12().
> 
> Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>  + WSL2) was crashing immediately after migration. It was also reported
> that we have more issues to fix as, while the failure rate was lowered 
> signifincatly, it was still possible to observe crashes after several
> dozens of migration. Turns out, the issue arises when we manage to issue
> KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> the flag itself is not part of saved nested state. A few other less 
> significant issues are fixed along the way.
> 
> While there's no proof this series fixes all eVMCS related problems,
> Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> crashing in testing.
> 
> Patches are based on the current kvm/next tree.
> 
> Vitaly Kuznetsov (7):
>   KVM: nVMX: Introduce nested_evmcs_is_used()
>   KVM: nVMX: Release enlightened VMCS on VMCLEAR
>   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>     vmx_get_nested_state()
>   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>     lost
> 
>  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>  2 files changed, 115 insertions(+), 59 deletions(-)
> 

Hi Vitaly!

In addition to the review of this patch series, I would like
to share an idea on how to avoid the hack of mapping the evmcs
in nested_vmx_vmexit, because I think I found a possible generic
solution to this and similar issues:

The solution is to always set nested_run_pending after 
nested migration (which means that we won't really
need to migrate this flag anymore).

I was thinking a lot about it and I think that there is no downside to this,
other than sometimes a one extra vmexit after migration.

Otherwise there is always a risk of the following scenario:

  1. We migrate with nested_run_pending=0 (but don't restore all the state
  yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
  or just the guest memory map is not up to date, guest is in smm or something
  like that)

  2. Userspace calls some ioctl that causes a nested vmexit

  This can happen today if the userspace calls 
  kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events

  3. Userspace finally sets correct guest's msrs, correct guest memory map and only
  then calls KVM_RUN

This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
but we have to do so to complete the nested vmexit.

To some extent, the entry to the nested mode after a migration is only complete
when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.

This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
nested vmexit path at all. 

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-17 13:50 ` [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used() Vitaly Kuznetsov
@ 2021-05-24 12:11   ` Maxim Levitsky
  2021-05-24 12:35     ` Vitaly Kuznetsov
  2021-05-24 13:54   ` Paolo Bonzini
  1 sibling, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:11 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
> can not be called directly from vmx_set_nested_state() as KVM may not have
> all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
> restored yet). Enlightened VMCS is mapped later while getting nested state
> pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
> for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
> called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
> resulting state will be unset (and such state will later fail to load).
> 
> Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
> vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
> after restore.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 31 ++++++++++++++++++++++++++-----
>  1 file changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 6058a65a6ede..3080e00c8f90 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -141,6 +141,27 @@ static void init_vmcs_shadow_fields(void)
>  	max_shadow_read_write_fields = j;
>  }
>  
> +static inline bool nested_evmcs_is_used(struct vcpu_vmx *vmx)
> +{
> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
> +
> +	if (vmx->nested.hv_evmcs)
> +		return true;
> +
> +	/*
> +	 * After KVM_SET_NESTED_STATE, enlightened VMCS is mapped during
> +	 * KVM_REQ_GET_NESTED_STATE_PAGES handling and until the request is
> +	 * processed vmx->nested.hv_evmcs is NULL. It is, however, possible to
> +	 * detect such state by checking 'nested.current_vmptr == -1ull' when
> +	 * vCPU is in guest mode, it is only possible with eVMCS.
> +	 */
> +	if (unlikely(vmx->nested.enlightened_vmcs_enabled && is_guest_mode(vcpu) &&
> +		     (vmx->nested.current_vmptr == -1ull)))
> +		return true;
> +
> +	return false;
> +}


I think that this is a valid way to solve the issue,
but it feels like there might be a better way.
I don't mind though to accept this patch as is.

So here are my 2 cents about this:

First of all after studying how evmcs works I take my words back
about needing to migrate its contents. 

It is indeed enough to migrate its physical address, 
or maybe even just a flag that evmcs is loaded
(and to my surprise we already do this - KVM_STATE_NESTED_EVMCS)

So how about just having a boolean flag that indicates that evmcs is in use, 
but doesn't imply that we know its address or that it is mapped 
to host address space, something like 'vmx->nested.enlightened_vmcs_loaded'

On migration that flag saved and restored as the KVM_STATE_NESTED_EVMCS,
otherwise it set when we load an evmcs and cleared when it is released.

Then as far as I can see we can use this flag in nested_evmcs_is_used
since all its callers don't touch evmcs, thus don't need it to be
mapped.

What do you think?

Best regards,
	Maxim Levitsky





>  /*
>   * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
>   * set the success or error code of an emulated VMX instruction (as specified
> @@ -187,7 +208,7 @@ static int nested_vmx_fail(struct kvm_vcpu *vcpu, u32 vm_instruction_error)
>  	 * failValid writes the error number to the current VMCS, which
>  	 * can't be done if there isn't a current VMCS.
>  	 */
> -	if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
> +	if (vmx->nested.current_vmptr == -1ull && !nested_evmcs_is_used(vmx))
>  		return nested_vmx_failInvalid(vcpu);
>  
>  	return nested_vmx_failValid(vcpu, vm_instruction_error);
> @@ -2208,7 +2229,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  	u32 exec_control;
>  	u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12);
>  
> -	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs)
> +	if (vmx->nested.dirty_vmcs12 || nested_evmcs_is_used(vmx))
>  		prepare_vmcs02_early_rare(vmx, vmcs12);
>  
>  	/*
> @@ -3437,7 +3458,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  
>  	load_vmcs12_host_state(vcpu, vmcs12);
>  	vmcs12->vm_exit_reason = exit_reason.full;
> -	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
> +	if (enable_shadow_vmcs || nested_evmcs_is_used(vmx))
>  		vmx->nested.need_vmcs12_to_shadow_sync = true;
>  	return NVMX_VMENTRY_VMEXIT;
>  }
> @@ -4032,7 +4053,7 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
> -	if (vmx->nested.hv_evmcs)
> +	if (nested_evmcs_is_used(vmx))
>  		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
>  
>  	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !vmx->nested.hv_evmcs;
> @@ -6056,7 +6077,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>  		if (vmx_has_valid_vmcs12(vcpu)) {
>  			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
>  
> -			if (vmx->nested.hv_evmcs)
> +			if (nested_evmcs_is_used(vmx))
>  				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
>  
>  			if (is_guest_mode(vcpu) &&






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR
  2021-05-17 13:50 ` [PATCH v2 2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR Vitaly Kuznetsov
@ 2021-05-24 12:13   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:13 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when
> enlightened VMCS is in use. TLFS has the following brief description:
> "The L1 hypervisor can execute a VMCLEAR instruction to transition an
> enlightened VMCS from the active to the non-active state". Normally,
> this change can be ignored as unmapping active eVMCS can be postponed
> until the next VMLAUNCH instruction but in case nested state is migrated
> with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped
> may result in its synchronization with VMCS12 and this is incorrect:
> L1 hypervisor is free to reuse inactive eVMCS memory for something else.
> 
> Inactive eVMCS after VMCLEAR can just be unmapped.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 3080e00c8f90..ea2869d8b823 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -5008,6 +5008,8 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
>  				     vmptr + offsetof(struct vmcs12,
>  						      launch_state),
>  				     &zero, sizeof(zero));
> +	} else if (vmx->nested.hv_evmcs && vmptr == vmx->nested.hv_evmcs_vmptr) {
> +		nested_release_evmcs(vcpu);
>  	}

Releasing the eVMCS is a good thing not only for migration 
but for instance for nested_vmx_fail which could write to 
current eVMCS even outside of nested mode 
(there is a bug there but you fixed it in the patch 4).


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


>  
>  	return nested_vmx_succeed(vcpu);






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-17 13:50 ` [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() Vitaly Kuznetsov
@ 2021-05-24 12:26   ` Maxim Levitsky
  2021-05-24 13:01     ` Vitaly Kuznetsov
  2021-05-24 13:56   ` Paolo Bonzini
  1 sibling, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:26 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> 'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
> hypervisor is not obliged to keep it up-to-date while it is mangling L2's
> state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
> eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
> VMCS12 is used as a source of ultimate truth, we must make sure we pick all
> the changes to eVMCS and thus 'clean fields' data must be ignored.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 49 +++++++++++++++++++++++----------------
>  1 file changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ea2869d8b823..ec476f64df73 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1607,7 +1607,7 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
>  	vmcs_load(vmx->loaded_vmcs->vmcs);
>  }
>  
> -static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
> +static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields)
>  {
>  	struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12;
>  	struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs;
> @@ -1616,7 +1616,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  	vmcs12->tpr_threshold = evmcs->tpr_threshold;
>  	vmcs12->guest_rip = evmcs->guest_rip;
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC))) {
>  		vmcs12->guest_rsp = evmcs->guest_rsp;
>  		vmcs12->guest_rflags = evmcs->guest_rflags;
> @@ -1624,23 +1624,23 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  			evmcs->guest_interruptibility_info;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
>  		vmcs12->cpu_based_vm_exec_control =
>  			evmcs->cpu_based_vm_exec_control;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN))) {
>  		vmcs12->exception_bitmap = evmcs->exception_bitmap;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY))) {
>  		vmcs12->vm_entry_controls = evmcs->vm_entry_controls;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT))) {
>  		vmcs12->vm_entry_intr_info_field =
>  			evmcs->vm_entry_intr_info_field;
> @@ -1650,7 +1650,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  			evmcs->vm_entry_instruction_len;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1))) {
>  		vmcs12->host_ia32_pat = evmcs->host_ia32_pat;
>  		vmcs12->host_ia32_efer = evmcs->host_ia32_efer;
> @@ -1670,7 +1670,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  		vmcs12->host_tr_selector = evmcs->host_tr_selector;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1))) {
>  		vmcs12->pin_based_vm_exec_control =
>  			evmcs->pin_based_vm_exec_control;
> @@ -1679,18 +1679,18 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  			evmcs->secondary_vm_exec_control;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP))) {
>  		vmcs12->io_bitmap_a = evmcs->io_bitmap_a;
>  		vmcs12->io_bitmap_b = evmcs->io_bitmap_b;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP))) {
>  		vmcs12->msr_bitmap = evmcs->msr_bitmap;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2))) {
>  		vmcs12->guest_es_base = evmcs->guest_es_base;
>  		vmcs12->guest_cs_base = evmcs->guest_cs_base;
> @@ -1730,14 +1730,14 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  		vmcs12->guest_tr_selector = evmcs->guest_tr_selector;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2))) {
>  		vmcs12->tsc_offset = evmcs->tsc_offset;
>  		vmcs12->virtual_apic_page_addr = evmcs->virtual_apic_page_addr;
>  		vmcs12->xss_exit_bitmap = evmcs->xss_exit_bitmap;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR))) {
>  		vmcs12->cr0_guest_host_mask = evmcs->cr0_guest_host_mask;
>  		vmcs12->cr4_guest_host_mask = evmcs->cr4_guest_host_mask;
> @@ -1749,7 +1749,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  		vmcs12->guest_dr7 = evmcs->guest_dr7;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER))) {
>  		vmcs12->host_fs_base = evmcs->host_fs_base;
>  		vmcs12->host_gs_base = evmcs->host_gs_base;
> @@ -1759,13 +1759,13 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>  		vmcs12->host_rsp = evmcs->host_rsp;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT))) {
>  		vmcs12->ept_pointer = evmcs->ept_pointer;
>  		vmcs12->virtual_processor_id = evmcs->virtual_processor_id;
>  	}
>  
> -	if (unlikely(!(evmcs->hv_clean_fields &
> +	if (unlikely(!(hv_clean_fields &
>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1))) {
>  		vmcs12->vmcs_link_pointer = evmcs->vmcs_link_pointer;
>  		vmcs12->guest_ia32_debugctl = evmcs->guest_ia32_debugctl;
> @@ -3473,6 +3473,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	enum nvmx_vmentry_status status;
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
> +	struct hv_enlightened_vmcs *evmcs;
>  	enum nested_evmptrld_status evmptrld_status;
>  
>  	++vcpu->stat.nested_run;
> @@ -3488,7 +3489,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  		return nested_vmx_failInvalid(vcpu);
>  	}
>  
> -	if (CC(!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull))
> +	evmcs = vmx->nested.hv_evmcs;
> +	if (CC(!evmcs && vmx->nested.current_vmptr == -1ull))
>  		return nested_vmx_failInvalid(vcpu);
>  
>  	vmcs12 = get_vmcs12(vcpu);
> @@ -3502,8 +3504,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	if (CC(vmcs12->hdr.shadow_vmcs))
>  		return nested_vmx_failInvalid(vcpu);
>  
> -	if (vmx->nested.hv_evmcs) {
> -		copy_enlightened_to_vmcs12(vmx);
> +	if (evmcs) {
> +		copy_enlightened_to_vmcs12(vmx, evmcs->hv_clean_fields);
>  		/* Enlightened VMCS doesn't have launch state */
>  		vmcs12->launch_state = !launch;
>  	} else if (enable_shadow_vmcs) {
> @@ -6136,7 +6138,14 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>  		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
>  		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
>  			if (vmx->nested.hv_evmcs)
> -				copy_enlightened_to_vmcs12(vmx);
> +				/*
> +				 * L1 hypervisor is not obliged to keep eVMCS
> +				 * clean fields data always up-to-date while
> +				 * not in guest mode, 'hv_clean_fields' is only
> +				 * supposed to be actual upon vmentry so we need
> +				 * to ignore it here and do full copy.
> +				 */
> +				copy_enlightened_to_vmcs12(vmx, 0);

Hi Vitaly!

This patch had lead me to a deep rabbit hole,
so I would like to share my thoughts about it:

Initially I thought that we should just drop the copy of the evmcs to vmcs12
instead, based on the following argument:
 
When L2 is running we don't copy it since then guest must not
modify it just like normal vmcs, and L2 state is in our vmcs02.
 
And when L1 is running, then essentially evmcs is just a guest memory.

Even when loaded by previous vm entry,
we barely touch that evmcs
(we only touch it to update the vmx instruction error),
and even for that we need to update the evmcs (vmcs12->evmcs) and not vice versa.

Reading whole evmcs while the L1 is running indeed 
feels wrong since at this point the L1 owns it.

However we have another bug that you fixed in patch 6, which I sadly rediscovered
while reviewing this patch, which makes the above approach impossible.
After a migration we won't be able to know if evmcs is more up to date,
or if vmcs12 is more up to date.

I was thinking that for later case (vmcs12 more up to date) it would be cleaner
to sync evmcs here and then drop the copy_enlightened_to_vmcs12 call.

However we can't do this since in KVM_GET_NESTED_STATE handler we are not allowed
to dirtify the guest memory since at the point, this ioctl is called, the userspace
assumes that it has already copied all of the guest memory, 
and that this ioctl won't dirtify the guest memory again.
(see commit fa58a9fa74979f845fc6c94a58501e67f3abb6de)

It is still possible to go with my suggestion though if we avoid using
need_vmcs12_to_shadow_sync for evmcs, and sync evmcs right away
in the two places where it needed:
On nested vmexit and when updating the VM_INSTRUCTION_ERROR.
This shouldn't cause any performance regressions.

Then we can just drop the copy_enlightened_to_vmcs12, drop this patch, patch 5 and patch 6,
and now have the assumption that when we migrate L1 then
whatever evmcs was previously used we
fully updated it on last nested vmexit, and then maybe L1 modified it,
while vmcs12 continues to hold state of the vm that we wrote
during last nested vm exit.

In regard to backward compatibility the evmcs nested migration 
was very broken prior to these fixes anyway thus the change
that I propose shoudn't make things worse.

What do you think?

Best regards,
	Maxim Levitsky



>  			else if (enable_shadow_vmcs)
>  				copy_shadow_to_vmcs12(vmx);
>  		}







^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
  2021-05-17 13:50 ` [PATCH v2 4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() Vitaly Kuznetsov
@ 2021-05-24 12:27   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:27 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> 'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened
> VMCS sync when we exit to L1. The comment in nested_vmx_failValid()
> validly states why shadow vmcs sync can be omitted but this doesn't
> apply to enlightened VMCS as it 'shadows' all VMCS12 fields.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ec476f64df73..eb2d25a93356 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -194,9 +194,13 @@ static int nested_vmx_failValid(struct kvm_vcpu *vcpu,
>  			| X86_EFLAGS_ZF);
>  	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
>  	/*
> -	 * We don't need to force a shadow sync because
> -	 * VM_INSTRUCTION_ERROR is not shadowed
> +	 * We don't need to force sync to shadow VMCS because
> +	 * VM_INSTRUCTION_ERROR is not shadowed. Enlightened VMCS 'shadows' all
> +	 * fields and thus must be synced.
>  	 */
> +	if (nested_evmcs_is_used(to_vmx(vcpu)))
> +		to_vmx(vcpu)->nested.need_vmcs12_to_shadow_sync = true;

Cool, this is a bug that I noticed too while reviewing
previous patches in this series.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky

> +
>  	return kvm_skip_emulated_instruction(vcpu);
>  }
>  



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
  2021-05-17 13:50 ` [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() Vitaly Kuznetsov
@ 2021-05-24 12:34   ` Maxim Levitsky
  2021-05-24 13:07     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> When nested state migration happens during L1's execution, it
> is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
> At lease genuine Hyper-v seems to not be very happy when 'clean fields'
> data changes underneath it.
> 
> 'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
> and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index eb2d25a93356..3bfbf991bf45 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2081,14 +2081,10 @@ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
> -	if (vmx->nested.hv_evmcs) {
> +	if (vmx->nested.hv_evmcs)
>  		copy_vmcs12_to_enlightened(vmx);
> -		/* All fields are clean */
> -		vmx->nested.hv_evmcs->hv_clean_fields |=
> -			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> -	} else {
> +	else
>  		copy_vmcs12_to_shadow(vmx);
> -	}
>  
>  	vmx->nested.need_vmcs12_to_shadow_sync = false;
>  }
> @@ -2629,6 +2625,12 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  
>  	kvm_rsp_write(vcpu, vmcs12->guest_rsp);
>  	kvm_rip_write(vcpu, vmcs12->guest_rip);
> +
> +	/* Mark all fields as clean so L1 hypervisor can set what's dirty */
> +	if (hv_evmcs)
> +		vmx->nested.hv_evmcs->hv_clean_fields |=
> +			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> +
>  	return 0;
>  }
> 

Hi!
 
If we avoid calling copy_enlightened_to_vmcs12 from 
vmx_get_nested_state, then we don't need this patch, right?
 
In addition to that I think that we need to research on why 
do we need to touch these clean bits, as from the spec, and
assuming that the clean bits should behave similar to how AMD
does it, clean bits should only be set by the L1 and never touched by
us.
 
We currently set clean bits in two places:
 
1. nested_vmx_handle_enlightened_vmptrld with vmlaunch, where it seems
like it is a workaround for a case (as we discussed on IRC) where
L1 keeps more than one active evmcs on a same vcpu, and 'vmresume's
them. Since we don't support this and have to do full context switch
when we switch a vmcs, we reset the clean bits so that evmcs is loaded
fully.
Also we reset the clean bits when a evmcs is 'vmlaunched' which
is also something we need to check if needed, and if needed
we probably should document that this is because of a bug in Hyper-V,
as it really should initialize these bits in this case.
 
I think that we should just ignore the clean bits in those cases
instead of resetting them in the evmcs.
 
 
2. In nested_sync_vmcs12_to_shadow which in practise is done only
on nested vmexits, when we updated the vmcs12 and need to update evmcs.
In this case you told me that Hyper-V has a bug that it expects
the clean bits to be cleaned by us and doesn't clean it on its own.
This makes sense although it is not documented in the Hyper-V spec,
and I would appreciate if we were to document this explicitly in the code.

 
Best regards,
	Maxim Levitsky
>  



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-24 12:11   ` Maxim Levitsky
@ 2021-05-24 12:35     ` Vitaly Kuznetsov
  2021-05-26 14:34       ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 12:35 UTC (permalink / raw)
  To: Maxim Levitsky, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
>> can not be called directly from vmx_set_nested_state() as KVM may not have
>> all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
>> restored yet). Enlightened VMCS is mapped later while getting nested state
>> pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
>> for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
>> called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
>> resulting state will be unset (and such state will later fail to load).
>> 
>> Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
>> vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
>> after restore.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/vmx/nested.c | 31 ++++++++++++++++++++++++++-----
>>  1 file changed, 26 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 6058a65a6ede..3080e00c8f90 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -141,6 +141,27 @@ static void init_vmcs_shadow_fields(void)
>>  	max_shadow_read_write_fields = j;
>>  }
>>  
>> +static inline bool nested_evmcs_is_used(struct vcpu_vmx *vmx)
>> +{
>> +	struct kvm_vcpu *vcpu = &vmx->vcpu;
>> +
>> +	if (vmx->nested.hv_evmcs)
>> +		return true;
>> +
>> +	/*
>> +	 * After KVM_SET_NESTED_STATE, enlightened VMCS is mapped during
>> +	 * KVM_REQ_GET_NESTED_STATE_PAGES handling and until the request is
>> +	 * processed vmx->nested.hv_evmcs is NULL. It is, however, possible to
>> +	 * detect such state by checking 'nested.current_vmptr == -1ull' when
>> +	 * vCPU is in guest mode, it is only possible with eVMCS.
>> +	 */
>> +	if (unlikely(vmx->nested.enlightened_vmcs_enabled && is_guest_mode(vcpu) &&
>> +		     (vmx->nested.current_vmptr == -1ull)))
>> +		return true;
>> +
>> +	return false;
>> +}
>
>
> I think that this is a valid way to solve the issue,
> but it feels like there might be a better way.
> I don't mind though to accept this patch as is.
>
> So here are my 2 cents about this:
>
> First of all after studying how evmcs works I take my words back
> about needing to migrate its contents. 
>
> It is indeed enough to migrate its physical address, 
> or maybe even just a flag that evmcs is loaded
> (and to my surprise we already do this - KVM_STATE_NESTED_EVMCS)
>
> So how about just having a boolean flag that indicates that evmcs is in use, 
> but doesn't imply that we know its address or that it is mapped 
> to host address space, something like 'vmx->nested.enlightened_vmcs_loaded'
>
> On migration that flag saved and restored as the KVM_STATE_NESTED_EVMCS,
> otherwise it set when we load an evmcs and cleared when it is released.
>
> Then as far as I can see we can use this flag in nested_evmcs_is_used
> since all its callers don't touch evmcs, thus don't need it to be
> mapped.
>
> What do you think?
>

First, we need to be compatible with older KVMs which don't have the
flag and this is problematic: currently, we always expect vmcs12 to
carry valid contents. This is challenging.

Second, vCPU can be migrated in three different states:
1) While L2 was running ('true' nested state is in VMCS02)
2) While L1 was running ('true' nested state is in eVMCS)
3) Right after an exit from L2 to L1 was forced
('need_vmcs12_to_shadow_sync = true') ('true' nested state is in
VMCS12).

The current solution is to always use VMCS12 as a container to transfer
the state and conceptually, it is at least easier to understand.

We can, indeed, transfer eVMCS (or VMCS12) in case 2) through guest
memory and I even tried that but that was making the code more complex
so eventually I gave up and decided to preserve the 'always use VMCS12
as a container' status quo.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
  2021-05-17 13:50 ` [PATCH v2 6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration Vitaly Kuznetsov
@ 2021-05-24 12:35   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:35 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> VMCS12 is used to keep the authoritative state during nested state
> migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're
> in between L2->L1 vmexit and L1 guest run when actual sync to
> enlightened (or shadow) VMCS happens. Nested state, however, has
> no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()->
> set_current_vmptr() always sets it. Enlightened vmptrld path, however,
> doesn't have the quirk so some VMCS12 changes may not get properly
> reflected to eVMCS and L1 will see an incorrect state.
> 
> Note, during L2 execution or when need_vmcs12_to_shadow_sync is not
> set the change is effectively a nop: in the former case all changes
> will get reflected during the first L2->L1 vmexit and in the later
> case VMCS12 and eVMCS are already in sync (thanks to
> copy_enlightened_to_vmcs12() in vmx_get_nested_state()).
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 3bfbf991bf45..a0dedd413a23 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3127,6 +3127,12 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>  		if (evmptrld_status == EVMPTRLD_VMFAIL ||
>  		    evmptrld_status == EVMPTRLD_ERROR)
>  			return false;
> +
> +		/*
> +		 * Post migration VMCS12 always provides the most actual
> +		 * information, copy it to eVMCS upon entry.
> +		 */
> +		vmx->nested.need_vmcs12_to_shadow_sync = true;

I noticed that bug too while reviewing the previous patches, 
wasted some time thinking about it :-(

This does look like a clean and correct way to fix this, unless
you adopt my suggestion to sync evmcs right away on nested vmexit
as I explained in my review of patch 3 of this series.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky

>  	}
>  
>  	return true;





^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost
  2021-05-17 13:50 ` [PATCH v2 7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost Vitaly Kuznetsov
@ 2021-05-24 12:36   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-24 12:36 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> Do KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE for a freshly restored VM
> (before the first KVM_RUN) to check that KVM_STATE_NESTED_EVMCS is not
> lost.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  .../testing/selftests/kvm/x86_64/evmcs_test.c | 64 +++++++++++--------
>  1 file changed, 38 insertions(+), 26 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/x86_64/evmcs_test.c b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
> index 63096cea26c6..fcef347a681a 100644
> --- a/tools/testing/selftests/kvm/x86_64/evmcs_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/evmcs_test.c
> @@ -121,14 +121,38 @@ void inject_nmi(struct kvm_vm *vm)
>  	vcpu_events_set(vm, VCPU_ID, &events);
>  }
>  
> +static void save_restore_vm(struct kvm_vm *vm)
> +{
> +	struct kvm_regs regs1, regs2;
> +	struct kvm_x86_state *state;
> +
> +	state = vcpu_save_state(vm, VCPU_ID);
> +	memset(&regs1, 0, sizeof(regs1));
> +	vcpu_regs_get(vm, VCPU_ID, &regs1);
> +
> +	kvm_vm_release(vm);
> +
> +	/* Restore state in a new VM.  */
> +	kvm_vm_restart(vm, O_RDWR);
> +	vm_vcpu_add(vm, VCPU_ID);
> +	vcpu_set_hv_cpuid(vm, VCPU_ID);
> +	vcpu_enable_evmcs(vm, VCPU_ID);
> +	vcpu_load_state(vm, VCPU_ID, state);
> +	free(state);
> +
> +	memset(&regs2, 0, sizeof(regs2));
> +	vcpu_regs_get(vm, VCPU_ID, &regs2);
> +	TEST_ASSERT(!memcmp(&regs1, &regs2, sizeof(regs2)),
> +		    "Unexpected register values after vcpu_load_state; rdi: %lx rsi: %lx",
> +		    (ulong) regs2.rdi, (ulong) regs2.rsi);
> +}
> +
>  int main(int argc, char *argv[])
>  {
>  	vm_vaddr_t vmx_pages_gva = 0;
>  
> -	struct kvm_regs regs1, regs2;
>  	struct kvm_vm *vm;
>  	struct kvm_run *run;
> -	struct kvm_x86_state *state;
>  	struct ucall uc;
>  	int stage;
>  
> @@ -145,10 +169,6 @@ int main(int argc, char *argv[])
>  	vcpu_set_hv_cpuid(vm, VCPU_ID);
>  	vcpu_enable_evmcs(vm, VCPU_ID);
>  
> -	run = vcpu_state(vm, VCPU_ID);
> -
> -	vcpu_regs_get(vm, VCPU_ID, &regs1);
> -
>  	vcpu_alloc_vmx(vm, &vmx_pages_gva);
>  	vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
>  
> @@ -160,6 +180,7 @@ int main(int argc, char *argv[])
>  	pr_info("Running L1 which uses EVMCS to run L2\n");
>  
>  	for (stage = 1;; stage++) {
> +		run = vcpu_state(vm, VCPU_ID);
>  		_vcpu_run(vm, VCPU_ID);
>  		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
>  			    "Stage %d: unexpected exit reason: %u (%s),\n",
> @@ -184,32 +205,23 @@ int main(int argc, char *argv[])
>  			    uc.args[1] == stage, "Stage %d: Unexpected register values vmexit, got %lx",
>  			    stage, (ulong)uc.args[1]);
>  
> -		state = vcpu_save_state(vm, VCPU_ID);
> -		memset(&regs1, 0, sizeof(regs1));
> -		vcpu_regs_get(vm, VCPU_ID, &regs1);
> -
> -		kvm_vm_release(vm);
> -
> -		/* Restore state in a new VM.  */
> -		kvm_vm_restart(vm, O_RDWR);
> -		vm_vcpu_add(vm, VCPU_ID);
> -		vcpu_set_hv_cpuid(vm, VCPU_ID);
> -		vcpu_enable_evmcs(vm, VCPU_ID);
> -		vcpu_load_state(vm, VCPU_ID, state);
> -		run = vcpu_state(vm, VCPU_ID);
> -		free(state);
> -
> -		memset(&regs2, 0, sizeof(regs2));
> -		vcpu_regs_get(vm, VCPU_ID, &regs2);
> -		TEST_ASSERT(!memcmp(&regs1, &regs2, sizeof(regs2)),
> -			    "Unexpected register values after vcpu_load_state; rdi: %lx rsi: %lx",
> -			    (ulong) regs2.rdi, (ulong) regs2.rsi);
> +		save_restore_vm(vm);
>  
>  		/* Force immediate L2->L1 exit before resuming */
>  		if (stage == 8) {
>  			pr_info("Injecting NMI into L1 before L2 had a chance to run after restore\n");
>  			inject_nmi(vm);
>  		}
> +
> +		/*
> +		 * Do KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE for a freshly
> +		 * restored VM (before the first KVM_RUN) to check that
> +		 * KVM_STATE_NESTED_EVMCS is not lost.
> +		 */
> +		if (stage == 9) {
> +			pr_info("Trying extra KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE cycle\n");
> +			save_restore_vm(vm);
> +		}
>  	}
>  
>  done:


This is a very good test. I do think that in the future we should move save_restore_vm
to common code so that I could test SVM nested migration (and plain VMX nested migration) 
in a similar way.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-24 12:08 ` [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Maxim Levitsky
@ 2021-05-24 12:44   ` Vitaly Kuznetsov
  2021-05-26 14:41     ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 12:44 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> Changes since v1 (Sean):
>> - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
>> - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>>   copy_enlightened_to_vmcs12().
>> 
>> Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
>> migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>>  + WSL2) was crashing immediately after migration. It was also reported
>> that we have more issues to fix as, while the failure rate was lowered 
>> signifincatly, it was still possible to observe crashes after several
>> dozens of migration. Turns out, the issue arises when we manage to issue
>> KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
>> to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
>> the flag itself is not part of saved nested state. A few other less 
>> significant issues are fixed along the way.
>> 
>> While there's no proof this series fixes all eVMCS related problems,
>> Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
>> crashing in testing.
>> 
>> Patches are based on the current kvm/next tree.
>> 
>> Vitaly Kuznetsov (7):
>>   KVM: nVMX: Introduce nested_evmcs_is_used()
>>   KVM: nVMX: Release enlightened VMCS on VMCLEAR
>>   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>>     vmx_get_nested_state()
>>   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>>   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>>   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>>   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>>     lost
>> 
>>  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>>  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>>  2 files changed, 115 insertions(+), 59 deletions(-)
>> 
>
> Hi Vitaly!
>
> In addition to the review of this patch series,

Thanks by the way!

>  I would like
> to share an idea on how to avoid the hack of mapping the evmcs
> in nested_vmx_vmexit, because I think I found a possible generic
> solution to this and similar issues:
>
> The solution is to always set nested_run_pending after 
> nested migration (which means that we won't really
> need to migrate this flag anymore).
>
> I was thinking a lot about it and I think that there is no downside to this,
> other than sometimes a one extra vmexit after migration.
>
> Otherwise there is always a risk of the following scenario:
>
>   1. We migrate with nested_run_pending=0 (but don't restore all the state
>   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
>   or just the guest memory map is not up to date, guest is in smm or something
>   like that)
>
>   2. Userspace calls some ioctl that causes a nested vmexit
>
>   This can happen today if the userspace calls 
>   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
>
>   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
>   then calls KVM_RUN
>
> This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> but we have to do so to complete the nested vmexit.

Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
only one place which calls copy_vmcs12_to_enlightened():
nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
vmx_prepare_switch_to_guest() so unless userspace decided to execute
not-fully-restored guest this should not happen. I'm probably missing
something in your scenario)

>
> To some extent, the entry to the nested mode after a migration is only complete
> when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
>
> This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> nested vmexit path at all. 

Remember, we have three possible states when nested state is
transferred:
1) L2 was running
2) L1 was running
3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).

Is 'nested_run_pending' suitable for all of them? Could you maybe draft
a patch so we can see how this works (in both 'normal' and 'evmcs'
cases)?

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-24 12:26   ` Maxim Levitsky
@ 2021-05-24 13:01     ` Vitaly Kuznetsov
  2021-05-24 13:58       ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 13:01 UTC (permalink / raw)
  To: Maxim Levitsky, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> 'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
>> hypervisor is not obliged to keep it up-to-date while it is mangling L2's
>> state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
>> eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
>> VMCS12 is used as a source of ultimate truth, we must make sure we pick all
>> the changes to eVMCS and thus 'clean fields' data must be ignored.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/vmx/nested.c | 49 +++++++++++++++++++++++----------------
>>  1 file changed, 29 insertions(+), 20 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index ea2869d8b823..ec476f64df73 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1607,7 +1607,7 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
>>  	vmcs_load(vmx->loaded_vmcs->vmcs);
>>  }
>>  
>> -static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>> +static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields)
>>  {
>>  	struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12;
>>  	struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs;
>> @@ -1616,7 +1616,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  	vmcs12->tpr_threshold = evmcs->tpr_threshold;
>>  	vmcs12->guest_rip = evmcs->guest_rip;
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC))) {
>>  		vmcs12->guest_rsp = evmcs->guest_rsp;
>>  		vmcs12->guest_rflags = evmcs->guest_rflags;
>> @@ -1624,23 +1624,23 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  			evmcs->guest_interruptibility_info;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) {
>>  		vmcs12->cpu_based_vm_exec_control =
>>  			evmcs->cpu_based_vm_exec_control;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN))) {
>>  		vmcs12->exception_bitmap = evmcs->exception_bitmap;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY))) {
>>  		vmcs12->vm_entry_controls = evmcs->vm_entry_controls;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT))) {
>>  		vmcs12->vm_entry_intr_info_field =
>>  			evmcs->vm_entry_intr_info_field;
>> @@ -1650,7 +1650,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  			evmcs->vm_entry_instruction_len;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1))) {
>>  		vmcs12->host_ia32_pat = evmcs->host_ia32_pat;
>>  		vmcs12->host_ia32_efer = evmcs->host_ia32_efer;
>> @@ -1670,7 +1670,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  		vmcs12->host_tr_selector = evmcs->host_tr_selector;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1))) {
>>  		vmcs12->pin_based_vm_exec_control =
>>  			evmcs->pin_based_vm_exec_control;
>> @@ -1679,18 +1679,18 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  			evmcs->secondary_vm_exec_control;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP))) {
>>  		vmcs12->io_bitmap_a = evmcs->io_bitmap_a;
>>  		vmcs12->io_bitmap_b = evmcs->io_bitmap_b;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP))) {
>>  		vmcs12->msr_bitmap = evmcs->msr_bitmap;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2))) {
>>  		vmcs12->guest_es_base = evmcs->guest_es_base;
>>  		vmcs12->guest_cs_base = evmcs->guest_cs_base;
>> @@ -1730,14 +1730,14 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  		vmcs12->guest_tr_selector = evmcs->guest_tr_selector;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2))) {
>>  		vmcs12->tsc_offset = evmcs->tsc_offset;
>>  		vmcs12->virtual_apic_page_addr = evmcs->virtual_apic_page_addr;
>>  		vmcs12->xss_exit_bitmap = evmcs->xss_exit_bitmap;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR))) {
>>  		vmcs12->cr0_guest_host_mask = evmcs->cr0_guest_host_mask;
>>  		vmcs12->cr4_guest_host_mask = evmcs->cr4_guest_host_mask;
>> @@ -1749,7 +1749,7 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  		vmcs12->guest_dr7 = evmcs->guest_dr7;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER))) {
>>  		vmcs12->host_fs_base = evmcs->host_fs_base;
>>  		vmcs12->host_gs_base = evmcs->host_gs_base;
>> @@ -1759,13 +1759,13 @@ static int copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx)
>>  		vmcs12->host_rsp = evmcs->host_rsp;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT))) {
>>  		vmcs12->ept_pointer = evmcs->ept_pointer;
>>  		vmcs12->virtual_processor_id = evmcs->virtual_processor_id;
>>  	}
>>  
>> -	if (unlikely(!(evmcs->hv_clean_fields &
>> +	if (unlikely(!(hv_clean_fields &
>>  		       HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1))) {
>>  		vmcs12->vmcs_link_pointer = evmcs->vmcs_link_pointer;
>>  		vmcs12->guest_ia32_debugctl = evmcs->guest_ia32_debugctl;
>> @@ -3473,6 +3473,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>>  	enum nvmx_vmentry_status status;
>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>  	u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu);
>> +	struct hv_enlightened_vmcs *evmcs;
>>  	enum nested_evmptrld_status evmptrld_status;
>>  
>>  	++vcpu->stat.nested_run;
>> @@ -3488,7 +3489,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>>  		return nested_vmx_failInvalid(vcpu);
>>  	}
>>  
>> -	if (CC(!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull))
>> +	evmcs = vmx->nested.hv_evmcs;
>> +	if (CC(!evmcs && vmx->nested.current_vmptr == -1ull))
>>  		return nested_vmx_failInvalid(vcpu);
>>  
>>  	vmcs12 = get_vmcs12(vcpu);
>> @@ -3502,8 +3504,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>>  	if (CC(vmcs12->hdr.shadow_vmcs))
>>  		return nested_vmx_failInvalid(vcpu);
>>  
>> -	if (vmx->nested.hv_evmcs) {
>> -		copy_enlightened_to_vmcs12(vmx);
>> +	if (evmcs) {
>> +		copy_enlightened_to_vmcs12(vmx, evmcs->hv_clean_fields);
>>  		/* Enlightened VMCS doesn't have launch state */
>>  		vmcs12->launch_state = !launch;
>>  	} else if (enable_shadow_vmcs) {
>> @@ -6136,7 +6138,14 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>>  		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
>>  		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
>>  			if (vmx->nested.hv_evmcs)
>> -				copy_enlightened_to_vmcs12(vmx);
>> +				/*
>> +				 * L1 hypervisor is not obliged to keep eVMCS
>> +				 * clean fields data always up-to-date while
>> +				 * not in guest mode, 'hv_clean_fields' is only
>> +				 * supposed to be actual upon vmentry so we need
>> +				 * to ignore it here and do full copy.
>> +				 */
>> +				copy_enlightened_to_vmcs12(vmx, 0);
>
> Hi Vitaly!
>
> This patch had lead me to a deep rabbit hole,
> so I would like to share my thoughts about it:
>
> Initially I thought that we should just drop the copy of the evmcs to vmcs12
> instead, based on the following argument:
>  
> When L2 is running we don't copy it since then guest must not
> modify it just like normal vmcs, and L2 state is in our vmcs02.
>  
> And when L1 is running, then essentially evmcs is just a guest memory.
>
> Even when loaded by previous vm entry,
> we barely touch that evmcs
> (we only touch it to update the vmx instruction error),
> and even for that we need to update the evmcs (vmcs12->evmcs) and not vice versa.
>
> Reading whole evmcs while the L1 is running indeed 
> feels wrong since at this point the L1 owns it.
>
> However we have another bug that you fixed in patch 6, which I sadly rediscovered
> while reviewing this patch, which makes the above approach impossible.
> After a migration we won't be able to know if evmcs is more up to date,
> or if vmcs12 is more up to date.
>
> I was thinking that for later case (vmcs12 more up to date) it would be cleaner
> to sync evmcs here and then drop the copy_enlightened_to_vmcs12 call.
>
> However we can't do this since in KVM_GET_NESTED_STATE handler we are not allowed
> to dirtify the guest memory since at the point, this ioctl is called, the userspace
> assumes that it has already copied all of the guest memory, 
> and that this ioctl won't dirtify the guest memory again.
> (see commit fa58a9fa74979f845fc6c94a58501e67f3abb6de)
>
> It is still possible to go with my suggestion though if we avoid using
> need_vmcs12_to_shadow_sync for evmcs, and sync evmcs right away
> in the two places where it needed:
> On nested vmexit and when updating the VM_INSTRUCTION_ERROR.
> This shouldn't cause any performance regressions.
>
> Then we can just drop the copy_enlightened_to_vmcs12, drop this patch, patch 5 and patch 6,
> and now have the assumption that when we migrate L1 then
> whatever evmcs was previously used we
> fully updated it on last nested vmexit, and then maybe L1 modified it,
> while vmcs12 continues to hold state of the vm that we wrote
> during last nested vm exit.
>
> In regard to backward compatibility the evmcs nested migration 
> was very broken prior to these fixes anyway thus the change
> that I propose shoudn't make things worse.
>
> What do you think?

I managed to forget some gory details but I had similar ideas when
playing with this. For better or worse, I was considering backwards
compatibility 'a must' as even despite multiple issues we fix with this
series, it was still possible to migrate with eVMCS. The currently
implemented protocol with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE is
to use VMCS12 as a source of ultimate truth and if we want to be
backwards compatible, we'll have to emulate this somehow.

With 'need_vmcs12_to_shadow_sync', we treat eVMCS as shadow VMCS which
happens to shadow all fields and while it may not be the most optimal
solution, it is at least easy to comprehend. We can try drafting
something up instead, maybe it will also be good but honestly I'm afraid
of incompatible changes in KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, we
can ask Paolo's opinion on that.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
  2021-05-24 12:34   ` Maxim Levitsky
@ 2021-05-24 13:07     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 13:07 UTC (permalink / raw)
  To: Maxim Levitsky, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> When nested state migration happens during L1's execution, it
>> is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
>> At lease genuine Hyper-v seems to not be very happy when 'clean fields'
>> data changes underneath it.
>> 
>> 'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
>> and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/vmx/nested.c | 14 ++++++++------
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index eb2d25a93356..3bfbf991bf45 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -2081,14 +2081,10 @@ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>  
>> -	if (vmx->nested.hv_evmcs) {
>> +	if (vmx->nested.hv_evmcs)
>>  		copy_vmcs12_to_enlightened(vmx);
>> -		/* All fields are clean */
>> -		vmx->nested.hv_evmcs->hv_clean_fields |=
>> -			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
>> -	} else {
>> +	else
>>  		copy_vmcs12_to_shadow(vmx);
>> -	}
>>  
>>  	vmx->nested.need_vmcs12_to_shadow_sync = false;
>>  }
>> @@ -2629,6 +2625,12 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>  
>>  	kvm_rsp_write(vcpu, vmcs12->guest_rsp);
>>  	kvm_rip_write(vcpu, vmcs12->guest_rip);
>> +
>> +	/* Mark all fields as clean so L1 hypervisor can set what's dirty */
>> +	if (hv_evmcs)
>> +		vmx->nested.hv_evmcs->hv_clean_fields |=
>> +			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
>> +
>>  	return 0;
>>  }
>> 
>
> Hi!
>  
> If we avoid calling copy_enlightened_to_vmcs12 from 
> vmx_get_nested_state, then we don't need this patch, right?
>  

Right.

> In addition to that I think that we need to research on why 
> do we need to touch these clean bits, as from the spec, and
> assuming that the clean bits should behave similar to how AMD
> does it, clean bits should only be set by the L1 and never touched by
> us.
>  
> We currently set clean bits in two places:
>  
> 1. nested_vmx_handle_enlightened_vmptrld with vmlaunch, where it seems
> like it is a workaround for a case (as we discussed on IRC) where
> L1 keeps more than one active evmcs on a same vcpu, and 'vmresume's
> them. Since we don't support this and have to do full context switch
> when we switch a vmcs, we reset the clean bits so that evmcs is loaded
> fully.
> Also we reset the clean bits when a evmcs is 'vmlaunched' which
> is also something we need to check if needed, and if needed
> we probably should document that this is because of a bug in Hyper-V,
> as it really should initialize these bits in this case.
>  
> I think that we should just ignore the clean bits in those cases
> instead of resetting them in the evmcs.
>  
>  
> 2. In nested_sync_vmcs12_to_shadow which in practise is done only
> on nested vmexits, when we updated the vmcs12 and need to update evmcs.
> In this case you told me that Hyper-V has a bug that it expects
> the clean bits to be cleaned by us and doesn't clean it on its own.
> This makes sense although it is not documented in the Hyper-V spec,
> and I would appreciate if we were to document this explicitly in the code.

My memory is really foggy but I put 'hv_clean_fields' cleaning to KVM
because I discovered that Hyper-V doesn't do that: with every exit we
get more and more 'dirty' stuff which wasn't touched. It would probably
make sense to repeat the experiment. I'll put this to my backlog, thanks
for the feedback!

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-17 13:50 ` [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used() Vitaly Kuznetsov
  2021-05-24 12:11   ` Maxim Levitsky
@ 2021-05-24 13:54   ` Paolo Bonzini
  2021-05-24 14:09     ` Vitaly Kuznetsov
  1 sibling, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-24 13:54 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 17/05/21 15:50, Vitaly Kuznetsov wrote:
> Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
> can not be called directly from vmx_set_nested_state() as KVM may not have
> all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
> restored yet). Enlightened VMCS is mapped later while getting nested state
> pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
> for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
> called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
> resulting state will be unset (and such state will later fail to load).
> 
> Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
> vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
> after restore.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

Would it make sense to instead use hv_evmcs_ptr, making the unused
value -1 instead of 0?

I even had this untested patch already lying around as a cleanup I
never bothered to submit:

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a19cfcb625da..dd3e4ddaaf26 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -187,7 +187,7 @@ static int nested_vmx_fail(struct kvm_vcpu *vcpu, u32 vm_instruction_error)
  	 * failValid writes the error number to the current VMCS, which
  	 * can't be done if there isn't a current VMCS.
  	 */
-	if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
+	if (vmx->nested.current_vmptr == -1ull && vmx->nested.hv_evmcs_vmptr == -1)
  		return nested_vmx_failInvalid(vcpu);
  
  	return nested_vmx_failValid(vcpu, vm_instruction_error);
@@ -221,11 +221,11 @@ static inline void nested_release_evmcs(struct kvm_vcpu *vcpu)
  {
  	struct vcpu_vmx *vmx = to_vmx(vcpu);
  
-	if (!vmx->nested.hv_evmcs)
+	if (vmx->nested.hv_evmcs_vmptr == -1)
  		return;
  
  	kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map, NULL, true);
-	vmx->nested.hv_evmcs_vmptr = 0;
+	vmx->nested.hv_evmcs_vmptr = -1;
  	vmx->nested.hv_evmcs = NULL;
  }
  
@@ -1978,10 +1978,8 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
  	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
  		return EVMPTRLD_DISABLED;
  
-	if (unlikely(!vmx->nested.hv_evmcs ||
-		     evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
-		if (!vmx->nested.hv_evmcs)
-			vmx->nested.current_vmptr = -1ull;
+	if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
+		vmx->nested.current_vmptr = -1ull;
  
  		nested_release_evmcs(vcpu);
  
@@ -2052,7 +2050,7 @@ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
  {
  	struct vcpu_vmx *vmx = to_vmx(vcpu);
  
-	if (vmx->nested.hv_evmcs) {
+	if (vmx->nested.hv_evmcs_vmptr != -1) {
  		copy_vmcs12_to_enlightened(vmx);
  		/* All fields are clean */
  		vmx->nested.hv_evmcs->hv_clean_fields |=
@@ -2204,7 +2202,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
  	u32 exec_control;
  	u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12);
  
-	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs)
+	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs_vmptr != -1)
  		prepare_vmcs02_early_rare(vmx, vmcs12);
  
  	/*
@@ -2487,9 +2485,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
  			  enum vm_entry_failure_code *entry_failure_code)
  {
  	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs;
  
-	if (vmx->nested.dirty_vmcs12 || hv_evmcs) {
+	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs_vmptr != -1) {
  		prepare_vmcs02_rare(vmx, vmcs12);
  		vmx->nested.dirty_vmcs12 = false;
  	}
@@ -3075,7 +3072,7 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
  	 * L2 was running), map it here to make sure vmcs12 changes are
  	 * properly reflected.
  	 */
-	if (vmx->nested.enlightened_vmcs_enabled && !vmx->nested.hv_evmcs) {
+	if (vmx->nested.enlightened_vmcs_enabled && vmx->nested.hv_evmcs_vmptr == -1) {
  		enum nested_evmptrld_status evmptrld_status =
  			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
  
@@ -3419,7 +3416,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
  
  	load_vmcs12_host_state(vcpu, vmcs12);
  	vmcs12->vm_exit_reason = exit_reason.full;
-	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
+	if (enable_shadow_vmcs || vmx->nested.hv_evmcs_vmptr != -1)
  		vmx->nested.need_vmcs12_to_shadow_sync = true;
  	return NVMX_VMENTRY_VMEXIT;
  }
@@ -3449,7 +3446,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
  		return nested_vmx_failInvalid(vcpu);
  	}
  
-	if (CC(!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull))
+	if (CC(vmx->nested.hv_evmcs_vmptr == -1 && vmx->nested.current_vmptr == -1ull))
  		return nested_vmx_failInvalid(vcpu);
  
  	vmcs12 = get_vmcs12(vcpu);
@@ -3463,7 +3460,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
  	if (CC(vmcs12->hdr.shadow_vmcs))
  		return nested_vmx_failInvalid(vcpu);
  
-	if (vmx->nested.hv_evmcs) {
+	if (vmx->nested.hv_evmcs_vmptr != -1) {
  		copy_enlightened_to_vmcs12(vmx);
  		/* Enlightened VMCS doesn't have launch state */
  		vmcs12->launch_state = !launch;
@@ -4014,10 +4011,10 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
  {
  	struct vcpu_vmx *vmx = to_vmx(vcpu);
  
-	if (vmx->nested.hv_evmcs)
+	if (vmx->nested.hv_evmcs_vmptr != -1)
  		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
  
-	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !vmx->nested.hv_evmcs;
+	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = (vmx->nested.hv_evmcs_vmptr == -1);
  
  	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
  	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
@@ -4514,7 +4511,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
  	}
  
  	if ((vm_exit_reason != -1) &&
-	    (enable_shadow_vmcs || vmx->nested.hv_evmcs))
+	    (enable_shadow_vmcs || vmx->nested.hv_evmcs_vmptr != -1))
  		vmx->nested.need_vmcs12_to_shadow_sync = true;
  
  	/* in case we halted in L2 */
@@ -5210,7 +5207,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
  		return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_VMXON_POINTER);
  
  	/* Forbid normal VMPTRLD if Enlightened version was used */
-	if (vmx->nested.hv_evmcs)
+	if (vmx->nested.hv_evmcs_vmptr != -1)
  		return 1;
  
  	if (vmx->nested.current_vmptr != vmptr) {
@@ -5266,7 +5263,7 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
  	if (!nested_vmx_check_permission(vcpu))
  		return 1;
  
-	if (unlikely(to_vmx(vcpu)->nested.hv_evmcs))
+	if (unlikely(to_vmx(vcpu)->nested.hv_evmcs_vmptr != -1))
  		return 1;
  
  	if (get_vmx_mem_address(vcpu, exit_qual, instr_info,
@@ -6038,7 +6035,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
  		if (vmx_has_valid_vmcs12(vcpu)) {
  			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
  
-			if (vmx->nested.hv_evmcs)
+			if (vmx->nested.hv_evmcs_vmptr != -1)
  				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
  
  			if (is_guest_mode(vcpu) &&
@@ -6094,7 +6091,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
  	} else  {
  		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
  		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
-			if (vmx->nested.hv_evmcs)
+			if (vmx->nested.hv_evmcs_vmptr != -1)
  				copy_enlightened_to_vmcs12(vmx);
  			else if (enable_shadow_vmcs)
  				copy_shadow_to_vmcs12(vmx);
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index 184418baeb3c..55925be48973 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -63,7 +63,7 @@ static inline int vmx_has_valid_vmcs12(struct kvm_vcpu *vcpu)
  	 * have vmcs12 if it is true.
  	 */
  	return is_guest_mode(vcpu) || vmx->nested.current_vmptr != -1ull ||
-		vmx->nested.hv_evmcs;
+		vmx->nested.hv_evmcs_vmptr != -1;
  }
  
  static inline u16 nested_get_vpid02(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f2fd447eed45..33208aa4ac87 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6974,6 +6974,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
  
  	vmx->nested.posted_intr_nv = -1;
  	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.hv_evmcs_vmptr = -1;
  
  	vcpu->arch.microcode_version = 0x100000000ULL;
  	vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;

Paolo


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-17 13:50 ` [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() Vitaly Kuznetsov
  2021-05-24 12:26   ` Maxim Levitsky
@ 2021-05-24 13:56   ` Paolo Bonzini
  2021-05-24 14:12     ` Vitaly Kuznetsov
  1 sibling, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-24 13:56 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 17/05/21 15:50, Vitaly Kuznetsov wrote:
> 'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
> hypervisor is not obliged to keep it up-to-date while it is mangling L2's
> state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
> eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
> VMCS12 is used as a source of ultimate truth, we must make sure we pick all
> the changes to eVMCS and thus 'clean fields' data must be ignored.

While you're at it, would you mind making copy_vmcs12_to_enlightened and 
copy_enlightened_to_vmcs12 void?

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-24 13:01     ` Vitaly Kuznetsov
@ 2021-05-24 13:58       ` Paolo Bonzini
  2021-05-26 14:44         ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-24 13:58 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Maxim Levitsky
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm

On 24/05/21 15:01, Vitaly Kuznetsov wrote:
> With 'need_vmcs12_to_shadow_sync', we treat eVMCS as shadow VMCS which
> happens to shadow all fields and while it may not be the most optimal
> solution, it is at least easy to comprehend. We can try drafting
> something up instead, maybe it will also be good but honestly I'm afraid
> of incompatible changes in KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, we
> can ask Paolo's opinion on that.

Yes, it's much easier to understand it if the eVMCS is essentially a 
memory-backed shadow VMCS, than if it's really the vmcs12 format.  I 
understand that it's bound to be a little slower, but at least the two 
formats are not all over the place.

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
                   ` (7 preceding siblings ...)
  2021-05-24 12:08 ` [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Maxim Levitsky
@ 2021-05-24 14:01 ` Paolo Bonzini
  8 siblings, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-24 14:01 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 17/05/21 15:50, Vitaly Kuznetsov wrote:
> Changes since v1 (Sean):
> - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>    copy_enlightened_to_vmcs12().
> 
> Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>   + WSL2) was crashing immediately after migration. It was also reported
> that we have more issues to fix as, while the failure rate was lowered
> signifincatly, it was still possible to observe crashes after several
> dozens of migration. Turns out, the issue arises when we manage to issue
> KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> the flag itself is not part of saved nested state. A few other less
> significant issues are fixed along the way.
> 
> While there's no proof this series fixes all eVMCS related problems,
> Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> crashing in testing.
> 
> Patches are based on the current kvm/next tree.
> 
> Vitaly Kuznetsov (7):
>    KVM: nVMX: Introduce nested_evmcs_is_used()
>    KVM: nVMX: Release enlightened VMCS on VMCLEAR
>    KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>      vmx_get_nested_state()
>    KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>    KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>    KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>    KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>      lost
> 
>   arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>   .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>   2 files changed, 115 insertions(+), 59 deletions(-)
> 

Looks good, I'm possibly expecting a v3 depending on what you think 
about my patch 1 suggestion.

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-24 13:54   ` Paolo Bonzini
@ 2021-05-24 14:09     ` Vitaly Kuznetsov
  2021-05-24 14:18       ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 14:09 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 17/05/21 15:50, Vitaly Kuznetsov wrote:
>> Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
>> can not be called directly from vmx_set_nested_state() as KVM may not have
>> all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
>> restored yet). Enlightened VMCS is mapped later while getting nested state
>> pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
>> for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
>> called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
>> resulting state will be unset (and such state will later fail to load).
>> 
>> Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
>> vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
>> after restore.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>
> Would it make sense to instead use hv_evmcs_ptr, making the unused
> value -1 instead of 0?

You mean we'll be using:

"hv_evmcs_ptr == 0" meaning "no evmcs" (like now)
"hv_evmcs_ptr == -1" meaing "evmcs not yet mapped" (and
nested_evmcs_is_used() will check for both '0' and '-1')
"hv_evmcs_ptr == anything else" - eVMCS mapped.

Right?

>
> I even had this untested patch already lying around as a cleanup I
> never bothered to submit:

...

>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a19cfcb625da..dd3e4ddaaf26 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -187,7 +187,7 @@ static int nested_vmx_fail(struct kvm_vcpu *vcpu, u32 vm_instruction_error)
>   	 * failValid writes the error number to the current VMCS, which
>   	 * can't be done if there isn't a current VMCS.
>   	 */
> -	if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
> +	if (vmx->nested.current_vmptr == -1ull && vmx->nested.hv_evmcs_vmptr == -1)

Or, if we decide to integrate this patch, it'll be:

"hv_evmcs_ptr == -1" meaning "no evmcs" (what is '0' now)
"hv_evmcs_ptr == 0" meaing "evmcs not yet mapped"

I'm not against the idea, hope we won't forget that we have two 'special'
values for 'hv_evmcs_vmptr' and not just one.

>   		return nested_vmx_failInvalid(vcpu);
>   
>   	return nested_vmx_failValid(vcpu, vm_instruction_error);
> @@ -221,11 +221,11 @@ static inline void nested_release_evmcs(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> -	if (!vmx->nested.hv_evmcs)
> +	if (vmx->nested.hv_evmcs_vmptr == -1)
>   		return;
>   
>   	kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map, NULL, true);
> -	vmx->nested.hv_evmcs_vmptr = 0;
> +	vmx->nested.hv_evmcs_vmptr = -1;
>   	vmx->nested.hv_evmcs = NULL;
>   }
>   
> @@ -1978,10 +1978,8 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
>   	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
>   		return EVMPTRLD_DISABLED;
>   
> -	if (unlikely(!vmx->nested.hv_evmcs ||
> -		     evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
> -		if (!vmx->nested.hv_evmcs)
> -			vmx->nested.current_vmptr = -1ull;
> +	if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
> +		vmx->nested.current_vmptr = -1ull;
>   
>   		nested_release_evmcs(vcpu);
>   
> @@ -2052,7 +2050,7 @@ void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> -	if (vmx->nested.hv_evmcs) {
> +	if (vmx->nested.hv_evmcs_vmptr != -1) {
>   		copy_vmcs12_to_enlightened(vmx);
>   		/* All fields are clean */
>   		vmx->nested.hv_evmcs->hv_clean_fields |=
> @@ -2204,7 +2202,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>   	u32 exec_control;
>   	u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12);
>   
> -	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs)
> +	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs_vmptr != -1)
>   		prepare_vmcs02_early_rare(vmx, vmcs12);
>   
>   	/*
> @@ -2487,9 +2485,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>   			  enum vm_entry_failure_code *entry_failure_code)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -	struct hv_enlightened_vmcs *hv_evmcs = vmx->nested.hv_evmcs;
>   
> -	if (vmx->nested.dirty_vmcs12 || hv_evmcs) {
> +	if (vmx->nested.dirty_vmcs12 || vmx->nested.hv_evmcs_vmptr != -1) {
>   		prepare_vmcs02_rare(vmx, vmcs12);
>   		vmx->nested.dirty_vmcs12 = false;
>   	}
> @@ -3075,7 +3072,7 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>   	 * L2 was running), map it here to make sure vmcs12 changes are
>   	 * properly reflected.
>   	 */
> -	if (vmx->nested.enlightened_vmcs_enabled && !vmx->nested.hv_evmcs) {
> +	if (vmx->nested.enlightened_vmcs_enabled && vmx->nested.hv_evmcs_vmptr == -1) {
>   		enum nested_evmptrld_status evmptrld_status =
>   			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>   
> @@ -3419,7 +3416,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>   
>   	load_vmcs12_host_state(vcpu, vmcs12);
>   	vmcs12->vm_exit_reason = exit_reason.full;
> -	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
> +	if (enable_shadow_vmcs || vmx->nested.hv_evmcs_vmptr != -1)
>   		vmx->nested.need_vmcs12_to_shadow_sync = true;
>   	return NVMX_VMENTRY_VMEXIT;
>   }
> @@ -3449,7 +3446,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>   		return nested_vmx_failInvalid(vcpu);
>   	}
>   
> -	if (CC(!vmx->nested.hv_evmcs && vmx->nested.current_vmptr == -1ull))
> +	if (CC(vmx->nested.hv_evmcs_vmptr == -1 && vmx->nested.current_vmptr == -1ull))
>   		return nested_vmx_failInvalid(vcpu);
>   
>   	vmcs12 = get_vmcs12(vcpu);
> @@ -3463,7 +3460,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>   	if (CC(vmcs12->hdr.shadow_vmcs))
>   		return nested_vmx_failInvalid(vcpu);
>   
> -	if (vmx->nested.hv_evmcs) {
> +	if (vmx->nested.hv_evmcs_vmptr != -1) {
>   		copy_enlightened_to_vmcs12(vmx);
>   		/* Enlightened VMCS doesn't have launch state */
>   		vmcs12->launch_state = !launch;
> @@ -4014,10 +4011,10 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> -	if (vmx->nested.hv_evmcs)
> +	if (vmx->nested.hv_evmcs_vmptr != -1)
>   		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
>   
> -	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !vmx->nested.hv_evmcs;
> +	vmx->nested.need_sync_vmcs02_to_vmcs12_rare = (vmx->nested.hv_evmcs_vmptr == -1);
>   
>   	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
>   	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
> @@ -4514,7 +4511,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>   	}
>   
>   	if ((vm_exit_reason != -1) &&
> -	    (enable_shadow_vmcs || vmx->nested.hv_evmcs))
> +	    (enable_shadow_vmcs || vmx->nested.hv_evmcs_vmptr != -1))
>   		vmx->nested.need_vmcs12_to_shadow_sync = true;
>   
>   	/* in case we halted in L2 */
> @@ -5210,7 +5207,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu)
>   		return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_VMXON_POINTER);
>   
>   	/* Forbid normal VMPTRLD if Enlightened version was used */
> -	if (vmx->nested.hv_evmcs)
> +	if (vmx->nested.hv_evmcs_vmptr != -1)
>   		return 1;
>   
>   	if (vmx->nested.current_vmptr != vmptr) {
> @@ -5266,7 +5263,7 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
>   	if (!nested_vmx_check_permission(vcpu))
>   		return 1;
>   
> -	if (unlikely(to_vmx(vcpu)->nested.hv_evmcs))
> +	if (unlikely(to_vmx(vcpu)->nested.hv_evmcs_vmptr != -1))
>   		return 1;
>   
>   	if (get_vmx_mem_address(vcpu, exit_qual, instr_info,
> @@ -6038,7 +6035,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>   		if (vmx_has_valid_vmcs12(vcpu)) {
>   			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
>   
> -			if (vmx->nested.hv_evmcs)
> +			if (vmx->nested.hv_evmcs_vmptr != -1)
>   				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
>   
>   			if (is_guest_mode(vcpu) &&
> @@ -6094,7 +6091,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>   	} else  {
>   		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
>   		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
> -			if (vmx->nested.hv_evmcs)
> +			if (vmx->nested.hv_evmcs_vmptr != -1)
>   				copy_enlightened_to_vmcs12(vmx);
>   			else if (enable_shadow_vmcs)
>   				copy_shadow_to_vmcs12(vmx);
> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> index 184418baeb3c..55925be48973 100644
> --- a/arch/x86/kvm/vmx/nested.h
> +++ b/arch/x86/kvm/vmx/nested.h
> @@ -63,7 +63,7 @@ static inline int vmx_has_valid_vmcs12(struct kvm_vcpu *vcpu)
>   	 * have vmcs12 if it is true.
>   	 */
>   	return is_guest_mode(vcpu) || vmx->nested.current_vmptr != -1ull ||
> -		vmx->nested.hv_evmcs;
> +		vmx->nested.hv_evmcs_vmptr != -1;
>   }
>   
>   static inline u16 nested_get_vpid02(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f2fd447eed45..33208aa4ac87 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6974,6 +6974,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>   
>   	vmx->nested.posted_intr_nv = -1;
>   	vmx->nested.current_vmptr = -1ull;
> +	vmx->nested.hv_evmcs_vmptr = -1;
>   
>   	vcpu->arch.microcode_version = 0x100000000ULL;
>   	vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
>
> Paolo
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-24 13:56   ` Paolo Bonzini
@ 2021-05-24 14:12     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 14:12 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 17/05/21 15:50, Vitaly Kuznetsov wrote:
>> 'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
>> hypervisor is not obliged to keep it up-to-date while it is mangling L2's
>> state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
>> eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
>> VMCS12 is used as a source of ultimate truth, we must make sure we pick all
>> the changes to eVMCS and thus 'clean fields' data must be ignored.
>
> While you're at it, would you mind making copy_vmcs12_to_enlightened and 
> copy_enlightened_to_vmcs12 void?
>

Sure, no problem.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-24 14:09     ` Vitaly Kuznetsov
@ 2021-05-24 14:18       ` Paolo Bonzini
  2021-05-24 14:37         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-24 14:18 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 24/05/21 16:09, Vitaly Kuznetsov wrote:
> You mean we'll be using:
> 
> "hv_evmcs_ptr == 0" meaning "no evmcs" (like now)
> "hv_evmcs_ptr == -1" meaing "evmcs not yet mapped" (and
> nested_evmcs_is_used() will check for both '0' and '-1')
> "hv_evmcs_ptr == anything else" - eVMCS mapped.

I was thinking of:

hv_evmcs_ptr == -1 meaning no evmcs

hv_evmcs == NULL meaning "evmcs not yet mapped" (I think)

hv_evmcs != NULL meaning "evmcs mapped".

As usual with my suggestions, I'm not sure if this makes sense :) but if 
it does, the code should be nicer.

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-24 14:18       ` Paolo Bonzini
@ 2021-05-24 14:37         ` Vitaly Kuznetsov
  0 siblings, 0 replies; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-24 14:37 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 24/05/21 16:09, Vitaly Kuznetsov wrote:
>> You mean we'll be using:
>> 
>> "hv_evmcs_ptr == 0" meaning "no evmcs" (like now)
>> "hv_evmcs_ptr == -1" meaing "evmcs not yet mapped" (and
>> nested_evmcs_is_used() will check for both '0' and '-1')
>> "hv_evmcs_ptr == anything else" - eVMCS mapped.
>
> I was thinking of:
>
> hv_evmcs_ptr == -1 meaning no evmcs
>
> hv_evmcs == NULL meaning "evmcs not yet mapped" (I think)
>
> hv_evmcs != NULL meaning "evmcs mapped".
>
> As usual with my suggestions, I'm not sure if this makes sense :) but if 
> it does, the code should be nicer.

Ok, let me try this!

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-24 12:35     ` Vitaly Kuznetsov
@ 2021-05-26 14:34       ` Maxim Levitsky
  2021-05-27  7:54         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-26 14:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-24 at 14:35 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
> > > can not be called directly from vmx_set_nested_state() as KVM may not have
> > > all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
> > > restored yet). Enlightened VMCS is mapped later while getting nested state
> > > pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
> > > for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
> > > called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
> > > resulting state will be unset (and such state will later fail to load).
> > > 
> > > Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
> > > vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
> > > after restore.
> > > 
> > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > ---
> > >  arch/x86/kvm/vmx/nested.c | 31 ++++++++++++++++++++++++++-----
> > >  1 file changed, 26 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index 6058a65a6ede..3080e00c8f90 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -141,6 +141,27 @@ static void init_vmcs_shadow_fields(void)
> > >  	max_shadow_read_write_fields = j;
> > >  }
> > >  
> > > +static inline bool nested_evmcs_is_used(struct vcpu_vmx *vmx)
> > > +{
> > > +	struct kvm_vcpu *vcpu = &vmx->vcpu;
> > > +
> > > +	if (vmx->nested.hv_evmcs)
> > > +		return true;
> > > +
> > > +	/*
> > > +	 * After KVM_SET_NESTED_STATE, enlightened VMCS is mapped during
> > > +	 * KVM_REQ_GET_NESTED_STATE_PAGES handling and until the request is
> > > +	 * processed vmx->nested.hv_evmcs is NULL. It is, however, possible to
> > > +	 * detect such state by checking 'nested.current_vmptr == -1ull' when
> > > +	 * vCPU is in guest mode, it is only possible with eVMCS.
> > > +	 */
> > > +	if (unlikely(vmx->nested.enlightened_vmcs_enabled && is_guest_mode(vcpu) &&
> > > +		     (vmx->nested.current_vmptr == -1ull)))
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > 
> > I think that this is a valid way to solve the issue,
> > but it feels like there might be a better way.
> > I don't mind though to accept this patch as is.
> > 
> > So here are my 2 cents about this:
> > 
> > First of all after studying how evmcs works I take my words back
> > about needing to migrate its contents. 
> > 
> > It is indeed enough to migrate its physical address, 
> > or maybe even just a flag that evmcs is loaded
> > (and to my surprise we already do this - KVM_STATE_NESTED_EVMCS)
> > 
> > So how about just having a boolean flag that indicates that evmcs is in use, 
> > but doesn't imply that we know its address or that it is mapped 
> > to host address space, something like 'vmx->nested.enlightened_vmcs_loaded'
> > 
> > On migration that flag saved and restored as the KVM_STATE_NESTED_EVMCS,
> > otherwise it set when we load an evmcs and cleared when it is released.
> > 
> > Then as far as I can see we can use this flag in nested_evmcs_is_used
> > since all its callers don't touch evmcs, thus don't need it to be
> > mapped.
> > 
> > What do you think?



> > 
> 
> First, we need to be compatible with older KVMs which don't have the
> flag and this is problematic: currently, we always expect vmcs12 to
> carry valid contents. This is challenging.

All right, I understand this can be an issue!

If the userspace doesn't set the KVM_STATE_NESTED_EVMCS
but has a valid EVMCS as later indicated enabling it in the HV
assist page, we can just use the logic that this patch uses but use it 
to set vmx->nested.enlightened_vmcs_loaded flag or whatever
we decide to name it.
Later we can even deprecate and disable this with a new KVM cap.


BTW, I like Paolo's idea of putting this flag into the evmcs_gpa,
like that

-1 no evmcs
0 - evmcs enabled but its gpa not known
anything else - valid gpa.


Also as I said, I am not against this patch either, 
I am just thinking maybe we can make it a bit better.


> 
> Second, vCPU can be migrated in three different states:
> 1) While L2 was running ('true' nested state is in VMCS02)
> 2) While L1 was running ('true' nested state is in eVMCS)
> 3) Right after an exit from L2 to L1 was forced
> ('need_vmcs12_to_shadow_sync = true') ('true' nested state is in
> VMCS12).

Yes and this was quite difficult thing to understand
when I was trying to figure out how this code works.

Also you can add another intersting state:

4) Right after emulating vmlauch/vmresume but before
the actual entry to the nested guest (aka nested_run_pending=true)



> 
> The current solution is to always use VMCS12 as a container to transfer
> the state and conceptually, it is at least easier to understand.
> 
> We can, indeed, transfer eVMCS (or VMCS12) in case 2) through guest
> memory and I even tried that but that was making the code more complex
> so eventually I gave up and decided to preserve the 'always use VMCS12
> as a container' status quo.


My only point of concern is that it feels like it is wrong to update eVMCS
when not doing a nested vmexit, because then the eVMCS is owned by the L1 hypervisor.
At least not the fields which aren't supposed to be updated by us.


This is a rough code draft of what I had in mind (not tested).
To me this seems reasonable but I do agree that there is
some complexety tradeoffs involved.

About the compatibitly it can be said that:


Case 1:
Both old and new kernels will send/recive up to date vmcs12,
while evcms is not up to date, and its contents aren't even defined
(since L2 runs).

Case 2:
Old kernel will send vmcb12, with partial changes that L1 already
made to evmcs, and latest state of evmcs with all changes
in the guest memory.

But these changes will be discarded on the receiving side, 
since once L1 asks us to enter L2, we will reload all the state from eVMCS,
(at least the state that is marked as dirty, which means differ
from vmcs12 as it was on the last nested vmexit)

New kernel will always send the vmcb12 as it was on the last vmexit,
a bit older version but even a more consistent one.

But this doesn't matter either as just like in case of the old kernel, 
the vmcs12 will be updated from evmcs as soon as we do another L2 entry.

So while in this case we send 'more stale' vmcb12, it doesn't
really matter as it is stale anyway and will be reloaded from
evmcs.

Case 3:
Old kernel will send up to date vmcb12 (since L1 didn't had a chance
to run anyway after nested vmexit). The evmcs will not be up to date
in the guest memory, but newer kernel can fix this by updating it
as you did in patch 6.

New kernel will send up to date vmcb12 (same reason) and up to date
evmcs, so in fact an unchanged target kernel will be able to migrate
from this state.

So in fact my suggestion would allow to actually migrate to a kernel
without the fix applied.
This is even better than I thought.


This is a rough draft of the idea:


diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 6058a65a6ede..98eb7526cae6 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -167,15 +167,22 @@ static int nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
 static int nested_vmx_failValid(struct kvm_vcpu *vcpu,
 				u32 vm_instruction_error)
 {
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
 			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
 			    X86_EFLAGS_SF | X86_EFLAGS_OF))
 			| X86_EFLAGS_ZF);
 	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
+
 	/*
 	 * We don't need to force a shadow sync because
 	 * VM_INSTRUCTION_ERROR is not shadowed
+	 * We do need to update the evmcs
 	 */
+
+	if (vmx->nested.hv_evmcs)
+		vmx->nested.hv_evmcs->vm_instruction_error = vm_instruction_error;
+
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
@@ -1962,6 +1969,10 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
 
 	evmcs->guest_bndcfgs = vmcs12->guest_bndcfgs;
 
+	/* All fields are clean */
+	vmx->nested.hv_evmcs->hv_clean_fields |=
+		HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
+
 	return 0;
 }
 
@@ -2055,16 +2066,7 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
 void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-	if (vmx->nested.hv_evmcs) {
-		copy_vmcs12_to_enlightened(vmx);
-		/* All fields are clean */
-		vmx->nested.hv_evmcs->hv_clean_fields |=
-			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
-	} else {
-		copy_vmcs12_to_shadow(vmx);
-	}
-
+	copy_vmcs12_to_shadow(vmx);
 	vmx->nested.need_vmcs12_to_shadow_sync = false;
 }
 
@@ -3437,8 +3439,13 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 
 	load_vmcs12_host_state(vcpu, vmcs12);
 	vmcs12->vm_exit_reason = exit_reason.full;
-	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
+
+	if (enable_shadow_vmcs)
 		vmx->nested.need_vmcs12_to_shadow_sync = true;
+
+	if (vmx->nested.hv_evmcs)
+		copy_vmcs12_to_enlightened(vmx);
+
 	return NVMX_VMENTRY_VMEXIT;
 }
 
@@ -4531,10 +4538,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
 	}
 
-	if ((vm_exit_reason != -1) &&
-	    (enable_shadow_vmcs || vmx->nested.hv_evmcs))
+	if ((vm_exit_reason != -1) && enable_shadow_vmcs)
 		vmx->nested.need_vmcs12_to_shadow_sync = true;
 
+	if (vmx->nested.hv_evmcs)
+		copy_vmcs12_to_enlightened(vmx);
+
 	/* in case we halted in L2 */
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 
@@ -6111,12 +6120,8 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
 	} else  {
 		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
-		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
-			if (vmx->nested.hv_evmcs)
-				copy_enlightened_to_vmcs12(vmx);
-			else if (enable_shadow_vmcs)
-				copy_shadow_to_vmcs12(vmx);
-		}
+		if (enable_shadow_vmcs && !vmx->nested.need_vmcs12_to_shadow_sync)
+			copy_shadow_to_vmcs12(vmx);
 	}
 
 	BUILD_BUG_ON(sizeof(user_vmx_nested_state->vmcs12) < VMCS12_SIZE);


Best regards,
	Maxim Levitsky

> 
> -- 
> Vitaly
> 



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-24 12:44   ` Vitaly Kuznetsov
@ 2021-05-26 14:41     ` Maxim Levitsky
  2021-05-27  8:01       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-26 14:41 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > Changes since v1 (Sean):
> > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
> > >   copy_enlightened_to_vmcs12().
> > > 
> > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
> > >  + WSL2) was crashing immediately after migration. It was also reported
> > > that we have more issues to fix as, while the failure rate was lowered 
> > > signifincatly, it was still possible to observe crashes after several
> > > dozens of migration. Turns out, the issue arises when we manage to issue
> > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> > > the flag itself is not part of saved nested state. A few other less 
> > > significant issues are fixed along the way.
> > > 
> > > While there's no proof this series fixes all eVMCS related problems,
> > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> > > crashing in testing.
> > > 
> > > Patches are based on the current kvm/next tree.
> > > 
> > > Vitaly Kuznetsov (7):
> > >   KVM: nVMX: Introduce nested_evmcs_is_used()
> > >   KVM: nVMX: Release enlightened VMCS on VMCLEAR
> > >   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
> > >     vmx_get_nested_state()
> > >   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
> > >   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
> > >   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
> > >   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
> > >     lost
> > > 
> > >  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
> > >  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
> > >  2 files changed, 115 insertions(+), 59 deletions(-)
> > > 
> > 
> > Hi Vitaly!
> > 
> > In addition to the review of this patch series,
> 
> Thanks by the way!
No problem!

> 
> >  I would like
> > to share an idea on how to avoid the hack of mapping the evmcs
> > in nested_vmx_vmexit, because I think I found a possible generic
> > solution to this and similar issues:
> > 
> > The solution is to always set nested_run_pending after 
> > nested migration (which means that we won't really
> > need to migrate this flag anymore).
> > 
> > I was thinking a lot about it and I think that there is no downside to this,
> > other than sometimes a one extra vmexit after migration.
> > 
> > Otherwise there is always a risk of the following scenario:
> > 
> >   1. We migrate with nested_run_pending=0 (but don't restore all the state
> >   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
> >   or just the guest memory map is not up to date, guest is in smm or something
> >   like that)
> > 
> >   2. Userspace calls some ioctl that causes a nested vmexit
> > 
> >   This can happen today if the userspace calls 
> >   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
> > 
> >   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
> >   then calls KVM_RUN
> > 
> > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> > but we have to do so to complete the nested vmexit.
> 
> Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
> only one place which calls copy_vmcs12_to_enlightened():
> nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
> vmx_prepare_switch_to_guest() so unless userspace decided to execute
> not-fully-restored guest this should not happen. I'm probably missing
> something in your scenario)
You are right! 
The evmcs write is delayed to the next vmentry.

However since we are now mapping the evmcs during nested vmexit,
and this can fail for example that HV assist msr is not up to date.

For example consider this: 

1. Userspace first sets nested state
2. Userspace calls KVM_GET_MP_STATE.
3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
since HV_ASSIST msr is not yet loaded.


Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit 
and conceptually has the same issue.
(if memory map is not up to date, we might not be able to read/write the 
vmcb12 on nested vmexit)


> 
> > To some extent, the entry to the nested mode after a migration is only complete
> > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
> > 
> > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> > nested vmexit path at all. 
> 
> Remember, we have three possible states when nested state is
> transferred:
> 1) L2 was running
> 2) L1 was running
> 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).

I understand. This suggestion wasn't meant to fix the case 3, but more to fix
case 1, where we are in L2, migrate, and then immediately decide to 
do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
request, and also before potentially before the guest state was fully uploaded
(see that KVM_GET_MP_STATE thing).
 
In a nutshell, I vote for not allowing nested vmexits from the moment
when we set the nested state and until the moment we enter the nested
guest once (maybe with request for immediate vmexit),
because during this time period, the guest state is not fully consistent.

Best regards,
	Maxim Levitsky

> 
> Is 'nested_run_pending' suitable for all of them? Could you maybe draft
> a patch so we can see how this works (in both 'normal' and 'evmcs'
> cases)?
> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
  2021-05-24 13:58       ` Paolo Bonzini
@ 2021-05-26 14:44         ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-26 14:44 UTC (permalink / raw)
  To: Paolo Bonzini, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm

On Mon, 2021-05-24 at 15:58 +0200, Paolo Bonzini wrote:
> On 24/05/21 15:01, Vitaly Kuznetsov wrote:
> > With 'need_vmcs12_to_shadow_sync', we treat eVMCS as shadow VMCS which
> > happens to shadow all fields and while it may not be the most optimal
> > solution, it is at least easy to comprehend. We can try drafting
> > something up instead, maybe it will also be good but honestly I'm afraid
> > of incompatible changes in KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, we
> > can ask Paolo's opinion on that.
> 
> Yes, it's much easier to understand it if the eVMCS is essentially a 
> memory-backed shadow VMCS, than if it's really the vmcs12 format.  I 
> understand that it's bound to be a little slower, but at least the two 
> formats are not all over the place.
> 
> Paolo
> 

Hi!

Please see my other reply to this in patch 1.
 
I understand this concern, but what bugs me is that we sort of 
shouldn't read evmcs while L1 is running.
(e.g its clean bits might be not up to date and
such).
 
Actually instead of thinking of evmcs as a shadow, I am thinking of it
more as a vmcb12 (the SVM one), 
which we load when we do a nested entry and which is then
updated when we do a nested vmexit, and other than that, while
L1 is running, we don't touch it.
 
Yes there is that vm instruction error field in evmcs which I suppose we should
write when we fail a VMX instruction (invept only practically I think) 
while we just run L1, and even that we might just avoid doing, 
which will allow us to avoid even keeping
the evmcs mapped while L1 is running.

Just my 0.2 cents.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-26 14:34       ` Maxim Levitsky
@ 2021-05-27  7:54         ` Vitaly Kuznetsov
  2021-05-27 14:10           ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-27  7:54 UTC (permalink / raw)
  To: Maxim Levitsky, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-24 at 14:35 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <mlevitsk@redhat.com> writes:
>> 
>> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> > > Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
>> > > can not be called directly from vmx_set_nested_state() as KVM may not have
>> > > all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
>> > > restored yet). Enlightened VMCS is mapped later while getting nested state
>> > > pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
>> > > for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
>> > > called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
>> > > resulting state will be unset (and such state will later fail to load).
>> > > 
>> > > Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
>> > > vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
>> > > after restore.
>> > > 
>> > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> > > ---
>> > >  arch/x86/kvm/vmx/nested.c | 31 ++++++++++++++++++++++++++-----
>> > >  1 file changed, 26 insertions(+), 5 deletions(-)
>> > > 
>> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> > > index 6058a65a6ede..3080e00c8f90 100644
>> > > --- a/arch/x86/kvm/vmx/nested.c
>> > > +++ b/arch/x86/kvm/vmx/nested.c
>> > > @@ -141,6 +141,27 @@ static void init_vmcs_shadow_fields(void)
>> > >  	max_shadow_read_write_fields = j;
>> > >  }
>> > >  
>> > > +static inline bool nested_evmcs_is_used(struct vcpu_vmx *vmx)
>> > > +{
>> > > +	struct kvm_vcpu *vcpu = &vmx->vcpu;
>> > > +
>> > > +	if (vmx->nested.hv_evmcs)
>> > > +		return true;
>> > > +
>> > > +	/*
>> > > +	 * After KVM_SET_NESTED_STATE, enlightened VMCS is mapped during
>> > > +	 * KVM_REQ_GET_NESTED_STATE_PAGES handling and until the request is
>> > > +	 * processed vmx->nested.hv_evmcs is NULL. It is, however, possible to
>> > > +	 * detect such state by checking 'nested.current_vmptr == -1ull' when
>> > > +	 * vCPU is in guest mode, it is only possible with eVMCS.
>> > > +	 */
>> > > +	if (unlikely(vmx->nested.enlightened_vmcs_enabled && is_guest_mode(vcpu) &&
>> > > +		     (vmx->nested.current_vmptr == -1ull)))
>> > > +		return true;
>> > > +
>> > > +	return false;
>> > > +}
>> > 
>> > I think that this is a valid way to solve the issue,
>> > but it feels like there might be a better way.
>> > I don't mind though to accept this patch as is.
>> > 
>> > So here are my 2 cents about this:
>> > 
>> > First of all after studying how evmcs works I take my words back
>> > about needing to migrate its contents. 
>> > 
>> > It is indeed enough to migrate its physical address, 
>> > or maybe even just a flag that evmcs is loaded
>> > (and to my surprise we already do this - KVM_STATE_NESTED_EVMCS)
>> > 
>> > So how about just having a boolean flag that indicates that evmcs is in use, 
>> > but doesn't imply that we know its address or that it is mapped 
>> > to host address space, something like 'vmx->nested.enlightened_vmcs_loaded'
>> > 
>> > On migration that flag saved and restored as the KVM_STATE_NESTED_EVMCS,
>> > otherwise it set when we load an evmcs and cleared when it is released.
>> > 
>> > Then as far as I can see we can use this flag in nested_evmcs_is_used
>> > since all its callers don't touch evmcs, thus don't need it to be
>> > mapped.
>> > 
>> > What do you think?
>
>
>
>> > 
>> 
>> First, we need to be compatible with older KVMs which don't have the
>> flag and this is problematic: currently, we always expect vmcs12 to
>> carry valid contents. This is challenging.
>
> All right, I understand this can be an issue!
>
> If the userspace doesn't set the KVM_STATE_NESTED_EVMCS
> but has a valid EVMCS as later indicated enabling it in the HV
> assist page, we can just use the logic that this patch uses but use it 
> to set vmx->nested.enlightened_vmcs_loaded flag or whatever
> we decide to name it.
> Later we can even deprecate and disable this with a new KVM cap.
>
>
> BTW, I like Paolo's idea of putting this flag into the evmcs_gpa,
> like that
>
> -1 no evmcs
> 0 - evmcs enabled but its gpa not known
> anything else - valid gpa.
>
>
> Also as I said, I am not against this patch either, 
> I am just thinking maybe we can make it a bit better.
>

v3 implements a similar idea (I kept Paolo's 'Suggested-by' though :-)
we set hv_evmcs_vmptr to EVMPTR_MAP_PENDING after migration. I haven't
tried it yet but I thin we can eventually drop
KVM_REQ_GET_NESTED_STATE_PAGES usage.

For now, this series is a bugfix (multiple bugfixes, actually) so I'd
like to get in and not try to make everything perfect regarding eVMCS
:-) I'll certainly take a look at other possible improvements later.

>
>> 
>> Second, vCPU can be migrated in three different states:
>> 1) While L2 was running ('true' nested state is in VMCS02)
>> 2) While L1 was running ('true' nested state is in eVMCS)
>> 3) Right after an exit from L2 to L1 was forced
>> ('need_vmcs12_to_shadow_sync = true') ('true' nested state is in
>> VMCS12).
>
> Yes and this was quite difficult thing to understand
> when I was trying to figure out how this code works.
>
> Also you can add another intersting state:
>
> 4) Right after emulating vmlauch/vmresume but before
> the actual entry to the nested guest (aka nested_run_pending=true)
>

Honestly, I haven't took a closer look at this state. Do you envision
specific issues? Can we actually try to serve KVM_GET_NESTED_STATE in
between setting 'nested_run_pending=true' and an actual attempt to enter
L2?

>
>
>> 
>> The current solution is to always use VMCS12 as a container to transfer
>> the state and conceptually, it is at least easier to understand.
>> 
>> We can, indeed, transfer eVMCS (or VMCS12) in case 2) through guest
>> memory and I even tried that but that was making the code more complex
>> so eventually I gave up and decided to preserve the 'always use VMCS12
>> as a container' status quo.
>
>
> My only point of concern is that it feels like it is wrong to update eVMCS
> when not doing a nested vmexit, because then the eVMCS is owned by the
> L1 hypervisor.

I see your concern and ideally we wouldn't have to touch it.

> At least not the fields which aren't supposed to be updated by us.
>
>
> This is a rough code draft of what I had in mind (not tested).
> To me this seems reasonable but I do agree that there is
> some complexety tradeoffs involved.
>
> About the compatibitly it can be said that:
>
>
> Case 1:
> Both old and new kernels will send/recive up to date vmcs12,
> while evcms is not up to date, and its contents aren't even defined
> (since L2 runs).
>
> Case 2:
> Old kernel will send vmcb12, with partial changes that L1 already
> made to evmcs, and latest state of evmcs with all changes
> in the guest memory.
>
> But these changes will be discarded on the receiving side, 
> since once L1 asks us to enter L2, we will reload all the state from eVMCS,
> (at least the state that is marked as dirty, which means differ
> from vmcs12 as it was on the last nested vmexit)
>
> New kernel will always send the vmcb12 as it was on the last vmexit,
> a bit older version but even a more consistent one.
>
> But this doesn't matter either as just like in case of the old kernel, 
> the vmcs12 will be updated from evmcs as soon as we do another L2 entry.
>
> So while in this case we send 'more stale' vmcb12, it doesn't
> really matter as it is stale anyway and will be reloaded from
> evmcs.
>
> Case 3:
> Old kernel will send up to date vmcb12 (since L1 didn't had a chance
> to run anyway after nested vmexit). The evmcs will not be up to date
> in the guest memory, but newer kernel can fix this by updating it
> as you did in patch 6.
>
> New kernel will send up to date vmcb12 (same reason) and up to date
> evmcs, so in fact an unchanged target kernel will be able to migrate
> from this state.
>
> So in fact my suggestion would allow to actually migrate to a kernel
> without the fix applied.
> This is even better than I thought.
>
>
> This is a rough draft of the idea:
>
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 6058a65a6ede..98eb7526cae6 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -167,15 +167,22 @@ static int nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
>  static int nested_vmx_failValid(struct kvm_vcpu *vcpu,
>  				u32 vm_instruction_error)
>  {
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
>  			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
>  			    X86_EFLAGS_SF | X86_EFLAGS_OF))
>  			| X86_EFLAGS_ZF);
>  	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
> +
>  	/*
>  	 * We don't need to force a shadow sync because
>  	 * VM_INSTRUCTION_ERROR is not shadowed
> +	 * We do need to update the evmcs
>  	 */
> +
> +	if (vmx->nested.hv_evmcs)
> +		vmx->nested.hv_evmcs->vm_instruction_error = vm_instruction_error;
> +
>  	return kvm_skip_emulated_instruction(vcpu);
>  }
>  
> @@ -1962,6 +1969,10 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
>  
>  	evmcs->guest_bndcfgs = vmcs12->guest_bndcfgs;
>  
> +	/* All fields are clean */
> +	vmx->nested.hv_evmcs->hv_clean_fields |=
> +		HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> +
>  	return 0;
>  }
>  
> @@ -2055,16 +2066,7 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
>  void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -
> -	if (vmx->nested.hv_evmcs) {
> -		copy_vmcs12_to_enlightened(vmx);
> -		/* All fields are clean */
> -		vmx->nested.hv_evmcs->hv_clean_fields |=
> -			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> -	} else {
> -		copy_vmcs12_to_shadow(vmx);
> -	}
> -
> +	copy_vmcs12_to_shadow(vmx);
>  	vmx->nested.need_vmcs12_to_shadow_sync = false;
>  }
>  
> @@ -3437,8 +3439,13 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  
>  	load_vmcs12_host_state(vcpu, vmcs12);
>  	vmcs12->vm_exit_reason = exit_reason.full;
> -	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
> +
> +	if (enable_shadow_vmcs)
>  		vmx->nested.need_vmcs12_to_shadow_sync = true;
> +
> +	if (vmx->nested.hv_evmcs)
> +		copy_vmcs12_to_enlightened(vmx);
> +
>  	return NVMX_VMENTRY_VMEXIT;
>  }
>  
> @@ -4531,10 +4538,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>  	}
>  
> -	if ((vm_exit_reason != -1) &&
> -	    (enable_shadow_vmcs || vmx->nested.hv_evmcs))
> +	if ((vm_exit_reason != -1) && enable_shadow_vmcs)
>  		vmx->nested.need_vmcs12_to_shadow_sync = true;
>  
> +	if (vmx->nested.hv_evmcs)
> +		copy_vmcs12_to_enlightened(vmx);
> +
>  	/* in case we halted in L2 */
>  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>  
> @@ -6111,12 +6120,8 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>  		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
>  	} else  {
>  		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
> -		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
> -			if (vmx->nested.hv_evmcs)
> -				copy_enlightened_to_vmcs12(vmx);
> -			else if (enable_shadow_vmcs)
> -				copy_shadow_to_vmcs12(vmx);
> -		}
> +		if (enable_shadow_vmcs && !vmx->nested.need_vmcs12_to_shadow_sync)
> +			copy_shadow_to_vmcs12(vmx);
>  	}
>  
>  	BUILD_BUG_ON(sizeof(user_vmx_nested_state->vmcs12) < VMCS12_SIZE);
>
>

I don't see why this can't work, the only concern here is that
conceptually, we'll be making eVMCS something different from shadow
vmcs. In any case, let's give this a try when this series fixing real
bugs lands.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-26 14:41     ` Maxim Levitsky
@ 2021-05-27  8:01       ` Vitaly Kuznetsov
  2021-05-27 14:11         ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-27  8:01 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <mlevitsk@redhat.com> writes:
>> 
>> > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
>> > > Changes since v1 (Sean):
>> > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
>> > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
>> > >   copy_enlightened_to_vmcs12().
>> > > 
>> > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
>> > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
>> > >  + WSL2) was crashing immediately after migration. It was also reported
>> > > that we have more issues to fix as, while the failure rate was lowered 
>> > > signifincatly, it was still possible to observe crashes after several
>> > > dozens of migration. Turns out, the issue arises when we manage to issue
>> > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
>> > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
>> > > the flag itself is not part of saved nested state. A few other less 
>> > > significant issues are fixed along the way.
>> > > 
>> > > While there's no proof this series fixes all eVMCS related problems,
>> > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
>> > > crashing in testing.
>> > > 
>> > > Patches are based on the current kvm/next tree.
>> > > 
>> > > Vitaly Kuznetsov (7):
>> > >   KVM: nVMX: Introduce nested_evmcs_is_used()
>> > >   KVM: nVMX: Release enlightened VMCS on VMCLEAR
>> > >   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
>> > >     vmx_get_nested_state()
>> > >   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
>> > >   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
>> > >   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
>> > >   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
>> > >     lost
>> > > 
>> > >  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
>> > >  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
>> > >  2 files changed, 115 insertions(+), 59 deletions(-)
>> > > 
>> > 
>> > Hi Vitaly!
>> > 
>> > In addition to the review of this patch series,
>> 
>> Thanks by the way!
> No problem!
>
>> 
>> >  I would like
>> > to share an idea on how to avoid the hack of mapping the evmcs
>> > in nested_vmx_vmexit, because I think I found a possible generic
>> > solution to this and similar issues:
>> > 
>> > The solution is to always set nested_run_pending after 
>> > nested migration (which means that we won't really
>> > need to migrate this flag anymore).
>> > 
>> > I was thinking a lot about it and I think that there is no downside to this,
>> > other than sometimes a one extra vmexit after migration.
>> > 
>> > Otherwise there is always a risk of the following scenario:
>> > 
>> >   1. We migrate with nested_run_pending=0 (but don't restore all the state
>> >   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
>> >   or just the guest memory map is not up to date, guest is in smm or something
>> >   like that)
>> > 
>> >   2. Userspace calls some ioctl that causes a nested vmexit
>> > 
>> >   This can happen today if the userspace calls 
>> >   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
>> > 
>> >   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
>> >   then calls KVM_RUN
>> > 
>> > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
>> > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
>> > but we have to do so to complete the nested vmexit.
>> 
>> Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
>> only one place which calls copy_vmcs12_to_enlightened():
>> nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
>> vmx_prepare_switch_to_guest() so unless userspace decided to execute
>> not-fully-restored guest this should not happen. I'm probably missing
>> something in your scenario)
> You are right! 
> The evmcs write is delayed to the next vmentry.
>
> However since we are now mapping the evmcs during nested vmexit,
> and this can fail for example that HV assist msr is not up to date.
>
> For example consider this: 
>
> 1. Userspace first sets nested state
> 2. Userspace calls KVM_GET_MP_STATE.
> 3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
> since HV_ASSIST msr is not yet loaded.
>
>
> Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit 
> and conceptually has the same issue.
> (if memory map is not up to date, we might not be able to read/write the 
> vmcb12 on nested vmexit)
>

It seems we have one correct way to restore a guest and a number of
incorrect ones :-) It may happen that this is not even a nested-only
thing (think about trying to resore caps, regs, msrs, cpuids, in a
random sequence). I'd vote for documenting the right one somewhere, even
if we'll just be extracting it from QEMU.

>
>> 
>> > To some extent, the entry to the nested mode after a migration is only complete
>> > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
>> > 
>> > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
>> > nested vmexit path at all. 
>> 
>> Remember, we have three possible states when nested state is
>> transferred:
>> 1) L2 was running
>> 2) L1 was running
>> 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).
>
> I understand. This suggestion wasn't meant to fix the case 3, but more to fix
> case 1, where we are in L2, migrate, and then immediately decide to 
> do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
> request, and also before potentially before the guest state was fully uploaded
> (see that KVM_GET_MP_STATE thing).
>  
> In a nutshell, I vote for not allowing nested vmexits from the moment
> when we set the nested state and until the moment we enter the nested
> guest once (maybe with request for immediate vmexit),
> because during this time period, the guest state is not fully consistent.
>

Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged'
idea and kill the guest immediately if something forces such an exit.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used()
  2021-05-27  7:54         ` Vitaly Kuznetsov
@ 2021-05-27 14:10           ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-27 14:10 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Thu, 2021-05-27 at 09:54 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-24 at 14:35 +0200, Vitaly Kuznetsov wrote:
> > > Maxim Levitsky <mlevitsk@redhat.com> writes:
> > > 
> > > > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > > > Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
> > > > > can not be called directly from vmx_set_nested_state() as KVM may not have
> > > > > all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
> > > > > restored yet). Enlightened VMCS is mapped later while getting nested state
> > > > > pages. In the meantime, vmx->nested.hv_evmcs remains NULL and using it
> > > > > for various checks is incorrect. In particular, if KVM_GET_NESTED_STATE is
> > > > > called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the
> > > > > resulting state will be unset (and such state will later fail to load).
> > > > > 
> > > > > Introduce nested_evmcs_is_used() and use 'is_guest_mode(vcpu) &&
> > > > > vmx->nested.current_vmptr == -1ull' check to detect not-yet-mapped eVMCS
> > > > > after restore.
> > > > > 
> > > > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > > > ---
> > > > >  arch/x86/kvm/vmx/nested.c | 31 ++++++++++++++++++++++++++-----
> > > > >  1 file changed, 26 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > > > index 6058a65a6ede..3080e00c8f90 100644
> > > > > --- a/arch/x86/kvm/vmx/nested.c
> > > > > +++ b/arch/x86/kvm/vmx/nested.c
> > > > > @@ -141,6 +141,27 @@ static void init_vmcs_shadow_fields(void)
> > > > >  	max_shadow_read_write_fields = j;
> > > > >  }
> > > > >  
> > > > > +static inline bool nested_evmcs_is_used(struct vcpu_vmx *vmx)
> > > > > +{
> > > > > +	struct kvm_vcpu *vcpu = &vmx->vcpu;
> > > > > +
> > > > > +	if (vmx->nested.hv_evmcs)
> > > > > +		return true;
> > > > > +
> > > > > +	/*
> > > > > +	 * After KVM_SET_NESTED_STATE, enlightened VMCS is mapped during
> > > > > +	 * KVM_REQ_GET_NESTED_STATE_PAGES handling and until the request is
> > > > > +	 * processed vmx->nested.hv_evmcs is NULL. It is, however, possible to
> > > > > +	 * detect such state by checking 'nested.current_vmptr == -1ull' when
> > > > > +	 * vCPU is in guest mode, it is only possible with eVMCS.
> > > > > +	 */
> > > > > +	if (unlikely(vmx->nested.enlightened_vmcs_enabled && is_guest_mode(vcpu) &&
> > > > > +		     (vmx->nested.current_vmptr == -1ull)))
> > > > > +		return true;
> > > > > +
> > > > > +	return false;
> > > > > +}
> > > > 
> > > > I think that this is a valid way to solve the issue,
> > > > but it feels like there might be a better way.
> > > > I don't mind though to accept this patch as is.
> > > > 
> > > > So here are my 2 cents about this:
> > > > 
> > > > First of all after studying how evmcs works I take my words back
> > > > about needing to migrate its contents. 
> > > > 
> > > > It is indeed enough to migrate its physical address, 
> > > > or maybe even just a flag that evmcs is loaded
> > > > (and to my surprise we already do this - KVM_STATE_NESTED_EVMCS)
> > > > 
> > > > So how about just having a boolean flag that indicates that evmcs is in use, 
> > > > but doesn't imply that we know its address or that it is mapped 
> > > > to host address space, something like 'vmx->nested.enlightened_vmcs_loaded'
> > > > 
> > > > On migration that flag saved and restored as the KVM_STATE_NESTED_EVMCS,
> > > > otherwise it set when we load an evmcs and cleared when it is released.
> > > > 
> > > > Then as far as I can see we can use this flag in nested_evmcs_is_used
> > > > since all its callers don't touch evmcs, thus don't need it to be
> > > > mapped.
> > > > 
> > > > What do you think?
> > 
> > 
> > > 
> > > First, we need to be compatible with older KVMs which don't have the
> > > flag and this is problematic: currently, we always expect vmcs12 to
> > > carry valid contents. This is challenging.
> > 
> > All right, I understand this can be an issue!
> > 
> > If the userspace doesn't set the KVM_STATE_NESTED_EVMCS
> > but has a valid EVMCS as later indicated enabling it in the HV
> > assist page, we can just use the logic that this patch uses but use it 
> > to set vmx->nested.enlightened_vmcs_loaded flag or whatever
> > we decide to name it.
> > Later we can even deprecate and disable this with a new KVM cap.
> > 
> > 
> > BTW, I like Paolo's idea of putting this flag into the evmcs_gpa,
> > like that
> > 
> > -1 no evmcs
> > 0 - evmcs enabled but its gpa not known
> > anything else - valid gpa.
> > 
> > 
> > Also as I said, I am not against this patch either, 
> > I am just thinking maybe we can make it a bit better.
> > 
> 
> v3 implements a similar idea (I kept Paolo's 'Suggested-by' though :-)
> we set hv_evmcs_vmptr to EVMPTR_MAP_PENDING after migration. I haven't
> tried it yet but I thin we can eventually drop
> KVM_REQ_GET_NESTED_STATE_PAGES usage.
> 
> For now, this series is a bugfix (multiple bugfixes, actually) so I'd
> like to get in and not try to make everything perfect regarding eVMCS
> :-) I'll certainly take a look at other possible improvements later.
100% agree with you.

> 
> > > Second, vCPU can be migrated in three different states:
> > > 1) While L2 was running ('true' nested state is in VMCS02)
> > > 2) While L1 was running ('true' nested state is in eVMCS)
> > > 3) Right after an exit from L2 to L1 was forced
> > > ('need_vmcs12_to_shadow_sync = true') ('true' nested state is in
> > > VMCS12).
> > 
> > Yes and this was quite difficult thing to understand
> > when I was trying to figure out how this code works.
> > 
> > Also you can add another intersting state:
> > 
> > 4) Right after emulating vmlauch/vmresume but before
> > the actual entry to the nested guest (aka nested_run_pending=true)
> > 
> 
> Honestly, I haven't took a closer look at this state. Do you envision
> specific issues? Can we actually try to serve KVM_GET_NESTED_STATE in
> between setting 'nested_run_pending=true' and an actual attempt to enter
> L2?

Yes, I fixed a bug that was specific to this state on SVM.
 
The problem was like that: L1 would do VMRUN with an injected event,
then we have migration with nested_run_pending=1, 
and then after a migration we would forget to set nested_run_pending=1.
 
This would allow us to inject another event to the VM which meanwhile
came from the local apic, overwriting the event the L1 injected,
something that can't otherwise happen as nested_run_pending=1 doesn't
allow injecting events to the VM.
 
As Paolo explained to me the meaning of nested_run_pending is that
the nested VM entry is basically done in two steps. First we emulate
some stuff, switch vmcs/vmcb and such, and then we do the actual
VM entry to the guest. Only then it is safe to do a nested VM exit.
 
However IMHO it won't hurt to set nested_run_pending=1 even when
it is not needed due to the above, it will eliminate the issues
that we have with VMexits that happen before we processed the
KVM_REQ_GET_NESTED_STATE_PAGES request.
 
Paolo, what do you think?

> 
> > 
> > > The current solution is to always use VMCS12 as a container to transfer
> > > the state and conceptually, it is at least easier to understand.
> > > 
> > > We can, indeed, transfer eVMCS (or VMCS12) in case 2) through guest
> > > memory and I even tried that but that was making the code more complex
> > > so eventually I gave up and decided to preserve the 'always use VMCS12
> > > as a container' status quo.
> > 
> > My only point of concern is that it feels like it is wrong to update eVMCS
> > when not doing a nested vmexit, because then the eVMCS is owned by the
> > L1 hypervisor.
> 
> I see your concern and ideally we wouldn't have to touch it.

Great!

> 
> > At least not the fields which aren't supposed to be updated by us.
> > 
> > 
> > This is a rough code draft of what I had in mind (not tested).
> > To me this seems reasonable but I do agree that there is
> > some complexety tradeoffs involved.
> > 
> > About the compatibitly it can be said that:
> > 
> > 
> > Case 1:
> > Both old and new kernels will send/recive up to date vmcs12,
> > while evcms is not up to date, and its contents aren't even defined
> > (since L2 runs).
> > 
> > Case 2:
> > Old kernel will send vmcb12, with partial changes that L1 already
> > made to evmcs, and latest state of evmcs with all changes
> > in the guest memory.
> > 
> > But these changes will be discarded on the receiving side, 
> > since once L1 asks us to enter L2, we will reload all the state from eVMCS,
> > (at least the state that is marked as dirty, which means differ
> > from vmcs12 as it was on the last nested vmexit)
> > 
> > New kernel will always send the vmcb12 as it was on the last vmexit,
> > a bit older version but even a more consistent one.
> > 
> > But this doesn't matter either as just like in case of the old kernel, 
> > the vmcs12 will be updated from evmcs as soon as we do another L2 entry.
> > 
> > So while in this case we send 'more stale' vmcb12, it doesn't
> > really matter as it is stale anyway and will be reloaded from
> > evmcs.
> > 
> > Case 3:
> > Old kernel will send up to date vmcb12 (since L1 didn't had a chance
> > to run anyway after nested vmexit). The evmcs will not be up to date
> > in the guest memory, but newer kernel can fix this by updating it
> > as you did in patch 6.
> > 
> > New kernel will send up to date vmcb12 (same reason) and up to date
> > evmcs, so in fact an unchanged target kernel will be able to migrate
> > from this state.
> > 
> > So in fact my suggestion would allow to actually migrate to a kernel
> > without the fix applied.
> > This is even better than I thought.
> > 
> > 
> > This is a rough draft of the idea:
> > 
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 6058a65a6ede..98eb7526cae6 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -167,15 +167,22 @@ static int nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
> >  static int nested_vmx_failValid(struct kvm_vcpu *vcpu,
> >  				u32 vm_instruction_error)
> >  {
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
> >  			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
> >  			    X86_EFLAGS_SF | X86_EFLAGS_OF))
> >  			| X86_EFLAGS_ZF);
> >  	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
> > +
> >  	/*
> >  	 * We don't need to force a shadow sync because
> >  	 * VM_INSTRUCTION_ERROR is not shadowed
> > +	 * We do need to update the evmcs
> >  	 */
> > +
> > +	if (vmx->nested.hv_evmcs)
> > +		vmx->nested.hv_evmcs->vm_instruction_error = vm_instruction_error;
> > +
> >  	return kvm_skip_emulated_instruction(vcpu);
> >  }
> >  
> > @@ -1962,6 +1969,10 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
> >  
> >  	evmcs->guest_bndcfgs = vmcs12->guest_bndcfgs;
> >  
> > +	/* All fields are clean */
> > +	vmx->nested.hv_evmcs->hv_clean_fields |=
> > +		HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> > +
> >  	return 0;
> >  }
> >  
> > @@ -2055,16 +2066,7 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
> >  void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > -
> > -	if (vmx->nested.hv_evmcs) {
> > -		copy_vmcs12_to_enlightened(vmx);
> > -		/* All fields are clean */
> > -		vmx->nested.hv_evmcs->hv_clean_fields |=
> > -			HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
> > -	} else {
> > -		copy_vmcs12_to_shadow(vmx);
> > -	}
> > -
> > +	copy_vmcs12_to_shadow(vmx);
> >  	vmx->nested.need_vmcs12_to_shadow_sync = false;
> >  }
> >  
> > @@ -3437,8 +3439,13 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> >  
> >  	load_vmcs12_host_state(vcpu, vmcs12);
> >  	vmcs12->vm_exit_reason = exit_reason.full;
> > -	if (enable_shadow_vmcs || vmx->nested.hv_evmcs)
> > +
> > +	if (enable_shadow_vmcs)
> >  		vmx->nested.need_vmcs12_to_shadow_sync = true;
> > +
> > +	if (vmx->nested.hv_evmcs)
> > +		copy_vmcs12_to_enlightened(vmx);
> > +
> >  	return NVMX_VMENTRY_VMEXIT;
> >  }
> >  
> > @@ -4531,10 +4538,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> >  		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
> >  	}
> >  
> > -	if ((vm_exit_reason != -1) &&
> > -	    (enable_shadow_vmcs || vmx->nested.hv_evmcs))
> > +	if ((vm_exit_reason != -1) && enable_shadow_vmcs)
> >  		vmx->nested.need_vmcs12_to_shadow_sync = true;
> >  
> > +	if (vmx->nested.hv_evmcs)
> > +		copy_vmcs12_to_enlightened(vmx);
> > +
> >  	/* in case we halted in L2 */
> >  	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >  
> > @@ -6111,12 +6120,8 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
> >  		sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
> >  	} else  {
> >  		copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
> > -		if (!vmx->nested.need_vmcs12_to_shadow_sync) {
> > -			if (vmx->nested.hv_evmcs)
> > -				copy_enlightened_to_vmcs12(vmx);
> > -			else if (enable_shadow_vmcs)
> > -				copy_shadow_to_vmcs12(vmx);
> > -		}
> > +		if (enable_shadow_vmcs && !vmx->nested.need_vmcs12_to_shadow_sync)
> > +			copy_shadow_to_vmcs12(vmx);
> >  	}
> >  
> >  	BUILD_BUG_ON(sizeof(user_vmx_nested_state->vmcs12) < VMCS12_SIZE);
> > 
> > 
> 
> I don't see why this can't work, the only concern here is that
> conceptually, we'll be making eVMCS something different from shadow
> vmcs. In any case, let's give this a try when this series fixing real
> bugs lands.

100% agree with you on this as well, including your concern about making evmcs
code less consistent with being a form of a 'shadow vmcs'.

It is a tradeoff, and I am not 100% sure which way is better too,
its just that I put a lot of thought to it, and I wanted to share this
approach in case we want to do it this way.


Best regards,
	Maxim Levitsky

> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-27  8:01       ` Vitaly Kuznetsov
@ 2021-05-27 14:11         ` Maxim Levitsky
  2021-05-27 14:17           ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-27 14:11 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

On Thu, 2021-05-27 at 10:01 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-24 at 14:44 +0200, Vitaly Kuznetsov wrote:
> > > Maxim Levitsky <mlevitsk@redhat.com> writes:
> > > 
> > > > On Mon, 2021-05-17 at 15:50 +0200, Vitaly Kuznetsov wrote:
> > > > > Changes since v1 (Sean):
> > > > > - Drop now-unneeded curly braces in nested_sync_vmcs12_to_shadow().
> > > > > - Pass 'evmcs->hv_clean_fields' instead of 'bool from_vmentry' to
> > > > >   copy_enlightened_to_vmcs12().
> > > > > 
> > > > > Commit f5c7e8425f18 ("KVM: nVMX: Always make an attempt to map eVMCS after
> > > > > migration") fixed the most obvious reason why Hyper-V on KVM (e.g. Win10
> > > > >  + WSL2) was crashing immediately after migration. It was also reported
> > > > > that we have more issues to fix as, while the failure rate was lowered 
> > > > > signifincatly, it was still possible to observe crashes after several
> > > > > dozens of migration. Turns out, the issue arises when we manage to issue
> > > > > KVM_GET_NESTED_STATE right after L2->L2 VMEXIT but before L1 gets a chance
> > > > > to run. This state is tracked with 'need_vmcs12_to_shadow_sync' flag but
> > > > > the flag itself is not part of saved nested state. A few other less 
> > > > > significant issues are fixed along the way.
> > > > > 
> > > > > While there's no proof this series fixes all eVMCS related problems,
> > > > > Win10+WSL2 was able to survive 3333 (thanks, Max!) migrations without
> > > > > crashing in testing.
> > > > > 
> > > > > Patches are based on the current kvm/next tree.
> > > > > 
> > > > > Vitaly Kuznetsov (7):
> > > > >   KVM: nVMX: Introduce nested_evmcs_is_used()
> > > > >   KVM: nVMX: Release enlightened VMCS on VMCLEAR
> > > > >   KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in
> > > > >     vmx_get_nested_state()
> > > > >   KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
> > > > >   KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
> > > > >   KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
> > > > >   KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never
> > > > >     lost
> > > > > 
> > > > >  arch/x86/kvm/vmx/nested.c                     | 110 ++++++++++++------
> > > > >  .../testing/selftests/kvm/x86_64/evmcs_test.c |  64 +++++-----
> > > > >  2 files changed, 115 insertions(+), 59 deletions(-)
> > > > > 
> > > > 
> > > > Hi Vitaly!
> > > > 
> > > > In addition to the review of this patch series,
> > > 
> > > Thanks by the way!
> > No problem!
> > 
> > > >  I would like
> > > > to share an idea on how to avoid the hack of mapping the evmcs
> > > > in nested_vmx_vmexit, because I think I found a possible generic
> > > > solution to this and similar issues:
> > > > 
> > > > The solution is to always set nested_run_pending after 
> > > > nested migration (which means that we won't really
> > > > need to migrate this flag anymore).
> > > > 
> > > > I was thinking a lot about it and I think that there is no downside to this,
> > > > other than sometimes a one extra vmexit after migration.
> > > > 
> > > > Otherwise there is always a risk of the following scenario:
> > > > 
> > > >   1. We migrate with nested_run_pending=0 (but don't restore all the state
> > > >   yet, like that HV_X64_MSR_VP_ASSIST_PAGE msr,
> > > >   or just the guest memory map is not up to date, guest is in smm or something
> > > >   like that)
> > > > 
> > > >   2. Userspace calls some ioctl that causes a nested vmexit
> > > > 
> > > >   This can happen today if the userspace calls 
> > > >   kvm_arch_vcpu_ioctl_get_mpstate -> kvm_apic_accept_events -> kvm_check_nested_events
> > > > 
> > > >   3. Userspace finally sets correct guest's msrs, correct guest memory map and only
> > > >   then calls KVM_RUN
> > > > 
> > > > This means that at (2) we can't map and write the evmcs/vmcs12/vmcb12 even
> > > > if KVM_REQ_GET_NESTED_STATE_PAGES is pending,
> > > > but we have to do so to complete the nested vmexit.
> > > 
> > > Why do we need to write to eVMCS to complete vmexit? AFAICT, there's
> > > only one place which calls copy_vmcs12_to_enlightened():
> > > nested_sync_vmcs12_to_shadow() which, in its turn, has only 1 caller:
> > > vmx_prepare_switch_to_guest() so unless userspace decided to execute
> > > not-fully-restored guest this should not happen. I'm probably missing
> > > something in your scenario)
> > You are right! 
> > The evmcs write is delayed to the next vmentry.
> > 
> > However since we are now mapping the evmcs during nested vmexit,
> > and this can fail for example that HV assist msr is not up to date.
> > 
> > For example consider this: 
> > 
> > 1. Userspace first sets nested state
> > 2. Userspace calls KVM_GET_MP_STATE.
> > 3. Nested vmexit that happened in 2 will end up not be able to map the evmcs,
> > since HV_ASSIST msr is not yet loaded.
> > 
> > 
> > Also the vmcb write (that is for SVM) _is_ done right away on nested vmexit 
> > and conceptually has the same issue.
> > (if memory map is not up to date, we might not be able to read/write the 
> > vmcb12 on nested vmexit)
> > 
> 
> It seems we have one correct way to restore a guest and a number of
> incorrect ones :-) It may happen that this is not even a nested-only
> thing (think about trying to resore caps, regs, msrs, cpuids, in a
> random sequence). I'd vote for documenting the right one somewhere, even
> if we'll just be extracting it from QEMU.
> 
> > > > To some extent, the entry to the nested mode after a migration is only complete
> > > > when we process the KVM_REQ_GET_NESTED_STATE_PAGES, so we shoudn't interrupt it.
> > > > 
> > > > This will allow us to avoid dealing with KVM_REQ_GET_NESTED_STATE_PAGES on
> > > > nested vmexit path at all. 
> > > 
> > > Remember, we have three possible states when nested state is
> > > transferred:
> > > 1) L2 was running
> > > 2) L1 was running
> > > 3) We're in beetween L2 and L1 (need_vmcs12_to_shadow_sync = true).
> > 
> > I understand. This suggestion wasn't meant to fix the case 3, but more to fix
> > case 1, where we are in L2, migrate, and then immediately decide to 
> > do a nested vmexit before we processed the KVM_REQ_GET_NESTED_STATE_PAGES
> > request, and also before potentially before the guest state was fully uploaded
> > (see that KVM_GET_MP_STATE thing).
> >  
> > In a nutshell, I vote for not allowing nested vmexits from the moment
> > when we set the nested state and until the moment we enter the nested
> > guest once (maybe with request for immediate vmexit),
> > because during this time period, the guest state is not fully consistent.
> > 
> 
> Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged'
> idea and kill the guest immediately if something forces such an exit.

Exactly, this is my idea. Set the nested_run_pending=1 always after the migration
It shoudn't cause any issues and it would avoid cases like that.

That variable can then be renamed too to something like 'nested_vmexit_not_allowed'
or something like that.

Paolo, what do you think?

Best regards,
	Maxim Levitsky

> 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use
  2021-05-27 14:11         ` Maxim Levitsky
@ 2021-05-27 14:17           ` Paolo Bonzini
  0 siblings, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-27 14:17 UTC (permalink / raw)
  To: Maxim Levitsky, Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm

On 27/05/21 16:11, Maxim Levitsky wrote:
>> Using 'nested_run_pending=1' perhaps? Or, we can get back to 'vm_bugged'
>> idea and kill the guest immediately if something forces such an exit.
> Exactly, this is my idea. Set the nested_run_pending=1 always after the migration
> It shoudn't cause any issues and it would avoid cases like that.
> 
> That variable can then be renamed too to something like 'nested_vmexit_not_allowed'
> or something like that.
> 
> Paolo, what do you think?

(If it works :)) that's clever.  It can even be set unconditionally on 
the save side and would even work for new->old migration.

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-05-27 14:17 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 13:50 [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Vitaly Kuznetsov
2021-05-17 13:50 ` [PATCH v2 1/7] KVM: nVMX: Introduce nested_evmcs_is_used() Vitaly Kuznetsov
2021-05-24 12:11   ` Maxim Levitsky
2021-05-24 12:35     ` Vitaly Kuznetsov
2021-05-26 14:34       ` Maxim Levitsky
2021-05-27  7:54         ` Vitaly Kuznetsov
2021-05-27 14:10           ` Maxim Levitsky
2021-05-24 13:54   ` Paolo Bonzini
2021-05-24 14:09     ` Vitaly Kuznetsov
2021-05-24 14:18       ` Paolo Bonzini
2021-05-24 14:37         ` Vitaly Kuznetsov
2021-05-17 13:50 ` [PATCH v2 2/7] KVM: nVMX: Release enlightened VMCS on VMCLEAR Vitaly Kuznetsov
2021-05-24 12:13   ` Maxim Levitsky
2021-05-17 13:50 ` [PATCH v2 3/7] KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() Vitaly Kuznetsov
2021-05-24 12:26   ` Maxim Levitsky
2021-05-24 13:01     ` Vitaly Kuznetsov
2021-05-24 13:58       ` Paolo Bonzini
2021-05-26 14:44         ` Maxim Levitsky
2021-05-24 13:56   ` Paolo Bonzini
2021-05-24 14:12     ` Vitaly Kuznetsov
2021-05-17 13:50 ` [PATCH v2 4/7] KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() Vitaly Kuznetsov
2021-05-24 12:27   ` Maxim Levitsky
2021-05-17 13:50 ` [PATCH v2 5/7] KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() Vitaly Kuznetsov
2021-05-24 12:34   ` Maxim Levitsky
2021-05-24 13:07     ` Vitaly Kuznetsov
2021-05-17 13:50 ` [PATCH v2 6/7] KVM: nVMX: Request to sync eVMCS from VMCS12 after migration Vitaly Kuznetsov
2021-05-24 12:35   ` Maxim Levitsky
2021-05-17 13:50 ` [PATCH v2 7/7] KVM: selftests: evmcs_test: Test that KVM_STATE_NESTED_EVMCS is never lost Vitaly Kuznetsov
2021-05-24 12:36   ` Maxim Levitsky
2021-05-24 12:08 ` [PATCH v2 0/7] KVM: nVMX: Fixes for nested state migration when eVMCS is in use Maxim Levitsky
2021-05-24 12:44   ` Vitaly Kuznetsov
2021-05-26 14:41     ` Maxim Levitsky
2021-05-27  8:01       ` Vitaly Kuznetsov
2021-05-27 14:11         ` Maxim Levitsky
2021-05-27 14:17           ` Paolo Bonzini
2021-05-24 14:01 ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.