kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use
@ 2021-05-03 15:08 Vitaly Kuznetsov
  2021-05-03 15:08 ` [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration Vitaly Kuznetsov
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-03 15:08 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Win10 guests with WSL2 enabled sometimes crash on migration when
enlightened VMCS was used. The condition seems to be induced by the
situation when L2->L1 exit is caused immediately after migration and
before L2 gets a chance to run (e.g. when there's an interrupt pending).
The issue was introduced by commit f2c7ef3ba955 ("KVM: nSVM: cancel 
KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit") and the first patch
of the series addresses the immediate issue. The eVMCS mapping restoration
path, however, seems to be fragile and the rest of the series tries to
make it more future proof by including eVMCS GPA in the migration data.

Vitaly Kuznetsov (4):
  KVM: nVMX: Always make an attempt to map eVMCS after migration
  KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr'
  KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld()
  KVM: nVMX: Map enlightened VMCS upon restore when possible

 arch/x86/include/uapi/asm/kvm.h |  4 ++
 arch/x86/kvm/vmx/nested.c       | 82 +++++++++++++++++++++++----------
 2 files changed, 61 insertions(+), 25 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration
  2021-05-03 15:08 [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Vitaly Kuznetsov
@ 2021-05-03 15:08 ` Vitaly Kuznetsov
  2021-05-05  8:22   ` Maxim Levitsky
  2021-05-03 15:08 ` [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr' Vitaly Kuznetsov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-03 15:08 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

When enlightened VMCS is in use and nested state is migrated with
vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
and we can't read it from VP assist page because userspace may decide
to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
(and QEMU, for example, does exactly that). To make sure eVMCS is
mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
request.

Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
when an immediate exit from L2 to L1 happens right after migration (caused
by a pending event, for example). Unfortunately, in the exact same
situation we still need to have eVMCS mapped so
nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.

As a band-aid, restore nested_get_evmcs_page() when clearing
KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
from being ideal as we can't easily propagate possible failures and even if
we could, this is most likely already too late to do so. The whole
'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
seems to be fragile as we diverge too much from the 'native' path when
vmptr loading happens on vmx_set_nested_state().

Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1e069aac7410..2febb1dd68e8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
 
 		if (evmptrld_status == EVMPTRLD_VMFAIL ||
-		    evmptrld_status == EVMPTRLD_ERROR) {
-			pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
-					     __func__);
-			vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-			vcpu->run->internal.suberror =
-				KVM_INTERNAL_ERROR_EMULATION;
-			vcpu->run->internal.ndata = 0;
+		    evmptrld_status == EVMPTRLD_ERROR)
 			return false;
-		}
 	}
 
 	return true;
@@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 
 static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
 {
-	if (!nested_get_evmcs_page(vcpu))
+	if (!nested_get_evmcs_page(vcpu)) {
+		pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
+				     __func__);
+		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+		vcpu->run->internal.suberror =
+			KVM_INTERNAL_ERROR_EMULATION;
+		vcpu->run->internal.ndata = 0;
+
 		return false;
+	}
 
 	if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
 		return false;
@@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	/* trying to cancel vmlaunch/vmresume is a bug */
 	WARN_ON_ONCE(vmx->nested.nested_run_pending);
 
-	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
+		/*
+		 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
+		 * Enlightened VMCS after migration and we still need to
+		 * do that when something is forcing L2->L1 exit prior to
+		 * the first L2 run.
+		 */
+		(void)nested_get_evmcs_page(vcpu);
+	}
 
 	/* Service the TLB flush request for L2 before switching to L1. */
 	if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr'
  2021-05-03 15:08 [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Vitaly Kuznetsov
  2021-05-03 15:08 ` [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration Vitaly Kuznetsov
@ 2021-05-03 15:08 ` Vitaly Kuznetsov
  2021-05-05  8:24   ` Maxim Levitsky
  2021-05-03 15:08 ` [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld() Vitaly Kuznetsov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-03 15:08 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Eliminate the probably unwanted hole in 'struct kvm_vmx_nested_state_hdr':

Pre-patch:
struct kvm_vmx_nested_state_hdr {
        __u64                      vmxon_pa;             /*     0     8 */
        __u64                      vmcs12_pa;            /*     8     8 */
        struct {
                __u16              flags;                /*    16     2 */
        } smm;                                           /*    16     2 */

        /* XXX 2 bytes hole, try to pack */

        __u32                      flags;                /*    20     4 */
        __u64                      preemption_timer_deadline; /*    24     8 */
};

Post-patch:
struct kvm_vmx_nested_state_hdr {
        __u64                      vmxon_pa;             /*     0     8 */
        __u64                      vmcs12_pa;            /*     8     8 */
        struct {
                __u16              flags;                /*    16     2 */
        } smm;                                           /*    16     2 */
        __u16                      pad;                  /*    18     2 */
        __u32                      flags;                /*    20     4 */
        __u64                      preemption_timer_deadline; /*    24     8 */
};

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/include/uapi/asm/kvm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a3022c8af82..0662f644aad9 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -437,6 +437,8 @@ struct kvm_vmx_nested_state_hdr {
 		__u16 flags;
 	} smm;
 
+	__u16 pad;
+
 	__u32 flags;
 	__u64 preemption_timer_deadline;
 };
-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld()
  2021-05-03 15:08 [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Vitaly Kuznetsov
  2021-05-03 15:08 ` [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration Vitaly Kuznetsov
  2021-05-03 15:08 ` [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr' Vitaly Kuznetsov
@ 2021-05-03 15:08 ` Vitaly Kuznetsov
  2021-05-05  8:24   ` Maxim Levitsky
  2021-05-03 15:08 ` [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible Vitaly Kuznetsov
  2021-05-03 15:43 ` [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Paolo Bonzini
  4 siblings, 1 reply; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-03 15:08 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

As a preparation to mapping eVMCS from vmx_set_nested_state() split
the actual eVMCS mappign from aquiring eVMCS GPA.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2febb1dd68e8..37fdc34f7afc 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1972,18 +1972,11 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
  * This is an equivalent of the nested hypervisor executing the vmptrld
  * instruction.
  */
-static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
-	struct kvm_vcpu *vcpu, bool from_launch)
+static enum nested_evmptrld_status __nested_vmx_handle_enlightened_vmptrld(
+	struct kvm_vcpu *vcpu, u64 evmcs_gpa, bool from_launch)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	bool evmcs_gpa_changed = false;
-	u64 evmcs_gpa;
-
-	if (likely(!vmx->nested.enlightened_vmcs_enabled))
-		return EVMPTRLD_DISABLED;
-
-	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
-		return EVMPTRLD_DISABLED;
 
 	if (unlikely(!vmx->nested.hv_evmcs ||
 		     evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
@@ -2055,6 +2048,21 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
 	return EVMPTRLD_SUCCEEDED;
 }
 
+static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
+	struct kvm_vcpu *vcpu, bool from_launch)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u64 evmcs_gpa;
+
+	if (likely(!vmx->nested.enlightened_vmcs_enabled))
+		return EVMPTRLD_DISABLED;
+
+	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
+		return EVMPTRLD_DISABLED;
+
+	return __nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, from_launch);
+}
+
 void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible
  2021-05-03 15:08 [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  2021-05-03 15:08 ` [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld() Vitaly Kuznetsov
@ 2021-05-03 15:08 ` Vitaly Kuznetsov
  2021-05-03 15:53   ` Paolo Bonzini
  2021-05-05  8:33   ` Maxim Levitsky
  2021-05-03 15:43 ` [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Paolo Bonzini
  4 siblings, 2 replies; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-03 15:08 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

It now looks like a bad idea to not restore eVMCS mapping directly from
vmx_set_nested_state(). The restoration path now depends on whether KVM
will continue executing L2 (vmx_get_nested_state_pages()) or will have to
exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
diverges too much from the 'native' path when 'nested.current_vmptr' is
set directly from vmx_get_nested_state_pages().

The existing solution postponing eVMCS mapping also seems to be fragile.
In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
state.

Also, in case vmx_get_nested_state() is called right after
vmx_set_nested_state() without executing the guest first, the resulting
state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
missing.

Fix all these issues by making eVMCS restoration path closer to its
'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
To avoid ABI incompatibility, do not introduce a new flag and keep the
original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
place. To distinguish between 'new' and 'old' formats consider eVMCS
GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
path). While technically possible, it seems to be an extremely unlikely
case.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/include/uapi/asm/kvm.h |  2 ++
 arch/x86/kvm/vmx/nested.c       | 27 +++++++++++++++++++++------
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 0662f644aad9..3845977b739e 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -441,6 +441,8 @@ struct kvm_vmx_nested_state_hdr {
 
 	__u32 flags;
 	__u64 preemption_timer_deadline;
+
+	__u64 evmcs_pa;
 };
 
 struct kvm_svm_nested_state_data {
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 37fdc34f7afc..4261cf4755c8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -6019,6 +6019,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 		.hdr.vmx.vmxon_pa = -1ull,
 		.hdr.vmx.vmcs12_pa = -1ull,
 		.hdr.vmx.preemption_timer_deadline = 0,
+		.hdr.vmx.evmcs_pa = -1ull,
 	};
 	struct kvm_vmx_nested_state_data __user *user_vmx_nested_state =
 		&user_kvm_nested_state->data.vmx[0];
@@ -6037,8 +6038,10 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
 		if (vmx_has_valid_vmcs12(vcpu)) {
 			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
 
-			if (vmx->nested.hv_evmcs)
+			if (vmx->nested.hv_evmcs) {
 				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
+				kvm_state.hdr.vmx.evmcs_pa = vmx->nested.hv_evmcs_vmptr;
+			}
 
 			if (is_guest_mode(vcpu) &&
 			    nested_cpu_has_shadow_vmcs(vmcs12) &&
@@ -6230,13 +6233,25 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 
 		set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
 	} else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
+		u64 evmcs_gpa = kvm_state->hdr.vmx.evmcs_pa;
+
 		/*
-		 * nested_vmx_handle_enlightened_vmptrld() cannot be called
-		 * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
-		 * restored yet. EVMCS will be mapped from
-		 * nested_get_vmcs12_pages().
+		 * EVMCS GPA == 0 most likely indicates that the migration data is
+		 * coming from an older KVM which doesn't support 'evmcs_pa' in
+		 * 'struct kvm_vmx_nested_state_hdr'.
 		 */
-		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+		if (evmcs_gpa && (evmcs_gpa != -1ull) &&
+		    (__nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, false) !=
+		     EVMPTRLD_SUCCEEDED)) {
+			return -EINVAL;
+		} else if (!evmcs_gpa) {
+			/*
+			 * EVMCS GPA can't be acquired from VP assist page here because
+			 * HV_X64_MSR_VP_ASSIST_PAGE may not be restored yet.
+			 * EVMCS will be mapped from nested_get_evmcs_page().
+			 */
+			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+		}
 	} else {
 		return -EINVAL;
 	}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use
  2021-05-03 15:08 [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Vitaly Kuznetsov
                   ` (3 preceding siblings ...)
  2021-05-03 15:08 ` [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible Vitaly Kuznetsov
@ 2021-05-03 15:43 ` Paolo Bonzini
  2021-05-03 15:52   ` Vitaly Kuznetsov
  4 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2021-05-03 15:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 03/05/21 17:08, Vitaly Kuznetsov wrote:
> Win10 guests with WSL2 enabled sometimes crash on migration when
> enlightened VMCS was used. The condition seems to be induced by the
> situation when L2->L1 exit is caused immediately after migration and
> before L2 gets a chance to run (e.g. when there's an interrupt pending).

Interesting, I think it gets to nested_vmx_vmexit before

                 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
                         if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
                                 r = 0;
                                 goto out;
                         }
                 }

due to the infamous calls to check_nested_events that are scattered
through KVM?

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use
  2021-05-03 15:43 ` [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Paolo Bonzini
@ 2021-05-03 15:52   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-03 15:52 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 03/05/21 17:08, Vitaly Kuznetsov wrote:
>> Win10 guests with WSL2 enabled sometimes crash on migration when
>> enlightened VMCS was used. The condition seems to be induced by the
>> situation when L2->L1 exit is caused immediately after migration and
>> before L2 gets a chance to run (e.g. when there's an interrupt pending).
>
> Interesting, I think it gets to nested_vmx_vmexit before
>
>                  if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
>                          if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
>                                  r = 0;
>                                  goto out;
>                          }
>                  }
>
> due to the infamous calls to check_nested_events that are scattered
> through KVM?

Yea,

vcpu_run() -> kvm_vcpu_running() -> vmx_check_nested_events() if I
remember it correctly.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible
  2021-05-03 15:08 ` [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible Vitaly Kuznetsov
@ 2021-05-03 15:53   ` Paolo Bonzini
  2021-05-04  8:02     ` Vitaly Kuznetsov
  2021-05-05  8:33   ` Maxim Levitsky
  1 sibling, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2021-05-03 15:53 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 03/05/21 17:08, Vitaly Kuznetsov wrote:
> It now looks like a bad idea to not restore eVMCS mapping directly from
> vmx_set_nested_state(). The restoration path now depends on whether KVM
> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
> diverges too much from the 'native' path when 'nested.current_vmptr' is
> set directly from vmx_get_nested_state_pages().
> 
> The existing solution postponing eVMCS mapping also seems to be fragile.
> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
> state.
> 
> Also, in case vmx_get_nested_state() is called right after
> vmx_set_nested_state() without executing the guest first, the resulting
> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
> missing.
> 
> Fix all these issues by making eVMCS restoration path closer to its
> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
> To avoid ABI incompatibility, do not introduce a new flag and keep the

I'm not sure what is the disadvantage of not having a new flag.

Having two different paths with subtly different side effects however 
seems really worse for maintenance.  We are already discussing in 
another thread how to get rid of the check_nested_events side effects; 
that might possibly even remove the need for patch 1, so it's at least 
worth pursuing more than adding this second path.

I have queued patch 1, but I'd rather have a kvm selftest for it.  It 
doesn't seem impossible to have one...

Paolo

> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
> place. To distinguish between 'new' and 'old' formats consider eVMCS
> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
> path). While technically possible, it seems to be an extremely unlikely
> case.


> Signed-off-by: Vitaly Kuznetsov<vkuznets@redhat.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible
  2021-05-03 15:53   ` Paolo Bonzini
@ 2021-05-04  8:02     ` Vitaly Kuznetsov
  2021-05-04  8:06       ` Paolo Bonzini
  0 siblings, 1 reply; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-04  8:02 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

Paolo Bonzini <pbonzini@redhat.com> writes:

> On 03/05/21 17:08, Vitaly Kuznetsov wrote:
>> It now looks like a bad idea to not restore eVMCS mapping directly from
>> vmx_set_nested_state(). The restoration path now depends on whether KVM
>> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
>> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
>> diverges too much from the 'native' path when 'nested.current_vmptr' is
>> set directly from vmx_get_nested_state_pages().
>> 
>> The existing solution postponing eVMCS mapping also seems to be fragile.
>> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
>> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
>> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
>> state.
>> 
>> Also, in case vmx_get_nested_state() is called right after
>> vmx_set_nested_state() without executing the guest first, the resulting
>> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
>> missing.
>> 
>> Fix all these issues by making eVMCS restoration path closer to its
>> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
>> To avoid ABI incompatibility, do not introduce a new flag and keep the
>
> I'm not sure what is the disadvantage of not having a new flag.
>

Adding a new flag would make us backwards-incompatible both ways:

1) Migrating 'new' state to an older KVM will fail the

	if (kvm_state->hdr.vmx.flags & ~KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE)
	        return -EINVAL;

check.

2) Migrating 'old' state to 'new' KVM would make us support the old path
('KVM_REQ_GET_NESTED_STATE_PAGES') so the flag will still be 'optional'.

> Having two different paths with subtly different side effects however 
> seems really worse for maintenance.  We are already discussing in 
> another thread how to get rid of the check_nested_events side effects; 
> that might possibly even remove the need for patch 1, so it's at least 
> worth pursuing more than adding this second path.

I have to admit I don't fully like this solution either :-( In case we
make sure KVM_REQ_GET_NESTED_STATE_PAGES always gets handled the fix can
be omitted indeed, however, I still dislike the divergence and the fact
that 'if (vmx->nested.hv_evmcs)' checks scattered across the code are
not fully valid. E.g. how do we fix immediate KVM_GET_NESTED_STATE after
KVM_SET_NESTED_STATE without executing the vCPU problem?

>
> I have queued patch 1, but I'd rather have a kvm selftest for it.  It 
> doesn't seem impossible to have one...

Thank you, the band-aid solves a real problem. Let me try to come up
with a selftest for it.

>
> Paolo
>
>> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
>> place. To distinguish between 'new' and 'old' formats consider eVMCS
>> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
>> path). While technically possible, it seems to be an extremely unlikely
>> case.
>
>
>> Signed-off-by: Vitaly Kuznetsov<vkuznets@redhat.com>
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible
  2021-05-04  8:02     ` Vitaly Kuznetsov
@ 2021-05-04  8:06       ` Paolo Bonzini
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2021-05-04  8:06 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, Maxim Levitsky,
	linux-kernel

On 04/05/21 10:02, Vitaly Kuznetsov wrote:
> I still dislike the divergence and the fact
> that 'if (vmx->nested.hv_evmcs)' checks scattered across the code are
> not fully valid. E.g. how do we fix immediate KVM_GET_NESTED_STATE after
> KVM_SET_NESTED_STATE without executing the vCPU problem?

You obviously have thought about this more than I did, but if you can 
write a testcase for that as well, I can take a look.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration
  2021-05-03 15:08 ` [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration Vitaly Kuznetsov
@ 2021-05-05  8:22   ` Maxim Levitsky
  2021-05-05  8:39     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2021-05-05  8:22 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> When enlightened VMCS is in use and nested state is migrated with
> vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
> page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
> and we can't read it from VP assist page because userspace may decide
> to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
> (and QEMU, for example, does exactly that). To make sure eVMCS is
> mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
> request.
> 
> Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
> on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
> nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
> when an immediate exit from L2 to L1 happens right after migration (caused
> by a pending event, for example). Unfortunately, in the exact same
> situation we still need to have eVMCS mapped so
> nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
> 
> As a band-aid, restore nested_get_evmcs_page() when clearing
> KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
> from being ideal as we can't easily propagate possible failures and even if
> we could, this is most likely already too late to do so. The whole
> 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
> seems to be fragile as we diverge too much from the 'native' path when
> vmptr loading happens on vmx_set_nested_state().
> 
> Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 1e069aac7410..2febb1dd68e8 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>  			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>  
>  		if (evmptrld_status == EVMPTRLD_VMFAIL ||
> -		    evmptrld_status == EVMPTRLD_ERROR) {
> -			pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> -					     __func__);
> -			vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> -			vcpu->run->internal.suberror =
> -				KVM_INTERNAL_ERROR_EMULATION;
> -			vcpu->run->internal.ndata = 0;
> +		    evmptrld_status == EVMPTRLD_ERROR)
>  			return false;
> -		}
>  	}
>  
>  	return true;
> @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>  
>  static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
>  {
> -	if (!nested_get_evmcs_page(vcpu))
> +	if (!nested_get_evmcs_page(vcpu)) {
> +		pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> +				     __func__);
> +		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> +		vcpu->run->internal.suberror =
> +			KVM_INTERNAL_ERROR_EMULATION;
> +		vcpu->run->internal.ndata = 0;
> +
>  		return false;
> +	}

Hi!

Any reason to move the debug prints out of nested_get_evmcs_page?


>  
>  	if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
>  		return false;
> @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  	/* trying to cancel vmlaunch/vmresume is a bug */
>  	WARN_ON_ONCE(vmx->nested.nested_run_pending);
>  
> -	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> +		/*
> +		 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
> +		 * Enlightened VMCS after migration and we still need to
> +		 * do that when something is forcing L2->L1 exit prior to
> +		 * the first L2 run.
> +		 */
> +		(void)nested_get_evmcs_page(vcpu);
> +	}
Yes this is a band-aid, but it has to be done I agree.

>  
>  	/* Service the TLB flush request for L2 before switching to L1. */
>  	if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))




I also tested this and it survives a bit better (used to crash instantly
after a single migration cycle, but the guest still crashes after around ~20 iterations of my 
regular nested migration test).

Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.

I tested both this patch alone and all 4 patches.

Without evmcs, the same VM with same host kernel and qemu survived an overnight
test and passed about 1800 migration iterations.
(my synthetic migration test doesn't yet work on Intel, I need to investigate why)

For reference this is the VM that you gave me to test, kvm/queue kernel,
with merged mainline in it,
and mostly latest qemu (updated about a week ago or so)

qemu: 3791642c8d60029adf9b00bcb4e34d7d8a1aea4d
kernel: 9f242010c3b46e63bc62f08fff42cef992d3801b and
        then merge v5.12 from mainline.

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr'
  2021-05-03 15:08 ` [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr' Vitaly Kuznetsov
@ 2021-05-05  8:24   ` Maxim Levitsky
  2021-05-05 17:34     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2021-05-05  8:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> Eliminate the probably unwanted hole in 'struct kvm_vmx_nested_state_hdr':
> 
> Pre-patch:
> struct kvm_vmx_nested_state_hdr {
>         __u64                      vmxon_pa;             /*     0     8 */
>         __u64                      vmcs12_pa;            /*     8     8 */
>         struct {
>                 __u16              flags;                /*    16     2 */
>         } smm;                                           /*    16     2 */
> 
>         /* XXX 2 bytes hole, try to pack */
> 
>         __u32                      flags;                /*    20     4 */
>         __u64                      preemption_timer_deadline; /*    24     8 */
> };
> 
> Post-patch:
> struct kvm_vmx_nested_state_hdr {
>         __u64                      vmxon_pa;             /*     0     8 */
>         __u64                      vmcs12_pa;            /*     8     8 */
>         struct {
>                 __u16              flags;                /*    16     2 */
>         } smm;                                           /*    16     2 */
>         __u16                      pad;                  /*    18     2 */
>         __u32                      flags;                /*    20     4 */
>         __u64                      preemption_timer_deadline; /*    24     8 */
> };
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 5a3022c8af82..0662f644aad9 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -437,6 +437,8 @@ struct kvm_vmx_nested_state_hdr {
>  		__u16 flags;
>  	} smm;
>  
> +	__u16 pad;
> +
>  	__u32 flags;
>  	__u64 preemption_timer_deadline;
>  };


Looks good to me.

I wonder if we can enable the -Wpadded GCC warning to warn about such cases.
Probably can't be enabled for the whole kernel but maybe we can enable it
for KVM codebase at least, like we did with -Werror.


From GCC manual:

"-Wpadded
Warn if padding is included in a structure, either to align an element of the structure or to align the whole structure. 
Sometimes when this happens it is possible to rearrange the fields of the structure 
to reduce the padding and so make the structure smaller."


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld()
  2021-05-03 15:08 ` [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld() Vitaly Kuznetsov
@ 2021-05-05  8:24   ` Maxim Levitsky
  0 siblings, 0 replies; 19+ messages in thread
From: Maxim Levitsky @ 2021-05-05  8:24 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> As a preparation to mapping eVMCS from vmx_set_nested_state() split
> the actual eVMCS mappign from aquiring eVMCS GPA.
> 
> No functional change intended.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 26 +++++++++++++++++---------
>  1 file changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 2febb1dd68e8..37fdc34f7afc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1972,18 +1972,11 @@ static int copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
>   * This is an equivalent of the nested hypervisor executing the vmptrld
>   * instruction.
>   */
> -static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
> -	struct kvm_vcpu *vcpu, bool from_launch)
> +static enum nested_evmptrld_status __nested_vmx_handle_enlightened_vmptrld(
> +	struct kvm_vcpu *vcpu, u64 evmcs_gpa, bool from_launch)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	bool evmcs_gpa_changed = false;
> -	u64 evmcs_gpa;
> -
> -	if (likely(!vmx->nested.enlightened_vmcs_enabled))
> -		return EVMPTRLD_DISABLED;
> -
> -	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
> -		return EVMPTRLD_DISABLED;
>  
>  	if (unlikely(!vmx->nested.hv_evmcs ||
>  		     evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) {
> @@ -2055,6 +2048,21 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
>  	return EVMPTRLD_SUCCEEDED;
>  }
>  
> +static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
> +	struct kvm_vcpu *vcpu, bool from_launch)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	u64 evmcs_gpa;
> +
> +	if (likely(!vmx->nested.enlightened_vmcs_enabled))
> +		return EVMPTRLD_DISABLED;
> +
> +	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa))
> +		return EVMPTRLD_DISABLED;
> +
> +	return __nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, from_launch);
> +}
> +
>  void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible
  2021-05-03 15:08 ` [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible Vitaly Kuznetsov
  2021-05-03 15:53   ` Paolo Bonzini
@ 2021-05-05  8:33   ` Maxim Levitsky
  2021-05-05  9:17     ` Vitaly Kuznetsov
  1 sibling, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2021-05-05  8:33 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel

On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> It now looks like a bad idea to not restore eVMCS mapping directly from
> vmx_set_nested_state(). The restoration path now depends on whether KVM
> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
> diverges too much from the 'native' path when 'nested.current_vmptr' is
> set directly from vmx_get_nested_state_pages().
> 
> The existing solution postponing eVMCS mapping also seems to be fragile.
> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
> state.
> 
> Also, in case vmx_get_nested_state() is called right after
> vmx_set_nested_state() without executing the guest first, the resulting
> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
> missing.
> 
> Fix all these issues by making eVMCS restoration path closer to its
> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
> To avoid ABI incompatibility, do not introduce a new flag and keep the
> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
> place. To distinguish between 'new' and 'old' formats consider eVMCS
> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
> path). While technically possible, it seems to be an extremely unlikely
> case.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm.h |  2 ++
>  arch/x86/kvm/vmx/nested.c       | 27 +++++++++++++++++++++------
>  2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 0662f644aad9..3845977b739e 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -441,6 +441,8 @@ struct kvm_vmx_nested_state_hdr {
>  
>  	__u32 flags;
>  	__u64 preemption_timer_deadline;
> +
> +	__u64 evmcs_pa;
>  };
>  
>  struct kvm_svm_nested_state_data {
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 37fdc34f7afc..4261cf4755c8 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -6019,6 +6019,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>  		.hdr.vmx.vmxon_pa = -1ull,
>  		.hdr.vmx.vmcs12_pa = -1ull,
>  		.hdr.vmx.preemption_timer_deadline = 0,
> +		.hdr.vmx.evmcs_pa = -1ull,
>  	};
>  	struct kvm_vmx_nested_state_data __user *user_vmx_nested_state =
>  		&user_kvm_nested_state->data.vmx[0];
> @@ -6037,8 +6038,10 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>  		if (vmx_has_valid_vmcs12(vcpu)) {
>  			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
>  
> -			if (vmx->nested.hv_evmcs)
> +			if (vmx->nested.hv_evmcs) {
>  				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
> +				kvm_state.hdr.vmx.evmcs_pa = vmx->nested.hv_evmcs_vmptr;
> +			}
>  
>  			if (is_guest_mode(vcpu) &&
>  			    nested_cpu_has_shadow_vmcs(vmcs12) &&
> @@ -6230,13 +6233,25 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
>  
>  		set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
>  	} else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
> +		u64 evmcs_gpa = kvm_state->hdr.vmx.evmcs_pa;
> +
>  		/*
> -		 * nested_vmx_handle_enlightened_vmptrld() cannot be called
> -		 * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
> -		 * restored yet. EVMCS will be mapped from
> -		 * nested_get_vmcs12_pages().
> +		 * EVMCS GPA == 0 most likely indicates that the migration data is
> +		 * coming from an older KVM which doesn't support 'evmcs_pa' in
> +		 * 'struct kvm_vmx_nested_state_hdr'.
>  		 */
> -		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +		if (evmcs_gpa && (evmcs_gpa != -1ull) &&
> +		    (__nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, false) !=
> +		     EVMPTRLD_SUCCEEDED)) {
> +			return -EINVAL;
> +		} else if (!evmcs_gpa) {
> +			/*
> +			 * EVMCS GPA can't be acquired from VP assist page here because
> +			 * HV_X64_MSR_VP_ASSIST_PAGE may not be restored yet.
> +			 * EVMCS will be mapped from nested_get_evmcs_page().
> +			 */
> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +		}
>  	} else {
>  		return -EINVAL;
>  	}

Hi everyone!

Let me expalin my concern about this patch and also ask if I understand this correctly.

In a nutshell if I understand this correctly, we are not allowed to access any guest
memory while setting the nested state. 

Now, if I understand correctly as well, the reason for the above,
is that the userspace is allowed to set the nested state first, then fiddle with
the KVM memslots, maybe even update the guest memory and only later do the KVM_RUN ioctl,

And so this is the major reason why the KVM_REQ_GET_NESTED_STATE_PAGES
request exists in the first place.

If that is correct I assume that we either have to keep loading the EVMCS page on
KVM_REQ_GET_NESTED_STATE_PAGES request, or we want to include the EVMCS itself
in the migration state in addition to its physical address, similar to how we treat
the VMCS12 and the VMCB12.

I personally tinkered with qemu to try and reproduce this situation
and in my tests I wasn't able to make it update the memory
map after the load of the nested state but prior to KVM_RUN
but neither I wasn't able to prove that this can't happen.

In addition to that I don't know how qemu behaves when it does 
guest ram post-copy because so far I haven't tried to tinker with it.

Finally other userspace hypervisors exist, and they might rely on assumption
as well.

Looking forward for any comments,
Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration
  2021-05-05  8:22   ` Maxim Levitsky
@ 2021-05-05  8:39     ` Vitaly Kuznetsov
  2021-05-05  9:17       ` Maxim Levitsky
  0 siblings, 1 reply; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-05  8:39 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
>> When enlightened VMCS is in use and nested state is migrated with
>> vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
>> page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
>> and we can't read it from VP assist page because userspace may decide
>> to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
>> (and QEMU, for example, does exactly that). To make sure eVMCS is
>> mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
>> request.
>> 
>> Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
>> on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
>> nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
>> when an immediate exit from L2 to L1 happens right after migration (caused
>> by a pending event, for example). Unfortunately, in the exact same
>> situation we still need to have eVMCS mapped so
>> nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
>> 
>> As a band-aid, restore nested_get_evmcs_page() when clearing
>> KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
>> from being ideal as we can't easily propagate possible failures and even if
>> we could, this is most likely already too late to do so. The whole
>> 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
>> seems to be fragile as we diverge too much from the 'native' path when
>> vmptr loading happens on vmx_set_nested_state().
>> 
>> Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
>>  1 file changed, 19 insertions(+), 10 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 1e069aac7410..2febb1dd68e8 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>>  			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>>  
>>  		if (evmptrld_status == EVMPTRLD_VMFAIL ||
>> -		    evmptrld_status == EVMPTRLD_ERROR) {
>> -			pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> -					     __func__);
>> -			vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> -			vcpu->run->internal.suberror =
>> -				KVM_INTERNAL_ERROR_EMULATION;
>> -			vcpu->run->internal.ndata = 0;
>> +		    evmptrld_status == EVMPTRLD_ERROR)
>>  			return false;
>> -		}
>>  	}
>>  
>>  	return true;
>> @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>>  
>>  static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
>>  {
>> -	if (!nested_get_evmcs_page(vcpu))
>> +	if (!nested_get_evmcs_page(vcpu)) {
>> +		pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> +				     __func__);
>> +		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> +		vcpu->run->internal.suberror =
>> +			KVM_INTERNAL_ERROR_EMULATION;
>> +		vcpu->run->internal.ndata = 0;
>> +
>>  		return false;
>> +	}
>
> Hi!
>
> Any reason to move the debug prints out of nested_get_evmcs_page?
>

Debug print could've probably stayed or could've been dropped
completely -- I don't really believe it's going to help
anyone. Debugging such issues without instrumentation/tracing seems to
be hard-to-impossible...

>
>>  
>>  	if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
>>  		return false;
>> @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>>  	/* trying to cancel vmlaunch/vmresume is a bug */
>>  	WARN_ON_ONCE(vmx->nested.nested_run_pending);
>>  
>> -	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
>> +		/*
>> +		 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
>> +		 * Enlightened VMCS after migration and we still need to
>> +		 * do that when something is forcing L2->L1 exit prior to
>> +		 * the first L2 run.
>> +		 */
>> +		(void)nested_get_evmcs_page(vcpu);
>> +	}
> Yes this is a band-aid, but it has to be done I agree.
>

To restore the status quo, yes.

>>  
>>  	/* Service the TLB flush request for L2 before switching to L1. */
>>  	if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
>
>
>
>
> I also tested this and it survives a bit better (used to crash instantly
> after a single migration cycle, but the guest still crashes after around ~20 iterations of my 
> regular nested migration test).
>
> Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.
>
> I tested both this patch alone and all 4 patches.
>
> Without evmcs, the same VM with same host kernel and qemu survived an overnight
> test and passed about 1800 migration iterations.
> (my synthetic migration test doesn't yet work on Intel, I need to investigate why)
>

It would be great to compare on Intel to be 100% sure the issue is eVMCS
related, Hyper-V may be behaving quite differently on AMD.

> For reference this is the VM that you gave me to test, kvm/queue kernel,
> with merged mainline in it,
> and mostly latest qemu (updated about a week ago or so)
>
> qemu: 3791642c8d60029adf9b00bcb4e34d7d8a1aea4d
> kernel: 9f242010c3b46e63bc62f08fff42cef992d3801b and
>         then merge v5.12 from mainline.

Thanks for testing! I'll try to come up with a selftest for this issue,
maybe it'll help us discovering others)

-- 
Vitaly


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration
  2021-05-05  8:39     ` Vitaly Kuznetsov
@ 2021-05-05  9:17       ` Maxim Levitsky
  2021-05-05  9:23         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 19+ messages in thread
From: Maxim Levitsky @ 2021-05-05  9:17 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

On Wed, 2021-05-05 at 10:39 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> > > When enlightened VMCS is in use and nested state is migrated with
> > > vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
> > > page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
> > > and we can't read it from VP assist page because userspace may decide
> > > to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
> > > (and QEMU, for example, does exactly that). To make sure eVMCS is
> > > mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
> > > request.
> > > 
> > > Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
> > > on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
> > > nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
> > > when an immediate exit from L2 to L1 happens right after migration (caused
> > > by a pending event, for example). Unfortunately, in the exact same
> > > situation we still need to have eVMCS mapped so
> > > nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
> > > 
> > > As a band-aid, restore nested_get_evmcs_page() when clearing
> > > KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
> > > from being ideal as we can't easily propagate possible failures and even if
> > > we could, this is most likely already too late to do so. The whole
> > > 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
> > > seems to be fragile as we diverge too much from the 'native' path when
> > > vmptr loading happens on vmx_set_nested_state().
> > > 
> > > Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
> > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > ---
> > >  arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
> > >  1 file changed, 19 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index 1e069aac7410..2febb1dd68e8 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
> > >  			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
> > >  
> > >  		if (evmptrld_status == EVMPTRLD_VMFAIL ||
> > > -		    evmptrld_status == EVMPTRLD_ERROR) {
> > > -			pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> > > -					     __func__);
> > > -			vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> > > -			vcpu->run->internal.suberror =
> > > -				KVM_INTERNAL_ERROR_EMULATION;
> > > -			vcpu->run->internal.ndata = 0;
> > > +		    evmptrld_status == EVMPTRLD_ERROR)
> > >  			return false;
> > > -		}
> > >  	}
> > >  
> > >  	return true;
> > > @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> > >  
> > >  static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
> > >  {
> > > -	if (!nested_get_evmcs_page(vcpu))
> > > +	if (!nested_get_evmcs_page(vcpu)) {
> > > +		pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
> > > +				     __func__);
> > > +		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> > > +		vcpu->run->internal.suberror =
> > > +			KVM_INTERNAL_ERROR_EMULATION;
> > > +		vcpu->run->internal.ndata = 0;
> > > +
> > >  		return false;
> > > +	}
> > 
> > Hi!
> > 
> > Any reason to move the debug prints out of nested_get_evmcs_page?
> > 
> 
> Debug print could've probably stayed or could've been dropped
> completely -- I don't really believe it's going to help
> anyone. Debugging such issues without instrumentation/tracing seems to
> be hard-to-impossible...
> 
> > >  
> > >  	if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
> > >  		return false;
> > > @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
> > >  	/* trying to cancel vmlaunch/vmresume is a bug */
> > >  	WARN_ON_ONCE(vmx->nested.nested_run_pending);
> > >  
> > > -	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > +		/*
> > > +		 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
> > > +		 * Enlightened VMCS after migration and we still need to
> > > +		 * do that when something is forcing L2->L1 exit prior to
> > > +		 * the first L2 run.
> > > +		 */
> > > +		(void)nested_get_evmcs_page(vcpu);
> > > +	}
> > Yes this is a band-aid, but it has to be done I agree.
> > 
> 
> To restore the status quo, yes.
> 
> > >  
> > >  	/* Service the TLB flush request for L2 before switching to L1. */
> > >  	if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
> > 
> > 
> > 
> > I also tested this and it survives a bit better (used to crash instantly
> > after a single migration cycle, but the guest still crashes after around ~20 iterations of my 
> > regular nested migration test).
> > 
> > Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.
> > 
> > I tested both this patch alone and all 4 patches.
> > 
> > Without evmcs, the same VM with same host kernel and qemu survived an overnight
> > test and passed about 1800 migration iterations.
> > (my synthetic migration test doesn't yet work on Intel, I need to investigate why)
> > 
> 
> It would be great to compare on Intel to be 100% sure the issue is eVMCS
> related, Hyper-V may be behaving quite differently on AMD.
Hi!

I tested this on my Intel machine with and without eVMCS, without changing
any other parameters, running the same VM from a snapshot.

As I said without eVMCS the test survived overnight stress of ~1800 migrations.
With eVMCs, it fails pretty much on first try. 
With those patches, it fails after about 20 iterations.

Best regards,
	Maxim Levitsky

> 
> > For reference this is the VM that you gave me to test, kvm/queue kernel,
> > with merged mainline in it,
> > and mostly latest qemu (updated about a week ago or so)
> > 
> > qemu: 3791642c8d60029adf9b00bcb4e34d7d8a1aea4d
> > kernel: 9f242010c3b46e63bc62f08fff42cef992d3801b and
> >         then merge v5.12 from mainline.
> 
> Thanks for testing! I'll try to come up with a selftest for this issue,
> maybe it'll help us discovering others)
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible
  2021-05-05  8:33   ` Maxim Levitsky
@ 2021-05-05  9:17     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-05  9:17 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
>> It now looks like a bad idea to not restore eVMCS mapping directly from
>> vmx_set_nested_state(). The restoration path now depends on whether KVM
>> will continue executing L2 (vmx_get_nested_state_pages()) or will have to
>> exit to L1 (nested_vmx_vmexit()), this complicates error propagation and
>> diverges too much from the 'native' path when 'nested.current_vmptr' is
>> set directly from vmx_get_nested_state_pages().
>> 
>> The existing solution postponing eVMCS mapping also seems to be fragile.
>> In multiple places the code checks whether 'vmx->nested.hv_evmcs' is not
>> NULL to distinguish between eVMCS and non-eVMCS cases. All these checks
>> are 'incomplete' as we have a weird 'eVMCS is in use but not yet mapped'
>> state.
>> 
>> Also, in case vmx_get_nested_state() is called right after
>> vmx_set_nested_state() without executing the guest first, the resulting
>> state is going to be incorrect as 'KVM_STATE_NESTED_EVMCS' flag will be
>> missing.
>> 
>> Fix all these issues by making eVMCS restoration path closer to its
>> 'native' sibling by putting eVMCS GPA to 'struct kvm_vmx_nested_state_hdr'.
>> To avoid ABI incompatibility, do not introduce a new flag and keep the
>> original eVMCS mapping path through KVM_REQ_GET_NESTED_STATE_PAGES in
>> place. To distinguish between 'new' and 'old' formats consider eVMCS
>> GPA == 0 as an unset GPA (thus forcing KVM_REQ_GET_NESTED_STATE_PAGES
>> path). While technically possible, it seems to be an extremely unlikely
>> case.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/include/uapi/asm/kvm.h |  2 ++
>>  arch/x86/kvm/vmx/nested.c       | 27 +++++++++++++++++++++------
>>  2 files changed, 23 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 0662f644aad9..3845977b739e 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -441,6 +441,8 @@ struct kvm_vmx_nested_state_hdr {
>>  
>>  	__u32 flags;
>>  	__u64 preemption_timer_deadline;
>> +
>> +	__u64 evmcs_pa;
>>  };
>>  
>>  struct kvm_svm_nested_state_data {
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 37fdc34f7afc..4261cf4755c8 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -6019,6 +6019,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>>  		.hdr.vmx.vmxon_pa = -1ull,
>>  		.hdr.vmx.vmcs12_pa = -1ull,
>>  		.hdr.vmx.preemption_timer_deadline = 0,
>> +		.hdr.vmx.evmcs_pa = -1ull,
>>  	};
>>  	struct kvm_vmx_nested_state_data __user *user_vmx_nested_state =
>>  		&user_kvm_nested_state->data.vmx[0];
>> @@ -6037,8 +6038,10 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
>>  		if (vmx_has_valid_vmcs12(vcpu)) {
>>  			kvm_state.size += sizeof(user_vmx_nested_state->vmcs12);
>>  
>> -			if (vmx->nested.hv_evmcs)
>> +			if (vmx->nested.hv_evmcs) {
>>  				kvm_state.flags |= KVM_STATE_NESTED_EVMCS;
>> +				kvm_state.hdr.vmx.evmcs_pa = vmx->nested.hv_evmcs_vmptr;
>> +			}
>>  
>>  			if (is_guest_mode(vcpu) &&
>>  			    nested_cpu_has_shadow_vmcs(vmcs12) &&
>> @@ -6230,13 +6233,25 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
>>  
>>  		set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa);
>>  	} else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) {
>> +		u64 evmcs_gpa = kvm_state->hdr.vmx.evmcs_pa;
>> +
>>  		/*
>> -		 * nested_vmx_handle_enlightened_vmptrld() cannot be called
>> -		 * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be
>> -		 * restored yet. EVMCS will be mapped from
>> -		 * nested_get_vmcs12_pages().
>> +		 * EVMCS GPA == 0 most likely indicates that the migration data is
>> +		 * coming from an older KVM which doesn't support 'evmcs_pa' in
>> +		 * 'struct kvm_vmx_nested_state_hdr'.
>>  		 */
>> -		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> +		if (evmcs_gpa && (evmcs_gpa != -1ull) &&
>> +		    (__nested_vmx_handle_enlightened_vmptrld(vcpu, evmcs_gpa, false) !=
>> +		     EVMPTRLD_SUCCEEDED)) {
>> +			return -EINVAL;
>> +		} else if (!evmcs_gpa) {
>> +			/*
>> +			 * EVMCS GPA can't be acquired from VP assist page here because
>> +			 * HV_X64_MSR_VP_ASSIST_PAGE may not be restored yet.
>> +			 * EVMCS will be mapped from nested_get_evmcs_page().
>> +			 */
>> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> +		}
>>  	} else {
>>  		return -EINVAL;
>>  	}
>
> Hi everyone!
>
> Let me expalin my concern about this patch and also ask if I understand this correctly.
>
> In a nutshell if I understand this correctly, we are not allowed to access any guest
> memory while setting the nested state. 
>
> Now, if I understand correctly as well, the reason for the above,
> is that the userspace is allowed to set the nested state first, then fiddle with
> the KVM memslots, maybe even update the guest memory and only later do the KVM_RUN ioctl,

Currently, userspace is free to restore the guest in any order
indeed. I've probably missed post-copy but even the fact that guest MSRs
can be restored after restoring nested state doesn't make our life easier.

>
> And so this is the major reason why the KVM_REQ_GET_NESTED_STATE_PAGES
> request exists in the first place.
>
> If that is correct I assume that we either have to keep loading the EVMCS page on
> KVM_REQ_GET_NESTED_STATE_PAGES request, or we want to include the EVMCS itself
> in the migration state in addition to its physical address, similar to how we treat
> the VMCS12 and the VMCB12.

Keeping eVMCS load from KVM_REQ_GET_NESTED_STATE_PAGES is OK I believe
(or at least I still don't see a reason for us to carry a copy in the
migration data). What I still don't like is the transient state after
vmx_set_nested_state(): 
- vmx->nested.current_vmptr is -1ull because no 'real' vmptrld was done
(we skip set_current_vmptr() when KVM_STATE_NESTED_EVMCS)
- vmx->nested.hv_evmcs/vmx->nested.hv_evmcs_vmptr are also NULL because
we haven't performed nested_vmx_handle_enlightened_vmptrld() yet.

I know of at least one real problem with this state: in case
vmx_get_nested_state() happens before KVM_RUN the resulting state won't
have KVM_STATE_NESTED_EVMCS flag and this is incorrect. Take a look at
the check in nested_vmx_fail() for example:

        if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
                return nested_vmx_failInvalid(vcpu);

this also seems off (I'm not sure it matters in any context but still).

>
> I personally tinkered with qemu to try and reproduce this situation
> and in my tests I wasn't able to make it update the memory
> map after the load of the nested state but prior to KVM_RUN
> but neither I wasn't able to prove that this can't happen.

Userspace has multiple ways to mess with the state of course, in KVM we
only need to make sure we don't crash :-) On migration, well behaving
userspace is supposed to restore exactly what it got though. The
restoration sequence may vary.

>
> In addition to that I don't know how qemu behaves when it does 
> guest ram post-copy because so far I haven't tried to tinker with it.
>
> Finally other userspace hypervisors exist, and they might rely on assumption
> as well.
>
> Looking forward for any comments,
> Best regards,
> 	Maxim Levitsky
>
>
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration
  2021-05-05  9:17       ` Maxim Levitsky
@ 2021-05-05  9:23         ` Vitaly Kuznetsov
  0 siblings, 0 replies; 19+ messages in thread
From: Vitaly Kuznetsov @ 2021-05-05  9:23 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Wanpeng Li, Jim Mattson, linux-kernel, kvm,
	Paolo Bonzini

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Wed, 2021-05-05 at 10:39 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <mlevitsk@redhat.com> writes:
>> 
>> > On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
>> > > When enlightened VMCS is in use and nested state is migrated with
>> > > vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
>> > > page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
>> > > and we can't read it from VP assist page because userspace may decide
>> > > to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
>> > > (and QEMU, for example, does exactly that). To make sure eVMCS is
>> > > mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
>> > > request.
>> > > 
>> > > Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
>> > > on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
>> > > nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
>> > > when an immediate exit from L2 to L1 happens right after migration (caused
>> > > by a pending event, for example). Unfortunately, in the exact same
>> > > situation we still need to have eVMCS mapped so
>> > > nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
>> > > 
>> > > As a band-aid, restore nested_get_evmcs_page() when clearing
>> > > KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
>> > > from being ideal as we can't easily propagate possible failures and even if
>> > > we could, this is most likely already too late to do so. The whole
>> > > 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
>> > > seems to be fragile as we diverge too much from the 'native' path when
>> > > vmptr loading happens on vmx_set_nested_state().
>> > > 
>> > > Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
>> > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> > > ---
>> > >  arch/x86/kvm/vmx/nested.c | 29 +++++++++++++++++++----------
>> > >  1 file changed, 19 insertions(+), 10 deletions(-)
>> > > 
>> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> > > index 1e069aac7410..2febb1dd68e8 100644
>> > > --- a/arch/x86/kvm/vmx/nested.c
>> > > +++ b/arch/x86/kvm/vmx/nested.c
>> > > @@ -3098,15 +3098,8 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
>> > >  			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
>> > >  
>> > >  		if (evmptrld_status == EVMPTRLD_VMFAIL ||
>> > > -		    evmptrld_status == EVMPTRLD_ERROR) {
>> > > -			pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> > > -					     __func__);
>> > > -			vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> > > -			vcpu->run->internal.suberror =
>> > > -				KVM_INTERNAL_ERROR_EMULATION;
>> > > -			vcpu->run->internal.ndata = 0;
>> > > +		    evmptrld_status == EVMPTRLD_ERROR)
>> > >  			return false;
>> > > -		}
>> > >  	}
>> > >  
>> > >  	return true;
>> > > @@ -3194,8 +3187,16 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>> > >  
>> > >  static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
>> > >  {
>> > > -	if (!nested_get_evmcs_page(vcpu))
>> > > +	if (!nested_get_evmcs_page(vcpu)) {
>> > > +		pr_debug_ratelimited("%s: enlightened vmptrld failed\n",
>> > > +				     __func__);
>> > > +		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>> > > +		vcpu->run->internal.suberror =
>> > > +			KVM_INTERNAL_ERROR_EMULATION;
>> > > +		vcpu->run->internal.ndata = 0;
>> > > +
>> > >  		return false;
>> > > +	}
>> > 
>> > Hi!
>> > 
>> > Any reason to move the debug prints out of nested_get_evmcs_page?
>> > 
>> 
>> Debug print could've probably stayed or could've been dropped
>> completely -- I don't really believe it's going to help
>> anyone. Debugging such issues without instrumentation/tracing seems to
>> be hard-to-impossible...
>> 
>> > >  
>> > >  	if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu))
>> > >  		return false;
>> > > @@ -4422,7 +4423,15 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>> > >  	/* trying to cancel vmlaunch/vmresume is a bug */
>> > >  	WARN_ON_ONCE(vmx->nested.nested_run_pending);
>> > >  
>> > > -	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>> > > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
>> > > +		/*
>> > > +		 * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map
>> > > +		 * Enlightened VMCS after migration and we still need to
>> > > +		 * do that when something is forcing L2->L1 exit prior to
>> > > +		 * the first L2 run.
>> > > +		 */
>> > > +		(void)nested_get_evmcs_page(vcpu);
>> > > +	}
>> > Yes this is a band-aid, but it has to be done I agree.
>> > 
>> 
>> To restore the status quo, yes.
>> 
>> > >  
>> > >  	/* Service the TLB flush request for L2 before switching to L1. */
>> > >  	if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu))
>> > 
>> > 
>> > 
>> > I also tested this and it survives a bit better (used to crash instantly
>> > after a single migration cycle, but the guest still crashes after around ~20 iterations of my 
>> > regular nested migration test).
>> > 
>> > Blues screen shows that stop code is HYPERVISOR ERROR and nothing else.
>> > 
>> > I tested both this patch alone and all 4 patches.
>> > 
>> > Without evmcs, the same VM with same host kernel and qemu survived an overnight
>> > test and passed about 1800 migration iterations.
>> > (my synthetic migration test doesn't yet work on Intel, I need to investigate why)
>> > 
>> 
>> It would be great to compare on Intel to be 100% sure the issue is eVMCS
>> related, Hyper-V may be behaving quite differently on AMD.
> Hi!
>
> I tested this on my Intel machine with and without eVMCS, without changing
> any other parameters, running the same VM from a snapshot.
>
> As I said without eVMCS the test survived overnight stress of ~1800 migrations.
> With eVMCs, it fails pretty much on first try. 
> With those patches, it fails after about 20 iterations.
>

Ah, sorry, misunderstood your 'synthetic migration test doesn't yet work
on Intel' :-) 

-- 
Vitaly


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr'
  2021-05-05  8:24   ` Maxim Levitsky
@ 2021-05-05 17:34     ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2021-05-05 17:34 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Vitaly Kuznetsov, kvm, Paolo Bonzini, Wanpeng Li, Jim Mattson,
	linux-kernel

On Wed, May 05, 2021, Maxim Levitsky wrote:
> On Mon, 2021-05-03 at 17:08 +0200, Vitaly Kuznetsov wrote:
> > Eliminate the probably unwanted hole in 'struct kvm_vmx_nested_state_hdr':
> > 
> > Pre-patch:
> > struct kvm_vmx_nested_state_hdr {
> >         __u64                      vmxon_pa;             /*     0     8 */
> >         __u64                      vmcs12_pa;            /*     8     8 */
> >         struct {
> >                 __u16              flags;                /*    16     2 */
> >         } smm;                                           /*    16     2 */
> > 
> >         /* XXX 2 bytes hole, try to pack */
> > 
> >         __u32                      flags;                /*    20     4 */
> >         __u64                      preemption_timer_deadline; /*    24     8 */
> > };
> > 
> > Post-patch:
> > struct kvm_vmx_nested_state_hdr {
> >         __u64                      vmxon_pa;             /*     0     8 */
> >         __u64                      vmcs12_pa;            /*     8     8 */
> >         struct {
> >                 __u16              flags;                /*    16     2 */
> >         } smm;                                           /*    16     2 */
> >         __u16                      pad;                  /*    18     2 */
> >         __u32                      flags;                /*    20     4 */
> >         __u64                      preemption_timer_deadline; /*    24     8 */
> > };
> > 
> > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > ---
> >  arch/x86/include/uapi/asm/kvm.h | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 5a3022c8af82..0662f644aad9 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -437,6 +437,8 @@ struct kvm_vmx_nested_state_hdr {
> >  		__u16 flags;
> >  	} smm;
> >  
> > +	__u16 pad;
> > +
> >  	__u32 flags;
> >  	__u64 preemption_timer_deadline;
> >  };
> 
> 
> Looks good to me.
> 
> I wonder if we can enable the -Wpadded GCC warning to warn about such cases.
> Probably can't be enabled for the whole kernel but maybe we can enable it
> for KVM codebase at least, like we did with -Werror.

It'll never work, there are far, far too many structs throughout the kernel and
KVM that have implicit padding.  And for kernel-internal structs, that's perfectly
ok and even desirable since the kernel generally shouldn't make assumptions about
the layouts of its structs, i.e. it's a good thing the compiler pads structs so
that accesses are optimally aligned.

The padding behavior is only problematic for structs that are exposed to
userspace, because if userspace pads differently then we've got problems.  But
even then, building the kernel with -Wpadded wouldn't prevent userspace from
using a broken/goofy compiler that inserts unusual padding and misinterprets the
intended layout.

AFAIK, the C standard only expicitly disallows padding at the beginning of a
struct, i.e. the kernel's ABI is heavily reliant on existing compiler convention.
The only way to ensure exact layouts without relying on compiler convention would
be to tagged structs as packed, but "packed" also causes the compiler to generate
sub-optimal code since "packed" has strict requirements, and so the kernel relies
on sane compiler padding to provide a stable ABI.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-05-05 18:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-03 15:08 [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Vitaly Kuznetsov
2021-05-03 15:08 ` [PATCH 1/4] KVM: nVMX: Always make an attempt to map eVMCS after migration Vitaly Kuznetsov
2021-05-05  8:22   ` Maxim Levitsky
2021-05-05  8:39     ` Vitaly Kuznetsov
2021-05-05  9:17       ` Maxim Levitsky
2021-05-05  9:23         ` Vitaly Kuznetsov
2021-05-03 15:08 ` [PATCH 2/4] KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr' Vitaly Kuznetsov
2021-05-05  8:24   ` Maxim Levitsky
2021-05-05 17:34     ` Sean Christopherson
2021-05-03 15:08 ` [PATCH 3/4] KVM: nVMX: Introduce __nested_vmx_handle_enlightened_vmptrld() Vitaly Kuznetsov
2021-05-05  8:24   ` Maxim Levitsky
2021-05-03 15:08 ` [PATCH 4/4] KVM: nVMX: Map enlightened VMCS upon restore when possible Vitaly Kuznetsov
2021-05-03 15:53   ` Paolo Bonzini
2021-05-04  8:02     ` Vitaly Kuznetsov
2021-05-04  8:06       ` Paolo Bonzini
2021-05-05  8:33   ` Maxim Levitsky
2021-05-05  9:17     ` Vitaly Kuznetsov
2021-05-03 15:43 ` [PATCH 0/4] KVM: nVMX: Fix migration of nested guests when eVMCS is in use Paolo Bonzini
2021-05-03 15:52   ` Vitaly Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).