[QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state
@ 2019-06-17 17:56 Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu() Liran Alon
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson, dgilbert

Hi,

This series aims to add support for QEMU to be able to migrate VMs that
are running nested hypervisors. In order to do so, it utilizes the new
IOCTLs introduced in KVM commit 8fcc4b5923af ("kvm: nVMX: Introduce
KVM_CAP_NESTED_STATE") which was created for this purpose.

1st patch introduce kvm_arch_destroy_vcpu() to perform per-vCPU
destruction logic that is arch-dependent.

2st patch is just refactoring to use symbolic constants instead of hard-coded
numbers.

3st patch fixes QEMU to update DR6 when QEMU re-inject #DB to guest after
it was intercepted by KVM when guest is debugged.

4th patch adds migration blocker for vCPU exposed with either Intel VMX
or AMD SVM. Until now it was blocked only for Intel VMX.

5rd patch updates linux-headers to have updated struct kvm_nested_state.
The updated struct now have explicit fields for the data portion.

6rd patch add vmstate support for saving/restoring kernel integer types (e.g. __u16).

7th patch adds support for saving and restoring nested state in order to migrate
guests which run a nested hypervisor.

8th patch add support for KVM_CAP_EXCEPTION_PAYLOAD. This new KVM capability
allows userspace to properly distingiush between pending and injecting exceptions.

9th patch reverts a past commit which have added a migration blocker when guest
is exposed with VMX. Remaining with only a migration blocker for vCPU
exposed with AMD SVM.

Regards,
-Liran

v1->v2 changes:
* Add patch to fix bug when re-inject #DB to guest.
* Add support for KVM_CAP_EXCEPTION_PAYLOAD.
* Use explicit fields for struct kvm_nested_state data portion.
* Use vmstate subsections to save/restore nested state in order to properly
* support forward & backwards migration compatability.
* Remove VMX migration blocker.

v2->v3 changes:
* Add kvm_arch_destroy_vcpu().
* Use DR6_BS where appropriate.
* Add cpu_pre_save() logic to convert pending exception to injected
  exception if guest is running L2.
* Converted max_nested_state_len to int instead of uint32_t.
* Use kvm_arch_destroy_vcpu() to free nested_state.
* Add migration blocker for vCPU exposed with AMD SVM.
* Don't rely on CR4 or MSR_EFER to know if it is required to
migrate new VMState subsections.
* Signal if vCPU is in guest-mode in hflags as original intention by Paolo.

Reference for discussion on v2:
https://patchwork.kernel.org/patch/10601689/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu()
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18 22:15   ` [Qemu-devel] " Maran Wilson
  2019-06-17 17:56 ` [QEMU PATCH v3 2/9] KVM: i386: Use symbolic constant for #DB/#BP exception constants Liran Alon
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon

Simiar to how kvm_init_vcpu() calls kvm_arch_init_vcpu() to perform
arch-dependent initialisation, introduce kvm_arch_destroy_vcpu()
to be called from kvm_destroy_vcpu() to perform arch-dependent
destruction.

This was added because some architectures (Such as i386)
currently do not free memory that it have allocated in
kvm_arch_init_vcpu().

Suggested-by: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 accel/kvm/kvm-all.c  |  5 +++++
 include/sysemu/kvm.h |  1 +
 target/arm/kvm32.c   |  5 +++++
 target/arm/kvm64.c   |  5 +++++
 target/i386/kvm.c    | 12 ++++++++++++
 target/mips/kvm.c    |  5 +++++
 target/ppc/kvm.c     |  5 +++++
 target/s390x/kvm.c   | 10 ++++++++++
 8 files changed, 48 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 524c4ddfbd0f..59a3aa3a40da 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -292,6 +292,11 @@ int kvm_destroy_vcpu(CPUState *cpu)
 
     DPRINTF("kvm_destroy_vcpu\n");
 
+    ret = kvm_arch_destroy_vcpu(cpu);
+    if (ret < 0) {
+        goto err;
+    }
+
     mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
     if (mmap_size < 0) {
         ret = mmap_size;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a6d1cd190fed..64f55e519df7 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -371,6 +371,7 @@ int kvm_arch_put_registers(CPUState *cpu, int level);
 int kvm_arch_init(MachineState *ms, KVMState *s);
 
 int kvm_arch_init_vcpu(CPUState *cpu);
+int kvm_arch_destroy_vcpu(CPUState *cpu);
 
 bool kvm_vcpu_id_is_valid(int vcpu_id);
 
diff --git a/target/arm/kvm32.c b/target/arm/kvm32.c
index 4e54e372a668..51f78f722b18 100644
--- a/target/arm/kvm32.c
+++ b/target/arm/kvm32.c
@@ -240,6 +240,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return kvm_arm_init_cpreg_list(cpu);
 }
 
+int kvm_arch_destroy_vcpu(CPUState *cs)
+{
+	return 0;
+}
+
 typedef struct Reg {
     uint64_t id;
     int offset;
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index 998d21f399f4..22d19c9aec6f 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -654,6 +654,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return kvm_arm_init_cpreg_list(cpu);
 }
 
+int kvm_arch_destroy_vcpu(CPUState *cs)
+{
+    return 0;
+}
+
 bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx)
 {
     /* Return true if the regidx is a register we should synchronize
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 3b29ce5c0d08..29889aa6b001 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -1349,6 +1349,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return r;
 }
 
+int kvm_arch_destroy_vcpu(CPUState *cs)
+{
+    X86CPU *cpu = X86_CPU(cs);
+
+    if (cpu->kvm_msr_buf) {
+        g_free(cpu->kvm_msr_buf);
+        cpu->kvm_msr_buf = NULL;
+    }
+
+    return 0;
+}
+
 void kvm_arch_reset_vcpu(X86CPU *cpu)
 {
     CPUX86State *env = &cpu->env;
diff --git a/target/mips/kvm.c b/target/mips/kvm.c
index 8e72850962e1..938f8f144b74 100644
--- a/target/mips/kvm.c
+++ b/target/mips/kvm.c
@@ -91,6 +91,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return ret;
 }
 
+int kvm_arch_destroy_vcpu(CPUState *cs)
+{
+    return 0;
+}
+
 void kvm_mips_reset_vcpu(MIPSCPU *cpu)
 {
     CPUMIPSState *env = &cpu->env;
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index 3bf0a46c3352..1967ccc51791 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -521,6 +521,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return ret;
 }
 
+int kvm_arch_destroy_vcpu(CPUState *cs)
+{
+    return 0;
+}
+
 static void kvm_sw_tlb_put(PowerPCCPU *cpu)
 {
     CPUPPCState *env = &cpu->env;
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index e5e2b691f253..c2747c31649b 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -368,6 +368,16 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return 0;
 }
 
+int kvm_arch_destroy_vcpu(CPUState *cs)
+{
+    S390CPU *cpu = S390_CPU(cs);
+
+    g_free(cpu->irqstate);
+    cpu->irqstate = NULL;
+
+    return 0;
+}
+
 void kvm_s390_reset_vcpu(S390CPU *cpu)
 {
     CPUState *cs = CPU(cpu);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 2/9] KVM: i386: Use symbolic constant for #DB/#BP exception constants
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu() Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 3/9] KVM: i386: Re-inject #DB to guest with updated DR6 Liran Alon
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon, Nikita Leshenko, Krish Sadhukhan

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 target/i386/kvm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 29889aa6b001..738dd91ff3cc 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -3006,9 +3006,9 @@ static int kvm_guest_debug_workarounds(X86CPU *cpu)
     unsigned long reinject_trap = 0;
 
     if (!kvm_has_vcpu_events()) {
-        if (env->exception_injected == 1) {
+        if (env->exception_injected == EXCP01_DB) {
             reinject_trap = KVM_GUESTDBG_INJECT_DB;
-        } else if (env->exception_injected == 3) {
+        } else if (env->exception_injected == EXCP03_INT3) {
             reinject_trap = KVM_GUESTDBG_INJECT_BP;
         }
         env->exception_injected = -1;
@@ -3520,8 +3520,8 @@ static int kvm_handle_debug(X86CPU *cpu,
     int ret = 0;
     int n;
 
-    if (arch_info->exception == 1) {
-        if (arch_info->dr6 & (1 << 14)) {
+    if (arch_info->exception == EXCP01_DB) {
+        if (arch_info->dr6 & DR6_BS) {
             if (cs->singlestep_enabled) {
                 ret = EXCP_DEBUG;
             }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 3/9] KVM: i386: Re-inject #DB to guest with updated DR6
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu() Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 2/9] KVM: i386: Use symbolic constant for #DB/#BP exception constants Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization Liran Alon
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon, Nikita Leshenko, Krish Sadhukhan

If userspace (QEMU) debug guest, when #DB is raised in guest and
intercepted by KVM, KVM forwards information on #DB to userspace
instead of injecting #DB to guest.
While doing so, KVM don't update vCPU DR6 but instead report the #DB DR6
value to userspace for further handling.
See KVM's handle_exception() DB_VECTOR handler.

QEMU handler for this case is kvm_handle_debug(). This handler basically
checks if #DB is related to one of user set hardware breakpoints and if
not, it re-inject #DB into guest.
The re-injection is done by setting env->exception_injected to #DB which
will later be passed as events.exception.nr to KVM_SET_VCPU_EVENTS ioctl
by kvm_put_vcpu_events().

However, in case userspace re-injects #DB, KVM expects userspace to set
vCPU DR6 as reported to userspace when #DB was intercepted! Otherwise,
KVM_REQ_EVENT handler will inject #DB with wrong DR6 to guest.

Fix this issue by updating vCPU DR6 appropriately when re-inject #DB to
guest.

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 target/i386/kvm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 738dd91ff3cc..c8fd53055d37 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -3558,6 +3558,9 @@ static int kvm_handle_debug(X86CPU *cpu,
         /* pass to guest */
         env->exception_injected = arch_info->exception;
         env->has_error_code = 0;
+        if (arch_info->exception == EXCP01_DB) {
+            env->dr[6] = arch_info->dr6;
+        }
     }

     return ret;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
                   ` (2 preceding siblings ...)
  2019-06-17 17:56 ` [QEMU PATCH v3 3/9] KVM: i386: Re-inject #DB to guest with updated DR6 Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18  8:44   ` Dr. David Alan Gilbert
  2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
  2019-06-17 17:56 ` [QEMU PATCH v3 5/9] linux-headers: i386: Modify struct kvm_nested_state to have explicit fields for data Liran Alon
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon

Commit d98f26073beb ("target/i386: kvm: add VMX migration blocker")
added a migration blocker for vCPU exposed with Intel VMX.
However, migration should also be blocked for vCPU exposed with
AMD SVM.

Both cases should be blocked because QEMU should extract additional
vCPU state from KVM that should be migrated as part of vCPU VMState.
E.g. Whether vCPU is running in guest-mode or host-mode.

Fixes: d98f26073beb ("target/i386: kvm: add VMX migration blocker")
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 target/i386/cpu.c |  6 ------
 target/i386/cpu.h | 24 ++++++++++++++++++++++++
 target/i386/kvm.c | 12 ++++++------
 3 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 536d7d152044..197201087e65 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5170,12 +5170,6 @@ static int x86_cpu_filter_features(X86CPU *cpu)
     return rv;
 }
 
-#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
-                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
-                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
-#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
-                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
-                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
 static void x86_cpu_realizefn(DeviceState *dev, Error **errp)
 {
     CPUState *cs = CPU(dev);
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index fce6660bac00..79d9495ceb0c 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -728,6 +728,13 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
 
 #define CPUID_VENDOR_HYGON    "HygonGenuine"
 
+#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
+                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
+                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
+#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
+                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
+                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
+
 #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
 
@@ -1866,6 +1873,23 @@ static inline int32_t x86_get_a20_mask(CPUX86State *env)
     }
 }
 
+static inline bool cpu_has_vmx(CPUX86State *env)
+{
+    return (IS_INTEL_CPU(env) &&
+            (env->features[FEAT_1_ECX] & CPUID_EXT_VMX));
+}
+
+static inline bool cpu_has_svm(CPUX86State *env)
+{
+    return (IS_AMD_CPU(env) &&
+            (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_SVM));
+}
+
+static inline bool cpu_has_nested_virt(CPUX86State *env)
+{
+    return (cpu_has_vmx(env) || cpu_has_svm(env));
+}
+
 /* fpu_helper.c */
 void update_fp_status(CPUX86State *env);
 void update_mxcsr_status(CPUX86State *env);
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index c8fd53055d37..f43e2d69859e 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -906,7 +906,7 @@ static int hyperv_init_vcpu(X86CPU *cpu)
 }
 
 static Error *invtsc_mig_blocker;
-static Error *vmx_mig_blocker;
+static Error *nested_virt_mig_blocker;
 
 #define KVM_MAX_CPUID_ENTRIES  100
 
@@ -1270,13 +1270,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
                                   !!(c->ecx & CPUID_EXT_SMX);
     }
 
-    if ((env->features[FEAT_1_ECX] & CPUID_EXT_VMX) && !vmx_mig_blocker) {
-        error_setg(&vmx_mig_blocker,
-                   "Nested VMX virtualization does not support live migration yet");
-        r = migrate_add_blocker(vmx_mig_blocker, &local_err);
+    if (cpu_has_nested_virt(env) && !nested_virt_mig_blocker) {
+        error_setg(&nested_virt_mig_blocker,
+                   "Nested virtualization does not support live migration yet");
+        r = migrate_add_blocker(nested_virt_mig_blocker, &local_err);
         if (local_err) {
             error_report_err(local_err);
-            error_free(vmx_mig_blocker);
+            error_free(nested_virt_mig_blocker);
             return r;
         }
     }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 5/9] linux-headers: i386: Modify struct kvm_nested_state to have explicit fields for data
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
                   ` (3 preceding siblings ...)
  2019-06-17 17:56 ` [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
  2019-06-17 17:56 ` [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types Liran Alon
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon

Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
of VMX nested state data in a struct.

In order to avoid changing the ioctl values of
KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
sizeof(struct kvm_nested_state). This is done by defining the data
struct as "data.vmx[0]". It was the most elegant way I found to
preserve struct size while still keeping struct readable and easy to
maintain. It does have a misfortunate side-effect that now it has to be
accessed as "data.vmx[0]" rather than just "data.vmx".

Because we are already modifying these structs, I also modified the
following:
* Define the "format" field values as macros.
* Rename vmcs_pa to vmcs12_pa for better readability.
* Add stub structs for AMD SVM.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 linux-headers/asm-x86/kvm.h | 41 +++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 7a0e64ccd6ff..e655d108af19 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -383,6 +383,9 @@ struct kvm_sync_regs {
 #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE	(1 << 2)
 #define KVM_X86_QUIRK_OUT_7E_INC_RIP	(1 << 3)
 
+#define KVM_STATE_NESTED_FORMAT_VMX	0
+#define KVM_STATE_NESTED_FORMAT_SVM	1
+
 #define KVM_STATE_NESTED_GUEST_MODE	0x00000001
 #define KVM_STATE_NESTED_RUN_PENDING	0x00000002
 #define KVM_STATE_NESTED_EVMCS		0x00000004
@@ -390,35 +393,51 @@ struct kvm_sync_regs {
 #define KVM_STATE_NESTED_SMM_GUEST_MODE	0x00000001
 #define KVM_STATE_NESTED_SMM_VMXON	0x00000002
 
-struct kvm_vmx_nested_state {
+struct kvm_vmx_nested_state_data {
+	__u8 vmcs12[0x1000];
+	__u8 shadow_vmcs12[0x1000];
+};
+
+struct kvm_vmx_nested_state_hdr {
 	__u64 vmxon_pa;
-	__u64 vmcs_pa;
+	__u64 vmcs12_pa;
 
 	struct {
 		__u16 flags;
 	} smm;
 };
 
+struct kvm_svm_nested_state_data {
+	/* TODO: Implement */
+};
+
+struct kvm_svm_nested_state_hdr {
+	/* TODO: Implement */
+};
+
 /* for KVM_CAP_NESTED_STATE */
 struct kvm_nested_state {
-	/* KVM_STATE_* flags */
 	__u16 flags;
-
-	/* 0 for VMX, 1 for SVM.  */
 	__u16 format;
-
-	/* 128 for SVM, 128 + VMCS size for VMX.  */
 	__u32 size;
 
 	union {
-		/* VMXON, VMCS */
-		struct kvm_vmx_nested_state vmx;
+		struct kvm_vmx_nested_state_hdr vmx;
+		struct kvm_svm_nested_state_hdr svm;
 
 		/* Pad the header to 128 bytes.  */
 		__u8 pad[120];
-	};
+	} hdr;
 
-	__u8 data[0];
+	/*
+	 * Define data region as 0 bytes to preserve backwards-compatability
+	 * to old definition of kvm_nested_state in order to avoid changing
+	 * KVM_{GET,PUT}_NESTED_STATE ioctl values.
+	 */
+	union {
+		struct kvm_vmx_nested_state_data vmx[0];
+		struct kvm_svm_nested_state_data svm[0];
+	} data;
 };
 
 #endif /* _ASM_X86_KVM_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
                   ` (4 preceding siblings ...)
  2019-06-17 17:56 ` [QEMU PATCH v3 5/9] linux-headers: i386: Modify struct kvm_nested_state to have explicit fields for data Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18  8:55   ` Dr. David Alan Gilbert
  2019-06-17 17:56 ` [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state Liran Alon
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon, Nikita Leshenko

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 include/migration/vmstate.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 9224370ed59a..a85424fb0483 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -797,6 +797,15 @@ extern const VMStateInfo vmstate_info_qtailq;
 #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
     VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
 
+#define VMSTATE_U8_V(_f, _s, _v)                                   \
+    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint8, __u8)
+#define VMSTATE_U16_V(_f, _s, _v)                                  \
+    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint16, __u16)
+#define VMSTATE_U32_V(_f, _s, _v)                                  \
+    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint32, __u32)
+#define VMSTATE_U64_V(_f, _s, _v)                                  \
+    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, __u64)
+
 #define VMSTATE_BOOL(_f, _s)                                          \
     VMSTATE_BOOL_V(_f, _s, 0)
 
@@ -818,6 +827,15 @@ extern const VMStateInfo vmstate_info_qtailq;
 #define VMSTATE_UINT64(_f, _s)                                        \
     VMSTATE_UINT64_V(_f, _s, 0)
 
+#define VMSTATE_U8(_f, _s)                                         \
+    VMSTATE_U8_V(_f, _s, 0)
+#define VMSTATE_U16(_f, _s)                                        \
+    VMSTATE_U16_V(_f, _s, 0)
+#define VMSTATE_U32(_f, _s)                                        \
+    VMSTATE_U32_V(_f, _s, 0)
+#define VMSTATE_U64(_f, _s)                                        \
+    VMSTATE_U64_V(_f, _s, 0)
+
 #define VMSTATE_UINT8_EQUAL(_f, _s, _err_hint)                        \
     VMSTATE_SINGLE_FULL(_f, _s, 0, 0,                                 \
                         vmstate_info_uint8_equal, uint8_t, _err_hint)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
                   ` (5 preceding siblings ...)
  2019-06-17 17:56 ` [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18  9:03   ` Dr. David Alan Gilbert
  2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
  2019-06-17 17:56 ` [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD Liran Alon
  2019-06-17 17:56 ` [QEMU PATCH v3 9/9] KVM: i386: Remove VMX migration blocker Liran Alon
  8 siblings, 2 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon, Nikita Leshenko

Kernel commit 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
introduced new IOCTLs to extract and restore vCPU state related to
Intel VMX & AMD SVM.

Utilize these IOCTLs to add support for migration of VMs which are
running nested hypervisors.

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 accel/kvm/kvm-all.c   |   8 ++
 include/sysemu/kvm.h  |   1 +
 target/i386/cpu.h     |   3 +
 target/i386/kvm.c     |  80 +++++++++++++++++
 target/i386/machine.c | 196 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 288 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 59a3aa3a40da..4fdf5b04b131 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -88,6 +88,7 @@ struct KVMState
 #ifdef KVM_CAP_SET_GUEST_DEBUG
     QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints;
 #endif
+    int max_nested_state_len;
     int many_ioeventfds;
     int intx_set_mask;
     bool sync_mmu;
@@ -1678,6 +1679,8 @@ static int kvm_init(MachineState *ms)
     s->debugregs = kvm_check_extension(s, KVM_CAP_DEBUGREGS);
 #endif
 
+    s->max_nested_state_len = kvm_check_extension(s, KVM_CAP_NESTED_STATE);
+
 #ifdef KVM_CAP_IRQ_ROUTING
     kvm_direct_msi_allowed = (kvm_check_extension(s, KVM_CAP_SIGNAL_MSI) > 0);
 #endif
@@ -2245,6 +2248,11 @@ int kvm_has_debugregs(void)
     return kvm_state->debugregs;
 }
 
+int kvm_max_nested_state_length(void)
+{
+    return kvm_state->max_nested_state_len;
+}
+
 int kvm_has_many_ioeventfds(void)
 {
     if (!kvm_enabled()) {
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 64f55e519df7..acd90aebb6c4 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -210,6 +210,7 @@ bool kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
 int kvm_has_robust_singlestep(void);
 int kvm_has_debugregs(void);
+int kvm_max_nested_state_length(void);
 int kvm_has_pit_state2(void);
 int kvm_has_many_ioeventfds(void);
 int kvm_has_gsi_routing(void);
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 79d9495ceb0c..a6bb71849869 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1350,6 +1350,9 @@ typedef struct CPUX86State {
 #if defined(CONFIG_KVM) || defined(CONFIG_HVF)
     void *xsave_buf;
 #endif
+#if defined(CONFIG_KVM)
+    struct kvm_nested_state *nested_state;
+#endif
 #if defined(CONFIG_HVF)
     HVFX86EmulatorState *hvf_emul;
 #endif
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index f43e2d69859e..5950c3ed0d1c 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -931,6 +931,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
     struct kvm_cpuid_entry2 *c;
     uint32_t signature[3];
     int kvm_base = KVM_CPUID_SIGNATURE;
+    int max_nested_state_len;
     int r;
     Error *local_err = NULL;
 
@@ -1331,6 +1332,24 @@ int kvm_arch_init_vcpu(CPUState *cs)
     if (has_xsave) {
         env->xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
     }
+
+    max_nested_state_len = kvm_max_nested_state_length();
+    if (max_nested_state_len > 0) {
+        assert(max_nested_state_len >= offsetof(struct kvm_nested_state, data));
+        env->nested_state = g_malloc0(max_nested_state_len);
+
+        env->nested_state->size = max_nested_state_len;
+
+        if (IS_INTEL_CPU(env)) {
+            struct kvm_vmx_nested_state_hdr *vmx_hdr =
+                &env->nested_state->hdr.vmx;
+
+            vmx_hdr->vmxon_pa = -1ull;
+            vmx_hdr->vmcs12_pa = -1ull;
+        }
+
+    }
+
     cpu->kvm_msr_buf = g_malloc0(MSR_BUF_SIZE);
 
     if (!(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
@@ -1352,12 +1371,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
 int kvm_arch_destroy_vcpu(CPUState *cs)
 {
     X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
 
     if (cpu->kvm_msr_buf) {
         g_free(cpu->kvm_msr_buf);
         cpu->kvm_msr_buf = NULL;
     }
 
+    if (env->nested_state) {
+        g_free(env->nested_state);
+        env->nested_state = NULL;
+    }
+
     return 0;
 }
 
@@ -3072,6 +3097,52 @@ static int kvm_get_debugregs(X86CPU *cpu)
     return 0;
 }
 
+static int kvm_put_nested_state(X86CPU *cpu)
+{
+    CPUX86State *env = &cpu->env;
+    int max_nested_state_len = kvm_max_nested_state_length();
+
+    if (max_nested_state_len <= 0) {
+        return 0;
+    }
+
+    assert(env->nested_state->size <= max_nested_state_len);
+    return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_NESTED_STATE, env->nested_state);
+}
+
+static int kvm_get_nested_state(X86CPU *cpu)
+{
+    CPUX86State *env = &cpu->env;
+    int max_nested_state_len = kvm_max_nested_state_length();
+    int ret;
+
+    if (max_nested_state_len <= 0) {
+        return 0;
+    }
+
+    /*
+     * It is possible that migration restored a smaller size into
+     * nested_state->hdr.size than what our kernel support.
+     * We preserve migration origin nested_state->hdr.size for
+     * call to KVM_SET_NESTED_STATE but wish that our next call
+     * to KVM_GET_NESTED_STATE will use max size our kernel support.
+     */
+    env->nested_state->size = max_nested_state_len;
+
+    ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_NESTED_STATE, env->nested_state);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (env->nested_state->flags & KVM_STATE_NESTED_GUEST_MODE) {
+        env->hflags |= HF_GUEST_MASK;
+    } else {
+        env->hflags &= ~HF_GUEST_MASK;
+    }
+
+    return ret;
+}
+
 int kvm_arch_put_registers(CPUState *cpu, int level)
 {
     X86CPU *x86_cpu = X86_CPU(cpu);
@@ -3079,6 +3150,11 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
 
     assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
 
+    ret = kvm_put_nested_state(x86_cpu);
+    if (ret < 0) {
+        return ret;
+    }
+
     if (level >= KVM_PUT_RESET_STATE) {
         ret = kvm_put_msr_feature_control(x86_cpu);
         if (ret < 0) {
@@ -3194,6 +3270,10 @@ int kvm_arch_get_registers(CPUState *cs)
     if (ret < 0) {
         goto out;
     }
+    ret = kvm_get_nested_state(cpu);
+    if (ret < 0) {
+        goto out;
+    }
     ret = 0;
  out:
     cpu_sync_bndcs_hflags(&cpu->env);
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 225b5d433bc4..95299ebff44a 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -231,6 +231,15 @@ static int cpu_pre_save(void *opaque)
         env->segs[R_SS].flags &= ~(env->segs[R_SS].flags & DESC_DPL_MASK);
     }
 
+#ifdef CONFIG_KVM
+    /* Verify we have nested virtualization state from kernel if required */
+    if (cpu_has_nested_virt(env) && !env->nested_state) {
+        error_report("Guest enabled nested virtualization but kernel "
+                "does not support saving of nested state");
+        return -EINVAL;
+    }
+#endif
+
     return 0;
 }
 
@@ -278,6 +287,16 @@ static int cpu_post_load(void *opaque, int version_id)
     env->hflags &= ~HF_CPL_MASK;
     env->hflags |= (env->segs[R_SS].flags >> DESC_DPL_SHIFT) & HF_CPL_MASK;
 
+#ifdef CONFIG_KVM
+    if ((env->hflags & HF_GUEST_MASK) &&
+        (!env->nested_state ||
+        !(env->nested_state->flags & KVM_STATE_NESTED_GUEST_MODE))) {
+        error_report("vCPU set in guest-mode inconsistent with "
+                     "migrated kernel nested state");
+        return -EINVAL;
+    }
+#endif
+
     env->fpstt = (env->fpus_vmstate >> 11) & 7;
     env->fpus = env->fpus_vmstate & ~0x3800;
     env->fptag_vmstate ^= 0xff;
@@ -851,6 +870,180 @@ static const VMStateDescription vmstate_tsc_khz = {
     }
 };
 
+#ifdef CONFIG_KVM
+
+static bool vmx_vmcs12_needed(void *opaque)
+{
+    struct kvm_nested_state *nested_state = opaque;
+    return (nested_state->size >
+            offsetof(struct kvm_nested_state, data.vmx[0].vmcs12));
+}
+
+static const VMStateDescription vmstate_vmx_vmcs12 = {
+	.name = "cpu/kvm_nested_state/vmx/vmcs12",
+	.version_id = 1,
+	.minimum_version_id = 1,
+	.needed = vmx_vmcs12_needed,
+	.fields = (VMStateField[]) {
+	    VMSTATE_UINT8_ARRAY(data.vmx[0].vmcs12,
+	                        struct kvm_nested_state, 0x1000),
+	    VMSTATE_END_OF_LIST()
+    }
+};
+
+static bool vmx_shadow_vmcs12_needed(void *opaque)
+{
+    struct kvm_nested_state *nested_state = opaque;
+    return (nested_state->size >
+            offsetof(struct kvm_nested_state, data.vmx[0].shadow_vmcs12));
+}
+
+static const VMStateDescription vmstate_vmx_shadow_vmcs12 = {
+	.name = "cpu/kvm_nested_state/vmx/shadow_vmcs12",
+	.version_id = 1,
+	.minimum_version_id = 1,
+	.needed = vmx_shadow_vmcs12_needed,
+	.fields = (VMStateField[]) {
+	    VMSTATE_UINT8_ARRAY(data.vmx[0].shadow_vmcs12,
+	                        struct kvm_nested_state, 0x1000),
+	    VMSTATE_END_OF_LIST()
+    }
+};
+
+static bool vmx_nested_state_needed(void *opaque)
+{
+    struct kvm_nested_state *nested_state = opaque;
+
+    return ((nested_state->format == KVM_STATE_NESTED_FORMAT_VMX) &&
+            ((nested_state->hdr.vmx.vmxon_pa != -1ull) ||
+             (nested_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_VMXON)));
+}
+
+static const VMStateDescription vmstate_vmx_nested_state = {
+	.name = "cpu/kvm_nested_state/vmx",
+	.version_id = 1,
+	.minimum_version_id = 1,
+	.needed = vmx_nested_state_needed,
+	.fields = (VMStateField[]) {
+	    VMSTATE_U64(hdr.vmx.vmxon_pa, struct kvm_nested_state),
+	    VMSTATE_U64(hdr.vmx.vmcs12_pa, struct kvm_nested_state),
+	    VMSTATE_U16(hdr.vmx.smm.flags, struct kvm_nested_state),
+	    VMSTATE_END_OF_LIST()
+    },
+    .subsections = (const VMStateDescription*[]) {
+        &vmstate_vmx_vmcs12,
+        &vmstate_vmx_shadow_vmcs12,
+        NULL,
+    }
+};
+
+static bool svm_nested_state_needed(void *opaque)
+{
+    struct kvm_nested_state *nested_state = opaque;
+
+    return (nested_state->format == KVM_STATE_NESTED_FORMAT_SVM);
+}
+
+static const VMStateDescription vmstate_svm_nested_state = {
+	.name = "cpu/kvm_nested_state/svm",
+	.version_id = 1,
+	.minimum_version_id = 1,
+	.needed = svm_nested_state_needed,
+	.fields = (VMStateField[]) {
+	    VMSTATE_END_OF_LIST()
+    }
+};
+
+static bool nested_state_needed(void *opaque)
+{
+    X86CPU *cpu = opaque;
+    CPUX86State *env = &cpu->env;
+
+    return (env->nested_state &&
+            (vmx_nested_state_needed(env->nested_state) ||
+             svm_nested_state_needed(env->nested_state)));
+}
+
+static int nested_state_post_load(void *opaque, int version_id)
+{
+    X86CPU *cpu = opaque;
+    CPUX86State *env = &cpu->env;
+    struct kvm_nested_state *nested_state = env->nested_state;
+    int min_nested_state_len = offsetof(struct kvm_nested_state, data);
+    int max_nested_state_len = kvm_max_nested_state_length();
+
+    /*
+     * If our kernel don't support setting nested state
+     * and we have received nested state from migration stream,
+     * we need to fail migration
+     */
+    if (max_nested_state_len <= 0) {
+        error_report("Received nested state when kernel cannot restore it");
+        return -EINVAL;
+    }
+
+    /*
+     * Verify that the size of received nested_state struct
+     * at least cover required header and is not larger
+     * than the max size that our kernel support
+     */
+    if (nested_state->size < min_nested_state_len) {
+        error_report("Received nested state size less than min: "
+                     "len=%d, min=%d",
+                     nested_state->size, min_nested_state_len);
+        return -EINVAL;
+    }
+    if (nested_state->size > max_nested_state_len) {
+        error_report("Recieved unsupported nested state size: "
+                     "nested_state->size=%d, max=%d",
+                     nested_state->size, max_nested_state_len);
+        return -EINVAL;
+    }
+
+    /* Verify format is valid */
+    if ((nested_state->format != KVM_STATE_NESTED_FORMAT_VMX) &&
+        (nested_state->format != KVM_STATE_NESTED_FORMAT_SVM)) {
+        error_report("Received invalid nested state format: %d",
+                     nested_state->format);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_kvm_nested_state = {
+    .name = "cpu/kvm_nested_state",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_U16(flags, struct kvm_nested_state),
+        VMSTATE_U16(format, struct kvm_nested_state),
+        VMSTATE_U32(size, struct kvm_nested_state),
+        VMSTATE_END_OF_LIST()
+    },
+    .subsections = (const VMStateDescription*[]) {
+        &vmstate_vmx_nested_state,
+        &vmstate_svm_nested_state,
+        NULL
+    }
+};
+
+static const VMStateDescription vmstate_nested_state = {
+    .name = "cpu/nested_state",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = nested_state_needed,
+    .post_load = nested_state_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_STRUCT_POINTER(env.nested_state, X86CPU,
+                               vmstate_kvm_nested_state,
+                               struct kvm_nested_state),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#endif
+
 static bool mcg_ext_ctl_needed(void *opaque)
 {
     X86CPU *cpu = opaque;
@@ -1089,6 +1282,9 @@ VMStateDescription vmstate_x86_cpu = {
         &vmstate_msr_intel_pt,
         &vmstate_msr_virt_ssbd,
         &vmstate_svm_npt,
+#ifdef CONFIG_KVM
+        &vmstate_nested_state,
+#endif
         NULL
     }
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
                   ` (6 preceding siblings ...)
  2019-06-17 17:56 ` [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18  9:07   ` Dr. David Alan Gilbert
  2019-06-17 17:56 ` [QEMU PATCH v3 9/9] KVM: i386: Remove VMX migration blocker Liran Alon
  8 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon, Nikita Leshenko

Kernel commit c4f55198c7c2 ("kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD")
introduced a new KVM capability which allows userspace to correctly
distinguish between pending and injected exceptions.

This distinguish is important in case of nested virtualization scenarios
because a L2 pending exception can still be intercepted by the L1 hypervisor
while a L2 injected exception cannot.

Furthermore, when an exception is attempted to be injected by QEMU,
QEMU should specify the exception payload (CR2 in case of #PF or
DR6 in case of #DB) instead of having the payload already delivered in
the respective vCPU register. Because in case exception is injected to
L2 guest and is intercepted by L1 hypervisor, then payload needs to be
reported to L1 intercept (VMExit handler) while still preserving
respective vCPU register unchanged.

This commit adds support for QEMU to properly utilise this new KVM
capability (KVM_CAP_EXCEPTION_PAYLOAD).

Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 target/i386/cpu.c        |   6 ++-
 target/i386/cpu.h        |   6 ++-
 target/i386/hvf/hvf.c    |  10 ++--
 target/i386/hvf/x86hvf.c |   4 +-
 target/i386/kvm.c        | 101 ++++++++++++++++++++++++++++++++-------
 target/i386/machine.c    |  84 +++++++++++++++++++++++++++++++-
 6 files changed, 187 insertions(+), 24 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 197201087e65..a026e49f5c0d 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -4774,7 +4774,11 @@ static void x86_cpu_reset(CPUState *s)
     memset(env->mtrr_fixed, 0, sizeof(env->mtrr_fixed));
 
     env->interrupt_injected = -1;
-    env->exception_injected = -1;
+    env->exception_nr = -1;
+    env->exception_pending = 0;
+    env->exception_injected = 0;
+    env->exception_has_payload = false;
+    env->exception_payload = 0;
     env->nmi_injected = false;
 #if !defined(CONFIG_USER_ONLY)
     /* We hard-wire the BSP to the first CPU. */
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index a6bb71849869..e2ac4132972d 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1338,10 +1338,14 @@ typedef struct CPUX86State {
 
     /* For KVM */
     uint32_t mp_state;
-    int32_t exception_injected;
+    int32_t exception_nr;
     int32_t interrupt_injected;
     uint8_t soft_interrupt;
+    uint8_t exception_pending;
+    uint8_t exception_injected;
     uint8_t has_error_code;
+    uint8_t exception_has_payload;
+    uint64_t exception_payload;
     uint32_t ins_len;
     uint32_t sipi_vector;
     bool tsc_valid;
diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
index 2751c8125ca2..dc4bb63536c8 100644
--- a/target/i386/hvf/hvf.c
+++ b/target/i386/hvf/hvf.c
@@ -605,7 +605,9 @@ static void hvf_store_events(CPUState *cpu, uint32_t ins_len, uint64_t idtvec_in
     X86CPU *x86_cpu = X86_CPU(cpu);
     CPUX86State *env = &x86_cpu->env;
 
-    env->exception_injected = -1;
+    env->exception_nr = -1;
+    env->exception_pending = 0;
+    env->exception_injected = 0;
     env->interrupt_injected = -1;
     env->nmi_injected = false;
     if (idtvec_info & VMCS_IDT_VEC_VALID) {
@@ -619,7 +621,8 @@ static void hvf_store_events(CPUState *cpu, uint32_t ins_len, uint64_t idtvec_in
             break;
         case VMCS_IDT_VEC_HWEXCEPTION:
         case VMCS_IDT_VEC_SWEXCEPTION:
-            env->exception_injected = idtvec_info & VMCS_IDT_VEC_VECNUM;
+            env->exception_nr = idtvec_info & VMCS_IDT_VEC_VECNUM;
+            env->exception_injected = 1;
             break;
         case VMCS_IDT_VEC_PRIV_SWEXCEPTION:
         default:
@@ -912,7 +915,8 @@ int hvf_vcpu_exec(CPUState *cpu)
             macvm_set_rip(cpu, rip + ins_len);
             break;
         case VMX_REASON_VMCALL:
-            env->exception_injected = EXCP0D_GPF;
+            env->exception_nr = EXCP0D_GPF;
+            env->exception_injected = 1;
             env->has_error_code = true;
             env->error_code = 0;
             break;
diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
index df8e946fbcde..e0ea02d631e6 100644
--- a/target/i386/hvf/x86hvf.c
+++ b/target/i386/hvf/x86hvf.c
@@ -362,8 +362,8 @@ bool hvf_inject_interrupts(CPUState *cpu_state)
     if (env->interrupt_injected != -1) {
         vector = env->interrupt_injected;
         intr_type = VMCS_INTR_T_SWINTR;
-    } else if (env->exception_injected != -1) {
-        vector = env->exception_injected;
+    } else if (env->exception_nr != -1) {
+        vector = env->exception_nr;
         if (vector == EXCP03_INT3 || vector == EXCP04_INTO) {
             intr_type = VMCS_INTR_T_SWEXCEPTION;
         } else {
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 5950c3ed0d1c..797f8ac46435 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -104,6 +104,7 @@ static uint32_t num_architectural_pmu_fixed_counters;
 static int has_xsave;
 static int has_xcrs;
 static int has_pit_state2;
+static int has_exception_payload;
 
 static bool has_msr_mcg_ext_ctl;
 
@@ -584,15 +585,56 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
     /* Hope we are lucky for AO MCE */
 }
 
+static void kvm_reset_exception(CPUX86State *env)
+{
+	env->exception_nr = -1;
+	env->exception_pending = 0;
+	env->exception_injected = 0;
+	env->exception_has_payload = false;
+	env->exception_payload = 0;
+}
+
+static void kvm_queue_exception(CPUX86State *env,
+                                int32_t exception_nr,
+                                uint8_t exception_has_payload,
+                                uint64_t exception_payload)
+{
+    assert(env->exception_nr == -1);
+    assert(!env->exception_pending);
+    assert(!env->exception_injected);
+    assert(!env->exception_has_payload);
+
+    env->exception_nr = exception_nr;
+
+    if (has_exception_payload) {
+        env->exception_pending = 1;
+
+        env->exception_has_payload = exception_has_payload;
+        env->exception_payload = exception_payload;
+    } else {
+        env->exception_injected = 1;
+
+        if (exception_nr == EXCP01_DB) {
+            assert(exception_has_payload);
+            env->dr[6] = exception_payload;
+        } else if (exception_nr == EXCP0E_PAGE) {
+            assert(exception_has_payload);
+            env->cr[2] = exception_payload;
+        } else {
+            assert(!exception_has_payload);
+        }
+    }
+}
+
 static int kvm_inject_mce_oldstyle(X86CPU *cpu)
 {
     CPUX86State *env = &cpu->env;
 
-    if (!kvm_has_vcpu_events() && env->exception_injected == EXCP12_MCHK) {
+    if (!kvm_has_vcpu_events() && env->exception_nr == EXCP12_MCHK) {
         unsigned int bank, bank_num = env->mcg_cap & 0xff;
         struct kvm_x86_mce mce;
 
-        env->exception_injected = -1;
+        kvm_reset_exception(env);
 
         /*
          * There must be at least one bank in use if an MCE is pending.
@@ -1610,6 +1652,16 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 
     hv_vpindex_settable = kvm_check_extension(s, KVM_CAP_HYPERV_VP_INDEX);
 
+    has_exception_payload = kvm_check_extension(s, KVM_CAP_EXCEPTION_PAYLOAD);
+    if (has_exception_payload) {
+        ret = kvm_vm_enable_cap(s, KVM_CAP_EXCEPTION_PAYLOAD, 0, true);
+        if (ret < 0) {
+            error_report("kvm: Failed to enable exception payload cap: %s",
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
     ret = kvm_get_supported_msrs(s);
     if (ret < 0) {
         return ret;
@@ -2914,8 +2966,16 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level)
         return 0;
     }
 
-    events.exception.injected = (env->exception_injected >= 0);
-    events.exception.nr = env->exception_injected;
+    events.flags = 0;
+
+    if (has_exception_payload) {
+        events.flags |= KVM_VCPUEVENT_VALID_PAYLOAD;
+        events.exception.pending = env->exception_pending;
+        events.exception_has_payload = env->exception_has_payload;
+        events.exception_payload = env->exception_payload;
+    }
+    events.exception.nr = env->exception_nr;
+    events.exception.injected = env->exception_injected;
     events.exception.has_error_code = env->has_error_code;
     events.exception.error_code = env->error_code;
 
@@ -2928,7 +2988,6 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level)
     events.nmi.masked = !!(env->hflags2 & HF2_NMI_MASK);
 
     events.sipi_vector = env->sipi_vector;
-    events.flags = 0;
 
     if (has_msr_smbase) {
         events.smi.smm = !!(env->hflags & HF_SMM_MASK);
@@ -2978,8 +3037,19 @@ static int kvm_get_vcpu_events(X86CPU *cpu)
     if (ret < 0) {
        return ret;
     }
-    env->exception_injected =
-       events.exception.injected ? events.exception.nr : -1;
+
+    if (events.flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
+        env->exception_pending = events.exception.pending;
+        env->exception_has_payload = events.exception_has_payload;
+        env->exception_payload = events.exception_payload;
+    } else {
+        env->exception_pending = 0;
+        env->exception_has_payload = false;
+    }
+    env->exception_injected = events.exception.injected;
+    env->exception_nr =
+        (env->exception_pending || env->exception_injected) ?
+        events.exception.nr : -1;
     env->has_error_code = events.exception.has_error_code;
     env->error_code = events.exception.error_code;
 
@@ -3031,12 +3101,12 @@ static int kvm_guest_debug_workarounds(X86CPU *cpu)
     unsigned long reinject_trap = 0;
 
     if (!kvm_has_vcpu_events()) {
-        if (env->exception_injected == EXCP01_DB) {
+        if (env->exception_nr == EXCP01_DB) {
             reinject_trap = KVM_GUESTDBG_INJECT_DB;
         } else if (env->exception_injected == EXCP03_INT3) {
             reinject_trap = KVM_GUESTDBG_INJECT_BP;
         }
-        env->exception_injected = -1;
+        kvm_reset_exception(env);
     }
 
     /*
@@ -3412,13 +3482,13 @@ int kvm_arch_process_async_events(CPUState *cs)
 
         kvm_cpu_synchronize_state(cs);
 
-        if (env->exception_injected == EXCP08_DBLE) {
+        if (env->exception_nr == EXCP08_DBLE) {
             /* this means triple fault */
             qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
             cs->exit_request = 1;
             return 0;
         }
-        env->exception_injected = EXCP12_MCHK;
+        kvm_queue_exception(env, EXCP12_MCHK, 0, 0);
         env->has_error_code = 0;
 
         cs->halted = 0;
@@ -3633,14 +3703,13 @@ static int kvm_handle_debug(X86CPU *cpu,
     }
     if (ret == 0) {
         cpu_synchronize_state(cs);
-        assert(env->exception_injected == -1);
+        assert(env->exception_nr == -1);
 
         /* pass to guest */
-        env->exception_injected = arch_info->exception;
+        kvm_queue_exception(env, arch_info->exception,
+                            arch_info->exception == EXCP01_DB,
+                            arch_info->dr6);
         env->has_error_code = 0;
-        if (arch_info->exception == EXCP01_DB) {
-            env->dr[6] = arch_info->dr6;
-        }
     }
 
     return ret;
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 95299ebff44a..6aac0fe9cb56 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -240,6 +240,41 @@ static int cpu_pre_save(void *opaque)
     }
 #endif
 
+    /*
+     * When vCPU is running L2 and exception is still pending,
+     * it can potentially be intercepted by L1 hypervisor.
+     * In contrast to an injected exception which cannot be
+     * intercepted anymore.
+     *
+     * Furthermore, when a L2 exception is intercepted by L1
+     * hypervisor, it's exception payload (CR2/DR6 on #PF/#DB)
+     * should not be set yet in the respective vCPU register.
+     * Thus, in case an exception is pending, it is
+     * important to save the exception payload seperately.
+     *
+     * Therefore, if an exception is not in a pending state
+     * or vCPU is not in guest-mode, it is not important to
+     * distinguish between a pending and injected exception
+     * and we don't need to store seperately the exception payload.
+     *
+     * In order to preserve better backwards-compatabile migration,
+     * convert a pending exception to an injected exception in
+     * case it is not important to distingiush between them
+     * as described above.
+     */
+    if (env->exception_pending && !(env->hflags & HF_GUEST_MASK)) {
+        env->exception_pending = 0;
+        env->exception_injected = 1;
+
+        if (env->exception_has_payload) {
+            if (env->exception_nr == EXCP01_DB) {
+                env->dr[6] = env->exception_payload;
+            } else if (env->exception_nr == EXCP0E_PAGE) {
+                env->cr[2] = env->exception_payload;
+            }
+        }
+    }
+
     return 0;
 }
 
@@ -297,6 +332,23 @@ static int cpu_post_load(void *opaque, int version_id)
     }
 #endif
 
+    /*
+     * There are cases that we can get valid exception_nr with both
+     * exception_pending and exception_injected being cleared.
+     * This can happen in one of the following scenarios:
+     * 1) Source is older QEMU without KVM_CAP_EXCEPTION_PAYLOAD support.
+     * 2) Source is running on kernel without KVM_CAP_EXCEPTION_PAYLOAD support.
+     * 3) "cpu/exception_info" subsection not sent because there is no exception
+     *	  pending or guest wasn't running L2 (See comment in cpu_pre_save()).
+     *
+     * In those cases, we can just deduce that a valid exception_nr means
+     * we can treat the exception as already injected.
+     */
+    if ((env->exception_nr != -1) &&
+        !env->exception_pending && !env->exception_injected) {
+        env->exception_injected = 1;
+    }
+
     env->fpstt = (env->fpus_vmstate >> 11) & 7;
     env->fpus = env->fpus_vmstate & ~0x3800;
     env->fptag_vmstate ^= 0xff;
@@ -342,6 +394,35 @@ static bool steal_time_msr_needed(void *opaque)
     return cpu->env.steal_time_msr != 0;
 }
 
+static bool exception_info_needed(void *opaque)
+{
+    X86CPU *cpu = opaque;
+    CPUX86State *env = &cpu->env;
+
+    /*
+     * It is important to save exception-info only in case
+     * we need to distingiush between a pending and injected
+     * exception. Which is only required in case there is a
+     * pending exception and vCPU is running L2.
+     * For more info, refer to comment in cpu_pre_save().
+     */
+    return (env->exception_pending && (env->hflags & HF_GUEST_MASK));
+}
+
+static const VMStateDescription vmstate_exception_info = {
+    .name = "cpu/exception_info",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = exception_info_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT8(env.exception_pending, X86CPU),
+        VMSTATE_UINT8(env.exception_injected, X86CPU),
+        VMSTATE_UINT8(env.exception_has_payload, X86CPU),
+        VMSTATE_UINT64(env.exception_payload, X86CPU),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static const VMStateDescription vmstate_steal_time_msr = {
     .name = "cpu/steal_time_msr",
     .version_id = 1,
@@ -1228,7 +1309,7 @@ VMStateDescription vmstate_x86_cpu = {
         VMSTATE_INT32(env.interrupt_injected, X86CPU),
         VMSTATE_UINT32(env.mp_state, X86CPU),
         VMSTATE_UINT64(env.tsc, X86CPU),
-        VMSTATE_INT32(env.exception_injected, X86CPU),
+        VMSTATE_INT32(env.exception_nr, X86CPU),
         VMSTATE_UINT8(env.soft_interrupt, X86CPU),
         VMSTATE_UINT8(env.nmi_injected, X86CPU),
         VMSTATE_UINT8(env.nmi_pending, X86CPU),
@@ -1252,6 +1333,7 @@ VMStateDescription vmstate_x86_cpu = {
         /* The above list is not sorted /wrt version numbers, watch out! */
     },
     .subsections = (const VMStateDescription*[]) {
+        &vmstate_exception_info,
         &vmstate_async_pf_msr,
         &vmstate_pv_eoi_msr,
         &vmstate_steal_time_msr,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [QEMU PATCH v3 9/9] KVM: i386: Remove VMX migration blocker
  2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
                   ` (7 preceding siblings ...)
  2019-06-17 17:56 ` [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD Liran Alon
@ 2019-06-17 17:56 ` Liran Alon
  2019-06-18 22:17   ` [Qemu-devel] " Maran Wilson
  8 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-17 17:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	dgilbert, Liran Alon

This effectively reverts d98f26073beb ("target/i386: kvm: add VMX migration blocker").
This can now be done because previous commits added support for Intel VMX migration.

AMD SVM migration is still blocked. This is because kernel
KVM_CAP_{GET,SET}_NESTED_STATE in case of AMD SVM is not
implemented yet. Therefore, required vCPU nested state is still
missing in order to perform valid migration for vCPU exposed with SVM.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
---
 target/i386/kvm.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 797f8ac46435..772c8619efc4 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -948,7 +948,7 @@ static int hyperv_init_vcpu(X86CPU *cpu)
 }
 
 static Error *invtsc_mig_blocker;
-static Error *nested_virt_mig_blocker;
+static Error *svm_mig_blocker;
 
 #define KVM_MAX_CPUID_ENTRIES  100
 
@@ -1313,13 +1313,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
                                   !!(c->ecx & CPUID_EXT_SMX);
     }
 
-    if (cpu_has_nested_virt(env) && !nested_virt_mig_blocker) {
-        error_setg(&nested_virt_mig_blocker,
-                   "Nested virtualization does not support live migration yet");
-        r = migrate_add_blocker(nested_virt_mig_blocker, &local_err);
+    if (cpu_has_svm(env) && !svm_mig_blocker) {
+        error_setg(&svm_mig_blocker,
+                   "AMD SVM does not support live migration yet");
+        r = migrate_add_blocker(svm_mig_blocker, &local_err);
         if (local_err) {
             error_report_err(local_err);
-            error_free(nested_virt_mig_blocker);
+            error_free(svm_mig_blocker);
             return r;
         }
     }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization
  2019-06-17 17:56 ` [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization Liran Alon
@ 2019-06-18  8:44   ` Dr. David Alan Gilbert
  2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
  1 sibling, 0 replies; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-18  8:44 UTC (permalink / raw)
  To: Liran Alon
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson

* Liran Alon (liran.alon@oracle.com) wrote:
> Commit d98f26073beb ("target/i386: kvm: add VMX migration blocker")
> added a migration blocker for vCPU exposed with Intel VMX.
> However, migration should also be blocked for vCPU exposed with
> AMD SVM.
> 
> Both cases should be blocked because QEMU should extract additional
> vCPU state from KVM that should be migrated as part of vCPU VMState.
> E.g. Whether vCPU is running in guest-mode or host-mode.
> 
> Fixes: d98f26073beb ("target/i386: kvm: add VMX migration blocker")
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>  target/i386/cpu.c |  6 ------
>  target/i386/cpu.h | 24 ++++++++++++++++++++++++
>  target/i386/kvm.c | 12 ++++++------
>  3 files changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 536d7d152044..197201087e65 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5170,12 +5170,6 @@ static int x86_cpu_filter_features(X86CPU *cpu)
>      return rv;
>  }
>  
> -#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
> -                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
> -                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
> -#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> -                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> -                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
>  static void x86_cpu_realizefn(DeviceState *dev, Error **errp)
>  {
>      CPUState *cs = CPU(dev);
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index fce6660bac00..79d9495ceb0c 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -728,6 +728,13 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
>  
>  #define CPUID_VENDOR_HYGON    "HygonGenuine"
>  
> +#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
> +                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
> +                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
> +#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> +                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> +                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
>  
> @@ -1866,6 +1873,23 @@ static inline int32_t x86_get_a20_mask(CPUX86State *env)
>      }
>  }
>  
> +static inline bool cpu_has_vmx(CPUX86State *env)
> +{
> +    return (IS_INTEL_CPU(env) &&
> +            (env->features[FEAT_1_ECX] & CPUID_EXT_VMX));
> +}
> +
> +static inline bool cpu_has_svm(CPUX86State *env)
> +{
> +    return (IS_AMD_CPU(env) &&
> +            (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_SVM));

Note that the Hygon Dhyana seems to have SVM set as well, see
target/i386/cpu.c and search for "Dhyana"

(Note the AMD users are going to be a bit more surprised by this
restriction than Intel users, since Nested has been enabled on AMD for a
long time).

Dave

> +}
> +
> +static inline bool cpu_has_nested_virt(CPUX86State *env)
> +{
> +    return (cpu_has_vmx(env) || cpu_has_svm(env));
> +}
> +
>  /* fpu_helper.c */
>  void update_fp_status(CPUX86State *env);
>  void update_mxcsr_status(CPUX86State *env);
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index c8fd53055d37..f43e2d69859e 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -906,7 +906,7 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>  }
>  
>  static Error *invtsc_mig_blocker;
> -static Error *vmx_mig_blocker;
> +static Error *nested_virt_mig_blocker;
>  
>  #define KVM_MAX_CPUID_ENTRIES  100
>  
> @@ -1270,13 +1270,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>                                    !!(c->ecx & CPUID_EXT_SMX);
>      }
>  
> -    if ((env->features[FEAT_1_ECX] & CPUID_EXT_VMX) && !vmx_mig_blocker) {
> -        error_setg(&vmx_mig_blocker,
> -                   "Nested VMX virtualization does not support live migration yet");
> -        r = migrate_add_blocker(vmx_mig_blocker, &local_err);
> +    if (cpu_has_nested_virt(env) && !nested_virt_mig_blocker) {
> +        error_setg(&nested_virt_mig_blocker,
> +                   "Nested virtualization does not support live migration yet");
> +        r = migrate_add_blocker(nested_virt_mig_blocker, &local_err);
>          if (local_err) {
>              error_report_err(local_err);
> -            error_free(vmx_mig_blocker);
> +            error_free(nested_virt_mig_blocker);
>              return r;
>          }
>      }
> -- 
> 2.20.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types
  2019-06-17 17:56 ` [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types Liran Alon
@ 2019-06-18  8:55   ` Dr. David Alan Gilbert
  2019-06-18 15:36     ` Liran Alon
  0 siblings, 1 reply; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-18  8:55 UTC (permalink / raw)
  To: Liran Alon
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko

* Liran Alon (liran.alon@oracle.com) wrote:
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>  include/migration/vmstate.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 9224370ed59a..a85424fb0483 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -797,6 +797,15 @@ extern const VMStateInfo vmstate_info_qtailq;
>  #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
>      VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)

A comment here stating they're for Linux kernel types would be nice.

> +#define VMSTATE_U8_V(_f, _s, _v)                                   \
> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint8, __u8)
> +#define VMSTATE_U16_V(_f, _s, _v)                                  \
> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint16, __u16)
> +#define VMSTATE_U32_V(_f, _s, _v)                                  \
> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint32, __u32)
> +#define VMSTATE_U64_V(_f, _s, _v)                                  \
> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, __u64)
> +

Have you checked that builds OK on a non-Linux system?

Dave

>  #define VMSTATE_BOOL(_f, _s)                                          \
>      VMSTATE_BOOL_V(_f, _s, 0)
>  
> @@ -818,6 +827,15 @@ extern const VMStateInfo vmstate_info_qtailq;
>  #define VMSTATE_UINT64(_f, _s)                                        \
>      VMSTATE_UINT64_V(_f, _s, 0)
>  
> +#define VMSTATE_U8(_f, _s)                                         \
> +    VMSTATE_U8_V(_f, _s, 0)
> +#define VMSTATE_U16(_f, _s)                                        \
> +    VMSTATE_U16_V(_f, _s, 0)
> +#define VMSTATE_U32(_f, _s)                                        \
> +    VMSTATE_U32_V(_f, _s, 0)
> +#define VMSTATE_U64(_f, _s)                                        \
> +    VMSTATE_U64_V(_f, _s, 0)
> +
>  #define VMSTATE_UINT8_EQUAL(_f, _s, _err_hint)                        \
>      VMSTATE_SINGLE_FULL(_f, _s, 0, 0,                                 \
>                          vmstate_info_uint8_equal, uint8_t, _err_hint)
> -- 
> 2.20.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-17 17:56 ` [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state Liran Alon
@ 2019-06-18  9:03   ` Dr. David Alan Gilbert
  2019-06-18 15:40     ` Liran Alon
  2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
  1 sibling, 1 reply; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-18  9:03 UTC (permalink / raw)
  To: Liran Alon
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko

* Liran Alon (liran.alon@oracle.com) wrote:
> Kernel commit 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
> introduced new IOCTLs to extract and restore vCPU state related to
> Intel VMX & AMD SVM.
> 
> Utilize these IOCTLs to add support for migration of VMs which are
> running nested hypervisors.
> 
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>  accel/kvm/kvm-all.c   |   8 ++
>  include/sysemu/kvm.h  |   1 +
>  target/i386/cpu.h     |   3 +
>  target/i386/kvm.c     |  80 +++++++++++++++++
>  target/i386/machine.c | 196 ++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 288 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 59a3aa3a40da..4fdf5b04b131 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -88,6 +88,7 @@ struct KVMState
>  #ifdef KVM_CAP_SET_GUEST_DEBUG
>      QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints;
>  #endif
> +    int max_nested_state_len;
>      int many_ioeventfds;
>      int intx_set_mask;
>      bool sync_mmu;
> @@ -1678,6 +1679,8 @@ static int kvm_init(MachineState *ms)
>      s->debugregs = kvm_check_extension(s, KVM_CAP_DEBUGREGS);
>  #endif
>  
> +    s->max_nested_state_len = kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> +
>  #ifdef KVM_CAP_IRQ_ROUTING
>      kvm_direct_msi_allowed = (kvm_check_extension(s, KVM_CAP_SIGNAL_MSI) > 0);
>  #endif
> @@ -2245,6 +2248,11 @@ int kvm_has_debugregs(void)
>      return kvm_state->debugregs;
>  }
>  
> +int kvm_max_nested_state_length(void)
> +{
> +    return kvm_state->max_nested_state_len;
> +}
> +
>  int kvm_has_many_ioeventfds(void)
>  {
>      if (!kvm_enabled()) {
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 64f55e519df7..acd90aebb6c4 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -210,6 +210,7 @@ bool kvm_has_sync_mmu(void);
>  int kvm_has_vcpu_events(void);
>  int kvm_has_robust_singlestep(void);
>  int kvm_has_debugregs(void);
> +int kvm_max_nested_state_length(void);
>  int kvm_has_pit_state2(void);
>  int kvm_has_many_ioeventfds(void);
>  int kvm_has_gsi_routing(void);
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 79d9495ceb0c..a6bb71849869 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1350,6 +1350,9 @@ typedef struct CPUX86State {
>  #if defined(CONFIG_KVM) || defined(CONFIG_HVF)
>      void *xsave_buf;
>  #endif
> +#if defined(CONFIG_KVM)
> +    struct kvm_nested_state *nested_state;
> +#endif
>  #if defined(CONFIG_HVF)
>      HVFX86EmulatorState *hvf_emul;
>  #endif
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index f43e2d69859e..5950c3ed0d1c 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -931,6 +931,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
>      struct kvm_cpuid_entry2 *c;
>      uint32_t signature[3];
>      int kvm_base = KVM_CPUID_SIGNATURE;
> +    int max_nested_state_len;
>      int r;
>      Error *local_err = NULL;
>  
> @@ -1331,6 +1332,24 @@ int kvm_arch_init_vcpu(CPUState *cs)
>      if (has_xsave) {
>          env->xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
>      }
> +
> +    max_nested_state_len = kvm_max_nested_state_length();
> +    if (max_nested_state_len > 0) {
> +        assert(max_nested_state_len >= offsetof(struct kvm_nested_state, data));
> +        env->nested_state = g_malloc0(max_nested_state_len);
> +
> +        env->nested_state->size = max_nested_state_len;
> +
> +        if (IS_INTEL_CPU(env)) {
> +            struct kvm_vmx_nested_state_hdr *vmx_hdr =
> +                &env->nested_state->hdr.vmx;
> +
> +            vmx_hdr->vmxon_pa = -1ull;
> +            vmx_hdr->vmcs12_pa = -1ull;
> +        }
> +
> +    }
> +
>      cpu->kvm_msr_buf = g_malloc0(MSR_BUF_SIZE);
>  
>      if (!(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
> @@ -1352,12 +1371,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  int kvm_arch_destroy_vcpu(CPUState *cs)
>  {
>      X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
>  
>      if (cpu->kvm_msr_buf) {
>          g_free(cpu->kvm_msr_buf);
>          cpu->kvm_msr_buf = NULL;
>      }
>  
> +    if (env->nested_state) {
> +        g_free(env->nested_state);
> +        env->nested_state = NULL;
> +    }
> +
>      return 0;
>  }
>  
> @@ -3072,6 +3097,52 @@ static int kvm_get_debugregs(X86CPU *cpu)
>      return 0;
>  }
>  
> +static int kvm_put_nested_state(X86CPU *cpu)
> +{
> +    CPUX86State *env = &cpu->env;
> +    int max_nested_state_len = kvm_max_nested_state_length();
> +
> +    if (max_nested_state_len <= 0) {
> +        return 0;
> +    }
> +
> +    assert(env->nested_state->size <= max_nested_state_len);
> +    return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_NESTED_STATE, env->nested_state);
> +}
> +
> +static int kvm_get_nested_state(X86CPU *cpu)
> +{
> +    CPUX86State *env = &cpu->env;
> +    int max_nested_state_len = kvm_max_nested_state_length();
> +    int ret;
> +
> +    if (max_nested_state_len <= 0) {
> +        return 0;
> +    }
> +
> +    /*
> +     * It is possible that migration restored a smaller size into
> +     * nested_state->hdr.size than what our kernel support.
> +     * We preserve migration origin nested_state->hdr.size for
> +     * call to KVM_SET_NESTED_STATE but wish that our next call
> +     * to KVM_GET_NESTED_STATE will use max size our kernel support.
> +     */
> +    env->nested_state->size = max_nested_state_len;
> +
> +    ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_NESTED_STATE, env->nested_state);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (env->nested_state->flags & KVM_STATE_NESTED_GUEST_MODE) {
> +        env->hflags |= HF_GUEST_MASK;
> +    } else {
> +        env->hflags &= ~HF_GUEST_MASK;
> +    }
> +
> +    return ret;
> +}
> +
>  int kvm_arch_put_registers(CPUState *cpu, int level)
>  {
>      X86CPU *x86_cpu = X86_CPU(cpu);
> @@ -3079,6 +3150,11 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
>  
>      assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
>  
> +    ret = kvm_put_nested_state(x86_cpu);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
>      if (level >= KVM_PUT_RESET_STATE) {
>          ret = kvm_put_msr_feature_control(x86_cpu);
>          if (ret < 0) {
> @@ -3194,6 +3270,10 @@ int kvm_arch_get_registers(CPUState *cs)
>      if (ret < 0) {
>          goto out;
>      }
> +    ret = kvm_get_nested_state(cpu);
> +    if (ret < 0) {
> +        goto out;
> +    }
>      ret = 0;
>   out:
>      cpu_sync_bndcs_hflags(&cpu->env);
> diff --git a/target/i386/machine.c b/target/i386/machine.c
> index 225b5d433bc4..95299ebff44a 100644
> --- a/target/i386/machine.c
> +++ b/target/i386/machine.c
> @@ -231,6 +231,15 @@ static int cpu_pre_save(void *opaque)
>          env->segs[R_SS].flags &= ~(env->segs[R_SS].flags & DESC_DPL_MASK);
>      }
>  
> +#ifdef CONFIG_KVM
> +    /* Verify we have nested virtualization state from kernel if required */
> +    if (cpu_has_nested_virt(env) && !env->nested_state) {
> +        error_report("Guest enabled nested virtualization but kernel "
> +                "does not support saving of nested state");
> +        return -EINVAL;
> +    }
> +#endif
> +
>      return 0;
>  }
>  
> @@ -278,6 +287,16 @@ static int cpu_post_load(void *opaque, int version_id)
>      env->hflags &= ~HF_CPL_MASK;
>      env->hflags |= (env->segs[R_SS].flags >> DESC_DPL_SHIFT) & HF_CPL_MASK;
>  
> +#ifdef CONFIG_KVM
> +    if ((env->hflags & HF_GUEST_MASK) &&
> +        (!env->nested_state ||
> +        !(env->nested_state->flags & KVM_STATE_NESTED_GUEST_MODE))) {
> +        error_report("vCPU set in guest-mode inconsistent with "
> +                     "migrated kernel nested state");
> +        return -EINVAL;
> +    }
> +#endif
> +
>      env->fpstt = (env->fpus_vmstate >> 11) & 7;
>      env->fpus = env->fpus_vmstate & ~0x3800;
>      env->fptag_vmstate ^= 0xff;
> @@ -851,6 +870,180 @@ static const VMStateDescription vmstate_tsc_khz = {
>      }
>  };
>  
> +#ifdef CONFIG_KVM
> +
> +static bool vmx_vmcs12_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +    return (nested_state->size >
> +            offsetof(struct kvm_nested_state, data.vmx[0].vmcs12));
> +}
> +
> +static const VMStateDescription vmstate_vmx_vmcs12 = {
> +	.name = "cpu/kvm_nested_state/vmx/vmcs12",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = vmx_vmcs12_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].vmcs12,
> +	                        struct kvm_nested_state, 0x1000),

Where did that magic 0x1000 come from?

> +	    VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool vmx_shadow_vmcs12_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +    return (nested_state->size >
> +            offsetof(struct kvm_nested_state, data.vmx[0].shadow_vmcs12));
> +}
> +
> +static const VMStateDescription vmstate_vmx_shadow_vmcs12 = {
> +	.name = "cpu/kvm_nested_state/vmx/shadow_vmcs12",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = vmx_shadow_vmcs12_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].shadow_vmcs12,
> +	                        struct kvm_nested_state, 0x1000),
> +	    VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool vmx_nested_state_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +
> +    return ((nested_state->format == KVM_STATE_NESTED_FORMAT_VMX) &&
> +            ((nested_state->hdr.vmx.vmxon_pa != -1ull) ||
> +             (nested_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_VMXON)));
> +}
> +
> +static const VMStateDescription vmstate_vmx_nested_state = {
> +	.name = "cpu/kvm_nested_state/vmx",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = vmx_nested_state_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_U64(hdr.vmx.vmxon_pa, struct kvm_nested_state),
> +	    VMSTATE_U64(hdr.vmx.vmcs12_pa, struct kvm_nested_state),
> +	    VMSTATE_U16(hdr.vmx.smm.flags, struct kvm_nested_state),
> +	    VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription*[]) {
> +        &vmstate_vmx_vmcs12,
> +        &vmstate_vmx_shadow_vmcs12,
> +        NULL,
> +    }
> +};
> +
> +static bool svm_nested_state_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +
> +    return (nested_state->format == KVM_STATE_NESTED_FORMAT_SVM);
> +}
> +
> +static const VMStateDescription vmstate_svm_nested_state = {
> +	.name = "cpu/kvm_nested_state/svm",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = svm_nested_state_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool nested_state_needed(void *opaque)
> +{
> +    X86CPU *cpu = opaque;
> +    CPUX86State *env = &cpu->env;
> +
> +    return (env->nested_state &&
> +            (vmx_nested_state_needed(env->nested_state) ||
> +             svm_nested_state_needed(env->nested_state)));
> +}
> +
> +static int nested_state_post_load(void *opaque, int version_id)
> +{
> +    X86CPU *cpu = opaque;
> +    CPUX86State *env = &cpu->env;
> +    struct kvm_nested_state *nested_state = env->nested_state;
> +    int min_nested_state_len = offsetof(struct kvm_nested_state, data);
> +    int max_nested_state_len = kvm_max_nested_state_length();
> +
> +    /*
> +     * If our kernel don't support setting nested state
> +     * and we have received nested state from migration stream,
> +     * we need to fail migration
> +     */
> +    if (max_nested_state_len <= 0) {
> +        error_report("Received nested state when kernel cannot restore it");
> +        return -EINVAL;
> +    }
> +
> +    /*
> +     * Verify that the size of received nested_state struct
> +     * at least cover required header and is not larger
> +     * than the max size that our kernel support
> +     */
> +    if (nested_state->size < min_nested_state_len) {
> +        error_report("Received nested state size less than min: "
> +                     "len=%d, min=%d",
> +                     nested_state->size, min_nested_state_len);
> +        return -EINVAL;
> +    }
> +    if (nested_state->size > max_nested_state_len) {
> +        error_report("Recieved unsupported nested state size: "
> +                     "nested_state->size=%d, max=%d",
> +                     nested_state->size, max_nested_state_len);
> +        return -EINVAL;
> +    }
> +
> +    /* Verify format is valid */
> +    if ((nested_state->format != KVM_STATE_NESTED_FORMAT_VMX) &&
> +        (nested_state->format != KVM_STATE_NESTED_FORMAT_SVM)) {
> +        error_report("Received invalid nested state format: %d",
> +                     nested_state->format);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_kvm_nested_state = {
> +    .name = "cpu/kvm_nested_state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_U16(flags, struct kvm_nested_state),
> +        VMSTATE_U16(format, struct kvm_nested_state),
> +        VMSTATE_U32(size, struct kvm_nested_state),
> +        VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription*[]) {
> +        &vmstate_vmx_nested_state,
> +        &vmstate_svm_nested_state,
> +        NULL
> +    }
> +};
> +
> +static const VMStateDescription vmstate_nested_state = {
> +    .name = "cpu/nested_state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = nested_state_needed,
> +    .post_load = nested_state_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_STRUCT_POINTER(env.nested_state, X86CPU,
> +                               vmstate_kvm_nested_state,
> +                               struct kvm_nested_state),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +#endif
> +
>  static bool mcg_ext_ctl_needed(void *opaque)
>  {
>      X86CPU *cpu = opaque;
> @@ -1089,6 +1282,9 @@ VMStateDescription vmstate_x86_cpu = {
>          &vmstate_msr_intel_pt,
>          &vmstate_msr_virt_ssbd,
>          &vmstate_svm_npt,
> +#ifdef CONFIG_KVM
> +        &vmstate_nested_state,
> +#endif
>          NULL
>      }
>  };
> -- 
> 2.20.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD
  2019-06-17 17:56 ` [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD Liran Alon
@ 2019-06-18  9:07   ` Dr. David Alan Gilbert
  2019-06-18 15:45     ` Liran Alon
  0 siblings, 1 reply; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-18  9:07 UTC (permalink / raw)
  To: Liran Alon
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko

* Liran Alon (liran.alon@oracle.com) wrote:
> Kernel commit c4f55198c7c2 ("kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD")
> introduced a new KVM capability which allows userspace to correctly
> distinguish between pending and injected exceptions.
> 
> This distinguish is important in case of nested virtualization scenarios
> because a L2 pending exception can still be intercepted by the L1 hypervisor
> while a L2 injected exception cannot.
> 
> Furthermore, when an exception is attempted to be injected by QEMU,
> QEMU should specify the exception payload (CR2 in case of #PF or
> DR6 in case of #DB) instead of having the payload already delivered in
> the respective vCPU register. Because in case exception is injected to
> L2 guest and is intercepted by L1 hypervisor, then payload needs to be
> reported to L1 intercept (VMExit handler) while still preserving
> respective vCPU register unchanged.
> 
> This commit adds support for QEMU to properly utilise this new KVM
> capability (KVM_CAP_EXCEPTION_PAYLOAD).

Does this kvm capability become a requirement for the nested migration
then? If so, is it wired into the blockers?

Dave

> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>  target/i386/cpu.c        |   6 ++-
>  target/i386/cpu.h        |   6 ++-
>  target/i386/hvf/hvf.c    |  10 ++--
>  target/i386/hvf/x86hvf.c |   4 +-
>  target/i386/kvm.c        | 101 ++++++++++++++++++++++++++++++++-------
>  target/i386/machine.c    |  84 +++++++++++++++++++++++++++++++-
>  6 files changed, 187 insertions(+), 24 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 197201087e65..a026e49f5c0d 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -4774,7 +4774,11 @@ static void x86_cpu_reset(CPUState *s)
>      memset(env->mtrr_fixed, 0, sizeof(env->mtrr_fixed));
>  
>      env->interrupt_injected = -1;
> -    env->exception_injected = -1;
> +    env->exception_nr = -1;
> +    env->exception_pending = 0;
> +    env->exception_injected = 0;
> +    env->exception_has_payload = false;
> +    env->exception_payload = 0;
>      env->nmi_injected = false;
>  #if !defined(CONFIG_USER_ONLY)
>      /* We hard-wire the BSP to the first CPU. */
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index a6bb71849869..e2ac4132972d 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1338,10 +1338,14 @@ typedef struct CPUX86State {
>  
>      /* For KVM */
>      uint32_t mp_state;
> -    int32_t exception_injected;
> +    int32_t exception_nr;
>      int32_t interrupt_injected;
>      uint8_t soft_interrupt;
> +    uint8_t exception_pending;
> +    uint8_t exception_injected;
>      uint8_t has_error_code;
> +    uint8_t exception_has_payload;
> +    uint64_t exception_payload;
>      uint32_t ins_len;
>      uint32_t sipi_vector;
>      bool tsc_valid;
> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
> index 2751c8125ca2..dc4bb63536c8 100644
> --- a/target/i386/hvf/hvf.c
> +++ b/target/i386/hvf/hvf.c
> @@ -605,7 +605,9 @@ static void hvf_store_events(CPUState *cpu, uint32_t ins_len, uint64_t idtvec_in
>      X86CPU *x86_cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86_cpu->env;
>  
> -    env->exception_injected = -1;
> +    env->exception_nr = -1;
> +    env->exception_pending = 0;
> +    env->exception_injected = 0;
>      env->interrupt_injected = -1;
>      env->nmi_injected = false;
>      if (idtvec_info & VMCS_IDT_VEC_VALID) {
> @@ -619,7 +621,8 @@ static void hvf_store_events(CPUState *cpu, uint32_t ins_len, uint64_t idtvec_in
>              break;
>          case VMCS_IDT_VEC_HWEXCEPTION:
>          case VMCS_IDT_VEC_SWEXCEPTION:
> -            env->exception_injected = idtvec_info & VMCS_IDT_VEC_VECNUM;
> +            env->exception_nr = idtvec_info & VMCS_IDT_VEC_VECNUM;
> +            env->exception_injected = 1;
>              break;
>          case VMCS_IDT_VEC_PRIV_SWEXCEPTION:
>          default:
> @@ -912,7 +915,8 @@ int hvf_vcpu_exec(CPUState *cpu)
>              macvm_set_rip(cpu, rip + ins_len);
>              break;
>          case VMX_REASON_VMCALL:
> -            env->exception_injected = EXCP0D_GPF;
> +            env->exception_nr = EXCP0D_GPF;
> +            env->exception_injected = 1;
>              env->has_error_code = true;
>              env->error_code = 0;
>              break;
> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
> index df8e946fbcde..e0ea02d631e6 100644
> --- a/target/i386/hvf/x86hvf.c
> +++ b/target/i386/hvf/x86hvf.c
> @@ -362,8 +362,8 @@ bool hvf_inject_interrupts(CPUState *cpu_state)
>      if (env->interrupt_injected != -1) {
>          vector = env->interrupt_injected;
>          intr_type = VMCS_INTR_T_SWINTR;
> -    } else if (env->exception_injected != -1) {
> -        vector = env->exception_injected;
> +    } else if (env->exception_nr != -1) {
> +        vector = env->exception_nr;
>          if (vector == EXCP03_INT3 || vector == EXCP04_INTO) {
>              intr_type = VMCS_INTR_T_SWEXCEPTION;
>          } else {
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index 5950c3ed0d1c..797f8ac46435 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -104,6 +104,7 @@ static uint32_t num_architectural_pmu_fixed_counters;
>  static int has_xsave;
>  static int has_xcrs;
>  static int has_pit_state2;
> +static int has_exception_payload;
>  
>  static bool has_msr_mcg_ext_ctl;
>  
> @@ -584,15 +585,56 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>      /* Hope we are lucky for AO MCE */
>  }
>  
> +static void kvm_reset_exception(CPUX86State *env)
> +{
> +	env->exception_nr = -1;
> +	env->exception_pending = 0;
> +	env->exception_injected = 0;
> +	env->exception_has_payload = false;
> +	env->exception_payload = 0;
> +}
> +
> +static void kvm_queue_exception(CPUX86State *env,
> +                                int32_t exception_nr,
> +                                uint8_t exception_has_payload,
> +                                uint64_t exception_payload)
> +{
> +    assert(env->exception_nr == -1);
> +    assert(!env->exception_pending);
> +    assert(!env->exception_injected);
> +    assert(!env->exception_has_payload);
> +
> +    env->exception_nr = exception_nr;
> +
> +    if (has_exception_payload) {
> +        env->exception_pending = 1;
> +
> +        env->exception_has_payload = exception_has_payload;
> +        env->exception_payload = exception_payload;
> +    } else {
> +        env->exception_injected = 1;
> +
> +        if (exception_nr == EXCP01_DB) {
> +            assert(exception_has_payload);
> +            env->dr[6] = exception_payload;
> +        } else if (exception_nr == EXCP0E_PAGE) {
> +            assert(exception_has_payload);
> +            env->cr[2] = exception_payload;
> +        } else {
> +            assert(!exception_has_payload);
> +        }
> +    }
> +}
> +
>  static int kvm_inject_mce_oldstyle(X86CPU *cpu)
>  {
>      CPUX86State *env = &cpu->env;
>  
> -    if (!kvm_has_vcpu_events() && env->exception_injected == EXCP12_MCHK) {
> +    if (!kvm_has_vcpu_events() && env->exception_nr == EXCP12_MCHK) {
>          unsigned int bank, bank_num = env->mcg_cap & 0xff;
>          struct kvm_x86_mce mce;
>  
> -        env->exception_injected = -1;
> +        kvm_reset_exception(env);
>  
>          /*
>           * There must be at least one bank in use if an MCE is pending.
> @@ -1610,6 +1652,16 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>  
>      hv_vpindex_settable = kvm_check_extension(s, KVM_CAP_HYPERV_VP_INDEX);
>  
> +    has_exception_payload = kvm_check_extension(s, KVM_CAP_EXCEPTION_PAYLOAD);
> +    if (has_exception_payload) {
> +        ret = kvm_vm_enable_cap(s, KVM_CAP_EXCEPTION_PAYLOAD, 0, true);
> +        if (ret < 0) {
> +            error_report("kvm: Failed to enable exception payload cap: %s",
> +                         strerror(-ret));
> +            return ret;
> +        }
> +    }
> +
>      ret = kvm_get_supported_msrs(s);
>      if (ret < 0) {
>          return ret;
> @@ -2914,8 +2966,16 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level)
>          return 0;
>      }
>  
> -    events.exception.injected = (env->exception_injected >= 0);
> -    events.exception.nr = env->exception_injected;
> +    events.flags = 0;
> +
> +    if (has_exception_payload) {
> +        events.flags |= KVM_VCPUEVENT_VALID_PAYLOAD;
> +        events.exception.pending = env->exception_pending;
> +        events.exception_has_payload = env->exception_has_payload;
> +        events.exception_payload = env->exception_payload;
> +    }
> +    events.exception.nr = env->exception_nr;
> +    events.exception.injected = env->exception_injected;
>      events.exception.has_error_code = env->has_error_code;
>      events.exception.error_code = env->error_code;
>  
> @@ -2928,7 +2988,6 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level)
>      events.nmi.masked = !!(env->hflags2 & HF2_NMI_MASK);
>  
>      events.sipi_vector = env->sipi_vector;
> -    events.flags = 0;
>  
>      if (has_msr_smbase) {
>          events.smi.smm = !!(env->hflags & HF_SMM_MASK);
> @@ -2978,8 +3037,19 @@ static int kvm_get_vcpu_events(X86CPU *cpu)
>      if (ret < 0) {
>         return ret;
>      }
> -    env->exception_injected =
> -       events.exception.injected ? events.exception.nr : -1;
> +
> +    if (events.flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
> +        env->exception_pending = events.exception.pending;
> +        env->exception_has_payload = events.exception_has_payload;
> +        env->exception_payload = events.exception_payload;
> +    } else {
> +        env->exception_pending = 0;
> +        env->exception_has_payload = false;
> +    }
> +    env->exception_injected = events.exception.injected;
> +    env->exception_nr =
> +        (env->exception_pending || env->exception_injected) ?
> +        events.exception.nr : -1;
>      env->has_error_code = events.exception.has_error_code;
>      env->error_code = events.exception.error_code;
>  
> @@ -3031,12 +3101,12 @@ static int kvm_guest_debug_workarounds(X86CPU *cpu)
>      unsigned long reinject_trap = 0;
>  
>      if (!kvm_has_vcpu_events()) {
> -        if (env->exception_injected == EXCP01_DB) {
> +        if (env->exception_nr == EXCP01_DB) {
>              reinject_trap = KVM_GUESTDBG_INJECT_DB;
>          } else if (env->exception_injected == EXCP03_INT3) {
>              reinject_trap = KVM_GUESTDBG_INJECT_BP;
>          }
> -        env->exception_injected = -1;
> +        kvm_reset_exception(env);
>      }
>  
>      /*
> @@ -3412,13 +3482,13 @@ int kvm_arch_process_async_events(CPUState *cs)
>  
>          kvm_cpu_synchronize_state(cs);
>  
> -        if (env->exception_injected == EXCP08_DBLE) {
> +        if (env->exception_nr == EXCP08_DBLE) {
>              /* this means triple fault */
>              qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
>              cs->exit_request = 1;
>              return 0;
>          }
> -        env->exception_injected = EXCP12_MCHK;
> +        kvm_queue_exception(env, EXCP12_MCHK, 0, 0);
>          env->has_error_code = 0;
>  
>          cs->halted = 0;
> @@ -3633,14 +3703,13 @@ static int kvm_handle_debug(X86CPU *cpu,
>      }
>      if (ret == 0) {
>          cpu_synchronize_state(cs);
> -        assert(env->exception_injected == -1);
> +        assert(env->exception_nr == -1);
>  
>          /* pass to guest */
> -        env->exception_injected = arch_info->exception;
> +        kvm_queue_exception(env, arch_info->exception,
> +                            arch_info->exception == EXCP01_DB,
> +                            arch_info->dr6);
>          env->has_error_code = 0;
> -        if (arch_info->exception == EXCP01_DB) {
> -            env->dr[6] = arch_info->dr6;
> -        }
>      }
>  
>      return ret;
> diff --git a/target/i386/machine.c b/target/i386/machine.c
> index 95299ebff44a..6aac0fe9cb56 100644
> --- a/target/i386/machine.c
> +++ b/target/i386/machine.c
> @@ -240,6 +240,41 @@ static int cpu_pre_save(void *opaque)
>      }
>  #endif
>  
> +    /*
> +     * When vCPU is running L2 and exception is still pending,
> +     * it can potentially be intercepted by L1 hypervisor.
> +     * In contrast to an injected exception which cannot be
> +     * intercepted anymore.
> +     *
> +     * Furthermore, when a L2 exception is intercepted by L1
> +     * hypervisor, it's exception payload (CR2/DR6 on #PF/#DB)
> +     * should not be set yet in the respective vCPU register.
> +     * Thus, in case an exception is pending, it is
> +     * important to save the exception payload seperately.
> +     *
> +     * Therefore, if an exception is not in a pending state
> +     * or vCPU is not in guest-mode, it is not important to
> +     * distinguish between a pending and injected exception
> +     * and we don't need to store seperately the exception payload.
> +     *
> +     * In order to preserve better backwards-compatabile migration,
> +     * convert a pending exception to an injected exception in
> +     * case it is not important to distingiush between them
> +     * as described above.
> +     */
> +    if (env->exception_pending && !(env->hflags & HF_GUEST_MASK)) {
> +        env->exception_pending = 0;
> +        env->exception_injected = 1;
> +
> +        if (env->exception_has_payload) {
> +            if (env->exception_nr == EXCP01_DB) {
> +                env->dr[6] = env->exception_payload;
> +            } else if (env->exception_nr == EXCP0E_PAGE) {
> +                env->cr[2] = env->exception_payload;
> +            }
> +        }
> +    }
> +
>      return 0;
>  }
>  
> @@ -297,6 +332,23 @@ static int cpu_post_load(void *opaque, int version_id)
>      }
>  #endif
>  
> +    /*
> +     * There are cases that we can get valid exception_nr with both
> +     * exception_pending and exception_injected being cleared.
> +     * This can happen in one of the following scenarios:
> +     * 1) Source is older QEMU without KVM_CAP_EXCEPTION_PAYLOAD support.
> +     * 2) Source is running on kernel without KVM_CAP_EXCEPTION_PAYLOAD support.
> +     * 3) "cpu/exception_info" subsection not sent because there is no exception
> +     *	  pending or guest wasn't running L2 (See comment in cpu_pre_save()).
> +     *
> +     * In those cases, we can just deduce that a valid exception_nr means
> +     * we can treat the exception as already injected.
> +     */
> +    if ((env->exception_nr != -1) &&
> +        !env->exception_pending && !env->exception_injected) {
> +        env->exception_injected = 1;
> +    }
> +
>      env->fpstt = (env->fpus_vmstate >> 11) & 7;
>      env->fpus = env->fpus_vmstate & ~0x3800;
>      env->fptag_vmstate ^= 0xff;
> @@ -342,6 +394,35 @@ static bool steal_time_msr_needed(void *opaque)
>      return cpu->env.steal_time_msr != 0;
>  }
>  
> +static bool exception_info_needed(void *opaque)
> +{
> +    X86CPU *cpu = opaque;
> +    CPUX86State *env = &cpu->env;
> +
> +    /*
> +     * It is important to save exception-info only in case
> +     * we need to distingiush between a pending and injected
> +     * exception. Which is only required in case there is a
> +     * pending exception and vCPU is running L2.
> +     * For more info, refer to comment in cpu_pre_save().
> +     */
> +    return (env->exception_pending && (env->hflags & HF_GUEST_MASK));
> +}
> +
> +static const VMStateDescription vmstate_exception_info = {
> +    .name = "cpu/exception_info",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = exception_info_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT8(env.exception_pending, X86CPU),
> +        VMSTATE_UINT8(env.exception_injected, X86CPU),
> +        VMSTATE_UINT8(env.exception_has_payload, X86CPU),
> +        VMSTATE_UINT64(env.exception_payload, X86CPU),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static const VMStateDescription vmstate_steal_time_msr = {
>      .name = "cpu/steal_time_msr",
>      .version_id = 1,
> @@ -1228,7 +1309,7 @@ VMStateDescription vmstate_x86_cpu = {
>          VMSTATE_INT32(env.interrupt_injected, X86CPU),
>          VMSTATE_UINT32(env.mp_state, X86CPU),
>          VMSTATE_UINT64(env.tsc, X86CPU),
> -        VMSTATE_INT32(env.exception_injected, X86CPU),
> +        VMSTATE_INT32(env.exception_nr, X86CPU),
>          VMSTATE_UINT8(env.soft_interrupt, X86CPU),
>          VMSTATE_UINT8(env.nmi_injected, X86CPU),
>          VMSTATE_UINT8(env.nmi_pending, X86CPU),
> @@ -1252,6 +1333,7 @@ VMStateDescription vmstate_x86_cpu = {
>          /* The above list is not sorted /wrt version numbers, watch out! */
>      },
>      .subsections = (const VMStateDescription*[]) {
> +        &vmstate_exception_info,
>          &vmstate_async_pf_msr,
>          &vmstate_pv_eoi_msr,
>          &vmstate_steal_time_msr,
> -- 
> 2.20.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types
  2019-06-18  8:55   ` Dr. David Alan Gilbert
@ 2019-06-18 15:36     ` Liran Alon
  2019-06-18 15:42       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-18 15:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko



> On 18 Jun 2019, at 11:55, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>> Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>> ---
>> include/migration/vmstate.h | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>> 
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index 9224370ed59a..a85424fb0483 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -797,6 +797,15 @@ extern const VMStateInfo vmstate_info_qtailq;
>> #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
>>     VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
> 
> A comment here stating they're for Linux kernel types would be nice.

I didn’t want to state this because in theory these types can be used not in kernel context…
I thought commit message is sufficient. I think comments in code should be made to clarify
things. But to justify existence I think commit message should be used.
But if you insist, I have no strong objection of adding such comment.

> 
>> +#define VMSTATE_U8_V(_f, _s, _v)                                   \
>> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint8, __u8)
>> +#define VMSTATE_U16_V(_f, _s, _v)                                  \
>> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint16, __u16)
>> +#define VMSTATE_U32_V(_f, _s, _v)                                  \
>> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint32, __u32)
>> +#define VMSTATE_U64_V(_f, _s, _v)                                  \
>> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, __u64)
>> +
> 
> Have you checked that builds OK on a non-Linux system?

Hmm that’s a good point. No. :P

-Liran

> 
> Dave
> 
>> #define VMSTATE_BOOL(_f, _s)                                          \
>>     VMSTATE_BOOL_V(_f, _s, 0)
>> 
>> @@ -818,6 +827,15 @@ extern const VMStateInfo vmstate_info_qtailq;
>> #define VMSTATE_UINT64(_f, _s)                                        \
>>     VMSTATE_UINT64_V(_f, _s, 0)
>> 
>> +#define VMSTATE_U8(_f, _s)                                         \
>> +    VMSTATE_U8_V(_f, _s, 0)
>> +#define VMSTATE_U16(_f, _s)                                        \
>> +    VMSTATE_U16_V(_f, _s, 0)
>> +#define VMSTATE_U32(_f, _s)                                        \
>> +    VMSTATE_U32_V(_f, _s, 0)
>> +#define VMSTATE_U64(_f, _s)                                        \
>> +    VMSTATE_U64_V(_f, _s, 0)
>> +
>> #define VMSTATE_UINT8_EQUAL(_f, _s, _err_hint)                        \
>>     VMSTATE_SINGLE_FULL(_f, _s, 0, 0,                                 \
>>                         vmstate_info_uint8_equal, uint8_t, _err_hint)
>> -- 
>> 2.20.1
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-18  9:03   ` Dr. David Alan Gilbert
@ 2019-06-18 15:40     ` Liran Alon
  2019-06-18 15:48       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 27+ messages in thread
From: Liran Alon @ 2019-06-18 15:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko


> On 18 Jun 2019, at 12:03, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> 
>> +static const VMStateDescription vmstate_vmx_vmcs12 = {
>> +	.name = "cpu/kvm_nested_state/vmx/vmcs12",
>> +	.version_id = 1,
>> +	.minimum_version_id = 1,
>> +	.needed = vmx_vmcs12_needed,
>> +	.fields = (VMStateField[]) {
>> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].vmcs12,
>> +	                        struct kvm_nested_state, 0x1000),
> 
> Where did that magic 0x1000 come from?

Currently, KVM folks (including myself), haven’t decided yet to expose vmcs12 struct layout to userspace but instead to still leave it opaque.
The formal size of this size is VMCS12_SIZE (defined in kernel as 0x1000). I was wondering if we wish to expose VMCS12_SIZE constant to userspace or not.
So currently I defined these __u8 arrays as 0x1000. But in case Paolo agrees to expose VMCS12_SIZE, we can use that instead.

-Liran

> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types
  2019-06-18 15:36     ` Liran Alon
@ 2019-06-18 15:42       ` Dr. David Alan Gilbert
  2019-06-18 16:44         ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-18 15:42 UTC (permalink / raw)
  To: Liran Alon
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko

* Liran Alon (liran.alon@oracle.com) wrote:
> 
> 
> > On 18 Jun 2019, at 11:55, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Liran Alon (liran.alon@oracle.com) wrote:
> >> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> >> Reviewed-by: Maran Wilson <maran.wilson@oracle.com>
> >> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> >> ---
> >> include/migration/vmstate.h | 18 ++++++++++++++++++
> >> 1 file changed, 18 insertions(+)
> >> 
> >> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> >> index 9224370ed59a..a85424fb0483 100644
> >> --- a/include/migration/vmstate.h
> >> +++ b/include/migration/vmstate.h
> >> @@ -797,6 +797,15 @@ extern const VMStateInfo vmstate_info_qtailq;
> >> #define VMSTATE_UINT64_V(_f, _s, _v)                                  \
> >>     VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, uint64_t)
> > 
> > A comment here stating they're for Linux kernel types would be nice.
> 
> I didn’t want to state this because in theory these types can be used not in kernel context…
> I thought commit message is sufficient. I think comments in code should be made to clarify
> things. But to justify existence I think commit message should be used.
> But if you insist, I have no strong objection of adding such comment.

It's only a 'would be nice' - it's just I don't want people trying to
use them for other places (I'm not sure what happens if you pass a
uint8_t to VMSTATE_U8 ???

> > 
> >> +#define VMSTATE_U8_V(_f, _s, _v)                                   \
> >> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint8, __u8)
> >> +#define VMSTATE_U16_V(_f, _s, _v)                                  \
> >> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint16, __u16)
> >> +#define VMSTATE_U32_V(_f, _s, _v)                                  \
> >> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint32, __u32)
> >> +#define VMSTATE_U64_V(_f, _s, _v)                                  \
> >> +    VMSTATE_SINGLE(_f, _s, _v, vmstate_info_uint64, __u64)
> >> +
> > 
> > Have you checked that builds OK on a non-Linux system?
> 
> Hmm that’s a good point. No. :P

Worth a check if you can find one lying around :-)

Dave

> -Liran
> 
> > 
> > Dave
> > 
> >> #define VMSTATE_BOOL(_f, _s)                                          \
> >>     VMSTATE_BOOL_V(_f, _s, 0)
> >> 
> >> @@ -818,6 +827,15 @@ extern const VMStateInfo vmstate_info_qtailq;
> >> #define VMSTATE_UINT64(_f, _s)                                        \
> >>     VMSTATE_UINT64_V(_f, _s, 0)
> >> 
> >> +#define VMSTATE_U8(_f, _s)                                         \
> >> +    VMSTATE_U8_V(_f, _s, 0)
> >> +#define VMSTATE_U16(_f, _s)                                        \
> >> +    VMSTATE_U16_V(_f, _s, 0)
> >> +#define VMSTATE_U32(_f, _s)                                        \
> >> +    VMSTATE_U32_V(_f, _s, 0)
> >> +#define VMSTATE_U64(_f, _s)                                        \
> >> +    VMSTATE_U64_V(_f, _s, 0)
> >> +
> >> #define VMSTATE_UINT8_EQUAL(_f, _s, _err_hint)                        \
> >>     VMSTATE_SINGLE_FULL(_f, _s, 0, 0,                                 \
> >>                         vmstate_info_uint8_equal, uint8_t, _err_hint)
> >> -- 
> >> 2.20.1
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD
  2019-06-18  9:07   ` Dr. David Alan Gilbert
@ 2019-06-18 15:45     ` Liran Alon
  0 siblings, 0 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-18 15:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko



> On 18 Jun 2019, at 12:07, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> Kernel commit c4f55198c7c2 ("kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD")
>> introduced a new KVM capability which allows userspace to correctly
>> distinguish between pending and injected exceptions.
>> 
>> This distinguish is important in case of nested virtualization scenarios
>> because a L2 pending exception can still be intercepted by the L1 hypervisor
>> while a L2 injected exception cannot.
>> 
>> Furthermore, when an exception is attempted to be injected by QEMU,
>> QEMU should specify the exception payload (CR2 in case of #PF or
>> DR6 in case of #DB) instead of having the payload already delivered in
>> the respective vCPU register. Because in case exception is injected to
>> L2 guest and is intercepted by L1 hypervisor, then payload needs to be
>> reported to L1 intercept (VMExit handler) while still preserving
>> respective vCPU register unchanged.
>> 
>> This commit adds support for QEMU to properly utilise this new KVM
>> capability (KVM_CAP_EXCEPTION_PAYLOAD).
> 
> Does this kvm capability become a requirement for the nested migration
> then? If so, is it wired into the blockers?
> 
> Dave
> 

That’s a very good point.
Yes this capability is required in order to correctly migrate VMs running nested hypervisors.
I agree that I should add a migration blocker for nested in case it isn’t present.
Nice catch.

-Liran



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-18 15:40     ` Liran Alon
@ 2019-06-18 15:48       ` Dr. David Alan Gilbert
  2019-06-18 15:50         ` Liran Alon
  2019-06-18 16:16         ` Paolo Bonzini
  0 siblings, 2 replies; 27+ messages in thread
From: Dr. David Alan Gilbert @ 2019-06-18 15:48 UTC (permalink / raw)
  To: Liran Alon
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko

* Liran Alon (liran.alon@oracle.com) wrote:
> 
> > On 18 Jun 2019, at 12:03, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Liran Alon (liran.alon@oracle.com) wrote:
> >> 
> >> +static const VMStateDescription vmstate_vmx_vmcs12 = {
> >> +	.name = "cpu/kvm_nested_state/vmx/vmcs12",
> >> +	.version_id = 1,
> >> +	.minimum_version_id = 1,
> >> +	.needed = vmx_vmcs12_needed,
> >> +	.fields = (VMStateField[]) {
> >> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].vmcs12,
> >> +	                        struct kvm_nested_state, 0x1000),
> > 
> > Where did that magic 0x1000 come from?
> 
> Currently, KVM folks (including myself), haven’t decided yet to expose vmcs12 struct layout to userspace but instead to still leave it opaque.
> The formal size of this size is VMCS12_SIZE (defined in kernel as 0x1000). I was wondering if we wish to expose VMCS12_SIZE constant to userspace or not.
> So currently I defined these __u8 arrays as 0x1000. But in case Paolo agrees to expose VMCS12_SIZE, we can use that instead.

Well if it's not defined it's bound to change at some state!
Also, do we need to clear it before we get it from the kernel - e.g.
is the kernel guaranteed to give us 0x1000 ?

Dave

> -Liran
> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-18 15:48       ` Dr. David Alan Gilbert
@ 2019-06-18 15:50         ` Liran Alon
  2019-06-18 16:16         ` Paolo Bonzini
  1 sibling, 0 replies; 27+ messages in thread
From: Liran Alon @ 2019-06-18 15:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, pbonzini, mtosatti, rth, ehabkost, kvm, jmattson,
	maran.wilson, Nikita Leshenko



> On 18 Jun 2019, at 18:48, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> 
>>> On 18 Jun 2019, at 12:03, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Liran Alon (liran.alon@oracle.com) wrote:
>>>> 
>>>> +static const VMStateDescription vmstate_vmx_vmcs12 = {
>>>> +	.name = "cpu/kvm_nested_state/vmx/vmcs12",
>>>> +	.version_id = 1,
>>>> +	.minimum_version_id = 1,
>>>> +	.needed = vmx_vmcs12_needed,
>>>> +	.fields = (VMStateField[]) {
>>>> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].vmcs12,
>>>> +	                        struct kvm_nested_state, 0x1000),
>>> 
>>> Where did that magic 0x1000 come from?
>> 
>> Currently, KVM folks (including myself), haven’t decided yet to expose vmcs12 struct layout to userspace but instead to still leave it opaque.
>> The formal size of this size is VMCS12_SIZE (defined in kernel as 0x1000). I was wondering if we wish to expose VMCS12_SIZE constant to userspace or not.
>> So currently I defined these __u8 arrays as 0x1000. But in case Paolo agrees to expose VMCS12_SIZE, we can use that instead.
> 
> Well if it's not defined it's bound to change at some state!

I agree it’s better to expose VMCS12_SIZE to userspace but I didn’t want to be the one that decides this.
Let’s let Paolo decide and modify this patch accordingly if he decides to expose it.

> Also, do we need to clear it before we get it from the kernel - e.g.
> is the kernel guaranteed to give us 0x1000 ?

Userspace don’t need to clear it before getting it from kernel.
It does guarantee to give you 0x1000. See vmx_get_nested_state() implementation in kernel.

-Liran

> 
> Dave
> 
>> -Liran
>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-18 15:48       ` Dr. David Alan Gilbert
  2019-06-18 15:50         ` Liran Alon
@ 2019-06-18 16:16         ` Paolo Bonzini
  1 sibling, 0 replies; 27+ messages in thread
From: Paolo Bonzini @ 2019-06-18 16:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Liran Alon
  Cc: qemu-devel, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	Nikita Leshenko

On 18/06/19 17:48, Dr. David Alan Gilbert wrote:
>> Currently, KVM folks (including myself), haven’t decided yet to expose vmcs12 struct layout to userspace but instead to still leave it opaque.
>> The formal size of this size is VMCS12_SIZE (defined in kernel as 0x1000). I was wondering if we wish to expose VMCS12_SIZE constant to userspace or not.
>> So currently I defined these __u8 arrays as 0x1000. But in case Paolo agrees to expose VMCS12_SIZE, we can use that instead.
> Well if it's not defined it's bound to change at some state!
> Also, do we need to clear it before we get it from the kernel - e.g.
> is the kernel guaranteed to give us 0x1000 ?

It is defined, it is the size of data.vmx[0].vmx12.  We can define it.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types
  2019-06-18 15:42       ` Dr. David Alan Gilbert
@ 2019-06-18 16:44         ` Paolo Bonzini
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Bonzini @ 2019-06-18 16:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Liran Alon
  Cc: qemu-devel, mtosatti, rth, ehabkost, kvm, jmattson, maran.wilson,
	Nikita Leshenko

On 18/06/19 17:42, Dr. David Alan Gilbert wrote:
>>> Have you checked that builds OK on a non-Linux system?
>> Hmm that’s a good point. No. :P
> Worth a check if you can find one lying around :-)

It does not, but it's a macro so it's enough to enclose the uses in
#ifdef CONFIG_LINUX or CONFIG_KVM.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu()
  2019-06-17 17:56 ` [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu() Liran Alon
@ 2019-06-18 22:15   ` Maran Wilson
  0 siblings, 0 replies; 27+ messages in thread
From: Maran Wilson @ 2019-06-18 22:15 UTC (permalink / raw)
  To: Liran Alon, qemu-devel
  Cc: ehabkost, kvm, mtosatti, dgilbert, pbonzini, rth, jmattson

On 6/17/2019 10:56 AM, Liran Alon wrote:
> Simiar to how kvm_init_vcpu() calls kvm_arch_init_vcpu() to perform
> arch-dependent initialisation, introduce kvm_arch_destroy_vcpu()
> to be called from kvm_destroy_vcpu() to perform arch-dependent
> destruction.
>
> This was added because some architectures (Such as i386)
> currently do not free memory that it have allocated in
> kvm_arch_init_vcpu().
>
> Suggested-by: Maran Wilson <maran.wilson@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>   accel/kvm/kvm-all.c  |  5 +++++
>   include/sysemu/kvm.h |  1 +
>   target/arm/kvm32.c   |  5 +++++
>   target/arm/kvm64.c   |  5 +++++
>   target/i386/kvm.c    | 12 ++++++++++++
>   target/mips/kvm.c    |  5 +++++
>   target/ppc/kvm.c     |  5 +++++
>   target/s390x/kvm.c   | 10 ++++++++++
>   8 files changed, 48 insertions(+)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 524c4ddfbd0f..59a3aa3a40da 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -292,6 +292,11 @@ int kvm_destroy_vcpu(CPUState *cpu)
>   
>       DPRINTF("kvm_destroy_vcpu\n");
>   
> +    ret = kvm_arch_destroy_vcpu(cpu);
> +    if (ret < 0) {
> +        goto err;
> +    }
> +
>       mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0);
>       if (mmap_size < 0) {
>           ret = mmap_size;
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index a6d1cd190fed..64f55e519df7 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -371,6 +371,7 @@ int kvm_arch_put_registers(CPUState *cpu, int level);
>   int kvm_arch_init(MachineState *ms, KVMState *s);
>   
>   int kvm_arch_init_vcpu(CPUState *cpu);
> +int kvm_arch_destroy_vcpu(CPUState *cpu);
>   
>   bool kvm_vcpu_id_is_valid(int vcpu_id);
>   
> diff --git a/target/arm/kvm32.c b/target/arm/kvm32.c
> index 4e54e372a668..51f78f722b18 100644
> --- a/target/arm/kvm32.c
> +++ b/target/arm/kvm32.c
> @@ -240,6 +240,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       return kvm_arm_init_cpreg_list(cpu);
>   }
>   
> +int kvm_arch_destroy_vcpu(CPUState *cs)
> +{
> +	return 0;
> +}
> +
>   typedef struct Reg {
>       uint64_t id;
>       int offset;
> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> index 998d21f399f4..22d19c9aec6f 100644
> --- a/target/arm/kvm64.c
> +++ b/target/arm/kvm64.c
> @@ -654,6 +654,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       return kvm_arm_init_cpreg_list(cpu);
>   }
>   
> +int kvm_arch_destroy_vcpu(CPUState *cs)
> +{
> +    return 0;
> +}
> +
>   bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx)
>   {
>       /* Return true if the regidx is a register we should synchronize
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index 3b29ce5c0d08..29889aa6b001 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -1349,6 +1349,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       return r;
>   }
>   
> +int kvm_arch_destroy_vcpu(CPUState *cs)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +
> +    if (cpu->kvm_msr_buf) {
> +        g_free(cpu->kvm_msr_buf);
> +        cpu->kvm_msr_buf = NULL;
> +    }
> +
> +    return 0;
> +}
> +
>   void kvm_arch_reset_vcpu(X86CPU *cpu)
>   {
>       CPUX86State *env = &cpu->env;
> diff --git a/target/mips/kvm.c b/target/mips/kvm.c
> index 8e72850962e1..938f8f144b74 100644
> --- a/target/mips/kvm.c
> +++ b/target/mips/kvm.c
> @@ -91,6 +91,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       return ret;
>   }
>   
> +int kvm_arch_destroy_vcpu(CPUState *cs)
> +{
> +    return 0;
> +}
> +
>   void kvm_mips_reset_vcpu(MIPSCPU *cpu)
>   {
>       CPUMIPSState *env = &cpu->env;
> diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
> index 3bf0a46c3352..1967ccc51791 100644
> --- a/target/ppc/kvm.c
> +++ b/target/ppc/kvm.c
> @@ -521,6 +521,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       return ret;
>   }
>   
> +int kvm_arch_destroy_vcpu(CPUState *cs)
> +{
> +    return 0;
> +}
> +
>   static void kvm_sw_tlb_put(PowerPCCPU *cpu)
>   {
>       CPUPPCState *env = &cpu->env;
> diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
> index e5e2b691f253..c2747c31649b 100644
> --- a/target/s390x/kvm.c
> +++ b/target/s390x/kvm.c
> @@ -368,6 +368,16 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       return 0;
>   }
>   
> +int kvm_arch_destroy_vcpu(CPUState *cs)
> +{
> +    S390CPU *cpu = S390_CPU(cs);
> +
> +    g_free(cpu->irqstate);
> +    cpu->irqstate = NULL;
> +
> +    return 0;
> +}
> +
>   void kvm_s390_reset_vcpu(S390CPU *cpu)
>   {
>       CPUState *cs = CPU(cpu);

Reviewed-by: Maran Wilson <maran.wilson@oracle.com>

Thanks,
-Maran

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization
  2019-06-17 17:56 ` [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization Liran Alon
  2019-06-18  8:44   ` Dr. David Alan Gilbert
@ 2019-06-18 22:16   ` Maran Wilson
  1 sibling, 0 replies; 27+ messages in thread
From: Maran Wilson @ 2019-06-18 22:16 UTC (permalink / raw)
  To: Liran Alon, qemu-devel
  Cc: ehabkost, kvm, mtosatti, dgilbert, pbonzini, rth, jmattson

On 6/17/2019 10:56 AM, Liran Alon wrote:
> Commit d98f26073beb ("target/i386: kvm: add VMX migration blocker")
> added a migration blocker for vCPU exposed with Intel VMX.
> However, migration should also be blocked for vCPU exposed with
> AMD SVM.
>
> Both cases should be blocked because QEMU should extract additional
> vCPU state from KVM that should be migrated as part of vCPU VMState.
> E.g. Whether vCPU is running in guest-mode or host-mode.
>
> Fixes: d98f26073beb ("target/i386: kvm: add VMX migration blocker")
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>   target/i386/cpu.c |  6 ------
>   target/i386/cpu.h | 24 ++++++++++++++++++++++++
>   target/i386/kvm.c | 12 ++++++------
>   3 files changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 536d7d152044..197201087e65 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5170,12 +5170,6 @@ static int x86_cpu_filter_features(X86CPU *cpu)
>       return rv;
>   }
>   
> -#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
> -                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
> -                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
> -#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> -                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> -                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
>   static void x86_cpu_realizefn(DeviceState *dev, Error **errp)
>   {
>       CPUState *cs = CPU(dev);
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index fce6660bac00..79d9495ceb0c 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -728,6 +728,13 @@ typedef uint32_t FeatureWordArray[FEATURE_WORDS];
>   
>   #define CPUID_VENDOR_HYGON    "HygonGenuine"
>   
> +#define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \
> +                           (env)->cpuid_vendor2 == CPUID_VENDOR_INTEL_2 && \
> +                           (env)->cpuid_vendor3 == CPUID_VENDOR_INTEL_3)
> +#define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> +                         (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> +                         (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +
>   #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>   #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
>   
> @@ -1866,6 +1873,23 @@ static inline int32_t x86_get_a20_mask(CPUX86State *env)
>       }
>   }
>   
> +static inline bool cpu_has_vmx(CPUX86State *env)
> +{
> +    return (IS_INTEL_CPU(env) &&
> +            (env->features[FEAT_1_ECX] & CPUID_EXT_VMX));
> +}
> +
> +static inline bool cpu_has_svm(CPUX86State *env)
> +{
> +    return (IS_AMD_CPU(env) &&
> +            (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_SVM));
> +}
> +
> +static inline bool cpu_has_nested_virt(CPUX86State *env)
> +{
> +    return (cpu_has_vmx(env) || cpu_has_svm(env));
> +}
> +
>   /* fpu_helper.c */
>   void update_fp_status(CPUX86State *env);
>   void update_mxcsr_status(CPUX86State *env);
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index c8fd53055d37..f43e2d69859e 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -906,7 +906,7 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>   }
>   
>   static Error *invtsc_mig_blocker;
> -static Error *vmx_mig_blocker;
> +static Error *nested_virt_mig_blocker;
>   
>   #define KVM_MAX_CPUID_ENTRIES  100
>   
> @@ -1270,13 +1270,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>                                     !!(c->ecx & CPUID_EXT_SMX);
>       }
>   
> -    if ((env->features[FEAT_1_ECX] & CPUID_EXT_VMX) && !vmx_mig_blocker) {
> -        error_setg(&vmx_mig_blocker,
> -                   "Nested VMX virtualization does not support live migration yet");
> -        r = migrate_add_blocker(vmx_mig_blocker, &local_err);
> +    if (cpu_has_nested_virt(env) && !nested_virt_mig_blocker) {
> +        error_setg(&nested_virt_mig_blocker,
> +                   "Nested virtualization does not support live migration yet");
> +        r = migrate_add_blocker(nested_virt_mig_blocker, &local_err);
>           if (local_err) {
>               error_report_err(local_err);
> -            error_free(vmx_mig_blocker);
> +            error_free(nested_virt_mig_blocker);
>               return r;
>           }
>       }

Reviewed-by: Maran Wilson <maran.wilson@oracle.com>

Thanks,
-Maran

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [QEMU PATCH v3 5/9] linux-headers: i386: Modify struct kvm_nested_state to have explicit fields for data
  2019-06-17 17:56 ` [QEMU PATCH v3 5/9] linux-headers: i386: Modify struct kvm_nested_state to have explicit fields for data Liran Alon
@ 2019-06-18 22:16   ` Maran Wilson
  0 siblings, 0 replies; 27+ messages in thread
From: Maran Wilson @ 2019-06-18 22:16 UTC (permalink / raw)
  To: Liran Alon, qemu-devel
  Cc: ehabkost, kvm, mtosatti, dgilbert, pbonzini, rth, jmattson

On 6/17/2019 10:56 AM, Liran Alon wrote:
> Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
> of VMX nested state data in a struct.
>
> In order to avoid changing the ioctl values of
> KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
> sizeof(struct kvm_nested_state). This is done by defining the data
> struct as "data.vmx[0]". It was the most elegant way I found to
> preserve struct size while still keeping struct readable and easy to
> maintain. It does have a misfortunate side-effect that now it has to be
> accessed as "data.vmx[0]" rather than just "data.vmx".
>
> Because we are already modifying these structs, I also modified the
> following:
> * Define the "format" field values as macros.
> * Rename vmcs_pa to vmcs12_pa for better readability.
> * Add stub structs for AMD SVM.
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>   linux-headers/asm-x86/kvm.h | 41 +++++++++++++++++++++++++++----------
>   1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
> index 7a0e64ccd6ff..e655d108af19 100644
> --- a/linux-headers/asm-x86/kvm.h
> +++ b/linux-headers/asm-x86/kvm.h
> @@ -383,6 +383,9 @@ struct kvm_sync_regs {
>   #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE	(1 << 2)
>   #define KVM_X86_QUIRK_OUT_7E_INC_RIP	(1 << 3)
>   
> +#define KVM_STATE_NESTED_FORMAT_VMX	0
> +#define KVM_STATE_NESTED_FORMAT_SVM	1
> +
>   #define KVM_STATE_NESTED_GUEST_MODE	0x00000001
>   #define KVM_STATE_NESTED_RUN_PENDING	0x00000002
>   #define KVM_STATE_NESTED_EVMCS		0x00000004
> @@ -390,35 +393,51 @@ struct kvm_sync_regs {
>   #define KVM_STATE_NESTED_SMM_GUEST_MODE	0x00000001
>   #define KVM_STATE_NESTED_SMM_VMXON	0x00000002
>   
> -struct kvm_vmx_nested_state {
> +struct kvm_vmx_nested_state_data {
> +	__u8 vmcs12[0x1000];
> +	__u8 shadow_vmcs12[0x1000];

I assume you will replace this magic 0x1000 value too, as discussed with 
respect to patch 7?

Thanks,
-Maran

> +};
> +
> +struct kvm_vmx_nested_state_hdr {
>   	__u64 vmxon_pa;
> -	__u64 vmcs_pa;
> +	__u64 vmcs12_pa;
>   
>   	struct {
>   		__u16 flags;
>   	} smm;
>   };
>   
> +struct kvm_svm_nested_state_data {
> +	/* TODO: Implement */
> +};
> +
> +struct kvm_svm_nested_state_hdr {
> +	/* TODO: Implement */
> +};
> +
>   /* for KVM_CAP_NESTED_STATE */
>   struct kvm_nested_state {
> -	/* KVM_STATE_* flags */
>   	__u16 flags;
> -
> -	/* 0 for VMX, 1 for SVM.  */
>   	__u16 format;
> -
> -	/* 128 for SVM, 128 + VMCS size for VMX.  */
>   	__u32 size;
>   
>   	union {
> -		/* VMXON, VMCS */
> -		struct kvm_vmx_nested_state vmx;
> +		struct kvm_vmx_nested_state_hdr vmx;
> +		struct kvm_svm_nested_state_hdr svm;
>   
>   		/* Pad the header to 128 bytes.  */
>   		__u8 pad[120];
> -	};
> +	} hdr;
>   
> -	__u8 data[0];
> +	/*
> +	 * Define data region as 0 bytes to preserve backwards-compatability
> +	 * to old definition of kvm_nested_state in order to avoid changing
> +	 * KVM_{GET,PUT}_NESTED_STATE ioctl values.
> +	 */
> +	union {
> +		struct kvm_vmx_nested_state_data vmx[0];
> +		struct kvm_svm_nested_state_data svm[0];
> +	} data;
>   };
>   
>   #endif /* _ASM_X86_KVM_H */


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state
  2019-06-17 17:56 ` [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state Liran Alon
  2019-06-18  9:03   ` Dr. David Alan Gilbert
@ 2019-06-18 22:16   ` Maran Wilson
  1 sibling, 0 replies; 27+ messages in thread
From: Maran Wilson @ 2019-06-18 22:16 UTC (permalink / raw)
  To: Liran Alon, qemu-devel
  Cc: ehabkost, kvm, mtosatti, dgilbert, Nikita Leshenko, pbonzini,
	rth, jmattson

On 6/17/2019 10:56 AM, Liran Alon wrote:
> Kernel commit 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
> introduced new IOCTLs to extract and restore vCPU state related to
> Intel VMX & AMD SVM.
>
> Utilize these IOCTLs to add support for migration of VMs which are
> running nested hypervisors.
>
> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>   accel/kvm/kvm-all.c   |   8 ++
>   include/sysemu/kvm.h  |   1 +
>   target/i386/cpu.h     |   3 +
>   target/i386/kvm.c     |  80 +++++++++++++++++
>   target/i386/machine.c | 196 ++++++++++++++++++++++++++++++++++++++++++
>   5 files changed, 288 insertions(+)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 59a3aa3a40da..4fdf5b04b131 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -88,6 +88,7 @@ struct KVMState
>   #ifdef KVM_CAP_SET_GUEST_DEBUG
>       QTAILQ_HEAD(, kvm_sw_breakpoint) kvm_sw_breakpoints;
>   #endif
> +    int max_nested_state_len;
>       int many_ioeventfds;
>       int intx_set_mask;
>       bool sync_mmu;
> @@ -1678,6 +1679,8 @@ static int kvm_init(MachineState *ms)
>       s->debugregs = kvm_check_extension(s, KVM_CAP_DEBUGREGS);
>   #endif
>   
> +    s->max_nested_state_len = kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> +
>   #ifdef KVM_CAP_IRQ_ROUTING
>       kvm_direct_msi_allowed = (kvm_check_extension(s, KVM_CAP_SIGNAL_MSI) > 0);
>   #endif
> @@ -2245,6 +2248,11 @@ int kvm_has_debugregs(void)
>       return kvm_state->debugregs;
>   }
>   
> +int kvm_max_nested_state_length(void)
> +{
> +    return kvm_state->max_nested_state_len;
> +}
> +
>   int kvm_has_many_ioeventfds(void)
>   {
>       if (!kvm_enabled()) {
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 64f55e519df7..acd90aebb6c4 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -210,6 +210,7 @@ bool kvm_has_sync_mmu(void);
>   int kvm_has_vcpu_events(void);
>   int kvm_has_robust_singlestep(void);
>   int kvm_has_debugregs(void);
> +int kvm_max_nested_state_length(void);
>   int kvm_has_pit_state2(void);
>   int kvm_has_many_ioeventfds(void);
>   int kvm_has_gsi_routing(void);
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 79d9495ceb0c..a6bb71849869 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1350,6 +1350,9 @@ typedef struct CPUX86State {
>   #if defined(CONFIG_KVM) || defined(CONFIG_HVF)
>       void *xsave_buf;
>   #endif
> +#if defined(CONFIG_KVM)
> +    struct kvm_nested_state *nested_state;
> +#endif
>   #if defined(CONFIG_HVF)
>       HVFX86EmulatorState *hvf_emul;
>   #endif
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index f43e2d69859e..5950c3ed0d1c 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -931,6 +931,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       struct kvm_cpuid_entry2 *c;
>       uint32_t signature[3];
>       int kvm_base = KVM_CPUID_SIGNATURE;
> +    int max_nested_state_len;
>       int r;
>       Error *local_err = NULL;
>   
> @@ -1331,6 +1332,24 @@ int kvm_arch_init_vcpu(CPUState *cs)
>       if (has_xsave) {
>           env->xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
>       }
> +
> +    max_nested_state_len = kvm_max_nested_state_length();
> +    if (max_nested_state_len > 0) {
> +        assert(max_nested_state_len >= offsetof(struct kvm_nested_state, data));
> +        env->nested_state = g_malloc0(max_nested_state_len);
> +
> +        env->nested_state->size = max_nested_state_len;
> +
> +        if (IS_INTEL_CPU(env)) {
> +            struct kvm_vmx_nested_state_hdr *vmx_hdr =
> +                &env->nested_state->hdr.vmx;
> +
> +            vmx_hdr->vmxon_pa = -1ull;
> +            vmx_hdr->vmcs12_pa = -1ull;
> +        }
> +
> +    }
> +
>       cpu->kvm_msr_buf = g_malloc0(MSR_BUF_SIZE);
>   
>       if (!(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP)) {
> @@ -1352,12 +1371,18 @@ int kvm_arch_init_vcpu(CPUState *cs)
>   int kvm_arch_destroy_vcpu(CPUState *cs)
>   {
>       X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
>   
>       if (cpu->kvm_msr_buf) {
>           g_free(cpu->kvm_msr_buf);
>           cpu->kvm_msr_buf = NULL;
>       }
>   
> +    if (env->nested_state) {
> +        g_free(env->nested_state);
> +        env->nested_state = NULL;
> +    }
> +
>       return 0;
>   }
>   
> @@ -3072,6 +3097,52 @@ static int kvm_get_debugregs(X86CPU *cpu)
>       return 0;
>   }
>   
> +static int kvm_put_nested_state(X86CPU *cpu)
> +{
> +    CPUX86State *env = &cpu->env;
> +    int max_nested_state_len = kvm_max_nested_state_length();
> +
> +    if (max_nested_state_len <= 0) {
> +        return 0;
> +    }
> +
> +    assert(env->nested_state->size <= max_nested_state_len);
> +    return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_NESTED_STATE, env->nested_state);
> +}
> +
> +static int kvm_get_nested_state(X86CPU *cpu)
> +{
> +    CPUX86State *env = &cpu->env;
> +    int max_nested_state_len = kvm_max_nested_state_length();
> +    int ret;
> +
> +    if (max_nested_state_len <= 0) {
> +        return 0;
> +    }
> +
> +    /*
> +     * It is possible that migration restored a smaller size into
> +     * nested_state->hdr.size than what our kernel support.
> +     * We preserve migration origin nested_state->hdr.size for
> +     * call to KVM_SET_NESTED_STATE but wish that our next call
> +     * to KVM_GET_NESTED_STATE will use max size our kernel support.
> +     */
> +    env->nested_state->size = max_nested_state_len;
> +
> +    ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_NESTED_STATE, env->nested_state);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (env->nested_state->flags & KVM_STATE_NESTED_GUEST_MODE) {
> +        env->hflags |= HF_GUEST_MASK;
> +    } else {
> +        env->hflags &= ~HF_GUEST_MASK;
> +    }
> +
> +    return ret;
> +}
> +
>   int kvm_arch_put_registers(CPUState *cpu, int level)
>   {
>       X86CPU *x86_cpu = X86_CPU(cpu);
> @@ -3079,6 +3150,11 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
>   
>       assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
>   
> +    ret = kvm_put_nested_state(x86_cpu);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
>       if (level >= KVM_PUT_RESET_STATE) {
>           ret = kvm_put_msr_feature_control(x86_cpu);
>           if (ret < 0) {
> @@ -3194,6 +3270,10 @@ int kvm_arch_get_registers(CPUState *cs)
>       if (ret < 0) {
>           goto out;
>       }
> +    ret = kvm_get_nested_state(cpu);
> +    if (ret < 0) {
> +        goto out;
> +    }
>       ret = 0;
>    out:
>       cpu_sync_bndcs_hflags(&cpu->env);
> diff --git a/target/i386/machine.c b/target/i386/machine.c
> index 225b5d433bc4..95299ebff44a 100644
> --- a/target/i386/machine.c
> +++ b/target/i386/machine.c
> @@ -231,6 +231,15 @@ static int cpu_pre_save(void *opaque)
>           env->segs[R_SS].flags &= ~(env->segs[R_SS].flags & DESC_DPL_MASK);
>       }
>   
> +#ifdef CONFIG_KVM
> +    /* Verify we have nested virtualization state from kernel if required */
> +    if (cpu_has_nested_virt(env) && !env->nested_state) {
> +        error_report("Guest enabled nested virtualization but kernel "
> +                "does not support saving of nested state");
> +        return -EINVAL;
> +    }
> +#endif
> +
>       return 0;
>   }
>   
> @@ -278,6 +287,16 @@ static int cpu_post_load(void *opaque, int version_id)
>       env->hflags &= ~HF_CPL_MASK;
>       env->hflags |= (env->segs[R_SS].flags >> DESC_DPL_SHIFT) & HF_CPL_MASK;
>   
> +#ifdef CONFIG_KVM
> +    if ((env->hflags & HF_GUEST_MASK) &&
> +        (!env->nested_state ||
> +        !(env->nested_state->flags & KVM_STATE_NESTED_GUEST_MODE))) {
> +        error_report("vCPU set in guest-mode inconsistent with "
> +                     "migrated kernel nested state");
> +        return -EINVAL;
> +    }
> +#endif
> +
>       env->fpstt = (env->fpus_vmstate >> 11) & 7;
>       env->fpus = env->fpus_vmstate & ~0x3800;
>       env->fptag_vmstate ^= 0xff;
> @@ -851,6 +870,180 @@ static const VMStateDescription vmstate_tsc_khz = {
>       }
>   };
>   
> +#ifdef CONFIG_KVM
> +
> +static bool vmx_vmcs12_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +    return (nested_state->size >
> +            offsetof(struct kvm_nested_state, data.vmx[0].vmcs12));
> +}
> +
> +static const VMStateDescription vmstate_vmx_vmcs12 = {
> +	.name = "cpu/kvm_nested_state/vmx/vmcs12",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = vmx_vmcs12_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].vmcs12,
> +	                        struct kvm_nested_state, 0x1000),
> +	    VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool vmx_shadow_vmcs12_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +    return (nested_state->size >
> +            offsetof(struct kvm_nested_state, data.vmx[0].shadow_vmcs12));
> +}
> +
> +static const VMStateDescription vmstate_vmx_shadow_vmcs12 = {
> +	.name = "cpu/kvm_nested_state/vmx/shadow_vmcs12",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = vmx_shadow_vmcs12_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_UINT8_ARRAY(data.vmx[0].shadow_vmcs12,
> +	                        struct kvm_nested_state, 0x1000),
> +	    VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool vmx_nested_state_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +
> +    return ((nested_state->format == KVM_STATE_NESTED_FORMAT_VMX) &&
> +            ((nested_state->hdr.vmx.vmxon_pa != -1ull) ||
> +             (nested_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_VMXON)));
> +}
> +
> +static const VMStateDescription vmstate_vmx_nested_state = {
> +	.name = "cpu/kvm_nested_state/vmx",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = vmx_nested_state_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_U64(hdr.vmx.vmxon_pa, struct kvm_nested_state),
> +	    VMSTATE_U64(hdr.vmx.vmcs12_pa, struct kvm_nested_state),
> +	    VMSTATE_U16(hdr.vmx.smm.flags, struct kvm_nested_state),
> +	    VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription*[]) {
> +        &vmstate_vmx_vmcs12,
> +        &vmstate_vmx_shadow_vmcs12,
> +        NULL,
> +    }
> +};
> +
> +static bool svm_nested_state_needed(void *opaque)
> +{
> +    struct kvm_nested_state *nested_state = opaque;
> +
> +    return (nested_state->format == KVM_STATE_NESTED_FORMAT_SVM);
> +}
> +
> +static const VMStateDescription vmstate_svm_nested_state = {
> +	.name = "cpu/kvm_nested_state/svm",
> +	.version_id = 1,
> +	.minimum_version_id = 1,
> +	.needed = svm_nested_state_needed,
> +	.fields = (VMStateField[]) {
> +	    VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool nested_state_needed(void *opaque)
> +{
> +    X86CPU *cpu = opaque;
> +    CPUX86State *env = &cpu->env;
> +
> +    return (env->nested_state &&
> +            (vmx_nested_state_needed(env->nested_state) ||
> +             svm_nested_state_needed(env->nested_state)));
> +}
> +
> +static int nested_state_post_load(void *opaque, int version_id)
> +{
> +    X86CPU *cpu = opaque;
> +    CPUX86State *env = &cpu->env;
> +    struct kvm_nested_state *nested_state = env->nested_state;
> +    int min_nested_state_len = offsetof(struct kvm_nested_state, data);
> +    int max_nested_state_len = kvm_max_nested_state_length();
> +
> +    /*
> +     * If our kernel don't support setting nested state
> +     * and we have received nested state from migration stream,
> +     * we need to fail migration
> +     */
> +    if (max_nested_state_len <= 0) {
> +        error_report("Received nested state when kernel cannot restore it");
> +        return -EINVAL;
> +    }
> +
> +    /*
> +     * Verify that the size of received nested_state struct
> +     * at least cover required header and is not larger
> +     * than the max size that our kernel support
> +     */
> +    if (nested_state->size < min_nested_state_len) {
> +        error_report("Received nested state size less than min: "
> +                     "len=%d, min=%d",
> +                     nested_state->size, min_nested_state_len);
> +        return -EINVAL;
> +    }
> +    if (nested_state->size > max_nested_state_len) {
> +        error_report("Recieved unsupported nested state size: "
> +                     "nested_state->size=%d, max=%d",
> +                     nested_state->size, max_nested_state_len);
> +        return -EINVAL;
> +    }
> +
> +    /* Verify format is valid */
> +    if ((nested_state->format != KVM_STATE_NESTED_FORMAT_VMX) &&
> +        (nested_state->format != KVM_STATE_NESTED_FORMAT_SVM)) {
> +        error_report("Received invalid nested state format: %d",
> +                     nested_state->format);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_kvm_nested_state = {
> +    .name = "cpu/kvm_nested_state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_U16(flags, struct kvm_nested_state),
> +        VMSTATE_U16(format, struct kvm_nested_state),
> +        VMSTATE_U32(size, struct kvm_nested_state),
> +        VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription*[]) {
> +        &vmstate_vmx_nested_state,
> +        &vmstate_svm_nested_state,
> +        NULL
> +    }
> +};
> +
> +static const VMStateDescription vmstate_nested_state = {
> +    .name = "cpu/nested_state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = nested_state_needed,
> +    .post_load = nested_state_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_STRUCT_POINTER(env.nested_state, X86CPU,
> +                               vmstate_kvm_nested_state,
> +                               struct kvm_nested_state),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +#endif
> +
>   static bool mcg_ext_ctl_needed(void *opaque)
>   {
>       X86CPU *cpu = opaque;
> @@ -1089,6 +1282,9 @@ VMStateDescription vmstate_x86_cpu = {
>           &vmstate_msr_intel_pt,
>           &vmstate_msr_virt_ssbd,
>           &vmstate_svm_npt,
> +#ifdef CONFIG_KVM
> +        &vmstate_nested_state,
> +#endif
>           NULL
>       }
>   };

I'm not sure it's comprehensive enough to qualify for a "Tested-by" 
line, but I did run this latest patch series in my environment and 
confirm that I was able to save/restore L1 VMs that had created active 
L2 VMs of their own. Previously those tests would always result in a L1 
kernel failure upon restoration. I also verified that cpu_pre_save() was 
properly catching and preventing the save when the host kernel did not 
support KVM_CAP_NESTED_STATE.

And aside from the 0x1000 magic value, this latest version of the patch 
looks good.

Reviewed-by: Maran Wilson <maran.wilson@oracle.com>

Thanks,
-Maran

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [QEMU PATCH v3 9/9] KVM: i386: Remove VMX migration blocker
  2019-06-17 17:56 ` [QEMU PATCH v3 9/9] KVM: i386: Remove VMX migration blocker Liran Alon
@ 2019-06-18 22:17   ` Maran Wilson
  0 siblings, 0 replies; 27+ messages in thread
From: Maran Wilson @ 2019-06-18 22:17 UTC (permalink / raw)
  To: Liran Alon, qemu-devel
  Cc: ehabkost, kvm, mtosatti, dgilbert, pbonzini, rth, jmattson

On 6/17/2019 10:56 AM, Liran Alon wrote:
> This effectively reverts d98f26073beb ("target/i386: kvm: add VMX migration blocker").
> This can now be done because previous commits added support for Intel VMX migration.
>
> AMD SVM migration is still blocked. This is because kernel
> KVM_CAP_{GET,SET}_NESTED_STATE in case of AMD SVM is not
> implemented yet. Therefore, required vCPU nested state is still
> missing in order to perform valid migration for vCPU exposed with SVM.
>
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> ---
>   target/i386/kvm.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/target/i386/kvm.c b/target/i386/kvm.c
> index 797f8ac46435..772c8619efc4 100644
> --- a/target/i386/kvm.c
> +++ b/target/i386/kvm.c
> @@ -948,7 +948,7 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>   }
>   
>   static Error *invtsc_mig_blocker;
> -static Error *nested_virt_mig_blocker;
> +static Error *svm_mig_blocker;
>   
>   #define KVM_MAX_CPUID_ENTRIES  100
>   
> @@ -1313,13 +1313,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
>                                     !!(c->ecx & CPUID_EXT_SMX);
>       }
>   
> -    if (cpu_has_nested_virt(env) && !nested_virt_mig_blocker) {
> -        error_setg(&nested_virt_mig_blocker,
> -                   "Nested virtualization does not support live migration yet");
> -        r = migrate_add_blocker(nested_virt_mig_blocker, &local_err);
> +    if (cpu_has_svm(env) && !svm_mig_blocker) {
> +        error_setg(&svm_mig_blocker,
> +                   "AMD SVM does not support live migration yet");
> +        r = migrate_add_blocker(svm_mig_blocker, &local_err);
>           if (local_err) {
>               error_report_err(local_err);
> -            error_free(nested_virt_mig_blocker);
> +            error_free(svm_mig_blocker);
>               return r;
>           }
>       }

Reviewed-by: Maran Wilson <maran.wilson@oracle.com>

Thanks,
-Maran

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-06-18 22:19 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-17 17:56 [QEMU PATCH v3 0/9]: KVM: i386: Add support for save and restore of nested state Liran Alon
2019-06-17 17:56 ` [QEMU PATCH v3 1/9] KVM: Introduce kvm_arch_destroy_vcpu() Liran Alon
2019-06-18 22:15   ` [Qemu-devel] " Maran Wilson
2019-06-17 17:56 ` [QEMU PATCH v3 2/9] KVM: i386: Use symbolic constant for #DB/#BP exception constants Liran Alon
2019-06-17 17:56 ` [QEMU PATCH v3 3/9] KVM: i386: Re-inject #DB to guest with updated DR6 Liran Alon
2019-06-17 17:56 ` [QEMU PATCH v3 4/9] KVM: i386: Block migration for vCPUs exposed with nested virtualization Liran Alon
2019-06-18  8:44   ` Dr. David Alan Gilbert
2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
2019-06-17 17:56 ` [QEMU PATCH v3 5/9] linux-headers: i386: Modify struct kvm_nested_state to have explicit fields for data Liran Alon
2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
2019-06-17 17:56 ` [QEMU PATCH v3 6/9] vmstate: Add support for kernel integer types Liran Alon
2019-06-18  8:55   ` Dr. David Alan Gilbert
2019-06-18 15:36     ` Liran Alon
2019-06-18 15:42       ` Dr. David Alan Gilbert
2019-06-18 16:44         ` Paolo Bonzini
2019-06-17 17:56 ` [QEMU PATCH v3 7/9] KVM: i386: Add support for save and restore nested state Liran Alon
2019-06-18  9:03   ` Dr. David Alan Gilbert
2019-06-18 15:40     ` Liran Alon
2019-06-18 15:48       ` Dr. David Alan Gilbert
2019-06-18 15:50         ` Liran Alon
2019-06-18 16:16         ` Paolo Bonzini
2019-06-18 22:16   ` [Qemu-devel] " Maran Wilson
2019-06-17 17:56 ` [QEMU PATCH v3 8/9] KVM: i386: Add support for KVM_CAP_EXCEPTION_PAYLOAD Liran Alon
2019-06-18  9:07   ` Dr. David Alan Gilbert
2019-06-18 15:45     ` Liran Alon
2019-06-17 17:56 ` [QEMU PATCH v3 9/9] KVM: i386: Remove VMX migration blocker Liran Alon
2019-06-18 22:17   ` [Qemu-devel] " Maran Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).