[PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest
@ 2018-04-12 15:12 KarimAllah Ahmed
  2018-04-12 15:12 ` [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE KarimAllah Ahmed
  2018-04-12 16:35 ` [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest Paolo Bonzini
  0 siblings, 2 replies; 16+ messages in thread
From: KarimAllah Ahmed @ 2018-04-12 15:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: KarimAllah Ahmed, Jim Mattson, Paolo Bonzini,
	Radim Krčmář

When the TSC MSR is captured while an L2 guest is running then restored,
the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
TSC_OFFSET. So ensure that this is compensated for when storing the value.

Cc: Jim Mattson <jmattson@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
 arch/x86/kvm/vmx.c | 12 +++++++++---
 arch/x86/kvm/x86.c |  1 -
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cff2f50..2f57571 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
+	u64 l1_tsc_offset = 0;
+
 	if (is_guest_mode(vcpu)) {
 		/*
 		 * We're here if L1 chose not to trap WRMSR to TSC. According
@@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 		 * to the newly set TSC to get L2's TSC.
 		 */
 		struct vmcs12 *vmcs12;
+
 		/* recalculate vmcs02.TSC_OFFSET: */
 		vmcs12 = get_vmcs12(vcpu);
-		vmcs_write64(TSC_OFFSET, offset +
-			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
-			 vmcs12->tsc_offset : 0));
+
+		l1_tsc_offset = nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+					vmcs12->tsc_offset : 0;
+		vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
 	} else {
 		trace_kvm_write_tsc_offset(vcpu->vcpu_id,
 					   vmcs_read64(TSC_OFFSET), offset);
 		vmcs_write64(TSC_OFFSET, offset);
 	}
+
+	vcpu->arch.tsc_offset = offset - l1_tsc_offset;
 }
 
 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac42c85..1a2ed92 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
 static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	kvm_x86_ops->write_tsc_offset(vcpu, offset);
-	vcpu->arch.tsc_offset = offset;
 }
 
 static inline bool kvm_check_tsc_unstable(void)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-12 15:12 [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest KarimAllah Ahmed
@ 2018-04-12 15:12 ` KarimAllah Ahmed
  2018-04-12 16:39   ` Paolo Bonzini
                     ` (2 more replies)
  2018-04-12 16:35 ` [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest Paolo Bonzini
  1 sibling, 3 replies; 16+ messages in thread
From: KarimAllah Ahmed @ 2018-04-12 15:12 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Jim Mattson, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86,
	KarimAllah Ahmed

From: Jim Mattson <jmattson@google.com>

For nested virtualization L0 KVM is managing a bit of state for L2 guests,
this state can not be captured through the currently available IOCTLs. In
fact the state captured through all of these IOCTLs is usually a mix of L1
and L2 state. It is also dependent on whether the L2 guest was running at
the moment when the process was interrupted to save its state.

With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
in VMX operation.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
[karahmed@ - rename structs and functions and make them ready for AMD and
             address previous comments.
           - rebase & a bit of refactoring.
           - Merge 7/8 and 8/8 into one patch.
           - Force a VMExit from L2 after reading the kvm_state to avoid
             mixed state between L1 and L2 on resurrecting the instance. ]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
v2 -> v3:
- Remove the forced VMExit from L2 after reading the kvm_state. The actual
  problem is solved.
- Rebase again!
- Set nested_run_pending during restore (not sure if it makes sense yet or
  not).
- Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
  to switch everything to u64)

v1 -> v2:
- Rename structs and functions and make them ready for AMD and address
  previous comments.
- Rebase & a bit of refactoring.
- Merge 7/8 and 8/8 into one patch.
- Force a VMExit from L2 after reading the kvm_state to avoid mixed state
  between L1 and L2 on resurrecting the instance.
---
 Documentation/virtual/kvm/api.txt |  47 ++++++++++
 arch/x86/include/asm/kvm_host.h   |   7 ++
 arch/x86/include/uapi/asm/kvm.h   |  38 ++++++++
 arch/x86/kvm/vmx.c                | 177 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c                |  21 +++++
 include/linux/kvm_host.h          |   2 +-
 include/uapi/linux/kvm.h          |   5 ++
 7 files changed, 292 insertions(+), 5 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 1c7958b..c51d5d3 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3548,6 +3548,53 @@ Returns: 0 on success,
 	-ENOENT on deassign if the conn_id isn't registered
 	-EEXIST on assign if the conn_id is already registered
 
+4.114 KVM_GET_STATE
+
+Capability: KVM_CAP_STATE
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_state (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  E2BIG:     the data size exceeds the value of 'size' specified by
+             the user (the size required will be written into size).
+
+struct kvm_state {
+	__u16 flags;
+	__u16 format;
+	__u32 size;
+	union {
+		struct kvm_vmx_state vmx;
+		struct kvm_svm_state svm;
+		__u8 pad[120];
+	};
+	__u8 data[0];
+};
+
+This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
+
+4.115 KVM_SET_STATE
+
+Capability: KVM_CAP_STATE
+Architectures: x86
+Type: vcpu ioctl
+Parameters: struct kvm_state (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_state {
+	__u16 flags;
+	__u16 format;
+	__u32 size;
+	union {
+		struct kvm_vmx_state vmx;
+		struct kvm_svm_state svm;
+		__u8 pad[120];
+	};
+	__u8 data[0];
+};
+
+This copies the vcpu's kvm_state struct from userspace to the kernel.
+>>>>>>> 13a7c9e... kvm: nVMX: Introduce KVM_CAP_STATE
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9fa4f57..ad2116a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -75,6 +75,7 @@
 #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
 #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
 #define KVM_REQ_LOAD_EOI_EXITMAP	KVM_ARCH_REQ(23)
+#define KVM_REQ_GET_VMCS12_PAGES	KVM_ARCH_REQ(24)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1084,6 +1085,12 @@ struct kvm_x86_ops {
 
 	void (*setup_mce)(struct kvm_vcpu *vcpu);
 
+	int (*get_state)(struct kvm_vcpu *vcpu,
+			 struct kvm_state __user *user_kvm_state);
+	int (*set_state)(struct kvm_vcpu *vcpu,
+			 struct kvm_state __user *user_kvm_state);
+	void (*get_vmcs12_pages)(struct kvm_vcpu *vcpu);
+
 	int (*smi_allowed)(struct kvm_vcpu *vcpu);
 	int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
 	int (*pre_leave_smm)(struct kvm_vcpu *vcpu, u64 smbase);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index c535c2f..d2c860a 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -378,4 +378,42 @@ struct kvm_sync_regs {
 #define KVM_X86_QUIRK_LINT0_REENABLED	(1 << 0)
 #define KVM_X86_QUIRK_CD_NW_CLEARED	(1 << 1)
 
+#define KVM_STATE_GUEST_MODE	0x00000001
+#define KVM_STATE_RUN_PENDING	0x00000002
+#define KVM_STATE_GIF		0x00000004
+
+struct kvm_vmx_state {
+	__u64 vmxon_pa;
+	__u64 vmcs_pa;
+};
+
+struct kvm_svm_state {
+	__u64 hsave_pa;
+	__u64 vmcb_pa;
+};
+
+/* for KVM_CAP_STATE */
+struct kvm_state {
+	/* KVM_STATE_* flags */
+	__u16 flags;
+
+	/* 0 for VMX, 1 for SVM.  */
+	__u16 format;
+
+	/* 128 for SVM, 128 + VMCS size for VMX.  */
+	__u32 size;
+
+	union {
+		/* VMXON, VMCS */
+		struct kvm_vmx_state vmx;
+		/* HSAVE_PA, VMCB */
+		struct kvm_svm_state svm;
+
+		/* Pad the union to 120 bytes.  */
+		__u8 pad[120];
+	};
+
+	__u8 data[0];
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2f57571..c2b436b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10359,10 +10359,10 @@ static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
 static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 						 struct vmcs12 *vmcs12);
 
-static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu,
-					struct vmcs12 *vmcs12)
+static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
 	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
 		if (vmcs12->apic_access_addr != vmx->nested.apic_access_mapping.gfn << PAGE_SHIFT) {
@@ -11430,8 +11430,6 @@ static int enter_vmx_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry)
 		return 1;
 	}
 
-	nested_get_vmcs12_pages(vcpu, vmcs12);
-
 	msr_entry_idx = nested_vmx_load_msr(vcpu,
 					    vmcs12->vm_entry_msr_load_addr,
 					    vmcs12->vm_entry_msr_load_count);
@@ -11529,6 +11527,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	if (ret)
 		return ret;
 
+	nested_get_vmcs12_pages(vcpu);
+
 	/*
 	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
 	 * by event injection, halt vcpu.
@@ -12595,6 +12595,171 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int get_vmcs_cache(struct kvm_vcpu *vcpu,
+			  struct kvm_state __user *user_kvm_state)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+	/*
+	 * When running L2, the authoritative vmcs12 state is in the
+	 * vmcs02. When running L1, the authoritative vmcs12 state is
+	 * in the shadow vmcs linked to vmcs01, unless
+	 * sync_shadow_vmcs is set, in which case, the authoritative
+	 * vmcs12 state is in the vmcs12 already.
+	 */
+	if (is_guest_mode(vcpu))
+		sync_vmcs12(vcpu, vmcs12);
+	else if (enable_shadow_vmcs && !vmx->nested.sync_shadow_vmcs)
+		copy_shadow_to_vmcs12(vmx);
+
+	if (copy_to_user(user_kvm_state->data, vmcs12, sizeof(*vmcs12)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int get_vmx_state(struct kvm_vcpu *vcpu,
+			 struct kvm_state __user *user_kvm_state)
+{
+	u32 user_data_size;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct kvm_state kvm_state = {
+		.flags = 0,
+		.format = 0,
+		.size = sizeof(kvm_state),
+		.vmx.vmxon_pa = -1ull,
+		.vmx.vmcs_pa = -1ull,
+	};
+
+	if (copy_from_user(&user_data_size, &user_kvm_state->size,
+			   sizeof(user_data_size)))
+		return -EFAULT;
+
+	if (nested_vmx_allowed(vcpu) && vmx->nested.vmxon) {
+		kvm_state.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
+		kvm_state.vmx.vmcs_pa = vmx->nested.current_vmptr;
+
+		if (vmx->nested.current_vmptr != -1ull)
+			kvm_state.size += VMCS12_SIZE;
+
+		if (is_guest_mode(vcpu)) {
+			kvm_state.flags |= KVM_STATE_GUEST_MODE;
+
+			if (vmx->nested.nested_run_pending)
+				kvm_state.flags |= KVM_STATE_RUN_PENDING;
+		}
+	}
+
+	if (user_data_size < kvm_state.size) {
+		if (copy_to_user(&user_kvm_state->size, &kvm_state.size,
+				 sizeof(kvm_state.size)))
+			return -EFAULT;
+		return -E2BIG;
+	}
+
+	if (copy_to_user(user_kvm_state, &kvm_state, sizeof(kvm_state)))
+		return -EFAULT;
+
+	if (vmx->nested.current_vmptr == -1ull)
+		return 0;
+
+	return get_vmcs_cache(vcpu, user_kvm_state);
+}
+
+static int set_vmcs_cache(struct kvm_vcpu *vcpu,
+			  struct kvm_state __user *user_kvm_state,
+			  struct kvm_state *kvm_state)
+
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	u32 exit_qual;
+	int ret;
+
+	if ((kvm_state->size < (sizeof(*vmcs12) + sizeof(*kvm_state))) ||
+	    kvm_state->vmx.vmcs_pa == kvm_state->vmx.vmxon_pa ||
+	    !page_address_valid(vcpu, kvm_state->vmx.vmcs_pa))
+		return -EINVAL;
+
+	if (copy_from_user(vmcs12, user_kvm_state->data, sizeof(*vmcs12)))
+		return -EFAULT;
+
+	if (vmcs12->revision_id != VMCS12_REVISION)
+		return -EINVAL;
+
+	set_current_vmptr(vmx, kvm_state->vmx.vmcs_pa);
+
+	if (!(kvm_state->flags & KVM_STATE_GUEST_MODE))
+		return 0;
+
+	if (kvm_state->flags & KVM_STATE_RUN_PENDING)
+		vmx->nested.nested_run_pending = 1;
+
+	if (check_vmentry_prereqs(vcpu, vmcs12) ||
+	    check_vmentry_postreqs(vcpu, vmcs12, &exit_qual))
+		return -EINVAL;
+
+	ret = enter_vmx_non_root_mode(vcpu, true);
+	if (ret)
+		return ret;
+
+	/*
+	 * This request will result in a call to
+	 * nested_get_vmcs12_pages before the next VM-entry.
+	 */
+	kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu);
+
+	vmx->nested.nested_run_pending = 1;
+
+	return 0;
+}
+
+static int set_vmx_state(struct kvm_vcpu *vcpu,
+			 struct kvm_state __user *user_kvm_state)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct kvm_state kvm_state;
+	int ret;
+
+	if (copy_from_user(&kvm_state, user_kvm_state, sizeof(kvm_state)))
+		return -EFAULT;
+
+	if (kvm_state.size < sizeof(kvm_state))
+		return -EINVAL;
+
+	if (kvm_state.format != 0)
+		return -EINVAL;
+
+	if (kvm_state.flags &
+	    ~(KVM_STATE_RUN_PENDING | KVM_STATE_GUEST_MODE))
+		return -EINVAL;
+
+	if (!nested_vmx_allowed(vcpu))
+		return kvm_state.vmx.vmxon_pa == -1ull ? 0 : -EINVAL;
+
+	vmx_leave_nested(vcpu);
+
+	vmx->nested.nested_run_pending =
+		!!(kvm_state.flags & KVM_STATE_RUN_PENDING);
+
+	if (kvm_state.vmx.vmxon_pa == -1ull)
+		return 0;
+
+	if (!page_address_valid(vcpu, kvm_state.vmx.vmxon_pa))
+		return -EINVAL;
+
+	vmx->nested.vmxon_ptr = kvm_state.vmx.vmxon_pa;
+	ret = enter_vmx_operation(vcpu);
+	if (ret)
+		return ret;
+
+	if (kvm_state.vmx.vmcs_pa == -1ull)
+		return 0;
+
+	return set_vmcs_cache(vcpu, user_kvm_state, &kvm_state);
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -12728,6 +12893,10 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 
 	.setup_mce = vmx_setup_mce,
 
+	.get_state = get_vmx_state,
+	.set_state = set_vmx_state,
+	.get_vmcs12_pages = nested_get_vmcs12_pages,
+
 	.smi_allowed = vmx_smi_allowed,
 	.pre_enter_smm = vmx_pre_enter_smm,
 	.pre_leave_smm = vmx_pre_leave_smm,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1a2ed92..565e1de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2923,6 +2923,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X2APIC_API:
 		r = KVM_X2APIC_API_VALID_FLAGS;
 		break;
+	case KVM_CAP_STATE:
+		r = !!kvm_x86_ops->get_state;
+		break;
 	default:
 		break;
 	}
@@ -3941,6 +3944,22 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
 		break;
 	}
+	case KVM_GET_STATE: {
+		struct kvm_state __user *user_kvm_state = argp;
+
+		r = -EINVAL;
+		if (kvm_x86_ops->get_state)
+			r = kvm_x86_ops->get_state(vcpu, user_kvm_state);
+		break;
+	}
+	case KVM_SET_STATE: {
+		struct kvm_state __user *user_kvm_state = argp;
+
+		r = -EINVAL;
+		if (kvm_x86_ops->set_state)
+			r = kvm_x86_ops->set_state(vcpu, user_kvm_state);
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
@@ -7214,6 +7233,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	bool req_immediate_exit = false;
 
 	if (kvm_request_pending(vcpu)) {
+		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu))
+			kvm_x86_ops->get_vmcs12_pages(vcpu);
 		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7a2889a..001b122 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -126,7 +126,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_PENDING_TIMER     2
 #define KVM_REQ_UNHALT            3
-#define KVM_REQUEST_ARCH_BASE     8
+#define KVM_REQUEST_ARCH_BASE     7
 
 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
 	BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 077d16f..69e6791 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -961,6 +961,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_BPB 152
 #define KVM_CAP_GET_MSR_FEATURES 153
 #define KVM_CAP_HYPERV_EVENTFD 154
+#define KVM_CAP_STATE 155
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1392,6 +1393,10 @@ struct kvm_s390_ucas_mapping {
 /* Memory Encryption Commands */
 #define KVM_MEMORY_ENCRYPT_OP      _IOWR(KVMIO, 0xba, unsigned long)
 
+/* Available with KVM_CAP_STATE */
+#define KVM_GET_STATE         _IOWR(KVMIO, 0xbb, struct kvm_vmx_state)
+#define KVM_SET_STATE         _IOW(KVMIO,  0xbc, struct kvm_vmx_state)
+
 struct kvm_enc_region {
 	__u64 addr;
 	__u64 size;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest
  2018-04-12 15:12 [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest KarimAllah Ahmed
  2018-04-12 15:12 ` [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE KarimAllah Ahmed
@ 2018-04-12 16:35 ` Paolo Bonzini
  2018-04-12 17:04   ` Raslan, KarimAllah
  1 sibling, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-04-12 16:35 UTC (permalink / raw)
  To: KarimAllah Ahmed, linux-kernel, kvm
  Cc: Jim Mattson, Radim Krčmář

On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> When the TSC MSR is captured while an L2 guest is running then restored,
> the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> TSC_OFFSET. So ensure that this is compensated for when storing the value.
> 
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> ---
>  arch/x86/kvm/vmx.c | 12 +++++++++---
>  arch/x86/kvm/x86.c |  1 -
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index cff2f50..2f57571 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
>   */
>  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> +	u64 l1_tsc_offset = 0;
> +
>  	if (is_guest_mode(vcpu)) {
>  		/*
>  		 * We're here if L1 chose not to trap WRMSR to TSC. According
> @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  		 * to the newly set TSC to get L2's TSC.
>  		 */
>  		struct vmcs12 *vmcs12;
> +
>  		/* recalculate vmcs02.TSC_OFFSET: */
>  		vmcs12 = get_vmcs12(vcpu);
> -		vmcs_write64(TSC_OFFSET, offset +
> -			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> -			 vmcs12->tsc_offset : 0));
> +
> +		l1_tsc_offset = nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +					vmcs12->tsc_offset : 0;
> +		vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
>  	} else {
>  		trace_kvm_write_tsc_offset(vcpu->vcpu_id,
>  					   vmcs_read64(TSC_OFFSET), offset);
>  		vmcs_write64(TSC_OFFSET, offset);
>  	}
> +
> +	vcpu->arch.tsc_offset = offset - l1_tsc_offset;

Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
l1_tsc_offset", then "offset" must be written to TSC_OFFSET.

I think the bug was introduced by commit 3e3f50262.  Before,
vmx_read_tsc_offset returned the L02 offset; now it always contains the
L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
a new kvm_x86_ops callback to subtract the L12 offset from the value it
returns.

Thanks,

Paolo

>  }
>  
>  /*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ac42c85..1a2ed92 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
>  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
>  	kvm_x86_ops->write_tsc_offset(vcpu, offset);
> -	vcpu->arch.tsc_offset = offset;
>  }
>  
>  static inline bool kvm_check_tsc_unstable(void)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-12 15:12 ` [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE KarimAllah Ahmed
@ 2018-04-12 16:39   ` Paolo Bonzini
  2018-04-14 15:56   ` Raslan, KarimAllah
  2018-04-16 16:22   ` Jim Mattson
  2 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2018-04-12 16:39 UTC (permalink / raw)
  To: KarimAllah Ahmed, linux-kernel, kvm
  Cc: Jim Mattson, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, x86

On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> From: Jim Mattson <jmattson@google.com>
> 
> For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> this state can not be captured through the currently available IOCTLs. In
> fact the state captured through all of these IOCTLs is usually a mix of L1
> and L2 state. It is also dependent on whether the L2 guest was running at
> the moment when the process was interrupted to save its state.
> 
> With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> in VMX operation.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jim Mattson <jmattson@google.com>
> [karahmed@ - rename structs and functions and make them ready for AMD and
>              address previous comments.
>            - rebase & a bit of refactoring.
>            - Merge 7/8 and 8/8 into one patch.
>            - Force a VMExit from L2 after reading the kvm_state to avoid
>              mixed state between L1 and L2 on resurrecting the instance. ]
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> ---
> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).
> - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
>   to switch everything to u64)

You still have to rename everything to KVM_{CAP,GET,SET}_NESTED_STATE
(and {vmx_{get,set}_nested state) though. :)

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest
  2018-04-12 16:35 ` [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest Paolo Bonzini
@ 2018-04-12 17:04   ` Raslan, KarimAllah
  2018-04-12 17:21     ` Raslan, KarimAllah
  0 siblings, 1 reply; 16+ messages in thread
From: Raslan, KarimAllah @ 2018-04-12 17:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini; +Cc: jmattson, rkrcmar

On Thu, 2018-04-12 at 18:35 +0200, Paolo Bonzini wrote:
> On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> > 
> > When the TSC MSR is captured while an L2 guest is running then restored,
> > the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> > TSC_OFFSET. So ensure that this is compensated for when storing the value.
> > 
> > Cc: Jim Mattson <jmattson@google.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: kvm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> > ---
> >  arch/x86/kvm/vmx.c | 12 +++++++++---
> >  arch/x86/kvm/x86.c |  1 -
> >  2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index cff2f50..2f57571 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
> >   */
> >  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  {
> > +	u64 l1_tsc_offset = 0;
> > +
> >  	if (is_guest_mode(vcpu)) {
> >  		/*
> >  		 * We're here if L1 chose not to trap WRMSR to TSC. According
> > @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  		 * to the newly set TSC to get L2's TSC.
> >  		 */
> >  		struct vmcs12 *vmcs12;
> > +
> >  		/* recalculate vmcs02.TSC_OFFSET: */
> >  		vmcs12 = get_vmcs12(vcpu);
> > -		vmcs_write64(TSC_OFFSET, offset +
> > -			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > -			 vmcs12->tsc_offset : 0));
> > +
> > +		l1_tsc_offset = nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > +					vmcs12->tsc_offset : 0;
> > +		vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
> >  	} else {
> >  		trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> >  					   vmcs_read64(TSC_OFFSET), offset);
> >  		vmcs_write64(TSC_OFFSET, offset);
> >  	}
> > +
> > +	vcpu->arch.tsc_offset = offset - l1_tsc_offset;
> 
> Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
> function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
> l1_tsc_offset", then "offset" must be written to TSC_OFFSET.

Ooops! I forgot to remove the + l1_tsc_offset :D

> 
> I think the bug was introduced by commit 3e3f50262.  Before,
> vmx_read_tsc_offset returned the L02 offset; now it always contains the
> L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
> nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
> a new kvm_x86_ops callback to subtract the L12 offset from the value it
> returns.

ack!

> 
> Thanks,
> 
> Paolo
> 
> > 
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index ac42c85..1a2ed92 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> >  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >  {
> >  	kvm_x86_ops->write_tsc_offset(vcpu, offset);
> > -	vcpu->arch.tsc_offset = offset;
> >  }
> >  
> >  static inline bool kvm_check_tsc_unstable(void)
> > 
> 
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest
  2018-04-12 17:04   ` Raslan, KarimAllah
@ 2018-04-12 17:21     ` Raslan, KarimAllah
  2018-04-12 20:21       ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Raslan, KarimAllah @ 2018-04-12 17:21 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini; +Cc: jmattson, rkrcmar

On Thu, 2018-04-12 at 17:04 +0000, Raslan, KarimAllah wrote:
> On Thu, 2018-04-12 at 18:35 +0200, Paolo Bonzini wrote:
> > 
> > On 12/04/2018 17:12, KarimAllah Ahmed wrote:
> > > 
> > > 
> > > When the TSC MSR is captured while an L2 guest is running then restored,
> > > the 'tsc_offset' ends up capturing the L02 TSC_OFFSET instead of the L01
> > > TSC_OFFSET. So ensure that this is compensated for when storing the value.
> > > 
> > > Cc: Jim Mattson <jmattson@google.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > > Cc: kvm@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > > Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> > > ---
> > >  arch/x86/kvm/vmx.c | 12 +++++++++---
> > >  arch/x86/kvm/x86.c |  1 -
> > >  2 files changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index cff2f50..2f57571 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -2900,6 +2900,8 @@ static u64 guest_read_tsc(struct kvm_vcpu *vcpu)
> > >   */
> > >  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  {
> > > +	u64 l1_tsc_offset = 0;
> > > +
> > >  	if (is_guest_mode(vcpu)) {
> > >  		/*
> > >  		 * We're here if L1 chose not to trap WRMSR to TSC. According
> > > @@ -2908,16 +2910,20 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  		 * to the newly set TSC to get L2's TSC.
> > >  		 */
> > >  		struct vmcs12 *vmcs12;
> > > +
> > >  		/* recalculate vmcs02.TSC_OFFSET: */
> > >  		vmcs12 = get_vmcs12(vcpu);
> > > -		vmcs_write64(TSC_OFFSET, offset +
> > > -			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > > -			 vmcs12->tsc_offset : 0));
> > > +
> > > +		l1_tsc_offset = nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> > > +					vmcs12->tsc_offset : 0;
> > > +		vmcs_write64(TSC_OFFSET, offset + l1_tsc_offset);
> > >  	} else {
> > >  		trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> > >  					   vmcs_read64(TSC_OFFSET), offset);
> > >  		vmcs_write64(TSC_OFFSET, offset);
> > >  	}
> > > +
> > > +	vcpu->arch.tsc_offset = offset - l1_tsc_offset;
> > 
> > Using both "offset + l1_tsc_offset" and "offset - l1_tsc_offset" in this 
> > function seems wrong to me: if vcpu->arch.tsc_offset must be "offset - 
> > l1_tsc_offset", then "offset" must be written to TSC_OFFSET.
> 
> Ooops! I forgot to remove the + l1_tsc_offset :D
> 
> > 
> > 
> > I think the bug was introduced by commit 3e3f50262.  Before,
> > vmx_read_tsc_offset returned the L02 offset; now it always contains the
> > L01 offset.  So the right fix is to adjust vcpu->arch.tsc_offset on
> > nested vmentry/vmexit.  If is_guest_mode(vcpu), kvm_read_l1_tsc must use
> > a new kvm_x86_ops callback to subtract the L12 offset from the value it
> > returns.
> 
> ack!

Now looking further at the code, it seems that everywhere in the code
tsc_offset is treated as the L01 TSC_OFFSET.

Like here:

        if (vmcs12->cpu_based_vm_exec_control &
CPU_BASED_USE_TSC_OFFSETING)
                vmcs_write64(TSC_OFFSET,
                        vcpu->arch.tsc_offset + vmcs12->tsc_offset);

and here:

        vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);

and here:

u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
{
        return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
}
EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);

... would not it be simpler and more inline with the current code to
just do what I did above + remove the "+ l1_tsc_offset" + probably
document tsc_offset ?

> 
> > 
> > 
> > Thanks,
> > 
> > Paolo
> > 
> > > 
> > > 
> > >  }
> > >  
> > >  /*
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index ac42c85..1a2ed92 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1539,7 +1539,6 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > >  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >  {
> > >  	kvm_x86_ops->write_tsc_offset(vcpu, offset);
> > > -	vcpu->arch.tsc_offset = offset;
> > >  }
> > >  
> > >  static inline bool kvm_check_tsc_unstable(void)
> > > 
> > 
> > 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest
  2018-04-12 17:21     ` Raslan, KarimAllah
@ 2018-04-12 20:21       ` Paolo Bonzini
  2018-04-12 20:24         ` Raslan, KarimAllah
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-04-12 20:21 UTC (permalink / raw)
  To: Raslan, KarimAllah, linux-kernel, kvm; +Cc: jmattson, rkrcmar

On 12/04/2018 19:21, Raslan, KarimAllah wrote:
> Now looking further at the code, it seems that everywhere in the code
> tsc_offset is treated as the L01 TSC_OFFSET.
> 
> Like here:
> 
>         if (vmcs12->cpu_based_vm_exec_control &
> CPU_BASED_USE_TSC_OFFSETING)
>                 vmcs_write64(TSC_OFFSET,
>                         vcpu->arch.tsc_offset + vmcs12->tsc_offset);
> 
> and here:
> 
>         vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
> 
> and here:
> 
> u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> {
>         return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> }
> EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> 
> ... would not it be simpler and more inline with the current code to
> just do what I did above + remove the "+ l1_tsc_offset" + probably
> document tsc_offset ?

Problem is, I don't think it's correct. :)  A good start would be to try
disabling MSR_IA32_TSC interception in KVM, prepare a kvm-unit-tests
test that reads the MSR, and see if you get the host or guest TSC...

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest
  2018-04-12 20:21       ` Paolo Bonzini
@ 2018-04-12 20:24         ` Raslan, KarimAllah
  0 siblings, 0 replies; 16+ messages in thread
From: Raslan, KarimAllah @ 2018-04-12 20:24 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini; +Cc: jmattson, rkrcmar

On Thu, 2018-04-12 at 22:21 +0200, Paolo Bonzini wrote:
> On 12/04/2018 19:21, Raslan, KarimAllah wrote:
> > 
> > Now looking further at the code, it seems that everywhere in the code
> > tsc_offset is treated as the L01 TSC_OFFSET.
> > 
> > Like here:
> > 
> >         if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETING)
> >                 vmcs_write64(TSC_OFFSET,
> >                         vcpu->arch.tsc_offset + vmcs12->tsc_offset);
> > 
> > and here:
> > 
> >         vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
> > 
> > and here:
> > 
> > u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> > {
> >         return vcpu->arch.tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> > }
> > EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > 
> > ... would not it be simpler and more inline with the current code to
> > just do what I did above + remove the "+ l1_tsc_offset" + probably
> > document tsc_offset ?
> 
> Problem is, I don't think it's correct. :)  A good start would be to try
> disabling MSR_IA32_TSC interception in KVM, prepare a kvm-unit-tests
> test that reads the MSR, and see if you get the host or guest TSC...

I actually just submitted a patch with your original suggestion
(I hope)  because I realized that adjust tsc was still using the wrong
tsc_offset anyway :)

> 
> Paolo
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-12 15:12 ` [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE KarimAllah Ahmed
  2018-04-12 16:39   ` Paolo Bonzini
@ 2018-04-14 15:56   ` Raslan, KarimAllah
  2018-04-14 22:31     ` Raslan, KarimAllah
  2018-04-16 16:22   ` Jim Mattson
  2 siblings, 1 reply; 16+ messages in thread
From: Raslan, KarimAllah @ 2018-04-14 15:56 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: jmattson, tglx, rkrcmar, hpa, pbonzini, mingo, x86

On Thu, 2018-04-12 at 17:12 +0200, KarimAllah Ahmed wrote:
> From: Jim Mattson <jmattson@google.com>
> 
> For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> this state can not be captured through the currently available IOCTLs. In
> fact the state captured through all of these IOCTLs is usually a mix of L1
> and L2 state. It is also dependent on whether the L2 guest was running at
> the moment when the process was interrupted to save its state.
> 
> With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> in VMX operation.
> 
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Jim Mattson <jmattson@google.com>
> [karahmed@ - rename structs and functions and make them ready for AMD and
>              address previous comments.
>            - rebase & a bit of refactoring.
>            - Merge 7/8 and 8/8 into one patch.
>            - Force a VMExit from L2 after reading the kvm_state to avoid
>              mixed state between L1 and L2 on resurrecting the instance. ]
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> ---
> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).
> - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
>   to switch everything to u64)
> 
> v1 -> v2:
> - Rename structs and functions and make them ready for AMD and address
>   previous comments.
> - Rebase & a bit of refactoring.
> - Merge 7/8 and 8/8 into one patch.
> - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
>   between L1 and L2 on resurrecting the instance.
> ---
>  Documentation/virtual/kvm/api.txt |  47 ++++++++++
>  arch/x86/include/asm/kvm_host.h   |   7 ++
>  arch/x86/include/uapi/asm/kvm.h   |  38 ++++++++
>  arch/x86/kvm/vmx.c                | 177 +++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c                |  21 +++++
>  include/linux/kvm_host.h          |   2 +-
>  include/uapi/linux/kvm.h          |   5 ++
>  7 files changed, 292 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 1c7958b..c51d5d3 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3548,6 +3548,53 @@ Returns: 0 on success,
>  	-ENOENT on deassign if the conn_id isn't registered
>  	-EEXIST on assign if the conn_id is already registered
>  
> +4.114 KVM_GET_STATE
> +
> +Capability: KVM_CAP_STATE
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_state (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  E2BIG:     the data size exceeds the value of 'size' specified by
> +             the user (the size required will be written into size).
> +
> +struct kvm_state {
> +	__u16 flags;
> +	__u16 format;
> +	__u32 size;
> +	union {
> +		struct kvm_vmx_state vmx;
> +		struct kvm_svm_state svm;
> +		__u8 pad[120];
> +	};
> +	__u8 data[0];
> +};
> +
> +This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
> +
> +4.115 KVM_SET_STATE
> +
> +Capability: KVM_CAP_STATE
> +Architectures: x86
> +Type: vcpu ioctl
> +Parameters: struct kvm_state (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_state {
> +	__u16 flags;
> +	__u16 format;
> +	__u32 size;
> +	union {
> +		struct kvm_vmx_state vmx;
> +		struct kvm_svm_state svm;
> +		__u8 pad[120];
> +	};
> +	__u8 data[0];
> +};
> +
> +This copies the vcpu's kvm_state struct from userspace to the kernel.
> +>>>>>>> 13a7c9e... kvm: nVMX: Introduce KVM_CAP_STATE
>  
>  5. The kvm_run structure
>  ------------------------
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9fa4f57..ad2116a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -75,6 +75,7 @@
>  #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
>  #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
>  #define KVM_REQ_LOAD_EOI_EXITMAP	KVM_ARCH_REQ(23)
> +#define KVM_REQ_GET_VMCS12_PAGES	KVM_ARCH_REQ(24)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -1084,6 +1085,12 @@ struct kvm_x86_ops {
>  
>  	void (*setup_mce)(struct kvm_vcpu *vcpu);
>  
> +	int (*get_state)(struct kvm_vcpu *vcpu,
> +			 struct kvm_state __user *user_kvm_state);
> +	int (*set_state)(struct kvm_vcpu *vcpu,
> +			 struct kvm_state __user *user_kvm_state);
> +	void (*get_vmcs12_pages)(struct kvm_vcpu *vcpu);
> +
>  	int (*smi_allowed)(struct kvm_vcpu *vcpu);
>  	int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
>  	int (*pre_leave_smm)(struct kvm_vcpu *vcpu, u64 smbase);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index c535c2f..d2c860a 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -378,4 +378,42 @@ struct kvm_sync_regs {
>  #define KVM_X86_QUIRK_LINT0_REENABLED	(1 << 0)
>  #define KVM_X86_QUIRK_CD_NW_CLEARED	(1 << 1)
>  
> +#define KVM_STATE_GUEST_MODE	0x00000001
> +#define KVM_STATE_RUN_PENDING	0x00000002
> +#define KVM_STATE_GIF		0x00000004
> +
> +struct kvm_vmx_state {
> +	__u64 vmxon_pa;
> +	__u64 vmcs_pa;
> +};
> +
> +struct kvm_svm_state {
> +	__u64 hsave_pa;
> +	__u64 vmcb_pa;
> +};
> +
> +/* for KVM_CAP_STATE */
> +struct kvm_state {
> +	/* KVM_STATE_* flags */
> +	__u16 flags;
> +
> +	/* 0 for VMX, 1 for SVM.  */
> +	__u16 format;
> +
> +	/* 128 for SVM, 128 + VMCS size for VMX.  */
> +	__u32 size;
> +
> +	union {
> +		/* VMXON, VMCS */
> +		struct kvm_vmx_state vmx;
> +		/* HSAVE_PA, VMCB */
> +		struct kvm_svm_state svm;
> +
> +		/* Pad the union to 120 bytes.  */
> +		__u8 pad[120];
> +	};
> +
> +	__u8 data[0];
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2f57571..c2b436b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10359,10 +10359,10 @@ static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
>  static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  						 struct vmcs12 *vmcs12);
>  
> -static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu,
> -					struct vmcs12 *vmcs12)
> +static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  
>  	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
>  		if (vmcs12->apic_access_addr != vmx->nested.apic_access_mapping.gfn << PAGE_SHIFT) {
> @@ -11430,8 +11430,6 @@ static int enter_vmx_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry)
>  		return 1;
>  	}
>  
> -	nested_get_vmcs12_pages(vcpu, vmcs12);
> -
>  	msr_entry_idx = nested_vmx_load_msr(vcpu,
>  					    vmcs12->vm_entry_msr_load_addr,
>  					    vmcs12->vm_entry_msr_load_count);
> @@ -11529,6 +11527,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>  	if (ret)
>  		return ret;
>  
> +	nested_get_vmcs12_pages(vcpu);
> +
>  	/*
>  	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
>  	 * by event injection, halt vcpu.
> @@ -12595,6 +12595,171 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +static int get_vmcs_cache(struct kvm_vcpu *vcpu,
> +			  struct kvm_state __user *user_kvm_state)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +
> +	/*
> +	 * When running L2, the authoritative vmcs12 state is in the
> +	 * vmcs02. When running L1, the authoritative vmcs12 state is
> +	 * in the shadow vmcs linked to vmcs01, unless
> +	 * sync_shadow_vmcs is set, in which case, the authoritative
> +	 * vmcs12 state is in the vmcs12 already.
> +	 */
> +	if (is_guest_mode(vcpu))
> +		sync_vmcs12(vcpu, vmcs12);
> +	else if (enable_shadow_vmcs && !vmx->nested.sync_shadow_vmcs)
> +		copy_shadow_to_vmcs12(vmx);
> +
> +	if (copy_to_user(user_kvm_state->data, vmcs12, sizeof(*vmcs12)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int get_vmx_state(struct kvm_vcpu *vcpu,
> +			 struct kvm_state __user *user_kvm_state)
> +{
> +	u32 user_data_size;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct kvm_state kvm_state = {
> +		.flags = 0,
> +		.format = 0,
> +		.size = sizeof(kvm_state),
> +		.vmx.vmxon_pa = -1ull,
> +		.vmx.vmcs_pa = -1ull,
> +	};
> +
> +	if (copy_from_user(&user_data_size, &user_kvm_state->size,
> +			   sizeof(user_data_size)))
> +		return -EFAULT;
> +
> +	if (nested_vmx_allowed(vcpu) && vmx->nested.vmxon) {
> +		kvm_state.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
> +		kvm_state.vmx.vmcs_pa = vmx->nested.current_vmptr;
> +
> +		if (vmx->nested.current_vmptr != -1ull)
> +			kvm_state.size += VMCS12_SIZE;
> +
> +		if (is_guest_mode(vcpu)) {
> +			kvm_state.flags |= KVM_STATE_GUEST_MODE;
> +
> +			if (vmx->nested.nested_run_pending)
> +				kvm_state.flags |= KVM_STATE_RUN_PENDING;
> +		}
> +	}
> +
> +	if (user_data_size < kvm_state.size) {
> +		if (copy_to_user(&user_kvm_state->size, &kvm_state.size,
> +				 sizeof(kvm_state.size)))
> +			return -EFAULT;
> +		return -E2BIG;
> +	}
> +
> +	if (copy_to_user(user_kvm_state, &kvm_state, sizeof(kvm_state)))
> +		return -EFAULT;
> +
> +	if (vmx->nested.current_vmptr == -1ull)
> +		return 0;
> +
> +	return get_vmcs_cache(vcpu, user_kvm_state);
> +}
> +
> +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> +			  struct kvm_state __user *user_kvm_state,
> +			  struct kvm_state *kvm_state)
> +
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	u32 exit_qual;
> +	int ret;
> +
> +	if ((kvm_state->size < (sizeof(*vmcs12) + sizeof(*kvm_state))) ||
> +	    kvm_state->vmx.vmcs_pa == kvm_state->vmx.vmxon_pa ||
> +	    !page_address_valid(vcpu, kvm_state->vmx.vmcs_pa))
> +		return -EINVAL;
> +
> +	if (copy_from_user(vmcs12, user_kvm_state->data, sizeof(*vmcs12)))
> +		return -EFAULT;
> +
> +	if (vmcs12->revision_id != VMCS12_REVISION)
> +		return -EINVAL;
> +
> +	set_current_vmptr(vmx, kvm_state->vmx.vmcs_pa);
> +
> +	if (!(kvm_state->flags & KVM_STATE_GUEST_MODE))
> +		return 0;
> +
> +	if (kvm_state->flags & KVM_STATE_RUN_PENDING)
> +		vmx->nested.nested_run_pending = 1;
> +
> +	if (check_vmentry_prereqs(vcpu, vmcs12) ||
> +	    check_vmentry_postreqs(vcpu, vmcs12, &exit_qual))
> +		return -EINVAL;
> +
> +	ret = enter_vmx_non_root_mode(vcpu, true);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * This request will result in a call to
> +	 * nested_get_vmcs12_pages before the next VM-entry.
> +	 */
> +	kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu);
> +
> +	vmx->nested.nested_run_pending = 1;
> +
> +	return 0;
> +}
> +
> +static int set_vmx_state(struct kvm_vcpu *vcpu,
> +			 struct kvm_state __user *user_kvm_state)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct kvm_state kvm_state;
> +	int ret;
> +
> +	if (copy_from_user(&kvm_state, user_kvm_state, sizeof(kvm_state)))
> +		return -EFAULT;
> +
> +	if (kvm_state.size < sizeof(kvm_state))
> +		return -EINVAL;
> +
> +	if (kvm_state.format != 0)
> +		return -EINVAL;
> +
> +	if (kvm_state.flags &
> +	    ~(KVM_STATE_RUN_PENDING | KVM_STATE_GUEST_MODE))
> +		return -EINVAL;
> +
> +	if (!nested_vmx_allowed(vcpu))
> +		return kvm_state.vmx.vmxon_pa == -1ull ? 0 : -EINVAL;
> +
> +	vmx_leave_nested(vcpu);
> +
> +	vmx->nested.nested_run_pending =
> +		!!(kvm_state.flags & KVM_STATE_RUN_PENDING);
> +
> +	if (kvm_state.vmx.vmxon_pa == -1ull)
> +		return 0;
> +
> +	if (!page_address_valid(vcpu, kvm_state.vmx.vmxon_pa))
> +		return -EINVAL;
> +
> +	vmx->nested.vmxon_ptr = kvm_state.vmx.vmxon_pa;
> +	ret = enter_vmx_operation(vcpu);
> +	if (ret)
> +		return ret;
> +
> +	if (kvm_state.vmx.vmcs_pa == -1ull)
> +		return 0;
> +
> +	return set_vmcs_cache(vcpu, user_kvm_state, &kvm_state);
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  	.cpu_has_kvm_support = cpu_has_kvm_support,
>  	.disabled_by_bios = vmx_disabled_by_bios,
> @@ -12728,6 +12893,10 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
>  
>  	.setup_mce = vmx_setup_mce,
>  
> +	.get_state = get_vmx_state,
> +	.set_state = set_vmx_state,
> +	.get_vmcs12_pages = nested_get_vmcs12_pages,
> +
>  	.smi_allowed = vmx_smi_allowed,
>  	.pre_enter_smm = vmx_pre_enter_smm,
>  	.pre_leave_smm = vmx_pre_leave_smm,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1a2ed92..565e1de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2923,6 +2923,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_X2APIC_API:
>  		r = KVM_X2APIC_API_VALID_FLAGS;
>  		break;
> +	case KVM_CAP_STATE:
> +		r = !!kvm_x86_ops->get_state;
> +		break;
>  	default:
>  		break;
>  	}
> @@ -3941,6 +3944,22 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
>  		break;
>  	}
> +	case KVM_GET_STATE: {
> +		struct kvm_state __user *user_kvm_state = argp;
> +
> +		r = -EINVAL;
> +		if (kvm_x86_ops->get_state)
> +			r = kvm_x86_ops->get_state(vcpu, user_kvm_state);
> +		break;
> +	}
> +	case KVM_SET_STATE: {
> +		struct kvm_state __user *user_kvm_state = argp;
> +
> +		r = -EINVAL;
> +		if (kvm_x86_ops->set_state)
> +			r = kvm_x86_ops->set_state(vcpu, user_kvm_state);
> +		break;
> +	}
>  	default:
>  		r = -EINVAL;
>  	}
> @@ -7214,6 +7233,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	bool req_immediate_exit = false;
>  
>  	if (kvm_request_pending(vcpu)) {
> +		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu))
> +			kvm_x86_ops->get_vmcs12_pages(vcpu);
>  		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
>  			kvm_mmu_unload(vcpu);
>  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7a2889a..001b122 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -126,7 +126,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_PENDING_TIMER     2
>  #define KVM_REQ_UNHALT            3
> -#define KVM_REQUEST_ARCH_BASE     8
> +#define KVM_REQUEST_ARCH_BASE     7

Paolo/Jim,

Is this bit acceptable for you or should I just go ahead and update
"requests" to be 64 bit instead?

We ran out of architecture specific bits :)

>  
>  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
>  	BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 077d16f..69e6791 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -961,6 +961,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_BPB 152
>  #define KVM_CAP_GET_MSR_FEATURES 153
>  #define KVM_CAP_HYPERV_EVENTFD 154
> +#define KVM_CAP_STATE 155
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1392,6 +1393,10 @@ struct kvm_s390_ucas_mapping {
>  /* Memory Encryption Commands */
>  #define KVM_MEMORY_ENCRYPT_OP      _IOWR(KVMIO, 0xba, unsigned long)
>  
> +/* Available with KVM_CAP_STATE */
> +#define KVM_GET_STATE         _IOWR(KVMIO, 0xbb, struct kvm_vmx_state)
> +#define KVM_SET_STATE         _IOW(KVMIO,  0xbc, struct kvm_vmx_state)
> +
>  struct kvm_enc_region {
>  	__u64 addr;
>  	__u64 size;
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-14 15:56   ` Raslan, KarimAllah
@ 2018-04-14 22:31     ` Raslan, KarimAllah
  0 siblings, 0 replies; 16+ messages in thread
From: Raslan, KarimAllah @ 2018-04-14 22:31 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: jmattson, tglx, rkrcmar, hpa, pbonzini, mingo, x86

On Sat, 2018-04-14 at 15:56 +0000, Raslan, KarimAllah wrote:
> On Thu, 2018-04-12 at 17:12 +0200, KarimAllah Ahmed wrote:
> > 
> > From: Jim Mattson <jmattson@google.com>
> > 
> > For nested virtualization L0 KVM is managing a bit of state for L2 guests,
> > this state can not be captured through the currently available IOCTLs. In
> > fact the state captured through all of these IOCTLs is usually a mix of L1
> > and L2 state. It is also dependent on whether the L2 guest was running at
> > the moment when the process was interrupted to save its state.
> > 
> > With this capability, there are two new vcpu ioctls: KVM_GET_VMX_STATE and
> > KVM_SET_VMX_STATE. These can be used for saving and restoring a VM that is
> > in VMX operation.
> > 
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: x86@kernel.org
> > Cc: kvm@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Jim Mattson <jmattson@google.com>
> > [karahmed@ - rename structs and functions and make them ready for AMD and
> >              address previous comments.
> >            - rebase & a bit of refactoring.
> >            - Merge 7/8 and 8/8 into one patch.
> >            - Force a VMExit from L2 after reading the kvm_state to avoid
> >              mixed state between L1 and L2 on resurrecting the instance. ]
> > Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> > ---
> > v2 -> v3:
> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
> >   problem is solved.
> > - Rebase again!
> > - Set nested_run_pending during restore (not sure if it makes sense yet or
> >   not).
> > - Reduce KVM_REQUEST_ARCH_BASE to 7 instead of 8 (the other alternative is
> >   to switch everything to u64)
> > 
> > v1 -> v2:
> > - Rename structs and functions and make them ready for AMD and address
> >   previous comments.
> > - Rebase & a bit of refactoring.
> > - Merge 7/8 and 8/8 into one patch.
> > - Force a VMExit from L2 after reading the kvm_state to avoid mixed state
> >   between L1 and L2 on resurrecting the instance.
> > ---
> >  Documentation/virtual/kvm/api.txt |  47 ++++++++++
> >  arch/x86/include/asm/kvm_host.h   |   7 ++
> >  arch/x86/include/uapi/asm/kvm.h   |  38 ++++++++
> >  arch/x86/kvm/vmx.c                | 177 +++++++++++++++++++++++++++++++++++++-
> >  arch/x86/kvm/x86.c                |  21 +++++
> >  include/linux/kvm_host.h          |   2 +-
> >  include/uapi/linux/kvm.h          |   5 ++
> >  7 files changed, 292 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > index 1c7958b..c51d5d3 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -3548,6 +3548,53 @@ Returns: 0 on success,
> >  	-ENOENT on deassign if the conn_id isn't registered
> >  	-EEXIST on assign if the conn_id is already registered
> >  
> > +4.114 KVM_GET_STATE
> > +
> > +Capability: KVM_CAP_STATE
> > +Architectures: x86
> > +Type: vcpu ioctl
> > +Parameters: struct kvm_state (in/out)
> > +Returns: 0 on success, -1 on error
> > +Errors:
> > +  E2BIG:     the data size exceeds the value of 'size' specified by
> > +             the user (the size required will be written into size).
> > +
> > +struct kvm_state {
> > +	__u16 flags;
> > +	__u16 format;
> > +	__u32 size;
> > +	union {
> > +		struct kvm_vmx_state vmx;
> > +		struct kvm_svm_state svm;
> > +		__u8 pad[120];
> > +	};
> > +	__u8 data[0];
> > +};
> > +
> > +This ioctl copies the vcpu's kvm_state struct from the kernel to userspace.
> > +
> > +4.115 KVM_SET_STATE
> > +
> > +Capability: KVM_CAP_STATE
> > +Architectures: x86
> > +Type: vcpu ioctl
> > +Parameters: struct kvm_state (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +struct kvm_state {
> > +	__u16 flags;
> > +	__u16 format;
> > +	__u32 size;
> > +	union {
> > +		struct kvm_vmx_state vmx;
> > +		struct kvm_svm_state svm;
> > +		__u8 pad[120];
> > +	};
> > +	__u8 data[0];
> > +};
> > +
> > +This copies the vcpu's kvm_state struct from userspace to the kernel.
> > +>>>>>>> 13a7c9e... kvm: nVMX: Introduce KVM_CAP_STATE
> >  
> >  5. The kvm_run structure
> >  ------------------------
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 9fa4f57..ad2116a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -75,6 +75,7 @@
> >  #define KVM_REQ_HV_EXIT			KVM_ARCH_REQ(21)
> >  #define KVM_REQ_HV_STIMER		KVM_ARCH_REQ(22)
> >  #define KVM_REQ_LOAD_EOI_EXITMAP	KVM_ARCH_REQ(23)
> > +#define KVM_REQ_GET_VMCS12_PAGES	KVM_ARCH_REQ(24)
> >  
> >  #define CR0_RESERVED_BITS                                               \
> >  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > @@ -1084,6 +1085,12 @@ struct kvm_x86_ops {
> >  
> >  	void (*setup_mce)(struct kvm_vcpu *vcpu);
> >  
> > +	int (*get_state)(struct kvm_vcpu *vcpu,
> > +			 struct kvm_state __user *user_kvm_state);
> > +	int (*set_state)(struct kvm_vcpu *vcpu,
> > +			 struct kvm_state __user *user_kvm_state);
> > +	void (*get_vmcs12_pages)(struct kvm_vcpu *vcpu);
> > +
> >  	int (*smi_allowed)(struct kvm_vcpu *vcpu);
> >  	int (*pre_enter_smm)(struct kvm_vcpu *vcpu, char *smstate);
> >  	int (*pre_leave_smm)(struct kvm_vcpu *vcpu, u64 smbase);
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index c535c2f..d2c860a 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -378,4 +378,42 @@ struct kvm_sync_regs {
> >  #define KVM_X86_QUIRK_LINT0_REENABLED	(1 << 0)
> >  #define KVM_X86_QUIRK_CD_NW_CLEARED	(1 << 1)
> >  
> > +#define KVM_STATE_GUEST_MODE	0x00000001
> > +#define KVM_STATE_RUN_PENDING	0x00000002
> > +#define KVM_STATE_GIF		0x00000004
> > +
> > +struct kvm_vmx_state {
> > +	__u64 vmxon_pa;
> > +	__u64 vmcs_pa;
> > +};
> > +
> > +struct kvm_svm_state {
> > +	__u64 hsave_pa;
> > +	__u64 vmcb_pa;
> > +};
> > +
> > +/* for KVM_CAP_STATE */
> > +struct kvm_state {
> > +	/* KVM_STATE_* flags */
> > +	__u16 flags;
> > +
> > +	/* 0 for VMX, 1 for SVM.  */
> > +	__u16 format;
> > +
> > +	/* 128 for SVM, 128 + VMCS size for VMX.  */
> > +	__u32 size;
> > +
> > +	union {
> > +		/* VMXON, VMCS */
> > +		struct kvm_vmx_state vmx;
> > +		/* HSAVE_PA, VMCB */
> > +		struct kvm_svm_state svm;
> > +
> > +		/* Pad the union to 120 bytes.  */
> > +		__u8 pad[120];
> > +	};
> > +
> > +	__u8 data[0];
> > +};
> > +
> >  #endif /* _ASM_X86_KVM_H */
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 2f57571..c2b436b 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -10359,10 +10359,10 @@ static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> >  static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >  						 struct vmcs12 *vmcs12);
> >  
> > -static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu,
> > -					struct vmcs12 *vmcs12)
> > +static void nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >  
> >  	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
> >  		if (vmcs12->apic_access_addr != vmx->nested.apic_access_mapping.gfn << PAGE_SHIFT) {
> > @@ -11430,8 +11430,6 @@ static int enter_vmx_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry)
> >  		return 1;
> >  	}
> >  
> > -	nested_get_vmcs12_pages(vcpu, vmcs12);
> > -
> >  	msr_entry_idx = nested_vmx_load_msr(vcpu,
> >  					    vmcs12->vm_entry_msr_load_addr,
> >  					    vmcs12->vm_entry_msr_load_count);
> > @@ -11529,6 +11527,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> >  	if (ret)
> >  		return ret;
> >  
> > +	nested_get_vmcs12_pages(vcpu);
> > +
> >  	/*
> >  	 * If we're entering a halted L2 vcpu and the L2 vcpu won't be woken
> >  	 * by event injection, halt vcpu.
> > @@ -12595,6 +12595,171 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
> >  	return 0;
> >  }
> >  
> > +static int get_vmcs_cache(struct kvm_vcpu *vcpu,
> > +			  struct kvm_state __user *user_kvm_state)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +
> > +	/*
> > +	 * When running L2, the authoritative vmcs12 state is in the
> > +	 * vmcs02. When running L1, the authoritative vmcs12 state is
> > +	 * in the shadow vmcs linked to vmcs01, unless
> > +	 * sync_shadow_vmcs is set, in which case, the authoritative
> > +	 * vmcs12 state is in the vmcs12 already.
> > +	 */
> > +	if (is_guest_mode(vcpu))
> > +		sync_vmcs12(vcpu, vmcs12);
> > +	else if (enable_shadow_vmcs && !vmx->nested.sync_shadow_vmcs)
> > +		copy_shadow_to_vmcs12(vmx);
> > +
> > +	if (copy_to_user(user_kvm_state->data, vmcs12, sizeof(*vmcs12)))
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +
> > +static int get_vmx_state(struct kvm_vcpu *vcpu,
> > +			 struct kvm_state __user *user_kvm_state)
> > +{
> > +	u32 user_data_size;
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct kvm_state kvm_state = {
> > +		.flags = 0,
> > +		.format = 0,
> > +		.size = sizeof(kvm_state),
> > +		.vmx.vmxon_pa = -1ull,
> > +		.vmx.vmcs_pa = -1ull,
> > +	};
> > +
> > +	if (copy_from_user(&user_data_size, &user_kvm_state->size,
> > +			   sizeof(user_data_size)))
> > +		return -EFAULT;
> > +
> > +	if (nested_vmx_allowed(vcpu) && vmx->nested.vmxon) {
> > +		kvm_state.vmx.vmxon_pa = vmx->nested.vmxon_ptr;
> > +		kvm_state.vmx.vmcs_pa = vmx->nested.current_vmptr;
> > +
> > +		if (vmx->nested.current_vmptr != -1ull)
> > +			kvm_state.size += VMCS12_SIZE;
> > +
> > +		if (is_guest_mode(vcpu)) {
> > +			kvm_state.flags |= KVM_STATE_GUEST_MODE;
> > +
> > +			if (vmx->nested.nested_run_pending)
> > +				kvm_state.flags |= KVM_STATE_RUN_PENDING;
> > +		}
> > +	}
> > +
> > +	if (user_data_size < kvm_state.size) {
> > +		if (copy_to_user(&user_kvm_state->size, &kvm_state.size,
> > +				 sizeof(kvm_state.size)))
> > +			return -EFAULT;
> > +		return -E2BIG;
> > +	}
> > +
> > +	if (copy_to_user(user_kvm_state, &kvm_state, sizeof(kvm_state)))
> > +		return -EFAULT;
> > +
> > +	if (vmx->nested.current_vmptr == -1ull)
> > +		return 0;
> > +
> > +	return get_vmcs_cache(vcpu, user_kvm_state);
> > +}
> > +
> > +static int set_vmcs_cache(struct kvm_vcpu *vcpu,
> > +			  struct kvm_state __user *user_kvm_state,
> > +			  struct kvm_state *kvm_state)
> > +
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +	u32 exit_qual;
> > +	int ret;
> > +
> > +	if ((kvm_state->size < (sizeof(*vmcs12) + sizeof(*kvm_state))) ||
> > +	    kvm_state->vmx.vmcs_pa == kvm_state->vmx.vmxon_pa ||
> > +	    !page_address_valid(vcpu, kvm_state->vmx.vmcs_pa))
> > +		return -EINVAL;
> > +
> > +	if (copy_from_user(vmcs12, user_kvm_state->data, sizeof(*vmcs12)))
> > +		return -EFAULT;
> > +
> > +	if (vmcs12->revision_id != VMCS12_REVISION)
> > +		return -EINVAL;
> > +
> > +	set_current_vmptr(vmx, kvm_state->vmx.vmcs_pa);
> > +
> > +	if (!(kvm_state->flags & KVM_STATE_GUEST_MODE))
> > +		return 0;
> > +
> > +	if (kvm_state->flags & KVM_STATE_RUN_PENDING)
> > +		vmx->nested.nested_run_pending = 1;
> > +
> > +	if (check_vmentry_prereqs(vcpu, vmcs12) ||
> > +	    check_vmentry_postreqs(vcpu, vmcs12, &exit_qual))
> > +		return -EINVAL;
> > +
> > +	ret = enter_vmx_non_root_mode(vcpu, true);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * This request will result in a call to
> > +	 * nested_get_vmcs12_pages before the next VM-entry.
> > +	 */
> > +	kvm_make_request(KVM_REQ_GET_VMCS12_PAGES, vcpu);
> > +
> > +	vmx->nested.nested_run_pending = 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static int set_vmx_state(struct kvm_vcpu *vcpu,
> > +			 struct kvm_state __user *user_kvm_state)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct kvm_state kvm_state;
> > +	int ret;
> > +
> > +	if (copy_from_user(&kvm_state, user_kvm_state, sizeof(kvm_state)))
> > +		return -EFAULT;
> > +
> > +	if (kvm_state.size < sizeof(kvm_state))
> > +		return -EINVAL;
> > +
> > +	if (kvm_state.format != 0)
> > +		return -EINVAL;
> > +
> > +	if (kvm_state.flags &
> > +	    ~(KVM_STATE_RUN_PENDING | KVM_STATE_GUEST_MODE))
> > +		return -EINVAL;
> > +
> > +	if (!nested_vmx_allowed(vcpu))
> > +		return kvm_state.vmx.vmxon_pa == -1ull ? 0 : -EINVAL;
> > +
> > +	vmx_leave_nested(vcpu);
> > +
> > +	vmx->nested.nested_run_pending =
> > +		!!(kvm_state.flags & KVM_STATE_RUN_PENDING);
> > +
> > +	if (kvm_state.vmx.vmxon_pa == -1ull)
> > +		return 0;
> > +
> > +	if (!page_address_valid(vcpu, kvm_state.vmx.vmxon_pa))
> > +		return -EINVAL;
> > +
> > +	vmx->nested.vmxon_ptr = kvm_state.vmx.vmxon_pa;
> > +	ret = enter_vmx_operation(vcpu);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (kvm_state.vmx.vmcs_pa == -1ull)
> > +		return 0;
> > +
> > +	return set_vmcs_cache(vcpu, user_kvm_state, &kvm_state);
> > +}
> > +
> >  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
> >  	.cpu_has_kvm_support = cpu_has_kvm_support,
> >  	.disabled_by_bios = vmx_disabled_by_bios,
> > @@ -12728,6 +12893,10 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
> >  
> >  	.setup_mce = vmx_setup_mce,
> >  
> > +	.get_state = get_vmx_state,
> > +	.set_state = set_vmx_state,
> > +	.get_vmcs12_pages = nested_get_vmcs12_pages,
> > +
> >  	.smi_allowed = vmx_smi_allowed,
> >  	.pre_enter_smm = vmx_pre_enter_smm,
> >  	.pre_leave_smm = vmx_pre_leave_smm,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1a2ed92..565e1de 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2923,6 +2923,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  	case KVM_CAP_X2APIC_API:
> >  		r = KVM_X2APIC_API_VALID_FLAGS;
> >  		break;
> > +	case KVM_CAP_STATE:
> > +		r = !!kvm_x86_ops->get_state;
> > +		break;
> >  	default:
> >  		break;
> >  	}
> > @@ -3941,6 +3944,22 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> >  		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
> >  		break;
> >  	}
> > +	case KVM_GET_STATE: {
> > +		struct kvm_state __user *user_kvm_state = argp;
> > +
> > +		r = -EINVAL;
> > +		if (kvm_x86_ops->get_state)
> > +			r = kvm_x86_ops->get_state(vcpu, user_kvm_state);
> > +		break;
> > +	}
> > +	case KVM_SET_STATE: {
> > +		struct kvm_state __user *user_kvm_state = argp;
> > +
> > +		r = -EINVAL;
> > +		if (kvm_x86_ops->set_state)
> > +			r = kvm_x86_ops->set_state(vcpu, user_kvm_state);
> > +		break;
> > +	}
> >  	default:
> >  		r = -EINVAL;
> >  	}
> > @@ -7214,6 +7233,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  	bool req_immediate_exit = false;
> >  
> >  	if (kvm_request_pending(vcpu)) {
> > +		if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu))
> > +			kvm_x86_ops->get_vmcs12_pages(vcpu);
> >  		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
> >  			kvm_mmu_unload(vcpu);
> >  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 7a2889a..001b122 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -126,7 +126,7 @@ static inline bool is_error_page(struct page *page)
> >  #define KVM_REQ_MMU_RELOAD        (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >  #define KVM_REQ_PENDING_TIMER     2
> >  #define KVM_REQ_UNHALT            3
> > -#define KVM_REQUEST_ARCH_BASE     8
> > +#define KVM_REQUEST_ARCH_BASE     7
> 
> Paolo/Jim,
> 
> Is this bit acceptable for you or should I just go ahead and update
> "requests" to be 64 bit instead?
> 
> We ran out of architecture specific bits :)

I just decided to opt for using all the 64-bits in requests:
https://lkml.org/lkml/2018/4/14/93

So I will drop this change to KVM_REQUEST_ARCH_BASE in the next
version.

> 
> > 
> >  
> >  #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
> >  	BUILD_BUG_ON((unsigned)(nr) >= 32 - KVM_REQUEST_ARCH_BASE); \
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 077d16f..69e6791 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -961,6 +961,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_S390_BPB 152
> >  #define KVM_CAP_GET_MSR_FEATURES 153
> >  #define KVM_CAP_HYPERV_EVENTFD 154
> > +#define KVM_CAP_STATE 155
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -1392,6 +1393,10 @@ struct kvm_s390_ucas_mapping {
> >  /* Memory Encryption Commands */
> >  #define KVM_MEMORY_ENCRYPT_OP      _IOWR(KVMIO, 0xba, unsigned long)
> >  
> > +/* Available with KVM_CAP_STATE */
> > +#define KVM_GET_STATE         _IOWR(KVMIO, 0xbb, struct kvm_vmx_state)
> > +#define KVM_SET_STATE         _IOW(KVMIO,  0xbc, struct kvm_vmx_state)
> > +
> >  struct kvm_enc_region {
> >  	__u64 addr;
> >  	__u64 size;
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-12 15:12 ` [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE KarimAllah Ahmed
  2018-04-12 16:39   ` Paolo Bonzini
  2018-04-14 15:56   ` Raslan, KarimAllah
@ 2018-04-16 16:22   ` Jim Mattson
  2018-04-16 17:15     ` Raslan, KarimAllah
  2 siblings, 1 reply; 16+ messages in thread
From: Jim Mattson @ 2018-04-16 16:22 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: LKML, kvm list, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin,
	the arch/x86 maintainers

On Thu, Apr 12, 2018 at 8:12 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:

> v2 -> v3:
> - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>   problem is solved.
> - Rebase again!
> - Set nested_run_pending during restore (not sure if it makes sense yet or
>   not).

This doesn't actually make sense. Nested_run_pending should only be
set between L1 doing a VMLAUNCH/VMRESUME and the first instruction
executing in L2. That is extremely unlikely at a restore point.

To deal with nested_run_pending and nested save/restore,
nested_run_pending should be set to 1 before calling
enter_vmx_non_root_mode, as it was prior to commit 7af40ad37b3f. That
means that it has to be cleared when emulating VM-entry to the halted
state (prior to calling kvm_vcpu_halt). And all of the from_vmentry
arguments that Paolo added when rebasing commit cf8b84f48a59 should be
removed, so that nested_run_pending is propagated correctly duting a
restore.

It should be possible to eliminate this strange little wart, but I
haven't looked deeply into it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-16 16:22   ` Jim Mattson
@ 2018-04-16 17:15     ` Raslan, KarimAllah
  2018-04-26 22:28       ` Jim Mattson
  0 siblings, 1 reply; 16+ messages in thread
From: Raslan, KarimAllah @ 2018-04-16 17:15 UTC (permalink / raw)
  To: jmattson; +Cc: kvm, linux-kernel, tglx, x86, hpa, mingo, pbonzini, rkrcmar

On Mon, 2018-04-16 at 09:22 -0700, Jim Mattson wrote:
> On Thu, Apr 12, 2018 at 8:12 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
> 
> > 
> > v2 -> v3:
> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
> >   problem is solved.
> > - Rebase again!
> > - Set nested_run_pending during restore (not sure if it makes sense yet or
> >   not).
> 
> This doesn't actually make sense. Nested_run_pending should only be
> set between L1 doing a VMLAUNCH/VMRESUME and the first instruction
> executing in L2. That is extremely unlikely at a restore point.

Yeah, I am afraid I put very little thought into it as I was focused
on the TSC issue :)

Will handle it properly in next version.

> 
> To deal with nested_run_pending and nested save/restore,
> nested_run_pending should be set to 1 before calling
> enter_vmx_non_root_mode, as it was prior to commit 7af40ad37b3f. That
> means that it has to be cleared when emulating VM-entry to the halted
> state (prior to calling kvm_vcpu_halt). And all of the from_vmentry
> arguments that Paolo added when rebasing commit cf8b84f48a59 should be
> removed, so that nested_run_pending is propagated correctly duting a
> restore.
> 
> It should be possible to eliminate this strange little wart, but I
> haven't looked deeply into it.
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-16 17:15     ` Raslan, KarimAllah
@ 2018-04-26 22:28       ` Jim Mattson
  2018-04-27 10:03         ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Mattson @ 2018-04-26 22:28 UTC (permalink / raw)
  To: Raslan, KarimAllah
  Cc: kvm, linux-kernel, tglx, x86, hpa, mingo, pbonzini, rkrcmar

I'll send out a patch to deal with nested_run_pending.

The other thing that comes to mind is that there are some new fields
in the VMCS12 since I first implemented this. One potentially
troublesome field is the VMX preemption timer. If the current timer
value is not saved on VM-exit, then it won't be stashed in the shadow
VMCS12 by sync_vmcs12. Post-migration, the timer will be reset to its
original value.

Do we care? Is this any different from what happens on real hardware
when there's an SMI? According to the SDM, this appears to be exacty
what happens when the dual-monitor treatment of SMIs and SMM is
active, but it's not clear what happens with the default treatment of
SMIs and SMM.

On Mon, Apr 16, 2018 at 10:15 AM, Raslan, KarimAllah <karahmed@amazon.de> wrote:
> On Mon, 2018-04-16 at 09:22 -0700, Jim Mattson wrote:
>> On Thu, Apr 12, 2018 at 8:12 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
>>
>> >
>> > v2 -> v3:
>> > - Remove the forced VMExit from L2 after reading the kvm_state. The actual
>> >   problem is solved.
>> > - Rebase again!
>> > - Set nested_run_pending during restore (not sure if it makes sense yet or
>> >   not).
>>
>> This doesn't actually make sense. Nested_run_pending should only be
>> set between L1 doing a VMLAUNCH/VMRESUME and the first instruction
>> executing in L2. That is extremely unlikely at a restore point.
>
> Yeah, I am afraid I put very little thought into it as I was focused
> on the TSC issue :)
>
> Will handle it properly in next version.
>
>>
>> To deal with nested_run_pending and nested save/restore,
>> nested_run_pending should be set to 1 before calling
>> enter_vmx_non_root_mode, as it was prior to commit 7af40ad37b3f. That
>> means that it has to be cleared when emulating VM-entry to the halted
>> state (prior to calling kvm_vcpu_halt). And all of the from_vmentry
>> arguments that Paolo added when rebasing commit cf8b84f48a59 should be
>> removed, so that nested_run_pending is propagated correctly duting a
>> restore.
>>
>> It should be possible to eliminate this strange little wart, but I
>> haven't looked deeply into it.
>>
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-26 22:28       ` Jim Mattson
@ 2018-04-27 10:03         ` Paolo Bonzini
  2018-04-27 15:19           ` Jim Mattson
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2018-04-27 10:03 UTC (permalink / raw)
  To: Jim Mattson, Raslan, KarimAllah
  Cc: kvm, linux-kernel, tglx, x86, hpa, mingo, rkrcmar

On 27/04/2018 00:28, Jim Mattson wrote:
> The other thing that comes to mind is that there are some new fields
> in the VMCS12 since I first implemented this. One potentially
> troublesome field is the VMX preemption timer. If the current timer
> value is not saved on VM-exit, then it won't be stashed in the shadow
> VMCS12 by sync_vmcs12. Post-migration, the timer will be reset to its
> original value.
> 
> Do we care? Is this any different from what happens on real hardware
> when there's an SMI? According to the SDM, this appears to be exacty
> what happens when the dual-monitor treatment of SMIs and SMM is
> active, but it's not clear what happens with the default treatment of
> SMIs and SMM.

I think it should be the same, because the preemption timer countdown is
not part of the VMX-critical state.

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-27 10:03         ` Paolo Bonzini
@ 2018-04-27 15:19           ` Jim Mattson
  2018-04-28  0:42             ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Jim Mattson @ 2018-04-27 15:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Raslan, KarimAllah, kvm, linux-kernel, tglx, x86, hpa, mingo, rkrcmar

On Fri, Apr 27, 2018 at 3:03 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 27/04/2018 00:28, Jim Mattson wrote:
>> The other thing that comes to mind is that there are some new fields
>> in the VMCS12 since I first implemented this. One potentially
>> troublesome field is the VMX preemption timer. If the current timer
>> value is not saved on VM-exit, then it won't be stashed in the shadow
>> VMCS12 by sync_vmcs12. Post-migration, the timer will be reset to its
>> original value.
>>
>> Do we care? Is this any different from what happens on real hardware
>> when there's an SMI? According to the SDM, this appears to be exacty
>> what happens when the dual-monitor treatment of SMIs and SMM is
>> active, but it's not clear what happens with the default treatment of
>> SMIs and SMM.
>
> I think it should be the same, because the preemption timer countdown is
> not part of the VMX-critical state.
>
> Paolo

Section 25.5.1 of the SDM says:

If the default treatment of SMIs and SMM (see Section 34.14) is
active, the VMX-preemption timer counts across an SMI to VMX non-root
operation, subsequent execution in SMM, and the return from SMM via
the RSM instruction. However, the timer can cause a VM exit only from
VMX non-root operation. If the timer expires during SMI, in SMM, or
during RSM, a timer-induced VM exit occurs immediately after RSM with
its normal priority unless it is blocked based on activity state
(Section 25.2).

So, there's no loophole here that allows us to reset the VMX
preemption timer when restoring nested state.

As a follow-on change, we should probably fix this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE
  2018-04-27 15:19           ` Jim Mattson
@ 2018-04-28  0:42             ` Paolo Bonzini
  0 siblings, 0 replies; 16+ messages in thread
From: Paolo Bonzini @ 2018-04-28  0:42 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Raslan, KarimAllah, kvm, linux-kernel, tglx, x86, hpa, mingo, rkrcmar

On 27/04/2018 17:19, Jim Mattson wrote:
> 
> If the default treatment of SMIs and SMM (see Section 34.14) is
> active, the VMX-preemption timer counts across an SMI to VMX non-root
> operation, subsequent execution in SMM, and the return from SMM via
> the RSM instruction. However, the timer can cause a VM exit only from
> VMX non-root operation. If the timer expires during SMI, in SMM, or
> during RSM, a timer-induced VM exit occurs immediately after RSM with
> its normal priority unless it is blocked based on activity state
> (Section 25.2).
> 
> So, there's no loophole here that allows us to reset the VMX
> preemption timer when restoring nested state.

Or when an SMI occurs.  So the expiration TSC of the preemption timer
should be stored into an "artificial" field of the vmcs12 at vmentry
time and later reused.

vmx->nested.smm.guest_node should also be saved...

Paolo

> As a follow-on change, we should probably fix this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-04-28  0:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12 15:12 [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest KarimAllah Ahmed
2018-04-12 15:12 ` [PATCH 2/2] kvm: nVMX: Introduce KVM_CAP_STATE KarimAllah Ahmed
2018-04-12 16:39   ` Paolo Bonzini
2018-04-14 15:56   ` Raslan, KarimAllah
2018-04-14 22:31     ` Raslan, KarimAllah
2018-04-16 16:22   ` Jim Mattson
2018-04-16 17:15     ` Raslan, KarimAllah
2018-04-26 22:28       ` Jim Mattson
2018-04-27 10:03         ` Paolo Bonzini
2018-04-27 15:19           ` Jim Mattson
2018-04-28  0:42             ` Paolo Bonzini
2018-04-12 16:35 ` [PATCH 1/2] X86/KVM: Properly restore 'tsc_offset' when running an L2 guest Paolo Bonzini
2018-04-12 17:04   ` Raslan, KarimAllah
2018-04-12 17:21     ` Raslan, KarimAllah
2018-04-12 20:21       ` Paolo Bonzini
2018-04-12 20:24         ` Raslan, KarimAllah

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.