All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/30] nVMX: Nested VMX, v9
@ 2011-05-08  8:15 Nadav Har'El
  2011-05-08  8:15 ` [PATCH 01/30] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
                   ` (30 more replies)
  0 siblings, 31 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:15 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Hi,

This is the ninth iteration of the nested VMX patch set. This iteration
addresses all of the comments and requests that were raised by reviewers in
the previous rounds, with only a few exception listed below.

Some of the issues which were solved in this version include:

 * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to
   256 vmcs02s, one for each L2. Now we only have one, which is reused.
   We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool
   of vmcs02s. This option will be useful in the future if vmcs02 won't be
   filled from scratch on each entry from L1 to L2 (currently, it is).

 * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was
   unnecessary, as all the necessary values are either known to KVM or appear
   in vmcs12. This structure is now gone for good.

 * There is no longer a "vmcs_fields" sub-structure that everyone disliked.
   All the VMCS fields appear directly in the vmcs12 structure, which makes
   the code simpler and more readable.

 * Make sure that the vmcs12 fields have fixed sizes and location, and add
   some extra padding, to support live migration and improve future-proofing.

 * For some fields, nested exit used to fail to return the host-state as set
   by L1. Fixed that.

 * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it
   in L0 and return to L2) is now more correct, and handles more exit reasons.

 * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code.
   The code is now shorter (uses existing functions like kvm_set_cr3, etc.),
   more readable, and more uniform (no pieces of code for enable_ept and not,
   less special code for cr0.TS, and none of that ugly cr0.PG monkey-business).

 * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now
   unneeded function sync_cached_regs_to_vcms().

 * Fix return value of the VMX msrs to be more correct, and more constant
   (not to needlessly vary on different hosts).

 * Added some more missing verifications to vmcs12's fields (cleanly failing
   the nested entry if these verifications fail).

 * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0,
   but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it.

 * Removed or rate limited printouts which could be exploited by guests.

 * Fix VM_ENTRY_LOAD_IA32_PAT feature handling.

 * Fixed potential bug and verified that nested vmx now works with both
   CONFIG_PREEMPT and CONFIG_SMP enabled.

 * Dozens of other code cleanups and bug fixes.

Only a few issues from previous reviews remain unaddressed. These are:

 * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was
   still not rewritten. It works, though ;-)

 * No KVM autotests for nested VMX yet.

 * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not
   supported. As explained above, the current code uses L1's MSR bitmap
   to avoid costly exits to L1, but still suffers exits to L0 on each
   MSR access in L2.

 * Still no option for disabling some capabilities advertised to L1.

 * No support for TPR_SHADOW feature for L1.

This new set of patches applies to the current KVM trunk (I checked with
082f9eced53d50c136e42d072598da4be4b9ba23).
If you wish, you can also check out an already-patched version of KVM from
branch "nvmx9" of the repository:
	 git://github.com/nyh/kvm-nested-vmx.git


About nested VMX:
-----------------

The following 30 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
"The Turtles Project: Design and Implementation of Nested Virtualization",
and was awarded "Jay Lepreau Best Paper". The paper is available online, at:

	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
------------------

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the "nested=1" option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
explicitly enabled, by giving qemu one of the following options:

     -cpu host              (emulated CPU has all features of the real CPU)

     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)


This version was only tested with KVM (64-bit) as a guest hypervisor, and
Linux as a nested guest.


Patch statistics:
-----------------

 Documentation/kvm/nested-vmx.txt |  243 ++
 arch/x86/include/asm/kvm_host.h  |    2 
 arch/x86/include/asm/msr-index.h |   12 
 arch/x86/include/asm/vmx.h       |   31 
 arch/x86/kvm/svm.c               |    6 
 arch/x86/kvm/vmx.c               | 2558 +++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c               |   11 
 arch/x86/kvm/x86.h               |    8 
 8 files changed, 2773 insertions(+), 98 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/30] nVMX: Add "nested" module option to kvm_intel
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
@ 2011-05-08  8:15 ` Nadav Har'El
  2011-05-08  8:16 ` [PATCH 02/30] nVMX: Implement VMXON and VMXOFF Nadav Har'El
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:15 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds to kvm_intel a module option "nested". This option controls
whether the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:17.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:17.000000000 +0300
@@ -72,6 +72,14 @@ module_param(vmm_exclusive, bool, S_IRUG
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., guests may use
+ * VMX and be a hypervisor for its own guests. If nested=0, guests may not
+ * use VMX instructions.
+ */
+static int __read_mostly nested = 0;
+module_param(nested, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\
@@ -1261,6 +1269,23 @@ static u64 vmx_compute_tsc_offset(struct
 	return target_tsc - native_read_tsc();
 }
 
+static bool guest_cpuid_has_vmx(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best = kvm_find_cpuid_entry(vcpu, 1, 0);
+	return best && (best->ecx & (1 << (X86_FEATURE_VMX & 31)));
+}
+
+/*
+ * nested_vmx_allowed() checks whether a guest should be allowed to use VMX
+ * instructions and MSRs (i.e., nested VMX). Nested VMX is disabled for
+ * all guests if the "nested" module option is off, and can also be disabled
+ * for a single guest by disabling its VMX cpuid bit.
+ */
+static inline bool nested_vmx_allowed(struct kvm_vcpu *vcpu)
+{
+	return nested && guest_cpuid_has_vmx(vcpu);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 02/30] nVMX: Implement VMXON and VMXOFF
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
  2011-05-08  8:15 ` [PATCH 01/30] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
@ 2011-05-08  8:16 ` Nadav Har'El
  2011-05-08  8:16 ` [PATCH 03/30] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
                   ` (28 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:16 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  110 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 108 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:17.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:17.000000000 +0300
@@ -130,6 +130,15 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -184,6 +193,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for a guest hypervisor (nested VMX) */
+	struct nested_vmx nested;
 };
 
 enum segment_cache_field {
@@ -3890,6 +3902,99 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably cr4.VMXE must be
+	 * set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
+	 * Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = true;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * Free whatever needs to be freed from vmx->nested when L1 goes down, or
+ * just stops using VMX.
+ */
+static void free_nested(struct vcpu_vmx *vmx)
+{
+	if (!vmx->nested.vmxon)
+		return;
+	vmx->nested.vmxon = false;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+	free_nested(to_vmx(vcpu));
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3917,8 +4022,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
@@ -4329,6 +4434,7 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	free_nested(vmx);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 03/30] nVMX: Allow setting the VMXE bit in CR4
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
  2011-05-08  8:15 ` [PATCH 01/30] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
  2011-05-08  8:16 ` [PATCH 02/30] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2011-05-08  8:16 ` Nadav Har'El
  2011-05-08  8:17 ` [PATCH 04/30] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:16 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +-
 arch/x86/kvm/svm.c              |    6 +++++-
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++--
 arch/x86/kvm/x86.c              |    4 +---
 4 files changed, 22 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-05-08 10:43:17.000000000 +0300
+++ .after/arch/x86/include/asm/kvm_host.h	2011-05-08 10:43:17.000000000 +0300
@@ -559,7 +559,7 @@ struct kvm_x86_ops {
 	void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
 	void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-	void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+	int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
 	void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
 	void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c	2011-05-08 10:43:17.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2011-05-08 10:43:17.000000000 +0300
@@ -1496,11 +1496,14 @@ static void svm_set_cr0(struct kvm_vcpu 
 	update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long host_cr4_mce = read_cr4() & X86_CR4_MCE;
 	unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;
 
+	if (cr4 & X86_CR4_VMXE)
+		return 1;
+
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
 		svm_flush_tlb(vcpu);
 
@@ -1510,6 +1513,7 @@ static void svm_set_cr4(struct kvm_vcpu 
 	cr4 |= host_cr4_mce;
 	to_svm(vcpu)->vmcb->save.cr4 = cr4;
 	mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
+	return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-08 10:43:18.000000000 +0300
@@ -615,11 +615,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 				   kvm_read_cr3(vcpu)))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
-	kvm_x86_ops->set_cr4(vcpu, cr4);
-
 	if ((cr4 ^ old_cr4) & pdptr_bits)
 		kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -2078,7 +2078,7 @@ static void ept_save_pdptrs(struct kvm_v
 		  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
 					unsigned long cr0,
@@ -2175,11 +2175,23 @@ static void vmx_set_cr3(struct kvm_vcpu 
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
 		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+	if (cr4 & X86_CR4_VMXE) {
+		/*
+		 * To use VMXON (and later other VMX instructions), a guest
+		 * must first be able to turn on cr4.VMXE (see handle_vmon()).
+		 * So basically the check on whether to allow nested VMX
+		 * is here.
+		 */
+		if (!nested_vmx_allowed(vcpu))
+			return 1;
+	} else if (to_vmx(vcpu)->nested.vmxon)
+		return 1;
+
 	vcpu->arch.cr4 = cr4;
 	if (enable_ept) {
 		if (!is_paging(vcpu)) {
@@ -2192,6 +2204,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
+	return 0;
 }
 
 static void vmx_get_segment(struct kvm_vcpu *vcpu,

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 04/30] nVMX: Introduce vmcs12: a VMCS structure for L1
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (2 preceding siblings ...)
  2011-05-08  8:16 ` [PATCH 03/30] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2011-05-08  8:17 ` Nadav Har'El
  2011-05-08  8:17 ` [PATCH 05/30] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:17 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guest. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   75 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -131,12 +131,53 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions.
+ * More than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed to ensure that its layout is identical across
+ * machines (necessary for live migration).
+ * If there are changes in this struct, VMCS12_REVISION must be changed.
+ */
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
+ * VMCS12_SIZE is the number of bytes L1 should allocate for the VMXON region
+ * and any VMCS region. Although only sizeof(struct vmcs12) are used by the
+ * current implementation, 4K are reserved to avoid future complications.
+ */
+#define VMCS12_SIZE 0x1000
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
  */
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above */
+	struct page *current_vmcs12_page;
+	struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -212,6 +253,31 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
+{
+	return to_vmx(vcpu)->nested.current_vmcs12;
+}
+
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		return NULL;
+	}
+	return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+	kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+	kvm_release_page_clean(page);
+}
+
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
@@ -3995,6 +4061,12 @@ static void free_nested(struct vcpu_vmx 
 	if (!vmx->nested.vmxon)
 		return;
 	vmx->nested.vmxon = false;
+	if (vmx->nested.current_vmptr != -1ull) {
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+		vmx->nested.current_vmcs12 = NULL;
+	}
 }
 
 /* Emulate the VMXOFF instruction */
@@ -4518,6 +4590,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_vmcs12 = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 05/30] nVMX: Implement reading and writing of VMX MSRs
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (3 preceding siblings ...)
  2011-05-08  8:17 ` [PATCH 04/30] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2011-05-08  8:17 ` Nadav Har'El
  2011-05-08  8:18 ` [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:17 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/msr-index.h |   12 ++
 arch/x86/kvm/vmx.c               |  174 +++++++++++++++++++++++++++++
 2 files changed, 186 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -1365,6 +1365,176 @@ static inline bool nested_vmx_allowed(st
 }
 
 /*
+ * nested_vmx_pinbased_ctls() returns the value which is to be returned
+ * for MSR_IA32_VMX_PINBASED_CTLS, and also determines the legal setting of
+ * vmcs12->pin_based_vm_exec_control. See the spec and vmx_control_verify() for
+ * the meaning of the low and high halves of this MSR.
+ * TODO: allow the return value to be modified (downgraded) by module options
+ * or other means.
+ */
+static inline void nested_vmx_pinbased_ctls(u32 *low, u32 *high)
+{
+	/*
+	 * According to the Intel spec, if bit 55 of VMX_BASIC is off (as it is
+	 * in our case), bits 1, 2 and 4 (i.e., 0x16) must be 1 in this MSR.
+	 */
+	*low = 0x16 ;
+	/* Allow only these bits to be 1 */
+	*high = 0x16 | PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING
+		     | PIN_BASED_VIRTUAL_NMIS;
+}
+
+static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
+{
+	/*
+	 * Bits 0 in high must be 0, and bits 1 in low must be 1.
+	 */
+	return ((control & high) | low) == control;
+}
+
+/*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 1 when we handled a
+ * VMX-specific MSR, or 0 when we haven't (and the caller should handle it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+	u32 vmx_msr_high, vmx_msr_low;
+
+	if (!nested_vmx_allowed(vcpu) && msr_index >= MSR_IA32_VMX_BASIC &&
+		     msr_index <= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
+		/*
+		 * According to the spec, processors which do not support VMX
+		 * should throw a #GP(0) when VMX capability MSRs are read.
+		 */
+		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
+		return 1;
+	}
+
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_BASIC:
+		/*
+		 * This MSR reports some information about VMX support. We
+		 * should return information about the VMX we emulate for the
+		 * guest, and the VMCS structure we give it - not about the
+		 * VMX support of the underlying hardware.
+		 */
+		*pdata = VMCS12_REVISION |
+			   ((u64)VMCS12_SIZE << VMX_BASIC_VMCS_SIZE_SHIFT) |
+			   (VMX_BASIC_MEM_TYPE_WB << VMX_BASIC_MEM_TYPE_SHIFT);
+		break;
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+		nested_vmx_pinbased_ctls(&vmx_msr_low, &vmx_msr_high);
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+		/* This MSR determines which vm-execution controls the L1
+		 * hypervisor may ask, or may not ask, to enable. Normally we
+		 * can only allow enabling features which the hardware can
+		 * support, but we limit ourselves to allowing only known
+		 * features that were tested nested. We can allow disabling any
+		 * feature (even if the hardware can't disable it) - we just
+		 * need to enable this feature and hide the extra exits from L1
+		 */
+		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+		vmx_msr_low = 0; /* allow disabling any feature */
+		vmx_msr_high &= /* do not expose new untested features */
+			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+			CPU_BASED_INVLPG_EXITING |
+#ifdef CONFIG_X86_64
+			CPU_BASED_CR8_LOAD_EXITING |
+			CPU_BASED_CR8_STORE_EXITING |
+#endif
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+		/*
+		 * We can allow some features even when not supported by the
+		 * hardware. For example, L1 can specify an MSR bitmap - and we
+		 * can use it to avoid exits to L1 - even when L0 runs L2
+		 * without MSR bitmaps.
+		 */
+		vmx_msr_high |= CPU_BASED_USE_MSR_BITMAPS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_TRUE_EXIT_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS:
+		*pdata = 0;
+#ifdef CONFIG_X86_64
+		*pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+		break;
+	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+	case MSR_IA32_VMX_ENTRY_CTLS:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_MISC:
+		*pdata = 0;
+		break;
+	/*
+	 * These MSRs specify bits which the guest must keep fixed (on or off)
+	 * while L1 is in VMXON mode (in L1's root mode, or running an L2).
+	 * We picked the standard core2 setting.
+	 */
+#define VMXON_CR0_ALWAYSON	(X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
+#define VMXON_CR4_ALWAYSON	X86_CR4_VMXE
+	case MSR_IA32_VMX_CR0_FIXED0:
+		*pdata = VMXON_CR0_ALWAYSON;
+		break;
+	case MSR_IA32_VMX_CR0_FIXED1:
+		*pdata = -1ULL;
+		break;
+	case MSR_IA32_VMX_CR4_FIXED0:
+		*pdata = VMXON_CR4_ALWAYSON;
+		break;
+	case MSR_IA32_VMX_CR4_FIXED1:
+		*pdata = -1ULL;
+		break;
+	case MSR_IA32_VMX_VMCS_ENUM:
+		*pdata = 0x1f;
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2, vmx_msr_low, vmx_msr_high);
+		vmx_msr_low = 0; /* allow disabling any feature */
+		vmx_msr_high &= /* do not expose new untested features */
+			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		/* Currently, no nested ept or nested vpid */
+		*pdata = 0;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+{
+	if (!nested_vmx_allowed(vcpu))
+		return 0;
+
+	if (msr_index == MSR_IA32_FEATURE_CONTROL)
+		/* TODO: the right thing. */
+		return 1;
+	/*
+	 * No need to treat VMX capability MSRs specially: If we don't handle
+	 * them, handle_wrmsr will #GP(0), which is correct (they are readonly)
+	 */
+	return 0;
+}
+
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1412,6 +1582,8 @@ static int vmx_get_msr(struct kvm_vcpu *
 		/* Otherwise falls through */
 	default:
 		vmx_load_host_state(to_vmx(vcpu));
+		if (vmx_get_vmx_msr(vcpu, msr_index, pdata))
+			return 0;
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			vmx_load_host_state(to_vmx(vcpu));
@@ -1483,6 +1655,8 @@ static int vmx_set_msr(struct kvm_vcpu *
 			return 1;
 		/* Otherwise falls through */
 	default:
+		if (vmx_set_vmx_msr(vcpu, msr_index, data))
+			break;
 		msr = find_msr_entry(vmx, msr_index);
 		if (msr) {
 			vmx_load_host_state(vmx);
--- .before/arch/x86/include/asm/msr-index.h	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/include/asm/msr-index.h	2011-05-08 10:43:18.000000000 +0300
@@ -434,6 +434,18 @@
 #define MSR_IA32_VMX_VMCS_ENUM          0x0000048a
 #define MSR_IA32_VMX_PROCBASED_CTLS2    0x0000048b
 #define MSR_IA32_VMX_EPT_VPID_CAP       0x0000048c
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS  0x0000048d
+#define MSR_IA32_VMX_TRUE_PROCBASED_CTLS 0x0000048e
+#define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
+#define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
+
+/* VMX_BASIC bits and bitmasks */
+#define VMX_BASIC_VMCS_SIZE_SHIFT	32
+#define VMX_BASIC_64		0x0001000000000000LLU
+#define VMX_BASIC_MEM_TYPE_SHIFT	50
+#define VMX_BASIC_MEM_TYPE_MASK	0x003c000000000000LLU
+#define VMX_BASIC_MEM_TYPE_WB	6LLU
+#define VMX_BASIC_INOUT		0x0040000000000000LLU
 
 /* AMD-V MSRs */
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (4 preceding siblings ...)
  2011-05-08  8:17 ` [PATCH 05/30] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2011-05-08  8:18 ` Nadav Har'El
  2011-05-09  9:47   ` Avi Kivity
  2011-05-08  8:18 ` [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
                   ` (24 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:18 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   53 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    3 +-
 arch/x86/kvm/x86.h |    4 +++
 3 files changed, 59 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-08 10:43:18.000000000 +0300
@@ -3815,7 +3815,7 @@ static int kvm_fetch_guest_virt(struct x
 					  exception);
 }
 
-static int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
+int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
 			       gva_t addr, void *val, unsigned int bytes,
 			       struct x86_exception *exception)
 {
@@ -3825,6 +3825,7 @@ static int kvm_read_guest_virt(struct x8
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 				      gva_t addr, void *val, unsigned int bytes,
--- .before/arch/x86/kvm/x86.h	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2011-05-08 10:43:18.000000000 +0300
@@ -81,4 +81,8 @@ int kvm_inject_realmode_interrupt(struct
 
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
+int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
+	gva_t addr, void *val, unsigned int bytes,
+	struct x86_exception *exception);
+
 #endif
--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -4254,6 +4254,59 @@ static int handle_vmoff(struct kvm_vcpu 
 }
 
 /*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+				 unsigned long exit_qualification,
+				 u32 vmx_instruction_info, gva_t *ret)
+{
+	/*
+	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
+	 * Execution", on an exit, vmx_instruction_info holds most of the
+	 * addressing components of the operand. Only the displacement part
+	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
+	 * For how an actual address is calculated from all these components,
+	 * refer to Vol. 1, "Operand Addressing".
+	 */
+	int  scaling = vmx_instruction_info & 3;
+	int  addr_size = (vmx_instruction_info >> 7) & 7;
+	bool is_reg = vmx_instruction_info & (1u << 10);
+	int  seg_reg = (vmx_instruction_info >> 15) & 7;
+	int  index_reg = (vmx_instruction_info >> 18) & 0xf;
+	bool index_is_valid = !(vmx_instruction_info & (1u << 22));
+	int  base_reg       = (vmx_instruction_info >> 23) & 0xf;
+	bool base_is_valid  = !(vmx_instruction_info & (1u << 27));
+
+	if (is_reg) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offset = base + [index * scale] + displacement */
+	*ret = vmx_get_segment_base(vcpu, seg_reg);
+	if (base_is_valid)
+		*ret += kvm_register_read(vcpu, base_reg);
+	if (index_is_valid)
+		*ret += kvm_register_read(vcpu, index_reg)<<scaling;
+	*ret += exit_qualification; /* holds the displacement */
+
+	if (addr_size == 1) /* 32 bit */
+		*ret &= 0xffffffff;
+
+	/*
+	 * TODO: throw #GP (and return 1) in various cases that the VM*
+	 * instructions require it - e.g., offset beyond segment limit,
+	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
+	 * address, and so on. Currently these are not checked.
+	 */
+	return 0;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (5 preceding siblings ...)
  2011-05-08  8:18 ` [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2011-05-08  8:18 ` Nadav Har'El
  2011-05-16 15:30   ` Marcelo Tosatti
  2011-05-08  8:19 ` [PATCH 08/30] nVMX: Fix local_vcpus_link handling Nadav Har'El
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:18 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that "vmcs02".
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
be reused even when L1 runs multiple L2 guests. However, in future versions
we'll probably want to add an optimization where vmcs02 fields that rarely
change will not be set each time. For that, we may want to keep around several
vmcs02s of L2 guests that have recently run, so that potentially we could run
these L2s again more quickly because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
reused to enter any L2 guest. In the future, when prepare_vmcs02() is
optimized not to set all fields every time, VMCS02_POOL_SIZE should be
increased.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  134 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 134 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
 module_param(ple_window, int, S_IRUGO);
 
 #define NR_AUTOLOAD_MSRS 1
+#define VMCS02_POOL_SIZE 1
 
 struct vmcs {
 	u32 revision_id;
@@ -166,6 +167,30 @@ struct __packed vmcs12 {
 #define VMCS12_SIZE 0x1000
 
 /*
+ * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
+ * while we use L2's VMCS), and we wish to save the previous VMCS, we must also
+ * remember on which CPU it was last loaded (vcpu->cpu), so when we return to
+ * using this VMCS we'll know if we're now running on a different CPU and need
+ * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
+ * we need to remember whether this VMCS was launched (vmx->launched), so when
+ * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
+ * this from other state, because it's possible that this VMCS had once been
+ * launched, but has since been cleared after a CPU switch).
+ */
+struct saved_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+};
+
+/* Used to remember the last vmcs02 used for some recently used vmcs12s */
+struct vmcs02_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct saved_vmcs vmcs02;
+};
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
  */
@@ -178,6 +203,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
+	struct list_head vmcs02_pool;
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -4155,6 +4184,106 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
+ * We could reuse a single VMCS for all the L2 guests, but we also want the
+ * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
+ * allows keeping them loaded on the processor, and in the future will allow
+ * optimizations where prepare_vmcs02 doesn't need to set all the fields on
+ * every entry if they never change.
+ * So we keep, in vmx->nested.vmcs02_pool, a cache of size VMCS02_POOL_SIZE
+ * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
+ *
+ * The following functions allocate and free a vmcs02 in this pool.
+ */
+
+static void __nested_free_saved_vmcs(void *arg)
+{
+	struct saved_vmcs *saved_vmcs = arg;
+
+	vmcs_clear(saved_vmcs->vmcs);
+	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
+		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
+}
+
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ * (the necessary information is in the saved_vmcs structure).
+ * See also vcpu_clear() (with different parameters and side-effects)
+ */
+static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
+		struct saved_vmcs *saved_vmcs)
+{
+	if (saved_vmcs->cpu != -1)
+		smp_call_function_single(saved_vmcs->cpu,
+				__nested_free_saved_vmcs, saved_vmcs, 1);
+
+	free_vmcs(saved_vmcs->vmcs);
+}
+
+/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
+static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmptr) {
+			nested_free_saved_vmcs(vmx, &item->vmcs02);
+			list_del(&item->list);
+			kfree(item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+/* Free all vmcs02 saved for this vcpu */
+static void nested_free_all_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item, *n;
+	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
+		nested_free_saved_vmcs(vmx, &item->vmcs02);
+		list_del(&item->list);
+		kfree(item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
+/* Get a vmcs02 for the current vmcs12. */
+static struct saved_vmcs *nested_get_current_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmx->nested.current_vmptr) {
+			list_move(&item->list, &vmx->nested.vmcs02_pool);
+			return &item->vmcs02;
+		}
+
+	if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE, 1)) {
+		/* Recycle the least recently used VMCS. */
+		item = list_entry(vmx->nested.vmcs02_pool.prev,
+			struct vmcs02_list, list);
+		item->vmcs12_addr = vmx->nested.current_vmptr;
+		list_move(&item->list, &vmx->nested.vmcs02_pool);
+		return &item->vmcs02;
+	}
+
+	/* Create a new vmcs02 */
+	item = (struct vmcs02_list *)
+		kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
+	if (!item)
+		return NULL;
+	item->vmcs02.vmcs = alloc_vmcs();
+	if (!item->vmcs02.vmcs) {
+		kfree(item);
+		return NULL;
+	}
+	item->vmcs12_addr = vmx->nested.current_vmptr;
+	item->vmcs02.cpu = -1;
+	item->vmcs02.launched = 0;
+	list_add(&(item->list), &(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num++;
+	return &item->vmcs02;
+}
+
+/*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
  * inspect the argument to VMXON (the so-called "VMXON pointer") because we
@@ -4190,6 +4319,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4241,6 +4373,8 @@ static void free_nested(struct vcpu_vmx 
 		vmx->nested.current_vmptr = -1ull;
 		vmx->nested.current_vmcs12 = NULL;
 	}
+
+	nested_free_all_vmcs02(vmx);
 }
 
 /* Emulate the VMXOFF instruction */

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 08/30] nVMX: Fix local_vcpus_link handling
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (6 preceding siblings ...)
  2011-05-08  8:18 ` [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
@ 2011-05-08  8:19 ` Nadav Har'El
  2011-05-08  8:19 ` [PATCH 09/30] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
                   ` (22 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:19 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, others for
each L2), and each of those may be have been last loaded on a different cpu.

This trivial patch changes the code to keep on vcpus_on_cpu only L1 VMCSs.
This fixes crashes on L1 shutdown caused by incorrectly maintaing the linked
lists.

It is not a complete solution, though. It doesn't flush the inactive L1 or L2
VMCSs loaded on a CPU which is being shutdown. Doing this correctly will
probably require replacing the vcpu linked list by a link list of "saved_vcms"
objects (VMCS, cpu and launched), and it is left as a TODO.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -638,7 +638,9 @@ static void __vcpu_clear(void *arg)
 		vmcs_clear(vmx->vmcs);
 	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
 		per_cpu(current_vmcs, cpu) = NULL;
-	list_del(&vmx->local_vcpus_link);
+	/* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+	if (!is_guest_mode(&vmx->vcpu))
+		list_del(&vmx->local_vcpus_link);
 	vmx->vcpu.cpu = -1;
 	vmx->launched = 0;
 }
@@ -1100,8 +1102,10 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
-		list_add(&vmx->local_vcpus_link,
-			 &per_cpu(vcpus_on_cpu, cpu));
+		/* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+		if (!is_guest_mode(&vmx->vcpu))
+			list_add(&vmx->local_vcpus_link,
+				 &per_cpu(vcpus_on_cpu, cpu));
 		local_irq_enable();
 
 		/*
@@ -1806,7 +1810,9 @@ static void vmclear_local_vcpus(void)
 
 	list_for_each_entry_safe(vmx, n, &per_cpu(vcpus_on_cpu, cpu),
 				 local_vcpus_link)
-		__vcpu_clear(vmx);
+		/* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+		if (!is_guest_mode(&vmx->vcpu))
+			__vcpu_clear(vmx);
 }
 
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 09/30] nVMX: Add VMCS fields to the vmcs12
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (7 preceding siblings ...)
  2011-05-08  8:19 ` [PATCH 08/30] nVMX: Fix local_vcpus_link handling Nadav Har'El
@ 2011-05-08  8:19 ` Nadav Har'El
  2011-05-08  8:20 ` [PATCH 10/30] nVMX: Success/failure of VMX instructions Nadav Har'El
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:19 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  275 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 275 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
@@ -144,12 +144,148 @@ struct shared_msr_entry {
  * machines (necessary for live migration).
  * If there are changes in this struct, VMCS12_REVISION must be changed.
  */
+typedef u64 natural_width;
 struct __packed vmcs12 {
 	/* According to the Intel spec, a VMCS region must start with the
 	 * following two fields. Then follow implementation-specific data.
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_ia32_efer;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u64 host_ia32_efer;
+	u64 padding64[8]; /* room for future expansion */
+	/*
+	 * To allow migration of L1 (complete with its L2 guests) between
+	 * machines of different natural widths (32 or 64 bit), we cannot have
+	 * unsigned long fields with no explict size. We use u64 (aliased
+	 * natural_width) instead. Luckily, x86 is little-endian.
+	 */
+	natural_width cr0_guest_host_mask;
+	natural_width cr4_guest_host_mask;
+	natural_width cr0_read_shadow;
+	natural_width cr4_read_shadow;
+	natural_width cr3_target_value0;
+	natural_width cr3_target_value1;
+	natural_width cr3_target_value2;
+	natural_width cr3_target_value3;
+	natural_width exit_qualification;
+	natural_width guest_linear_address;
+	natural_width guest_cr0;
+	natural_width guest_cr3;
+	natural_width guest_cr4;
+	natural_width guest_es_base;
+	natural_width guest_cs_base;
+	natural_width guest_ss_base;
+	natural_width guest_ds_base;
+	natural_width guest_fs_base;
+	natural_width guest_gs_base;
+	natural_width guest_ldtr_base;
+	natural_width guest_tr_base;
+	natural_width guest_gdtr_base;
+	natural_width guest_idtr_base;
+	natural_width guest_dr7;
+	natural_width guest_rsp;
+	natural_width guest_rip;
+	natural_width guest_rflags;
+	natural_width guest_pending_dbg_exceptions;
+	natural_width guest_sysenter_esp;
+	natural_width guest_sysenter_eip;
+	natural_width host_cr0;
+	natural_width host_cr3;
+	natural_width host_cr4;
+	natural_width host_fs_base;
+	natural_width host_gs_base;
+	natural_width host_tr_base;
+	natural_width host_gdtr_base;
+	natural_width host_idtr_base;
+	natural_width host_ia32_sysenter_esp;
+	natural_width host_ia32_sysenter_eip;
+	natural_width host_rsp;
+	natural_width host_rip;
+	natural_width paddingl[8]; /* room for future expansion */
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	u32 padding32[8]; /* room for future expansion */
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
 };
 
 /*
@@ -282,6 +418,145 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
+#define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
+#define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
+				[number##_HIGH] = VMCS12_OFFSET(name)+4
+
+static unsigned short vmcs_field_to_offset_table[] = {
+	FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id),
+	FIELD(GUEST_ES_SELECTOR, guest_es_selector),
+	FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
+	FIELD(GUEST_SS_SELECTOR, guest_ss_selector),
+	FIELD(GUEST_DS_SELECTOR, guest_ds_selector),
+	FIELD(GUEST_FS_SELECTOR, guest_fs_selector),
+	FIELD(GUEST_GS_SELECTOR, guest_gs_selector),
+	FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
+	FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
+	FIELD(HOST_ES_SELECTOR, host_es_selector),
+	FIELD(HOST_CS_SELECTOR, host_cs_selector),
+	FIELD(HOST_SS_SELECTOR, host_ss_selector),
+	FIELD(HOST_DS_SELECTOR, host_ds_selector),
+	FIELD(HOST_FS_SELECTOR, host_fs_selector),
+	FIELD(HOST_GS_SELECTOR, host_gs_selector),
+	FIELD(HOST_TR_SELECTOR, host_tr_selector),
+	FIELD64(IO_BITMAP_A, io_bitmap_a),
+	FIELD64(IO_BITMAP_B, io_bitmap_b),
+	FIELD64(MSR_BITMAP, msr_bitmap),
+	FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
+	FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
+	FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
+	FIELD64(TSC_OFFSET, tsc_offset),
+	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
+	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
+	FIELD64(EPT_POINTER, ept_pointer),
+	FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
+	FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
+	FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
+	FIELD64(GUEST_IA32_PAT, guest_ia32_pat),
+	FIELD64(GUEST_PDPTR0, guest_pdptr0),
+	FIELD64(GUEST_PDPTR1, guest_pdptr1),
+	FIELD64(GUEST_PDPTR2, guest_pdptr2),
+	FIELD64(GUEST_PDPTR3, guest_pdptr3),
+	FIELD64(HOST_IA32_PAT, host_ia32_pat),
+	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
+	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
+	FIELD(EXCEPTION_BITMAP, exception_bitmap),
+	FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask),
+	FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match),
+	FIELD(CR3_TARGET_COUNT, cr3_target_count),
+	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
+	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
+	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
+	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
+	FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count),
+	FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field),
+	FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code),
+	FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len),
+	FIELD(TPR_THRESHOLD, tpr_threshold),
+	FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control),
+	FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error),
+	FIELD(VM_EXIT_REASON, vm_exit_reason),
+	FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info),
+	FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code),
+	FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field),
+	FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code),
+	FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len),
+	FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info),
+	FIELD(GUEST_ES_LIMIT, guest_es_limit),
+	FIELD(GUEST_CS_LIMIT, guest_cs_limit),
+	FIELD(GUEST_SS_LIMIT, guest_ss_limit),
+	FIELD(GUEST_DS_LIMIT, guest_ds_limit),
+	FIELD(GUEST_FS_LIMIT, guest_fs_limit),
+	FIELD(GUEST_GS_LIMIT, guest_gs_limit),
+	FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit),
+	FIELD(GUEST_TR_LIMIT, guest_tr_limit),
+	FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit),
+	FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit),
+	FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes),
+	FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes),
+	FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes),
+	FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes),
+	FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes),
+	FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes),
+	FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes),
+	FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes),
+	FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info),
+	FIELD(GUEST_ACTIVITY_STATE, guest_activity_state),
+	FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs),
+	FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs),
+	FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask),
+	FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask),
+	FIELD(CR0_READ_SHADOW, cr0_read_shadow),
+	FIELD(CR4_READ_SHADOW, cr4_read_shadow),
+	FIELD(CR3_TARGET_VALUE0, cr3_target_value0),
+	FIELD(CR3_TARGET_VALUE1, cr3_target_value1),
+	FIELD(CR3_TARGET_VALUE2, cr3_target_value2),
+	FIELD(CR3_TARGET_VALUE3, cr3_target_value3),
+	FIELD(EXIT_QUALIFICATION, exit_qualification),
+	FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address),
+	FIELD(GUEST_CR0, guest_cr0),
+	FIELD(GUEST_CR3, guest_cr3),
+	FIELD(GUEST_CR4, guest_cr4),
+	FIELD(GUEST_ES_BASE, guest_es_base),
+	FIELD(GUEST_CS_BASE, guest_cs_base),
+	FIELD(GUEST_SS_BASE, guest_ss_base),
+	FIELD(GUEST_DS_BASE, guest_ds_base),
+	FIELD(GUEST_FS_BASE, guest_fs_base),
+	FIELD(GUEST_GS_BASE, guest_gs_base),
+	FIELD(GUEST_LDTR_BASE, guest_ldtr_base),
+	FIELD(GUEST_TR_BASE, guest_tr_base),
+	FIELD(GUEST_GDTR_BASE, guest_gdtr_base),
+	FIELD(GUEST_IDTR_BASE, guest_idtr_base),
+	FIELD(GUEST_DR7, guest_dr7),
+	FIELD(GUEST_RSP, guest_rsp),
+	FIELD(GUEST_RIP, guest_rip),
+	FIELD(GUEST_RFLAGS, guest_rflags),
+	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
+	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
+	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+	FIELD(HOST_CR0, host_cr0),
+	FIELD(HOST_CR3, host_cr3),
+	FIELD(HOST_CR4, host_cr4),
+	FIELD(HOST_FS_BASE, host_fs_base),
+	FIELD(HOST_GS_BASE, host_gs_base),
+	FIELD(HOST_TR_BASE, host_tr_base),
+	FIELD(HOST_GDTR_BASE, host_gdtr_base),
+	FIELD(HOST_IDTR_BASE, host_idtr_base),
+	FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp),
+	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
+	FIELD(HOST_RSP, host_rsp),
+	FIELD(HOST_RIP, host_rip),
+};
+static const int max_vmcs_field = ARRAY_SIZE(vmcs_field_to_offset_table);
+
+static inline short vmcs_field_to_offset(unsigned long field)
+{
+	if (field >= max_vmcs_field || vmcs_field_to_offset_table[field] == 0)
+		return -1;
+	return vmcs_field_to_offset_table[field];
+}
+
 static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->nested.current_vmcs12;

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 10/30] nVMX: Success/failure of VMX instructions.
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (8 preceding siblings ...)
  2011-05-08  8:19 ` [PATCH 09/30] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2011-05-08  8:20 ` Nadav Har'El
  2011-05-08  8:20 ` [PATCH 11/30] nVMX: Implement VMCLEAR Nadav Har'El
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:20 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |   31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c         |   30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -4722,6 +4722,36 @@ static int get_vmx_mem_address(struct kv
 }
 
 /*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, "Conventions".
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+					u32 vm_instruction_error)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_ZF);
+	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--- .before/arch/x86/include/asm/vmx.h	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/include/asm/vmx.h	2011-05-08 10:43:19.000000000 +0300
@@ -426,4 +426,35 @@ struct vmx_msr_entry {
 	u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+	VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+	VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+	VMXERR_VMCLEAR_VMXON_POINTER = 3,
+	VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+	VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+	VMXERR_VMRESUME_AFTER_VMXOFF = 6,
+	VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+	VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+	VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+	VMXERR_VMPTRLD_VMXON_POINTER = 10,
+	VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+	VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+	VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+	VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+	VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+	VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+	VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+	VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+	VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+	VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+	VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+	VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+	VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+	VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 11/30] nVMX: Implement VMCLEAR
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (9 preceding siblings ...)
  2011-05-08  8:20 ` [PATCH 10/30] nVMX: Success/failure of VMX instructions Nadav Har'El
@ 2011-05-08  8:20 ` Nadav Har'El
  2011-05-08  8:21 ` [PATCH 12/30] nVMX: Implement VMPTRLD Nadav Har'El
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:20 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   65 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    1 
 2 files changed, 65 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-08 10:43:19.000000000 +0300
@@ -347,6 +347,7 @@ void kvm_inject_page_fault(struct kvm_vc
 	vcpu->arch.cr2 = fault->address;
 	kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
 }
+EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
 
 void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
 {
--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -152,6 +152,9 @@ struct __packed vmcs12 {
 	u32 revision_id;
 	u32 abort;
 
+	u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+	u32 padding[7]; /* room for future expansion */
+
 	u64 io_bitmap_a;
 	u64 io_bitmap_b;
 	u64 msr_bitmap;
@@ -4751,6 +4754,66 @@ static void nested_vmx_failValid(struct 
 	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct vmcs12 *vmcs12;
+	struct page *page;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vmcs12_addr,
+				sizeof(vmcs12_addr), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmcs12_addr == vmx->nested.current_vmptr) {
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+		vmx->nested.current_vmcs12 = NULL;
+	}
+
+	page = nested_get_page(vcpu, vmcs12_addr);
+	if (page == NULL) {
+		/*
+		 * For accurate processor emulation, VMCLEAR beyond available
+		 * physical memory should do nothing at all. However, it is
+		 * possible that a nested vmx bug, not a guest hypervisor bug,
+		 * resulted in this case, so let's shut down before doing any
+		 * more damage:
+		 */
+		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+		return 1;
+	}
+	vmcs12 = kmap(page);
+	vmcs12->launch_state = 0;
+	kunmap(page);
+	nested_release_page(page);
+
+	nested_free_vmcs02(vmx, vmcs12_addr);
+
+	skip_emulated_instruction(vcpu);
+	nested_vmx_succeed(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4772,7 +4835,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 12/30] nVMX: Implement VMPTRLD
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (10 preceding siblings ...)
  2011-05-08  8:20 ` [PATCH 11/30] nVMX: Implement VMCLEAR Nadav Har'El
@ 2011-05-08  8:21 ` Nadav Har'El
  2011-05-16 14:34   ` Marcelo Tosatti
  2011-05-08  8:21 ` [PATCH 13/30] nVMX: Implement VMPTRST Nadav Har'El
                   ` (18 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:21 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -4814,6 +4814,66 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vmcs12_addr,
+				sizeof(vmcs12_addr), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmx->nested.current_vmptr != vmcs12_addr) {
+		struct vmcs12 *new_vmcs12;
+		struct page *page;
+		page = nested_get_page(vcpu, vmcs12_addr);
+		if (page == NULL) {
+			nested_vmx_failInvalid(vcpu);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		new_vmcs12 = kmap(page);
+		if (new_vmcs12->revision_id != VMCS12_REVISION) {
+			kunmap(page);
+			nested_release_page_clean(page);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		if (vmx->nested.current_vmptr != -1ull) {
+			kunmap(vmx->nested.current_vmcs12_page);
+			nested_release_page(vmx->nested.current_vmcs12_page);
+		}
+
+		vmx->nested.current_vmptr = vmcs12_addr;
+		vmx->nested.current_vmcs12 = new_vmcs12;
+		vmx->nested.current_vmcs12_page = page;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4837,7 +4897,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 13/30] nVMX: Implement VMPTRST
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (11 preceding siblings ...)
  2011-05-08  8:21 ` [PATCH 12/30] nVMX: Implement VMPTRLD Nadav Har'El
@ 2011-05-08  8:21 ` Nadav Har'El
  2011-05-08  8:22 ` [PATCH 14/30] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:21 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   28 +++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    3 ++-
 arch/x86/kvm/x86.h |    4 ++++
 3 files changed, 33 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-08 10:43:19.000000000 +0300
@@ -3836,7 +3836,7 @@ static int kvm_read_guest_virt_system(st
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
-static int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
+int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 				       gva_t addr, void *val,
 				       unsigned int bytes,
 				       struct x86_exception *exception)
@@ -3868,6 +3868,7 @@ static int kvm_write_guest_virt_system(s
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
 				  unsigned long addr,
--- .before/arch/x86/kvm/x86.h	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2011-05-08 10:43:19.000000000 +0300
@@ -85,4 +85,8 @@ int kvm_read_guest_virt(struct x86_emula
 	gva_t addr, void *val, unsigned int bytes,
 	struct x86_exception *exception);
 
+int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
+	gva_t addr, void *val, unsigned int bytes,
+	struct x86_exception *exception);
+
 #endif
--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -4874,6 +4874,32 @@ static int handle_vmptrld(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t vmcs_gva;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, exit_qualification,
+			vmx_instruction_info, &vmcs_gva))
+		return 1;
+	/* ok to use *_system, as nested_vmx_check_permission verified cpl=0 */
+	if (kvm_write_guest_virt_system(&vcpu->arch.emulate_ctxt, vmcs_gva,
+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
+				 sizeof(u64), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4898,7 +4924,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
-	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 14/30] nVMX: Implement VMREAD and VMWRITE
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (12 preceding siblings ...)
  2011-05-08  8:21 ` [PATCH 13/30] nVMX: Implement VMPTRST Nadav Har'El
@ 2011-05-08  8:22 ` Nadav Har'El
  2011-05-08  8:22 ` [PATCH 15/30] nVMX: Move host-state field setup to a function Nadav Har'El
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:22 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs12 structure introduced in a previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  176 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 174 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -4814,6 +4814,178 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_NATURAL_WIDTH = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* the *_HIGH fields are all 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (field >> 13) & 0x3 ;
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+	return (((field >> 10) & 0x3) == 1);
+}
+
+/*
+ * Read a vmcs12 field. Since these can have varying lengths and we return
+ * one type, we chose the biggest type (u64) and zero-extend the return value
+ * to that size. Note that the caller, handle_vmread, might need to use only
+ * some of the bits we return here (e.g., on 32-bit guests, only 32 bits of
+ * 64-bit fields are to be returned).
+ */
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+
+	p = ((char *)(get_vmcs12(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_NATURAL_WIDTH:
+		*ret = *((natural_width *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = *((u16 *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = *((u32 *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* Decode instruction info and find the field to read */
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	/* Read the field, zero-extended to a u64 field_value */
+	if (!vmcs12_read_any(vcpu, field, &field_value)) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	/*
+	 * Now copy part of this value to register or memory, as requested.
+	 * Note that the number of bits actually copied is 32 or 64 depending
+	 * on the guest's mode (32 or 64 bit), not on the given field's length.
+	 */
+	if (vmx_instruction_info & (1u << 10)) {
+		kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
+			field_value);
+	} else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		/* _system ok, as nested_vmx_check_permission verified cpl=0 */
+		kvm_write_guest_virt_system(&vcpu->arch.emulate_ctxt, gva,
+			     &field_value, (is_long_mode(vcpu) ? 8 : 4), NULL);
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	gva_t gva;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+	/* The value to write might be 32 or 64 bits, depending on L1's long
+	 * mode, and eventually we need to write that into a field of several
+	 * possible lengths. The code below first zero-extends the value to 64
+	 * bit (field_value), and then copies only the approriate number of
+	 * bits into the vmcs12 field.
+	 */
+	u64 field_value = 0;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (vmx_instruction_info & (1u << 10))
+		field_value = kvm_register_read(vcpu,
+			(((vmx_instruction_info) >> 3) & 0xf));
+	else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva,
+			   &field_value, (is_long_mode(vcpu) ? 8 : 4), &e)) {
+			kvm_inject_page_fault(vcpu, &e);
+			return 1;
+		}
+	}
+
+
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	if (vmcs_field_readonly(field)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	p = ((char *) get_vmcs12(vcpu)) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+		*(u64 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_NATURAL_WIDTH:
+		*(natural_width *)p = field_value;
+		break;
+	default:
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /* Emulate the VMPTRLD instruction */
 static int handle_vmptrld(struct kvm_vcpu *vcpu)
 {
@@ -4925,9 +5097,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 15/30] nVMX: Move host-state field setup to a function
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (13 preceding siblings ...)
  2011-05-08  8:22 ` [PATCH 14/30] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
@ 2011-05-08  8:22 ` Nadav Har'El
  2011-05-09  9:56   ` Avi Kivity
  2011-05-08  8:23 ` [PATCH 16/30] nVMX: Move control field setup to functions Nadav Har'El
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:22 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move the setting of constant host-state fields (fields that do not change
throughout the life of the guest) from vmx_vcpu_setup to a new common function
vmx_set_constant_host_state(). This function will also be used to set the
host state when running L2 guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   72 ++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 31 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -3323,17 +3323,53 @@ static void vmx_disable_intercept_for_ms
 }
 
 /*
+ * Set up the vmcs's constant host-state fields, i.e., host-state fields that
+ * will not change in the lifetime of the guest.
+ * Note that host-state that does change is set elsewhere. E.g., host-state
+ * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
+ */
+static void vmx_set_constant_host_state(void)
+{
+	u32 low32, high32;
+	unsigned long tmpl;
+	struct desc_ptr dt;
+
+	vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
+	vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
+	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
+
+	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
+	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
+
+	native_store_idt(&dt);
+	vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
+
+	asm("mov $.Lkvm_vmx_return, %0" : "=r"(tmpl));
+	vmcs_writel(HOST_RIP, tmpl); /* 22.2.5 */
+
+	rdmsr(MSR_IA32_SYSENTER_CS, low32, high32);
+	vmcs_write32(HOST_IA32_SYSENTER_CS, low32);
+	rdmsrl(MSR_IA32_SYSENTER_EIP, tmpl);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, tmpl);   /* 22.2.3 */
+
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) {
+		rdmsr(MSR_IA32_CR_PAT, low32, high32);
+		vmcs_write64(HOST_IA32_PAT, low32 | ((u64) high32 << 32));
+	}
+}
+
+/*
  * Sets up the vmcs for emulated real mode.
  */
 static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 {
-	u32 host_sysenter_cs, msr_low, msr_high;
-	u32 junk;
+	u32 msr_low, msr_high;
 	u64 host_pat;
 	unsigned long a;
-	struct desc_ptr dt;
 	int i;
-	unsigned long kvm_vmx_return;
 	u32 exec_control;
 
 	/* I/O */
@@ -3390,16 +3426,9 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
 
-	vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
-	vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
-	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
-
-	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
-	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
-	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_FS_SELECTOR, 0);            /* 22.2.4 */
 	vmcs_write16(HOST_GS_SELECTOR, 0);            /* 22.2.4 */
-	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmx_set_constant_host_state();
 #ifdef CONFIG_X86_64
 	rdmsrl(MSR_FS_BASE, a);
 	vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */
@@ -3410,31 +3439,12 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
-	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
-
-	native_store_idt(&dt);
-	vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
-
-	asm("mov $.Lkvm_vmx_return, %0" : "=r"(kvm_vmx_return));
-	vmcs_writel(HOST_RIP, kvm_vmx_return); /* 22.2.5 */
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
 
-	rdmsr(MSR_IA32_SYSENTER_CS, host_sysenter_cs, junk);
-	vmcs_write32(HOST_IA32_SYSENTER_CS, host_sysenter_cs);
-	rdmsrl(MSR_IA32_SYSENTER_ESP, a);
-	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);   /* 22.2.3 */
-	rdmsrl(MSR_IA32_SYSENTER_EIP, a);
-	vmcs_writel(HOST_IA32_SYSENTER_EIP, a);   /* 22.2.3 */
-
-	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) {
-		rdmsr(MSR_IA32_CR_PAT, msr_low, msr_high);
-		host_pat = msr_low | ((u64) msr_high << 32);
-		vmcs_write64(HOST_IA32_PAT, host_pat);
-	}
 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
 		rdmsr(MSR_IA32_CR_PAT, msr_low, msr_high);
 		host_pat = msr_low | ((u64) msr_high << 32);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 16/30] nVMX: Move control field setup to functions
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (14 preceding siblings ...)
  2011-05-08  8:22 ` [PATCH 15/30] nVMX: Move host-state field setup to a function Nadav Har'El
@ 2011-05-08  8:23 ` Nadav Har'El
  2011-05-08  8:23 ` [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:23 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move some of the control field setup to common functions. These functions will
also be needed for running L2 guests - L0's desires (expressed in these
functions) will be appropriately merged with L1's desires.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   80 +++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 33 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
@@ -3361,6 +3361,49 @@ static void vmx_set_constant_host_state(
 	}
 }
 
+static void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
+{
+	vmx->vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
+	if (enable_ept)
+		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
+	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
+}
+
+static u32 vmx_exec_control(struct vcpu_vmx *vmx)
+{
+	u32 exec_control = vmcs_config.cpu_based_exec_ctrl;
+	if (!vm_need_tpr_shadow(vmx->vcpu.kvm)) {
+		exec_control &= ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+		exec_control |= CPU_BASED_CR8_STORE_EXITING |
+				CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	}
+	if (!enable_ept)
+		exec_control |= CPU_BASED_CR3_STORE_EXITING |
+				CPU_BASED_CR3_LOAD_EXITING  |
+				CPU_BASED_INVLPG_EXITING;
+	return exec_control;
+}
+
+static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
+{
+	u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
+	if (!vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))
+		exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	if (vmx->vpid == 0)
+		exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
+	if (!enable_ept) {
+		exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+		enable_unrestricted_guest = 0;
+	}
+	if (!enable_unrestricted_guest)
+		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+	if (!ple_gap)
+		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+	return exec_control;
+}
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -3370,7 +3413,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	u64 host_pat;
 	unsigned long a;
 	int i;
-	u32 exec_control;
 
 	/* I/O */
 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
@@ -3385,36 +3427,11 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
 		vmcs_config.pin_based_exec_ctrl);
 
-	exec_control = vmcs_config.cpu_based_exec_ctrl;
-	if (!vm_need_tpr_shadow(vmx->vcpu.kvm)) {
-		exec_control &= ~CPU_BASED_TPR_SHADOW;
-#ifdef CONFIG_X86_64
-		exec_control |= CPU_BASED_CR8_STORE_EXITING |
-				CPU_BASED_CR8_LOAD_EXITING;
-#endif
-	}
-	if (!enable_ept)
-		exec_control |= CPU_BASED_CR3_STORE_EXITING |
-				CPU_BASED_CR3_LOAD_EXITING  |
-				CPU_BASED_INVLPG_EXITING;
-	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx));
 
 	if (cpu_has_secondary_exec_ctrls()) {
-		exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
-		if (!vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))
-			exec_control &=
-				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
-		if (vmx->vpid == 0)
-			exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
-		if (!enable_ept) {
-			exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
-			enable_unrestricted_guest = 0;
-		}
-		if (!enable_unrestricted_guest)
-			exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
-		if (!ple_gap)
-			exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
-		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
+				vmx_secondary_exec_control(vmx));
 	}
 
 	if (ple_gap) {
@@ -3475,10 +3492,7 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_write32(VM_ENTRY_CONTROLS, vmcs_config.vmentry_ctrl);
 
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
-	vmx->vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
-	if (enable_ept)
-		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
-	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
+	set_cr4_guest_host_mask(vmx);
 
 	kvm_write_tsc(&vmx->vcpu, 0);
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (15 preceding siblings ...)
  2011-05-08  8:23 ` [PATCH 16/30] nVMX: Move control field setup to functions Nadav Har'El
@ 2011-05-08  8:23 ` Nadav Har'El
  2011-05-09 10:12   ` Avi Kivity
  2011-05-08  8:24 ` [PATCH 18/30] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:23 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
own guests).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  272 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 272 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
@@ -346,6 +346,12 @@ struct nested_vmx {
 	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
 	struct list_head vmcs02_pool;
 	int vmcs02_num;
+	u64 vmcs01_tsc_offset;
+	/*
+	 * Guest pages referred to in vmcs02 with host-physical pointers, so
+	 * we must keep them pinned while L2 runs.
+	 */
+	struct page *apic_access_page;
 };
 
 struct vcpu_vmx {
@@ -835,6 +841,18 @@ static inline bool report_flexpriority(v
 	return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
+{
+	return vmcs12->cpu_based_vm_exec_control & bit;
+}
+
+static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
+{
+	return (vmcs12->cpu_based_vm_exec_control &
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) &&
+		(vmcs12->secondary_vm_exec_control & bit);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -1425,6 +1443,22 @@ static void vmx_fpu_activate(struct kvm_
 
 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);
 
+/*
+ * Return the cr0 value that a nested guest would read. This is a combination
+ * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
+ * its hypervisor (cr0_read_shadow).
+ */
+static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
+{
+	return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
+		(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
+}
+static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
+{
+	return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
+		(fields->cr4_read_shadow & fields->cr4_guest_host_mask);
+}
+
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
 	vmx_decache_cr0_guest_bits(vcpu);
@@ -3366,6 +3400,9 @@ static void set_cr4_guest_host_mask(stru
 	vmx->vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
 	if (enable_ept)
 		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
+	if (is_guest_mode(&vmx->vcpu))
+		vmx->vcpu.arch.cr4_guest_owned_bits &=
+			~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;
 	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
 }
 
@@ -4681,6 +4718,11 @@ static void free_nested(struct vcpu_vmx 
 		vmx->nested.current_vmptr = -1ull;
 		vmx->nested.current_vmcs12 = NULL;
 	}
+	/* Unpin physical memory we referred to in current vmcs02 */
+	if (vmx->nested.apic_access_page) {
+		nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = 0;
+	}
 
 	nested_free_all_vmcs02(vmx);
 }
@@ -5749,6 +5791,236 @@ static void vmx_set_supported_cpuid(u32 
 {
 }
 
+/*
+ * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
+ * guest in a way that will both be appropriate to L1's requests, and our
+ * needs. In addition to modifying the active vmcs (which is vmcs02), this
+ * function also has additional necessary side-effects, like setting various
+ * vcpu->arch fields.
+ */
+static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 exec_control;
+
+	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
+	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
+	vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
+
+	vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
+
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		vmcs12->vm_entry_intr_info_field);
+	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		vmcs12->vm_entry_exception_error_code);
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmcs12->vm_entry_instruction_len);
+
+	vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
+	vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
+	vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
+	vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
+	vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
+	vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
+	vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
+	vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
+	vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
+	vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
+	vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
+	vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
+	vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
+	vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
+	vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
+	vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
+	vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
+	vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
+	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+		vmcs12->guest_interruptibility_info);
+	vmcs_write32(GUEST_ACTIVITY_STATE, vmcs12->guest_activity_state);
+	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
+
+	vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
+	vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
+	vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
+	vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
+	vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
+	vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
+	vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
+	vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+	vmcs_writel(GUEST_DR7, vmcs12->guest_dr7);
+	vmcs_writel(GUEST_RFLAGS, vmcs12->guest_rflags);
+	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+		vmcs12->guest_pending_dbg_exceptions);
+	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
+
+	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
+
+	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->apic_access_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+		/*
+		 * Keep the page pinned, so its physical address we just wrote
+		 * remains valid. We keep a reference to it so we can release
+		 * it later.
+		 */
+		if (vmx->nested.apic_access_page) /* shouldn't happen... */
+			nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = page;
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+		(vmcs_config.pin_based_exec_ctrl |
+		 vmcs12->pin_based_vm_exec_control));
+
+	/*
+	 * Whether page-faults are trapped is determined by a combination of
+	 * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
+	 * If enable_ept, L0 doesn't care about page faults and we should
+	 * set all of these to L1's desires. However, if !enable_ept, L0 does
+	 * care about (at least some) page faults, and because it is not easy
+	 * (if at all possible?) to merge L0 and L1's desires, we simply ask
+	 * to exit on each and every L2 page fault. This is done by setting
+	 * MASK=MATCH=0 and (see below) EB.PF=1.
+	 * Note that below we don't need special code to set EB.PF beyond the
+	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
+	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
+	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+	 *
+	 * A problem with this approach (when !enable_ept) is that L1 may be
+	 * injected with more page faults than it asked for. This could have
+	 * caused problems, but in practice existing hypervisors don't care.
+	 * To fix this, we will need to emulate the PFEC checking (on the L1
+	 * page tables), using walk_addr(), when injecting PFs to L1.
+	 */
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		enable_ept ? vmcs12->page_fault_error_code_mask : 0);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		enable_ept ? vmcs12->page_fault_error_code_match : 0);
+
+	if (cpu_has_secondary_exec_ctrls()) {
+		u32 exec_control = vmx_secondary_exec_control(vmx);
+		if (!vmx->rdtscp_enabled)
+			exec_control &= ~SECONDARY_EXEC_RDTSCP;
+		/* Take the following fields only from vmcs12 */
+		exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		if (nested_cpu_has(vmcs12,
+				CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
+			exec_control |= vmcs12->secondary_vm_exec_control;
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	}
+
+	/*
+	 * Set host-state according to L0's settings (vmcs12 is irrelevant here)
+	 * Some constant fields are set here by vmx_set_constant_host_state().
+	 * Other fields are different per CPU, and will be set later when
+	 * vmx_vcpu_load() is called, and when vmx_save_host_state() is called.
+	 */
+	vmx_set_constant_host_state();
+
+	/*
+	 * HOST_RSP is normally set correctly in vmx_vcpu_run() just before
+	 * entry, but only if the current (host) sp changed from the value
+	 * we wrote last (vmx->host_rsp). This cache is no longer relevant
+	 * if we switch vmcs, and rather than hold a separate cache per vmcs,
+	 * here we just force the write to happen on entry.
+	 */
+	vmx->host_rsp = 0;
+
+	exec_control = vmx_exec_control(vmx); /* L0's desires */
+	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+	exec_control &= ~CPU_BASED_TPR_SHADOW;
+	exec_control |= vmcs12->cpu_based_vm_exec_control;
+	/*
+	 * Merging of IO and MSR bitmaps not currently supported.
+	 * Rather, exit every time.
+	 */
+	exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;
+	exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
+	exec_control |= CPU_BASED_UNCOND_IO_EXITING;
+
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+
+	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
+	 * bitwise-or of what L1 wants to trap for L2, and what we want to
+	 * trap. Note that CR0.TS also needs updating - we do this later.
+	 */
+	update_exception_bitmap(vcpu);
+	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
+	vmcs_write32(VM_EXIT_CONTROLS,
+		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
+	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
+
+	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat);
+	else if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
+
+
+	set_cr4_guest_host_mask(vmx);
+
+	vmcs_write64(TSC_OFFSET,
+		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+		vmx_flush_tlb(vcpu);
+	}
+
+	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
+		vcpu->arch.efer = vmcs12->guest_ia32_efer;
+	if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	/* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	/*
+	 * This sets GUEST_CR0 to vmcs12->guest_cr0, with possibly a modified
+	 * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
+	 * The CR0_READ_SHADOW is what L2 should have expected to read given
+	 * the specifications by L1; It's not enough to take
+	 * vmcs12->cr0_read_shadow because on our cr0_guest_host_mask we we
+	 * have more bits than L1 expected.
+	 */
+	vmx_set_cr0(vcpu, vmcs12->guest_cr0);
+	vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
+
+	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
+	vmcs_writel(CR4_READ_SHADOW, guest_readable_cr4(vmcs12));
+
+	/* shadow page tables on either EPT or shadow page tables */
+	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
+	kvm_mmu_reset_context(vcpu);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
+	return 0;
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 18/30] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (16 preceding siblings ...)
  2011-05-08  8:23 ` [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2011-05-08  8:24 ` Nadav Har'El
  2011-05-08  8:24 ` [PATCH 19/30] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:24 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  139 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 137 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
@@ -346,6 +346,9 @@ struct nested_vmx {
 	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
 	struct list_head vmcs02_pool;
 	int vmcs02_num;
+
+	/* Saving the VMCS that we used for running L1 */
+	struct saved_vmcs saved_vmcs01;
 	u64 vmcs01_tsc_offset;
 	/*
 	 * Guest pages referred to in vmcs02 with host-physical pointers, so
@@ -4880,6 +4883,21 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return nested_vmx_run(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return nested_vmx_run(vcpu, false);
+}
+
 enum vmcs_field_type {
 	VMCS_FIELD_TYPE_U16 = 0,
 	VMCS_FIELD_TYPE_U64 = 1,
@@ -5160,11 +5178,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -6021,6 +6039,123 @@ static int prepare_vmcs02(struct kvm_vcp
 	return 0;
 }
 
+/*
+ * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or VMRESUME on L1
+ * for running an L2 nested guest.
+ */
+static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
+{
+	struct vmcs12 *vmcs12;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int cpu;
+	struct saved_vmcs *saved_vmcs02;
+	u32 low, high;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+	skip_emulated_instruction(vcpu);
+
+	/*
+	 * The nested entry process starts with enforcing various prerequisites
+	 * on vmcs12 as required by the Intel SDM, and act appropriately when
+	 * they fail: As the SDM explains, some conditions should cause the
+	 * instruction to fail, while others will cause the instruction to seem
+	 * to succeed, but return an EXIT_REASON_INVALID_STATE.
+	 * To speed up the normal (success) code path, we should avoid checking
+	 * for misconfigurations which will anyway be caught by the processor
+	 * when using the merged vmcs02.
+	 */
+
+	vmcs12 = get_vmcs12(vcpu);
+	if (vmcs12->launch_state == launch) {
+		nested_vmx_failValid(vcpu,
+			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
+			       : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+		return 1;
+	}
+
+	if (vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_MOV_SS) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
+		return 1;
+	}
+
+	if ((vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_MSR_BITMAPS) &&
+			!IS_ALIGNED(vmcs12->msr_bitmap, PAGE_SIZE)) {
+		/*TODO: Also verify bits beyond physical address width are 0*/
+		nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		return 1;
+	}
+
+	if (vmcs12->vm_entry_msr_load_count > 0 ||
+	    vmcs12->vm_exit_msr_load_count > 0 ||
+	    vmcs12->vm_exit_msr_store_count > 0) {
+		if (printk_ratelimit())
+			printk(KERN_WARNING
+			  "%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
+		nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		return 1;
+	}
+
+	nested_vmx_pinbased_ctls(&low, &high);
+	if (!vmx_control_verify(vmcs12->pin_based_vm_exec_control, low, high)) {
+		nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		return 1;
+	}
+
+	if (((vmcs12->host_cr0 & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) ||
+	    ((vmcs12->host_cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_INVALID_HOST_STATE_FIELD);
+		return 1;
+	}
+
+	/*
+	 * We're finally done with prerequisite checking, and can start with
+	 * the nested entry.
+	 */
+
+	enter_guest_mode(vcpu);
+
+	vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
+
+	/*
+	 * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
+	 * vmcs01, on which CPU it was last loaded, and whether it was launched
+	 * (we need all these values next time we will use L1). Then recall
+	 * these values from the last time vmcs02 was used.
+	 */
+	saved_vmcs02 = nested_get_current_vmcs02(vmx);
+	if (!saved_vmcs02)
+		return -ENOMEM;
+
+	cpu = get_cpu();
+	vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
+	vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
+	vmx->nested.saved_vmcs01.launched = vmx->launched;
+
+	vmx->vmcs = saved_vmcs02->vmcs;
+	vcpu->cpu = saved_vmcs02->cpu;
+	vmx->launched = saved_vmcs02->launched;
+
+	vmx_vcpu_put(vcpu);
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	vmcs12->launch_state = 1;
+
+	prepare_vmcs02(vcpu, vmcs12);
+
+	/*
+	 * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
+	 * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
+	 * returned as far as L1 is concerned. It will only return (and set
+	 * the success flag) when L2 exits (see nested_vmx_vmexit()).
+	 */
+	return 1;
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 19/30] nVMX: No need for handle_vmx_insn function any more
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (17 preceding siblings ...)
  2011-05-08  8:24 ` [PATCH 18/30] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2011-05-08  8:24 ` Nadav Har'El
  2011-05-08  8:25 ` [PATCH 20/30] nVMX: Exiting from L2 to L1 Nadav Har'El
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:24 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the respective VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ------
 1 file changed, 6 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
@@ -4240,12 +4240,6 @@ static int handle_vmcall(struct kvm_vcpu
 	return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
-}
-
 static int handle_invd(struct kvm_vcpu *vcpu)
 {
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 20/30] nVMX: Exiting from L2 to L1
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (18 preceding siblings ...)
  2011-05-08  8:24 ` [PATCH 19/30] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
@ 2011-05-08  8:25 ` Nadav Har'El
  2011-05-09 10:45   ` Avi Kivity
  2011-05-08  8:25 ` [PATCH 21/30] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
                   ` (10 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:25 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  288 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 288 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
@@ -6150,6 +6150,294 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0 - that may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
+ * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow bit. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	/*
+	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
+	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
+	 * if we did trap it - if we did so because L1 asked to trap this bit
+	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
+	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
+	 */
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+	       (vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
+ * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12),
+ * and this function updates it to reflect the changes to the guest state while
+ * L2 was running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7);
+	vmcs12->guest_rsp = kvm_register_read(vcpu, VCPU_REGS_RSP);
+	vmcs12->guest_rip = kvm_register_read(vcpu, VCPU_REGS_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs12->vm_entry_controls & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+/*
+ * A part of what we need to when the nested L2 guest exits and we want to
+ * run its L1 parent, is to reset L1's guest state to the host state specified
+ * in vmcs12.
+ * This function is to be called not only on normal nested exit, but also on
+ * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry
+ * Failures During or After Loading Guest State").
+ * This function should be called when the active VMCS is L1's (vmcs01).
+ */
+void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
+		vcpu->arch.efer = vmcs12->host_ia32_efer;
+	if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->host_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->host_rip);
+	/*
+	 * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
+	 * actually changed, because it depends on the current state of
+	 * fpu_active (which may have changed).
+	 * Note that vmx_set_cr0 refers to efer set above.
+	 */
+	kvm_set_cr0(vcpu, vmcs12->host_cr0);
+	/*
+	 * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
+	 * to apply the same changes to L1's vmcs. We just set cr0 correctly,
+	 * but we also need to update cr0_guest_host_mask and exception_bitmap.
+	 */
+	update_exception_bitmap(vcpu);
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	/*
+	 * Note that CR4_GUEST_HOST_MASK is already set in the original vmcs01
+	 * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
+	 */
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+	kvm_set_cr4(vcpu, vmcs12->host_cr4);
+
+	/* shadow page tables on either EPT or shadow page tables */
+	kvm_set_cr3(vcpu, vmcs12->host_cr3);
+	kvm_mmu_reset_context(vcpu);
+
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmx_flush_tlb(vcpu);
+	}
+}
+
+/*
+ * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
+ * and modify vmcs12 to make it see what it would expect to see there if
+ * L2 was its real guest. Must only be called when in L2 (is_guest_mode())
+ */
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int cpu;
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+	leave_guest_mode(vcpu);
+
+	prepare_vmcs12(vcpu, vmcs12);
+
+	/*
+	 * Usually, nested_vmx_vmexit() is called after an exit from L2 that
+	 * we wish to pass to L1, so we can pass this exit reason. However,
+	 * when the ad-hoc is_interrupt flag is on, it means there was no
+	 * real exit reason: The caller wanted to exit to L1 just to inject
+	 * an interrupt to it, and we set here a fictitious exit reason.
+	 * In the future, this call option will be eliminated: Instead of
+	 * exiting to L1 and later injecting to it, the better solution would
+	 * be to exit to L1 with the injected interrupt as the exit reason.
+	 */
+	if (is_interrupt)
+		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	/*
+	 * Switch from L2's VMCS, to L1's VMCS. Remember on which CPU the L2
+	 * VMCS was last loaded, and whether it was launched (we need to know
+	 * this next time we use L2), and recall these values as they were for
+	 * L1's VMCS.
+	 */
+	cpu = get_cpu();
+	if (VMCS02_POOL_SIZE > 0) {
+		struct saved_vmcs *saved_vmcs02 =
+			nested_get_current_vmcs02(vmx);
+		saved_vmcs02->cpu = vcpu->cpu;
+		saved_vmcs02->launched = vmx->launched;
+	} else {
+		/* no vmcs02 cache requested, so free the one we used */
+		nested_free_vmcs02(vmx, vmx->nested.current_vmptr);
+	}
+	vmx->vmcs = vmx->nested.saved_vmcs01.vmcs;
+	vcpu->cpu = vmx->nested.saved_vmcs01.cpu;
+	vmx->launched = vmx->nested.saved_vmcs01.launched;
+
+	vmx_vcpu_put(vcpu);
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	load_vmcs12_host_state(vcpu, vmcs12);
+
+	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
+
+	/* This is needed for same reason as it was needed in prepare_vmcs02 */
+	vmx->host_rsp = 0;
+
+	/* Unpin physical memory we referred to in vmcs02 */
+	if (vmx->nested.apic_access_page) {
+		nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = 0;
+	}
+
+	/*
+	 * Exiting from L2 to L1, we're now back to L1 which thinks it just
+	 * finished a VMLAUNCH or VMRESUME instruction, so we need to set the
+	 * success or failure flag accordingly.
+	 */
+	if (unlikely(vmx->fail)) {
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+}
+
+/*
+ * L1's failure to enter L2 is a subset of a normal exit, as explained in
+ * 23.7 "VM-entry failures during or after loading guest state". It should
+ * only be called before L2 actually succeeded to run, and when vmcs01 is
+ * current (it doesn't leave_guest_mode() or switch vmcss).
+ */
+static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
+					struct vmcs12 *vmcs12)
+{
+	load_vmcs12_host_state(vcpu, vmcs12);
+	/* TODO: there are more possible types of failures - see 23.7 */
+	vmcs12->vm_exit_reason = EXIT_REASON_INVALID_STATE |
+		VMX_EXIT_REASONS_FAILED_VMENTRY;
+	vmcs12->exit_qualification = 0;
+	nested_vmx_succeed(vcpu);
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 21/30] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (19 preceding siblings ...)
  2011-05-08  8:25 ` [PATCH 20/30] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2011-05-08  8:25 ` Nadav Har'El
  2011-05-08  8:26 ` [PATCH 22/30] nVMX: Correct handling of interrupt injection Nadav Har'El
                   ` (9 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:25 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  265 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 264 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
@@ -350,6 +350,8 @@ struct nested_vmx {
 	/* Saving the VMCS that we used for running L1 */
 	struct saved_vmcs saved_vmcs01;
 	u64 vmcs01_tsc_offset;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 	/*
 	 * Guest pages referred to in vmcs02 with host-physical pointers, so
 	 * we must keep them pinned while L2 runs.
@@ -856,6 +858,23 @@ static inline bool nested_cpu_has2(struc
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
+static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
+{
+	return is_guest_mode(vcpu) &&
+		(get_vmcs12(vcpu)->pin_based_vm_exec_control &
+			PIN_BASED_VIRTUAL_NMIS);
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
+					struct vmcs12 *vmcs12);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -5196,6 +5215,232 @@ static int (*kvm_vmx_exit_handlers[])(st
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check whether L1 expressed
+ * disinterest in the current event (read or write a specific MSR) by using an
+ * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps.
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs12 *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	gpa_t bitmap;
+
+	if (!nested_cpu_has(get_vmcs12(vcpu), CPU_BASED_USE_MSR_BITMAPS))
+		return 1;
+
+	/*
+	 * The MSR_BITMAP page is divided into four 1024-byte bitmaps,
+	 * for the four combinations of read/write and low/high MSR numbers.
+	 * First we need to figure out which of the four to use:
+	 */
+	bitmap = vmcs12->msr_bitmap;
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		bitmap += 2048;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		bitmap += 1024;
+	}
+
+	/* Then read the msr_index'th bit from this bitmap: */
+	if (msr_index < 1024*8) {
+		unsigned char b;
+		kvm_read_guest(vcpu->kvm, bitmap + msr_index/8, &b, 1);
+		return 1 & (b >> (msr_index & 7));
+	} else
+		return 1; /* let L1 handle the wrong parameter */
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs12 *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if ((vmcs12->cr3_target_count >= 1 &&
+					vmcs12->cr3_target_value0 == val) ||
+				(vmcs12->cr3_target_count >= 2 &&
+					vmcs12->cr3_target_value1 == val) ||
+				(vmcs12->cr3_target_count >= 3 &&
+					vmcs12->cr3_target_value2 == val) ||
+				(vmcs12->cr3_target_count >= 4 &&
+					vmcs12->cr3_target_value3 == val))
+				return 0;
+			if (nested_cpu_has(vmcs12, CPU_BASED_CR3_LOAD_EXITING))
+				return 1;
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING))
+				return 1;
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if ((vmcs12->cr0_guest_host_mask & X86_CR0_TS) &&
+		    (vmcs12->cr0_read_shadow & X86_CR0_TS))
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+			return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in is_guest_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info))
+			return enable_ept;
+		return vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK));
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_TRIPLE_FAULT:
+		return 1;
+	case EXIT_REASON_PENDING_INTERRUPT:
+	case EXIT_REASON_NMI_WINDOW:
+		/*
+		 * prepare_vmcs02() set the CPU_BASED_VIRTUAL_INTR_PENDING bit
+		 * (aka Interrupt Window Exiting) only when L1 turned it on,
+		 * so if we got a PENDING_INTERRUPT exit, this must be for L1.
+		 * Same for NMI Window Exiting.
+		 */
+		return 1;
+	case EXIT_REASON_TASK_SWITCH:
+		return 1;
+	case EXIT_REASON_CPUID:
+		return 1;
+	case EXIT_REASON_HLT:
+		return nested_cpu_has(vmcs12, CPU_BASED_HLT_EXITING);
+	case EXIT_REASON_INVD:
+		return 1;
+	case EXIT_REASON_INVLPG:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING;
+	case EXIT_REASON_RDPMC:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_RDPMC_EXITING;
+	case EXIT_REASON_RDTSC:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_RDTSC_EXITING;
+	case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
+	case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
+	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
+	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
+	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+		/*
+		 * VMX instructions trap unconditionally. This allows L1 to
+		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
+		 */
+		return 1;
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
+	case EXIT_REASON_IO_INSTRUCTION:
+		/* TODO: support IO bitmaps */
+		return 1;
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_INVALID_STATE:
+		return 1;
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		return nested_cpu_has(vmcs12, CPU_BASED_MWAIT_EXITING);
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+		return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_EXITING);
+	case EXIT_REASON_PAUSE_INSTRUCTION:
+		return nested_cpu_has(vmcs12, CPU_BASED_PAUSE_EXITING) ||
+			nested_cpu_has2(vmcs12,
+				SECONDARY_EXEC_PAUSE_LOOP_EXITING);
+	case EXIT_REASON_MCE_DURING_VMENTRY:
+		return 0;
+	case EXIT_REASON_TPR_BELOW_THRESHOLD:
+		return 1;
+	case EXIT_REASON_APIC_ACCESS:
+		return nested_cpu_has2(vmcs12,
+			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+	case EXIT_REASON_EPT_VIOLATION:
+	case EXIT_REASON_EPT_MISCONFIG:
+		return 0;
+	case EXIT_REASON_WBINVD:
+		return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
+	case EXIT_REASON_XSETBV:
+		return 1;
+	default:
+		return 1;
+	}
+}
+
 static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 {
 	*info1 = vmcs_readl(EXIT_QUALIFICATION);
@@ -5218,6 +5463,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
 		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		vcpu->run->fail_entry.hardware_entry_failure_reason
@@ -5240,7 +5496,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
+			!nested_cpu_has_virtual_nmis(vcpu))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -6104,6 +6361,12 @@ static int nested_vmx_run(struct kvm_vcp
 		return 1;
 	}
 
+	if (((vmcs12->guest_cr0 & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) ||
+	    ((vmcs12->guest_cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) {
+		nested_vmx_entry_failure(vcpu, vmcs12);
+		return 1;
+	}
+
 	/*
 	 * We're finally done with prerequisite checking, and can start with
 	 * the nested entry.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 22/30] nVMX: Correct handling of interrupt injection
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (20 preceding siblings ...)
  2011-05-08  8:25 ` [PATCH 21/30] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2011-05-08  8:26 ` Nadav Har'El
  2011-05-09 10:57   ` Avi Kivity
  2011-05-08  8:27 ` [PATCH 23/30] nVMX: Correct handling of exception injection Nadav Har'El
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:26 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
@@ -3675,9 +3675,25 @@ out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3800,6 +3816,13 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu, true);
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5463,6 +5486,14 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */
+	if (vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;
@@ -5660,6 +5691,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (is_guest_mode(&vmx->vcpu))
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -5667,6 +5700,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu))
+		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 23/30] nVMX: Correct handling of exception injection
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (21 preceding siblings ...)
  2011-05-08  8:26 ` [PATCH 22/30] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2011-05-08  8:27 ` Nadav Har'El
  2011-05-08  8:27 ` [PATCH 24/30] nVMX: Correct handling of idt vectoring info Nadav Har'El
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:27 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -1572,6 +1572,25 @@ static void vmx_clear_hlt(struct kvm_vcp
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu, false);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1579,6 +1598,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && is_guest_mode(vcpu) &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3750,6 +3773,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu))
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 24/30] nVMX: Correct handling of idt vectoring info
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (22 preceding siblings ...)
  2011-05-08  8:27 ` [PATCH 23/30] nVMX: Correct handling of exception injection Nadav Har'El
@ 2011-05-08  8:27 ` Nadav Har'El
  2011-05-09 11:04   ` Avi Kivity
  2011-05-08  8:28 ` [PATCH 25/30] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
                   ` (6 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:27 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -352,6 +352,10 @@ struct nested_vmx {
 	u64 vmcs01_tsc_offset;
 	/* L2 must run next, and mustn't decide to exit to L1. */
 	bool nested_run_pending;
+	/* true if last exit was of L2, and had a valid idt_vectoring_info */
+	bool valid_idt_vectoring_info;
+	/* These are saved if valid_idt_vectoring_info */
+	u32 vm_exit_instruction_len, idt_vectoring_error_code;
 	/*
 	 * Guest pages referred to in vmcs02 with host-physical pointers, so
 	 * we must keep them pinned while L2 runs.
@@ -5736,6 +5740,22 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+	int irq  = vmx->idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
+	int type = vmx->idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
+	int errCodeValid = vmx->idt_vectoring_info &
+		VECTORING_INFO_DELIVER_CODE_MASK;
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmx->nested.vm_exit_instruction_len);
+	if (errCodeValid)
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+			vmx->nested.idt_vectoring_error_code);
+}
+
 #ifdef CONFIG_X86_64
 #define R "r"
 #define Q "q"
@@ -5748,6 +5768,9 @@ static void __noclone vmx_vcpu_run(struc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu) && vmx->nested.valid_idt_vectoring_info)
+		nested_handle_valid_idt_vectoring_info(vmx);
+
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -5879,6 +5902,15 @@ static void __noclone vmx_vcpu_run(struc
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+	vmx->nested.valid_idt_vectoring_info = is_guest_mode(vcpu) &&
+		(vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK);
+	if (vmx->nested.valid_idt_vectoring_info) {
+		vmx->nested.vm_exit_instruction_len =
+			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+		vmx->nested.idt_vectoring_error_code =
+			vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	}
+
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 	vmx->launched = 1;
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 25/30] nVMX: Handling of CR0 and CR4 modifying instructions
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (23 preceding siblings ...)
  2011-05-08  8:27 ` [PATCH 24/30] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2011-05-08  8:28 ` Nadav Har'El
  2011-05-08  8:28 ` [PATCH 26/30] nVMX: Further fixes for lazy FPU loading Nadav Har'El
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:28 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   58 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -4094,6 +4094,58 @@ vmx_patch_hypercall(struct kvm_vcpu *vcp
 	hypercall[2] = 0xc1;
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (to_vmx(vcpu)->nested.vmxon &&
+	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
+		return 1;
+
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * We get here when L2 changed cr0 in a way that did not change
+		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
+		 * but did change L0 shadowed bits. This can currently happen
+		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
+		 * loading) while pretending to allow the guest to change it.
+		 */
+		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
+			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
+			return 1;
+		vmcs_writel(CR0_READ_SHADOW, val);
+		return 0;
+	} else
+		return kvm_set_cr0(vcpu, val);
+}
+
+static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_guest_mode(vcpu)) {
+		if (kvm_set_cr4(vcpu, (val & vcpu->arch.cr4_guest_owned_bits) |
+			 (vcpu->arch.cr4 & ~vcpu->arch.cr4_guest_owned_bits)))
+			return 1;
+		vmcs_writel(CR4_READ_SHADOW, val);
+		return 0;
+	} else
+		return kvm_set_cr4(vcpu, val);
+}
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * We get here when L2 did CLTS, and L1 didn't shadow CR0.TS
+		 * but we did (!fpu_active). We need to keep GUEST_CR0.TS on,
+		 * just pretend it's off (also in arch.cr0 for fpu_activate).
+		 */
+		vmcs_writel(CR0_READ_SHADOW,
+			vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS);
+		vcpu->arch.cr0 &= ~X86_CR0_TS;
+	} else
+		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification, val;
@@ -4110,7 +4162,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		trace_kvm_cr_write(cr, val);
 		switch (cr) {
 		case 0:
-			err = kvm_set_cr0(vcpu, val);
+			err = handle_set_cr0(vcpu, val);
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 3:
@@ -4118,7 +4170,7 @@ static int handle_cr(struct kvm_vcpu *vc
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 4:
-			err = kvm_set_cr4(vcpu, val);
+			err = handle_set_cr4(vcpu, val);
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 8: {
@@ -4136,7 +4188,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		};
 		break;
 	case 2: /* clts */
-		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+		handle_clts(vcpu);
 		trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
 		skip_emulated_instruction(vcpu);
 		vmx_fpu_activate(vcpu);

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 26/30] nVMX: Further fixes for lazy FPU loading
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (24 preceding siblings ...)
  2011-05-08  8:28 ` [PATCH 25/30] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
@ 2011-05-08  8:28 ` Nadav Har'El
  2011-05-08  8:29 ` [PATCH 27/30] nVMX: Additional TSC-offset handling Nadav Har'El
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:28 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -1170,6 +1170,15 @@ static void update_exception_bitmap(stru
 		eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */
 	if (vcpu->fpu_active)
 		eb &= ~(1u << NM_VECTOR);
+
+	/* When we are running a nested L2 guest and L1 specified for it a
+	 * certain exception bitmap, we must trap the same exceptions and pass
+	 * them to L1. When running L2, we will only handle the exceptions
+	 * specified above if L1 did not want them.
+	 */
+	if (is_guest_mode(vcpu))
+		eb |= get_vmcs12(vcpu)->exception_bitmap;
+
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1464,6 +1473,9 @@ static void vmx_fpu_activate(struct kvm_
 	vmcs_writel(GUEST_CR0, cr0);
 	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	if (is_guest_mode(vcpu))
+		vcpu->arch.cr0_guest_owned_bits &=
+			~get_vmcs12(vcpu)->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 }
 
@@ -1487,12 +1499,29 @@ static inline unsigned long guest_readab
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
+	 * set this *before* calling this function.
+	 */
 	vmx_decache_cr0_guest_bits(vcpu);
 	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
 	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = 0;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
-	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * L1's specified read shadow might not contain the TS bit,
+		 * so now that we turned on shadowing of this bit, we need to
+		 * set this bit of the shadow. Like in nested_vmx_run we need
+		 * guest_readable_cr0(vmcs12), but vmcs12->guest_cr0 is not
+		 * yet up-to-date here because we just decached cr0.TS (and
+		 * we'll only update vmcs12->guest_cr0 on nested exit).
+		 */
+		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+		vmcs12->guest_cr0 = (vmcs12->guest_cr0 & ~X86_CR0_TS) |
+			(vcpu->arch.cr0 & X86_CR0_TS);
+		vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
+	} else
+		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 27/30] nVMX: Additional TSC-offset handling
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (25 preceding siblings ...)
  2011-05-08  8:28 ` [PATCH 26/30] nVMX: Further fixes for lazy FPU loading Nadav Har'El
@ 2011-05-08  8:29 ` Nadav Har'El
  2011-05-09 17:27   ` Zachary Amsden
  2011-05-08  8:29 ` [PATCH 28/30] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
                   ` (3 subsequent siblings)
  30 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:29 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
of this function is that the TSC of all guests on this vcpu, L1 and possibly
several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
tsc_offset (this offset will also apply to each L2s we enter). We can't set
vmcs01 now, so we have to remember this adjustment and apply it when we
later exit to L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -1757,12 +1757,24 @@ static void vmx_set_tsc_khz(struct kvm_v
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
+	if (is_guest_mode(vcpu))
+		/*
+		 * We're here if L1 chose not to trap the TSC MSR. Since
+		 * prepare_vmcs12() does not copy tsc_offset, we need to also
+		 * set the vmcs12 field here.
+		 */
+		get_vmcs12(vcpu)->tsc_offset = offset -
+			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
 	u64 offset = vmcs_read64(TSC_OFFSET);
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
+	if (is_guest_mode(vcpu)) {
+		/* Even when running L2, the adjustment needs to apply to L1 */
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset += adjustment;
+	}
 }
 
 static u64 vmx_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 28/30] nVMX: Add VMX to list of supported cpuid features
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (26 preceding siblings ...)
  2011-05-08  8:29 ` [PATCH 27/30] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2011-05-08  8:29 ` Nadav Har'El
  2011-05-08  8:30 ` [PATCH 29/30] nVMX: Miscellenous small corrections Nadav Har'El
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:29 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -6244,6 +6244,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+	if (func == 1 && nested)
+		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
 /*

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 29/30] nVMX: Miscellenous small corrections
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (27 preceding siblings ...)
  2011-05-08  8:29 ` [PATCH 28/30] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
@ 2011-05-08  8:30 ` Nadav Har'El
  2011-05-08  8:30 ` [PATCH 30/30] nVMX: Documentation Nadav Har'El
  2011-05-09 11:18 ` [PATCH 0/30] nVMX: Nested VMX, v9 Avi Kivity
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:21.000000000 +0300
@@ -947,7 +947,7 @@ static void vmcs_load(struct vmcs *vmcs)
 			: "=qm"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
 	if (error)
-		printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+		printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
 		       vmcs, phys_addr);
 }
 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 30/30] nVMX: Documentation
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (28 preceding siblings ...)
  2011-05-08  8:30 ` [PATCH 29/30] nVMX: Miscellenous small corrections Nadav Har'El
@ 2011-05-08  8:30 ` Nadav Har'El
  2011-05-09 11:18 ` [PATCH 0/30] nVMX: Nested VMX, v9 Avi Kivity
  30 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-08  8:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/kvm/nested-vmx.txt |  243 +++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt	2011-05-08 10:43:22.000000000 +0300
+++ .after/Documentation/kvm/nested-vmx.txt	2011-05-08 10:43:22.000000000 +0300
@@ -0,0 +1,243 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code supports running Linux guests under KVM guests.
+Only 64-bit guest hypervisors are supported.
+
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+     -cpu host              (emulated CPU has all features of the real CPU)
+
+     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+	typedef u64 natural_width;
+	struct __packed vmcs12 {
+		/* According to the Intel spec, a VMCS region must start with
+		 * these two user-visible fields */
+		u32 revision_id;
+		u32 abort;
+
+		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+		u32 padding[7]; /* room for future expansion */
+
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		u64 vm_exit_msr_store_addr;
+		u64 vm_exit_msr_load_addr;
+		u64 vm_entry_msr_load_addr;
+		u64 tsc_offset;
+		u64 virtual_apic_page_addr;
+		u64 apic_access_addr;
+		u64 ept_pointer;
+		u64 guest_physical_address;
+		u64 vmcs_link_pointer;
+		u64 guest_ia32_debugctl;
+		u64 guest_ia32_pat;
+		u64 guest_ia32_efer;
+		u64 guest_pdptr0;
+		u64 guest_pdptr1;
+		u64 guest_pdptr2;
+		u64 guest_pdptr3;
+		u64 host_ia32_pat;
+		u64 host_ia32_efer;
+		u64 padding64[8]; /* room for future expansion */
+		natural_width cr0_guest_host_mask;
+		natural_width cr4_guest_host_mask;
+		natural_width cr0_read_shadow;
+		natural_width cr4_read_shadow;
+		natural_width cr3_target_value0;
+		natural_width cr3_target_value1;
+		natural_width cr3_target_value2;
+		natural_width cr3_target_value3;
+		natural_width exit_qualification;
+		natural_width guest_linear_address;
+		natural_width guest_cr0;
+		natural_width guest_cr3;
+		natural_width guest_cr4;
+		natural_width guest_es_base;
+		natural_width guest_cs_base;
+		natural_width guest_ss_base;
+		natural_width guest_ds_base;
+		natural_width guest_fs_base;
+		natural_width guest_gs_base;
+		natural_width guest_ldtr_base;
+		natural_width guest_tr_base;
+		natural_width guest_gdtr_base;
+		natural_width guest_idtr_base;
+		natural_width guest_dr7;
+		natural_width guest_rsp;
+		natural_width guest_rip;
+		natural_width guest_rflags;
+		natural_width guest_pending_dbg_exceptions;
+		natural_width guest_sysenter_esp;
+		natural_width guest_sysenter_eip;
+		natural_width host_cr0;
+		natural_width host_cr3;
+		natural_width host_cr4;
+		natural_width host_fs_base;
+		natural_width host_gs_base;
+		natural_width host_tr_base;
+		natural_width host_gdtr_base;
+		natural_width host_idtr_base;
+		natural_width host_ia32_sysenter_esp;
+		natural_width host_ia32_sysenter_eip;
+		natural_width host_rsp;
+		natural_width host_rip;
+		natural_width paddingl[8]; /* room for future expansion */
+		u32 pin_based_vm_exec_control;
+		u32 cpu_based_vm_exec_control;
+		u32 exception_bitmap;
+		u32 page_fault_error_code_mask;
+		u32 page_fault_error_code_match;
+		u32 cr3_target_count;
+		u32 vm_exit_controls;
+		u32 vm_exit_msr_store_count;
+		u32 vm_exit_msr_load_count;
+		u32 vm_entry_controls;
+		u32 vm_entry_msr_load_count;
+		u32 vm_entry_intr_info_field;
+		u32 vm_entry_exception_error_code;
+		u32 vm_entry_instruction_len;
+		u32 tpr_threshold;
+		u32 secondary_vm_exec_control;
+		u32 vm_instruction_error;
+		u32 vm_exit_reason;
+		u32 vm_exit_intr_info;
+		u32 vm_exit_intr_error_code;
+		u32 idt_vectoring_info_field;
+		u32 idt_vectoring_error_code;
+		u32 vm_exit_instruction_len;
+		u32 vmx_instruction_info;
+		u32 guest_es_limit;
+		u32 guest_cs_limit;
+		u32 guest_ss_limit;
+		u32 guest_ds_limit;
+		u32 guest_fs_limit;
+		u32 guest_gs_limit;
+		u32 guest_ldtr_limit;
+		u32 guest_tr_limit;
+		u32 guest_gdtr_limit;
+		u32 guest_idtr_limit;
+		u32 guest_es_ar_bytes;
+		u32 guest_cs_ar_bytes;
+		u32 guest_ss_ar_bytes;
+		u32 guest_ds_ar_bytes;
+		u32 guest_fs_ar_bytes;
+		u32 guest_gs_ar_bytes;
+		u32 guest_ldtr_ar_bytes;
+		u32 guest_tr_ar_bytes;
+		u32 guest_interruptibility_info;
+		u32 guest_activity_state;
+		u32 guest_sysenter_cs;
+		u32 host_ia32_sysenter_cs;
+		u32 padding32[8]; /* room for future expansion */
+		u16 virtual_processor_id;
+		u16 guest_es_selector;
+		u16 guest_cs_selector;
+		u16 guest_ss_selector;
+		u16 guest_ds_selector;
+		u16 guest_fs_selector;
+		u16 guest_gs_selector;
+		u16 guest_ldtr_selector;
+		u16 guest_tr_selector;
+		u16 host_es_selector;
+		u16 host_cs_selector;
+		u16 host_ss_selector;
+		u16 host_ds_selector;
+		u16 host_fs_selector;
+		u16 host_gs_selector;
+		u16 host_tr_selector;
+	};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+     Michael Factor, factor <at> il.ibm.com
+     Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com
+     and others.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions
  2011-05-08  8:18 ` [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2011-05-09  9:47   ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-09  9:47 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:18 AM, Nadav Har'El wrote:
> This patch includes a utility function for decoding pointer operands of VMX
> instructions issued by L1 (a guest hypervisor)
>
> +	/*
> +	 * TODO: throw #GP (and return 1) in various cases that the VM*
> +	 * instructions require it - e.g., offset beyond segment limit,
> +	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
> +	 * address, and so on. Currently these are not checked.
> +	 */
> +	return 0;
> +}
> +

Note: emulate.c now contains a function (linearize()) which does these 
calculations.  We need to generalize it and expose it so nvmx can make 
use of it.

There is no real security concern since these instructions are only 
allowed from cpl 0 anyway.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 15/30] nVMX: Move host-state field setup to a function
  2011-05-08  8:22 ` [PATCH 15/30] nVMX: Move host-state field setup to a function Nadav Har'El
@ 2011-05-09  9:56   ` Avi Kivity
  2011-05-09 10:40     ` Nadav Har'El
  0 siblings, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-09  9:56 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:22 AM, Nadav Har'El wrote:
> Move the setting of constant host-state fields (fields that do not change
> throughout the life of the guest) from vmx_vcpu_setup to a new common function
> vmx_set_constant_host_state(). This function will also be used to set the
> host state when running L2 guests.

>    */
>   static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>   {
> -	u32 host_sysenter_cs, msr_low, msr_high;
> -	u32 junk;
> +	u32 msr_low, msr_high;


Unused?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-08  8:23 ` [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2011-05-09 10:12   ` Avi Kivity
  2011-05-09 10:27     ` Nadav Har'El
  0 siblings, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-09 10:12 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:23 AM, Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
> in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
> own guests).
> +/*
> + * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
> + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> + * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
> + * guest in a way that will both be appropriate to L1's requests, and our
> + * needs. In addition to modifying the active vmcs (which is vmcs02), this
> + * function also has additional necessary side-effects, like setting various
> + * vcpu->arch fields.
> + */
> +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{

<snip>

> +	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);

I think this is wrong - anything having to do with vmcs linking will 
need to be emulated, we can't let the cpu see the real value (and even 
if we don't emulate, we have to translate addresses like you do for the 
apic access page.

> +	vmcs_write64(TSC_OFFSET,
> +		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);

This is probably wrong (everything with time is probably wrong), but we 
can deal with it (much) later.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-09 10:12   ` Avi Kivity
@ 2011-05-09 10:27     ` Nadav Har'El
  2011-05-09 10:45       ` Avi Kivity
  0 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-09 10:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

Hi, and thanks again for the reviews.

On Mon, May 09, 2011, Avi Kivity wrote about "Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12":
> >+	vmcs_write64(TSC_OFFSET,
> >+		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> 
> This is probably wrong (everything with time is probably wrong), but we 
> can deal with it (much) later.

I thought this was right :-) Why do you believe it to be wrong?

L1 wants to add vmcs12->tsc_offset to its own TSC to generate L2's TSC.
But L1's TSC is itself with vmx->nested.vmcs01_tsc_offset from L0's TSC.
So their sum, vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset, is the
offset of L2's TSC from L0's TSC. Am I missing something?

Thanks,
Nadav.

-- 
Nadav Har'El                        |        Monday, May  9 2011, 5 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I couldn't think of an interesting
http://nadav.harel.org.il           |signature to put here... Maybe next time.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 15/30] nVMX: Move host-state field setup to a function
  2011-05-09  9:56   ` Avi Kivity
@ 2011-05-09 10:40     ` Nadav Har'El
  0 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-09 10:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Mon, May 09, 2011, Avi Kivity wrote about "Re: [PATCH 15/30] nVMX: Move host-state field setup to a function":
> >  static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >  {
> >-	u32 host_sysenter_cs, msr_low, msr_high;
> >-	u32 junk;
> >+	u32 msr_low, msr_high;
> 
> 
> Unused?

Well, it's actually is used, because I left the GUEST_IA32_PAT setting in
vmx_vcpu_setup. I guess I couldn't have moved these two variables inside
the if((vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) block, but I
didn't. Similarly, the host_pat variable can also move inside the if().

I'll make these changes.

-- 
Nadav Har'El                        |        Monday, May  9 2011, 5 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Shortening Year-2000 to Y2K was just the
http://nadav.harel.org.il           |kind of thinking that caused that problem!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/30] nVMX: Exiting from L2 to L1
  2011-05-08  8:25 ` [PATCH 20/30] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2011-05-09 10:45   ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-09 10:45 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:25 AM, Nadav Har'El wrote:
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
>
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in the next patch.
>
>
> /*
> + * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
> + * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12),
> + * and this function updates it to reflect the changes to the guest state while
> + * L2 was running (and perhaps made some exits which were handled directly by L0
> + * without going back to L1), and to reflect the exit reason.
> + * Note that we do not have to copy here all VMCS fields, just those that
> + * could have changed by the L2 guest or the exit - i.e., the guest-state and
> + * exit-information fields only. Other fields are modified by L1 with VMWRITE,
> + * which already writes to vmcs12 directly.
> + */
> +void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{
<snip>

> +	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);

Again, this should be emulated, not assigned to the guest.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-09 10:27     ` Nadav Har'El
@ 2011-05-09 10:45       ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-09 10:45 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/09/2011 01:27 PM, Nadav Har'El wrote:
> Hi, and thanks again for the reviews.
>
> On Mon, May 09, 2011, Avi Kivity wrote about "Re: [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12":
> >  >+	vmcs_write64(TSC_OFFSET,
> >  >+		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> >
> >  This is probably wrong (everything with time is probably wrong), but we
> >  can deal with it (much) later.
>
> I thought this was right :-) Why do you believe it to be wrong?

Just out of principle, everything to do with time is wrong.

> L1 wants to add vmcs12->tsc_offset to its own TSC to generate L2's TSC.
> But L1's TSC is itself with vmx->nested.vmcs01_tsc_offset from L0's TSC.
> So their sum, vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset, is the
> offset of L2's TSC from L0's TSC. Am I missing something?

Only Zach knows.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 22/30] nVMX: Correct handling of interrupt injection
  2011-05-08  8:26 ` [PATCH 22/30] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2011-05-09 10:57   ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-09 10:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:26 AM, Nadav Har'El wrote:
> When KVM wants to inject an interrupt, the guest should think a real interrupt
> has happened. Normally (in the non-nested case) this means checking that the
> guest doesn't block interrupts (and if it does, inject when it doesn't - using
> the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
> fields for the guest to receive the interrupt.
>
> However, when we are running a nested guest (L2) and its hypervisor (L1)
> requested exits on interrupts (as most hypervisors do), the most efficient
> thing to do is to exit L2, telling L1 that the exit was caused by an
> interrupt, the one we were injecting; Only when L1 asked not to be notified
> of interrupts, we should inject directly to the running L2 guest (i.e.,
> the normal code path).
>
> However, properly doing what is described above requires invasive changes to
> the flow of the existing code, which we elected not to do in this stage.
> Instead we do something more simplistic and less efficient: we modify
> vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
> now, to exit from L2 to L1 before continuing the normal code. The normal kvm
> code then notices that L1 is blocking interrupts, and sets the interrupt
> window to inject the interrupt later to L1. Shortly after, L1 gets the
> interrupt while it is itself running, not as an exit from L2. The cost is an
> extra L1 exit (the interrupt window).
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |   35 +++++++++++++++++++++++++++++++++++
>   1 file changed, 35 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:20.000000000 +0300
> @@ -3675,9 +3675,25 @@ out:
>   	return ret;
>   }
>
> +/*
> + * In nested virtualization, check if L1 asked to exit on external interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> +	return get_vmcs12(vcpu)->pin_based_vm_exec_control&
> +		PIN_BASED_EXT_INTR_MASK;
> +}
> +
>   static void enable_irq_window(struct kvm_vcpu *vcpu)
>   {
>   	u32 cpu_based_vm_exec_control;
> +	if (is_guest_mode(vcpu)&&  nested_exit_on_intr(vcpu))
> +		/* We can get here when nested_run_pending caused
> +		 * vmx_interrupt_allowed() to return false. In this case, do
> +		 * nothing - the interrupt will be injected later.
> +		 */
> +		return;

Why not do (or schedule) the nested vmexit here?  It's more natural than 
in vmx_interrupt_allowed() which from its name you'd expect to only read 
stuff.

I guess it can live for now if there's some unexpected complexity there.

>
>   	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>   	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> @@ -3800,6 +3816,13 @@ static void vmx_set_nmi_mask(struct kvm_
>
>   static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>   {
> +	if (is_guest_mode(vcpu)&&  nested_exit_on_intr(vcpu)) {
> +		if (to_vmx(vcpu)->nested.nested_run_pending)
> +			return 0;
> +		nested_vmx_vmexit(vcpu, true);
> +		/* fall through to normal code, but now in L1, not L2 */
> +	}
> +
>   	return (vmcs_readl(GUEST_RFLAGS)&  X86_EFLAGS_IF)&&
>   		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO)&
>   			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
> @@ -5463,6 +5486,14 @@ static int vmx_handle_exit(struct kvm_vc
>   	if (vmx->emulation_required&&  emulate_invalid_guest_state)
>   		return handle_invalid_guest_state(vcpu);
>
> +	/*
> +	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
> +	 * we did not inject a still-pending event to L1 now because of
> +	 * nested_run_pending, we need to re-enable this bit.
> +	 */
> +	if (vmx->nested.nested_run_pending)
> +		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
>   	if (exit_reason == EXIT_REASON_VMLAUNCH ||
>   	    exit_reason == EXIT_REASON_VMRESUME)
>   		vmx->nested.nested_run_pending = 1;
> @@ -5660,6 +5691,8 @@ static void __vmx_complete_interrupts(st
>
>   static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>   {
> +	if (is_guest_mode(&vmx->vcpu))
> +		return;
>   	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>   				  VM_EXIT_INSTRUCTION_LEN,
>   				  IDT_VECTORING_ERROR_CODE);
> @@ -5667,6 +5700,8 @@ static void vmx_complete_interrupts(stru
>
>   static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>   {
> +	if (is_guest_mode(vcpu))
> +		return;

Hmm.  What if L0 injected something into L2?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 24/30] nVMX: Correct handling of idt vectoring info
  2011-05-08  8:27 ` [PATCH 24/30] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2011-05-09 11:04   ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-09 11:04 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:27 AM, Nadav Har'El wrote:
> This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
> case.
>
> When a guest exits while handling an interrupt or exception, we get this
> information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
> there's nothing we need to do, because L1 will see this field in vmcs12, and
> handle it itself. However, when L2 exits and L0 handles the exit itself and
> plans to return to L2, L0 must inject this event to L2.
>
> In the normal non-nested case, the idt_vectoring_info case is discovered after
> the exit, and the decision to inject (though not the injection itself) is made
> at that point. However, in the nested case a decision of whether to return
> to L2 or L1 also happens during the injection phase (see the previous
> patches), so in the nested case we can only decide what to do about the
> idt_vectoring_info right after the injection, i.e., in the beginning of
> vmx_vcpu_run, which is the first time we know for sure if we're staying in
> L2 (i.e., nested_mode is true).
>
> +static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
> +{
> +	int irq  = vmx->idt_vectoring_info&  VECTORING_INFO_VECTOR_MASK;
> +	int type = vmx->idt_vectoring_info&  VECTORING_INFO_TYPE_MASK;
> +	int errCodeValid = vmx->idt_vectoring_info&
> +		VECTORING_INFO_DELIVER_CODE_MASK;

Innovative coding style.

> +	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> +		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
> +

Why not do a 1:1 copy?

> +	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +		vmx->nested.vm_exit_instruction_len);
> +	if (errCodeValid)
> +		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +			vmx->nested.idt_vectoring_error_code);
> +}
> +
>   #ifdef CONFIG_X86_64
>   #define R "r"
>   #define Q "q"

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
                   ` (29 preceding siblings ...)
  2011-05-08  8:30 ` [PATCH 30/30] nVMX: Documentation Nadav Har'El
@ 2011-05-09 11:18 ` Avi Kivity
  2011-05-09 11:37   ` Nadav Har'El
  2011-05-11  8:20   ` Gleb Natapov
  30 siblings, 2 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-09 11:18 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 05/08/2011 11:15 AM, Nadav Har'El wrote:
> Hi,
>
> This is the ninth iteration of the nested VMX patch set. This iteration
> addresses all of the comments and requests that were raised by reviewers in
> the previous rounds, with only a few exception listed below.
>
> Some of the issues which were solved in this version include:
>
>   * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to
>     256 vmcs02s, one for each L2. Now we only have one, which is reused.
>     We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool
>     of vmcs02s. This option will be useful in the future if vmcs02 won't be
>     filled from scratch on each entry from L1 to L2 (currently, it is).
>
>   * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was
>     unnecessary, as all the necessary values are either known to KVM or appear
>     in vmcs12. This structure is now gone for good.
>
>   * There is no longer a "vmcs_fields" sub-structure that everyone disliked.
>     All the VMCS fields appear directly in the vmcs12 structure, which makes
>     the code simpler and more readable.
>
>   * Make sure that the vmcs12 fields have fixed sizes and location, and add
>     some extra padding, to support live migration and improve future-proofing.
>
>   * For some fields, nested exit used to fail to return the host-state as set
>     by L1. Fixed that.
>
>   * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it
>     in L0 and return to L2) is now more correct, and handles more exit reasons.
>
>   * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code.
>     The code is now shorter (uses existing functions like kvm_set_cr3, etc.),
>     more readable, and more uniform (no pieces of code for enable_ept and not,
>     less special code for cr0.TS, and none of that ugly cr0.PG monkey-business).
>
>   * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now
>     unneeded function sync_cached_regs_to_vcms().
>
>   * Fix return value of the VMX msrs to be more correct, and more constant
>     (not to needlessly vary on different hosts).
>
>   * Added some more missing verifications to vmcs12's fields (cleanly failing
>     the nested entry if these verifications fail).
>
>   * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0,
>     but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it.
>
>   * Removed or rate limited printouts which could be exploited by guests.
>
>   * Fix VM_ENTRY_LOAD_IA32_PAT feature handling.
>
>   * Fixed potential bug and verified that nested vmx now works with both
>     CONFIG_PREEMPT and CONFIG_SMP enabled.
>
>   * Dozens of other code cleanups and bug fixes.
>
> Only a few issues from previous reviews remain unaddressed. These are:
>
>   * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was
>     still not rewritten. It works, though ;-)
>
>   * No KVM autotests for nested VMX yet.
>
>   * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not
>     supported. As explained above, the current code uses L1's MSR bitmap
>     to avoid costly exits to L1, but still suffers exits to L0 on each
>     MSR access in L2.
>
>   * Still no option for disabling some capabilities advertised to L1.
>
>   * No support for TPR_SHADOW feature for L1.
>
> This new set of patches applies to the current KVM trunk (I checked with
> 082f9eced53d50c136e42d072598da4be4b9ba23).
> If you wish, you can also check out an already-patched version of KVM from
> branch "nvmx9" of the repository:
> 	 git://github.com/nyh/kvm-nested-vmx.git
>
>
> About nested VMX:
> -----------------
>
> The following 30 patches implement nested VMX support. This feature enables
> a guest to use the VMX APIs in order to run its own nested guests.
> In other words, it allows running hypervisors (that use VMX) under KVM.
> Multiple guest hypervisors can be run concurrently, and each of those can
> in turn host multiple guests.
>
> The theory behind this work, our implementation, and its performance
> characteristics were presented in OSDI 2010 (the USENIX Symposium on
> Operating Systems Design and Implementation). Our paper was titled
> "The Turtles Project: Design and Implementation of Nested Virtualization",
> and was awarded "Jay Lepreau Best Paper". The paper is available online, at:
>
> 	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
>
> This patch set does not include all the features described in the paper.
> In particular, this patch set is missing nested EPT (L1 can't use EPT and
> must use shadow page tables). It is also missing some features required to
> run VMWare hypervisors as a guest. These missing features will be sent as
> follow-on patchs.
>
> Running nested VMX:
> ------------------
>
> The nested VMX feature is currently disabled by default. It must be
> explicitly enabled with the "nested=1" option to the kvm-intel module.
>
> No modifications are required to user space (qemu). However, qemu's default
> emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
> explicitly enabled, by giving qemu one of the following options:
>
>       -cpu host              (emulated CPU has all features of the real CPU)
>
>       -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
>
>
> This version was only tested with KVM (64-bit) as a guest hypervisor, and
> Linux as a nested guest.

Okay, truly excellent.  The code is now a lot more readable, and I'm 
almost beginning to understand it.  The code comments are also very 
good, I wish we had the same quality comments in the rest of kvm.  We 
can probably merge the next iteration if there aren't significant 
comments from others.

The only worrying thing is the issue you raise in patch 8.  Is there a 
simple fix you can push that addresses correctness?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-09 11:18 ` [PATCH 0/30] nVMX: Nested VMX, v9 Avi Kivity
@ 2011-05-09 11:37   ` Nadav Har'El
  2011-05-11  8:20   ` Gleb Natapov
  1 sibling, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-09 11:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Mon, May 09, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> Okay, truly excellent.  The code is now a lot more readable, and I'm 
> almost beginning to understand it.  The code comments are also very 
> good, I wish we had the same quality comments in the rest of kvm.  We 
> can probably merge the next iteration if there aren't significant 
> comments from others.

Thanks!

> The only worrying thing is the issue you raise in patch 8.  Is there a 
> simple fix you can push that addresses correctness?

I'll fix this for the next iteration.
I wanted to avoid changing the existing vcpus_on_cpu machinary, but you're
probably right - it's better to just do this correctly once and for all than
to try to explain the problem away, or to pray that future processors also
continue to work properly if you "forget" to vmclear a vmcs...

-- 
Nadav Har'El                        |        Monday, May  9 2011, 5 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |A diplomat thinks twice before saying
http://nadav.harel.org.il           |nothing.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 27/30] nVMX: Additional TSC-offset handling
  2011-05-08  8:29 ` [PATCH 27/30] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2011-05-09 17:27   ` Zachary Amsden
  0 siblings, 0 replies; 83+ messages in thread
From: Zachary Amsden @ 2011-05-09 17:27 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On 05/08/2011 01:29 AM, Nadav Har'El wrote:
> In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
> emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
> set vmcs12.tsc_offset, for this change to survive the next nested entry (see
> prepare_vmcs02()).
>    

Both changes look correct to me.

Zach

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-09 11:18 ` [PATCH 0/30] nVMX: Nested VMX, v9 Avi Kivity
  2011-05-09 11:37   ` Nadav Har'El
@ 2011-05-11  8:20   ` Gleb Natapov
  2011-05-12 15:42     ` Nadav Har'El
  1 sibling, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2011-05-11  8:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Mon, May 09, 2011 at 02:18:25PM +0300, Avi Kivity wrote:
> Okay, truly excellent.  The code is now a lot more readable, and I'm
> almost beginning to understand it.  The code comments are also very
> good, I wish we had the same quality comments in the rest of kvm.
> We can probably merge the next iteration if there aren't significant
> comments from others.
> 
I still feel that interrupt injection path should be reworked to be
SVM like before merging the code.

> The only worrying thing is the issue you raise in patch 8.  Is there
> a simple fix you can push that addresses correctness?
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

--
			Gleb.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-11  8:20   ` Gleb Natapov
@ 2011-05-12 15:42     ` Nadav Har'El
  2011-05-12 15:57       ` Gleb Natapov
  2011-05-12 16:18       ` Avi Kivity
  0 siblings, 2 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-12 15:42 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm, abelg

On Wed, May 11, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> On Mon, May 09, 2011 at 02:18:25PM +0300, Avi Kivity wrote:
>..
> > We can probably merge the next iteration if there aren't significant
> > comments from others.
>..
> > The only worrying thing is the issue you raise in patch 8.  Is there
> > a simple fix you can push that addresses correctness?

Hi Avi, I have fixed the few issues you raised in this round, including
a rewritten patch 8 that doesn't leave anything as a TODO. I am ready
to send another round of patches once we decide what to do about the issue
raised by Gleb yesterday:

> I still feel that interrupt injection path should be reworked to be
> SVM like before merging the code.

Both I and my colleague Abel Gordon reviewed the relevant code-path in nested
VMX and nested SVM, and carefully (re)read the relevant parts of the VMX spec,
and we have come to several conclusions. Avi and Gleb, I'd love to hear your
comments on these issues:

Our first conclusion is that my description of our interrupt injection patch
was unduely negative. It wrongly suggested that our solution of using the
interrupt window was inefficient, and was just an ad-interim plug until a
cleaner solution could be written. This was wrong. In fact, the somewhat
strange exit and interrupt window that happens when injecting an interrupt
to L2 are *necessary* for accurate emulation. I explain why this is so in the
new version of this patch, which is included below. Please take a look.

Our second conclusion (and I hope that I'm not offending anyone here)
is that the changes for L2 interrupt injection in both SVM and VMX are both
ugly - they are just ugly in different ways. Both modified the non-nested
code in strange places in strange and unexpected ways, and tried to circumvent
the usual code path in x86.c without touching x86.c. They just did this in
two slightly different ways, neither (I think) is inherently uglier than the
other:

For accurate emulation (as I explain in the patch below), both codes need to
cause x86.c to change its normal behavior: It checks for interrupt_allowed()
and then (discovering that it isn't) enable_irq_window(). We want it to
instead exit to L1, and then enable the irq window on that. In the SVM code,
interrupt_allowed() is modified to always return false if nested, and
enable_irq_window() is modified to flag for an exit to L1 (which is performed
later) and turn on the interrupt window. In VMX, we modify the same places
but differently: In interrupt_allowed() we exit to L1 immediately (it's a
short operation, we didn't mind to do it in atomic context), and
enable_irq_window() doesn't need to be changed (it already runs in L1).

Continuing to survey the difference between nested VMX and and SVM, there
were other different choices made besides the ones mentioned above. nested SVM
uses an additional trick, of skipping one round of running the guest, when
it discovered the need for an exit in the "wrong" place, so it can get to
the "right" place again. Nested VMX solved the same problems with other
mechanisms, like a separate piece of code for handling IDT_VECTORING_INFO,
and nested_run_pending. Some differences can also be explained by the different
design of (non-nested) vmx.c vs svm.c - e.g., svm_complete_interrupts() is
called during the handle_exit(), while vmx_complete_interrupts() is called
after handle_exit() has completed (in atomic context) - this is one of the
reasons the nested IDT_VECTORING_INFO path is different.

I think that both solutions are far from being beautiful or easy to understand.
Nested SVM is perhaps slightly less ugly but also has a small performance cost
(with the extra vcpu_run iteration doing nothing) - and I think neither is
inherently better than the other.

So I guess my question is, and Avi and Gleb I'd love your comments about this
question: Is it really beneficial that I rewrite the "ugly" nested-VMX
injection code to be somewhat-ugly in exactly the same way that nested-SVM
injection code? Won't it be more beneficial to rewrite *both* codes to
be cleaner? This would probably mean changes to the common x86.c, that both
will use. For example, x86.c's injection code could check the nested case
itself, perhaps calling a special x86_op to handle the nested injection (exit,
set interrupt window, etc.) instead of calling the regular
interrupt_allowed/enable_irq_window and forcing those to be modified in
mysterious ways.

Now that there's a is_guest_mode(vcpu) function, more nested-related code
can be moved to x86.c, to make both nested VMX and nested SVM code cleaner.

Waiting to hear your opinions and suggestions,

Thanks,
Nadav.


=============================================================
Subject: [PATCH 23/31] nVMX: Correct handling of interrupt injection

The code in this patch correctly emulates external-interrupt injection
while a nested guest L2 is running. Because of this code's relative
oddity and un-obviousness, I include here a longer-than-usual justification
for what it does - longer than the code itself ;-)

To understand how to correctly emulate interrupt injection while L2 is
running, let's look first at what we need to emulate: How would things look
like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
an interrupt we have hardware delivering an interrupt?

Now L1 runs on bare metal, with a guest L2 and the hardware generates an
interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and 
VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
happens now is this: The processor exits from L2 to L1, with an
external-interrupt exit reason but without an interrupt vector. L1 runs,
with interrupts disabled, and it doesn't yet know what the interrupt was.
Soon after, it enables interrupts and only at that moment, it gets the
interrupt from the processor. when L1 is KVM, Linux handles this interrupt.

Now we need exactly the same thing to happen when that L1->L2 system runs
on top of L0, instead of real hardware. This is how we do this:

When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
external-interrupt exit reason (without an interrupt vector), and run L1.
Just like in the bare metal case, it likely can't deliver the interrupt to
L1 now because it is running with interrupts disabled, in which case it turns
on the interrupt window when running L1 after the exit. L1 will soon enable
interrupts, and at that point L0 will gain control again and inject the
interrupt to L1.

Finally, there is an extra complication in the code: when nested_run_pending,
we cannot return to L1 now, and must launch L2. We need to remember the
interrupt we wanted to inject (and not clear it now), and do it on the
next exit.

The above explanation shows that the relative strangeness of the nested
interrupt injection code in this patch, and the extra interrupt-window
exit incurred, are in fact necessary for accurate emulation, and are not
just an unoptimized implementation.

Let's revisit now the two assumptions made above:

If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
does, by the way), things are simple: L0 may inject the interrupt directly
to the L2 guest - using the normal code path that injects to any guest.
We support this case in the code below.

If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
does), things look very different from the description above: L1 expects
to see an exit from L2 with the interrupt vector already filled in the exit
information, and does not expect to be interrupted again with this interrupt.
The current code does not (yet) support this case, so we do not allow the
VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-12 17:39:33.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-12 17:39:33.000000000 +0300
@@ -3703,9 +3703,25 @@ out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3828,6 +3844,15 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu);
+		get_vmcs12(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5515,6 +5540,14 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */
+	if (vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;
@@ -5712,6 +5745,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (is_guest_mode(&vmx->vcpu))
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -5719,6 +5754,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu))
+		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,

-- 
Nadav Har'El                        |      Thursday, May 12 2011, 8 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Linux: Because rebooting is for adding
http://nadav.harel.org.il           |new hardware.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 15:42     ` Nadav Har'El
@ 2011-05-12 15:57       ` Gleb Natapov
  2011-05-12 16:08         ` Avi Kivity
  2011-05-12 16:31         ` Nadav Har'El
  2011-05-12 16:18       ` Avi Kivity
  1 sibling, 2 replies; 83+ messages in thread
From: Gleb Natapov @ 2011-05-12 15:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, abelg

On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote:
> So I guess my question is, and Avi and Gleb I'd love your comments about this
> question: Is it really beneficial that I rewrite the "ugly" nested-VMX
> injection code to be somewhat-ugly in exactly the same way that nested-SVM
> injection code? Won't it be more beneficial to rewrite *both* codes to
> be cleaner? This would probably mean changes to the common x86.c, that both
> will use. For example, x86.c's injection code could check the nested case
> itself, perhaps calling a special x86_op to handle the nested injection (exit,
> set interrupt window, etc.) instead of calling the regular
> interrupt_allowed/enable_irq_window and forcing those to be modified in
> mysterious ways.
> 
That is exactly what should be done and what I have in mind when I am
asking to change VMX code to be SVM like. To achieve what you outlined
above gradually we need to move common VMX and SVM logic into x86.c
and then change the logic to be more nested friendly.  If VMX will have
different interrupt handling logic we will have to have additional step:
making SVM and VMX code similar (so it will be possible to move it
into x86.c).  All I am asking is to make this step now, before merge,
while the code is still actively developed.
 
> Now that there's a is_guest_mode(vcpu) function, more nested-related code
> can be moved to x86.c, to make both nested VMX and nested SVM code cleaner.
> 
> Waiting to hear your opinions and suggestions,
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 15:57       ` Gleb Natapov
@ 2011-05-12 16:08         ` Avi Kivity
  2011-05-12 16:14           ` Gleb Natapov
  2011-05-12 16:31         ` Nadav Har'El
  1 sibling, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-12 16:08 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Nadav Har'El, kvm, abelg

On 05/12/2011 06:57 PM, Gleb Natapov wrote:
> On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote:
> >  So I guess my question is, and Avi and Gleb I'd love your comments about this
> >  question: Is it really beneficial that I rewrite the "ugly" nested-VMX
> >  injection code to be somewhat-ugly in exactly the same way that nested-SVM
> >  injection code? Won't it be more beneficial to rewrite *both* codes to
> >  be cleaner? This would probably mean changes to the common x86.c, that both
> >  will use. For example, x86.c's injection code could check the nested case
> >  itself, perhaps calling a special x86_op to handle the nested injection (exit,
> >  set interrupt window, etc.) instead of calling the regular
> >  interrupt_allowed/enable_irq_window and forcing those to be modified in
> >  mysterious ways.
> >
> That is exactly what should be done and what I have in mind when I am
> asking to change VMX code to be SVM like. To achieve what you outlined
> above gradually we need to move common VMX and SVM logic into x86.c
> and then change the logic to be more nested friendly.  If VMX will have
> different interrupt handling logic we will have to have additional step:
> making SVM and VMX code similar (so it will be possible to move it
> into x86.c).  All I am asking is to make this step now, before merge,
> while the code is still actively developed.
>

I don't think it's fair to ask Nadav to do a unification right now.  Or 
productive - there's a limit to the size of a patchset that can be 
carried outside.  Also it needs to be done in consideration with future 
changes to interrupt injection, like using the svm interrupt queue to 
avoid an interrupt window exit.

Are there vmx-only changes that you think can help?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 16:08         ` Avi Kivity
@ 2011-05-12 16:14           ` Gleb Natapov
  0 siblings, 0 replies; 83+ messages in thread
From: Gleb Natapov @ 2011-05-12 16:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm, abelg

On Thu, May 12, 2011 at 07:08:59PM +0300, Avi Kivity wrote:
> On 05/12/2011 06:57 PM, Gleb Natapov wrote:
> >On Thu, May 12, 2011 at 06:42:28PM +0300, Nadav Har'El wrote:
> >>  So I guess my question is, and Avi and Gleb I'd love your comments about this
> >>  question: Is it really beneficial that I rewrite the "ugly" nested-VMX
> >>  injection code to be somewhat-ugly in exactly the same way that nested-SVM
> >>  injection code? Won't it be more beneficial to rewrite *both* codes to
> >>  be cleaner? This would probably mean changes to the common x86.c, that both
> >>  will use. For example, x86.c's injection code could check the nested case
> >>  itself, perhaps calling a special x86_op to handle the nested injection (exit,
> >>  set interrupt window, etc.) instead of calling the regular
> >>  interrupt_allowed/enable_irq_window and forcing those to be modified in
> >>  mysterious ways.
> >>
> >That is exactly what should be done and what I have in mind when I am
> >asking to change VMX code to be SVM like. To achieve what you outlined
> >above gradually we need to move common VMX and SVM logic into x86.c
> >and then change the logic to be more nested friendly.  If VMX will have
> >different interrupt handling logic we will have to have additional step:
> >making SVM and VMX code similar (so it will be possible to move it
> >into x86.c).  All I am asking is to make this step now, before merge,
> >while the code is still actively developed.
> >
> 
> I don't think it's fair to ask Nadav to do a unification right now.
Definitely. And I am not asking for it!

> Or productive - there's a limit to the size of a patchset that can
> be carried outside.  Also it needs to be done in consideration with
> future changes to interrupt injection, like using the svm interrupt
> queue to avoid an interrupt window exit.
> 
> Are there vmx-only changes that you think can help?
> 
I am asking for vmx-only change actually. To make interrupt handling logic
the same as SVM. This will allow me or you or someone else to handle
unification part later without rewriting VMX.
 
--
			Gleb.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 15:42     ` Nadav Har'El
  2011-05-12 15:57       ` Gleb Natapov
@ 2011-05-12 16:18       ` Avi Kivity
  1 sibling, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-12 16:18 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm, abelg

On 05/12/2011 06:42 PM, Nadav Har'El wrote:
> Our second conclusion (and I hope that I'm not offending anyone here)
> is that the changes for L2 interrupt injection in both SVM and VMX are both
> ugly - they are just ugly in different ways. Both modified the non-nested
> code in strange places in strange and unexpected ways, and tried to circumvent
> the usual code path in x86.c without touching x86.c. They just did this in
> two slightly different ways, neither (I think) is inherently uglier than the
> other:
>
> For accurate emulation (as I explain in the patch below), both codes need to
> cause x86.c to change its normal behavior: It checks for interrupt_allowed()
> and then (discovering that it isn't) enable_irq_window(). We want it to
> instead exit to L1, and then enable the irq window on that. In the SVM code,
> interrupt_allowed() is modified to always return false if nested, and
> enable_irq_window() is modified to flag for an exit to L1 (which is performed
> later) and turn on the interrupt window. In VMX, we modify the same places
> but differently: In interrupt_allowed() we exit to L1 immediately (it's a
> short operation, we didn't mind to do it in atomic context), and
> enable_irq_window() doesn't need to be changed (it already runs in L1).

I think that interrupt_allowed() should return true in L2 (if L1 has 
configured external interrupts to be trapped), and interrupt injection 
modified to cause an exit instead of queueing an interrupt.  Note that 
on vmx, intercepted interrupt injection can take two different paths 
depending on whether the L1 wants interrupts acked or not.

> Continuing to survey the difference between nested VMX and and SVM, there
> were other different choices made besides the ones mentioned above. nested SVM
> uses an additional trick, of skipping one round of running the guest, when
> it discovered the need for an exit in the "wrong" place, so it can get to
> the "right" place again. Nested VMX solved the same problems with other
> mechanisms, like a separate piece of code for handling IDT_VECTORING_INFO,
> and nested_run_pending. Some differences can also be explained by the different
> design of (non-nested) vmx.c vs svm.c - e.g., svm_complete_interrupts() is
> called during the handle_exit(), while vmx_complete_interrupts() is called
> after handle_exit() has completed (in atomic context) - this is one of the
> reasons the nested IDT_VECTORING_INFO path is different.
>
> I think that both solutions are far from being beautiful or easy to understand.
> Nested SVM is perhaps slightly less ugly but also has a small performance cost
> (with the extra vcpu_run iteration doing nothing) - and I think neither is
> inherently better than the other.
>
> So I guess my question is, and Avi and Gleb I'd love your comments about this
> question: Is it really beneficial that I rewrite the "ugly" nested-VMX
> injection code to be somewhat-ugly in exactly the same way that nested-SVM
> injection code? Won't it be more beneficial to rewrite *both* codes to
> be cleaner? This would probably mean changes to the common x86.c, that both
> will use. For example, x86.c's injection code could check the nested case
> itself, perhaps calling a special x86_op to handle the nested injection (exit,
> set interrupt window, etc.) instead of calling the regular
> interrupt_allowed/enable_irq_window and forcing those to be modified in
> mysterious ways.
>
> Now that there's a is_guest_mode(vcpu) function, more nested-related code
> can be moved to x86.c, to make both nested VMX and nested SVM code cleaner.

I am fine with committing as is.  Later we can modify both vmx and svm 
to do the right thing (whatever that is), and later merge them into x86.c.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 15:57       ` Gleb Natapov
  2011-05-12 16:08         ` Avi Kivity
@ 2011-05-12 16:31         ` Nadav Har'El
  2011-05-12 16:51           ` Gleb Natapov
  1 sibling, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-12 16:31 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm, abelg

Hi,

On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> That is exactly what should be done and what I have in mind when I am
> asking to change VMX code to be SVM like. To achieve what you outlined
> above gradually we need to move common VMX and SVM logic into x86.c
> and then change the logic to be more nested friendly.  If VMX will have
> different interrupt handling logic we will have to have additional step:
> making SVM and VMX code similar (so it will be possible to move it
> into x86.c).

But if my interpretation of the code is correct, SVM isn't much closer
than VMX to the goal of moving this logic to x86.c. When some logic is
moved there, both SVM and VMX code will need to change - perhaps even
considerably. So how will it be helpful to make VMX behave exactly like
SVM does now, when the latter will also need to change considerably?

It sounds to me that working to move some nested-interrupt-injection related
logic to x86.c is a worthy effort (and I'd be happy to start some discussion
on how to best design it), but working to duplicate the exact idiosyncrasies
of the current SVM implementation in the VMX code is not as productive.
But as usual, I'm open to arguments (or dictums ;-)) that I'm wrong here.

By the way, I hope that I'm being fair to the nested SVM implementation when
I call some of the code there, after only a short review, idiosyncrasies.
Basically I am working under the assumption that some of the modifications
there (I gave examples in my previous post) were done in the way they were
just to fit the mold of x86.c, and that it would have been possible to alter
x86.c in a way that could make the nested SVM code simpler - and quite
different (in the area of interrupt injection).

> All I am asking is to make this step now, before merge,
> while the code is still actively developed.

The code will continue to be actively developed even after the merge :-)

Nadav.

-- 
Nadav Har'El                        |      Thursday, May 12 2011, 9 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |A city is a large community where people
http://nadav.harel.org.il           |are lonesome together.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 16:31         ` Nadav Har'El
@ 2011-05-12 16:51           ` Gleb Natapov
  2011-05-12 17:00             ` Avi Kivity
  2011-05-22 19:32             ` Nadav Har'El
  0 siblings, 2 replies; 83+ messages in thread
From: Gleb Natapov @ 2011-05-12 16:51 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, abelg

On Thu, May 12, 2011 at 07:31:15PM +0300, Nadav Har'El wrote:
> Hi,
> 
> On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> > That is exactly what should be done and what I have in mind when I am
> > asking to change VMX code to be SVM like. To achieve what you outlined
> > above gradually we need to move common VMX and SVM logic into x86.c
> > and then change the logic to be more nested friendly.  If VMX will have
> > different interrupt handling logic we will have to have additional step:
> > making SVM and VMX code similar (so it will be possible to move it
> > into x86.c).
> 
> But if my interpretation of the code is correct, SVM isn't much closer
> than VMX to the goal of moving this logic to x86.c. When some logic is
> moved there, both SVM and VMX code will need to change - perhaps even
> considerably. So how will it be helpful to make VMX behave exactly like
> SVM does now, when the latter will also need to change considerably?
> 
SVM design is much close to the goal of moving the logic into x86.c
because IIRC it does not bypass parsing of IDT vectoring info into arch
independent structure. VMX code uses vmx->idt_vectoring_info directly.
SVM is much close to working migration with nested guests for the same
reason.

> It sounds to me that working to move some nested-interrupt-injection related
> logic to x86.c is a worthy effort (and I'd be happy to start some discussion
> on how to best design it), but working to duplicate the exact idiosyncrasies
> of the current SVM implementation in the VMX code is not as productive.
> But as usual, I'm open to arguments (or dictums ;-)) that I'm wrong here.
> 
> By the way, I hope that I'm being fair to the nested SVM implementation when
> I call some of the code there, after only a short review, idiosyncrasies.
> Basically I am working under the assumption that some of the modifications
> there (I gave examples in my previous post) were done in the way they were
> just to fit the mold of x86.c, and that it would have been possible to alter
> x86.c in a way that could make the nested SVM code simpler - and quite
> different (in the area of interrupt injection).
I think (haven't looked at the code for a long time) it can benefit
from additional x86 ops callbacks, but it would be silly to add one set
of callbacks to support SVM way of doing things and another set for VMX,
not because archs are so different that unification is impossible (they
are not, they are very close in this area in fact), but because two
implementations are different.

> 
> > All I am asking is to make this step now, before merge,
> > while the code is still actively developed.
> 
> The code will continue to be actively developed even after the merge :-)
> 
Amen! :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 16:51           ` Gleb Natapov
@ 2011-05-12 17:00             ` Avi Kivity
  2011-05-15 23:11               ` Nadav Har'El
  2011-05-22 19:32             ` Nadav Har'El
  1 sibling, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-12 17:00 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Nadav Har'El, kvm, abelg

On 05/12/2011 07:51 PM, Gleb Natapov wrote:
> >
> >  But if my interpretation of the code is correct, SVM isn't much closer
> >  than VMX to the goal of moving this logic to x86.c. When some logic is
> >  moved there, both SVM and VMX code will need to change - perhaps even
> >  considerably. So how will it be helpful to make VMX behave exactly like
> >  SVM does now, when the latter will also need to change considerably?
> >
> SVM design is much close to the goal of moving the logic into x86.c
> because IIRC it does not bypass parsing of IDT vectoring info into arch
> independent structure. VMX code uses vmx->idt_vectoring_info directly.
> SVM is much close to working migration with nested guests for the same
> reason.

Ah, yes.  For live migration to work, all vmcb state must be accessible 
via vendor-independent accessors once an exit is completely handled.  
For example, GPRs are accessible via kvm_register_read(), and without 
nesting, interrupt state is stowed in the interrupt queue, but if you 
keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
you migrate at this point.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 17:00             ` Avi Kivity
@ 2011-05-15 23:11               ` Nadav Har'El
  2011-05-16  6:38                 ` Gleb Natapov
  2011-05-16  9:50                 ` Avi Kivity
  0 siblings, 2 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-15 23:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm, abelg

On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> Ah, yes.  For live migration to work, all vmcb state must be accessible 
> via vendor-independent accessors once an exit is completely handled.  
> For example, GPRs are accessible via kvm_register_read(), and without 
> nesting, interrupt state is stowed in the interrupt queue, but if you 
> keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
> you migrate at this point.

Hi, I can quite easily save this state in a different place which is saved -
The easiest will just be to use vmcs12, which has place for exactly the fields
we want to save (and they are rewritten anyway when we exit to L1).

Avi, would you you like me use this sort of solution to avoid the extra
state? Of course, considering that anyway, live migration with nested VMX
probably still doesn't work for a dozen other reasons :(

Or do you consider this not enough, and rather that it is necessary that
nested VMX should use exactly the same logic as nested SVM does - namely,
use tricks like SVM's "exit_required" instead of our different tricks?

Nadav.

-- 
Nadav Har'El                        |       Monday, May 16 2011, 12 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Corduroy pillows - they're making
http://nadav.harel.org.il           |headlines!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-15 23:11               ` Nadav Har'El
@ 2011-05-16  6:38                 ` Gleb Natapov
  2011-05-16  7:44                   ` Nadav Har'El
  2011-05-16  9:50                 ` Avi Kivity
  1 sibling, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2011-05-16  6:38 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, abelg

On Mon, May 16, 2011 at 02:11:40AM +0300, Nadav Har'El wrote:
> On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> > Ah, yes.  For live migration to work, all vmcb state must be accessible 
> > via vendor-independent accessors once an exit is completely handled.  
> > For example, GPRs are accessible via kvm_register_read(), and without 
> > nesting, interrupt state is stowed in the interrupt queue, but if you 
> > keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
> > you migrate at this point.
> 
> Hi, I can quite easily save this state in a different place which is saved -
> The easiest will just be to use vmcs12, which has place for exactly the fields
> we want to save (and they are rewritten anyway when we exit to L1).
> 
This will not address the problem that the state will not be visible to
generic logic in x86.c.

> Avi, would you you like me use this sort of solution to avoid the extra
> state? Of course, considering that anyway, live migration with nested VMX
> probably still doesn't work for a dozen other reasons :(
> 
> Or do you consider this not enough, and rather that it is necessary that
> nested VMX should use exactly the same logic as nested SVM does - namely,
> use tricks like SVM's "exit_required" instead of our different tricks?
> 
Given two solutions I prefer SVM one. Yes, I know that you asked Avi :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-16  6:38                 ` Gleb Natapov
@ 2011-05-16  7:44                   ` Nadav Har'El
  2011-05-16  7:57                     ` Gleb Natapov
  0 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-16  7:44 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm, abelg

On Mon, May 16, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> > Hi, I can quite easily save this state in a different place which is saved -
> > The easiest will just be to use vmcs12, which has place for exactly the fields
> > we want to save (and they are rewritten anyway when we exit to L1).
> > 
> This will not address the problem that the state will not be visible to
> generic logic in x86.c.

Maybe I misunderstood your intention, but given that vmcs12 is in guest
memory, which is migrated as well, isn't that enough (for the live migration
issue)?


-- 
Nadav Har'El                        |       Monday, May 16 2011, 12 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The trouble with political jokes is they
http://nadav.harel.org.il           |get elected.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-16  7:44                   ` Nadav Har'El
@ 2011-05-16  7:57                     ` Gleb Natapov
  0 siblings, 0 replies; 83+ messages in thread
From: Gleb Natapov @ 2011-05-16  7:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, abelg

On Mon, May 16, 2011 at 10:44:28AM +0300, Nadav Har'El wrote:
> On Mon, May 16, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> > > Hi, I can quite easily save this state in a different place which is saved -
> > > The easiest will just be to use vmcs12, which has place for exactly the fields
> > > we want to save (and they are rewritten anyway when we exit to L1).
> > > 
> > This will not address the problem that the state will not be visible to
> > generic logic in x86.c.
> 
> Maybe I misunderstood your intention, but given that vmcs12 is in guest
> memory, which is migrated as well, isn't that enough (for the live migration
> issue)?
> 
I pointed two issues. Migration was a second and minor one since there
is a long why before migration will work with nested guest anyway. The
first one was much more important, so let me repeat it again. To move
nested event handling into generic code IDT vectoring info has to be
parsed into data structure that event injection code in x86.c actually
works with. And that code does not manipulate vmx->idt_vectoring_info
or SVM analog directly, but it works with event queue instead. SVM does
this right and there is nothing I can see that prevents moving SVM logic
into x86.c. I don't see how your VMX logic can be moved into x86.c as is
since it works on internal VMX fields directly.

--
			Gleb.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-15 23:11               ` Nadav Har'El
  2011-05-16  6:38                 ` Gleb Natapov
@ 2011-05-16  9:50                 ` Avi Kivity
  2011-05-16 10:20                   ` Avi Kivity
  1 sibling, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-16  9:50 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm, abelg

On 05/16/2011 02:11 AM, Nadav Har'El wrote:
> On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> >  Ah, yes.  For live migration to work, all vmcb state must be accessible
> >  via vendor-independent accessors once an exit is completely handled.
> >  For example, GPRs are accessible via kvm_register_read(), and without
> >  nesting, interrupt state is stowed in the interrupt queue, but if you
> >  keep IDT_VECTORING_INFO live between exit and entry, you can lose it if
> >  you migrate at this point.
>
> Hi, I can quite easily save this state in a different place which is saved -
> The easiest will just be to use vmcs12, which has place for exactly the fields
> we want to save (and they are rewritten anyway when we exit to L1).

You would still need ->valid_idt_vectoring_info to know you need special 
handling, no?

> Avi, would you you like me use this sort of solution to avoid the extra
> state? Of course, considering that anyway, live migration with nested VMX
> probably still doesn't work for a dozen other reasons :(
>
> Or do you consider this not enough, and rather that it is necessary that
> nested VMX should use exactly the same logic as nested SVM does - namely,
> use tricks like SVM's "exit_required" instead of our different tricks?

I think svm is rather simple here using svm_complete_interrupts() to 
decode exit_int_info into the arch independent structures.  I don't 
think ->exit_required is a hack - it could probably be improved but I 
think it does the right thing essentially.  For example 
svm_nmi_allowed() will return true if in guest mode and NMI interception 
is enabled.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-16  9:50                 ` Avi Kivity
@ 2011-05-16 10:20                   ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-16 10:20 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm, abelg

On 05/16/2011 12:50 PM, Avi Kivity wrote:
>> Or do you consider this not enough, and rather that it is necessary that
>> nested VMX should use exactly the same logic as nested SVM does - 
>> namely,
>> use tricks like SVM's "exit_required" instead of our different tricks?
>
>
> I think svm is rather simple here using svm_complete_interrupts() to 
> decode exit_int_info into the arch independent structures.  I don't 
> think ->exit_required is a hack - it could probably be improved but I 
> think it does the right thing essentially.  For example 
> svm_nmi_allowed() will return true if in guest mode and NMI 
> interception is enabled.
>

It would be better if it just returned true and let svm_inject_nmi() do 
the vmexit.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/30] nVMX: Implement VMPTRLD
  2011-05-08  8:21 ` [PATCH 12/30] nVMX: Implement VMPTRLD Nadav Har'El
@ 2011-05-16 14:34   ` Marcelo Tosatti
  2011-05-16 18:58     ` Nadav Har'El
  0 siblings, 1 reply; 83+ messages in thread
From: Marcelo Tosatti @ 2011-05-16 14:34 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On Sun, May 08, 2011 at 11:21:22AM +0300, Nadav Har'El wrote:
> This patch implements the VMPTRLD instruction.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 61 insertions(+), 1 deletion(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:19.000000000 +0300
> @@ -4814,6 +4814,66 @@ static int handle_vmclear(struct kvm_vcp
>  	return 1;
>  }
>  
> +/* Emulate the VMPTRLD instruction */
> +static int handle_vmptrld(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gva_t gva;
> +	gpa_t vmcs12_addr;
> +	struct x86_exception e;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> +			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
> +		return 1;
> +
> +	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vmcs12_addr,
> +				sizeof(vmcs12_addr), &e)) {
> +		kvm_inject_page_fault(vcpu, &e);
> +		return 1;
> +	}
> +
> +	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
> +		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
> +
> +	if (vmx->nested.current_vmptr != vmcs12_addr) {
> +		struct vmcs12 *new_vmcs12;
> +		struct page *page;
> +		page = nested_get_page(vcpu, vmcs12_addr);
> +		if (page == NULL) {
> +			nested_vmx_failInvalid(vcpu);

This can access a NULL current_vmcs12 pointer, no? Apparently other
code paths are vulnerable to the same issue (as in allowed to execute
before vmtprld maps guest VMCS). Perhaps a BUG_ON on get_vmcs12 could be
helpful.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-08  8:18 ` [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
@ 2011-05-16 15:30   ` Marcelo Tosatti
  2011-05-16 18:32     ` Nadav Har'El
  0 siblings, 1 reply; 83+ messages in thread
From: Marcelo Tosatti @ 2011-05-16 15:30 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On Sun, May 08, 2011 at 11:18:47AM +0300, Nadav Har'El wrote:
> We saw in a previous patch that L1 controls its L2 guest with a vcms12.
> L0 needs to create a real VMCS for running L2. We call that "vmcs02".
> A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
> fields. This patch only contains code for allocating vmcs02.
> 
> In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
> enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
> be reused even when L1 runs multiple L2 guests. However, in future versions
> we'll probably want to add an optimization where vmcs02 fields that rarely
> change will not be set each time. For that, we may want to keep around several
> vmcs02s of L2 guests that have recently run, so that potentially we could run
> these L2s again more quickly because less vmwrites to vmcs02 will be needed.
> 
> This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
> which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
> As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
> I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
> reused to enter any L2 guest. In the future, when prepare_vmcs02() is
> optimized not to set all fields every time, VMCS02_POOL_SIZE should be
> increased.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  134 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 134 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-08 10:43:18.000000000 +0300
> @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
>  module_param(ple_window, int, S_IRUGO);
>  
>  #define NR_AUTOLOAD_MSRS 1
> +#define VMCS02_POOL_SIZE 1
>  
>  struct vmcs {
>  	u32 revision_id;
> @@ -166,6 +167,30 @@ struct __packed vmcs12 {
>  #define VMCS12_SIZE 0x1000
>  
>  /*
> + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
> + * while we use L2's VMCS), and we wish to save the previous VMCS, we must also
> + * remember on which CPU it was last loaded (vcpu->cpu), so when we return to
> + * using this VMCS we'll know if we're now running on a different CPU and need
> + * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
> + * we need to remember whether this VMCS was launched (vmx->launched), so when
> + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
> + * this from other state, because it's possible that this VMCS had once been
> + * launched, but has since been cleared after a CPU switch).
> + */
> +struct saved_vmcs {
> +	struct vmcs *vmcs;
> +	int cpu;
> +	int launched;
> +};
> +
> +/* Used to remember the last vmcs02 used for some recently used vmcs12s */
> +struct vmcs02_list {
> +	struct list_head list;
> +	gpa_t vmcs12_addr;
> +	struct saved_vmcs vmcs02;
> +};
> +
> +/*
>   * The nested_vmx structure is part of vcpu_vmx, and holds information we need
>   * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
>   */
> @@ -178,6 +203,10 @@ struct nested_vmx {
>  	/* The host-usable pointer to the above */
>  	struct page *current_vmcs12_page;
>  	struct vmcs12 *current_vmcs12;
> +
> +	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
> +	struct list_head vmcs02_pool;
> +	int vmcs02_num;
>  };
>  
>  struct vcpu_vmx {
> @@ -4155,6 +4184,106 @@ static int handle_invalid_op(struct kvm_
>  }
>  
>  /*
> + * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
> + * We could reuse a single VMCS for all the L2 guests, but we also want the
> + * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
> + * allows keeping them loaded on the processor, and in the future will allow
> + * optimizations where prepare_vmcs02 doesn't need to set all the fields on
> + * every entry if they never change.
> + * So we keep, in vmx->nested.vmcs02_pool, a cache of size VMCS02_POOL_SIZE
> + * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
> + *
> + * The following functions allocate and free a vmcs02 in this pool.
> + */
> +
> +static void __nested_free_saved_vmcs(void *arg)
> +{
> +	struct saved_vmcs *saved_vmcs = arg;
> +
> +	vmcs_clear(saved_vmcs->vmcs);
> +	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
> +		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
> +}

Should use raw_smp_processor_id instead of saved_vmcs->cpu.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-16 15:30   ` Marcelo Tosatti
@ 2011-05-16 18:32     ` Nadav Har'El
  2011-05-17 13:20       ` Marcelo Tosatti
  0 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-16 18:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, gleb, avi

On Mon, May 16, 2011, Marcelo Tosatti wrote about "Re: [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2":
> > +static void __nested_free_saved_vmcs(void *arg)
> > +{
> > +	struct saved_vmcs *saved_vmcs = arg;
> > +
> > +	vmcs_clear(saved_vmcs->vmcs);
> > +	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
> > +		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
> > +}
> 
> Should use raw_smp_processor_id instead of saved_vmcs->cpu.

Hi,

__nested_free_saved_vmcs is designed to be called only on the when
saved_vmcs->cpu is equal to the current CPU. E.g., it is called as:

        if (saved_vmcs->cpu != -1)
                smp_call_function_single(saved_vmcs->cpu,
                                __nested_free_saved_vmcs, saved_vmcs, 1);

So the current code should be just as correct.

The similar __vcpu_clear has an obscure case where (vmx->vcpu.cpu != cpu),
but in this new function, there isn't.

-- 
Nadav Har'El                        |       Monday, May 16 2011, 13 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Creativity consists of coming up with
http://nadav.harel.org.il           |many ideas, not just that one great idea.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/30] nVMX: Implement VMPTRLD
  2011-05-16 14:34   ` Marcelo Tosatti
@ 2011-05-16 18:58     ` Nadav Har'El
  2011-05-16 19:09       ` Nadav Har'El
  0 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-16 18:58 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, gleb, avi

Hi,

On Mon, May 16, 2011, Marcelo Tosatti wrote about "Re: [PATCH 12/30] nVMX: Implement VMPTRLD":
> > +	if (vmx->nested.current_vmptr != vmcs12_addr) {
> > +		struct vmcs12 *new_vmcs12;
> > +		struct page *page;
> > +		page = nested_get_page(vcpu, vmcs12_addr);
> > +		if (page == NULL) {
> > +			nested_vmx_failInvalid(vcpu);
> 
> This can access a NULL current_vmcs12 pointer, no?

I'm afraid I didn't understand where this specific code you can access a
NULL current_vmcs12 pointer...

Looking at the rest of this function, the code, for examples, that frees the
previous current_vmcs12_page, first makes sure that
vmx->nested.current_vmptr != -1ull. current_vmptr, current_vmcs12 and
current_vmcs12_page are all set together (when everything was succesful)
and we never release the old page before we test the new one, so we can be
sure that whenever current_vmptr != -1, we have a valid current_vmcs12 as well.

But maybe I'm missing something?

> Apparently other
> code paths are vulnerable to the same issue (as in allowed to execute
> before vmtprld maps guest VMCS). Perhaps a BUG_ON on get_vmcs12 could be
> helpful.

The call to get_vmcs12() should typically be in a if(is_guest_mode(vcpu))
(or in places we know this to be true), and to enter L2, we should have
verified already that we have a working vmcs12. This is why I thought it
is unnecessary to add any assertions to the trivial inline function
get_vmcs12().

But now that I think about it, there does appear to be a problem in
nested_vmx_run(): This is where we should have verified that there is a
current VMCS - i.e., that VMPTRLD was previously used! And it seems I forgot
testing this... :( I'll need to add such a test - not as a BUG_ON but as
a real test that causes the VMLAUNCH instruction to fail (I have to look at
the spec to see exactly how) if VMPTRLD hadn't been previously done.


-- 
Nadav Har'El                        |       Monday, May 16 2011, 13 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I have a great signature, but it won't
http://nadav.harel.org.il           |fit at the end of this message -- Fermat

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 12/30] nVMX: Implement VMPTRLD
  2011-05-16 18:58     ` Nadav Har'El
@ 2011-05-16 19:09       ` Nadav Har'El
  0 siblings, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm, gleb, avi

On Mon, May 16, 2011, Nadav Har'El wrote about "Re: [PATCH 12/30] nVMX: Implement VMPTRLD":
> But now that I think about it, there does appear to be a problem in
> nested_vmx_run(): This is where we should have verified that there is a
> current VMCS - i.e., that VMPTRLD was previously used! And it seems I forgot
> testing this... :( I'll need to add such a test - not as a BUG_ON but as
> a real test that causes the VMLAUNCH instruction to fail (I have to look at
> the spec to see exactly how) if VMPTRLD hadn't been previously done.

Oh, and there appears to be a similar problem with VMWRITE/VMREAD - it
also can be called before VMPTRLD was ever called, and cause us to dereference
stupid pointers.

Thanks for spotting this.

Nadav.

-- 
Nadav Har'El                        |       Monday, May 16 2011, 13 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If I were two-faced, would I be wearing
http://nadav.harel.org.il           |this one?.... Abraham Lincoln

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-16 18:32     ` Nadav Har'El
@ 2011-05-17 13:20       ` Marcelo Tosatti
  0 siblings, 0 replies; 83+ messages in thread
From: Marcelo Tosatti @ 2011-05-17 13:20 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On Mon, May 16, 2011 at 09:32:53PM +0300, Nadav Har'El wrote:
> On Mon, May 16, 2011, Marcelo Tosatti wrote about "Re: [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2":
> > > +static void __nested_free_saved_vmcs(void *arg)
> > > +{
> > > +	struct saved_vmcs *saved_vmcs = arg;
> > > +
> > > +	vmcs_clear(saved_vmcs->vmcs);
> > > +	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
> > > +		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
> > > +}
> > 
> > Should use raw_smp_processor_id instead of saved_vmcs->cpu.
> 
> Hi,
> 
> __nested_free_saved_vmcs is designed to be called only on the when
> saved_vmcs->cpu is equal to the current CPU. E.g., it is called as:
> 
>         if (saved_vmcs->cpu != -1)
>                 smp_call_function_single(saved_vmcs->cpu,
>                                 __nested_free_saved_vmcs, saved_vmcs, 1);

Yes, using raw_smp_processor_id makes that fact explicit to the reader.

> So the current code should be just as correct.

It is correct.

> 
> The similar __vcpu_clear has an obscure case where (vmx->vcpu.cpu != cpu),
> but in this new function, there isn't.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-12 16:51           ` Gleb Natapov
  2011-05-12 17:00             ` Avi Kivity
@ 2011-05-22 19:32             ` Nadav Har'El
  2011-05-23  9:37               ` Joerg Roedel
  2011-05-23  9:52               ` Avi Kivity
  1 sibling, 2 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-22 19:32 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm, abelg

On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> > But if my interpretation of the code is correct, SVM isn't much closer
> > than VMX to the goal of moving this logic to x86.c. When some logic is
> > moved there, both SVM and VMX code will need to change - perhaps even
> > considerably. So how will it be helpful to make VMX behave exactly like
> > SVM does now, when the latter will also need to change considerably?
> > 
> SVM design is much close to the goal of moving the logic into x86.c
> because IIRC it does not bypass parsing of IDT vectoring info into arch
> independent structure. VMX code uses vmx->idt_vectoring_info directly.

At the risk of sounding blasphemous, I'd like to make the case that perhaps
the current nested-VMX design - regarding the IDT-vectoring-info-field
handling - is actually closer than nested-SVM to the goal of moving clean
nested-supporting logic into x86.c, instead of having ad-hoc, unnatural,
workarounds.

Let me explain, and see if you agree with my logic:

We discover at exit time whether the virtualization hardware (VMX or SVM)
exited while *delivering* an interrupt or exception to the current guest.
This is known as "idt-vectoring-information" in VMX.

What do we need to do with this idt-vectoring-information? In regular (non-
nested) guests, the answer is simple: On the next entry, we need to inject
this event again into the guest, so it can resume the delivery of the
same event it was trying to deliver. This is why the nested-unaware code
has a vmx_complete_interrupts which basically adds this idt-vectoring-info
into KVM's event queue, which on the next entry will be injected similarly
to the way virtual interrupts from userspace are injected, and so on.

But with nested virtualization, this is *not* what is supposed to happen -
we do not *always* need to inject the event to the guest. We will only need
to inject the event if the next entry will be again to the same guest, i.e.,
L1 after L1, or L2 after L2. If the idt-vectoring-info came from L2, but
our next entry will be into L1 (i.e., a nested exit), we *shouldn't* inject
the event as usual, but should rather pass this idt-vectoring-info field
as the exit information that L1 gets (in nested vmx terminology, in vmcs12).

However, at the time of exit, we cannot know for sure whether L2 will actually
run next, because it is still possible that an injection from user space,
before the next entry, will cause us to decide to exit to L1.

Therefore, I believe that the clean solution isn't to leave the original
non-nested logic that always queues the idt-vectoring-info assuming it will
be injected, and then if it shouldn't (because we want to exit during entry)
we need to skip the entry once as a "trick" to avoid this wrong injection.

Rather, a clean solution is, I think, to recognize that in nested
virtualization, idt-vectoring-info is a different kind of beast than regular
injected events, and it needs to be saved at exit time in a different field
(which will of course be common to SVM and VMX). Only at entry time, after
the regular injection code (which may cause a nested exit), we can call a
x86_op to handle this special injection.

The benefit of this approach, which is closer to the current vmx code,
is, I think, that x86.c will contain clear, self-explanatory nested logic,
instead of relying on vmx.c or svm.c circumventing various x86.c functions
and mechanisms to do something different from what they were meant to do.

What do you think?


-- 
Nadav Har'El                        |       Sunday, May 22 2011, 19 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If I were two-faced, would I be wearing
http://nadav.harel.org.il           |this one?.... Abraham Lincoln

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-22 19:32             ` Nadav Har'El
@ 2011-05-23  9:37               ` Joerg Roedel
  2011-05-23  9:52               ` Avi Kivity
  1 sibling, 0 replies; 83+ messages in thread
From: Joerg Roedel @ 2011-05-23  9:37 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, Avi Kivity, kvm, abelg

On Sun, May 22, 2011 at 10:32:39PM +0300, Nadav Har'El wrote:

> At the risk of sounding blasphemous, I'd like to make the case that perhaps
> the current nested-VMX design - regarding the IDT-vectoring-info-field
> handling - is actually closer than nested-SVM to the goal of moving clean
> nested-supporting logic into x86.c, instead of having ad-hoc, unnatural,
> workarounds.

Well, the nested SVM implementation is certainly not perfect in this
regard :)

> Therefore, I believe that the clean solution isn't to leave the original
> non-nested logic that always queues the idt-vectoring-info assuming it will
> be injected, and then if it shouldn't (because we want to exit during entry)
> we need to skip the entry once as a "trick" to avoid this wrong injection.
> 
> Rather, a clean solution is, I think, to recognize that in nested
> virtualization, idt-vectoring-info is a different kind of beast than regular
> injected events, and it needs to be saved at exit time in a different field
> (which will of course be common to SVM and VMX). Only at entry time, after
> the regular injection code (which may cause a nested exit), we can call a
> x86_op to handle this special injection.

Things are complicated either way. If you keep the vectoring-info
seperate from the kvm exception queue you need special logic to combine
the vectoring-info and the queue. For example, imagine something is
pending in idt-vectoring info and the intercept causes another
exception for the guest. KVM needs to turn this into the #DF then. When
we just queue the vectoring-info into the exception queue we get this
implicitly without extra code. This is a cleaner way imho.

On the other side, when using the exception queue we need to keep
extra-information for nesting in the queue because an event which is
just re-injected into L2 must not cause a nested vmexit, even if the
exception vector is intercepted by L1. But this is the same for SVM and
VMX so we can do this in generic x86 code. This is not the case when
keeping track of idt-vectoring info seperate in architecture code.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-22 19:32             ` Nadav Har'El
  2011-05-23  9:37               ` Joerg Roedel
@ 2011-05-23  9:52               ` Avi Kivity
  2011-05-23 13:02                 ` Joerg Roedel
  1 sibling, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-23  9:52 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm, abelg

On 05/22/2011 10:32 PM, Nadav Har'El wrote:
> On Thu, May 12, 2011, Gleb Natapov wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> >  >  But if my interpretation of the code is correct, SVM isn't much closer
> >  >  than VMX to the goal of moving this logic to x86.c. When some logic is
> >  >  moved there, both SVM and VMX code will need to change - perhaps even
> >  >  considerably. So how will it be helpful to make VMX behave exactly like
> >  >  SVM does now, when the latter will also need to change considerably?
> >  >
> >  SVM design is much close to the goal of moving the logic into x86.c
> >  because IIRC it does not bypass parsing of IDT vectoring info into arch
> >  independent structure. VMX code uses vmx->idt_vectoring_info directly.
>
> At the risk of sounding blasphemous, I'd like to make the case that perhaps
> the current nested-VMX design - regarding the IDT-vectoring-info-field
> handling - is actually closer than nested-SVM to the goal of moving clean
> nested-supporting logic into x86.c, instead of having ad-hoc, unnatural,
> workarounds.
>
> Let me explain, and see if you agree with my logic:
>
> We discover at exit time whether the virtualization hardware (VMX or SVM)
> exited while *delivering* an interrupt or exception to the current guest.
> This is known as "idt-vectoring-information" in VMX.
>
> What do we need to do with this idt-vectoring-information? In regular (non-
> nested) guests, the answer is simple: On the next entry, we need to inject
> this event again into the guest, so it can resume the delivery of the
> same event it was trying to deliver. This is why the nested-unaware code
> has a vmx_complete_interrupts which basically adds this idt-vectoring-info
> into KVM's event queue, which on the next entry will be injected similarly
> to the way virtual interrupts from userspace are injected, and so on.

The other thing we may need to do, is to expose it to userspace in case 
we're live migrating at exactly this point in time.

> But with nested virtualization, this is *not* what is supposed to happen -
> we do not *always* need to inject the event to the guest. We will only need
> to inject the event if the next entry will be again to the same guest, i.e.,
> L1 after L1, or L2 after L2. If the idt-vectoring-info came from L2, but
> our next entry will be into L1 (i.e., a nested exit), we *shouldn't* inject
> the event as usual, but should rather pass this idt-vectoring-info field
> as the exit information that L1 gets (in nested vmx terminology, in vmcs12).
>
> However, at the time of exit, we cannot know for sure whether L2 will actually
> run next, because it is still possible that an injection from user space,
> before the next entry, will cause us to decide to exit to L1.
>
> Therefore, I believe that the clean solution isn't to leave the original
> non-nested logic that always queues the idt-vectoring-info assuming it will
> be injected, and then if it shouldn't (because we want to exit during entry)
> we need to skip the entry once as a "trick" to avoid this wrong injection.
>
> Rather, a clean solution is, I think, to recognize that in nested
> virtualization, idt-vectoring-info is a different kind of beast than regular
> injected events, and it needs to be saved at exit time in a different field
> (which will of course be common to SVM and VMX). Only at entry time, after
> the regular injection code (which may cause a nested exit), we can call a
> x86_op to handle this special injection.
>
> The benefit of this approach, which is closer to the current vmx code,
> is, I think, that x86.c will contain clear, self-explanatory nested logic,
> instead of relying on vmx.c or svm.c circumventing various x86.c functions
> and mechanisms to do something different from what they were meant to do.
>

IMO this will cause confusion, especially with the user interfaces used 
to read/write pending events.

I think what we need to do is:

1. change ->interrupt_allowed() to return true if the interrupt flag is 
unmasked OR if in a nested guest, and we're intercepting interrupts
2. change ->set_irq() to cause a nested vmexit if in a nested guest and 
we're intercepting interrupts
3. change ->nmi_allowed() and ->set_nmi() in a similar way
4. add a .injected flag to the interrupt queue which overrides the 
nested vmexit for VM_ENTRY_INTR_INFO_FIELD and the svm equivalent; if 
present normal injection takes place (or an error vmexit if the 
interrupt flag is clear and we cannot inject)


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23  9:52               ` Avi Kivity
@ 2011-05-23 13:02                 ` Joerg Roedel
  2011-05-23 13:08                   ` Avi Kivity
  2011-05-23 13:18                   ` Nadav Har'El
  0 siblings, 2 replies; 83+ messages in thread
From: Joerg Roedel @ 2011-05-23 13:02 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011 at 12:52:50PM +0300, Avi Kivity wrote:
> On 05/22/2011 10:32 PM, Nadav Har'El wrote:
>> What do we need to do with this idt-vectoring-information? In regular (non-
>> nested) guests, the answer is simple: On the next entry, we need to inject
>> this event again into the guest, so it can resume the delivery of the
>> same event it was trying to deliver. This is why the nested-unaware code
>> has a vmx_complete_interrupts which basically adds this idt-vectoring-info
>> into KVM's event queue, which on the next entry will be injected similarly
>> to the way virtual interrupts from userspace are injected, and so on.
>
> The other thing we may need to do, is to expose it to userspace in case  
> we're live migrating at exactly this point in time.

About live-migration with nesting, we had discussed the idea of just
doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
The problem was that the hypervisor may not expect an INTR intercept.

How about doing an implicit VMEXIT in this case and an implicit VMRUN
after the vcpu is migrated? The nested hypervisor will not see the
vmexit and the vcpu will be in a state where it is safe to migrate. This
should work for nested-vmx too if the guest-state is written back to
guest memory on VMEXIT. Is this the case?

	Joerg

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 13:02                 ` Joerg Roedel
@ 2011-05-23 13:08                   ` Avi Kivity
  2011-05-23 13:40                     ` Joerg Roedel
  2011-05-23 13:18                   ` Nadav Har'El
  1 sibling, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-23 13:08 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On 05/23/2011 04:02 PM, Joerg Roedel wrote:
> On Mon, May 23, 2011 at 12:52:50PM +0300, Avi Kivity wrote:
> >  On 05/22/2011 10:32 PM, Nadav Har'El wrote:
> >>  What do we need to do with this idt-vectoring-information? In regular (non-
> >>  nested) guests, the answer is simple: On the next entry, we need to inject
> >>  this event again into the guest, so it can resume the delivery of the
> >>  same event it was trying to deliver. This is why the nested-unaware code
> >>  has a vmx_complete_interrupts which basically adds this idt-vectoring-info
> >>  into KVM's event queue, which on the next entry will be injected similarly
> >>  to the way virtual interrupts from userspace are injected, and so on.
> >
> >  The other thing we may need to do, is to expose it to userspace in case
> >  we're live migrating at exactly this point in time.
>
> About live-migration with nesting, we had discussed the idea of just
> doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
> The problem was that the hypervisor may not expect an INTR intercept.
>
> How about doing an implicit VMEXIT in this case and an implicit VMRUN
> after the vcpu is migrated?

What if there's something in EXIT_INT_INFO?

>   The nested hypervisor will not see the
> vmexit and the vcpu will be in a state where it is safe to migrate. This
> should work for nested-vmx too if the guest-state is written back to
> guest memory on VMEXIT. Is this the case?

It is the case with the current implementation, and we can/should make 
it so in future implementations, just before exit to userspace.  Or at 
least provide an ABI to sync memory.

But I don't see why we shouldn't just migrate all the hidden state (in 
guest mode flag, svm host paging mode, svm host interrupt state, vmcb 
address/vmptr, etc.).  It's more state, but no thinking is involved, so 
it's clearly superior.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 13:02                 ` Joerg Roedel
  2011-05-23 13:08                   ` Avi Kivity
@ 2011-05-23 13:18                   ` Nadav Har'El
  1 sibling, 0 replies; 83+ messages in thread
From: Nadav Har'El @ 2011-05-23 13:18 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Avi Kivity, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011, Joerg Roedel wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> About live-migration with nesting, we had discussed the idea of just
> doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
> The problem was that the hypervisor may not expect an INTR intercept.
> 
> How about doing an implicit VMEXIT in this case and an implicit VMRUN
> after the vcpu is migrated? The nested hypervisor will not see the
> vmexit and the vcpu will be in a state where it is safe to migrate. This
> should work for nested-vmx too if the guest-state is written back to
> guest memory on VMEXIT. Is this the case?

Indeed, on nested exit (L2 to L1), the L2 guest state is written back to
vmcs12 (in guest memory). In theory, at that point, the vmcs02 (the vmcs
used by L0 to actually run L2) can be discarded, without risking losing
anything.

The receiving hypervisor will need to remember to do that implicit VMRUN
when it starts the guest; It also needs to know what is the current L2
guest - in VMX this would be vmx->nested.current_vmptr, which needs to me
migrated as well (on the other hand, other variables like
vmx->nested.current_vmcs12, will need to be recalculated by the receiver, and
not migrated as-is). I haven't started considering how to wrap up all these
pieces into a complete working solution - it is one of the things on my TODO
list after the basic nested VMX is merged.

-- 
Nadav Har'El                        |       Monday, May 23 2011, 19 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Live as if you were to die tomorrow,
http://nadav.harel.org.il           |learn as if you were to live forever.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 13:08                   ` Avi Kivity
@ 2011-05-23 13:40                     ` Joerg Roedel
  2011-05-23 13:52                       ` Avi Kivity
  0 siblings, 1 reply; 83+ messages in thread
From: Joerg Roedel @ 2011-05-23 13:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011 at 04:08:00PM +0300, Avi Kivity wrote:
> On 05/23/2011 04:02 PM, Joerg Roedel wrote:

>> About live-migration with nesting, we had discussed the idea of just
>> doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
>> The problem was that the hypervisor may not expect an INTR intercept.
>>
>> How about doing an implicit VMEXIT in this case and an implicit VMRUN
>> after the vcpu is migrated?
>
> What if there's something in EXIT_INT_INFO?

On real SVM hardware EXIT_INT_INFO should only contain something for
exception and npt intercepts. These are all handled in the kernel and do
not cause an exit to user-space so that no valid EXIT_INT_INFO should be
around when we actually go back to user-space (so that migration can
happen).

The exception might be the #PF/NPT intercept when the guest is doing
very obscure things like putting an exception/interrupt handler on mmio
memory, but that isn't really supported by KVM anyway so I doubt we
should care.

Unless I miss something here we should be safe by just not looking at
EXIT_INT_INFO while migrating.

>>   The nested hypervisor will not see the
>> vmexit and the vcpu will be in a state where it is safe to migrate. This
>> should work for nested-vmx too if the guest-state is written back to
>> guest memory on VMEXIT. Is this the case?
>
> It is the case with the current implementation, and we can/should make  
> it so in future implementations, just before exit to userspace.  Or at  
> least provide an ABI to sync memory.
>
> But I don't see why we shouldn't just migrate all the hidden state (in  
> guest mode flag, svm host paging mode, svm host interrupt state, vmcb  
> address/vmptr, etc.).  It's more state, but no thinking is involved, so  
> it's clearly superior.

An issue is that there is different state to migrate for Intel and AMD
hosts. If we keep all that information in guest memory the kvm kernel
module can handle those details and all KVM needs to migrate is the
in-guest-mode flag and the gpa of the vmcb/vmcs which is currently
executed. This state should be enough for Intel and AMD nesting.

The next benefit is that it works seemlessly even if the state that
needs to be transfered is extended (e.g. by emulating a new
virtualization hardware feature). This support can be implemented in the
kernel module and no changes to qemu are required.


	Joerg


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 13:40                     ` Joerg Roedel
@ 2011-05-23 13:52                       ` Avi Kivity
  2011-05-23 14:10                         ` Nadav Har'El
  2011-05-23 14:28                         ` Joerg Roedel
  0 siblings, 2 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-23 13:52 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On 05/23/2011 04:40 PM, Joerg Roedel wrote:
> On Mon, May 23, 2011 at 04:08:00PM +0300, Avi Kivity wrote:
> >  On 05/23/2011 04:02 PM, Joerg Roedel wrote:
>
> >>  About live-migration with nesting, we had discussed the idea of just
> >>  doing an VMEXIT(INTR) if the vcpu runs nested and we want to migrate.
> >>  The problem was that the hypervisor may not expect an INTR intercept.
> >>
> >>  How about doing an implicit VMEXIT in this case and an implicit VMRUN
> >>  after the vcpu is migrated?
> >
> >  What if there's something in EXIT_INT_INFO?
>
> On real SVM hardware EXIT_INT_INFO should only contain something for
> exception and npt intercepts. These are all handled in the kernel and do
> not cause an exit to user-space so that no valid EXIT_INT_INFO should be
> around when we actually go back to user-space (so that migration can
> happen).
>
> The exception might be the #PF/NPT intercept when the guest is doing
> very obscure things like putting an exception/interrupt handler on mmio
> memory, but that isn't really supported by KVM anyway so I doubt we
> should care.
>
> Unless I miss something here we should be safe by just not looking at
> EXIT_INT_INFO while migrating.

Agree.

> >>    The nested hypervisor will not see the
> >>  vmexit and the vcpu will be in a state where it is safe to migrate. This
> >>  should work for nested-vmx too if the guest-state is written back to
> >>  guest memory on VMEXIT. Is this the case?
> >
> >  It is the case with the current implementation, and we can/should make
> >  it so in future implementations, just before exit to userspace.  Or at
> >  least provide an ABI to sync memory.
> >
> >  But I don't see why we shouldn't just migrate all the hidden state (in
> >  guest mode flag, svm host paging mode, svm host interrupt state, vmcb
> >  address/vmptr, etc.).  It's more state, but no thinking is involved, so
> >  it's clearly superior.
>
> An issue is that there is different state to migrate for Intel and AMD
> hosts. If we keep all that information in guest memory the kvm kernel
> module can handle those details and all KVM needs to migrate is the
> in-guest-mode flag and the gpa of the vmcb/vmcs which is currently
> executed. This state should be enough for Intel and AMD nesting.

I think for Intel there is no hidden state apart from in-guest-mode 
(there is the VMPTR, but it is an actual register accessible via 
instructions).  For svm we can keep the hidden state in the host 
state-save area (including the vmcb pointer).  The only risk is that svm 
will gain hardware support for nesting, and will choose a different 
format than ours.

An alternative is a fake MSR for storing this data, or just another 
get/set ioctl pair.  We'll have a flags field that says which fields are 
filled in.

> The next benefit is that it works seemlessly even if the state that
> needs to be transfered is extended (e.g. by emulating a new
> virtualization hardware feature). This support can be implemented in the
> kernel module and no changes to qemu are required.

I agree it's a benefit.  But I don't like making the fake vmexit part of 
live migration, if it turns out the wrong choice it's hard to undo it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 13:52                       ` Avi Kivity
@ 2011-05-23 14:10                         ` Nadav Har'El
  2011-05-23 14:32                           ` Avi Kivity
  2011-05-23 14:28                         ` Joerg Roedel
  1 sibling, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-23 14:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Joerg Roedel, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> I think for Intel there is no hidden state apart from in-guest-mode 
> (there is the VMPTR, but it is an actual register accessible via 
> instructions).

is_guest_mode(vcpu), vmx->nested.vmxon, vmx->nested.current_vmptr are the
only three things I can think of. Vmxon is actually more than a boolean
(there's also a vmxon pointer).

What do you mean by the current_vmptr being available through an instruction?
It is (VMPTRST), but this would be an instruction run on L1 (emulated by L0).
How would L0's user space use that instruction?

> I agree it's a benefit.  But I don't like making the fake vmexit part of 
> live migration, if it turns out the wrong choice it's hard to undo it.

If you don't do this "fake vmexit", you'll need to migrate both vmcs01 and
the current vmcs02 - the fact that vmcs12 is in guest memory will not be
enough, because vmcs02 isn't copied back to vmcs12 until the nested exit.


-- 
Nadav Har'El                        |       Monday, May 23 2011, 19 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The world is coming to an end ... SAVE
http://nadav.harel.org.il           |YOUR BUFFERS!!!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 13:52                       ` Avi Kivity
  2011-05-23 14:10                         ` Nadav Har'El
@ 2011-05-23 14:28                         ` Joerg Roedel
  2011-05-23 14:34                           ` Avi Kivity
  1 sibling, 1 reply; 83+ messages in thread
From: Joerg Roedel @ 2011-05-23 14:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011 at 04:52:47PM +0300, Avi Kivity wrote:
> On 05/23/2011 04:40 PM, Joerg Roedel wrote:

>> The next benefit is that it works seemlessly even if the state that
>> needs to be transfered is extended (e.g. by emulating a new
>> virtualization hardware feature). This support can be implemented in the
>> kernel module and no changes to qemu are required.
>
> I agree it's a benefit.  But I don't like making the fake vmexit part of  
> live migration, if it turns out the wrong choice it's hard to undo it.

Well, saving the state to the host-save-area and doing a fake-vmexit is
logically the same, only the memory where the information is stored
differs.

To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
does all the necessary things.

	Joerg


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 14:10                         ` Nadav Har'El
@ 2011-05-23 14:32                           ` Avi Kivity
  2011-05-23 14:44                             ` Nadav Har'El
  0 siblings, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-23 14:32 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Joerg Roedel, Gleb Natapov, kvm, abelg

On 05/23/2011 05:10 PM, Nadav Har'El wrote:
> On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> >  I think for Intel there is no hidden state apart from in-guest-mode
> >  (there is the VMPTR, but it is an actual register accessible via
> >  instructions).
>
> is_guest_mode(vcpu), vmx->nested.vmxon, vmx->nested.current_vmptr are the
> only three things I can think of. Vmxon is actually more than a boolean
> (there's also a vmxon pointer).
>
> What do you mean by the current_vmptr being available through an instruction?
> It is (VMPTRST), but this would be an instruction run on L1 (emulated by L0).
> How would L0's user space use that instruction?

I mean that it is an architectural register rather than "hidden state".  
It doesn't mean that L0 user space can use it.


> >  I agree it's a benefit.  But I don't like making the fake vmexit part of
> >  live migration, if it turns out the wrong choice it's hard to undo it.
>
> If you don't do this "fake vmexit", you'll need to migrate both vmcs01 and
> the current vmcs02 - the fact that vmcs12 is in guest memory will not be
> enough, because vmcs02 isn't copied back to vmcs12 until the nested exit.
>

vmcs01 and vmcs02 will both be generated from vmcs12.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 14:28                         ` Joerg Roedel
@ 2011-05-23 14:34                           ` Avi Kivity
  2011-05-23 14:58                             ` Joerg Roedel
  0 siblings, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-23 14:34 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On 05/23/2011 05:28 PM, Joerg Roedel wrote:
> On Mon, May 23, 2011 at 04:52:47PM +0300, Avi Kivity wrote:
> >  On 05/23/2011 04:40 PM, Joerg Roedel wrote:
>
> >>  The next benefit is that it works seemlessly even if the state that
> >>  needs to be transfered is extended (e.g. by emulating a new
> >>  virtualization hardware feature). This support can be implemented in the
> >>  kernel module and no changes to qemu are required.
> >
> >  I agree it's a benefit.  But I don't like making the fake vmexit part of
> >  live migration, if it turns out the wrong choice it's hard to undo it.
>
> Well, saving the state to the host-save-area and doing a fake-vmexit is
> logically the same, only the memory where the information is stored
> differs.

Right.  I guess the main difference is "info registers" after a stop.

> To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
> does all the necessary things.
>

Or we can automatically flush things on any exit to userspace.  They 
should be very rare in guest mode.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 14:32                           ` Avi Kivity
@ 2011-05-23 14:44                             ` Nadav Har'El
  2011-05-23 15:23                               ` Avi Kivity
  0 siblings, 1 reply; 83+ messages in thread
From: Nadav Har'El @ 2011-05-23 14:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Joerg Roedel, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> vmcs01 and vmcs02 will both be generated from vmcs12.

If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
non-trapped bits of guest_cr0), and these modifications are not copied back
to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
this task).

If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then
indeed vmcs02 can be thrown away and regenerated.

Nadav.

-- 
Nadav Har'El                        |       Monday, May 23 2011, 19 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Jury: Twelve people who determine which
http://nadav.harel.org.il           |client has the better lawyer.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 14:34                           ` Avi Kivity
@ 2011-05-23 14:58                             ` Joerg Roedel
  2011-05-23 15:19                               ` Avi Kivity
  0 siblings, 1 reply; 83+ messages in thread
From: Joerg Roedel @ 2011-05-23 14:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On Mon, May 23, 2011 at 05:34:20PM +0300, Avi Kivity wrote:
> On 05/23/2011 05:28 PM, Joerg Roedel wrote:

>> To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
>> does all the necessary things.
>
> Or we can automatically flush things on any exit to userspace.  They  
> should be very rare in guest mode.

This would make nesting mostly transparent to migration, so it sounds
good in this regard.

I do not completly agree that user-space exits in guest-mode are rare,
this depends on the hypervisor in the L1. In Hyper-V for example the
root-domain uses hardware virtualization too and has direct access to
devices (at least to some degree). IOIO is not intercepted in the
root-domain, for example. Not sure about the MMIO regions.


	Joerg


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 14:58                             ` Joerg Roedel
@ 2011-05-23 15:19                               ` Avi Kivity
  0 siblings, 0 replies; 83+ messages in thread
From: Avi Kivity @ 2011-05-23 15:19 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: Nadav Har'El, Gleb Natapov, kvm, abelg

On 05/23/2011 05:58 PM, Joerg Roedel wrote:
> On Mon, May 23, 2011 at 05:34:20PM +0300, Avi Kivity wrote:
> >  On 05/23/2011 05:28 PM, Joerg Roedel wrote:
>
> >>  To user-space we can provide a VCPU_FREEZE/VCPU_UNFREEZE ioctl which
> >>  does all the necessary things.
> >
> >  Or we can automatically flush things on any exit to userspace.  They
> >  should be very rare in guest mode.
>
> This would make nesting mostly transparent to migration, so it sounds
> good in this regard.
>
> I do not completly agree that user-space exits in guest-mode are rare,
> this depends on the hypervisor in the L1. In Hyper-V for example the
> root-domain uses hardware virtualization too and has direct access to
> devices (at least to some degree). IOIO is not intercepted in the
> root-domain, for example. Not sure about the MMIO regions.

Good point.  We were also talking about passing through virtio (or even 
host) devices to the guest.

So an ioctl to flush volatile state to memory would be a good idea.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 14:44                             ` Nadav Har'El
@ 2011-05-23 15:23                               ` Avi Kivity
  2011-05-23 18:06                                 ` Alexander Graf
  0 siblings, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-23 15:23 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Joerg Roedel, Gleb Natapov, kvm, abelg

On 05/23/2011 05:44 PM, Nadav Har'El wrote:
> On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> >  vmcs01 and vmcs02 will both be generated from vmcs12.
>
> If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
> from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
> non-trapped bits of guest_cr0), and these modifications are not copied back
> to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
> this task).
>
> If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then
> indeed vmcs02 can be thrown away and regenerated.

You would flush this state back to the vmcs.  But that just confirms 
Joerg's statement that a fake vmexit/vmrun is more or less equivalent.

The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
HOST_RIP (or the next instruction for svm), or to guest code.  But the 
actual things we need to do are all very similar subsets of a vmexit.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 15:23                               ` Avi Kivity
@ 2011-05-23 18:06                                 ` Alexander Graf
  2011-05-24 11:09                                   ` Avi Kivity
  0 siblings, 1 reply; 83+ messages in thread
From: Alexander Graf @ 2011-05-23 18:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Joerg Roedel, Gleb Natapov, kvm, abelg


On 23.05.2011, at 17:23, Avi Kivity wrote:

> On 05/23/2011 05:44 PM, Nadav Har'El wrote:
>> On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
>> >  vmcs01 and vmcs02 will both be generated from vmcs12.
>> 
>> If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
>> from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
>> non-trapped bits of guest_cr0), and these modifications are not copied back
>> to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
>> this task).
>> 
>> If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then
>> indeed vmcs02 can be thrown away and regenerated.
> 
> You would flush this state back to the vmcs.  But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent.
> 
> The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code.  But the actual things we need to do are all very similar subsets of a vmexit.

%rip should certainly point to VMRUN. That way there is no need to save any information whatsoever, as the VMCB is already in sane state and nothing needs to be special cased, as the next VCPU_RUN would simply go back into guest mode - which is exactly what we want.

The only tricky part is how we distinguish between "I need to live migrate" and "info registers". In the former case, %rip should be on VMRUN. In the latter, on the guest rip.


Alex


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-23 18:06                                 ` Alexander Graf
@ 2011-05-24 11:09                                   ` Avi Kivity
  2011-05-24 13:07                                     ` Joerg Roedel
  0 siblings, 1 reply; 83+ messages in thread
From: Avi Kivity @ 2011-05-24 11:09 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Nadav Har'El, Joerg Roedel, Gleb Natapov, kvm, abelg

On 05/23/2011 09:06 PM, Alexander Graf wrote:
> On 23.05.2011, at 17:23, Avi Kivity wrote:
>
> >  On 05/23/2011 05:44 PM, Nadav Har'El wrote:
> >>  On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
> >>  >   vmcs01 and vmcs02 will both be generated from vmcs12.
> >>
> >>  If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
> >>  from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
> >>  non-trapped bits of guest_cr0), and these modifications are not copied back
> >>  to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
> >>  this task).
> >>
> >>  If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then
> >>  indeed vmcs02 can be thrown away and regenerated.
> >
> >  You would flush this state back to the vmcs.  But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent.
> >
> >  The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code.  But the actual things we need to do are all very similar subsets of a vmexit.
>
> %rip should certainly point to VMRUN. That way there is no need to save any information whatsoever, as the VMCB is already in sane state and nothing needs to be special cased, as the next VCPU_RUN would simply go back into guest mode - which is exactly what we want.
>
> The only tricky part is how we distinguish between "I need to live migrate" and "info registers". In the former case, %rip should be on VMRUN. In the latter, on the guest rip.

We can split vmrun emulation into "save host state, load guest state" 
and "prepare nested vmcb".  Then, when we load registers, if we see that 
we're in guest mode, we do just the "prepare nested vmcb" bit.

This way register state is always nested guest state.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 0/30] nVMX: Nested VMX, v9
  2011-05-24 11:09                                   ` Avi Kivity
@ 2011-05-24 13:07                                     ` Joerg Roedel
  0 siblings, 0 replies; 83+ messages in thread
From: Joerg Roedel @ 2011-05-24 13:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, Nadav Har'El, Gleb Natapov, kvm, abelg

On Tue, May 24, 2011 at 02:09:00PM +0300, Avi Kivity wrote:
> On 05/23/2011 09:06 PM, Alexander Graf wrote:
>> On 23.05.2011, at 17:23, Avi Kivity wrote:
>>
>> >  On 05/23/2011 05:44 PM, Nadav Har'El wrote:
>> >>  On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested VMX, v9":
>> >>  >   vmcs01 and vmcs02 will both be generated from vmcs12.
>> >>
>> >>  If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be generated
>> >>  from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
>> >>  non-trapped bits of guest_cr0), and these modifications are not copied back
>> >>  to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
>> >>  this task).
>> >>
>> >>  If you do a nested exit (a "fake" one), vmcs12 is made up to date, and then
>> >>  indeed vmcs02 can be thrown away and regenerated.
>> >
>> >  You would flush this state back to the vmcs.  But that just confirms Joerg's statement that a fake vmexit/vmrun is more or less equivalent.
>> >
>> >  The question is whether %rip points to the VMRUN/VMLAUNCH instruction, HOST_RIP (or the next instruction for svm), or to guest code.  But the actual things we need to do are all very similar subsets of a vmexit.
>>
>> %rip should certainly point to VMRUN. That way there is no need to save any information whatsoever, as the VMCB is already in sane state and nothing needs to be special cased, as the next VCPU_RUN would simply go back into guest mode - which is exactly what we want.
>>
>> The only tricky part is how we distinguish between "I need to live migrate" and "info registers". In the former case, %rip should be on VMRUN. In the latter, on the guest rip.
>
> We can split vmrun emulation into "save host state, load guest state"  
> and "prepare nested vmcb".  Then, when we load registers, if we see that  
> we're in guest mode, we do just the "prepare nested vmcb" bit.

Or we just emulate a VMEXIT in the VCPU_FREEZE ioctl and set the
%rip back to the VMRUN that entered the L2 guest. For 'info registers'
the VCPU_FREEZE ioctl will not be issued and the guest registers be
displayed.
That way we don't need to migrate any additional state for SVM.

	Joerg


^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2011-05-24 13:07 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-08  8:15 [PATCH 0/30] nVMX: Nested VMX, v9 Nadav Har'El
2011-05-08  8:15 ` [PATCH 01/30] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
2011-05-08  8:16 ` [PATCH 02/30] nVMX: Implement VMXON and VMXOFF Nadav Har'El
2011-05-08  8:16 ` [PATCH 03/30] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
2011-05-08  8:17 ` [PATCH 04/30] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2011-05-08  8:17 ` [PATCH 05/30] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
2011-05-08  8:18 ` [PATCH 06/30] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
2011-05-09  9:47   ` Avi Kivity
2011-05-08  8:18 ` [PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
2011-05-16 15:30   ` Marcelo Tosatti
2011-05-16 18:32     ` Nadav Har'El
2011-05-17 13:20       ` Marcelo Tosatti
2011-05-08  8:19 ` [PATCH 08/30] nVMX: Fix local_vcpus_link handling Nadav Har'El
2011-05-08  8:19 ` [PATCH 09/30] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
2011-05-08  8:20 ` [PATCH 10/30] nVMX: Success/failure of VMX instructions Nadav Har'El
2011-05-08  8:20 ` [PATCH 11/30] nVMX: Implement VMCLEAR Nadav Har'El
2011-05-08  8:21 ` [PATCH 12/30] nVMX: Implement VMPTRLD Nadav Har'El
2011-05-16 14:34   ` Marcelo Tosatti
2011-05-16 18:58     ` Nadav Har'El
2011-05-16 19:09       ` Nadav Har'El
2011-05-08  8:21 ` [PATCH 13/30] nVMX: Implement VMPTRST Nadav Har'El
2011-05-08  8:22 ` [PATCH 14/30] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
2011-05-08  8:22 ` [PATCH 15/30] nVMX: Move host-state field setup to a function Nadav Har'El
2011-05-09  9:56   ` Avi Kivity
2011-05-09 10:40     ` Nadav Har'El
2011-05-08  8:23 ` [PATCH 16/30] nVMX: Move control field setup to functions Nadav Har'El
2011-05-08  8:23 ` [PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2011-05-09 10:12   ` Avi Kivity
2011-05-09 10:27     ` Nadav Har'El
2011-05-09 10:45       ` Avi Kivity
2011-05-08  8:24 ` [PATCH 18/30] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
2011-05-08  8:24 ` [PATCH 19/30] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
2011-05-08  8:25 ` [PATCH 20/30] nVMX: Exiting from L2 to L1 Nadav Har'El
2011-05-09 10:45   ` Avi Kivity
2011-05-08  8:25 ` [PATCH 21/30] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2011-05-08  8:26 ` [PATCH 22/30] nVMX: Correct handling of interrupt injection Nadav Har'El
2011-05-09 10:57   ` Avi Kivity
2011-05-08  8:27 ` [PATCH 23/30] nVMX: Correct handling of exception injection Nadav Har'El
2011-05-08  8:27 ` [PATCH 24/30] nVMX: Correct handling of idt vectoring info Nadav Har'El
2011-05-09 11:04   ` Avi Kivity
2011-05-08  8:28 ` [PATCH 25/30] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
2011-05-08  8:28 ` [PATCH 26/30] nVMX: Further fixes for lazy FPU loading Nadav Har'El
2011-05-08  8:29 ` [PATCH 27/30] nVMX: Additional TSC-offset handling Nadav Har'El
2011-05-09 17:27   ` Zachary Amsden
2011-05-08  8:29 ` [PATCH 28/30] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
2011-05-08  8:30 ` [PATCH 29/30] nVMX: Miscellenous small corrections Nadav Har'El
2011-05-08  8:30 ` [PATCH 30/30] nVMX: Documentation Nadav Har'El
2011-05-09 11:18 ` [PATCH 0/30] nVMX: Nested VMX, v9 Avi Kivity
2011-05-09 11:37   ` Nadav Har'El
2011-05-11  8:20   ` Gleb Natapov
2011-05-12 15:42     ` Nadav Har'El
2011-05-12 15:57       ` Gleb Natapov
2011-05-12 16:08         ` Avi Kivity
2011-05-12 16:14           ` Gleb Natapov
2011-05-12 16:31         ` Nadav Har'El
2011-05-12 16:51           ` Gleb Natapov
2011-05-12 17:00             ` Avi Kivity
2011-05-15 23:11               ` Nadav Har'El
2011-05-16  6:38                 ` Gleb Natapov
2011-05-16  7:44                   ` Nadav Har'El
2011-05-16  7:57                     ` Gleb Natapov
2011-05-16  9:50                 ` Avi Kivity
2011-05-16 10:20                   ` Avi Kivity
2011-05-22 19:32             ` Nadav Har'El
2011-05-23  9:37               ` Joerg Roedel
2011-05-23  9:52               ` Avi Kivity
2011-05-23 13:02                 ` Joerg Roedel
2011-05-23 13:08                   ` Avi Kivity
2011-05-23 13:40                     ` Joerg Roedel
2011-05-23 13:52                       ` Avi Kivity
2011-05-23 14:10                         ` Nadav Har'El
2011-05-23 14:32                           ` Avi Kivity
2011-05-23 14:44                             ` Nadav Har'El
2011-05-23 15:23                               ` Avi Kivity
2011-05-23 18:06                                 ` Alexander Graf
2011-05-24 11:09                                   ` Avi Kivity
2011-05-24 13:07                                     ` Joerg Roedel
2011-05-23 14:28                         ` Joerg Roedel
2011-05-23 14:34                           ` Avi Kivity
2011-05-23 14:58                             ` Joerg Roedel
2011-05-23 15:19                               ` Avi Kivity
2011-05-23 13:18                   ` Nadav Har'El
2011-05-12 16:18       ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.