All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/29] nVMX: Nested VMX, v8
@ 2011-01-27  8:29 Nadav Har'El
  2011-01-27  8:30 ` [PATCH 01/29] nVMX: Add "nested" module option to vmx.c Nadav Har'El
                   ` (29 more replies)
  0 siblings, 30 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:29 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Hi,

This is the eighth iteration of the nested VMX patch set. This iteration
solves a number of bugs and issues that bothered the reviewers. Some more
issues raised in the previous review remain open, but don't worry - I *am*
working to resolve all of them.

The biggest improvement in this version is that SMP finally works: You can
now run nested VMX on an SMP host - The "nosmp" kernel option is no longer
required. You can also have SMP L1s and L2s, although in this version, SMP
L2 support is still somewhat buggy and should be made more stable in the
next version. The "vpid=0" option that used to be required is also no longer
required.

Other improvements include:

 * #GP on writing read-only VMX MSRs, don't save/restore them, and don't
   print annoying and incorrect messages on startup.
 * Cleanup free_l1_state() and renamed it free_nested().
 * Removed guest expoitable printk()s.
 * Finally got rid of the l1_state structure and all its redundant fields.
 * Moved cpu and launched fields out of the (guest memory) vmcs12, and moved
   to a new structure (in host memory) saved_vmcs. Avi, you asked if and why
   these two fields are really needed - and they are needed, and I explained
   why in a comment.
 * Moved kunmap() out of nested_release_page() and into callers.
 * Made vmcs_field_to_offset_table initialization more readable.
 * Moved constants in vmx.c and to include files, as requested.
 * Fixed wrong MOV_SS check in handle_launch_or_resume().
 * Fixed page leak in nested_vmx_exit_handled_msr().
 * Removed redundant if(nested) check.
 * Allow turning off nested VMX for one guest (by removing VMX from cpuid).
 * Fixed the EFER handling code.

This new set of patches applys to the current KVM trunk (I checked with
844e6679184180cffa7aca014d672545941ed78e). If you wish, you can also check
out an already-patched version of KVM from the repository
git://github.com/nyh/kvm-nested-vmx.git - take the branch "nvmx8".


About nested VMX:
-----------------

The following 29 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
"The Turtles Project: Design and Implementation of Nested Virtualization",
and was awarded "Jay Lepreau Best Paper". The paper is available online, at:

	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
------------------

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the "nested=1" option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
explicitly enabled, by giving qemu one of the following options:

     -cpu host              (emulated CPU has all features of the real CPU)

     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)


This version was only tested with KVM (64-bit) as a guest hypervisor, and
Linux as a nested guest.


Patch statistics:
-----------------

 Documentation/kvm/nested-vmx.txt |  241 ++
 arch/x86/include/asm/kvm_host.h  |    2 
 arch/x86/include/asm/msr-index.h |    9 
 arch/x86/include/asm/vmx.h       |   31 
 arch/x86/kvm/svm.c               |    6 
 arch/x86/kvm/vmx.c               | 2496 ++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c               |   10 
 arch/x86/kvm/x86.h               |    6 
 8 files changed, 2760 insertions(+), 41 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/29] nVMX: Add "nested" module option to vmx.c
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
@ 2011-01-27  8:30 ` Nadav Har'El
  2011-01-27  8:30 ` [PATCH 02/29] nVMX: Implement VMXON and VMXOFF Nadav Har'El
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds a module option "nested" to vmx.c, which controls whether
the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:02.000000000 +0200
@@ -72,6 +72,14 @@ module_param(vmm_exclusive, bool, S_IRUG
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., a guest may use
+ * VMX and be a hypervisor for its own guests. If nested=0, a guest may not
+ * use VMX instructions.
+ */
+static int __read_mostly nested = 0;
+module_param(nested, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\
@@ -1164,6 +1172,23 @@ static void vmx_adjust_tsc_offset(struct
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
 }
 
+static bool guest_cpuid_has_vmx(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best = kvm_find_cpuid_entry(vcpu, 1, 0);
+	return best && (best->ecx & (1 << (X86_FEATURE_VMX & 31)));
+}
+
+/*
+ * nested_vmx_allowed() checks whether a guest should be allowed to use VMX
+ * instructions and MSRs (i.e., nested VMX). Nested VMX is disabled for
+ * all guests if the "nested" module option is off, and can also be disabled
+ * for a single guest by disabling its VMX cpuid bit.
+ */
+static inline bool nested_vmx_allowed(struct kvm_vcpu *vcpu)
+{
+	return nested && guest_cpuid_has_vmx(vcpu);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 02/29] nVMX: Implement VMXON and VMXOFF
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
  2011-01-27  8:30 ` [PATCH 01/29] nVMX: Add "nested" module option to vmx.c Nadav Har'El
@ 2011-01-27  8:30 ` Nadav Har'El
  2011-01-27  8:31 ` [PATCH 03/29] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:30 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  111 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:02.000000000 +0200
@@ -130,6 +130,17 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -177,6 +188,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for a guest hypervisor (nested VMX) */
+	struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3758,6 +3772,99 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably cr4.VMXE must be
+	 * set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
+	 * Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = true;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * Free whatever needs to be freed from vmx->nested when L1 goes down, or
+ * just stops using VMX.
+ */
+static void free_nested(struct vcpu_vmx *vmx)
+{
+	if (!vmx->nested.vmxon)
+		return;
+	vmx->nested.vmxon = false;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+	free_nested(to_vmx(vcpu));
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3785,8 +3892,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 03/29] nVMX: Allow setting the VMXE bit in CR4
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
  2011-01-27  8:30 ` [PATCH 01/29] nVMX: Add "nested" module option to vmx.c Nadav Har'El
  2011-01-27  8:30 ` [PATCH 02/29] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2011-01-27  8:31 ` Nadav Har'El
  2011-01-27  8:31 ` [PATCH 04/29] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:31 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the "nested" module option is on,
and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +-
 arch/x86/kvm/svm.c              |    6 +++++-
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++--
 arch/x86/kvm/x86.c              |    4 +---
 4 files changed, 22 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-01-26 18:06:02.000000000 +0200
+++ .after/arch/x86/include/asm/kvm_host.h	2011-01-26 18:06:02.000000000 +0200
@@ -542,7 +542,7 @@ struct kvm_x86_ops {
 	void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
 	void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-	void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+	int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
 	void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
 	void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c	2011-01-26 18:06:02.000000000 +0200
+++ .after/arch/x86/kvm/svm.c	2011-01-26 18:06:02.000000000 +0200
@@ -1417,11 +1417,14 @@ static void svm_set_cr0(struct kvm_vcpu 
 	update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long host_cr4_mce = read_cr4() & X86_CR4_MCE;
 	unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;
 
+	if (cr4 & X86_CR4_VMXE)
+		return 1;
+
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
 		svm_flush_tlb(vcpu);
 
@@ -1431,6 +1434,7 @@ static void svm_set_cr4(struct kvm_vcpu 
 	cr4 |= host_cr4_mce;
 	to_svm(vcpu)->vmcb->save.cr4 = cr4;
 	mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
+	return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c	2011-01-26 18:06:02.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2011-01-26 18:06:02.000000000 +0200
@@ -615,11 +615,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 				   kvm_read_cr3(vcpu)))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
-	kvm_x86_ops->set_cr4(vcpu, cr4);
-
 	if ((cr4 ^ old_cr4) & pdptr_bits)
 		kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -1956,7 +1956,7 @@ static void ept_save_pdptrs(struct kvm_v
 		  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
 					unsigned long cr0,
@@ -2052,11 +2052,23 @@ static void vmx_set_cr3(struct kvm_vcpu 
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
 		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+	if (cr4 & X86_CR4_VMXE) {
+		/*
+		 * To use VMXON (and later other VMX instructions), a guest
+		 * must first be able to turn on cr4.VMXE (see handle_vmxon()).
+		 * So basically the check on whether to allow nested VMX
+		 * is here.
+		 */
+		if (!nested_vmx_allowed(vcpu))
+			return 1;
+	} else if (to_vmx(vcpu)->nested.vmxon)
+		return 1;
+
 	vcpu->arch.cr4 = cr4;
 	if (enable_ept) {
 		if (!is_paging(vcpu)) {
@@ -2069,6 +2081,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
+	return 0;
 }
 
 static void vmx_get_segment(struct kvm_vcpu *vcpu,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 04/29] nVMX: Introduce vmcs12: a VMCS structure for L1
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (2 preceding siblings ...)
  2011-01-27  8:31 ` [PATCH 03/29] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2011-01-27  8:31 ` Nadav Har'El
  2011-01-27  8:32 ` [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:31 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guests. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   63 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -131,6 +131,34 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed in order to preserve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -139,6 +167,12 @@ struct shared_msr_entry {
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above */
+	struct page *current_vmcs12_page;
+	struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -198,6 +232,26 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		return NULL;
+	}
+	return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+	kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+	kvm_release_page_clean(page);
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
@@ -3865,6 +3919,11 @@ static void free_nested(struct vcpu_vmx 
 	if (!vmx->nested.vmxon)
 		return;
 	vmx->nested.vmxon = false;
+	if (vmx->nested.current_vmptr != -1ull) {
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+	}
 }
 
 /* Emulate the VMXOFF instruction */
@@ -4300,6 +4359,7 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	free_nested(vmx);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
@@ -4366,6 +4426,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_vmcs12 = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (3 preceding siblings ...)
  2011-01-27  8:31 ` [PATCH 04/29] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2011-01-27  8:32 ` Nadav Har'El
  2011-01-30  9:52   ` Avi Kivity
  2011-01-27  8:32 ` [PATCH 06/29] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:32 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/msr-index.h |    9 ++
 arch/x86/kvm/vmx.c               |  126 +++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -1258,6 +1258,128 @@ static inline bool nested_vmx_allowed(st
 }
 
 /*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 1 when we handled a
+ * VMX-specific MSR, or 0 when we haven't (and the caller should handle it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+	u64 vmx_msr = 0;
+	u32 vmx_msr_high, vmx_msr_low;
+
+	if (!nested_vmx_allowed(vcpu) && msr_index >= MSR_IA32_VMX_BASIC &&
+	    msr_index <= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
+		/*
+		 * According to the spec, processors which do not support VMX
+		 * should throw a #GP(0) when VMX capability MSRs are read.
+		 */
+		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
+		return 1;
+	}
+
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_BASIC:
+		/*
+		 * This MSR reports some information about VMX support of the
+		 * processor. We should return information about the VMX we
+		 * emulate for the guest, and the VMCS structure we give it -
+		 * not about the VMX support of the underlying hardware.
+		 * However, some capabilities of the underlying hardware are
+		 * used directly by our emulation (e.g., the physical address
+		 * width), so these are copied from what the hardware reports.
+		 */
+		*pdata = VMCS12_REVISION | (((u64)sizeof(struct vmcs12)) << 32);
+		rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+		*pdata |= vmx_msr &
+			(VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
+		break;
+#define CORE2_PINBASED_CTLS_MUST_BE_ONE	0x00000016
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
+		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
+				PIN_BASED_EXT_INTR_MASK |
+				PIN_BASED_NMI_EXITING |
+				PIN_BASED_VIRTUAL_NMIS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+		/* This MSR determines which vm-execution controls the L1
+		 * hypervisor may ask, or may not ask, to enable. Normally we
+		 * can only allow enabling features which the hardware can
+		 * support, but we limit ourselves to allowing only known
+		 * features that were tested nested. We allow disabling any
+		 * feature (even if the hardware can't disable it).
+		 */
+		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+		vmx_msr_low = 0; /* allow disabling any feature */
+		vmx_msr_high &= /* do not expose new untested features */
+			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+			CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
+#ifdef CONFIG_X86_64
+			CPU_BASED_CR8_LOAD_EXITING |
+			CPU_BASED_CR8_STORE_EXITING |
+#endif
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_TRUE_EXIT_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS:
+		*pdata = 0;
+#ifdef CONFIG_X86_64
+		*pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+		break;
+	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+	case MSR_IA32_VMX_ENTRY_CTLS:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		*pdata = 0;
+		if (vm_need_virtualize_apic_accesses(vcpu->kvm))
+			*pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		break;
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		/* Currently, no nested ept or nested vpid */
+		*pdata = 0;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+{
+	if (!nested_vmx_allowed(vcpu))
+		return 0;
+
+	/*
+	 * according to the spec, "VMX capability MSRs are read-only; an
+	 * attempt to write them (with WRMSR) produces a #GP(0).
+	 */
+	if (msr_index >= MSR_IA32_VMX_BASIC &&
+	    msr_index <= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
+		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
+		return 1;
+	} else if (msr_index == MSR_IA32_FEATURE_CONTROL)
+		/* TODO: the right thing. */
+		return 1;
+	else
+		return 0;
+}
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1305,6 +1427,8 @@ static int vmx_get_msr(struct kvm_vcpu *
 		/* Otherwise falls through */
 	default:
 		vmx_load_host_state(to_vmx(vcpu));
+		if (vmx_get_vmx_msr(vcpu, msr_index, pdata))
+			return 0;
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			vmx_load_host_state(to_vmx(vcpu));
@@ -1374,6 +1498,8 @@ static int vmx_set_msr(struct kvm_vcpu *
 			return 1;
 		/* Otherwise falls through */
 	default:
+		if (vmx_set_vmx_msr(vcpu, msr_index, data))
+			break;
 		msr = find_msr_entry(vmx, msr_index);
 		if (msr) {
 			vmx_load_host_state(vmx);
--- .before/arch/x86/include/asm/msr-index.h	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/include/asm/msr-index.h	2011-01-26 18:06:03.000000000 +0200
@@ -424,6 +424,15 @@
 #define MSR_IA32_VMX_VMCS_ENUM          0x0000048a
 #define MSR_IA32_VMX_PROCBASED_CTLS2    0x0000048b
 #define MSR_IA32_VMX_EPT_VPID_CAP       0x0000048c
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS  0x0000048d
+#define MSR_IA32_VMX_TRUE_PROCBASED_CTLS 0x0000048e
+#define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
+#define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
+
+/* VMX_BASIC bits and bitmasks */
+#define VMX_BASIC_64		0x0001000000000000LLU
+#define VMX_BASIC_MEM_TYPE	0x003c000000000000LLU
+#define VMX_BASIC_INOUT		0x0040000000000000LLU
 
 /* AMD-V MSRs */
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 06/29] nVMX: Decoding memory operands of VMX instructions
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (4 preceding siblings ...)
  2011-01-27  8:32 ` [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2011-01-27  8:32 ` Nadav Har'El
  2011-01-27  8:33 ` [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:32 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   59 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    3 +-
 arch/x86/kvm/x86.h |    3 ++
 3 files changed, 64 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2011-01-26 18:06:03.000000000 +0200
@@ -3686,7 +3686,7 @@ static int kvm_fetch_guest_virt(gva_t ad
 					  exception);
 }
 
-static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 			       struct kvm_vcpu *vcpu,
 			       struct x86_exception *exception)
 {
@@ -3694,6 +3694,7 @@ static int kvm_read_guest_virt(gva_t add
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
 				      struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.h	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/x86.h	2011-01-26 18:06:03.000000000 +0200
@@ -79,6 +79,9 @@ void kvm_before_handle_nmi(struct kvm_vc
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq);
 
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+		struct kvm_vcpu *vcpu, struct x86_exception *exception);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -4063,6 +4063,65 @@ static int handle_vmoff(struct kvm_vcpu 
 }
 
 /*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+				 unsigned long exit_qualification,
+				 u32 vmx_instruction_info, gva_t *ret)
+{
+	/*
+	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
+	 * Execution", on an exit, vmx_instruction_info holds most of the
+	 * addressing components of the operand. Only the displacement part
+	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
+	 * For how an actual address is calculated from all these components,
+	 * refer to Vol. 1, "Operand Addressing".
+	 */
+	int  scaling = vmx_instruction_info & 3;
+	int  addr_size = (vmx_instruction_info >> 7) & 7;
+	bool is_reg = vmx_instruction_info & (1u << 10);
+	int  seg_reg = (vmx_instruction_info >> 15) & 7;
+	int  index_reg = (vmx_instruction_info >> 18) & 0xf;
+	bool index_is_valid = !(vmx_instruction_info & (1u << 22));
+	int  base_reg       = (vmx_instruction_info >> 23) & 0xf;
+	bool base_is_valid  = !(vmx_instruction_info & (1u << 27));
+
+	if (is_reg) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	switch (addr_size) {
+	case 1: /* 32 bit. high bits are undefined according to the spec: */
+		exit_qualification &= 0xffffffff;
+		break;
+	case 2: /* 64 bit */
+		break;
+	default: /* 16 bit */
+		return 1;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offset = base + [index * scale] + displacement */
+	*ret = vmx_get_segment_base(vcpu, seg_reg);
+	if (base_is_valid)
+		*ret += kvm_register_read(vcpu, base_reg);
+	if (index_is_valid)
+		*ret += kvm_register_read(vcpu, index_reg)<<scaling;
+	*ret += exit_qualification; /* holds the displacement */
+	/*
+	 * TODO: throw #GP (and return 1) in various cases that the VM*
+	 * instructions require it - e.g., offset beyond segment limit,
+	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
+	 * address, and so on. Currently these are not checked.
+	 */
+	return 0;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (5 preceding siblings ...)
  2011-01-27  8:32 ` [PATCH 06/29] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2011-01-27  8:33 ` Nadav Har'El
  2011-01-30 10:02   ` Avi Kivity
  2011-01-27  8:33 ` [PATCH 08/29] nVMX: Fix local_vcpus_link handling Nadav Har'El
                   ` (22 subsequent siblings)
  29 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:33 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a 
hardware VMCS for each active vmcs12 (i.e., for each L2 guest).

We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
to run its nested guest L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  142 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
 module_param(ple_window, int, S_IRUGO);
 
 #define NR_AUTOLOAD_MSRS 1
+#define NESTED_MAX_VMCS 256
 
 struct vmcs {
 	u32 revision_id;
@@ -159,6 +160,34 @@ struct __packed vmcs12 {
 #define VMCS12_REVISION 0x11e57ed0
 
 /*
+ * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
+ * while we use L2's VMCS), and wish to save the previous VMCS, we must also
+ * remember on which CPU it was last loaded (vcpu->cpu), so when we return to
+ * using this VMCS we'll know if we're now running on a different CPU and need
+ * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
+ * we need to remember whether this VMCS was launched (vmx->launched), so when
+ * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
+ * this from other state, because it's possible that this VMCS had once been
+ * launched, but has since been cleared after a CPU switch, and now
+ * vmx->launch is 0.
+ */
+struct saved_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+};
+
+/*
+ * A cache keeping a VMCS (vmcs02) for each loaded vmcs12. In addition to the
+ * VMCS, we need information on its state - see struct saved_vmcs above.
+ */
+struct vmcs_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct saved_vmcs vmcs02;
+};
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -173,6 +202,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
+	struct list_head vmcs02_list; /* a vmcs_list */
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -3964,6 +3997,110 @@ static int handle_invalid_op(struct kvm_
 	return 1;
 }
 
+/* Find a vmcs02 saved for the current L2's vmcs12 */
+static struct saved_vmcs *nested_get_current_vmcs(struct vcpu_vmx *vmx)
+{
+	struct vmcs_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_list, list)
+		if (item->vmcs12_addr == vmx->nested.current_vmptr)
+			return &item->vmcs02;
+	return NULL;
+}
+
+/*
+ * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
+ * does not already exist. The allocation is done in L0 memory, so to avoid
+ * denial-of-service attack by guests, we limit the number of concurrently-
+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
+ * trigger this limit.
+ */
+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_list *new_l2_guest;
+	struct vmcs *vmcs02;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (nested_get_current_vmcs(vmx))
+		return 0; /* nothing to do - we already have a VMCS */
+
+	if (vmx->nested.vmcs02_num >= NESTED_MAX_VMCS)
+		return -ENOMEM;
+
+	new_l2_guest = (struct vmcs_list *)
+		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
+	if (!new_l2_guest)
+		return -ENOMEM;
+
+	vmcs02 = alloc_vmcs();
+	if (!vmcs02) {
+		kfree(new_l2_guest);
+		return -ENOMEM;
+	}
+
+	new_l2_guest->vmcs12_addr = vmx->nested.current_vmptr;
+	new_l2_guest->vmcs02.vmcs = vmcs02;
+	new_l2_guest->vmcs02.cpu = -1;
+	new_l2_guest->vmcs02.launched = 0;
+	list_add(&(new_l2_guest->list), &(vmx->nested.vmcs02_list));
+	vmx->nested.vmcs02_num++;
+	return 0;
+}
+
+static void __nested_free_saved_vmcs(void *arg)
+{
+	struct saved_vmcs *saved_vmcs = arg;
+	int cpu = raw_smp_processor_id();
+
+	if (saved_vmcs->cpu == cpu) /* TODO: how can this not be the case? */
+		vmcs_clear(saved_vmcs->vmcs);
+	if (per_cpu(current_vmcs, cpu) == saved_vmcs->vmcs)
+		per_cpu(current_vmcs, cpu) = NULL;
+}
+
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ * (the necessary information is in the saved_vmcs structure).
+ * See also vcpu_clear() (with different parameters and side-effects)
+ */
+static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
+		struct saved_vmcs *saved_vmcs)
+{
+	if (saved_vmcs->cpu != -1)
+		smp_call_function_single(saved_vmcs->cpu,
+				__nested_free_saved_vmcs, saved_vmcs, 1);
+
+	free_vmcs(saved_vmcs->vmcs);
+}
+
+/*
+ * Free a vmcs12's associated vmcs02 (if there is one), and remove it from
+ * vmcs02_list.
+ */
+static void nested_free_vmcs(struct vcpu_vmx *vmx, gpa_t vmptr)
+{
+	struct vmcs_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_list, list)
+		if (item->vmcs12_addr == vmptr) {
+			nested_free_saved_vmcs(vmx, &item->vmcs02);
+			list_del(&item->list);
+			kfree(item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+/* Free all vmcs02 saved for this L1 vcpu */
+static void nested_free_all_vmcs(struct vcpu_vmx *vmx)
+{
+	struct vmcs_list *item, *n;
+	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_list, list) {
+		nested_free_saved_vmcs(vmx, &item->vmcs02);
+		list_del(&item->list);
+		kfree(item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
 /*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
@@ -4000,6 +4137,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4050,6 +4190,8 @@ static void free_nested(struct vcpu_vmx 
 		nested_release_page(vmx->nested.current_vmcs12_page);
 		vmx->nested.current_vmptr = -1ull;
 	}
+
+	nested_free_all_vmcs(vmx);
 }
 
 /* Emulate the VMXOFF instruction */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 08/29] nVMX: Fix local_vcpus_link handling
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (6 preceding siblings ...)
  2011-01-27  8:33 ` [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2011-01-27  8:33 ` Nadav Har'El
  2011-01-30 10:08   ` Avi Kivity
  2011-01-27  8:34 ` [PATCH 09/29] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
                   ` (21 subsequent siblings)
  29 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:33 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, others for
each L2), and each of those may be have been last loaded on a different cpu.

This trivial patch changes the code to keep on vcpus_on_cpu only L1 VMCSs.
This fixes crashes on L1 shutdown caused by incorrectly maintaing the linked
lists.

It is not a complete solution, though. It doesn't flush the inactive L1 or L2
VMCSs loaded on a CPU which is being shutdown. Doing this correctly will
probably require replacing the vcpu linked list by a link list of "saved_vcms"
objects (VMCS, cpu and launched), and it is left as a TODO.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -616,7 +616,9 @@ static void __vcpu_clear(void *arg)
 		vmcs_clear(vmx->vmcs);
 	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
 		per_cpu(current_vmcs, cpu) = NULL;
-	list_del(&vmx->local_vcpus_link);
+	/* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+	if (!is_guest_mode(&vmx->vcpu))
+		list_del(&vmx->local_vcpus_link);
 	vmx->vcpu.cpu = -1;
 	vmx->launched = 0;
 }
@@ -1022,8 +1024,10 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
-		list_add(&vmx->local_vcpus_link,
-			 &per_cpu(vcpus_on_cpu, cpu));
+		/* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+		if (!is_guest_mode(&vmx->vcpu))
+			list_add(&vmx->local_vcpus_link,
+				 &per_cpu(vcpus_on_cpu, cpu));
 		local_irq_enable();
 
 		/*
@@ -1647,7 +1651,9 @@ static void vmclear_local_vcpus(void)
 
 	list_for_each_entry_safe(vmx, n, &per_cpu(vcpus_on_cpu, cpu),
 				 local_vcpus_link)
-		__vcpu_clear(vmx);
+		/* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+		if (!is_guest_mode(&vmx->vcpu))
+			__vcpu_clear(vmx);
 }
 
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 09/29] nVMX: Add VMCS fields to the vmcs12
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (7 preceding siblings ...)
  2011-01-27  8:33 ` [PATCH 08/29] nVMX: Fix local_vcpus_link handling Nadav Har'El
@ 2011-01-27  8:34 ` Nadav Har'El
  2011-01-30 10:10   ` Avi Kivity
  2011-01-27  8:34 ` [PATCH 10/29] nVMX: Success/failure of VMX instructions Nadav Har'El
                   ` (20 subsequent siblings)
  29 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:34 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields. These fields are encapsulated in a struct vmcs_fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  283 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 283 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:03.000000000 +0200
@@ -132,6 +132,137 @@ struct shared_msr_entry {
 };
 
 /*
+ * vmcs_fields is a structure used in nested VMX for holding a copy of all
+ * standard VMCS fields. It is used for emulating a VMCS for L1 (see struct
+ * vmcs12), and also for easier access to VMCS data (see vmcs01_fields).
+ */
+struct __packed vmcs_fields {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+/*
  * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
  * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
  * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
@@ -150,6 +281,8 @@ struct __packed vmcs12 {
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	struct vmcs_fields fields;
 };
 
 /*
@@ -265,6 +398,156 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define VMCS_FIELDS_OFFSET(x) offsetof(struct vmcs_fields, x)
+#define FIELD(number, name)	[number] = VMCS_FIELDS_OFFSET(name)
+#define FIELD64(number, name)	[number] = VMCS_FIELDS_OFFSET(name), \
+				[number##_HIGH] = VMCS_FIELDS_OFFSET(name)+4
+
+static unsigned short vmcs_field_to_offset_table[] = {
+	FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id),
+	FIELD(GUEST_ES_SELECTOR, guest_es_selector),
+	FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
+	FIELD(GUEST_SS_SELECTOR, guest_ss_selector),
+	FIELD(GUEST_DS_SELECTOR, guest_ds_selector),
+	FIELD(GUEST_FS_SELECTOR, guest_fs_selector),
+	FIELD(GUEST_GS_SELECTOR, guest_gs_selector),
+	FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
+	FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
+	FIELD(HOST_ES_SELECTOR, host_es_selector),
+	FIELD(HOST_CS_SELECTOR, host_cs_selector),
+	FIELD(HOST_SS_SELECTOR, host_ss_selector),
+	FIELD(HOST_DS_SELECTOR, host_ds_selector),
+	FIELD(HOST_FS_SELECTOR, host_fs_selector),
+	FIELD(HOST_GS_SELECTOR, host_gs_selector),
+	FIELD(HOST_TR_SELECTOR, host_tr_selector),
+	FIELD64(IO_BITMAP_A, io_bitmap_a),
+	FIELD64(IO_BITMAP_B, io_bitmap_b),
+	FIELD64(MSR_BITMAP, msr_bitmap),
+	FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
+	FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
+	FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
+	FIELD64(TSC_OFFSET, tsc_offset),
+	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
+	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
+	FIELD64(EPT_POINTER, ept_pointer),
+	FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
+	FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
+	FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
+	FIELD64(GUEST_IA32_PAT, guest_ia32_pat),
+	FIELD64(GUEST_PDPTR0, guest_pdptr0),
+	FIELD64(GUEST_PDPTR1, guest_pdptr1),
+	FIELD64(GUEST_PDPTR2, guest_pdptr2),
+	FIELD64(GUEST_PDPTR3, guest_pdptr3),
+	FIELD64(HOST_IA32_PAT, host_ia32_pat),
+	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
+	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
+	FIELD(EXCEPTION_BITMAP, exception_bitmap),
+	FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask),
+	FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match),
+	FIELD(CR3_TARGET_COUNT, cr3_target_count),
+	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
+	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
+	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
+	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
+	FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count),
+	FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field),
+	FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code),
+	FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len),
+	FIELD(TPR_THRESHOLD, tpr_threshold),
+	FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control),
+	FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error),
+	FIELD(VM_EXIT_REASON, vm_exit_reason),
+	FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info),
+	FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code),
+	FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field),
+	FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code),
+	FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len),
+	FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info),
+	FIELD(GUEST_ES_LIMIT, guest_es_limit),
+	FIELD(GUEST_CS_LIMIT, guest_cs_limit),
+	FIELD(GUEST_SS_LIMIT, guest_ss_limit),
+	FIELD(GUEST_DS_LIMIT, guest_ds_limit),
+	FIELD(GUEST_FS_LIMIT, guest_fs_limit),
+	FIELD(GUEST_GS_LIMIT, guest_gs_limit),
+	FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit),
+	FIELD(GUEST_TR_LIMIT, guest_tr_limit),
+	FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit),
+	FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit),
+	FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes),
+	FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes),
+	FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes),
+	FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes),
+	FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes),
+	FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes),
+	FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes),
+	FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes),
+	FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info),
+	FIELD(GUEST_ACTIVITY_STATE, guest_activity_state),
+	FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs),
+	FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs),
+	FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask),
+	FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask),
+	FIELD(CR0_READ_SHADOW, cr0_read_shadow),
+	FIELD(CR4_READ_SHADOW, cr4_read_shadow),
+	FIELD(CR3_TARGET_VALUE0, cr3_target_value0),
+	FIELD(CR3_TARGET_VALUE1, cr3_target_value1),
+	FIELD(CR3_TARGET_VALUE2, cr3_target_value2),
+	FIELD(CR3_TARGET_VALUE3, cr3_target_value3),
+	FIELD(EXIT_QUALIFICATION, exit_qualification),
+	FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address),
+	FIELD(GUEST_CR0, guest_cr0),
+	FIELD(GUEST_CR3, guest_cr3),
+	FIELD(GUEST_CR4, guest_cr4),
+	FIELD(GUEST_ES_BASE, guest_es_base),
+	FIELD(GUEST_CS_BASE, guest_cs_base),
+	FIELD(GUEST_SS_BASE, guest_ss_base),
+	FIELD(GUEST_DS_BASE, guest_ds_base),
+	FIELD(GUEST_FS_BASE, guest_fs_base),
+	FIELD(GUEST_GS_BASE, guest_gs_base),
+	FIELD(GUEST_LDTR_BASE, guest_ldtr_base),
+	FIELD(GUEST_TR_BASE, guest_tr_base),
+	FIELD(GUEST_GDTR_BASE, guest_gdtr_base),
+	FIELD(GUEST_IDTR_BASE, guest_idtr_base),
+	FIELD(GUEST_DR7, guest_dr7),
+	FIELD(GUEST_RSP, guest_rsp),
+	FIELD(GUEST_RIP, guest_rip),
+	FIELD(GUEST_RFLAGS, guest_rflags),
+	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
+	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
+	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+	FIELD(HOST_CR0, host_cr0),
+	FIELD(HOST_CR3, host_cr3),
+	FIELD(HOST_CR4, host_cr4),
+	FIELD(HOST_FS_BASE, host_fs_base),
+	FIELD(HOST_GS_BASE, host_gs_base),
+	FIELD(HOST_TR_BASE, host_tr_base),
+	FIELD(HOST_GDTR_BASE, host_gdtr_base),
+	FIELD(HOST_IDTR_BASE, host_idtr_base),
+	FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp),
+	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
+	FIELD(HOST_RSP, host_rsp),
+	FIELD(HOST_RIP, host_rip),
+};
+static const int max_vmcs_field = ARRAY_SIZE(vmcs_field_to_offset_table);
+
+static inline short vmcs_field_to_offset(unsigned long field)
+{
+
+	if (field >= max_vmcs_field || vmcs_field_to_offset_table[field] == 0)
+		return -1;
+	return vmcs_field_to_offset_table[field];
+}
+
+static inline struct vmcs_fields *get_vmcs12_fields(struct kvm_vcpu *vcpu)
+{
+	return &(to_vmx(vcpu)->nested.current_vmcs12->fields);
+}
+
+static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
+{
+	return to_vmx(vcpu)->nested.current_vmcs12;
+}
+
 static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
 {
 	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 10/29] nVMX: Success/failure of VMX instructions.
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (8 preceding siblings ...)
  2011-01-27  8:34 ` [PATCH 09/29] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2011-01-27  8:34 ` Nadav Har'El
  2011-01-27  8:35 ` [PATCH 11/29] nVMX: Implement VMCLEAR Nadav Har'El
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:34 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |   31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c         |   30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -4553,6 +4553,36 @@ static int get_vmx_mem_address(struct kv
 }
 
 /*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, "Conventions".
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+					u32 vm_instruction_error)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_ZF);
+	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--- .before/arch/x86/include/asm/vmx.h	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/include/asm/vmx.h	2011-01-26 18:06:04.000000000 +0200
@@ -426,4 +426,35 @@ struct vmx_msr_entry {
 	u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+	VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+	VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+	VMXERR_VMCLEAR_VMXON_POINTER = 3,
+	VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+	VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+	VMXERR_VMRESUME_AFTER_VMXOFF = 6,
+	VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+	VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+	VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+	VMXERR_VMPTRLD_VMXON_POINTER = 10,
+	VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+	VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+	VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+	VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+	VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+	VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+	VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+	VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+	VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+	VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+	VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+	VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+	VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+	VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 11/29] nVMX: Implement VMCLEAR
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (9 preceding siblings ...)
  2011-01-27  8:34 ` [PATCH 10/29] nVMX: Success/failure of VMX instructions Nadav Har'El
@ 2011-01-27  8:35 ` Nadav Har'El
  2011-01-30 12:07   ` Avi Kivity
  2011-01-27  8:35 ` [PATCH 12/29] nVMX: Implement VMPTRLD Nadav Har'El
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:35 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   63 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -283,6 +283,8 @@ struct __packed vmcs12 {
 	u32 abort;
 
 	struct vmcs_fields fields;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
 /*
@@ -4582,6 +4584,65 @@ static void nested_vmx_failValid(struct 
 	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct vmcs12 *vmcs12;
+	struct page *page;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmcs12_addr == vmx->nested.current_vmptr) {
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+	}
+
+	page = nested_get_page(vcpu, vmcs12_addr);
+	if (page == NULL) {
+		/*
+		 * For accurate processor emulation, VMCLEAR beyond available
+		 * physical memory should do nothing at all. However, it is
+		 * possible that a nested vmx bug, not a guest hypervisor bug,
+		 * resulted in this case, so let's shut down before doing any
+		 * more damage:
+		 */
+		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+		nested_release_page(page);
+		return 1;
+	}
+	vmcs12 = kmap(page);
+	vmcs12->launch_state = 0;
+	kunmap(page);
+	nested_release_page(page);
+
+	nested_free_vmcs(vmx, vmcs12_addr);
+
+	skip_emulated_instruction(vcpu);
+	nested_vmx_succeed(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4603,7 +4664,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 12/29] nVMX: Implement VMPTRLD
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (10 preceding siblings ...)
  2011-01-27  8:35 ` [PATCH 11/29] nVMX: Implement VMCLEAR Nadav Har'El
@ 2011-01-27  8:35 ` Nadav Har'El
  2011-01-27  8:36 ` [PATCH 13/29] nVMX: Implement VMPTRST Nadav Har'El
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:35 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   64 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -4643,6 +4643,68 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmx->nested.current_vmptr != vmcs12_addr) {
+		struct vmcs12 *new_vmcs12;
+		struct page *page;
+		page = nested_get_page(vcpu, vmcs12_addr);
+		if (page == NULL) {
+			nested_vmx_failInvalid(vcpu);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		new_vmcs12 = kmap(page);
+		if (new_vmcs12->revision_id != VMCS12_REVISION) {
+			kunmap(page);
+			nested_release_page_clean(page);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		if (vmx->nested.current_vmptr != -1ull) {
+			kunmap(vmx->nested.current_vmcs12_page);
+			nested_release_page(vmx->nested.current_vmcs12_page);
+		}
+
+		vmx->nested.current_vmptr = vmcs12_addr;
+		vmx->nested.current_vmcs12 = new_vmcs12;
+		vmx->nested.current_vmcs12_page = page;
+
+		if (nested_create_current_vmcs(vcpu))
+			return -ENOMEM;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4666,7 +4728,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 13/29] nVMX: Implement VMPTRST
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (11 preceding siblings ...)
  2011-01-27  8:35 ` [PATCH 12/29] nVMX: Implement VMPTRLD Nadav Har'El
@ 2011-01-27  8:36 ` Nadav Har'El
  2011-01-27  8:37 ` [PATCH 14/29] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:36 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   27 ++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    3 ++-
 arch/x86/kvm/x86.h |    3 +++
 3 files changed, 31 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2011-01-26 18:06:04.000000000 +0200
@@ -3703,7 +3703,7 @@ static int kvm_read_guest_virt_system(gv
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
-static int kvm_write_guest_virt_system(gva_t addr, void *val,
+int kvm_write_guest_virt_system(gva_t addr, void *val,
 				       unsigned int bytes,
 				       struct kvm_vcpu *vcpu,
 				       struct x86_exception *exception)
@@ -3734,6 +3734,7 @@ static int kvm_write_guest_virt_system(g
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(unsigned long addr,
 				  void *val,
--- .before/arch/x86/kvm/x86.h	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/x86.h	2011-01-26 18:06:04.000000000 +0200
@@ -82,6 +82,9 @@ int kvm_inject_realmode_interrupt(struct
 int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 		struct kvm_vcpu *vcpu, struct x86_exception *exception);
 
+int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
+		struct kvm_vcpu *vcpu, struct x86_exception *exception);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -4705,6 +4705,31 @@ static int handle_vmptrld(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t vmcs_gva;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, exit_qualification,
+			vmx_instruction_info, &vmcs_gva))
+		return 1;
+	/* ok to use *_system, because handle_vmread verified cpl=0 */
+	if (kvm_write_guest_virt_system(vmcs_gva,
+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
+				 sizeof(u64), vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4729,7 +4754,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
-	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 14/29] nVMX: Implement VMREAD and VMWRITE
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (12 preceding siblings ...)
  2011-01-27  8:36 ` [PATCH 13/29] nVMX: Implement VMPTRST Nadav Har'El
@ 2011-01-27  8:37 ` Nadav Har'El
  2011-01-27  8:37 ` [PATCH 15/29] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:37 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs_fields structure introduced in a previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  171 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 169 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -4643,6 +4643,173 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_ULONG = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* one of the *_HIGH fields, all are 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (field >> 13) & 0x3 ;
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+	return (((field >> 10) & 0x3) == 1);
+}
+
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+
+	p = ((char *)(get_vmcs12_fields(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_ULONG:
+		*ret = *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = (u16) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = (u32) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* decode instruction info and find the field to read */
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	if (!vmcs12_read_any(vcpu, field, &field_value)) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	/*
+	 * and now check if request to put the value in register or memory.
+	 * Note that the number of bits actually written is 32 or 64 depending
+	 * in the mode, not on the given field's length.
+	 */
+	if (vmx_instruction_info & (1u << 10)) {
+		kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
+			field_value);
+	} else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		/* ok to use *_system, because handle_vmread verified cpl=0 */
+		kvm_write_guest_virt_system(gva, &field_value,
+			     (is_long_mode(vcpu) ? 8 : 4), vcpu, NULL);
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value = 0;
+	gva_t gva;
+	int field_type;
+	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (vmx_instruction_info & (1u << 10))
+		field_value = kvm_register_read(vcpu,
+			(((vmx_instruction_info) >> 3) & 0xf));
+	else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		if (kvm_read_guest_virt(gva, &field_value,
+				(is_long_mode(vcpu) ? 8 : 4), vcpu, NULL)) {
+			kvm_queue_exception(vcpu, PF_VECTOR);
+			return 1;
+		}
+	}
+
+
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+
+	if (vmcs_field_readonly(field)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	field_type = vmcs_field_type(field);
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	p = ((char *) get_vmcs12_fields(vcpu)) + offset;
+
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+#ifdef CONFIG_X86_64
+		*(unsigned long *)p = field_value;
+#else
+		*(unsigned long *)p = field_value;
+		*(((unsigned long *)p)+1) = field_value >> 32;
+#endif
+		break;
+	case VMCS_FIELD_TYPE_ULONG:
+		*(unsigned long *)p = field_value;
+		break;
+	default:
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /* Emulate the VMPTRLD instruction */
 static int handle_vmptrld(struct kvm_vcpu *vcpu)
 {
@@ -4755,9 +4922,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 15/29] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (13 preceding siblings ...)
  2011-01-27  8:37 ` [PATCH 14/29] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
@ 2011-01-27  8:37 ` Nadav Har'El
  2011-01-27  8:38 ` [PATCH 16/29] nVMX: Move register-syncing to a function Nadav Har'El
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:37 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we
built for L1).

VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
makes it difficult for us to read from vmcs01 while writing to vmcs12. This
is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then
read that memory copy while writing to vmcs02.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  410 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 410 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -820,6 +820,28 @@ static inline bool report_flexpriority(v
 	return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_vmx_tpr_shadow() &&
+		(get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_TPR_SHADOW);
+}
+
+static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_secondary_exec_ctrls() &&
+		(get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS);
+}
+
+static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
+							   *vcpu)
+{
+	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+		(get_vmcs12_fields(vcpu)->secondary_vm_exec_control &
+		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -5529,6 +5551,394 @@ static void vmx_set_supported_cpuid(u32 
 {
 }
 
+/*
+ * Make a copy of the current VMCS to ordinary memory. This is needed because
+ * in VMX you cannot read and write to two VMCS at the same time, so when we
+ * want to do this (in prepare_vmcs02, which needs to read from vmcs01 while
+ * preparing vmcs02), we need to first save a copy of one VMCS's fields in
+ * memory, and then use that copy.
+ */
+void save_vmcs(struct vmcs_fields *dst)
+{
+	dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
+	dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
+	dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
+	dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
+	dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
+	dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
+	dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
+	dst->tsc_offset = vmcs_read64(TSC_OFFSET);
+	dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
+	dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
+	dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+	dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+	dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	if (enable_ept) {
+		/* shadow pages tables on EPT */
+		dst->ept_pointer = vmcs_read64(EPT_POINTER);
+		dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+	dst->pin_based_vm_exec_control = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
+	dst->cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	dst->exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
+	dst->page_fault_error_code_mask =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
+	dst->page_fault_error_code_match =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH);
+	dst->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
+	dst->vm_exit_controls = vmcs_read32(VM_EXIT_CONTROLS);
+	dst->vm_entry_controls = vmcs_read32(VM_ENTRY_CONTROLS);
+	dst->vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	dst->vm_entry_exception_error_code =
+		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+	dst->vm_entry_instruction_len = vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
+	dst->tpr_threshold = vmcs_read32(TPR_THRESHOLD);
+	dst->secondary_vm_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	if (enable_vpid && dst->secondary_vm_exec_control &
+	    SECONDARY_EXEC_ENABLE_VPID)
+		dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
+	dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
+	dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	dst->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	dst->idt_vectoring_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	dst->idt_vectoring_error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	dst->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	dst->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	dst->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	dst->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	dst->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	dst->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	dst->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	dst->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	dst->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	dst->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	dst->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	dst->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	dst->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	dst->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	dst->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	dst->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	dst->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	dst->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	dst->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	dst->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	dst->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
+	dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
+	dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
+	dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
+	dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
+	dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
+	dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
+	dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
+	dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
+	dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	dst->guest_cr0 = vmcs_readl(GUEST_CR0);
+	dst->guest_cr3 = vmcs_readl(GUEST_CR3);
+	dst->guest_cr4 = vmcs_readl(GUEST_CR4);
+	dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+	dst->guest_dr7 = vmcs_readl(GUEST_DR7);
+	dst->guest_rsp = vmcs_readl(GUEST_RSP);
+	dst->guest_rip = vmcs_readl(GUEST_RIP);
+	dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+	dst->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+	dst->host_cr0 = vmcs_readl(HOST_CR0);
+	dst->host_cr3 = vmcs_readl(HOST_CR3);
+	dst->host_cr4 = vmcs_readl(HOST_CR4);
+	dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
+	dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
+	dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
+	dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+	dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
+	dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
+	dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
+	dst->host_rsp = vmcs_readl(HOST_RSP);
+	dst->host_rip = vmcs_readl(HOST_RIP);
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);
+}
+
+/*
+ * prepare_vmcs02 is called in when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
+ * way that will both be appropriate to L1's requests, and our needs.
+ */
+int prepare_vmcs02(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, struct vmcs_fields *vmcs01)
+{
+	u32 exec_control;
+
+	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
+	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
+	vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
+
+	vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
+
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat);
+
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		     vmcs12->vm_entry_intr_info_field);
+	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		     vmcs12->vm_entry_exception_error_code);
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		     vmcs12->vm_entry_instruction_len);
+
+	vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
+	vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
+	vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
+	vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
+	vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
+	vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
+	vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
+	vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
+	vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
+	vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
+	vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
+	vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
+	vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
+	vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
+	vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
+	vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
+	vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
+	vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
+	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+		     vmcs12->guest_interruptibility_info);
+	vmcs_write32(GUEST_ACTIVITY_STATE, vmcs12->guest_activity_state);
+	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
+
+	vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
+	vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
+	vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
+	vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
+	vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
+	vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
+	vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
+	vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+	vmcs_writel(GUEST_DR7, vmcs12->guest_dr7);
+	vmcs_writel(GUEST_RSP, vmcs12->guest_rsp);
+	vmcs_writel(GUEST_RIP, vmcs12->guest_rip);
+	vmcs_writel(GUEST_RFLAGS, vmcs12->guest_rflags);
+	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+		    vmcs12->guest_pending_dbg_exceptions);
+	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
+
+	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
+
+	if (vmcs12->vm_entry_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_store_count > 0) {
+		printk(KERN_WARNING
+			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
+	}
+
+	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->apic_access_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+		     (vmcs01->pin_based_vm_exec_control |
+		      vmcs12->pin_based_vm_exec_control));
+
+	/*
+	 * Whether page-faults are trapped is determined by a combination of
+	 * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
+	 * If enable_ept, L0 doesn't care about page faults and we should
+	 * set all of these to L1's desires. However, if !enable_ept, L0 does
+	 * care about (at least some) page faults, and because it is not easy
+	 * (if at all possible?) to merge L0 and L1's desires, we simply ask
+	 * to exit on each and every L2 page fault. This is done by setting
+	 * MASK=MATCH=0 and (see below) EB.PF=1.
+	 * Note that below we don't need special code to set EB.PF beyond the
+	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
+	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
+	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+	 *
+	 * A problem with this approach (when !enable_ept) is that L1 may be
+	 * injected with more page faults than it asked for. This could have
+	 * caused problems, but in practice existing hypervisors don't care.
+	 * To fix this, we will need to emulate the PFEC checking (on the L1
+	 * page tables), using walk_addr(), when injecting PFs to L1.
+	 */
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		enable_ept ? vmcs12->page_fault_error_code_mask : 0);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		enable_ept ? vmcs12->page_fault_error_code_match : 0);
+
+	if (cpu_has_secondary_exec_ctrls()) {
+		u32 exec_control = vmcs01->secondary_vm_exec_control;
+		if (nested_cpu_has_secondary_exec_ctrls(vcpu)) {
+			exec_control |= vmcs12->secondary_vm_exec_control;
+			if (!vm_need_virtualize_apic_accesses(vcpu->kvm) ||
+			    !nested_vm_need_virtualize_apic_accesses(vcpu))
+				exec_control &=
+				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		}
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	}
+
+	/*
+	 * Copy host-state from vcms01. Notes that some fields are different
+	 * per CPU, and are not copied here but rather in vmx_vcpu_load().
+	 * Eventually, we shouldn't copy host fields from vcms01 at all, but
+	 * rather just call the KVM code which sets them up.
+	 */
+	vmcs_write16(HOST_ES_SELECTOR, vmcs01->host_es_selector);
+	vmcs_write16(HOST_CS_SELECTOR, vmcs01->host_cs_selector);
+	vmcs_write16(HOST_SS_SELECTOR, vmcs01->host_ss_selector);
+	vmcs_write16(HOST_DS_SELECTOR, vmcs01->host_ds_selector);
+	vmcs_write16(HOST_FS_SELECTOR, vmcs01->host_fs_selector);
+	vmcs_write16(HOST_GS_SELECTOR, vmcs01->host_gs_selector);
+	vmcs_write16(HOST_TR_SELECTOR, vmcs01->host_tr_selector);
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(HOST_IA32_PAT, vmcs01->host_ia32_pat);
+	vmcs_write32(HOST_IA32_SYSENTER_CS, vmcs01->host_ia32_sysenter_cs);
+	vmcs_writel(HOST_CR0, vmcs01->host_cr0);
+	vmcs_writel(HOST_CR3, vmcs01->host_cr3);
+	vmcs_writel(HOST_CR4, vmcs01->host_cr4);
+	vmcs_writel(HOST_FS_BASE, vmcs01->host_fs_base);
+	vmcs_writel(HOST_GS_BASE, vmcs01->host_gs_base);
+	vmcs_writel(HOST_IDTR_BASE, vmcs01->host_idtr_base);
+	vmcs_writel(HOST_RSP, vmcs01->host_rsp);
+	vmcs_writel(HOST_RIP, vmcs01->host_rip);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, vmcs01->host_ia32_sysenter_eip);
+
+	if (vm_need_tpr_shadow(vcpu->kvm) &&
+	    nested_cpu_has_vmx_tpr_shadow(vcpu))
+		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
+
+	exec_control = vmcs01->cpu_based_vm_exec_control;
+	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+	exec_control &= ~CPU_BASED_TPR_SHADOW;
+	exec_control |= vmcs12->cpu_based_vm_exec_control;
+	if (!vm_need_tpr_shadow(vcpu->kvm) ||
+	    vmcs12->virtual_apic_page_addr == 0) {
+		exec_control &= ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+		exec_control |= CPU_BASED_CR8_STORE_EXITING |
+			CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	} else if (exec_control & CPU_BASED_TPR_SHADOW) {
+#ifdef CONFIG_X86_64
+		exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
+		exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	}
+	/*
+	 * Merging of IO and MSR bitmaps not currently supported.
+	 * Rather, exit every time.
+	 */
+	exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;
+	exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
+	exec_control |= CPU_BASED_UNCOND_IO_EXITING;
+
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+
+	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
+	 * bitwise-or of what L1 wants to trap for L2, and what we want to
+	 * trap. However, vmx_fpu_activate/deactivate may have happened after
+	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
+	 * and need to base them again on fpu_active. Note that CR0.TS also
+	 * needs updating - we do this after this function returns (in
+	 * nested_vmx_run).
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
+		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
+		      vmcs12->exception_bitmap));
+	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmcs_write32(VM_EXIT_CONTROLS,
+		     (vmcs01->vm_exit_controls &
+			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
+		       | vmcs12->vm_exit_controls);
+
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		     (vmcs01->vm_entry_controls &
+			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
+		      | vmcs12->vm_entry_controls);
+
+	vmcs_writel(CR4_GUEST_HOST_MASK,
+		    (vmcs01->cr4_guest_host_mask |
+		     vmcs12->cr4_guest_host_mask));
+	vcpu->arch.cr4_guest_owned_bits = ~(vmcs01->cr4_guest_host_mask |
+		vmcs12->cr4_guest_host_mask);
+
+	vmcs_write64(TSC_OFFSET, vmcs01->tsc_offset + vmcs12->tsc_offset);
+
+	if (enable_ept) {
+		/* shadow page tables on EPT */
+		vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
+	}
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmcs_write16(VIRTUAL_PROCESSOR_ID, to_vmx(vcpu)->vpid);
+		vmx_flush_tlb(vcpu);
+	}
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 16/29] nVMX: Move register-syncing to a function
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (14 preceding siblings ...)
  2011-01-27  8:37 ` [PATCH 15/29] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2011-01-27  8:38 ` Nadav Har'El
  2011-01-27  8:38 ` [PATCH 17/29] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:38 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move code that syncs dirty RSP and RIP registers back to the VMCS, into a
function. We will need to call this function from additional places in the
next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:04.000000000 +0200
@@ -5172,6 +5172,15 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
+{
+	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
+	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	vcpu->arch.regs_dirty = 0;
+}
+
 #ifdef CONFIG_X86_64
 #define R "r"
 #define Q "q"
@@ -5193,10 +5202,7 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return;
 
-	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
-	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	sync_cached_regs_to_vmcs(vcpu);
 
 	/* When single-stepping over STI and MOV SS, we must clear the
 	 * corresponding interruptibility bits in the guest state. Otherwise
@@ -5308,7 +5314,6 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
 				  | (1 << VCPU_EXREG_PDPTR)
 				  | (1 << VCPU_EXREG_CR3));
-	vcpu->arch.regs_dirty = 0;
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 17/29] nVMX: Implement VMLAUNCH and VMRESUME
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (15 preceding siblings ...)
  2011-01-27  8:38 ` [PATCH 16/29] nVMX: Move register-syncing to a function Nadav Har'El
@ 2011-01-27  8:38 ` Nadav Har'El
  2011-01-27  8:39 ` [PATCH 18/29] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:38 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  205 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 202 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -341,6 +341,10 @@ struct nested_vmx {
 	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
 	struct list_head vmcs02_list; /* a vmcs_list */
 	int vmcs02_num;
+
+	/* Saving the VMCS that we used for running L1 */
+	struct saved_vmcs saved_vmcs01;
+	struct vmcs_fields *vmcs01_fields;
 };
 
 struct vcpu_vmx {
@@ -4453,6 +4457,10 @@ static int handle_vmon(struct kvm_vcpu *
 	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
 	vmx->nested.vmcs02_num = 0;
 
+	vmx->nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!vmx->nested.vmcs01_fields)
+		return -ENOMEM;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4505,6 +4513,9 @@ static void free_nested(struct vcpu_vmx 
 	}
 
 	nested_free_all_vmcs(vmx);
+
+	kfree(vmx->nested.vmcs01_fields);
+	vmx->nested.vmcs01_fields = NULL;
 }
 
 /* Emulate the VMXOFF instruction */
@@ -4665,6 +4676,60 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+	struct vmcs12 *vmcs12;
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	skip_emulated_instruction(vcpu);
+
+	vmcs12 = get_vmcs12(vcpu);
+	/* yet another strange pre-requisite listed in the VMX spec */
+	if (vmcs12->fields.guest_interruptibility_info &
+			GUEST_INTR_STATE_MOV_SS) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
+		return 1;
+	}
+	/*
+	 * enforce that after VMCLEAR, L1 must use VMLAUNCH, but later must use
+	 * VMRESUME, as this is part of the spec (even though it was easier for
+	 * us to just allow both to work any time).
+	 */
+	if (vmcs12->launch_state == launch) {
+		nested_vmx_failValid(vcpu,
+			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS :
+				 VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+		return 1;
+	}
+
+	nested_vmx_run(vcpu);
+
+	/*
+	 * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
+	 * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
+	 * returned as far as L1 is concerned. It will only return (and set
+	 * the success flag) when L2 exits (see nested_vmx_vmexit()).
+	 */
+	return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return handle_launch_or_resume(vcpu, false);
+}
+
 enum vmcs_field_type {
 	VMCS_FIELD_TYPE_U16 = 0,
 	VMCS_FIELD_TYPE_U64 = 1,
@@ -4941,11 +5006,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -5009,7 +5074,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (!is_guest_mode(vcpu) &&
+	    unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -5944,6 +6010,139 @@ int prepare_vmcs02(struct kvm_vcpu *vcpu
 	return 0;
 }
 
+/*
+ * Return the cr0 value that a guest would read. This is a combination of
+ * the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
+ * the hypervisor (cr0_read_shadow).
+ */
+static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields)
+{
+	return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
+		(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
+}
+static inline unsigned long guest_readable_cr4(struct vmcs_fields *fields)
+{
+	return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
+		(fields->cr4_read_shadow & fields->cr4_guest_host_mask);
+}
+static inline void set_cr3_and_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
+{
+	vcpu->arch.cr3 = cr3;
+	vmcs_writel(GUEST_CR3, cr3);
+	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
+	load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3);
+	vmcs_write64(GUEST_PDPTR0, vcpu->arch.mmu.pdptrs[0]);
+	vmcs_write64(GUEST_PDPTR1, vcpu->arch.mmu.pdptrs[1]);
+	vmcs_write64(GUEST_PDPTR2, vcpu->arch.mmu.pdptrs[2]);
+	vmcs_write64(GUEST_PDPTR3, vcpu->arch.mmu.pdptrs[3]);
+}
+
+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int cpu;
+	struct saved_vmcs *saved_vmcs02;
+
+	enter_guest_mode(vcpu);
+	sync_cached_regs_to_vmcs(vcpu);
+	save_vmcs(vmx->nested.vmcs01_fields);
+
+	/*
+	 * Switch from L1's VMCS, to L2's VMCS. Remember the L1 VMCS, on which
+	 * CPU it was last loaded, and whether it was launched (we need all
+	 * these values next time we will use L1). Then recall these values as
+	 * they were for L2's VMCS (unless L2 has never been launched).
+	 */
+	vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
+	vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
+	vmx->nested.saved_vmcs01.launched = vmx->launched;
+
+	saved_vmcs02 = nested_get_current_vmcs(vmx);
+	if (!saved_vmcs02) {
+		/* In the current code, this cannot happen */
+		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+		return 1;
+	}
+	vmx->vmcs = saved_vmcs02->vmcs;
+	vcpu->cpu = saved_vmcs02->cpu;
+	vmx->launched = saved_vmcs02->launched;
+
+	vmx_vcpu_put(vcpu);
+	cpu = get_cpu();
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	vmx->nested.current_vmcs12->launch_state = 1;
+
+	prepare_vmcs02(vcpu,
+		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
+
+	if (get_vmcs12_fields(vcpu)->vm_entry_controls &
+	    VM_ENTRY_IA32E_MODE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+
+	vmx->rmode.vm86_active =
+		!(get_vmcs12_fields(vcpu)->cr0_read_shadow & X86_CR0_PE);
+
+	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
+	 * dictated, and takes appropriate actions for special cr0 bits (like
+	 * real mode, etc.).
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(get_vmcs12_fields(vcpu)));
+
+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
+	 * the latter with with TS added if !fpu_active. We need to take the
+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
+	 * restoration of fpu registers until the FPU is really used).
+	 */
+	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
+	 * CR0 (we need the paging so that KVM treat this guest as a paging
+	 * guest so we can easly forward page faults to L1.)
+	 */
+	vcpu->arch.cr0 |= X86_CR0_PG;
+
+	if (enable_ept) {
+		/* shadow page tables on EPT */
+		vcpu->arch.cr4 = guest_readable_cr4(get_vmcs12_fields(vcpu));
+		vmcs_writel(CR4_READ_SHADOW, vcpu->arch.cr4);
+		vmcs_writel(GUEST_CR4, get_vmcs12_fields(vcpu)->guest_cr4);
+		set_cr3_and_pdptrs(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+	} else {
+		/* shadow page tables on shadow page tables */
+		vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
+		vmcs_writel(CR4_READ_SHADOW,
+			    get_vmcs12_fields(vcpu)->cr4_read_shadow);
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+		kvm_mmu_reset_context(vcpu);
+
+		if (unlikely(kvm_mmu_load(vcpu))) {
+			/*
+			 * TODO: there is no reasonable error number to use.
+			 * perhaps a more reasonable thing to do is to
+			 * emulate a guest shutdown, not a launch error?
+			 */
+			nested_vmx_failValid(vcpu, 87);
+			return 1;
+		}
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   get_vmcs12_fields(vcpu)->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   get_vmcs12_fields(vcpu)->guest_rip);
+
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 18/29] nVMX: No need for handle_vmx_insn function any more
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (16 preceding siblings ...)
  2011-01-27  8:38 ` [PATCH 17/29] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2011-01-27  8:39 ` Nadav Har'El
  2011-01-27  8:39 ` [PATCH 19/29] nVMX: Exiting from L2 to L1 Nadav Har'El
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:39 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the relevant VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ------
 1 file changed, 6 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -4021,12 +4021,6 @@ static int handle_vmcall(struct kvm_vcpu
 	return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
-}
-
 static int handle_invd(struct kvm_vcpu *vcpu)
 {
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 19/29] nVMX: Exiting from L2 to L1
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (17 preceding siblings ...)
  2011-01-27  8:39 ` [PATCH 18/29] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
@ 2011-01-27  8:39 ` Nadav Har'El
  2011-01-27  8:40 ` [PATCH 20/29] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:39 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  248 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 248 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -5217,6 +5217,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (is_guest_mode(&vmx->vcpu))
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -6137,6 +6139,252 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+		(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+		(vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the changes to the guest state while L2 was
+ * running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
+	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
+	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+/*
+ * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
+ * and modify vmcs12 to make it see what it would expect to see there if
+ * L2 was its real guest. Must only be called when in L2 (is_guest_mode())
+ */
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+	int cpu;
+	struct saved_vmcs *saved_vmcs02;
+
+	leave_guest_mode(vcpu);
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	prepare_vmcs12(vcpu);
+
+	if (is_interrupt)
+		get_vmcs12_fields(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	/*
+	 * Switch from L2's VMCS, to L1's VMCS. Remember on which CPU the L2
+	 * VMCS was last loaded, and whether it was launched (we need to know
+	 * this next time we use L2), and recall these values as they were for
+	 * L1's VMCS.
+	 */
+	saved_vmcs02 = nested_get_current_vmcs(vmx);
+	if (saved_vmcs02) {
+		saved_vmcs02->cpu = vcpu->cpu;
+		saved_vmcs02->launched = vmx->launched;
+	}
+	vmx->vmcs = vmx->nested.saved_vmcs01.vmcs;
+	vcpu->cpu = vmx->nested.saved_vmcs01.cpu;
+	vmx->launched = vmx->nested.saved_vmcs01.launched;
+
+	vmx_vcpu_put(vcpu);
+	cpu = get_cpu();
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	if (get_vmcs12_fields(vcpu)->vm_exit_controls &
+	    VM_EXIT_HOST_ADDR_SPACE_SIZE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	/*
+	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
+	 * in L1 and as it is running VMX, it can't be in real mode.
+	 */
+	vmx->rmode.vm86_active = 0;
+
+	/*
+	 * If L1 set the HOST_* fields in the VMCS, when exiting from L2 to L1
+	 * we need to return those, not L1's old values.
+	 */
+	vmcs_writel(GUEST_RIP, get_vmcs12_fields(vcpu)->host_rip);
+	vmcs_writel(GUEST_RSP, get_vmcs12_fields(vcpu)->host_rsp);
+	vmcs01->cr0_read_shadow = get_vmcs12_fields(vcpu)->host_cr0;
+
+	/*
+	 * We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
+	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
+	 * it sets, e.g., add TS when !fpu_active.
+	 * Note that vmx_set_cr0 refers to rmode and efer set above.
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));
+	/*
+	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
+	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
+	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmcs01->exception_bitmap & ~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	vmx_set_cr4(vcpu, guest_readable_cr4(vmcs01));
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs01->cr4_guest_host_mask;
+
+	if (enable_ept) {
+		/* shadow page tables on EPT: */
+		set_cr3_and_pdptrs(vcpu, get_vmcs12_fields(vcpu)->host_cr3);
+	} else {
+		/* shadow page tables on shadow page tables: */
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->host_cr3);
+		kvm_mmu_reset_context(vcpu);
+		kvm_mmu_load(vcpu);
+	}
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmx_flush_tlb(vcpu);
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs01->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs01->guest_rip);
+
+	if (unlikely(vmx->fail)) {
+		/*
+		 * When L1 launches L2 and then we (L0) fail to launch L2,
+		 * we nested_vmx_vmexit back to L1, but now should let it know
+		 * that the VMLAUNCH failed - with the same error that we
+		 * got when launching L2.
+		 */
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 20/29] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (18 preceding siblings ...)
  2011-01-27  8:39 ` [PATCH 19/29] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2011-01-27  8:40 ` Nadav Har'El
  2011-01-27  8:40 ` [PATCH 21/29] nVMX: Correct handling of interrupt injection Nadav Har'El
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:40 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  220 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 220 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -345,6 +345,8 @@ struct nested_vmx {
 	/* Saving the VMCS that we used for running L1 */
 	struct saved_vmcs saved_vmcs01;
 	struct vmcs_fields *vmcs01_fields;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 };
 
 struct vcpu_vmx {
@@ -846,6 +848,20 @@ static inline bool nested_vm_need_virtua
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 }
 
+static inline bool nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -5024,6 +5040,197 @@ static int (*kvm_vmx_exit_handlers[])(st
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
+ * it expressed interest in the current event (read or write a specific MSR).
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	struct page *msr_bitmap_page;
+	void *va;
+	bool ret;
+
+	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
+		return 1;
+
+	msr_bitmap_page = nested_get_page(vcpu, vmcs12->msr_bitmap);
+	if (!msr_bitmap_page) {
+		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	va = kmap_atomic(msr_bitmap_page, KM_USER1);
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		va += 0x800;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		va += 0x400;
+	}
+	if (msr_index <= 0x1fff)
+		ret = test_bit(msr_index, va);
+	else
+		ret = 1; /* let L1 handle the wrong parameter */
+	kunmap_atomic(va, KM_USER1);
+	nested_release_page_clean(msr_bitmap_page);
+	return ret;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if ((vmcs12->cr3_target_count >= 1 &&
+					vmcs12->cr3_target_value0 == val) ||
+				(vmcs12->cr3_target_count >= 2 &&
+					vmcs12->cr3_target_value1 == val) ||
+				(vmcs12->cr3_target_count >= 3 &&
+					vmcs12->cr3_target_value2 == val) ||
+				(vmcs12->cr3_target_count >= 4 &&
+					vmcs12->cr3_target_value3 == val))
+				return 0;
+			if (nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+				(vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_CR3_LOAD_EXITING)) {
+				return 1;
+			}
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+				(vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_CR8_LOAD_EXITING))
+				return 1;
+			/*
+			 * TODO: missing else if control & CPU_BASED_TPR_SHADOW
+			 * then set tpr shadow and if below tpr_threshold, exit.
+			 */
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if (vmcs12->cr0_guest_host_mask & X86_CR0_TS)
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 0:
+			return 1;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 4:
+			return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+			return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in is_guest_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info))
+			return enable_ept;
+		return vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK));
+	case EXIT_REASON_EPT_VIOLATION:
+		return 0;
+	case EXIT_REASON_INVLPG:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING;
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_MOV_DR_EXITING;
+	default:
+		/*
+		 * One particularly interesting case that is covered here is an
+		 * exit caused by L2 running a VMX instruction. L2 is guest
+		 * mode in L1's world, and according to the VMX spec running a
+		 * VMX instruction in guest mode should cause an exit to root
+		 * mode, i.e., to L1. This is why we need to return r=1 for
+		 * those exit reasons too. This enables further nesting: Like
+		 * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
+		 * L2, who will then be able to run L3.
+		 */
+		return 1;
+	}
+}
+
 static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 {
 	*info1 = vmcs_readl(EXIT_QUALIFICATION);
@@ -5046,6 +5253,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
 		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		vcpu->run->fail_entry.hardware_entry_failure_reason
@@ -6121,6 +6339,8 @@ static int nested_vmx_run(struct kvm_vcp
 		kvm_mmu_reset_context(vcpu);
 
 		if (unlikely(kvm_mmu_load(vcpu))) {
+			/* switch back to L1 */
+			nested_vmx_vmexit(vcpu, false);
 			/*
 			 * TODO: there is no reasonable error number to use.
 			 * perhaps a more reasonable thing to do is to

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 21/29] nVMX: Correct handling of interrupt injection
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (19 preceding siblings ...)
  2011-01-27  8:40 ` [PATCH 20/29] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2011-01-27  8:40 ` Nadav Har'El
  2011-01-27  8:41 ` [PATCH 22/29] nVMX: Correct handling of exception injection Nadav Har'El
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:40 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -3463,9 +3463,25 @@ out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3581,6 +3597,13 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu, true);
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5253,6 +5276,14 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */
+	if (vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;
@@ -5444,6 +5475,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu))
+		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 22/29] nVMX: Correct handling of exception injection
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (20 preceding siblings ...)
  2011-01-27  8:40 ` [PATCH 21/29] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2011-01-27  8:41 ` Nadav Har'El
  2011-01-27  8:41 ` [PATCH 23/29] nVMX: Correct handling of idt vectoring info Nadav Har'El
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:41 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -1480,6 +1480,25 @@ static void vmx_clear_hlt(struct kvm_vcp
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu, false);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1487,6 +1506,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && is_guest_mode(vcpu) &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3535,6 +3558,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu))
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 23/29] nVMX: Correct handling of idt vectoring info
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (21 preceding siblings ...)
  2011-01-27  8:41 ` [PATCH 22/29] nVMX: Correct handling of exception injection Nadav Har'El
@ 2011-01-27  8:41 ` Nadav Har'El
  2011-01-27  8:42 ` [PATCH 24/29] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:41 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:05.000000000 +0200
@@ -347,6 +347,10 @@ struct nested_vmx {
 	struct vmcs_fields *vmcs01_fields;
 	/* L2 must run next, and mustn't decide to exit to L1. */
 	bool nested_run_pending;
+	/* true if last exit was of L2, and had a valid idt_vectoring_info */
+	bool valid_idt_vectoring_info;
+	/* These are saved if valid_idt_vectoring_info */
+	u32 vm_exit_instruction_len, idt_vectoring_error_code;
 };
 
 struct vcpu_vmx {
@@ -5511,6 +5515,22 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+	int irq  = vmx->idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
+	int type = vmx->idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
+	int errCodeValid = vmx->idt_vectoring_info &
+		VECTORING_INFO_DELIVER_CODE_MASK;
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmx->nested.vm_exit_instruction_len);
+	if (errCodeValid)
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+			vmx->nested.idt_vectoring_error_code);
+}
+
 static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
 {
 	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
@@ -5532,6 +5552,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu) && vmx->nested.valid_idt_vectoring_info)
+		nested_handle_valid_idt_vectoring_info(vmx);
+
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -5656,6 +5679,15 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+	vmx->nested.valid_idt_vectoring_info = is_guest_mode(vcpu) &&
+		(vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK);
+	if (vmx->nested.valid_idt_vectoring_info) {
+		vmx->nested.vm_exit_instruction_len =
+			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+		vmx->nested.idt_vectoring_error_code =
+			vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	}
+
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 	vmx->launched = 1;
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 24/29] nVMX: Handling of CR0 and CR4 modifying instructions
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (22 preceding siblings ...)
  2011-01-27  8:41 ` [PATCH 23/29] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2011-01-27  8:42 ` Nadav Har'El
  2011-01-27  8:42 ` [PATCH 25/29] nVMX: Further fixes for lazy FPU loading Nadav Har'El
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:42 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   53 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
@@ -3872,6 +3872,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcp
 	hypercall[2] = 0xc1;
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * We get here when L2 changed cr0 in a way that did not change
+		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
+		 * but did change L0 shadowed bits. This can currently happen
+		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
+		 * loading) while pretending to allow the guest to change it.
+		 */
+		vmcs_writel(GUEST_CR0,
+		   (val & vcpu->arch.cr0_guest_owned_bits) |
+		   (vmcs_readl(GUEST_CR0) & ~vcpu->arch.cr0_guest_owned_bits));
+		vmcs_writel(CR0_READ_SHADOW, val);
+		vcpu->arch.cr0 = val;
+		return 0;
+	} else
+		return kvm_set_cr0(vcpu, val);
+}
+
+static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_guest_mode(vcpu)) {
+		vmcs_writel(GUEST_CR4,
+		   (val & vcpu->arch.cr4_guest_owned_bits) |
+		   (vmcs_readl(GUEST_CR4) & ~vcpu->arch.cr4_guest_owned_bits));
+		vmcs_writel(CR4_READ_SHADOW, val);
+		vcpu->arch.cr4 = val;
+		return 0;
+	} else
+		return kvm_set_cr4(vcpu, val);
+}
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+	if (is_guest_mode(vcpu)) {
+		/* As in handle_set_cr0(), we can't call vmx_set_cr0 here */
+		vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0) & ~X86_CR0_TS);
+		vmcs_writel(CR0_READ_SHADOW,
+			vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS);
+		vcpu->arch.cr0 &= ~X86_CR0_TS;
+	} else
+		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification, val;
@@ -3888,7 +3935,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		trace_kvm_cr_write(cr, val);
 		switch (cr) {
 		case 0:
-			err = kvm_set_cr0(vcpu, val);
+			err = handle_set_cr0(vcpu, val);
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 3:
@@ -3896,7 +3943,7 @@ static int handle_cr(struct kvm_vcpu *vc
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 4:
-			err = kvm_set_cr4(vcpu, val);
+			err = handle_set_cr4(vcpu, val);
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 8: {
@@ -3914,7 +3961,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		};
 		break;
 	case 2: /* clts */
-		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+		handle_clts(vcpu);
 		trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
 		skip_emulated_instruction(vcpu);
 		vmx_fpu_activate(vcpu);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 25/29] nVMX: Further fixes for lazy FPU loading
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (23 preceding siblings ...)
  2011-01-27  8:42 ` [PATCH 24/29] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
@ 2011-01-27  8:42 ` Nadav Har'El
  2011-01-27  8:43 ` [PATCH 26/29] nVMX: Additional TSC-offset handling Nadav Har'El
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:42 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
@@ -1101,6 +1101,15 @@ static void update_exception_bitmap(stru
 		eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */
 	if (vcpu->fpu_active)
 		eb &= ~(1u << NM_VECTOR);
+
+	/* When we are running a nested L2 guest and L1 specified for it a
+	 * certain exception bitmap, we must trap the same exceptions and pass
+	 * them to L1. When running L2, we will only handle the exceptions
+	 * specified above if L1 did not want them.
+	 */
+	if (is_guest_mode(vcpu))
+		eb |= get_vmcs12_fields(vcpu)->exception_bitmap;
+
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1393,8 +1402,19 @@ static void vmx_fpu_activate(struct kvm_
 	cr0 &= ~(X86_CR0_TS | X86_CR0_MP);
 	cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP);
 	vmcs_writel(GUEST_CR0, cr0);
-	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	if (is_guest_mode(vcpu)) {
+		/* While we (L0) no longer care about NM exceptions or cr0.TS
+		 * changes, our guest hypervisor (L1) might care in which case
+		 * we must trap them for it.
+		 */
+		u32 eb = vmcs_read32(EXCEPTION_BITMAP) & ~(1u << NM_VECTOR);
+		struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+		eb |= vmcs12->exception_bitmap;
+		vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+		vmcs_write32(EXCEPTION_BITMAP, eb);
+	} else
+		update_exception_bitmap(vcpu);
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 }
 
@@ -1402,12 +1422,24 @@ static void vmx_decache_cr0_guest_bits(s
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
+	 * set this *before* calling this function.
+	 */
 	vmx_decache_cr0_guest_bits(vcpu);
 	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
-	update_exception_bitmap(vcpu);
+	vmcs_write32(EXCEPTION_BITMAP,
+		vmcs_read32(EXCEPTION_BITMAP) | (1u << NM_VECTOR));
 	vcpu->arch.cr0_guest_owned_bits = 0;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
-	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
+	if (is_guest_mode(vcpu))
+		/* Unfortunately in nested mode we play with arch.cr0's PG
+		 * bit, so we musn't copy it all, just the relevant TS bit
+		 */
+		vmcs_writel(CR0_READ_SHADOW,
+			(vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS) |
+			(vcpu->arch.cr0 & X86_CR0_TS));
+	else
+		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 26/29] nVMX: Additional TSC-offset handling
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (24 preceding siblings ...)
  2011-01-27  8:42 ` [PATCH 25/29] nVMX: Further fixes for lazy FPU loading Nadav Har'El
@ 2011-01-27  8:43 ` Nadav Har'El
  2011-01-27  8:43 ` [PATCH 27/29] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:43 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset.
We also need to set vmcs12.tsc_offset, for this change to survive the next
nested entry (see prepare_vmcs02()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
@@ -1655,12 +1655,23 @@ static u64 guest_read_tsc(void)
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
+	if (is_guest_mode(vcpu))
+		/*
+		 * We are only changing TSC_OFFSET when L2 is running if for
+		 * some reason L1 chose not to trap the TSC MSR. Since
+		 * prepare_vmcs12() does not copy tsc_offset, we need to also
+		 * set the vmcs12 field here.
+		 */
+		get_vmcs12_fields(vcpu)->tsc_offset = offset -
+			to_vmx(vcpu)->nested.vmcs01_fields->tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
 	u64 offset = vmcs_read64(TSC_OFFSET);
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
+	if (is_guest_mode(vcpu))
+		get_vmcs12_fields(vcpu)->tsc_offset += adjustment;
 }
 
 static bool guest_cpuid_has_vmx(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 27/29] nVMX: Add VMX to list of supported cpuid features
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (25 preceding siblings ...)
  2011-01-27  8:43 ` [PATCH 26/29] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2011-01-27  8:43 ` Nadav Har'El
  2011-01-27  8:44 ` [PATCH 28/29] nVMX: Miscellenous small corrections Nadav Har'El
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:43 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
@@ -6015,6 +6015,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+	if (func == 1 && nested)
+		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
 /*

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 28/29] nVMX: Miscellenous small corrections
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (26 preceding siblings ...)
  2011-01-27  8:43 ` [PATCH 27/29] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
@ 2011-01-27  8:44 ` Nadav Har'El
  2011-01-27  8:44 ` [PATCH 29/29] nVMX: Documentation Nadav Har'El
  2011-01-28  8:41 ` [PATCH 0/29] nVMX: Nested VMX, v8 Juerg Haefliger
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:44 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-01-26 18:06:06.000000000 +0200
@@ -934,7 +934,7 @@ static void vmcs_load(struct vmcs *vmcs)
 			: "=qm"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
 	if (error)
-		printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+		printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
 		       vmcs, phys_addr);
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 29/29] nVMX: Documentation
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (27 preceding siblings ...)
  2011-01-27  8:44 ` [PATCH 28/29] nVMX: Miscellenous small corrections Nadav Har'El
@ 2011-01-27  8:44 ` Nadav Har'El
  2011-01-28  8:41 ` [PATCH 0/29] nVMX: Nested VMX, v8 Juerg Haefliger
  29 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-27  8:44 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/kvm/nested-vmx.txt |  241 +++++++++++++++++++++++++++++
 1 file changed, 241 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt	2011-01-26 18:06:06.000000000 +0200
+++ .after/Documentation/kvm/nested-vmx.txt	2011-01-26 18:06:06.000000000 +0200
@@ -0,0 +1,241 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code supports running Linux guests under KVM guests.
+Only 64-bit guest hypervisors are supported.
+
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+     -cpu host              (emulated CPU has all features of the real CPU)
+
+     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+
+	struct vmcs_fields fields;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+}
+
+struct __packed vmcs_fields {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+     Michael Factor, factor <at> il.ibm.com
+     Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com
+     and others.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/29] nVMX: Nested VMX, v8
  2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
                   ` (28 preceding siblings ...)
  2011-01-27  8:44 ` [PATCH 29/29] nVMX: Documentation Nadav Har'El
@ 2011-01-28  8:41 ` Juerg Haefliger
  2011-01-28 17:16   ` Nadav Har'El
  2011-01-31 10:07   ` Nadav Har'El
  29 siblings, 2 replies; 47+ messages in thread
From: Juerg Haefliger @ 2011-01-28  8:41 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

Hi Nadav,


> Hi,
>
> This is the eighth iteration of the nested VMX patch set. This iteration
> solves a number of bugs and issues that bothered the reviewers. Some more
> issues raised in the previous review remain open, but don't worry - I *am*
> working to resolve all of them.
>
> The biggest improvement in this version is that SMP finally works: You can
> now run nested VMX on an SMP host - The "nosmp" kernel option is no longer
> required. You can also have SMP L1s and L2s, although in this version, SMP
> L2 support is still somewhat buggy and should be made more stable in the
> next version. The "vpid=0" option that used to be required is also no longer
> required.
>
> Other improvements include:
>
>  * #GP on writing read-only VMX MSRs, don't save/restore them, and don't
>   print annoying and incorrect messages on startup.
>  * Cleanup free_l1_state() and renamed it free_nested().
>  * Removed guest expoitable printk()s.
>  * Finally got rid of the l1_state structure and all its redundant fields.
>  * Moved cpu and launched fields out of the (guest memory) vmcs12, and moved
>   to a new structure (in host memory) saved_vmcs. Avi, you asked if and why
>   these two fields are really needed - and they are needed, and I explained
>   why in a comment.
>  * Moved kunmap() out of nested_release_page() and into callers.
>  * Made vmcs_field_to_offset_table initialization more readable.
>  * Moved constants in vmx.c and to include files, as requested.
>  * Fixed wrong MOV_SS check in handle_launch_or_resume().
>  * Fixed page leak in nested_vmx_exit_handled_msr().
>  * Removed redundant if(nested) check.
>  * Allow turning off nested VMX for one guest (by removing VMX from cpuid).
>  * Fixed the EFER handling code.
>
> This new set of patches applys to the current KVM trunk (I checked with
> 844e6679184180cffa7aca014d672545941ed78e). If you wish, you can also check
> out an already-patched version of KVM from the repository
> git://github.com/nyh/kvm-nested-vmx.git - take the branch "nvmx8".

This branch doesn't even compile:
  CC [M]  drivers/staging/samsung-laptop/samsung-laptop.o
  LD      drivers/staging/se401/built-in.o
  CC [M]  drivers/staging/se401/se401.o
  LD      drivers/staging/serqt_usb2/built-in.o
  CC [M]  drivers/staging/serqt_usb2/serqt_usb2.o
  LD      drivers/staging/slicoss/built-in.o
  CC [M]  drivers/staging/slicoss/slicoss.o
  LD      drivers/staging/sm7xx/built-in.o
  CC [M]  drivers/staging/sm7xx/smtcfb.o
  LD [M]  drivers/staging/sm7xx/sm7xx.o
  LD      drivers/staging/smbfs/built-in.o
  CC [M]  drivers/staging/smbfs/proc.o
  CC [M]  drivers/staging/smbfs/dir.o
drivers/staging/smbfs/dir.c:286: error: static declaration of
‘smbfs_dentry_operations’ follows non-static declaration
drivers/staging/smbfs/proto.h:42: note: previous declaration of
‘smbfs_dentry_operations’ was here
drivers/staging/smbfs/dir.c:294: error: static declaration of
‘smbfs_dentry_operations_case’ follows non-static declaration
drivers/staging/smbfs/proto.h:41: note: previous declaration of
‘smbfs_dentry_operations_case’ was here
make[3]: *** [drivers/staging/smbfs/dir.o] Error 1
make[2]: *** [drivers/staging/smbfs] Error 2
make[1]: *** [drivers/staging] Error 2
make: *** [drivers] Error 2
juergh@gollum:~/hpq/git/kvm-nested-vmx$ git branch
  master
* nvmx8
juergh@gollum:~/hpq/git/kvm-nested-vmx$ git log -1
commit 445a94a1497ed18a3c4fb94ae231ce7c7b3b637b
Author: Nadav Har'El <nyh@il.ibm.com>
Date:   Wed Jan 26 18:13:26 2011 +0200

    Nested VMX patch, v8


It's probably not related to your patch but doesn't make me very
comfortable about the stability of the resulting kernel. Can I safely
apply your patch to the latest -rc kernel?

...Juerg


>
> About nested VMX:
> -----------------
>
> The following 29 patches implement nested VMX support. This feature enables
> a guest to use the VMX APIs in order to run its own nested guests.
> In other words, it allows running hypervisors (that use VMX) under KVM.
> Multiple guest hypervisors can be run concurrently, and each of those can
> in turn host multiple guests.
>
> The theory behind this work, our implementation, and its performance
> characteristics were presented in OSDI 2010 (the USENIX Symposium on
> Operating Systems Design and Implementation). Our paper was titled
> "The Turtles Project: Design and Implementation of Nested Virtualization",
> and was awarded "Jay Lepreau Best Paper". The paper is available online, at:
>
>        http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
>
> This patch set does not include all the features described in the paper.
> In particular, this patch set is missing nested EPT (L1 can't use EPT and
> must use shadow page tables). It is also missing some features required to
> run VMWare hypervisors as a guest. These missing features will be sent as
> follow-on patchs.
>
> Running nested VMX:
> ------------------
>
> The nested VMX feature is currently disabled by default. It must be
> explicitly enabled with the "nested=1" option to the kvm-intel module.
>
> No modifications are required to user space (qemu). However, qemu's default
> emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
> explicitly enabled, by giving qemu one of the following options:
>
>     -cpu host              (emulated CPU has all features of the real CPU)
>
>     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
>
>
> This version was only tested with KVM (64-bit) as a guest hypervisor, and
> Linux as a nested guest.
>
>
> Patch statistics:
> -----------------
>
>  Documentation/kvm/nested-vmx.txt |  241 ++
>  arch/x86/include/asm/kvm_host.h  |    2
>  arch/x86/include/asm/msr-index.h |    9
>  arch/x86/include/asm/vmx.h       |   31
>  arch/x86/kvm/svm.c               |    6
>  arch/x86/kvm/vmx.c               | 2496 ++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c               |   10
>  arch/x86/kvm/x86.h               |    6
>  8 files changed, 2760 insertions(+), 41 deletions(-)
>
> --
> Nadav Har'El
> IBM Haifa Research Lab
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/29] nVMX: Nested VMX, v8
  2011-01-28  8:41 ` [PATCH 0/29] nVMX: Nested VMX, v8 Juerg Haefliger
@ 2011-01-28 17:16   ` Nadav Har'El
  2011-01-31 10:07   ` Nadav Har'El
  1 sibling, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-28 17:16 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: kvm, gleb, avi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-8859-8-i, Size: 1696 bytes --]

On Fri, Jan 28, 2011, Juerg Haefliger wrote about "Re: [PATCH 0/29] nVMX: Nested VMX, v8":
> This branch doesn't even compile:
>...
> drivers/staging/smbfs/dir.c:286: error: static declaration of
> ‘smbfs_dentry_operations’ follows non-static declaration
>...


Thanks, I'll look as soon as possible into why this has happened.
Although this obviously has nothing to do with the nested patches themselves
(which didn't touch anything outside KVM, and certainly nothing related to
smbfs) so there must be something wrong with the base version I used - which
I thought (but I'll recheck) was just the current kvm.git master head at the
time I sent the patches.  Needless to say, I did try compiling this version
(and running it), but I used a very minimal .config, and didn't try to compile
smbfs.

> It's probably not related to your patch but doesn't make me very
> comfortable about the stability of the resulting kernel. Can I safely
> apply your patch to the latest -rc kernel?

I have rebased the patches to the latest kvm.git head at the time I sent the
patches - not to any official KVM or Linux version. If the "latest rc kernel"
contains a recent enough version of KVM, the patches might work on it, but
I simply don't know (and haven't tested it).

Good luck, and I'll look into the non-compiling smbfs issue right after the
weekend. Thanks,

Nadav.


-- 
Nadav Har'El                        |      Friday, Jan 28 2011, 24 Shevat 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Amateurs built the ark - professionals
http://nadav.harel.org.il           |built the Titanic.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs
  2011-01-27  8:32 ` [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2011-01-30  9:52   ` Avi Kivity
  2011-01-31  8:57     ` Nadav Har'El
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2011-01-30  9:52 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/27/2011 10:32 AM, Nadav Har'El wrote:
> When the guest can use VMX instructions (when the "nested" module option is
> on), it should also be able to read and write VMX MSRs, e.g., to query about
> VMX capabilities. This patch adds this support.
>
>
> +#define CORE2_PINBASED_CTLS_MUST_BE_ONE	0x00000016
> +	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PINBASED_CTLS:
> +		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
> +		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
> +				PIN_BASED_EXT_INTR_MASK |
> +				PIN_BASED_NMI_EXITING |
> +				PIN_BASED_VIRTUAL_NMIS;

Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't 
support them?

Maybe better to drop for the initial version.
> +		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
> +
> +		vmx_msr_low = 0; /* allow disabling any feature */
> +		vmx_msr_high&= /* do not expose new untested features */
> +			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> +			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
> +			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
> +			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
> +			CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
> +#ifdef CONFIG_X86_64
> +			CPU_BASED_CR8_LOAD_EXITING |
> +			CPU_BASED_CR8_STORE_EXITING |
> +#endif
> +			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;

Similarly, I think CPU_BASED_TPR_SHADOW and 
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS are not available on all models.

> +	case MSR_IA32_VMX_PROCBASED_CTLS2:
> +		*pdata = 0;
> +		if (vm_need_virtualize_apic_accesses(vcpu->kvm))
> +			*pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +		break;

Here, at least you qualify support by checking the host.

> +	case MSR_IA32_VMX_EPT_VPID_CAP:
> +		/* Currently, no nested ept or nested vpid */
> +		*pdata = 0;
> +		break;
> +	default:
> +		return 0;
> +	}
> +
> +	return 1;
> +}
> +
> +static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
> +{
> +	if (!nested_vmx_allowed(vcpu))
> +		return 0;
> +
> +	/*
> +	 * according to the spec, "VMX capability MSRs are read-only; an
> +	 * attempt to write them (with WRMSR) produces a #GP(0).
> +	 */
> +	if (msr_index>= MSR_IA32_VMX_BASIC&&
> +	    msr_index<= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
> +		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> +		return 1;

Can just drop this part, #GP is the default response.

> +	} else if (msr_index == MSR_IA32_FEATURE_CONTROL)
> +		/* TODO: the right thing. */
> +		return 1;
> +	else
> +		return 0;
> +}

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-01-27  8:33 ` [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2011-01-30 10:02   ` Avi Kivity
  2011-01-31  9:26     ` Nadav Har'El
  2011-02-03 12:57     ` Nadav Har'El
  0 siblings, 2 replies; 47+ messages in thread
From: Avi Kivity @ 2011-01-30 10:02 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/27/2011 10:33 AM, Nadav Har'El wrote:
> In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a
> hardware VMCS for each active vmcs12 (i.e., for each L2 guest).
>
> We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
> to run its nested guest L2.
>
>
> +/*
> + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
> + * does not already exist. The allocation is done in L0 memory, so to avoid
> + * denial-of-service attack by guests, we limit the number of concurrently-
> + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
> + * trigger this limit.

No, it won't.  If you run N guests on a single-cpu kvm host, you'll have 
N active VMCSs.

> + */
> +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_list *new_l2_guest;
> +	struct vmcs *vmcs02;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (nested_get_current_vmcs(vmx))
> +		return 0; /* nothing to do - we already have a VMCS */
> +
> +	if (vmx->nested.vmcs02_num>= NESTED_MAX_VMCS)
> +		return -ENOMEM;

I asked to replace this by dropping the entire vmcs02_list (or perhaps 
just its tail).

> +static void __nested_free_saved_vmcs(void *arg)
> +{
> +	struct saved_vmcs *saved_vmcs = arg;
> +	int cpu = raw_smp_processor_id();
> +
> +	if (saved_vmcs->cpu == cpu) /* TODO: how can this not be the case? */
> +		vmcs_clear(saved_vmcs->vmcs);

This check will always be true.

> +	if (per_cpu(current_vmcs, cpu) == saved_vmcs->vmcs)
> +		per_cpu(current_vmcs, cpu) = NULL;

And this will always be false, no?  Unless you free a vmcs02 while you 
use it?  Don't you always switch back to vmcs01 prior to freeing?

> +}
> +

>   	skip_emulated_instruction(vcpu);
> @@ -4050,6 +4190,8 @@ static void free_nested(struct vcpu_vmx
>   		nested_release_page(vmx->nested.current_vmcs12_page);
>   		vmx->nested.current_vmptr = -1ull;
>   	}
> +
> +	nested_free_all_vmcs(vmx);
>   }

Maybe this is the counterexample - we kill a vcpu while it is in nested 
mode.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/29] nVMX: Fix local_vcpus_link handling
  2011-01-27  8:33 ` [PATCH 08/29] nVMX: Fix local_vcpus_link handling Nadav Har'El
@ 2011-01-30 10:08   ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2011-01-30 10:08 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/27/2011 10:33 AM, Nadav Har'El wrote:
> In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
> because (at least in theory) the processor might not have written all of its
> content back to memory. Since a patch from June 26, 2008, this is done using
> a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
>
> The problem is that with nested VMX, we no longer have the concept of a
> vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, others for
> each L2), and each of those may be have been last loaded on a different cpu.
>
> This trivial patch changes the code to keep on vcpus_on_cpu only L1 VMCSs.
> This fixes crashes on L1 shutdown caused by incorrectly maintaing the linked
> lists.
>
> It is not a complete solution, though. It doesn't flush the inactive L1 or L2
> VMCSs loaded on a CPU which is being shutdown. Doing this correctly will
> probably require replacing the vcpu linked list by a link list of "saved_vcms"
> objects (VMCS, cpu and launched), and it is left as a TODO.
>

It looks like the right thing is a structure that represents the common 
things between 02 and 02 vmcses:

- pointer to memory
- cpu
- linked list entries for vcpus_on_vcpu (to be renamed vmcses_on_cpu)

You could then use vcpu_clear() in the previous patch.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/29] nVMX: Add VMCS fields to the vmcs12
  2011-01-27  8:34 ` [PATCH 09/29] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2011-01-30 10:10   ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2011-01-30 10:10 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/27/2011 10:34 AM, Nadav Har'El wrote:
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields. These fields are encapsulated in a struct vmcs_fields.
>
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
> a hardware VMCS for running L2.
>
>
>   /*
> + * vmcs_fields is a structure used in nested VMX for holding a copy of all
> + * standard VMCS fields. It is used for emulating a VMCS for L1 (see struct
> + * vmcs12), and also for easier access to VMCS data (see vmcs01_fields).
> + */
> +struct __packed vmcs_fields {
> +	u16 virtual_processor_id;
> +	u16 guest_es_selector;
> +	u16 guest_cs_selector;
> +	u16 guest_ss_selector;
> +	u16 guest_ds_selector;
> +	u16 guest_fs_selector;
> +	u16 guest_gs_selector;
> +	u16 guest_ldtr_selector;
> +	u16 guest_tr_selector;
> +	u16 host_es_selector;
> +	u16 host_cs_selector;
> +	u16 host_ss_selector;
> +	u16 host_ds_selector;
> +	u16 host_fs_selector;
> +	u16 host_gs_selector;
> +	u16 host_tr_selector;
> +	u64 io_bitmap_a;
> +	u64 io_bitmap_b;
> +	u64 msr_bitmap;
> +	u64 vm_exit_msr_store_addr;
> +	u64 vm_exit_msr_load_addr;
> +	u64 vm_entry_msr_load_addr;
> +	u64 tsc_offset;
> +	u64 virtual_apic_page_addr;
> +	u64 apic_access_addr;
> +	u64 ept_pointer;
> +	u64 guest_physical_address;
> +	u64 vmcs_link_pointer;
> +	u64 guest_ia32_debugctl;
> +	u64 guest_ia32_pat;
> +	u64 guest_pdptr0;
> +	u64 guest_pdptr1;
> +	u64 guest_pdptr2;
> +	u64 guest_pdptr3;
> +	u64 host_ia32_pat;
> +	u32 pin_based_vm_exec_control;
> +	u32 cpu_based_vm_exec_control;
> +	u32 exception_bitmap;
> +	u32 page_fault_error_code_mask;
> +	u32 page_fault_error_code_match;
> +	u32 cr3_target_count;
> +	u32 vm_exit_controls;
> +	u32 vm_exit_msr_store_count;
> +	u32 vm_exit_msr_load_count;
> +	u32 vm_entry_controls;
> +	u32 vm_entry_msr_load_count;
> +	u32 vm_entry_intr_info_field;
> +	u32 vm_entry_exception_error_code;
> +	u32 vm_entry_instruction_len;
> +	u32 tpr_threshold;
> +	u32 secondary_vm_exec_control;
> +	u32 vm_instruction_error;
> +	u32 vm_exit_reason;
> +	u32 vm_exit_intr_info;
> +	u32 vm_exit_intr_error_code;
> +	u32 idt_vectoring_info_field;
> +	u32 idt_vectoring_error_code;
> +	u32 vm_exit_instruction_len;
> +	u32 vmx_instruction_info;
> +	u32 guest_es_limit;
> +	u32 guest_cs_limit;
> +	u32 guest_ss_limit;
> +	u32 guest_ds_limit;
> +	u32 guest_fs_limit;
> +	u32 guest_gs_limit;
> +	u32 guest_ldtr_limit;
> +	u32 guest_tr_limit;
> +	u32 guest_gdtr_limit;
> +	u32 guest_idtr_limit;
> +	u32 guest_es_ar_bytes;
> +	u32 guest_cs_ar_bytes;
> +	u32 guest_ss_ar_bytes;
> +	u32 guest_ds_ar_bytes;
> +	u32 guest_fs_ar_bytes;
> +	u32 guest_gs_ar_bytes;
> +	u32 guest_ldtr_ar_bytes;
> +	u32 guest_tr_ar_bytes;
> +	u32 guest_interruptibility_info;
> +	u32 guest_activity_state;
> +	u32 guest_sysenter_cs;
> +	u32 host_ia32_sysenter_cs;
> +	unsigned long cr0_guest_host_mask;

ulongs are not migration friendly since they can change size.  Please 
use u64.

> +	unsigned long cr4_guest_host_mask;
> +	unsigned long cr0_read_shadow;
> +	unsigned long cr4_read_shadow;
> +	unsigned long cr3_target_value0;
> +	unsigned long cr3_target_value1;
> +	unsigned long cr3_target_value2;
> +	unsigned long cr3_target_value3;
> +	unsigned long exit_qualification;
> +	unsigned long guest_linear_address;
> +	unsigned long guest_cr0;
> +	unsigned long guest_cr3;
> +	unsigned long guest_cr4;
> +	unsigned long guest_es_base;
> +	unsigned long guest_cs_base;
> +	unsigned long guest_ss_base;
> +	unsigned long guest_ds_base;
> +	unsigned long guest_fs_base;
> +	unsigned long guest_gs_base;
> +	unsigned long guest_ldtr_base;
> +	unsigned long guest_tr_base;
> +	unsigned long guest_gdtr_base;
> +	unsigned long guest_idtr_base;
> +	unsigned long guest_dr7;
> +	unsigned long guest_rsp;
> +	unsigned long guest_rip;
> +	unsigned long guest_rflags;
> +	unsigned long guest_pending_dbg_exceptions;
> +	unsigned long guest_sysenter_esp;
> +	unsigned long guest_sysenter_eip;
> +	unsigned long host_cr0;
> +	unsigned long host_cr3;
> +	unsigned long host_cr4;
> +	unsigned long host_fs_base;
> +	unsigned long host_gs_base;
> +	unsigned long host_tr_base;
> +	unsigned long host_gdtr_base;
> +	unsigned long host_idtr_base;
> +	unsigned long host_ia32_sysenter_esp;
> +	unsigned long host_ia32_sysenter_eip;
> +	unsigned long host_rsp;
> +	unsigned long host_rip;
> +};
> +
> +/*
>    * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
>    * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
>    * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
> @@ -150,6 +281,8 @@ struct __packed vmcs12 {
>   	 */
>   	u32 revision_id;
>   	u32 abort;

Please add padding here for future-proofing.

> +
> +	struct vmcs_fields fields;
>   };
>
>   /*

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 11/29] nVMX: Implement VMCLEAR
  2011-01-27  8:35 ` [PATCH 11/29] nVMX: Implement VMCLEAR Nadav Har'El
@ 2011-01-30 12:07   ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2011-01-30 12:07 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/27/2011 10:35 AM, Nadav Har'El wrote:
> This patch implements the VMCLEAR instruction.
>
>
>
> +/* Emulate the VMCLEAR instruction */
> +static int handle_vmclear(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gva_t gva;
> +	gpa_t vmcs12_addr;
> +	struct vmcs12 *vmcs12;
> +	struct page *page;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> +			vmcs_read32(VMX_INSTRUCTION_INFO),&gva))
> +		return 1;
> +
> +	if (kvm_read_guest_virt(gva,&vmcs12_addr, sizeof(vmcs12_addr),
> +				vcpu, NULL)) {
> +		kvm_queue_exception(vcpu, PF_VECTOR);

This generates an exception without an error code.  Use the 'struct 
x86_exception' parameter to kvm_read_guest_virt() to obtain the correct 
exception/error code pair.

> +		return 1;
> +	}
> +
> +	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
> +		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
> +
> +	if (vmcs12_addr == vmx->nested.current_vmptr) {
> +		kunmap(vmx->nested.current_vmcs12_page);
> +		nested_release_page(vmx->nested.current_vmcs12_page);
> +		vmx->nested.current_vmptr = -1ull;
> +	}
> +
> +	page = nested_get_page(vcpu, vmcs12_addr);
> +	if (page == NULL) {
> +		/*
> +		 * For accurate processor emulation, VMCLEAR beyond available
> +		 * physical memory should do nothing at all. However, it is
> +		 * possible that a nested vmx bug, not a guest hypervisor bug,
> +		 * resulted in this case, so let's shut down before doing any
> +		 * more damage:
> +		 */
> +		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +		nested_release_page(page);

nested_release_page(NULL) unneeded.

> +		return 1;
> +	}
> +	vmcs12 = kmap(page);
> +	vmcs12->launch_state = 0;
> +	kunmap(page);
> +	nested_release_page(page);
> +
> +	nested_free_vmcs(vmx, vmcs12_addr);
> +
> +	skip_emulated_instruction(vcpu);
> +	nested_vmx_succeed(vcpu);
> +	return 1;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs
  2011-01-30  9:52   ` Avi Kivity
@ 2011-01-31  8:57     ` Nadav Har'El
  2011-01-31  9:01       ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-31  8:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs":
> >+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> >+	case MSR_IA32_VMX_PINBASED_CTLS:
> >+		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
> >+		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
> >+				PIN_BASED_EXT_INTR_MASK |
> >+				PIN_BASED_NMI_EXITING |
> >+				PIN_BASED_VIRTUAL_NMIS;
> 
> Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't 
> support them?
> 
> Maybe better to drop for the initial version.

Thanks, I'll look into this. You already found this problem in June, and
it's already in my bugzilla. Just wanted to let you know that I'm taking
all your previous comments seriously, and not forgetting any of them.
Since you mention this one again, I'm increasing its priority, so I'll fix
it before the next version of the patches.

> >+static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
> >+{
> >+	if (!nested_vmx_allowed(vcpu))
> >+		return 0;
> >+
> >+	/*
> >+	 * according to the spec, "VMX capability MSRs are read-only; an
> >+	 * attempt to write them (with WRMSR) produces a #GP(0).
> >+	 */
> >+	if (msr_index>= MSR_IA32_VMX_BASIC&&
> >+	    msr_index<= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
> >+		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
> >+		return 1;
> 
> Can just drop this part, #GP is the default response.

Right, thanks, I see that now. I'll remove the extra code, but leave a
comment.

-- 
Nadav Har'El                        |      Monday, Jan 31 2011, 26 Shevat 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If the universe is expanding, why can't I
http://nadav.harel.org.il           |find a parking space?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs
  2011-01-31  8:57     ` Nadav Har'El
@ 2011-01-31  9:01       ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2011-01-31  9:01 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/31/2011 10:57 AM, Nadav Har'El wrote:
> On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs":
> >  >+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> >  >+	case MSR_IA32_VMX_PINBASED_CTLS:
> >  >+		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
> >  >+		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
> >  >+				PIN_BASED_EXT_INTR_MASK |
> >  >+				PIN_BASED_NMI_EXITING |
> >  >+				PIN_BASED_VIRTUAL_NMIS;
> >
> >  Can we actually support PIN_BASED_VIRTUAL_NMIs on hosts which don't
> >  support them?
> >
> >  Maybe better to drop for the initial version.
>
> Thanks, I'll look into this. You already found this problem in June, and
> it's already in my bugzilla. Just wanted to let you know that I'm taking
> all your previous comments seriously, and not forgetting any of them.
> Since you mention this one again, I'm increasing its priority, so I'll fix
> it before the next version of the patches.

Thanks.  Since I see many other comments are still unaddressed, I'll 
stop reviewing until you let me know that you have a version that is 
ready for review.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-01-30 10:02   ` Avi Kivity
@ 2011-01-31  9:26     ` Nadav Har'El
  2011-01-31  9:41       ` Avi Kivity
  2011-02-03 12:57     ` Nadav Har'El
  1 sibling, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-01-31  9:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

Hi,

On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
> >+/*
> >+ * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
> >+ * does not already exist. The allocation is done in L0 memory, so to 
> >avoid
> >+ * denial-of-service attack by guests, we limit the number of 
> >concurrently-
> >+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
> >+ * trigger this limit.
> 
> No, it won't.  If you run N guests on a single-cpu kvm host, you'll have 
> N active VMCSs.

Of course. What I said was that *unused* vmcs12s (in the sense that they
don't describe any active guest) will normally be unloaded (VMCLEARed) by L1
and so will not take up space. Only VMCSs actually being used to run guests
will take up space. If you have N running guests, then right, you'll have
N VMCSs. I put the limit at 256, which on one hand allows L1 to run 256 L2s
(which I think is well above what people will normally run on one CPU), and
on the other hand limits the amount of damage that a malicious L1 can do:
At worst it can cause the host to allocate (and pin) 256 extra VMCSs, which
sum up to 1 MB.

I thought that this compromise was good enough, and you didn't, and although
I still don't understand why, I promise I will change it. I'll make this
change my top priority now.

> >+static void __nested_free_saved_vmcs(void *arg)
> >+{
> >+	struct saved_vmcs *saved_vmcs = arg;
> >+	int cpu = raw_smp_processor_id();
> >+
> >+	if (saved_vmcs->cpu == cpu) /* TODO: how can this not be the case? */
> >+		vmcs_clear(saved_vmcs->vmcs);
> 
> This check will always be true.

This is what I thought too... I call this function on the saved_vmcs->cpu
cpu, so there's no reason why it would find itself being called on a different
cpu.

The only reason I added this sanity check was that __vcpu_clear has the same
one, and there too it seemed redundant, and I thought maybe I was missing
something. Do you know why __vcpu_clear needs this test?

> 
> >+	if (per_cpu(current_vmcs, cpu) == saved_vmcs->vmcs)
> >+		per_cpu(current_vmcs, cpu) = NULL;
> 
> And this will always be false, no?  Unless you free a vmcs02 while you 
> use it?  Don't you always switch back to vmcs01 prior to freeing?
>..
> Maybe this is the counterexample - we kill a vcpu while it is in nested 
> mode.

Right, this is what I had in mind.

-- 
Nadav Har'El                        |      Monday, Jan 31 2011, 26 Shevat 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I planted some bird seed. A bird came up.
http://nadav.harel.org.il           |Now I don't know what to feed it...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-01-31  9:26     ` Nadav Har'El
@ 2011-01-31  9:41       ` Avi Kivity
  0 siblings, 0 replies; 47+ messages in thread
From: Avi Kivity @ 2011-01-31  9:41 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 01/31/2011 11:26 AM, Nadav Har'El wrote:
> Hi,
>
> On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
> >  >+/*
> >  >+ * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
> >  >+ * does not already exist. The allocation is done in L0 memory, so to
> >  >avoid
> >  >+ * denial-of-service attack by guests, we limit the number of
> >  >concurrently-
> >  >+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
> >  >+ * trigger this limit.
> >
> >  No, it won't.  If you run N guests on a single-cpu kvm host, you'll have
> >  N active VMCSs.
>
> Of course. What I said was that *unused* vmcs12s (in the sense that they
> don't describe any active guest) will normally be unloaded (VMCLEARed) by L1
> and so will not take up space. Only VMCSs actually being used to run guests
> will take up space. If you have N running guests, then right, you'll have
> N VMCSs. I put the limit at 256, which on one hand allows L1 to run 256 L2s
> (which I think is well above what people will normally run on one CPU), and
> on the other hand limits the amount of damage that a malicious L1 can do:
> At worst it can cause the host to allocate (and pin) 256 extra VMCSs, which
> sum up to 1 MB.
>
> I thought that this compromise was good enough, and you didn't, and although
> I still don't understand why, I promise I will change it. I'll make this
> change my top priority now.

VM crashes on legal guest behaviour are bad; and since it's easy to 
avoid, why do it?

> >  >+static void __nested_free_saved_vmcs(void *arg)
> >  >+{
> >  >+	struct saved_vmcs *saved_vmcs = arg;
> >  >+	int cpu = raw_smp_processor_id();
> >  >+
> >  >+	if (saved_vmcs->cpu == cpu) /* TODO: how can this not be the case? */
> >  >+		vmcs_clear(saved_vmcs->vmcs);
> >
> >  This check will always be true.
>
> This is what I thought too... I call this function on the saved_vmcs->cpu
> cpu, so there's no reason why it would find itself being called on a different
> cpu.
>
> The only reason I added this sanity check was that __vcpu_clear has the same
> one, and there too it seemed redundant, and I thought maybe I was missing
> something. Do you know why __vcpu_clear needs this test?

__vcpu_clear() can race with itself (called from the cpu migration path 
vs. cpu offline path)


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/29] nVMX: Nested VMX, v8
  2011-01-28  8:41 ` [PATCH 0/29] nVMX: Nested VMX, v8 Juerg Haefliger
  2011-01-28 17:16   ` Nadav Har'El
@ 2011-01-31 10:07   ` Nadav Har'El
  1 sibling, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-01-31 10:07 UTC (permalink / raw)
  To: Juerg Haefliger; +Cc: kvm, gleb, avi

On Fri, Jan 28, 2011, Juerg Haefliger wrote about "Re: [PATCH 0/29] nVMX: Nested VMX, v8":
> This branch doesn't even compile:
>...
>   CC [M]  drivers/staging/smbfs/dir.o
> drivers/staging/smbfs/dir.c:286: error: static declaration of

I tried to compile this branch with the default .config (answering the
defaults to every question in "make config"), and it compiled without
any error.

However, the default configuration is *not* to compile the code you had
problems in, drivers/staging/smbfs, because by default CONFIG_STAGING is off.

When I went to enable CONFIG_STAGING and then SMB_FS, I was given a big
warning that smbfs is "OBSOLETE, please use CIFS". Is there a specific reason
why you want to use smbfs, not cifs?

Anyway, when I did enable SMB_FS, and compiled, sure enought I got the
same error that you did. This definitely has nothing to do with the
nested VMX patches. If smbfs isn't essential to you, please try again
without it (turn off SMB_FS in your .config). If it is, maybe you should
report this problem to the smbfs maintainers?

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Monday, Jan 31 2011, 26 Shevat 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |This box was intentionally left blank.
http://nadav.harel.org.il           |

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-01-30 10:02   ` Avi Kivity
  2011-01-31  9:26     ` Nadav Har'El
@ 2011-02-03 12:57     ` Nadav Har'El
  2011-02-06  9:16       ` Avi Kivity
  1 sibling, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-02-03 12:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
>..
> >+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> >+{
>...
> >+	if (vmx->nested.vmcs02_num>= NESTED_MAX_VMCS)
> >+		return -ENOMEM;
> 
> I asked to replace this by dropping the entire vmcs02_list (or perhaps 
> just its tail).

Hi, here is a completely rewritten patch.

Now we make no guarantee to keep one vmcs02 for each vmcs12. Rather, we
have a limited pool of vmcs02. If we can find the same vmcs02 that we
previously used for the current vmcs12. Otherwise, we can take any of them
(we take the least recently used) and just use that. Of course, if the
pool is not yet fool, we can also allocate a new vmcs02.

The current default size for the pool is 1, meaning that we just keep one
vmcs02 (per vcpu) for use in any L2, potentially many of them. Because in
this version prepare_vmcs02 each time sets all vmcs02 fields, and doesn't
try to avoid setting rarely modified fields, there's nothing to gain by
trying to start from the previous vmcs02 used to run a certain L2.
In the future, when we do have an optimized prepare_vmcs02 which doesn't
set every field each time, it will make sense to increase the pool size.



Subject: [PATCH 07/29] nVMX: Introduce vmcs02: VMCS used to run L2

We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that "vmcs02".
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu would have
sufficed: It could be reused even when L1 runs multiple L2 guests.
However, in future versions we'll probably want to add an optimization where
vmcs02 fields that rarely change will not be set each time. For that reason
it is beneficial to keep around several vmcs02s of L2 guests that have
recently run, so that potentially we could run these L2s again more quickly
because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
Because in the current version prepare_vmcs02() sets all vmcs02 fields no
matter what we start with, we choose VMCS02_POOL_SIZE=1. I.e., one vmcs02
is allocated (and loaded onto the processor), and it is reused to enter any
L2 guest. In the future, when prepare_vmcs02() is optimized not to set all
fields every time, VMCS02_POOL_SIZE should be increased.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  135 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-02-03 14:46:53.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-02-03 14:46:53.000000000 +0200
@@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
 module_param(ple_window, int, S_IRUGO);
 
 #define NR_AUTOLOAD_MSRS 1
+#define VMCS02_POOL_SIZE 1
 
 struct vmcs {
 	u32 revision_id;
@@ -159,6 +160,31 @@ struct __packed vmcs12 {
 #define VMCS12_REVISION 0x11e57ed0
 
 /*
+ * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
+ * while we use L2's VMCS), and wish to save the previous VMCS, we must also
+ * remember on which CPU it was last loaded (vcpu->cpu), so when we return to
+ * using this VMCS we'll know if we're now running on a different CPU and need
+ * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
+ * we need to remember whether this VMCS was launched (vmx->launched), so when
+ * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
+ * this from other state, because it's possible that this VMCS had once been
+ * launched, but has since been cleared after a CPU switch, and now
+ * vmx->launch is 0).
+ */
+struct saved_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+};
+
+/* Used to remember the last vmcs02 used for some recently used vmcs12s */
+struct vmcs02_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct saved_vmcs vmcs02;
+};
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -173,6 +199,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
+	struct list_head vmcs02_pool;
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -3965,6 +3995,106 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
+ * We could reuse a single VMCS for all the L2 guests, but we also want the
+ * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
+ * allows keeping them loaded on the processor, and in the future will allow
+ * optimizations where prepare_vmcs02 doesn't need to set all the fields on
+ * every entry if they never change.
+ * So we keep, in vmx->nested.vmcs02_pool, an cache of size VMCS02_POOL_SIZE
+ * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
+ *
+ * The following functions allocate and free a vmcs02 in this pool.
+ */
+
+static void __nested_free_saved_vmcs(void *arg)
+{
+	struct saved_vmcs *saved_vmcs = arg;
+
+	vmcs_clear(saved_vmcs->vmcs);
+	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
+		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
+}
+
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ * (the necessary information is in the saved_vmcs structure).
+ * See also vcpu_clear() (with different parameters and side-effects)
+ */
+static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
+		struct saved_vmcs *saved_vmcs)
+{
+	if (saved_vmcs->cpu != -1)
+		smp_call_function_single(saved_vmcs->cpu,
+				__nested_free_saved_vmcs, saved_vmcs, 1);
+
+	free_vmcs(saved_vmcs->vmcs);
+}
+
+/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
+static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmptr) {
+			nested_free_saved_vmcs(vmx, &item->vmcs02);
+			list_del(&item->list);
+			kfree(item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+/* Free all vmcs02 saved for this vcpu */
+static void nested_free_all_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item, *n;
+	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
+		nested_free_saved_vmcs(vmx, &item->vmcs02);
+		list_del(&item->list);
+		kfree(item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
+/* Get a vmcs02 for the current vmcs12. */
+static struct saved_vmcs *nested_get_current_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmx->nested.current_vmptr){
+			list_move(&item->list, &vmx->nested.vmcs02_pool);
+			return &item->vmcs02;
+		}
+
+	if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE,1)) {
+		/* Recycle the least recently used VMCS. */
+		item = list_entry(vmx->nested.vmcs02_pool.prev,
+			struct vmcs02_list, list);
+		item->vmcs12_addr = vmx->nested.current_vmptr;
+		list_move(&item->list, &vmx->nested.vmcs02_pool);
+		return &item->vmcs02;
+	}
+
+	/* Create a new vmcs02 */
+	item = (struct vmcs02_list *)
+		kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
+	if (!item)
+		return NULL;
+	item->vmcs02.vmcs = alloc_vmcs();
+	if (!item->vmcs02.vmcs) {
+		kfree(item);
+		return NULL;
+	}
+	item->vmcs12_addr = vmx->nested.current_vmptr;
+	item->vmcs02.cpu = -1;
+	item->vmcs02.launched = 0;
+	list_add(&(item->list), &(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num++;
+	return &item->vmcs02;
+}
+
+/*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
  * inspect the argument to VMXON (the so-called "VMXON pointer") because we
@@ -4000,6 +4130,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4050,6 +4183,8 @@ static void free_nested(struct vcpu_vmx 
 		nested_release_page(vmx->nested.current_vmcs12_page);
 		vmx->nested.current_vmptr = -1ull;
 	}
+
+	nested_free_all_vmcs02(vmx);
 }
 
 /* Emulate the VMXOFF instruction */

-- 
Nadav Har'El                        |    Thursday, Feb  3 2011, 29 Shevat 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Boat: A hole in the water surrounded by
http://nadav.harel.org.il           |wood into which one pours money.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-02-03 12:57     ` Nadav Har'El
@ 2011-02-06  9:16       ` Avi Kivity
  2011-02-13 13:04         ` Nadav Har'El
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2011-02-06  9:16 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 02/03/2011 02:57 PM, Nadav Har'El wrote:
> On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
> >..
> >  >+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> >  >+{
> >...
> >  >+	if (vmx->nested.vmcs02_num>= NESTED_MAX_VMCS)
> >  >+		return -ENOMEM;
> >
> >  I asked to replace this by dropping the entire vmcs02_list (or perhaps
> >  just its tail).
>
> Hi, here is a completely rewritten patch.
>
> Now we make no guarantee to keep one vmcs02 for each vmcs12. Rather, we
> have a limited pool of vmcs02. If we can find the same vmcs02 that we
> previously used for the current vmcs12. Otherwise, we can take any of them
> (we take the least recently used) and just use that. Of course, if the
> pool is not yet fool, we can also allocate a new vmcs02.
>
> The current default size for the pool is 1, meaning that we just keep one
> vmcs02 (per vcpu) for use in any L2, potentially many of them. Because in
> this version prepare_vmcs02 each time sets all vmcs02 fields, and doesn't
> try to avoid setting rarely modified fields, there's nothing to gain by
> trying to start from the previous vmcs02 used to run a certain L2.
> In the future, when we do have an optimized prepare_vmcs02 which doesn't
> set every field each time, it will make sense to increase the pool size.
>
>
>
> Subject: [PATCH 07/29] nVMX: Introduce vmcs02: VMCS used to run L2
>
> We saw in a previous patch that L1 controls its L2 guest with a vcms12.
> L0 needs to create a real VMCS for running L2. We call that "vmcs02".
> A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
> fields. This patch only contains code for allocating vmcs02.
>
> In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
> enter from L1 to L2, so keeping just one vmcs02 for the vcpu would have
> sufficed: It could be reused even when L1 runs multiple L2 guests.
> However, in future versions we'll probably want to add an optimization where
> vmcs02 fields that rarely change will not be set each time. For that reason
> it is beneficial to keep around several vmcs02s of L2 guests that have
> recently run, so that potentially we could run these L2s again more quickly
> because less vmwrites to vmcs02 will be needed.
>
> This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
> which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
> Because in the current version prepare_vmcs02() sets all vmcs02 fields no
> matter what we start with, we choose VMCS02_POOL_SIZE=1. I.e., one vmcs02
> is allocated (and loaded onto the processor), and it is reused to enter any
> L2 guest. In the future, when prepare_vmcs02() is optimized not to set all
> fields every time, VMCS02_POOL_SIZE should be increased.
>

Thanks, that looks much nicer.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-02-06  9:16       ` Avi Kivity
@ 2011-02-13 13:04         ` Nadav Har'El
  2011-02-13 14:58           ` Avi Kivity
  0 siblings, 1 reply; 47+ messages in thread
From: Nadav Har'El @ 2011-02-13 13:04 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

Hi,

On Sun, Feb 06, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
> Thanks, that looks much nicer.

Apropos of looking nicer, there's something I'd like to ask your opinion about.

So far, the nested-VMX patches put all the new code in vmx.c.
On one hand, this was the right place (because it's VMX-specific code),
but on the other hand, it adds 2,500 lines to vmx.c, increasing its length
by around 50% and possibly making it harder for future developers to
understand the core non-nested code, and generally makes the KVM look more
complicated.

There's another possibility: I could put all the new nested-specific functions
in a new source file, nested-vmx.c, put some declarations in new files vmx.h
and nested-vmx.h, and leave only small number of unavoidable modifications in
vmx.c itself.
When I went ahead and did this, unfortunately there were several complications,
e.g., I had to put the "struct nested_vmx" in vmx.h (because it's a field
in struct vmx_vcpu), and I had to un-static some functions from vmx.h and
move other inline functions completely to vmx.h, when I needed them both for
vmx.c and nested-vmx.c. But it didn't take more than a few hours to get over
these hurdles.

After my first version of this code restructuring (which works, but can
definitely use more cleanups), the good news is that the bulk of the new
nested code is in nested-vmx.c and nested-vmx.h (2,000 lines). The not-as-good
news is that around 200 lines had to move from vmx.c to vmx.h (mostly
structure definitions and inline functions), and about 200 lines in vmx.c
had to be modified anyway (some nested-specific, but many are just about
removing "static"). If you're interested, I can send you a patch so you could
see for yourself.

My question is, if you have an opinion on which approach is better.
On the one hand, splitting nested-specific code to a second file can leave
the non-nested code easier to follow.
On the other hand, it also means that some of the things people were used
to see in vmx.c (struct vmcs, struct vmx_vcpu, functions like vmcs_read*,
etc.) will move to vmx.h. Also, the patch I send will be larger because it
has to deal with technicalities like removing "static" or moving pieces of
code from one source file to another.
So which option do you consider less ugly?

P.S. The quote in my signature below is completely random, I promise. It has
nothing to do with my current question about the two code structure options ;-)

-- 
Nadav Har'El                        |       Sunday, Feb 13 2011, 9 Adar I 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |When choosing between two evils, I always
http://nadav.harel.org.il           |pick the one I never tried before.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-02-13 13:04         ` Nadav Har'El
@ 2011-02-13 14:58           ` Avi Kivity
  2011-02-13 20:07             ` Nadav Har'El
  0 siblings, 1 reply; 47+ messages in thread
From: Avi Kivity @ 2011-02-13 14:58 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 02/13/2011 03:04 PM, Nadav Har'El wrote:
> Hi,
>
> On Sun, Feb 06, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
> >  Thanks, that looks much nicer.
>
> Apropos of looking nicer, there's something I'd like to ask your opinion about.
>
> So far, the nested-VMX patches put all the new code in vmx.c.
> On one hand, this was the right place (because it's VMX-specific code),
> but on the other hand, it adds 2,500 lines to vmx.c, increasing its length
> by around 50% and possibly making it harder for future developers to
> understand the core non-nested code, and generally makes the KVM look more
> complicated.
>
> There's another possibility: I could put all the new nested-specific functions
> in a new source file, nested-vmx.c, put some declarations in new files vmx.h
> and nested-vmx.h, and leave only small number of unavoidable modifications in
> vmx.c itself.
> When I went ahead and did this, unfortunately there were several complications,
> e.g., I had to put the "struct nested_vmx" in vmx.h (because it's a field
> in struct vmx_vcpu), and I had to un-static some functions from vmx.h and
> move other inline functions completely to vmx.h, when I needed them both for
> vmx.c and nested-vmx.c. But it didn't take more than a few hours to get over
> these hurdles.
>
> After my first version of this code restructuring (which works, but can
> definitely use more cleanups), the good news is that the bulk of the new
> nested code is in nested-vmx.c and nested-vmx.h (2,000 lines). The not-as-good
> news is that around 200 lines had to move from vmx.c to vmx.h (mostly
> structure definitions and inline functions), and about 200 lines in vmx.c
> had to be modified anyway (some nested-specific, but many are just about
> removing "static"). If you're interested, I can send you a patch so you could
> see for yourself.
>
> My question is, if you have an opinion on which approach is better.
> On the one hand, splitting nested-specific code to a second file can leave
> the non-nested code easier to follow.
> On the other hand, it also means that some of the things people were used
> to see in vmx.c (struct vmcs, struct vmx_vcpu, functions like vmcs_read*,
> etc.) will move to vmx.h. Also, the patch I send will be larger because it
> has to deal with technicalities like removing "static" or moving pieces of
> code from one source file to another.
> So which option do you consider less ugly?

I like having everything in one file.  One reason is that it's always 
been this way, so it's a long standing tradition.  Another is that if we 
put nvmx in its own file, no one will look at it when they update things 
in core vmx.

If it gets too unwieldy we can always break it up later.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
  2011-02-13 14:58           ` Avi Kivity
@ 2011-02-13 20:07             ` Nadav Har'El
  0 siblings, 0 replies; 47+ messages in thread
From: Nadav Har'El @ 2011-02-13 20:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Feb 13, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
> I like having everything in one file.  One reason is that it's always 
> been this way, so it's a long standing tradition.  Another is that if we 
> put nvmx in its own file, no one will look at it when they update things 
> in core vmx.

Sure. One file it will remain, then.
Thanks.

Nadav.

-- 
Nadav Har'El                        |      Sunday, Feb 13 2011, 10 Adar I 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Diplomat: one who tells you to go to hell
http://nadav.harel.org.il           |in a way you look forward to the trip.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-02-13 20:08 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-27  8:29 [PATCH 0/29] nVMX: Nested VMX, v8 Nadav Har'El
2011-01-27  8:30 ` [PATCH 01/29] nVMX: Add "nested" module option to vmx.c Nadav Har'El
2011-01-27  8:30 ` [PATCH 02/29] nVMX: Implement VMXON and VMXOFF Nadav Har'El
2011-01-27  8:31 ` [PATCH 03/29] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
2011-01-27  8:31 ` [PATCH 04/29] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2011-01-27  8:32 ` [PATCH 05/29] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
2011-01-30  9:52   ` Avi Kivity
2011-01-31  8:57     ` Nadav Har'El
2011-01-31  9:01       ` Avi Kivity
2011-01-27  8:32 ` [PATCH 06/29] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
2011-01-27  8:33 ` [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
2011-01-30 10:02   ` Avi Kivity
2011-01-31  9:26     ` Nadav Har'El
2011-01-31  9:41       ` Avi Kivity
2011-02-03 12:57     ` Nadav Har'El
2011-02-06  9:16       ` Avi Kivity
2011-02-13 13:04         ` Nadav Har'El
2011-02-13 14:58           ` Avi Kivity
2011-02-13 20:07             ` Nadav Har'El
2011-01-27  8:33 ` [PATCH 08/29] nVMX: Fix local_vcpus_link handling Nadav Har'El
2011-01-30 10:08   ` Avi Kivity
2011-01-27  8:34 ` [PATCH 09/29] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
2011-01-30 10:10   ` Avi Kivity
2011-01-27  8:34 ` [PATCH 10/29] nVMX: Success/failure of VMX instructions Nadav Har'El
2011-01-27  8:35 ` [PATCH 11/29] nVMX: Implement VMCLEAR Nadav Har'El
2011-01-30 12:07   ` Avi Kivity
2011-01-27  8:35 ` [PATCH 12/29] nVMX: Implement VMPTRLD Nadav Har'El
2011-01-27  8:36 ` [PATCH 13/29] nVMX: Implement VMPTRST Nadav Har'El
2011-01-27  8:37 ` [PATCH 14/29] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
2011-01-27  8:37 ` [PATCH 15/29] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2011-01-27  8:38 ` [PATCH 16/29] nVMX: Move register-syncing to a function Nadav Har'El
2011-01-27  8:38 ` [PATCH 17/29] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
2011-01-27  8:39 ` [PATCH 18/29] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
2011-01-27  8:39 ` [PATCH 19/29] nVMX: Exiting from L2 to L1 Nadav Har'El
2011-01-27  8:40 ` [PATCH 20/29] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2011-01-27  8:40 ` [PATCH 21/29] nVMX: Correct handling of interrupt injection Nadav Har'El
2011-01-27  8:41 ` [PATCH 22/29] nVMX: Correct handling of exception injection Nadav Har'El
2011-01-27  8:41 ` [PATCH 23/29] nVMX: Correct handling of idt vectoring info Nadav Har'El
2011-01-27  8:42 ` [PATCH 24/29] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
2011-01-27  8:42 ` [PATCH 25/29] nVMX: Further fixes for lazy FPU loading Nadav Har'El
2011-01-27  8:43 ` [PATCH 26/29] nVMX: Additional TSC-offset handling Nadav Har'El
2011-01-27  8:43 ` [PATCH 27/29] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
2011-01-27  8:44 ` [PATCH 28/29] nVMX: Miscellenous small corrections Nadav Har'El
2011-01-27  8:44 ` [PATCH 29/29] nVMX: Documentation Nadav Har'El
2011-01-28  8:41 ` [PATCH 0/29] nVMX: Nested VMX, v8 Juerg Haefliger
2011-01-28 17:16   ` Nadav Har'El
2011-01-31 10:07   ` Nadav Har'El

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.