All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/27] nVMX: Nested VMX, v6
@ 2010-10-17 10:03 Nadav Har'El
  2010-10-17 10:04 ` [PATCH 01/27] nVMX: Add "nested" module option to vmx.c Nadav Har'El
                   ` (26 more replies)
  0 siblings, 27 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:03 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Hi,

About three months have passed since my previous (v5) nested VMX patch set,
and it can no longer be applied cleanly to the current KVM trunk.

This version of the patches can be applied to the current trunk, and addresses
dozens of concerns that have been raised by Avi Kivity, Marcelo Tosatti, Gleb
Natapov, and Eddie Dong while reviewing v5.
There are still several outstanding issues (e.g., that IDT handling code that
we've been discussing) that are not addressed in this version, but rest assured
that I have not forgotten them - I simply want a newer version, and one that
works with a current KVM, to be available to potential reviewers or testers.

About nested VMX:
-----------------

The following 27 patches implement nested VMX support. This feature enables a
guest to use the VMX APIs in order to run its own nested guests. In other
words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented this month in OSDI (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
"The Turtles Project: Design and Implementation of Nested Virtualization",
and was awarded "Jay Lepreau Best Paper". The paper is available online, at:

	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (shadow page tables are
used) and missing some features required to run VMWare Server as a guest.
These missing features will be sent as follow-on patchs.

Running nested VMX:
------------------

The current patches have a number of requirements, which will be relaxed in
follow-on patches:

1. This version was only tested with KVM (64-bit) as a guest hypervisor, and
   Linux as a nested guest.

2. SMP is supported in the code, but is unfortunately buggy in this version
   and often leads to hangs. Use the "nosmp" option in the L0 (topmost)
   kernel to avoid this bug (and to reduce your performance ;-))..

3. No modifications are required to user space (qemu). However, qemu does not
   currently list "VMX" as a CPU feature in its emulated CPUs (even when they
   are named after CPUs that do normally have VMX). Therefore, the "-cpu host"
   option should be given to qemu, to tell it to support CPU features which
   exist in the host - and in particular VMX.
   This requirement can be made unnecessary by a trivial patch to qemu (which
   I will submit in the future).

4. The nested VMX feature is currently disabled by default. It must be
   explicitly enabled with the "nested=1" option to the kvm-intel module.

5. Nested EPT and VPID are not properly supported in this version. You must
   give the "ept=0 vpid=0" module options to kvm-intel to turn both features
   off.


Patch statistics:
-----------------

 Documentation/kvm/nested-vmx.txt |  237 ++
 arch/x86/include/asm/kvm_host.h  |    2 
 arch/x86/include/asm/vmx.h       |   31 
 arch/x86/kvm/svm.c               |    6 
 arch/x86/kvm/vmx.c               | 2396 ++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c               |   16 
 arch/x86/kvm/x86.h               |    6 
 7 files changed, 2657 insertions(+), 37 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/27] nVMX: Add "nested" module option to vmx.c
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
@ 2010-10-17 10:04 ` Nadav Har'El
  2010-10-17 10:04 ` [PATCH 02/27] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
                   ` (25 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:04 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds a module option "nested" to vmx.c, which controls whether
the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:51:59.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:51:59.000000000 +0200
@@ -69,6 +69,14 @@ module_param(emulate_invalid_guest_state
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., the guest may use
+ * VMX and be a hypervisor for its own guests. If nested=0, the guest may not
+ * use VMX instructions.
+ */
+static int nested = 0;
+module_param(nested, int, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 02/27] nVMX: Add VMX and SVM to list of supported cpuid features
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
  2010-10-17 10:04 ` [PATCH 01/27] nVMX: Add "nested" module option to vmx.c Nadav Har'El
@ 2010-10-17 10:04 ` Nadav Har'El
  2010-10-17 10:05 ` [PATCH 03/27] nVMX: Implement VMXON and VMXOFF Nadav Har'El
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:04 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -4271,6 +4271,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+	if (func == 1 && nested)
+		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 03/27] nVMX: Implement VMXON and VMXOFF
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
  2010-10-17 10:04 ` [PATCH 01/27] nVMX: Add "nested" module option to vmx.c Nadav Har'El
  2010-10-17 10:04 ` [PATCH 02/27] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
@ 2010-10-17 10:05 ` Nadav Har'El
  2010-10-17 12:24   ` Avi Kivity
  2010-10-17 13:07   ` Avi Kivity
  2010-10-17 10:05 ` [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
                   ` (23 subsequent siblings)
  26 siblings, 2 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:05 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  102 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -127,6 +127,17 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -174,6 +185,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for a guest hypervisor (nested VMX) */
+	struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3364,6 +3378,90 @@ static int handle_vmx_insn(struct kvm_vc
 	return 1;
 }
 
+/*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
+	 * set to 1. Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!nested ||
+	    !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = true;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	to_vmx(vcpu)->nested.vmxon = false;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -3673,8 +3771,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (2 preceding siblings ...)
  2010-10-17 10:05 ` [PATCH 03/27] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2010-10-17 10:05 ` Nadav Har'El
  2010-10-17 12:31   ` Avi Kivity
  2010-10-17 10:06 ` [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:05 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the "nested" module option is on,
and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +-
 arch/x86/kvm/svm.c              |    6 +++++-
 arch/x86/kvm/vmx.c              |   13 +++++++++++--
 arch/x86/kvm/x86.c              |    4 +---
 4 files changed, 18 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/include/asm/kvm_host.h	2010-10-17 11:52:00.000000000 +0200
@@ -532,7 +532,7 @@ struct kvm_x86_ops {
 	void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
 	void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-	void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+	int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
 	void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
 	void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/svm.c	2010-10-17 11:52:00.000000000 +0200
@@ -1271,11 +1271,14 @@ static void svm_set_cr0(struct kvm_vcpu 
 	update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long host_cr4_mce = read_cr4() & X86_CR4_MCE;
 	unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;
 
+	if (cr4 & X86_CR4_VMXE)
+		return 1;
+
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
 		force_new_asid(vcpu);
 
@@ -1284,6 +1287,7 @@ static void svm_set_cr4(struct kvm_vcpu 
 		cr4 |= X86_CR4_PAE;
 	cr4 |= host_cr4_mce;
 	to_svm(vcpu)->vmcb->save.cr4 = cr4;
+	return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
@@ -603,11 +603,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 		   && !load_pdptrs(vcpu, vcpu->arch.walk_mmu, vcpu->arch.cr3))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
-	kvm_x86_ops->set_cr4(vcpu, cr4);
-
 	if ((cr4 ^ old_cr4) & pdptr_bits)
 		kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -1880,7 +1880,7 @@ static void ept_save_pdptrs(struct kvm_v
 		  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
 					unsigned long cr0,
@@ -1975,11 +1975,19 @@ static void vmx_set_cr3(struct kvm_vcpu 
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
 		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+	if (cr4 & X86_CR4_VMXE){
+		if (!nested)
+			return 1;
+	} else {
+		if (nested && to_vmx(vcpu)->nested.vmxon)
+			return 1;
+	}
+
 	vcpu->arch.cr4 = cr4;
 	if (enable_ept) {
 		if (!is_paging(vcpu)) {
@@ -1992,6 +2000,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
+	return 0;
 }
 
 static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (3 preceding siblings ...)
  2010-10-17 10:05 ` [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2010-10-17 10:06 ` Nadav Har'El
  2010-10-17 12:34   ` Avi Kivity
  2010-10-17 10:06 ` [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:06 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guests. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   66 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -128,6 +128,34 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed in order to preserve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -136,6 +164,12 @@ struct shared_msr_entry {
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above */
+	struct page *current_vmcs12_page;
+	struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -195,6 +229,26 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		return NULL;
+	}
+	return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+	kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+	kvm_release_page_clean(page);
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
@@ -3467,6 +3521,11 @@ static int handle_vmoff(struct kvm_vcpu 
 
 	to_vmx(vcpu)->nested.vmxon = false;
 
+	if(to_vmx(vcpu)->nested.current_vmptr != -1ull){
+		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
+		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
+	}
+
 	skip_emulated_instruction(vcpu);
 	return 1;
 }
@@ -4170,6 +4229,10 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	if (vmx->nested.vmxon && to_vmx(vcpu)->nested.current_vmptr != -1ull){
+		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
+		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
+	}
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
@@ -4236,6 +4299,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_vmcs12 = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (4 preceding siblings ...)
  2010-10-17 10:06 ` [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2010-10-17 10:06 ` Nadav Har'El
  2010-10-17 12:52   ` Avi Kivity
  2010-10-17 10:07 ` [PATCH 07/27] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:06 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  117 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    6 +-
 2 files changed, 122 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
@@ -789,7 +789,11 @@ static u32 msrs_to_save[] = {
 #ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
-	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
+	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
+	MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
+	MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
+	MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
 };
 
 static unsigned num_msrs_to_save;
--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -1216,6 +1216,119 @@ static void vmx_adjust_tsc_offset(struct
 }
 
 /*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 0 when we handled a
+ * VMX-specific MSR, or 1 when we haven't (and the caller should handled it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+	u64 vmx_msr = 0;
+	u32 vmx_msr_high, vmx_msr_low;
+
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_BASIC:
+		/*
+		 * This MSR reports some information about VMX support of the
+		 * processor. We should return information about the VMX we
+		 * emulate for the guest, and the VMCS structure we give it -
+		 * not about the VMX support of the underlying hardware.
+		 * However, some capabilities of the underlying hardware are
+		 * used directly by our emulation (e.g., the physical address
+		 * width), so these are copied from what the hardware reports.
+		 */
+		*pdata = VMCS12_REVISION | (((u64)sizeof(struct vmcs12)) << 32);
+		rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+#define VMX_BASIC_64		0x0001000000000000LLU
+#define VMX_BASIC_MEM_TYPE	0x003c000000000000LLU
+#define VMX_BASIC_INOUT		0x0040000000000000LLU
+		*pdata |= vmx_msr &
+			(VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
+		break;
+#define CORE2_PINBASED_CTLS_MUST_BE_ONE	0x00000016
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS	0x48d
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
+		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
+				PIN_BASED_EXT_INTR_MASK |
+				PIN_BASED_NMI_EXITING |
+				PIN_BASED_VIRTUAL_NMIS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+		/* This MSR determines which vm-execution controls the L1
+		 * hypervisor may ask, or may not ask, to enable. Normally we
+		 * can only allow enabling features which the hardware can
+		 * support, but we limit ourselves to allowing only known
+		 * features that were tested nested. We allow disabling any
+		 * feature (even if the hardware can't disable it).
+		 */
+		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+		vmx_msr_low = 0; /* allow disabling any feature */
+		vmx_msr_high &= /* do not expose new untested features */
+			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+			CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
+			CPU_BASED_USE_MSR_BITMAPS |
+#ifdef CONFIG_X86_64
+			CPU_BASED_CR8_LOAD_EXITING |
+			CPU_BASED_CR8_STORE_EXITING |
+#endif
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_EXIT_CTLS:
+		*pdata = 0;
+#ifdef CONFIG_X86_64
+		*pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+		break;
+	case MSR_IA32_VMX_ENTRY_CTLS:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		*pdata = 0;
+		if (vm_need_virtualize_apic_accesses(vcpu->kvm))
+			*pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		break;
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		*pdata = 0;
+		break;
+	default:
+		return 1;
+	}
+
+	return 0;
+}
+
+static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+{
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+	case MSR_IA32_VMX_BASIC:
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS:
+	case MSR_IA32_VMX_ENTRY_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		pr_unimpl(vcpu, "unimplemented VMX MSR write: 0x%x data %llx\n",
+			  msr_index, data);
+		return 0;
+	default:
+		return 1;
+	}
+}
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1263,6 +1376,8 @@ static int vmx_get_msr(struct kvm_vcpu *
 		/* Otherwise falls through */
 	default:
 		vmx_load_host_state(to_vmx(vcpu));
+		if (nested && !vmx_get_vmx_msr(vcpu, msr_index, &data))
+			break;
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			vmx_load_host_state(to_vmx(vcpu));
@@ -1332,6 +1447,8 @@ static int vmx_set_msr(struct kvm_vcpu *
 			return 1;
 		/* Otherwise falls through */
 	default:
+		if (nested && !vmx_set_vmx_msr(vcpu, msr_index, data))
+			break;
 		msr = find_msr_entry(vmx, msr_index);
 		if (msr) {
 			vmx_load_host_state(vmx);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 07/27] nVMX: Decoding memory operands of VMX instructions
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (5 preceding siblings ...)
  2010-10-17 10:06 ` [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2010-10-17 10:07 ` Nadav Har'El
  2010-10-17 10:07 ` [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:07 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   59 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    3 +-
 arch/x86/kvm/x86.h |    3 ++
 3 files changed, 64 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
@@ -3636,13 +3636,14 @@ static int kvm_fetch_guest_virt(gva_t ad
 					  access | PFERR_FETCH_MASK, error);
 }
 
-static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 			       struct kvm_vcpu *vcpu, u32 *error)
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  error);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
 			       struct kvm_vcpu *vcpu, u32 *error)
--- .before/arch/x86/kvm/x86.h	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/x86.h	2010-10-17 11:52:00.000000000 +0200
@@ -74,6 +74,9 @@ void kvm_before_handle_nmi(struct kvm_vc
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq);
 
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+		struct kvm_vcpu *vcpu, u32 *error);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -3647,6 +3647,65 @@ static int handle_vmoff(struct kvm_vcpu 
 	return 1;
 }
 
+/*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+				 unsigned long exit_qualification,
+				 u32 vmx_instruction_info, gva_t *ret)
+{
+	/*
+	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
+	 * Execution", on an exit, vmx_instruction_info holds most of the
+	 * addressing components of the operand. Only the displacement part
+	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
+	 * For how an actual address is calculated from all these components,
+	 * refer to Vol. 1, "Operand Addressing".
+	 */
+	int  scaling = vmx_instruction_info & 3;
+	int  addr_size = (vmx_instruction_info >> 7) & 7;
+	bool is_reg = vmx_instruction_info & (1u << 10);
+	int  seg_reg = (vmx_instruction_info >> 15) & 7;
+	int  index_reg = (vmx_instruction_info >> 18) & 0xf;
+	bool index_is_valid = !(vmx_instruction_info & (1u << 22));
+	int  base_reg       = (vmx_instruction_info >> 23) & 0xf;
+	bool base_is_valid  = !(vmx_instruction_info & (1u << 27));
+
+	if (is_reg) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	switch (addr_size) {
+	case 1: /* 32 bit. high bits are undefined according to the spec: */
+		exit_qualification &= 0xffffffff;
+		break;
+	case 2: /* 64 bit */
+		break;
+	default: /* 16 bit */
+		return 1;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offset = base + [index * scale] + displacement */
+	*ret = vmx_get_segment_base(vcpu, seg_reg);
+	if (base_is_valid)
+		*ret += kvm_register_read(vcpu, base_reg);
+	if (index_is_valid)
+		*ret += kvm_register_read(vcpu, index_reg)<<scaling;
+	*ret += exit_qualification; /* holds the displacement */
+	/*
+	 * TODO: throw #GP (and return 1) in various cases that the VM*
+	 * instructions require it - e.g., offset beyond segment limit,
+	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
+	 * address, and so on. Currently these are not checked.
+	 */
+	return 0;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (6 preceding siblings ...)
  2010-10-17 10:07 ` [PATCH 07/27] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2010-10-17 10:07 ` Nadav Har'El
  2010-10-17 13:00   ` Avi Kivity
  2010-10-17 10:08 ` [PATCH 09/27] nVMX: Success/failure of VMX instructions Nadav Har'El
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:07 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a 
hardware VMCS for each active vmcs12 (i.e., for each L2 guest).

We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
to run its nested guest L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   96 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 96 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
@@ -155,6 +155,12 @@ struct __packed vmcs12 {
  */
 #define VMCS12_REVISION 0x11e57ed0
 
+struct vmcs_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct vmcs *vmcs02;
+};
+
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
@@ -170,6 +176,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
+	struct list_head vmcs02_list; /* a vmcs_list */
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -1738,6 +1748,85 @@ static void free_vmcs(struct vmcs *vmcs)
 	free_pages((unsigned long)vmcs, vmcs_config.order);
 }
 
+static struct vmcs *nested_get_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n, &vmx->nested.vmcs02_list, list)
+		if (list_item->vmcs12_addr == vmx->nested.current_vmptr)
+			return list_item->vmcs02;
+
+	return NULL;
+}
+
+/*
+ * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
+ * does not already exist. The allocation is done in L0 memory, so to avoid
+ * denial-of-service attack by guests, we limit the number of concurrently-
+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
+ * trigger this limit.
+ */
+static const int NESTED_MAX_VMCS = 256;
+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_list *new_l2_guest;
+	struct vmcs *vmcs02;
+
+	if (nested_get_current_vmcs(vcpu))
+		return 0; /* nothing to do - we already have a VMCS */
+
+	if (to_vmx(vcpu)->nested.vmcs02_num >= NESTED_MAX_VMCS)
+		return -ENOMEM;
+
+	new_l2_guest = (struct vmcs_list *)
+		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
+	if (!new_l2_guest)
+		return -ENOMEM;
+
+	vmcs02 = alloc_vmcs();
+	if (!vmcs02) {
+		kfree(new_l2_guest);
+		return -ENOMEM;
+	}
+
+	new_l2_guest->vmcs12_addr = to_vmx(vcpu)->nested.current_vmptr;
+	new_l2_guest->vmcs02 = vmcs02;
+	list_add(&(new_l2_guest->list), &(to_vmx(vcpu)->nested.vmcs02_list));
+	to_vmx(vcpu)->nested.vmcs02_num++;
+	return 0;
+}
+
+/* Free a vmcs12's associated vmcs02, and remove it from vmcs02_list */
+static void nested_free_vmcs(struct kvm_vcpu *vcpu, gpa_t vmptr)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n, &vmx->nested.vmcs02_list, list)
+		if (list_item->vmcs12_addr == vmptr) {
+			free_vmcs(list_item->vmcs02);
+			list_del(&(list_item->list));
+			kfree(list_item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+static void free_l1_state(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n,
+			&vmx->nested.vmcs02_list, list) {
+		free_vmcs(list_item->vmcs02);
+		list_del(&(list_item->list));
+		kfree(list_item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
 static void free_kvm_area(void)
 {
 	int cpu;
@@ -3594,6 +3683,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -3643,6 +3735,8 @@ static int handle_vmoff(struct kvm_vcpu 
 		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
 	}
 
+	free_l1_state(vcpu);
+
 	skip_emulated_instruction(vcpu);
 	return 1;
 }
@@ -4409,6 +4503,8 @@ static void vmx_free_vcpu(struct kvm_vcp
 		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
 		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
 	}
+	if (vmx->nested.vmxon)
+		free_l1_state(vcpu);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 09/27] nVMX: Success/failure of VMX instructions.
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (7 preceding siblings ...)
  2010-10-17 10:07 ` [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2010-10-17 10:08 ` Nadav Har'El
  2010-10-17 10:08 ` [PATCH 10/27] nVMX: Implement VMCLEAR Nadav Har'El
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:08 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |   31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c         |   30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -3800,6 +3800,36 @@ static int get_vmx_mem_address(struct kv
 	return 0;
 }
 
+/*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, "Conventions".
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+		    	    X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+					u32 vm_instruction_error)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_ZF);
+	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
--- .before/arch/x86/include/asm/vmx.h	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/include/asm/vmx.h	2010-10-17 11:52:01.000000000 +0200
@@ -411,4 +411,35 @@ struct vmx_msr_entry {
 	u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+	VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+	VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+	VMXERR_VMCLEAR_VMXON_POINTER = 3,
+	VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+	VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+	VMXERR_VMRESUME_CORRUPTED_VMCS = 6,
+	VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+	VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+	VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+	VMXERR_VMPTRLD_VMXON_POINTER = 10,
+	VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+	VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+	VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+	VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+	VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+	VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+	VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+	VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+	VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+	VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+	VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+	VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+	VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+	VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (8 preceding siblings ...)
  2010-10-17 10:08 ` [PATCH 09/27] nVMX: Success/failure of VMX instructions Nadav Har'El
@ 2010-10-17 10:08 ` Nadav Har'El
  2010-10-17 13:05   ` Avi Kivity
  2010-10-17 10:09 ` [PATCH 11/27] nVMX: Implement VMPTRLD Nadav Har'El
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:08 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -146,6 +146,8 @@ struct __packed vmcs12 {
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
 /*
@@ -3830,6 +3832,64 @@ static void nested_vmx_failValid(struct 
 	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct vmcs12 *vmcs12;
+	struct page *page;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmcs12_addr == vmx->nested.current_vmptr){
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+	}
+
+	page = nested_get_page(vcpu, vmcs12_addr);
+	if(page == NULL){
+		/*
+		 * For accurate processor emulation, VMCLEAR beyond available
+		 * physical memory should do nothing at all. However, it is
+		 * possible that a nested vmx bug, not a guest hypervisor bug,
+		 * resulted in this case, so let's shut down before doing any
+		 * more damage:
+		 */
+		set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+		return 1;
+	}
+	vmcs12 = kmap(page);
+	vmcs12->launch_state = 0;
+	kunmap(page);
+	nested_release_page(page);
+
+	nested_free_vmcs(vcpu, vmcs12_addr);
+
+	skip_emulated_instruction(vcpu);
+	nested_vmx_succeed(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4132,7 +4192,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 11/27] nVMX: Implement VMPTRLD
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (9 preceding siblings ...)
  2010-10-17 10:08 ` [PATCH 10/27] nVMX: Implement VMCLEAR Nadav Har'El
@ 2010-10-17 10:09 ` Nadav Har'El
  2010-10-17 10:09 ` [PATCH 12/27] nVMX: Implement VMPTRST Nadav Har'El
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:09 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   64 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -3890,6 +3890,68 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmx->nested.current_vmptr != vmcs12_addr) {
+		struct vmcs12 *new_vmcs12;
+		struct page *page;
+		page = nested_get_page(vcpu, vmcs12_addr);
+		if (page == NULL){
+			nested_vmx_failInvalid(vcpu);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		new_vmcs12 = kmap(page);
+		if (new_vmcs12->revision_id != VMCS12_REVISION){
+			kunmap(page);
+			nested_release_page_clean(page);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		if (vmx->nested.current_vmptr != -1ull){
+			kunmap(vmx->nested.current_vmcs12_page);
+			nested_release_page(vmx->nested.current_vmcs12_page);
+		}
+
+		vmx->nested.current_vmptr = vmcs12_addr;
+		vmx->nested.current_vmcs12 = new_vmcs12;
+		vmx->nested.current_vmcs12_page = page;
+
+		if (nested_create_current_vmcs(vcpu))
+			return -ENOMEM;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4194,7 +4256,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 12/27] nVMX: Implement VMPTRST
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (10 preceding siblings ...)
  2010-10-17 10:09 ` [PATCH 11/27] nVMX: Implement VMPTRLD Nadav Har'El
@ 2010-10-17 10:09 ` Nadav Har'El
  2010-10-17 10:10 ` [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:09 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   27 ++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    3 ++-
 arch/x86/kvm/x86.h |    3 +++
 3 files changed, 31 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/x86.c	2010-10-17 11:52:01.000000000 +0200
@@ -3651,7 +3651,7 @@ static int kvm_read_guest_virt_system(gv
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
 }
 
-static int kvm_write_guest_virt_system(gva_t addr, void *val,
+int kvm_write_guest_virt_system(gva_t addr, void *val,
 				       unsigned int bytes,
 				       struct kvm_vcpu *vcpu,
 				       u32 *error)
@@ -3684,6 +3684,7 @@ static int kvm_write_guest_virt_system(g
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(unsigned long addr,
 				  void *val,
--- .before/arch/x86/kvm/x86.h	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/x86.h	2010-10-17 11:52:01.000000000 +0200
@@ -77,6 +77,9 @@ int kvm_inject_realmode_interrupt(struct
 int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 		struct kvm_vcpu *vcpu, u32 *error);
 
+int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
+		struct kvm_vcpu *vcpu, u32 *error);
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -3952,6 +3952,31 @@ static int handle_vmptrld(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t vmcs_gva;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, exit_qualification,
+			vmx_instruction_info, &vmcs_gva))
+		return 1;
+	/* ok to use *_system, because handle_vmread verified cpl=0 */
+	if (kvm_write_guest_virt_system(vmcs_gva,
+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
+				 sizeof(u64), vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4257,7 +4282,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
-	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (11 preceding siblings ...)
  2010-10-17 10:09 ` [PATCH 12/27] nVMX: Implement VMPTRST Nadav Har'El
@ 2010-10-17 10:10 ` Nadav Har'El
  2010-10-17 13:15   ` Avi Kivity
  2010-10-17 10:10 ` [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:10 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields. These fields are encapsulated in a struct vmcs_fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  295 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 295 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -128,6 +128,137 @@ struct shared_msr_entry {
 };
 
 /*
+ * vmcs_fields is a structure used in nested VMX for holding a copy of all
+ * standard VMCS fields. It is used for emulating a VMCS for L1 (see struct
+ * vmcs12), and also for easier access to VMCS data (see vmcs01_fields).
+ */
+struct __packed vmcs_fields {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+/*
  * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
  * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
  * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
@@ -147,6 +278,8 @@ struct __packed vmcs12 {
 	u32 revision_id;
 	u32 abort;
 
+	struct vmcs_fields fields;
+
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
@@ -241,6 +374,168 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define OFFSET(x) offsetof(struct vmcs_fields, x)
+
+static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
+	[VIRTUAL_PROCESSOR_ID] = OFFSET(virtual_processor_id),
+	[GUEST_ES_SELECTOR] = OFFSET(guest_es_selector),
+	[GUEST_CS_SELECTOR] = OFFSET(guest_cs_selector),
+	[GUEST_SS_SELECTOR] = OFFSET(guest_ss_selector),
+	[GUEST_DS_SELECTOR] = OFFSET(guest_ds_selector),
+	[GUEST_FS_SELECTOR] = OFFSET(guest_fs_selector),
+	[GUEST_GS_SELECTOR] = OFFSET(guest_gs_selector),
+	[GUEST_LDTR_SELECTOR] = OFFSET(guest_ldtr_selector),
+	[GUEST_TR_SELECTOR] = OFFSET(guest_tr_selector),
+	[HOST_ES_SELECTOR] = OFFSET(host_es_selector),
+	[HOST_CS_SELECTOR] = OFFSET(host_cs_selector),
+	[HOST_SS_SELECTOR] = OFFSET(host_ss_selector),
+	[HOST_DS_SELECTOR] = OFFSET(host_ds_selector),
+	[HOST_FS_SELECTOR] = OFFSET(host_fs_selector),
+	[HOST_GS_SELECTOR] = OFFSET(host_gs_selector),
+	[HOST_TR_SELECTOR] = OFFSET(host_tr_selector),
+	[IO_BITMAP_A] = OFFSET(io_bitmap_a),
+	[IO_BITMAP_A_HIGH] = OFFSET(io_bitmap_a)+4,
+	[IO_BITMAP_B] = OFFSET(io_bitmap_b),
+	[IO_BITMAP_B_HIGH] = OFFSET(io_bitmap_b)+4,
+	[MSR_BITMAP] = OFFSET(msr_bitmap),
+	[MSR_BITMAP_HIGH] = OFFSET(msr_bitmap)+4,
+	[VM_EXIT_MSR_STORE_ADDR] = OFFSET(vm_exit_msr_store_addr),
+	[VM_EXIT_MSR_STORE_ADDR_HIGH] = OFFSET(vm_exit_msr_store_addr)+4,
+	[VM_EXIT_MSR_LOAD_ADDR] = OFFSET(vm_exit_msr_load_addr),
+	[VM_EXIT_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_exit_msr_load_addr)+4,
+	[VM_ENTRY_MSR_LOAD_ADDR] = OFFSET(vm_entry_msr_load_addr),
+	[VM_ENTRY_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_entry_msr_load_addr)+4,
+	[TSC_OFFSET] = OFFSET(tsc_offset),
+	[TSC_OFFSET_HIGH] = OFFSET(tsc_offset)+4,
+	[VIRTUAL_APIC_PAGE_ADDR] = OFFSET(virtual_apic_page_addr),
+	[VIRTUAL_APIC_PAGE_ADDR_HIGH] = OFFSET(virtual_apic_page_addr)+4,
+	[APIC_ACCESS_ADDR] = OFFSET(apic_access_addr),
+	[APIC_ACCESS_ADDR_HIGH] = OFFSET(apic_access_addr)+4,
+	[EPT_POINTER] = OFFSET(ept_pointer),
+	[EPT_POINTER_HIGH] = OFFSET(ept_pointer)+4,
+	[GUEST_PHYSICAL_ADDRESS] = OFFSET(guest_physical_address),
+	[GUEST_PHYSICAL_ADDRESS_HIGH] = OFFSET(guest_physical_address)+4,
+	[VMCS_LINK_POINTER] = OFFSET(vmcs_link_pointer),
+	[VMCS_LINK_POINTER_HIGH] = OFFSET(vmcs_link_pointer)+4,
+	[GUEST_IA32_DEBUGCTL] = OFFSET(guest_ia32_debugctl),
+	[GUEST_IA32_DEBUGCTL_HIGH] = OFFSET(guest_ia32_debugctl)+4,
+	[GUEST_IA32_PAT] = OFFSET(guest_ia32_pat),
+	[GUEST_IA32_PAT_HIGH] = OFFSET(guest_ia32_pat)+4,
+	[GUEST_PDPTR0] = OFFSET(guest_pdptr0),
+	[GUEST_PDPTR0_HIGH] = OFFSET(guest_pdptr0)+4,
+	[GUEST_PDPTR1] = OFFSET(guest_pdptr1),
+	[GUEST_PDPTR1_HIGH] = OFFSET(guest_pdptr1)+4,
+	[GUEST_PDPTR2] = OFFSET(guest_pdptr2),
+	[GUEST_PDPTR2_HIGH] = OFFSET(guest_pdptr2)+4,
+	[GUEST_PDPTR3] = OFFSET(guest_pdptr3),
+	[GUEST_PDPTR3_HIGH] = OFFSET(guest_pdptr3)+4,
+	[HOST_IA32_PAT] = OFFSET(host_ia32_pat),
+	[HOST_IA32_PAT_HIGH] = OFFSET(host_ia32_pat)+4,
+	[PIN_BASED_VM_EXEC_CONTROL] = OFFSET(pin_based_vm_exec_control),
+	[CPU_BASED_VM_EXEC_CONTROL] = OFFSET(cpu_based_vm_exec_control),
+	[EXCEPTION_BITMAP] = OFFSET(exception_bitmap),
+	[PAGE_FAULT_ERROR_CODE_MASK] = OFFSET(page_fault_error_code_mask),
+	[PAGE_FAULT_ERROR_CODE_MATCH] = OFFSET(page_fault_error_code_match),
+	[CR3_TARGET_COUNT] = OFFSET(cr3_target_count),
+	[VM_EXIT_CONTROLS] = OFFSET(vm_exit_controls),
+	[VM_EXIT_MSR_STORE_COUNT] = OFFSET(vm_exit_msr_store_count),
+	[VM_EXIT_MSR_LOAD_COUNT] = OFFSET(vm_exit_msr_load_count),
+	[VM_ENTRY_CONTROLS] = OFFSET(vm_entry_controls),
+	[VM_ENTRY_MSR_LOAD_COUNT] = OFFSET(vm_entry_msr_load_count),
+	[VM_ENTRY_INTR_INFO_FIELD] = OFFSET(vm_entry_intr_info_field),
+	[VM_ENTRY_EXCEPTION_ERROR_CODE] = OFFSET(vm_entry_exception_error_code),
+	[VM_ENTRY_INSTRUCTION_LEN] = OFFSET(vm_entry_instruction_len),
+	[TPR_THRESHOLD] = OFFSET(tpr_threshold),
+	[SECONDARY_VM_EXEC_CONTROL] = OFFSET(secondary_vm_exec_control),
+	[VM_INSTRUCTION_ERROR] = OFFSET(vm_instruction_error),
+	[VM_EXIT_REASON] = OFFSET(vm_exit_reason),
+	[VM_EXIT_INTR_INFO] = OFFSET(vm_exit_intr_info),
+	[VM_EXIT_INTR_ERROR_CODE] = OFFSET(vm_exit_intr_error_code),
+	[IDT_VECTORING_INFO_FIELD] = OFFSET(idt_vectoring_info_field),
+	[IDT_VECTORING_ERROR_CODE] = OFFSET(idt_vectoring_error_code),
+	[VM_EXIT_INSTRUCTION_LEN] = OFFSET(vm_exit_instruction_len),
+	[VMX_INSTRUCTION_INFO] = OFFSET(vmx_instruction_info),
+	[GUEST_ES_LIMIT] = OFFSET(guest_es_limit),
+	[GUEST_CS_LIMIT] = OFFSET(guest_cs_limit),
+	[GUEST_SS_LIMIT] = OFFSET(guest_ss_limit),
+	[GUEST_DS_LIMIT] = OFFSET(guest_ds_limit),
+	[GUEST_FS_LIMIT] = OFFSET(guest_fs_limit),
+	[GUEST_GS_LIMIT] = OFFSET(guest_gs_limit),
+	[GUEST_LDTR_LIMIT] = OFFSET(guest_ldtr_limit),
+	[GUEST_TR_LIMIT] = OFFSET(guest_tr_limit),
+	[GUEST_GDTR_LIMIT] = OFFSET(guest_gdtr_limit),
+	[GUEST_IDTR_LIMIT] = OFFSET(guest_idtr_limit),
+	[GUEST_ES_AR_BYTES] = OFFSET(guest_es_ar_bytes),
+	[GUEST_CS_AR_BYTES] = OFFSET(guest_cs_ar_bytes),
+	[GUEST_SS_AR_BYTES] = OFFSET(guest_ss_ar_bytes),
+	[GUEST_DS_AR_BYTES] = OFFSET(guest_ds_ar_bytes),
+	[GUEST_FS_AR_BYTES] = OFFSET(guest_fs_ar_bytes),
+	[GUEST_GS_AR_BYTES] = OFFSET(guest_gs_ar_bytes),
+	[GUEST_LDTR_AR_BYTES] = OFFSET(guest_ldtr_ar_bytes),
+	[GUEST_TR_AR_BYTES] = OFFSET(guest_tr_ar_bytes),
+	[GUEST_INTERRUPTIBILITY_INFO] = OFFSET(guest_interruptibility_info),
+	[GUEST_ACTIVITY_STATE] = OFFSET(guest_activity_state),
+	[GUEST_SYSENTER_CS] = OFFSET(guest_sysenter_cs),
+	[HOST_IA32_SYSENTER_CS] = OFFSET(host_ia32_sysenter_cs),
+	[CR0_GUEST_HOST_MASK] = OFFSET(cr0_guest_host_mask),
+	[CR4_GUEST_HOST_MASK] = OFFSET(cr4_guest_host_mask),
+	[CR0_READ_SHADOW] = OFFSET(cr0_read_shadow),
+	[CR4_READ_SHADOW] = OFFSET(cr4_read_shadow),
+	[CR3_TARGET_VALUE0] = OFFSET(cr3_target_value0),
+	[CR3_TARGET_VALUE1] = OFFSET(cr3_target_value1),
+	[CR3_TARGET_VALUE2] = OFFSET(cr3_target_value2),
+	[CR3_TARGET_VALUE3] = OFFSET(cr3_target_value3),
+	[EXIT_QUALIFICATION] = OFFSET(exit_qualification),
+	[GUEST_LINEAR_ADDRESS] = OFFSET(guest_linear_address),
+	[GUEST_CR0] = OFFSET(guest_cr0),
+	[GUEST_CR3] = OFFSET(guest_cr3),
+	[GUEST_CR4] = OFFSET(guest_cr4),
+	[GUEST_ES_BASE] = OFFSET(guest_es_base),
+	[GUEST_CS_BASE] = OFFSET(guest_cs_base),
+	[GUEST_SS_BASE] = OFFSET(guest_ss_base),
+	[GUEST_DS_BASE] = OFFSET(guest_ds_base),
+	[GUEST_FS_BASE] = OFFSET(guest_fs_base),
+	[GUEST_GS_BASE] = OFFSET(guest_gs_base),
+	[GUEST_LDTR_BASE] = OFFSET(guest_ldtr_base),
+	[GUEST_TR_BASE] = OFFSET(guest_tr_base),
+	[GUEST_GDTR_BASE] = OFFSET(guest_gdtr_base),
+	[GUEST_IDTR_BASE] = OFFSET(guest_idtr_base),
+	[GUEST_DR7] = OFFSET(guest_dr7),
+	[GUEST_RSP] = OFFSET(guest_rsp),
+	[GUEST_RIP] = OFFSET(guest_rip),
+	[GUEST_RFLAGS] = OFFSET(guest_rflags),
+	[GUEST_PENDING_DBG_EXCEPTIONS] = OFFSET(guest_pending_dbg_exceptions),
+	[GUEST_SYSENTER_ESP] = OFFSET(guest_sysenter_esp),
+	[GUEST_SYSENTER_EIP] = OFFSET(guest_sysenter_eip),
+	[HOST_CR0] = OFFSET(host_cr0),
+	[HOST_CR3] = OFFSET(host_cr3),
+	[HOST_CR4] = OFFSET(host_cr4),
+	[HOST_FS_BASE] = OFFSET(host_fs_base),
+	[HOST_GS_BASE] = OFFSET(host_gs_base),
+	[HOST_TR_BASE] = OFFSET(host_tr_base),
+	[HOST_GDTR_BASE] = OFFSET(host_gdtr_base),
+	[HOST_IDTR_BASE] = OFFSET(host_idtr_base),
+	[HOST_IA32_SYSENTER_ESP] = OFFSET(host_ia32_sysenter_esp),
+	[HOST_IA32_SYSENTER_EIP] = OFFSET(host_ia32_sysenter_eip),
+	[HOST_RSP] = OFFSET(host_rsp),
+	[HOST_RIP] = OFFSET(host_rip),
+};
+
+static inline short vmcs_field_to_offset(unsigned long field)
+{
+
+	if (field > HOST_RIP || vmcs_field_to_offset_table[field] == 0) {
+		printk(KERN_ERR "invalid vmcs field 0x%lx\n", field);
+		return -1;
+	}
+	return vmcs_field_to_offset_table[field];
+}
+
+static inline struct vmcs_fields *get_vmcs12_fields(struct kvm_vcpu *vcpu)
+{
+	return &(to_vmx(vcpu)->nested.current_vmcs12->fields);
+}
+
 static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
 {
 	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (12 preceding siblings ...)
  2010-10-17 10:10 ` [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2010-10-17 10:10 ` Nadav Har'El
  2010-10-17 13:25   ` Avi Kivity
  2010-10-17 10:11 ` [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
                   ` (12 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:10 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs_fields structure introduced in the previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  171 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 169 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -4185,6 +4185,173 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_ULONG = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* one of the *_HIGH fields, all are 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (field >> 13) & 0x3 ;
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+	return (((field >> 10) & 0x3) == 1);
+}
+
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+
+	p = ((char *)(get_vmcs12_fields(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_ULONG:
+		*ret = *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = (u16) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = (u32) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* decode instruction info and find the field to read */
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	if(!vmcs12_read_any(vcpu, field, &field_value)){
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	/*
+	 * and now check if reuqest to put the value in register or memory.
+	 * Note that the number of bits actually written is 32 or 64 depending
+	 * in the mode, not on the given field's length.
+	 */
+	if (vmx_instruction_info & (1u << 10)) {
+		kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
+			field_value);
+	} else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		/* ok to use *_system, because handle_vmread verified cpl=0 */
+		kvm_write_guest_virt_system(gva, &field_value,
+			     (is_long_mode(vcpu) ? 8 : 4), vcpu, NULL);
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value = 0;
+	gva_t gva;
+	int field_type;
+	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (vmx_instruction_info & (1u << 10))
+		field_value = kvm_register_read(vcpu,
+			(((vmx_instruction_info) >> 3) & 0xf));
+	else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		if(kvm_read_guest_virt(gva, &field_value,
+				(is_long_mode(vcpu) ? 8 : 4), vcpu, NULL)){
+			kvm_queue_exception(vcpu, PF_VECTOR);
+			return 1;
+		}
+	}
+
+
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+
+	if (vmcs_field_readonly(field)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	field_type = vmcs_field_type(field);
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	p = ((char *) get_vmcs12_fields(vcpu)) + offset;
+
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+#ifdef CONFIG_X86_64
+		*(unsigned long *)p = field_value;
+#else
+		*(unsigned long *)p = field_value;
+		*(((unsigned long *)p)+1) = field_value >> 32;
+#endif
+		break;
+	case VMCS_FIELD_TYPE_ULONG:
+		*(unsigned long *)p = field_value;
+		break;
+	default:
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /* Emulate the VMPTRLD instruction */
 static int handle_vmptrld(struct kvm_vcpu *vcpu)
 {
@@ -4578,9 +4745,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (13 preceding siblings ...)
  2010-10-17 10:10 ` [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
@ 2010-10-17 10:11 ` Nadav Har'El
  2010-10-17 14:08   ` Avi Kivity
  2010-10-17 10:11 ` [PATCH 16/27] nVMX: Move register-syncing to a function Nadav Har'El
                   ` (11 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:11 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we
built for L1).

VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
makes it difficult for us to read from vmcs01 while writing to vmcs12. This
is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then
read that memory copy while writing to vmcs12.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  408 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 408 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
@@ -803,6 +803,28 @@ static inline bool report_flexpriority(v
 	return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_vmx_tpr_shadow() &&
+		get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_TPR_SHADOW;
+}
+
+static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_secondary_exec_ctrls() &&
+		get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+}
+
+static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
+							   *vcpu)
+{
+	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+		(get_vmcs12_fields(vcpu)->secondary_vm_exec_control &
+		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -1258,6 +1280,37 @@ static void vmx_load_host_state(struct v
 	preempt_enable();
 }
 
+int load_vmcs_host_state(struct vmcs_fields *src)
+{
+	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
+	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
+	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
+	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
+	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
+	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
+	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
+
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
+
+	vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
+
+	vmcs_writel(HOST_CR0, src->host_cr0);
+	vmcs_writel(HOST_CR3, src->host_cr3);
+	vmcs_writel(HOST_CR4, src->host_cr4);
+	vmcs_writel(HOST_FS_BASE, src->host_fs_base);
+	vmcs_writel(HOST_GS_BASE, src->host_gs_base);
+	vmcs_writel(HOST_TR_BASE, src->host_tr_base);
+	vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);
+	vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);
+	vmcs_writel(HOST_RSP, src->host_rsp);
+	vmcs_writel(HOST_RIP, src->host_rip);
+	vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);
+
+	return 0;
+}
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -5359,6 +5412,361 @@ static void vmx_set_supported_cpuid(u32 
 		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/*
+ * Make a copy of the current VMCS to ordinary memory. This is needed because
+ * in VMX you cannot read and write to two VMCS at the same time, so when we
+ * want to do this (in prepare_vmcs02, which needs to read from vmcs01 while
+ * preparing vmcs02), we need to first save a copy of one VMCS's fields in
+ * memory, and then use that copy.
+ */
+void save_vmcs(struct vmcs_fields *dst)
+{
+	dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
+	dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
+	dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
+	dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
+	dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
+	dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
+	dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
+	dst->io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	dst->io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	if (cpu_has_vmx_msr_bitmap())
+		dst->msr_bitmap = vmcs_read64(MSR_BITMAP);
+	dst->tsc_offset = vmcs_read64(TSC_OFFSET);
+	dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
+	dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
+	if (enable_ept)
+		dst->ept_pointer = vmcs_read64(EPT_POINTER);
+	dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+	dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+	dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	if (enable_ept) {
+		dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+	dst->pin_based_vm_exec_control = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
+	dst->cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	dst->exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
+	dst->page_fault_error_code_mask =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
+	dst->page_fault_error_code_match =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH);
+	dst->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
+	dst->vm_exit_controls = vmcs_read32(VM_EXIT_CONTROLS);
+	dst->vm_entry_controls = vmcs_read32(VM_ENTRY_CONTROLS);
+	dst->vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	dst->vm_entry_exception_error_code =
+		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+	dst->vm_entry_instruction_len = vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
+	dst->tpr_threshold = vmcs_read32(TPR_THRESHOLD);
+	dst->secondary_vm_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	if (enable_vpid && dst->secondary_vm_exec_control &
+	    SECONDARY_EXEC_ENABLE_VPID)
+		dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
+	dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
+	dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	dst->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	dst->idt_vectoring_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	dst->idt_vectoring_error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	dst->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	dst->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	dst->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	dst->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	dst->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	dst->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	dst->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	dst->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	dst->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	dst->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	dst->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	dst->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	dst->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	dst->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	dst->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	dst->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	dst->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	dst->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	dst->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	dst->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	dst->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
+	dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
+	dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
+	dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
+	dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
+	dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
+	dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
+	dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
+	dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
+	dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	dst->guest_cr0 = vmcs_readl(GUEST_CR0);
+	dst->guest_cr3 = vmcs_readl(GUEST_CR3);
+	dst->guest_cr4 = vmcs_readl(GUEST_CR4);
+	dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+	dst->guest_dr7 = vmcs_readl(GUEST_DR7);
+	dst->guest_rsp = vmcs_readl(GUEST_RSP);
+	dst->guest_rip = vmcs_readl(GUEST_RIP);
+	dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+	dst->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+	dst->host_cr0 = vmcs_readl(HOST_CR0);
+	dst->host_cr3 = vmcs_readl(HOST_CR3);
+	dst->host_cr4 = vmcs_readl(HOST_CR4);
+	dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
+	dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
+	dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
+	dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+	dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
+	dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
+	dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
+	dst->host_rsp = vmcs_readl(HOST_RSP);
+	dst->host_rip = vmcs_readl(HOST_RIP);
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);
+}
+
+int load_vmcs_common(struct vmcs_fields *src)
+{
+	vmcs_write16(GUEST_ES_SELECTOR, src->guest_es_selector);
+	vmcs_write16(GUEST_CS_SELECTOR, src->guest_cs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, src->guest_ss_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, src->guest_ds_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, src->guest_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, src->guest_gs_selector);
+	vmcs_write16(GUEST_LDTR_SELECTOR, src->guest_ldtr_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, src->guest_tr_selector);
+
+	vmcs_write64(GUEST_IA32_DEBUGCTL, src->guest_ia32_debugctl);
+
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, src->guest_ia32_pat);
+
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, src->vm_entry_intr_info_field);
+	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		     src->vm_entry_exception_error_code);
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, src->vm_entry_instruction_len);
+
+	vmcs_write32(GUEST_ES_LIMIT, src->guest_es_limit);
+	vmcs_write32(GUEST_CS_LIMIT, src->guest_cs_limit);
+	vmcs_write32(GUEST_SS_LIMIT, src->guest_ss_limit);
+	vmcs_write32(GUEST_DS_LIMIT, src->guest_ds_limit);
+	vmcs_write32(GUEST_FS_LIMIT, src->guest_fs_limit);
+	vmcs_write32(GUEST_GS_LIMIT, src->guest_gs_limit);
+	vmcs_write32(GUEST_LDTR_LIMIT, src->guest_ldtr_limit);
+	vmcs_write32(GUEST_TR_LIMIT, src->guest_tr_limit);
+	vmcs_write32(GUEST_GDTR_LIMIT, src->guest_gdtr_limit);
+	vmcs_write32(GUEST_IDTR_LIMIT, src->guest_idtr_limit);
+	vmcs_write32(GUEST_ES_AR_BYTES, src->guest_es_ar_bytes);
+	vmcs_write32(GUEST_CS_AR_BYTES, src->guest_cs_ar_bytes);
+	vmcs_write32(GUEST_SS_AR_BYTES, src->guest_ss_ar_bytes);
+	vmcs_write32(GUEST_DS_AR_BYTES, src->guest_ds_ar_bytes);
+	vmcs_write32(GUEST_FS_AR_BYTES, src->guest_fs_ar_bytes);
+	vmcs_write32(GUEST_GS_AR_BYTES, src->guest_gs_ar_bytes);
+	vmcs_write32(GUEST_LDTR_AR_BYTES, src->guest_ldtr_ar_bytes);
+	vmcs_write32(GUEST_TR_AR_BYTES, src->guest_tr_ar_bytes);
+	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+		     src->guest_interruptibility_info);
+	vmcs_write32(GUEST_ACTIVITY_STATE, src->guest_activity_state);
+	vmcs_write32(GUEST_SYSENTER_CS, src->guest_sysenter_cs);
+
+	vmcs_writel(GUEST_ES_BASE, src->guest_es_base);
+	vmcs_writel(GUEST_CS_BASE, src->guest_cs_base);
+	vmcs_writel(GUEST_SS_BASE, src->guest_ss_base);
+	vmcs_writel(GUEST_DS_BASE, src->guest_ds_base);
+	vmcs_writel(GUEST_FS_BASE, src->guest_fs_base);
+	vmcs_writel(GUEST_GS_BASE, src->guest_gs_base);
+	vmcs_writel(GUEST_LDTR_BASE, src->guest_ldtr_base);
+	vmcs_writel(GUEST_TR_BASE, src->guest_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, src->guest_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, src->guest_idtr_base);
+	vmcs_writel(GUEST_DR7, src->guest_dr7);
+	vmcs_writel(GUEST_RSP, src->guest_rsp);
+	vmcs_writel(GUEST_RIP, src->guest_rip);
+	vmcs_writel(GUEST_RFLAGS, src->guest_rflags);
+	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+		    src->guest_pending_dbg_exceptions);
+	vmcs_writel(GUEST_SYSENTER_ESP, src->guest_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, src->guest_sysenter_eip);
+
+	return 0;
+}
+
+/*
+ * prepare_vmcs02 is called in when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
+ * way that will both be appropriate to L1's requests, and our needs.
+ */
+int prepare_vmcs02(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, struct vmcs_fields *vmcs01)
+{
+	u32 exec_control;
+
+	load_vmcs_common(vmcs12);
+
+	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
+	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
+	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
+	if (cpu_has_vmx_msr_bitmap())
+		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
+
+	if (vmcs12->vm_entry_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_store_count > 0) {
+		printk(KERN_WARNING
+			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
+	}
+
+	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->apic_access_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+		     (vmcs01->pin_based_vm_exec_control |
+		      vmcs12->pin_based_vm_exec_control));
+
+
+	/*
+	 * Whether page-faults are trapped is determined by a combination of
+	 * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
+	 * If enable_ept, L0 doesn't care about page faults and we should
+	 * set all of these to L1's desires. However, if !enable_ept, L0 does
+	 * care about (at least some) page faults, and because it is not easy
+	 * (if at all possible?) to merge L0 and L1's desires, we simply ask
+	 * to exit on each and every L2 page fault. This is done by setting
+	 * MASK=MATCH=0 and (see below) EB.PF=1.
+	 * Note that below we don't need special code to set EB.PF beyond the
+	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
+	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
+	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+	 *
+	 * TODO: A problem with this approach is that L1 may be injected
+	 * with more page faults than it asked for. This could have caused
+	 * problems, but in practice existing hypervisors don't care. To fix
+	 * this, we will need to emulate the PFEC checking (on the L1 page
+	 * tables), using walk_addr(), when injecting PFs to L1.
+	 */
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		enable_ept ? vmcs12->page_fault_error_code_mask : 0);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		enable_ept ? vmcs12->page_fault_error_code_match : 0);
+
+	if (cpu_has_secondary_exec_ctrls()) {
+		u32 exec_control = vmcs01->secondary_vm_exec_control;
+		if (nested_cpu_has_secondary_exec_ctrls(vcpu)) {
+			exec_control |= vmcs12->secondary_vm_exec_control;
+			if (!vm_need_virtualize_apic_accesses(vcpu->kvm) ||
+			    !nested_vm_need_virtualize_apic_accesses(vcpu))
+				exec_control &=
+				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		}
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	}
+
+	load_vmcs_host_state(vmcs01);
+
+	if (vm_need_tpr_shadow(vcpu->kvm) &&
+	    nested_cpu_has_vmx_tpr_shadow(vcpu))
+		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
+
+	exec_control = vmcs01->cpu_based_vm_exec_control;
+	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+	exec_control &= ~CPU_BASED_TPR_SHADOW;
+	exec_control |= vmcs12->cpu_based_vm_exec_control;
+	if (!vm_need_tpr_shadow(vcpu->kvm) ||
+	    vmcs12->virtual_apic_page_addr == 0) {
+		exec_control &= ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+		exec_control |= CPU_BASED_CR8_STORE_EXITING |
+			CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	} else if (exec_control & CPU_BASED_TPR_SHADOW) {
+#ifdef CONFIG_X86_64
+		exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
+		exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	}
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+
+	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
+	 * bitwise-or of what L1 wants to trap for L2, and what we want to
+	 * trap. However, vmx_fpu_activate/deactivate may have happened after
+	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
+	 * and need to base them again on fpu_active. Note that CR0.TS also
+	 * needs updating - we do this after this function returns (in
+	 * nested_vmx_run).
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
+		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
+		      vmcs12->exception_bitmap));
+	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmcs_write32(VM_EXIT_CONTROLS,
+		     (vmcs01->vm_exit_controls &
+			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
+		       | vmcs12->vm_exit_controls);
+
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		     (vmcs01->vm_entry_controls &
+			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
+		      | vmcs12->vm_entry_controls);
+
+	vmcs_writel(CR4_GUEST_HOST_MASK,
+		    (vmcs01->cr4_guest_host_mask  &
+		     vmcs12->cr4_guest_host_mask));
+	
+	vmcs_write64(TSC_OFFSET, vmcs01->tsc_offset + vmcs12->tsc_offset);
+
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 16/27] nVMX: Move register-syncing to a function
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (14 preceding siblings ...)
  2010-10-17 10:11 ` [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2010-10-17 10:11 ` Nadav Har'El
  2010-10-17 10:12 ` [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:11 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move code that syncs dirty RSP and RIP registers back to the VMCS, into a
function. We will need to call this function from additional places in the
next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
@@ -5025,6 +5025,15 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
+{
+	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
+	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	vcpu->arch.regs_dirty = 0;
+}
+
 #ifdef CONFIG_X86_64
 #define R "r"
 #define Q "q"
@@ -5046,10 +5055,7 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return;
 
-	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
-	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	sync_cached_regs_to_vmcs(vcpu);
 
 	/* When single-stepping over STI and MOV SS, we must clear the
 	 * corresponding interruptibility bits in the guest state. Otherwise
@@ -5157,7 +5163,6 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
 	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
 				  | (1 << VCPU_EXREG_PDPTR));
-	vcpu->arch.regs_dirty = 0;
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (15 preceding siblings ...)
  2010-10-17 10:11 ` [PATCH 16/27] nVMX: Move register-syncing to a function Nadav Har'El
@ 2010-10-17 10:12 ` Nadav Har'El
  2010-10-17 15:06   ` Avi Kivity
  2010-10-17 10:12 ` [PATCH 18/27] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
                   ` (9 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:12 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  221 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 218 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
@@ -281,6 +281,9 @@ struct __packed vmcs12 {
 	struct vmcs_fields fields;
 
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
 };
 
 /*
@@ -315,6 +318,23 @@ struct nested_vmx {
 	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
 	struct list_head vmcs02_list; /* a vmcs_list */
 	int vmcs02_num;
+
+	/* Are we running a nested guest now */
+	bool nested_mode;
+	/* Level 1 state for switching to level 2 and back */
+	struct  {
+		u64 efer;
+		unsigned long cr3;
+		unsigned long cr4;
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		int cpu;
+		int launched;
+	} l1_state;
+	/* Saving the VMCS that we used for running L1 */
+	struct vmcs *vmcs01;
+	struct vmcs_fields *vmcs01_fields;
 };
 
 struct vcpu_vmx {
@@ -1349,6 +1369,16 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
+
+		if (vmx->nested.vmcs01_fields != NULL) {
+			struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+			vmcs01->host_tr_base = vmcs_readl(HOST_TR_BASE);
+			vmcs01->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+			vmcs01->host_ia32_sysenter_esp =
+				vmcs_readl(HOST_IA32_SYSENTER_ESP);
+			if (vmx->nested.nested_mode)
+				load_vmcs_host_state(vmcs01);
+		}
 	}
 }
 
@@ -2175,6 +2205,9 @@ static void free_l1_state(struct kvm_vcp
 		kfree(list_item);
 	}
 	vmx->nested.vmcs02_num = 0;
+
+	kfree(vmx->nested.vmcs01_fields);
+	vmx->nested.vmcs01_fields = NULL;
 }
 
 static void free_kvm_area(void)
@@ -4036,6 +4069,10 @@ static int handle_vmon(struct kvm_vcpu *
 	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
 	vmx->nested.vmcs02_num = 0;
 
+	vmx->nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!vmx->nested.vmcs01_fields)
+		return -ENOMEM;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4238,6 +4275,49 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* yet another strange pre-requisite listed in the VMX spec */
+	if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_MOV_SS){
+		nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (to_vmx(vcpu)->nested.current_vmcs12->launch_state == launch) {
+		/* Must use VMLAUNCH for the first time, VMRESUME later */
+		nested_vmx_failValid(vcpu,
+			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS :
+				 VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	skip_emulated_instruction(vcpu);
+
+	nested_vmx_run(vcpu);
+	return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return handle_launch_or_resume(vcpu, false);
+}
+
 enum vmcs_field_type {
 	VMCS_FIELD_TYPE_U16 = 0,
 	VMCS_FIELD_TYPE_U64 = 1,
@@ -4795,11 +4875,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -4862,7 +4942,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (!vmx->nested.nested_mode &&
+	    unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -5772,6 +5853,140 @@ int prepare_vmcs02(struct kvm_vcpu *vcpu
 	return 0;
 }
 
+
+
+/*
+ * Return the cr0 value that a guest would read. This is a combination of
+ * the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
+ * the hypervisor (cr0_read_shadow).
+ */
+static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields)
+{
+	return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
+		(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
+}
+
+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->nested.nested_mode = true;
+	sync_cached_regs_to_vmcs(vcpu);
+	save_vmcs(vmx->nested.vmcs01_fields);
+
+	vmx->nested.l1_state.efer = vcpu->arch.efer;
+	if (!enable_ept)
+		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
+	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
+	else
+		vmx->nested.l1_state.msr_bitmap = 0;
+
+	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	vmx->nested.vmcs01 = vmx->vmcs;
+	vmx->nested.l1_state.cpu = vcpu->cpu;
+	vmx->nested.l1_state.launched = vmx->launched;
+
+	vmx->vmcs = nested_get_current_vmcs(vcpu);
+	if (!vmx->vmcs) {
+		printk(KERN_ERR "Missing VMCS\n");
+		nested_vmx_failValid(vcpu, VMXERR_VMRESUME_CORRUPTED_VMCS);
+		return 1;
+	}
+
+	vcpu->cpu = vmx->nested.current_vmcs12->cpu;
+	vmx->launched = vmx->nested.current_vmcs12->launched;
+
+	if (!vmx->nested.current_vmcs12->launch_state || !vmx->launched) {
+		vmcs_clear(vmx->vmcs);
+		vmx->launched = 0;
+		vmx->nested.current_vmcs12->launch_state = 1;
+	}
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	prepare_vmcs02(vcpu,
+		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
+
+	if (get_vmcs12_fields(vcpu)->vm_entry_controls &
+	    VM_ENTRY_IA32E_MODE) {
+		if (!((vcpu->arch.efer & EFER_LMA) &&
+		      (vcpu->arch.efer & EFER_LME)))
+			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	} else {
+		if ((vcpu->arch.efer & EFER_LMA) ||
+		    (vcpu->arch.efer & EFER_LME))
+			vcpu->arch.efer = 0;
+	}
+
+	vmx->rmode.vm86_active =
+		!(get_vmcs12_fields(vcpu)->cr0_read_shadow & X86_CR0_PE);
+
+	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
+	 * dictated, and takes appropriate actions for special cr0 bits (like
+	 * real mode, etc.).
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(get_vmcs12_fields(vcpu)));
+
+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
+	 * the latter with with TS added if !fpu_active. We need to take the
+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
+	 * restoration of fpu registers until the FPU is really used).
+	 */
+	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
+	vmcs_writel(CR4_READ_SHADOW,
+		    get_vmcs12_fields(vcpu)->cr4_read_shadow);
+
+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
+	 * CR0 (we need the paging so that KVM treat this guest as a paging
+	 * guest so we can easly forward page faults to L1.)
+	 */
+	vcpu->arch.cr0 |= X86_CR0_PG;
+
+	if (enable_ept) {
+		vmcs_write32(GUEST_CR3, get_vmcs12_fields(vcpu)->guest_cr3);
+		vmx->vcpu.arch.cr3 = get_vmcs12_fields(vcpu)->guest_cr3;
+	} else {
+		int r;
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+		kvm_mmu_reset_context(vcpu);
+
+		r = kvm_mmu_load(vcpu);
+		if (unlikely(r)) {
+			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
+			/* switch back to L1 */
+			vmx->nested.nested_mode = false;
+			vmx->vmcs = vmx->nested.vmcs01;
+			vcpu->cpu = vmx->nested.l1_state.cpu;
+			vmx->launched = vmx->nested.l1_state.launched;
+
+			vmx_vcpu_load(vcpu, get_cpu());
+			put_cpu();
+
+			return 1;
+		}
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   get_vmcs12_fields(vcpu)->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   get_vmcs12_fields(vcpu)->guest_rip);
+
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 18/27] nVMX: No need for handle_vmx_insn function any more
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (16 preceding siblings ...)
  2010-10-17 10:12 ` [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2010-10-17 10:12 ` Nadav Har'El
  2010-10-17 10:13 ` [PATCH 19/27] nVMX: Exiting from L2 to L1 Nadav Har'El
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:12 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the relevant VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ------
 1 file changed, 6 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
@@ -4024,12 +4024,6 @@ static int handle_vmcall(struct kvm_vcpu
 	return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
-}
-
 /*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 19/27] nVMX: Exiting from L2 to L1
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (17 preceding siblings ...)
  2010-10-17 10:12 ` [PATCH 18/27] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
@ 2010-10-17 10:13 ` Nadav Har'El
  2010-10-17 15:58   ` Avi Kivity
  2010-10-17 10:13 ` [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
                   ` (7 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:13 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  235 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 235 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
@@ -5085,6 +5085,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (vmx->nested.nested_mode)
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -5981,6 +5983,239 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+		(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+		(vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the changes to the guest state while L2 was
+ * running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
+	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
+	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	if(enable_ept){
+		vmcs12->guest_physical_address =
+			vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+		vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	}
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int efer_offset;
+	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+
+	if (!vmx->nested.nested_mode) {
+		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
+		       __func__);
+		return 0;
+	}
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	prepare_vmcs12(vcpu);
+	if (is_interrupt)
+		get_vmcs12_fields(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	vmx->nested.current_vmcs12->launched = vmx->launched;
+	vmx->nested.current_vmcs12->cpu = vcpu->cpu;
+
+	vmx->vmcs = vmx->nested.vmcs01;
+	vcpu->cpu = vmx->nested.l1_state.cpu;
+	vmx->launched = vmx->nested.l1_state.launched;
+
+	vmx->nested.nested_mode = false;
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	vcpu->arch.efer = vmx->nested.l1_state.efer;
+	if ((vcpu->arch.efer & EFER_LMA) &&
+	    !(vcpu->arch.efer & EFER_SCE))
+		vcpu->arch.efer |= EFER_SCE;
+
+	efer_offset = __find_msr_index(vmx, MSR_EFER);
+	if (update_transition_efer(vmx, efer_offset))
+		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
+	
+	/*
+	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
+	 * in L1 and as it is running VMX, it can't be in real mode.
+	 */
+	vmx->rmode.vm86_active = 0;
+
+	/*
+	 * We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
+	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
+	 * it sets, e.g., add TS when !fpu_active.
+	 * Note that vmx_set_cr0 refers to rmode and efer set above.
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));
+	/*
+	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
+	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
+	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmcs01->exception_bitmap & ~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+
+	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
+
+	if (enable_ept) {
+		vcpu->arch.cr3 = vmcs01->guest_cr3;
+		vmcs_write32(GUEST_CR3, vmcs01->guest_cr3);
+		vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
+		vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
+	} else {
+		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs01->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs01->guest_rip);
+
+	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_load(vcpu);
+
+	if (unlikely(vmx->fail)) {
+		/*
+		 * When L1 launches L2 and then we (L0) fail to launch L2,
+		 * we nested_vmx_vmexit back to L1, but now should let it know
+		 * that the VMLAUNCH failed - with the same error that we
+		 * got when launching L2.
+		 */
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (18 preceding siblings ...)
  2010-10-17 10:13 ` [PATCH 19/27] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2010-10-17 10:13 ` Nadav Har'El
  2010-10-20 12:13   ` Avi Kivity
  2010-10-17 10:14 ` [PATCH 21/27] nVMX: Correct handling of interrupt injection Nadav Har'El
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:13 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  205 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
@@ -335,6 +335,8 @@ struct nested_vmx {
 	/* Saving the VMCS that we used for running L1 */
 	struct vmcs *vmcs01;
 	struct vmcs_fields *vmcs01_fields;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 };
 
 struct vcpu_vmx {
@@ -845,6 +847,20 @@ static inline bool nested_vm_need_virtua
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
 }
 
+static inline bool nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -4894,6 +4910,183 @@ static const int kvm_vmx_max_exit_handle
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
 /*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
+ * it expressed interest in the current event (read or write a specific MSR).
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	struct page *msr_bitmap_page;
+	void *va;
+	bool ret;
+
+	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
+		return 1;
+
+	msr_bitmap_page = nested_get_page(vcpu, vmcs12->msr_bitmap);
+	if (!msr_bitmap_page) {
+		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	va = kmap_atomic(msr_bitmap_page, KM_USER1);
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		va += 0x800;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		va += 0x400;
+	}
+	if (msr_index > 0x1fff)
+		return 0;
+	ret = test_bit(msr_index, va);
+	kunmap_atomic(va, KM_USER1);
+	return ret;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_LOAD_EXITING)
+				return 1;
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_LOAD_EXITING)
+				return 1;
+			/*
+			 * TODO: missing else if control & CPU_BASED_TPR_SHADOW
+			 * then set tpr shadow and if below tpr_threshold, exit.
+			 */
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if (vmcs12->cr0_guest_host_mask & X86_CR0_TS)
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 0:
+			return 1;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 4:
+			return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+		    	return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in nested_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info) && (!enable_ept))
+			return 0;
+		return (vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK)));
+	case EXIT_REASON_EPT_VIOLATION:
+		return 0;
+	case EXIT_REASON_INVLPG:
+		return (vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING);
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return (vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_MOV_DR_EXITING);
+	default:
+		/*
+		 * One particularly interesting case that is covered here is an
+		 * exit caused by L2 running a VMX instruction. L2 is guest
+		 * mode in L1's world, and according to the VMX spec running a
+		 * VMX instruction in guest mode should cause an exit to root
+		 * mode, i.e., to L1. This is why we need to return r=1 for
+		 * those exit reasons too. This enables further nesting: Like
+		 * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
+		 * L2, who will then be able to run L3.
+		 */
+		return 1;
+	}
+}
+
+/*
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
  */
@@ -4909,6 +5102,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (vmx->nested.nested_mode && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	/* Access CR3 don't cause VMExit in paging mode, so we need
 	 * to sync with guest real CR3. */
 	if (enable_ept && is_paging(vcpu))
@@ -5960,6 +6164,7 @@ static int nested_vmx_run(struct kvm_vcp
 		r = kvm_mmu_load(vcpu);
 		if (unlikely(r)) {
 			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			nested_vmx_vmexit(vcpu, false);
 			nested_vmx_failValid(vcpu,
 				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
 			/* switch back to L1 */

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 21/27] nVMX: Correct handling of interrupt injection
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (19 preceding siblings ...)
  2010-10-17 10:13 ` [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2010-10-17 10:14 ` Nadav Har'El
  2010-10-17 10:14 ` [PATCH 22/27] nVMX: Correct handling of exception injection Nadav Har'El
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:14 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
@@ -3466,9 +3466,25 @@ out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (to_vmx(vcpu)->nested.nested_mode && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3577,6 +3593,13 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (to_vmx(vcpu)->nested.nested_mode && nested_exit_on_intr(vcpu)) {
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu, true);
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5102,6 +5125,14 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */ 
+	if(vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;
@@ -5298,6 +5329,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (to_vmx(vcpu)->nested.nested_mode)
+		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 22/27] nVMX: Correct handling of exception injection
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (20 preceding siblings ...)
  2010-10-17 10:14 ` [PATCH 21/27] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2010-10-17 10:14 ` Nadav Har'El
  2010-10-17 10:15 ` [PATCH 23/27] nVMX: Correct handling of idt vectoring info Nadav Har'El
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:14 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
@@ -1498,6 +1498,25 @@ static void skip_emulated_instruction(st
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu, false);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1505,6 +1524,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && vmx->nested.nested_mode &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3533,6 +3556,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (vmx->nested.nested_mode)
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 23/27] nVMX: Correct handling of idt vectoring info
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (21 preceding siblings ...)
  2010-10-17 10:14 ` [PATCH 22/27] nVMX: Correct handling of exception injection Nadav Har'El
@ 2010-10-17 10:15 ` Nadav Har'El
  2010-10-17 10:15 ` [PATCH 24/27] nVMX: Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:15 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
@@ -337,6 +337,10 @@ struct nested_vmx {
 	struct vmcs_fields *vmcs01_fields;
 	/* L2 must run next, and mustn't decide to exit to L1. */
 	bool nested_run_pending;
+	/* true if last exit was of L2, and had a valid idt_vectoring_info */
+	bool valid_idt_vectoring_info;
+	/* These are saved if valid_idt_vectoring_info */
+	u32 vm_exit_instruction_len, idt_vectoring_error_code;
 };
 
 struct vcpu_vmx {
@@ -5365,6 +5369,22 @@ static void vmx_cancel_injection(struct 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+	int irq  = vmx->idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
+	int type = vmx->idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
+	int errCodeValid = vmx->idt_vectoring_info &
+		VECTORING_INFO_DELIVER_CODE_MASK;
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmx->nested.vm_exit_instruction_len);
+	if (errCodeValid)
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+			vmx->nested.idt_vectoring_error_code);
+}
+
 static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
 {
 	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
@@ -5386,6 +5406,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (vmx->nested.nested_mode && vmx->nested.valid_idt_vectoring_info)
+		nested_handle_valid_idt_vectoring_info(vmx);
+
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -5506,6 +5529,15 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+	vmx->nested.valid_idt_vectoring_info = vmx->nested.nested_mode &&
+		(vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK);
+	if (vmx->nested.valid_idt_vectoring_info) {
+		vmx->nested.vm_exit_instruction_len =
+			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+		vmx->nested.idt_vectoring_error_code =
+			vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	}
+
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 	vmx->launched = 1;
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 24/27] nVMX: Handling of CR0.TS and #NM for Lazy FPU loading
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (22 preceding siblings ...)
  2010-10-17 10:15 ` [PATCH 23/27] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2010-10-17 10:15 ` Nadav Har'El
  2010-10-17 10:16 ` [PATCH 25/27] nVMX: Additional TSC-offset handling Nadav Har'El
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:15 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate(), handle_cr()) to do the correct
merging of L0's and L1's needs. Note that new code in introduced in previous
patches already handles CR0 correctly (see prepare_vmcs02(),
prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   90 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 85 insertions(+), 5 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
@@ -1098,6 +1098,17 @@ static void update_exception_bitmap(stru
 		eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */
 	if (vcpu->fpu_active)
 		eb &= ~(1u << NM_VECTOR);
+
+	/* When we are running a nested L2 guest and L1 specified for it a
+	 * certain exception bitmap, we must trap the same exceptions and pass
+	 * them to L1. When running L2, we will only handle the exceptions
+	 * specified above if L1 did not want them.
+	 */
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		u32 nested_eb = get_vmcs12_fields(vcpu)->exception_bitmap;
+		eb |= nested_eb;
+	}
+
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1422,8 +1433,19 @@ static void vmx_fpu_activate(struct kvm_
 	cr0 &= ~(X86_CR0_TS | X86_CR0_MP);
 	cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP);
 	vmcs_writel(GUEST_CR0, cr0);
-	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		/* While we (L0) no longer care about NM exceptions or cr0.TS
+		 * changes, our guest hypervisor (L1) might care in which case
+		 * we must trap them for it.
+		 */
+		u32 eb = vmcs_read32(EXCEPTION_BITMAP) & ~(1u << NM_VECTOR);
+		struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+		eb |= vmcs12->exception_bitmap;
+		vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+		vmcs_write32(EXCEPTION_BITMAP, eb);
+	} else
+		update_exception_bitmap(vcpu);
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 }
 
@@ -1431,12 +1453,24 @@ static void vmx_decache_cr0_guest_bits(s
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
+	 * set this *before* calling this function.
+	 */
 	vmx_decache_cr0_guest_bits(vcpu);
 	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
-	update_exception_bitmap(vcpu);
+	vmcs_write32(EXCEPTION_BITMAP,
+		vmcs_read32(EXCEPTION_BITMAP) | (1u << NM_VECTOR));
 	vcpu->arch.cr0_guest_owned_bits = 0;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
-	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
+	if (to_vmx(vcpu)->nested.nested_mode)
+		/* Unfortunately in nested mode we play with arch.cr0's PG
+		 * bit, so we musn't copy it all, just the relevant TS bit
+		 */
+		vmcs_writel(CR0_READ_SHADOW,
+			(vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS) |
+			(vcpu->arch.cr0 & X86_CR0_TS));
+	else
+		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
@@ -3876,6 +3910,52 @@ static void complete_insn_gp(struct kvm_
 		skip_emulated_instruction(vcpu);
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		/* When running L2, we usually do what L1 wants: it decides
+		 * which cr0 bits to intercept, we forward it cr0-change events
+		 * (see nested_vmx_exit_handled()). We only get here when a cr0
+		 * bit was changed that L1 did not ask to intercept, but L0
+		 * nevertheless did. Currently this can only happen with the TS
+		 * bit (see CR0_GUEST_HOST_MASK in prepare_vmcs02()).
+		 * We must change only this bit in GUEST_CR0 and CR0_READ_SHADOW
+		 * and not call kvm_set_cr0 because it enforces a relationship
+		 * between the two that is specific to KVM (i.e., only the TS
+		 * bit might differ) and with which L1 might not agree.
+		 */
+		long new_cr0 = vmcs_readl(GUEST_CR0);
+		long new_cr0_rs = vmcs_readl(CR0_READ_SHADOW);
+		if (val & X86_CR0_TS) {
+			new_cr0 |= X86_CR0_TS;
+			new_cr0_rs |= X86_CR0_TS;
+			vcpu->arch.cr0 |= X86_CR0_TS;
+		} else {
+			new_cr0 &= ~X86_CR0_TS;
+			new_cr0_rs &= ~X86_CR0_TS;
+			vcpu->arch.cr0 &= ~X86_CR0_TS;
+		}
+		vmcs_writel(GUEST_CR0, new_cr0);
+		vmcs_writel(CR0_READ_SHADOW, new_cr0_rs);
+		return 0;
+	} else
+		return kvm_set_cr0(vcpu, val);
+}
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		/* As in handle_set_cr0(), we can't call vmx_set_cr0 here */
+		vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0) & ~X86_CR0_TS);
+		vmcs_writel(CR0_READ_SHADOW,
+			vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS);
+		vcpu->arch.cr0 &= ~X86_CR0_TS;
+	} else
+		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification, val;
@@ -3892,7 +3972,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		trace_kvm_cr_write(cr, val);
 		switch (cr) {
 		case 0:
-			err = kvm_set_cr0(vcpu, val);
+			err = handle_set_cr0(vcpu, val);
 			complete_insn_gp(vcpu, err);
 			return 1;
 		case 3:
@@ -3918,7 +3998,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		};
 		break;
 	case 2: /* clts */
-		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+		handle_clts(vcpu);
 		trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
 		skip_emulated_instruction(vcpu);
 		vmx_fpu_activate(vcpu);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 25/27] nVMX: Additional TSC-offset handling
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (23 preceding siblings ...)
  2010-10-17 10:15 ` [PATCH 24/27] nVMX: Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
@ 2010-10-17 10:16 ` Nadav Har'El
  2010-10-19 19:13   ` Zachary Amsden
  2010-10-17 10:16 ` [PATCH 26/27] nVMX: Miscellenous small corrections Nadav Har'El
  2010-10-17 10:17 ` [PATCH 27/27] nVMX: Documentation Nadav Har'El
  26 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:16 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset.
We also need to set vmcs12.tsc_offset, for this change to survive the next
nested entry (see prepare_vmcs02()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
@@ -1674,12 +1674,23 @@ static u64 guest_read_tsc(void)
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
+	if (to_vmx(vcpu)->nested.nested_mode)
+		/*
+		 * We are only changing TSC_OFFSET when L2 is running if for
+		 * some reason L1 chose not to trap the TSC MSR. Since
+		 * prepare_vmcs12() does not copy tsc_offset, we need to also
+		 * set the vmcs12 field here.
+		 */
+		get_vmcs12_fields(vcpu)->tsc_offset = offset -
+			to_vmx(vcpu)->nested.vmcs01_fields->tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
 	u64 offset = vmcs_read64(TSC_OFFSET);
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
+	if (to_vmx(vcpu)->nested.nested_mode)
+		get_vmcs12_fields(vcpu)->tsc_offset += adjustment;
 }
 
 /*

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 26/27] nVMX: Miscellenous small corrections
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (24 preceding siblings ...)
  2010-10-17 10:16 ` [PATCH 25/27] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2010-10-17 10:16 ` Nadav Har'El
  2010-10-17 10:17 ` [PATCH 27/27] nVMX: Documentation Nadav Har'El
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:16 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
@@ -933,7 +933,7 @@ static void vmcs_load(struct vmcs *vmcs)
 			: "=g"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
 	if (error)
-		printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+		printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
 		       vmcs, phys_addr);
 }
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 27/27] nVMX: Documentation
  2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
                   ` (25 preceding siblings ...)
  2010-10-17 10:16 ` [PATCH 26/27] nVMX: Miscellenous small corrections Nadav Har'El
@ 2010-10-17 10:17 ` Nadav Har'El
  26 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 10:17 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/kvm/nested-vmx.txt |  237 +++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt	2010-10-17 11:52:03.000000000 +0200
+++ .after/Documentation/kvm/nested-vmx.txt	2010-10-17 11:52:03.000000000 +0200
@@ -0,0 +1,237 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and the nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code support running Linux under a nested KVM using shadow
+page table. They support multiple nested hypervisors, which can run multiple
+guests. Only 64-bit nested hypervisors are supported. SMP is supported, but
+is known to be buggy in this release.
+Additional patches for running Windows under nested KVM, and Linux under
+nested VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+
+	struct shadow_vmcs shadow_vmcs;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
+}
+
+struct __packed shadow_vmcs {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+     Michael Factor, factor <at> il.ibm.com
+     Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com
+     and others.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/27] nVMX: Implement VMXON and VMXOFF
  2010-10-17 10:05 ` [PATCH 03/27] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2010-10-17 12:24   ` Avi Kivity
  2010-10-17 12:47     ` Nadav Har'El
  2010-10-17 13:07   ` Avi Kivity
  1 sibling, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 12:24 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:05 PM, Nadav Har'El wrote:
> This patch allows a guest to use the VMXON and VMXOFF instructions, and
> emulates them accordingly. Basically this amounts to checking some
> prerequisites, and then remembering whether the guest has enabled or disabled
> VMX operation.
>
>
> +/*
> + * Emulate the VMXON instruction.
> + * Currently, we just remember that VMX is active, and do not save or even
> + * inspect the argument to VMXON (the so-called "VMXON pointer") because we
> + * do not currently need to store anything in that guest-allocated memory
> + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
> + * argument is different from the VMXON pointer (which the spec says they do).
> + */
> +static int handle_vmon(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_segment cs;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* The Intel VMX Instruction Reference lists a bunch of bits that
> +	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
> +	 * set to 1. Otherwise, we should fail with #UD. We test these now:
> +	 */
> +	if (!nested ||

Is the !nested case needed?  Presumably cr4.vmxe will be clear is !nested.

> +	    !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
> +	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
> +	    (vmx_get_rflags(vcpu)&  X86_EFLAGS_VM)) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	vmx_get_segment(vcpu,&cs, VCPU_SREG_CS);
> +	if (is_long_mode(vcpu)&&  !cs.l) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	if (vmx_get_cpl(vcpu)) {
> +		kvm_inject_gp(vcpu, 0);
> +		return 1;
> +	}
> +
> +	vmx->nested.vmxon = true;
> +
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
>

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4
  2010-10-17 10:05 ` [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2010-10-17 12:31   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 12:31 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:05 PM, Nadav Har'El wrote:
> This patch allows the guest to enable the VMXE bit in CR4, which is a
> prerequisite to running VMXON.
>
> Whether to allow setting the VMXE bit now depends on the architecture (svm
> or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
> now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
> will also return 1, and this will cause kvm_set_cr4() will throw a #GP.
>
> Turning on the VMXE bit is allowed only when the "nested" module option is on,
> and turning it off is forbidden after a vmxon.
>
>
>
> -static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> +static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>   {
>   	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
>   		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
>
> +	if (cr4&  X86_CR4_VMXE){
> +		if (!nested)
> +			return 1;

Need to check cpuid.vmx as well, otherwise we can't turn off vmx for 
specific guests (only for the entire machine).

> +	} else {
> +		if (nested&&  to_vmx(vcpu)->nested.vmxon)
> +			return 1;
> +	}
> +

Unrelated, if nested.vmxon, shouldn't we forbid clearing certain bits of 
cr0?

It occurs to me that these bits are not fixed, instead specified by an 
MSR.  So we can simply have the MSR allow them (if we're sure that it 
would work; don't see why not).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1
  2010-10-17 10:06 ` [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2010-10-17 12:34   ` Avi Kivity
  2010-10-17 13:18     ` Nadav Har'El
  0 siblings, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 12:34 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:06 PM, Nadav Har'El wrote:
> An implementation of VMX needs to define a VMCS structure. This structure
> is kept in guest memory, but is opaque to the guest (who can only read or
> write it with VMX instructions).
>
> This patch starts to define the VMCS structure which our nested VMX
> implementation will present to L1. We call it "vmcs12", as it is the VMCS
> that L1 keeps for its L2 guests. We will add more content to this structure
> in later patches.
>
> This patch also adds the notion (as required by the VMX spec) of L1's "current
> VMCS", and finally includes utility functions for mapping the guest-allocated
> VMCSs in host memory.
>
>
> @@ -3467,6 +3521,11 @@ static int handle_vmoff(struct kvm_vcpu
>
>   	to_vmx(vcpu)->nested.vmxon = false;
>
> +	if(to_vmx(vcpu)->nested.current_vmptr != -1ull){

Missing whitespace after if and before {.

> +		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
> +		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
> +	}
> +
>   	skip_emulated_instruction(vcpu);
>   	return 1;
>   }
> @@ -4170,6 +4229,10 @@ static void vmx_free_vcpu(struct kvm_vcp
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>
>   	free_vpid(vmx);
> +	if (vmx->nested.vmxon&&  to_vmx(vcpu)->nested.current_vmptr != -1ull){
> +		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
> +		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
> +	}

Duplication - helper?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/27] nVMX: Implement VMXON and VMXOFF
  2010-10-17 12:24   ` Avi Kivity
@ 2010-10-17 12:47     ` Nadav Har'El
  0 siblings, 0 replies; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 12:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 03/27] nVMX: Implement VMXON and VMXOFF":
> >+static int handle_vmon(struct kvm_vcpu *vcpu)
>..
> >+	if (!nested ||
> 
> Is the !nested case needed?  Presumably cr4.vmxe will be clear is !nested.

Right - I just added this as a redundant security measure - even if you
somehow manage to set cr4.VMXE, you still won't be able to turn on vmx when
the 'nested' module option is off. If you don't like it, I'll remove this
extra test.

-- 
Nadav Har'El                        |      Sunday, Oct 17 2010, 9 Heshvan 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |:(){ :|:&};: # DANGER: DO NOT run this,
http://nadav.harel.org.il           |unless you REALLY know what you're doing!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs
  2010-10-17 10:06 ` [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2010-10-17 12:52   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 12:52 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:06 PM, Nadav Har'El wrote:
> When the guest can use VMX instructions (when the "nested" module option is
> on), it should also be able to read and write VMX MSRs, e.g., to query about
> VMX capabilities. This patch adds this support.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |  117 +++++++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c |    6 +-
>   2 files changed, 122 insertions(+), 1 deletion(-)
>
> --- .before/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
> +++ .after/arch/x86/kvm/x86.c	2010-10-17 11:52:00.000000000 +0200
> @@ -789,7 +789,11 @@ static u32 msrs_to_save[] = {
>   #ifdef CONFIG_X86_64
>   	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
>   #endif
> -	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
> +	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> +	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
> +	MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
> +	MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
> +	MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
>   };

These MSRs are read-only by the guest (except FEATURE_CONTROL).  No need 
to save/restore them.

>
>   static unsigned num_msrs_to_save;
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:00.000000000 +0200
> @@ -1216,6 +1216,119 @@ static void vmx_adjust_tsc_offset(struct
>   }
>
>   /*
> + * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
> + * also let it use VMX-specific MSRs.
> + * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 0 when we handled a
> + * VMX-specific MSR, or 1 when we haven't (and the caller should handled it
> + * like all other MSRs).
> + */
> +static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
> +{
> +	u64 vmx_msr = 0;
> +	u32 vmx_msr_high, vmx_msr_low;
> +
> +	switch (msr_index) {
> +	case MSR_IA32_FEATURE_CONTROL:
> +		*pdata = 0;
> +		break;
> +	case MSR_IA32_VMX_BASIC:
> +		/*
> +		 * This MSR reports some information about VMX support of the
> +		 * processor. We should return information about the VMX we
> +		 * emulate for the guest, and the VMCS structure we give it -
> +		 * not about the VMX support of the underlying hardware.
> +		 * However, some capabilities of the underlying hardware are
> +		 * used directly by our emulation (e.g., the physical address
> +		 * width), so these are copied from what the hardware reports.
> +		 */
> +		*pdata = VMCS12_REVISION | (((u64)sizeof(struct vmcs12))<<  32);

Let's reserve 4K unconditionally to avoid future complications.

> +		rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
> +#define VMX_BASIC_64		0x0001000000000000LLU
> +#define VMX_BASIC_MEM_TYPE	0x003c000000000000LLU
> +#define VMX_BASIC_INOUT		0x0040000000000000LLU

Please move the defines to vmx.h (or msr-index.h).

> +		*pdata |= vmx_msr&
> +			(VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);

I don't see why we need the real data here.  Nothing prevents us from 
supporting 64-bit physical addresses on 32-bit hosts (so long as we use 
gpa_t for addresses; ditto for MEM_TYPE and INOUT.

It's helpful to have fixed values here to remove obstacles to live 
migration.

> +		break;
> +#define CORE2_PINBASED_CTLS_MUST_BE_ONE	0x00000016

Please use the bit names instead.

> +#define MSR_IA32_VMX_TRUE_PINBASED_CTLS	0x48d

msr-index.h

> +	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PINBASED_CTLS:
> +		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
> +		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
> +				PIN_BASED_EXT_INTR_MASK |
> +				PIN_BASED_NMI_EXITING |
> +				PIN_BASED_VIRTUAL_NMIS;
> +		*pdata = vmx_msr_low | ((u64)vmx_msr_high<<  32);
> +		break;
> +	case MSR_IA32_VMX_PROCBASED_CTLS:
> +		/* This MSR determines which vm-execution controls the L1
> +		 * hypervisor may ask, or may not ask, to enable. Normally we
> +		 * can only allow enabling features which the hardware can
> +		 * support, but we limit ourselves to allowing only known
> +		 * features that were tested nested. We allow disabling any
> +		 * feature (even if the hardware can't disable it).
> +		 */
> +		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
> +
> +		vmx_msr_low = 0; /* allow disabling any feature */

What if the host doesn't allow disabling a feature?  I think we can't 
modify vmx_msr_low.

> +		vmx_msr_high&= /* do not expose new untested features */
> +			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> +			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
> +			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
> +			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
> +			CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
> +			CPU_BASED_USE_MSR_BITMAPS |
> +#ifdef CONFIG_X86_64
> +			CPU_BASED_CR8_LOAD_EXITING |
> +			CPU_BASED_CR8_STORE_EXITING |
> +#endif
> +			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> +		*pdata = vmx_msr_low | ((u64)vmx_msr_high<<  32);
> +		break;
> +	case MSR_IA32_VMX_EXIT_CTLS:
> +		*pdata = 0;
> +#ifdef CONFIG_X86_64
> +		*pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
> +#endif
> +		break;
> +	case MSR_IA32_VMX_ENTRY_CTLS:
> +		*pdata = 0;
> +		break;
> +	case MSR_IA32_VMX_PROCBASED_CTLS2:
> +		*pdata = 0;
> +		if (vm_need_virtualize_apic_accesses(vcpu->kvm))
> +			*pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +		break;
> +	case MSR_IA32_VMX_EPT_VPID_CAP:
> +		*pdata = 0;
> +		break;
> +	default:
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
> +{
> +	switch (msr_index) {
> +	case MSR_IA32_FEATURE_CONTROL:
> +	case MSR_IA32_VMX_BASIC:
> +	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PROCBASED_CTLS:
> +	case MSR_IA32_VMX_EXIT_CTLS:
> +	case MSR_IA32_VMX_ENTRY_CTLS:
> +	case MSR_IA32_VMX_PROCBASED_CTLS2:
> +	case MSR_IA32_VMX_EPT_VPID_CAP:
> +		pr_unimpl(vcpu, "unimplemented VMX MSR write: 0x%x data %llx\n",
> +			  msr_index, data);
> +		return 0;


These are illegal to write anyway and should #GP (except 
FEATURE_CONTROL).  We will however need a way for userspace to write 
these MSRs to allow fine tuning the exposed features (as we do with cpuid).

> +	default:
> +		return 1;
> +	}
> +}

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12
  2010-10-17 10:07 ` [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2010-10-17 13:00   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:00 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:07 PM, Nadav Har'El wrote:
> In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a
> hardware VMCS for each active vmcs12 (i.e., for each L2 guest).
>
> We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
> to run its nested guest L2.
>
> +
> +/*
> + * Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
> + * does not already exist. The allocation is done in L0 memory, so to avoid
> + * denial-of-service attack by guests, we limit the number of concurrently-
> + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
> + * trigger this limit.
> + */
> +static const int NESTED_MAX_VMCS = 256;

#define, top of file

> +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_list *new_l2_guest;
> +	struct vmcs *vmcs02;
> +
> +	if (nested_get_current_vmcs(vcpu))
> +		return 0; /* nothing to do - we already have a VMCS */
> +
> +	if (to_vmx(vcpu)->nested.vmcs02_num>= NESTED_MAX_VMCS)
> +		return -ENOMEM;

Why not just free_l1_state()?

You can have just nested_get_current_vmcs() which creates the vmcs if 
necessary and returns an old if cached.

> +
> +	new_l2_guest = (struct vmcs_list *)
> +		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
> +	if (!new_l2_guest)
> +		return -ENOMEM;
> +
> +	vmcs02 = alloc_vmcs();
> +	if (!vmcs02) {
> +		kfree(new_l2_guest);
> +		return -ENOMEM;
> +	}
> +
> +	new_l2_guest->vmcs12_addr = to_vmx(vcpu)->nested.current_vmptr;
> +	new_l2_guest->vmcs02 = vmcs02;
> +	list_add(&(new_l2_guest->list),&(to_vmx(vcpu)->nested.vmcs02_list));
> +	to_vmx(vcpu)->nested.vmcs02_num++;
> +	return 0;
> +}
> +
>
> @@ -4409,6 +4503,8 @@ static void vmx_free_vcpu(struct kvm_vcp
>   		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
>   		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
>   	}
> +	if (vmx->nested.vmxon)
> +		free_l1_state(vcpu);

Can be called unconditionally.

>   	vmx_free_vmcs(vcpu);
>   	kfree(vmx->guest_msrs);
>   	kvm_vcpu_uninit(vcpu);


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 10:08 ` [PATCH 10/27] nVMX: Implement VMCLEAR Nadav Har'El
@ 2010-10-17 13:05   ` Avi Kivity
  2010-10-17 13:25     ` Nadav Har'El
  0 siblings, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:05 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:08 PM, Nadav Har'El wrote:
> This patch implements the VMCLEAR instruction.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 61 insertions(+), 1 deletion(-)
>
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
> @@ -146,6 +146,8 @@ struct __packed vmcs12 {
>   	 */
>   	u32 revision_id;
>   	u32 abort;
> +
> +	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */

u32 please, this is an ABI.

>   };
>
>   /*
> @@ -3830,6 +3832,64 @@ static void nested_vmx_failValid(struct
>   	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
>   }
>
> +/* Emulate the VMCLEAR instruction */
> +static int handle_vmclear(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gva_t gva;
> +	gpa_t vmcs12_addr;
> +	struct vmcs12 *vmcs12;
> +	struct page *page;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
> +			vmcs_read32(VMX_INSTRUCTION_INFO),&gva))
> +		return 1;
> +
> +	if (kvm_read_guest_virt(gva,&vmcs12_addr, sizeof(vmcs12_addr),
> +				vcpu, NULL)) {
> +		kvm_queue_exception(vcpu, PF_VECTOR);
> +		return 1;
> +	}
> +
> +	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
> +		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
> +
> +	if (vmcs12_addr == vmx->nested.current_vmptr){
> +		kunmap(vmx->nested.current_vmcs12_page);
> +		nested_release_page(vmx->nested.current_vmcs12_page);
> +		vmx->nested.current_vmptr = -1ull;
> +	}
> +
> +	page = nested_get_page(vcpu, vmcs12_addr);
> +	if(page == NULL){

Missing whitespace.

> +		/*
> +		 * For accurate processor emulation, VMCLEAR beyond available
> +		 * physical memory should do nothing at all. However, it is
> +		 * possible that a nested vmx bug, not a guest hypervisor bug,
> +		 * resulted in this case, so let's shut down before doing any
> +		 * more damage:
> +		 */
> +		set_bit(KVM_REQ_TRIPLE_FAULT,&vcpu->requests);
> +		return 1;
> +	}
> +	vmcs12 = kmap(page);

kmap_atomic() please (better, kvm_write_guest(), but can defer that for 
later)

> +	vmcs12->launch_state = 0;
> +	kunmap(page);
> +	nested_release_page(page);
> +
> +	nested_free_vmcs(vcpu, vmcs12_addr);
> +
> +	skip_emulated_instruction(vcpu);
> +	nested_vmx_succeed(vcpu);
> +	return 1;
> +}
> +
>   static int handle_invlpg(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/27] nVMX: Implement VMXON and VMXOFF
  2010-10-17 10:05 ` [PATCH 03/27] nVMX: Implement VMXON and VMXOFF Nadav Har'El
  2010-10-17 12:24   ` Avi Kivity
@ 2010-10-17 13:07   ` Avi Kivity
  1 sibling, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:07 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:05 PM, Nadav Har'El wrote:
> This patch allows a guest to use the VMXON and VMXOFF instructions, and
> emulates them accordingly. Basically this amounts to checking some
> prerequisites, and then remembering whether the guest has enabled or disabled
> VMX operation.

Please add a TODO for the lapic code to remind us that INITs need to be 
blocked while in vmx operation.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12
  2010-10-17 10:10 ` [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2010-10-17 13:15   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:15 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:10 PM, Nadav Har'El wrote:
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields. These fields are encapsulated in a struct vmcs_fields.
>
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
> a hardware VMCS for running L2.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |  295 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 295 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:01.000000000 +0200
> @@ -128,6 +128,137 @@ struct shared_msr_entry {
>   };
>
>   /*
> + * vmcs_fields is a structure used in nested VMX for holding a copy of all
> + * standard VMCS fields. It is used for emulating a VMCS for L1 (see struct
> + * vmcs12), and also for easier access to VMCS data (see vmcs01_fields).
> + */
> +struct __packed vmcs_fields {

...

> +	unsigned long cr0_guest_host_mask;
> +	unsigned long cr4_guest_host_mask;

Those ulongs won't survive live migrations.  ABIs always want explicitly 
sized types.

Better move them above the u32 so we don't have to check whether there's 
an even number of them.
> +
> +/*
>    * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
>    * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
>    * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
> @@ -147,6 +278,8 @@ struct __packed vmcs12 {
>   	u32 revision_id;
>   	u32 abort;
>

Reserve some space here.

> +	struct vmcs_fields fields;
> +
>   	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */

And move this above fields, so we can expand it later.

>   };
>
> @@ -241,6 +374,168 @@ static inline struct vcpu_vmx *to_vmx(st
>   	return container_of(vcpu, struct vcpu_vmx, vcpu);
>   }
>
> +#define OFFSET(x) offsetof(struct vmcs_fields, x)
> +
> +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
>
> +	[IO_BITMAP_A] = OFFSET(io_bitmap_a),
> +	[IO_BITMAP_A_HIGH] = OFFSET(io_bitmap_a)+4,

Might have a FIELD(name, field) to define ordinary fields and 
FIELD64(name, field) macros to define both sub-fields of a 64-bit field 
at one.  Can defer until later.

> +};
> +
> +static inline short vmcs_field_to_offset(unsigned long field)
> +{
> +
> +	if (field>  HOST_RIP || vmcs_field_to_offset_table[field] == 0) {
> +		printk(KERN_ERR "invalid vmcs field 0x%lx\n", field);

Guest exploitable printk() - remove.

> +		return -1;
> +	}
> +	return vmcs_field_to_offset_table[field];
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1
  2010-10-17 12:34   ` Avi Kivity
@ 2010-10-17 13:18     ` Nadav Har'El
  2010-10-17 13:29       ` Avi Kivity
  0 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 13:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1":
> >+	if(to_vmx(vcpu)->nested.current_vmptr != -1ull){
> 
> Missing whitespace after if and before {.

Sorry about that - I forgot to run checkpatch.pl on this iteration.
I now fixed this, and a bunch of other small style issues.

> >  }
> >@@ -4170,6 +4229,10 @@ static void vmx_free_vcpu(struct kvm_vcp
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >
> >  	free_vpid(vmx);
> >+	if (vmx->nested.vmxon&&  to_vmx(vcpu)->nested.current_vmptr != 
> >-1ull){
> >+		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
> >+	 nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
> >+	}
> 
> Duplication - helper?

Ok, I just moved the kunmap() into nested_release_page() - it was always
called before nested_release_page. I hope that's what you meant by
"duplication".

-- 
Nadav Har'El                        |      Sunday, Oct 17 2010, 9 Heshvan 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Unix is user friendly - it's just picky
http://nadav.harel.org.il           |about its friends.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 13:05   ` Avi Kivity
@ 2010-10-17 13:25     ` Nadav Har'El
  2010-10-17 13:27       ` Avi Kivity
  0 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 13:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 10/27] nVMX: Implement VMCLEAR":
> >+	vmcs12 = kmap(page);
> 
> kmap_atomic() please (better, kvm_write_guest(), but can defer that for 
> later)

Sorry about my ignorance, but why is kmap_atomic() better here than kmap()?
While handling an exit (caused by a guest running VMCLEAR instruction), we
aren't in atomic context, aren't we?

If I use kmap_atomic() here I'll need to kunmap_atomic() below which will
break the newly combined kunmap & nested_release_page function ;-)

> >+	vmcs12->launch_state = 0;
> >+	kunmap(page);
> >+	nested_release_page(page);

-- 
Nadav Har'El                        |      Sunday, Oct 17 2010, 9 Heshvan 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a
http://nadav.harel.org.il           |woman's birthday but never her age.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE
  2010-10-17 10:10 ` [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
@ 2010-10-17 13:25   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:25 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:10 PM, Nadav Har'El wrote:
> Implement the VMREAD and VMWRITE instructions. With these instructions, L1
> can read and write to the VMCS it is holding. The values are read or written
> to the fields of the vmcs_fields structure introduced in the previous patch.
>
>
> +
> +static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
> +					unsigned long field, u64 *ret)
> +{
> +	short offset = vmcs_field_to_offset(field);
> +	char *p;
> +
> +	if (offset<  0)
> +		return 0;
> +
> +	p = ((char *)(get_vmcs12_fields(vcpu))) + offset;
> +
> +	switch (vmcs_field_type(field)) {
> +	case VMCS_FIELD_TYPE_ULONG:
> +		*ret = *((unsigned long *)p);

The cast here should depend on guest mode.  A !is_long_mode() guest 
needs to cast this to u32.

> +		return 1;
> +	case VMCS_FIELD_TYPE_U16:
> +		*ret = (u16) *((unsigned long *)p);
> +		return 1;
> +	case VMCS_FIELD_TYPE_U32:
> +		*ret = (u32) *((unsigned long *)p);
> +		return 1;
> +	case VMCS_FIELD_TYPE_U64:
> +		*ret = *((u64 *)p);

Ditto.

> +		return 1;
> +	default:
> +		return 0; /* can never happen. */
> +	}
> +}
> +
> +
> +static int handle_vmwrite(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long field;
> +	u64 field_value = 0;
> +	gva_t gva;
> +	int field_type;
> +	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
> +	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	char *p;
> +	short offset;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (vmx_instruction_info&  (1u<<  10))
> +		field_value = kvm_register_read(vcpu,
> +			(((vmx_instruction_info)>>  3)&  0xf));
> +	else {
> +		if (get_vmx_mem_address(vcpu, exit_qualification,
> +				vmx_instruction_info,&gva))
> +			return 1;
> +		if(kvm_read_guest_virt(gva,&field_value,
> +				(is_long_mode(vcpu) ? 8 : 4), vcpu, NULL)){

Whitespace.

> +			kvm_queue_exception(vcpu, PF_VECTOR);
> +			return 1;
> +		}
> +	}
> +
> +
> +	field = kvm_register_read(vcpu, (((vmx_instruction_info)>>  28)&  0xf));
> +
> +	if (vmcs_field_readonly(field)) {
> +		nested_vmx_failValid(vcpu,
> +			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
> +
> +	field_type = vmcs_field_type(field);
> +
> +	offset = vmcs_field_to_offset(field);
> +	if (offset<  0) {
> +		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
> +	p = ((char *) get_vmcs12_fields(vcpu)) + offset;
> +
> +	switch (field_type) {
> +	case VMCS_FIELD_TYPE_U16:
> +		*(u16 *)p = field_value;
> +		break;
> +	case VMCS_FIELD_TYPE_U32:
> +		*(u32 *)p = field_value;
> +		break;
> +	case VMCS_FIELD_TYPE_U64:
> +#ifdef CONFIG_X86_64
> +		*(unsigned long *)p = field_value;
> +#else
> +		*(unsigned long *)p = field_value;
> +		*(((unsigned long *)p)+1) = field_value>>  32;

Depend on guest bitness here, not host bitness.  32-bit guests only 
write the first word.

> +#endif
> +		break;
> +	case VMCS_FIELD_TYPE_ULONG:
> +		*(unsigned long *)p = field_value;
> +		break;
> +	default:
> +		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
> +
> +	nested_vmx_succeed(vcpu);
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 13:25     ` Nadav Har'El
@ 2010-10-17 13:27       ` Avi Kivity
  2010-10-17 13:37         ` Nadav Har'El
  0 siblings, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:27 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 03:25 PM, Nadav Har'El wrote:
> On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 10/27] nVMX: Implement VMCLEAR":
> >  >+	vmcs12 = kmap(page);
> >
> >  kmap_atomic() please (better, kvm_write_guest(), but can defer that for
> >  later)
>
> Sorry about my ignorance, but why is kmap_atomic() better here than kmap()?
> While handling an exit (caused by a guest running VMCLEAR instruction), we
> aren't in atomic context, aren't we?

kmap() is unloved since it is deadlock-prone in some circumstances, and 
also much slower than kmap_atomic(), since it needs global tlb 
synchronization.

> If I use kmap_atomic() here I'll need to kunmap_atomic() below which will
> break the newly combined kunmap&  nested_release_page function ;-)
>
> >  >+	vmcs12->launch_state = 0;
> >  >+	kunmap(page);
> >  >+	nested_release_page(page);
>

Is something preventing you from changing all kmap()s to kmap_atomic()s 
(like guest memory access in the mapped section)?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1
  2010-10-17 13:18     ` Nadav Har'El
@ 2010-10-17 13:29       ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 13:29 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 03:18 PM, Nadav Har'El wrote:
> On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1":
> >  >+	if(to_vmx(vcpu)->nested.current_vmptr != -1ull){
> >
> >  Missing whitespace after if and before {.
>
> Sorry about that - I forgot to run checkpatch.pl on this iteration.
> I now fixed this, and a bunch of other small style issues.
>
> >  >   }
> >  >@@ -4170,6 +4229,10 @@ static void vmx_free_vcpu(struct kvm_vcp
> >  >   	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  >
> >  >   	free_vpid(vmx);
> >  >+	if (vmx->nested.vmxon&&   to_vmx(vcpu)->nested.current_vmptr !=
> >  >-1ull){
> >  >+		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
> >  >+	 nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
> >  >+	}
> >
> >  Duplication - helper?
>
> Ok, I just moved the kunmap() into nested_release_page() - it was always
> called before nested_release_page. I hope that's what you meant by
> "duplication".
>

I meant the 4-line sequence duplicates the preceding hunk.  So you could 
have a function that does it and call it twice.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 13:27       ` Avi Kivity
@ 2010-10-17 13:37         ` Nadav Har'El
  2010-10-17 14:12           ` Avi Kivity
  0 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2010-10-17 13:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 10/27] nVMX: Implement VMCLEAR":
> Is something preventing you from changing all kmap()s to kmap_atomic()s 
> (like guest memory access in the mapped section)?

Yes. We've discussed this before, and I know you suggested other alternatives,
but the way we currently work with vmcs12 (the page of memory that L1 maintains
as its VMCS for its L2 guest) is this: When L1 uses VMPTRLD to set the current
VMCS, we pin this page and kmap it, and keep a pointer to it immediately
accessible throughout the code. The page is only unmapped and released when
L1 is done with this VMCS (i.e., calls VMPTRLD again, or VMCLEAR, or of course
terminates).

The nice thing about this approach, over the alternatives, is that it is
more efficient than special guest_read/write calls (accesses to vmcs12 are
ordinary memory accesses) and the code is simpler than it was previously
with map/unmap pairs around every access.

Obviously, I can't use kmap_atomic() when the mapping is to live a long time,
also outside atomic constant. This could lead to bugs if two parts of the
Kernel use the same kmap_atomic() "slot" :(


-- 
Nadav Har'El                        |      Sunday, Oct 17 2010, 9 Heshvan 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The message above is just this
http://nadav.harel.org.il           |signature's way of propagating itself.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2010-10-17 10:11 ` [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2010-10-17 14:08   ` Avi Kivity
  2011-02-08 12:13     ` Nadav Har'El
  0 siblings, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 14:08 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:11 PM, Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
> in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we
> built for L1).
>
> VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
> makes it difficult for us to read from vmcs01 while writing to vmcs12. This
> is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then
> read that memory copy while writing to vmcs12.
>

I believe I commented on this before - you can call the same functions 
kvm uses to initialize the normal vmcs to get the common parts filled 
in.+int load_vmcs_host_state(struct vmcs_fields *src)
> +{
> +	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
> +	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
> +	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
> +	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
> +	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
> +	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
> +	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);

vmx_vcpu_setup() - you can extract the common parts and call them from here.

> +
> +	if (vmcs_config.vmexit_ctrl&  VM_EXIT_LOAD_IA32_PAT)
> +		vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
> +
> +	vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
> +
> +	vmcs_writel(HOST_CR0, src->host_cr0);
> +	vmcs_writel(HOST_CR3, src->host_cr3);
> +	vmcs_writel(HOST_CR4, src->host_cr4);

Ditto.

> +	vmcs_writel(HOST_FS_BASE, src->host_fs_base);
> +	vmcs_writel(HOST_GS_BASE, src->host_gs_base);

These change on vcpu migration.  Perhaps the cause of smp failures?  
Check that vmx_vcpu_load() updates the correct vmcs.

> +	vmcs_writel(HOST_TR_BASE, src->host_tr_base);
> +	vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);

Both per-cpu, again updated on vcpu mirgation.

> +	vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);

Not per-cpu, unfortunately.

> +	vmcs_writel(HOST_RSP, src->host_rsp);
> +	vmcs_writel(HOST_RIP, src->host_rip);
> +	vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
> +	vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);

Constant, can use vmx_vcpu_setup().

> +
> +	return 0;
> +}
> +
>   /*
>    * Switches to specified vcpu, until a matching vcpu_put(), but assumes
>    * vcpu mutex is already taken.
> @@ -5359,6 +5412,361 @@ static void vmx_set_supported_cpuid(u32
>   		entry->ecx |= bit(X86_FEATURE_VMX);
>   }
>
> +/*
> + * Make a copy of the current VMCS to ordinary memory. This is needed because
> + * in VMX you cannot read and write to two VMCS at the same time, so when we
> + * want to do this (in prepare_vmcs02, which needs to read from vmcs01 while
> + * preparing vmcs02), we need to first save a copy of one VMCS's fields in
> + * memory, and then use that copy.
> + */
> +void save_vmcs(struct vmcs_fields *dst)
> +{
> +	dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
> +	dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
> +	dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
> +	dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
> +	dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
> +	dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
> +	dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
> +	dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
> +	dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
> +	dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
> +	dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
> +	dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
> +	dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
> +	dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
> +	dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
> +	dst->io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +	dst->io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +	if (cpu_has_vmx_msr_bitmap())
> +		dst->msr_bitmap = vmcs_read64(MSR_BITMAP);

Do we support io bitmaps and msr bitmaps in this version?  If not, 
please drop (also ept).

> +	dst->tsc_offset = vmcs_read64(TSC_OFFSET);
> +	dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
> +	dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
> +	if (enable_ept)
> +		dst->ept_pointer = vmcs_read64(EPT_POINTER);
> +	dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> +	dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
> +	dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> +	if (vmcs_config.vmentry_ctrl&  VM_ENTRY_LOAD_IA32_PAT)
> +		dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
> +	if (enable_ept) {
> +		dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> +		dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> +		dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> +		dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> +	}
> +	dst->pin_based_vm_exec_control = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
> +	dst->cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	dst->exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
> +	dst->page_fault_error_code_mask =
> +		vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
> +	dst->page_fault_error_code_match =
> +		vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH);
> +	dst->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
> +	dst->vm_exit_controls = vmcs_read32(VM_EXIT_CONTROLS);
> +	dst->vm_entry_controls = vmcs_read32(VM_ENTRY_CONTROLS);
> +	dst->vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> +	dst->vm_entry_exception_error_code =
> +		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
> +	dst->vm_entry_instruction_len = vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
> +	dst->tpr_threshold = vmcs_read32(TPR_THRESHOLD);
> +	dst->secondary_vm_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	if (enable_vpid&&  dst->secondary_vm_exec_control&
> +	    SECONDARY_EXEC_ENABLE_VPID)
> +		dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
> +	dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
> +	dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
> +	dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	dst->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> +	dst->idt_vectoring_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	dst->idt_vectoring_error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
> +	dst->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> +	dst->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	dst->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
> +	dst->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
> +	dst->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
> +	dst->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
> +	dst->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
> +	dst->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
> +	dst->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
> +	dst->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
> +	dst->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
> +	dst->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
> +	dst->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
> +	dst->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
> +	dst->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
> +	dst->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
> +	dst->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
> +	dst->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
> +	dst->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
> +	dst->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
> +	dst->guest_interruptibility_info =
> +		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
> +	dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
> +	dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
> +	dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
> +	dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
> +	dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
> +	dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
> +	dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
> +	dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
> +	dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
> +	dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
> +	dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
> +	dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
> +	dst->guest_cr0 = vmcs_readl(GUEST_CR0);
> +	dst->guest_cr3 = vmcs_readl(GUEST_CR3);
> +	dst->guest_cr4 = vmcs_readl(GUEST_CR4);
> +	dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
> +	dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
> +	dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
> +	dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
> +	dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
> +	dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
> +	dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
> +	dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
> +	dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
> +	dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
> +	dst->guest_dr7 = vmcs_readl(GUEST_DR7);
> +	dst->guest_rsp = vmcs_readl(GUEST_RSP);
> +	dst->guest_rip = vmcs_readl(GUEST_RIP);
> +	dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
> +	dst->guest_pending_dbg_exceptions =
> +		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> +	dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
> +	dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
> +	dst->host_cr0 = vmcs_readl(HOST_CR0);
> +	dst->host_cr3 = vmcs_readl(HOST_CR3);
> +	dst->host_cr4 = vmcs_readl(HOST_CR4);
> +	dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
> +	dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
> +	dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +	dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +	dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
> +	dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
> +	dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
> +	dst->host_rsp = vmcs_readl(HOST_RSP);
> +	dst->host_rip = vmcs_readl(HOST_RIP);
> +	if (vmcs_config.vmexit_ctrl&  VM_EXIT_LOAD_IA32_PAT)
> +		dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);

I think this should be broken up:

- guest-state fields, obviously needed
- host-state fields, not needed (can reuse current kvm code to setup, 
never need to read them)
- control fields not modified by hardware - no need to save
- read-only control fields - probably no need to load

> +
> +	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
> +	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
> +	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);

Reuse vmx_vcpu_setup()

> +	if (cpu_has_vmx_msr_bitmap())
> +		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);

Have setup_msrs() cache the value of msr_bitmap somewhere, so we don't 
need to vmcs_read64() it.


> +
> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> +		     (vmcs01->pin_based_vm_exec_control |
> +		      vmcs12->pin_based_vm_exec_control));

Reuse vmx_vcpu_setup()

> +
> +	if (vm_need_tpr_shadow(vcpu->kvm)&&
> +	    nested_cpu_has_vmx_tpr_shadow(vcpu))
> +		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
> +
> +	exec_control = vmcs01->cpu_based_vm_exec_control;
> +	exec_control&= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +	exec_control&= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> +	exec_control&= ~CPU_BASED_TPR_SHADOW;
> +	exec_control |= vmcs12->cpu_based_vm_exec_control;
> +	if (!vm_need_tpr_shadow(vcpu->kvm) ||
> +	    vmcs12->virtual_apic_page_addr == 0) {
> +		exec_control&= ~CPU_BASED_TPR_SHADOW;
> +#ifdef CONFIG_X86_64
> +		exec_control |= CPU_BASED_CR8_STORE_EXITING |
> +			CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +	} else if (exec_control&  CPU_BASED_TPR_SHADOW) {
> +#ifdef CONFIG_X86_64
> +		exec_control&= ~CPU_BASED_CR8_STORE_EXITING;
> +		exec_control&= ~CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +	}
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);

Reuse vmx_vcpu_setup() code instead of vmread().

Note you have to set KVM_REQ_EVENT so INTR_PENDING is recalculated on 
vmexit.

> +
> +	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> +	 * bitwise-or of what L1 wants to trap for L2, and what we want to
> +	 * trap. However, vmx_fpu_activate/deactivate may have happened after
> +	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
> +	 * and need to base them again on fpu_active. Note that CR0.TS also
> +	 * needs updating - we do this after this function returns (in
> +	 * nested_vmx_run).
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
> +		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
> +		      vmcs12->exception_bitmap));

I think you can reuse update_exception_bitmap() here.  Also, may need to 
enable #PF interception when enable_ept? not sure.

This reuses the fpu_active and guest debugging logic in 
update_exception_bitmap().  Again, needs to happen after we see the L2 
state.

> +	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));

Please use ~cr0_guest_owned_bits instead, equivalent information.  
Should be something like

   vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask | 
~cr0_guest_owned_bits);

> +	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));

Here, too ( |= is natural for updating).

> +
> +	vmcs_write32(VM_EXIT_CONTROLS,
> +		     (vmcs01->vm_exit_controls&
> +			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
> +		       | vmcs12->vm_exit_controls);

vmx_vcpu_setup

> +
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		     (vmcs01->vm_entry_controls&
> +			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
> +		      | vmcs12->vm_entry_controls);

vmx_vcpu_setup; IA32E mode will be updated by enter_lmode() or 
exit_lmode() which you'll need to call.

> +
> +	vmcs_writel(CR4_GUEST_HOST_MASK,
> +		    (vmcs01->cr4_guest_host_mask&
> +		     vmcs12->cr4_guest_host_mask));
> +	

~cr4_guest_owned_bits

> +	vmcs_write64(TSC_OFFSET, vmcs01->tsc_offset + vmcs12->tsc_offset);

Zachary Amsden

> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 13:37         ` Nadav Har'El
@ 2010-10-17 14:12           ` Avi Kivity
  2010-10-17 14:14             ` Gleb Natapov
  0 siblings, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 14:12 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 03:37 PM, Nadav Har'El wrote:
> On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 10/27] nVMX: Implement VMCLEAR":
> >  Is something preventing you from changing all kmap()s to kmap_atomic()s
> >  (like guest memory access in the mapped section)?
>
> Yes. We've discussed this before, and I know you suggested other alternatives,
> but the way we currently work with vmcs12 (the page of memory that L1 maintains
> as its VMCS for its L2 guest) is this: When L1 uses VMPTRLD to set the current
> VMCS, we pin this page and kmap it, and keep a pointer to it immediately
> accessible throughout the code. The page is only unmapped and released when
> L1 is done with this VMCS (i.e., calls VMPTRLD again, or VMCLEAR, or of course
> terminates).
>
> The nice thing about this approach, over the alternatives, is that it is
> more efficient than special guest_read/write calls (accesses to vmcs12 are
> ordinary memory accesses) and the code is simpler than it was previously
> with map/unmap pairs around every access.
>
> Obviously, I can't use kmap_atomic() when the mapping is to live a long time,
> also outside atomic constant. This could lead to bugs if two parts of the
> Kernel use the same kmap_atomic() "slot" :(
>

Ok.  Let's keep it for now.  But look at 
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/60920 for a much 
nicer way to to this (y, can you add kvm_read_guest_cached()?)

Sorry about repeating old arguments, I kunmap_atomic() everything 
immediately after I review it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/27] nVMX: Implement VMCLEAR
  2010-10-17 14:12           ` Avi Kivity
@ 2010-10-17 14:14             ` Gleb Natapov
  0 siblings, 0 replies; 56+ messages in thread
From: Gleb Natapov @ 2010-10-17 14:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Sun, Oct 17, 2010 at 04:12:35PM +0200, Avi Kivity wrote:
>  On 10/17/2010 03:37 PM, Nadav Har'El wrote:
> >On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 10/27] nVMX: Implement VMCLEAR":
> >>  Is something preventing you from changing all kmap()s to kmap_atomic()s
> >>  (like guest memory access in the mapped section)?
> >
> >Yes. We've discussed this before, and I know you suggested other alternatives,
> >but the way we currently work with vmcs12 (the page of memory that L1 maintains
> >as its VMCS for its L2 guest) is this: When L1 uses VMPTRLD to set the current
> >VMCS, we pin this page and kmap it, and keep a pointer to it immediately
> >accessible throughout the code. The page is only unmapped and released when
> >L1 is done with this VMCS (i.e., calls VMPTRLD again, or VMCLEAR, or of course
> >terminates).
> >
> >The nice thing about this approach, over the alternatives, is that it is
> >more efficient than special guest_read/write calls (accesses to vmcs12 are
> >ordinary memory accesses) and the code is simpler than it was previously
> >with map/unmap pairs around every access.
> >
> >Obviously, I can't use kmap_atomic() when the mapping is to live a long time,
> >also outside atomic constant. This could lead to bugs if two parts of the
> >Kernel use the same kmap_atomic() "slot" :(
> >
> 
> Ok.  Let's keep it for now.  But look at
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/60920 for a
> much nicer way to to this (y, can you add kvm_read_guest_cached()?)
> 
Yes, haven't done that because my patch set does not have use for it.

> Sorry about repeating old arguments, I kunmap_atomic() everything
> immediately after I review it.
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME
  2010-10-17 10:12 ` [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2010-10-17 15:06   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 15:06 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:12 PM, Nadav Har'El wrote:
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |  221 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 218 insertions(+), 3 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
> @@ -281,6 +281,9 @@ struct __packed vmcs12 {
>   	struct vmcs_fields fields;
>
>   	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
> +
> +	int cpu;

Why is this needed?

> +	int launched;

Doesn't it duplicate launch_state?

If I asked these before, it may indicate a comment is needed.

>   };
>
>   /*
> @@ -315,6 +318,23 @@ struct nested_vmx {
>   	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
>   	struct list_head vmcs02_list; /* a vmcs_list */
>   	int vmcs02_num;
> +
> +	/* Are we running a nested guest now */
> +	bool nested_mode;

TODO: live migration for this state.

> +	/* Level 1 state for switching to level 2 and back */
> +	struct  {
> +		u64 efer;

Redundant? LMA/LME set but IA32E_MODE_GUEST, the other bits unchanged by 
transition.

> +		unsigned long cr3;
> +		unsigned long cr4;

Redundant with L1's HOST_CRx?

> +		u64 io_bitmap_a;
> +		u64 io_bitmap_b;

Unneeded? Should not ever change.

> +		u64 msr_bitmap;

Update using setup_msrs().

> +		int cpu;
> +		int launched;

Hmm.

> +	} l1_state;
> +	/* Saving the VMCS that we used for running L1 */
> +	struct vmcs *vmcs01;
> +	struct vmcs_fields *vmcs01_fields;

vmcs01_fields unneeded if we can restructure according to my comments to 
the previous patch.

>   };
>
>   struct vcpu_vmx {
> @@ -1349,6 +1369,16 @@ static void vmx_vcpu_load(struct kvm_vcp
>
>   		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>   		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
> +
> +		if (vmx->nested.vmcs01_fields != NULL) {
> +			struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
> +			vmcs01->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +			vmcs01->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +			vmcs01->host_ia32_sysenter_esp =
> +				vmcs_readl(HOST_IA32_SYSENTER_ESP);
> +			if (vmx->nested.nested_mode)
> +				load_vmcs_host_state(vmcs01);
> +		}
>   	}
>   }

Instead, you can call a subset of vmx_vcpu_load() which updates these 
fields on nested vmexit.  In fact I think calling vmx_vcpu_load() as is 
may work.

Same for nested vmentry.  Once you switch the vmcs, call vmx_vcpu_load() 
and it will update the per-cpu parts of vmcs02.

It will also update per_cpu(current_vmcs) and per_cpu(vcpus_on_vcpu) 
which are needed for smp and for suspend/resume.  You'll also need to 
call vmx_vcpu_put() (but without the __vmx_load_host_state() part).

> +
> +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
> +{
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	/* yet another strange pre-requisite listed in the VMX spec */
> +	if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO)&  GUEST_INTR_STATE_MOV_SS){
> +		nested_vmx_failValid(vcpu,
> +			VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}

Can't you just let the guest launch and handle the failure if it happens?

Can be done later; just add a TODO.

> +
> +	if (to_vmx(vcpu)->nested.current_vmcs12->launch_state == launch) {
> +		/* Must use VMLAUNCH for the first time, VMRESUME later */
> +		nested_vmx_failValid(vcpu,
> +			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS :
> +				 VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}

Ditto.  Less critical since it doesn't involve a VMREAD.

> +
> +	skip_emulated_instruction(vcpu);
> +
> +	nested_vmx_run(vcpu);
> +	return 1;
> +}
> +
>
> +
> +static int nested_vmx_run(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->nested.nested_mode = true;
> +	sync_cached_regs_to_vmcs(vcpu);
> +	save_vmcs(vmx->nested.vmcs01_fields);
> +
> +	vmx->nested.l1_state.efer = vcpu->arch.efer;
> +	if (!enable_ept)
> +		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
> +	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
> +	else
> +		vmx->nested.l1_state.msr_bitmap = 0;
> +
> +	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +	vmx->nested.vmcs01 = vmx->vmcs;
> +	vmx->nested.l1_state.cpu = vcpu->cpu;
> +	vmx->nested.l1_state.launched = vmx->launched;
> +
> +	vmx->vmcs = nested_get_current_vmcs(vcpu);
> +	if (!vmx->vmcs) {
> +		printk(KERN_ERR "Missing VMCS\n");

Guest exploitable printk(), remove.  There are debug printk macros 
around, you can use them if they help debugging.

> +		nested_vmx_failValid(vcpu, VMXERR_VMRESUME_CORRUPTED_VMCS);
> +		return 1;
> +	}
> +
> +	vcpu->cpu = vmx->nested.current_vmcs12->cpu;
> +	vmx->launched = vmx->nested.current_vmcs12->launched;

These bits are volatile (changed by process migration) so can only be 
used in preempt disabled contexts.


> +
> +	if (!vmx->nested.current_vmcs12->launch_state || !vmx->launched) {
> +		vmcs_clear(vmx->vmcs);
> +		vmx->launched = 0;
> +		vmx->nested.current_vmcs12->launch_state = 1;

launch_state == 1 -> not launched?  strange.

vmcs_clear() needs to happen on the right cpu.

> +	}
> +
> +	vmx_vcpu_load(vcpu, get_cpu());

Does this not do everything correctly?

I think you need to move the get_cpu() above to disable preemption.  Not 
sure how much.
> +	put_cpu();
> +
> +	prepare_vmcs02(vcpu,
> +		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
> +
> +	if (get_vmcs12_fields(vcpu)->vm_entry_controls&
> +	    VM_ENTRY_IA32E_MODE) {
> +		if (!((vcpu->arch.efer&  EFER_LMA)&&
> +		      (vcpu->arch.efer&  EFER_LME)))

> +			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +	} else {
> +		if ((vcpu->arch.efer&  EFER_LMA) ||
> +		    (vcpu->arch.efer&  EFER_LME))
> +			vcpu->arch.efer = 0;

Clearing all of EFER is incorrect.  Just assign IA32E_MODE 
unconditionally to both EFER.LMA and EFER.LME.

> +	}
> +
> +	vmx->rmode.vm86_active =
> +		!(get_vmcs12_fields(vcpu)->cr0_read_shadow&  X86_CR0_PE);

Needs to be unconditionally false since we don't support real mode 
nested guests.  No need to clear since we can't vmenter from a real mode 
guest.

> +
> +	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
> +	 * dictated, and takes appropriate actions for special cr0 bits (like
> +	 * real mode, etc.).
> +	 */
> +	vmx_set_cr0(vcpu, guest_readable_cr0(get_vmcs12_fields(vcpu)));

Don't we want vmcs12->guest_cr0 here?  guest_readable_cr0() is only 
useful for lmsw and emulated_read_cr().

Paging mode etc. are set by guest_cr0; consider cr0_read_shadow.pg=0 and 
guest_cr0.pg=1.

> +
> +	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
> +	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
> +	 * the latter with with TS added if !fpu_active. We need to take the
> +	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
> +	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
> +	 * restoration of fpu registers until the FPU is really used).
> +	 */
> +	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
> +		(vcpu->fpu_active ? 0 : X86_CR0_TS));

See?

> +
> +	vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
> +	vmcs_writel(CR4_READ_SHADOW,
> +		    get_vmcs12_fields(vcpu)->cr4_read_shadow);
> +
> +	/* we have to set the X86_CR0_PG bit of the cached cr0, because
> +	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
> +	 * CR0 (we need the paging so that KVM treat this guest as a paging
> +	 * guest so we can easly forward page faults to L1.)
> +	 */
> +	vcpu->arch.cr0 |= X86_CR0_PG;

Shouldn't be needed (but should check that guest_cr0 has the always-on 
bits set).

> +
> +	if (enable_ept) {
> +		vmcs_write32(GUEST_CR3, get_vmcs12_fields(vcpu)->guest_cr3);
> +		vmx->vcpu.arch.cr3 = get_vmcs12_fields(vcpu)->guest_cr3;
> +	} else {
> +		int r;
> +		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
> +		kvm_mmu_reset_context(vcpu);
> +
> +		r = kvm_mmu_load(vcpu);
> +		if (unlikely(r)) {
> +			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
> +			nested_vmx_failValid(vcpu,
> +				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
> +			/* switch back to L1 */
> +			vmx->nested.nested_mode = false;
> +			vmx->vmcs = vmx->nested.vmcs01;
> +			vcpu->cpu = vmx->nested.l1_state.cpu;
> +			vmx->launched = vmx->nested.l1_state.launched;
> +
> +			vmx_vcpu_load(vcpu, get_cpu());
> +			put_cpu();
> +
> +			return 1;
> +		}
> +	}

I think you can call kvm_set_cr3() unconditionally.  It will return 1 on 
bad cr3 which you can use for failing the entry.

btw, kvm_mmu_reset_context() is needed on ept as well; kvm_mmu_load() is 
not needed AFAICT (the common entry code will take care of it).


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 19/27] nVMX: Exiting from L2 to L1
  2010-10-17 10:13 ` [PATCH 19/27] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2010-10-17 15:58   ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-17 15:58 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:13 PM, Nadav Har'El wrote:
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
>
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in the next patch.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |  235 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 235 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:02.000000000 +0200
> @@ -5085,6 +5085,8 @@ static void __vmx_complete_interrupts(st
>
>   static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>   {
> +	if (vmx->nested.nested_mode)
> +		return;
>   	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>   				  VM_EXIT_INSTRUCTION_LEN,
>   				  IDT_VECTORING_ERROR_CODE);
> @@ -5981,6 +5983,239 @@ static int nested_vmx_run(struct kvm_vcp
>   	return 1;
>   }
>
> +/*
> + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
> + * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
> + * without L0 trapping the change and updating vmcs12.
> + * This function returns the value we should put in vmcs12.guest_cr0. It's not
> + * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
> + * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
> + * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
> + * to trap this change and instead set just the read shadow. If this is the
> + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
> + * L1 believes they already are.
> + */
> +static inline unsigned long
> +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
> +{
> +	unsigned long guest_cr0_bits =
> +		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> +	return (vmcs_readl(GUEST_CR0)&  guest_cr0_bits) |
> +		(vmcs_readl(CR0_READ_SHADOW)&  ~guest_cr0_bits);
> +}

I think it's easier to keep cr0_guest_owned_bits up to date, and use 
kvm_read_cr0().  In fact, I think you have to do it so kvm_read_cr0() 
works correctly while in nested guest context.

> +
> +static inline unsigned long
> +vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
> +{
> +	unsigned long guest_cr4_bits =
> +		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
> +	return (vmcs_readl(GUEST_CR4)&  guest_cr4_bits) |
> +		(vmcs_readl(CR4_READ_SHADOW)&  ~guest_cr4_bits);
> +}

Ditto.

> +
> +/*
> + * prepare_vmcs12 is called when the nested L2 guest exits and we want to
> + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
> + * function updates it to reflect the changes to the guest state while L2 was
> + * running (and perhaps made some exits which were handled directly by L0
> + * without going back to L1), and to reflect the exit reason.
> + * Note that we do not have to copy here all VMCS fields, just those that
> + * could have changed by the L2 guest or the exit - i.e., the guest-state and
> + * exit-information fields only. Other fields are modified by L1 with VMWRITE,
> + * which already writes to vmcs12 directly.
> + */
> +void prepare_vmcs12(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
> +
> +	/* update guest state fields: */
> +	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
> +	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
> +
> +	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
> +	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
> +	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);

kvm_register_read(), kvm_rip_read(), kvm_get_dr().

> +	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
> +

> +
> +	if(enable_ept){
> +		vmcs12->guest_physical_address =
> +			vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> +		vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
> +	}

Drop please.

> +
> +static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int efer_offset;
> +	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
> +
> +	if (!vmx->nested.nested_mode) {
> +		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
> +		       __func__);
> +		return 0;
> +	}
> +
> +	sync_cached_regs_to_vmcs(vcpu);

If you use kvm_rip_read() etc, this isn't needed.

> +
> +	prepare_vmcs12(vcpu);
> +	if (is_interrupt)
> +		get_vmcs12_fields(vcpu)->vm_exit_reason =
> +			EXIT_REASON_EXTERNAL_INTERRUPT;

Somewhat strange that a field is updated conditionally.

> +
> +	vmx->nested.current_vmcs12->launched = vmx->launched;
> +	vmx->nested.current_vmcs12->cpu = vcpu->cpu;
> +
> +	vmx->vmcs = vmx->nested.vmcs01;
> +	vcpu->cpu = vmx->nested.l1_state.cpu;
> +	vmx->launched = vmx->nested.l1_state.launched;
> +
> +	vmx->nested.nested_mode = false;
> +
> +	vmx_vcpu_load(vcpu, get_cpu());
> +	put_cpu();

Again need to extend the preempt disable region, probably to before you 
assign vmx->vmcs.

> +
> +	vcpu->arch.efer = vmx->nested.l1_state.efer;
> +	if ((vcpu->arch.efer&  EFER_LMA)&&
> +	    !(vcpu->arch.efer&  EFER_SCE))
> +		vcpu->arch.efer |= EFER_SCE;
> +
> +	efer_offset = __find_msr_index(vmx, MSR_EFER);
> +	if (update_transition_efer(vmx, efer_offset))
> +		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);

Use kvm_set_efer().  Just take the existing efer and use "host address 
space vm-exit control bit" for LMA and LME.

You may need to enter_lmode() or exit_lmode() manually.

> +	
> +	/*
> +	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
> +	 * in L1 and as it is running VMX, it can't be in real mode.
> +	 */
> +	vmx->rmode.vm86_active = 0;
> +

L2 cannot be in real mode, (vmx non-root mode does not support it).  L1 
cannot be in real mode (vmx root operation does not support it).  So no 
need for the assignment.

> +	/*
> +	 * We're running a regular L1 guest again, so we do the regular KVM
> +	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
> +	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
> +	 * it sets, e.g., add TS when !fpu_active.
> +	 * Note that vmx_set_cr0 refers to rmode and efer set above.
> +	 */
> +	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));

Should be vmcs12->host_cr0.

> +	/*
> +	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
> +	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
> +	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		(vmcs01->exception_bitmap&  ~(1u<<NM_VECTOR)) |
> +			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));

update_exception_bitmap()

> +	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);

&= ~X86_CR0_TS removes the inside information about fpu_active.

> +	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);


> +
> +
> +	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
> +
> +	if (enable_ept) {
> +		vcpu->arch.cr3 = vmcs01->guest_cr3;
> +		vmcs_write32(GUEST_CR3, vmcs01->guest_cr3);
> +		vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
> +		vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
> +		vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
> +		vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
> +		vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);

vmexits do not reload PDPTRs from a cache.  Instead, use kvm_set_cr3() 
unconditionally.

> +	} else {
> +		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
> +	}
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs01->guest_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs01->guest_rip);
> +
> +	kvm_mmu_reset_context(vcpu);
> +	kvm_mmu_load(vcpu);

kvm_mmu_load() is unnecessary, next guest entry will do it automatically 
IIRC.

> +
> +	if (unlikely(vmx->fail)) {
> +		/*
> +		 * When L1 launches L2 and then we (L0) fail to launch L2,
> +		 * we nested_vmx_vmexit back to L1, but now should let it know
> +		 * that the VMLAUNCH failed - with the same error that we
> +		 * got when launching L2.
> +		 */
> +		vmx->fail = 0;
> +		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
> +	} else
> +		nested_vmx_succeed(vcpu);
> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 25/27] nVMX: Additional TSC-offset handling
  2010-10-17 10:16 ` [PATCH 25/27] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2010-10-19 19:13   ` Zachary Amsden
  0 siblings, 0 replies; 56+ messages in thread
From: Zachary Amsden @ 2010-10-19 19:13 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On 10/17/2010 12:16 AM, Nadav Har'El wrote:
> In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
> emulate this MSR write by L2 by modifying vmcs02.tsc_offset.
> We also need to set vmcs12.tsc_offset, for this change to survive the next
> nested entry (see prepare_vmcs02()).
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
>   arch/x86/kvm/vmx.c |   11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
> +++ .after/arch/x86/kvm/vmx.c	2010-10-17 11:52:03.000000000 +0200
> @@ -1674,12 +1674,23 @@ static u64 guest_read_tsc(void)
>   static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>   {
>   	vmcs_write64(TSC_OFFSET, offset);
> +	if (to_vmx(vcpu)->nested.nested_mode)
> +		/*
> +		 * We are only changing TSC_OFFSET when L2 is running if for
> +		 * some reason L1 chose not to trap the TSC MSR. Since
> +		 * prepare_vmcs12() does not copy tsc_offset, we need to also
> +		 * set the vmcs12 field here.
> +		 */
> +		get_vmcs12_fields(vcpu)->tsc_offset = offset -
> +			to_vmx(vcpu)->nested.vmcs01_fields->tsc_offset;
>   }
>    

This path also arrives when L0 is initializing a new vmcs.  In that 
case, nested_mode will not be set, but it is worth noting because the 
same principle applies to the next function.

>
>   static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
>   {
>   	u64 offset = vmcs_read64(TSC_OFFSET);
>   	vmcs_write64(TSC_OFFSET, offset + adjustment);
> +	if (to_vmx(vcpu)->nested.nested_mode)
> +		get_vmcs12_fields(vcpu)->tsc_offset += adjustment;
>   }
>    

This path arrives when L0 is compensating for local changes to TSC.  In 
that case, you need to insure that L1 is properly and persistently 
compensated.  Depending on which vmcs is active, this may not 
necessarily be the case.

I've not yet closely looked at how the VMCS is managed for nested VMX, 
but will look at it a bit deeper.

Thanks,

Zach

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2010-10-17 10:13 ` [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2010-10-20 12:13   ` Avi Kivity
  2010-10-20 14:57     ` Avi Kivity
  0 siblings, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2010-10-20 12:13 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/17/2010 12:13 PM, Nadav Har'El wrote:
> This patch contains the logic of whether an L2 exit should be handled by L0
> and then L2 should be resumed, or whether L1 should be run to handle this
> exit (using the nested_vmx_vmexit() function of the previous patch).
>
> The basic idea is to let L1 handle the exit only if it actually asked to
> trap this sort of event. For example, when L2 exits on a change to CR0,
> we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
> bit which changed; If it did, we exit to L1. But if it didn't it means that
> it is we (L0) that wished to trap this event, so we handle it ourselves.
>
> The next two patches add additional logic of what to do when an interrupt or
> exception is injected: Does L0 need to do it, should we exit to L1 to do it,
> or should we resume L2 and keep the exception to be injected later.
>
> We keep a new flag, "nested_run_pending", which can override the decision of
> which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
> L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
> and therefore expects L2 to be run (and perhaps be injected with an event it
> specified, etc.). Nested_run_pending is especially intended to avoid switching
> to L1 in the injection decision-point described above.
>
>
>   /*
> + * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
> + * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
> + * it expressed interest in the current event (read or write a specific MSR).
> + */
> +static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
> +	struct vmcs_fields *vmcs12, u32 exit_reason)
> +{
> +	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
> +	struct page *msr_bitmap_page;
> +	void *va;
> +	bool ret;
> +
> +	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
> +		return 1;
> +
> +	msr_bitmap_page = nested_get_page(vcpu, vmcs12->msr_bitmap);
> +	if (!msr_bitmap_page) {
> +		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
> +		return 0;

return leaks the page.

> +	}
> +
> +	va = kmap_atomic(msr_bitmap_page, KM_USER1);
> +	if (exit_reason == EXIT_REASON_MSR_WRITE)
> +		va += 0x800;
> +	if (msr_index>= 0xc0000000) {
> +		msr_index -= 0xc0000000;
> +		va += 0x400;
> +	}
> +	if (msr_index>  0x1fff)
> +		return 0;

return leaks the kmap.

> +	ret = test_bit(msr_index, va);
> +	kunmap_atomic(va, KM_USER1);
> +	return ret;
> +}

How about using kvm_read_guest() instead?  Much simpler and safer.

> +
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
> + * rather than handle it ourselves in L0. I.e., check if L1 wanted to
> + * intercept (via guest_host_mask etc.) the current event.
> + */
> +static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
> +	struct vmcs_fields *vmcs12)
> +{
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int cr = exit_qualification&  15;
> +	int reg = (exit_qualification>>  8)&  15;
> +	unsigned long val = kvm_register_read(vcpu, reg);
> +
> +	switch ((exit_qualification>>  4)&  3) {
> +	case 0: /* mov to cr */
> +		switch (cr) {
> +		case 0:
> +			if (vmcs12->cr0_guest_host_mask&
> +			    (val ^ vmcs12->cr0_read_shadow))
> +				return 1;
> +			break;
> +		case 3:
> +			if (vmcs12->cpu_based_vm_exec_control&
> +			    CPU_BASED_CR3_LOAD_EXITING)
> +				return 1;
> +			break;
> +		case 4:
> +			if (vmcs12->cr4_guest_host_mask&
> +			    (vmcs12->cr4_read_shadow ^ val))
> +				return 1;
> +			break;
> +		case 8:
> +			if (vmcs12->cpu_based_vm_exec_control&
> +			    CPU_BASED_CR8_LOAD_EXITING)
> +				return 1;
> +			/*
> +			 * TODO: missing else if control&  CPU_BASED_TPR_SHADOW
> +			 * then set tpr shadow and if below tpr_threshold, exit.
> +			 */
> +			break;
> +		}
> +		break;
> +	case 2: /* clts */
> +		if (vmcs12->cr0_guest_host_mask&  X86_CR0_TS)
> +			return 1;

If TS is already clear in the guest's cr0_read_shadow, an L2->L1 exit is 
not needed.

> +		break;
> +	case 1: /* mov from cr */
> +		switch (cr) {
> +		case 0:
> +			return 1;

Cannot happen.

> +		case 3:
> +			if (vmcs12->cpu_based_vm_exec_control&
> +			    CPU_BASED_CR3_STORE_EXITING)
> +				return 1;
> +			break;
> +		case 4:
> +			return 1;
> +			break;

Cannot happen.

> +		case 8:
> +			if (vmcs12->cpu_based_vm_exec_control&
> +			    CPU_BASED_CR8_STORE_EXITING)
> +				return 1;

What about TPR threshold?  Or is it not supported yet?

> +			break;
> +		}
> +		break;
> +	case 3: /* lmsw */
> +		/*
> +		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
> +		 * cr0. Other attempted changes are ignored, with no exit.
> +		 */
> +		if (vmcs12->cr0_guest_host_mask&  0xe&
> +		    (val ^ vmcs12->cr0_read_shadow))
> +			return 1;
> +		if ((vmcs12->cr0_guest_host_mask&  0x1)&&
> +		    !(vmcs12->cr0_read_shadow&  0x1)&&
> +		    (val&  0x1))
> +		    	return 1;
> +		break;
> +	}
> +	return 0;
> +}

I'd prefer to move the intercept checks to kvm_set_cr(), but that can be 
done later (much later).

> +
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
> + * should handle it ourselves in L0 (and then continue L2). Only call this
> + * when in nested_mode (L2).
> + */
> +static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
> +{
> +	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
> +	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
> +
> +	if (vmx->nested.nested_run_pending)
> +		return 0;
> +
> +	if (unlikely(vmx->fail)) {
> +		printk(KERN_INFO "%s failed vm entry %x\n",
> +		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
> +		return 1;
> +	}
> +
> +	switch (exit_reason) {
> +	case EXIT_REASON_EXTERNAL_INTERRUPT:
> +		return 0;
> +	case EXIT_REASON_EXCEPTION_NMI:
> +		if (!is_exception(intr_info))
> +			return 0;
> +		else if (is_page_fault(intr_info)&&  (!enable_ept))
> +			return 0;

We may still find out later that the page fault needs to be intercepted 
by the guest, yes?

> +		return (vmcs12->exception_bitmap&
> +				(1u<<  (intr_info&  INTR_INFO_VECTOR_MASK)));
> +	case EXIT_REASON_EPT_VIOLATION:
> +		return 0;
> +	case EXIT_REASON_INVLPG:
> +		return (vmcs12->cpu_based_vm_exec_control&
> +				CPU_BASED_INVLPG_EXITING);
> +	case EXIT_REASON_MSR_READ:
> +	case EXIT_REASON_MSR_WRITE:
> +		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
> +	case EXIT_REASON_CR_ACCESS:
> +		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
> +	case EXIT_REASON_DR_ACCESS:
> +		return (vmcs12->cpu_based_vm_exec_control&
> +				CPU_BASED_MOV_DR_EXITING);
> +	default:
> +		/*
> +		 * One particularly interesting case that is covered here is an
> +		 * exit caused by L2 running a VMX instruction. L2 is guest
> +		 * mode in L1's world, and according to the VMX spec running a
> +		 * VMX instruction in guest mode should cause an exit to root
> +		 * mode, i.e., to L1. This is why we need to return r=1 for
> +		 * those exit reasons too. This enables further nesting: Like
> +		 * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
> +		 * L2, who will then be able to run L3.
> +		 */
> +		return 1;

What about intr/nmi window?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2010-10-20 12:13   ` Avi Kivity
@ 2010-10-20 14:57     ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2010-10-20 14:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

  On 10/20/2010 02:13 PM, Avi Kivity wrote:
>> +    switch (exit_reason) {
>> +    case EXIT_REASON_EXTERNAL_INTERRUPT:
>> +        return 0;
>> +    case EXIT_REASON_EXCEPTION_NMI:
>> +        if (!is_exception(intr_info))
>> +            return 0;
>> +        else if (is_page_fault(intr_info)&&  (!enable_ept))
>> +            return 0;
> +
>
> We may still find out later that the page fault needs to be 
> intercepted by the guest, yes?
>
>> +        return (vmcs12->exception_bitmap&
>> +                (1u<<  (intr_info&  INTR_INFO_VECTOR_MASK)));
>> +    case EXIT_REASON_EPT_VIOLATION:
>> +        return 0;
>> +    case EXIT_REASON_INVLPG:
>> +        return (vmcs12->cpu_based_vm_exec_control&
>> +                CPU_BASED_INVLPG_EXITING);
>> +    case EXIT_REASON_MSR_READ:
>> +    case EXIT_REASON_MSR_WRITE:
>> +        return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
>> +    case EXIT_REASON_CR_ACCESS:
>> +        return nested_vmx_exit_handled_cr(vcpu, vmcs12);
>> +    case EXIT_REASON_DR_ACCESS:
>> +        return (vmcs12->cpu_based_vm_exec_control&
>> +                CPU_BASED_MOV_DR_EXITING);
>> +    default:
>> +        /*
>> +         * One particularly interesting case that is covered here is an
>> +         * exit caused by L2 running a VMX instruction. L2 is guest
>> +         * mode in L1's world, and according to the VMX spec running a
>> +         * VMX instruction in guest mode should cause an exit to root
>> +         * mode, i.e., to L1. This is why we need to return r=1 for
>> +         * those exit reasons too. This enables further nesting: Like
>> +         * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
>> +         * L2, who will then be able to run L3.
>> +         */
>> +        return 1;
>
> What about intr/nmi window?
>

Also WBINVD, pause loop exit, rdtsc[p], monitor/mwait, hlt.

It's best to list every exit reason here, so it's easier to review and 
maintain.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2010-10-17 14:08   ` Avi Kivity
@ 2011-02-08 12:13     ` Nadav Har'El
  2011-02-08 12:27       ` Avi Kivity
  2011-02-08 12:27       ` Avi Kivity
  0 siblings, 2 replies; 56+ messages in thread
From: Nadav Har'El @ 2011-02-08 12:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12":
>  On 10/17/2010 12:11 PM, Nadav Har'El wrote:
> >This patch contains code to prepare the VMCS which can be used to actually
> >run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the 
> >information
> >in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (the vmcs that we
> >built for L1).
> >
> >VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
> >makes it difficult for us to read from vmcs01 while writing to vmcs12. This
> >is why we first make a copy of vmcs01 in memory (vmcs01_fields) and then
> >read that memory copy while writing to vmcs12.
> >
> 
> I believe I commented on this before - you can call the same functions 
> kvm uses to initialize the normal vmcs to get the common parts filled 
> in.

Hi,

I'm finally attending to this issue, and you're right, for almost all fields
there's hardly a reason to read them from VMCS into a memory structure vmcs01,
when it was KVM who set these fields in the first place and can set them
again when needed in vmcs02. I'm slowly removing the fields in the vmcs01 in-
memory structure (only about a dozen left at this point!), and finally the
"vmcs12" structure contains all its fields, without another substructure
(as you asked once).

But while doing this, I came across a question that I wonder if you can
clarify for me:

Among the other things it sets up, vmx_vcpu_setup() sets

	rdmsrl(MSR_IA32_SYSENTER_ESP, a);
	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);   /* 22.2.3 */

Why is this needed here? In vmx_vcpu_load(), when a cpu is known (or changed),
we again have:

	rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
	vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */

So isn't the first setting, in vmx_vcpu_setup(), redundant?

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Tuesday, Feb  8 2011, 4 Adar I 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Earth First! We can strip-mine the other
http://nadav.harel.org.il           |planets later...

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-02-08 12:13     ` Nadav Har'El
@ 2011-02-08 12:27       ` Avi Kivity
  2011-02-08 12:36         ` Nadav Har'El
  2011-02-08 12:27       ` Avi Kivity
  1 sibling, 1 reply; 56+ messages in thread
From: Avi Kivity @ 2011-02-08 12:27 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 02/08/2011 02:13 PM, Nadav Har'El wrote:
> But while doing this, I came across a question that I wonder if you can
> clarify for me:
>
> Among the other things it sets up, vmx_vcpu_setup() sets
>
> 	rdmsrl(MSR_IA32_SYSENTER_ESP, a);
> 	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);   /* 22.2.3 */
>
> Why is this needed here?

It's not needed here.

> In vmx_vcpu_load(), when a cpu is known (or changed),
> we again have:
>
> 	rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
> 	vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
>
> So isn't the first setting, in vmx_vcpu_setup(), redundant?

It is.  It's just historical baggage - these lines were introduced about 
20 commits into kvm development (af9d6e204919016ca in qemu-kvm.git) and 
never removed, even after it was fixed to be per-cpu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-02-08 12:13     ` Nadav Har'El
  2011-02-08 12:27       ` Avi Kivity
@ 2011-02-08 12:27       ` Avi Kivity
  1 sibling, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2011-02-08 12:27 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 02/08/2011 02:13 PM, Nadav Har'El wrote:
> But while doing this, I came across a question that I wonder if you can
> clarify for me:
>
> Among the other things it sets up, vmx_vcpu_setup() sets
>
> 	rdmsrl(MSR_IA32_SYSENTER_ESP, a);
> 	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);   /* 22.2.3 */
>
> Why is this needed here?

It's not needed here.

> In vmx_vcpu_load(), when a cpu is known (or changed),
> we again have:
>
> 	rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
> 	vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
>
> So isn't the first setting, in vmx_vcpu_setup(), redundant?

It is.  It's just historical baggage - these lines were introduced about 
20 commits into kvm development (af9d6e204919016ca in qemu-kvm.git) and 
never removed, even after it was fixed to be per-cpu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-02-08 12:27       ` Avi Kivity
@ 2011-02-08 12:36         ` Nadav Har'El
  2011-02-08 12:39           ` Avi Kivity
  0 siblings, 1 reply; 56+ messages in thread
From: Nadav Har'El @ 2011-02-08 12:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, gleb

On Tue, Feb 08, 2011, Avi Kivity wrote about "Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12":
> >So isn't the first setting, in vmx_vcpu_setup(), redundant?
> 
> It is.  It's just historical baggage - these lines were introduced about 
> 20 commits into kvm development (af9d6e204919016ca in qemu-kvm.git) and 
> never removed, even after it was fixed to be per-cpu.

Thanks.

As part of my patch, I'm splitting off a function vmx_set_constant_host_state()
from vmx_vcpu_setup() (the same function will also be used to set up the host
state on vmcs02). So I'll remove that redundant setting from this function.

-- 
Nadav Har'El                        |      Tuesday, Feb  8 2011, 4 Adar I 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Quotation, n.: The act of repeating
http://nadav.harel.org.il           |erroneously the words of another.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-02-08 12:36         ` Nadav Har'El
@ 2011-02-08 12:39           ` Avi Kivity
  0 siblings, 0 replies; 56+ messages in thread
From: Avi Kivity @ 2011-02-08 12:39 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb

On 02/08/2011 02:36 PM, Nadav Har'El wrote:
> On Tue, Feb 08, 2011, Avi Kivity wrote about "Re: [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12":
> >  >So isn't the first setting, in vmx_vcpu_setup(), redundant?
> >
> >  It is.  It's just historical baggage - these lines were introduced about
> >  20 commits into kvm development (af9d6e204919016ca in qemu-kvm.git) and
> >  never removed, even after it was fixed to be per-cpu.
>
> Thanks.
>
> As part of my patch, I'm splitting off a function vmx_set_constant_host_state()
> from vmx_vcpu_setup() (the same function will also be used to set up the host
> state on vmcs02). So I'll remove that redundant setting from this function.

Feel free to post cleanups like that as independent patches which can be 
merged immediately.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2011-02-08 12:39 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-17 10:03 [PATCH 0/27] nVMX: Nested VMX, v6 Nadav Har'El
2010-10-17 10:04 ` [PATCH 01/27] nVMX: Add "nested" module option to vmx.c Nadav Har'El
2010-10-17 10:04 ` [PATCH 02/27] nVMX: Add VMX and SVM to list of supported cpuid features Nadav Har'El
2010-10-17 10:05 ` [PATCH 03/27] nVMX: Implement VMXON and VMXOFF Nadav Har'El
2010-10-17 12:24   ` Avi Kivity
2010-10-17 12:47     ` Nadav Har'El
2010-10-17 13:07   ` Avi Kivity
2010-10-17 10:05 ` [PATCH 04/27] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
2010-10-17 12:31   ` Avi Kivity
2010-10-17 10:06 ` [PATCH 05/27] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2010-10-17 12:34   ` Avi Kivity
2010-10-17 13:18     ` Nadav Har'El
2010-10-17 13:29       ` Avi Kivity
2010-10-17 10:06 ` [PATCH 06/27] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
2010-10-17 12:52   ` Avi Kivity
2010-10-17 10:07 ` [PATCH 07/27] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
2010-10-17 10:07 ` [PATCH 08/27] nVMX: Hold a vmcs02 for each vmcs12 Nadav Har'El
2010-10-17 13:00   ` Avi Kivity
2010-10-17 10:08 ` [PATCH 09/27] nVMX: Success/failure of VMX instructions Nadav Har'El
2010-10-17 10:08 ` [PATCH 10/27] nVMX: Implement VMCLEAR Nadav Har'El
2010-10-17 13:05   ` Avi Kivity
2010-10-17 13:25     ` Nadav Har'El
2010-10-17 13:27       ` Avi Kivity
2010-10-17 13:37         ` Nadav Har'El
2010-10-17 14:12           ` Avi Kivity
2010-10-17 14:14             ` Gleb Natapov
2010-10-17 10:09 ` [PATCH 11/27] nVMX: Implement VMPTRLD Nadav Har'El
2010-10-17 10:09 ` [PATCH 12/27] nVMX: Implement VMPTRST Nadav Har'El
2010-10-17 10:10 ` [PATCH 13/27] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
2010-10-17 13:15   ` Avi Kivity
2010-10-17 10:10 ` [PATCH 14/27] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
2010-10-17 13:25   ` Avi Kivity
2010-10-17 10:11 ` [PATCH 15/27] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2010-10-17 14:08   ` Avi Kivity
2011-02-08 12:13     ` Nadav Har'El
2011-02-08 12:27       ` Avi Kivity
2011-02-08 12:36         ` Nadav Har'El
2011-02-08 12:39           ` Avi Kivity
2011-02-08 12:27       ` Avi Kivity
2010-10-17 10:11 ` [PATCH 16/27] nVMX: Move register-syncing to a function Nadav Har'El
2010-10-17 10:12 ` [PATCH 17/27] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
2010-10-17 15:06   ` Avi Kivity
2010-10-17 10:12 ` [PATCH 18/27] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
2010-10-17 10:13 ` [PATCH 19/27] nVMX: Exiting from L2 to L1 Nadav Har'El
2010-10-17 15:58   ` Avi Kivity
2010-10-17 10:13 ` [PATCH 20/27] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2010-10-20 12:13   ` Avi Kivity
2010-10-20 14:57     ` Avi Kivity
2010-10-17 10:14 ` [PATCH 21/27] nVMX: Correct handling of interrupt injection Nadav Har'El
2010-10-17 10:14 ` [PATCH 22/27] nVMX: Correct handling of exception injection Nadav Har'El
2010-10-17 10:15 ` [PATCH 23/27] nVMX: Correct handling of idt vectoring info Nadav Har'El
2010-10-17 10:15 ` [PATCH 24/27] nVMX: Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
2010-10-17 10:16 ` [PATCH 25/27] nVMX: Additional TSC-offset handling Nadav Har'El
2010-10-19 19:13   ` Zachary Amsden
2010-10-17 10:16 ` [PATCH 26/27] nVMX: Miscellenous small corrections Nadav Har'El
2010-10-17 10:17 ` [PATCH 27/27] nVMX: Documentation Nadav Har'El

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.