All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/31] nVMX: Nested VMX, v10
@ 2011-05-16 19:43 Nadav Har'El
  2011-05-16 19:44 ` [PATCH 01/31] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
                   ` (30 more replies)
  0 siblings, 31 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:43 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Hi,

This is the tenth iteration of the nested VMX patch set. Improvements in this
version over the previous one include:

 * Fix the code which did not fully maintain a list of all VMCSs loaded on
   each CPU. (Avi, this was the big thing that bothered you in the previous
   version).

 * Add nested-entry-time (L1->L2) verification of control fields of vmcs12 -
   procbased, pinbased, entry, exit and secondary controls - compared to the
   capability MSRs which we advertise to L1.

   The values we advertise (and verify during entry) are stored in variables,
   and theoretically can be modified to reduce the capabilities given to L1
   (although there's no API for that yet).

 * Explain the external-interrupt injection (patch 23) more accurately.
   Also got rid of the mysterious "is_interrupt" flag to nested_vmx_vmexit().

 * Fix incorrect VMCS_LINK_POINTER merging; Now we always set it to -1 (as
   the spec suggests), and fail nested entry if vmcs12's isn't -1 (with exit
   qualification 4 - section see 23.7).

 * Store idt_vectoring_info and related fields in vmcs12, instead of new
   vmx->nested fields, between exit and entry.

   I still *haven't* done the complete rewrite of the idt_vectoring_info
   handling that Gleb requested.

And fixed two bugs reported by real users (hooray!) from this mailing list:

 * Fix bug where sometimes NMIs headed for L0 were also injected into L1.
   Thanks to Abel Gordon for investigating this bug.

 * Removed incorrect test of guest mov-SS block during entry, which prevented
   L2 from running for one tester.

   I removed this test (rather than correcting it), as the processor will do
   exactly the same test anyway when L0 runs L2, and entry failure at that
   time will be returned to L1 as its entry failure.

This version doesn't yet include a fix for the missing VMPTRLD check that
Marcello sent to the list just a few minutes ago.

This new set of patches applies to the current KVM trunk (I checked with
6f1bd0daae731ff07f4755b4f56730a6e4a3c1cb).
If you wish, you can also check out an already-patched version of KVM from
branch "nvmx10" of the repository:
	 git://github.com/nyh/kvm-nested-vmx.git


About nested VMX:
-----------------

The following 31 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
"The Turtles Project: Design and Implementation of Nested Virtualization",
and was awarded "Jay Lepreau Best Paper". The paper is available online, at:

	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
------------------

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the "nested=1" option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
explicitly enabled, by giving qemu one of the following options:

     -cpu host              (emulated CPU has all features of the real CPU)

     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)


This version was only tested with KVM (64-bit) as a guest hypervisor, and
Linux as a nested guest.


Patch statistics:
-----------------

 Documentation/kvm/nested-vmx.txt |  243 ++
 arch/x86/include/asm/kvm_host.h  |    2 
 arch/x86/include/asm/msr-index.h |   12 
 arch/x86/include/asm/vmx.h       |   39 
 arch/x86/kvm/svm.c               |    6 
 arch/x86/kvm/vmx.c               | 2658 ++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c               |   11 
 arch/x86/kvm/x86.h               |    8 
 8 files changed, 2884 insertions(+), 95 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 01/31] nVMX: Add "nested" module option to kvm_intel
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
@ 2011-05-16 19:44 ` Nadav Har'El
  2011-05-16 19:44 ` [PATCH 02/31] nVMX: Implement VMXON and VMXOFF Nadav Har'El
                   ` (29 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:44 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds to kvm_intel a module option "nested". This option controls
whether the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
@@ -72,6 +72,14 @@ module_param(vmm_exclusive, bool, S_IRUG
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., guests may use
+ * VMX and be a hypervisor for its own guests. If nested=0, guests may not
+ * use VMX instructions.
+ */
+static int __read_mostly nested = 0;
+module_param(nested, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\
@@ -1261,6 +1269,23 @@ static u64 vmx_compute_tsc_offset(struct
 	return target_tsc - native_read_tsc();
 }
 
+static bool guest_cpuid_has_vmx(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best = kvm_find_cpuid_entry(vcpu, 1, 0);
+	return best && (best->ecx & (1 << (X86_FEATURE_VMX & 31)));
+}
+
+/*
+ * nested_vmx_allowed() checks whether a guest should be allowed to use VMX
+ * instructions and MSRs (i.e., nested VMX). Nested VMX is disabled for
+ * all guests if the "nested" module option is off, and can also be disabled
+ * for a single guest by disabling its VMX cpuid bit.
+ */
+static inline bool nested_vmx_allowed(struct kvm_vcpu *vcpu)
+{
+	return nested && guest_cpuid_has_vmx(vcpu);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 02/31] nVMX: Implement VMXON and VMXOFF
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
  2011-05-16 19:44 ` [PATCH 01/31] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
@ 2011-05-16 19:44 ` Nadav Har'El
  2011-05-20  7:58   ` Tian, Kevin
  2011-05-16 19:45 ` [PATCH 03/31] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:44 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  110 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 108 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
@@ -130,6 +130,15 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -184,6 +193,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for a guest hypervisor (nested VMX) */
+	struct nested_vmx nested;
 };
 
 enum segment_cache_field {
@@ -3890,6 +3902,99 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably cr4.VMXE must be
+	 * set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
+	 * Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = true;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * Free whatever needs to be freed from vmx->nested when L1 goes down, or
+ * just stops using VMX.
+ */
+static void free_nested(struct vcpu_vmx *vmx)
+{
+	if (!vmx->nested.vmxon)
+		return;
+	vmx->nested.vmxon = false;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+	free_nested(to_vmx(vcpu));
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3917,8 +4022,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
@@ -4329,6 +4434,7 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	free_nested(vmx);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 03/31] nVMX: Allow setting the VMXE bit in CR4
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
  2011-05-16 19:44 ` [PATCH 01/31] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
  2011-05-16 19:44 ` [PATCH 02/31] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2011-05-16 19:45 ` Nadav Har'El
  2011-05-16 19:45 ` [PATCH 04/31] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
                   ` (27 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:45 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +-
 arch/x86/kvm/svm.c              |    6 +++++-
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++--
 arch/x86/kvm/x86.c              |    4 +---
 4 files changed, 22 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-05-16 22:36:46.000000000 +0300
+++ .after/arch/x86/include/asm/kvm_host.h	2011-05-16 22:36:46.000000000 +0300
@@ -559,7 +559,7 @@ struct kvm_x86_ops {
 	void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
 	void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-	void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+	int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
 	void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
 	void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c	2011-05-16 22:36:46.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2011-05-16 22:36:46.000000000 +0300
@@ -1496,11 +1496,14 @@ static void svm_set_cr0(struct kvm_vcpu 
 	update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long host_cr4_mce = read_cr4() & X86_CR4_MCE;
 	unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;
 
+	if (cr4 & X86_CR4_VMXE)
+		return 1;
+
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
 		svm_flush_tlb(vcpu);
 
@@ -1510,6 +1513,7 @@ static void svm_set_cr4(struct kvm_vcpu 
 	cr4 |= host_cr4_mce;
 	to_svm(vcpu)->vmcb->save.cr4 = cr4;
 	mark_dirty(to_svm(vcpu)->vmcb, VMCB_CR);
+	return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c	2011-05-16 22:36:46.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-16 22:36:46.000000000 +0300
@@ -615,11 +615,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 				   kvm_read_cr3(vcpu)))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
-	kvm_x86_ops->set_cr4(vcpu, cr4);
-
 	if ((cr4 ^ old_cr4) & pdptr_bits)
 		kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
@@ -2078,7 +2078,7 @@ static void ept_save_pdptrs(struct kvm_v
 		  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
 					unsigned long cr0,
@@ -2175,11 +2175,23 @@ static void vmx_set_cr3(struct kvm_vcpu 
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
 		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+	if (cr4 & X86_CR4_VMXE) {
+		/*
+		 * To use VMXON (and later other VMX instructions), a guest
+		 * must first be able to turn on cr4.VMXE (see handle_vmon()).
+		 * So basically the check on whether to allow nested VMX
+		 * is here.
+		 */
+		if (!nested_vmx_allowed(vcpu))
+			return 1;
+	} else if (to_vmx(vcpu)->nested.vmxon)
+		return 1;
+
 	vcpu->arch.cr4 = cr4;
 	if (enable_ept) {
 		if (!is_paging(vcpu)) {
@@ -2192,6 +2204,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
+	return 0;
 }
 
 static void vmx_get_segment(struct kvm_vcpu *vcpu,

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 04/31] nVMX: Introduce vmcs12: a VMCS structure for L1
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (2 preceding siblings ...)
  2011-05-16 19:45 ` [PATCH 03/31] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2011-05-16 19:45 ` Nadav Har'El
  2011-05-16 19:46 ` [PATCH 05/31] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
                   ` (26 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:45 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guest. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   75 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -131,12 +131,53 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions.
+ * More than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed to ensure that its layout is identical across
+ * machines (necessary for live migration).
+ * If there are changes in this struct, VMCS12_REVISION must be changed.
+ */
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
+ * VMCS12_SIZE is the number of bytes L1 should allocate for the VMXON region
+ * and any VMCS region. Although only sizeof(struct vmcs12) are used by the
+ * current implementation, 4K are reserved to avoid future complications.
+ */
+#define VMCS12_SIZE 0x1000
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
  */
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above */
+	struct page *current_vmcs12_page;
+	struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -212,6 +253,31 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
+{
+	return to_vmx(vcpu)->nested.current_vmcs12;
+}
+
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		return NULL;
+	}
+	return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+	kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+	kvm_release_page_clean(page);
+}
+
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
@@ -3995,6 +4061,12 @@ static void free_nested(struct vcpu_vmx 
 	if (!vmx->nested.vmxon)
 		return;
 	vmx->nested.vmxon = false;
+	if (vmx->nested.current_vmptr != -1ull) {
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+		vmx->nested.current_vmcs12 = NULL;
+	}
 }
 
 /* Emulate the VMXOFF instruction */
@@ -4518,6 +4590,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_vmcs12 = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 05/31] nVMX: Implement reading and writing of VMX MSRs
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (3 preceding siblings ...)
  2011-05-16 19:45 ` [PATCH 04/31] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2011-05-16 19:46 ` Nadav Har'El
  2011-05-16 19:46 ` [PATCH 06/31] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
                   ` (25 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:46 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/msr-index.h |   12 +
 arch/x86/kvm/vmx.c               |  219 +++++++++++++++++++++++++++++
 2 files changed, 231 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -1365,6 +1365,218 @@ static inline bool nested_vmx_allowed(st
 }
 
 /*
+ * nested_vmx_setup_ctls_msrs() sets up variables containing the values to be
+ * returned for the various VMX controls MSRs when nested VMX is enabled.
+ * The same values should also be used to verify that vmcs12 control fields are
+ * valid during nested entry from L1 to L2.
+ * Each of these control msrs has a low and high 32-bit half: A low bit is on
+ * if the corresponding bit in the (32-bit) control field *must* be on, and a
+ * bit in the high half is on if the corresponding bit in the control field
+ * may be on. See also vmx_control_verify().
+ * TODO: allow these variables to be modified (downgraded) by module options
+ * or other means.
+ */
+static u32 nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high;
+static u32 nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high;
+static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
+static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
+static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static __init void nested_vmx_setup_ctls_msrs(void)
+{
+	/*
+	 * Note that as a general rule, the high half of the MSRs (bits in
+	 * the control fields which may be 1) should be initialized by the
+	 * intersection of the underlying hardware's MSR (i.e., features which
+	 * can be supported) and the list of features we want to expose -
+	 * because they are known to be properly supported in our code.
+	 * Also, usually, the low half of the MSRs (bits which must be 1) can
+	 * be set to 0, meaning that L1 may turn off any of these bits. The
+	 * reason is that if one of these bits is necessary, it will appear
+	 * in vmcs01 and prepare_vmcs02, when it bitwise-or's the control
+	 * fields of vmcs01 and vmcs02, will turn these bits off - and
+	 * nested_vmx_exit_handled() will not pass related exits to L1.
+	 * These rules have exceptions below.
+	 */
+
+	/* pin-based controls */
+	/*
+	 * According to the Intel spec, if bit 55 of VMX_BASIC is off (as it is
+	 * in our case), bits 1, 2 and 4 (i.e., 0x16) must be 1 in this MSR.
+	 */
+	nested_vmx_pinbased_ctls_low = 0x16 ;
+	nested_vmx_pinbased_ctls_high = 0x16 |
+		PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING |
+		PIN_BASED_VIRTUAL_NMIS;
+
+	/* exit controls */
+	nested_vmx_exit_ctls_low = 0;
+#ifdef CONFIG_X86_64
+	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#else
+	nested_vmx_exit_ctls_high = 0;
+#endif
+
+	/* entry controls */
+	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
+		nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high);
+	nested_vmx_entry_ctls_low = 0;
+	nested_vmx_entry_ctls_high &=
+		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
+
+	/* cpu-based controls */
+	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
+		nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
+	nested_vmx_procbased_ctls_low = 0;
+	nested_vmx_procbased_ctls_high &=
+		CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
+		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
+		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+		CPU_BASED_CR3_STORE_EXITING |
+#ifdef CONFIG_X86_64
+		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
+#endif
+		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
+		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+	/*
+	 * We can allow some features even when not supported by the
+	 * hardware. For example, L1 can specify an MSR bitmap - and we
+	 * can use it to avoid exits to L1 - even when L0 runs L2
+	 * without MSR bitmaps.
+	 */
+	nested_vmx_procbased_ctls_high |= CPU_BASED_USE_MSR_BITMAPS;
+
+	/* secondary cpu-based controls */
+	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2,
+		nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high);
+	nested_vmx_secondary_ctls_low = 0;
+	nested_vmx_secondary_ctls_high &=
+		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+}
+
+static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
+{
+	/*
+	 * Bits 0 in high must be 0, and bits 1 in low must be 1.
+	 */
+	return ((control & high) | low) == control;
+}
+
+static inline u64 vmx_control_msr(u32 low, u32 high)
+{
+	return low | ((u64)high << 32);
+}
+
+/*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 1 when we handled a
+ * VMX-specific MSR, or 0 when we haven't (and the caller should handle it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+	if (!nested_vmx_allowed(vcpu) && msr_index >= MSR_IA32_VMX_BASIC &&
+		     msr_index <= MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
+		/*
+		 * According to the spec, processors which do not support VMX
+		 * should throw a #GP(0) when VMX capability MSRs are read.
+		 */
+		kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
+		return 1;
+	}
+
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_BASIC:
+		/*
+		 * This MSR reports some information about VMX support. We
+		 * should return information about the VMX we emulate for the
+		 * guest, and the VMCS structure we give it - not about the
+		 * VMX support of the underlying hardware.
+		 */
+		*pdata = VMCS12_REVISION |
+			   ((u64)VMCS12_SIZE << VMX_BASIC_VMCS_SIZE_SHIFT) |
+			   (VMX_BASIC_MEM_TYPE_WB << VMX_BASIC_MEM_TYPE_SHIFT);
+		break;
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+		*pdata = vmx_control_msr(nested_vmx_pinbased_ctls_low,
+					nested_vmx_pinbased_ctls_high);
+		break;
+	case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+		*pdata = vmx_control_msr(nested_vmx_procbased_ctls_low,
+					nested_vmx_procbased_ctls_high);
+		break;
+	case MSR_IA32_VMX_TRUE_EXIT_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS:
+		*pdata = vmx_control_msr(nested_vmx_exit_ctls_low,
+					nested_vmx_exit_ctls_high);
+		break;
+	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+	case MSR_IA32_VMX_ENTRY_CTLS:
+		*pdata = vmx_control_msr(nested_vmx_entry_ctls_low,
+					nested_vmx_entry_ctls_high);
+		break;
+	case MSR_IA32_VMX_MISC:
+		*pdata = 0;
+		break;
+	/*
+	 * These MSRs specify bits which the guest must keep fixed (on or off)
+	 * while L1 is in VMXON mode (in L1's root mode, or running an L2).
+	 * We picked the standard core2 setting.
+	 */
+#define VMXON_CR0_ALWAYSON	(X86_CR0_PE | X86_CR0_PG | X86_CR0_NE)
+#define VMXON_CR4_ALWAYSON	X86_CR4_VMXE
+	case MSR_IA32_VMX_CR0_FIXED0:
+		*pdata = VMXON_CR0_ALWAYSON;
+		break;
+	case MSR_IA32_VMX_CR0_FIXED1:
+		*pdata = -1ULL;
+		break;
+	case MSR_IA32_VMX_CR4_FIXED0:
+		*pdata = VMXON_CR4_ALWAYSON;
+		break;
+	case MSR_IA32_VMX_CR4_FIXED1:
+		*pdata = -1ULL;
+		break;
+	case MSR_IA32_VMX_VMCS_ENUM:
+		*pdata = 0x1f;
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		*pdata = vmx_control_msr(nested_vmx_secondary_ctls_low,
+					nested_vmx_secondary_ctls_high);
+		break;
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		/* Currently, no nested ept or nested vpid */
+		*pdata = 0;
+		break;
+	default:
+		return 0;
+	}
+
+	return 1;
+}
+
+static int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+{
+	if (!nested_vmx_allowed(vcpu))
+		return 0;
+
+	if (msr_index == MSR_IA32_FEATURE_CONTROL)
+		/* TODO: the right thing. */
+		return 1;
+	/*
+	 * No need to treat VMX capability MSRs specially: If we don't handle
+	 * them, handle_wrmsr will #GP(0), which is correct (they are readonly)
+	 */
+	return 0;
+}
+
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1412,6 +1624,8 @@ static int vmx_get_msr(struct kvm_vcpu *
 		/* Otherwise falls through */
 	default:
 		vmx_load_host_state(to_vmx(vcpu));
+		if (vmx_get_vmx_msr(vcpu, msr_index, pdata))
+			return 0;
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			vmx_load_host_state(to_vmx(vcpu));
@@ -1483,6 +1697,8 @@ static int vmx_set_msr(struct kvm_vcpu *
 			return 1;
 		/* Otherwise falls through */
 	default:
+		if (vmx_set_vmx_msr(vcpu, msr_index, data))
+			break;
 		msr = find_msr_entry(vmx, msr_index);
 		if (msr) {
 			vmx_load_host_state(vmx);
@@ -1859,6 +2075,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
+	if (nested)
+		nested_vmx_setup_ctls_msrs();
+
 	return alloc_kvm_area();
 }
 
--- .before/arch/x86/include/asm/msr-index.h	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/include/asm/msr-index.h	2011-05-16 22:36:47.000000000 +0300
@@ -438,6 +438,18 @@
 #define MSR_IA32_VMX_VMCS_ENUM          0x0000048a
 #define MSR_IA32_VMX_PROCBASED_CTLS2    0x0000048b
 #define MSR_IA32_VMX_EPT_VPID_CAP       0x0000048c
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS  0x0000048d
+#define MSR_IA32_VMX_TRUE_PROCBASED_CTLS 0x0000048e
+#define MSR_IA32_VMX_TRUE_EXIT_CTLS      0x0000048f
+#define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
+
+/* VMX_BASIC bits and bitmasks */
+#define VMX_BASIC_VMCS_SIZE_SHIFT	32
+#define VMX_BASIC_64		0x0001000000000000LLU
+#define VMX_BASIC_MEM_TYPE_SHIFT	50
+#define VMX_BASIC_MEM_TYPE_MASK	0x003c000000000000LLU
+#define VMX_BASIC_MEM_TYPE_WB	6LLU
+#define VMX_BASIC_INOUT		0x0040000000000000LLU
 
 /* AMD-V MSRs */
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 06/31] nVMX: Decoding memory operands of VMX instructions
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (4 preceding siblings ...)
  2011-05-16 19:46 ` [PATCH 05/31] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
@ 2011-05-16 19:46 ` Nadav Har'El
  2011-05-16 19:47 ` [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
                   ` (24 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:46 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   53 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c |    3 +-
 arch/x86/kvm/x86.h |    4 +++
 3 files changed, 59 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-16 22:36:47.000000000 +0300
@@ -3815,7 +3815,7 @@ static int kvm_fetch_guest_virt(struct x
 					  exception);
 }
 
-static int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
+int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
 			       gva_t addr, void *val, unsigned int bytes,
 			       struct x86_exception *exception)
 {
@@ -3825,6 +3825,7 @@ static int kvm_read_guest_virt(struct x8
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  exception);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 				      gva_t addr, void *val, unsigned int bytes,
--- .before/arch/x86/kvm/x86.h	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2011-05-16 22:36:47.000000000 +0300
@@ -81,4 +81,8 @@ int kvm_inject_realmode_interrupt(struct
 
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
+int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
+	gva_t addr, void *val, unsigned int bytes,
+	struct x86_exception *exception);
+
 #endif
--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -4299,6 +4299,59 @@ static int handle_vmoff(struct kvm_vcpu 
 }
 
 /*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+				 unsigned long exit_qualification,
+				 u32 vmx_instruction_info, gva_t *ret)
+{
+	/*
+	 * According to Vol. 3B, "Information for VM Exits Due to Instruction
+	 * Execution", on an exit, vmx_instruction_info holds most of the
+	 * addressing components of the operand. Only the displacement part
+	 * is put in exit_qualification (see 3B, "Basic VM-Exit Information").
+	 * For how an actual address is calculated from all these components,
+	 * refer to Vol. 1, "Operand Addressing".
+	 */
+	int  scaling = vmx_instruction_info & 3;
+	int  addr_size = (vmx_instruction_info >> 7) & 7;
+	bool is_reg = vmx_instruction_info & (1u << 10);
+	int  seg_reg = (vmx_instruction_info >> 15) & 7;
+	int  index_reg = (vmx_instruction_info >> 18) & 0xf;
+	bool index_is_valid = !(vmx_instruction_info & (1u << 22));
+	int  base_reg       = (vmx_instruction_info >> 23) & 0xf;
+	bool base_is_valid  = !(vmx_instruction_info & (1u << 27));
+
+	if (is_reg) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offset = base + [index * scale] + displacement */
+	*ret = vmx_get_segment_base(vcpu, seg_reg);
+	if (base_is_valid)
+		*ret += kvm_register_read(vcpu, base_reg);
+	if (index_is_valid)
+		*ret += kvm_register_read(vcpu, index_reg)<<scaling;
+	*ret += exit_qualification; /* holds the displacement */
+
+	if (addr_size == 1) /* 32 bit */
+		*ret &= 0xffffffff;
+
+	/*
+	 * TODO: throw #GP (and return 1) in various cases that the VM*
+	 * instructions require it - e.g., offset beyond segment limit,
+	 * unusable or unreadable/unwritable segment, non-canonical 64-bit
+	 * address, and so on. Currently these are not checked.
+	 */
+	return 0;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (5 preceding siblings ...)
  2011-05-16 19:46 ` [PATCH 06/31] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
@ 2011-05-16 19:47 ` Nadav Har'El
  2011-05-20  8:04   ` Tian, Kevin
  2011-05-16 19:48 ` [PATCH 08/31] nVMX: Fix local_vcpus_link handling Nadav Har'El
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:47 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that "vmcs02".
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
be reused even when L1 runs multiple L2 guests. However, in future versions
we'll probably want to add an optimization where vmcs02 fields that rarely
change will not be set each time. For that, we may want to keep around several
vmcs02s of L2 guests that have recently run, so that potentially we could run
these L2s again more quickly because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
reused to enter any L2 guest. In the future, when prepare_vmcs02() is
optimized not to set all fields every time, VMCS02_POOL_SIZE should be
increased.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  139 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 139 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
 module_param(ple_window, int, S_IRUGO);
 
 #define NR_AUTOLOAD_MSRS 1
+#define VMCS02_POOL_SIZE 1
 
 struct vmcs {
 	u32 revision_id;
@@ -166,6 +167,30 @@ struct __packed vmcs12 {
 #define VMCS12_SIZE 0x1000
 
 /*
+ * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
+ * while we use L2's VMCS), and we wish to save the previous VMCS, we must also
+ * remember on which CPU it was last loaded (vcpu->cpu), so when we return to
+ * using this VMCS we'll know if we're now running on a different CPU and need
+ * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
+ * we need to remember whether this VMCS was launched (vmx->launched), so when
+ * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
+ * this from other state, because it's possible that this VMCS had once been
+ * launched, but has since been cleared after a CPU switch).
+ */
+struct saved_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+};
+
+/* Used to remember the last vmcs02 used for some recently used vmcs12s */
+struct vmcs02_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct saved_vmcs vmcs02;
+};
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
  */
@@ -178,6 +203,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
+	struct list_head vmcs02_pool;
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -4200,6 +4229,111 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
+ * We could reuse a single VMCS for all the L2 guests, but we also want the
+ * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
+ * allows keeping them loaded on the processor, and in the future will allow
+ * optimizations where prepare_vmcs02 doesn't need to set all the fields on
+ * every entry if they never change.
+ * So we keep, in vmx->nested.vmcs02_pool, a cache of size VMCS02_POOL_SIZE
+ * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
+ *
+ * The following functions allocate and free a vmcs02 in this pool.
+ */
+
+static void __nested_free_saved_vmcs(void *arg)
+{
+	struct saved_vmcs *saved_vmcs = arg;
+
+	vmcs_clear(saved_vmcs->vmcs);
+	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
+		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
+}
+
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ * (the necessary information is in the saved_vmcs structure).
+ * See also vcpu_clear() (with different parameters and side-effects)
+ */
+static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
+		struct saved_vmcs *saved_vmcs)
+{
+	if (saved_vmcs->cpu != -1)
+		smp_call_function_single(saved_vmcs->cpu,
+				__nested_free_saved_vmcs, saved_vmcs, 1);
+
+	free_vmcs(saved_vmcs->vmcs);
+}
+
+/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
+static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmptr) {
+			nested_free_saved_vmcs(vmx, &item->vmcs02);
+			list_del(&item->list);
+			kfree(item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+/*
+ * Free all VMCSs saved for this vcpu, except the actual vmx->vmcs.
+ * These include the VMCSs in vmcs02_pool (except the one currently used,
+ * if running L2), and saved_vmcs01 when running L2.
+ */
+static void nested_free_all_saved_vmcss(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item, *n;
+	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
+		if (vmx->vmcs != item->vmcs02.vmcs)
+			nested_free_saved_vmcs(vmx, &item->vmcs02);
+		list_del(&item->list);
+		kfree(item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
+/* Get a vmcs02 for the current vmcs12. */
+static struct saved_vmcs *nested_get_current_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmx->nested.current_vmptr) {
+			list_move(&item->list, &vmx->nested.vmcs02_pool);
+			return &item->vmcs02;
+		}
+
+	if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE, 1)) {
+		/* Recycle the least recently used VMCS. */
+		item = list_entry(vmx->nested.vmcs02_pool.prev,
+			struct vmcs02_list, list);
+		item->vmcs12_addr = vmx->nested.current_vmptr;
+		list_move(&item->list, &vmx->nested.vmcs02_pool);
+		return &item->vmcs02;
+	}
+
+	/* Create a new vmcs02 */
+	item = (struct vmcs02_list *)
+		kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
+	if (!item)
+		return NULL;
+	item->vmcs02.vmcs = alloc_vmcs();
+	if (!item->vmcs02.vmcs) {
+		kfree(item);
+		return NULL;
+	}
+	item->vmcs12_addr = vmx->nested.current_vmptr;
+	item->vmcs02.cpu = -1;
+	item->vmcs02.launched = 0;
+	list_add(&(item->list), &(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num++;
+	return &item->vmcs02;
+}
+
+/*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
  * inspect the argument to VMXON (the so-called "VMXON pointer") because we
@@ -4235,6 +4369,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4286,6 +4423,8 @@ static void free_nested(struct vcpu_vmx 
 		vmx->nested.current_vmptr = -1ull;
 		vmx->nested.current_vmcs12 = NULL;
 	}
+
+	nested_free_all_saved_vmcss(vmx);
 }
 
 /* Emulate the VMXOFF instruction */

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (6 preceding siblings ...)
  2011-05-16 19:47 ` [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
@ 2011-05-16 19:48 ` Nadav Har'El
  2011-05-17 13:19   ` Marcelo Tosatti
  2011-05-16 19:48 ` [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
                   ` (22 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:48 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

Our solution is to hold, in addition to vcpus_on_cpu, a second linked list
saved_vmcss_on_cpu, which holds the current list of "saved" VMCSs, i.e.,
VMCSs which are loaded on this CPU but are not the vmx->vmcs of any of
the vcpus. These saved VMCSs include L1's VMCS while L2 is running
(saved_vmcs01), and L2 VMCSs not currently used - because L1 is running or
because the vmcs02_pool contains more than one entry.

When we will switch between L1's and L2's VMCSs, they need to be moved
between vcpus_on_cpu and saved_vmcs_on_cpu lists and vice versa. A new
function, nested_maintain_per_cpu_lists(), takes care of that.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   67 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -181,6 +181,7 @@ struct saved_vmcs {
 	struct vmcs *vmcs;
 	int cpu;
 	int launched;
+	struct list_head local_saved_vmcss_link; /* see saved_vmcss_on_cpu */
 };
 
 /* Used to remember the last vmcs02 used for some recently used vmcs12s */
@@ -315,7 +316,20 @@ static int vmx_set_tss_addr(struct kvm *
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
+/*
+ * We maintain a per-CPU linked-list vcpus_on_cpu, holding for each CPU a list
+ * of vcpus whose VMCS are loaded on that CPU. This is needed when a CPU is
+ * brought down, and we need to VMCLEAR all VMCSs loaded on it.
+ *
+ * With nested VMX, we have additional VMCSs which are not the current
+ * vmx->vmcs of any vcpu, but may also be loaded on some CPU: While L2 is
+ * running, L1's VMCS is loaded but not the VMCS of any vcpu; While L1 is
+ * running, a previously used L2 VMCS might still be around and loaded on some
+ * CPU, somes even more than one such L2 VMCSs is kept (see VMCS02_POOL_SIZE).
+ * The list of these additional VMCSs is kept on cpu saved_vmcss_on_cpu.
+ */
 static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
+static DEFINE_PER_CPU(struct list_head, saved_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
 static unsigned long *vmx_io_bitmap_a;
@@ -1818,6 +1832,7 @@ static int hardware_enable(void *garbage
 		return -EBUSY;
 
 	INIT_LIST_HEAD(&per_cpu(vcpus_on_cpu, cpu));
+	INIT_LIST_HEAD(&per_cpu(saved_vmcss_on_cpu, cpu));
 	rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
 
 	test_bits = FEATURE_CONTROL_LOCKED;
@@ -1860,10 +1875,13 @@ static void kvm_cpu_vmxoff(void)
 	asm volatile (__ex(ASM_VMX_VMXOFF) : : : "cc");
 }
 
+static void vmclear_local_saved_vmcss(void);
+
 static void hardware_disable(void *garbage)
 {
 	if (vmm_exclusive) {
 		vmclear_local_vcpus();
+		vmclear_local_saved_vmcss();
 		kvm_cpu_vmxoff();
 	}
 	write_cr4(read_cr4() & ~X86_CR4_VMXE);
@@ -4248,6 +4266,8 @@ static void __nested_free_saved_vmcs(voi
 	vmcs_clear(saved_vmcs->vmcs);
 	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
 		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
+	list_del(&saved_vmcs->local_saved_vmcss_link);
+	saved_vmcs->cpu = -1;
 }
 
 /*
@@ -4265,6 +4285,21 @@ static void nested_free_saved_vmcs(struc
 	free_vmcs(saved_vmcs->vmcs);
 }
 
+/*
+ * VMCLEAR all the currently unused (not vmx->vmcs on any vcpu) saved_vmcss
+ * which were loaded on the current CPU. See also vmclear_load_vcpus(), which
+ * does the same for VMCS currently used in vcpus.
+ */
+static void vmclear_local_saved_vmcss(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct saved_vmcs *v, *n;
+
+	list_for_each_entry_safe(v, n, &per_cpu(saved_vmcss_on_cpu, cpu),
+				 local_saved_vmcss_link)
+		__nested_free_saved_vmcs(v);
+}
+
 /* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
 static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
 {
@@ -5143,6 +5178,38 @@ static void vmx_set_supported_cpuid(u32 
 {
 }
 
+/*
+ * Maintain the vcpus_on_cpu and saved_vmcss_on_cpu lists of vcpus and
+ * inactive saved_vmcss on nested entry (L1->L2) or nested exit (L2->L1).
+ *
+ * nested_maintain_per_cpu_lists should be called after the VMCS was switched
+ * to the new one, with parameters giving both the new on (after the entry
+ * or exit) and the old one, in that order.
+ */
+static void nested_maintain_per_cpu_lists(struct vcpu_vmx *vmx,
+		struct saved_vmcs *new_vmcs,
+		struct saved_vmcs *old_vmcs)
+{
+	/*
+	 * When a vcpus's old vmcs is saved, we need to drop it from
+	 * vcpus_on_cpu and put it on saved_vmcss_on_cpu.
+	 */
+	if (old_vmcs->cpu != -1) {
+		list_del(&vmx->local_vcpus_link);
+		list_add(&old_vmcs->local_saved_vmcss_link,
+			 &per_cpu(saved_vmcss_on_cpu, old_vmcs->cpu));
+	}
+	/*
+	 * When a saved vmcs becomes a vcpu's new vmcs, we need to drop it
+	 * from saved_vmcss_on_cpu and put it on vcpus_on_cpu.
+	 */
+	if (new_vmcs->cpu != -1) {
+		list_del(&new_vmcs->local_saved_vmcss_link);
+		list_add(&vmx->local_vcpus_link,
+			 &per_cpu(vcpus_on_cpu, new_vmcs->cpu));
+	}
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (7 preceding siblings ...)
  2011-05-16 19:48 ` [PATCH 08/31] nVMX: Fix local_vcpus_link handling Nadav Har'El
@ 2011-05-16 19:48 ` Nadav Har'El
  2011-05-20  8:22   ` Tian, Kevin
  2011-05-16 19:49 ` [PATCH 10/31] nVMX: Success/failure of VMX instructions Nadav Har'El
                   ` (21 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:48 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  275 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 275 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -144,12 +144,148 @@ struct shared_msr_entry {
  * machines (necessary for live migration).
  * If there are changes in this struct, VMCS12_REVISION must be changed.
  */
+typedef u64 natural_width;
 struct __packed vmcs12 {
 	/* According to the Intel spec, a VMCS region must start with the
 	 * following two fields. Then follow implementation-specific data.
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_ia32_efer;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u64 host_ia32_efer;
+	u64 padding64[8]; /* room for future expansion */
+	/*
+	 * To allow migration of L1 (complete with its L2 guests) between
+	 * machines of different natural widths (32 or 64 bit), we cannot have
+	 * unsigned long fields with no explict size. We use u64 (aliased
+	 * natural_width) instead. Luckily, x86 is little-endian.
+	 */
+	natural_width cr0_guest_host_mask;
+	natural_width cr4_guest_host_mask;
+	natural_width cr0_read_shadow;
+	natural_width cr4_read_shadow;
+	natural_width cr3_target_value0;
+	natural_width cr3_target_value1;
+	natural_width cr3_target_value2;
+	natural_width cr3_target_value3;
+	natural_width exit_qualification;
+	natural_width guest_linear_address;
+	natural_width guest_cr0;
+	natural_width guest_cr3;
+	natural_width guest_cr4;
+	natural_width guest_es_base;
+	natural_width guest_cs_base;
+	natural_width guest_ss_base;
+	natural_width guest_ds_base;
+	natural_width guest_fs_base;
+	natural_width guest_gs_base;
+	natural_width guest_ldtr_base;
+	natural_width guest_tr_base;
+	natural_width guest_gdtr_base;
+	natural_width guest_idtr_base;
+	natural_width guest_dr7;
+	natural_width guest_rsp;
+	natural_width guest_rip;
+	natural_width guest_rflags;
+	natural_width guest_pending_dbg_exceptions;
+	natural_width guest_sysenter_esp;
+	natural_width guest_sysenter_eip;
+	natural_width host_cr0;
+	natural_width host_cr3;
+	natural_width host_cr4;
+	natural_width host_fs_base;
+	natural_width host_gs_base;
+	natural_width host_tr_base;
+	natural_width host_gdtr_base;
+	natural_width host_idtr_base;
+	natural_width host_ia32_sysenter_esp;
+	natural_width host_ia32_sysenter_eip;
+	natural_width host_rsp;
+	natural_width host_rip;
+	natural_width paddingl[8]; /* room for future expansion */
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	u32 padding32[8]; /* room for future expansion */
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
 };
 
 /*
@@ -283,6 +419,145 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
+#define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
+#define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
+				[number##_HIGH] = VMCS12_OFFSET(name)+4
+
+static unsigned short vmcs_field_to_offset_table[] = {
+	FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id),
+	FIELD(GUEST_ES_SELECTOR, guest_es_selector),
+	FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
+	FIELD(GUEST_SS_SELECTOR, guest_ss_selector),
+	FIELD(GUEST_DS_SELECTOR, guest_ds_selector),
+	FIELD(GUEST_FS_SELECTOR, guest_fs_selector),
+	FIELD(GUEST_GS_SELECTOR, guest_gs_selector),
+	FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
+	FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
+	FIELD(HOST_ES_SELECTOR, host_es_selector),
+	FIELD(HOST_CS_SELECTOR, host_cs_selector),
+	FIELD(HOST_SS_SELECTOR, host_ss_selector),
+	FIELD(HOST_DS_SELECTOR, host_ds_selector),
+	FIELD(HOST_FS_SELECTOR, host_fs_selector),
+	FIELD(HOST_GS_SELECTOR, host_gs_selector),
+	FIELD(HOST_TR_SELECTOR, host_tr_selector),
+	FIELD64(IO_BITMAP_A, io_bitmap_a),
+	FIELD64(IO_BITMAP_B, io_bitmap_b),
+	FIELD64(MSR_BITMAP, msr_bitmap),
+	FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
+	FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
+	FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
+	FIELD64(TSC_OFFSET, tsc_offset),
+	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
+	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
+	FIELD64(EPT_POINTER, ept_pointer),
+	FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
+	FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
+	FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
+	FIELD64(GUEST_IA32_PAT, guest_ia32_pat),
+	FIELD64(GUEST_PDPTR0, guest_pdptr0),
+	FIELD64(GUEST_PDPTR1, guest_pdptr1),
+	FIELD64(GUEST_PDPTR2, guest_pdptr2),
+	FIELD64(GUEST_PDPTR3, guest_pdptr3),
+	FIELD64(HOST_IA32_PAT, host_ia32_pat),
+	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
+	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
+	FIELD(EXCEPTION_BITMAP, exception_bitmap),
+	FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask),
+	FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match),
+	FIELD(CR3_TARGET_COUNT, cr3_target_count),
+	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
+	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
+	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
+	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
+	FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count),
+	FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field),
+	FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code),
+	FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len),
+	FIELD(TPR_THRESHOLD, tpr_threshold),
+	FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control),
+	FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error),
+	FIELD(VM_EXIT_REASON, vm_exit_reason),
+	FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info),
+	FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code),
+	FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field),
+	FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code),
+	FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len),
+	FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info),
+	FIELD(GUEST_ES_LIMIT, guest_es_limit),
+	FIELD(GUEST_CS_LIMIT, guest_cs_limit),
+	FIELD(GUEST_SS_LIMIT, guest_ss_limit),
+	FIELD(GUEST_DS_LIMIT, guest_ds_limit),
+	FIELD(GUEST_FS_LIMIT, guest_fs_limit),
+	FIELD(GUEST_GS_LIMIT, guest_gs_limit),
+	FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit),
+	FIELD(GUEST_TR_LIMIT, guest_tr_limit),
+	FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit),
+	FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit),
+	FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes),
+	FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes),
+	FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes),
+	FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes),
+	FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes),
+	FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes),
+	FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes),
+	FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes),
+	FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info),
+	FIELD(GUEST_ACTIVITY_STATE, guest_activity_state),
+	FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs),
+	FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs),
+	FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask),
+	FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask),
+	FIELD(CR0_READ_SHADOW, cr0_read_shadow),
+	FIELD(CR4_READ_SHADOW, cr4_read_shadow),
+	FIELD(CR3_TARGET_VALUE0, cr3_target_value0),
+	FIELD(CR3_TARGET_VALUE1, cr3_target_value1),
+	FIELD(CR3_TARGET_VALUE2, cr3_target_value2),
+	FIELD(CR3_TARGET_VALUE3, cr3_target_value3),
+	FIELD(EXIT_QUALIFICATION, exit_qualification),
+	FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address),
+	FIELD(GUEST_CR0, guest_cr0),
+	FIELD(GUEST_CR3, guest_cr3),
+	FIELD(GUEST_CR4, guest_cr4),
+	FIELD(GUEST_ES_BASE, guest_es_base),
+	FIELD(GUEST_CS_BASE, guest_cs_base),
+	FIELD(GUEST_SS_BASE, guest_ss_base),
+	FIELD(GUEST_DS_BASE, guest_ds_base),
+	FIELD(GUEST_FS_BASE, guest_fs_base),
+	FIELD(GUEST_GS_BASE, guest_gs_base),
+	FIELD(GUEST_LDTR_BASE, guest_ldtr_base),
+	FIELD(GUEST_TR_BASE, guest_tr_base),
+	FIELD(GUEST_GDTR_BASE, guest_gdtr_base),
+	FIELD(GUEST_IDTR_BASE, guest_idtr_base),
+	FIELD(GUEST_DR7, guest_dr7),
+	FIELD(GUEST_RSP, guest_rsp),
+	FIELD(GUEST_RIP, guest_rip),
+	FIELD(GUEST_RFLAGS, guest_rflags),
+	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
+	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
+	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+	FIELD(HOST_CR0, host_cr0),
+	FIELD(HOST_CR3, host_cr3),
+	FIELD(HOST_CR4, host_cr4),
+	FIELD(HOST_FS_BASE, host_fs_base),
+	FIELD(HOST_GS_BASE, host_gs_base),
+	FIELD(HOST_TR_BASE, host_tr_base),
+	FIELD(HOST_GDTR_BASE, host_gdtr_base),
+	FIELD(HOST_IDTR_BASE, host_idtr_base),
+	FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp),
+	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
+	FIELD(HOST_RSP, host_rsp),
+	FIELD(HOST_RIP, host_rip),
+};
+static const int max_vmcs_field = ARRAY_SIZE(vmcs_field_to_offset_table);
+
+static inline short vmcs_field_to_offset(unsigned long field)
+{
+	if (field >= max_vmcs_field || vmcs_field_to_offset_table[field] == 0)
+		return -1;
+	return vmcs_field_to_offset_table[field];
+}
+
 static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->nested.current_vmcs12;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 10/31] nVMX: Success/failure of VMX instructions.
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (8 preceding siblings ...)
  2011-05-16 19:48 ` [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2011-05-16 19:49 ` Nadav Har'El
  2011-05-16 19:49 ` [PATCH 11/31] nVMX: Implement VMCLEAR Nadav Har'El
                   ` (20 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:49 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |   31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c         |   30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
@@ -4801,6 +4801,36 @@ static int get_vmx_mem_address(struct kv
 }
 
 /*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, "Conventions".
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+					u32 vm_instruction_error)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_ZF);
+	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--- .before/arch/x86/include/asm/vmx.h	2011-05-16 22:36:47.000000000 +0300
+++ .after/arch/x86/include/asm/vmx.h	2011-05-16 22:36:47.000000000 +0300
@@ -426,4 +426,35 @@ struct vmx_msr_entry {
 	u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+	VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+	VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+	VMXERR_VMCLEAR_VMXON_POINTER = 3,
+	VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+	VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+	VMXERR_VMRESUME_AFTER_VMXOFF = 6,
+	VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+	VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+	VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+	VMXERR_VMPTRLD_VMXON_POINTER = 10,
+	VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+	VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+	VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+	VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+	VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+	VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+	VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+	VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+	VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+	VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+	VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+	VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+	VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+	VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 11/31] nVMX: Implement VMCLEAR
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (9 preceding siblings ...)
  2011-05-16 19:49 ` [PATCH 10/31] nVMX: Success/failure of VMX instructions Nadav Har'El
@ 2011-05-16 19:49 ` Nadav Har'El
  2011-05-16 19:50 ` [PATCH 12/31] nVMX: Implement VMPTRLD Nadav Har'El
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:49 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   65 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    1 
 2 files changed, 65 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-16 22:36:48.000000000 +0300
@@ -347,6 +347,7 @@ void kvm_inject_page_fault(struct kvm_vc
 	vcpu->arch.cr2 = fault->address;
 	kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
 }
+EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
 
 void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
 {
--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -152,6 +152,9 @@ struct __packed vmcs12 {
 	u32 revision_id;
 	u32 abort;
 
+	u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+	u32 padding[7]; /* room for future expansion */
+
 	u64 io_bitmap_a;
 	u64 io_bitmap_b;
 	u64 msr_bitmap;
@@ -4830,6 +4833,66 @@ static void nested_vmx_failValid(struct 
 	get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct vmcs12 *vmcs12;
+	struct page *page;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vmcs12_addr,
+				sizeof(vmcs12_addr), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmcs12_addr == vmx->nested.current_vmptr) {
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+		vmx->nested.current_vmcs12 = NULL;
+	}
+
+	page = nested_get_page(vcpu, vmcs12_addr);
+	if (page == NULL) {
+		/*
+		 * For accurate processor emulation, VMCLEAR beyond available
+		 * physical memory should do nothing at all. However, it is
+		 * possible that a nested vmx bug, not a guest hypervisor bug,
+		 * resulted in this case, so let's shut down before doing any
+		 * more damage:
+		 */
+		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+		return 1;
+	}
+	vmcs12 = kmap(page);
+	vmcs12->launch_state = 0;
+	kunmap(page);
+	nested_release_page(page);
+
+	nested_free_vmcs02(vmx, vmcs12_addr);
+
+	skip_emulated_instruction(vcpu);
+	nested_vmx_succeed(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4851,7 +4914,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 12/31] nVMX: Implement VMPTRLD
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (10 preceding siblings ...)
  2011-05-16 19:49 ` [PATCH 11/31] nVMX: Implement VMCLEAR Nadav Har'El
@ 2011-05-16 19:50 ` Nadav Har'El
  2011-05-16 19:50 ` [PATCH 13/31] nVMX: Implement VMPTRST Nadav Har'El
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:50 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -4893,6 +4893,66 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &vmcs12_addr,
+				sizeof(vmcs12_addr), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmx->nested.current_vmptr != vmcs12_addr) {
+		struct vmcs12 *new_vmcs12;
+		struct page *page;
+		page = nested_get_page(vcpu, vmcs12_addr);
+		if (page == NULL) {
+			nested_vmx_failInvalid(vcpu);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		new_vmcs12 = kmap(page);
+		if (new_vmcs12->revision_id != VMCS12_REVISION) {
+			kunmap(page);
+			nested_release_page_clean(page);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		if (vmx->nested.current_vmptr != -1ull) {
+			kunmap(vmx->nested.current_vmcs12_page);
+			nested_release_page(vmx->nested.current_vmcs12_page);
+		}
+
+		vmx->nested.current_vmptr = vmcs12_addr;
+		vmx->nested.current_vmcs12 = new_vmcs12;
+		vmx->nested.current_vmcs12_page = page;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4916,7 +4976,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 13/31] nVMX: Implement VMPTRST
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (11 preceding siblings ...)
  2011-05-16 19:50 ` [PATCH 12/31] nVMX: Implement VMPTRLD Nadav Har'El
@ 2011-05-16 19:50 ` Nadav Har'El
  2011-05-16 19:51 ` [PATCH 14/31] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:50 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   28 +++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c |    3 ++-
 arch/x86/kvm/x86.h |    4 ++++
 3 files changed, 33 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-16 22:36:48.000000000 +0300
@@ -3836,7 +3836,7 @@ static int kvm_read_guest_virt_system(st
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
-static int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
+int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 				       gva_t addr, void *val,
 				       unsigned int bytes,
 				       struct x86_exception *exception)
@@ -3868,6 +3868,7 @@ static int kvm_write_guest_virt_system(s
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
 				  unsigned long addr,
--- .before/arch/x86/kvm/x86.h	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2011-05-16 22:36:48.000000000 +0300
@@ -85,4 +85,8 @@ int kvm_read_guest_virt(struct x86_emula
 	gva_t addr, void *val, unsigned int bytes,
 	struct x86_exception *exception);
 
+int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
+	gva_t addr, void *val, unsigned int bytes,
+	struct x86_exception *exception);
+
 #endif
--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -4953,6 +4953,32 @@ static int handle_vmptrld(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t vmcs_gva;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, exit_qualification,
+			vmx_instruction_info, &vmcs_gva))
+		return 1;
+	/* ok to use *_system, as nested_vmx_check_permission verified cpl=0 */
+	if (kvm_write_guest_virt_system(&vcpu->arch.emulate_ctxt, vmcs_gva,
+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
+				 sizeof(u64), &e)) {
+		kvm_inject_page_fault(vcpu, &e);
+		return 1;
+	}
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4977,7 +5003,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
-	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 14/31] nVMX: Implement VMREAD and VMWRITE
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (12 preceding siblings ...)
  2011-05-16 19:50 ` [PATCH 13/31] nVMX: Implement VMPTRST Nadav Har'El
@ 2011-05-16 19:51 ` Nadav Har'El
  2011-05-16 19:51 ` [PATCH 15/31] nVMX: Move host-state field setup to a function Nadav Har'El
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:51 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs12 structure introduced in a previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  176 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 174 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -4893,6 +4893,178 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_NATURAL_WIDTH = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* the *_HIGH fields are all 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (field >> 13) & 0x3 ;
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+	return (((field >> 10) & 0x3) == 1);
+}
+
+/*
+ * Read a vmcs12 field. Since these can have varying lengths and we return
+ * one type, we chose the biggest type (u64) and zero-extend the return value
+ * to that size. Note that the caller, handle_vmread, might need to use only
+ * some of the bits we return here (e.g., on 32-bit guests, only 32 bits of
+ * 64-bit fields are to be returned).
+ */
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+
+	p = ((char *)(get_vmcs12(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_NATURAL_WIDTH:
+		*ret = *((natural_width *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = *((u16 *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = *((u32 *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* Decode instruction info and find the field to read */
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	/* Read the field, zero-extended to a u64 field_value */
+	if (!vmcs12_read_any(vcpu, field, &field_value)) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	/*
+	 * Now copy part of this value to register or memory, as requested.
+	 * Note that the number of bits actually copied is 32 or 64 depending
+	 * on the guest's mode (32 or 64 bit), not on the given field's length.
+	 */
+	if (vmx_instruction_info & (1u << 10)) {
+		kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
+			field_value);
+	} else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		/* _system ok, as nested_vmx_check_permission verified cpl=0 */
+		kvm_write_guest_virt_system(&vcpu->arch.emulate_ctxt, gva,
+			     &field_value, (is_long_mode(vcpu) ? 8 : 4), NULL);
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	gva_t gva;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+	/* The value to write might be 32 or 64 bits, depending on L1's long
+	 * mode, and eventually we need to write that into a field of several
+	 * possible lengths. The code below first zero-extends the value to 64
+	 * bit (field_value), and then copies only the approriate number of
+	 * bits into the vmcs12 field.
+	 */
+	u64 field_value = 0;
+	struct x86_exception e;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (vmx_instruction_info & (1u << 10))
+		field_value = kvm_register_read(vcpu,
+			(((vmx_instruction_info) >> 3) & 0xf));
+	else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva,
+			   &field_value, (is_long_mode(vcpu) ? 8 : 4), &e)) {
+			kvm_inject_page_fault(vcpu, &e);
+			return 1;
+		}
+	}
+
+
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	if (vmcs_field_readonly(field)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	p = ((char *) get_vmcs12(vcpu)) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+		*(u64 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_NATURAL_WIDTH:
+		*(natural_width *)p = field_value;
+		break;
+	default:
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 /* Emulate the VMPTRLD instruction */
 static int handle_vmptrld(struct kvm_vcpu *vcpu)
 {
@@ -5004,9 +5176,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 15/31] nVMX: Move host-state field setup to a function
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (13 preceding siblings ...)
  2011-05-16 19:51 ` [PATCH 14/31] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
@ 2011-05-16 19:51 ` Nadav Har'El
  2011-05-16 19:52 ` [PATCH 16/31] nVMX: Move control field setup to functions Nadav Har'El
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:51 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move the setting of constant host-state fields (fields that do not change
throughout the life of the guest) from vmx_vcpu_setup to a new common function
vmx_set_constant_host_state(). This function will also be used to set the
host state when running L2 guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   74 ++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 32 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -3380,17 +3380,51 @@ static void vmx_disable_intercept_for_ms
 }
 
 /*
+ * Set up the vmcs's constant host-state fields, i.e., host-state fields that
+ * will not change in the lifetime of the guest.
+ * Note that host-state that does change is set elsewhere. E.g., host-state
+ * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
+ */
+static void vmx_set_constant_host_state(void)
+{
+	u32 low32, high32;
+	unsigned long tmpl;
+	struct desc_ptr dt;
+
+	vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
+	vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
+	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
+
+	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
+	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
+
+	native_store_idt(&dt);
+	vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
+
+	asm("mov $.Lkvm_vmx_return, %0" : "=r"(tmpl));
+	vmcs_writel(HOST_RIP, tmpl); /* 22.2.5 */
+
+	rdmsr(MSR_IA32_SYSENTER_CS, low32, high32);
+	vmcs_write32(HOST_IA32_SYSENTER_CS, low32);
+	rdmsrl(MSR_IA32_SYSENTER_EIP, tmpl);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, tmpl);   /* 22.2.3 */
+
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) {
+		rdmsr(MSR_IA32_CR_PAT, low32, high32);
+		vmcs_write64(HOST_IA32_PAT, low32 | ((u64) high32 << 32));
+	}
+}
+
+/*
  * Sets up the vmcs for emulated real mode.
  */
 static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 {
-	u32 host_sysenter_cs, msr_low, msr_high;
-	u32 junk;
-	u64 host_pat;
 	unsigned long a;
-	struct desc_ptr dt;
 	int i;
-	unsigned long kvm_vmx_return;
 	u32 exec_control;
 
 	/* I/O */
@@ -3447,16 +3481,9 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
 
-	vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
-	vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
-	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
-
-	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
-	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
-	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_FS_SELECTOR, 0);            /* 22.2.4 */
 	vmcs_write16(HOST_GS_SELECTOR, 0);            /* 22.2.4 */
-	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+	vmx_set_constant_host_state();
 #ifdef CONFIG_X86_64
 	rdmsrl(MSR_FS_BASE, a);
 	vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */
@@ -3467,32 +3494,15 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
-	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
-
-	native_store_idt(&dt);
-	vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
-
-	asm("mov $.Lkvm_vmx_return, %0" : "=r"(kvm_vmx_return));
-	vmcs_writel(HOST_RIP, kvm_vmx_return); /* 22.2.5 */
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest));
 
-	rdmsr(MSR_IA32_SYSENTER_CS, host_sysenter_cs, junk);
-	vmcs_write32(HOST_IA32_SYSENTER_CS, host_sysenter_cs);
-	rdmsrl(MSR_IA32_SYSENTER_ESP, a);
-	vmcs_writel(HOST_IA32_SYSENTER_ESP, a);   /* 22.2.3 */
-	rdmsrl(MSR_IA32_SYSENTER_EIP, a);
-	vmcs_writel(HOST_IA32_SYSENTER_EIP, a);   /* 22.2.3 */
-
-	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT) {
-		rdmsr(MSR_IA32_CR_PAT, msr_low, msr_high);
-		host_pat = msr_low | ((u64) msr_high << 32);
-		vmcs_write64(HOST_IA32_PAT, host_pat);
-	}
 	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
+		u32 msr_low, msr_high;
+		u64 host_pat;
 		rdmsr(MSR_IA32_CR_PAT, msr_low, msr_high);
 		host_pat = msr_low | ((u64) msr_high << 32);
 		/* Write the default value follow host pat */

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 16/31] nVMX: Move control field setup to functions
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (14 preceding siblings ...)
  2011-05-16 19:51 ` [PATCH 15/31] nVMX: Move host-state field setup to a function Nadav Har'El
@ 2011-05-16 19:52 ` Nadav Har'El
  2011-05-16 19:52 ` [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:52 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Move some of the control field setup to common functions. These functions will
also be needed for running L2 guests - L0's desires (expressed in these
functions) will be appropriately merged with L1's desires.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   80 +++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 33 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -3418,6 +3418,49 @@ static void vmx_set_constant_host_state(
 	}
 }
 
+static void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
+{
+	vmx->vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
+	if (enable_ept)
+		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
+	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
+}
+
+static u32 vmx_exec_control(struct vcpu_vmx *vmx)
+{
+	u32 exec_control = vmcs_config.cpu_based_exec_ctrl;
+	if (!vm_need_tpr_shadow(vmx->vcpu.kvm)) {
+		exec_control &= ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+		exec_control |= CPU_BASED_CR8_STORE_EXITING |
+				CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	}
+	if (!enable_ept)
+		exec_control |= CPU_BASED_CR3_STORE_EXITING |
+				CPU_BASED_CR3_LOAD_EXITING  |
+				CPU_BASED_INVLPG_EXITING;
+	return exec_control;
+}
+
+static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
+{
+	u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
+	if (!vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))
+		exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	if (vmx->vpid == 0)
+		exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
+	if (!enable_ept) {
+		exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+		enable_unrestricted_guest = 0;
+	}
+	if (!enable_unrestricted_guest)
+		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+	if (!ple_gap)
+		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+	return exec_control;
+}
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -3425,7 +3468,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
 {
 	unsigned long a;
 	int i;
-	u32 exec_control;
 
 	/* I/O */
 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
@@ -3440,36 +3482,11 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
 		vmcs_config.pin_based_exec_ctrl);
 
-	exec_control = vmcs_config.cpu_based_exec_ctrl;
-	if (!vm_need_tpr_shadow(vmx->vcpu.kvm)) {
-		exec_control &= ~CPU_BASED_TPR_SHADOW;
-#ifdef CONFIG_X86_64
-		exec_control |= CPU_BASED_CR8_STORE_EXITING |
-				CPU_BASED_CR8_LOAD_EXITING;
-#endif
-	}
-	if (!enable_ept)
-		exec_control |= CPU_BASED_CR3_STORE_EXITING |
-				CPU_BASED_CR3_LOAD_EXITING  |
-				CPU_BASED_INVLPG_EXITING;
-	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx));
 
 	if (cpu_has_secondary_exec_ctrls()) {
-		exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
-		if (!vm_need_virtualize_apic_accesses(vmx->vcpu.kvm))
-			exec_control &=
-				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
-		if (vmx->vpid == 0)
-			exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
-		if (!enable_ept) {
-			exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
-			enable_unrestricted_guest = 0;
-		}
-		if (!enable_unrestricted_guest)
-			exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
-		if (!ple_gap)
-			exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
-		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
+				vmx_secondary_exec_control(vmx));
 	}
 
 	if (ple_gap) {
@@ -3532,10 +3549,7 @@ static int vmx_vcpu_setup(struct vcpu_vm
 	vmcs_write32(VM_ENTRY_CONTROLS, vmcs_config.vmentry_ctrl);
 
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
-	vmx->vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
-	if (enable_ept)
-		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
-	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
+	set_cr4_guest_host_mask(vmx);
 
 	kvm_write_tsc(&vmx->vcpu, 0);
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (15 preceding siblings ...)
  2011-05-16 19:52 ` [PATCH 16/31] nVMX: Move control field setup to functions Nadav Har'El
@ 2011-05-16 19:52 ` Nadav Har'El
  2011-05-24  8:02   ` Tian, Kevin
  2011-05-16 19:53 ` [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:52 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
own guests).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  269 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 269 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:48.000000000 +0300
@@ -347,6 +347,12 @@ struct nested_vmx {
 	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
 	struct list_head vmcs02_pool;
 	int vmcs02_num;
+	u64 vmcs01_tsc_offset;
+	/*
+	 * Guest pages referred to in vmcs02 with host-physical pointers, so
+	 * we must keep them pinned while L2 runs.
+	 */
+	struct page *apic_access_page;
 };
 
 struct vcpu_vmx {
@@ -849,6 +855,18 @@ static inline bool report_flexpriority(v
 	return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
+{
+	return vmcs12->cpu_based_vm_exec_control & bit;
+}
+
+static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
+{
+	return (vmcs12->cpu_based_vm_exec_control &
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) &&
+		(vmcs12->secondary_vm_exec_control & bit);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -1435,6 +1453,22 @@ static void vmx_fpu_activate(struct kvm_
 
 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);
 
+/*
+ * Return the cr0 value that a nested guest would read. This is a combination
+ * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
+ * its hypervisor (cr0_read_shadow).
+ */
+static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
+{
+	return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
+		(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
+}
+static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
+{
+	return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
+		(fields->cr4_read_shadow & fields->cr4_guest_host_mask);
+}
+
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
 	vmx_decache_cr0_guest_bits(vcpu);
@@ -3423,6 +3457,9 @@ static void set_cr4_guest_host_mask(stru
 	vmx->vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
 	if (enable_ept)
 		vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
+	if (is_guest_mode(&vmx->vcpu))
+		vmx->vcpu.arch.cr4_guest_owned_bits &=
+			~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;
 	vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr4_guest_owned_bits);
 }
 
@@ -4760,6 +4797,11 @@ static void free_nested(struct vcpu_vmx 
 		vmx->nested.current_vmptr = -1ull;
 		vmx->nested.current_vmcs12 = NULL;
 	}
+	/* Unpin physical memory we referred to in current vmcs02 */
+	if (vmx->nested.apic_access_page) {
+		nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = 0;
+	}
 
 	nested_free_all_saved_vmcss(vmx);
 }
@@ -5829,6 +5871,233 @@ static void vmx_set_supported_cpuid(u32 
 }
 
 /*
+ * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
+ * guest in a way that will both be appropriate to L1's requests, and our
+ * needs. In addition to modifying the active vmcs (which is vmcs02), this
+ * function also has additional necessary side-effects, like setting various
+ * vcpu->arch fields.
+ */
+static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 exec_control;
+
+	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
+	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
+	vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
+	vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
+	vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
+	vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
+	vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
+	vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
+	vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
+	vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
+	vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
+	vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
+	vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
+	vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
+	vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
+	vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
+	vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
+	vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
+	vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
+	vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
+	vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
+	vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
+	vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
+	vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
+	vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
+	vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
+	vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
+	vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
+	vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+
+	vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		vmcs12->vm_entry_intr_info_field);
+	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		vmcs12->vm_entry_exception_error_code);
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmcs12->vm_entry_instruction_len);
+	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+		vmcs12->guest_interruptibility_info);
+	vmcs_write32(GUEST_ACTIVITY_STATE, vmcs12->guest_activity_state);
+	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
+	vmcs_writel(GUEST_DR7, vmcs12->guest_dr7);
+	vmcs_writel(GUEST_RFLAGS, vmcs12->guest_rflags);
+	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+		vmcs12->guest_pending_dbg_exceptions);
+	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
+
+	vmcs_write64(VMCS_LINK_POINTER, -1ull);
+
+	if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->apic_access_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+		/*
+		 * Keep the page pinned, so its physical address we just wrote
+		 * remains valid. We keep a reference to it so we can release
+		 * it later.
+		 */
+		if (vmx->nested.apic_access_page) /* shouldn't happen... */
+			nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = page;
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+		(vmcs_config.pin_based_exec_ctrl |
+		 vmcs12->pin_based_vm_exec_control));
+
+	/*
+	 * Whether page-faults are trapped is determined by a combination of
+	 * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
+	 * If enable_ept, L0 doesn't care about page faults and we should
+	 * set all of these to L1's desires. However, if !enable_ept, L0 does
+	 * care about (at least some) page faults, and because it is not easy
+	 * (if at all possible?) to merge L0 and L1's desires, we simply ask
+	 * to exit on each and every L2 page fault. This is done by setting
+	 * MASK=MATCH=0 and (see below) EB.PF=1.
+	 * Note that below we don't need special code to set EB.PF beyond the
+	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
+	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
+	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+	 *
+	 * A problem with this approach (when !enable_ept) is that L1 may be
+	 * injected with more page faults than it asked for. This could have
+	 * caused problems, but in practice existing hypervisors don't care.
+	 * To fix this, we will need to emulate the PFEC checking (on the L1
+	 * page tables), using walk_addr(), when injecting PFs to L1.
+	 */
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		enable_ept ? vmcs12->page_fault_error_code_mask : 0);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		enable_ept ? vmcs12->page_fault_error_code_match : 0);
+
+	if (cpu_has_secondary_exec_ctrls()) {
+		u32 exec_control = vmx_secondary_exec_control(vmx);
+		if (!vmx->rdtscp_enabled)
+			exec_control &= ~SECONDARY_EXEC_RDTSCP;
+		/* Take the following fields only from vmcs12 */
+		exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		if (nested_cpu_has(vmcs12,
+				CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
+			exec_control |= vmcs12->secondary_vm_exec_control;
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	}
+
+	/*
+	 * Set host-state according to L0's settings (vmcs12 is irrelevant here)
+	 * Some constant fields are set here by vmx_set_constant_host_state().
+	 * Other fields are different per CPU, and will be set later when
+	 * vmx_vcpu_load() is called, and when vmx_save_host_state() is called.
+	 */
+	vmx_set_constant_host_state();
+
+	/*
+	 * HOST_RSP is normally set correctly in vmx_vcpu_run() just before
+	 * entry, but only if the current (host) sp changed from the value
+	 * we wrote last (vmx->host_rsp). This cache is no longer relevant
+	 * if we switch vmcs, and rather than hold a separate cache per vmcs,
+	 * here we just force the write to happen on entry.
+	 */
+	vmx->host_rsp = 0;
+
+	exec_control = vmx_exec_control(vmx); /* L0's desires */
+	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+	exec_control &= ~CPU_BASED_TPR_SHADOW;
+	exec_control |= vmcs12->cpu_based_vm_exec_control;
+	/*
+	 * Merging of IO and MSR bitmaps not currently supported.
+	 * Rather, exit every time.
+	 */
+	exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;
+	exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
+	exec_control |= CPU_BASED_UNCOND_IO_EXITING;
+
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+
+	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
+	 * bitwise-or of what L1 wants to trap for L2, and what we want to
+	 * trap. Note that CR0.TS also needs updating - we do this later.
+	 */
+	update_exception_bitmap(vcpu);
+	vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	/* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
+	vmcs_write32(VM_EXIT_CONTROLS,
+		vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
+	vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
+		(vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
+
+	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat);
+	else if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
+
+
+	set_cr4_guest_host_mask(vmx);
+
+	vmcs_write64(TSC_OFFSET,
+		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+		vmx_flush_tlb(vcpu);
+	}
+
+	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
+		vcpu->arch.efer = vmcs12->guest_ia32_efer;
+	if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	/* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	/*
+	 * This sets GUEST_CR0 to vmcs12->guest_cr0, with possibly a modified
+	 * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
+	 * The CR0_READ_SHADOW is what L2 should have expected to read given
+	 * the specifications by L1; It's not enough to take
+	 * vmcs12->cr0_read_shadow because on our cr0_guest_host_mask we we
+	 * have more bits than L1 expected.
+	 */
+	vmx_set_cr0(vcpu, vmcs12->guest_cr0);
+	vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
+
+	vmx_set_cr4(vcpu, vmcs12->guest_cr4);
+	vmcs_writel(CR4_READ_SHADOW, guest_readable_cr4(vmcs12));
+
+	/* shadow page tables on either EPT or shadow page tables */
+	kvm_set_cr3(vcpu, vmcs12->guest_cr3);
+	kvm_mmu_reset_context(vcpu);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
+	return 0;
+}
+
+/*
  * Maintain the vcpus_on_cpu and saved_vmcss_on_cpu lists of vcpus and
  * inactive saved_vmcss on nested entry (L1->L2) or nested exit (L2->L1).
  *

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (16 preceding siblings ...)
  2011-05-16 19:52 ` [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2011-05-16 19:53 ` Nadav Har'El
  2011-05-24  8:45   ` Tian, Kevin
  2011-05-25  8:00   ` Tian, Kevin
  2011-05-16 19:53 ` [PATCH 19/31] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
                   ` (12 subsequent siblings)
  30 siblings, 2 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:53 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

This patch does not include some of the necessary validity checks on
vmcs12 fields before the entry. These will appear in a separate patch
below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   84 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -347,6 +347,9 @@ struct nested_vmx {
 	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
 	struct list_head vmcs02_pool;
 	int vmcs02_num;
+
+	/* Saving the VMCS that we used for running L1 */
+	struct saved_vmcs saved_vmcs01;
 	u64 vmcs01_tsc_offset;
 	/*
 	 * Guest pages referred to in vmcs02 with host-physical pointers, so
@@ -4668,6 +4671,8 @@ static void nested_free_all_saved_vmcss(
 		kfree(item);
 	}
 	vmx->nested.vmcs02_num = 0;
+	if (is_guest_mode(&vmx->vcpu))
+		nested_free_saved_vmcs(vmx, &vmx->nested.saved_vmcs01);
 }
 
 /* Get a vmcs02 for the current vmcs12. */
@@ -4959,6 +4964,21 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return nested_vmx_run(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return nested_vmx_run(vcpu, false);
+}
+
 enum vmcs_field_type {
 	VMCS_FIELD_TYPE_U16 = 0,
 	VMCS_FIELD_TYPE_U64 = 1,
@@ -5239,11 +5259,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -6129,6 +6149,66 @@ static void nested_maintain_per_cpu_list
 	}
 }
 
+/*
+ * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or VMRESUME on L1
+ * for running an L2 nested guest.
+ */
+static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
+{
+	struct vmcs12 *vmcs12;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int cpu;
+	struct saved_vmcs *saved_vmcs02;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+	skip_emulated_instruction(vcpu);
+
+	vmcs12 = get_vmcs12(vcpu);
+
+	enter_guest_mode(vcpu);
+
+	vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
+
+	/*
+	 * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
+	 * vmcs01, on which CPU it was last loaded, and whether it was launched
+	 * (we need all these values next time we will use L1). Then recall
+	 * these values from the last time vmcs02 was used.
+	 */
+	saved_vmcs02 = nested_get_current_vmcs02(vmx);
+	if (!saved_vmcs02)
+		return -ENOMEM;
+
+	cpu = get_cpu();
+	vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
+	vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
+	vmx->nested.saved_vmcs01.launched = vmx->launched;
+	vmx->vmcs = saved_vmcs02->vmcs;
+	vcpu->cpu = saved_vmcs02->cpu;
+	vmx->launched = saved_vmcs02->launched;
+
+	nested_maintain_per_cpu_lists(vmx,
+		saved_vmcs02, &vmx->nested.saved_vmcs01);
+
+	vmx_vcpu_put(vcpu);
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	vmcs12->launch_state = 1;
+
+	prepare_vmcs02(vcpu, vmcs12);
+
+	/*
+	 * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
+	 * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
+	 * returned as far as L1 is concerned. It will only return (and set
+	 * the success flag) when L2 exits (see nested_vmx_vmexit()).
+	 */
+	return 1;
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 19/31] nVMX: No need for handle_vmx_insn function any more
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (17 preceding siblings ...)
  2011-05-16 19:53 ` [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2011-05-16 19:53 ` Nadav Har'El
  2011-05-16 19:54 ` [PATCH 20/31] nVMX: Exiting from L2 to L1 Nadav Har'El
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:53 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the respective VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    6 ------
 1 file changed, 6 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -4297,12 +4297,6 @@ static int handle_vmcall(struct kvm_vcpu
 	return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
-}
-
 static int handle_invd(struct kvm_vcpu *vcpu)
 {
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (18 preceding siblings ...)
  2011-05-16 19:53 ` [PATCH 19/31] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
@ 2011-05-16 19:54 ` Nadav Har'El
  2011-05-24 12:58   ` Tian, Kevin
  2011-05-25  2:43   ` Tian, Kevin
  2011-05-16 19:54 ` [PATCH 21/31] nVMX: vmcs12 checks on nested entry Nadav Har'El
                   ` (10 subsequent siblings)
  30 siblings, 2 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:54 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in a separate patch below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  257 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 257 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -6203,6 +6203,263 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0 - that may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
+ * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow bit. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	/*
+	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
+	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
+	 * if we did trap it - if we did so because L1 asked to trap this bit
+	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
+	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
+	 */
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+	       (vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
+ * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12),
+ * and this function updates it to reflect the changes to the guest state while
+ * L2 was running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7);
+	vmcs12->guest_rsp = kvm_register_read(vcpu, VCPU_REGS_RSP);
+	vmcs12->guest_rip = kvm_register_read(vcpu, VCPU_REGS_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs12->vm_entry_controls & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+/*
+ * A part of what we need to when the nested L2 guest exits and we want to
+ * run its L1 parent, is to reset L1's guest state to the host state specified
+ * in vmcs12.
+ * This function is to be called not only on normal nested exit, but also on
+ * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry
+ * Failures During or After Loading Guest State").
+ * This function should be called when the active VMCS is L1's (vmcs01).
+ */
+void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
+		vcpu->arch.efer = vmcs12->host_ia32_efer;
+	if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE)
+		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	else
+		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
+	vmx_set_efer(vcpu, vcpu->arch.efer);
+
+	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->host_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->host_rip);
+	/*
+	 * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
+	 * actually changed, because it depends on the current state of
+	 * fpu_active (which may have changed).
+	 * Note that vmx_set_cr0 refers to efer set above.
+	 */
+	kvm_set_cr0(vcpu, vmcs12->host_cr0);
+	/*
+	 * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
+	 * to apply the same changes to L1's vmcs. We just set cr0 correctly,
+	 * but we also need to update cr0_guest_host_mask and exception_bitmap.
+	 */
+	update_exception_bitmap(vcpu);
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	/*
+	 * Note that CR4_GUEST_HOST_MASK is already set in the original vmcs01
+	 * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
+	 */
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+	kvm_set_cr4(vcpu, vmcs12->host_cr4);
+
+	/* shadow page tables on either EPT or shadow page tables */
+	kvm_set_cr3(vcpu, vmcs12->host_cr3);
+	kvm_mmu_reset_context(vcpu);
+
+	if (enable_vpid) {
+		/*
+		 * Trivially support vpid by letting L2s share their parent
+		 * L1's vpid. TODO: move to a more elaborate solution, giving
+		 * each L2 its own vpid and exposing the vpid feature to L1.
+		 */
+		vmx_flush_tlb(vcpu);
+	}
+}
+
+/*
+ * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
+ * and modify vmcs12 to make it see what it would expect to see there if
+ * L2 was its real guest. Must only be called when in L2 (is_guest_mode())
+ */
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int cpu;
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	struct saved_vmcs *saved_vmcs02;
+
+	leave_guest_mode(vcpu);
+	prepare_vmcs12(vcpu, vmcs12);
+
+	/*
+	 * Switch from L2's VMCS, to L1's VMCS. Remember on which CPU the L2
+	 * VMCS was last loaded, and whether it was launched (we need to know
+	 * this next time we use L2), and recall these values as they were for
+	 * L1's VMCS.
+	 */
+	cpu = get_cpu();
+	saved_vmcs02 = nested_get_current_vmcs02(vmx);
+	saved_vmcs02->cpu = vcpu->cpu;
+	saved_vmcs02->launched = vmx->launched;
+	vmx->vmcs = vmx->nested.saved_vmcs01.vmcs;
+	vcpu->cpu = vmx->nested.saved_vmcs01.cpu;
+	vmx->launched = vmx->nested.saved_vmcs01.launched;
+
+	nested_maintain_per_cpu_lists(vmx,
+		&vmx->nested.saved_vmcs01, saved_vmcs02);
+	/* if no vmcs02 cache requested, remove the one we used */
+	if (VMCS02_POOL_SIZE == 0)
+		nested_free_vmcs02(vmx, vmx->nested.current_vmptr);
+
+	vmx_vcpu_put(vcpu);
+	vmx_vcpu_load(vcpu, cpu);
+	vcpu->cpu = cpu;
+	put_cpu();
+
+	load_vmcs12_host_state(vcpu, vmcs12);
+
+	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
+
+	/* This is needed for same reason as it was needed in prepare_vmcs02 */
+	vmx->host_rsp = 0;
+
+	/* Unpin physical memory we referred to in vmcs02 */
+	if (vmx->nested.apic_access_page) {
+		nested_release_page(vmx->nested.apic_access_page);
+		vmx->nested.apic_access_page = 0;
+	}
+
+	/*
+	 * Exiting from L2 to L1, we're now back to L1 which thinks it just
+	 * finished a VMLAUNCH or VMRESUME instruction, so we need to set the
+	 * success or failure flag accordingly.
+	 */
+	if (unlikely(vmx->fail)) {
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 21/31] nVMX: vmcs12 checks on nested entry
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (19 preceding siblings ...)
  2011-05-16 19:54 ` [PATCH 20/31] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2011-05-16 19:54 ` Nadav Har'El
  2011-05-25  3:01   ` Tian, Kevin
  2011-05-16 19:55 ` [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:54 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds a bunch of tests of the validity of the vmcs12 fields,
according to what the VMX spec and our implementation allows. If fields
we cannot (or don't want to) honor are discovered, an entry failure is
emulated.

According to the spec, there are two types of entry failures: If the problem
was in vmcs12's host state or control fields, the VMLAUNCH instruction simply
fails. But a problem is found in the guest state, the behavior is more
similar to that of an exit.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |    8 ++
 arch/x86/kvm/vmx.c         |   94 +++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -870,6 +870,10 @@ static inline bool nested_cpu_has2(struc
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
+static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
+			struct vmcs12 *vmcs12,
+			u32 reason, unsigned long qualification);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -6160,6 +6164,79 @@ static int nested_vmx_run(struct kvm_vcp
 
 	vmcs12 = get_vmcs12(vcpu);
 
+	/*
+	 * The nested entry process starts with enforcing various prerequisites
+	 * on vmcs12 as required by the Intel SDM, and act appropriately when
+	 * they fail: As the SDM explains, some conditions should cause the
+	 * instruction to fail, while others will cause the instruction to seem
+	 * to succeed, but return an EXIT_REASON_INVALID_STATE.
+	 * To speed up the normal (success) code path, we should avoid checking
+	 * for misconfigurations which will anyway be caught by the processor
+	 * when using the merged vmcs02.
+	 */
+	if (vmcs12->launch_state == launch) {
+		nested_vmx_failValid(vcpu,
+			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
+			       : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+		return 1;
+	}
+
+	if ((vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_MSR_BITMAPS) &&
+			!IS_ALIGNED(vmcs12->msr_bitmap, PAGE_SIZE)) {
+		/*TODO: Also verify bits beyond physical address width are 0*/
+		nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		return 1;
+	}
+
+	if (vmcs12->vm_entry_msr_load_count > 0 ||
+	    vmcs12->vm_exit_msr_load_count > 0 ||
+	    vmcs12->vm_exit_msr_store_count > 0) {
+		if (printk_ratelimit())
+			printk(KERN_WARNING
+			  "%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
+		nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		return 1;
+	}
+
+	if (!vmx_control_verify(vmcs12->cpu_based_vm_exec_control,
+	      nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high) ||
+	    !vmx_control_verify(vmcs12->secondary_vm_exec_control,
+	      nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high) ||
+	    !vmx_control_verify(vmcs12->pin_based_vm_exec_control,
+	      nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high) ||
+	    !vmx_control_verify(vmcs12->vm_exit_controls,
+	      nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high) ||
+	    !vmx_control_verify(vmcs12->vm_entry_controls,
+	      nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high))
+	{
+		nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+		return 1;
+	}
+
+	if (((vmcs12->host_cr0 & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) ||
+	    ((vmcs12->host_cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_ENTRY_INVALID_HOST_STATE_FIELD);
+		return 1;
+	}
+
+	if (((vmcs12->guest_cr0 & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) ||
+	    ((vmcs12->guest_cr4 & VMXON_CR4_ALWAYSON) != VMXON_CR4_ALWAYSON)) {
+		nested_vmx_entry_failure(vcpu, vmcs12,
+			EXIT_REASON_INVALID_STATE, ENTRY_FAIL_DEFAULT);
+		return 1;
+	}
+	if (vmcs12->vmcs_link_pointer != -1ull) {
+		nested_vmx_entry_failure(vcpu, vmcs12,
+			EXIT_REASON_INVALID_STATE, ENTRY_FAIL_VMCS_LINK_PTR);
+		return 1;
+	}
+
+	/*
+	 * We're finally done with prerequisite checking, and can start with
+	 * the nested entry.
+	 */
+
 	enter_guest_mode(vcpu);
 
 	vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
@@ -6460,6 +6537,23 @@ static void nested_vmx_vmexit(struct kvm
 		nested_vmx_succeed(vcpu);
 }
 
+/*
+ * L1's failure to enter L2 is a subset of a normal exit, as explained in
+ * 23.7 "VM-entry failures during or after loading guest state" (this also
+ * lists the acceptable exit-reason and exit-qualification parameters).
+ * It should only be called before L2 actually succeeded to run, and when
+ * vmcs01 is current (it doesn't leave_guest_mode() or switch vmcss).
+ */
+static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
+			struct vmcs12 *vmcs12,
+			u32 reason, unsigned long qualification)
+{
+	load_vmcs12_host_state(vcpu, vmcs12);
+	vmcs12->vm_exit_reason = reason | VMX_EXIT_REASONS_FAILED_VMENTRY;
+	vmcs12->exit_qualification = qualification;
+	nested_vmx_succeed(vcpu);
+}
+
 static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage)
--- .before/arch/x86/include/asm/vmx.h	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/include/asm/vmx.h	2011-05-16 22:36:49.000000000 +0300
@@ -427,6 +427,14 @@ struct vmx_msr_entry {
 } __aligned(16);
 
 /*
+ * Exit Qualifications for entry failure during or after loading guest state
+ */
+#define ENTRY_FAIL_DEFAULT		0
+#define ENTRY_FAIL_PDPTE		2
+#define ENTRY_FAIL_NMI			3
+#define ENTRY_FAIL_VMCS_LINK_PTR	4
+
+/*
  * VM-instruction error numbers
  */
 enum vm_instruction_error_number {

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (20 preceding siblings ...)
  2011-05-16 19:54 ` [PATCH 21/31] nVMX: vmcs12 checks on nested entry Nadav Har'El
@ 2011-05-16 19:55 ` Nadav Har'El
  2011-05-25  7:56   ` Tian, Kevin
  2011-05-16 19:55 ` [PATCH 23/31] nVMX: Correct handling of interrupt injection Nadav Har'El
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:55 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  256 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 255 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -351,6 +351,8 @@ struct nested_vmx {
 	/* Saving the VMCS that we used for running L1 */
 	struct saved_vmcs saved_vmcs01;
 	u64 vmcs01_tsc_offset;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 	/*
 	 * Guest pages referred to in vmcs02 with host-physical pointers, so
 	 * we must keep them pinned while L2 runs.
@@ -870,6 +872,20 @@ static inline bool nested_cpu_has2(struc
 		(vmcs12->secondary_vm_exec_control & bit);
 }
 
+static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
+{
+	return is_guest_mode(vcpu) &&
+		(get_vmcs12(vcpu)->pin_based_vm_exec_control &
+			PIN_BASED_VIRTUAL_NMIS);
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu);
 static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
 			struct vmcs12 *vmcs12,
 			u32 reason, unsigned long qualification);
@@ -5281,6 +5297,232 @@ static int (*kvm_vmx_exit_handlers[])(st
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check whether L1 expressed
+ * disinterest in the current event (read or write a specific MSR) by using an
+ * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps.
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs12 *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	gpa_t bitmap;
+
+	if (!nested_cpu_has(get_vmcs12(vcpu), CPU_BASED_USE_MSR_BITMAPS))
+		return 1;
+
+	/*
+	 * The MSR_BITMAP page is divided into four 1024-byte bitmaps,
+	 * for the four combinations of read/write and low/high MSR numbers.
+	 * First we need to figure out which of the four to use:
+	 */
+	bitmap = vmcs12->msr_bitmap;
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		bitmap += 2048;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		bitmap += 1024;
+	}
+
+	/* Then read the msr_index'th bit from this bitmap: */
+	if (msr_index < 1024*8) {
+		unsigned char b;
+		kvm_read_guest(vcpu->kvm, bitmap + msr_index/8, &b, 1);
+		return 1 & (b >> (msr_index & 7));
+	} else
+		return 1; /* let L1 handle the wrong parameter */
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs12 *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if ((vmcs12->cr3_target_count >= 1 &&
+					vmcs12->cr3_target_value0 == val) ||
+				(vmcs12->cr3_target_count >= 2 &&
+					vmcs12->cr3_target_value1 == val) ||
+				(vmcs12->cr3_target_count >= 3 &&
+					vmcs12->cr3_target_value2 == val) ||
+				(vmcs12->cr3_target_count >= 4 &&
+					vmcs12->cr3_target_value3 == val))
+				return 0;
+			if (nested_cpu_has(vmcs12, CPU_BASED_CR3_LOAD_EXITING))
+				return 1;
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING))
+				return 1;
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if ((vmcs12->cr0_guest_host_mask & X86_CR0_TS) &&
+		    (vmcs12->cr0_read_shadow & X86_CR0_TS))
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+			return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in is_guest_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info))
+			return enable_ept;
+		return vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK));
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_TRIPLE_FAULT:
+		return 1;
+	case EXIT_REASON_PENDING_INTERRUPT:
+	case EXIT_REASON_NMI_WINDOW:
+		/*
+		 * prepare_vmcs02() set the CPU_BASED_VIRTUAL_INTR_PENDING bit
+		 * (aka Interrupt Window Exiting) only when L1 turned it on,
+		 * so if we got a PENDING_INTERRUPT exit, this must be for L1.
+		 * Same for NMI Window Exiting.
+		 */
+		return 1;
+	case EXIT_REASON_TASK_SWITCH:
+		return 1;
+	case EXIT_REASON_CPUID:
+		return 1;
+	case EXIT_REASON_HLT:
+		return nested_cpu_has(vmcs12, CPU_BASED_HLT_EXITING);
+	case EXIT_REASON_INVD:
+		return 1;
+	case EXIT_REASON_INVLPG:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING;
+	case EXIT_REASON_RDPMC:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_RDPMC_EXITING;
+	case EXIT_REASON_RDTSC:
+		return vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_RDTSC_EXITING;
+	case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
+	case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
+	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
+	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
+	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
+		/*
+		 * VMX instructions trap unconditionally. This allows L1 to
+		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
+		 */
+		return 1;
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
+	case EXIT_REASON_IO_INSTRUCTION:
+		/* TODO: support IO bitmaps */
+		return 1;
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_INVALID_STATE:
+		return 1;
+	case EXIT_REASON_MWAIT_INSTRUCTION:
+		return nested_cpu_has(vmcs12, CPU_BASED_MWAIT_EXITING);
+	case EXIT_REASON_MONITOR_INSTRUCTION:
+		return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_EXITING);
+	case EXIT_REASON_PAUSE_INSTRUCTION:
+		return nested_cpu_has(vmcs12, CPU_BASED_PAUSE_EXITING) ||
+			nested_cpu_has2(vmcs12,
+				SECONDARY_EXEC_PAUSE_LOOP_EXITING);
+	case EXIT_REASON_MCE_DURING_VMENTRY:
+		return 0;
+	case EXIT_REASON_TPR_BELOW_THRESHOLD:
+		return 1;
+	case EXIT_REASON_APIC_ACCESS:
+		return nested_cpu_has2(vmcs12,
+			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+	case EXIT_REASON_EPT_VIOLATION:
+	case EXIT_REASON_EPT_MISCONFIG:
+		return 0;
+	case EXIT_REASON_WBINVD:
+		return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
+	case EXIT_REASON_XSETBV:
+		return 1;
+	default:
+		return 1;
+	}
+}
+
 static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 {
 	*info1 = vmcs_readl(EXIT_QUALIFICATION);
@@ -5303,6 +5545,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu);
+		return 1;
+	}
+
 	if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
 		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
 		vcpu->run->fail_entry.hardware_entry_failure_reason
@@ -5325,7 +5578,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
+			!nested_cpu_has_virtual_nmis(vcpu))) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (21 preceding siblings ...)
  2011-05-16 19:55 ` [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2011-05-16 19:55 ` Nadav Har'El
  2011-05-25  8:39   ` Tian, Kevin
  2011-05-25  9:18   ` Tian, Kevin
  2011-05-16 19:56 ` [PATCH 24/31] nVMX: Correct handling of exception injection Nadav Har'El
                   ` (7 subsequent siblings)
  30 siblings, 2 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:55 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

The code in this patch correctly emulates external-interrupt injection
while a nested guest L2 is running.

Because of this code's relative un-obviousness, I include here a longer-than-
usual justification for what it does - much longer than the code itself ;-)

To understand how to correctly emulate interrupt injection while L2 is
running, let's look first at what we need to emulate: How would things look
like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
an interrupt, we had hardware delivering an interrupt?

Now we have L1 running on bare metal with a guest L2, and the hardware
generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and 
VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
happens now is this: The processor exits from L2 to L1, with an external-
interrupt exit reason but without an interrupt vector. L1 runs, with
interrupts disabled, and it doesn't yet know what the interrupt was. Soon
after, it enables interrupts and only at that moment, it gets the interrupt
from the processor. when L1 is KVM, Linux handles this interrupt.

Now we need exactly the same thing to happen when that L1->L2 system runs
on top of L0, instead of real hardware. This is how we do this:

When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
external-interrupt exit reason (with an invalid interrupt vector), and run L1.
Just like in the bare metal case, it likely can't deliver the interrupt to
L1 now because L1 is running with interrupts disabled, in which case it turns
on the interrupt window when running L1 after the exit. L1 will soon enable
interrupts, and at that point L0 will gain control again and inject the
interrupt to L1.

Finally, there is an extra complication in the code: when nested_run_pending,
we cannot return to L1 now, and must launch L2. We need to remember the
interrupt we wanted to inject (and not clear it now), and do it on the
next exit.

The above explanation shows that the relative strangeness of the nested
interrupt injection code in this patch, and the extra interrupt-window
exit incurred, are in fact necessary for accurate emulation, and are not
just an unoptimized implementation.

Let's revisit now the two assumptions made above:

If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
does, by the way), things are simple: L0 may inject the interrupt directly
to the L2 guest - using the normal code path that injects to any guest.
We support this case in the code below.

If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
does), things look very different from the description above: L1 expects
to see an exit from L2 with the interrupt vector already filled in the exit
information, and does not expect to be interrupted again with this interrupt.
The current code does not (yet) support this case, so we do not allow the
VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
@@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
 
 	/* exit controls */
 	nested_vmx_exit_ctls_low = 0;
+	/* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported. */
 #ifdef CONFIG_X86_64
 	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #else
@@ -3733,9 +3734,25 @@ out:
 	return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3858,6 +3875,17 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
+		struct vmcs12 *vmcs12;
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu);
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
+		vmcs12->vm_exit_intr_info = 0;
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5545,6 +5573,14 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	/*
+	 * the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+	 * we did not inject a still-pending event to L1 now because of
+	 * nested_run_pending, we need to re-enable this bit.
+	 */
+	if (vmx->nested.nested_run_pending)
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
 	if (exit_reason == EXIT_REASON_VMLAUNCH ||
 	    exit_reason == EXIT_REASON_VMRESUME)
 		vmx->nested.nested_run_pending = 1;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 24/31] nVMX: Correct handling of exception injection
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (22 preceding siblings ...)
  2011-05-16 19:55 ` [PATCH 23/31] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2011-05-16 19:56 ` Nadav Har'El
  2011-05-16 19:56 ` [PATCH 25/31] nVMX: Correct handling of idt vectoring info Nadav Har'El
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:56 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -1583,6 +1583,25 @@ static void vmx_clear_hlt(struct kvm_vcp
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1590,6 +1609,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && is_guest_mode(vcpu) &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3809,6 +3832,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu))
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 25/31] nVMX: Correct handling of idt vectoring info
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (23 preceding siblings ...)
  2011-05-16 19:56 ` [PATCH 24/31] nVMX: Correct handling of exception injection Nadav Har'El
@ 2011-05-16 19:56 ` Nadav Har'El
  2011-05-25 10:02   ` Tian, Kevin
  2011-05-16 19:57 ` [PATCH 26/31] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
                   ` (5 subsequent siblings)
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:56 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while delivering an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2.

Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
vmx_complete_interrupts() code which queues the idt_vectoring_info for
injection on next entry - because such injection would not be appropriate
if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
and related fields in vmcs12 (which is a convenient place to save these
fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
potentially exiting to L1 to inject an event requested by user space), if
we find ourselves in L1 we don't need to do anything with those values
we saved (as explained above). But if we find that we're in L2, or rather
*still* at L2 (it's not nested_run_pending, meaning that this is the first
round of L2 running after L1 having just launched it), we need to inject
the event saved in those fields - by writing the appropriate VMCS fields.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -5804,6 +5804,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+	if (is_guest_mode(&vmx->vcpu))
+		return;
 	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
 				  IDT_VECTORING_ERROR_CODE);
@@ -5811,6 +5813,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+	if (is_guest_mode(vcpu))
+		return;
 	__vmx_complete_interrupts(to_vmx(vcpu),
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,
@@ -5831,6 +5835,21 @@ static void __noclone vmx_vcpu_run(struc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
+		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+		if (vmcs12->idt_vectoring_info_field &
+				VECTORING_INFO_VALID_MASK) {
+			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+				vmcs12->idt_vectoring_info_field);
+			vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+				vmcs12->vm_exit_instruction_len);
+			if (vmcs12->idt_vectoring_info_field &
+					VECTORING_INFO_DELIVER_CODE_MASK)
+				vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+					vmcs12->idt_vectoring_error_code);
+		}
+	}
+
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -5962,6 +5981,17 @@ static void __noclone vmx_vcpu_run(struc
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+	if (is_guest_mode(vcpu)) {
+		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+		vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
+		if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
+			vmcs12->idt_vectoring_error_code =
+				vmcs_read32(IDT_VECTORING_ERROR_CODE);
+			vmcs12->vm_exit_instruction_len =
+				vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+		}
+	}
+
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
 	vmx->launched = 1;
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 26/31] nVMX: Handling of CR0 and CR4 modifying instructions
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (24 preceding siblings ...)
  2011-05-16 19:56 ` [PATCH 25/31] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2011-05-16 19:57 ` Nadav Har'El
  2011-05-16 19:57 ` [PATCH 27/31] nVMX: Further fixes for lazy FPU loading Nadav Har'El
                   ` (4 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:57 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   58 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -4153,6 +4153,58 @@ vmx_patch_hypercall(struct kvm_vcpu *vcp
 	hypercall[2] = 0xc1;
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (to_vmx(vcpu)->nested.vmxon &&
+	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
+		return 1;
+
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * We get here when L2 changed cr0 in a way that did not change
+		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
+		 * but did change L0 shadowed bits. This can currently happen
+		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
+		 * loading) while pretending to allow the guest to change it.
+		 */
+		if (kvm_set_cr0(vcpu, (val & vcpu->arch.cr0_guest_owned_bits) |
+			 (vcpu->arch.cr0 & ~vcpu->arch.cr0_guest_owned_bits)))
+			return 1;
+		vmcs_writel(CR0_READ_SHADOW, val);
+		return 0;
+	} else
+		return kvm_set_cr0(vcpu, val);
+}
+
+static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_guest_mode(vcpu)) {
+		if (kvm_set_cr4(vcpu, (val & vcpu->arch.cr4_guest_owned_bits) |
+			 (vcpu->arch.cr4 & ~vcpu->arch.cr4_guest_owned_bits)))
+			return 1;
+		vmcs_writel(CR4_READ_SHADOW, val);
+		return 0;
+	} else
+		return kvm_set_cr4(vcpu, val);
+}
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * We get here when L2 did CLTS, and L1 didn't shadow CR0.TS
+		 * but we did (!fpu_active). We need to keep GUEST_CR0.TS on,
+		 * just pretend it's off (also in arch.cr0 for fpu_activate).
+		 */
+		vmcs_writel(CR0_READ_SHADOW,
+			vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS);
+		vcpu->arch.cr0 &= ~X86_CR0_TS;
+	} else
+		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification, val;
@@ -4169,7 +4221,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		trace_kvm_cr_write(cr, val);
 		switch (cr) {
 		case 0:
-			err = kvm_set_cr0(vcpu, val);
+			err = handle_set_cr0(vcpu, val);
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 3:
@@ -4177,7 +4229,7 @@ static int handle_cr(struct kvm_vcpu *vc
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 4:
-			err = kvm_set_cr4(vcpu, val);
+			err = handle_set_cr4(vcpu, val);
 			kvm_complete_insn_gp(vcpu, err);
 			return 1;
 		case 8: {
@@ -4195,7 +4247,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		};
 		break;
 	case 2: /* clts */
-		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+		handle_clts(vcpu);
 		trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
 		skip_emulated_instruction(vcpu);
 		vmx_fpu_activate(vcpu);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 27/31] nVMX: Further fixes for lazy FPU loading
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (25 preceding siblings ...)
  2011-05-16 19:57 ` [PATCH 26/31] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
@ 2011-05-16 19:57 ` Nadav Har'El
  2011-05-16 19:58 ` [PATCH 28/31] nVMX: Additional TSC-offset handling Nadav Har'El
                   ` (3 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:57 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -1179,6 +1179,15 @@ static void update_exception_bitmap(stru
 		eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */
 	if (vcpu->fpu_active)
 		eb &= ~(1u << NM_VECTOR);
+
+	/* When we are running a nested L2 guest and L1 specified for it a
+	 * certain exception bitmap, we must trap the same exceptions and pass
+	 * them to L1. When running L2, we will only handle the exceptions
+	 * specified above if L1 did not want them.
+	 */
+	if (is_guest_mode(vcpu))
+		eb |= get_vmcs12(vcpu)->exception_bitmap;
+
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1471,6 +1480,9 @@ static void vmx_fpu_activate(struct kvm_
 	vmcs_writel(GUEST_CR0, cr0);
 	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	if (is_guest_mode(vcpu))
+		vcpu->arch.cr0_guest_owned_bits &=
+			~get_vmcs12(vcpu)->cr0_guest_host_mask;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 }
 
@@ -1494,12 +1506,29 @@ static inline unsigned long guest_readab
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
+	 * set this *before* calling this function.
+	 */
 	vmx_decache_cr0_guest_bits(vcpu);
 	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
 	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = 0;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
-	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
+	if (is_guest_mode(vcpu)) {
+		/*
+		 * L1's specified read shadow might not contain the TS bit,
+		 * so now that we turned on shadowing of this bit, we need to
+		 * set this bit of the shadow. Like in nested_vmx_run we need
+		 * guest_readable_cr0(vmcs12), but vmcs12->guest_cr0 is not
+		 * yet up-to-date here because we just decached cr0.TS (and
+		 * we'll only update vmcs12->guest_cr0 on nested exit).
+		 */
+		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+		vmcs12->guest_cr0 = (vmcs12->guest_cr0 & ~X86_CR0_TS) |
+			(vcpu->arch.cr0 & X86_CR0_TS);
+		vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
+	} else
+		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 28/31] nVMX: Additional TSC-offset handling
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (26 preceding siblings ...)
  2011-05-16 19:57 ` [PATCH 27/31] nVMX: Further fixes for lazy FPU loading Nadav Har'El
@ 2011-05-16 19:58 ` Nadav Har'El
  2011-05-16 19:58 ` [PATCH 29/31] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:58 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
of this function is that the TSC of all guests on this vcpu, L1 and possibly
several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
tsc_offset (this offset will also apply to each L2s we enter). We can't set
vmcs01 now, so we have to remember this adjustment and apply it when we
later exit to L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -1764,12 +1764,24 @@ static void vmx_set_tsc_khz(struct kvm_v
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
+	if (is_guest_mode(vcpu))
+		/*
+		 * We're here if L1 chose not to trap the TSC MSR. Since
+		 * prepare_vmcs12() does not copy tsc_offset, we need to also
+		 * set the vmcs12 field here.
+		 */
+		get_vmcs12(vcpu)->tsc_offset = offset -
+			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
 	u64 offset = vmcs_read64(TSC_OFFSET);
 	vmcs_write64(TSC_OFFSET, offset + adjustment);
+	if (is_guest_mode(vcpu)) {
+		/* Even when running L2, the adjustment needs to apply to L1 */
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset += adjustment;
+	}
 }
 
 static u64 vmx_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 29/31] nVMX: Add VMX to list of supported cpuid features
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (27 preceding siblings ...)
  2011-05-16 19:58 ` [PATCH 28/31] nVMX: Additional TSC-offset handling Nadav Har'El
@ 2011-05-16 19:58 ` Nadav Har'El
  2011-05-16 19:59 ` [PATCH 30/31] nVMX: Miscellenous small corrections Nadav Har'El
  2011-05-16 19:59 ` [PATCH 31/31] nVMX: Documentation Nadav Har'El
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:58 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -6325,6 +6325,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+	if (func == 1 && nested)
+		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
 /*

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 30/31] nVMX: Miscellenous small corrections
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (28 preceding siblings ...)
  2011-05-16 19:58 ` [PATCH 29/31] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
@ 2011-05-16 19:59 ` Nadav Har'El
  2011-05-16 19:59 ` [PATCH 31/31] nVMX: Documentation Nadav Har'El
  30 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:59 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
@@ -958,7 +958,7 @@ static void vmcs_load(struct vmcs *vmcs)
 			: "=qm"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
 	if (error)
-		printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+		printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
 		       vmcs, phys_addr);
 }
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 31/31] nVMX: Documentation
  2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
                   ` (29 preceding siblings ...)
  2011-05-16 19:59 ` [PATCH 30/31] nVMX: Miscellenous small corrections Nadav Har'El
@ 2011-05-16 19:59 ` Nadav Har'El
  2011-05-25 10:33   ` Tian, Kevin
  30 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-16 19:59 UTC (permalink / raw)
  To: kvm; +Cc: gleb, avi

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 Documentation/kvm/nested-vmx.txt |  243 +++++++++++++++++++++++++++++
 1 file changed, 243 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt	2011-05-16 22:36:51.000000000 +0300
+++ .after/Documentation/kvm/nested-vmx.txt	2011-05-16 22:36:51.000000000 +0300
@@ -0,0 +1,243 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+"The Turtles Project: Design and Implementation of Nested Virtualization",
+available at:
+
+	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code supports running Linux guests under KVM guests.
+Only 64-bit guest hypervisors are supported.
+
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+     -cpu host              (emulated CPU has all features of the real CPU)
+
+     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+	typedef u64 natural_width;
+	struct __packed vmcs12 {
+		/* According to the Intel spec, a VMCS region must start with
+		 * these two user-visible fields */
+		u32 revision_id;
+		u32 abort;
+
+		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+		u32 padding[7]; /* room for future expansion */
+
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		u64 vm_exit_msr_store_addr;
+		u64 vm_exit_msr_load_addr;
+		u64 vm_entry_msr_load_addr;
+		u64 tsc_offset;
+		u64 virtual_apic_page_addr;
+		u64 apic_access_addr;
+		u64 ept_pointer;
+		u64 guest_physical_address;
+		u64 vmcs_link_pointer;
+		u64 guest_ia32_debugctl;
+		u64 guest_ia32_pat;
+		u64 guest_ia32_efer;
+		u64 guest_pdptr0;
+		u64 guest_pdptr1;
+		u64 guest_pdptr2;
+		u64 guest_pdptr3;
+		u64 host_ia32_pat;
+		u64 host_ia32_efer;
+		u64 padding64[8]; /* room for future expansion */
+		natural_width cr0_guest_host_mask;
+		natural_width cr4_guest_host_mask;
+		natural_width cr0_read_shadow;
+		natural_width cr4_read_shadow;
+		natural_width cr3_target_value0;
+		natural_width cr3_target_value1;
+		natural_width cr3_target_value2;
+		natural_width cr3_target_value3;
+		natural_width exit_qualification;
+		natural_width guest_linear_address;
+		natural_width guest_cr0;
+		natural_width guest_cr3;
+		natural_width guest_cr4;
+		natural_width guest_es_base;
+		natural_width guest_cs_base;
+		natural_width guest_ss_base;
+		natural_width guest_ds_base;
+		natural_width guest_fs_base;
+		natural_width guest_gs_base;
+		natural_width guest_ldtr_base;
+		natural_width guest_tr_base;
+		natural_width guest_gdtr_base;
+		natural_width guest_idtr_base;
+		natural_width guest_dr7;
+		natural_width guest_rsp;
+		natural_width guest_rip;
+		natural_width guest_rflags;
+		natural_width guest_pending_dbg_exceptions;
+		natural_width guest_sysenter_esp;
+		natural_width guest_sysenter_eip;
+		natural_width host_cr0;
+		natural_width host_cr3;
+		natural_width host_cr4;
+		natural_width host_fs_base;
+		natural_width host_gs_base;
+		natural_width host_tr_base;
+		natural_width host_gdtr_base;
+		natural_width host_idtr_base;
+		natural_width host_ia32_sysenter_esp;
+		natural_width host_ia32_sysenter_eip;
+		natural_width host_rsp;
+		natural_width host_rip;
+		natural_width paddingl[8]; /* room for future expansion */
+		u32 pin_based_vm_exec_control;
+		u32 cpu_based_vm_exec_control;
+		u32 exception_bitmap;
+		u32 page_fault_error_code_mask;
+		u32 page_fault_error_code_match;
+		u32 cr3_target_count;
+		u32 vm_exit_controls;
+		u32 vm_exit_msr_store_count;
+		u32 vm_exit_msr_load_count;
+		u32 vm_entry_controls;
+		u32 vm_entry_msr_load_count;
+		u32 vm_entry_intr_info_field;
+		u32 vm_entry_exception_error_code;
+		u32 vm_entry_instruction_len;
+		u32 tpr_threshold;
+		u32 secondary_vm_exec_control;
+		u32 vm_instruction_error;
+		u32 vm_exit_reason;
+		u32 vm_exit_intr_info;
+		u32 vm_exit_intr_error_code;
+		u32 idt_vectoring_info_field;
+		u32 idt_vectoring_error_code;
+		u32 vm_exit_instruction_len;
+		u32 vmx_instruction_info;
+		u32 guest_es_limit;
+		u32 guest_cs_limit;
+		u32 guest_ss_limit;
+		u32 guest_ds_limit;
+		u32 guest_fs_limit;
+		u32 guest_gs_limit;
+		u32 guest_ldtr_limit;
+		u32 guest_tr_limit;
+		u32 guest_gdtr_limit;
+		u32 guest_idtr_limit;
+		u32 guest_es_ar_bytes;
+		u32 guest_cs_ar_bytes;
+		u32 guest_ss_ar_bytes;
+		u32 guest_ds_ar_bytes;
+		u32 guest_fs_ar_bytes;
+		u32 guest_gs_ar_bytes;
+		u32 guest_ldtr_ar_bytes;
+		u32 guest_tr_ar_bytes;
+		u32 guest_interruptibility_info;
+		u32 guest_activity_state;
+		u32 guest_sysenter_cs;
+		u32 host_ia32_sysenter_cs;
+		u32 padding32[8]; /* room for future expansion */
+		u16 virtual_processor_id;
+		u16 guest_es_selector;
+		u16 guest_cs_selector;
+		u16 guest_ss_selector;
+		u16 guest_ds_selector;
+		u16 guest_fs_selector;
+		u16 guest_gs_selector;
+		u16 guest_ldtr_selector;
+		u16 guest_tr_selector;
+		u16 host_es_selector;
+		u16 host_cs_selector;
+		u16 host_ss_selector;
+		u16 host_ds_selector;
+		u16 host_fs_selector;
+		u16 host_gs_selector;
+		u16 host_tr_selector;
+	};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+     Michael Factor, factor <at> il.ibm.com
+     Zvi Dubitzky, dubi <at> il.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com
+     and others.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-16 19:48 ` [PATCH 08/31] nVMX: Fix local_vcpus_link handling Nadav Har'El
@ 2011-05-17 13:19   ` Marcelo Tosatti
  2011-05-17 13:35     ` Avi Kivity
  0 siblings, 1 reply; 118+ messages in thread
From: Marcelo Tosatti @ 2011-05-17 13:19 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

On Mon, May 16, 2011 at 10:48:01PM +0300, Nadav Har'El wrote:
> In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
> because (at least in theory) the processor might not have written all of its
> content back to memory. Since a patch from June 26, 2008, this is done using
> a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
> 
> The problem is that with nested VMX, we no longer have the concept of a
> vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
> L2s), and each of those may be have been last loaded on a different cpu.
> 
> Our solution is to hold, in addition to vcpus_on_cpu, a second linked list
> saved_vmcss_on_cpu, which holds the current list of "saved" VMCSs, i.e.,
> VMCSs which are loaded on this CPU but are not the vmx->vmcs of any of
> the vcpus. These saved VMCSs include L1's VMCS while L2 is running
> (saved_vmcs01), and L2 VMCSs not currently used - because L1 is running or
> because the vmcs02_pool contains more than one entry.
> 
> When we will switch between L1's and L2's VMCSs, they need to be moved
> between vcpus_on_cpu and saved_vmcs_on_cpu lists and vice versa. A new
> function, nested_maintain_per_cpu_lists(), takes care of that.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   67 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> @@ -181,6 +181,7 @@ struct saved_vmcs {
>  	struct vmcs *vmcs;
>  	int cpu;
>  	int launched;
> +	struct list_head local_saved_vmcss_link; /* see saved_vmcss_on_cpu */
>  };
>  
>  /* Used to remember the last vmcs02 used for some recently used vmcs12s */
> @@ -315,7 +316,20 @@ static int vmx_set_tss_addr(struct kvm *
>  
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> +/*
> + * We maintain a per-CPU linked-list vcpus_on_cpu, holding for each CPU a list
> + * of vcpus whose VMCS are loaded on that CPU. This is needed when a CPU is
> + * brought down, and we need to VMCLEAR all VMCSs loaded on it.
> + *
> + * With nested VMX, we have additional VMCSs which are not the current
> + * vmx->vmcs of any vcpu, but may also be loaded on some CPU: While L2 is
> + * running, L1's VMCS is loaded but not the VMCS of any vcpu; While L1 is
> + * running, a previously used L2 VMCS might still be around and loaded on some
> + * CPU, somes even more than one such L2 VMCSs is kept (see VMCS02_POOL_SIZE).
> + * The list of these additional VMCSs is kept on cpu saved_vmcss_on_cpu.
> + */
>  static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
> +static DEFINE_PER_CPU(struct list_head, saved_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
>  
>  static unsigned long *vmx_io_bitmap_a;
> @@ -1818,6 +1832,7 @@ static int hardware_enable(void *garbage
>  		return -EBUSY;
>  
>  	INIT_LIST_HEAD(&per_cpu(vcpus_on_cpu, cpu));
> +	INIT_LIST_HEAD(&per_cpu(saved_vmcss_on_cpu, cpu));
>  	rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
>  
>  	test_bits = FEATURE_CONTROL_LOCKED;
> @@ -1860,10 +1875,13 @@ static void kvm_cpu_vmxoff(void)
>  	asm volatile (__ex(ASM_VMX_VMXOFF) : : : "cc");
>  }
>  
> +static void vmclear_local_saved_vmcss(void);
> +
>  static void hardware_disable(void *garbage)
>  {
>  	if (vmm_exclusive) {
>  		vmclear_local_vcpus();
> +		vmclear_local_saved_vmcss();
>  		kvm_cpu_vmxoff();
>  	}
>  	write_cr4(read_cr4() & ~X86_CR4_VMXE);
> @@ -4248,6 +4266,8 @@ static void __nested_free_saved_vmcs(voi
>  	vmcs_clear(saved_vmcs->vmcs);
>  	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
>  		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
> +	list_del(&saved_vmcs->local_saved_vmcss_link);
> +	saved_vmcs->cpu = -1;
>  }
>  
>  /*
> @@ -4265,6 +4285,21 @@ static void nested_free_saved_vmcs(struc
>  	free_vmcs(saved_vmcs->vmcs);
>  }
>  
> +/*
> + * VMCLEAR all the currently unused (not vmx->vmcs on any vcpu) saved_vmcss
> + * which were loaded on the current CPU. See also vmclear_load_vcpus(), which
> + * does the same for VMCS currently used in vcpus.
> + */
> +static void vmclear_local_saved_vmcss(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	struct saved_vmcs *v, *n;
> +
> +	list_for_each_entry_safe(v, n, &per_cpu(saved_vmcss_on_cpu, cpu),
> +				 local_saved_vmcss_link)
> +		__nested_free_saved_vmcs(v);
> +}
> +
>  /* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
>  static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
>  {
> @@ -5143,6 +5178,38 @@ static void vmx_set_supported_cpuid(u32 
>  {
>  }
>  
> +/*
> + * Maintain the vcpus_on_cpu and saved_vmcss_on_cpu lists of vcpus and
> + * inactive saved_vmcss on nested entry (L1->L2) or nested exit (L2->L1).
> + *
> + * nested_maintain_per_cpu_lists should be called after the VMCS was switched
> + * to the new one, with parameters giving both the new on (after the entry
> + * or exit) and the old one, in that order.
> + */
> +static void nested_maintain_per_cpu_lists(struct vcpu_vmx *vmx,
> +		struct saved_vmcs *new_vmcs,
> +		struct saved_vmcs *old_vmcs)
> +{
> +	/*
> +	 * When a vcpus's old vmcs is saved, we need to drop it from
> +	 * vcpus_on_cpu and put it on saved_vmcss_on_cpu.
> +	 */
> +	if (old_vmcs->cpu != -1) {
> +		list_del(&vmx->local_vcpus_link);
> +		list_add(&old_vmcs->local_saved_vmcss_link,
> +			 &per_cpu(saved_vmcss_on_cpu, old_vmcs->cpu));
> +	}

This new handling of vmcs could be simplified (local_vcpus_link must be
manipulated with interrupts disabled, BTW).

What about having a per-CPU VMCS list instead of per-CPU vcpu list?
"local_vmcs_link" list node could be in "struct saved_vmcs" (and 
a current_saved_vmcs pointer in "struct vcpu_vmx").

vmx_vcpu_load would then add to this list at

        if (per_cpu(current_vmcs, cpu) != vmx->vmcs) {
                per_cpu(current_vmcs, cpu) = vmx->vmcs;
                vmcs_load(vmx->vmcs);
        }

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 13:19   ` Marcelo Tosatti
@ 2011-05-17 13:35     ` Avi Kivity
  2011-05-17 14:35       ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-17 13:35 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nadav Har'El, kvm, gleb

On 05/17/2011 04:19 PM, Marcelo Tosatti wrote:
> >  +/*
> >  + * Maintain the vcpus_on_cpu and saved_vmcss_on_cpu lists of vcpus and
> >  + * inactive saved_vmcss on nested entry (L1->L2) or nested exit (L2->L1).
> >  + *
> >  + * nested_maintain_per_cpu_lists should be called after the VMCS was switched
> >  + * to the new one, with parameters giving both the new on (after the entry
> >  + * or exit) and the old one, in that order.
> >  + */
> >  +static void nested_maintain_per_cpu_lists(struct vcpu_vmx *vmx,
> >  +		struct saved_vmcs *new_vmcs,
> >  +		struct saved_vmcs *old_vmcs)
> >  +{
> >  +	/*
> >  +	 * When a vcpus's old vmcs is saved, we need to drop it from
> >  +	 * vcpus_on_cpu and put it on saved_vmcss_on_cpu.
> >  +	 */
> >  +	if (old_vmcs->cpu != -1) {
> >  +		list_del(&vmx->local_vcpus_link);
> >  +		list_add(&old_vmcs->local_saved_vmcss_link,
> >  +			&per_cpu(saved_vmcss_on_cpu, old_vmcs->cpu));
> >  +	}
>
> This new handling of vmcs could be simplified (local_vcpus_link must be
> manipulated with interrupts disabled, BTW).
>
> What about having a per-CPU VMCS list instead of per-CPU vcpu list?
> "local_vmcs_link" list node could be in "struct saved_vmcs" (and
> a current_saved_vmcs pointer in "struct vcpu_vmx").
>
> vmx_vcpu_load would then add to this list at
>
>          if (per_cpu(current_vmcs, cpu) != vmx->vmcs) {
>                  per_cpu(current_vmcs, cpu) = vmx->vmcs;
>                  vmcs_load(vmx->vmcs);
>          }

Right, that's the easiest thing to do.

Perhaps even easier (avoids duplication):

struct raw_vmcs {
     u32 revision_id;
     u32 abort;
     char data[0];
};

struct vmcs {
     struct raw_vmcs *raw_vmcs;
     struct list_head local_vmcs_link;
};

struct vcpu_vmx {
     ...
     struct vmcs *vmcs;  /* often points at l1_vmcs */
     struct vmcs l1_vmcs;
     ...
};

static DEFINE_PER_CPU(struct list_head, vmcss_on_cpu);

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 13:35     ` Avi Kivity
@ 2011-05-17 14:35       ` Nadav Har'El
  2011-05-17 14:42         ` Marcelo Tosatti
  2011-05-17 15:11         ` Avi Kivity
  0 siblings, 2 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-17 14:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, gleb

On Tue, May 17, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >(local_vcpus_link must be >manipulated with interrupts disabled, BTW).

Thanks, I'll look into that.

> >What about having a per-CPU VMCS list instead of per-CPU vcpu list?
> Perhaps even easier (avoids duplication):
> 
> struct raw_vmcs {
>     u32 revision_id;
>     u32 abort;
>     char data[0];
> };
> 
> struct vmcs {
>     struct raw_vmcs *raw_vmcs;
>     struct list_head local_vmcs_link;
> };
> 
> struct vcpu_vmx {
>     ...
>     struct vmcs *vmcs;  /* often points at l1_vmcs */
>     struct vmcs l1_vmcs;
>     ...
> };
> 
> static DEFINE_PER_CPU(struct list_head, vmcss_on_cpu);

This is an interesting suggestion. My initial plan was to do something similar
to this, and I agree it could have been nicer code, but I had to change it
after bumping into too many obstacles.

For example, currently, vmclear_local_vcpus() not only VMCLEARs the vmcss,
it also sets vmx->vcpu.cpu = -1, xmv->launched=0 for the vcpus holding these
VMCSs.  If we had only a list of VMCSs, how can we mark the vcpus as being not
currently loaded (cpu=-1)?


-- 
Nadav Har'El                        |      Tuesday, May 17 2011, 13 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I'm a peripheral visionary: I see into
http://nadav.harel.org.il           |the future, but mostly off to the sides.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 14:35       ` Nadav Har'El
@ 2011-05-17 14:42         ` Marcelo Tosatti
  2011-05-17 17:57           ` Nadav Har'El
  2011-05-17 15:11         ` Avi Kivity
  1 sibling, 1 reply; 118+ messages in thread
From: Marcelo Tosatti @ 2011-05-17 14:42 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, gleb

On Tue, May 17, 2011 at 05:35:32PM +0300, Nadav Har'El wrote:
> On Tue, May 17, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > static DEFINE_PER_CPU(struct list_head, vmcss_on_cpu);
> 
> This is an interesting suggestion. My initial plan was to do something similar
> to this, and I agree it could have been nicer code, but I had to change it
> after bumping into too many obstacles.
> 
> For example, currently, vmclear_local_vcpus() not only VMCLEARs the vmcss,
> it also sets vmx->vcpu.cpu = -1, xmv->launched=0 for the vcpus holding these
> VMCSs.  If we had only a list of VMCSs, how can we mark the vcpus as being not
> currently loaded (cpu=-1)?

Do it in vcpu_clear, its just an optimization not necessary in
vmclear_local_vcpus path.



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 14:35       ` Nadav Har'El
  2011-05-17 14:42         ` Marcelo Tosatti
@ 2011-05-17 15:11         ` Avi Kivity
  2011-05-17 18:11           ` Nadav Har'El
  1 sibling, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-17 15:11 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Marcelo Tosatti, kvm, gleb

On 05/17/2011 05:35 PM, Nadav Har'El wrote:
>
>>> What about having a per-CPU VMCS list instead of per-CPU vcpu list?
>> Perhaps even easier (avoids duplication):
>>
>> struct raw_vmcs {
>>      u32 revision_id;
>>      u32 abort;
>>      char data[0];
>> };
>>
>> struct vmcs {
>>      struct raw_vmcs *raw_vmcs;
>>      struct list_head local_vmcs_link;
>> };
>>
>> struct vcpu_vmx {
>>      ...
>>      struct vmcs *vmcs;  /* often points at l1_vmcs */
>>      struct vmcs l1_vmcs;
>>      ...
>> };
>>
>> static DEFINE_PER_CPU(struct list_head, vmcss_on_cpu);
> This is an interesting suggestion. My initial plan was to do something similar
> to this, and I agree it could have been nicer code, but I had to change it
> after bumping into too many obstacles.
>
> For example, currently, vmclear_local_vcpus() not only VMCLEARs the vmcss,
> it also sets vmx->vcpu.cpu = -1, xmv->launched=0 for the vcpus holding these
> VMCSs.  If we had only a list of VMCSs, how can we mark the vcpus as being not
> currently loaded (cpu=-1)?
>

->launched and ->cpu simply move into struct vmcs.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 14:42         ` Marcelo Tosatti
@ 2011-05-17 17:57           ` Nadav Har'El
  0 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-17 17:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, gleb

On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > For example, currently, vmclear_local_vcpus() not only VMCLEARs the vmcss,
> > it also sets vmx->vcpu.cpu = -1, xmv->launched=0 for the vcpus holding these
> > VMCSs.  If we had only a list of VMCSs, how can we mark the vcpus as being not
> > currently loaded (cpu=-1)?
> 
> Do it in vcpu_clear, its just an optimization not necessary in
> vmclear_local_vcpus path.

Well, what if (say) we're running L2, and L1's vmcs is saved in saved_vmcs01
and is not the current vmcs of the vcpu, and then we shut down the CPU on
which this saved_vmcs01 was loaded. We need not only to VMCLEAR this vmcs,
we need to also remember that this vmcs is not loaded, so when we nested_vmexit
back to L1, we know we need to load the vmcs again.

There's solution to this (which Avi also mentioned in his email) - it is to
use everywhere my "saved_vmcs" type (which I'd rename "loaded vmcs"), which
includes the vmcs *and* the cpu (and possibly "launched").
If the "cpu" field was part of vmx, this was easy - but "cpu" is a field of
vcpu, not vmx, so I have problems encapsulating both "vmcs" and "cpu" in
one structure everywhere.

These are the kind of problems I wrapped my head with, until I gave up and
came up with the current solution...

-- 
Nadav Har'El                        |      Tuesday, May 17 2011, 14 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Bigamy: Having one wife too many.
http://nadav.harel.org.il           |Monogamy: The same thing!

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 15:11         ` Avi Kivity
@ 2011-05-17 18:11           ` Nadav Har'El
  2011-05-17 18:43             ` Marcelo Tosatti
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-17 18:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, gleb

On Tue, May 17, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >VMCSs.  If we had only a list of VMCSs, how can we mark the vcpus as being 
> >not
> >currently loaded (cpu=-1)?
> >
> 
> ->launched and ->cpu simply move into struct vmcs.

As I explained in the sister thread (this discussion is becoming a tree ;-))
this is what I planned to do, until it dawned on me that I can't, because "cpu"
isn't part of vmx (where the vmcs and launched sit in the standard KVM), but
rather part of vcpu... When I gave up trying to "solve" these interdependencies
and avoiding modifying half of KVM, I came up with the current solution.

Maybe I'm missing something - I'd be happy if we do find a solution that
simplifies this code.


-- 
Nadav Har'El                        |      Tuesday, May 17 2011, 14 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Why do we drive on a parkway and park on
http://nadav.harel.org.il           |a driveway?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 18:11           ` Nadav Har'El
@ 2011-05-17 18:43             ` Marcelo Tosatti
  2011-05-17 19:30               ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Marcelo Tosatti @ 2011-05-17 18:43 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, gleb

On Tue, May 17, 2011 at 09:11:32PM +0300, Nadav Har'El wrote:
> On Tue, May 17, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > >VMCSs.  If we had only a list of VMCSs, how can we mark the vcpus as being 
> > >not
> > >currently loaded (cpu=-1)?
> > >
> > 
> > ->launched and ->cpu simply move into struct vmcs.
> 
> As I explained in the sister thread (this discussion is becoming a tree ;-))
> this is what I planned to do, until it dawned on me that I can't, because "cpu"
> isn't part of vmx (where the vmcs and launched sit in the standard KVM), but
> rather part of vcpu... When I gave up trying to "solve" these interdependencies
> and avoiding modifying half of KVM, I came up with the current solution.
> 
> Maybe I'm missing something - I'd be happy if we do find a solution that
> simplifies this code.

vcpu->cpu remains there. There is a new ->cpu field on struct vmcs, just
as saved_vmcs has in the current patches, to note the cpu which the VMCS 
was last loaded.

As mentioned there is no need to set "vcpu->cpu = -1" in __vcpu_clear,
the IPI handler, that can be done in vcpu_clear.


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 18:43             ` Marcelo Tosatti
@ 2011-05-17 19:30               ` Nadav Har'El
  2011-05-17 19:52                 ` Marcelo Tosatti
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-17 19:30 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, gleb

On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > this is what I planned to do, until it dawned on me that I can't, because "cpu"
> > isn't part of vmx (where the vmcs and launched sit in the standard KVM), but
>...
> vcpu->cpu remains there. There is a new ->cpu field on struct vmcs, just
> as saved_vmcs has in the current patches, to note the cpu which the VMCS 
> was last loaded.

So we'll have two fields, vmx.vcpu.cpu and vmx.vmcs.cpu, which are supposed
to always contain the same value. Are you fine with that?

> As mentioned there is no need to set "vcpu->cpu = -1" in __vcpu_clear,
> the IPI handler, that can be done in vcpu_clear.

Right, this is true.

-- 
Nadav Har'El                        |      Tuesday, May 17 2011, 14 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"A mathematician is a device for turning
http://nadav.harel.org.il           |coffee into theorems" -- P. Erdos

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 19:30               ` Nadav Har'El
@ 2011-05-17 19:52                 ` Marcelo Tosatti
  2011-05-18  5:52                   ` Nadav Har'El
  2011-05-18  8:29                   ` Avi Kivity
  0 siblings, 2 replies; 118+ messages in thread
From: Marcelo Tosatti @ 2011-05-17 19:52 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, gleb

On Tue, May 17, 2011 at 10:30:30PM +0300, Nadav Har'El wrote:
> On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > > this is what I planned to do, until it dawned on me that I can't, because "cpu"
> > > isn't part of vmx (where the vmcs and launched sit in the standard KVM), but
> >...
> > vcpu->cpu remains there. There is a new ->cpu field on struct vmcs, just
> > as saved_vmcs has in the current patches, to note the cpu which the VMCS 
> > was last loaded.
> 
> So we'll have two fields, vmx.vcpu.cpu and vmx.vmcs.cpu, which are supposed
> to always contain the same value. Are you fine with that?

Yes. Avi?


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 19:52                 ` Marcelo Tosatti
@ 2011-05-18  5:52                   ` Nadav Har'El
  2011-05-18  8:31                     ` Avi Kivity
  2011-05-18 12:08                     ` Marcelo Tosatti
  2011-05-18  8:29                   ` Avi Kivity
  1 sibling, 2 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-18  5:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, gleb

On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> On Tue, May 17, 2011 at 10:30:30PM +0300, Nadav Har'El wrote:
> > So we'll have two fields, vmx.vcpu.cpu and vmx.vmcs.cpu, which are supposed
> > to always contain the same value. Are you fine with that?
> 
> Yes. Avi?

Oops, it's even worse than I said, because if the new vmclear_local_vmcss
clears the vmcs currently used on some vcpu, it will update vmcs.cpu on that
vcpu to -1, but will *not* update vmx.vcpu.cpu, which remain its old value,
and potentially cause problems when it is used (e.g., in x86.c) instead
of vmx.vmcs.cpu.

-- 
Nadav Har'El                        |    Wednesday, May 18 2011, 14 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |An egotist is a person of low taste, more
http://nadav.harel.org.il           |interested in himself than in me.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-17 19:52                 ` Marcelo Tosatti
  2011-05-18  5:52                   ` Nadav Har'El
@ 2011-05-18  8:29                   ` Avi Kivity
  1 sibling, 0 replies; 118+ messages in thread
From: Avi Kivity @ 2011-05-18  8:29 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Nadav Har'El, kvm, gleb

On 05/17/2011 10:52 PM, Marcelo Tosatti wrote:
> On Tue, May 17, 2011 at 10:30:30PM +0300, Nadav Har'El wrote:
> >  On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >  >  >  this is what I planned to do, until it dawned on me that I can't, because "cpu"
> >  >  >  isn't part of vmx (where the vmcs and launched sit in the standard KVM), but
> >  >...
> >  >  vcpu->cpu remains there. There is a new ->cpu field on struct vmcs, just
> >  >  as saved_vmcs has in the current patches, to note the cpu which the VMCS
> >  >  was last loaded.
> >
> >  So we'll have two fields, vmx.vcpu.cpu and vmx.vmcs.cpu, which are supposed
> >  to always contain the same value. Are you fine with that?
>
> Yes. Avi?

Yes.

They have different meanings.  vcpu->cpu means where the task that runs 
the vcpu is running (or last ran).  vmcs->cpu means which cpu has the 
vmcs cached.  They need not be the same when we have multiple vmcs's for 
a vcpu; but vmx->vmcs->cpu will chase vcpu->cpu as it changes.

Please post this patch separately instead of reposting the entire 
series, we can apply it independently.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-18  5:52                   ` Nadav Har'El
@ 2011-05-18  8:31                     ` Avi Kivity
  2011-05-18  9:02                       ` Nadav Har'El
  2011-05-18 12:08                     ` Marcelo Tosatti
  1 sibling, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-18  8:31 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Marcelo Tosatti, kvm, gleb

On 05/18/2011 08:52 AM, Nadav Har'El wrote:
> On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >  On Tue, May 17, 2011 at 10:30:30PM +0300, Nadav Har'El wrote:
> >  >  So we'll have two fields, vmx.vcpu.cpu and vmx.vmcs.cpu, which are supposed
> >  >  to always contain the same value. Are you fine with that?
> >
> >  Yes. Avi?
>
> Oops, it's even worse than I said, because if the new vmclear_local_vmcss
> clears the vmcs currently used on some vcpu, it will update vmcs.cpu on that
> vcpu to -1, but will *not* update vmx.vcpu.cpu, which remain its old value,
> and potentially cause problems when it is used (e.g., in x86.c) instead
> of vmx.vmcs.cpu.
>

I did a quick audit and it seems fine.  If it isn't, we'll fix it when 
we see the problem.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-18  8:31                     ` Avi Kivity
@ 2011-05-18  9:02                       ` Nadav Har'El
  2011-05-18  9:16                         ` Avi Kivity
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-18  9:02 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, gleb

On Wed, May 18, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> I did a quick audit and it seems fine.  If it isn't, we'll fix it when 
> we see the problem.

Ok, then, I'm working on the code with the new approach.

My fear was that some CPU 7 is taken down, but vcpu.cpu remains 7 (not set to
-1). If cpu 7 nevers comes up again, it's not a problem because when we run
the same vcpu again on a different cpu, it's not 7 so we do what needs to be
done on CPU switch. But, what if CPU 7 does come up again later, and we find
ourselves running again on CPU 7, but it's not the same CPU 7 and we don't
know it? Is this case at all possible?

-- 
Nadav Har'El                        |    Wednesday, May 18 2011, 14 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Enjoy the new millennium; it might be
http://nadav.harel.org.il           |your last.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-18  9:02                       ` Nadav Har'El
@ 2011-05-18  9:16                         ` Avi Kivity
  0 siblings, 0 replies; 118+ messages in thread
From: Avi Kivity @ 2011-05-18  9:16 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Marcelo Tosatti, kvm, gleb

On 05/18/2011 12:02 PM, Nadav Har'El wrote:
> On Wed, May 18, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >  I did a quick audit and it seems fine.  If it isn't, we'll fix it when
> >  we see the problem.
>
> Ok, then, I'm working on the code with the new approach.
>
> My fear was that some CPU 7 is taken down, but vcpu.cpu remains 7 (not set to
> -1). If cpu 7 nevers comes up again, it's not a problem because when we run
> the same vcpu again on a different cpu, it's not 7 so we do what needs to be
> done on CPU switch. But, what if CPU 7 does come up again later, and we find
> ourselves running again on CPU 7, but it's not the same CPU 7 and we don't
> know it? Is this case at all possible?

It's certainly possible, but it's independent of this patch.

It's even handled, see kvm_arch_hardware_enable().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-18  5:52                   ` Nadav Har'El
  2011-05-18  8:31                     ` Avi Kivity
@ 2011-05-18 12:08                     ` Marcelo Tosatti
  2011-05-18 12:19                       ` Nadav Har'El
  2011-05-22  8:57                       ` Nadav Har'El
  1 sibling, 2 replies; 118+ messages in thread
From: Marcelo Tosatti @ 2011-05-18 12:08 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm, gleb

On Wed, May 18, 2011 at 08:52:36AM +0300, Nadav Har'El wrote:
> On Tue, May 17, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > On Tue, May 17, 2011 at 10:30:30PM +0300, Nadav Har'El wrote:
> > > So we'll have two fields, vmx.vcpu.cpu and vmx.vmcs.cpu, which are supposed
> > > to always contain the same value. Are you fine with that?
> > 
> > Yes. Avi?
> 
> Oops, it's even worse than I said, because if the new vmclear_local_vmcss
> clears the vmcs currently used on some vcpu, it will update vmcs.cpu on that
> vcpu to -1, but will *not* update vmx.vcpu.cpu, which remain its old value,
> and potentially cause problems when it is used (e.g., in x86.c) instead
> of vmx.vmcs.cpu.

Humpf, right. OK, you can handle the x86.c usage with

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 64edf57..b5fd9b4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2118,7 +2118,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (need_emulate_wbinvd(vcpu)) {
 		if (kvm_x86_ops->has_wbinvd_exit())
 			cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
-		else if (vcpu->cpu != -1 && vcpu->cpu != cpu)
+		else if (vcpu->cpu != -1 && vcpu->cpu != cpu && cpu_online(vcpu->cpu))
 			smp_call_function_single(vcpu->cpu,
 					wbinvd_ipi, NULL, 1);
 	}

Note this is not just about the code being nicer, but simplicity is
crucial, the code is tricky enough with one linked list.


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-18 12:08                     ` Marcelo Tosatti
@ 2011-05-18 12:19                       ` Nadav Har'El
  2011-05-22  8:57                       ` Nadav Har'El
  1 sibling, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-18 12:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, gleb

On Wed, May 18, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> Note this is not just about the code being nicer, but simplicity is
> crucial, the code is tricky enough with one linked list.

Unfortunately, it's not obvious that the method you suggested (and which, like
I said, was the first method I considered as well, and rejected) will be
simpler or less tricky, with its two "cpu" variables, vmcs pointing to l1_vmcs
even in the non-nested case, and a bunch of other issues. The main benefit of
the code as I already posted it was that it didn't add *any* complexity or
changed anything to the non-nested case. The code I'm writing now based on
your suggestions is more risky in the sense that it *may* break some things
completely unrelated to nested.

In any case, like I said, I'm working on a version using your and Avi's
suggestions, and will send it for your review shortly.

Thanks for all the ideas,
Nadav.

-- 
Nadav Har'El                        |    Wednesday, May 18 2011, 14 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Sign seen in restaurant: We Reserve The
http://nadav.harel.org.il           |Right To Serve Refuse To Anyone!

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 02/31] nVMX: Implement VMXON and VMXOFF
  2011-05-16 19:44 ` [PATCH 02/31] nVMX: Implement VMXON and VMXOFF Nadav Har'El
@ 2011-05-20  7:58   ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-20  7:58 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:45 AM
> 
> This patch allows a guest to use the VMXON and VMXOFF instructions, and
> emulates them accordingly. Basically this amounts to checking some
> prerequisites, and then remembering whether the guest has enabled or
> disabled VMX operation.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  110
> ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 108 insertions(+), 2 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:46.000000000 +0300
> @@ -130,6 +130,15 @@ struct shared_msr_entry {
>  	u64 mask;
>  };
> 
> +/*
> + * The nested_vmx structure is part of vcpu_vmx, and holds information
> +we need
> + * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
> + */
> +struct nested_vmx {
> +	/* Has the level1 guest done vmxon? */
> +	bool vmxon;
> +};
> +
>  struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	struct list_head      local_vcpus_link;
> @@ -184,6 +193,9 @@ struct vcpu_vmx {
>  	u32 exit_reason;
> 
>  	bool rdtscp_enabled;
> +
> +	/* Support for a guest hypervisor (nested VMX) */
> +	struct nested_vmx nested;
>  };
> 
>  enum segment_cache_field {
> @@ -3890,6 +3902,99 @@ static int handle_invalid_op(struct kvm_  }
> 
>  /*
> + * Emulate the VMXON instruction.
> + * Currently, we just remember that VMX is active, and do not save or
> +even
> + * inspect the argument to VMXON (the so-called "VMXON pointer")
> +because we
> + * do not currently need to store anything in that guest-allocated
> +memory

Though we don't need store anything, VMXON needs to check revision ID of
VMXON region to make sure it matches processor's assumption. Considering
an user uses nVMX to practice VMM development and forgot to fill revision
ID into the region. We should fail the instruction at the 1st place.

> + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that
> +the their
> + * argument is different from the VMXON pointer (which the spec says they
> do).
> + */
> +static int handle_vmon(struct kvm_vcpu *vcpu) {
> +	struct kvm_segment cs;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* The Intel VMX Instruction Reference lists a bunch of bits that
> +	 * are prerequisite to running VMXON, most notably cr4.VMXE must be
> +	 * set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
> +	 * Otherwise, we should fail with #UD. We test these now:
> +	 */
> +	if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
> +	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
> +	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
> +	if (is_long_mode(vcpu) && !cs.l) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	if (vmx_get_cpl(vcpu)) {
> +		kvm_inject_gp(vcpu, 0);
> +		return 1;
> +	}

You need also check IA32_FEATURE_CONTROL_MSR for bit 0/1/2 as
said in SDM. 

So does the check on 4k alignment and physical-address width for VMXON
region.

> +
> +	vmx->nested.vmxon = true;
> +
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +
> +/*
> + * Intel's VMX Instruction Reference specifies a common set of
> +prerequisites
> + * for running VMX instructions (except VMXON, whose prerequisites are
> + * slightly different). It also specifies what exception to inject otherwise.
> + */
> +static int nested_vmx_check_permission(struct kvm_vcpu *vcpu) {
> +	struct kvm_segment cs;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (!vmx->nested.vmxon) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 0;
> +	}
> +
> +	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
> +	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
> +	    (is_long_mode(vcpu) && !cs.l)) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 0;
> +	}
> +
> +	if (vmx_get_cpl(vcpu)) {
> +		kvm_inject_gp(vcpu, 0);
> +		return 0;
> +	}
> +
> +	return 1;
> +}
> +
> +/*
> + * Free whatever needs to be freed from vmx->nested when L1 goes down,
> +or
> + * just stops using VMX.
> + */
> +static void free_nested(struct vcpu_vmx *vmx) {
> +	if (!vmx->nested.vmxon)
> +		return;
> +	vmx->nested.vmxon = false;
> +}
> +
> +/* Emulate the VMXOFF instruction */
> +static int handle_vmoff(struct kvm_vcpu *vcpu) {
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;

miss one check on CR0.PE

> +	free_nested(to_vmx(vcpu));
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +
> +/*
>   * The exit handlers return 1 if the exit was handled fully and guest execution
>   * may resume.  Otherwise they set the kvm_run parameter to indicate
> what needs
>   * to be done to userspace and return 0.
> @@ -3917,8 +4022,8 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
>  	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
>  	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
> -	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
> -	[EXIT_REASON_VMON]                    = handle_vmx_insn,
> +	[EXIT_REASON_VMOFF]                   = handle_vmoff,
> +	[EXIT_REASON_VMON]                    = handle_vmon,
>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     =
> handle_tpr_below_threshold,
>  	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
>  	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
> @@ -4329,6 +4434,7 @@ static void vmx_free_vcpu(struct kvm_vcp
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
>  	free_vpid(vmx);
> +	free_nested(vmx);
>  	vmx_free_vmcs(vcpu);
>  	kfree(vmx->guest_msrs);
>  	kvm_vcpu_uninit(vcpu);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a
> message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-16 19:47 ` [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
@ 2011-05-20  8:04   ` Tian, Kevin
  2011-05-20  8:48     ` Tian, Kevin
  2011-05-22  8:29     ` Nadav Har'El
  0 siblings, 2 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-20  8:04 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:48 AM
> 
> We saw in a previous patch that L1 controls its L2 guest with a vcms12.
> L0 needs to create a real VMCS for running L2. We call that "vmcs02".
> A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
> fields. This patch only contains code for allocating vmcs02.
> 
> In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
> enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
> be reused even when L1 runs multiple L2 guests. However, in future versions
> we'll probably want to add an optimization where vmcs02 fields that rarely
> change will not be set each time. For that, we may want to keep around several
> vmcs02s of L2 guests that have recently run, so that potentially we could run
> these L2s again more quickly because less vmwrites to vmcs02 will be needed.

That would be a neat enhancement and should have an obvious improvement.
Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
is similar to the hardware behavior regarding to cleared and launched state.

> 
> This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
> which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
> As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
> I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
> reused to enter any L2 guest. In the future, when prepare_vmcs02() is
> optimized not to set all fields every time, VMCS02_POOL_SIZE should be
> increased.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  139
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 139 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
>  module_param(ple_window, int, S_IRUGO);
> 
>  #define NR_AUTOLOAD_MSRS 1
> +#define VMCS02_POOL_SIZE 1
> 
>  struct vmcs {
>  	u32 revision_id;
> @@ -166,6 +167,30 @@ struct __packed vmcs12 {
>  #define VMCS12_SIZE 0x1000
> 
>  /*
> + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
> + * while we use L2's VMCS), and we wish to save the previous VMCS, we must
> also
> + * remember on which CPU it was last loaded (vcpu->cpu), so when we return
> to
> + * using this VMCS we'll know if we're now running on a different CPU and
> need
> + * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
> + * we need to remember whether this VMCS was launched (vmx->launched),
> so when
> + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot
> deduce
> + * this from other state, because it's possible that this VMCS had once been
> + * launched, but has since been cleared after a CPU switch).
> + */
> +struct saved_vmcs {
> +	struct vmcs *vmcs;
> +	int cpu;
> +	int launched;
> +};

"saved" looks a bit misleading here. It's simply a list of all active vmcs02 tracked
by kvm, isn't it?

> +
> +/* Used to remember the last vmcs02 used for some recently used vmcs12s
> */
> +struct vmcs02_list {
> +	struct list_head list;
> +	gpa_t vmcs12_addr;

uniform the name 'vmptr' as nested_vmx strucure:
 /* The guest-physical address of the current VMCS L1 keeps for L2 */
	gpa_t current_vmptr;
	/* The host-usable pointer to the above */
	struct page *current_vmcs12_page;
	struct vmcs12 *current_vmcs12;

you should keep consistent meaning for vmcs12, which means the arch-neutral
state interpreted by KVM only.

> +	struct saved_vmcs vmcs02;
> +};
> +
> +/*
>   * The nested_vmx structure is part of vcpu_vmx, and holds information we
> need
>   * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
>   */
> @@ -178,6 +203,10 @@ struct nested_vmx {
>  	/* The host-usable pointer to the above */
>  	struct page *current_vmcs12_page;
>  	struct vmcs12 *current_vmcs12;
> +
> +	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
> +	struct list_head vmcs02_pool;
> +	int vmcs02_num;
>  };
> 
>  struct vcpu_vmx {
> @@ -4200,6 +4229,111 @@ static int handle_invalid_op(struct kvm_
>  }
> 
>  /*
> + * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
> + * We could reuse a single VMCS for all the L2 guests, but we also want the
> + * option to allocate a separate vmcs02 for each separate loaded vmcs12 -
> this
> + * allows keeping them loaded on the processor, and in the future will allow
> + * optimizations where prepare_vmcs02 doesn't need to set all the fields on
> + * every entry if they never change.
> + * So we keep, in vmx->nested.vmcs02_pool, a cache of size
> VMCS02_POOL_SIZE
> + * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
> + *
> + * The following functions allocate and free a vmcs02 in this pool.
> + */
> +
> +static void __nested_free_saved_vmcs(void *arg)
> +{
> +	struct saved_vmcs *saved_vmcs = arg;
> +
> +	vmcs_clear(saved_vmcs->vmcs);
> +	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
> +		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
> +}
> +
> +/*
> + * Free a VMCS, but before that VMCLEAR it on the CPU where it was last
> loaded
> + * (the necessary information is in the saved_vmcs structure).
> + * See also vcpu_clear() (with different parameters and side-effects)
> + */
> +static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
> +		struct saved_vmcs *saved_vmcs)
> +{
> +	if (saved_vmcs->cpu != -1)
> +		smp_call_function_single(saved_vmcs->cpu,
> +				__nested_free_saved_vmcs, saved_vmcs, 1);
> +
> +	free_vmcs(saved_vmcs->vmcs);
> +}
> +
> +/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
> +static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
> +{
> +	struct vmcs02_list *item;
> +	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
> +		if (item->vmcs12_addr == vmptr) {
> +			nested_free_saved_vmcs(vmx, &item->vmcs02);
> +			list_del(&item->list);
> +			kfree(item);
> +			vmx->nested.vmcs02_num--;
> +			return;
> +		}
> +}
> +
> +/*
> + * Free all VMCSs saved for this vcpu, except the actual vmx->vmcs.
> + * These include the VMCSs in vmcs02_pool (except the one currently used,
> + * if running L2), and saved_vmcs01 when running L2.
> + */
> +static void nested_free_all_saved_vmcss(struct vcpu_vmx *vmx)
> +{
> +	struct vmcs02_list *item, *n;
> +	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
> +		if (vmx->vmcs != item->vmcs02.vmcs)
> +			nested_free_saved_vmcs(vmx, &item->vmcs02);
> +		list_del(&item->list);
> +		kfree(item);
> +	}
> +	vmx->nested.vmcs02_num = 0;
> +}
> +
> +/* Get a vmcs02 for the current vmcs12. */
> +static struct saved_vmcs *nested_get_current_vmcs02(struct vcpu_vmx
> *vmx)
> +{
> +	struct vmcs02_list *item;
> +	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
> +		if (item->vmcs12_addr == vmx->nested.current_vmptr) {
> +			list_move(&item->list, &vmx->nested.vmcs02_pool);
> +			return &item->vmcs02;
> +		}
> +
> +	if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE, 1)) {
> +		/* Recycle the least recently used VMCS. */
> +		item = list_entry(vmx->nested.vmcs02_pool.prev,
> +			struct vmcs02_list, list);
> +		item->vmcs12_addr = vmx->nested.current_vmptr;
> +		list_move(&item->list, &vmx->nested.vmcs02_pool);
> +		return &item->vmcs02;
> +	}
> +
> +	/* Create a new vmcs02 */
> +	item = (struct vmcs02_list *)
> +		kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
> +	if (!item)
> +		return NULL;
> +	item->vmcs02.vmcs = alloc_vmcs();
> +	if (!item->vmcs02.vmcs) {
> +		kfree(item);
> +		return NULL;
> +	}
> +	item->vmcs12_addr = vmx->nested.current_vmptr;
> +	item->vmcs02.cpu = -1;
> +	item->vmcs02.launched = 0;
> +	list_add(&(item->list), &(vmx->nested.vmcs02_pool));
> +	vmx->nested.vmcs02_num++;
> +	return &item->vmcs02;
> +}
> +
> +/*
>   * Emulate the VMXON instruction.
>   * Currently, we just remember that VMX is active, and do not save or even
>   * inspect the argument to VMXON (the so-called "VMXON pointer") because
> we
> @@ -4235,6 +4369,9 @@ static int handle_vmon(struct kvm_vcpu *
>  		return 1;
>  	}
> 
> +	INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
> +	vmx->nested.vmcs02_num = 0;
> +
>  	vmx->nested.vmxon = true;
> 
>  	skip_emulated_instruction(vcpu);
> @@ -4286,6 +4423,8 @@ static void free_nested(struct vcpu_vmx
>  		vmx->nested.current_vmptr = -1ull;
>  		vmx->nested.current_vmcs12 = NULL;
>  	}
> +
> +	nested_free_all_saved_vmcss(vmx);
>  }
> 
>  /* Emulate the VMXOFF instruction */
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12
  2011-05-16 19:48 ` [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
@ 2011-05-20  8:22   ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-20  8:22 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:49 AM
> 
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields.
> 
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing
> vmcs02,
> a hardware VMCS for running L2.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  275
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 275 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> @@ -144,12 +144,148 @@ struct shared_msr_entry {
>   * machines (necessary for live migration).
>   * If there are changes in this struct, VMCS12_REVISION must be changed.
>   */
> +typedef u64 natural_width;
>  struct __packed vmcs12 {
>  	/* According to the Intel spec, a VMCS region must start with the
>  	 * following two fields. Then follow implementation-specific data.
>  	 */
>  	u32 revision_id;
>  	u32 abort;
> +
> +	u64 io_bitmap_a;
> +	u64 io_bitmap_b;
> +	u64 msr_bitmap;
> +	u64 vm_exit_msr_store_addr;
> +	u64 vm_exit_msr_load_addr;
> +	u64 vm_entry_msr_load_addr;
> +	u64 tsc_offset;
> +	u64 virtual_apic_page_addr;
> +	u64 apic_access_addr;
> +	u64 ept_pointer;
> +	u64 guest_physical_address;
> +	u64 vmcs_link_pointer;
> +	u64 guest_ia32_debugctl;
> +	u64 guest_ia32_pat;
> +	u64 guest_ia32_efer;
> +	u64 guest_pdptr0;
> +	u64 guest_pdptr1;
> +	u64 guest_pdptr2;
> +	u64 guest_pdptr3;
> +	u64 host_ia32_pat;
> +	u64 host_ia32_efer;
> +	u64 padding64[8]; /* room for future expansion */
> +	/*
> +	 * To allow migration of L1 (complete with its L2 guests) between
> +	 * machines of different natural widths (32 or 64 bit), we cannot have
> +	 * unsigned long fields with no explict size. We use u64 (aliased
> +	 * natural_width) instead. Luckily, x86 is little-endian.
> +	 */
> +	natural_width cr0_guest_host_mask;
> +	natural_width cr4_guest_host_mask;
> +	natural_width cr0_read_shadow;
> +	natural_width cr4_read_shadow;
> +	natural_width cr3_target_value0;
> +	natural_width cr3_target_value1;
> +	natural_width cr3_target_value2;
> +	natural_width cr3_target_value3;
> +	natural_width exit_qualification;
> +	natural_width guest_linear_address;
> +	natural_width guest_cr0;
> +	natural_width guest_cr3;
> +	natural_width guest_cr4;
> +	natural_width guest_es_base;
> +	natural_width guest_cs_base;
> +	natural_width guest_ss_base;
> +	natural_width guest_ds_base;
> +	natural_width guest_fs_base;
> +	natural_width guest_gs_base;
> +	natural_width guest_ldtr_base;
> +	natural_width guest_tr_base;
> +	natural_width guest_gdtr_base;
> +	natural_width guest_idtr_base;
> +	natural_width guest_dr7;
> +	natural_width guest_rsp;
> +	natural_width guest_rip;
> +	natural_width guest_rflags;
> +	natural_width guest_pending_dbg_exceptions;
> +	natural_width guest_sysenter_esp;
> +	natural_width guest_sysenter_eip;
> +	natural_width host_cr0;
> +	natural_width host_cr3;
> +	natural_width host_cr4;
> +	natural_width host_fs_base;
> +	natural_width host_gs_base;
> +	natural_width host_tr_base;
> +	natural_width host_gdtr_base;
> +	natural_width host_idtr_base;
> +	natural_width host_ia32_sysenter_esp;
> +	natural_width host_ia32_sysenter_eip;
> +	natural_width host_rsp;
> +	natural_width host_rip;
> +	natural_width paddingl[8]; /* room for future expansion */
> +	u32 pin_based_vm_exec_control;
> +	u32 cpu_based_vm_exec_control;
> +	u32 exception_bitmap;
> +	u32 page_fault_error_code_mask;
> +	u32 page_fault_error_code_match;
> +	u32 cr3_target_count;
> +	u32 vm_exit_controls;
> +	u32 vm_exit_msr_store_count;
> +	u32 vm_exit_msr_load_count;
> +	u32 vm_entry_controls;
> +	u32 vm_entry_msr_load_count;
> +	u32 vm_entry_intr_info_field;
> +	u32 vm_entry_exception_error_code;
> +	u32 vm_entry_instruction_len;
> +	u32 tpr_threshold;
> +	u32 secondary_vm_exec_control;
> +	u32 vm_instruction_error;
> +	u32 vm_exit_reason;
> +	u32 vm_exit_intr_info;
> +	u32 vm_exit_intr_error_code;
> +	u32 idt_vectoring_info_field;
> +	u32 idt_vectoring_error_code;
> +	u32 vm_exit_instruction_len;
> +	u32 vmx_instruction_info;
> +	u32 guest_es_limit;
> +	u32 guest_cs_limit;
> +	u32 guest_ss_limit;
> +	u32 guest_ds_limit;
> +	u32 guest_fs_limit;
> +	u32 guest_gs_limit;
> +	u32 guest_ldtr_limit;
> +	u32 guest_tr_limit;
> +	u32 guest_gdtr_limit;
> +	u32 guest_idtr_limit;
> +	u32 guest_es_ar_bytes;
> +	u32 guest_cs_ar_bytes;
> +	u32 guest_ss_ar_bytes;
> +	u32 guest_ds_ar_bytes;
> +	u32 guest_fs_ar_bytes;
> +	u32 guest_gs_ar_bytes;
> +	u32 guest_ldtr_ar_bytes;
> +	u32 guest_tr_ar_bytes;
> +	u32 guest_interruptibility_info;
> +	u32 guest_activity_state;
> +	u32 guest_sysenter_cs;
> +	u32 host_ia32_sysenter_cs;
> +	u32 padding32[8]; /* room for future expansion */
> +	u16 virtual_processor_id;
> +	u16 guest_es_selector;
> +	u16 guest_cs_selector;
> +	u16 guest_ss_selector;
> +	u16 guest_ds_selector;
> +	u16 guest_fs_selector;
> +	u16 guest_gs_selector;
> +	u16 guest_ldtr_selector;
> +	u16 guest_tr_selector;
> +	u16 host_es_selector;
> +	u16 host_cs_selector;
> +	u16 host_ss_selector;
> +	u16 host_ds_selector;
> +	u16 host_fs_selector;
> +	u16 host_gs_selector;
> +	u16 host_tr_selector;
>  };
> 

should we pad vmcs12 to 4096 as reported to the L1 guest? 

Thanks
Kevin

>  /*
> @@ -283,6 +419,145 @@ static inline struct vcpu_vmx *to_vmx(st
>  	return container_of(vcpu, struct vcpu_vmx, vcpu);
>  }
> 
> +#define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
> +#define FIELD(number, name)	[number] = VMCS12_OFFSET(name)
> +#define FIELD64(number, name)	[number] = VMCS12_OFFSET(name), \
> +				[number##_HIGH] = VMCS12_OFFSET(name)+4
> +
> +static unsigned short vmcs_field_to_offset_table[] = {
> +	FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id),
> +	FIELD(GUEST_ES_SELECTOR, guest_es_selector),
> +	FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
> +	FIELD(GUEST_SS_SELECTOR, guest_ss_selector),
> +	FIELD(GUEST_DS_SELECTOR, guest_ds_selector),
> +	FIELD(GUEST_FS_SELECTOR, guest_fs_selector),
> +	FIELD(GUEST_GS_SELECTOR, guest_gs_selector),
> +	FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
> +	FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
> +	FIELD(HOST_ES_SELECTOR, host_es_selector),
> +	FIELD(HOST_CS_SELECTOR, host_cs_selector),
> +	FIELD(HOST_SS_SELECTOR, host_ss_selector),
> +	FIELD(HOST_DS_SELECTOR, host_ds_selector),
> +	FIELD(HOST_FS_SELECTOR, host_fs_selector),
> +	FIELD(HOST_GS_SELECTOR, host_gs_selector),
> +	FIELD(HOST_TR_SELECTOR, host_tr_selector),
> +	FIELD64(IO_BITMAP_A, io_bitmap_a),
> +	FIELD64(IO_BITMAP_B, io_bitmap_b),
> +	FIELD64(MSR_BITMAP, msr_bitmap),
> +	FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
> +	FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
> +	FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
> +	FIELD64(TSC_OFFSET, tsc_offset),
> +	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
> +	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
> +	FIELD64(EPT_POINTER, ept_pointer),
> +	FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
> +	FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
> +	FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
> +	FIELD64(GUEST_IA32_PAT, guest_ia32_pat),
> +	FIELD64(GUEST_PDPTR0, guest_pdptr0),
> +	FIELD64(GUEST_PDPTR1, guest_pdptr1),
> +	FIELD64(GUEST_PDPTR2, guest_pdptr2),
> +	FIELD64(GUEST_PDPTR3, guest_pdptr3),
> +	FIELD64(HOST_IA32_PAT, host_ia32_pat),
> +	FIELD(PIN_BASED_VM_EXEC_CONTROL, pin_based_vm_exec_control),
> +	FIELD(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control),
> +	FIELD(EXCEPTION_BITMAP, exception_bitmap),
> +	FIELD(PAGE_FAULT_ERROR_CODE_MASK, page_fault_error_code_mask),
> +	FIELD(PAGE_FAULT_ERROR_CODE_MATCH,
> page_fault_error_code_match),
> +	FIELD(CR3_TARGET_COUNT, cr3_target_count),
> +	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
> +	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
> +	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
> +	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
> +	FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count),
> +	FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field),
> +	FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE,
> vm_entry_exception_error_code),
> +	FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len),
> +	FIELD(TPR_THRESHOLD, tpr_threshold),
> +	FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control),
> +	FIELD(VM_INSTRUCTION_ERROR, vm_instruction_error),
> +	FIELD(VM_EXIT_REASON, vm_exit_reason),
> +	FIELD(VM_EXIT_INTR_INFO, vm_exit_intr_info),
> +	FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code),
> +	FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field),
> +	FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code),
> +	FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len),
> +	FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info),
> +	FIELD(GUEST_ES_LIMIT, guest_es_limit),
> +	FIELD(GUEST_CS_LIMIT, guest_cs_limit),
> +	FIELD(GUEST_SS_LIMIT, guest_ss_limit),
> +	FIELD(GUEST_DS_LIMIT, guest_ds_limit),
> +	FIELD(GUEST_FS_LIMIT, guest_fs_limit),
> +	FIELD(GUEST_GS_LIMIT, guest_gs_limit),
> +	FIELD(GUEST_LDTR_LIMIT, guest_ldtr_limit),
> +	FIELD(GUEST_TR_LIMIT, guest_tr_limit),
> +	FIELD(GUEST_GDTR_LIMIT, guest_gdtr_limit),
> +	FIELD(GUEST_IDTR_LIMIT, guest_idtr_limit),
> +	FIELD(GUEST_ES_AR_BYTES, guest_es_ar_bytes),
> +	FIELD(GUEST_CS_AR_BYTES, guest_cs_ar_bytes),
> +	FIELD(GUEST_SS_AR_BYTES, guest_ss_ar_bytes),
> +	FIELD(GUEST_DS_AR_BYTES, guest_ds_ar_bytes),
> +	FIELD(GUEST_FS_AR_BYTES, guest_fs_ar_bytes),
> +	FIELD(GUEST_GS_AR_BYTES, guest_gs_ar_bytes),
> +	FIELD(GUEST_LDTR_AR_BYTES, guest_ldtr_ar_bytes),
> +	FIELD(GUEST_TR_AR_BYTES, guest_tr_ar_bytes),
> +	FIELD(GUEST_INTERRUPTIBILITY_INFO, guest_interruptibility_info),
> +	FIELD(GUEST_ACTIVITY_STATE, guest_activity_state),
> +	FIELD(GUEST_SYSENTER_CS, guest_sysenter_cs),
> +	FIELD(HOST_IA32_SYSENTER_CS, host_ia32_sysenter_cs),
> +	FIELD(CR0_GUEST_HOST_MASK, cr0_guest_host_mask),
> +	FIELD(CR4_GUEST_HOST_MASK, cr4_guest_host_mask),
> +	FIELD(CR0_READ_SHADOW, cr0_read_shadow),
> +	FIELD(CR4_READ_SHADOW, cr4_read_shadow),
> +	FIELD(CR3_TARGET_VALUE0, cr3_target_value0),
> +	FIELD(CR3_TARGET_VALUE1, cr3_target_value1),
> +	FIELD(CR3_TARGET_VALUE2, cr3_target_value2),
> +	FIELD(CR3_TARGET_VALUE3, cr3_target_value3),
> +	FIELD(EXIT_QUALIFICATION, exit_qualification),
> +	FIELD(GUEST_LINEAR_ADDRESS, guest_linear_address),
> +	FIELD(GUEST_CR0, guest_cr0),
> +	FIELD(GUEST_CR3, guest_cr3),
> +	FIELD(GUEST_CR4, guest_cr4),
> +	FIELD(GUEST_ES_BASE, guest_es_base),
> +	FIELD(GUEST_CS_BASE, guest_cs_base),
> +	FIELD(GUEST_SS_BASE, guest_ss_base),
> +	FIELD(GUEST_DS_BASE, guest_ds_base),
> +	FIELD(GUEST_FS_BASE, guest_fs_base),
> +	FIELD(GUEST_GS_BASE, guest_gs_base),
> +	FIELD(GUEST_LDTR_BASE, guest_ldtr_base),
> +	FIELD(GUEST_TR_BASE, guest_tr_base),
> +	FIELD(GUEST_GDTR_BASE, guest_gdtr_base),
> +	FIELD(GUEST_IDTR_BASE, guest_idtr_base),
> +	FIELD(GUEST_DR7, guest_dr7),
> +	FIELD(GUEST_RSP, guest_rsp),
> +	FIELD(GUEST_RIP, guest_rip),
> +	FIELD(GUEST_RFLAGS, guest_rflags),
> +	FIELD(GUEST_PENDING_DBG_EXCEPTIONS,
> guest_pending_dbg_exceptions),
> +	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> +	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> +	FIELD(HOST_CR0, host_cr0),
> +	FIELD(HOST_CR3, host_cr3),
> +	FIELD(HOST_CR4, host_cr4),
> +	FIELD(HOST_FS_BASE, host_fs_base),
> +	FIELD(HOST_GS_BASE, host_gs_base),
> +	FIELD(HOST_TR_BASE, host_tr_base),
> +	FIELD(HOST_GDTR_BASE, host_gdtr_base),
> +	FIELD(HOST_IDTR_BASE, host_idtr_base),
> +	FIELD(HOST_IA32_SYSENTER_ESP, host_ia32_sysenter_esp),
> +	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
> +	FIELD(HOST_RSP, host_rsp),
> +	FIELD(HOST_RIP, host_rip),
> +};
> +static const int max_vmcs_field = ARRAY_SIZE(vmcs_field_to_offset_table);
> +
> +static inline short vmcs_field_to_offset(unsigned long field)
> +{
> +	if (field >= max_vmcs_field || vmcs_field_to_offset_table[field] == 0)
> +		return -1;
> +	return vmcs_field_to_offset_table[field];
> +}
> +
>  static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
>  {
>  	return to_vmx(vcpu)->nested.current_vmcs12;
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-20  8:04   ` Tian, Kevin
@ 2011-05-20  8:48     ` Tian, Kevin
  2011-05-20 20:32       ` Nadav Har'El
  2011-05-22  8:29     ` Nadav Har'El
  1 sibling, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-20  8:48 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Tian, Kevin
> Sent: Friday, May 20, 2011 4:05 PM
> 
> > From: Nadav Har'El
> > Sent: Tuesday, May 17, 2011 3:48 AM
> >
> > We saw in a previous patch that L1 controls its L2 guest with a vcms12.
> > L0 needs to create a real VMCS for running L2. We call that "vmcs02".
> > A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
> > fields. This patch only contains code for allocating vmcs02.
> >
> > In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
> > enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
> > be reused even when L1 runs multiple L2 guests. However, in future versions
> > we'll probably want to add an optimization where vmcs02 fields that rarely
> > change will not be set each time. For that, we may want to keep around
> several
> > vmcs02s of L2 guests that have recently run, so that potentially we could run
> > these L2s again more quickly because less vmwrites to vmcs02 will be
> needed.
> 
> That would be a neat enhancement and should have an obvious improvement.
> Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
> is similar to the hardware behavior regarding to cleared and launched state.
> 
> >
> > This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
> > which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE
> L2s.
> > As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
> > I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
> > reused to enter any L2 guest. In the future, when prepare_vmcs02() is
> > optimized not to set all fields every time, VMCS02_POOL_SIZE should be
> > increased.
> >
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > ---
> >  arch/x86/kvm/vmx.c |  139
> > +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 139 insertions(+)
> >
> > --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> > +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:47.000000000 +0300
> > @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
> >  module_param(ple_window, int, S_IRUGO);
> >
> >  #define NR_AUTOLOAD_MSRS 1
> > +#define VMCS02_POOL_SIZE 1
> >
> >  struct vmcs {
> >  	u32 revision_id;
> > @@ -166,6 +167,30 @@ struct __packed vmcs12 {
> >  #define VMCS12_SIZE 0x1000
> >
> >  /*
> > + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's
> VMCS
> > + * while we use L2's VMCS), and we wish to save the previous VMCS, we
> must
> > also
> > + * remember on which CPU it was last loaded (vcpu->cpu), so when we
> return
> > to
> > + * using this VMCS we'll know if we're now running on a different CPU and
> > need
> > + * to clear the VMCS on the old CPU, and load it on the new one.
> Additionally,
> > + * we need to remember whether this VMCS was launched (vmx->launched),
> > so when
> > + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot
> > deduce
> > + * this from other state, because it's possible that this VMCS had once been
> > + * launched, but has since been cleared after a CPU switch).
> > + */
> > +struct saved_vmcs {
> > +	struct vmcs *vmcs;
> > +	int cpu;
> > +	int launched;
> > +};
> 
> "saved" looks a bit misleading here. It's simply a list of all active vmcs02
> tracked
> by kvm, isn't it?
> 
> > +
> > +/* Used to remember the last vmcs02 used for some recently used vmcs12s
> > */
> > +struct vmcs02_list {
> > +	struct list_head list;
> > +	gpa_t vmcs12_addr;
> 
> uniform the name 'vmptr' as nested_vmx strucure:
>  /* The guest-physical address of the current VMCS L1 keeps for L2 */
> 	gpa_t current_vmptr;
> 	/* The host-usable pointer to the above */
> 	struct page *current_vmcs12_page;
> 	struct vmcs12 *current_vmcs12;
> 
> you should keep consistent meaning for vmcs12, which means the arch-neutral
> state interpreted by KVM only.
> 
> > +	struct saved_vmcs vmcs02;
> > +};
> > +
> > +/*
> >   * The nested_vmx structure is part of vcpu_vmx, and holds information we
> > need
> >   * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
> >   */
> > @@ -178,6 +203,10 @@ struct nested_vmx {
> >  	/* The host-usable pointer to the above */
> >  	struct page *current_vmcs12_page;
> >  	struct vmcs12 *current_vmcs12;
> > +
> > +	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
> > +	struct list_head vmcs02_pool;
> > +	int vmcs02_num;
> >  };
> >
> >  struct vcpu_vmx {
> > @@ -4200,6 +4229,111 @@ static int handle_invalid_op(struct kvm_
> >  }
> >
> >  /*
> > + * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
> > + * We could reuse a single VMCS for all the L2 guests, but we also want the
> > + * option to allocate a separate vmcs02 for each separate loaded vmcs12 -
> > this
> > + * allows keeping them loaded on the processor, and in the future will allow
> > + * optimizations where prepare_vmcs02 doesn't need to set all the fields on
> > + * every entry if they never change.
> > + * So we keep, in vmx->nested.vmcs02_pool, a cache of size
> > VMCS02_POOL_SIZE
> > + * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
> > + *
> > + * The following functions allocate and free a vmcs02 in this pool.
> > + */
> > +
> > +static void __nested_free_saved_vmcs(void *arg)
> > +{
> > +	struct saved_vmcs *saved_vmcs = arg;
> > +
> > +	vmcs_clear(saved_vmcs->vmcs);
> > +	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
> > +		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
> > +}
> > +
> > +/*
> > + * Free a VMCS, but before that VMCLEAR it on the CPU where it was last
> > loaded
> > + * (the necessary information is in the saved_vmcs structure).
> > + * See also vcpu_clear() (with different parameters and side-effects)
> > + */
> > +static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
> > +		struct saved_vmcs *saved_vmcs)
> > +{
> > +	if (saved_vmcs->cpu != -1)
> > +		smp_call_function_single(saved_vmcs->cpu,
> > +				__nested_free_saved_vmcs, saved_vmcs, 1);
> > +
> > +	free_vmcs(saved_vmcs->vmcs);
> > +}
> > +
> > +/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one)
> */
> > +static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
> > +{
> > +	struct vmcs02_list *item;
> > +	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
> > +		if (item->vmcs12_addr == vmptr) {
> > +			nested_free_saved_vmcs(vmx, &item->vmcs02);
> > +			list_del(&item->list);
> > +			kfree(item);
> > +			vmx->nested.vmcs02_num--;
> > +			return;
> > +		}
> > +}
> > +
> > +/*
> > + * Free all VMCSs saved for this vcpu, except the actual vmx->vmcs.
> > + * These include the VMCSs in vmcs02_pool (except the one currently used,
> > + * if running L2), and saved_vmcs01 when running L2.
> > + */
> > +static void nested_free_all_saved_vmcss(struct vcpu_vmx *vmx)
> > +{
> > +	struct vmcs02_list *item, *n;
> > +	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
> > +		if (vmx->vmcs != item->vmcs02.vmcs)
> > +			nested_free_saved_vmcs(vmx, &item->vmcs02);
> > +		list_del(&item->list);
> > +		kfree(item);
> > +	}
> > +	vmx->nested.vmcs02_num = 0;
> > +}
> > +
> > +/* Get a vmcs02 for the current vmcs12. */
> > +static struct saved_vmcs *nested_get_current_vmcs02(struct vcpu_vmx
> > *vmx)
> > +{
> > +	struct vmcs02_list *item;
> > +	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
> > +		if (item->vmcs12_addr == vmx->nested.current_vmptr) {
> > +			list_move(&item->list, &vmx->nested.vmcs02_pool);
> > +			return &item->vmcs02;
> > +		}
> > +
> > +	if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE, 1)) {
> > +		/* Recycle the least recently used VMCS. */
> > +		item = list_entry(vmx->nested.vmcs02_pool.prev,
> > +			struct vmcs02_list, list);
> > +		item->vmcs12_addr = vmx->nested.current_vmptr;
> > +		list_move(&item->list, &vmx->nested.vmcs02_pool);
> > +		return &item->vmcs02;

btw, shouldn't you clear recycled VMCS and reset 'cpu' and 'launched' fields?

Have you tried SMP L2 guest?

Thanks
Kevin

> > +	}
> > +
> > +	/* Create a new vmcs02 */
> > +	item = (struct vmcs02_list *)
> > +		kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
> > +	if (!item)
> > +		return NULL;
> > +	item->vmcs02.vmcs = alloc_vmcs();
> > +	if (!item->vmcs02.vmcs) {
> > +		kfree(item);
> > +		return NULL;
> > +	}
> > +	item->vmcs12_addr = vmx->nested.current_vmptr;
> > +	item->vmcs02.cpu = -1;
> > +	item->vmcs02.launched = 0;
> > +	list_add(&(item->list), &(vmx->nested.vmcs02_pool));
> > +	vmx->nested.vmcs02_num++;
> > +	return &item->vmcs02;
> > +}
> > +
> > +/*
> >   * Emulate the VMXON instruction.
> >   * Currently, we just remember that VMX is active, and do not save or even
> >   * inspect the argument to VMXON (the so-called "VMXON pointer")
> because
> > we
> > @@ -4235,6 +4369,9 @@ static int handle_vmon(struct kvm_vcpu *
> >  		return 1;
> >  	}
> >
> > +	INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
> > +	vmx->nested.vmcs02_num = 0;
> > +
> >  	vmx->nested.vmxon = true;
> >
> >  	skip_emulated_instruction(vcpu);
> > @@ -4286,6 +4423,8 @@ static void free_nested(struct vcpu_vmx
> >  		vmx->nested.current_vmptr = -1ull;
> >  		vmx->nested.current_vmcs12 = NULL;
> >  	}
> > +
> > +	nested_free_all_saved_vmcss(vmx);
> >  }
> >
> >  /* Emulate the VMXOFF instruction */
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-20  8:48     ` Tian, Kevin
@ 2011-05-20 20:32       ` Nadav Har'El
  2011-05-22  2:00         ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-20 20:32 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Fri, May 20, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2":
> btw, shouldn't you clear recycled VMCS and reset 'cpu' and 'launched' fields?

Well, I believe the answer is "no": As far as I understood, a host is allowed
to take a VMCS that has been used once to launch a certain guest, and then
modify all the VMCS's fields to define a completely different guest, and then
VMRESUME it, without doing the regular VMCLEAR/VMLAUNCH, even though it's
"a different guest". Is there something wrong in my assumption? Does VMX keep
anything constant between successive VMRESUMEs?

> Have you tried SMP L2 guest?

It "sort of" works, but it *does* appear to still have a bug which I didn't
yet have the time to hunt... In one case, for example, an 8-vcpu L2 on an
8-vcpu L1 seemed to work well (e.g., doing parallel make) for about a minute,
and then hung with some sort of page fault in the kernel.

Nadav.

-- 
Nadav Har'El                        |       Friday, May 20 2011, 17 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Anyone is entitled to their own opinions.
http://nadav.harel.org.il           |No one is entitled to their own facts.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-20 20:32       ` Nadav Har'El
@ 2011-05-22  2:00         ` Tian, Kevin
  2011-05-22  7:22           ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-22  2:00 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Saturday, May 21, 2011 4:32 AM
> 
> On Fri, May 20, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX:
> Introduce vmcs02: VMCS used to run L2":
> > btw, shouldn't you clear recycled VMCS and reset 'cpu' and 'launched' fields?
> 
> Well, I believe the answer is "no": As far as I understood, a host is allowed
> to take a VMCS that has been used once to launch a certain guest, and then
> modify all the VMCS's fields to define a completely different guest, and then
> VMRESUME it, without doing the regular VMCLEAR/VMLAUNCH, even though
> it's
> "a different guest". Is there something wrong in my assumption? Does VMX
> keep
> anything constant between successive VMRESUMEs?

Yes, you can reuse a VMCS with a completely different state if the VMCS is used
on the same processor, and you must ensure that VMCS not having a dirty state
on other processors. The SDM 3B (21.10.1) explicitly requires:

----

No VMCS should ever be active on more than one logical processor. If a VMCS is to be
"migrated" from one logical processor to another, the first logical processor should
execute VMCLEAR for the VMCS (to make it inactive on that logical processor and to
ensure that all VMCS data are in memory) before the other logical processor
executes VMPTRLD for the VMCS (to make it active on the second logical processor).
A VMCS that is made active on more than one logical processor may become
corrupted

----

Here the vmcs02 being overridden may have been run on another processor before
but is not vmclear-ed yet. When you resume this vmcs02 with new content on a 
separate processor, the risk of corruption exists.

> 
> > Have you tried SMP L2 guest?
> 
> It "sort of" works, but it *does* appear to still have a bug which I didn't
> yet have the time to hunt... In one case, for example, an 8-vcpu L2 on an
> 8-vcpu L1 seemed to work well (e.g., doing parallel make) for about a minute,
> and then hung with some sort of page fault in the kernel.
> 

See whether cleaning up above can help here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-22  2:00         ` Tian, Kevin
@ 2011-05-22  7:22           ` Nadav Har'El
  2011-05-24  0:54             ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-22  7:22 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

Hi,

On Sun, May 22, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2":
> Here the vmcs02 being overridden may have been run on another processor before
> but is not vmclear-ed yet. When you resume this vmcs02 with new content on a 
> separate processor, the risk of corruption exists.

I still believe that my current code is correct (in this area). I'll try to
explain it here and would be grateful if you could point to me the error (if
there is one) in my logic:

Nested_vmx_run() is our function which is switches from running L1 to L2
(patch 18).

This function starts by calling nested_get_current_vmcs02(), which gets us
*some* vmcs to use for vmcs02. This may be a fresh new VMCS, or a "recycled"
VMCS, some VMCS we've previously used to run some, potentially different L2
guest on some, potentially different, CPU.
nested_get_current_vmcs02() returns a "saved_vmcs" structure, which
not only contains a VMCS, but also remembers on which (if any) cpu it is
currently loaded (and whether it was VMLAUNCHed once on that cpu).

The next thing that Nested_vmx_run() now does is to set up in the vcpu object
the vmcs, cpu and launched fields according to what was returned above.

Now it calls vmx_vcpu_load(). This standard KVM function checks if we're now
running on a different CPU from the vcpu->cpu, and if it a different one, is
uses vcpu_clear() to VMCLEAR the vmcs on the CPU where it was last loaded
(using an IPI). Only after it vmclears the VMCS on the old CPU, it can finally
load the VMCS on the new CPU.

Only now Nested_vmx_run() can call prepare_vmcs02, which starts VMWRITEing
to this VMCS, and finally returns.

P.S. Seeing that you're from Intel, maybe you can help me with a pointer:
I found what appears to be a small error in the SDM - who can I report it to?

Thanks,
Nadav.

-- 
Nadav Har'El                        |       Sunday, May 22 2011, 18 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I work for money. If you want loyalty,
http://nadav.harel.org.il           |buy yourself a dog.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-20  8:04   ` Tian, Kevin
  2011-05-20  8:48     ` Tian, Kevin
@ 2011-05-22  8:29     ` Nadav Har'El
  2011-05-24  1:03       ` Tian, Kevin
  1 sibling, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-22  8:29 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

Hi,

On Fri, May 20, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2":
> Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
> is similar to the hardware behavior regarding to cleared and launched state.

If you set VMCS02_POOL_SIZE to a large size, and L1, like typical hypervisors,
only keeps around a few VMCSs (and VMCLEARs the ones it will not use again),
then we'll only have a few vmcs02: handle_vmclear() removes from the pool the
vmcs02 that L1 explicitly told us it won't need again.

> > +struct saved_vmcs {
> > +	struct vmcs *vmcs;
> > +	int cpu;
> > +	int launched;
> > +};
> 
> "saved" looks a bit misleading here. It's simply a list of all active vmcs02 tracked
> by kvm, isn't it?

I have rewritten this part of the code, based on Avi's and Marcelo's requests,
and the new name for this structure is "loaded_vmcs", i.e., a structure
describing where a VMCS was loaded.


> > +/* Used to remember the last vmcs02 used for some recently used vmcs12s
> > */
> > +struct vmcs02_list {
> > +	struct list_head list;
> > +	gpa_t vmcs12_addr;
> 
> uniform the name 'vmptr' as nested_vmx strucure:

Ok. Changing all the mentions of "vmcs12_addr" to vmptr.

-- 
Nadav Har'El                        |       Sunday, May 22 2011, 18 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"A witty saying proves nothing." --
http://nadav.harel.org.il           |Voltaire

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-18 12:08                     ` Marcelo Tosatti
  2011-05-18 12:19                       ` Nadav Har'El
@ 2011-05-22  8:57                       ` Nadav Har'El
  2011-05-23 15:49                         ` Avi Kivity
  1 sibling, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-22  8:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, gleb

On Wed, May 18, 2011, Marcelo Tosatti wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> Humpf, right. OK, you can handle the x86.c usage with
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>...

Hi Avi and Marcelo, here is a the new first patch to the nvmx patch set,
which overhauls the handling of vmcss on cpus, as you asked.

As you guessed, the nested entry and exit code becomes much simpler and
cleaner, with the whole VMCS switching code on entry, for example, reduced
to:
	cpu = get_cpu();
	vmx->loaded_vmcs = vmcs02;
	vmx_vcpu_put(vcpu);
	vmx_vcpu_load(vcpu, cpu);
	vcpu->cpu = cpu;
	put_cpu();

You can apply this patch separately from the rest of the patch set, if you
wish. I'm sending just this one, like you asked - and can send the rest of
the patches when you ask me to.


Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

So instead of linking the vcpus, we link the VMCSs, using a new structure
loaded_vmcs. This structure contains the VMCS, and the information pertaining
to its loading on a specific cpu (namely, the cpu number, and whether it
was already launched on this cpu once). In nested we will also use the same
structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
currently active VMCS.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  129 ++++++++++++++++++++++++++-----------------
 arch/x86/kvm/x86.c |    3 -
 2 files changed, 80 insertions(+), 52 deletions(-)

--- .before/arch/x86/kvm/x86.c	2011-05-22 11:41:57.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-05-22 11:41:57.000000000 +0300
@@ -2119,7 +2119,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 	if (need_emulate_wbinvd(vcpu)) {
 		if (kvm_x86_ops->has_wbinvd_exit())
 			cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
-		else if (vcpu->cpu != -1 && vcpu->cpu != cpu)
+		else if (vcpu->cpu != -1 && vcpu->cpu != cpu
+				&& cpu_online(vcpu->cpu))
 			smp_call_function_single(vcpu->cpu,
 					wbinvd_ipi, NULL, 1);
 	}
--- .before/arch/x86/kvm/vmx.c	2011-05-22 11:41:57.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-22 11:41:58.000000000 +0300
@@ -116,6 +116,18 @@ struct vmcs {
 	char data[0];
 };
 
+/*
+ * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
+ * remember whether it was VMLAUNCHed, and maintain a linked list of all VMCSs
+ * loaded on this CPU (so we can clear them if the CPU goes down).
+ */
+struct loaded_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+	struct list_head loaded_vmcss_on_cpu_link;
+};
+
 struct shared_msr_entry {
 	unsigned index;
 	u64 data;
@@ -124,9 +136,7 @@ struct shared_msr_entry {
 
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
-	struct list_head      local_vcpus_link;
 	unsigned long         host_rsp;
-	int                   launched;
 	u8                    fail;
 	u8                    cpl;
 	bool                  nmi_known_unmasked;
@@ -140,7 +150,14 @@ struct vcpu_vmx {
 	u64 		      msr_host_kernel_gs_base;
 	u64 		      msr_guest_kernel_gs_base;
 #endif
-	struct vmcs          *vmcs;
+	/*
+	 * loaded_vmcs points to the VMCS currently used in this vcpu. For a
+	 * non-nested (L1) guest, it always points to vmcs01. For a nested
+	 * guest (L2), it points to a different VMCS.
+	 */
+	struct loaded_vmcs    vmcs01;
+	struct loaded_vmcs   *loaded_vmcs;
+	bool                  __launched; /* temporary, used in vmx_vcpu_run */
 	struct msr_autoload {
 		unsigned nr;
 		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
@@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
-static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
+/*
+ * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
+ * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
+ */
+static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
 static unsigned long *vmx_io_bitmap_a;
@@ -514,25 +535,25 @@ static void vmcs_load(struct vmcs *vmcs)
 		       vmcs, phys_addr);
 }
 
-static void __vcpu_clear(void *arg)
+static void __loaded_vmcs_clear(void *arg)
 {
-	struct vcpu_vmx *vmx = arg;
+	struct loaded_vmcs *loaded_vmcs = arg;
 	int cpu = raw_smp_processor_id();
 
-	if (vmx->vcpu.cpu == cpu)
-		vmcs_clear(vmx->vmcs);
-	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
+	if (loaded_vmcs->cpu == cpu)
+		vmcs_clear(loaded_vmcs->vmcs);
+	if (per_cpu(current_vmcs, cpu) == loaded_vmcs->vmcs)
 		per_cpu(current_vmcs, cpu) = NULL;
-	list_del(&vmx->local_vcpus_link);
-	vmx->vcpu.cpu = -1;
-	vmx->launched = 0;
+	list_del(&loaded_vmcs->loaded_vmcss_on_cpu_link);
+	loaded_vmcs->cpu = -1;
+	loaded_vmcs->launched = 0;
 }
 
-static void vcpu_clear(struct vcpu_vmx *vmx)
+static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
 {
-	if (vmx->vcpu.cpu == -1)
-		return;
-	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 1);
+	if (loaded_vmcs->cpu != -1)
+		smp_call_function_single(
+			loaded_vmcs->cpu, __loaded_vmcs_clear, loaded_vmcs, 1);
 }
 
 static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
@@ -971,22 +992,22 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 	if (!vmm_exclusive)
 		kvm_cpu_vmxon(phys_addr);
-	else if (vcpu->cpu != cpu)
-		vcpu_clear(vmx);
+	else if (vmx->loaded_vmcs->cpu != cpu)
+		loaded_vmcs_clear(vmx->loaded_vmcs);
 
-	if (per_cpu(current_vmcs, cpu) != vmx->vmcs) {
-		per_cpu(current_vmcs, cpu) = vmx->vmcs;
-		vmcs_load(vmx->vmcs);
+	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
+		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
+		vmcs_load(vmx->loaded_vmcs->vmcs);
 	}
 
-	if (vcpu->cpu != cpu) {
+	if (vmx->loaded_vmcs->cpu != cpu) {
 		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
 		unsigned long sysenter_esp;
 
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
-		list_add(&vmx->local_vcpus_link,
-			 &per_cpu(vcpus_on_cpu, cpu));
+		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
+			 &per_cpu(loaded_vmcss_on_cpu, cpu));
 		local_irq_enable();
 
 		/*
@@ -999,13 +1020,15 @@ static void vmx_vcpu_load(struct kvm_vcp
 		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
 	}
+	vmx->loaded_vmcs->cpu = cpu;
 }
 
 static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	__vmx_load_host_state(to_vmx(vcpu));
 	if (!vmm_exclusive) {
-		__vcpu_clear(to_vmx(vcpu));
+		__loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
+		vcpu->cpu = -1;
 		kvm_cpu_vmxoff();
 	}
 }
@@ -1469,7 +1492,7 @@ static int hardware_enable(void *garbage
 	if (read_cr4() & X86_CR4_VMXE)
 		return -EBUSY;
 
-	INIT_LIST_HEAD(&per_cpu(vcpus_on_cpu, cpu));
+	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
 	rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
 
 	test_bits = FEATURE_CONTROL_LOCKED;
@@ -1493,14 +1516,14 @@ static int hardware_enable(void *garbage
 	return 0;
 }
 
-static void vmclear_local_vcpus(void)
+static void vmclear_local_loaded_vmcss(void)
 {
 	int cpu = raw_smp_processor_id();
-	struct vcpu_vmx *vmx, *n;
+	struct loaded_vmcs *v, *n;
 
-	list_for_each_entry_safe(vmx, n, &per_cpu(vcpus_on_cpu, cpu),
-				 local_vcpus_link)
-		__vcpu_clear(vmx);
+	list_for_each_entry_safe(v, n, &per_cpu(loaded_vmcss_on_cpu, cpu),
+				 loaded_vmcss_on_cpu_link)
+		__loaded_vmcs_clear(v);
 }
 
 
@@ -1515,7 +1538,7 @@ static void kvm_cpu_vmxoff(void)
 static void hardware_disable(void *garbage)
 {
 	if (vmm_exclusive) {
-		vmclear_local_vcpus();
+		vmclear_local_loaded_vmcss();
 		kvm_cpu_vmxoff();
 	}
 	write_cr4(read_cr4() & ~X86_CR4_VMXE);
@@ -1696,6 +1719,18 @@ static void free_vmcs(struct vmcs *vmcs)
 	free_pages((unsigned long)vmcs, vmcs_config.order);
 }
 
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ */
+static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
+{
+	if (!loaded_vmcs->vmcs)
+		return;
+	loaded_vmcs_clear(loaded_vmcs);
+	free_vmcs(loaded_vmcs->vmcs);
+	loaded_vmcs->vmcs = NULL;
+}
+
 static void free_kvm_area(void)
 {
 	int cpu;
@@ -4166,6 +4201,7 @@ static void __noclone vmx_vcpu_run(struc
 	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
 		vmx_set_interrupt_shadow(vcpu, 0);
 
+	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
 		"push %%"R"dx; push %%"R"bp;"
@@ -4236,7 +4272,7 @@ static void __noclone vmx_vcpu_run(struc
 		"pop  %%"R"bp; pop  %%"R"dx \n\t"
 		"setbe %c[fail](%0) \n\t"
 	      : : "c"(vmx), "d"((unsigned long)HOST_RSP),
-		[launched]"i"(offsetof(struct vcpu_vmx, launched)),
+		[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
 		[fail]"i"(offsetof(struct vcpu_vmx, fail)),
 		[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)),
 		[rax]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RAX])),
@@ -4276,7 +4312,7 @@ static void __noclone vmx_vcpu_run(struc
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
-	vmx->launched = 1;
+	vmx->loaded_vmcs->launched = 1;
 
 	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
 
@@ -4288,23 +4324,12 @@ static void __noclone vmx_vcpu_run(struc
 #undef R
 #undef Q
 
-static void vmx_free_vmcs(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-	if (vmx->vmcs) {
-		vcpu_clear(vmx);
-		free_vmcs(vmx->vmcs);
-		vmx->vmcs = NULL;
-	}
-}
-
 static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
-	vmx_free_vmcs(vcpu);
+	free_loaded_vmcs(vmx->loaded_vmcs);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vmx);
@@ -4344,11 +4369,13 @@ static struct kvm_vcpu *vmx_create_vcpu(
 		goto uninit_vcpu;
 	}
 
-	vmx->vmcs = alloc_vmcs();
-	if (!vmx->vmcs)
+	vmx->loaded_vmcs = &vmx->vmcs01;
+	vmx->loaded_vmcs->vmcs = alloc_vmcs();
+	if (!vmx->loaded_vmcs->vmcs)
 		goto free_msrs;
-
-	vmcs_init(vmx->vmcs);
+	vmcs_init(vmx->loaded_vmcs->vmcs);
+	vmx->loaded_vmcs->cpu = -1;
+	vmx->loaded_vmcs->launched = 0;
 
 	cpu = get_cpu();
 	vmx_vcpu_load(&vmx->vcpu, cpu);
@@ -4377,7 +4404,7 @@ static struct kvm_vcpu *vmx_create_vcpu(
 	return &vmx->vcpu;
 
 free_vmcs:
-	free_vmcs(vmx->vmcs);
+	free_vmcs(vmx->loaded_vmcs->vmcs);
 free_msrs:
 	kfree(vmx->guest_msrs);
 uninit_vcpu:

-- 
Nadav Har'El                        |       Sunday, May 22 2011, 18 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I think therefore I am. My computer
http://nadav.harel.org.il           |thinks for me, therefore I am not.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-22  8:57                       ` Nadav Har'El
@ 2011-05-23 15:49                         ` Avi Kivity
  2011-05-23 16:17                           ` Gleb Natapov
                                             ` (3 more replies)
  0 siblings, 4 replies; 118+ messages in thread
From: Avi Kivity @ 2011-05-23 15:49 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Marcelo Tosatti, kvm, gleb, Roedel, Joerg

On 05/22/2011 11:57 AM, Nadav Har'El wrote:
> Hi Avi and Marcelo, here is a the new first patch to the nvmx patch set,
> which overhauls the handling of vmcss on cpus, as you asked.
>
> As you guessed, the nested entry and exit code becomes much simpler and
> cleaner, with the whole VMCS switching code on entry, for example, reduced
> to:
> 	cpu = get_cpu();
> 	vmx->loaded_vmcs = vmcs02;
> 	vmx_vcpu_put(vcpu);
> 	vmx_vcpu_load(vcpu, cpu);
> 	vcpu->cpu = cpu;
> 	put_cpu();

That's wonderful, it indicates the code is much better integrated.  
Perhaps later we can refine it  to have separate _load and _put for 
host-related and guest-related parts (I think they already exist in the 
code, except they are always called together), but that is an 
optimization, and not the most important one by far.

> You can apply this patch separately from the rest of the patch set, if you
> wish. I'm sending just this one, like you asked - and can send the rest of
> the patches when you ask me to.
>
>
> Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
>
> In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
> because (at least in theory) the processor might not have written all of its
> content back to memory. Since a patch from June 26, 2008, this is done using
> a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
>
> The problem is that with nested VMX, we no longer have the concept of a
> vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
> L2s), and each of those may be have been last loaded on a different cpu.
>
> So instead of linking the vcpus, we link the VMCSs, using a new structure
> loaded_vmcs. This structure contains the VMCS, and the information pertaining
> to its loading on a specific cpu (namely, the cpu number, and whether it
> was already launched on this cpu once). In nested we will also use the same
> structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
> currently active VMCS.
>
> --- .before/arch/x86/kvm/x86.c	2011-05-22 11:41:57.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2011-05-22 11:41:57.000000000 +0300
> @@ -2119,7 +2119,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>   	if (need_emulate_wbinvd(vcpu)) {
>   		if (kvm_x86_ops->has_wbinvd_exit())
>   			cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
> -		else if (vcpu->cpu != -1&&  vcpu->cpu != cpu)
> +		else if (vcpu->cpu != -1&&  vcpu->cpu != cpu
> +				&&  cpu_online(vcpu->cpu))
>   			smp_call_function_single(vcpu->cpu,
>   					wbinvd_ipi, NULL, 1);
>   	}

Is this a necessary part of this patch?  Or an semi-related bugfix?

I think that it can't actually trigger before this patch due to luck.  
svm doesn't clear vcpu->cpu on cpu offline, but on the other hand it 
->has_wbinvd_exit().

Joerg, is

     if (unlikely(cpu != vcpu->cpu)) {
         svm->asid_generation = 0;
         mark_all_dirty(svm->vmcb);
     }

susceptible to cpu offline/online?

> @@ -971,22 +992,22 @@ static void vmx_vcpu_load(struct kvm_vcp
>
>   	if (!vmm_exclusive)
>   		kvm_cpu_vmxon(phys_addr);
> -	else if (vcpu->cpu != cpu)
> -		vcpu_clear(vmx);
> +	else if (vmx->loaded_vmcs->cpu != cpu)
> +		loaded_vmcs_clear(vmx->loaded_vmcs);
>
> -	if (per_cpu(current_vmcs, cpu) != vmx->vmcs) {
> -		per_cpu(current_vmcs, cpu) = vmx->vmcs;
> -		vmcs_load(vmx->vmcs);
> +	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
> +		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> +		vmcs_load(vmx->loaded_vmcs->vmcs);
>   	}
>
> -	if (vcpu->cpu != cpu) {
> +	if (vmx->loaded_vmcs->cpu != cpu) {
>   		struct desc_ptr *gdt =&__get_cpu_var(host_gdt);
>   		unsigned long sysenter_esp;
>
>   		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>   		local_irq_disable();
> -		list_add(&vmx->local_vcpus_link,
> -			&per_cpu(vcpus_on_cpu, cpu));
> +		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
> +			&per_cpu(loaded_vmcss_on_cpu, cpu));
>   		local_irq_enable();
>
>   		/*
> @@ -999,13 +1020,15 @@ static void vmx_vcpu_load(struct kvm_vcp
>   		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>   		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
>   	}
> +	vmx->loaded_vmcs->cpu = cpu;

This should be within the if () block.

> @@ -4344,11 +4369,13 @@ static struct kvm_vcpu *vmx_create_vcpu(
>   		goto uninit_vcpu;
>   	}
>
> -	vmx->vmcs = alloc_vmcs();
> -	if (!vmx->vmcs)
> +	vmx->loaded_vmcs =&vmx->vmcs01;
> +	vmx->loaded_vmcs->vmcs = alloc_vmcs();
> +	if (!vmx->loaded_vmcs->vmcs)
>   		goto free_msrs;
> -
> -	vmcs_init(vmx->vmcs);
> +	vmcs_init(vmx->loaded_vmcs->vmcs);
> +	vmx->loaded_vmcs->cpu = -1;
> +	vmx->loaded_vmcs->launched = 0;

Perhaps a loaded_vmcs_init() to encapsulate initialization of these 
three fields, you'll probably reuse it later.

Please repost separately after the fix, I'd like to apply it before the 
rest of the series.

(regarding interrupts, I think we can do that work post-merge.  But I'd 
like to see Kevin's comments addressed)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 15:49                         ` Avi Kivity
@ 2011-05-23 16:17                           ` Gleb Natapov
  2011-05-23 18:59                             ` Nadav Har'El
  2011-05-23 16:43                           ` Roedel, Joerg
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 118+ messages in thread
From: Gleb Natapov @ 2011-05-23 16:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, Roedel, Joerg

On Mon, May 23, 2011 at 06:49:17PM +0300, Avi Kivity wrote:
> (regarding interrupts, I think we can do that work post-merge.  But
> I'd like to see Kevin's comments addressed)
> 
To be fair this wasn't addressed for almost two years now.

--
			Gleb.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 15:49                         ` Avi Kivity
  2011-05-23 16:17                           ` Gleb Natapov
@ 2011-05-23 16:43                           ` Roedel, Joerg
  2011-05-23 16:51                             ` Avi Kivity
  2011-05-23 18:51                           ` Nadav Har'El
  2011-05-24  0:57                           ` Tian, Kevin
  3 siblings, 1 reply; 118+ messages in thread
From: Roedel, Joerg @ 2011-05-23 16:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb

On Mon, May 23, 2011 at 11:49:17AM -0400, Avi Kivity wrote:

> Joerg, is
> 
>      if (unlikely(cpu != vcpu->cpu)) {
>          svm->asid_generation = 0;
>          mark_all_dirty(svm->vmcb);
>      }
> 
> susceptible to cpu offline/online?

I don't think so. This should be safe for cpu offline/online as long as
the cpu-number value is not reused for another physical cpu. But that
should be the case afaik.

	Joerg



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 16:43                           ` Roedel, Joerg
@ 2011-05-23 16:51                             ` Avi Kivity
  2011-05-24  9:22                               ` Roedel, Joerg
  0 siblings, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-23 16:51 UTC (permalink / raw)
  To: Roedel, Joerg; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb

On 05/23/2011 07:43 PM, Roedel, Joerg wrote:
> On Mon, May 23, 2011 at 11:49:17AM -0400, Avi Kivity wrote:
>
> >  Joerg, is
> >
> >       if (unlikely(cpu != vcpu->cpu)) {
> >           svm->asid_generation = 0;
> >           mark_all_dirty(svm->vmcb);
> >       }
> >
> >  susceptible to cpu offline/online?
>
> I don't think so. This should be safe for cpu offline/online as long as
> the cpu-number value is not reused for another physical cpu. But that
> should be the case afaik.
>

Why not? offline/online does reuse cpu numbers AFAIK (and it must, if 
you have a fully populated machine and offline/online just one cpu).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 15:49                         ` Avi Kivity
  2011-05-23 16:17                           ` Gleb Natapov
  2011-05-23 16:43                           ` Roedel, Joerg
@ 2011-05-23 18:51                           ` Nadav Har'El
  2011-05-24  2:22                             ` Tian, Kevin
  2011-05-24  0:57                           ` Tian, Kevin
  3 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-23 18:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, gleb, Roedel, Joerg

Hi, and thanks again for the reviews,

On Mon, May 23, 2011, Avi Kivity wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >  	if (need_emulate_wbinvd(vcpu)) {
> >  		if (kvm_x86_ops->has_wbinvd_exit())
> >  			cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
> >-		else if (vcpu->cpu != -1&&  vcpu->cpu != cpu)
> >+		else if (vcpu->cpu != -1&&  vcpu->cpu != cpu
> >+				&&  cpu_online(vcpu->cpu))
> >  			smp_call_function_single(vcpu->cpu,
> >  					wbinvd_ipi, NULL, 1);
> >  	}
> 
> Is this a necessary part of this patch?  Or an semi-related bugfix?
> 
> I think that it can't actually trigger before this patch due to luck.  
> svm doesn't clear vcpu->cpu on cpu offline, but on the other hand it 
> ->has_wbinvd_exit().

Well, this was Marcelo's patch:  When I suggested that we might have problems
because vcpu->cpu now isn't cleared to -1 when a cpu is offlined, he looked
at the code and said that he thinks this is the only place that will have
problems, and offered this patch, which I simply included in mine. I'm afraid
to admit I don't understand that part of the code, so I can't judge if this
is important or not. I'll drop it from my patch for now (and you can apply
Marcelo's patch separately).

> >+	if (vmx->loaded_vmcs->cpu != cpu) {
> >  		struct desc_ptr *gdt =&__get_cpu_var(host_gdt);
> >  		unsigned long sysenter_esp;
> >
> >  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> >  		local_irq_disable();
> >-		list_add(&vmx->local_vcpus_link,
> >-			&per_cpu(vcpus_on_cpu, cpu));
> >+		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
> >+			&per_cpu(loaded_vmcss_on_cpu, cpu));
> >  		local_irq_enable();
> >
> >  		/*
> >@@ -999,13 +1020,15 @@ static void vmx_vcpu_load(struct kvm_vcp
> >  		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
> >  		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 
> >  		*/
> >  	}
> >+	vmx->loaded_vmcs->cpu = cpu;
> This should be within the if () block.

Makes sense :-) Done.

> >+	vmcs_init(vmx->loaded_vmcs->vmcs);
> >+	vmx->loaded_vmcs->cpu = -1;
> >+	vmx->loaded_vmcs->launched = 0;
> 
> Perhaps a loaded_vmcs_init() to encapsulate initialization of these 
> three fields, you'll probably reuse it later.

It's good you pointed this out, because it made me suddenly realise that I
forgot to VMCLEAR the new vmcs02's I allocate. In practice it never made a
difference, but better safe than sorry.

I had to restructure some of the code a bit to be able to properly use this
new function (in 3 places - __loaded_vmcs_clear, nested_get_current_vmcs02,
vmx_create_cpu).

> Please repost separately after the fix, I'd like to apply it before the 
> rest of the series.

I am adding a new version of this patch at the end of this mail.

> (regarding interrupts, I think we can do that work post-merge.  But I'd 
> like to see Kevin's comments addressed)

I replied to his comments. Done some of the things he asked, and asked for
more info on why/where he believes the current code is incorrect where I
didn't understand what problems he pointed to, and am now waiting for him
to reply.


------- 8< ------ 8< ---------- 8< ---------- 8< ----------- 8< -----------

Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

So instead of linking the vcpus, we link the VMCSs, using a new structure
loaded_vmcs. This structure contains the VMCS, and the information pertaining
to its loading on a specific cpu (namely, the cpu number, and whether it
was already launched on this cpu once). In nested we will also use the same
structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
currently active VMCS.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  150 ++++++++++++++++++++++++-------------------
 1 file changed, 86 insertions(+), 64 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-05-23 21:46:14.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-05-23 21:46:14.000000000 +0300
@@ -116,6 +116,18 @@ struct vmcs {
 	char data[0];
 };
 
+/*
+ * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
+ * remember whether it was VMLAUNCHed, and maintain a linked list of all VMCSs
+ * loaded on this CPU (so we can clear them if the CPU goes down).
+ */
+struct loaded_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+	struct list_head loaded_vmcss_on_cpu_link;
+};
+
 struct shared_msr_entry {
 	unsigned index;
 	u64 data;
@@ -124,9 +136,7 @@ struct shared_msr_entry {
 
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
-	struct list_head      local_vcpus_link;
 	unsigned long         host_rsp;
-	int                   launched;
 	u8                    fail;
 	u8                    cpl;
 	bool                  nmi_known_unmasked;
@@ -140,7 +150,14 @@ struct vcpu_vmx {
 	u64 		      msr_host_kernel_gs_base;
 	u64 		      msr_guest_kernel_gs_base;
 #endif
-	struct vmcs          *vmcs;
+	/*
+	 * loaded_vmcs points to the VMCS currently used in this vcpu. For a
+	 * non-nested (L1) guest, it always points to vmcs01. For a nested
+	 * guest (L2), it points to a different VMCS.
+	 */
+	struct loaded_vmcs    vmcs01;
+	struct loaded_vmcs   *loaded_vmcs;
+	bool                  __launched; /* temporary, used in vmx_vcpu_run */
 	struct msr_autoload {
 		unsigned nr;
 		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
@@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
-static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
+/*
+ * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
+ * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
+ */
+static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
 static unsigned long *vmx_io_bitmap_a;
@@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs
 		       vmcs, phys_addr);
 }
 
+static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
+{
+	vmcs_clear(loaded_vmcs->vmcs);
+	loaded_vmcs->cpu = -1;
+	loaded_vmcs->launched = 0;
+}
+
 static void vmcs_load(struct vmcs *vmcs)
 {
 	u64 phys_addr = __pa(vmcs);
@@ -514,25 +542,24 @@ static void vmcs_load(struct vmcs *vmcs)
 		       vmcs, phys_addr);
 }
 
-static void __vcpu_clear(void *arg)
+static void __loaded_vmcs_clear(void *arg)
 {
-	struct vcpu_vmx *vmx = arg;
+	struct loaded_vmcs *loaded_vmcs = arg;
 	int cpu = raw_smp_processor_id();
 
-	if (vmx->vcpu.cpu == cpu)
-		vmcs_clear(vmx->vmcs);
-	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
+	if (loaded_vmcs->cpu != cpu)
+		return; /* cpu migration can race with cpu offline */
+	if (per_cpu(current_vmcs, cpu) == loaded_vmcs->vmcs)
 		per_cpu(current_vmcs, cpu) = NULL;
-	list_del(&vmx->local_vcpus_link);
-	vmx->vcpu.cpu = -1;
-	vmx->launched = 0;
+	list_del(&loaded_vmcs->loaded_vmcss_on_cpu_link);
+	loaded_vmcs_init(loaded_vmcs);
 }
 
-static void vcpu_clear(struct vcpu_vmx *vmx)
+static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
 {
-	if (vmx->vcpu.cpu == -1)
-		return;
-	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 1);
+	if (loaded_vmcs->cpu != -1)
+		smp_call_function_single(
+			loaded_vmcs->cpu, __loaded_vmcs_clear, loaded_vmcs, 1);
 }
 
 static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
@@ -971,22 +998,22 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 	if (!vmm_exclusive)
 		kvm_cpu_vmxon(phys_addr);
-	else if (vcpu->cpu != cpu)
-		vcpu_clear(vmx);
+	else if (vmx->loaded_vmcs->cpu != cpu)
+		loaded_vmcs_clear(vmx->loaded_vmcs);
 
-	if (per_cpu(current_vmcs, cpu) != vmx->vmcs) {
-		per_cpu(current_vmcs, cpu) = vmx->vmcs;
-		vmcs_load(vmx->vmcs);
+	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
+		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
+		vmcs_load(vmx->loaded_vmcs->vmcs);
 	}
 
-	if (vcpu->cpu != cpu) {
+	if (vmx->loaded_vmcs->cpu != cpu) {
 		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
 		unsigned long sysenter_esp;
 
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
-		list_add(&vmx->local_vcpus_link,
-			 &per_cpu(vcpus_on_cpu, cpu));
+		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
+			 &per_cpu(loaded_vmcss_on_cpu, cpu));
 		local_irq_enable();
 
 		/*
@@ -998,6 +1025,7 @@ static void vmx_vcpu_load(struct kvm_vcp
 
 		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
+		vmx->loaded_vmcs->cpu = cpu;
 	}
 }
 
@@ -1005,7 +1033,8 @@ static void vmx_vcpu_put(struct kvm_vcpu
 {
 	__vmx_load_host_state(to_vmx(vcpu));
 	if (!vmm_exclusive) {
-		__vcpu_clear(to_vmx(vcpu));
+		__loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
+		vcpu->cpu = -1;
 		kvm_cpu_vmxoff();
 	}
 }
@@ -1469,7 +1498,7 @@ static int hardware_enable(void *garbage
 	if (read_cr4() & X86_CR4_VMXE)
 		return -EBUSY;
 
-	INIT_LIST_HEAD(&per_cpu(vcpus_on_cpu, cpu));
+	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
 	rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
 
 	test_bits = FEATURE_CONTROL_LOCKED;
@@ -1493,14 +1522,14 @@ static int hardware_enable(void *garbage
 	return 0;
 }
 
-static void vmclear_local_vcpus(void)
+static void vmclear_local_loaded_vmcss(void)
 {
 	int cpu = raw_smp_processor_id();
-	struct vcpu_vmx *vmx, *n;
+	struct loaded_vmcs *v, *n;
 
-	list_for_each_entry_safe(vmx, n, &per_cpu(vcpus_on_cpu, cpu),
-				 local_vcpus_link)
-		__vcpu_clear(vmx);
+	list_for_each_entry_safe(v, n, &per_cpu(loaded_vmcss_on_cpu, cpu),
+				 loaded_vmcss_on_cpu_link)
+		__loaded_vmcs_clear(v);
 }
 
 
@@ -1515,7 +1544,7 @@ static void kvm_cpu_vmxoff(void)
 static void hardware_disable(void *garbage)
 {
 	if (vmm_exclusive) {
-		vmclear_local_vcpus();
+		vmclear_local_loaded_vmcss();
 		kvm_cpu_vmxoff();
 	}
 	write_cr4(read_cr4() & ~X86_CR4_VMXE);
@@ -1696,6 +1725,18 @@ static void free_vmcs(struct vmcs *vmcs)
 	free_pages((unsigned long)vmcs, vmcs_config.order);
 }
 
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ */
+static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
+{
+	if (!loaded_vmcs->vmcs)
+		return;
+	loaded_vmcs_clear(loaded_vmcs);
+	free_vmcs(loaded_vmcs->vmcs);
+	loaded_vmcs->vmcs = NULL;
+}
+
 static void free_kvm_area(void)
 {
 	int cpu;
@@ -4166,6 +4207,7 @@ static void __noclone vmx_vcpu_run(struc
 	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
 		vmx_set_interrupt_shadow(vcpu, 0);
 
+	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
 		"push %%"R"dx; push %%"R"bp;"
@@ -4236,7 +4278,7 @@ static void __noclone vmx_vcpu_run(struc
 		"pop  %%"R"bp; pop  %%"R"dx \n\t"
 		"setbe %c[fail](%0) \n\t"
 	      : : "c"(vmx), "d"((unsigned long)HOST_RSP),
-		[launched]"i"(offsetof(struct vcpu_vmx, launched)),
+		[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
 		[fail]"i"(offsetof(struct vcpu_vmx, fail)),
 		[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)),
 		[rax]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RAX])),
@@ -4276,7 +4318,7 @@ static void __noclone vmx_vcpu_run(struc
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
 	asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS));
-	vmx->launched = 1;
+	vmx->loaded_vmcs->launched = 1;
 
 	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
 
@@ -4288,41 +4330,17 @@ static void __noclone vmx_vcpu_run(struc
 #undef R
 #undef Q
 
-static void vmx_free_vmcs(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-	if (vmx->vmcs) {
-		vcpu_clear(vmx);
-		free_vmcs(vmx->vmcs);
-		vmx->vmcs = NULL;
-	}
-}
-
 static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
-	vmx_free_vmcs(vcpu);
+	free_loaded_vmcs(vmx->loaded_vmcs);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, vmx);
 }
 
-static inline void vmcs_init(struct vmcs *vmcs)
-{
-	u64 phys_addr = __pa(per_cpu(vmxarea, raw_smp_processor_id()));
-
-	if (!vmm_exclusive)
-		kvm_cpu_vmxon(phys_addr);
-
-	vmcs_clear(vmcs);
-
-	if (!vmm_exclusive)
-		kvm_cpu_vmxoff();
-}
-
 static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 {
 	int err;
@@ -4344,11 +4362,15 @@ static struct kvm_vcpu *vmx_create_vcpu(
 		goto uninit_vcpu;
 	}
 
-	vmx->vmcs = alloc_vmcs();
-	if (!vmx->vmcs)
+	vmx->loaded_vmcs = &vmx->vmcs01;
+	vmx->loaded_vmcs->vmcs = alloc_vmcs();
+	if (!vmx->loaded_vmcs->vmcs)
 		goto free_msrs;
-
-	vmcs_init(vmx->vmcs);
+	if (!vmm_exclusive)
+		kvm_cpu_vmxon(__pa(per_cpu(vmxarea, raw_smp_processor_id())));
+	loaded_vmcs_init(vmx->loaded_vmcs);
+	if (!vmm_exclusive)
+		kvm_cpu_vmxoff();
 
 	cpu = get_cpu();
 	vmx_vcpu_load(&vmx->vcpu, cpu);
@@ -4377,7 +4399,7 @@ static struct kvm_vcpu *vmx_create_vcpu(
 	return &vmx->vcpu;
 
 free_vmcs:
-	free_vmcs(vmx->vmcs);
+	free_vmcs(vmx->loaded_vmcs->vmcs);
 free_msrs:
 	kfree(vmx->guest_msrs);
 uninit_vcpu:


-- 
Nadav Har'El                        |       Monday, May 23 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Make it idiot proof and someone will make
http://nadav.harel.org.il           |a better idiot.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 16:17                           ` Gleb Natapov
@ 2011-05-23 18:59                             ` Nadav Har'El
  2011-05-23 19:03                               ` Gleb Natapov
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-23 18:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Roedel, Joerg

On Mon, May 23, 2011, Gleb Natapov wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> On Mon, May 23, 2011 at 06:49:17PM +0300, Avi Kivity wrote:
> > (regarding interrupts, I think we can do that work post-merge.  But
> > I'd like to see Kevin's comments addressed)
> > 
> To be fair this wasn't addressed for almost two years now.

Gleb, I assume by "this" you meant the idt-vectoring information issue, not
Kevin's comments (which I only saw a couple of days ago)?

-- 
Nadav Har'El                        |       Monday, May 23 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Someone offered you a cute little quote
http://nadav.harel.org.il           |for your signature? JUST SAY NO!

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 18:59                             ` Nadav Har'El
@ 2011-05-23 19:03                               ` Gleb Natapov
  0 siblings, 0 replies; 118+ messages in thread
From: Gleb Natapov @ 2011-05-23 19:03 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Roedel, Joerg

On Mon, May 23, 2011 at 09:59:01PM +0300, Nadav Har'El wrote:
> On Mon, May 23, 2011, Gleb Natapov wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > On Mon, May 23, 2011 at 06:49:17PM +0300, Avi Kivity wrote:
> > > (regarding interrupts, I think we can do that work post-merge.  But
> > > I'd like to see Kevin's comments addressed)
> > > 
> > To be fair this wasn't addressed for almost two years now.
> 
> Gleb, I assume by "this" you meant the idt-vectoring information issue, not
> Kevin's comments (which I only saw a couple of days ago)?
> 
Yes, of course.

--
			Gleb.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-22  7:22           ` Nadav Har'El
@ 2011-05-24  0:54             ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  0:54 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Sunday, May 22, 2011 3:23 PM
> 
> Hi,
> 
> On Sun, May 22, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX:
> Introduce vmcs02: VMCS used to run L2":
> > Here the vmcs02 being overridden may have been run on another processor
> before
> > but is not vmclear-ed yet. When you resume this vmcs02 with new content on
> a
> > separate processor, the risk of corruption exists.
> 
> I still believe that my current code is correct (in this area). I'll try to
> explain it here and would be grateful if you could point to me the error (if
> there is one) in my logic:
> 
> Nested_vmx_run() is our function which is switches from running L1 to L2
> (patch 18).
> 
> This function starts by calling nested_get_current_vmcs02(), which gets us
> *some* vmcs to use for vmcs02. This may be a fresh new VMCS, or a
> "recycled"
> VMCS, some VMCS we've previously used to run some, potentially different L2
> guest on some, potentially different, CPU.
> nested_get_current_vmcs02() returns a "saved_vmcs" structure, which
> not only contains a VMCS, but also remembers on which (if any) cpu it is
> currently loaded (and whether it was VMLAUNCHed once on that cpu).
> 
> The next thing that Nested_vmx_run() now does is to set up in the vcpu object
> the vmcs, cpu and launched fields according to what was returned above.
> 
> Now it calls vmx_vcpu_load(). This standard KVM function checks if we're now
> running on a different CPU from the vcpu->cpu, and if it a different one, is
> uses vcpu_clear() to VMCLEAR the vmcs on the CPU where it was last loaded
> (using an IPI). Only after it vmclears the VMCS on the old CPU, it can finally
> load the VMCS on the new CPU.
> 
> Only now Nested_vmx_run() can call prepare_vmcs02, which starts
> VMWRITEing
> to this VMCS, and finally returns.
> 

yes, you're correct. Previously I just looked around 07/31 and raised above concern.
Along with nested_vmx_run you explained above, this part is clear to me now. :-)

> P.S. Seeing that you're from Intel, maybe you can help me with a pointer:
> I found what appears to be a small error in the SDM - who can I report it to?
> 

Let me ask for you.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 15:49                         ` Avi Kivity
                                             ` (2 preceding siblings ...)
  2011-05-23 18:51                           ` Nadav Har'El
@ 2011-05-24  0:57                           ` Tian, Kevin
  3 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  0:57 UTC (permalink / raw)
  To: Avi Kivity, Nadav Har'El; +Cc: Marcelo Tosatti, kvm, gleb, Roedel, Joerg

> From: Avi Kivity
> Sent: Monday, May 23, 2011 11:49 PM
> (regarding interrupts, I think we can do that work post-merge.  But I'd
> like to see Kevin's comments addressed)

My earlier comment has been addressed by Nadav with his explanation.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2
  2011-05-22  8:29     ` Nadav Har'El
@ 2011-05-24  1:03       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  1:03 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Sunday, May 22, 2011 4:30 PM
> 
> Hi,
> 
> On Fri, May 20, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX:
> Introduce vmcs02: VMCS used to run L2":
> > Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
> > is similar to the hardware behavior regarding to cleared and launched state.
> 
> If you set VMCS02_POOL_SIZE to a large size, and L1, like typical hypervisors,
> only keeps around a few VMCSs (and VMCLEARs the ones it will not use again),
> then we'll only have a few vmcs02: handle_vmclear() removes from the pool the
> vmcs02 that L1 explicitly told us it won't need again.

yes

> 
> > > +struct saved_vmcs {
> > > +	struct vmcs *vmcs;
> > > +	int cpu;
> > > +	int launched;
> > > +};
> >
> > "saved" looks a bit misleading here. It's simply a list of all active vmcs02
> tracked
> > by kvm, isn't it?
> 
> I have rewritten this part of the code, based on Avi's and Marcelo's requests,
> and the new name for this structure is "loaded_vmcs", i.e., a structure
> describing where a VMCS was loaded.

great, I'll take a look at your new code.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 18:51                           ` Nadav Har'El
@ 2011-05-24  2:22                             ` Tian, Kevin
  2011-05-24  7:56                               ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  2:22 UTC (permalink / raw)
  To: Nadav Har'El, Avi Kivity; +Cc: Marcelo Tosatti, kvm, gleb, Roedel, Joerg

> From: Nadav Har'El
> Sent: Tuesday, May 24, 2011 2:51 AM
> 
> > >+	vmcs_init(vmx->loaded_vmcs->vmcs);
> > >+	vmx->loaded_vmcs->cpu = -1;
> > >+	vmx->loaded_vmcs->launched = 0;
> >
> > Perhaps a loaded_vmcs_init() to encapsulate initialization of these
> > three fields, you'll probably reuse it later.
> 
> It's good you pointed this out, because it made me suddenly realise that I
> forgot to VMCLEAR the new vmcs02's I allocate. In practice it never made a
> difference, but better safe than sorry.

yes, that's what spec requires. You need VMCLEAR on any new VMCS which
does implementation specific initialization in that VMCS region.

> 
> I had to restructure some of the code a bit to be able to properly use this
> new function (in 3 places - __loaded_vmcs_clear, nested_get_current_vmcs02,
> vmx_create_cpu).
> 
> > Please repost separately after the fix, I'd like to apply it before the
> > rest of the series.
> 
> I am adding a new version of this patch at the end of this mail.
> 
> > (regarding interrupts, I think we can do that work post-merge.  But I'd
> > like to see Kevin's comments addressed)
> 
> I replied to his comments. Done some of the things he asked, and asked for
> more info on why/where he believes the current code is incorrect where I
> didn't understand what problems he pointed to, and am now waiting for him
> to reply.

As I replied in another thread, I believe this has been explained clearly by Nadav.

> 
> 
> ------- 8< ------ 8< ---------- 8< ---------- 8< ----------- 8< -----------
> 
> Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
> 
> In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
> because (at least in theory) the processor might not have written all of its
> content back to memory. Since a patch from June 26, 2008, this is done using
> a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
> 
> The problem is that with nested VMX, we no longer have the concept of a
> vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
> L2s), and each of those may be have been last loaded on a different cpu.
> 
> So instead of linking the vcpus, we link the VMCSs, using a new structure
> loaded_vmcs. This structure contains the VMCS, and the information
> pertaining
> to its loading on a specific cpu (namely, the cpu number, and whether it
> was already launched on this cpu once). In nested we will also use the same
> structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
> currently active VMCS.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  150 ++++++++++++++++++++++++-------------------
>  1 file changed, 86 insertions(+), 64 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-23 21:46:14.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-23 21:46:14.000000000 +0300
> @@ -116,6 +116,18 @@ struct vmcs {
>  	char data[0];
>  };
> 
> +/*
> + * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
> + * remember whether it was VMLAUNCHed, and maintain a linked list of all
> VMCSs
> + * loaded on this CPU (so we can clear them if the CPU goes down).
> + */
> +struct loaded_vmcs {
> +	struct vmcs *vmcs;
> +	int cpu;
> +	int launched;
> +	struct list_head loaded_vmcss_on_cpu_link;
> +};
> +
>  struct shared_msr_entry {
>  	unsigned index;
>  	u64 data;
> @@ -124,9 +136,7 @@ struct shared_msr_entry {
> 
>  struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
> -	struct list_head      local_vcpus_link;
>  	unsigned long         host_rsp;
> -	int                   launched;
>  	u8                    fail;
>  	u8                    cpl;
>  	bool                  nmi_known_unmasked;
> @@ -140,7 +150,14 @@ struct vcpu_vmx {
>  	u64 		      msr_host_kernel_gs_base;
>  	u64 		      msr_guest_kernel_gs_base;
>  #endif
> -	struct vmcs          *vmcs;
> +	/*
> +	 * loaded_vmcs points to the VMCS currently used in this vcpu. For a
> +	 * non-nested (L1) guest, it always points to vmcs01. For a nested
> +	 * guest (L2), it points to a different VMCS.
> +	 */
> +	struct loaded_vmcs    vmcs01;
> +	struct loaded_vmcs   *loaded_vmcs;
> +	bool                  __launched; /* temporary, used in
> vmx_vcpu_run */
>  	struct msr_autoload {
>  		unsigned nr;
>  		struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
> @@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
> 
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> -static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
> +/*
> + * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is
> needed
> + * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded
> on it.
> + */
> +static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> 
>  static unsigned long *vmx_io_bitmap_a;
> @@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs
>  		       vmcs, phys_addr);
>  }
> 
> +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
> +{
> +	vmcs_clear(loaded_vmcs->vmcs);
> +	loaded_vmcs->cpu = -1;
> +	loaded_vmcs->launched = 0;
> +}
> +

call it vmcs_init instead since you now remove original vmcs_init invocation,
which more reflect the necessity of adding VMCLEAR for a new vmcs?

>  static void vmcs_load(struct vmcs *vmcs)
>  {
>  	u64 phys_addr = __pa(vmcs);
> @@ -514,25 +542,24 @@ static void vmcs_load(struct vmcs *vmcs)
>  		       vmcs, phys_addr);
>  }
> 
> -static void __vcpu_clear(void *arg)
> +static void __loaded_vmcs_clear(void *arg)
>  {
> -	struct vcpu_vmx *vmx = arg;
> +	struct loaded_vmcs *loaded_vmcs = arg;
>  	int cpu = raw_smp_processor_id();
> 
> -	if (vmx->vcpu.cpu == cpu)
> -		vmcs_clear(vmx->vmcs);
> -	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> +	if (loaded_vmcs->cpu != cpu)
> +		return; /* cpu migration can race with cpu offline */

what do you mean by "cpu migration" here? why does 'cpu offline'
matter here regarding to the cpu change?

> +	if (per_cpu(current_vmcs, cpu) == loaded_vmcs->vmcs)
>  		per_cpu(current_vmcs, cpu) = NULL;
> -	list_del(&vmx->local_vcpus_link);
> -	vmx->vcpu.cpu = -1;
> -	vmx->launched = 0;
> +	list_del(&loaded_vmcs->loaded_vmcss_on_cpu_link);
> +	loaded_vmcs_init(loaded_vmcs);
>  }
> 
> -static void vcpu_clear(struct vcpu_vmx *vmx)
> +static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
>  {
> -	if (vmx->vcpu.cpu == -1)
> -		return;
> -	smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 1);
> +	if (loaded_vmcs->cpu != -1)
> +		smp_call_function_single(
> +			loaded_vmcs->cpu, __loaded_vmcs_clear, loaded_vmcs, 1);
>  }
> 
>  static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
> @@ -971,22 +998,22 @@ static void vmx_vcpu_load(struct kvm_vcp
> 
>  	if (!vmm_exclusive)
>  		kvm_cpu_vmxon(phys_addr);
> -	else if (vcpu->cpu != cpu)
> -		vcpu_clear(vmx);
> +	else if (vmx->loaded_vmcs->cpu != cpu)
> +		loaded_vmcs_clear(vmx->loaded_vmcs);
> 
> -	if (per_cpu(current_vmcs, cpu) != vmx->vmcs) {
> -		per_cpu(current_vmcs, cpu) = vmx->vmcs;
> -		vmcs_load(vmx->vmcs);
> +	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
> +		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
> +		vmcs_load(vmx->loaded_vmcs->vmcs);
>  	}
> 
> -	if (vcpu->cpu != cpu) {
> +	if (vmx->loaded_vmcs->cpu != cpu) {
>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>  		unsigned long sysenter_esp;
> 
>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>  		local_irq_disable();
> -		list_add(&vmx->local_vcpus_link,
> -			 &per_cpu(vcpus_on_cpu, cpu));
> +		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
> +			 &per_cpu(loaded_vmcss_on_cpu, cpu));
>  		local_irq_enable();
> 
>  		/*
> @@ -998,6 +1025,7 @@ static void vmx_vcpu_load(struct kvm_vcp
> 
>  		rdmsrl(MSR_IA32_SYSENTER_ESP, sysenter_esp);
>  		vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
> +		vmx->loaded_vmcs->cpu = cpu;
>  	}
>  }
> 
> @@ -1005,7 +1033,8 @@ static void vmx_vcpu_put(struct kvm_vcpu
>  {
>  	__vmx_load_host_state(to_vmx(vcpu));
>  	if (!vmm_exclusive) {
> -		__vcpu_clear(to_vmx(vcpu));
> +		__loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
> +		vcpu->cpu = -1;
>  		kvm_cpu_vmxoff();
>  	}
>  }
> @@ -1469,7 +1498,7 @@ static int hardware_enable(void *garbage
>  	if (read_cr4() & X86_CR4_VMXE)
>  		return -EBUSY;
> 
> -	INIT_LIST_HEAD(&per_cpu(vcpus_on_cpu, cpu));
> +	INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
>  	rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
> 
>  	test_bits = FEATURE_CONTROL_LOCKED;
> @@ -1493,14 +1522,14 @@ static int hardware_enable(void *garbage
>  	return 0;
>  }
> 
> -static void vmclear_local_vcpus(void)
> +static void vmclear_local_loaded_vmcss(void)
>  {
>  	int cpu = raw_smp_processor_id();
> -	struct vcpu_vmx *vmx, *n;
> +	struct loaded_vmcs *v, *n;
> 
> -	list_for_each_entry_safe(vmx, n, &per_cpu(vcpus_on_cpu, cpu),
> -				 local_vcpus_link)
> -		__vcpu_clear(vmx);
> +	list_for_each_entry_safe(v, n, &per_cpu(loaded_vmcss_on_cpu, cpu),
> +				 loaded_vmcss_on_cpu_link)
> +		__loaded_vmcs_clear(v);
>  }
> 
> 
> @@ -1515,7 +1544,7 @@ static void kvm_cpu_vmxoff(void)
>  static void hardware_disable(void *garbage)
>  {
>  	if (vmm_exclusive) {
> -		vmclear_local_vcpus();
> +		vmclear_local_loaded_vmcss();
>  		kvm_cpu_vmxoff();
>  	}
>  	write_cr4(read_cr4() & ~X86_CR4_VMXE);
> @@ -1696,6 +1725,18 @@ static void free_vmcs(struct vmcs *vmcs)
>  	free_pages((unsigned long)vmcs, vmcs_config.order);
>  }
> 
> +/*
> + * Free a VMCS, but before that VMCLEAR it on the CPU where it was last
> loaded
> + */
> +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
> +{
> +	if (!loaded_vmcs->vmcs)
> +		return;
> +	loaded_vmcs_clear(loaded_vmcs);
> +	free_vmcs(loaded_vmcs->vmcs);
> +	loaded_vmcs->vmcs = NULL;
> +}

prefer to not do cleanup work through loaded_vmcs since it's just a pointer
to a loaded vmcs structure. Though you can carefully arrange the nested
vmcs cleanup happening before it, it's not very clean and a bit error prone
simply from this function itself. It's clearer to directly cleanup vmcs01, and
if you want an assertion could be added to make sure loaded_vmcs doesn't
point to any stale vmcs02 structure after nested cleanup step.	

Thanks,
Kevin 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  2:22                             ` Tian, Kevin
@ 2011-05-24  7:56                               ` Nadav Har'El
  2011-05-24  8:20                                 ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24  7:56 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Avi Kivity, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
> > +{
> > +	vmcs_clear(loaded_vmcs->vmcs);
> > +	loaded_vmcs->cpu = -1;
> > +	loaded_vmcs->launched = 0;
> > +}
> > +
> 
> call it vmcs_init instead since you now remove original vmcs_init invocation,
> which more reflect the necessity of adding VMCLEAR for a new vmcs?

The best name for this function, I think, would have been loaded_vmcs_clear,
because this function isn't necessarily used to "init" - it's also called to
VMCLEAR an old vmcs (and flush its content back to memory) - in that sense
it is definitely not a "vmcs_init".

Unfortunately, I already have a whole chain of functions with this name :(
the existing loaded_vmcs_clear() does an IPI to the CPU which has this VMCS
loaded, and causes it to run __loaded_vmcs_clear(), which in turn calls
the above loaded_vmcs_init(). I wish I could call all three functions
loaded_vmcs_clear(), but of course I can't. If anyone reading this has a good
suggestion on how to name these three functions, please let me know.

> > +static void __loaded_vmcs_clear(void *arg)
> >  {
> > -	struct vcpu_vmx *vmx = arg;
> > +	struct loaded_vmcs *loaded_vmcs = arg;
> >  	int cpu = raw_smp_processor_id();
> > 
> > -	if (vmx->vcpu.cpu == cpu)
> > -		vmcs_clear(vmx->vmcs);
> > -	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> > +	if (loaded_vmcs->cpu != cpu)
> > +		return; /* cpu migration can race with cpu offline */
> 
> what do you mean by "cpu migration" here? why does 'cpu offline'
> matter here regarding to the cpu change?

__loaded_vmcs_clear() is typically called in one of two cases: "cpu migration"
means that a guest that used to run on one CPU, and had its VMCS loaded there,
suddenly needs to run on a different CPU, so we need to clear the VMCS on
the old CPU. "cpu offline" means that we want to take a certain CPU offline,
and before we do that we should VMCLEAR all VMCSs which were loaded on it.

The (vmx->cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
happen: In the cpu offline path, we only call it for the loaded_vmcss which
we know for sure are loaded on the current cpu. In the cpu migration path,
loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures that
equality.

But, there can be a race condition (this was actually explained to me a while
back by Avi - I never seen this happening in practice): Imagine that cpu
migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
a decision is made to take it offline, and all loaded_vmcs loaded on it
(including the one in question) are cleared. When that CPU acts on this IPI,
it notices that vmx->cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
anything (in the new version of the code, I made this more explicit, by
returning immediately in this case).

At least this is the theory. As I said, I didn't see this problem in practice
(unsurprising, since I never offlined any CPU). Maybe Avi or someone else can
comment more about this (vmx->cpu.cpu == cpu) check, which existed before
my patch - in __vcpu_clear().

> > +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
> > +{
> > +	if (!loaded_vmcs->vmcs)
> > +		return;
> > +	loaded_vmcs_clear(loaded_vmcs);
> > +	free_vmcs(loaded_vmcs->vmcs);
> > +	loaded_vmcs->vmcs = NULL;
> > +}
> 
> prefer to not do cleanup work through loaded_vmcs since it's just a pointer
> to a loaded vmcs structure. Though you can carefully arrange the nested
> vmcs cleanup happening before it, it's not very clean and a bit error prone
> simply from this function itself. It's clearer to directly cleanup vmcs01, and
> if you want an assertion could be added to make sure loaded_vmcs doesn't
> point to any stale vmcs02 structure after nested cleanup step.	

I'm afraid I didn't understand what you meant here... Basically, this
free_loaded_vmcs() is just a shortcut for loaded_vmcs_clear() and free_vmcs(),
as doing both is needed in 3 places: nested_free_vmcs02,
nested_free_all_saved_vmcss, vmx_free_vcpu. The same function is needed
for both vmcs01 and vmcs02 VMCSs - in both cases when we don't need them any
more we need to VMCLEAR them and then free the VMCS memory. Note that this
function does *not* free the loaded_vmcs structure itself.

What's wrong with this?
Would you prefer that I remove this function and explictly call
loaded_vmcs_clear() and then free_vmcs() in all three places?

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Linux: Because a PC is a terrible thing
http://nadav.harel.org.il           |to waste.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-16 19:52 ` [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2011-05-24  8:02   ` Tian, Kevin
  2011-05-24  9:19     ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  8:02 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:53 AM
>
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the
> information
> in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
> own guests).
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  269
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 269 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c        2011-05-16 22:36:48.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:48.000000000 +0300
> @@ -347,6 +347,12 @@ struct nested_vmx {
>       /* vmcs02_list cache of VMCSs recently used to run L2 guests */
>       struct list_head vmcs02_pool;
>       int vmcs02_num;
> +     u64 vmcs01_tsc_offset;
> +     /*
> +      * Guest pages referred to in vmcs02 with host-physical pointers, so
> +      * we must keep them pinned while L2 runs.
> +      */
> +     struct page *apic_access_page;
>  };
>
>  struct vcpu_vmx {
> @@ -849,6 +855,18 @@ static inline bool report_flexpriority(v
>       return flexpriority_enabled;
>  }
>
> +static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
> +{
> +     return vmcs12->cpu_based_vm_exec_control & bit;
> +}
> +
> +static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
> +{
> +     return (vmcs12->cpu_based_vm_exec_control &
> +                     CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) &&
> +             (vmcs12->secondary_vm_exec_control & bit);
> +}
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>       int i;
> @@ -1435,6 +1453,22 @@ static void vmx_fpu_activate(struct kvm_
>
>  static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);
>
> +/*
> + * Return the cr0 value that a nested guest would read. This is a combination
> + * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
> + * its hypervisor (cr0_read_shadow).
> + */
> +static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
> +{
> +     return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
> +             (fields->cr0_read_shadow & fields->cr0_guest_host_mask);
> +}
> +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
> +{
> +     return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
> +             (fields->cr4_read_shadow & fields->cr4_guest_host_mask);
> +}
> +

will guest_ prefix look confusing here? The 'guest' has a broad range which makes
above two functions look like they can be used in non-nested case. Should we stick
to nested_prefix for nested specific facilities?

>  static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
>  {
>       vmx_decache_cr0_guest_bits(vcpu);
> @@ -3423,6 +3457,9 @@ static void set_cr4_guest_host_mask(stru
>       vmx->vcpu.arch.cr4_guest_owned_bits =
> KVM_CR4_GUEST_OWNED_BITS;
>       if (enable_ept)
>               vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
> +     if (is_guest_mode(&vmx->vcpu))
> +             vmx->vcpu.arch.cr4_guest_owned_bits &=
> +                     ~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;

why not is_nested_mode()? :-P

>       vmcs_writel(CR4_GUEST_HOST_MASK,
> ~vmx->vcpu.arch.cr4_guest_owned_bits);
>  }
>
> @@ -4760,6 +4797,11 @@ static void free_nested(struct vcpu_vmx
>               vmx->nested.current_vmptr = -1ull;
>               vmx->nested.current_vmcs12 = NULL;
>       }
> +     /* Unpin physical memory we referred to in current vmcs02 */
> +     if (vmx->nested.apic_access_page) {
> +             nested_release_page(vmx->nested.apic_access_page);
> +             vmx->nested.apic_access_page = 0;
> +     }
>
>       nested_free_all_saved_vmcss(vmx);
>  }
> @@ -5829,6 +5871,233 @@ static void vmx_set_supported_cpuid(u32
>  }
>
>  /*
> + * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
> + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> + * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
> + * guest in a way that will both be appropriate to L1's requests, and our
> + * needs. In addition to modifying the active vmcs (which is vmcs02), this
> + * function also has additional necessary side-effects, like setting various
> + * vcpu->arch fields.
> + */
> +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{
> +     struct vcpu_vmx *vmx = to_vmx(vcpu);
> +     u32 exec_control;
> +
> +     vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> +     vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
> +     vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> +     vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
> +     vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
> +     vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
> +     vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
> +     vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector);
> +     vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit);
> +     vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit);
> +     vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit);
> +     vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit);
> +     vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit);
> +     vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit);
> +     vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit);
> +     vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit);
> +     vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit);
> +     vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit);
> +     vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes);
> +     vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes);
> +     vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes);
> +     vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes);
> +     vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes);
> +     vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes);
> +     vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes);
> +     vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes);
> +     vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base);
> +     vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base);
> +     vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base);
> +     vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base);
> +     vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base);
> +     vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base);
> +     vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base);
> +     vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
> +     vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
> +     vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
> +
> +     vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> +     vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> +             vmcs12->vm_entry_intr_info_field);
> +     vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +             vmcs12->vm_entry_exception_error_code);
> +     vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +             vmcs12->vm_entry_instruction_len);
> +     vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
> +             vmcs12->guest_interruptibility_info);
> +     vmcs_write32(GUEST_ACTIVITY_STATE, vmcs12->guest_activity_state);
> +     vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs);
> +     vmcs_writel(GUEST_DR7, vmcs12->guest_dr7);
> +     vmcs_writel(GUEST_RFLAGS, vmcs12->guest_rflags);
> +     vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> +             vmcs12->guest_pending_dbg_exceptions);
> +     vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp);
> +     vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip);
> +
> +     vmcs_write64(VMCS_LINK_POINTER, -1ull);
> +
> +     if (nested_cpu_has2(vmcs12,
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
> +             struct page *page =
> +                     nested_get_page(vcpu, vmcs12->apic_access_addr);
> +             if (!page)
> +                     return 1;
> +             vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +             /*
> +              * Keep the page pinned, so its physical address we just wrote
> +              * remains valid. We keep a reference to it so we can release
> +              * it later.
> +              */
> +             if (vmx->nested.apic_access_page) /* shouldn't happen... */
> +                     nested_release_page(vmx->nested.apic_access_page);
> +             vmx->nested.apic_access_page = page;
> +     }
> +
> +     vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> +             (vmcs_config.pin_based_exec_ctrl |
> +              vmcs12->pin_based_vm_exec_control));
> +
> +     /*
> +      * Whether page-faults are trapped is determined by a combination of
> +      * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF.
> +      * If enable_ept, L0 doesn't care about page faults and we should
> +      * set all of these to L1's desires. However, if !enable_ept, L0 does
> +      * care about (at least some) page faults, and because it is not easy
> +      * (if at all possible?) to merge L0 and L1's desires, we simply ask
> +      * to exit on each and every L2 page fault. This is done by setting
> +      * MASK=MATCH=0 and (see below) EB.PF=1.
> +      * Note that below we don't need special code to set EB.PF beyond the
> +      * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
> +      * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
> +      * !enable_ept, EB.PF is 1, so the "or" will always be 1.
> +      *
> +      * A problem with this approach (when !enable_ept) is that L1 may be
> +      * injected with more page faults than it asked for. This could have
> +      * caused problems, but in practice existing hypervisors don't care.
> +      * To fix this, we will need to emulate the PFEC checking (on the L1
> +      * page tables), using walk_addr(), when injecting PFs to L1.
> +      */
> +     vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +             enable_ept ? vmcs12->page_fault_error_code_mask : 0);
> +     vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +             enable_ept ? vmcs12->page_fault_error_code_match : 0);
> +
> +     if (cpu_has_secondary_exec_ctrls()) {
> +             u32 exec_control = vmx_secondary_exec_control(vmx);
> +             if (!vmx->rdtscp_enabled)
> +                     exec_control &= ~SECONDARY_EXEC_RDTSCP;
> +             /* Take the following fields only from vmcs12 */
> +             exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +             if (nested_cpu_has(vmcs12,
> +                             CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
> +                     exec_control |= vmcs12->secondary_vm_exec_control;

should this 2nd exec_control be merged in clear case-by-case flavor?

what about L0 sets "virtualize x2APIC" bit while L1 doesn't?

Or what about L0 disables EPT while L1 sets it?

I think it's better to scrutinize every 2nd exec_control feature with a
clear policy:
- whether we want to use the stricter policy which is only set when both L0 and
L1 set it
- whether we want to use L1 setting absolutely regardless of L0 setting like
what you did for virtualize APIC access

> +             vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> +     }
> +
> +     /*
> +      * Set host-state according to L0's settings (vmcs12 is irrelevant here)
> +      * Some constant fields are set here by vmx_set_constant_host_state().
> +      * Other fields are different per CPU, and will be set later when
> +      * vmx_vcpu_load() is called, and when vmx_save_host_state() is called.
> +      */
> +     vmx_set_constant_host_state();
> +
> +     /*
> +      * HOST_RSP is normally set correctly in vmx_vcpu_run() just before
> +      * entry, but only if the current (host) sp changed from the value
> +      * we wrote last (vmx->host_rsp). This cache is no longer relevant
> +      * if we switch vmcs, and rather than hold a separate cache per vmcs,
> +      * here we just force the write to happen on entry.
> +      */
> +     vmx->host_rsp = 0;
> +
> +     exec_control = vmx_exec_control(vmx); /* L0's desires */
> +     exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +     exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> +     exec_control &= ~CPU_BASED_TPR_SHADOW;
> +     exec_control |= vmcs12->cpu_based_vm_exec_control;

similarly I think default OR for other features may be risky. In most time we
choose the most conservative setting from L1/L0 or simply borrow from L1.
A default OR here doesn't make this policy clear. Better to spell out every
control bit clearly with desired merge policy.


Thanks
Kevin

> +     /*
> +      * Merging of IO and MSR bitmaps not currently supported.
> +      * Rather, exit every time.
> +      */
> +     exec_control &= ~CPU_BASED_USE_MSR_BITMAPS;
> +     exec_control &= ~CPU_BASED_USE_IO_BITMAPS;
> +     exec_control |= CPU_BASED_UNCOND_IO_EXITING;
> +
> +     vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
> +
> +     /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be
> the
> +      * bitwise-or of what L1 wants to trap for L2, and what we want to
> +      * trap. Note that CR0.TS also needs updating - we do this later.
> +      */
> +     update_exception_bitmap(vcpu);
> +     vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
> +     vmcs_writel(CR0_GUEST_HOST_MASK,
> ~vcpu->arch.cr0_guest_owned_bits);
> +
> +     /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer
> below */
> +     vmcs_write32(VM_EXIT_CONTROLS,
> +             vmcs12->vm_exit_controls | vmcs_config.vmexit_ctrl);
> +     vmcs_write32(VM_ENTRY_CONTROLS, vmcs12->vm_entry_controls |
> +             (vmcs_config.vmentry_ctrl & ~VM_ENTRY_IA32E_MODE));
> +
> +     if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)
> +             vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat);
> +     else if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> +             vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat);
> +
> +
> +     set_cr4_guest_host_mask(vmx);
> +
> +     vmcs_write64(TSC_OFFSET,
> +             vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +
> +     if (enable_vpid) {
> +             /*
> +              * Trivially support vpid by letting L2s share their parent
> +              * L1's vpid. TODO: move to a more elaborate solution, giving
> +              * each L2 its own vpid and exposing the vpid feature to L1.
> +              */
> +             vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> +             vmx_flush_tlb(vcpu);
> +     }
> +
> +     if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
> +             vcpu->arch.efer = vmcs12->guest_ia32_efer;
> +     if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
> +             vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +     else
> +             vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
> +     /* Note: modifies VM_ENTRY/EXIT_CONTROLS and
> GUEST/HOST_IA32_EFER */
> +     vmx_set_efer(vcpu, vcpu->arch.efer);
> +
> +     /*
> +      * This sets GUEST_CR0 to vmcs12->guest_cr0, with possibly a modified
> +      * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
> +      * The CR0_READ_SHADOW is what L2 should have expected to read
> given
> +      * the specifications by L1; It's not enough to take
> +      * vmcs12->cr0_read_shadow because on our cr0_guest_host_mask we
> we
> +      * have more bits than L1 expected.
> +      */
> +     vmx_set_cr0(vcpu, vmcs12->guest_cr0);
> +     vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
> +
> +     vmx_set_cr4(vcpu, vmcs12->guest_cr4);
> +     vmcs_writel(CR4_READ_SHADOW, guest_readable_cr4(vmcs12));
> +
> +     /* shadow page tables on either EPT or shadow page tables */
> +     kvm_set_cr3(vcpu, vmcs12->guest_cr3);
> +     kvm_mmu_reset_context(vcpu);
> +
> +     kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
> +     kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
> +     return 0;
> +}
> +
> +/*
>   * Maintain the vcpus_on_cpu and saved_vmcss_on_cpu lists of vcpus and
>   * inactive saved_vmcss on nested entry (L1->L2) or nested exit (L2->L1).
>   *
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  7:56                               ` Nadav Har'El
@ 2011-05-24  8:20                                 ` Tian, Kevin
  2011-05-24 11:05                                   ` Avi Kivity
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  8:20 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 3:57 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 08/31] nVMX: Fix
> local_vcpus_link handling":
> > > +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
> > > +{
> > > +	vmcs_clear(loaded_vmcs->vmcs);
> > > +	loaded_vmcs->cpu = -1;
> > > +	loaded_vmcs->launched = 0;
> > > +}
> > > +
> >
> > call it vmcs_init instead since you now remove original vmcs_init invocation,
> > which more reflect the necessity of adding VMCLEAR for a new vmcs?
> 
> The best name for this function, I think, would have been loaded_vmcs_clear,
> because this function isn't necessarily used to "init" - it's also called to
> VMCLEAR an old vmcs (and flush its content back to memory) - in that sense
> it is definitely not a "vmcs_init".
> 
> Unfortunately, I already have a whole chain of functions with this name :(
> the existing loaded_vmcs_clear() does an IPI to the CPU which has this VMCS
> loaded, and causes it to run __loaded_vmcs_clear(), which in turn calls
> the above loaded_vmcs_init(). I wish I could call all three functions
> loaded_vmcs_clear(), but of course I can't. If anyone reading this has a good
> suggestion on how to name these three functions, please let me know.

how about loaded_vmcs_reset?

> 
> > > +static void __loaded_vmcs_clear(void *arg)
> > >  {
> > > -	struct vcpu_vmx *vmx = arg;
> > > +	struct loaded_vmcs *loaded_vmcs = arg;
> > >  	int cpu = raw_smp_processor_id();
> > >
> > > -	if (vmx->vcpu.cpu == cpu)
> > > -		vmcs_clear(vmx->vmcs);
> > > -	if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> > > +	if (loaded_vmcs->cpu != cpu)
> > > +		return; /* cpu migration can race with cpu offline */
> >
> > what do you mean by "cpu migration" here? why does 'cpu offline'
> > matter here regarding to the cpu change?
> 
> __loaded_vmcs_clear() is typically called in one of two cases: "cpu migration"
> means that a guest that used to run on one CPU, and had its VMCS loaded
> there,
> suddenly needs to run on a different CPU, so we need to clear the VMCS on
> the old CPU. "cpu offline" means that we want to take a certain CPU offline,
> and before we do that we should VMCLEAR all VMCSs which were loaded on it.

So here you need explicitly differentiate a vcpu and a real cpu. For the 1st case
it's just 'vcpu migration", and the latter it's the real 'cpu offline'. 'cpu migration' 
is generally a RAS feature in mission critical world. :-) 

> 
> The (vmx->cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
> happen: In the cpu offline path, we only call it for the loaded_vmcss which
> we know for sure are loaded on the current cpu. In the cpu migration path,
> loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures
> that
> equality.
> 
> But, there can be a race condition (this was actually explained to me a while
> back by Avi - I never seen this happening in practice): Imagine that cpu
> migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
> VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
> a decision is made to take it offline, and all loaded_vmcs loaded on it
> (including the one in question) are cleared. When that CPU acts on this IPI,
> it notices that vmx->cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
> anything (in the new version of the code, I made this more explicit, by
> returning immediately in this case).

the reverse also holds true. Right between the point where cpu_offline hits
a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
that the vcpu is migrated to another cpu, and it's likely that migration path
(vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
from old cpu's linked list. This way later when __loaded_vmcs_clear is
invoked on the offlined cpu, there's still chance to observe cpu as -1.

> 
> At least this is the theory. As I said, I didn't see this problem in practice
> (unsurprising, since I never offlined any CPU). Maybe Avi or someone else can
> comment more about this (vmx->cpu.cpu == cpu) check, which existed before
> my patch - in __vcpu_clear().

I agree this check is necessary, but just want you to make the comment clear
with right term.

> 
> > > +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
> > > +{
> > > +	if (!loaded_vmcs->vmcs)
> > > +		return;
> > > +	loaded_vmcs_clear(loaded_vmcs);
> > > +	free_vmcs(loaded_vmcs->vmcs);
> > > +	loaded_vmcs->vmcs = NULL;
> > > +}
> >
> > prefer to not do cleanup work through loaded_vmcs since it's just a pointer
> > to a loaded vmcs structure. Though you can carefully arrange the nested
> > vmcs cleanup happening before it, it's not very clean and a bit error prone
> > simply from this function itself. It's clearer to directly cleanup vmcs01, and
> > if you want an assertion could be added to make sure loaded_vmcs doesn't
> > point to any stale vmcs02 structure after nested cleanup step.
> 
> I'm afraid I didn't understand what you meant here... Basically, this
> free_loaded_vmcs() is just a shortcut for loaded_vmcs_clear() and free_vmcs(),
> as doing both is needed in 3 places: nested_free_vmcs02,
> nested_free_all_saved_vmcss, vmx_free_vcpu. The same function is needed
> for both vmcs01 and vmcs02 VMCSs - in both cases when we don't need them
> any
> more we need to VMCLEAR them and then free the VMCS memory. Note that
> this
> function does *not* free the loaded_vmcs structure itself.
> 
> What's wrong with this?
> Would you prefer that I remove this function and explictly call
> loaded_vmcs_clear() and then free_vmcs() in all three places?
> 

Forgot about it. I originally thought this was only used to free vmcs01, and thus
wanted to make that purpose obvious.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-16 19:53 ` [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2011-05-24  8:45   ` Tian, Kevin
  2011-05-24  9:45     ` Nadav Har'El
  2011-05-25  8:00   ` Tian, Kevin
  1 sibling, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24  8:45 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:53 AM
> 
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
> 
> This patch does not include some of the necessary validity checks on
> vmcs12 fields before the entry. These will appear in a separate patch
> below.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   84
> +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 82 insertions(+), 2 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -347,6 +347,9 @@ struct nested_vmx {
>  	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
>  	struct list_head vmcs02_pool;
>  	int vmcs02_num;
> +
> +	/* Saving the VMCS that we used for running L1 */
> +	struct saved_vmcs saved_vmcs01;
>  	u64 vmcs01_tsc_offset;
>  	/*
>  	 * Guest pages referred to in vmcs02 with host-physical pointers, so
> @@ -4668,6 +4671,8 @@ static void nested_free_all_saved_vmcss(
>  		kfree(item);
>  	}
>  	vmx->nested.vmcs02_num = 0;
> +	if (is_guest_mode(&vmx->vcpu))
> +		nested_free_saved_vmcs(vmx, &vmx->nested.saved_vmcs01);
>  }
> 
>  /* Get a vmcs02 for the current vmcs12. */
> @@ -4959,6 +4964,21 @@ static int handle_vmclear(struct kvm_vcp
>  	return 1;
>  }
> 
> +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
> +
> +/* Emulate the VMLAUNCH instruction */
> +static int handle_vmlaunch(struct kvm_vcpu *vcpu)
> +{
> +	return nested_vmx_run(vcpu, true);
> +}
> +
> +/* Emulate the VMRESUME instruction */
> +static int handle_vmresume(struct kvm_vcpu *vcpu)
> +{
> +
> +	return nested_vmx_run(vcpu, false);
> +}
> +
>  enum vmcs_field_type {
>  	VMCS_FIELD_TYPE_U16 = 0,
>  	VMCS_FIELD_TYPE_U64 = 1,
> @@ -5239,11 +5259,11 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
> -	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
> +	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
>  	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
>  	[EXIT_REASON_VMREAD]                  = handle_vmread,
> -	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
> +	[EXIT_REASON_VMRESUME]                = handle_vmresume,
>  	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
>  	[EXIT_REASON_VMOFF]                   = handle_vmoff,
>  	[EXIT_REASON_VMON]                    = handle_vmon,
> @@ -6129,6 +6149,66 @@ static void nested_maintain_per_cpu_list
>  	}
>  }
> 
> +/*
> + * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or
> VMRESUME on L1
> + * for running an L2 nested guest.
> + */
> +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> +{
> +	struct vmcs12 *vmcs12;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int cpu;
> +	struct saved_vmcs *saved_vmcs02;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +	skip_emulated_instruction(vcpu);
> +
> +	vmcs12 = get_vmcs12(vcpu);
> +
> +	enter_guest_mode(vcpu);
> +
> +	vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
> +
> +	/*
> +	 * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
> +	 * vmcs01, on which CPU it was last loaded, and whether it was launched
> +	 * (we need all these values next time we will use L1). Then recall
> +	 * these values from the last time vmcs02 was used.
> +	 */
> +	saved_vmcs02 = nested_get_current_vmcs02(vmx);
> +	if (!saved_vmcs02)
> +		return -ENOMEM;
> +
> +	cpu = get_cpu();
> +	vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
> +	vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
> +	vmx->nested.saved_vmcs01.launched = vmx->launched;
> +	vmx->vmcs = saved_vmcs02->vmcs;
> +	vcpu->cpu = saved_vmcs02->cpu;

this may be another valid reason for your check on cpu_online in your
latest [08/31] local_vcpus_link fix, since cpu may be offlined after
this assignment. :-)

> +	vmx->launched = saved_vmcs02->launched;
> +
> +	nested_maintain_per_cpu_lists(vmx,
> +		saved_vmcs02, &vmx->nested.saved_vmcs01);
> +
> +	vmx_vcpu_put(vcpu);
> +	vmx_vcpu_load(vcpu, cpu);
> +	vcpu->cpu = cpu;
> +	put_cpu();
> +
> +	vmcs12->launch_state = 1;
> +
> +	prepare_vmcs02(vcpu, vmcs12);

Since prepare_vmcs may fail, add a check here and move launch_state
assignment after its success?

Thanks
Kevin

> +
> +	/*
> +	 * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
> +	 * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
> +	 * returned as far as L1 is concerned. It will only return (and set
> +	 * the success flag) when L2 exits (see nested_vmx_vmexit()).
> +	 */
> +	return 1;
> +}
> +
>  static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>  			       struct x86_instruction_info *info,
>  			       enum x86_intercept_stage stage)
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-24  8:02   ` Tian, Kevin
@ 2011-05-24  9:19     ` Nadav Har'El
  2011-05-24 10:52       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24  9:19 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12":
> > +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
> > +{
> > +     return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
> > +             (fields->cr4_read_shadow & fields->cr4_guest_host_mask);
> > +}
> > +
> 
> will guest_ prefix look confusing here? The 'guest' has a broad range which makes
> above two functions look like they can be used in non-nested case. Should we stick
> to nested_prefix for nested specific facilities?

I don't know, I thought it made calls like

	vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));

readable, and the comments (and the parameters) make it obvious it's for
nested only.

I now renamed these functions nested_read_cr0(), nested_read_cr4() - I hope
you like these names better.
	
> > +     if (is_guest_mode(&vmx->vcpu))
> > +             vmx->vcpu.arch.cr4_guest_owned_bits &=
> > +                     ~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;
> 
> why not is_nested_mode()? :-P

I assume you're wondering why the function is called is_guest_mode(), and
not is_nested_mode()?

This name was chosen by Avi Kivity in November last year, for the function
previously introduced by Joerg Roedel. My original code (before Joerg added
this function to x86.c) indeed used the term "nested_mode", not "guest_mode".

In January, I pointed to the possibility of confusion between the new
is_guest_mode() and other things called "guest mode", and Avi Kivity said
he will rename it to is_nested_guest() - see
http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/01418.html
But as you can see, he never did this renaming.

That being said, after half a year, I got used to the name is_guest_mode(),
and am no longer convinced it should be changed. It checks whether the vcpu
(not the underlying CPU) is in Intel-SDM-terminology "guest mode". Just like
is_long_mode() checks if the vcpu is in long mode. So I'm fine with leaving
its current name.

> > +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> > +{
>...
> > +             if (!vmx->rdtscp_enabled)
> > +                     exec_control &= ~SECONDARY_EXEC_RDTSCP;
> > +             /* Take the following fields only from vmcs12 */
> > +             exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> > +             if (nested_cpu_has(vmcs12,
> > +                             CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
> > +                     exec_control |= vmcs12->secondary_vm_exec_control;
> 
> should this 2nd exec_control be merged in clear case-by-case flavor?
> 
> what about L0 sets "virtualize x2APIC" bit while L1 doesn't?
> 
> Or what about L0 disables EPT while L1 sets it?
> 
> I think it's better to scrutinize every 2nd exec_control feature with a
> clear policy:
> - whether we want to use the stricter policy which is only set when both L0 and
> L1 set it
> - whether we want to use L1 setting absolutely regardless of L0 setting like
> what you did for virtualize APIC access

Please note that most of the examples you give cannot happen in practice,
because we tell L1 (via MSR) which features it is allowed to use, and we
fail entry if it tries to use disallowed features (before ever reaching
the merge code you're commenting on). So we don't allow L1, for example,
to use the EPT feature (and when nested-EPT support is added, we won't
allow L1 to use EPT if L0 didn't). The general thinking was that for most
fields that we do explicitly allow, "OR" is the right choice.

I'll add this to my bugzilla, and think about it again later.

Thanks,
Nadav

-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If you choke a Smurf, what color does it
http://nadav.harel.org.il           |turn?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-23 16:51                             ` Avi Kivity
@ 2011-05-24  9:22                               ` Roedel, Joerg
  2011-05-24  9:28                                 ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Roedel, Joerg @ 2011-05-24  9:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb

On Mon, May 23, 2011 at 12:51:55PM -0400, Avi Kivity wrote:
> On 05/23/2011 07:43 PM, Roedel, Joerg wrote:
> > On Mon, May 23, 2011 at 11:49:17AM -0400, Avi Kivity wrote:
> >
> > >  Joerg, is
> > >
> > >       if (unlikely(cpu != vcpu->cpu)) {
> > >           svm->asid_generation = 0;
> > >           mark_all_dirty(svm->vmcb);
> > >       }
> > >
> > >  susceptible to cpu offline/online?
> >
> > I don't think so. This should be safe for cpu offline/online as long as
> > the cpu-number value is not reused for another physical cpu. But that
> > should be the case afaik.
> >
> 
> Why not? offline/online does reuse cpu numbers AFAIK (and it must, if 
> you have a fully populated machine and offline/online just one cpu).

Yes, you are right. There is a slight possibility that the asid is not
updated when a vcpu has asid_generation == 1 and hasn't been running on
another cpu while this given cpu was offlined/onlined. Very unlikely,
but we can not rule it out.

Probably we should make the local_vcpu_list from vmx generic, use it
from svm  and fix it this way.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  9:22                               ` Roedel, Joerg
@ 2011-05-24  9:28                                 ` Nadav Har'El
  2011-05-24  9:57                                   ` Roedel, Joerg
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24  9:28 UTC (permalink / raw)
  To: Roedel, Joerg; +Cc: Avi Kivity, Marcelo Tosatti, kvm, gleb

On Tue, May 24, 2011, Roedel, Joerg wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> Probably we should make the local_vcpu_list from vmx generic, use it
> from svm  and fix it this way.

The point is, local_vcpu_list is now gone, replaced by a loaded_vmcss_on_cpu,
and vcpu->cpu is not set to -1 for any vcpu when a CPU is offlined - also in
VMX...

-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |In case of emergency, this box may be
http://nadav.harel.org.il           |used as a quotation device.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-24  8:45   ` Tian, Kevin
@ 2011-05-24  9:45     ` Nadav Har'El
  2011-05-24 10:54       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24  9:45 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME":
> > +	/*
> > +	 * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
> > +	 * vmcs01, on which CPU it was last loaded, and whether it was launched
> > +	 * (we need all these values next time we will use L1). Then recall
> > +	 * these values from the last time vmcs02 was used.
> > +	 */
> > +	saved_vmcs02 = nested_get_current_vmcs02(vmx);
> > +	if (!saved_vmcs02)
> > +		return -ENOMEM;
> > +
> > +	cpu = get_cpu();
> > +	vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
> > +	vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
> > +	vmx->nested.saved_vmcs01.launched = vmx->launched;
> > +	vmx->vmcs = saved_vmcs02->vmcs;
> > +	vcpu->cpu = saved_vmcs02->cpu;
> 
> this may be another valid reason for your check on cpu_online in your
> latest [08/31] local_vcpus_link fix, since cpu may be offlined after
> this assignment. :-)

I believe that wrapping this part of the code with get_cpu()/put_cpu()
protected me from these kinds of race conditions.

By the way, please note that this part of the code was changed after my
latest loaded_vmcs overhaul. It now looks like this:

	vmcs02 = nested_get_current_vmcs02(vmx);
	if (!vmcs02)
		return -ENOMEM;

	cpu = get_cpu();
	vmx->loaded_vmcs = vmcs02;
	vmx_vcpu_put(vcpu);
	vmx_vcpu_load(vcpu, cpu);
	vcpu->cpu = cpu;
	put_cpu();

(if Avi gives me the green light, I'll send the entire, up-to-date, patch set
again).

> > +	vmcs12->launch_state = 1;
> > +
> > +	prepare_vmcs02(vcpu, vmcs12);
> 
> Since prepare_vmcs may fail, add a check here and move launch_state
> assignment after its success?

prepare_vmcs02() cannot fail. All the checks that need to be done on vmcs12
are done before calling it, in nested_vmx_run().

Currently, there's a single case where prepare_vmcs02 "fails" when it fails
to access apic_access_addr memory. This is wrong - the check should have been
done earlier. I'll fix that, and make prepare_vmcs02() void.


-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |It's no use crying over spilt milk -- it
http://nadav.harel.org.il           |only makes it salty for the cat.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  9:28                                 ` Nadav Har'El
@ 2011-05-24  9:57                                   ` Roedel, Joerg
  2011-05-24 10:08                                     ` Avi Kivity
  2011-05-24 10:12                                     ` Nadav Har'El
  0 siblings, 2 replies; 118+ messages in thread
From: Roedel, Joerg @ 2011-05-24  9:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, Marcelo Tosatti, kvm, gleb

On Tue, May 24, 2011 at 05:28:38AM -0400, Nadav Har'El wrote:
> On Tue, May 24, 2011, Roedel, Joerg wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> > Probably we should make the local_vcpu_list from vmx generic, use it
> > from svm  and fix it this way.
> 
> The point is, local_vcpu_list is now gone, replaced by a loaded_vmcss_on_cpu,
> and vcpu->cpu is not set to -1 for any vcpu when a CPU is offlined - also in
> VMX...

loaded_vmcss_on_cpu sound similar, probably this can be generalized. Is
this code already upstream or is this changed with your nVMX patch-set?

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  9:57                                   ` Roedel, Joerg
@ 2011-05-24 10:08                                     ` Avi Kivity
  2011-05-24 10:12                                     ` Nadav Har'El
  1 sibling, 0 replies; 118+ messages in thread
From: Avi Kivity @ 2011-05-24 10:08 UTC (permalink / raw)
  To: Roedel, Joerg; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb

On 05/24/2011 12:57 PM, Roedel, Joerg wrote:
> On Tue, May 24, 2011 at 05:28:38AM -0400, Nadav Har'El wrote:
> >  On Tue, May 24, 2011, Roedel, Joerg wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> >  >  Probably we should make the local_vcpu_list from vmx generic, use it
> >  >  from svm  and fix it this way.
> >
> >  The point is, local_vcpu_list is now gone, replaced by a loaded_vmcss_on_cpu,
> >  and vcpu->cpu is not set to -1 for any vcpu when a CPU is offlined - also in
> >  VMX...
>
> loaded_vmcss_on_cpu sound similar, probably this can be generalized.

It's not the same: there is a main:1 relationship between vmcss and 
vcpus (like vmcbs and vcpus).

However, it may be that the general case for svm also needs to treat 
individual vmcbs differently.


> Is
> this code already upstream or is this changed with your nVMX patch-set?
>

Not upstream yet (however generalization, if needed, will be done after 
it's upstream).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  9:57                                   ` Roedel, Joerg
  2011-05-24 10:08                                     ` Avi Kivity
@ 2011-05-24 10:12                                     ` Nadav Har'El
  1 sibling, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24 10:12 UTC (permalink / raw)
  To: Roedel, Joerg; +Cc: Avi Kivity, Marcelo Tosatti, kvm, gleb

On Tue, May 24, 2011, Roedel, Joerg wrote about "Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> loaded_vmcss_on_cpu sound similar, probably this can be generalized.

I don't think so - now that a VCPU may have several VMCSs (L1, L2), each
of those may be loaded on a different cpu so we have a list of VMCSs
(the new loaded_vmcs structure), not vcpus. When we offline a CPU, we recall
all VMCSs loaded on it from this list, and clear them; We mark cpu=-1 for
each of those vmcs, but vcpu->cpu remains untouched (and not set to -1)
for all the vcpus.

> Is this code already upstream or is this changed with your nVMX patch-set?

Avi asked me to send the patch that does this *before* nvmx. But he did not
yet merge it.


-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If you notice this notice, you'll notice
http://nadav.harel.org.il           |it's not worth noticing but is noticable.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12
  2011-05-24  9:19     ` Nadav Har'El
@ 2011-05-24 10:52       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24 10:52 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 24, 2011 5:19 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 17/31] nVMX:
> Prepare vmcs02 from vmcs01 and vmcs12":
> > > +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
> > > +{
> > > +     return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
> > > +             (fields->cr4_read_shadow &
> fields->cr4_guest_host_mask);
> > > +}
> > > +
> >
> > will guest_ prefix look confusing here? The 'guest' has a broad range which
> makes
> > above two functions look like they can be used in non-nested case. Should we
> stick
> > to nested_prefix for nested specific facilities?
> 
> I don't know, I thought it made calls like
> 
> 	vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
> 
> readable, and the comments (and the parameters) make it obvious it's for
> nested only.
> 
> I now renamed these functions nested_read_cr0(), nested_read_cr4() - I hope
> you like these names better.

yes.

> 
> > > +     if (is_guest_mode(&vmx->vcpu))
> > > +             vmx->vcpu.arch.cr4_guest_owned_bits &=
> > > +
> ~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;
> >
> > why not is_nested_mode()? :-P
> 
> I assume you're wondering why the function is called is_guest_mode(), and
> not is_nested_mode()?

yes

> 
> This name was chosen by Avi Kivity in November last year, for the function
> previously introduced by Joerg Roedel. My original code (before Joerg added
> this function to x86.c) indeed used the term "nested_mode", not
> "guest_mode".
> 
> In January, I pointed to the possibility of confusion between the new
> is_guest_mode() and other things called "guest mode", and Avi Kivity said
> he will rename it to is_nested_guest() - see
> http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/01418.html
> But as you can see, he never did this renaming.
> 
> That being said, after half a year, I got used to the name is_guest_mode(),
> and am no longer convinced it should be changed. It checks whether the vcpu
> (not the underlying CPU) is in Intel-SDM-terminology "guest mode". Just like
> is_long_mode() checks if the vcpu is in long mode. So I'm fine with leaving
> its current name.

well, it's a small issue, and I'm fine with leaving it though I don't like 'guest' here. :-)

> 
> > > +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
> > > +{
> >...
> > > +             if (!vmx->rdtscp_enabled)
> > > +                     exec_control &= ~SECONDARY_EXEC_RDTSCP;
> > > +             /* Take the following fields only from vmcs12 */
> > > +             exec_control &=
> ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> > > +             if (nested_cpu_has(vmcs12,
> > > +
> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
> > > +                     exec_control |=
> vmcs12->secondary_vm_exec_control;
> >
> > should this 2nd exec_control be merged in clear case-by-case flavor?
> >
> > what about L0 sets "virtualize x2APIC" bit while L1 doesn't?
> >
> > Or what about L0 disables EPT while L1 sets it?
> >
> > I think it's better to scrutinize every 2nd exec_control feature with a
> > clear policy:
> > - whether we want to use the stricter policy which is only set when both L0
> and
> > L1 set it
> > - whether we want to use L1 setting absolutely regardless of L0 setting like
> > what you did for virtualize APIC access
> 
> Please note that most of the examples you give cannot happen in practice,
> because we tell L1 (via MSR) which features it is allowed to use, and we
> fail entry if it tries to use disallowed features (before ever reaching
> the merge code you're commenting on). So we don't allow L1, for example,
> to use the EPT feature (and when nested-EPT support is added, we won't
> allow L1 to use EPT if L0 didn't). The general thinking was that for most
> fields that we do explicitly allow, "OR" is the right choice.

This really bases on the value of the control bit. To achieve the strictest
setting between L0/L1, sometimes you want to use AND and sometimes you
want to use OR.

>From a design p.o.v, it's better not to have such implicit assumption on other
places. Just make it clean and correct. Also in your example it doesn't cover
the case where L0 sets some bits which are not exposed to L1 via MSR. For
example as I said earlier, what about L0 sets virtualize X2APIC mode while
it's not enabled by or not exposed to L1. With OR, you then also enable this 
mode for L2 absolutely, while L1 has no logic to handle it.

I'd like to see a clean policy for the known control bits here, even with a 
strict policy to incur most VM-exits which can be optimized in the future.

> 
> I'll add this to my bugzilla, and think about it again later.
>

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-24  9:45     ` Nadav Har'El
@ 2011-05-24 10:54       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24 10:54 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 5:45 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 18/31] nVMX:
> Implement VMLAUNCH and VMRESUME":
> > > +	/*
> > > +	 * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02).
> Remember
> > > +	 * vmcs01, on which CPU it was last loaded, and whether it was
> launched
> > > +	 * (we need all these values next time we will use L1). Then recall
> > > +	 * these values from the last time vmcs02 was used.
> > > +	 */
> > > +	saved_vmcs02 = nested_get_current_vmcs02(vmx);
> > > +	if (!saved_vmcs02)
> > > +		return -ENOMEM;
> > > +
> > > +	cpu = get_cpu();
> > > +	vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
> > > +	vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
> > > +	vmx->nested.saved_vmcs01.launched = vmx->launched;
> > > +	vmx->vmcs = saved_vmcs02->vmcs;
> > > +	vcpu->cpu = saved_vmcs02->cpu;
> >
> > this may be another valid reason for your check on cpu_online in your
> > latest [08/31] local_vcpus_link fix, since cpu may be offlined after
> > this assignment. :-)
> 
> I believe that wrapping this part of the code with get_cpu()/put_cpu()
> protected me from these kinds of race conditions.

you're right.

> 
> By the way, please note that this part of the code was changed after my
> latest loaded_vmcs overhaul. It now looks like this:
> 
> 	vmcs02 = nested_get_current_vmcs02(vmx);
> 	if (!vmcs02)
> 		return -ENOMEM;
> 
> 	cpu = get_cpu();
> 	vmx->loaded_vmcs = vmcs02;
> 	vmx_vcpu_put(vcpu);
> 	vmx_vcpu_load(vcpu, cpu);
> 	vcpu->cpu = cpu;
> 	put_cpu();
> 
> (if Avi gives me the green light, I'll send the entire, up-to-date, patch set
> again).

Generally your new patch looks good.

> 
> > > +	vmcs12->launch_state = 1;
> > > +
> > > +	prepare_vmcs02(vcpu, vmcs12);
> >
> > Since prepare_vmcs may fail, add a check here and move launch_state
> > assignment after its success?
> 
> prepare_vmcs02() cannot fail. All the checks that need to be done on vmcs12
> are done before calling it, in nested_vmx_run().
> 
> Currently, there's a single case where prepare_vmcs02 "fails" when it fails
> to access apic_access_addr memory. This is wrong - the check should have
> been
> done earlier. I'll fix that, and make prepare_vmcs02() void.
> 

then no problem, as long as you keep this choice clear.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24  8:20                                 ` Tian, Kevin
@ 2011-05-24 11:05                                   ` Avi Kivity
  2011-05-24 11:20                                     ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-24 11:05 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

On 05/24/2011 11:20 AM, Tian, Kevin wrote:
> >
> >  The (vmx->cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
> >  happen: In the cpu offline path, we only call it for the loaded_vmcss which
> >  we know for sure are loaded on the current cpu. In the cpu migration path,
> >  loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures
> >  that
> >  equality.
> >
> >  But, there can be a race condition (this was actually explained to me a while
> >  back by Avi - I never seen this happening in practice): Imagine that cpu
> >  migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
> >  VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
> >  a decision is made to take it offline, and all loaded_vmcs loaded on it
> >  (including the one in question) are cleared. When that CPU acts on this IPI,
> >  it notices that vmx->cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
> >  anything (in the new version of the code, I made this more explicit, by
> >  returning immediately in this case).
>
> the reverse also holds true. Right between the point where cpu_offline hits
> a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
> that the vcpu is migrated to another cpu, and it's likely that migration path
> (vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
> from old cpu's linked list. This way later when __loaded_vmcs_clear is
> invoked on the offlined cpu, there's still chance to observe cpu as -1.

I don't think it's possible.  Both calls are done with interrupts disabled.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24 11:05                                   ` Avi Kivity
@ 2011-05-24 11:20                                     ` Tian, Kevin
  2011-05-24 11:27                                       ` Avi Kivity
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24 11:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

> From: Avi Kivity [mailto:avi@redhat.com]
> Sent: Tuesday, May 24, 2011 7:06 PM
> 
> On 05/24/2011 11:20 AM, Tian, Kevin wrote:
> > >
> > >  The (vmx->cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally
> never
> > >  happen: In the cpu offline path, we only call it for the loaded_vmcss which
> > >  we know for sure are loaded on the current cpu. In the cpu migration
> path,
> > >  loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which
> ensures
> > >  that
> > >  equality.
> > >
> > >  But, there can be a race condition (this was actually explained to me a
> while
> > >  back by Avi - I never seen this happening in practice): Imagine that cpu
> > >  migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
> > >  VMCLEAR this vmcs. But before that old CPU gets a chance to act on that
> IPI,
> > >  a decision is made to take it offline, and all loaded_vmcs loaded on it
> > >  (including the one in question) are cleared. When that CPU acts on this
> IPI,
> > >  it notices that vmx->cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
> > >  anything (in the new version of the code, I made this more explicit, by
> > >  returning immediately in this case).
> >
> > the reverse also holds true. Right between the point where cpu_offline hits
> > a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
> > that the vcpu is migrated to another cpu, and it's likely that migration path
> > (vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
> > from old cpu's linked list. This way later when __loaded_vmcs_clear is
> > invoked on the offlined cpu, there's still chance to observe cpu as -1.
> 
> I don't think it's possible.  Both calls are done with interrupts disabled.

If that's the case then there's another potential issue. Deadlock may happen
when calling smp_call_function_single with interrupt disabled. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24 11:20                                     ` Tian, Kevin
@ 2011-05-24 11:27                                       ` Avi Kivity
  2011-05-24 11:30                                         ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-24 11:27 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

On 05/24/2011 02:20 PM, Tian, Kevin wrote:
> >  I don't think it's possible.  Both calls are done with interrupts disabled.
>
> If that's the case then there's another potential issue. Deadlock may happen
> when calling smp_call_function_single with interrupt disabled.

We don't do that.  vcpu migration calls vcpu_clear() with interrupts 
enabled, which then calls smp_call_function_single(), which calls 
__vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is 
called from interrupts disabled (and calls __vcpu_clear() directly).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24 11:27                                       ` Avi Kivity
@ 2011-05-24 11:30                                         ` Tian, Kevin
  2011-05-24 11:36                                           ` Avi Kivity
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24 11:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

> From: Avi Kivity [mailto:avi@redhat.com]
> Sent: Tuesday, May 24, 2011 7:27 PM
> 
> On 05/24/2011 02:20 PM, Tian, Kevin wrote:
> > >  I don't think it's possible.  Both calls are done with interrupts disabled.
> >
> > If that's the case then there's another potential issue. Deadlock may happen
> > when calling smp_call_function_single with interrupt disabled.
> 
> We don't do that.  vcpu migration calls vcpu_clear() with interrupts
> enabled, which then calls smp_call_function_single(), which calls
> __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
> called from interrupts disabled (and calls __vcpu_clear() directly).
> 

OK, that's clear to me now. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24 11:30                                         ` Tian, Kevin
@ 2011-05-24 11:36                                           ` Avi Kivity
  2011-05-24 11:40                                             ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Avi Kivity @ 2011-05-24 11:36 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

On 05/24/2011 02:30 PM, Tian, Kevin wrote:
> >
> >  We don't do that.  vcpu migration calls vcpu_clear() with interrupts
> >  enabled, which then calls smp_call_function_single(), which calls
> >  __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
> >  called from interrupts disabled (and calls __vcpu_clear() directly).
> >
>
> OK, that's clear to me now.

Are there still open issues about the patch?

(Nadav, please post patches in the future in new threads so they're 
easier to find)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24 11:36                                           ` Avi Kivity
@ 2011-05-24 11:40                                             ` Tian, Kevin
  2011-05-24 11:59                                               ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24 11:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

> From: Avi Kivity [mailto:avi@redhat.com]
> Sent: Tuesday, May 24, 2011 7:37 PM
> 
> On 05/24/2011 02:30 PM, Tian, Kevin wrote:
> > >
> > >  We don't do that.  vcpu migration calls vcpu_clear() with interrupts
> > >  enabled, which then calls smp_call_function_single(), which calls
> > >  __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
> > >  called from interrupts disabled (and calls __vcpu_clear() directly).
> > >
> >
> > OK, that's clear to me now.
> 
> Are there still open issues about the patch?
> 
> (Nadav, please post patches in the future in new threads so they're
> easier to find)
> 

I'm fine with this patch except that Nadav needs to clarify the comment
in __loaded_vmcs_clear (regarding to 'cpu migration' and 'cpu offline' part
which I replied in another mail)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling
  2011-05-24 11:40                                             ` Tian, Kevin
@ 2011-05-24 11:59                                               ` Nadav Har'El
  0 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24 11:59 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Avi Kivity, Marcelo Tosatti, kvm, gleb, Roedel, Joerg

On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling":
> I'm fine with this patch except that Nadav needs to clarify the comment
> in __loaded_vmcs_clear (regarding to 'cpu migration' and 'cpu offline' part
> which I replied in another mail)

I added a single letter, "v", to my comment ;-)

Please note that the same code existed previously, and didn't have any comment.
If you find this short comment more confusing (or wrong) than helpful, then I
can just remove it.

Avi, I'll send a new version of patch 1 in a few minutes, in a new thread
this time ;-) Please let me know when (or if) you are prepared to apply the
rest of the patches, so I can send a new version, rebased to the current
trunk and with all the fixes people asked for in the last few days.


-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Ms Piggy's last words: "I'm pink,
http://nadav.harel.org.il           |therefore I'm ham."

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-16 19:54 ` [PATCH 20/31] nVMX: Exiting from L2 to L1 Nadav Har'El
@ 2011-05-24 12:58   ` Tian, Kevin
  2011-05-24 13:43     ` Nadav Har'El
  2011-05-25  2:43   ` Tian, Kevin
  1 sibling, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-24 12:58 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:54 AM
> 
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
> 
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in a separate patch below.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  257
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 257 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -6203,6 +6203,263 @@ static int nested_vmx_run(struct kvm_vcp
>  	return 1;
>  }
> 
> +/*
> + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
> + * because L2 may have changed some cr0 bits directly (see
> CRO_GUEST_HOST_MASK)
> + * without L0 trapping the change and updating vmcs12.
> + * This function returns the value we should put in vmcs12.guest_cr0. It's not
> + * enough to just return the current (vmcs02) GUEST_CR0 - that may not be
> the
> + * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
> + * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) asked
> + * to trap this change and instead set just the read shadow bit. If this is the
> + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0,
> where
> + * L1 believes they already are.
> + */
> +static inline unsigned long
> +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{
> +	/*
> +	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
> +	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
> +	 * if we did trap it - if we did so because L1 asked to trap this bit
> +	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
> +	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
> +	 */
> +	unsigned long guest_cr0_bits =
> +		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> +	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> +	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> +}

Hi, Nadav,

Not sure whether I get above operation wrong. But it looks not exactly correct to me
in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case that
bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
operation will make vmcs02_GUEST_CR0 bit returned instead.

Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
why not just updating bits which can be altered while keeping the rest bits from
vmcs12_GUEST_CR0? Say something like:

vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged bits */
vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-24 12:58   ` Tian, Kevin
@ 2011-05-24 13:43     ` Nadav Har'El
  2011-05-25  0:55       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-24 13:43 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX: Exiting from L2 to L1":
> > +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> > +{
> > +	/*
> > +	 * As explained above, we take a bit from GUEST_CR0 if we allowed the
> > +	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
> > +	 * if we did trap it - if we did so because L1 asked to trap this bit
> > +	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
> > +	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
> > +	 */
> > +	unsigned long guest_cr0_bits =
> > +		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> > +	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> > +	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> > +}
> 
> Hi, Nadav,
> 
> Not sure whether I get above operation wrong.

This is one of the trickiest functions in nested VMX, which is why I added
15 lines of comments (!) on just two statements of code.

> But it looks not exactly correct to me
> in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case that
> bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
> operation will make vmcs02_GUEST_CR0 bit returned instead.

This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
in particular, if it is set in both L0's and L1's), we always exit to L1 when
L2 changes this bit, and this bit cannot change while L2 is running, so
naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
identical in that be.
Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would be
completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
the vmcs02->cr0_read_shadow is vmcs12->cr0_read_shadow (see nested_read_cr0),
and is just a pretense that L1 set up for L2 - it is NOT the real bit of
guest_cr0, so copying it into guest_cr0 would be wrong.

Note that this function is completely different from nested_read_cr0 (the
new name), which behaves similar to what you suggested but serves a completely
different (and in some respect, opposite) function.

I think my comments in the code are clearer than what I just wrote here, so
please take a look at them again, and let me know if you find any errors.

> Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
> why not just updating bits which can be altered while keeping the rest bits from
> vmcs12_GUEST_CR0? Say something like:
> 
> vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged bits */
> vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
> 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

I guess I could do something like this, but do you think it's clearer?
I don't. Behind all the details, my formula emphasises that MOST cr0 bits
can be just copied from vmcs02 to vmcs12 as is - and we only have to do
something strange for special bits - where L0 wanted to trap but L1 didn't.
In your formula, it looks like there are 3 different cases instead of 2.

In any case, your formula is definitely not more correct, because the formulas
are in fact equivalent - let me prove:

If, instead of taking the "unchanged bits" (as you call them) from
vmcs12->guest_cr0, you take them from vmcs02->guest_cr0 (you can,
because they couldn't have changed), you end up with *exactly* the same
formula I used. Here is the proof:

 yourformula =
	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

Now because of the "unchanged bits",
	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) ==
	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask) ==

          (and note that vmcs02->guest_cr0 is vmcs_readl(GUEST_CR0))

so this in yourformula, it becomes

 yourformula =
	(vmcs_readl(GUEST_CR0) & vmcs12->cr0_guest_host_mask) |
	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

or, simplifying

 yourformula =
	(vmcs_readl(GUEST_CR0) & (vmcs12->cr0_guest_host_mask | vcpu->arch.cr0_guest_owned_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))

now, using the name I used:
	unsigned long guest_cr0_bits =
		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;

you end up with

 yourforumula =
	(vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
 	(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits )

Which is, believe it or not, exactly my formula :-)


-- 
Nadav Har'El                        |      Tuesday, May 24 2011, 20 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |From the Linux getopt(3) manpage: "BUGS:
http://nadav.harel.org.il           |This manpage is confusing."

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-24 13:43     ` Nadav Har'El
@ 2011-05-25  0:55       ` Tian, Kevin
  2011-05-25  8:06         ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  0:55 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 9:43 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > > +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> > > +{
> > > +	/*
> > > +	 * As explained above, we take a bit from GUEST_CR0 if we allowed
> the
> > > +	 * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits),
> or
> > > +	 * if we did trap it - if we did so because L1 asked to trap this bit
> > > +	 * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but
> L1
> > > +	 * didn't expect us to trap) we read from CR0_READ_SHADOW.
> > > +	 */
> > > +	unsigned long guest_cr0_bits =
> > > +		vcpu->arch.cr0_guest_owned_bits |
> vmcs12->cr0_guest_host_mask;
> > > +	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> > > +	       (vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> > > +}
> >
> > Hi, Nadav,
> >
> > Not sure whether I get above operation wrong.
> 
> This is one of the trickiest functions in nested VMX, which is why I added
> 15 lines of comments (!) on just two statements of code.

I read the comment carefully, and the scenario I described is not covered there.

> 
> > But it looks not exactly correct to me
> > in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case
> that
> > bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW,
> however above
> > operation will make vmcs02_GUEST_CR0 bit returned instead.
> 
> This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
> in particular, if it is set in both L0's and L1's), we always exit to L1 when
> L2 changes this bit, and this bit cannot change while L2 is running, so
> naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
> identical in that be.

Are you sure this is the case? vmcs12.guest_cr0 is identical to an operation
that L1 tries to update GUEST_CR0 when you prepare vmcs02 which is why
you use vmx_set_cr0(vcpu, vmcs12->guest_cr0) in prepare_vmcs02. If L0 
has one bit set in L0's cr0_guest_host_mask, the corresponding bit in 
vmcs12.guest_cr0 will be cached in vmcs02.cr0_read_shadow anyway. This
is not related to whether L2 changes that bit.

IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
has no permission to set its bit effectively in this case.

> Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would
> be
> completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
> the vmcs02->cr0_read_shadow is vmcs12->cr0_read_shadow (see
> nested_read_cr0),
> and is just a pretense that L1 set up for L2 - it is NOT the real bit of
> guest_cr0, so copying it into guest_cr0 would be wrong.

So I'm talking about reserving that bit from vmcs12.guest_cr0 when it's set
in vmcs12.cr0_guest_host_mask which is a natural output.

> 
> Note that this function is completely different from nested_read_cr0 (the
> new name), which behaves similar to what you suggested but serves a
> completely
> different (and in some respect, opposite) function.
> 
> I think my comments in the code are clearer than what I just wrote here, so
> please take a look at them again, and let me know if you find any errors.
> 
> > Instead of constructing vmcs12_GUEST_CR0 completely from
> vmcs02_GUEST_CR0,
> > why not just updating bits which can be altered while keeping the rest bits
> from
> > vmcs12_GUEST_CR0? Say something like:
> >
> > vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged
> bits */
> > vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) &
> vcpu->arch.cr0_guest_owned_bits) |
> > 	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> I guess I could do something like this, but do you think it's clearer?
> I don't. Behind all the details, my formula emphasises that MOST cr0 bits
> can be just copied from vmcs02 to vmcs12 as is - and we only have to do
> something strange for special bits - where L0 wanted to trap but L1 didn't.
> In your formula, it looks like there are 3 different cases instead of 2.

But my formula is more clear given that it sticks to the implication of the
cr0_guest_host_mask. You only need to update cr0 bits which can be modified
by the L2 w/o trap while just keeping the rest.

> 
> In any case, your formula is definitely not more correct, because the formulas
> are in fact equivalent - let me prove:
> 
> If, instead of taking the "unchanged bits" (as you call them) from
> vmcs12->guest_cr0, you take them from vmcs02->guest_cr0 (you can,
> because they couldn't have changed), you end up with *exactly* the same
> formula I used. Here is the proof:
> 
>  yourformula =
> 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
> 	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> Now because of the "unchanged bits",
> 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) ==
> 	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask) ==
> 
>           (and note that vmcs02->guest_cr0 is vmcs_readl(GUEST_CR0))

this is the problem:

	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)

only below equation holds true:

	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask & !L0->cr0_guest_host_mask) ==
	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask & !L0->cr0_guest_host_mask)

When one bit of vmcs12->cr0_guest_host_mask is set, it simply implicates that L1
wants to control the bit instead of L2. However whether L1 can really control that
bit still depends on whether L0 allows it to be!

> 
> so this in yourformula, it becomes
> 
>  yourformula =
> 	(vmcs_readl(GUEST_CR0) & vmcs12->cr0_guest_host_mask) |
> 	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> or, simplifying
> 
>  yourformula =
> 	(vmcs_readl(GUEST_CR0) & (vmcs12->cr0_guest_host_mask |
> vcpu->arch.cr0_guest_owned_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> now, using the name I used:
> 	unsigned long guest_cr0_bits =
> 		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> 
> you end up with
> 
>  yourforumula =
> 	(vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
>  	(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits )
> 
> Which is, believe it or not, exactly my formula :-)
> 

So with my interpretation, two formulas are different because you
misuse vmcs12.cr0_guest_host_mask. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-16 19:54 ` [PATCH 20/31] nVMX: Exiting from L2 to L1 Nadav Har'El
  2011-05-24 12:58   ` Tian, Kevin
@ 2011-05-25  2:43   ` Tian, Kevin
  2011-05-25 13:21     ` Nadav Har'El
  1 sibling, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  2:43 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:54 AM
> 
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
> 
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in a separate patch below.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>

> +/*
> + * A part of what we need to when the nested L2 guest exits and we want to
> + * run its L1 parent, is to reset L1's guest state to the host state specified
> + * in vmcs12.
> + * This function is to be called not only on normal nested exit, but also on
> + * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry
> + * Failures During or After Loading Guest State").
> + * This function should be called when the active VMCS is L1's (vmcs01).
> + */
> +void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
> +{
> +	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> +		vcpu->arch.efer = vmcs12->host_ia32_efer;
> +	if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE)
> +		vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +	else
> +		vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
> +	vmx_set_efer(vcpu, vcpu->arch.efer);
> +
> +	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
> +		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->host_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->host_rip);
> +	/*
> +	 * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
> +	 * actually changed, because it depends on the current state of
> +	 * fpu_active (which may have changed).
> +	 * Note that vmx_set_cr0 refers to efer set above.
> +	 */
> +	kvm_set_cr0(vcpu, vmcs12->host_cr0);
> +	/*
> +	 * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
> +	 * to apply the same changes to L1's vmcs. We just set cr0 correctly,
> +	 * but we also need to update cr0_guest_host_mask and
> exception_bitmap.
> +	 */
> +	update_exception_bitmap(vcpu);
> +	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
> +	vmcs_writel(CR0_GUEST_HOST_MASK,
> ~vcpu->arch.cr0_guest_owned_bits);
> +
> +	/*
> +	 * Note that CR4_GUEST_HOST_MASK is already set in the original
> vmcs01
> +	 * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
> +	 */
> +	vcpu->arch.cr4_guest_owned_bits =
> ~vmcs_readl(CR4_GUEST_HOST_MASK);
> +	kvm_set_cr4(vcpu, vmcs12->host_cr4);
> +
> +	/* shadow page tables on either EPT or shadow page tables */
> +	kvm_set_cr3(vcpu, vmcs12->host_cr3);
> +	kvm_mmu_reset_context(vcpu);
> +
> +	if (enable_vpid) {
> +		/*
> +		 * Trivially support vpid by letting L2s share their parent
> +		 * L1's vpid. TODO: move to a more elaborate solution, giving
> +		 * each L2 its own vpid and exposing the vpid feature to L1.
> +		 */
> +		vmx_flush_tlb(vcpu);
> +	}

How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO comment
here make the whole load process complete. :-)

Also isn't it more sane to update vmcs01's guest segment info based on vmcs12's
host segment info? Though you can assume the environment in L1 doesn't change
from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural clear
to load those segments fields according to L1's desire.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 21/31] nVMX: vmcs12 checks on nested entry
  2011-05-16 19:54 ` [PATCH 21/31] nVMX: vmcs12 checks on nested entry Nadav Har'El
@ 2011-05-25  3:01   ` Tian, Kevin
  2011-05-25  5:38     ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  3:01 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:55 AM
> 
> This patch adds a bunch of tests of the validity of the vmcs12 fields,
> according to what the VMX spec and our implementation allows. If fields
> we cannot (or don't want to) honor are discovered, an entry failure is
> emulated.
> 
> According to the spec, there are two types of entry failures: If the problem
> was in vmcs12's host state or control fields, the VMLAUNCH instruction simply
> fails. But a problem is found in the guest state, the behavior is more
> similar to that of an exit.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/include/asm/vmx.h |    8 ++
>  arch/x86/kvm/vmx.c         |   94
> +++++++++++++++++++++++++++++++++++
>  2 files changed, 102 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -870,6 +870,10 @@ static inline bool nested_cpu_has2(struc
>  		(vmcs12->secondary_vm_exec_control & bit);
>  }
> 
> +static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
> +			struct vmcs12 *vmcs12,
> +			u32 reason, unsigned long qualification);
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>  	int i;
> @@ -6160,6 +6164,79 @@ static int nested_vmx_run(struct kvm_vcp
> 
>  	vmcs12 = get_vmcs12(vcpu);
> 
> +	/*
> +	 * The nested entry process starts with enforcing various prerequisites
> +	 * on vmcs12 as required by the Intel SDM, and act appropriately when
> +	 * they fail: As the SDM explains, some conditions should cause the
> +	 * instruction to fail, while others will cause the instruction to seem
> +	 * to succeed, but return an EXIT_REASON_INVALID_STATE.
> +	 * To speed up the normal (success) code path, we should avoid checking
> +	 * for misconfigurations which will anyway be caught by the processor
> +	 * when using the merged vmcs02.
> +	 */
> +	if (vmcs12->launch_state == launch) {
> +		nested_vmx_failValid(vcpu,
> +			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
> +			       : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> +		return 1;
> +	}

from SDM:
	ELSIF (VMLAUNCH and launch state of current VMCS is not "clear")
		THEN VMfailValid(VMLAUNCH with non-clear VMCS);
	ELSIF (VMRESUME and launch state of current VMCS is not "launched")
		THEN VMfailValid(VMRESUME with non-launched VMCS);

So it's legal to use VMLAUNCH on a launched VMCS. However here you
changes this behavior. On the other hand, do you want to add a 'clear' state
along with L1 VMCLEAR to catch the failure here?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 21/31] nVMX: vmcs12 checks on nested entry
  2011-05-25  3:01   ` Tian, Kevin
@ 2011-05-25  5:38     ` Nadav Har'El
  2011-05-25  7:33       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25  5:38 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 21/31] nVMX: vmcs12 checks on nested entry":
> > +	if (vmcs12->launch_state == launch) {
> > +		nested_vmx_failValid(vcpu,
> > +			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
> > +			       : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> > +		return 1;
> > +	}
> 
> from SDM:
> 	ELSIF (VMLAUNCH and launch state of current VMCS is not "clear")
> 		THEN VMfailValid(VMLAUNCH with non-clear VMCS);
> 	ELSIF (VMRESUME and launch state of current VMCS is not "launched")
> 		THEN VMfailValid(VMRESUME with non-launched VMCS);
> 
> So it's legal to use VMLAUNCH on a launched VMCS. However here you
> changes this behavior. On the other hand, do you want to add a 'clear' state
> along with L1 VMCLEAR to catch the failure here?

I don't understand: I always understood the spec to mean that "clear" and
"launched" the two opposite states of the "launch state" bit? If it isn't,
what does "clear" mean?

Is it really "legal to use a VMLAUNCH on a launched VMCS"?
If it is, why does KVM, for example, go to great lengths to VMLAUNCH the
first time, and VMRESUME all subsequent times?

-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If a million Shakespeares tried to write
http://nadav.harel.org.il           |together, they would write like a monkey.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 21/31] nVMX: vmcs12 checks on nested entry
  2011-05-25  5:38     ` Nadav Har'El
@ 2011-05-25  7:33       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  7:33 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 1:38 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 21/31] nVMX:
> vmcs12 checks on nested entry":
> > > +	if (vmcs12->launch_state == launch) {
> > > +		nested_vmx_failValid(vcpu,
> > > +			launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
> > > +			       : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> > > +		return 1;
> > > +	}
> >
> > from SDM:
> > 	ELSIF (VMLAUNCH and launch state of current VMCS is not "clear")
> > 		THEN VMfailValid(VMLAUNCH with non-clear VMCS);
> > 	ELSIF (VMRESUME and launch state of current VMCS is not "launched")
> > 		THEN VMfailValid(VMRESUME with non-launched VMCS);
> >
> > So it's legal to use VMLAUNCH on a launched VMCS. However here you
> > changes this behavior. On the other hand, do you want to add a 'clear' state
> > along with L1 VMCLEAR to catch the failure here?
> 
> I don't understand: I always understood the spec to mean that "clear" and
> "launched" the two opposite states of the "launch state" bit? If it isn't,
> what does "clear" mean?
> 
> Is it really "legal to use a VMLAUNCH on a launched VMCS"?
> If it is, why does KVM, for example, go to great lengths to VMLAUNCH the
> first time, and VMRESUME all subsequent times?
> 

You're correct. I've got my head messed on this point. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2011-05-16 19:55 ` [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2011-05-25  7:56   ` Tian, Kevin
  2011-05-25 13:45     ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  7:56 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:55 AM
> 
> This patch contains the logic of whether an L2 exit should be handled by L0
> and then L2 should be resumed, or whether L1 should be run to handle this
> exit (using the nested_vmx_vmexit() function of the previous patch).
> 
> The basic idea is to let L1 handle the exit only if it actually asked to
> trap this sort of event. For example, when L2 exits on a change to CR0,
> we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
> bit which changed; If it did, we exit to L1. But if it didn't it means that
> it is we (L0) that wished to trap this event, so we handle it ourselves.
> 
> The next two patches add additional logic of what to do when an interrupt or
> exception is injected: Does L0 need to do it, should we exit to L1 to do it,
> or should we resume L2 and keep the exception to be injected later.
> 
> We keep a new flag, "nested_run_pending", which can override the decision of
> which should run next, L1 or L2. nested_run_pending=1 means that we *must*
> run
> L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
> and therefore expects L2 to be run (and perhaps be injected with an event it
> specified, etc.). Nested_run_pending is especially intended to avoid switching
> to L1 in the injection decision-point described above.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |  256
> ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 255 insertions(+), 1 deletion(-)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -351,6 +351,8 @@ struct nested_vmx {
>  	/* Saving the VMCS that we used for running L1 */
>  	struct saved_vmcs saved_vmcs01;
>  	u64 vmcs01_tsc_offset;
> +	/* L2 must run next, and mustn't decide to exit to L1. */
> +	bool nested_run_pending;
>  	/*
>  	 * Guest pages referred to in vmcs02 with host-physical pointers, so
>  	 * we must keep them pinned while L2 runs.
> @@ -870,6 +872,20 @@ static inline bool nested_cpu_has2(struc
>  		(vmcs12->secondary_vm_exec_control & bit);
>  }
> 
> +static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
> +{
> +	return is_guest_mode(vcpu) &&
> +		(get_vmcs12(vcpu)->pin_based_vm_exec_control &
> +			PIN_BASED_VIRTUAL_NMIS);
> +}

any reason to add guest mode check here? I didn't see such check in your
earlier nested_cpu_has_xxx. It would be clearer to use existing nested_cpu_has_xxx
along with is_guest_mode explicitly which makes such usage consistent.

> +
> +static inline bool is_exception(u32 intr_info)
> +{
> +	return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VALID_MASK))
> +		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
> +}
> +
> +static void nested_vmx_vmexit(struct kvm_vcpu *vcpu);
>  static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
>  			struct vmcs12 *vmcs12,
>  			u32 reason, unsigned long qualification);
> @@ -5281,6 +5297,232 @@ static int (*kvm_vmx_exit_handlers[])(st
>  static const int kvm_vmx_max_exit_handlers =
>  	ARRAY_SIZE(kvm_vmx_exit_handlers);
> 
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
> + * rather than handle it ourselves in L0. I.e., check whether L1 expressed
> + * disinterest in the current event (read or write a specific MSR) by using an
> + * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps.
> + */
> +static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
> +	struct vmcs12 *vmcs12, u32 exit_reason)
> +{
> +	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
> +	gpa_t bitmap;
> +
> +	if (!nested_cpu_has(get_vmcs12(vcpu),
> CPU_BASED_USE_MSR_BITMAPS))
> +		return 1;
> +
> +	/*
> +	 * The MSR_BITMAP page is divided into four 1024-byte bitmaps,
> +	 * for the four combinations of read/write and low/high MSR numbers.
> +	 * First we need to figure out which of the four to use:
> +	 */
> +	bitmap = vmcs12->msr_bitmap;
> +	if (exit_reason == EXIT_REASON_MSR_WRITE)
> +		bitmap += 2048;
> +	if (msr_index >= 0xc0000000) {
> +		msr_index -= 0xc0000000;
> +		bitmap += 1024;
> +	}
> +
> +	/* Then read the msr_index'th bit from this bitmap: */
> +	if (msr_index < 1024*8) {
> +		unsigned char b;
> +		kvm_read_guest(vcpu->kvm, bitmap + msr_index/8, &b, 1);
> +		return 1 & (b >> (msr_index & 7));
> +	} else
> +		return 1; /* let L1 handle the wrong parameter */
> +}
> +
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
> + * rather than handle it ourselves in L0. I.e., check if L1 wanted to
> + * intercept (via guest_host_mask etc.) the current event.
> + */
> +static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
> +	struct vmcs12 *vmcs12)
> +{
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int cr = exit_qualification & 15;
> +	int reg = (exit_qualification >> 8) & 15;
> +	unsigned long val = kvm_register_read(vcpu, reg);
> +
> +	switch ((exit_qualification >> 4) & 3) {
> +	case 0: /* mov to cr */
> +		switch (cr) {
> +		case 0:
> +			if (vmcs12->cr0_guest_host_mask &
> +			    (val ^ vmcs12->cr0_read_shadow))
> +				return 1;
> +			break;
> +		case 3:
> +			if ((vmcs12->cr3_target_count >= 1 &&
> +					vmcs12->cr3_target_value0 == val) ||
> +				(vmcs12->cr3_target_count >= 2 &&
> +					vmcs12->cr3_target_value1 == val) ||
> +				(vmcs12->cr3_target_count >= 3 &&
> +					vmcs12->cr3_target_value2 == val) ||
> +				(vmcs12->cr3_target_count >= 4 &&
> +					vmcs12->cr3_target_value3 == val))
> +				return 0;
> +			if (nested_cpu_has(vmcs12, CPU_BASED_CR3_LOAD_EXITING))
> +				return 1;
> +			break;
> +		case 4:
> +			if (vmcs12->cr4_guest_host_mask &
> +			    (vmcs12->cr4_read_shadow ^ val))
> +				return 1;
> +			break;
> +		case 8:
> +			if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING))
> +				return 1;
> +			break;
> +		}
> +		break;
> +	case 2: /* clts */
> +		if ((vmcs12->cr0_guest_host_mask & X86_CR0_TS) &&
> +		    (vmcs12->cr0_read_shadow & X86_CR0_TS))
> +			return 1;
> +		break;
> +	case 1: /* mov from cr */
> +		switch (cr) {
> +		case 3:
> +			if (vmcs12->cpu_based_vm_exec_control &
> +			    CPU_BASED_CR3_STORE_EXITING)
> +				return 1;
> +			break;
> +		case 8:
> +			if (vmcs12->cpu_based_vm_exec_control &
> +			    CPU_BASED_CR8_STORE_EXITING)
> +				return 1;
> +			break;
> +		}
> +		break;
> +	case 3: /* lmsw */
> +		/*
> +		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
> +		 * cr0. Other attempted changes are ignored, with no exit.
> +		 */
> +		if (vmcs12->cr0_guest_host_mask & 0xe &
> +		    (val ^ vmcs12->cr0_read_shadow))
> +			return 1;
> +		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
> +		    !(vmcs12->cr0_read_shadow & 0x1) &&
> +		    (val & 0x1))
> +			return 1;
> +		break;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
> + * should handle it ourselves in L0 (and then continue L2). Only call this
> + * when in is_guest_mode (L2).
> + */
> +static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
> +{
> +	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
> +	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +
> +	if (vmx->nested.nested_run_pending)
> +		return 0;
> +
> +	if (unlikely(vmx->fail)) {
> +		printk(KERN_INFO "%s failed vm entry %x\n",
> +		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
> +		return 1;
> +	}
> +
> +	switch (exit_reason) {
> +	case EXIT_REASON_EXCEPTION_NMI:
> +		if (!is_exception(intr_info))
> +			return 0;
> +		else if (is_page_fault(intr_info))
> +			return enable_ept;
> +		return vmcs12->exception_bitmap &
> +				(1u << (intr_info & INTR_INFO_VECTOR_MASK));
> +	case EXIT_REASON_EXTERNAL_INTERRUPT:
> +		return 0;
> +	case EXIT_REASON_TRIPLE_FAULT:
> +		return 1;
> +	case EXIT_REASON_PENDING_INTERRUPT:
> +	case EXIT_REASON_NMI_WINDOW:
> +		/*
> +		 * prepare_vmcs02() set the CPU_BASED_VIRTUAL_INTR_PENDING
> bit
> +		 * (aka Interrupt Window Exiting) only when L1 turned it on,
> +		 * so if we got a PENDING_INTERRUPT exit, this must be for L1.
> +		 * Same for NMI Window Exiting.
> +		 */
> +		return 1;
> +	case EXIT_REASON_TASK_SWITCH:
> +		return 1;
> +	case EXIT_REASON_CPUID:
> +		return 1;
> +	case EXIT_REASON_HLT:
> +		return nested_cpu_has(vmcs12, CPU_BASED_HLT_EXITING);
> +	case EXIT_REASON_INVD:
> +		return 1;
> +	case EXIT_REASON_INVLPG:
> +		return vmcs12->cpu_based_vm_exec_control &
> +				CPU_BASED_INVLPG_EXITING;

use nested_cpu_has.

> +	case EXIT_REASON_RDPMC:
> +		return vmcs12->cpu_based_vm_exec_control &
> +				CPU_BASED_RDPMC_EXITING;
> +	case EXIT_REASON_RDTSC:
> +		return vmcs12->cpu_based_vm_exec_control &
> +				CPU_BASED_RDTSC_EXITING;

ditto

> +	case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
> +	case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
> +	case EXIT_REASON_VMPTRST: case EXIT_REASON_VMREAD:
> +	case EXIT_REASON_VMRESUME: case EXIT_REASON_VMWRITE:
> +	case EXIT_REASON_VMOFF: case EXIT_REASON_VMON:
> +		/*
> +		 * VMX instructions trap unconditionally. This allows L1 to
> +		 * emulate them for its L2 guest, i.e., allows 3-level nesting!
> +		 */
> +		return 1;
> +	case EXIT_REASON_CR_ACCESS:
> +		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
> +	case EXIT_REASON_DR_ACCESS:
> +		return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
> +	case EXIT_REASON_IO_INSTRUCTION:
> +		/* TODO: support IO bitmaps */
> +		return 1;
> +	case EXIT_REASON_MSR_READ:
> +	case EXIT_REASON_MSR_WRITE:
> +		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
> +	case EXIT_REASON_INVALID_STATE:
> +		return 1;
> +	case EXIT_REASON_MWAIT_INSTRUCTION:
> +		return nested_cpu_has(vmcs12, CPU_BASED_MWAIT_EXITING);
> +	case EXIT_REASON_MONITOR_INSTRUCTION:
> +		return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_EXITING);
> +	case EXIT_REASON_PAUSE_INSTRUCTION:
> +		return nested_cpu_has(vmcs12, CPU_BASED_PAUSE_EXITING) ||
> +			nested_cpu_has2(vmcs12,
> +				SECONDARY_EXEC_PAUSE_LOOP_EXITING);
> +	case EXIT_REASON_MCE_DURING_VMENTRY:
> +		return 0;
> +	case EXIT_REASON_TPR_BELOW_THRESHOLD:
> +		return 1;
> +	case EXIT_REASON_APIC_ACCESS:
> +		return nested_cpu_has2(vmcs12,
> +			SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
> +	case EXIT_REASON_EPT_VIOLATION:
> +	case EXIT_REASON_EPT_MISCONFIG:
> +		return 0;
> +	case EXIT_REASON_WBINVD:
> +		return nested_cpu_has2(vmcs12,
> SECONDARY_EXEC_WBINVD_EXITING);
> +	case EXIT_REASON_XSETBV:
> +		return 1;
> +	default:
> +		return 1;
> +	}
> +}
> +
>  static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
>  {
>  	*info1 = vmcs_readl(EXIT_QUALIFICATION);
> @@ -5303,6 +5545,17 @@ static int vmx_handle_exit(struct kvm_vc
>  	if (vmx->emulation_required && emulate_invalid_guest_state)
>  		return handle_invalid_guest_state(vcpu);
> 
> +	if (exit_reason == EXIT_REASON_VMLAUNCH ||
> +	    exit_reason == EXIT_REASON_VMRESUME)
> +		vmx->nested.nested_run_pending = 1;
> +	else
> +		vmx->nested.nested_run_pending = 0;

what about VMLAUNCH invoked from L2? In such case I think you expect
L1 to run instead of L2.

On the other hand, isn't just guest mode check enough to differentiate
pending nested run? When L1 invokes VMLAUNCH/VMRESUME, guest mode
hasn't been set yet, and below check will fail. All other operations will then
be checked by nested_vmx_exit_handled...

Do I miss anything here?

> +
> +	if (is_guest_mode(vcpu) && nested_vmx_exit_handled(vcpu)) {
> +		nested_vmx_vmexit(vcpu);
> +		return 1;
> +	}
> +
>  	if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
>  		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
>  		vcpu->run->fail_entry.hardware_entry_failure_reason
> @@ -5325,7 +5578,8 @@ static int vmx_handle_exit(struct kvm_vc
>  		       "(0x%x) and exit reason is 0x%x\n",
>  		       __func__, vectoring_info, exit_reason);
> 
> -	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
> +	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
> +			!nested_cpu_has_virtual_nmis(vcpu))) {

Would L0 want to control vNMI for L2 guest? Otherwise we just use is_guest_mode
here for the condition check?

>  		if (vmx_interrupt_allowed(vcpu)) {
>  			vmx->soft_vnmi_blocked = 0;
>  		} else if (vmx->vnmi_blocked_time > 1000000000LL &&


Thanks,
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-16 19:53 ` [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
  2011-05-24  8:45   ` Tian, Kevin
@ 2011-05-25  8:00   ` Tian, Kevin
  2011-05-25 13:26     ` Nadav Har'El
  1 sibling, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  8:00 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:53 AM
> 
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
> 
> This patch does not include some of the necessary validity checks on
> vmcs12 fields before the entry. These will appear in a separate patch
> below.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---

[...]

> +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> +{
> +	struct vmcs12 *vmcs12;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int cpu;
> +	struct saved_vmcs *saved_vmcs02;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +	skip_emulated_instruction(vcpu);
> +
> +	vmcs12 = get_vmcs12(vcpu);
> +
> +	enter_guest_mode(vcpu);
> +
> +	vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
> +
> +	/*
> +	 * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
> +	 * vmcs01, on which CPU it was last loaded, and whether it was launched
> +	 * (we need all these values next time we will use L1). Then recall
> +	 * these values from the last time vmcs02 was used.
> +	 */
> +	saved_vmcs02 = nested_get_current_vmcs02(vmx);
> +	if (!saved_vmcs02)
> +		return -ENOMEM;
> +

we shouldn't return error after the guest mode is updated. Or else move
enter_guest_mode to a later place...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-25  0:55       ` Tian, Kevin
@ 2011-05-25  8:06         ` Nadav Har'El
  2011-05-25  8:23           ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25  8:06 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX: Exiting from L2 to L1":
> IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
> the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
> has no permission to set its bit effectively in this case.
>...
> this is the problem:
> 
> 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
> 	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)

Sorry for arguing previously, this is a very good, and correct, point, which
I missed. When both L0 and L1 are KVM, this didn't cause problems because the
only problematic bit has been the TS bit, and when KVM wants to override this
bit it always does it to 1.

So I've rewritten this function, based on my new understanding following your
insights. I believe it now implements your formula *exactly*. Please take a
look at the comments and the code, and see if you now agree with them:

/*
 * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
 * because L2 may have changed some cr0 bits directly (CRO_GUEST_HOST_MASK).
 * This function returns the new value we should put in vmcs12.guest_cr0.
 * It's not enough to just return the vmcs02 GUEST_CR0. Rather,
 *  1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are now
 *     available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0
 *     didn't trap the bit, because if L1 did, so would L0).
 *  2. Bits that L1 asked to trap (and therefore L0 also did) could not have
 *     been modified by L2, and L1 knows it. So just leave the old value of
 *     the bit from vmcs12.guest_cr0. Note that the bit from vmcs02 GUEST_CR0
 *     isn't relevant, because if L0 traps this bit it can set it to anything.
 *  3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have
 *     changed these bits, and therefore they need to be updated, but L0
 *     didn't necessarily allow them to be changed in GUEST_CR0 - and rather
 *     put them in vmcs02 CR0_READ_SHADOW. So take these bits from there.
 */
static inline unsigned long
vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
{
	return
	/*1*/	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
	/*2*/	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
	/*3*/	(vmcs_readl(CR0_READ_SHADOW) & ~(vmcs12->cr0_guest_host_mask |
			vcpu->arch.cr0_guest_owned_bits));
}


-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If at first you don't succeed, redefine
http://nadav.harel.org.il           |success.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-25  8:06         ` Nadav Har'El
@ 2011-05-25  8:23           ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  8:23 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 4:06 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
> > the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
> > has no permission to set its bit effectively in this case.
> >...
> > this is the problem:
> >
> > 	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
> > 	(vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)
> 
> Sorry for arguing previously, this is a very good, and correct, point, which
> I missed. When both L0 and L1 are KVM, this didn't cause problems because
> the
> only problematic bit has been the TS bit, and when KVM wants to override this
> bit it always does it to 1.
> 
> So I've rewritten this function, based on my new understanding following your
> insights. I believe it now implements your formula *exactly*. Please take a
> look at the comments and the code, and see if you now agree with them:
> 
> /*
>  * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
>  * because L2 may have changed some cr0 bits directly
> (CRO_GUEST_HOST_MASK).
>  * This function returns the new value we should put in vmcs12.guest_cr0.
>  * It's not enough to just return the vmcs02 GUEST_CR0. Rather,
>  *  1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are now
>  *     available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0
>  *     didn't trap the bit, because if L1 did, so would L0).
>  *  2. Bits that L1 asked to trap (and therefore L0 also did) could not have
>  *     been modified by L2, and L1 knows it. So just leave the old value of
>  *     the bit from vmcs12.guest_cr0. Note that the bit from vmcs02
> GUEST_CR0
>  *     isn't relevant, because if L0 traps this bit it can set it to anything.
>  *  3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have
>  *     changed these bits, and therefore they need to be updated, but L0
>  *     didn't necessarily allow them to be changed in GUEST_CR0 - and
> rather
>  *     put them in vmcs02 CR0_READ_SHADOW. So take these bits from
> there.
>  */
> static inline unsigned long
> vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> {
> 	return
> 	/*1*/	(vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits)
> |
> 	/*2*/	(vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
> 	/*3*/	(vmcs_readl(CR0_READ_SHADOW) &
> ~(vmcs12->cr0_guest_host_mask |
> 			vcpu->arch.cr0_guest_owned_bits));
> }
> 
> 

This looks good to me. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-16 19:55 ` [PATCH 23/31] nVMX: Correct handling of interrupt injection Nadav Har'El
@ 2011-05-25  8:39   ` Tian, Kevin
  2011-05-25  8:45     ` Tian, Kevin
  2011-05-25 10:56     ` Nadav Har'El
  2011-05-25  9:18   ` Tian, Kevin
  1 sibling, 2 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  8:39 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:56 AM
> 
> The code in this patch correctly emulates external-interrupt injection
> while a nested guest L2 is running.
> 
> Because of this code's relative un-obviousness, I include here a longer-than-
> usual justification for what it does - much longer than the code itself ;-)
> 
> To understand how to correctly emulate interrupt injection while L2 is
> running, let's look first at what we need to emulate: How would things look
> like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
> an interrupt, we had hardware delivering an interrupt?
> 
> Now we have L1 running on bare metal with a guest L2, and the hardware
> generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
> 1, and
> VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
> what
> happens now is this: The processor exits from L2 to L1, with an external-
> interrupt exit reason but without an interrupt vector. L1 runs, with
> interrupts disabled, and it doesn't yet know what the interrupt was. Soon
> after, it enables interrupts and only at that moment, it gets the interrupt
> from the processor. when L1 is KVM, Linux handles this interrupt.
> 
> Now we need exactly the same thing to happen when that L1->L2 system runs
> on top of L0, instead of real hardware. This is how we do this:
> 
> When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
> external-interrupt exit reason (with an invalid interrupt vector), and run L1.
> Just like in the bare metal case, it likely can't deliver the interrupt to
> L1 now because L1 is running with interrupts disabled, in which case it turns
> on the interrupt window when running L1 after the exit. L1 will soon enable
> interrupts, and at that point L0 will gain control again and inject the
> interrupt to L1.
> 
> Finally, there is an extra complication in the code: when nested_run_pending,
> we cannot return to L1 now, and must launch L2. We need to remember the
> interrupt we wanted to inject (and not clear it now), and do it on the
> next exit.
> 
> The above explanation shows that the relative strangeness of the nested
> interrupt injection code in this patch, and the extra interrupt-window
> exit incurred, are in fact necessary for accurate emulation, and are not
> just an unoptimized implementation.
> 
> Let's revisit now the two assumptions made above:
> 
> If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
> does, by the way), things are simple: L0 may inject the interrupt directly
> to the L2 guest - using the normal code path that injects to any guest.
> We support this case in the code below.
> 
> If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> does), things look very different from the description above: L1 expects

Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
touch generic interrupt logic, and thus better to not ack it until interrupt
is enabled and then hardware will gear to the kernel interrupt handler
automatically.

> to see an exit from L2 with the interrupt vector already filled in the exit
> information, and does not expect to be interrupted again with this interrupt.
> The current code does not (yet) support this case, so we do not allow the
> VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Then just fill the interrupt vector field with the highest unmasked bit
from pending vIRR.

> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
> 
>  	/* exit controls */
>  	nested_vmx_exit_ctls_low = 0;
> +	/* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
> */
>  #ifdef CONFIG_X86_64
>  	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
> @@ -3733,9 +3734,25 @@ out:
>  	return ret;
>  }
> 
> +/*
> + * In nested virtualization, check if L1 asked to exit on external interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> +	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> +		PIN_BASED_EXT_INTR_MASK;
> +}
> +

could be a similar common wrapper like nested_cpu_has...

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-25  8:39   ` Tian, Kevin
@ 2011-05-25  8:45     ` Tian, Kevin
  2011-05-25 10:56     ` Nadav Har'El
  1 sibling, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  8:45 UTC (permalink / raw)
  To: Tian, Kevin, Nadav Har'El, kvm; +Cc: gleb, avi

> From: Tian, Kevin
> Sent: Wednesday, May 25, 2011 4:40 PM
> > If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> > does), things look very different from the description above: L1 expects
> 
> Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
> really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
> touch generic interrupt logic, and thus better to not ack it until interrupt
> is enabled and then hardware will gear to the kernel interrupt handler
> automatically.
> 
> > to see an exit from L2 with the interrupt vector already filled in the exit
> > information, and does not expect to be interrupted again with this interrupt.
> > The current code does not (yet) support this case, so we do not allow the
> > VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
> 
> Then just fill the interrupt vector field with the highest unmasked bit
> from pending vIRR.
> 

And also ack virtual irqchip accordingly...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-16 19:55 ` [PATCH 23/31] nVMX: Correct handling of interrupt injection Nadav Har'El
  2011-05-25  8:39   ` Tian, Kevin
@ 2011-05-25  9:18   ` Tian, Kevin
  2011-05-25 12:33     ` Nadav Har'El
  1 sibling, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25  9:18 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:56 AM
> 
> The code in this patch correctly emulates external-interrupt injection
> while a nested guest L2 is running.
> 
> Because of this code's relative un-obviousness, I include here a longer-than-
> usual justification for what it does - much longer than the code itself ;-)
> 
> To understand how to correctly emulate interrupt injection while L2 is
> running, let's look first at what we need to emulate: How would things look
> like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
> an interrupt, we had hardware delivering an interrupt?
> 
> Now we have L1 running on bare metal with a guest L2, and the hardware
> generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
> 1, and
> VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
> what
> happens now is this: The processor exits from L2 to L1, with an external-
> interrupt exit reason but without an interrupt vector. L1 runs, with
> interrupts disabled, and it doesn't yet know what the interrupt was. Soon
> after, it enables interrupts and only at that moment, it gets the interrupt
> from the processor. when L1 is KVM, Linux handles this interrupt.
> 
> Now we need exactly the same thing to happen when that L1->L2 system runs
> on top of L0, instead of real hardware. This is how we do this:
> 
> When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
> external-interrupt exit reason (with an invalid interrupt vector), and run L1.
> Just like in the bare metal case, it likely can't deliver the interrupt to
> L1 now because L1 is running with interrupts disabled, in which case it turns
> on the interrupt window when running L1 after the exit. L1 will soon enable
> interrupts, and at that point L0 will gain control again and inject the
> interrupt to L1.
> 
> Finally, there is an extra complication in the code: when nested_run_pending,
> we cannot return to L1 now, and must launch L2. We need to remember the
> interrupt we wanted to inject (and not clear it now), and do it on the
> next exit.
> 
> The above explanation shows that the relative strangeness of the nested
> interrupt injection code in this patch, and the extra interrupt-window
> exit incurred, are in fact necessary for accurate emulation, and are not
> just an unoptimized implementation.
> 
> Let's revisit now the two assumptions made above:
> 
> If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
> does, by the way), things are simple: L0 may inject the interrupt directly
> to the L2 guest - using the normal code path that injects to any guest.
> We support this case in the code below.
> 
> If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> does), things look very different from the description above: L1 expects
> to see an exit from L2 with the interrupt vector already filled in the exit
> information, and does not expect to be interrupted again with this interrupt.
> The current code does not (yet) support this case, so we do not allow the
> VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:49.000000000 +0300
> @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
> 
>  	/* exit controls */
>  	nested_vmx_exit_ctls_low = 0;
> +	/* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
> */
>  #ifdef CONFIG_X86_64
>  	nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
> @@ -3733,9 +3734,25 @@ out:
>  	return ret;
>  }
> 
> +/*
> + * In nested virtualization, check if L1 asked to exit on external interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> +	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> +		PIN_BASED_EXT_INTR_MASK;
> +}
> +
>  static void enable_irq_window(struct kvm_vcpu *vcpu)
>  {
>  	u32 cpu_based_vm_exec_control;
> +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> +		/* We can get here when nested_run_pending caused
> +		 * vmx_interrupt_allowed() to return false. In this case, do
> +		 * nothing - the interrupt will be injected later.
> +		 */

I think this is not a rare path? when vcpu is in guest mode with L2 as current
vmx context, this function could be invoked multiple times since kvm thread
can be scheduled out/in randomly. 

> +		return;
> 
>  	cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>  	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> @@ -3858,6 +3875,17 @@ static void vmx_set_nmi_mask(struct kvm_
> 
>  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>  {
> +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) {
> +		struct vmcs12 *vmcs12;
> +		if (to_vmx(vcpu)->nested.nested_run_pending)
> +			return 0;

Well, now I can see why you require this special 'nested_run_pending' flag
because there're places where L0 injects virtual interrupts right after
VMLAUNCH/VMRESUME emulation and before entering L2. :-)

> +		nested_vmx_vmexit(vcpu);
> +		vmcs12 = get_vmcs12(vcpu);
> +		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
> +		vmcs12->vm_exit_intr_info = 0;
> +		/* fall through to normal code, but now in L1, not L2 */
> +	}
> +

This is a bad place to add this logic. vmx_interrupt_allowed is simply a
query function but you make it an absolute trigger point for switching from
L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
when there's vNMI pending. But it's dangerous to have such assumption
for pending events inside vmx_interrupt_allowed.

On the other hand, I think there's one area which is not handled timely.
I think you need to kick a L2->L1 transition when L0 wants to inject 
virtual interrupt. Consider your current logic:

a) L2 is running on cpu1
b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to 
cpu1 after updating virqchip
c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
the event. As it's an external interrupt, L0 will handle it. As it's a notification
IPI, nothing is required.
d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT

At this point you only add logic to enable_irq_window, but there's no 
action to trigger L2->L1 transition. So what will happen? Will the event
be injected into L2 instead or pend until next switch happens due to
other cause?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info
  2011-05-16 19:56 ` [PATCH 25/31] nVMX: Correct handling of idt vectoring info Nadav Har'El
@ 2011-05-25 10:02   ` Tian, Kevin
  2011-05-25 10:13     ` Nadav Har'El
  0 siblings, 1 reply; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25 10:02 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:57 AM
> 
> This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the
> nested
> case.
> 
> When a guest exits while delivering an interrupt or exception, we get this
> information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
> there's nothing we need to do, because L1 will see this field in vmcs12, and
> handle it itself. However, when L2 exits and L0 handles the exit itself and
> plans to return to L2, L0 must inject this event to L2.
> 
> In the normal non-nested case, the idt_vectoring_info case is discovered after
> the exit, and the decision to inject (though not the injection itself) is made
> at that point. However, in the nested case a decision of whether to return
> to L2 or L1 also happens during the injection phase (see the previous
> patches), so in the nested case we can only decide what to do about the
> idt_vectoring_info right after the injection, i.e., in the beginning of
> vmx_vcpu_run, which is the first time we know for sure if we're staying in
> L2.
> 
> Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
> vmx_complete_interrupts() code which queues the idt_vectoring_info for
> injection on next entry - because such injection would not be appropriate
> if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
> and related fields in vmcs12 (which is a convenient place to save these
> fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
> potentially exiting to L1 to inject an event requested by user space), if
> we find ourselves in L1 we don't need to do anything with those values
> we saved (as explained above). But if we find that we're in L2, or rather
> *still* at L2 (it's not nested_run_pending, meaning that this is the first
> round of L2 running after L1 having just launched it), we need to inject
> the event saved in those fields - by writing the appropriate VMCS fields.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2011-05-16 22:36:50.000000000 +0300
> @@ -5804,6 +5804,8 @@ static void __vmx_complete_interrupts(st
> 
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> +	if (is_guest_mode(&vmx->vcpu))
> +		return;
>  	__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>  				  VM_EXIT_INSTRUCTION_LEN,
>  				  IDT_VECTORING_ERROR_CODE);
> @@ -5811,6 +5813,8 @@ static void vmx_complete_interrupts(stru
> 
>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  {
> +	if (is_guest_mode(vcpu))
> +		return;
>  	__vmx_complete_interrupts(to_vmx(vcpu),
>  				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
>  				  VM_ENTRY_INSTRUCTION_LEN,
> @@ -5831,6 +5835,21 @@ static void __noclone vmx_vcpu_run(struc
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
> +	if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> +		struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +		if (vmcs12->idt_vectoring_info_field &
> +				VECTORING_INFO_VALID_MASK) {
> +			vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> +				vmcs12->idt_vectoring_info_field);
> +			vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +				vmcs12->vm_exit_instruction_len);
> +			if (vmcs12->idt_vectoring_info_field &
> +					VECTORING_INFO_DELIVER_CODE_MASK)
> +				vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +					vmcs12->idt_vectoring_error_code);
> +		}
> +	}

Here got one question. How about L2 has interrupt exiting disabled? That way
it's expect to have L0 directly inject virtual interrupt into L2, and thus simply
overwrite interrupt info field here looks incorrect. Though as you said typical
hypervisor doesn't turn interrupt exiting off, but it does be an architectural
correct thing. I think here you should compare current INTR_INFO_FIELD with
saved IDT_VECTOR and choose a higher priority, when L2 has interrupt
exiting disabled.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 25/31] nVMX: Correct handling of idt vectoring info
  2011-05-25 10:02   ` Tian, Kevin
@ 2011-05-25 10:13     ` Nadav Har'El
  2011-05-25 10:17       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 10:13 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info":
> Here got one question. How about L2 has interrupt exiting disabled? That way

You are right, this is a bug in my code. It's a known bug - listed on my
private bugzilla - but I didn't get around to fixing it.

As I said, most hypervisors "in the wild" do not have interrupt exiting
disabled, so this bug will not appear in practice, but definitely it should
be fixed, and I will.

When the nvmx patches are merged, I can can copy my private bug tracker's
open issues to an open bug tracker (I'm hoping KVM has one?).

Thanks,
Nadav.

-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"I don't use drugs, my dreams are
http://nadav.harel.org.il           |frightening enough." -- M. C. Escher

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info
  2011-05-25 10:13     ` Nadav Har'El
@ 2011-05-25 10:17       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25 10:17 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 6:14 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 25/31] nVMX:
> Correct handling of idt vectoring info":
> > Here got one question. How about L2 has interrupt exiting disabled? That
> way
> 
> You are right, this is a bug in my code. It's a known bug - listed on my
> private bugzilla - but I didn't get around to fixing it.
> 
> As I said, most hypervisors "in the wild" do not have interrupt exiting
> disabled, so this bug will not appear in practice, but definitely it should
> be fixed, and I will.
> 
> When the nvmx patches are merged, I can can copy my private bug tracker's
> open issues to an open bug tracker (I'm hoping KVM has one?).
> 

That's fine, or at least convert your private bug list into a list of TODO
comment in right code place.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 31/31] nVMX: Documentation
  2011-05-16 19:59 ` [PATCH 31/31] nVMX: Documentation Nadav Har'El
@ 2011-05-25 10:33   ` Tian, Kevin
  2011-05-25 11:54     ` Nadav Har'El
  2011-05-25 12:13     ` Muli Ben-Yehuda
  0 siblings, 2 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25 10:33 UTC (permalink / raw)
  To: Nadav Har'El, kvm; +Cc: gleb, avi

> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 4:00 AM
> 
> This patch includes a brief introduction to the nested vmx feature in the
> Documentation/kvm directory. The document also includes a copy of the
> vmcs12 structure, as requested by Avi Kivity.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  Documentation/kvm/nested-vmx.txt |  243
> +++++++++++++++++++++++++++++
>  1 file changed, 243 insertions(+)
> 
> --- .before/Documentation/kvm/nested-vmx.txt	2011-05-16
> 22:36:51.000000000 +0300
> +++ .after/Documentation/kvm/nested-vmx.txt	2011-05-16
> 22:36:51.000000000 +0300
> @@ -0,0 +1,243 @@
> +Nested VMX
> +==========
> +
> +Overview
> +---------
> +
> +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> +to easily and efficiently run guest operating systems. Normally, these guests
> +*cannot* themselves be hypervisors running their own guests, because in
> VMX,
> +guests cannot use VMX instructions.

"because in VMX, guests cannot use VMX instructions" looks not correct or else
you can't add nVMX support. :-) It's just because currently KVM doesn't emulate
those VMX instructions.

> +
> +The "Nested VMX" feature adds this missing capability - of running guest
> +hypervisors (which use VMX) with their own nested guests. It does so by
> +allowing a guest to use VMX instructions, and correctly and efficiently
> +emulating them using the single level of VMX available in the hardware.
> +
> +We describe in much greater detail the theory behind the nested VMX
> feature,
> +its implementation and its performance characteristics, in the OSDI 2010
> paper
> +"The Turtles Project: Design and Implementation of Nested Virtualization",
> +available at:
> +
> +	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
> +
> +
> +Terminology
> +-----------
> +
> +Single-level virtualization has two levels - the host (KVM) and the guests.
> +In nested virtualization, we have three levels: The host (KVM), which we call
> +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> +call L2.

Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here, given
that this doc is a centralized place to gain quick picture of the nested VMX.

> +
> +
> +Known limitations
> +-----------------
> +
> +The current code supports running Linux guests under KVM guests.
> +Only 64-bit guest hypervisors are supported.
> +
> +Additional patches for running Windows under guest KVM, and Linux under
> +guest VMware server, and support for nested EPT, are currently running in
> +the lab, and will be sent as follow-on patchsets.

any plan on nested VTD?

> +
> +
> +Running nested VMX
> +------------------
> +
> +The nested VMX feature is disabled by default. It can be enabled by giving
> +the "nested=1" option to the kvm-intel module.
> +
> +No modifications are required to user space (qemu). However, qemu's default
> +emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must
> be
> +explicitly enabled, by giving qemu one of the following options:
> +
> +     -cpu host              (emulated CPU has all features of the real
> CPU)
> +
> +     -cpu qemu64,+vmx       (add just the vmx feature to a named CPU
> type)
> +
> +
> +ABIs
> +----
> +
> +Nested VMX aims to present a standard and (eventually) fully-functional VMX
> +implementation for the a guest hypervisor to use. As such, the official
> +specification of the ABI that it provides is Intel's VMX specification,
> +namely volume 3B of their "Intel 64 and IA-32 Architectures Software
> +Developer's Manual". Not all of VMX's features are currently fully supported,
> +but the goal is to eventually support them all, starting with the VMX features
> +which are used in practice by popular hypervisors (KVM and others).

It'd be good to provide a list of known supported features. In your current code,
people have to look at code to understand current status. If you can keep a
supported and verified feature list here, it'd be great.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-25  8:39   ` Tian, Kevin
  2011-05-25  8:45     ` Tian, Kevin
@ 2011-05-25 10:56     ` Nadav Har'El
  1 sibling, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 10:56 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection":
> > If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> > does), things look very different from the description above: L1 expects
> 
> Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
> really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
> touch generic interrupt logic, and thus better to not ack it until interrupt
> is enabled and then hardware will gear to the kernel interrupt handler
> automatically.

I have to be honest (and I was, in the patch set's introduction), this version
of nested VMX was only tested with a KVM L1. We've had VMWARE Server running
as a guest using a two-year-old branch of this code (as reported in our paper),
but this code has changed considerably since and it probably will not work
today. We've never tested Xen as L1.

I'll remove the emphasis on "no hypervisor that I know does", but the important
point remains: VM_EXIT_ACK_INTR_ON_EXIT is an optional feature, which we
report is *not* supported, so L1 should not attempt to use it.
As you said, it's not that difficult to support it, so we should, eventually,
but it's not a priority right now. It's on my Bugzilla.

> > +/*
> > + * In nested virtualization, check if L1 asked to exit on external interrupts.
> > + * For most existing hypervisors, this will always return true.
> > + */
> > +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> > +{
> > +	return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> > +		PIN_BASED_EXT_INTR_MASK;
> > +}
> > +
> 
> could be a similar common wrapper like nested_cpu_has...

I thought this made the callers easier to read, and didn't result in too much
code duplication, so I prefer it this way.

-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Windows detected you moved your mouse.
http://nadav.harel.org.il           |Reboot for this change to take effect.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 31/31] nVMX: Documentation
  2011-05-25 10:33   ` Tian, Kevin
@ 2011-05-25 11:54     ` Nadav Har'El
  2011-05-25 12:11       ` Tian, Kevin
  2011-05-25 12:13     ` Muli Ben-Yehuda
  1 sibling, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 11:54 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 31/31] nVMX: Documentation":
> > +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> > +to easily and efficiently run guest operating systems. Normally, these guests
> > +*cannot* themselves be hypervisors running their own guests, because in
> > VMX,
> > +guests cannot use VMX instructions.
> 
> "because in VMX, guests cannot use VMX instructions" looks not correct or else
> you can't add nVMX support. :-) It's just because currently KVM doesn't emulate
> those VMX instructions.

It depends on whether you look on the half-empty or half-full part of the
glass ;-)

The VMX instructions, when used in L1, do trap - as mandated by Popek and
Goldberg's theorem (that sensitive instructions must trap) - but they
don't "just work" like, for example, arithmetic instructions just work -
they need to be emulated by the VMM.

> > +Terminology
> > +-----------
> > +
> > +Single-level virtualization has two levels - the host (KVM) and the guests.
> > +In nested virtualization, we have three levels: The host (KVM), which we call
> > +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> > +call L2.
> 
> Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here, given
> that this doc is a centralized place to gain quick picture of the nested VMX.

I'm adding now a short mention. However, I think this file should be viewed
as a user's guide, not a developer's guide. Developers should probably read
our full paper, where this terminology is explained, as well as how vmcs02
is related to the two others.

> > +Additional patches for running Windows under guest KVM, and Linux under
> > +guest VMware server, and support for nested EPT, are currently running in
> > +the lab, and will be sent as follow-on patchsets.
> 
> any plan on nested VTD?

Yes, for some definition of Yes ;-)

We do have an experimental nested IOMMU implementation: In our nested VMX
paper we showed how giving L1 an IOMMU allows for efficient nested device
assignment (L0 assigns a PCI device to L1, and L1 does the same to L2).
In that work we used a very simplistic "paravirtual" IOMMU instead of fully
emulating an IOMMU for L1.
Later, we did develop a full emulation of an IOMMU for L1, although we didn't
test it in the context of nested VMX (we used it to allow L1 to use an IOMMU
for better DMA protection inside the guest).

The IOMMU emulation work was done by Nadav Amit, Muli Ben-Yehuda, et al.,
and will be described in the upcoming Usenix ATC conference
(http://www.usenix.org/event/atc11/tech/techAbstracts.html#Amit).
After the conference in June, the paper will be available at this URL:
http://www.usenix.org/event/atc11/tech/final_files/Amit.pdf

If there is interest, they can perhaps contribute their work to
KVM (and QEMU) - if you're interested, please get in touch with them directly.

> It'd be good to provide a list of known supported features. In your current code,
> people have to look at code to understand current status. If you can keep a
> supported and verified feature list here, it'd be great.

It will be even better to support all features ;-)

But seriously, the VMX spec is hundreds of pages long, with hundreds of
features, sub-features, and sub-sub-features and myriads of subcase-of-
subfeature and combinations thereof, so I don't think such a list would be
practical - or ever be accurate.

In the "Known Limitations" section of this document, I'd like to list major
features which are missing, and perhaps more importantly - L1 and L2
guests which are known NOT to work.

By the way, it appears that you've been going over the patches in increasing
numerical order, and this is the last patch ;-) Have you finished your
review iteration?

Thanks for the reviews!
Nadav.

-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Cats aren't clean, they're just covered
http://nadav.harel.org.il           |with cat spit.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 31/31] nVMX: Documentation
  2011-05-25 11:54     ` Nadav Har'El
@ 2011-05-25 12:11       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25 12:11 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 7:55 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 31/31] nVMX:
> Documentation":
> > > +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> > > +to easily and efficiently run guest operating systems. Normally, these
> guests
> > > +*cannot* themselves be hypervisors running their own guests, because in
> > > VMX,
> > > +guests cannot use VMX instructions.
> >
> > "because in VMX, guests cannot use VMX instructions" looks not correct or
> else
> > you can't add nVMX support. :-) It's just because currently KVM doesn't
> emulate
> > those VMX instructions.
> 
> It depends on whether you look on the half-empty or half-full part of the
> glass ;-)
> 
> The VMX instructions, when used in L1, do trap - as mandated by Popek and
> Goldberg's theorem (that sensitive instructions must trap) - but they
> don't "just work" like, for example, arithmetic instructions just work -
> they need to be emulated by the VMM.
> 
> > > +Terminology
> > > +-----------
> > > +
> > > +Single-level virtualization has two levels - the host (KVM) and the guests.
> > > +In nested virtualization, we have three levels: The host (KVM), which we
> call
> > > +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> > > +call L2.
> >
> > Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here,
> given
> > that this doc is a centralized place to gain quick picture of the nested VMX.
> 
> I'm adding now a short mention. However, I think this file should be viewed
> as a user's guide, not a developer's guide. Developers should probably read
> our full paper, where this terminology is explained, as well as how vmcs02
> is related to the two others.

I agree with the purpose of this doc. 

> 
> > > +Additional patches for running Windows under guest KVM, and Linux under
> > > +guest VMware server, and support for nested EPT, are currently running in
> > > +the lab, and will be sent as follow-on patchsets.
> >
> > any plan on nested VTD?
> 
> Yes, for some definition of Yes ;-)
> 
> We do have an experimental nested IOMMU implementation: In our nested
> VMX
> paper we showed how giving L1 an IOMMU allows for efficient nested device
> assignment (L0 assigns a PCI device to L1, and L1 does the same to L2).
> In that work we used a very simplistic "paravirtual" IOMMU instead of fully
> emulating an IOMMU for L1.
> Later, we did develop a full emulation of an IOMMU for L1, although we didn't
> test it in the context of nested VMX (we used it to allow L1 to use an IOMMU
> for better DMA protection inside the guest).
> 
> The IOMMU emulation work was done by Nadav Amit, Muli Ben-Yehuda, et al.,
> and will be described in the upcoming Usenix ATC conference
> (http://www.usenix.org/event/atc11/tech/techAbstracts.html#Amit).
> After the conference in June, the paper will be available at this URL:
> http://www.usenix.org/event/atc11/tech/final_files/Amit.pdf
> 
> If there is interest, they can perhaps contribute their work to
> KVM (and QEMU) - if you're interested, please get in touch with them directly.

Thanks and good to know those information

> 
> > It'd be good to provide a list of known supported features. In your current
> code,
> > people have to look at code to understand current status. If you can keep a
> > supported and verified feature list here, it'd be great.
> 
> It will be even better to support all features ;-)
> 
> But seriously, the VMX spec is hundreds of pages long, with hundreds of
> features, sub-features, and sub-sub-features and myriads of subcase-of-
> subfeature and combinations thereof, so I don't think such a list would be
> practical - or ever be accurate.

no need for all subfeatures, a list of possibly a dozen features which people
once enabled them one-by-one is applausive, especially for things which
may accelerate L2 perf, such as virtual NMI, tpr shadow, virtual x2APIC, ... 

> 
> In the "Known Limitations" section of this document, I'd like to list major
> features which are missing, and perhaps more importantly - L1 and L2
> guests which are known NOT to work.

yes, that info is also important and thus people can easily reproduce your
success.

> 
> By the way, it appears that you've been going over the patches in increasing
> numerical order, and this is the last patch ;-) Have you finished your
> review iteration?
> 

yes, I've finished my review on all of your v10 patches. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 31/31] nVMX: Documentation
  2011-05-25 10:33   ` Tian, Kevin
  2011-05-25 11:54     ` Nadav Har'El
@ 2011-05-25 12:13     ` Muli Ben-Yehuda
  1 sibling, 0 replies; 118+ messages in thread
From: Muli Ben-Yehuda @ 2011-05-25 12:13 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Nadav Har'El, kvm, gleb, avi

On Wed, May 25, 2011 at 06:33:30PM +0800, Tian, Kevin wrote:

> > +Known limitations
> > +-----------------
> > +
> > +The current code supports running Linux guests under KVM guests.
> > +Only 64-bit guest hypervisors are supported.
> > +
> > +Additional patches for running Windows under guest KVM, and Linux under
> > +guest VMware server, and support for nested EPT, are currently running in
> > +the lab, and will be sent as follow-on patchsets.
> 
> any plan on nested VTD?

Nadav Amit sent patches for VT-d emulation about a year ago
(http://marc.info/?l=qemu-devel&m=127124206827481&w=2). They don't
apply to the current tree, but rebasing them probably doesn't make
sense until some version of the QEMU IOMMU/DMA API that has been
discussed makes it in.

Cheers,
Muli

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-25  9:18   ` Tian, Kevin
@ 2011-05-25 12:33     ` Nadav Har'El
  2011-05-25 12:55       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 12:33 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection":
> >  static void enable_irq_window(struct kvm_vcpu *vcpu)
> >  {
> >  	u32 cpu_based_vm_exec_control;
> > +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> > +		/* We can get here when nested_run_pending caused
> > +		 * vmx_interrupt_allowed() to return false. In this case, do
> > +		 * nothing - the interrupt will be injected later.
> > +		 */
> 
> I think this is not a rare path? when vcpu is in guest mode with L2 as current
> vmx context, this function could be invoked multiple times since kvm thread
> can be scheduled out/in randomly. 

As I wrote in this comment, this can only happen on nested_run_pending
(i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
exited L2, and we wouldn't be in this case.

I don't know if to classify this as a "rare" path - it's definitely not
very common. But what does it matter if it's rare or common?


> > +		if (to_vmx(vcpu)->nested.nested_run_pending)
> > +			return 0;
> 
> Well, now I can see why you require this special 'nested_run_pending' flag
> because there're places where L0 injects virtual interrupts right after
> VMLAUNCH/VMRESUME emulation and before entering L2. :-)

Indeed. I tried to explain that in the patch description, where I wrote

 We keep a new flag, "nested_run_pending", which can override the decision of
 which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
 L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
 and therefore expects L2 to be run (and perhaps be injected with an event it
 specified, etc.). Nested_run_pending is especially intended to avoid switching
 to L1 in the injection decision-point described above.

> > +		nested_vmx_vmexit(vcpu);
> > +		vmcs12 = get_vmcs12(vcpu);
> > +		vmcs12->vm_exit_reason = EXIT_REASON_EXTERNAL_INTERRUPT;
> > +		vmcs12->vm_exit_intr_info = 0;
> > +		/* fall through to normal code, but now in L1, not L2 */
> > +	}
> > +
> 
> This is a bad place to add this logic. vmx_interrupt_allowed is simply a
> query function but you make it an absolute trigger point for switching from
> L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
> when there's vNMI pending. But it's dangerous to have such assumption
> for pending events inside vmx_interrupt_allowed.

Now you're beating a dead horse ;-)

Gleb, and to some degree Avi, already argued that this is the wrong place
to do this exit, and if anything the exit should be done (or just decided on)
in enable_irq_window().

My counter-argument was that the right way is *neither* of these approaches -
any attempt to "commandeer" one of the existing x86 ops, be they
vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
they were never designed to do is both ugly, and dangerous if the call sites
change at some time in the future.

So rather than changing one ugly abuse of one function, to the (arguably
also ugly) abuse of another function, what I'd like to see is a better overall
design, where the call sites in x86.c know about the possibility of a nested
guest (they already do - like we previously discussed, an is_guest_mode()
function was recently added), and when they need, *they* will call an
exit-to-L1 function, rather than calling a function called "enable_irq_window"
or "vmx_interrupt_allowed" which mysteriously will do the exit.


> On the other hand, I think there's one area which is not handled timely.
> I think you need to kick a L2->L1 transition when L0 wants to inject 
> virtual interrupt. Consider your current logic:
> 
> a) L2 is running on cpu1
> b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to 
> cpu1 after updating virqchip
> c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
> the event. As it's an external interrupt, L0 will handle it. As it's a notification
> IPI, nothing is required.
> d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT
> 
> At this point you only add logic to enable_irq_window, but there's no 
> action to trigger L2->L1 transition. So what will happen? Will the event
> be injected into L2 instead or pend until next switch happens due to
> other cause?

I'm afraid I'm missing something in your explanation... In step d, L0 finds
an interrupt in the injection queue, so isn't the first thing it does is to
call vmx_interrupt_allowed(), to check if injection is allowed now?
In our code, "vmx_interrupt_allowed()" was bastardized to exit to L1 in
this case. Isn't that the missing exit you were looking for?


-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Experience is what lets you recognize a
http://nadav.harel.org.il           |mistake when you make it again.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection
  2011-05-25 12:33     ` Nadav Har'El
@ 2011-05-25 12:55       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-25 12:55 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 8:34 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX:
> Correct handling of interrupt injection":
> > >  static void enable_irq_window(struct kvm_vcpu *vcpu)
> > >  {
> > >  	u32 cpu_based_vm_exec_control;
> > > +	if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> > > +		/* We can get here when nested_run_pending caused
> > > +		 * vmx_interrupt_allowed() to return false. In this case, do
> > > +		 * nothing - the interrupt will be injected later.
> > > +		 */
> >
> > I think this is not a rare path? when vcpu is in guest mode with L2 as current
> > vmx context, this function could be invoked multiple times since kvm thread
> > can be scheduled out/in randomly.
> 
> As I wrote in this comment, this can only happen on nested_run_pending
> (i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
> and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
> exited L2, and we wouldn't be in this case.
> 
> I don't know if to classify this as a "rare" path - it's definitely not
> very common. But what does it matter if it's rare or common?

It doesn't matter much. I just tried to understand your comment.

> 
> 
> > > +		if (to_vmx(vcpu)->nested.nested_run_pending)
> > > +			return 0;
> >
> > Well, now I can see why you require this special 'nested_run_pending' flag
> > because there're places where L0 injects virtual interrupts right after
> > VMLAUNCH/VMRESUME emulation and before entering L2. :-)
> 
> Indeed. I tried to explain that in the patch description, where I wrote
> 
>  We keep a new flag, "nested_run_pending", which can override the decision
> of
>  which should run next, L1 or L2. nested_run_pending=1 means that we
> *must* run
>  L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
>  and therefore expects L2 to be run (and perhaps be injected with an event it
>  specified, etc.). Nested_run_pending is especially intended to avoid switching
>  to L1 in the injection decision-point described above.

atm when nested_run_pending is first introduced, its usage is simple which made
me think this field may not be required. But later several key patches do depend
on this flag for correctness. :-)

> 
> > > +		nested_vmx_vmexit(vcpu);
> > > +		vmcs12 = get_vmcs12(vcpu);
> > > +		vmcs12->vm_exit_reason =
> EXIT_REASON_EXTERNAL_INTERRUPT;
> > > +		vmcs12->vm_exit_intr_info = 0;
> > > +		/* fall through to normal code, but now in L1, not L2 */
> > > +	}
> > > +
> >
> > This is a bad place to add this logic. vmx_interrupt_allowed is simply a
> > query function but you make it an absolute trigger point for switching from
> > L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
> > when there's vNMI pending. But it's dangerous to have such assumption
> > for pending events inside vmx_interrupt_allowed.
> 
> Now you're beating a dead horse ;-)
> 
> Gleb, and to some degree Avi, already argued that this is the wrong place
> to do this exit, and if anything the exit should be done (or just decided on)
> in enable_irq_window().
> 
> My counter-argument was that the right way is *neither* of these approaches
> -
> any attempt to "commandeer" one of the existing x86 ops, be they
> vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
> they were never designed to do is both ugly, and dangerous if the call sites
> change at some time in the future.
> 
> So rather than changing one ugly abuse of one function, to the (arguably
> also ugly) abuse of another function, what I'd like to see is a better overall
> design, where the call sites in x86.c know about the possibility of a nested
> guest (they already do - like we previously discussed, an is_guest_mode()
> function was recently added), and when they need, *they* will call an
> exit-to-L1 function, rather than calling a function called "enable_irq_window"
> or "vmx_interrupt_allowed" which mysteriously will do the exit.
> 

I agree with your point here.

> 
> > On the other hand, I think there's one area which is not handled timely.
> > I think you need to kick a L2->L1 transition when L0 wants to inject
> > virtual interrupt. Consider your current logic:
> >
> > a) L2 is running on cpu1
> > b) L0 on cpu 0 decides to post a virtual interrupt to L1. An IPI is issued to
> > cpu1 after updating virqchip
> > c) L2 on cpu0 vmexit to L0, and checks whether L0 or L1 should handle
> > the event. As it's an external interrupt, L0 will handle it. As it's a notification
> > IPI, nothing is required.
> > d) L0 on cpu0 then decides to resume, and find KVM_REQ_EVENT
> >
> > At this point you only add logic to enable_irq_window, but there's no
> > action to trigger L2->L1 transition. So what will happen? Will the event
> > be injected into L2 instead or pend until next switch happens due to
> > other cause?
> 
> I'm afraid I'm missing something in your explanation... In step d, L0 finds
> an interrupt in the injection queue, so isn't the first thing it does is to
> call vmx_interrupt_allowed(), to check if injection is allowed now?
> In our code, "vmx_interrupt_allowed()" was bastardized to exit to L1 in
> this case. Isn't that the missing exit you were looking for?
> 

This is a false alarm. In my earlier search I thought that vmx_interrupt_allowed
is only invoked in vmx.c for pending vNMI check which actually led me wonder
for a bigger problem. But actually this .interrupt_allowed is checked in common
path as expected. So my own problem here. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-25  2:43   ` Tian, Kevin
@ 2011-05-25 13:21     ` Nadav Har'El
  2011-05-26  0:41       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 13:21 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX: Exiting from L2 to L1":
> How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO comment
> here make the whole load process complete. :-)
> 
> Also isn't it more sane to update vmcs01's guest segment info based on vmcs12's
> host segment info? Though you can assume the environment in L1 doesn't change
> from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural clear
> to load those segments fields according to L1's desire.

Right... One of these days, I (or some other volunteer ;-)) would need to
print out the relevant sections of the SDM, sit down with a marker, a read
it line by line marking lines, fields, capabilities, and so on, which we
forgot to implement... 

How about these additions:

	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);
	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip);
	vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base);
	vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base);
	vmcs_writel(GUEST_TR_BASE, vmcs12->host_tr_base);
	vmcs_writel(GUEST_GS_BASE, vmcs12->host_gs_base);
	vmcs_writel(GUEST_FS_BASE, vmcs12->host_fs_base);
	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->host_es_selector);
	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->host_cs_selector);
	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->host_ss_selector);
	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->host_ds_selector);
	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->host_fs_selector);
	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->host_gs_selector);
	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->host_tr_selector);

	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
			vmcs12->host_ia32_perf_global_ctrl);


-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |AAAAA: the American Association for the
http://nadav.harel.org.il           |Abolition of Abbreviations and Acronyms

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-25  8:00   ` Tian, Kevin
@ 2011-05-25 13:26     ` Nadav Har'El
  2011-05-26  0:42       ` Tian, Kevin
  0 siblings, 1 reply; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 13:26 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME":
> > +	if (!saved_vmcs02)
> > +		return -ENOMEM;
> > +
> 
> we shouldn't return error after the guest mode is updated. Or else move
> enter_guest_mode to a later place...

I moved things around, but I don't think it matters anyway: If we return
ENOMEM, the KVM ioctl fails, and the whole L1 guest dies - it doesn't matter
at this point if we were in the middle of updating its state.

-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Quotation, n.: The act of repeating
http://nadav.harel.org.il           |erroneously the words of another.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit
  2011-05-25  7:56   ` Tian, Kevin
@ 2011-05-25 13:45     ` Nadav Har'El
  0 siblings, 0 replies; 118+ messages in thread
From: Nadav Har'El @ 2011-05-25 13:45 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm, gleb, avi

On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit":
> > +static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
> > +{
> > +	return is_guest_mode(vcpu) &&
> > +		(get_vmcs12(vcpu)->pin_based_vm_exec_control &
> > +			PIN_BASED_VIRTUAL_NMIS);
> > +}
> 
> any reason to add guest mode check here? I didn't see such check in your
> earlier nested_cpu_has_xxx. It would be clearer to use existing nested_cpu_has_xxx
> along with is_guest_mode explicitly which makes such usage consistent.

The nested_cpu_has function is for procbased controls, not pinbased controls...
But you're right, it's strange that only this function has is_guest_mode()
in it. Moving it outside. The call site is now really ugly ;-)

> > +	case EXIT_REASON_INVLPG:
> > +		return vmcs12->cpu_based_vm_exec_control &
> > +				CPU_BASED_INVLPG_EXITING;
> 
> use nested_cpu_has.

Right ;-)

> > +	if (exit_reason == EXIT_REASON_VMLAUNCH ||
> > +	    exit_reason == EXIT_REASON_VMRESUME)
> > +		vmx->nested.nested_run_pending = 1;
> > +	else
> > +		vmx->nested.nested_run_pending = 0;
> 
> what about VMLAUNCH invoked from L2? In such case I think you expect
> L1 to run instead of L2.

Wow, a good catch!

According to the theory (see our paper :-)), our implementations should work
with any number of nesting levels, not just two. But in practice, we've always
had a bug in running L3, and we never had the time to figure out why.
This is a good lead - I'll have to investigate. But not now.


> On the other hand, isn't just guest mode check enough to differentiate
> pending nested run? When L1 invokes VMLAUNCH/VMRESUME, guest mode
> hasn't been set yet, and below check will fail. All other operations will then
> be checked by nested_vmx_exit_handled...
> 
> Do I miss anything here?

As we discussed in another thread, nested_run_pending is important later,
at the injection path.

> > -	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
> > +	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
> > +			!nested_cpu_has_virtual_nmis(vcpu))) {
> 
> Would L0 want to control vNMI for L2 guest? Otherwise we just use is_guest_mode
> here for the condition check?

I don't remember the details here, but this if() used to be different, until
Avi Kivity asked me to change it to this state.

Nadav.

-- 
Nadav Har'El                        |    Wednesday, May 25 2011, 21 Iyyar 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The person who knows how to laugh at
http://nadav.harel.org.il           |himself will never cease to be amused.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 20/31] nVMX: Exiting from L2 to L1
  2011-05-25 13:21     ` Nadav Har'El
@ 2011-05-26  0:41       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-26  0:41 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 9:21 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO
> comment
> > here make the whole load process complete. :-)
> >
> > Also isn't it more sane to update vmcs01's guest segment info based on
> vmcs12's
> > host segment info? Though you can assume the environment in L1 doesn't
> change
> > from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural
> clear
> > to load those segments fields according to L1's desire.
> 
> Right... One of these days, I (or some other volunteer ;-)) would need to
> print out the relevant sections of the SDM, sit down with a marker, a read
> it line by line marking lines, fields, capabilities, and so on, which we
> forgot to implement...

You've done a great job.

> 
> How about these additions:
> 
> 	vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
> 	vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);
> 	vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip);
> 	vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base);
> 	vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base);
> 	vmcs_writel(GUEST_TR_BASE, vmcs12->host_tr_base);
> 	vmcs_writel(GUEST_GS_BASE, vmcs12->host_gs_base);
> 	vmcs_writel(GUEST_FS_BASE, vmcs12->host_fs_base);
> 	vmcs_write16(GUEST_ES_SELECTOR, vmcs12->host_es_selector);
> 	vmcs_write16(GUEST_CS_SELECTOR, vmcs12->host_cs_selector);
> 	vmcs_write16(GUEST_SS_SELECTOR, vmcs12->host_ss_selector);
> 	vmcs_write16(GUEST_DS_SELECTOR, vmcs12->host_ds_selector);
> 	vmcs_write16(GUEST_FS_SELECTOR, vmcs12->host_fs_selector);
> 	vmcs_write16(GUEST_GS_SELECTOR, vmcs12->host_gs_selector);
> 	vmcs_write16(GUEST_TR_SELECTOR, vmcs12->host_tr_selector);
> 
> 	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
> 		vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> 	if (vmcs12->vm_exit_controls &
> VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
> 		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
> 			vmcs12->host_ia32_perf_global_ctrl);
> 

looks good.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

* RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME
  2011-05-25 13:26     ` Nadav Har'El
@ 2011-05-26  0:42       ` Tian, Kevin
  0 siblings, 0 replies; 118+ messages in thread
From: Tian, Kevin @ 2011-05-26  0:42 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, gleb, avi

> From: Nadav Har'El [mailto:nyh@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 9:26 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 18/31] nVMX:
> Implement VMLAUNCH and VMRESUME":
> > > +	if (!saved_vmcs02)
> > > +		return -ENOMEM;
> > > +
> >
> > we shouldn't return error after the guest mode is updated. Or else move
> > enter_guest_mode to a later place...
> 
> I moved things around, but I don't think it matters anyway: If we return
> ENOMEM, the KVM ioctl fails, and the whole L1 guest dies - it doesn't matter
> at this point if we were in the middle of updating its state.
> 

yes but the code this way is cleaner that guest mode is set only when next
we'll certainly enter L2.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2011-05-26  0:43 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-16 19:43 [PATCH 0/31] nVMX: Nested VMX, v10 Nadav Har'El
2011-05-16 19:44 ` [PATCH 01/31] nVMX: Add "nested" module option to kvm_intel Nadav Har'El
2011-05-16 19:44 ` [PATCH 02/31] nVMX: Implement VMXON and VMXOFF Nadav Har'El
2011-05-20  7:58   ` Tian, Kevin
2011-05-16 19:45 ` [PATCH 03/31] nVMX: Allow setting the VMXE bit in CR4 Nadav Har'El
2011-05-16 19:45 ` [PATCH 04/31] nVMX: Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2011-05-16 19:46 ` [PATCH 05/31] nVMX: Implement reading and writing of VMX MSRs Nadav Har'El
2011-05-16 19:46 ` [PATCH 06/31] nVMX: Decoding memory operands of VMX instructions Nadav Har'El
2011-05-16 19:47 ` [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2 Nadav Har'El
2011-05-20  8:04   ` Tian, Kevin
2011-05-20  8:48     ` Tian, Kevin
2011-05-20 20:32       ` Nadav Har'El
2011-05-22  2:00         ` Tian, Kevin
2011-05-22  7:22           ` Nadav Har'El
2011-05-24  0:54             ` Tian, Kevin
2011-05-22  8:29     ` Nadav Har'El
2011-05-24  1:03       ` Tian, Kevin
2011-05-16 19:48 ` [PATCH 08/31] nVMX: Fix local_vcpus_link handling Nadav Har'El
2011-05-17 13:19   ` Marcelo Tosatti
2011-05-17 13:35     ` Avi Kivity
2011-05-17 14:35       ` Nadav Har'El
2011-05-17 14:42         ` Marcelo Tosatti
2011-05-17 17:57           ` Nadav Har'El
2011-05-17 15:11         ` Avi Kivity
2011-05-17 18:11           ` Nadav Har'El
2011-05-17 18:43             ` Marcelo Tosatti
2011-05-17 19:30               ` Nadav Har'El
2011-05-17 19:52                 ` Marcelo Tosatti
2011-05-18  5:52                   ` Nadav Har'El
2011-05-18  8:31                     ` Avi Kivity
2011-05-18  9:02                       ` Nadav Har'El
2011-05-18  9:16                         ` Avi Kivity
2011-05-18 12:08                     ` Marcelo Tosatti
2011-05-18 12:19                       ` Nadav Har'El
2011-05-22  8:57                       ` Nadav Har'El
2011-05-23 15:49                         ` Avi Kivity
2011-05-23 16:17                           ` Gleb Natapov
2011-05-23 18:59                             ` Nadav Har'El
2011-05-23 19:03                               ` Gleb Natapov
2011-05-23 16:43                           ` Roedel, Joerg
2011-05-23 16:51                             ` Avi Kivity
2011-05-24  9:22                               ` Roedel, Joerg
2011-05-24  9:28                                 ` Nadav Har'El
2011-05-24  9:57                                   ` Roedel, Joerg
2011-05-24 10:08                                     ` Avi Kivity
2011-05-24 10:12                                     ` Nadav Har'El
2011-05-23 18:51                           ` Nadav Har'El
2011-05-24  2:22                             ` Tian, Kevin
2011-05-24  7:56                               ` Nadav Har'El
2011-05-24  8:20                                 ` Tian, Kevin
2011-05-24 11:05                                   ` Avi Kivity
2011-05-24 11:20                                     ` Tian, Kevin
2011-05-24 11:27                                       ` Avi Kivity
2011-05-24 11:30                                         ` Tian, Kevin
2011-05-24 11:36                                           ` Avi Kivity
2011-05-24 11:40                                             ` Tian, Kevin
2011-05-24 11:59                                               ` Nadav Har'El
2011-05-24  0:57                           ` Tian, Kevin
2011-05-18  8:29                   ` Avi Kivity
2011-05-16 19:48 ` [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12 Nadav Har'El
2011-05-20  8:22   ` Tian, Kevin
2011-05-16 19:49 ` [PATCH 10/31] nVMX: Success/failure of VMX instructions Nadav Har'El
2011-05-16 19:49 ` [PATCH 11/31] nVMX: Implement VMCLEAR Nadav Har'El
2011-05-16 19:50 ` [PATCH 12/31] nVMX: Implement VMPTRLD Nadav Har'El
2011-05-16 19:50 ` [PATCH 13/31] nVMX: Implement VMPTRST Nadav Har'El
2011-05-16 19:51 ` [PATCH 14/31] nVMX: Implement VMREAD and VMWRITE Nadav Har'El
2011-05-16 19:51 ` [PATCH 15/31] nVMX: Move host-state field setup to a function Nadav Har'El
2011-05-16 19:52 ` [PATCH 16/31] nVMX: Move control field setup to functions Nadav Har'El
2011-05-16 19:52 ` [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2011-05-24  8:02   ` Tian, Kevin
2011-05-24  9:19     ` Nadav Har'El
2011-05-24 10:52       ` Tian, Kevin
2011-05-16 19:53 ` [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME Nadav Har'El
2011-05-24  8:45   ` Tian, Kevin
2011-05-24  9:45     ` Nadav Har'El
2011-05-24 10:54       ` Tian, Kevin
2011-05-25  8:00   ` Tian, Kevin
2011-05-25 13:26     ` Nadav Har'El
2011-05-26  0:42       ` Tian, Kevin
2011-05-16 19:53 ` [PATCH 19/31] nVMX: No need for handle_vmx_insn function any more Nadav Har'El
2011-05-16 19:54 ` [PATCH 20/31] nVMX: Exiting from L2 to L1 Nadav Har'El
2011-05-24 12:58   ` Tian, Kevin
2011-05-24 13:43     ` Nadav Har'El
2011-05-25  0:55       ` Tian, Kevin
2011-05-25  8:06         ` Nadav Har'El
2011-05-25  8:23           ` Tian, Kevin
2011-05-25  2:43   ` Tian, Kevin
2011-05-25 13:21     ` Nadav Har'El
2011-05-26  0:41       ` Tian, Kevin
2011-05-16 19:54 ` [PATCH 21/31] nVMX: vmcs12 checks on nested entry Nadav Har'El
2011-05-25  3:01   ` Tian, Kevin
2011-05-25  5:38     ` Nadav Har'El
2011-05-25  7:33       ` Tian, Kevin
2011-05-16 19:55 ` [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2011-05-25  7:56   ` Tian, Kevin
2011-05-25 13:45     ` Nadav Har'El
2011-05-16 19:55 ` [PATCH 23/31] nVMX: Correct handling of interrupt injection Nadav Har'El
2011-05-25  8:39   ` Tian, Kevin
2011-05-25  8:45     ` Tian, Kevin
2011-05-25 10:56     ` Nadav Har'El
2011-05-25  9:18   ` Tian, Kevin
2011-05-25 12:33     ` Nadav Har'El
2011-05-25 12:55       ` Tian, Kevin
2011-05-16 19:56 ` [PATCH 24/31] nVMX: Correct handling of exception injection Nadav Har'El
2011-05-16 19:56 ` [PATCH 25/31] nVMX: Correct handling of idt vectoring info Nadav Har'El
2011-05-25 10:02   ` Tian, Kevin
2011-05-25 10:13     ` Nadav Har'El
2011-05-25 10:17       ` Tian, Kevin
2011-05-16 19:57 ` [PATCH 26/31] nVMX: Handling of CR0 and CR4 modifying instructions Nadav Har'El
2011-05-16 19:57 ` [PATCH 27/31] nVMX: Further fixes for lazy FPU loading Nadav Har'El
2011-05-16 19:58 ` [PATCH 28/31] nVMX: Additional TSC-offset handling Nadav Har'El
2011-05-16 19:58 ` [PATCH 29/31] nVMX: Add VMX to list of supported cpuid features Nadav Har'El
2011-05-16 19:59 ` [PATCH 30/31] nVMX: Miscellenous small corrections Nadav Har'El
2011-05-16 19:59 ` [PATCH 31/31] nVMX: Documentation Nadav Har'El
2011-05-25 10:33   ` Tian, Kevin
2011-05-25 11:54     ` Nadav Har'El
2011-05-25 12:11       ` Tian, Kevin
2011-05-25 12:13     ` Muli Ben-Yehuda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.