All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/24] Nested VMX, v5
@ 2010-06-13 12:22 Nadav Har'El
  2010-06-13 12:23 ` [PATCH 1/24] Move nested option from svm.c to x86.c Nadav Har'El
                   ` (26 more replies)
  0 siblings, 27 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:22 UTC (permalink / raw)
  To: avi; +Cc: kvm

Hi Avi,

This is a followup of our nested VMX patches that Orit Wasserman posted in
December. We've addressed most of the comments and concerns that you and
others on the mailing list had with the previous patch set. We hope you'll
find these patches easier to understand, and suitable for applying to KVM.


The following 24 patches implement nested VMX support. The patches enable a
guest to use the VMX APIs in order to run its own nested guests. I.e., it
allows running hypervisors (that use VMX) under KVM. We describe the theory
behind this work, our implementation, and its performance characteristics,
in IBM Research report H-0282, "The Turtles Project: Design and Implementation
of Nested Virtualization", available at:

	http://bit.ly/a0o9te

The current patches support running Linux under a nested KVM using shadow
page table (with bypass_guest_pf disabled). They support multiple nested
hypervisors, which can run multiple guests. Only 64-bit nested hypervisors
are supported. SMP is supported. Additional patches for running Windows under
nested KVM, and Linux under nested VMware server, and support for nested EPT,
are currently running in the lab, and will be sent as follow-on patchsets.

These patches were written by:
     Abel Gordon, abelg <at> il.ibm.com
     Nadav Har'El, nyh <at> il.ibm.com
     Orit Wasserman, oritw <at> il.ibm.com
     Ben-Ami Yassor, benami <at> il.ibm.com
     Muli Ben-Yehuda, muli <at> il.ibm.com

With contributions by:
     Anthony Liguori, aliguori <at> us.ibm.com
     Mike Day, mdday <at> us.ibm.com

This work was inspired by the nested SVM support by Alexander Graf and Joerg
Roedel.


Changes since v4:
* Rebased to the current KVM tree.
* Support for lazy FPU loading.
* Implemented about 90 requests and suggestions made on the mailing list
  regarding the previous version of this patch set.
* Split the changes into many more, and better documented, patches.

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 1/24] Move nested option from svm.c to x86.c
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
@ 2010-06-13 12:23 ` Nadav Har'El
  2010-06-14  8:11   ` Avi Kivity
  2010-06-13 12:23 ` [PATCH 2/24] Add VMX and SVM to list of supported cpuid features Nadav Har'El
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:23 UTC (permalink / raw)
  To: avi; +Cc: kvm

The SVM module had a "nested" option, on by default, which controls whether
to allow nested virtualization. Now that VMX also supports nested
virtualization, we can move this option to x86.c, for both SVM and VMX.

The "nested" option takes three possible values. 0 disables nested
virtualization on both SVM and VMX, and 1 enables it on both.
The value 2, which is the default when this module option is not explicitly
set, asks each of SVM or VMX to choose its own default; Currently, VMX
disables nested virtualization in this case, while SVM leaves it enabled.

When nested VMX becomes more mature, this default should probably be changed
to enable nested virtualization on both architectures.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/svm.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2010-06-13 15:01:28.000000000 +0300
@@ -158,9 +158,6 @@ static int npt = 1;
 
 module_param(npt, int, S_IRUGO);
 
-static int nested = 1;
-module_param(nested, int, S_IRUGO);
-
 static void svm_flush_tlb(struct kvm_vcpu *vcpu);
 static void svm_complete_interrupts(struct vcpu_svm *svm);
 
--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
@@ -95,6 +95,17 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
 int ignore_msrs = 0;
 module_param_named(ignore_msrs, ignore_msrs, bool, S_IRUGO | S_IWUSR);
 
+/* If nested=1, nested virtualization is supported. I.e., the guest may use
+ * VMX or SVM (as appropriate) and be a hypervisor for its own guests.
+ * If nested=0, nested virtualization is not supported.
+ * When nested starts as 2 (which is the default), it is later modified by the
+ * specific module used (VMX or SVM). Currently, nested will be left enabled
+ * on SVM, but reset to 0 on VMX.
+ */
+int nested = 2;
+EXPORT_SYMBOL_GPL(nested);
+module_param(nested, int, S_IRUGO);
+
 #define KVM_NR_SHARED_MSRS 16
 
 struct kvm_shared_msrs_global {
--- .before/arch/x86/kvm/x86.h	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2010-06-13 15:01:28.000000000 +0300
@@ -75,4 +75,6 @@ static inline struct kvm_mem_aliases *kv
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 
+extern int nested;
+
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
@@ -4310,6 +4310,12 @@ static int __init vmx_init(void)
 {
 	int r, i;
 
+	/* By default (when nested==2), turn off nested support. This check
+	 * should be removed when nested VMX is considered mature enough.
+	 */
+	if (nested != 1)
+		nested = 0;
+
 	rdmsrl_safe(MSR_EFER, &host_efer);
 
 	for (i = 0; i < NR_VMX_MSR; ++i)

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 2/24] Add VMX and SVM to list of supported cpuid features
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
  2010-06-13 12:23 ` [PATCH 1/24] Move nested option from svm.c to x86.c Nadav Har'El
@ 2010-06-13 12:23 ` Nadav Har'El
  2010-06-14  8:13   ` Avi Kivity
  2010-06-13 12:24 ` [PATCH 3/24] Implement VMXON and VMXOFF Nadav Har'El
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:23 UTC (permalink / raw)
  To: avi; +Cc: kvm

Add the "VMX" CPU feature to the list of CPU featuress KVM advertises with
the KVM_GET_SUPPORTED_CPUID ioctl (unless the "nested" module option is off).

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.
This patch also does the same for SVM: KVM now advertises it supports SVM,
unless the "nested" module option is off.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
@@ -1923,7 +1923,7 @@ static void do_cpuid_ent(struct kvm_cpui
 	/* cpuid 1.ecx */
 	const u32 kvm_supported_word4_x86_features =
 		F(XMM3) | 0 /* Reserved, DTES64, MONITOR */ |
-		0 /* DS-CPL, VMX, SMX, EST */ |
+		0 /* DS-CPL */ | (nested ? F(VMX) : 0) | 0 /* SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
 		0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
 		0 /* Reserved, DCA */ | F(XMM4_1) |
@@ -1931,7 +1931,8 @@ static void do_cpuid_ent(struct kvm_cpui
 		0 /* Reserved, XSAVE, OSXSAVE */;
 	/* cpuid 0x80000001.ecx */
 	const u32 kvm_supported_word6_x86_features =
-		F(LAHF_LM) | F(CMP_LEGACY) | F(SVM) | 0 /* ExtApicSpace */ |
+		F(LAHF_LM) | F(CMP_LEGACY) | (nested ? F(SVM) : 0) |
+		0 /* ExtApicSpace */ |
 		F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) |
 		F(3DNOWPREFETCH) | 0 /* OSVW */ | 0 /* IBS */ | F(SSE5) |
 		0 /* SKINIT */ | 0 /* WDT */;

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 3/24] Implement VMXON and VMXOFF
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
  2010-06-13 12:23 ` [PATCH 1/24] Move nested option from svm.c to x86.c Nadav Har'El
  2010-06-13 12:23 ` [PATCH 2/24] Add VMX and SVM to list of supported cpuid features Nadav Har'El
@ 2010-06-13 12:24 ` Nadav Har'El
  2010-06-14  8:21   ` Avi Kivity
  2010-06-15 20:18   ` Marcelo Tosatti
  2010-06-13 12:24 ` [PATCH 4/24] Allow setting the VMXE bit in CR4 Nadav Har'El
                   ` (23 subsequent siblings)
  26 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:24 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
@@ -117,6 +117,16 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/* The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -168,6 +178,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for guest hypervisors (nested VMX) */
+	struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3353,6 +3366,93 @@ static int handle_vmx_insn(struct kvm_vc
 	return 1;
 }
 
+/* Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
+	 * set to 1. Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!nested) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (!(vcpu->arch.cr4 & X86_CR4_VMXE) ||
+	    !(vcpu->arch.cr0 & X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = 1;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	to_vmx(vcpu)->nested.vmxon = 0;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -3642,8 +3742,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 4/24] Allow setting the VMXE bit in CR4
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (2 preceding siblings ...)
  2010-06-13 12:24 ` [PATCH 3/24] Implement VMXON and VMXOFF Nadav Har'El
@ 2010-06-13 12:24 ` Nadav Har'El
  2010-06-15 11:09   ` Gleb Natapov
  2010-06-13 12:25 ` [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:24 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
@@ -501,7 +501,7 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu,
 		   && !load_pdptrs(vcpu, vcpu->arch.cr3))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (cr4 & X86_CR4_VMXE && !nested)
 		return 1;
 
 	kvm_x86_ops->set_cr4(vcpu, cr4);

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (3 preceding siblings ...)
  2010-06-13 12:24 ` [PATCH 4/24] Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2010-06-13 12:25 ` Nadav Har'El
  2010-06-14  8:33   ` Avi Kivity
  2010-06-13 12:25 ` [PATCH 6/24] Implement reading and writing of VMX MSRs Nadav Har'El
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:25 UTC (permalink / raw)
  To: avi; +Cc: kvm

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guests.

This patch also adds the notion (as required by the VMX spec) of the "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
@@ -117,6 +117,29 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure (which is opaque to the guest), and vmcs12 is our emulated
+ * VMX's VMCS. This structure is stored in guest memory specified by VMPTRLD,
+ * and accessed by the guest using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a VMCS for the underlying
+ * hardware which will be used to run L2.
+ * This structure is packed in order to preseve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __attribute__ ((__packed__)) vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
 /* The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -125,6 +148,11 @@ struct shared_msr_entry {
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above. Set by nested_map_current() */
+	struct vmcs12 *current_l2_page;
 };
 
 struct vcpu_vmx {
@@ -188,6 +216,61 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr)
+{
+	struct page *vmcs_page =
+		gfn_to_page(vcpu->kvm, vmcs_addr >> PAGE_SHIFT);
+
+	if (is_error_page(vmcs_page)) {
+		printk(KERN_ERR "%s error allocating page 0x%llx\n",
+		       __func__, vmcs_addr);
+		kvm_release_page_clean(vmcs_page);
+		return NULL;
+	}
+	return vmcs_page;
+}
+
+static int nested_map_current(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct page *vmcs_page =
+		nested_get_page(vcpu, vmx->nested.current_vmptr);
+
+	if (vmcs_page == NULL) {
+		printk(KERN_INFO "%s: failure in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	if (vmx->nested.current_l2_page) {
+		printk(KERN_INFO "Shadow vmcs already mapped\n");
+		BUG_ON(1);
+		return 0;
+	}
+
+	vmx->nested.current_l2_page = kmap_atomic(vmcs_page, KM_USER0);
+	return 1;
+}
+
+static void nested_unmap_current(struct kvm_vcpu *vcpu)
+{
+	struct page *page;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.current_l2_page) {
+		printk(KERN_INFO "Shadow vmcs already unmapped\n");
+		BUG_ON(1);
+		return;
+	}
+
+	page = kmap_atomic_to_page(vmx->nested.current_l2_page);
+
+	kunmap_atomic(vmx->nested.current_l2_page, KM_USER0);
+
+	kvm_release_page_dirty(page);
+
+	vmx->nested.current_l2_page = NULL;
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
@@ -4186,6 +4269,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_l2_page = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 6/24] Implement reading and writing of VMX MSRs
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (4 preceding siblings ...)
  2010-06-13 12:25 ` [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2010-06-13 12:25 ` Nadav Har'El
  2010-06-14  8:42   ` Avi Kivity
  2010-06-13 12:26 ` [PATCH 7/24] Understanding guest pointers to vmcs12 structures Nadav Har'El
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:25 UTC (permalink / raw)
  To: avi; +Cc: kvm

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
@@ -702,7 +702,11 @@ static u32 msrs_to_save[] = {
 #ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
-	MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+	MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
+	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
+	MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
+	MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
+	MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
 };
 
 static unsigned num_msrs_to_save;
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
@@ -1231,6 +1231,98 @@ static void guest_write_tsc(u64 guest_ts
 }
 
 /*
+ * If we allow our guest to use VMX instructions, we should also let it use
+ * VMX-specific MSRs.
+ */
+static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+	u64 vmx_msr = 0;
+	u32 vmx_msr_high, vmx_msr_low;
+
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_BASIC:
+		/*
+		 * This MSR reports some information about VMX support of the
+		 * processor. We should return information about the VMX we
+		 * emulate for the guest, and the VMCS structure we give it -
+		 * not about the VMX support of the underlying hardware. Some
+		 * However, some capabilities of the underlying hardware are
+		 * used directly by our emulation (e.g., the physical address
+		 * width), so these are copied from what the hardware reports.
+		 */
+		*pdata = VMCS12_REVISION |
+			(((u64)sizeof(struct vmcs12)) << 32);
+		rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+#define VMX_BASIC_64		0x0001000000000000LLU
+#define VMX_BASIC_MEM_TYPE	0x003c000000000000LLU
+#define VMX_BASIC_INOUT		0x0040000000000000LLU
+		*pdata |= vmx_msr &
+			(VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
+		break;
+#define CORE2_PINBASED_CTLS_MUST_BE_ONE  0x00000016
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS  0x48d
+	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+	case MSR_IA32_VMX_PINBASED_CTLS:
+		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
+		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
+				PIN_BASED_EXT_INTR_MASK |
+				PIN_BASED_NMI_EXITING |
+				PIN_BASED_VIRTUAL_NMIS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS:
+		/* This MSR determines which vm-execution controls the L1
+		 * hypervisor may ask, or may not ask, to enable. Normally we
+		 * can only allow enabling features which the hardware can
+		 * support, but we limit ourselves to allowing only known
+		 * features that were tested nested. We allow disabling any
+		 * feature (even if the hardware can't disable it).
+		 */
+		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+
+		vmx_msr_low = 0; /* allow disabling any feature */
+		vmx_msr_high &= /* do not expose new untested features */
+			CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+			CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+			CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+			CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+			CPU_BASED_INVLPG_EXITING | CPU_BASED_TPR_SHADOW |
+			CPU_BASED_USE_MSR_BITMAPS |
+#ifdef CONFIG_X86_64
+			CPU_BASED_CR8_LOAD_EXITING |
+			CPU_BASED_CR8_STORE_EXITING |
+#endif
+			CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+		*pdata = vmx_msr_low | ((u64)vmx_msr_high << 32);
+		break;
+	case MSR_IA32_VMX_EXIT_CTLS:
+		*pdata = 0;
+#ifdef CONFIG_X86_64
+		*pdata |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+#endif
+		break;
+	case MSR_IA32_VMX_ENTRY_CTLS:
+		*pdata = 0;
+		break;
+	case MSR_IA32_VMX_PROCBASED_CTLS2:
+		*pdata = 0;
+		if (vm_need_virtualize_apic_accesses(vcpu->kvm))
+			*pdata |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		break;
+	case MSR_IA32_VMX_EPT_VPID_CAP:
+		*pdata = 0;
+		break;
+	default:
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1278,6 +1370,8 @@ static int vmx_get_msr(struct kvm_vcpu *
 		/* Otherwise falls through */
 	default:
 		vmx_load_host_state(to_vmx(vcpu));
+		if (nested && !nested_vmx_get_msr(vcpu, msr_index, &data))
+			break;
 		msr = find_msr_entry(to_vmx(vcpu), msr_index);
 		if (msr) {
 			vmx_load_host_state(to_vmx(vcpu));
@@ -1292,6 +1386,27 @@ static int vmx_get_msr(struct kvm_vcpu *
 }
 
 /*
+ * Writes msr value for nested virtualization
+ * Returns 0 on success, non-0 otherwise.
+ */
+static int nested_vmx_set_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
+{
+	switch (msr_index) {
+	case MSR_IA32_FEATURE_CONTROL:
+		if ((data & (FEATURE_CONTROL_LOCKED |
+			     FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX))
+		    != (FEATURE_CONTROL_LOCKED |
+			FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX))
+			return 1;
+		break;
+	default:
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
  * Writes msr value into into the appropriate "register".
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
@@ -1349,6 +1464,9 @@ static int vmx_set_msr(struct kvm_vcpu *
 			return 1;
 		/* Otherwise falls through */
 	default:
+		if (nested &&
+		    !nested_vmx_set_msr(vcpu, msr_index, data))
+			break;
 		msr = find_msr_entry(vmx, msr_index);
 		if (msr) {
 			vmx_load_host_state(vmx);

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (5 preceding siblings ...)
  2010-06-13 12:25 ` [PATCH 6/24] Implement reading and writing of VMX MSRs Nadav Har'El
@ 2010-06-13 12:26 ` Nadav Har'El
  2010-06-14  8:48   ` Avi Kivity
  2010-06-15 12:14   ` Gleb Natapov
  2010-06-13 12:26 ` [PATCH 8/24] Hold a vmcs02 for each vmcs12 Nadav Har'El
                   ` (19 subsequent siblings)
  26 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:26 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch includes a couple of utility functions for extracting pointer
operands of VMX instructions issued by L1 (a guest hypervisor), and
translating guest-given vmcs12 virtual addresses to guest-physical addresses.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
@@ -3286,13 +3286,14 @@ static int kvm_fetch_guest_virt(gva_t ad
 					  access | PFERR_FETCH_MASK, error);
 }
 
-static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 			       struct kvm_vcpu *vcpu, u32 *error)
 {
 	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
 					  error);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
 			       struct kvm_vcpu *vcpu, u32 *error)
--- .before/arch/x86/kvm/x86.h	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2010-06-13 15:01:29.000000000 +0300
@@ -75,6 +75,9 @@ static inline struct kvm_mem_aliases *kv
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 
+int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
+			struct kvm_vcpu *vcpu, u32 *error);
+
 extern int nested;
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -3654,6 +3654,86 @@ static int handle_vmoff(struct kvm_vcpu 
 	return 1;
 }
 
+/*
+ * Decode the memory-address operand of a vmx instruction, according to the
+ * Intel spec.
+ */
+#define VMX_OPERAND_SCALING(vii)	((vii) & 3)
+#define VMX_OPERAND_ADDR_SIZE(vii)	(((vii) >> 7) & 7)
+#define VMX_OPERAND_IS_REG(vii)		((vii) & (1u << 10))
+#define VMX_OPERAND_SEG_REG(vii)	(((vii) >> 15) & 7)
+#define VMX_OPERAND_INDEX_REG(vii)	(((vii) >> 18) & 0xf)
+#define VMX_OPERAND_INDEX_INVALID(vii)	((vii) & (1u << 22))
+#define VMX_OPERAND_BASE_REG(vii)	(((vii) >> 23) & 0xf)
+#define VMX_OPERAND_BASE_INVALID(vii)	((vii) & (1u << 27))
+#define VMX_OPERAND_REG(vii)		(((vii) >> 3) & 0xf)
+#define VMX_OPERAND_REG2(vii)		(((vii) >> 28) & 0xf)
+static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu,
+				 unsigned long exit_qualification,
+				 u32 vmx_instruction_info)
+{
+	int  scaling = VMX_OPERAND_SCALING(vmx_instruction_info);
+	int  addr_size = VMX_OPERAND_ADDR_SIZE(vmx_instruction_info);
+	bool is_reg = VMX_OPERAND_IS_REG(vmx_instruction_info);
+	int  seg_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
+	int  index_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
+	bool index_is_valid = !VMX_OPERAND_INDEX_INVALID(vmx_instruction_info);
+	int  base_reg       = VMX_OPERAND_BASE_REG(vmx_instruction_info);
+	bool base_is_valid  = !VMX_OPERAND_BASE_INVALID(vmx_instruction_info);
+	gva_t addr;
+
+	if (is_reg) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	switch (addr_size) {
+	case 1: /* 32 bit. high bits are undefined according to the spec: */
+		exit_qualification &= 0xffffffff;
+		break;
+	case 2: /* 64 bit */
+		break;
+	default: /* addr_size=0 means 16 bit */
+		return 0;
+	}
+
+	/* Addr = segment_base + offset */
+	/* offfset = Base + [Index * Scale] + Displacement */
+	addr = vmx_get_segment_base(vcpu, seg_reg);
+	if (base_is_valid)
+		addr += kvm_register_read(vcpu, base_reg);
+	if (index_is_valid)
+		addr += kvm_register_read(vcpu, index_reg)<<scaling;
+	addr += exit_qualification; /* holds the displacement */
+
+	return addr;
+}
+
+static int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, gpa_t *gpap)
+{
+	int r;
+	gva_t gva = get_vmx_mem_address(vcpu,
+		vmcs_readl(EXIT_QUALIFICATION),
+		vmcs_read32(VMX_INSTRUCTION_INFO));
+	if (gva == 0)
+		return 1;
+	*gpap = 0;
+	r = kvm_read_guest_virt(gva, gpap, sizeof(*gpap), vcpu, NULL);
+	if (r) {
+		printk(KERN_ERR "%s cannot read guest vmcs addr %lx : %d\n",
+		       __func__, gva, r);
+		return r;
+	}
+	/* According to the spec, VMCS addresses must be 4K aligned */
+	if (!IS_ALIGNED(*gpap, PAGE_SIZE)) {
+		printk(KERN_DEBUG "%s addr %llx not aligned\n",
+		       __func__, *gpap);
+		return 1;
+	}
+
+	return 0;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 8/24] Hold a vmcs02 for each vmcs12
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (6 preceding siblings ...)
  2010-06-13 12:26 ` [PATCH 7/24] Understanding guest pointers to vmcs12 structures Nadav Har'El
@ 2010-06-13 12:26 ` Nadav Har'El
  2010-06-14  8:57   ` Avi Kivity
  2010-07-06  9:50   ` Dong, Eddie
  2010-06-13 12:27 ` [PATCH 9/24] Implement VMCLEAR Nadav Har'El
                   ` (18 subsequent siblings)
  26 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:26 UTC (permalink / raw)
  To: avi; +Cc: kvm

In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a 
hardware VMCS for each active L1 VMCS (i.e., for each L2 guest).

We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
to run its nested guest L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -140,6 +140,12 @@ struct __attribute__ ((__packed__)) vmcs
 	u32 abort;
 };
 
+struct vmcs_list {
+	struct list_head list;
+	gpa_t vmcs_addr;
+	struct vmcs *l2_vmcs;
+};
+
 /* The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -153,6 +159,10 @@ struct nested_vmx {
 	gpa_t current_vmptr;
 	/* The host-usable pointer to the above. Set by nested_map_current() */
 	struct vmcs12 *current_l2_page;
+
+	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
+	struct list_head l2_vmcs_list; /* a vmcs_list */
+	int l2_vmcs_num;
 };
 
 struct vcpu_vmx {
@@ -1754,6 +1764,84 @@ static void free_vmcs(struct vmcs *vmcs)
 	free_pages((unsigned long)vmcs, vmcs_config.order);
 }
 
+static struct vmcs *nested_get_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n, &vmx->nested.l2_vmcs_list, list)
+		if (list_item->vmcs_addr == vmx->nested.current_vmptr)
+			return list_item->l2_vmcs;
+
+	return NULL;
+}
+
+/* Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
+ * does not already exist. The allocation is done in L0 memory, so to avoid
+ * denial-of-service attack by guests, we limit the number of concurrently-
+ * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
+ * trigger this limit.
+ */
+static const int NESTED_MAX_VMCS = 256;
+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_list *new_l2_guest;
+	struct vmcs *l2_vmcs;
+
+	if (nested_get_current_vmcs(vcpu))
+		return 0; /* nothing to do - we already have a VMCS */
+
+	if (to_vmx(vcpu)->nested.l2_vmcs_num >= NESTED_MAX_VMCS)
+		return -ENOMEM;
+
+	new_l2_guest = (struct vmcs_list *)
+		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
+	if (!new_l2_guest)
+		return -ENOMEM;
+
+	l2_vmcs = alloc_vmcs();
+	if (!l2_vmcs) {
+		kfree(new_l2_guest);
+		return -ENOMEM;
+	}
+
+	new_l2_guest->vmcs_addr = to_vmx(vcpu)->nested.current_vmptr;
+	new_l2_guest->l2_vmcs = l2_vmcs;
+	list_add(&(new_l2_guest->list), &(to_vmx(vcpu)->nested.l2_vmcs_list));
+	to_vmx(vcpu)->nested.l2_vmcs_num++;
+	return 0;
+}
+
+/* Free the current L2 VMCS, and remove it from l2_vmcs_list */
+static void nested_free_current_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n, &vmx->nested.l2_vmcs_list, list)
+		if (list_item->vmcs_addr == vmx->nested.current_vmptr) {
+			free_vmcs(list_item->l2_vmcs);
+			list_del(&(list_item->list));
+			kfree(list_item);
+			vmx->nested.l2_vmcs_num--;
+			return;
+		}
+}
+
+static void free_l1_state(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_list *list_item, *n;
+
+	list_for_each_entry_safe(list_item, n,
+			&vmx->nested.l2_vmcs_list, list) {
+		free_vmcs(list_item->l2_vmcs);
+		list_del(&(list_item->list));
+		kfree(list_item);
+	}
+	vmx->nested.l2_vmcs_num = 0;
+}
+
 static void free_kvm_area(void)
 {
 	int cpu;
@@ -3606,6 +3694,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.l2_vmcs_list));
+	vmx->nested.l2_vmcs_num = 0;
+
 	vmx->nested.vmxon = 1;
 
 	skip_emulated_instruction(vcpu);
@@ -3650,6 +3741,8 @@ static int handle_vmoff(struct kvm_vcpu 
 
 	to_vmx(vcpu)->nested.vmxon = 0;
 
+	free_l1_state(vcpu);
+
 	skip_emulated_instruction(vcpu);
 	return 1;
 }
@@ -4402,6 +4495,8 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	if (vmx->nested.vmxon)
+		free_l1_state(vcpu);
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 9/24] Implement VMCLEAR
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (7 preceding siblings ...)
  2010-06-13 12:26 ` [PATCH 8/24] Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2010-06-13 12:27 ` Nadav Har'El
  2010-06-14  9:03   ` Avi Kivity
                     ` (2 more replies)
  2010-06-13 12:27 ` [PATCH 10/24] Implement VMPTRLD Nadav Har'El
                   ` (17 subsequent siblings)
  26 siblings, 3 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:27 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
 struct vmcs_list {
@@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv
 	return 0;
 }
 
+static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
+{
+	unsigned long rflags;
+	rflags = vmx_get_rflags(vcpu);
+	rflags &= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
+	vmx_set_rflags(vcpu, rflags);
+}
+
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gpa_t guest_vmcs_addr, save_current_vmptr;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr))
+		return 1;
+
+	save_current_vmptr = vmx->nested.current_vmptr;
+
+	vmx->nested.current_vmptr = guest_vmcs_addr;
+	if (!nested_map_current(vcpu))
+		return 1;
+	vmx->nested.current_l2_page->launch_state = 0;
+	nested_unmap_current(vcpu);
+
+	nested_free_current_vmcs(vcpu);
+
+	if (save_current_vmptr == guest_vmcs_addr)
+		vmx->nested.current_vmptr = -1ull;
+	else
+		vmx->nested.current_vmptr = save_current_vmptr;
+
+	skip_emulated_instruction(vcpu);
+	clear_rflags_cf_zf(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4109,7 +4151,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 10/24] Implement VMPTRLD
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (8 preceding siblings ...)
  2010-06-13 12:27 ` [PATCH 9/24] Implement VMCLEAR Nadav Har'El
@ 2010-06-13 12:27 ` Nadav Har'El
  2010-06-14  9:07   ` Avi Kivity
                     ` (2 more replies)
  2010-06-13 12:28 ` [PATCH 11/24] Implement VMPTRST Nadav Har'El
                   ` (16 subsequent siblings)
  26 siblings, 3 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:27 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv
 	return 0;
 }
 
+static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu)
+{
+	unsigned long rflags;
+	rflags = vmx_get_rflags(vcpu);
+	rflags |= X86_EFLAGS_CF;
+	rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_ZF &
+		~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
+	vmx_set_rflags(vcpu, rflags);
+}
+
+static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu)
+{
+	unsigned long rflags;
+	rflags = vmx_get_rflags(vcpu);
+	rflags |= X86_EFLAGS_ZF;
+	rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_CF &
+		~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
+	vmx_set_rflags(vcpu, rflags);
+}
+
 static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
 {
 	unsigned long rflags;
@@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr)
+{
+	bool ret;
+	struct vmcs12 *vmcs12;
+	struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
+	if (vmcs_page == NULL)
+		return 0;
+	vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
+	if (vmcs12->revision_id == VMCS12_REVISION)
+		ret = 1;
+	else {
+		set_rflags_to_vmx_fail_valid(vcpu);
+		ret = 0;
+	}
+	kunmap_atomic(vmcs12, KM_USER0);
+	kvm_release_page_dirty(vmcs_page);
+	return ret;
+}
+
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gpa_t guest_vmcs_addr;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr)) {
+		set_rflags_to_vmx_fail_invalid(vcpu);
+		return 1;
+	}
+
+	if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
+		return 1;
+
+	if (vmx->nested.current_vmptr != guest_vmcs_addr) {
+		vmx->nested.current_vmptr = guest_vmcs_addr;
+
+		if (nested_create_current_vmcs(vcpu)) {
+			printk(KERN_ERR "%s error could not allocate memory",
+				__func__);
+			return -ENOMEM;
+		}
+	}
+
+	clear_rflags_cf_zf(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 11/24] Implement VMPTRST
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (9 preceding siblings ...)
  2010-06-13 12:27 ` [PATCH 10/24] Implement VMPTRLD Nadav Har'El
@ 2010-06-13 12:28 ` Nadav Har'El
  2010-06-14  9:15   ` Avi Kivity
  2010-06-13 12:28 ` [PATCH 12/24] Add VMCS fields to the vmcs12 Nadav Har'El
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:28 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
@@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
 }
 
-static int kvm_write_guest_virt_system(gva_t addr, void *val,
+int kvm_write_guest_virt_system(gva_t addr, void *val,
 				       unsigned int bytes,
 				       struct kvm_vcpu *vcpu,
 				       u32 *error)
@@ -3333,6 +3333,7 @@ static int kvm_write_guest_virt_system(g
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(unsigned long addr,
 				  void *val,
--- .before/arch/x86/kvm/x86.h	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/x86.h	2010-06-13 15:01:29.000000000 +0300
@@ -78,6 +78,9 @@ void kvm_after_handle_nmi(struct kvm_vcp
 int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
 			struct kvm_vcpu *vcpu, u32 *error);
 
+int kvm_write_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
+			 struct kvm_vcpu *vcpu, u32 *error);
+
 extern int nested;
 
 #endif
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -3940,6 +3940,33 @@ static int handle_vmptrld(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+	int r = 0;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t vmcs_gva;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification,
+				       vmx_instruction_info);
+	if (vmcs_gva == 0)
+		return 1;
+	r = kvm_write_guest_virt_system(vmcs_gva,
+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
+				 sizeof(u64), vcpu, NULL);
+	if (r) {
+		printk(KERN_INFO "%s failed to write vmptr\n", __func__);
+		return 1;
+	}
+	clear_rflags_cf_zf(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4225,7 +4252,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
-	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 12/24] Add VMCS fields to the vmcs12
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (10 preceding siblings ...)
  2010-06-13 12:28 ` [PATCH 11/24] Implement VMPTRST Nadav Har'El
@ 2010-06-13 12:28 ` Nadav Har'El
  2010-06-14  9:24   ` Avi Kivity
  2010-06-16 14:18   ` Gleb Natapov
  2010-06-13 12:29 ` [PATCH 13/24] Implement VMREAD and VMWRITE Nadav Har'El
                   ` (14 subsequent siblings)
  26 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:28 UTC (permalink / raw)
  To: avi; +Cc: kvm

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real
VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -117,6 +117,136 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/* shadow_vmcs is a structure used in nested VMX for holding a copy of all
+ * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12),
+ * and also for easier access to VMCS data (see l1_shadow_vmcs).
+ */
+struct __attribute__ ((__packed__)) shadow_vmcs {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
 #define VMCS12_REVISION 0x11e57ed0
 
 /*
@@ -139,6 +269,8 @@ struct __attribute__ ((__packed__)) vmcs
 	u32 revision_id;
 	u32 abort;
 
+	struct shadow_vmcs shadow_vmcs;
+
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
@@ -228,6 +360,169 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+#define OFFSET(x) offsetof(struct shadow_vmcs, x)
+
+static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
+	[VIRTUAL_PROCESSOR_ID] = OFFSET(virtual_processor_id),
+	[GUEST_ES_SELECTOR] = OFFSET(guest_es_selector),
+	[GUEST_CS_SELECTOR] = OFFSET(guest_cs_selector),
+	[GUEST_SS_SELECTOR] = OFFSET(guest_ss_selector),
+	[GUEST_DS_SELECTOR] = OFFSET(guest_ds_selector),
+	[GUEST_FS_SELECTOR] = OFFSET(guest_fs_selector),
+	[GUEST_GS_SELECTOR] = OFFSET(guest_gs_selector),
+	[GUEST_LDTR_SELECTOR] = OFFSET(guest_ldtr_selector),
+	[GUEST_TR_SELECTOR] = OFFSET(guest_tr_selector),
+	[HOST_ES_SELECTOR] = OFFSET(host_es_selector),
+	[HOST_CS_SELECTOR] = OFFSET(host_cs_selector),
+	[HOST_SS_SELECTOR] = OFFSET(host_ss_selector),
+	[HOST_DS_SELECTOR] = OFFSET(host_ds_selector),
+	[HOST_FS_SELECTOR] = OFFSET(host_fs_selector),
+	[HOST_GS_SELECTOR] = OFFSET(host_gs_selector),
+	[HOST_TR_SELECTOR] = OFFSET(host_tr_selector),
+	[IO_BITMAP_A] = OFFSET(io_bitmap_a),
+	[IO_BITMAP_A_HIGH] = OFFSET(io_bitmap_a)+4,
+	[IO_BITMAP_B] = OFFSET(io_bitmap_b),
+	[IO_BITMAP_B_HIGH] = OFFSET(io_bitmap_b)+4,
+	[MSR_BITMAP] = OFFSET(msr_bitmap),
+	[MSR_BITMAP_HIGH] = OFFSET(msr_bitmap)+4,
+	[VM_EXIT_MSR_STORE_ADDR] = OFFSET(vm_exit_msr_store_addr),
+	[VM_EXIT_MSR_STORE_ADDR_HIGH] = OFFSET(vm_exit_msr_store_addr)+4,
+	[VM_EXIT_MSR_LOAD_ADDR] = OFFSET(vm_exit_msr_load_addr),
+	[VM_EXIT_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_exit_msr_load_addr)+4,
+	[VM_ENTRY_MSR_LOAD_ADDR] = OFFSET(vm_entry_msr_load_addr),
+	[VM_ENTRY_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_entry_msr_load_addr)+4,
+	[TSC_OFFSET] = OFFSET(tsc_offset),
+	[TSC_OFFSET_HIGH] = OFFSET(tsc_offset)+4,
+	[VIRTUAL_APIC_PAGE_ADDR] = OFFSET(virtual_apic_page_addr),
+	[VIRTUAL_APIC_PAGE_ADDR_HIGH] = OFFSET(virtual_apic_page_addr)+4,
+	[APIC_ACCESS_ADDR] = OFFSET(apic_access_addr),
+	[APIC_ACCESS_ADDR_HIGH] = OFFSET(apic_access_addr)+4,
+	[EPT_POINTER] = OFFSET(ept_pointer),
+	[EPT_POINTER_HIGH] = OFFSET(ept_pointer)+4,
+	[GUEST_PHYSICAL_ADDRESS] = OFFSET(guest_physical_address),
+	[GUEST_PHYSICAL_ADDRESS_HIGH] = OFFSET(guest_physical_address)+4,
+	[VMCS_LINK_POINTER] = OFFSET(vmcs_link_pointer),
+	[VMCS_LINK_POINTER_HIGH] = OFFSET(vmcs_link_pointer)+4,
+	[GUEST_IA32_DEBUGCTL] = OFFSET(guest_ia32_debugctl),
+	[GUEST_IA32_DEBUGCTL_HIGH] = OFFSET(guest_ia32_debugctl)+4,
+	[GUEST_IA32_PAT] = OFFSET(guest_ia32_pat),
+	[GUEST_IA32_PAT_HIGH] = OFFSET(guest_ia32_pat)+4,
+	[GUEST_PDPTR0] = OFFSET(guest_pdptr0),
+	[GUEST_PDPTR0_HIGH] = OFFSET(guest_pdptr0)+4,
+	[GUEST_PDPTR1] = OFFSET(guest_pdptr1),
+	[GUEST_PDPTR1_HIGH] = OFFSET(guest_pdptr1)+4,
+	[GUEST_PDPTR2] = OFFSET(guest_pdptr2),
+	[GUEST_PDPTR2_HIGH] = OFFSET(guest_pdptr2)+4,
+	[GUEST_PDPTR3] = OFFSET(guest_pdptr3),
+	[GUEST_PDPTR3_HIGH] = OFFSET(guest_pdptr3)+4,
+	[HOST_IA32_PAT] = OFFSET(host_ia32_pat),
+	[HOST_IA32_PAT_HIGH] = OFFSET(host_ia32_pat)+4,
+	[PIN_BASED_VM_EXEC_CONTROL] = OFFSET(pin_based_vm_exec_control),
+	[CPU_BASED_VM_EXEC_CONTROL] = OFFSET(cpu_based_vm_exec_control),
+	[EXCEPTION_BITMAP] = OFFSET(exception_bitmap),
+	[PAGE_FAULT_ERROR_CODE_MASK] = OFFSET(page_fault_error_code_mask),
+	[PAGE_FAULT_ERROR_CODE_MATCH] = OFFSET(page_fault_error_code_match),
+	[CR3_TARGET_COUNT] = OFFSET(cr3_target_count),
+	[VM_EXIT_CONTROLS] = OFFSET(vm_exit_controls),
+	[VM_EXIT_MSR_STORE_COUNT] = OFFSET(vm_exit_msr_store_count),
+	[VM_EXIT_MSR_LOAD_COUNT] = OFFSET(vm_exit_msr_load_count),
+	[VM_ENTRY_CONTROLS] = OFFSET(vm_entry_controls),
+	[VM_ENTRY_MSR_LOAD_COUNT] = OFFSET(vm_entry_msr_load_count),
+	[VM_ENTRY_INTR_INFO_FIELD] = OFFSET(vm_entry_intr_info_field),
+	[VM_ENTRY_EXCEPTION_ERROR_CODE] = OFFSET(vm_entry_exception_error_code),
+	[VM_ENTRY_INSTRUCTION_LEN] = OFFSET(vm_entry_instruction_len),
+	[TPR_THRESHOLD] = OFFSET(tpr_threshold),
+	[SECONDARY_VM_EXEC_CONTROL] = OFFSET(secondary_vm_exec_control),
+	[VM_INSTRUCTION_ERROR] = OFFSET(vm_instruction_error),
+	[VM_EXIT_REASON] = OFFSET(vm_exit_reason),
+	[VM_EXIT_INTR_INFO] = OFFSET(vm_exit_intr_info),
+	[VM_EXIT_INTR_ERROR_CODE] = OFFSET(vm_exit_intr_error_code),
+	[IDT_VECTORING_INFO_FIELD] = OFFSET(idt_vectoring_info_field),
+	[IDT_VECTORING_ERROR_CODE] = OFFSET(idt_vectoring_error_code),
+	[VM_EXIT_INSTRUCTION_LEN] = OFFSET(vm_exit_instruction_len),
+	[VMX_INSTRUCTION_INFO] = OFFSET(vmx_instruction_info),
+	[GUEST_ES_LIMIT] = OFFSET(guest_es_limit),
+	[GUEST_CS_LIMIT] = OFFSET(guest_cs_limit),
+	[GUEST_SS_LIMIT] = OFFSET(guest_ss_limit),
+	[GUEST_DS_LIMIT] = OFFSET(guest_ds_limit),
+	[GUEST_FS_LIMIT] = OFFSET(guest_fs_limit),
+	[GUEST_GS_LIMIT] = OFFSET(guest_gs_limit),
+	[GUEST_LDTR_LIMIT] = OFFSET(guest_ldtr_limit),
+	[GUEST_TR_LIMIT] = OFFSET(guest_tr_limit),
+	[GUEST_GDTR_LIMIT] = OFFSET(guest_gdtr_limit),
+	[GUEST_IDTR_LIMIT] = OFFSET(guest_idtr_limit),
+	[GUEST_ES_AR_BYTES] = OFFSET(guest_es_ar_bytes),
+	[GUEST_CS_AR_BYTES] = OFFSET(guest_cs_ar_bytes),
+	[GUEST_SS_AR_BYTES] = OFFSET(guest_ss_ar_bytes),
+	[GUEST_DS_AR_BYTES] = OFFSET(guest_ds_ar_bytes),
+	[GUEST_FS_AR_BYTES] = OFFSET(guest_fs_ar_bytes),
+	[GUEST_GS_AR_BYTES] = OFFSET(guest_gs_ar_bytes),
+	[GUEST_LDTR_AR_BYTES] = OFFSET(guest_ldtr_ar_bytes),
+	[GUEST_TR_AR_BYTES] = OFFSET(guest_tr_ar_bytes),
+	[GUEST_INTERRUPTIBILITY_INFO] = OFFSET(guest_interruptibility_info),
+	[GUEST_ACTIVITY_STATE] = OFFSET(guest_activity_state),
+	[GUEST_SYSENTER_CS] = OFFSET(guest_sysenter_cs),
+	[HOST_IA32_SYSENTER_CS] = OFFSET(host_ia32_sysenter_cs),
+	[CR0_GUEST_HOST_MASK] = OFFSET(cr0_guest_host_mask),
+	[CR4_GUEST_HOST_MASK] = OFFSET(cr4_guest_host_mask),
+	[CR0_READ_SHADOW] = OFFSET(cr0_read_shadow),
+	[CR4_READ_SHADOW] = OFFSET(cr4_read_shadow),
+	[CR3_TARGET_VALUE0] = OFFSET(cr3_target_value0),
+	[CR3_TARGET_VALUE1] = OFFSET(cr3_target_value1),
+	[CR3_TARGET_VALUE2] = OFFSET(cr3_target_value2),
+	[CR3_TARGET_VALUE3] = OFFSET(cr3_target_value3),
+	[EXIT_QUALIFICATION] = OFFSET(exit_qualification),
+	[GUEST_LINEAR_ADDRESS] = OFFSET(guest_linear_address),
+	[GUEST_CR0] = OFFSET(guest_cr0),
+	[GUEST_CR3] = OFFSET(guest_cr3),
+	[GUEST_CR4] = OFFSET(guest_cr4),
+	[GUEST_ES_BASE] = OFFSET(guest_es_base),
+	[GUEST_CS_BASE] = OFFSET(guest_cs_base),
+	[GUEST_SS_BASE] = OFFSET(guest_ss_base),
+	[GUEST_DS_BASE] = OFFSET(guest_ds_base),
+	[GUEST_FS_BASE] = OFFSET(guest_fs_base),
+	[GUEST_GS_BASE] = OFFSET(guest_gs_base),
+	[GUEST_LDTR_BASE] = OFFSET(guest_ldtr_base),
+	[GUEST_TR_BASE] = OFFSET(guest_tr_base),
+	[GUEST_GDTR_BASE] = OFFSET(guest_gdtr_base),
+	[GUEST_IDTR_BASE] = OFFSET(guest_idtr_base),
+	[GUEST_DR7] = OFFSET(guest_dr7),
+	[GUEST_RSP] = OFFSET(guest_rsp),
+	[GUEST_RIP] = OFFSET(guest_rip),
+	[GUEST_RFLAGS] = OFFSET(guest_rflags),
+	[GUEST_PENDING_DBG_EXCEPTIONS] = OFFSET(guest_pending_dbg_exceptions),
+	[GUEST_SYSENTER_ESP] = OFFSET(guest_sysenter_esp),
+	[GUEST_SYSENTER_EIP] = OFFSET(guest_sysenter_eip),
+	[HOST_CR0] = OFFSET(host_cr0),
+	[HOST_CR3] = OFFSET(host_cr3),
+	[HOST_CR4] = OFFSET(host_cr4),
+	[HOST_FS_BASE] = OFFSET(host_fs_base),
+	[HOST_GS_BASE] = OFFSET(host_gs_base),
+	[HOST_TR_BASE] = OFFSET(host_tr_base),
+	[HOST_GDTR_BASE] = OFFSET(host_gdtr_base),
+	[HOST_IDTR_BASE] = OFFSET(host_idtr_base),
+	[HOST_IA32_SYSENTER_ESP] = OFFSET(host_ia32_sysenter_esp),
+	[HOST_IA32_SYSENTER_EIP] = OFFSET(host_ia32_sysenter_eip),
+	[HOST_RSP] = OFFSET(host_rsp),
+	[HOST_RIP] = OFFSET(host_rip),
+};
+
+static inline short vmcs_field_to_offset(unsigned long field)
+{
+
+	if (field > HOST_RIP || vmcs_field_to_offset_table[field] == 0) {
+		printk(KERN_ERR "invalid vmcs field 0x%lx\n", field);
+		return -1;
+	}
+	return vmcs_field_to_offset_table[field];
+}
+
+static inline struct shadow_vmcs *get_shadow_vmcs(struct kvm_vcpu *vcpu)
+{
+	WARN_ON(!to_vmx(vcpu)->nested.current_l2_page);
+	return &(to_vmx(vcpu)->nested.current_l2_page->shadow_vmcs);
+}
+
 static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr)
 {
 	struct page *vmcs_page =

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (11 preceding siblings ...)
  2010-06-13 12:28 ` [PATCH 12/24] Add VMCS fields to the vmcs12 Nadav Har'El
@ 2010-06-13 12:29 ` Nadav Har'El
  2010-06-14  9:36   ` Avi Kivity
  2010-06-16 15:03   ` Gleb Natapov
  2010-06-13 12:29 ` [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
                   ` (13 subsequent siblings)
  26 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:29 UTC (permalink / raw)
  To: avi; +Cc: kvm

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the shadow_vmcs structure introduced in the previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -299,6 +299,42 @@ struct nested_vmx {
 	int l2_vmcs_num;
 };
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_ULONG = 3
+};
+
+#define VMCS_FIELD_LENGTH_OFFSET 13
+#define VMCS_FIELD_LENGTH_MASK 0x6000
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* one of the *_HIGH fields, all are 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (VMCS_FIELD_LENGTH_MASK & field) >> 13;
+}
+
+static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
+{
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		return 2;
+	case VMCS_FIELD_TYPE_U32:
+		return 4;
+	case VMCS_FIELD_TYPE_U64:
+		return 8;
+	case VMCS_FIELD_TYPE_ULONG:
+#ifdef CONFIG_X86_64
+		if (is_long_mode(vcpu))
+			return 8;
+#endif
+		return 4;
+	}
+	return 0; /* should never happen */
+}
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static inline bool nested_vmcs_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+	if (!to_vmx(vcpu)->nested.current_l2_page)
+		return 0;
+
+	p = ((char *)(get_shadow_vmcs(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_ULONG:
+		*ret = *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = (u16) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = (u32) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg,
+			     unsigned long field)
+{
+	u64 field_value;
+	if (!nested_vmcs_read_any(vcpu, field, &field_value))
+		return 0;
+
+#ifdef CONFIG_X86_64
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
+		if (!is_long_mode(vcpu)) {
+			kvm_register_write(vcpu, reg+1, field_value >> 32);
+			field_value = (u32)field_value;
+		}
+	}
+#endif
+	kvm_register_write(vcpu, reg, field_value);
+	return 1;
+}
+
+static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva,
+			     unsigned long field)
+{
+	u64 field_value;
+	if (!nested_vmcs_read_any(vcpu, field, &field_value))
+		return 0;
+
+	/* It's ok to use *_system, because handle_vmread verifies cpl=0 */
+	kvm_write_guest_virt_system(gva, &field_value,
+			     vmcs_field_size(vmcs_field_type(field), vcpu),
+			     vcpu, NULL);
+	return 1;
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+	int read_succeed;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!nested_map_current(vcpu)) {
+		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
+		set_rflags_to_vmx_fail_invalid(vcpu);
+		return 1;
+	}
+
+	/* decode instruction info to get the field to read and where to store
+	 * its value */
+	field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
+	if (VMX_OPERAND_IS_REG(vmx_instruction_info)) {
+		read_succeed = handle_vmread_reg(vcpu,
+			VMX_OPERAND_REG(vmx_instruction_info), field);
+	} else {
+		gva = get_vmx_mem_address(vcpu, exit_qualification,
+					  vmx_instruction_info);
+		if (gva == 0)
+			return 1;
+		read_succeed = handle_vmread_mem(vcpu, gva, field);
+	}
+
+	if (read_succeed) {
+		clear_rflags_cf_zf(vcpu);
+		skip_emulated_instruction(vcpu);
+	} else {
+		set_rflags_to_vmx_fail_valid(vcpu);
+		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
+	}
+
+	nested_unmap_current(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value = 0;
+	gva_t gva;
+	int field_type;
+	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!nested_map_current(vcpu)) {
+		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
+		set_rflags_to_vmx_fail_invalid(vcpu);
+		return 1;
+	}
+
+	field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
+	field_type = vmcs_field_type(field);
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		set_rflags_to_vmx_fail_invalid(vcpu);
+		return 1;
+	}
+	p = ((char *) get_shadow_vmcs(vcpu)) + offset;
+
+	if (VMX_OPERAND_IS_REG(vmx_instruction_info))
+		field_value = kvm_register_read(vcpu,
+			VMX_OPERAND_REG(vmx_instruction_info));
+	else {
+		gva  = get_vmx_mem_address(vcpu, exit_qualification,
+			vmx_instruction_info);
+		if (gva == 0)
+			return 1;
+		kvm_read_guest_virt(gva, &field_value,
+			vmcs_field_size(field_type, vcpu), vcpu, NULL);
+	}
+
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+#ifdef CONFIG_X86_64
+		*(unsigned long *)p = field_value;
+#else
+		*(unsigned long *)p = field_value;
+		*(((unsigned long *)p)+1) = field_value >> 32;
+#endif
+		break;
+	case VMCS_FIELD_TYPE_ULONG:
+		*(unsigned long *)p = field_value;
+		break;
+	default:
+		printk(KERN_INFO "%s invalid field\n", __func__);
+		set_rflags_to_vmx_fail_valid(vcpu);
+		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
+		nested_unmap_current(vcpu);
+		return 1;
+	}
+
+	clear_rflags_cf_zf(vcpu);
+	skip_emulated_instruction(vcpu);
+	nested_unmap_current(vcpu);
+	return 1;
+}
+
 static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr)
 {
 	bool ret;
@@ -4548,9 +4767,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (12 preceding siblings ...)
  2010-06-13 12:29 ` [PATCH 13/24] Implement VMREAD and VMWRITE Nadav Har'El
@ 2010-06-13 12:29 ` Nadav Har'El
  2010-06-14 11:11   ` Avi Kivity
                     ` (2 more replies)
  2010-06-13 12:30 ` [PATCH 15/24] Move register-syncing to a function Nadav Har'El
                   ` (12 subsequent siblings)
  26 siblings, 3 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:29 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in shadow_vmcs that L1 built for L2 (vmcs12), and that in the VMCS that we
built for L1 (vmcs01).

VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
makes it difficult for us to read from vmcs01 while writing to vmcs12. This
is why we first make a copy of vmcs01 in memory (l1_shadow_vmcs) and then
read that memory copy while writing to vmcs12.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
 	return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_vmx_tpr_shadow() &&
+		get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_TPR_SHADOW;
+}
+
+static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
+{
+	return cpu_has_secondary_exec_ctrls() &&
+		get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+}
+
+static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
+							   *vcpu)
+{
+	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+		(get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
+		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
+}
+
+static inline bool nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu)
+{
+	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
+		(get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
+		SECONDARY_EXEC_ENABLE_EPT);
+}
+
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
 	preempt_enable();
 }
 
+int load_vmcs_host_state(struct shadow_vmcs *src)
+{
+	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
+	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
+	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
+	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
+	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
+	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
+	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
+
+	vmcs_write64(TSC_OFFSET, src->tsc_offset);
+
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
+
+	vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
+
+	vmcs_writel(HOST_CR0, src->host_cr0);
+	vmcs_writel(HOST_CR3, src->host_cr3);
+	vmcs_writel(HOST_CR4, src->host_cr4);
+	vmcs_writel(HOST_FS_BASE, src->host_fs_base);
+	vmcs_writel(HOST_GS_BASE, src->host_gs_base);
+	vmcs_writel(HOST_TR_BASE, src->host_tr_base);
+	vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);
+	vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);
+	vmcs_writel(HOST_RSP, src->host_rsp);
+	vmcs_writel(HOST_RIP, src->host_rip);
+	vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
+	vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);
+
+	return 0;
+}
+
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -1922,6 +1985,71 @@ static void vmclear_local_vcpus(void)
 		__vcpu_clear(vmx);
 }
 
+int load_vmcs_common(struct shadow_vmcs *src)
+{
+	vmcs_write16(GUEST_ES_SELECTOR, src->guest_es_selector);
+	vmcs_write16(GUEST_CS_SELECTOR, src->guest_cs_selector);
+	vmcs_write16(GUEST_SS_SELECTOR, src->guest_ss_selector);
+	vmcs_write16(GUEST_DS_SELECTOR, src->guest_ds_selector);
+	vmcs_write16(GUEST_FS_SELECTOR, src->guest_fs_selector);
+	vmcs_write16(GUEST_GS_SELECTOR, src->guest_gs_selector);
+	vmcs_write16(GUEST_LDTR_SELECTOR, src->guest_ldtr_selector);
+	vmcs_write16(GUEST_TR_SELECTOR, src->guest_tr_selector);
+
+	vmcs_write64(GUEST_IA32_DEBUGCTL, src->guest_ia32_debugctl);
+
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs_write64(GUEST_IA32_PAT, src->guest_ia32_pat);
+
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, src->vm_entry_intr_info_field);
+	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+		     src->vm_entry_exception_error_code);
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, src->vm_entry_instruction_len);
+
+	vmcs_write32(GUEST_ES_LIMIT, src->guest_es_limit);
+	vmcs_write32(GUEST_CS_LIMIT, src->guest_cs_limit);
+	vmcs_write32(GUEST_SS_LIMIT, src->guest_ss_limit);
+	vmcs_write32(GUEST_DS_LIMIT, src->guest_ds_limit);
+	vmcs_write32(GUEST_FS_LIMIT, src->guest_fs_limit);
+	vmcs_write32(GUEST_GS_LIMIT, src->guest_gs_limit);
+	vmcs_write32(GUEST_LDTR_LIMIT, src->guest_ldtr_limit);
+	vmcs_write32(GUEST_TR_LIMIT, src->guest_tr_limit);
+	vmcs_write32(GUEST_GDTR_LIMIT, src->guest_gdtr_limit);
+	vmcs_write32(GUEST_IDTR_LIMIT, src->guest_idtr_limit);
+	vmcs_write32(GUEST_ES_AR_BYTES, src->guest_es_ar_bytes);
+	vmcs_write32(GUEST_CS_AR_BYTES, src->guest_cs_ar_bytes);
+	vmcs_write32(GUEST_SS_AR_BYTES, src->guest_ss_ar_bytes);
+	vmcs_write32(GUEST_DS_AR_BYTES, src->guest_ds_ar_bytes);
+	vmcs_write32(GUEST_FS_AR_BYTES, src->guest_fs_ar_bytes);
+	vmcs_write32(GUEST_GS_AR_BYTES, src->guest_gs_ar_bytes);
+	vmcs_write32(GUEST_LDTR_AR_BYTES, src->guest_ldtr_ar_bytes);
+	vmcs_write32(GUEST_TR_AR_BYTES, src->guest_tr_ar_bytes);
+	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+		     src->guest_interruptibility_info);
+	vmcs_write32(GUEST_ACTIVITY_STATE, src->guest_activity_state);
+	vmcs_write32(GUEST_SYSENTER_CS, src->guest_sysenter_cs);
+
+	vmcs_writel(GUEST_ES_BASE, src->guest_es_base);
+	vmcs_writel(GUEST_CS_BASE, src->guest_cs_base);
+	vmcs_writel(GUEST_SS_BASE, src->guest_ss_base);
+	vmcs_writel(GUEST_DS_BASE, src->guest_ds_base);
+	vmcs_writel(GUEST_FS_BASE, src->guest_fs_base);
+	vmcs_writel(GUEST_GS_BASE, src->guest_gs_base);
+	vmcs_writel(GUEST_LDTR_BASE, src->guest_ldtr_base);
+	vmcs_writel(GUEST_TR_BASE, src->guest_tr_base);
+	vmcs_writel(GUEST_GDTR_BASE, src->guest_gdtr_base);
+	vmcs_writel(GUEST_IDTR_BASE, src->guest_idtr_base);
+	vmcs_writel(GUEST_DR7, src->guest_dr7);
+	vmcs_writel(GUEST_RSP, src->guest_rsp);
+	vmcs_writel(GUEST_RIP, src->guest_rip);
+	vmcs_writel(GUEST_RFLAGS, src->guest_rflags);
+	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+		    src->guest_pending_dbg_exceptions);
+	vmcs_writel(GUEST_SYSENTER_ESP, src->guest_sysenter_esp);
+	vmcs_writel(GUEST_SYSENTER_EIP, src->guest_sysenter_eip);
+
+	return 0;
+}
 
 /* Just like cpu_vmxoff(), but with the __kvm_handle_fault_on_reboot()
  * tricks.
@@ -5363,6 +5491,281 @@ static void vmx_set_supported_cpuid(u32 
 {
 }
 
+/* Make a copy of the current VMCS to ordinary memory. This is needed because
+ * in VMX you cannot read and write to two VMCS at the same time, so when we
+ * want to do this (in prepare_vmcs_02, which needs to read from vmcs01 while
+ * preparing vmcs02), we need to first save a copy of one VMCS's fields in
+ * memory, and then use that copy.
+ */
+void save_vmcs(struct shadow_vmcs *dst)
+{
+	dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
+	dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
+	dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
+	dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
+	dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
+	dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
+	dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
+	dst->io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	dst->io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	if (cpu_has_vmx_msr_bitmap())
+		dst->msr_bitmap = vmcs_read64(MSR_BITMAP);
+	dst->tsc_offset = vmcs_read64(TSC_OFFSET);
+	dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
+	dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
+	if (enable_ept)
+		dst->ept_pointer = vmcs_read64(EPT_POINTER);
+	dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+	dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+	dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	if (enable_ept) {
+		dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+		dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+		dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+		dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+	}
+	dst->pin_based_vm_exec_control = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
+	dst->cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	dst->exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
+	dst->page_fault_error_code_mask =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
+	dst->page_fault_error_code_match =
+		vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH);
+	dst->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
+	dst->vm_exit_controls = vmcs_read32(VM_EXIT_CONTROLS);
+	dst->vm_entry_controls = vmcs_read32(VM_ENTRY_CONTROLS);
+	dst->vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	dst->vm_entry_exception_error_code =
+		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+	dst->vm_entry_instruction_len = vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
+	dst->tpr_threshold = vmcs_read32(TPR_THRESHOLD);
+	dst->secondary_vm_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	if (enable_vpid && dst->secondary_vm_exec_control &
+	    SECONDARY_EXEC_ENABLE_VPID)
+		dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
+	dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
+	dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	dst->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	dst->idt_vectoring_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	dst->idt_vectoring_error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	dst->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	dst->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	dst->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	dst->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	dst->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	dst->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	dst->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	dst->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	dst->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	dst->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	dst->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	dst->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	dst->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	dst->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	dst->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	dst->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	dst->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	dst->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	dst->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	dst->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	dst->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
+	dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
+	dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
+	dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
+	dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
+	dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
+	dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
+	dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
+	dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
+	dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	dst->guest_cr0 = vmcs_readl(GUEST_CR0);
+	dst->guest_cr3 = vmcs_readl(GUEST_CR3);
+	dst->guest_cr4 = vmcs_readl(GUEST_CR4);
+	dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+	dst->guest_dr7 = vmcs_readl(GUEST_DR7);
+	dst->guest_rsp = vmcs_readl(GUEST_RSP);
+	dst->guest_rip = vmcs_readl(GUEST_RIP);
+	dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+	dst->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+	dst->host_cr0 = vmcs_readl(HOST_CR0);
+	dst->host_cr3 = vmcs_readl(HOST_CR3);
+	dst->host_cr4 = vmcs_readl(HOST_CR4);
+	dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
+	dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
+	dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
+	dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+	dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
+	dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
+	dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
+	dst->host_rsp = vmcs_readl(HOST_RSP);
+	dst->host_rip = vmcs_readl(HOST_RIP);
+	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
+		dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);
+}
+
+/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
+ * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
+ * way that will both be appropriate to L1's requests, and our needs.
+ */
+int prepare_vmcs_02(struct kvm_vcpu *vcpu,
+	struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01)
+{
+	u32 exec_control;
+
+	load_vmcs_common(vmcs12);
+
+	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
+	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
+	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
+	if (cpu_has_vmx_msr_bitmap())
+		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
+
+	if (vmcs12->vm_entry_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_load_count > 0 ||
+			vmcs12->vm_exit_msr_store_count > 0) {
+		printk(KERN_WARNING
+			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
+	}
+
+	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
+		struct page *page =
+			nested_get_page(vcpu, vmcs12->apic_access_addr);
+		if (!page)
+			return 1;
+		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+		kvm_release_page_clean(page);
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+		     (vmcs01->pin_based_vm_exec_control |
+		      vmcs12->pin_based_vm_exec_control));
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		     (vmcs01->page_fault_error_code_mask &
+		      vmcs12->page_fault_error_code_mask));
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		     (vmcs01->page_fault_error_code_match &
+		      vmcs12->page_fault_error_code_match));
+
+	if (cpu_has_secondary_exec_ctrls()) {
+		u32 exec_control = vmcs01->secondary_vm_exec_control;
+		if (nested_cpu_has_secondary_exec_ctrls(vcpu)) {
+			exec_control |= vmcs12->secondary_vm_exec_control;
+			if (!vm_need_virtualize_apic_accesses(vcpu->kvm) ||
+			    !nested_vm_need_virtualize_apic_accesses(vcpu))
+				exec_control &=
+				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+		}
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	}
+
+	load_vmcs_host_state(vmcs01);
+
+	if (vm_need_tpr_shadow(vcpu->kvm) &&
+	    nested_cpu_has_vmx_tpr_shadow(vcpu))
+		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
+
+	if (enable_ept) {
+		if (!nested_cpu_has_vmx_ept(vcpu)) {
+			vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
+			vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
+			vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
+			vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
+			vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
+		}
+	}
+
+	exec_control = vmcs01->cpu_based_vm_exec_control;
+	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
+	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
+	exec_control &= ~CPU_BASED_TPR_SHADOW;
+	exec_control |= vmcs12->cpu_based_vm_exec_control;
+	if (!vm_need_tpr_shadow(vcpu->kvm) ||
+	    vmcs12->virtual_apic_page_addr == 0) {
+		exec_control &= ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+		exec_control |= CPU_BASED_CR8_STORE_EXITING |
+			CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	} else if (exec_control & CPU_BASED_TPR_SHADOW) {
+#ifdef CONFIG_X86_64
+		exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
+		exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
+#endif
+	}
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+
+	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
+	 * bitwise-or of what L1 wants to trap for L2, and what we want to
+	 * trap. However, vmx_fpu_activate/deactivate may have happened after
+	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
+	 * and need to base them again on fpu_active. Note that CR0.TS also
+	 * needs updating - we do this after this function returns (in
+	 * nested_vmx_run).
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
+		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
+		      vmcs12->exception_bitmap));
+	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
+			(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmcs_write32(VM_EXIT_CONTROLS,
+		     (vmcs01->vm_exit_controls &
+			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
+		       | vmcs12->vm_exit_controls);
+
+	vmcs_write32(VM_ENTRY_CONTROLS,
+		     (vmcs01->vm_entry_controls &
+			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
+		      | vmcs12->vm_entry_controls);
+
+	vmcs_writel(CR4_GUEST_HOST_MASK,
+		    (vmcs01->cr4_guest_host_mask  &
+		     vmcs12->cr4_guest_host_mask));
+
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 15/24] Move register-syncing to a function
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (13 preceding siblings ...)
  2010-06-13 12:29 ` [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2010-06-13 12:30 ` Nadav Har'El
  2010-06-13 12:30 ` [PATCH 16/24] Implement VMLAUNCH and VMRESUME Nadav Har'El
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:30 UTC (permalink / raw)
  To: avi; +Cc: kvm

Move code that syncs dirty RSP and RIP registers back to the VMCS, into a
function. We will need to call this function from additional places in the
next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -5114,6 +5114,15 @@ static void fixup_rmode_irq(struct vcpu_
 		| vmx->rmode.irq.vector;
 }
 
+static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
+{
+	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
+	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
+		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	vcpu->arch.regs_dirty = 0;
+}
+
 #ifdef CONFIG_X86_64
 #define R "r"
 #define Q "q"
@@ -5135,10 +5144,7 @@ static void vmx_vcpu_run(struct kvm_vcpu
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return;
 
-	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);
-	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
-		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+	sync_cached_regs_to_vmcs(vcpu);
 
 	/* When single-stepping over STI and MOV SS, we must clear the
 	 * corresponding interruptibility bits in the guest state. Otherwise
@@ -5246,7 +5252,6 @@ static void vmx_vcpu_run(struct kvm_vcpu
 
 	vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)
 				  | (1 << VCPU_EXREG_PDPTR));
-	vcpu->arch.regs_dirty = 0;
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 	if (vmx->rmode.irq.pending)

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (14 preceding siblings ...)
  2010-06-13 12:30 ` [PATCH 15/24] Move register-syncing to a function Nadav Har'El
@ 2010-06-13 12:30 ` Nadav Har'El
  2010-06-14 11:41   ` Avi Kivity
  2010-06-17 10:59   ` Gleb Natapov
  2010-06-13 12:31 ` [PATCH 17/24] No need for handle_vmx_insn function any more Nadav Har'El
                   ` (10 subsequent siblings)
  26 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:30 UTC (permalink / raw)
  To: avi; +Cc: kvm

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
@@ -272,6 +272,9 @@ struct __attribute__ ((__packed__)) vmcs
 	struct shadow_vmcs shadow_vmcs;
 
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
 };
 
 struct vmcs_list {
@@ -297,6 +300,24 @@ struct nested_vmx {
 	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
 	struct list_head l2_vmcs_list; /* a vmcs_list */
 	int l2_vmcs_num;
+
+	/* Are we running a nested guest now */
+	bool nested_mode;
+	/* Level 1 state for switching to level 2 and back */
+	struct  {
+		u64 efer;
+		unsigned long cr3;
+		unsigned long cr4;
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		int cpu;
+		int launched;
+	} l1_state;
+	/* Level 1 shadow vmcs for switching to level 2 and back */
+	struct shadow_vmcs *l1_shadow_vmcs;
+	/* Level 1 vmcs loaded into the processor */
+	struct vmcs *l1_vmcs;
 };
 
 enum vmcs_field_type {
@@ -1407,6 +1428,19 @@ static void vmx_vcpu_load(struct kvm_vcp
 			new_offset = vmcs_read64(TSC_OFFSET) + delta;
 			vmcs_write64(TSC_OFFSET, new_offset);
 		}
+
+		if (vmx->nested.l1_shadow_vmcs != NULL) {
+			struct shadow_vmcs *l1svmcs =
+				vmx->nested.l1_shadow_vmcs;
+			l1svmcs->host_tr_base = vmcs_readl(HOST_TR_BASE);
+			l1svmcs->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+			l1svmcs->host_ia32_sysenter_esp =
+				vmcs_readl(HOST_IA32_SYSENTER_ESP);
+			if (tsc_this < vcpu->arch.host_tsc)
+				l1svmcs->tsc_offset = vmcs_read64(TSC_OFFSET);
+			if (vmx->nested.nested_mode)
+				load_vmcs_host_state(l1svmcs);
+		}
 	}
 }
 
@@ -2301,6 +2335,9 @@ static void free_l1_state(struct kvm_vcp
 		kfree(list_item);
 	}
 	vmx->nested.l2_vmcs_num = 0;
+
+	kfree(vmx->nested.l1_shadow_vmcs);
+	vmx->nested.l1_shadow_vmcs = NULL;
 }
 
 static void free_kvm_area(void)
@@ -4158,6 +4195,13 @@ static int handle_vmon(struct kvm_vcpu *
 	INIT_LIST_HEAD(&(vmx->nested.l2_vmcs_list));
 	vmx->nested.l2_vmcs_num = 0;
 
+	vmx->nested.l1_shadow_vmcs = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!vmx->nested.l1_shadow_vmcs) {
+		printk(KERN_INFO
+			"couldn't allocate memory for l1_shadow_vmcs\n");
+		return -ENOMEM;
+	}
+
 	vmx->nested.vmxon = 1;
 
 	skip_emulated_instruction(vcpu);
@@ -4348,6 +4392,42 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (!nested_map_current(vcpu))
+		return 1;
+	if (to_vmx(vcpu)->nested.current_l2_page->launch_state == launch) {
+		/* Must use VMLAUNCH for the first time, VMRESUME later */
+		set_rflags_to_vmx_fail_valid(vcpu);
+		nested_unmap_current(vcpu);
+		return 1;
+	}
+	nested_unmap_current(vcpu);
+
+	skip_emulated_instruction(vcpu);
+
+	nested_vmx_run(vcpu);
+	return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return handle_launch_or_resume(vcpu, false);
+}
+
 static inline bool nested_vmcs_read_any(struct kvm_vcpu *vcpu,
 					unsigned long field, u64 *ret)
 {
@@ -4892,11 +4972,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -4958,7 +5038,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (!vmx->nested.nested_mode &&
+		unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -5771,6 +5852,138 @@ int prepare_vmcs_02(struct kvm_vcpu *vcp
 	return 0;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->nested.nested_mode = 1;
+	sync_cached_regs_to_vmcs(vcpu);
+	save_vmcs(vmx->nested.l1_shadow_vmcs);
+
+	vmx->nested.l1_state.efer = vcpu->arch.efer;
+	if (!enable_ept)
+		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
+	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
+
+	if (!nested_map_current(vcpu)) {
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
+	else
+		vmx->nested.l1_state.msr_bitmap = 0;
+
+	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	vmx->nested.l1_vmcs = vmx->vmcs;
+	vmx->nested.l1_state.cpu = vcpu->cpu;
+	vmx->nested.l1_state.launched = vmx->launched;
+
+	vmx->vmcs = nested_get_current_vmcs(vcpu);
+	if (!vmx->vmcs) {
+		printk(KERN_ERR "Missing VMCS\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	vcpu->cpu = vmx->nested.current_l2_page->cpu;
+	vmx->launched = vmx->nested.current_l2_page->launched;
+
+	if (!vmx->nested.current_l2_page->launch_state || !vmx->launched) {
+		vmcs_clear(vmx->vmcs);
+		vmx->launched = 0;
+		vmx->nested.current_l2_page->launch_state = 1;
+	}
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	prepare_vmcs_02(vcpu,
+		get_shadow_vmcs(vcpu), vmx->nested.l1_shadow_vmcs);
+
+	if (get_shadow_vmcs(vcpu)->vm_entry_controls &
+	    VM_ENTRY_IA32E_MODE) {
+		if (!((vcpu->arch.efer & EFER_LMA) &&
+		      (vcpu->arch.efer & EFER_LME)))
+			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	} else {
+		if ((vcpu->arch.efer & EFER_LMA) ||
+		    (vcpu->arch.efer & EFER_LME))
+			vcpu->arch.efer = 0;
+	}
+
+	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
+	 * dictated, and takes appropriate actions for special cr0 bits (like
+	 * real mode, etc.).
+	 */
+	vmx_set_cr0(vcpu,
+		(get_shadow_vmcs(vcpu)->guest_cr0 &
+			~get_shadow_vmcs(vcpu)->cr0_guest_host_mask) |
+		(get_shadow_vmcs(vcpu)->cr0_read_shadow &
+			get_shadow_vmcs(vcpu)->cr0_guest_host_mask));
+
+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
+	 * the latter with with TS added if !fpu_active. We need to take the
+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
+	 * restoration of fpu registers until the FPU is really used).
+	 */
+	vmcs_writel(GUEST_CR0, get_shadow_vmcs(vcpu)->guest_cr0 |
+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmx_set_cr4(vcpu, get_shadow_vmcs(vcpu)->guest_cr4);
+	vmcs_writel(CR4_READ_SHADOW,
+		    get_shadow_vmcs(vcpu)->cr4_read_shadow);
+
+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
+	 * CR0 (we need the paging so that KVM treat this guest as a paging
+	 * guest so we can easly forward page faults to L1.)
+	 */
+	vcpu->arch.cr0 |= X86_CR0_PG;
+
+	if (enable_ept && !nested_cpu_has_vmx_ept(vcpu)) {
+		vmcs_write32(GUEST_CR3, get_shadow_vmcs(vcpu)->guest_cr3);
+		vmx->vcpu.arch.cr3 = get_shadow_vmcs(vcpu)->guest_cr3;
+	} else {
+		int r;
+		kvm_set_cr3(vcpu, get_shadow_vmcs(vcpu)->guest_cr3);
+		kvm_mmu_reset_context(vcpu);
+
+		nested_unmap_current(vcpu);
+
+		r = kvm_mmu_load(vcpu);
+		if (unlikely(r)) {
+			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			set_rflags_to_vmx_fail_valid(vcpu);
+			/* switch back to L1 */
+			vmx->nested.nested_mode = 0;
+			vmx->vmcs = vmx->nested.l1_vmcs;
+			vcpu->cpu = vmx->nested.l1_state.cpu;
+			vmx->launched = vmx->nested.l1_state.launched;
+
+			vmx_vcpu_load(vcpu, get_cpu());
+			put_cpu();
+
+			return 1;
+		}
+
+		nested_map_current(vcpu);
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   get_shadow_vmcs(vcpu)->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   get_shadow_vmcs(vcpu)->guest_rip);
+
+	nested_unmap_current(vcpu);
+
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 17/24] No need for handle_vmx_insn function any more
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (15 preceding siblings ...)
  2010-06-13 12:30 ` [PATCH 16/24] Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2010-06-13 12:31 ` Nadav Har'El
  2010-06-13 12:31 ` [PATCH 18/24] Exiting from L2 to L1 Nadav Har'El
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:31 UTC (permalink / raw)
  To: avi; +Cc: kvm

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the relevant VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -4147,12 +4147,6 @@ static int handle_vmcall(struct kvm_vcpu
 	return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-	kvm_queue_exception(vcpu, UD_VECTOR);
-	return 1;
-}
-
 /* Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
  * inspect the argument to VMXON (the so-called "VMXON pointer") because we

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 18/24] Exiting from L2 to L1
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (16 preceding siblings ...)
  2010-06-13 12:31 ` [PATCH 17/24] No need for handle_vmx_insn function any more Nadav Har'El
@ 2010-06-13 12:31 ` Nadav Har'El
  2010-06-14 12:04   ` Avi Kivity
  2010-06-13 12:32 ` [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
                   ` (8 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:31 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -5080,9 +5080,13 @@ static void vmx_complete_interrupts(stru
 	int type;
 	bool idtv_info_valid;
 
+	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+	if (vmx->nested.nested_mode)
+		return;
+
 	exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 
-	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
 
 	/* Handle machine checks before interrupts are enabled */
 	if ((vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
@@ -5978,6 +5982,278 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the state of the registers during the exit,
+ * and to reflect some changes that happened while L2 was running (and perhaps
+ * made some exits which were handled directly by L0 without going back to L1).
+ */
+void prepare_vmcs_12(struct kvm_vcpu *vcpu)
+{
+	struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+
+	vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET);
+	vmcs12->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
+	vmcs12->vm_entry_intr_info_field =
+		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	vmcs12->vm_entry_exception_error_code =
+		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
+	vmcs12->vm_entry_instruction_len =
+		vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
+	vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+
+	vmcs12->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+
+	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
+	 * have changed some cr0 bits without us ever saving them in the shadow
+	 * vmcs. So we need to save these changes now.
+	 * In the current code, the only GHM bit which can be off is TS (it
+	 * will be off when fpu_active and L1 also set it to off).
+	 */
+	vmcs12->guest_cr0 = vmcs_readl(GUEST_CR0);
+
+	/* But this may not be the guest_cr0 that the L1 guest hypervisor
+	 * actually thought it was giving its L2 guest. It is possible that
+	 * L1 wished to allow its guest to set a cr0 bit directly, but we (L0)
+	 * captured this attempt and instead set just the read shadow. If this
+	 * is the case, we need copy these read-shadow bits back to guest_cr0,
+	 * where L1 believes they already are. Note that we must read the
+	 * actual CR0_READ_SHADOW (which is what L0 may have changed), not
+	 * vmcs12->cr0_read_shadow (which L1 defined, and we don't
+	 * change without being told by L1). Currently, the only bit where
+	 * this can happen is TS.
+	 */
+	if (!(vcpu->arch.cr0_guest_owned_bits & X86_CR0_TS)
+			&& !(vmcs12->cr0_guest_host_mask & X86_CR0_TS))
+		vmcs12->guest_cr0 =
+			(vmcs12->guest_cr0 & ~X86_CR0_TS) |
+			(vmcs_readl(CR0_READ_SHADOW) & X86_CR0_TS);
+
+	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
+	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
+	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+}
+
+int switch_back_vmcs(struct kvm_vcpu *vcpu)
+{
+	struct shadow_vmcs *src = to_vmx(vcpu)->nested.l1_shadow_vmcs;
+
+	if (enable_vpid && src->virtual_processor_id != 0)
+		vmcs_write16(VIRTUAL_PROCESSOR_ID, src->virtual_processor_id);
+
+	vmcs_write64(IO_BITMAP_A, src->io_bitmap_a);
+	vmcs_write64(IO_BITMAP_B, src->io_bitmap_b);
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmcs_write64(MSR_BITMAP, src->msr_bitmap);
+
+	vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, src->virtual_apic_page_addr);
+
+	if (vm_need_virtualize_apic_accesses(vcpu->kvm))
+		vmcs_write64(APIC_ACCESS_ADDR,
+			     src->apic_access_addr);
+
+	if (enable_ept) {
+		vmcs_write64(EPT_POINTER, src->ept_pointer);
+		vmcs_write64(GUEST_PDPTR0, src->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, src->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, src->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3);
+	}
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, src->pin_based_vm_exec_control);
+	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, src->cpu_based_vm_exec_control);
+	vmcs_write32(EXCEPTION_BITMAP, src->exception_bitmap);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
+		     src->page_fault_error_code_mask);
+	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
+		     src->page_fault_error_code_match);
+	vmcs_write32(VM_EXIT_CONTROLS, src->vm_exit_controls);
+	vmcs_write32(VM_ENTRY_CONTROLS, src->vm_entry_controls);
+
+	if (cpu_has_secondary_exec_ctrls())
+		vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
+			     src->secondary_vm_exec_control);
+
+	load_vmcs_common(src);
+
+	load_vmcs_host_state(to_vmx(vcpu)->nested.l1_shadow_vmcs);
+
+	return 0;
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu,
+			     bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int efer_offset;
+
+	if (!vmx->nested.nested_mode) {
+		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
+		       __func__);
+		return 0;
+	}
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	if (!nested_map_current(vcpu)) {
+		printk(KERN_INFO "Error mapping shadow vmcs\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	prepare_vmcs_12(vcpu);
+	if (is_interrupt)
+		get_shadow_vmcs(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	vmx->nested.current_l2_page->launched = vmx->launched;
+	vmx->nested.current_l2_page->cpu = vcpu->cpu;
+
+	nested_unmap_current(vcpu);
+
+	vmx->vmcs = vmx->nested.l1_vmcs;
+	vcpu->cpu = vmx->nested.l1_state.cpu;
+	vmx->launched = vmx->nested.l1_state.launched;
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	vcpu->arch.efer = vmx->nested.l1_state.efer;
+	if ((vcpu->arch.efer & EFER_LMA) &&
+	    !(vcpu->arch.efer & EFER_SCE))
+		vcpu->arch.efer |= EFER_SCE;
+
+	efer_offset = __find_msr_index(vmx, MSR_EFER);
+	if (update_transition_efer(vmx, efer_offset))
+		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
+
+	/* We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has
+	 * (this can be figured out by combining its old guest_cr0 and
+	 * cr0_read_shadow, using the cr0_guest_host_mask). vmx_set_cr0 might
+	 * use slightly different bits on the new guest_cr0 it sets, e.g.,
+	 * add TS when !fpu_active.
+	 */
+	vmx_set_cr0(vcpu,
+		(vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask &
+		vmx->nested.l1_shadow_vmcs->cr0_read_shadow) |
+		(~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask &
+		vmx->nested.l1_shadow_vmcs->guest_cr0));
+
+	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
+
+	if (enable_ept) {
+		vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3;
+		vmcs_write32(GUEST_CR3, vmx->nested.l1_shadow_vmcs->guest_cr3);
+	} else {
+		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
+	}
+
+	if (!nested_map_current(vcpu)) {
+		printk(KERN_INFO "Error mapping shadow vmcs\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	switch_back_vmcs(vcpu);
+
+	nested_unmap_current(vcpu);
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   vmx->nested.l1_shadow_vmcs->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   vmx->nested.l1_shadow_vmcs->guest_rip);
+
+	vmx->nested.nested_mode = 0;
+
+	/* If we did fpu_activate()/fpu_deactive() during l2's run, we need
+	 * to apply the same changes also when running l1. We don't need to
+	 * change cr0 here - we already did this above - just the
+	 * cr0_guest_host_mask, and exception bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmx->nested.l1_shadow_vmcs->exception_bitmap &
+			~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_load(vcpu);
+
+	if (unlikely(vmx->fail)) {
+		vmx->fail = 0;
+		set_rflags_to_vmx_fail_valid(vcpu);
+	} else
+		clear_rflags_cf_zf(vcpu);
+
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (17 preceding siblings ...)
  2010-06-13 12:31 ` [PATCH 18/24] Exiting from L2 to L1 Nadav Har'El
@ 2010-06-13 12:32 ` Nadav Har'El
  2010-06-14 12:24   ` Avi Kivity
  2010-06-13 12:32 ` [PATCH 20/24] Correct handling of interrupt injection Nadav Har'El
                   ` (7 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:32 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in several situations where had L1 run on
bare metal it would not have expected to be resumed at this stage. One
example is when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run.
Another examples is when L2 exits on an #NM exception that L0 asked for
(because of lazy FPU loading), and L0 must deal with the exception and resume
L2 which was in a middle of an instruction, and not resume L1 which does not
expect to see an exit from L2 at this point. nested_run_pending is especially
intended to avoid switching to L1 in the injection decision-point described
above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -318,6 +318,8 @@ struct nested_vmx {
 	struct shadow_vmcs *l1_shadow_vmcs;
 	/* Level 1 vmcs loaded into the processor */
 	struct vmcs *l1_vmcs;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 };
 
 enum vmcs_field_type {
@@ -900,6 +902,24 @@ static inline bool nested_cpu_has_vmx_ep
 }
 
 
+static inline bool nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+	return get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static inline bool is_nmi(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -3694,6 +3714,8 @@ static void vmx_set_nmi_mask(struct kvm_
 	}
 }
 
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
@@ -3819,6 +3841,8 @@ static int handle_exception(struct kvm_v
 
 	if (is_no_device(intr_info)) {
 		vmx_fpu_activate(vcpu);
+		if (vmx->nested.nested_mode)
+			vmx->nested.nested_run_pending = 1;
 		return 1;
 	}
 
@@ -4989,6 +5013,202 @@ static int (*kvm_vmx_exit_handlers[])(st
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+/* Return 1 if we should exit from L2 to L1 to handle an MSR access exit,
+ * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
+ * it expressed interest in the current event (read or write a specific MSR).
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct shadow_vmcs *l2svmcs, u32 exit_code)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	struct page *msr_bitmap_page;
+	void *va;
+	bool ret;
+
+	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
+		return 1;
+
+	msr_bitmap_page = nested_get_page(vcpu, l2svmcs->msr_bitmap);
+	if (!msr_bitmap_page) {
+		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	va = kmap_atomic(msr_bitmap_page, KM_USER1);
+	if (exit_code == EXIT_REASON_MSR_WRITE)
+		va += 0x800;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		va += 0x400;
+	}
+	if (msr_index > 0x1fff)
+		return 0;
+	ret = test_bit(msr_index, va);
+	kunmap_atomic(va, KM_USER1);
+	return ret;
+}
+
+/* Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct shadow_vmcs *l2svmcs)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (l2svmcs->cr0_guest_host_mask &
+			    (val ^ l2svmcs->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if (l2svmcs->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_LOAD_EXITING)
+				return 1;
+			break;
+		case 4:
+			if (l2svmcs->cr4_guest_host_mask &
+			    (l2svmcs->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (l2svmcs->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_LOAD_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if (l2svmcs->cr0_guest_host_mask & X86_CR0_TS)
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 0:
+			return 1;
+		case 3:
+			if (l2svmcs->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 4:
+			return 1;
+			break;
+		case 8:
+			if (l2svmcs->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		if (l2svmcs->cr0_guest_host_mask &
+		    (val ^ l2svmcs->cr0_read_shadow))
+			return 1;
+		break;
+	}
+	return 0;
+}
+
+/* Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0. Only call this when in nested_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool afterexit)
+{
+	u32 exit_code = vmcs_read32(VM_EXIT_REASON);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct shadow_vmcs *l2svmcs;
+	int r = 0;
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	if (afterexit) {
+		/* There are some cases where we should let L1 handle certain
+		 * events when these are injected (afterexit==0) but we should
+		 * handle them in L0 on an exit (afterexit==1).
+		 */
+		switch (exit_code) {
+		case EXIT_REASON_EXTERNAL_INTERRUPT:
+			return 0;
+		case EXIT_REASON_EXCEPTION_NMI:
+			if (!is_exception(intr_info))
+				return 0;
+			if (is_page_fault(intr_info) && (!enable_ept))
+				return 0;
+			break;
+		case EXIT_REASON_EPT_VIOLATION:
+			if (enable_ept)
+				return 0;
+			break;
+		}
+	}
+
+	if (!nested_map_current(vcpu))
+		return 0;
+	l2svmcs = get_shadow_vmcs(vcpu);
+
+	switch (exit_code) {
+	case EXIT_REASON_INVLPG:
+		if (l2svmcs->cpu_based_vm_exec_control &
+		    CPU_BASED_INVLPG_EXITING)
+			r = 1;
+		break;
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		r = nested_vmx_exit_handled_msr(vcpu, l2svmcs, exit_code);
+		break;
+	case EXIT_REASON_CR_ACCESS:
+		r = nested_vmx_exit_handled_cr(vcpu, l2svmcs);
+		break;
+	case EXIT_REASON_DR_ACCESS:
+		if (l2svmcs->cpu_based_vm_exec_control &
+		    CPU_BASED_MOV_DR_EXITING)
+			r = 1;
+		break;
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (is_external_interrupt(intr_info) &&
+		    (l2svmcs->pin_based_vm_exec_control &
+		     PIN_BASED_EXT_INTR_MASK))
+			r = 1;
+		else if (is_nmi(intr_info) &&
+		    (l2svmcs->pin_based_vm_exec_control &
+		     PIN_BASED_NMI_EXITING))
+			r = 1;
+		else if (is_exception(intr_info) &&
+		    (l2svmcs->exception_bitmap &
+		     (1u << (intr_info & INTR_INFO_VECTOR_MASK))))
+			r = 1;
+		else if (is_page_fault(intr_info))
+			r = 1;
+		break;
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		if (l2svmcs->pin_based_vm_exec_control &
+		    PIN_BASED_EXT_INTR_MASK)
+			r = 1;
+		break;
+	default:
+		r = 1;
+	}
+	nested_unmap_current(vcpu);
+
+	return r;
+}
+
 /*
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
@@ -5005,6 +5225,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (vmx->nested.nested_mode && nested_vmx_exit_handled(vcpu, true)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	/* Access CR3 don't cause VMExit in paging mode, so we need
 	 * to sync with guest real CR3. */
 	if (enable_ept && is_paging(vcpu))
@@ -5956,6 +6187,7 @@ static int nested_vmx_run(struct kvm_vcp
 		r = kvm_mmu_load(vcpu);
 		if (unlikely(r)) {
 			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			nested_vmx_vmexit(vcpu, false);
 			set_rflags_to_vmx_fail_valid(vcpu);
 			/* switch back to L1 */
 			vmx->nested.nested_mode = 0;

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 20/24] Correct handling of interrupt injection
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (18 preceding siblings ...)
  2010-06-13 12:32 ` [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2010-06-13 12:32 ` Nadav Har'El
  2010-06-14 12:29   ` Avi Kivity
  2010-06-13 12:33 ` [PATCH 21/24] Correct handling of exception injection Nadav Har'El
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:32 UTC (permalink / raw)
  To: avi; +Cc: kvm

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should to inject it directly to the running guest L2 (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -3591,9 +3591,29 @@ out:
 	return ret;
 }
 
+/* In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+	int ret;
+	if (!nested_map_current(vcpu))
+		return 0;
+	ret = get_shadow_vmcs(vcpu)->pin_based_vm_exec_control &
+		PIN_BASED_EXT_INTR_MASK;
+	nested_unmap_current(vcpu);
+	return ret;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	u32 cpu_based_vm_exec_control;
+	if (to_vmx(vcpu)->nested.nested_mode && nested_exit_on_intr(vcpu))
+		/* We can get here when nested_run_pending caused
+		 * vmx_interrupt_allowed() to return false. In this case, do
+		 * nothing - the interrupt will be injected later.
+		 */
+		return;
 
 	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3718,6 +3738,13 @@ static int nested_vmx_vmexit(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+	if (to_vmx(vcpu)->nested.nested_mode && nested_exit_on_intr(vcpu)) {
+		if (to_vmx(vcpu)->nested.nested_run_pending)
+			return 0;
+		nested_vmx_vmexit(vcpu, true);
+		/* fall through to normal code, but now in L1, not L2 */
+	}
+
 	return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
 		!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
 			(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 21/24] Correct handling of exception injection
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (19 preceding siblings ...)
  2010-06-13 12:32 ` [PATCH 20/24] Correct handling of interrupt injection Nadav Har'El
@ 2010-06-13 12:33 ` Nadav Har'El
  2010-06-13 12:33 ` [PATCH 22/24] Correct handling of idt vectoring info Nadav Har'El
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:33 UTC (permalink / raw)
  To: avi; +Cc: kvm

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -1564,6 +1564,9 @@ static void skip_emulated_instruction(st
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr,
+				bool has_error_code, u32 error_code);
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1571,6 +1574,9 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nested_vmx_check_exception(vcpu, nr, has_error_code, error_code))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3670,6 +3676,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (vmx->nested.nested_mode)
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon
@@ -6513,6 +6522,26 @@ static int nested_vmx_vmexit(struct kvm_
 	return 0;
 }
 
+static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned nr,
+				      bool has_error_code, u32 error_code)
+{
+	if (!to_vmx(vcpu)->nested.nested_mode)
+		return 0;
+	if (!nested_vmx_exit_handled(vcpu, false))
+		return 0;
+	nested_vmx_vmexit(vcpu, false);
+	if (!nested_map_current(vcpu))
+		return 1;
+	get_shadow_vmcs(vcpu)->vm_exit_reason = EXIT_REASON_EXCEPTION_NMI;
+	get_shadow_vmcs(vcpu)->vm_exit_intr_info = (nr
+		| INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK
+		| (has_error_code ?  INTR_INFO_DELIVER_CODE_MASK : 0));
+	if (has_error_code)
+		get_shadow_vmcs(vcpu)->vm_exit_intr_error_code = error_code;
+	nested_unmap_current(vcpu);
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 22/24] Correct handling of idt vectoring info
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (20 preceding siblings ...)
  2010-06-13 12:33 ` [PATCH 21/24] Correct handling of exception injection Nadav Har'El
@ 2010-06-13 12:33 ` Nadav Har'El
  2010-06-17 11:58   ` Gleb Natapov
  2010-06-13 12:34 ` [PATCH 23/24] Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
                   ` (4 subsequent siblings)
  26 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:33 UTC (permalink / raw)
  To: avi; +Cc: kvm

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is treated after
the exit. However, in the nested case a decision of whether to return to L2
or L1 also happens during the injection phase (see the previous patches), so
in the nested case we have to treat the idt_vectoring_info right after the
injection, i.e., in the beginning of vmx_vcpu_run, which is the first time
we know for sure if we're staying in L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -320,6 +320,10 @@ struct nested_vmx {
 	struct vmcs *l1_vmcs;
 	/* L2 must run next, and mustn't decide to exit to L1. */
 	bool nested_run_pending;
+	/* true if last exit was of L2, and had a valid idt_vectoring_info */
+	bool valid_idt_vectoring_info;
+	/* These are saved if valid_idt_vectoring_info */
+	u32 vm_exit_instruction_len, idt_vectoring_error_code;
 };
 
 enum vmcs_field_type {
@@ -5460,6 +5464,22 @@ static void fixup_rmode_irq(struct vcpu_
 		| vmx->rmode.irq.vector;
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+	int irq  = vmx->idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
+	int type = vmx->idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
+	int errCodeValid = vmx->idt_vectoring_info &
+		VECTORING_INFO_DELIVER_CODE_MASK;
+	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+		vmx->nested.vm_exit_instruction_len);
+	if (errCodeValid)
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+			vmx->nested.idt_vectoring_error_code);
+}
+
 static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
 {
 	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
@@ -5481,6 +5501,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (vmx->nested.nested_mode && vmx->nested.valid_idt_vectoring_info)
+		nested_handle_valid_idt_vectoring_info(vmx);
+
 	/* Record the guest's net vcpu time for enforced NMI injections. */
 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
 		vmx->entry_time = ktime_get();
@@ -5600,6 +5623,16 @@ static void vmx_vcpu_run(struct kvm_vcpu
 				  | (1 << VCPU_EXREG_PDPTR));
 
 	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+
+	vmx->nested.valid_idt_vectoring_info = vmx->nested.nested_mode &&
+		(vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK);
+	if (vmx->nested.valid_idt_vectoring_info) {
+		vmx->nested.vm_exit_instruction_len =
+			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+		vmx->nested.idt_vectoring_error_code =
+			vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	}
+
 	if (vmx->rmode.irq.pending)
 		fixup_rmode_irq(vmx);
 

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 23/24] Handling of CR0.TS and #NM for Lazy FPU loading
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (21 preceding siblings ...)
  2010-06-13 12:33 ` [PATCH 22/24] Correct handling of idt vectoring info Nadav Har'El
@ 2010-06-13 12:34 ` Nadav Har'El
  2010-06-13 12:34 ` [PATCH 24/24] Miscellenous small corrections Nadav Har'El
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:34 UTC (permalink / raw)
  To: avi; +Cc: kvm

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate(), handle_cr()) to do the correct
merging of L0's and L1's needs. Note that new code in introduced in previous
patches already handles CR0 correctly (see prepare_vmcs_02(),
prepare_vmcs_12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -1144,6 +1144,27 @@ static void update_exception_bitmap(stru
 		eb &= ~(1u << PF_VECTOR); /* bypass_guest_pf = 0 */
 	if (vcpu->fpu_active)
 		eb &= ~(1u << NM_VECTOR);
+
+	/* When we are running a nested L2 guest and L1 specified for it a
+	 * certain exception bitmap, we must trap the same exceptions and pass
+	 * them to L1. When running L2, we will only handle the exceptions
+	 * specified above if L1 did not want them.
+	 */
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		u32 nested_eb;
+		if (to_vmx(vcpu)->nested.current_l2_page)
+			nested_eb = get_shadow_vmcs(vcpu)->exception_bitmap;
+		else {
+			if (!nested_map_current(vcpu)) {
+				to_vmx(vcpu)->fail = 1;
+				return;
+			}
+			nested_eb = get_shadow_vmcs(vcpu)->exception_bitmap;
+			nested_unmap_current(vcpu);
+		}
+		eb |= nested_eb;
+	}
+
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1488,8 +1509,25 @@ static void vmx_fpu_activate(struct kvm_
 	cr0 &= ~(X86_CR0_TS | X86_CR0_MP);
 	cr0 |= kvm_read_cr0_bits(vcpu, X86_CR0_TS | X86_CR0_MP);
 	vmcs_writel(GUEST_CR0, cr0);
-	update_exception_bitmap(vcpu);
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		/* While we (L0) no longer care about NM exceptions or cr0.TS
+		 * changes, our guest hypervisor (L1) might care in which case
+		 * we must trap them for it.
+		 */
+		u32 eb = vmcs_read32(EXCEPTION_BITMAP) & ~(1u << NM_VECTOR);
+		struct shadow_vmcs *vmcs12;
+		if (!nested_map_current(vcpu)) {
+			to_vmx(vcpu)->fail = 1;
+			return;
+		}
+		vmcs12 = get_shadow_vmcs(vcpu);
+		eb |= vmcs12->exception_bitmap;
+		vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask;
+		nested_unmap_current(vcpu);
+		vmcs_write32(EXCEPTION_BITMAP, eb);
+	} else
+		update_exception_bitmap(vcpu);
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
 }
 
@@ -1497,12 +1535,24 @@ static void vmx_decache_cr0_guest_bits(s
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+	/* Note that there is no vcpu->fpu_active = 0 here. The caller must
+	 * set this *before* calling this function.
+	 */
 	vmx_decache_cr0_guest_bits(vcpu);
 	vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
-	update_exception_bitmap(vcpu);
+	vmcs_write32(EXCEPTION_BITMAP,
+		vmcs_read32(EXCEPTION_BITMAP) | (1u << NM_VECTOR));
 	vcpu->arch.cr0_guest_owned_bits = 0;
 	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
-	vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
+	if (to_vmx(vcpu)->nested.nested_mode)
+		/* Unfortunately in nested mode we play with arch.cr0's PG
+		 * bit, so we musn't copy it all, just the relevant TS bit
+		 */
+		vmcs_writel(CR0_READ_SHADOW,
+			(vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS) |
+			(vcpu->arch.cr0 & X86_CR0_TS));
+	else
+		vmcs_writel(CR0_READ_SHADOW, vcpu->arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
@@ -3998,6 +4048,53 @@ vmx_patch_hypercall(struct kvm_vcpu *vcp
 	hypercall[2] = 0xc1;
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static void handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		/* When running L2, we usually do what L1 wants: it decides
+		 * which cr0 bits to intercept, we forward it cr0-change events
+		 * (see nested_vmx_exit_handled()). We only get here when a cr0
+		 * bit was changed that L1 did not ask to intercept, but L0
+		 * nevertheless did. Currently this can only happen with the TS
+		 * bit (see CR0_GUEST_HOST_MASK in prepare_vmcs_02()).
+		 * We must change only this bit in GUEST_CR0 and CR0_READ_SHADOW
+		 * and not call kvm_set_cr0 because it enforces a relationship
+		 * between the two that is specific to KVM (i.e., only the TS
+		 * bit might differ) and with which L1 might not agree.
+		 */
+		long new_cr0 = vmcs_readl(GUEST_CR0);
+		long new_cr0_rs = vmcs_readl(CR0_READ_SHADOW);
+		if (val & X86_CR0_TS) {
+			new_cr0 |= X86_CR0_TS;
+			new_cr0_rs |= X86_CR0_TS;
+			vcpu->arch.cr0 |= X86_CR0_TS;
+		} else {
+			new_cr0 &= ~X86_CR0_TS;
+			new_cr0_rs &= ~X86_CR0_TS;
+			vcpu->arch.cr0 &= ~X86_CR0_TS;
+		}
+		vmcs_writel(GUEST_CR0, new_cr0);
+		vmcs_writel(CR0_READ_SHADOW, new_cr0_rs);
+		to_vmx(vcpu)->nested.nested_run_pending = 1;
+	} else
+		kvm_set_cr0(vcpu, val);
+}
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+	if (to_vmx(vcpu)->nested.nested_mode) {
+		/* As in handle_set_cr0(), we can't call vmx_set_cr0 here */
+		vmcs_writel(GUEST_CR0, vmcs_readl(GUEST_CR0) & ~X86_CR0_TS);
+		vmcs_writel(CR0_READ_SHADOW,
+				vmcs_readl(CR0_READ_SHADOW) & ~X86_CR0_TS);
+		vcpu->arch.cr0 &= ~X86_CR0_TS;
+		to_vmx(vcpu)->nested.nested_run_pending = 1;
+	} else
+		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification, val;
@@ -4013,7 +4110,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		trace_kvm_cr_write(cr, val);
 		switch (cr) {
 		case 0:
-			kvm_set_cr0(vcpu, val);
+			handle_set_cr0(vcpu, val);
 			skip_emulated_instruction(vcpu);
 			return 1;
 		case 3:
@@ -4039,7 +4136,7 @@ static int handle_cr(struct kvm_vcpu *vc
 		};
 		break;
 	case 2: /* clts */
-		vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+		handle_clts(vcpu);
 		trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
 		skip_emulated_instruction(vcpu);
 		vmx_fpu_activate(vcpu);

^ permalink raw reply	[flat|nested] 147+ messages in thread

* [PATCH 24/24] Miscellenous small corrections
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (22 preceding siblings ...)
  2010-06-13 12:34 ` [PATCH 23/24] Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
@ 2010-06-13 12:34 ` Nadav Har'El
  2010-06-14 12:34 ` [PATCH 0/24] Nested VMX, v5 Avi Kivity
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-13 12:34 UTC (permalink / raw)
  To: avi; +Cc: kvm

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
@@ -992,7 +992,7 @@ static void vmcs_load(struct vmcs *vmcs)
 			: "=g"(error) : "a"(&phys_addr), "m"(phys_addr)
 			: "cc", "memory");
 	if (error)
-		printk(KERN_ERR "kvm: vmptrld %p/%llx fail\n",
+		printk(KERN_ERR "kvm: vmptrld %p/%llx failed\n",
 		       vmcs, phys_addr);
 }
 

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 1/24] Move nested option from svm.c to x86.c
  2010-06-13 12:23 ` [PATCH 1/24] Move nested option from svm.c to x86.c Nadav Har'El
@ 2010-06-14  8:11   ` Avi Kivity
  2010-06-15 14:27     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:11 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:23 PM, Nadav Har'El wrote:
> The SVM module had a "nested" option, on by default, which controls whether
> to allow nested virtualization. Now that VMX also supports nested
> virtualization, we can move this option to x86.c, for both SVM and VMX.
>
> The "nested" option takes three possible values. 0 disables nested
> virtualization on both SVM and VMX, and 1 enables it on both.
> The value 2, which is the default when this module option is not explicitly
> set, asks each of SVM or VMX to choose its own default; Currently, VMX
> disables nested virtualization in this case, while SVM leaves it enabled.
>
> When nested VMX becomes more mature, this default should probably be changed
> to enable nested virtualization on both architectures.
>
> --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> @@ -95,6 +95,17 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
>   int ignore_msrs = 0;
>   module_param_named(ignore_msrs, ignore_msrs, bool, S_IRUGO | S_IWUSR);
>
> +/* If nested=1, nested virtualization is supported. I.e., the guest may use
> + * VMX or SVM (as appropriate) and be a hypervisor for its own guests.
> + * If nested=0, nested virtualization is not supported.
> + * When nested starts as 2 (which is the default), it is later modified by the
> + * specific module used (VMX or SVM). Currently, nested will be left enabled
> + * on SVM, but reset to 0 on VMX.
> + */
> +int nested = 2;
> +EXPORT_SYMBOL_GPL(nested);
> +module_param(nested, int, S_IRUGO);
> +
>    


A global variable names 'nested' is not a good idea.  I recommend having 
a kvm-intel scope module parameter instead, that also avoids the 0/1/2 
values.

After the patches are merged we can try to consolidate here.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 2/24] Add VMX and SVM to list of supported cpuid features
  2010-06-13 12:23 ` [PATCH 2/24] Add VMX and SVM to list of supported cpuid features Nadav Har'El
@ 2010-06-14  8:13   ` Avi Kivity
  2010-06-15 14:31     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:13 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:23 PM, Nadav Har'El wrote:
> Add the "VMX" CPU feature to the list of CPU featuress KVM advertises with
> the KVM_GET_SUPPORTED_CPUID ioctl (unless the "nested" module option is off).
>
> Qemu uses this ioctl, and intersects KVM's list with its own list of desired
> cpu features (depending on the -cpu option given to qemu) to determine the
> final list of features presented to the guest.
> This patch also does the same for SVM: KVM now advertises it supports SVM,
> unless the "nested" module option is off.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> @@ -1923,7 +1923,7 @@ static void do_cpuid_ent(struct kvm_cpui
>   	/* cpuid 1.ecx */
>   	const u32 kvm_supported_word4_x86_features =
>   		F(XMM3) | 0 /* Reserved, DTES64, MONITOR */ |
> -		0 /* DS-CPL, VMX, SMX, EST */ |
> +		0 /* DS-CPL */ | (nested ? F(VMX) : 0) | 0 /* SMX, EST */ |
>   		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
>   		0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
>   		0 /* Reserved, DCA */ | F(XMM4_1) |
>    

You can use kvm_x86_ops->set_supported_cpuid() to alter features.

> @@ -1931,7 +1931,8 @@ static void do_cpuid_ent(struct kvm_cpui
>   		0 /* Reserved, XSAVE, OSXSAVE */;
>   	/* cpuid 0x80000001.ecx */
>   	const u32 kvm_supported_word6_x86_features =
> -		F(LAHF_LM) | F(CMP_LEGACY) | F(SVM) | 0 /* ExtApicSpace */ |
> +		F(LAHF_LM) | F(CMP_LEGACY) | (nested ? F(SVM) : 0) |
> +		0 /* ExtApicSpace */ |
>   		F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) |
>   		F(3DNOWPREFETCH) | 0 /* OSVW */ | 0 /* IBS */ | F(SSE5) |
>   		0 /* SKINIT */ | 0 /* WDT */;
>    

Good idea, but let's leave it out of the nvmx patches.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 3/24] Implement VMXON and VMXOFF
  2010-06-13 12:24 ` [PATCH 3/24] Implement VMXON and VMXOFF Nadav Har'El
@ 2010-06-14  8:21   ` Avi Kivity
  2010-06-16 11:14     ` Nadav Har'El
  2010-06-15 20:18   ` Marcelo Tosatti
  1 sibling, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:21 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:24 PM, Nadav Har'El wrote:
> This patch allows a guest to use the VMXON and VMXOFF instructions, and
> emulates them accordingly. Basically this amounts to checking some
> prerequisites, and then remembering whether the guest has enabled or disabled
> VMX operation.
>
>    

Should probably reorder with next patch.

> +/* The nested_vmx structure is part of vcpu_vmx, and holds information we need
> + * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
> + * the current VMCS set by L1, a list of the VMCSs used to run the active
> + * L2 guests on the hardware, and more.
> + */
>    

Please (here and elsewhere) use the standard kernel style for multiline 
comments - start with /* on a line by itself.

>
> +/* Emulate the VMXON instruction.
> + * Currently, we just remember that VMX is active, and do not save or even
> + * inspect the argument to VMXON (the so-called "VMXON pointer") because we
> + * do not currently need to store anything in that guest-allocated memory
> + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
> + * argument is different from the VMXON pointer (which the spec says they do).
> + */
> +static int handle_vmon(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_segment cs;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* The Intel VMX Instruction Reference lists a bunch of bits that
> +	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
> +	 * set to 1. Otherwise, we should fail with #UD. We test these now:
> +	 */
> +	if (!nested) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	if (!(vcpu->arch.cr4&  X86_CR4_VMXE) ||
> +	    !(vcpu->arch.cr0&  X86_CR0_PE) ||
> +	    (vmx_get_rflags(vcpu)&  X86_EFLAGS_VM)) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	vmx_get_segment(vcpu,&cs, VCPU_SREG_CS);
> +	if (is_long_mode(vcpu)&&  !cs.l) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	if (vmx_get_cpl(vcpu)) {
> +		kvm_inject_gp(vcpu, 0);
> +		return 1;
> +	}
> +
> +	vmx->nested.vmxon = 1;
>    

= true

> +
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
>    

Need to block INIT signals in the local apic as well (fine for a 
separate patch).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-13 12:25 ` [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
@ 2010-06-14  8:33   ` Avi Kivity
  2010-06-14  8:49     ` Nadav Har'El
                       ` (2 more replies)
  0 siblings, 3 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:33 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:25 PM, Nadav Har'El wrote:
> An implementation of VMX needs to define a VMCS structure. This structure
> is kept in guest memory, but is opaque to the guest (who can only read or
> write it with VMX instructions).
>
> This patch starts to define the VMCS structure which our nested VMX
> implementation will present to L1. We call it "vmcs12", as it is the VMCS
> that L1 keeps for its L2 guests.
>
> This patch also adds the notion (as required by the VMX spec) of the "current
> VMCS", and finally includes utility functions for mapping the guest-allocated
> VMCSs in host memory.
>
> +#define VMCS12_REVISION 0x11e57ed0
>    

Where did this number come from?  It's not from real hardware, yes?

> +
> +/*
> + * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
> + * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
> + * a VMCS structure (which is opaque to the guest), and vmcs12 is our emulated
> + * VMX's VMCS. This structure is stored in guest memory specified by VMPTRLD,
> + * and accessed by the guest using VMREAD/VMWRITE/VMCLEAR instructions. More
> + * than one of these structures may exist, if L1 runs multiple L2 guests.
> + * nested_vmx_run() will use the data here to build a VMCS for the underlying
> + * hardware which will be used to run L2.
> + * This structure is packed in order to preseve the binary content after live
> + * migration. If there are changes in the content or layout, VMCS12_REVISION
> + * must be changed.
> + */
> +struct __attribute__ ((__packed__)) vmcs12 {
>    

__packed is a convenient define for this.

> +	/* According to the Intel spec, a VMCS region must start with the
> +	 * following two fields. Then follow implementation-specific data.
> +	 */
> +	u32 revision_id;
> +	u32 abort;
> +};
>    

Note that this structure becomes an ABI, it cannot change except in a 
backward compatible way due to the need for live migration.  So I'd like 
a documentation patch that adds a description of the content to 
Documentation/kvm/.  It can be as simple as listing the structure 
definition.

>
> +static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr)
> +{
> +	struct page *vmcs_page =
> +		gfn_to_page(vcpu->kvm, vmcs_addr>>  PAGE_SHIFT);
> +
> +	if (is_error_page(vmcs_page)) {
> +		printk(KERN_ERR "%s error allocating page 0x%llx\n",
> +		       __func__, vmcs_addr);
>    

Those printks can be used by a malicious guest to span the host logs.  
Please wrap them with something that is conditional on a debug flag.

I'm not sure what we need to do with vmcs that is not in RAM.  It may 
simplify things to return the error_page to the caller and set 
KVM_REQ_TRIPLE_FAULT, so we don't have to deal with error handling later on.

> +		kvm_release_page_clean(vmcs_page);
> +		return NULL;
> +	}
> +	return vmcs_page;
> +}
> +
> +static void nested_unmap_current(struct kvm_vcpu *vcpu)
> +{
> +	struct page *page;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (!vmx->nested.current_l2_page) {
> +		printk(KERN_INFO "Shadow vmcs already unmapped\n");
> +		BUG_ON(1);
> +		return;
> +	}
> +
> +	page = kmap_atomic_to_page(vmx->nested.current_l2_page);
> +
> +	kunmap_atomic(vmx->nested.current_l2_page, KM_USER0);
> +
> +	kvm_release_page_dirty(page);
>    

Do we always dirty the page?

I guess it is no big deal even if we don't.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 6/24] Implement reading and writing of VMX MSRs
  2010-06-13 12:25 ` [PATCH 6/24] Implement reading and writing of VMX MSRs Nadav Har'El
@ 2010-06-14  8:42   ` Avi Kivity
  2010-06-23  8:13     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:42 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:25 PM, Nadav Har'El wrote:
> When the guest can use VMX instructions (when the "nested" module option is
> on), it should also be able to read and write VMX MSRs, e.g., to query about
> VMX capabilities. This patch adds this support.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> @@ -702,7 +702,11 @@ static u32 msrs_to_save[] = {
>   #ifdef CONFIG_X86_64
>   	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
>   #endif
> -	MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
> +	MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> +	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
> +	MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
> +	MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
> +	MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
>   };
>    

These are read only from the guest point of view, but we need write 
support from the host to allow for tuning the features exposed to the guest.

>   /*
> + * If we allow our guest to use VMX instructions, we should also let it use
> + * VMX-specific MSRs.
> + */
> +static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
> +{
> +	u64 vmx_msr = 0;
> +	u32 vmx_msr_high, vmx_msr_low;
> +
> +	switch (msr_index) {
> +	case MSR_IA32_FEATURE_CONTROL:
> +		*pdata = 0;
> +		break;
> +	case MSR_IA32_VMX_BASIC:
> +		/*
> +		 * This MSR reports some information about VMX support of the
> +		 * processor. We should return information about the VMX we
> +		 * emulate for the guest, and the VMCS structure we give it -
> +		 * not about the VMX support of the underlying hardware. Some
> +		 * However, some capabilities of the underlying hardware are
> +		 * used directly by our emulation (e.g., the physical address
> +		 * width), so these are copied from what the hardware reports.
> +		 */
> +		*pdata = VMCS12_REVISION |
> +			(((u64)sizeof(struct vmcs12))<<  32);
> +		rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
> +#define VMX_BASIC_64		0x0001000000000000LLU
> +#define VMX_BASIC_MEM_TYPE	0x003c000000000000LLU
> +#define VMX_BASIC_INOUT		0x0040000000000000LLU
>    

Please move those defines (with longer names) to msr-index.h.

> +		*pdata |= vmx_msr&
> +			(VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
> +		break;
> +#define CORE2_PINBASED_CTLS_MUST_BE_ONE  0x00000016
> +#define MSR_IA32_VMX_TRUE_PINBASED_CTLS  0x48d
> +	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
> +	case MSR_IA32_VMX_PINBASED_CTLS:
> +		vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
> +		vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
> +				PIN_BASED_EXT_INTR_MASK |
> +				PIN_BASED_NMI_EXITING |
> +				PIN_BASED_VIRTUAL_NMIS;
>    

IIRC not all processors support PIN_BASED_VIRTUAL_NMIs.  Can we support 
this feature on hosts that don't have it?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-06-13 12:26 ` [PATCH 7/24] Understanding guest pointers to vmcs12 structures Nadav Har'El
@ 2010-06-14  8:48   ` Avi Kivity
  2010-08-02 12:25     ` Nadav Har'El
  2010-06-15 12:14   ` Gleb Natapov
  1 sibling, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:48 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:26 PM, Nadav Har'El wrote:
> This patch includes a couple of utility functions for extracting pointer
> operands of VMX instructions issued by L1 (a guest hypervisor), and
> translating guest-given vmcs12 virtual addresses to guest-physical addresses.
>
> +/*
> + * Decode the memory-address operand of a vmx instruction, according to the
> + * Intel spec.
> + */
> +#define VMX_OPERAND_SCALING(vii)	((vii)&  3)
> +#define VMX_OPERAND_ADDR_SIZE(vii)	(((vii)>>  7)&  7)
> +#define VMX_OPERAND_IS_REG(vii)		((vii)&  (1u<<  10))
> +#define VMX_OPERAND_SEG_REG(vii)	(((vii)>>  15)&  7)
> +#define VMX_OPERAND_INDEX_REG(vii)	(((vii)>>  18)&  0xf)
> +#define VMX_OPERAND_INDEX_INVALID(vii)	((vii)&  (1u<<  22))
> +#define VMX_OPERAND_BASE_REG(vii)	(((vii)>>  23)&  0xf)
> +#define VMX_OPERAND_BASE_INVALID(vii)	((vii)&  (1u<<  27))
> +#define VMX_OPERAND_REG(vii)		(((vii)>>  3)&  0xf)
> +#define VMX_OPERAND_REG2(vii)		(((vii)>>  28)&  0xf)
> +static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu,
> +				 unsigned long exit_qualification,
> +				 u32 vmx_instruction_info)
> +{
> +	int  scaling = VMX_OPERAND_SCALING(vmx_instruction_info);
> +	int  addr_size = VMX_OPERAND_ADDR_SIZE(vmx_instruction_info);
> +	bool is_reg = VMX_OPERAND_IS_REG(vmx_instruction_info);
> +	int  seg_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
> +	int  index_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
> +	bool index_is_valid = !VMX_OPERAND_INDEX_INVALID(vmx_instruction_info);
> +	int  base_reg       = VMX_OPERAND_BASE_REG(vmx_instruction_info);
> +	bool base_is_valid  = !VMX_OPERAND_BASE_INVALID(vmx_instruction_info);
>    

Since those defines are used just ones, you can fold them into their 
uses.  It doesn't add much to repeat the variable name.

> +	gva_t addr;
> +
> +	if (is_reg) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 0;
> +	}
> +
> +	switch (addr_size) {
> +	case 1: /* 32 bit. high bits are undefined according to the spec: */
> +		exit_qualification&= 0xffffffff;
> +		break;
> +	case 2: /* 64 bit */
> +		break;
> +	default: /* addr_size=0 means 16 bit */
> +		return 0;
> +	}
> +
> +	/* Addr = segment_base + offset */
> +	/* offfset = Base + [Index * Scale] + Displacement */
> +	addr = vmx_get_segment_base(vcpu, seg_reg);
> +	if (base_is_valid)
> +		addr += kvm_register_read(vcpu, base_reg);
> +	if (index_is_valid)
> +		addr += kvm_register_read(vcpu, index_reg)<<scaling;
> +	addr += exit_qualification; /* holds the displacement */
>    

Do we need a segment limit and access rights check?

> +
> +	return addr;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-14  8:33   ` Avi Kivity
@ 2010-06-14  8:49     ` Nadav Har'El
  2010-06-14 12:35       ` Avi Kivity
  2010-06-16 12:24     ` Nadav Har'El
  2010-06-22 14:54     ` Nadav Har'El
  2 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-14  8:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
> On 06/13/2010 03:25 PM, Nadav Har'El wrote:
> >+#define VMCS12_REVISION 0x11e57ed0
> >   
> 
> Where did this number come from?  It's not from real hardware, yes?

Basically, we are presenting emulated VMX for the L1 guest, complete with
its own VMCS structure. This structure needs to have some VMCS revision id,
which should be an arbitrary number that we invent - it is not related to any
revision id that any real hardware uses. If you look closely, you can see that
the number I used is leetspeak for "Nested0" ;-)

As you can see in the following patches, MSR_IA32_VMX_BASIC will return this
arbitrary VMCS revision id, and and VMPTRLD will verify that the VMCS region
that L1 is trying to load contains this revision id.

-- 
Nadav Har'El                        |       Monday, Jun 14 2010, 2 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If you're looking for a helping hand,
http://nadav.harel.org.il           |look first at the end of your arm.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 8/24] Hold a vmcs02 for each vmcs12
  2010-06-13 12:26 ` [PATCH 8/24] Hold a vmcs02 for each vmcs12 Nadav Har'El
@ 2010-06-14  8:57   ` Avi Kivity
  2010-07-06  9:50   ` Dong, Eddie
  1 sibling, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  8:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:26 PM, Nadav Har'El wrote:
> In this patch we add a list of L0 (hardware) VMCSs, which we'll use to hold a
> hardware VMCS for each active L1 VMCS (i.e., for each L2 guest).
>
> We call each of these L0 VMCSs a "vmcs02", as it is the VMCS that L0 uses
> to run its nested guest L2.
>
>
> +
> +/* Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if one
> + * does not already exist. The allocation is done in L0 memory, so to avoid
> + * denial-of-service attack by guests, we limit the number of concurrently-
> + * allocated vmcss. A well-behaving L1 will VMCLEAR unused vmcs12s and not
> + * trigger this limit.
> + */
> +static const int NESTED_MAX_VMCS = 256;
>    

This is much too high, it allows the guest to pin a large amount of host 
memory.  Also, the limit is not real; if the guest exceeds the limit we 
should drop some LRU vmcs and instantiate a new one.

I suggest starting with a much lower limit (say, 4) which will exercise 
the drop/reload code.  Later, we can increase the limit and add a 
shrinker callback so the host can reduce the number of cached vmcses if 
memory gets tight.

> +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_list *new_l2_guest;
> +	struct vmcs *l2_vmcs;
> +
> +	if (nested_get_current_vmcs(vcpu))
> +		return 0; /* nothing to do - we already have a VMCS */
> +
> +	if (to_vmx(vcpu)->nested.l2_vmcs_num>= NESTED_MAX_VMCS)
> +		return -ENOMEM;
>    

As mentioned above, recycle an old vmcs here.

> +
> +/* Free the current L2 VMCS, and remove it from l2_vmcs_list */
> +static void nested_free_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs_list *list_item, *n;
> +
> +	list_for_each_entry_safe(list_item, n,&vmx->nested.l2_vmcs_list, list)
> +		if (list_item->vmcs_addr == vmx->nested.current_vmptr) {
> +			free_vmcs(list_item->l2_vmcs);
> +			list_del(&(list_item->list));
> +			kfree(list_item);
> +			vmx->nested.l2_vmcs_num--;
> +			return;
> +		}
> +}
>    

Since you return, no need to be _safe.  But we do need to vmclear that 
vmcs to avoid the processor writing back to those pages after we've 
freed them.

> +
> +static void free_l1_state(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmcs_list *list_item, *n;
> +
> +	list_for_each_entry_safe(list_item, n,
> +			&vmx->nested.l2_vmcs_list, list) {
>    

vmclear needed.

> +		free_vmcs(list_item->l2_vmcs);
> +		list_del(&(list_item->list));
> +		kfree(list_item);
> +	}
>    

> +	vmx->nested.l2_vmcs_num = 0;
> +}
>    

Can share code for dealing with one vmcs with the function above.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-06-13 12:27 ` [PATCH 9/24] Implement VMCLEAR Nadav Har'El
@ 2010-06-14  9:03   ` Avi Kivity
  2010-06-15 13:47   ` Gleb Natapov
  2010-07-06  2:56   ` Dong, Eddie
  2 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  9:03 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:27 PM, Nadav Har'El wrote:
> This patch implements the VMCLEAR instruction.
>
> +
> +/* Emulate the VMCLEAR instruction */
> +static int handle_vmclear(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gpa_t guest_vmcs_addr, save_current_vmptr;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (read_guest_vmcs_gpa(vcpu,&guest_vmcs_addr))
> +		return 1;
> +
> +	save_current_vmptr = vmx->nested.current_vmptr;
> +
> +	vmx->nested.current_vmptr = guest_vmcs_addr;
> +	if (!nested_map_current(vcpu))
> +		return 1;
>    

Haven't you leaked current_vmptr here?

If I read the code correctly, you are implementing a sort of stack here 
and pushing the current vmptr into save_current_vmptr.  Perhaps it's 
simper to have an nvmxptr structure which holds a vmptr and a kmap'ed 
pointer to it, and pass that around to functions.

> +	vmx->nested.current_l2_page->launch_state = 0;
> +	nested_unmap_current(vcpu);
> +
> +	nested_free_current_vmcs(vcpu);
> +
> +	if (save_current_vmptr == guest_vmcs_addr)
> +		vmx->nested.current_vmptr = -1ull;
> +	else
> +		vmx->nested.current_vmptr = save_current_vmptr;
> +
> +	skip_emulated_instruction(vcpu);
> +	clear_rflags_cf_zf(vcpu);
> +	return 1;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 10/24] Implement VMPTRLD
  2010-06-13 12:27 ` [PATCH 10/24] Implement VMPTRLD Nadav Har'El
@ 2010-06-14  9:07   ` Avi Kivity
  2010-08-05 11:13     ` Nadav Har'El
  2010-06-16 13:36   ` Gleb Natapov
  2010-07-06  3:09   ` Dong, Eddie
  2 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  9:07 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:27 PM, Nadav Har'El wrote:
> This patch implements the VMPTRLD instruction.
>
>
>   static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long rflags;
> @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
>   	return 1;
>   }
>
> +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr)
> +{
> +	bool ret;
> +	struct vmcs12 *vmcs12;
> +	struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
>    

Blank line so I can catch my breath.

> +	if (vmcs_page == NULL)
> +		return 0;
>    

Doesn't seem right.

> +	vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
> +	if (vmcs12->revision_id == VMCS12_REVISION)
> +		ret = 1;
> +	else {
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		ret = 0;
> +	}
> +	kunmap_atomic(vmcs12, KM_USER0);
> +	kvm_release_page_dirty(vmcs_page);
>    

Can release a clean page here.  But what happened to those mapping helpers?

> +	return ret;
> +}
> +
> +/* Emulate the VMPTRLD instruction */
> +static int handle_vmptrld(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gpa_t guest_vmcs_addr;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (read_guest_vmcs_gpa(vcpu,&guest_vmcs_addr)) {
> +		set_rflags_to_vmx_fail_invalid(vcpu);
>    

Need to skip_emulated_instruction() in this case.

> +		return 1;
> +	}
> +
> +	if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
> +		return 1;
>    

Here too.

> +
> +	if (vmx->nested.current_vmptr != guest_vmcs_addr) {
> +		vmx->nested.current_vmptr = guest_vmcs_addr;
> +
> +		if (nested_create_current_vmcs(vcpu)) {
> +			printk(KERN_ERR "%s error could not allocate memory",
> +				__func__);
>    

In general ftrace and the ENOMEM itself are sufficient documentation 
that something went wrong.

> +			return -ENOMEM;
> +		}
> +	}
> +
> +	clear_rflags_cf_zf(vcpu);
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 11/24] Implement VMPTRST
  2010-06-13 12:28 ` [PATCH 11/24] Implement VMPTRST Nadav Har'El
@ 2010-06-14  9:15   ` Avi Kivity
  2010-06-16 13:53     ` Gleb Natapov
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  9:15 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:28 PM, Nadav Har'El wrote:
> This patch implements the VMPTRST instruction.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
> @@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv
>   	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
>   }
>
> -static int kvm_write_guest_virt_system(gva_t addr, void *val,
> +int kvm_write_guest_virt_system(gva_t addr, void *val,
>   				       unsigned int bytes,
>   				       struct kvm_vcpu *vcpu,
>   				       u32 *error)
>    

write_guest_virt_system() is used by writes which need to ignore the 
cpl, for example when a cpl 3 instruction loads a segment, the processor 
needs to update the accessed flag even though it is only accessible to 
cpl 0.  That's not your case, you need the ordinary write_guest_virt().

Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it.

>
> +/* Emulate the VMPTRST instruction */
> +static int handle_vmptrst(struct kvm_vcpu *vcpu)
> +{
> +	int r = 0;
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	gva_t vmcs_gva;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification,
> +				       vmx_instruction_info);
> +	if (vmcs_gva == 0)
> +		return 1;
>    

What's wrong with gva 0?  It's favoured by exploiters everywhere.

> +	r = kvm_write_guest_virt_system(vmcs_gva,
> +				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
> +				 sizeof(u64), vcpu, NULL);
> +	if (r) {
>    

Check against the X86EMUL return codes.  You'll need to inject a page 
fault on failure.

> +		printk(KERN_INFO "%s failed to write vmptr\n", __func__);
> +		return 1;
> +	}
> +	clear_rflags_cf_zf(vcpu);
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 12/24] Add VMCS fields to the vmcs12
  2010-06-13 12:28 ` [PATCH 12/24] Add VMCS fields to the vmcs12 Nadav Har'El
@ 2010-06-14  9:24   ` Avi Kivity
  2010-06-16 14:18   ` Gleb Natapov
  1 sibling, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  9:24 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:28 PM, Nadav Har'El wrote:
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs.
>
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real
> VMCS for running L2.
>
>
> +/* shadow_vmcs is a structure used in nested VMX for holding a copy of all
> + * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12),
> + * and also for easier access to VMCS data (see l1_shadow_vmcs).
> + */
> +struct __attribute__ ((__packed__)) shadow_vmcs {
>    


> +	u32 host_ia32_sysenter_cs;
> +	unsigned long cr0_guest_host_mask;
>    

I think I counted an odd number of u32 fields, which mean the ulong 
fields will be unaligned.  Please add padding to preserve natural alignment.

Have you considered placing often used fields together to reduce cache 
misses?  I'm not sure whether it's worth the effort.

>
>   /*
> @@ -139,6 +269,8 @@ struct __attribute__ ((__packed__)) vmcs
>   	u32 revision_id;
>   	u32 abort;
>
> +	struct shadow_vmcs shadow_vmcs;
> +
>   	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
>    

That will make it difficult to expand shadow_vmcs in the future.  I 
suggest putting it at the end, and reserving some space in the middle.

>
> +#define OFFSET(x) offsetof(struct shadow_vmcs, x)
> +
> +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
>    

Can leave the size unspecified (and use ARRAY_SIZE() later on).

The encoding of the indexes is a very sparse, so the table will be very 
large.  No need to deal with that now though.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-13 12:29 ` [PATCH 13/24] Implement VMREAD and VMWRITE Nadav Har'El
@ 2010-06-14  9:36   ` Avi Kivity
  2010-06-16 14:48     ` Gleb Natapov
  2010-08-04 16:09     ` Nadav Har'El
  2010-06-16 15:03   ` Gleb Natapov
  1 sibling, 2 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14  9:36 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:29 PM, Nadav Har'El wrote:
> Implement the VMREAD and VMWRITE instructions. With these instructions, L1
> can read and write to the VMCS it is holding. The values are read or written
> to the fields of the shadow_vmcs structure introduced in the previous patch.
>
>
> +
> +static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
> +{
> +	switch (field_type) {
> +	case VMCS_FIELD_TYPE_U16:
> +		return 2;
> +	case VMCS_FIELD_TYPE_U32:
> +		return 4;
> +	case VMCS_FIELD_TYPE_U64:
> +		return 8;
> +	case VMCS_FIELD_TYPE_ULONG:
> +#ifdef CONFIG_X86_64
> +		if (is_long_mode(vcpu))
> +			return 8;
> +#endif
> +		return 4;
>    

No need for the ifdef, is_long_mode() works everywhere.

> +	}
> +	return 0; /* should never happen */
>    

Then BUG()?

> +}
> +
>   struct vcpu_vmx {
>   	struct kvm_vcpu       vcpu;
>   	struct list_head      local_vcpus_link;
> @@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp
>   	return 1;
>   }
>
>
> +static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg,
> +			     unsigned long field)
> +{
> +	u64 field_value;
> +	if (!nested_vmcs_read_any(vcpu, field,&field_value))
> +		return 0;
> +
> +#ifdef CONFIG_X86_64
> +	switch (vmcs_field_type(field)) {
> +	case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
> +		if (!is_long_mode(vcpu)) {
> +			kvm_register_write(vcpu, reg+1, field_value>>  32);
>    

What's this reg+1 thing?  I thought vmread simply ignores the upper half.

> +			field_value = (u32)field_value;
> +		}
> +	}
> +#endif
> +	kvm_register_write(vcpu, reg, field_value);
> +	return 1;
> +}
> +
> +static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva,
> +			     unsigned long field)
> +{
> +	u64 field_value;
> +	if (!nested_vmcs_read_any(vcpu, field,&field_value))
> +		return 0;
> +
> +	/* It's ok to use *_system, because handle_vmread verifies cpl=0 */
>    

> +	kvm_write_guest_virt_system(gva,&field_value,
> +			     vmcs_field_size(vmcs_field_type(field), vcpu),
> +			     vcpu, NULL);
>    

vmread doesn't support 64-bit writes to memory outside long mode, so 
you'll have to truncate the write.

I think you'll be better off returning a 32-bit size in 
vmcs_field_size() in these cases.

> +	return 1;
> +}
> +
> +static int handle_vmread(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long field;
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	gva_t gva = 0;
> +	int read_succeed;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
> +		set_rflags_to_vmx_fail_invalid(vcpu);
> +		return 1;
> +	}
>    

Can do the read_any() here.

> +
> +	/* decode instruction info to get the field to read and where to store
> +	 * its value */
> +	field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
> +	if (VMX_OPERAND_IS_REG(vmx_instruction_info)) {
> +		read_succeed = handle_vmread_reg(vcpu,
> +			VMX_OPERAND_REG(vmx_instruction_info), field);
> +	} else {
> +		gva = get_vmx_mem_address(vcpu, exit_qualification,
> +					  vmx_instruction_info);
> +		if (gva == 0)
> +			return 1;
> +		read_succeed = handle_vmread_mem(vcpu, gva, field);
> +	}
> +
> +	if (read_succeed) {
> +		clear_rflags_cf_zf(vcpu);
> +		skip_emulated_instruction(vcpu);
> +	} else {
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
>    

s_e_i() in any case but an exception.

> +	}
> +
> +	nested_unmap_current(vcpu);
> +	return 1;
> +}
> +
> +
>
> +	if (VMX_OPERAND_IS_REG(vmx_instruction_info))
> +		field_value = kvm_register_read(vcpu,
> +			VMX_OPERAND_REG(vmx_instruction_info));
> +	else {
> +		gva  = get_vmx_mem_address(vcpu, exit_qualification,
> +			vmx_instruction_info);
> +		if (gva == 0)
> +			return 1;
> +		kvm_read_guest_virt(gva,&field_value,
> +			vmcs_field_size(field_type, vcpu), vcpu, NULL);
>    

Check for exception.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
  2010-06-13 12:29 ` [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
@ 2010-06-14 11:11   ` Avi Kivity
  2010-06-17  8:50   ` Gleb Natapov
  2010-07-06  6:25   ` Dong, Eddie
  2 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 11:11 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:29 PM, Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
> in shadow_vmcs that L1 built for L2 (vmcs12), and that in the VMCS that we
> built for L1 (vmcs01).
>
> VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
> makes it difficult for us to read from vmcs01 while writing to vmcs12. This
> is why we first make a copy of vmcs01 in memory (l1_shadow_vmcs) and then
> read that memory copy while writing to vmcs12.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
>   	return flexpriority_enabled;
>   }
>
> +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
> +{
> +	return cpu_has_vmx_tpr_shadow()&&
> +		get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control&
> +		CPU_BASED_TPR_SHADOW;
>    

Operator precedence is with you, but the line width limit is not.  Use 
parentheses to improve readability.

> @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
>   	preempt_enable();
>   }
>
> +int load_vmcs_host_state(struct shadow_vmcs *src)
> +{
> +	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
> +	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
> +	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
> +	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
> +	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
> +	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
> +	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
>    

Why do we need to go through a shadow_vmcs for host fields?  Instead of 
cloning a vmcs, you can call a common init routing to initialize the 
host fields.

> +
> +	vmcs_write64(TSC_OFFSET, src->tsc_offset);
>    

Don't you need to adjust for our TSC_OFFSET?

> +/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs its nested
> + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> + * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
> + * way that will both be appropriate to L1's requests, and our needs.
> + */
> +int prepare_vmcs_02(struct kvm_vcpu *vcpu,
> +	struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01)
> +{
> +	u32 exec_control;
> +
> +	load_vmcs_common(vmcs12);
> +
> +	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
>    

Not sure about this.  We don't use the vmcs link pointer, it's better to 
keep it at its default of -1ull.

> +	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
> +	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
>    

I guess merging the io bitmaps doesn't make much sense, at least at this 
stage.

> +	if (cpu_has_vmx_msr_bitmap())
> +		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
>    

However, merging the msr bitmaps is critical.  Perhaps you do it in a 
later patch.

> +
> +	if (vmcs12->vm_entry_msr_load_count>  0 ||
> +			vmcs12->vm_exit_msr_load_count>  0 ||
> +			vmcs12->vm_exit_msr_store_count>  0) {
> +		printk(KERN_WARNING
> +			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
>    

Unfortunate, since kvm has started to use this feature.

For all unsupported mandatory features, we need reporting that is always 
enabled (i.e. no dprintk()), but to avoid flooding dmesg, use 
printk_ratelimit() or report just once.  Also, it's better to kill the 
guest (KVM_REQ_TRIPLE_FAULT, or a VM instruction error) than to let it 
continue incorrectly.

> +	}
> +
> +	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
> +		struct page *page =
> +			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
> +		if (!page)
> +			return 1;
>    

?

> +		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
> +		kvm_release_page_clean(page);
>    

Hm.  The host can move this page around.  If it happens not to be mapped 
into the guest, we won't get any notification.  So we need to keep a 
reference to this page, or else force an exit from the mmu notifier 
callbacks if it is removed by the host.

> +	}
> +
> +	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
> +		struct page *page =
> +			nested_get_page(vcpu, vmcs12->apic_access_addr);
> +		if (!page)
> +			return 1;
> +		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +		kvm_release_page_clean(page);
> +	}
>    

Ditto.

> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> +		     (vmcs01->pin_based_vm_exec_control |
> +		      vmcs12->pin_based_vm_exec_control));
>    

We don't really want the guest's pin-based controls to influence our own 
(it doesn't really matter since ours are always set).  Rather, they 
should influence the interface between the local APIC and the guest.

Where do you check if the values are valid?  Otherwise the guest can 
easily crash where it expects a VM entry failure.


> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +		     (vmcs01->page_fault_error_code_mask&
> +		      vmcs12->page_fault_error_code_mask));
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +		     (vmcs01->page_fault_error_code_match&
> +		      vmcs12->page_fault_error_code_match));
>    

Bit 14 in the exception bitmap also plays a part, I think it 
significantly changes the picture if set in the guest (the host will 
have it clear always IIRC).  Can perhaps avoid by clearing the whole 
thing if the guest is inverting the match.+
> +	if (enable_ept) {
> +		if (!nested_cpu_has_vmx_ept(vcpu)) {
> +			vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
> +			vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
> +			vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
> +			vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
> +			vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
> +		}
>    

Currently we don't support guest ept, so the second condition can be 
avoided.

> +	}
> +
> +	exec_control = vmcs01->cpu_based_vm_exec_control;
> +	exec_control&= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +	exec_control&= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> +	exec_control&= ~CPU_BASED_TPR_SHADOW;
> +	exec_control |= vmcs12->cpu_based_vm_exec_control;
> +	if (!vm_need_tpr_shadow(vcpu->kvm) ||
> +	    vmcs12->virtual_apic_page_addr == 0) {
> +		exec_control&= ~CPU_BASED_TPR_SHADOW;
>    

Why?

> +#ifdef CONFIG_X86_64
> +		exec_control |= CPU_BASED_CR8_STORE_EXITING |
> +			CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +	} else if (exec_control&  CPU_BASED_TPR_SHADOW) {
> +#ifdef CONFIG_X86_64
> +		exec_control&= ~CPU_BASED_CR8_STORE_EXITING;
> +		exec_control&= ~CPU_BASED_CR8_LOAD_EXITING;
> +#endif
>    

This should be part of the general checks on valid control values.

> +	}
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
> +
> +	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> +	 * bitwise-or of what L1 wants to trap for L2, and what we want to
> +	 * trap. However, vmx_fpu_activate/deactivate may have happened after
> +	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
> +	 * and need to base them again on fpu_active. Note that CR0.TS also
> +	 * needs updating - we do this after this function returns (in
> +	 * nested_vmx_run).
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
> +		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
> +		      vmcs12->exception_bitmap));
>    

Perhaps moving this to update_exception_bitmap will make this clearer.

> +	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));
> +	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));
>    

I'm worried that managing this in two separate places will cause problems.

> +
> +	vmcs_write32(VM_EXIT_CONTROLS,
> +		     (vmcs01->vm_exit_controls&
> +			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
> +		       | vmcs12->vm_exit_controls);
>    

Why do you drop PAT load/save?

> +
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		     (vmcs01->vm_entry_controls&
> +			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
> +		      | vmcs12->vm_entry_controls);
> +
> +	vmcs_writel(CR4_GUEST_HOST_MASK,
> +		    (vmcs01->cr4_guest_host_mask&
> +		     vmcs12->cr4_guest_host_mask));
> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
>    


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-06-13 12:30 ` [PATCH 16/24] Implement VMLAUNCH and VMRESUME Nadav Har'El
@ 2010-06-14 11:41   ` Avi Kivity
  2010-09-26 11:14     ` Nadav Har'El
  2010-06-17 10:59   ` Gleb Natapov
  1 sibling, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 11:41 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:30 PM, Nadav Har'El wrote:
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -272,6 +272,9 @@ struct __attribute__ ((__packed__)) vmcs
>   	struct shadow_vmcs shadow_vmcs;
>
>   	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
> +
> +	int cpu;
>    

Not sure cpu should be here.  It's certainly won't survive live 
migration.  Perhaps in struct vmcs_list (which should be renamed, 
perhaps struct cached_vmcs).

> +	int launched;
>   };
>    

What's the difference between this and launch_state?

>
>   struct vmcs_list {
> @@ -297,6 +300,24 @@ struct nested_vmx {
>   	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
>   	struct list_head l2_vmcs_list; /* a vmcs_list */
>   	int l2_vmcs_num;
> +
> +	/* Are we running a nested guest now */
> +	bool nested_mode;
> +	/* Level 1 state for switching to level 2 and back */
> +	struct  {
> +		u64 efer;
> +		unsigned long cr3;
> +		unsigned long cr4;
> +		u64 io_bitmap_a;
> +		u64 io_bitmap_b;
> +		u64 msr_bitmap;
> +		int cpu;
> +		int launched;
> +	} l1_state;
>    

This state needs save/restore support (as well as the current vmptr and 
vmxon state).

> +	/* Level 1 shadow vmcs for switching to level 2 and back */
> +	struct shadow_vmcs *l1_shadow_vmcs;
>    

Again, not really happy about shadowing the non-nested vmcs.

> +	/* Level 1 vmcs loaded into the processor */
> +	struct vmcs *l1_vmcs;
>   };
>
>   enum vmcs_field_type {
> @@ -1407,6 +1428,19 @@ static void vmx_vcpu_load(struct kvm_vcp
>   			new_offset = vmcs_read64(TSC_OFFSET) + delta;
>   			vmcs_write64(TSC_OFFSET, new_offset);
>   		}
> +
> +		if (vmx->nested.l1_shadow_vmcs != NULL) {
> +			struct shadow_vmcs *l1svmcs =
> +				vmx->nested.l1_shadow_vmcs;
> +			l1svmcs->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +			l1svmcs->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +			l1svmcs->host_ia32_sysenter_esp =
> +				vmcs_readl(HOST_IA32_SYSENTER_ESP);
>    

These are all static (at least on a single cpu.  No need to read them 
from a vmcs.

> +			if (tsc_this<  vcpu->arch.host_tsc)
> +				l1svmcs->tsc_offset = vmcs_read64(TSC_OFFSET);
> +			if (vmx->nested.nested_mode)
> +				load_vmcs_host_state(l1svmcs);
> +		}
>   	}
>   }
>
>
> @@ -4348,6 +4392,42 @@ static int handle_vmclear(struct kvm_vcp
>   	return 1;
>   }
>
> +static int nested_vmx_run(struct kvm_vcpu *vcpu);
> +
> +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
> +{
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (!nested_map_current(vcpu))
> +		return 1;
>    

Better error handling needed, perhaps triple fault.

> +	if (to_vmx(vcpu)->nested.current_l2_page->launch_state == launch) {
> +		/* Must use VMLAUNCH for the first time, VMRESUME later */
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		nested_unmap_current(vcpu);
>    

skip_emulted_instruction();

> +		return 1;
> +	}
> +	nested_unmap_current(vcpu);
> +
> +	skip_emulated_instruction(vcpu);
> +
> +	nested_vmx_run(vcpu);
> +	return 1;
> +}
>
> @@ -4958,7 +5038,8 @@ static int vmx_handle_exit(struct kvm_vc
>   		       "(0x%x) and exit reason is 0x%x\n",
>   		       __func__, vectoring_info, exit_reason);
>
> -	if (unlikely(!cpu_has_virtual_nmis()&&  vmx->soft_vnmi_blocked)) {
> +	if (!vmx->nested.nested_mode&&
> +		unlikely(!cpu_has_virtual_nmis()&&  vmx->soft_vnmi_blocked)) {
>    

Too much indent.  the unlikely() looks like the first statement of the 
block.

I think it isn't enough to check for nested mode.  If the guest hasn't 
enabled virtual NMIs, then the nested guest should behave exactly like 
the guest.

>
> +static int nested_vmx_run(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->nested.nested_mode = 1;
>    

true

> +	sync_cached_regs_to_vmcs(vcpu);
> +	save_vmcs(vmx->nested.l1_shadow_vmcs);
> +
> +	vmx->nested.l1_state.efer = vcpu->arch.efer;
>    

Not sure why you need to save efer.  Ordinarily, vmx reconstructs it 
from the guest efer and the host size exit control, you can do the same.

> +	if (!enable_ept)
> +		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
>    

Ditto, isn't that HOST_CR3?

> +	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
>    

Ditto.

> +
> +	if (!nested_map_current(vcpu)) {
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
> +	else
> +		vmx->nested.l1_state.msr_bitmap = 0;
> +
> +	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +	vmx->nested.l1_vmcs = vmx->vmcs;
> +	vmx->nested.l1_state.cpu = vcpu->cpu;
> +	vmx->nested.l1_state.launched = vmx->launched;
> +
> +	vmx->vmcs = nested_get_current_vmcs(vcpu);
> +	if (!vmx->vmcs) {
> +		printk(KERN_ERR "Missing VMCS\n");
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	vcpu->cpu = vmx->nested.current_l2_page->cpu;
>    

How can this change?  It must remain constant between 
kvm_arch_vcpu_load() and kvm_arch_vcpu_put().

> +	vmx->launched = vmx->nested.current_l2_page->launched;
> +
> +	if (!vmx->nested.current_l2_page->launch_state || !vmx->launched) {
> +		vmcs_clear(vmx->vmcs);
> +		vmx->launched = 0;
> +		vmx->nested.current_l2_page->launch_state = 1;
> +	}
> +
> +	vmx_vcpu_load(vcpu, get_cpu());
> +	put_cpu();
> +
> +	prepare_vmcs_02(vcpu,
> +		get_shadow_vmcs(vcpu), vmx->nested.l1_shadow_vmcs);
> +
> +	if (get_shadow_vmcs(vcpu)->vm_entry_controls&
> +	    VM_ENTRY_IA32E_MODE) {
> +		if (!((vcpu->arch.efer&  EFER_LMA)&&
> +		      (vcpu->arch.efer&  EFER_LME)))
> +			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +	} else {
> +		if ((vcpu->arch.efer&  EFER_LMA) ||
> +		    (vcpu->arch.efer&  EFER_LME))
> +			vcpu->arch.efer = 0;
> +	}
> +
> +	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
> +	 * dictated, and takes appropriate actions for special cr0 bits (like
> +	 * real mode, etc.).
> +	 */
> +	vmx_set_cr0(vcpu,
> +		(get_shadow_vmcs(vcpu)->guest_cr0&
> +			~get_shadow_vmcs(vcpu)->cr0_guest_host_mask) |
> +		(get_shadow_vmcs(vcpu)->cr0_read_shadow&
> +			get_shadow_vmcs(vcpu)->cr0_guest_host_mask));
> +
> +	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
> +	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
> +	 * the latter with with TS added if !fpu_active. We need to take the
> +	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
> +	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
> +	 * restoration of fpu registers until the FPU is really used).
> +	 */
> +	vmcs_writel(GUEST_CR0, get_shadow_vmcs(vcpu)->guest_cr0 |
> +		(vcpu->fpu_active ? 0 : X86_CR0_TS));
>    

Please update vmx_set_cr0() instead.

> +
> +	vmx_set_cr4(vcpu, get_shadow_vmcs(vcpu)->guest_cr4);
>    

Note: kvm_set_cr4() does some stuff that vmx_set_cr4() doesn't.  Esp. 
the kvm_mmu_reset_context().

> +	vmcs_writel(CR4_READ_SHADOW,
> +		    get_shadow_vmcs(vcpu)->cr4_read_shadow);
> +
> +	/* we have to set the X86_CR0_PG bit of the cached cr0, because
> +	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
> +	 * CR0 (we need the paging so that KVM treat this guest as a paging
> +	 * guest so we can easly forward page faults to L1.)
> +	 */
> +	vcpu->arch.cr0 |= X86_CR0_PG;
>    

Since this version doesn't support unrestricted nested guests, cr0.pg 
will be already set or we will have failed vmentry.

> +
> +	if (enable_ept&&  !nested_cpu_has_vmx_ept(vcpu)) {
>    

We don't support nested ept yet, yes?

> +		vmcs_write32(GUEST_CR3, get_shadow_vmcs(vcpu)->guest_cr3);
> +		vmx->vcpu.arch.cr3 = get_shadow_vmcs(vcpu)->guest_cr3;
>    

Should be via kvm_set_cr3().

> +	} else {
> +		int r;
> +		kvm_set_cr3(vcpu, get_shadow_vmcs(vcpu)->guest_cr3);
> +		kvm_mmu_reset_context(vcpu);
> +
> +		nested_unmap_current(vcpu);
> +
> +		r = kvm_mmu_load(vcpu);
>    

Ordinary guest entry will load the mmu.  Failures here can only be 
memory allocation and should not be visible to the guest anyway (we 
return -ENOMEM to userspace and that's it).

> +		if (unlikely(r)) {
> +			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
> +			set_rflags_to_vmx_fail_valid(vcpu);
> +			/* switch back to L1 */
> +			vmx->nested.nested_mode = 0;
> +			vmx->vmcs = vmx->nested.l1_vmcs;
> +			vcpu->cpu = vmx->nested.l1_state.cpu;
> +			vmx->launched = vmx->nested.l1_state.launched;
> +
> +			vmx_vcpu_load(vcpu, get_cpu());
> +			put_cpu();
> +
> +			return 1;
> +		}
> +
> +		nested_map_current(vcpu);
> +	}
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP,
> +			   get_shadow_vmcs(vcpu)->guest_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP,
> +			   get_shadow_vmcs(vcpu)->guest_rip);
> +
> +	nested_unmap_current(vcpu);
> +
> +	return 1;
> +}
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-06-13 12:31 ` [PATCH 18/24] Exiting from L2 to L1 Nadav Har'El
@ 2010-06-14 12:04   ` Avi Kivity
  2010-09-12 14:05     ` Nadav Har'El
  2010-09-14 13:07     ` Nadav Har'El
  0 siblings, 2 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 12:04 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:31 PM, Nadav Har'El wrote:
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
>
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in the next patch.
>
>
>
> +/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
> + * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
> + * function updates it to reflect the state of the registers during the exit,
> + * and to reflect some changes that happened while L2 was running (and perhaps
> + * made some exits which were handled directly by L0 without going back to L1).
> + */
> +void prepare_vmcs_12(struct kvm_vcpu *vcpu)
> +{
> +	struct shadow_vmcs *vmcs12 = get_shadow_vmcs(vcpu);
> +
> +	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
> +	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
> +	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
> +	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
> +	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
> +	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
> +	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
> +	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
> +
> +	vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET);
>    

TSC_OFFSET cannot have changed.

> +	vmcs12->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>    

Not available without EPT.

> +	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
>    

Can this change?

> +	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
>    

Without msr bitmaps, cannot change.

> +	if (vmcs_config.vmentry_ctrl&  VM_ENTRY_LOAD_IA32_PAT)
> +		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
>    

Should check for VM_EXIT_SAVE_IA32_PAT, no?  Also unneeded without msr 
bitmaps and passthrough for this msr.

> +	vmcs12->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
>    

R/O

> +	vmcs12->vm_entry_intr_info_field =
> +		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>    

Autocleared, no need to read.

> +	vmcs12->vm_entry_exception_error_code =
> +		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
> +	vmcs12->vm_entry_instruction_len =
> +		vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
>    

R/O

> +	vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
>    

We don't want to pass this to the guest?

> +	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
> +	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> +	vmcs12->idt_vectoring_info_field =
> +		vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	vmcs12->idt_vectoring_error_code =
> +		vmcs_read32(IDT_VECTORING_ERROR_CODE);
> +	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> +	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>    

For the above, if the host handles the exit, we must not clobber guest 
fields.  A subsequent guest vmread will see the changed values even 
though from its point of view a vmexit has not occurred.

But no, that can't happen, since a vmread needs to have a vmexit first 
to happen.  Still, best to delay this.+
> +	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
> +	 * have changed some cr0 bits without us ever saving them in the shadow
> +	 * vmcs. So we need to save these changes now.
> +	 * In the current code, the only GHM bit which can be off is TS (it
> +	 * will be off when fpu_active and L1 also set it to off).
> +	 */
> +	vmcs12->guest_cr0 = vmcs_readl(GUEST_CR0);
> +
> +	/* But this may not be the guest_cr0 that the L1 guest hypervisor
> +	 * actually thought it was giving its L2 guest. It is possible that
> +	 * L1 wished to allow its guest to set a cr0 bit directly, but we (L0)
> +	 * captured this attempt and instead set just the read shadow. If this
> +	 * is the case, we need copy these read-shadow bits back to guest_cr0,
> +	 * where L1 believes they already are. Note that we must read the
> +	 * actual CR0_READ_SHADOW (which is what L0 may have changed), not
> +	 * vmcs12->cr0_read_shadow (which L1 defined, and we don't
> +	 * change without being told by L1). Currently, the only bit where
> +	 * this can happen is TS.
> +	 */
> +	if (!(vcpu->arch.cr0_guest_owned_bits&  X86_CR0_TS)
> +			&&  !(vmcs12->cr0_guest_host_mask&  X86_CR0_TS))
> +		vmcs12->guest_cr0 =
> +			(vmcs12->guest_cr0&  ~X86_CR0_TS) |
> +			(vmcs_readl(CR0_READ_SHADOW)&  X86_CR0_TS);
> +
> +	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);
>    

Can't we have the same issue with cr4?

Better to have some helpers to do the common magic, and not encode the 
special knowledge about TS into it (make it generic).

> +
> +int switch_back_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct shadow_vmcs *src = to_vmx(vcpu)->nested.l1_shadow_vmcs;
> +
> +	if (enable_vpid&&  src->virtual_processor_id != 0)
> +		vmcs_write16(VIRTUAL_PROCESSOR_ID, src->virtual_processor_id);
>    

IIUC vpids are not exposed to the guest yet?  So the VPID should not 
change between guest and nested guest.

> +
> +	vmcs_write64(IO_BITMAP_A, src->io_bitmap_a);
> +	vmcs_write64(IO_BITMAP_B, src->io_bitmap_b);
>    

Why change the I/O bitmap?

> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmcs_write64(MSR_BITMAP, src->msr_bitmap);
>    

Or the msr bitmap?  After all, we're switching the entire vmcs?

> +
> +	vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, src->virtual_apic_page_addr);
> +
> +	if (vm_need_virtualize_apic_accesses(vcpu->kvm))
> +		vmcs_write64(APIC_ACCESS_ADDR,
> +			     src->apic_access_addr);
> +
> +	if (enable_ept) {
> +		vmcs_write64(EPT_POINTER, src->ept_pointer);
> +		vmcs_write64(GUEST_PDPTR0, src->guest_pdptr0);
> +		vmcs_write64(GUEST_PDPTR1, src->guest_pdptr1);
> +		vmcs_write64(GUEST_PDPTR2, src->guest_pdptr2);
> +		vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3);
> +	}
>    

A kvm_set_cr3(src->host_cr3) should do all that and more, no?

> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, src->pin_based_vm_exec_control);
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, src->cpu_based_vm_exec_control);
> +	vmcs_write32(EXCEPTION_BITMAP, src->exception_bitmap);
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +		     src->page_fault_error_code_mask);
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +		     src->page_fault_error_code_match);
> +	vmcs_write32(VM_EXIT_CONTROLS, src->vm_exit_controls);
> +	vmcs_write32(VM_ENTRY_CONTROLS, src->vm_entry_controls);
>    

Why write all these?  What could have changed them?

> +
> +	if (cpu_has_secondary_exec_ctrls())
> +		vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> +			     src->secondary_vm_exec_control);
> +
> +	load_vmcs_common(src);
> +
> +	load_vmcs_host_state(to_vmx(vcpu)->nested.l1_shadow_vmcs);
> +
> +	return 0;
> +}
> +
> +static int nested_vmx_vmexit(struct kvm_vcpu *vcpu,
> +			     bool is_interrupt)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int efer_offset;
> +
> +	if (!vmx->nested.nested_mode) {
> +		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
> +		       __func__);
> +		return 0;
> +	}
> +
> +	sync_cached_regs_to_vmcs(vcpu);
> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "Error mapping shadow vmcs\n");
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	prepare_vmcs_12(vcpu);
> +	if (is_interrupt)
> +		get_shadow_vmcs(vcpu)->vm_exit_reason =
> +			EXIT_REASON_EXTERNAL_INTERRUPT;
> +
> +	vmx->nested.current_l2_page->launched = vmx->launched;
> +	vmx->nested.current_l2_page->cpu = vcpu->cpu;
> +
> +	nested_unmap_current(vcpu);
> +
> +	vmx->vmcs = vmx->nested.l1_vmcs;
> +	vcpu->cpu = vmx->nested.l1_state.cpu;
> +	vmx->launched = vmx->nested.l1_state.launched;
> +
> +	vmx_vcpu_load(vcpu, get_cpu());
> +	put_cpu();
> +
> +	vcpu->arch.efer = vmx->nested.l1_state.efer;
> +	if ((vcpu->arch.efer&  EFER_LMA)&&
> +	    !(vcpu->arch.efer&  EFER_SCE))
> +		vcpu->arch.efer |= EFER_SCE;
> +
> +	efer_offset = __find_msr_index(vmx, MSR_EFER);
> +	if (update_transition_efer(vmx, efer_offset))
> +		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
> +
> +	/* We're running a regular L1 guest again, so we do the regular KVM
> +	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has
> +	 * (this can be figured out by combining its old guest_cr0 and
> +	 * cr0_read_shadow, using the cr0_guest_host_mask). vmx_set_cr0 might
> +	 * use slightly different bits on the new guest_cr0 it sets, e.g.,
> +	 * add TS when !fpu_active.
> +	 */
> +	vmx_set_cr0(vcpu,
> +		(vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> +		vmx->nested.l1_shadow_vmcs->cr0_read_shadow) |
> +		(~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> +		vmx->nested.l1_shadow_vmcs->guest_cr0));
>    

Helper wanted.

> +
> +	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
> +
>    

Again, the kvm_set_crx() versions have more meat.

> +	if (enable_ept) {
> +		vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3;
> +		vmcs_write32(GUEST_CR3, vmx->nested.l1_shadow_vmcs->guest_cr3);
> +	} else {
> +		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
> +	}
>    

kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case 
the nested guest was able to corrupted the guest's PDPT).

> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "Error mapping shadow vmcs\n");
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	switch_back_vmcs(vcpu);
> +
> +	nested_unmap_current(vcpu);
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP,
> +			   vmx->nested.l1_shadow_vmcs->guest_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP,
> +			   vmx->nested.l1_shadow_vmcs->guest_rip);
> +
> +	vmx->nested.nested_mode = 0;
> +
> +	/* If we did fpu_activate()/fpu_deactive() during l2's run, we need
> +	 * to apply the same changes also when running l1. We don't need to
> +	 * change cr0 here - we already did this above - just the
> +	 * cr0_guest_host_mask, and exception bitmap.
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		(vmx->nested.l1_shadow_vmcs->exception_bitmap&
> +			~(1u<<NM_VECTOR)) |
> +			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
> +	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
> +	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
> +
> +	kvm_mmu_reset_context(vcpu);
> +	kvm_mmu_load(vcpu);
>    

kvm_mmu_load() unneeded, usually.

> +
> +	if (unlikely(vmx->fail)) {
> +		vmx->fail = 0;
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +	} else
> +		clear_rflags_cf_zf(vcpu);
> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
>    

I'm probably missing something about the read/write of various vmcs fields.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit
  2010-06-13 12:32 ` [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
@ 2010-06-14 12:24   ` Avi Kivity
  2010-09-16 14:42     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 12:24 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:32 PM, Nadav Har'El wrote:
> This patch contains the logic of whether an L2 exit should be handled by L0
> and then L2 should be resumed, or whether L1 should be run to handle this
> exit (using the nested_vmx_vmexit() function of the previous patch).
>
> The basic idea is to let L1 handle the exit only if it actually asked to
> trap this sort of event. For example, when L2 exits on a change to CR0,
> we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
> bit which changed; If it did, we exit to L1. But if it didn't it means that
> it is we (L0) that wished to trap this event, so we handle it ourselves.
>
> The next two patches add additional logic of what to do when an interrupt or
> exception is injected: Does L0 need to do it, should we exit to L1 to do it,
> or should we resume L2 and keep the exception to be injected later.
>
> We keep a new flag, "nested_run_pending", which can override the decision of
> which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
> L2 next, not L1. This is necessary in several situations where had L1 run on
> bare metal it would not have expected to be resumed at this stage. One
> example is when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run.
> Another examples is when L2 exits on an #NM exception that L0 asked for
> (because of lazy FPU loading), and L0 must deal with the exception and resume
> L2 which was in a middle of an instruction, and not resume L1 which does not
> expect to see an exit from L2 at this point. nested_run_pending is especially
> intended to avoid switching to L1 in the injection decision-point described
> above.
>
> @@ -3819,6 +3841,8 @@ static int handle_exception(struct kvm_v
>
>   	if (is_no_device(intr_info)) {
>   		vmx_fpu_activate(vcpu);
> +		if (vmx->nested.nested_mode)
> +			vmx->nested.nested_run_pending = 1;
>   		return 1;
>   	}
>    

Isn't this true for many other exceptions?  #UD which we emulate (but 
the guest doesn't trap), page faults which we handle completely...

>
> +
> +/* Return 1 if we should exit from L2 to L1 to handle a CR access exit,
> + * rather than handle it ourselves in L0. I.e., check if L1 wanted to
> + * intercept (via guest_host_mask etc.) the current event.
> + */
> +static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
> +	struct shadow_vmcs *l2svmcs)
> +{
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int cr = exit_qualification&  15;
> +	int reg = (exit_qualification>>  8)&  15;
> +	unsigned long val = kvm_register_read(vcpu, reg);
> +
> +	switch ((exit_qualification>>  4)&  3) {
> +	case 0: /* mov to cr */
> +		switch (cr) {
> +		case 0:
> +			if (l2svmcs->cr0_guest_host_mask&
> +			    (val ^ l2svmcs->cr0_read_shadow))
> +				return 1;
> +			break;
> +		case 3:
> +			if (l2svmcs->cpu_based_vm_exec_control&
> +			    CPU_BASED_CR3_LOAD_EXITING)
> +				return 1;
> +			break;
> +		case 4:
> +			if (l2svmcs->cr4_guest_host_mask&
> +			    (l2svmcs->cr4_read_shadow ^ val))
> +				return 1;
> +			break;
> +		case 8:
> +			if (l2svmcs->cpu_based_vm_exec_control&
> +			    CPU_BASED_CR8_LOAD_EXITING)
> +				return 1;
>    

Should check TPR threshold here too if enabled.


> +	case 3: /* lmsw */
> +		if (l2svmcs->cr0_guest_host_mask&
> +		    (val ^ l2svmcs->cr0_read_shadow))
> +			return 1;
>    

Need to mask off bit 0 (cr0.pe) of val, since lmsw can't clear it.

> +		break;
> +	}
> +	return 0;
> +}
> +
> +/* Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
> + * should handle it ourselves in L0. Only call this when in nested_mode (L2).
> + */
> +static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool afterexit)
> +{
> +	u32 exit_code = vmcs_read32(VM_EXIT_REASON);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	struct shadow_vmcs *l2svmcs;
> +	int r = 0;
> +
> +	if (vmx->nested.nested_run_pending)
> +		return 0;
> +
> +	if (unlikely(vmx->fail)) {
> +		printk(KERN_INFO "%s failed vm entry %x\n",
> +		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
> +		return 1;
> +	}
> +
> +	if (afterexit) {
> +		/* There are some cases where we should let L1 handle certain
> +		 * events when these are injected (afterexit==0) but we should
> +		 * handle them in L0 on an exit (afterexit==1).
> +		 */
> +		switch (exit_code) {
> +		case EXIT_REASON_EXTERNAL_INTERRUPT:
> +			return 0;
> +		case EXIT_REASON_EXCEPTION_NMI:
> +			if (!is_exception(intr_info))
> +				return 0;
> +			if (is_page_fault(intr_info)&&  (!enable_ept))
> +				return 0;
>    

Some page faults do need a l2->l1 transition.  Maybe I'll see this later.

> +			break;
> +		case EXIT_REASON_EPT_VIOLATION:
> +			if (enable_ept)
> +				return 0;
> +			break;
> +		}
> +	}
> +
> +	if (!nested_map_current(vcpu))
> +		return 0;
> +	l2svmcs = get_shadow_vmcs(vcpu);
> +
> +	switch (exit_code) {
> +	case EXIT_REASON_INVLPG:
> +		if (l2svmcs->cpu_based_vm_exec_control&
> +		    CPU_BASED_INVLPG_EXITING)
> +			r = 1;
> +		break;
> +	case EXIT_REASON_MSR_READ:
> +	case EXIT_REASON_MSR_WRITE:
> +		r = nested_vmx_exit_handled_msr(vcpu, l2svmcs, exit_code);
> +		break;
> +	case EXIT_REASON_CR_ACCESS:
> +		r = nested_vmx_exit_handled_cr(vcpu, l2svmcs);
> +		break;
> +	case EXIT_REASON_DR_ACCESS:
> +		if (l2svmcs->cpu_based_vm_exec_control&
> +		    CPU_BASED_MOV_DR_EXITING)
> +			r = 1;
> +		break;
> +	case EXIT_REASON_EXCEPTION_NMI:
> +		if (is_external_interrupt(intr_info)&&
> +		    (l2svmcs->pin_based_vm_exec_control&
> +		     PIN_BASED_EXT_INTR_MASK))
> +			r = 1;
>    

A real external interrupt should never be handled by the guest, only a 
virtual external interrupt.

> +		else if (is_nmi(intr_info)&&
> +		    (l2svmcs->pin_based_vm_exec_control&
> +		     PIN_BASED_NMI_EXITING))
> +			r = 1;
>    

Ditto for nmi.

> +		else if (is_exception(intr_info)&&
> +		    (l2svmcs->exception_bitmap&
> +		     (1u<<  (intr_info&  INTR_INFO_VECTOR_MASK))))
> +			r = 1;
>    

Bit 14 of the exception bitmap is special, need special treatment.

> +		else if (is_page_fault(intr_info))
> +			r = 1;
>    

Still looking for magic page fault handling...

> +		break;
> +	case EXIT_REASON_EXTERNAL_INTERRUPT:
> +		if (l2svmcs->pin_based_vm_exec_control&
> +		    PIN_BASED_EXT_INTR_MASK)
> +			r = 1;
> +		break;
> +	default:
> +		r = 1;
> +	}
> +	nested_unmap_current(vcpu);
> +
> +	return r;
> +}
> +
>   /*
>    * The guest has exited.  See if we can fix it or if we need userspace
>    * assistance.
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 20/24] Correct handling of interrupt injection
  2010-06-13 12:32 ` [PATCH 20/24] Correct handling of interrupt injection Nadav Har'El
@ 2010-06-14 12:29   ` Avi Kivity
  2010-06-14 12:48     ` Avi Kivity
  2010-09-16 15:25     ` Nadav Har'El
  0 siblings, 2 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 12:29 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:32 PM, Nadav Har'El wrote:
> When KVM wants to inject an interrupt, the guest should think a real interrupt
> has happened. Normally (in the non-nested case) this means checking that the
> guest doesn't block interrupts (and if it does, inject when it doesn't - using
> the "interrupt window" VMX mechanism), and setting up the appropriate VMCS
> fields for the guest to receive the interrupt.
>
> However, when we are running a nested guest (L2) and its hypervisor (L1)
> requested exits on interrupts (as most hypervisors do), the most efficient
> thing to do is to exit L2, telling L1 that the exit was caused by an
> interrupt, the one we were injecting; Only when L1 asked not to be notified
> of interrupts, we should to inject it directly to the running guest L2 (i.e.,
> the normal code path).
>
> However, properly doing what is described above requires invasive changes to
> the flow of the existing code, which we elected not to do in this stage.
> Instead we do something more simplistic and less efficient: we modify
> vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
> now, to exit from L2 to L1 before continuing the normal code. The normal kvm
> code then notices that L1 is blocking interrupts, and sets the interrupt
> window to inject the interrupt later to L1. Shortly after, L1 gets the
> interrupt while it is itself running, not as an exit from L2. The cost is an
> extra L1 exit (the interrupt window).
>
>    

That's a little sad.

>
>   	cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>   	cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> @@ -3718,6 +3738,13 @@ static int nested_vmx_vmexit(struct kvm_
>
>   static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
>   {
> +	if (to_vmx(vcpu)->nested.nested_mode&&  nested_exit_on_intr(vcpu)) {
> +		if (to_vmx(vcpu)->nested.nested_run_pending)
> +			return 0;
> +		nested_vmx_vmexit(vcpu, true);
> +		/* fall through to normal code, but now in L1, not L2 */
> +	}
> +
>    

What exit is reported here?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (23 preceding siblings ...)
  2010-06-13 12:34 ` [PATCH 24/24] Miscellenous small corrections Nadav Har'El
@ 2010-06-14 12:34 ` Avi Kivity
  2010-06-14 13:03   ` Nadav Har'El
  2010-07-09  8:59 ` Dong, Eddie
  2010-07-15  3:27 ` Sheng Yang
  26 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 12:34 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/13/2010 03:22 PM, Nadav Har'El wrote:
> Hi Avi,
>
> This is a followup of our nested VMX patches that Orit Wasserman posted in
> December. We've addressed most of the comments and concerns that you and
> others on the mailing list had with the previous patch set. We hope you'll
> find these patches easier to understand, and suitable for applying to KVM.
>
>
> The following 24 patches implement nested VMX support. The patches enable a
> guest to use the VMX APIs in order to run its own nested guests. I.e., it
> allows running hypervisors (that use VMX) under KVM. We describe the theory
> behind this work, our implementation, and its performance characteristics,
> in IBM Research report H-0282, "The Turtles Project: Design and Implementation
> of Nested Virtualization", available at:
>
> 	http://bit.ly/a0o9te
>
> The current patches support running Linux under a nested KVM using shadow
> page table (with bypass_guest_pf disabled). They support multiple nested
> hypervisors, which can run multiple guests. Only 64-bit nested hypervisors
> are supported. SMP is supported. Additional patches for running Windows under
> nested KVM, and Linux under nested VMware server, and support for nested EPT,
> are currently running in the lab, and will be sent as follow-on patchsets.
>
> These patches were written by:
>       Abel Gordon, abelg<at>  il.ibm.com
>       Nadav Har'El, nyh<at>  il.ibm.com
>       Orit Wasserman, oritw<at>  il.ibm.com
>       Ben-Ami Yassor, benami<at>  il.ibm.com
>       Muli Ben-Yehuda, muli<at>  il.ibm.com
>
> With contributions by:
>       Anthony Liguori, aliguori<at>  us.ibm.com
>       Mike Day, mdday<at>  us.ibm.com
>
> This work was inspired by the nested SVM support by Alexander Graf and Joerg
> Roedel.
>
>
> Changes since v4:
> * Rebased to the current KVM tree.
> * Support for lazy FPU loading.
> * Implemented about 90 requests and suggestions made on the mailing list
>    regarding the previous version of this patch set.
> * Split the changes into many more, and better documented, patches.
>
>    

Overall, very nice.  The finer split and better documentation really 
help reviewing, thanks.

Let's try to get this merged quickly.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-14  8:49     ` Nadav Har'El
@ 2010-06-14 12:35       ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 12:35 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/14/2010 11:49 AM, Nadav Har'El wrote:
> On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
>    
>> On 06/13/2010 03:25 PM, Nadav Har'El wrote:
>>      
>>> +#define VMCS12_REVISION 0x11e57ed0
>>>
>>>        
>> Where did this number come from?  It's not from real hardware, yes?
>>      
> Basically, we are presenting emulated VMX for the L1 guest, complete with
> its own VMCS structure. This structure needs to have some VMCS revision id,
> which should be an arbitrary number that we invent - it is not related to any
> revision id that any real hardware uses. If you look closely, you can see that
> the number I used is leetspeak for "Nested0" ;-)
>    

Will have to brush up on my leetspeak, I see.

> As you can see in the following patches, MSR_IA32_VMX_BASIC will return this
> arbitrary VMCS revision id, and and VMPTRLD will verify that the VMCS region
> that L1 is trying to load contains this revision id.
>    

Ok good, I was worried this was a real hardware ID.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 20/24] Correct handling of interrupt injection
  2010-06-14 12:29   ` Avi Kivity
@ 2010-06-14 12:48     ` Avi Kivity
  2010-09-16 15:25     ` Nadav Har'El
  1 sibling, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-14 12:48 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/14/2010 03:29 PM, Avi Kivity wrote:
> On 06/13/2010 03:32 PM, Nadav Har'El wrote:
>> When KVM wants to inject an interrupt, the guest should think a real 
>> interrupt
>> has happened. Normally (in the non-nested case) this means checking 
>> that the
>> guest doesn't block interrupts (and if it does, inject when it 
>> doesn't - using
>> the "interrupt window" VMX mechanism), and setting up the appropriate 
>> VMCS
>> fields for the guest to receive the interrupt.
>>
>> However, when we are running a nested guest (L2) and its hypervisor (L1)
>> requested exits on interrupts (as most hypervisors do), the most 
>> efficient
>> thing to do is to exit L2, telling L1 that the exit was caused by an
>> interrupt, the one we were injecting; Only when L1 asked not to be 
>> notified
>> of interrupts, we should to inject it directly to the running guest 
>> L2 (i.e.,
>> the normal code path).
>>
>> However, properly doing what is described above requires invasive 
>> changes to
>> the flow of the existing code, which we elected not to do in this stage.
>> Instead we do something more simplistic and less efficient: we modify
>> vmx_interrupt_allowed(), which kvm calls to see if it can inject the 
>> interrupt
>> now, to exit from L2 to L1 before continuing the normal code. The 
>> normal kvm
>> code then notices that L1 is blocking interrupts, and sets the interrupt
>> window to inject the interrupt later to L1. Shortly after, L1 gets the
>> interrupt while it is itself running, not as an exit from L2. The 
>> cost is an
>> extra L1 exit (the interrupt window).
>>
>
> That's a little sad.

It can also be broken if the guest chooses to keep interrupts disabled 
during exits, and instead ask vmx to ack interrupts.  The guest can then 
vmread the vector number and dispatch the interrupt itself.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-06-14 12:34 ` [PATCH 0/24] Nested VMX, v5 Avi Kivity
@ 2010-06-14 13:03   ` Nadav Har'El
  2010-06-15 10:00     ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-14 13:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> Overall, very nice.  The finer split and better documentation really 
> help reviewing, thanks.

Thank you for the review and all the accurate comments!

> Let's try to get this merged quickly.

I'll start fixing the individual patches and resending them individually, and
when I've fixed everything I'll resubmit the whole lot. I hope that this time
I can do it in a matter of days, not months.

Thanks,
Nadav.

-- 
Nadav Har'El                        |       Monday, Jun 14 2010, 2 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |An egotist is a person of low taste, more
http://nadav.harel.org.il           |interested in himself than in me.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-06-14 13:03   ` Nadav Har'El
@ 2010-06-15 10:00     ` Avi Kivity
  2010-10-17 12:03       ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-15 10:00 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/14/2010 04:03 PM, Nadav Har'El wrote:
>
>> Let's try to get this merged quickly.
>>      
> I'll start fixing the individual patches and resending them individually, and
> when I've fixed everything I'll resubmit the whole lot. I hope that this time
> I can do it in a matter of days, not months.
>    

I've tried to test the patches, but I see a vm-entry failure code 7 on 
the very first vmentry.  Guest is Fedora 12 x86-64 (2.6.32.9-70.fc12).

If you can post a git tree with the next round, that will make it easier 
for people experimenting with the patches.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 4/24] Allow setting the VMXE bit in CR4
  2010-06-13 12:24 ` [PATCH 4/24] Allow setting the VMXE bit in CR4 Nadav Har'El
@ 2010-06-15 11:09   ` Gleb Natapov
  2010-06-15 14:44     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-15 11:09 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:24:37PM +0300, Nadav Har'El wrote:
> This patch allows the guest to enable the VMXE bit in CR4, which is a
> prerequisite to running VMXON.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
> @@ -501,7 +501,7 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu,
>  		   && !load_pdptrs(vcpu, vcpu->arch.cr3))
>  		return 1;
>  
> -	if (cr4 & X86_CR4_VMXE)
> +	if (cr4 & X86_CR4_VMXE && !nested)
>  		return 1;
>  
We shouldn't be able to clear X86_CR4_VMXE after VMXON.

>  	kvm_x86_ops->set_cr4(vcpu, cr4);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-06-13 12:26 ` [PATCH 7/24] Understanding guest pointers to vmcs12 structures Nadav Har'El
  2010-06-14  8:48   ` Avi Kivity
@ 2010-06-15 12:14   ` Gleb Natapov
  2010-08-01 15:16     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-15 12:14 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:26:09PM +0300, Nadav Har'El wrote:
> This patch includes a couple of utility functions for extracting pointer
> operands of VMX instructions issued by L1 (a guest hypervisor), and
> translating guest-given vmcs12 virtual addresses to guest-physical addresses.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
> @@ -3286,13 +3286,14 @@ static int kvm_fetch_guest_virt(gva_t ad
>  					  access | PFERR_FETCH_MASK, error);
>  }
>  
> -static int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
> +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
>  			       struct kvm_vcpu *vcpu, u32 *error)
>  {
>  	u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
>  	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
>  					  error);
>  }
> +EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
>  
>  static int kvm_read_guest_virt_system(gva_t addr, void *val, unsigned int bytes,
>  			       struct kvm_vcpu *vcpu, u32 *error)
> --- .before/arch/x86/kvm/x86.h	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/x86.h	2010-06-13 15:01:29.000000000 +0300
> @@ -75,6 +75,9 @@ static inline struct kvm_mem_aliases *kv
>  void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
>  void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
>  
> +int kvm_read_guest_virt(gva_t addr, void *val, unsigned int bytes,
> +			struct kvm_vcpu *vcpu, u32 *error);
> +
>  extern int nested;
>  
>  #endif
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -3654,6 +3654,86 @@ static int handle_vmoff(struct kvm_vcpu 
>  	return 1;
>  }
>  
> +/*
> + * Decode the memory-address operand of a vmx instruction, according to the
> + * Intel spec.
> + */
> +#define VMX_OPERAND_SCALING(vii)	((vii) & 3)
> +#define VMX_OPERAND_ADDR_SIZE(vii)	(((vii) >> 7) & 7)
> +#define VMX_OPERAND_IS_REG(vii)		((vii) & (1u << 10))
> +#define VMX_OPERAND_SEG_REG(vii)	(((vii) >> 15) & 7)
> +#define VMX_OPERAND_INDEX_REG(vii)	(((vii) >> 18) & 0xf)
> +#define VMX_OPERAND_INDEX_INVALID(vii)	((vii) & (1u << 22))
> +#define VMX_OPERAND_BASE_REG(vii)	(((vii) >> 23) & 0xf)
> +#define VMX_OPERAND_BASE_INVALID(vii)	((vii) & (1u << 27))
> +#define VMX_OPERAND_REG(vii)		(((vii) >> 3) & 0xf)
> +#define VMX_OPERAND_REG2(vii)		(((vii) >> 28) & 0xf)
> +static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu,
> +				 unsigned long exit_qualification,
> +				 u32 vmx_instruction_info)
> +{
> +	int  scaling = VMX_OPERAND_SCALING(vmx_instruction_info);
> +	int  addr_size = VMX_OPERAND_ADDR_SIZE(vmx_instruction_info);
> +	bool is_reg = VMX_OPERAND_IS_REG(vmx_instruction_info);
> +	int  seg_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
> +	int  index_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
> +	bool index_is_valid = !VMX_OPERAND_INDEX_INVALID(vmx_instruction_info);
> +	int  base_reg       = VMX_OPERAND_BASE_REG(vmx_instruction_info);
> +	bool base_is_valid  = !VMX_OPERAND_BASE_INVALID(vmx_instruction_info);
> +	gva_t addr;
> +
> +	if (is_reg) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 0;
Isn't zero a legitimate address for vmx operation?

> +	}
> +
> +	switch (addr_size) {
> +	case 1: /* 32 bit. high bits are undefined according to the spec: */
> +		exit_qualification &= 0xffffffff;
> +		break;
> +	case 2: /* 64 bit */
> +		break;
> +	default: /* addr_size=0 means 16 bit */
> +		return 0;
> +	}
> +
> +	/* Addr = segment_base + offset */
> +	/* offfset = Base + [Index * Scale] + Displacement */
> +	addr = vmx_get_segment_base(vcpu, seg_reg);
> +	if (base_is_valid)
> +		addr += kvm_register_read(vcpu, base_reg);
> +	if (index_is_valid)
> +		addr += kvm_register_read(vcpu, index_reg)<<scaling;
> +	addr += exit_qualification; /* holds the displacement */
> +
> +	return addr;
> +}
> +
> +static int read_guest_vmcs_gpa(struct kvm_vcpu *vcpu, gpa_t *gpap)
> +{
> +	int r;
> +	gva_t gva = get_vmx_mem_address(vcpu,
> +		vmcs_readl(EXIT_QUALIFICATION),
> +		vmcs_read32(VMX_INSTRUCTION_INFO));
> +	if (gva == 0)
> +		return 1;
> +	*gpap = 0;
> +	r = kvm_read_guest_virt(gva, gpap, sizeof(*gpap), vcpu, NULL);
> +	if (r) {
> +		printk(KERN_ERR "%s cannot read guest vmcs addr %lx : %d\n",
> +		       __func__, gva, r);
> +		return r;
> +	}
> +	/* According to the spec, VMCS addresses must be 4K aligned */
> +	if (!IS_ALIGNED(*gpap, PAGE_SIZE)) {
> +		printk(KERN_DEBUG "%s addr %llx not aligned\n",
> +		       __func__, *gpap);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-06-13 12:27 ` [PATCH 9/24] Implement VMCLEAR Nadav Har'El
  2010-06-14  9:03   ` Avi Kivity
@ 2010-06-15 13:47   ` Gleb Natapov
  2010-06-15 13:50     ` Avi Kivity
  2010-07-06  2:56   ` Dong, Eddie
  2 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-15 13:47 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:27:10PM +0300, Nadav Har'El wrote:
> This patch implements the VMCLEAR instruction.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs
>  	 */
>  	u32 revision_id;
>  	u32 abort;
> +
> +	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
>  };
>  
>  struct vmcs_list {
> @@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv
>  	return 0;
>  }
>  
> +static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rflags;
> +	rflags = vmx_get_rflags(vcpu);
> +	rflags &= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
> +/* Emulate the VMCLEAR instruction */
> +static int handle_vmclear(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gpa_t guest_vmcs_addr, save_current_vmptr;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr))
> +		return 1;
> +
> +	save_current_vmptr = vmx->nested.current_vmptr;
> +
> +	vmx->nested.current_vmptr = guest_vmcs_addr;
> +	if (!nested_map_current(vcpu))
> +		return 1;
> +	vmx->nested.current_l2_page->launch_state = 0;
> +	nested_unmap_current(vcpu);
> +
> +	nested_free_current_vmcs(vcpu);
> +
> +	if (save_current_vmptr == guest_vmcs_addr)
> +		vmx->nested.current_vmptr = -1ull;
> +	else
> +		vmx->nested.current_vmptr = save_current_vmptr;
> +
> +	skip_emulated_instruction(vcpu);
> +	clear_rflags_cf_zf(vcpu);
> +	return 1;
> +}
> +
Shouldn't error cases update flags too?

>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -4109,7 +4151,7 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_HLT]                     = handle_halt,
>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
> -	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
> +	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>  	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
>  	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-06-15 13:47   ` Gleb Natapov
@ 2010-06-15 13:50     ` Avi Kivity
  2010-06-15 13:54       ` Gleb Natapov
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-15 13:50 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Nadav Har'El, kvm

On 06/15/2010 04:47 PM, Gleb Natapov wrote:
> On Sun, Jun 13, 2010 at 03:27:10PM +0300, Nadav Har'El wrote:
>    
>> This patch implements the VMCLEAR instruction.
>>
>> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
>> ---
>> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
>> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
>> @@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs
>>   	 */
>>   	u32 revision_id;
>>   	u32 abort;
>> +
>> +	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
>>   };
>>
>>   struct vmcs_list {
>> @@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv
>>   	return 0;
>>   }
>>
>> +static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long rflags;
>> +	rflags = vmx_get_rflags(vcpu);
>> +	rflags&= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
>> +	vmx_set_rflags(vcpu, rflags);
>> +}
>> +
>> +/* Emulate the VMCLEAR instruction */
>> +static int handle_vmclear(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +	gpa_t guest_vmcs_addr, save_current_vmptr;
>> +
>> +	if (!nested_vmx_check_permission(vcpu))
>> +		return 1;
>> +
>> +	if (read_guest_vmcs_gpa(vcpu,&guest_vmcs_addr))
>> +		return 1;
>> +
>> +	save_current_vmptr = vmx->nested.current_vmptr;
>> +
>> +	vmx->nested.current_vmptr = guest_vmcs_addr;
>> +	if (!nested_map_current(vcpu))
>> +		return 1;
>> +	vmx->nested.current_l2_page->launch_state = 0;
>> +	nested_unmap_current(vcpu);
>> +
>> +	nested_free_current_vmcs(vcpu);
>> +
>> +	if (save_current_vmptr == guest_vmcs_addr)
>> +		vmx->nested.current_vmptr = -1ull;
>> +	else
>> +		vmx->nested.current_vmptr = save_current_vmptr;
>> +
>> +	skip_emulated_instruction(vcpu);
>> +	clear_rflags_cf_zf(vcpu);
>> +	return 1;
>> +}
>> +
>>      
> Shouldn't error cases update flags too?
>    

Architectural errors (bad alignment) should update flags.  Internal 
errors (ENOMEM, vpmtr pointing outside of RAM) should not.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-06-15 13:50     ` Avi Kivity
@ 2010-06-15 13:54       ` Gleb Natapov
  2010-08-05 11:50         ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-15 13:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Tue, Jun 15, 2010 at 04:50:35PM +0300, Avi Kivity wrote:
> On 06/15/2010 04:47 PM, Gleb Natapov wrote:
> >On Sun, Jun 13, 2010 at 03:27:10PM +0300, Nadav Har'El wrote:
> >>This patch implements the VMCLEAR instruction.
> >>
> >>Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> >>---
> >>--- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> >>+++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> >>@@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs
> >>  	 */
> >>  	u32 revision_id;
> >>  	u32 abort;
> >>+
> >>+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
> >>  };
> >>
> >>  struct vmcs_list {
> >>@@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv
> >>  	return 0;
> >>  }
> >>
> >>+static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
> >>+{
> >>+	unsigned long rflags;
> >>+	rflags = vmx_get_rflags(vcpu);
> >>+	rflags&= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
> >>+	vmx_set_rflags(vcpu, rflags);
> >>+}
> >>+
> >>+/* Emulate the VMCLEAR instruction */
> >>+static int handle_vmclear(struct kvm_vcpu *vcpu)
> >>+{
> >>+	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>+	gpa_t guest_vmcs_addr, save_current_vmptr;
> >>+
> >>+	if (!nested_vmx_check_permission(vcpu))
> >>+		return 1;
> >>+
> >>+	if (read_guest_vmcs_gpa(vcpu,&guest_vmcs_addr))
> >>+		return 1;
> >>+
> >>+	save_current_vmptr = vmx->nested.current_vmptr;
> >>+
> >>+	vmx->nested.current_vmptr = guest_vmcs_addr;
> >>+	if (!nested_map_current(vcpu))
> >>+		return 1;
> >>+	vmx->nested.current_l2_page->launch_state = 0;
> >>+	nested_unmap_current(vcpu);
> >>+
> >>+	nested_free_current_vmcs(vcpu);
> >>+
> >>+	if (save_current_vmptr == guest_vmcs_addr)
> >>+		vmx->nested.current_vmptr = -1ull;
> >>+	else
> >>+		vmx->nested.current_vmptr = save_current_vmptr;
> >>+
> >>+	skip_emulated_instruction(vcpu);
> >>+	clear_rflags_cf_zf(vcpu);
> >>+	return 1;
> >>+}
> >>+
> >Shouldn't error cases update flags too?
> 
> Architectural errors (bad alignment) should update flags.  Internal
> errors (ENOMEM, vpmtr pointing outside of RAM) should not.
> 
vpmtr pointing outside of RAM is architectural error (or not?). SDM
says "The operand of this instruction is always 64 bits and is always in
memory", but may be they mean "not in register". Anyway internal errors
should generate error exit to userspace which this patch is also
missing.

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 1/24] Move nested option from svm.c to x86.c
  2010-06-14  8:11   ` Avi Kivity
@ 2010-06-15 14:27     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-15 14:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 1/24] Move nested option from svm.c to x86.c":
> A global variable names 'nested' is not a good idea.  I recommend having 
> a kvm-intel scope module parameter instead, that also avoids the 0/1/2 
> values.

The rationale behind having a "nested" flag in x86.c (instead of individually
in svm.c and vmx.c) was that it allows nesting-related logic that is common to
both SVM and VMX to reside in x86.c.

But you are right that this is not very important right now. So in the fixed
patch below I've changed it to be a separate module parameter "nested" for
each module. As you requested, VMX's nested option defaults to off.

========
Subject: [PATCH 1/24] Add "nested" module option to vmx.c

This patch adds a module option "nested" to vmx.c, which controls whether
the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-15 17:20:01.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-15 17:20:01.000000000 +0300
@@ -67,6 +67,14 @@ module_param(emulate_invalid_guest_state
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., the guest may use
+ * VMX and be a hypervisor for its own guests. If nested=0, the guest may not
+ * use VMX instructions.
+ */
+static int nested = 0;
+module_param(nested, int, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST				\
 	(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK						\

-- 
Nadav Har'El                        |      Tuesday, Jun 15 2010, 3 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |When everything's coming your way, you're
http://nadav.harel.org.il           |in the wrong lane.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 2/24] Add VMX and SVM to list of supported cpuid features
  2010-06-14  8:13   ` Avi Kivity
@ 2010-06-15 14:31     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-15 14:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 2/24] Add VMX and SVM to list of supported cpuid features":
> >  	const u32 kvm_supported_word4_x86_features =
> >  		F(XMM3) | 0 /* Reserved, DTES64, MONITOR */ |
> >-		0 /* DS-CPL, VMX, SMX, EST */ |
> >+		0 /* DS-CPL */ | (nested ? F(VMX) : 0) | 0 /* SMX, EST */ |
> >  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> >  		0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
> >  		0 /* Reserved, DCA */ | F(XMM4_1) |
> >   
> 
> You can use kvm_x86_ops->set_supported_cpuid() to alter features.

You're right, that's indeed much cleaner! Thanks.
This also prevents me from needing the "nested" parameter in x86.c (which I
don't have now that I've moved it to vmx.c).

The fixed patch:

-------
Subject: [PATCH 2/24] Add VMX and SVM to list of supported cpuid features

Add the "VMX" CPU feature to the list of CPU featuress KVM advertises with
the KVM_GET_SUPPORTED_CPUID ioctl (unless the "nested" module option is off).

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-15 17:20:01.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-15 17:20:01.000000000 +0300
@@ -4236,6 +4236,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+	if (func == 1 && nested)
+		entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {

-- 
Nadav Har'El                        |      Tuesday, Jun 15 2010, 3 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Time is the best teacher. Unfortunately
http://nadav.harel.org.il           |it kills all its students.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 4/24] Allow setting the VMXE bit in CR4
  2010-06-15 11:09   ` Gleb Natapov
@ 2010-06-15 14:44     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-15 14:44 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Tue, Jun 15, 2010, Gleb Natapov wrote about "Re: [PATCH 4/24] Allow setting the VMXE bit in CR4":
> On Sun, Jun 13, 2010 at 03:24:37PM +0300, Nadav Har'El wrote:
> > This patch allows the guest to enable the VMXE bit in CR4, which is a
> > prerequisite to running VMXON.
>..
> > --- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:28.000000000 +0300
>..
> > +	if (cr4 & x86_cr4_vmxe && !nested)
> >  		return 1;

After I moved back the "nested" option from x86.c to vmx.c, this created a
problem in this patch, because kvm_set_cr4 (in x86.c) can no longer test the
"nested" option as above to decide if the VMXE bit is to be allowed or not.

But this setback was actually an opportunity to do this testing more
correctly. I've changed kvm_x86_ops->set_cr4() to return 1 when a #GP should
be thrown (like in __kvm_set_cr4()). SVM's set_cr4() now always refuses to set
the VMXE bit, while VMX's set_cr4() refuses to set or unset it as appropriate
(it cannot be set if "nested" is not on, and cannot be unset after VMXON).

> We shouldn't be able to clear X86_CR4_VMXE after VMXON.

You're absolutely right. I fixed that too in the fixed patch below.


----
Subject: [PATCH 4/24] Allow setting the VMXE bit in CR4

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the "nested" module option is on,
and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/include/asm/kvm_host.h	2010-06-15 17:20:01.000000000 +0300
+++ .after/arch/x86/include/asm/kvm_host.h	2010-06-15 17:20:01.000000000 +0300
@@ -490,7 +490,7 @@ struct kvm_x86_ops {
 	void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
 	void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
 	void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-	void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+	int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
 	void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
 	void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
 	void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c	2010-06-15 17:20:01.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2010-06-15 17:20:01.000000000 +0300
@@ -1242,11 +1242,14 @@ static void svm_set_cr0(struct kvm_vcpu 
 	update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long host_cr4_mce = read_cr4() & X86_CR4_MCE;
 	unsigned long old_cr4 = to_svm(vcpu)->vmcb->save.cr4;
 
+	if (cr4 & X86_CR4_VMXE)
+		return 1;
+
 	if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
 		force_new_asid(vcpu);
 
@@ -1255,6 +1258,7 @@ static void svm_set_cr4(struct kvm_vcpu 
 		cr4 |= X86_CR4_PAE;
 	cr4 |= host_cr4_mce;
 	to_svm(vcpu)->vmcb->save.cr4 = cr4;
+	return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c	2010-06-15 17:20:01.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2010-06-15 17:20:01.000000000 +0300
@@ -490,11 +490,9 @@ int __kvm_set_cr4(struct kvm_vcpu *vcpu,
 		   && !load_pdptrs(vcpu, vcpu->arch.cr3))
 		return 1;
 
-	if (cr4 & X86_CR4_VMXE)
+	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
-	kvm_x86_ops->set_cr4(vcpu, cr4);
-
 	if ((cr4 ^ old_cr4) & pdptr_bits)
 		kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c	2010-06-15 17:20:01.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-15 17:20:01.000000000 +0300
@@ -1874,7 +1874,7 @@ static void ept_save_pdptrs(struct kvm_v
 		  (unsigned long *)&vcpu->arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
 					unsigned long cr0,
@@ -1969,11 +1969,19 @@ static void vmx_set_cr3(struct kvm_vcpu 
 	vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
 		    KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+	if (cr4 & X86_CR4_VMXE){
+		if (!nested)
+			return 1;
+	} else {
+		if (nested && to_vmx(vcpu)->nested.vmxon)
+			return 1;
+	}
+
 	vcpu->arch.cr4 = cr4;
 	if (enable_ept) {
 		if (!is_paging(vcpu)) {
@@ -1986,6 +1994,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
 	vmcs_writel(CR4_READ_SHADOW, cr4);
 	vmcs_writel(GUEST_CR4, hw_cr4);
+	return 0;
 }
 
 static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)

-- 
Nadav Har'El                        |      Tuesday, Jun 15 2010, 3 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Tea or coffee? Coffee, without cream. It
http://nadav.harel.org.il           |will be without milk, we have no cream.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 3/24] Implement VMXON and VMXOFF
  2010-06-13 12:24 ` [PATCH 3/24] Implement VMXON and VMXOFF Nadav Har'El
  2010-06-14  8:21   ` Avi Kivity
@ 2010-06-15 20:18   ` Marcelo Tosatti
  2010-06-16  7:50     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Marcelo Tosatti @ 2010-06-15 20:18 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:24:06PM +0300, Nadav Har'El wrote:
> This patch allows a guest to use the VMXON and VMXOFF instructions, and
> emulates them accordingly. Basically this amounts to checking some
> prerequisites, and then remembering whether the guest has enabled or disabled
> VMX operation.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:28.000000000 +0300
> @@ -117,6 +117,16 @@ struct shared_msr_entry {
>  	u64 mask;
>  };
>  
> +/* The nested_vmx structure is part of vcpu_vmx, and holds information we need
> + * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
> + * the current VMCS set by L1, a list of the VMCSs used to run the active
> + * L2 guests on the hardware, and more.
> + */
> +struct nested_vmx {
> +	/* Has the level1 guest done vmxon? */
> +	bool vmxon;
> +};
> +
>  struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	struct list_head      local_vcpus_link;
> @@ -168,6 +178,9 @@ struct vcpu_vmx {
>  	u32 exit_reason;
>  
>  	bool rdtscp_enabled;
> +
> +	/* Support for guest hypervisors (nested VMX) */
> +	struct nested_vmx nested;
>  };
>  
>  static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
> @@ -3353,6 +3366,93 @@ static int handle_vmx_insn(struct kvm_vc
>  	return 1;
>  }
>  
> +/* Emulate the VMXON instruction.
> + * Currently, we just remember that VMX is active, and do not save or even
> + * inspect the argument to VMXON (the so-called "VMXON pointer") because we
> + * do not currently need to store anything in that guest-allocated memory
> + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
> + * argument is different from the VMXON pointer (which the spec says they do).
> + */
> +static int handle_vmon(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_segment cs;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* The Intel VMX Instruction Reference lists a bunch of bits that
> +	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
> +	 * set to 1. Otherwise, we should fail with #UD. We test these now:
> +	 */
> +	if (!nested) {
> +		kvm_queue_exception(vcpu, UD_VECTOR);
> +		return 1;
> +	}
> +
> +	if (!(vcpu->arch.cr4 & X86_CR4_VMXE) ||
> +	    !(vcpu->arch.cr0 & X86_CR0_PE) ||

kvm_read_cr0_bits, kvm_read_cr4_bits.


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 3/24] Implement VMXON and VMXOFF
  2010-06-15 20:18   ` Marcelo Tosatti
@ 2010-06-16  7:50     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-16  7:50 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: avi, kvm

On Tue, Jun 15, 2010, Marcelo Tosatti wrote about "Re: [PATCH 3/24] Implement VMXON and VMXOFF":
> > +	if (!(vcpu->arch.cr4 & X86_CR4_VMXE) ||
> > +	    !(vcpu->arch.cr0 & X86_CR0_PE) ||
> 
> kvm_read_cr0_bits, kvm_read_cr4_bits.

Thanks. I'll change that.

-- 
Nadav Har'El                        |    Wednesday, Jun 16 2010, 4 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |In Fortran, God is real unless declared
http://nadav.harel.org.il           |an integer.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 3/24] Implement VMXON and VMXOFF
  2010-06-14  8:21   ` Avi Kivity
@ 2010-06-16 11:14     ` Nadav Har'El
  2010-06-16 11:26       ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-16 11:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hi,

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 3/24] Implement VMXON and VMXOFF":
> On 06/13/2010 03:24 PM, Nadav Har'El wrote:
> >This patch allows a guest to use the VMXON and VMXOFF instructions, and
> >emulates them accordingly. Basically this amounts to checking some
> >prerequisites, and then remembering whether the guest has enabled or 
> >disabled
> >VMX operation.
> 
> Should probably reorder with next patch.

I can't do this if I want the code to compile after each patch, because the
next patch (controlling when setting cr4.VMXE can be set) needs to check
whether VMXON was done.

> Please (here and elsewhere) use the standard kernel style for multiline 
> comments - start with /* on a line by itself.

Sure, sorry about that. I guess I need to (re)read the Linux coding style
document.

> >+	vmx->nested.vmxon = 1;
> >   
> = true

I'll change that. I learned C more than a decade before the advent of
stdbool.h, so in my mind, "1" has always been, and still is, the right and
only way to write "true"... But of course it doesn't mean I need to inflict
my old style on everybody else ;-)

> Need to block INIT signals in the local apic as well (fine for a 
> separate patch).

I've been looking into how I might best go about achieving this.

The APIC_DM_INIT handler is in lapic.c, which is not aware of VMX or
(obviously) nested VMX. So I need to add some sort of generic "block INIT"
flag which that code will check. Is this the sort of fix you had in mind?

A different change could be to write a handler for exit reason 3, which we 
get if there's a real INIT signal in the host; If we get exit reason 3 from
L2, we need to exit to L1 to handle it, while if we get exit reason 3 from
L1 that has done VMXON, we simply need to do nothing (according to the spec).

So I'm not sure which of these two things is what we really about. What kind
of scenario did you have in mind where this INIT business is relevant?



Here is the patch with your above comments fixed *except* the INIT thing:

-------
Subject: [PATCH 3/24] Implement VMXON and VMXOFF

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-16 13:20:19.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-16 13:20:19.000000000 +0300
@@ -125,6 +125,17 @@ struct shared_msr_entry {
 	u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
+struct nested_vmx {
+	/* Has the level1 guest done vmxon? */
+	bool vmxon;
+};
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	struct list_head      local_vcpus_link;
@@ -176,6 +187,9 @@ struct vcpu_vmx {
 	u32 exit_reason;
 
 	bool rdtscp_enabled;
+
+	/* Support for guest hypervisors (nested VMX) */
+	struct nested_vmx nested;
 };
 
 static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
@@ -3361,6 +3375,90 @@ static int handle_vmx_insn(struct kvm_vc
 	return 1;
 }
 
+/*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called "VMXON pointer") because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* The Intel VMX Instruction Reference lists a bunch of bits that
+	 * are prerequisite to running VMXON, most notably CR4.VMXE must be
+	 * set to 1. Otherwise, we should fail with #UD. We test these now:
+	 */
+	if (!nested ||
+	    !kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+	    !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+	    (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if (is_long_mode(vcpu) && !cs.l) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 1;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 1;
+	}
+
+	vmx->nested.vmxon = true;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+	struct kvm_segment cs;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!vmx->nested.vmxon) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
+	    (is_long_mode(vcpu) && !cs.l)) {
+		kvm_queue_exception(vcpu, UD_VECTOR);
+		return 0;
+	}
+
+	if (vmx_get_cpl(vcpu)) {
+		kvm_inject_gp(vcpu, 0);
+		return 0;
+	}
+
+	return 1;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	to_vmx(vcpu)->nested.vmxon = false;
+
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -3650,8 +3748,8 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
 	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
-	[EXIT_REASON_VMOFF]                   = handle_vmx_insn,
-	[EXIT_REASON_VMON]                    = handle_vmx_insn,
+	[EXIT_REASON_VMOFF]                   = handle_vmoff,
+	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,

-- 
Nadav Har'El                        |    Wednesday, Jun 16 2010, 4 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Deja Moo: The feeling that you've heard
http://nadav.harel.org.il           |this bull before.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 3/24] Implement VMXON and VMXOFF
  2010-06-16 11:14     ` Nadav Har'El
@ 2010-06-16 11:26       ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-16 11:26 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/16/2010 02:14 PM, Nadav Har'El wrote:
> Hi,
>
> On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 3/24] Implement VMXON and VMXOFF":
>    
>> On 06/13/2010 03:24 PM, Nadav Har'El wrote:
>>      
>>> This patch allows a guest to use the VMXON and VMXOFF instructions, and
>>> emulates them accordingly. Basically this amounts to checking some
>>> prerequisites, and then remembering whether the guest has enabled or
>>> disabled
>>> VMX operation.
>>>        
>> Should probably reorder with next patch.
>>      
> I can't do this if I want the code to compile after each patch, because the
> next patch (controlling when setting cr4.VMXE can be set) needs to check
> whether VMXON was done.
>    

You can have this patch add the vmxon check.  But it doesn't matter too 
much, you can keep the current order.

>> Need to block INIT signals in the local apic as well (fine for a
>> separate patch).
>>      
> I've been looking into how I might best go about achieving this.
>
> The APIC_DM_INIT handler is in lapic.c, which is not aware of VMX or
> (obviously) nested VMX. So I need to add some sort of generic "block INIT"
> flag which that code will check. Is this the sort of fix you had in mind?
>    

It's not enough to block INIT, there is also exit reason 3, INIT 
signal.  So you need to call x86.c code from the lapic, which needs to 
call a kvm_x86_op hook which lets vmx.c decide whether the INIT needs to 
be intercepted or not, and what to do with it (ignore in root mode, exit 
in non-root mode)

Note the check needs to be done in vcpu context, not during delivery as 
it is done now.  So we probably need a KVM_REQ_INIT bit in 
vcpu->requests, which we can check during guest entry where we know if 
we're in root or non-root mode.

Pretty complicated and esoteric.  We can defer this now while we work 
out more immediate issues, but it needs to be addressed.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-14  8:33   ` Avi Kivity
  2010-06-14  8:49     ` Nadav Har'El
@ 2010-06-16 12:24     ` Nadav Har'El
  2010-06-16 13:10       ` Avi Kivity
  2010-06-22 14:54     ` Nadav Har'El
  2 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-16 12:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
> >+struct __attribute__ ((__packed__)) vmcs12 {
> >+	/* According to the Intel spec, a VMCS region must start with the
> >+	 * following two fields. Then follow implementation-specific data.
> >+	 */
> >+	u32 revision_id;
> >+	u32 abort;
> >+};
> >   
> 
> Note that this structure becomes an ABI, it cannot change except in a 
> backward compatible way due to the need for live migration.  So I'd like 
> a documentation patch that adds a description of the content to 
> Documentation/kvm/.  It can be as simple as listing the structure 
> definition.

I agree that if struct vmcs12 is changed, this will cause problems for live
migration, but why does this mean that the struct's fields or layout an ABI
worth documenting?
After all, isn't the idea of VMCS that its internal content and layout
is opaque for the L1 guest - he can only read/write it with VMREAD/VMWRITE,
and those two instructions are the ABI (which is of course documented in the
Intel spec) - not the content of the vmcs12 structure. Even if the guest knew
the exact layout of this structure, he's not supposed to use it.

By the way, we have not actually checked that live migration is working
as expected with nested virtualization running. I expect there to be more
pitfalls and bugs even before we consider migration between two different
versions. We would indeed like to allow live migration of different kinds
(of L1 with all its L2 guests; Of all L2 guests of a L1; Of a single L2
guest), but we're trying to finish the more basic functionality first.

Thanks,
Nadav.


-- 
Nadav Har'El                        |    Wednesday, Jun 16 2010, 4 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |It is error alone which needs the support
http://nadav.harel.org.il           |of government. Truth can stand by itself.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-16 12:24     ` Nadav Har'El
@ 2010-06-16 13:10       ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-16 13:10 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/16/2010 03:24 PM, Nadav Har'El wrote:
> On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
>    
>>> +struct __attribute__ ((__packed__)) vmcs12 {
>>> +	/* According to the Intel spec, a VMCS region must start with the
>>> +	 * following two fields. Then follow implementation-specific data.
>>> +	 */
>>> +	u32 revision_id;
>>> +	u32 abort;
>>> +};
>>>
>>>        
>> Note that this structure becomes an ABI, it cannot change except in a
>> backward compatible way due to the need for live migration.  So I'd like
>> a documentation patch that adds a description of the content to
>> Documentation/kvm/.  It can be as simple as listing the structure
>> definition.
>>      
> I agree that if struct vmcs12 is changed, this will cause problems for live
> migration, but why does this mean that the struct's fields or layout an ABI
> worth documenting?
>    

It's a way of adding a barrier to changing it, and of determining which 
versions are compatible with which other versions.

> After all, isn't the idea of VMCS that its internal content and layout
> is opaque for the L1 guest - he can only read/write it with VMREAD/VMWRITE,
> and those two instructions are the ABI (which is of course documented in the
> Intel spec) - not the content of the vmcs12 structure. Even if the guest knew
> the exact layout of this structure, he's not supposed to use it.
>    

Right, it's only migration, not for guest use.  Or perhaps for someone 
debugging a hypervisor.

> By the way, we have not actually checked that live migration is working
> as expected with nested virtualization running. I expect there to be more
> pitfalls and bugs even before we consider migration between two different
> versions. We would indeed like to allow live migration of different kinds
> (of L1 with all its L2 guests; Of all L2 guests of a L1; Of a single L2
> guest), but we're trying to finish the more basic functionality first.
>    

Live migration will not work without ioctls to save/load the vmptr and 
vmxon state.

nsvm has a hack where they force an exit before existing to host 
userspace, so host userspace never sees guest mode.  I don't like it 
much, and in any case it can't work for nvmx since you need to migrate 
the vmxon state.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 10/24] Implement VMPTRLD
  2010-06-13 12:27 ` [PATCH 10/24] Implement VMPTRLD Nadav Har'El
  2010-06-14  9:07   ` Avi Kivity
@ 2010-06-16 13:36   ` Gleb Natapov
  2010-07-06  3:09   ` Dong, Eddie
  2 siblings, 0 replies; 147+ messages in thread
From: Gleb Natapov @ 2010-06-16 13:36 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:27:41PM +0300, Nadav Har'El wrote:
> This patch implements the VMPTRLD instruction.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv
>  	return 0;
>  }
>  
> +static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rflags;
> +	rflags = vmx_get_rflags(vcpu);
> +	rflags |= X86_EFLAGS_CF;
> +	rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_ZF &
> +		~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
> +static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rflags;
> +	rflags = vmx_get_rflags(vcpu);
> +	rflags |= X86_EFLAGS_ZF;
> +	rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_CF &
> +		~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
>  static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long rflags;
> @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
>  	return 1;
>  }
>  
> +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr)
> +{
> +	bool ret;
> +	struct vmcs12 *vmcs12;
> +	struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
> +	if (vmcs_page == NULL)
> +		return 0;
> +	vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
> +	if (vmcs12->revision_id == VMCS12_REVISION)
> +		ret = 1;
> +	else {
> +		set_rflags_to_vmx_fail_valid(vcpu);
Should set VM-Instruction Error Field accordingly.

> +		ret = 0;
> +	}
> +	kunmap_atomic(vmcs12, KM_USER0);
> +	kvm_release_page_dirty(vmcs_page);
> +	return ret;
> +}
> +
> +/* Emulate the VMPTRLD instruction */
> +static int handle_vmptrld(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gpa_t guest_vmcs_addr;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr)) {
> +		set_rflags_to_vmx_fail_invalid(vcpu);
> +		return 1;
> +	}
> +
> +	if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
> +		return 1;

Should check that guest_vmcs_addr != VMXON address. I think this check
is missing from VMCLEAR too.

> +
> +	if (vmx->nested.current_vmptr != guest_vmcs_addr) {
> +		vmx->nested.current_vmptr = guest_vmcs_addr;
> +
> +		if (nested_create_current_vmcs(vcpu)) {
> +			printk(KERN_ERR "%s error could not allocate memory",
> +				__func__);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	clear_rflags_cf_zf(vcpu);
> +	skip_emulated_instruction(vcpu);
> +	return 1;
> +}
> +
>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>  	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
> -	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
> +	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
>  	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
>  	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 11/24] Implement VMPTRST
  2010-06-14  9:15   ` Avi Kivity
@ 2010-06-16 13:53     ` Gleb Natapov
  2010-06-16 15:33       ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-16 13:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Mon, Jun 14, 2010 at 12:15:10PM +0300, Avi Kivity wrote:
> On 06/13/2010 03:28 PM, Nadav Har'El wrote:
> >This patch implements the VMPTRST instruction.
> >
> >Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> >---
> >--- .before/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
> >+++ .after/arch/x86/kvm/x86.c	2010-06-13 15:01:29.000000000 +0300
> >@@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv
> >  	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
> >  }
> >
> >-static int kvm_write_guest_virt_system(gva_t addr, void *val,
> >+int kvm_write_guest_virt_system(gva_t addr, void *val,
> >  				       unsigned int bytes,
> >  				       struct kvm_vcpu *vcpu,
> >  				       u32 *error)
> 
> write_guest_virt_system() is used by writes which need to ignore the
> cpl, for example when a cpl 3 instruction loads a segment, the
> processor needs to update the accessed flag even though it is only
> accessible to cpl 0.  That's not your case, you need the ordinary
> write_guest_virt().
> 
> Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it.
> 
the code uses this function after checking cpl to be zero, so may be it
is ok, not to pretty though. I was actually hoping to get rid of all
kvm_(read|write)_guest_virt* and replace existing uses with
emulator_(read|write)_emulated, but this patch series adds more users
that will be hard to replace :(

> >
> >+/* Emulate the VMPTRST instruction */
> >+static int handle_vmptrst(struct kvm_vcpu *vcpu)
> >+{
> >+	int r = 0;
> >+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> >+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> >+	gva_t vmcs_gva;
> >+
> >+	if (!nested_vmx_check_permission(vcpu))
> >+		return 1;
> >+
> >+	vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification,
> >+				       vmx_instruction_info);
> >+	if (vmcs_gva == 0)
> >+		return 1;
> 
> What's wrong with gva 0?  It's favoured by exploiters everywhere.
> 
> >+	r = kvm_write_guest_virt_system(vmcs_gva,
> >+				 (void *)&to_vmx(vcpu)->nested.current_vmptr,
> >+				 sizeof(u64), vcpu, NULL);
> >+	if (r) {
> 
> Check against the X86EMUL return codes.  You'll need to inject a
> page fault on failure.
> 
> >+		printk(KERN_INFO "%s failed to write vmptr\n", __func__);
> >+		return 1;
> >+	}
> >+	clear_rflags_cf_zf(vcpu);
> >+	skip_emulated_instruction(vcpu);
> >+	return 1;
> >+}
> >+
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 12/24] Add VMCS fields to the vmcs12
  2010-06-13 12:28 ` [PATCH 12/24] Add VMCS fields to the vmcs12 Nadav Har'El
  2010-06-14  9:24   ` Avi Kivity
@ 2010-06-16 14:18   ` Gleb Natapov
  1 sibling, 0 replies; 147+ messages in thread
From: Gleb Natapov @ 2010-06-16 14:18 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:28:43PM +0300, Nadav Har'El wrote:
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs.
> 
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real
> VMCS for running L2.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -117,6 +117,136 @@ struct shared_msr_entry {
>  	u64 mask;
>  };
>  
> +/* shadow_vmcs is a structure used in nested VMX for holding a copy of all
> + * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12),
> + * and also for easier access to VMCS data (see l1_shadow_vmcs).
> + */
> +struct __attribute__ ((__packed__)) shadow_vmcs {
> +	u16 virtual_processor_id;
> +	u16 guest_es_selector;
> +	u16 guest_cs_selector;
> +	u16 guest_ss_selector;
> +	u16 guest_ds_selector;
> +	u16 guest_fs_selector;
> +	u16 guest_gs_selector;
> +	u16 guest_ldtr_selector;
> +	u16 guest_tr_selector;
> +	u16 host_es_selector;
> +	u16 host_cs_selector;
> +	u16 host_ss_selector;
> +	u16 host_ds_selector;
> +	u16 host_fs_selector;
> +	u16 host_gs_selector;
> +	u16 host_tr_selector;
> +	u64 io_bitmap_a;
> +	u64 io_bitmap_b;
> +	u64 msr_bitmap;
> +	u64 vm_exit_msr_store_addr;
> +	u64 vm_exit_msr_load_addr;
> +	u64 vm_entry_msr_load_addr;
> +	u64 tsc_offset;
> +	u64 virtual_apic_page_addr;
> +	u64 apic_access_addr;
> +	u64 ept_pointer;
> +	u64 guest_physical_address;
> +	u64 vmcs_link_pointer;
> +	u64 guest_ia32_debugctl;
> +	u64 guest_ia32_pat;
> +	u64 guest_pdptr0;
> +	u64 guest_pdptr1;
> +	u64 guest_pdptr2;
> +	u64 guest_pdptr3;
> +	u64 host_ia32_pat;
> +	u32 pin_based_vm_exec_control;
> +	u32 cpu_based_vm_exec_control;
> +	u32 exception_bitmap;
> +	u32 page_fault_error_code_mask;
> +	u32 page_fault_error_code_match;
> +	u32 cr3_target_count;
> +	u32 vm_exit_controls;
> +	u32 vm_exit_msr_store_count;
> +	u32 vm_exit_msr_load_count;
> +	u32 vm_entry_controls;
> +	u32 vm_entry_msr_load_count;
> +	u32 vm_entry_intr_info_field;
> +	u32 vm_entry_exception_error_code;
> +	u32 vm_entry_instruction_len;
> +	u32 tpr_threshold;
> +	u32 secondary_vm_exec_control;
> +	u32 vm_instruction_error;
> +	u32 vm_exit_reason;
> +	u32 vm_exit_intr_info;
> +	u32 vm_exit_intr_error_code;
> +	u32 idt_vectoring_info_field;
> +	u32 idt_vectoring_error_code;
> +	u32 vm_exit_instruction_len;
> +	u32 vmx_instruction_info;
> +	u32 guest_es_limit;
> +	u32 guest_cs_limit;
> +	u32 guest_ss_limit;
> +	u32 guest_ds_limit;
> +	u32 guest_fs_limit;
> +	u32 guest_gs_limit;
> +	u32 guest_ldtr_limit;
> +	u32 guest_tr_limit;
> +	u32 guest_gdtr_limit;
> +	u32 guest_idtr_limit;
> +	u32 guest_es_ar_bytes;
> +	u32 guest_cs_ar_bytes;
> +	u32 guest_ss_ar_bytes;
> +	u32 guest_ds_ar_bytes;
> +	u32 guest_fs_ar_bytes;
> +	u32 guest_gs_ar_bytes;
> +	u32 guest_ldtr_ar_bytes;
> +	u32 guest_tr_ar_bytes;
> +	u32 guest_interruptibility_info;
> +	u32 guest_activity_state;
> +	u32 guest_sysenter_cs;
> +	u32 host_ia32_sysenter_cs;
> +	unsigned long cr0_guest_host_mask;
> +	unsigned long cr4_guest_host_mask;
> +	unsigned long cr0_read_shadow;
> +	unsigned long cr4_read_shadow;
> +	unsigned long cr3_target_value0;
> +	unsigned long cr3_target_value1;
> +	unsigned long cr3_target_value2;
> +	unsigned long cr3_target_value3;
> +	unsigned long exit_qualification;
> +	unsigned long guest_linear_address;
> +	unsigned long guest_cr0;
> +	unsigned long guest_cr3;
> +	unsigned long guest_cr4;
> +	unsigned long guest_es_base;
> +	unsigned long guest_cs_base;
> +	unsigned long guest_ss_base;
> +	unsigned long guest_ds_base;
> +	unsigned long guest_fs_base;
> +	unsigned long guest_gs_base;
> +	unsigned long guest_ldtr_base;
> +	unsigned long guest_tr_base;
> +	unsigned long guest_gdtr_base;
> +	unsigned long guest_idtr_base;
> +	unsigned long guest_dr7;
> +	unsigned long guest_rsp;
> +	unsigned long guest_rip;
> +	unsigned long guest_rflags;
> +	unsigned long guest_pending_dbg_exceptions;
> +	unsigned long guest_sysenter_esp;
> +	unsigned long guest_sysenter_eip;
> +	unsigned long host_cr0;
> +	unsigned long host_cr3;
> +	unsigned long host_cr4;
> +	unsigned long host_fs_base;
> +	unsigned long host_gs_base;
> +	unsigned long host_tr_base;
> +	unsigned long host_gdtr_base;
> +	unsigned long host_idtr_base;
> +	unsigned long host_ia32_sysenter_esp;
> +	unsigned long host_ia32_sysenter_eip;
> +	unsigned long host_rsp;
> +	unsigned long host_rip;
> +};
> +
>  #define VMCS12_REVISION 0x11e57ed0
>  
>  /*
> @@ -139,6 +269,8 @@ struct __attribute__ ((__packed__)) vmcs
>  	u32 revision_id;
>  	u32 abort;
>  
> +	struct shadow_vmcs shadow_vmcs;
> +
>  	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
>  };
>  
> @@ -228,6 +360,169 @@ static inline struct vcpu_vmx *to_vmx(st
>  	return container_of(vcpu, struct vcpu_vmx, vcpu);
>  }
>  
> +#define OFFSET(x) offsetof(struct shadow_vmcs, x)
> +
> +static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
> +	[VIRTUAL_PROCESSOR_ID] = OFFSET(virtual_processor_id),
> +	[GUEST_ES_SELECTOR] = OFFSET(guest_es_selector),
> +	[GUEST_CS_SELECTOR] = OFFSET(guest_cs_selector),
> +	[GUEST_SS_SELECTOR] = OFFSET(guest_ss_selector),
> +	[GUEST_DS_SELECTOR] = OFFSET(guest_ds_selector),
> +	[GUEST_FS_SELECTOR] = OFFSET(guest_fs_selector),
> +	[GUEST_GS_SELECTOR] = OFFSET(guest_gs_selector),
> +	[GUEST_LDTR_SELECTOR] = OFFSET(guest_ldtr_selector),
> +	[GUEST_TR_SELECTOR] = OFFSET(guest_tr_selector),
> +	[HOST_ES_SELECTOR] = OFFSET(host_es_selector),
> +	[HOST_CS_SELECTOR] = OFFSET(host_cs_selector),
> +	[HOST_SS_SELECTOR] = OFFSET(host_ss_selector),
> +	[HOST_DS_SELECTOR] = OFFSET(host_ds_selector),
> +	[HOST_FS_SELECTOR] = OFFSET(host_fs_selector),
> +	[HOST_GS_SELECTOR] = OFFSET(host_gs_selector),
> +	[HOST_TR_SELECTOR] = OFFSET(host_tr_selector),
> +	[IO_BITMAP_A] = OFFSET(io_bitmap_a),
> +	[IO_BITMAP_A_HIGH] = OFFSET(io_bitmap_a)+4,
> +	[IO_BITMAP_B] = OFFSET(io_bitmap_b),
> +	[IO_BITMAP_B_HIGH] = OFFSET(io_bitmap_b)+4,
> +	[MSR_BITMAP] = OFFSET(msr_bitmap),
> +	[MSR_BITMAP_HIGH] = OFFSET(msr_bitmap)+4,
> +	[VM_EXIT_MSR_STORE_ADDR] = OFFSET(vm_exit_msr_store_addr),
> +	[VM_EXIT_MSR_STORE_ADDR_HIGH] = OFFSET(vm_exit_msr_store_addr)+4,
> +	[VM_EXIT_MSR_LOAD_ADDR] = OFFSET(vm_exit_msr_load_addr),
> +	[VM_EXIT_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_exit_msr_load_addr)+4,
> +	[VM_ENTRY_MSR_LOAD_ADDR] = OFFSET(vm_entry_msr_load_addr),
> +	[VM_ENTRY_MSR_LOAD_ADDR_HIGH] = OFFSET(vm_entry_msr_load_addr)+4,
> +	[TSC_OFFSET] = OFFSET(tsc_offset),
> +	[TSC_OFFSET_HIGH] = OFFSET(tsc_offset)+4,
> +	[VIRTUAL_APIC_PAGE_ADDR] = OFFSET(virtual_apic_page_addr),
> +	[VIRTUAL_APIC_PAGE_ADDR_HIGH] = OFFSET(virtual_apic_page_addr)+4,
> +	[APIC_ACCESS_ADDR] = OFFSET(apic_access_addr),
> +	[APIC_ACCESS_ADDR_HIGH] = OFFSET(apic_access_addr)+4,
> +	[EPT_POINTER] = OFFSET(ept_pointer),
> +	[EPT_POINTER_HIGH] = OFFSET(ept_pointer)+4,
> +	[GUEST_PHYSICAL_ADDRESS] = OFFSET(guest_physical_address),
> +	[GUEST_PHYSICAL_ADDRESS_HIGH] = OFFSET(guest_physical_address)+4,
> +	[VMCS_LINK_POINTER] = OFFSET(vmcs_link_pointer),
> +	[VMCS_LINK_POINTER_HIGH] = OFFSET(vmcs_link_pointer)+4,
> +	[GUEST_IA32_DEBUGCTL] = OFFSET(guest_ia32_debugctl),
> +	[GUEST_IA32_DEBUGCTL_HIGH] = OFFSET(guest_ia32_debugctl)+4,
> +	[GUEST_IA32_PAT] = OFFSET(guest_ia32_pat),
> +	[GUEST_IA32_PAT_HIGH] = OFFSET(guest_ia32_pat)+4,
> +	[GUEST_PDPTR0] = OFFSET(guest_pdptr0),
> +	[GUEST_PDPTR0_HIGH] = OFFSET(guest_pdptr0)+4,
> +	[GUEST_PDPTR1] = OFFSET(guest_pdptr1),
> +	[GUEST_PDPTR1_HIGH] = OFFSET(guest_pdptr1)+4,
> +	[GUEST_PDPTR2] = OFFSET(guest_pdptr2),
> +	[GUEST_PDPTR2_HIGH] = OFFSET(guest_pdptr2)+4,
> +	[GUEST_PDPTR3] = OFFSET(guest_pdptr3),
> +	[GUEST_PDPTR3_HIGH] = OFFSET(guest_pdptr3)+4,
> +	[HOST_IA32_PAT] = OFFSET(host_ia32_pat),
> +	[HOST_IA32_PAT_HIGH] = OFFSET(host_ia32_pat)+4,
> +	[PIN_BASED_VM_EXEC_CONTROL] = OFFSET(pin_based_vm_exec_control),
> +	[CPU_BASED_VM_EXEC_CONTROL] = OFFSET(cpu_based_vm_exec_control),
> +	[EXCEPTION_BITMAP] = OFFSET(exception_bitmap),
> +	[PAGE_FAULT_ERROR_CODE_MASK] = OFFSET(page_fault_error_code_mask),
> +	[PAGE_FAULT_ERROR_CODE_MATCH] = OFFSET(page_fault_error_code_match),
> +	[CR3_TARGET_COUNT] = OFFSET(cr3_target_count),
> +	[VM_EXIT_CONTROLS] = OFFSET(vm_exit_controls),
> +	[VM_EXIT_MSR_STORE_COUNT] = OFFSET(vm_exit_msr_store_count),
> +	[VM_EXIT_MSR_LOAD_COUNT] = OFFSET(vm_exit_msr_load_count),
> +	[VM_ENTRY_CONTROLS] = OFFSET(vm_entry_controls),
> +	[VM_ENTRY_MSR_LOAD_COUNT] = OFFSET(vm_entry_msr_load_count),
> +	[VM_ENTRY_INTR_INFO_FIELD] = OFFSET(vm_entry_intr_info_field),
> +	[VM_ENTRY_EXCEPTION_ERROR_CODE] = OFFSET(vm_entry_exception_error_code),
> +	[VM_ENTRY_INSTRUCTION_LEN] = OFFSET(vm_entry_instruction_len),
> +	[TPR_THRESHOLD] = OFFSET(tpr_threshold),
> +	[SECONDARY_VM_EXEC_CONTROL] = OFFSET(secondary_vm_exec_control),
> +	[VM_INSTRUCTION_ERROR] = OFFSET(vm_instruction_error),
> +	[VM_EXIT_REASON] = OFFSET(vm_exit_reason),
> +	[VM_EXIT_INTR_INFO] = OFFSET(vm_exit_intr_info),
> +	[VM_EXIT_INTR_ERROR_CODE] = OFFSET(vm_exit_intr_error_code),
> +	[IDT_VECTORING_INFO_FIELD] = OFFSET(idt_vectoring_info_field),
> +	[IDT_VECTORING_ERROR_CODE] = OFFSET(idt_vectoring_error_code),
> +	[VM_EXIT_INSTRUCTION_LEN] = OFFSET(vm_exit_instruction_len),
> +	[VMX_INSTRUCTION_INFO] = OFFSET(vmx_instruction_info),
> +	[GUEST_ES_LIMIT] = OFFSET(guest_es_limit),
> +	[GUEST_CS_LIMIT] = OFFSET(guest_cs_limit),
> +	[GUEST_SS_LIMIT] = OFFSET(guest_ss_limit),
> +	[GUEST_DS_LIMIT] = OFFSET(guest_ds_limit),
> +	[GUEST_FS_LIMIT] = OFFSET(guest_fs_limit),
> +	[GUEST_GS_LIMIT] = OFFSET(guest_gs_limit),
> +	[GUEST_LDTR_LIMIT] = OFFSET(guest_ldtr_limit),
> +	[GUEST_TR_LIMIT] = OFFSET(guest_tr_limit),
> +	[GUEST_GDTR_LIMIT] = OFFSET(guest_gdtr_limit),
> +	[GUEST_IDTR_LIMIT] = OFFSET(guest_idtr_limit),
> +	[GUEST_ES_AR_BYTES] = OFFSET(guest_es_ar_bytes),
> +	[GUEST_CS_AR_BYTES] = OFFSET(guest_cs_ar_bytes),
> +	[GUEST_SS_AR_BYTES] = OFFSET(guest_ss_ar_bytes),
> +	[GUEST_DS_AR_BYTES] = OFFSET(guest_ds_ar_bytes),
> +	[GUEST_FS_AR_BYTES] = OFFSET(guest_fs_ar_bytes),
> +	[GUEST_GS_AR_BYTES] = OFFSET(guest_gs_ar_bytes),
> +	[GUEST_LDTR_AR_BYTES] = OFFSET(guest_ldtr_ar_bytes),
> +	[GUEST_TR_AR_BYTES] = OFFSET(guest_tr_ar_bytes),
> +	[GUEST_INTERRUPTIBILITY_INFO] = OFFSET(guest_interruptibility_info),
> +	[GUEST_ACTIVITY_STATE] = OFFSET(guest_activity_state),
> +	[GUEST_SYSENTER_CS] = OFFSET(guest_sysenter_cs),
> +	[HOST_IA32_SYSENTER_CS] = OFFSET(host_ia32_sysenter_cs),
> +	[CR0_GUEST_HOST_MASK] = OFFSET(cr0_guest_host_mask),
> +	[CR4_GUEST_HOST_MASK] = OFFSET(cr4_guest_host_mask),
> +	[CR0_READ_SHADOW] = OFFSET(cr0_read_shadow),
> +	[CR4_READ_SHADOW] = OFFSET(cr4_read_shadow),
> +	[CR3_TARGET_VALUE0] = OFFSET(cr3_target_value0),
> +	[CR3_TARGET_VALUE1] = OFFSET(cr3_target_value1),
> +	[CR3_TARGET_VALUE2] = OFFSET(cr3_target_value2),
> +	[CR3_TARGET_VALUE3] = OFFSET(cr3_target_value3),
> +	[EXIT_QUALIFICATION] = OFFSET(exit_qualification),
> +	[GUEST_LINEAR_ADDRESS] = OFFSET(guest_linear_address),
> +	[GUEST_CR0] = OFFSET(guest_cr0),
> +	[GUEST_CR3] = OFFSET(guest_cr3),
> +	[GUEST_CR4] = OFFSET(guest_cr4),
> +	[GUEST_ES_BASE] = OFFSET(guest_es_base),
> +	[GUEST_CS_BASE] = OFFSET(guest_cs_base),
> +	[GUEST_SS_BASE] = OFFSET(guest_ss_base),
> +	[GUEST_DS_BASE] = OFFSET(guest_ds_base),
> +	[GUEST_FS_BASE] = OFFSET(guest_fs_base),
> +	[GUEST_GS_BASE] = OFFSET(guest_gs_base),
> +	[GUEST_LDTR_BASE] = OFFSET(guest_ldtr_base),
> +	[GUEST_TR_BASE] = OFFSET(guest_tr_base),
> +	[GUEST_GDTR_BASE] = OFFSET(guest_gdtr_base),
> +	[GUEST_IDTR_BASE] = OFFSET(guest_idtr_base),
> +	[GUEST_DR7] = OFFSET(guest_dr7),
> +	[GUEST_RSP] = OFFSET(guest_rsp),
> +	[GUEST_RIP] = OFFSET(guest_rip),
> +	[GUEST_RFLAGS] = OFFSET(guest_rflags),
> +	[GUEST_PENDING_DBG_EXCEPTIONS] = OFFSET(guest_pending_dbg_exceptions),
> +	[GUEST_SYSENTER_ESP] = OFFSET(guest_sysenter_esp),
> +	[GUEST_SYSENTER_EIP] = OFFSET(guest_sysenter_eip),
> +	[HOST_CR0] = OFFSET(host_cr0),
> +	[HOST_CR3] = OFFSET(host_cr3),
> +	[HOST_CR4] = OFFSET(host_cr4),
> +	[HOST_FS_BASE] = OFFSET(host_fs_base),
> +	[HOST_GS_BASE] = OFFSET(host_gs_base),
> +	[HOST_TR_BASE] = OFFSET(host_tr_base),
> +	[HOST_GDTR_BASE] = OFFSET(host_gdtr_base),
> +	[HOST_IDTR_BASE] = OFFSET(host_idtr_base),
> +	[HOST_IA32_SYSENTER_ESP] = OFFSET(host_ia32_sysenter_esp),
> +	[HOST_IA32_SYSENTER_EIP] = OFFSET(host_ia32_sysenter_eip),
> +	[HOST_RSP] = OFFSET(host_rsp),
> +	[HOST_RIP] = OFFSET(host_rip),
> +};
> +
> +static inline short vmcs_field_to_offset(unsigned long field)
> +{
> +
> +	if (field > HOST_RIP || vmcs_field_to_offset_table[field] == 0) {
> +		printk(KERN_ERR "invalid vmcs field 0x%lx\n", field);
> +		return -1;
> +	}
> +	return vmcs_field_to_offset_table[field];
> +}
> +
> +static inline struct shadow_vmcs *get_shadow_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	WARN_ON(!to_vmx(vcpu)->nested.current_l2_page);
What's is the point of WARN_ON() if we will crash a moment later while
using address returned by get_shadow_vmcs()?

> +	return &(to_vmx(vcpu)->nested.current_l2_page->shadow_vmcs);
> +}
> +
>  static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr)
>  {
>  	struct page *vmcs_page =
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-14  9:36   ` Avi Kivity
@ 2010-06-16 14:48     ` Gleb Natapov
  2010-08-04 13:42       ` Nadav Har'El
  2010-08-04 16:09     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-16 14:48 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Mon, Jun 14, 2010 at 12:36:02PM +0300, Avi Kivity wrote:
> On 06/13/2010 03:29 PM, Nadav Har'El wrote:
> >Implement the VMREAD and VMWRITE instructions. With these instructions, L1
> >can read and write to the VMCS it is holding. The values are read or written
> >to the fields of the shadow_vmcs structure introduced in the previous patch.
> >
> >
> >+
> >+static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
> >+{
> >+	switch (field_type) {
> >+	case VMCS_FIELD_TYPE_U16:
> >+		return 2;
> >+	case VMCS_FIELD_TYPE_U32:
> >+		return 4;
> >+	case VMCS_FIELD_TYPE_U64:
> >+		return 8;
> >+	case VMCS_FIELD_TYPE_ULONG:
> >+#ifdef CONFIG_X86_64
> >+		if (is_long_mode(vcpu))
> >+			return 8;
> >+#endif
> >+		return 4;
> 
> No need for the ifdef, is_long_mode() works everywhere.
> 
> >+	}
> >+	return 0; /* should never happen */
> 
> Then BUG()?
> 
> >+}
> >+
> >  struct vcpu_vmx {
> >  	struct kvm_vcpu       vcpu;
> >  	struct list_head      local_vcpus_link;
> >@@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp
> >  	return 1;
> >  }
> >
> >
> >+static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg,
> >+			     unsigned long field)
> >+{
> >+	u64 field_value;
> >+	if (!nested_vmcs_read_any(vcpu, field,&field_value))
> >+		return 0;
> >+
> >+#ifdef CONFIG_X86_64
> >+	switch (vmcs_field_type(field)) {
> >+	case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
> >+		if (!is_long_mode(vcpu)) {
> >+			kvm_register_write(vcpu, reg+1, field_value>>  32);
> 
> What's this reg+1 thing?  I thought vmread simply ignores the upper half.
> 
> >+			field_value = (u32)field_value;
> >+		}
> >+	}
> >+#endif
> >+	kvm_register_write(vcpu, reg, field_value);
> >+	return 1;
> >+}
> >+
> >+static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva,
> >+			     unsigned long field)
> >+{
> >+	u64 field_value;
> >+	if (!nested_vmcs_read_any(vcpu, field,&field_value))
> >+		return 0;
> >+
> >+	/* It's ok to use *_system, because handle_vmread verifies cpl=0 */
> 
> >+	kvm_write_guest_virt_system(gva,&field_value,
> >+			     vmcs_field_size(vmcs_field_type(field), vcpu),
> >+			     vcpu, NULL);
> 
> vmread doesn't support 64-bit writes to memory outside long mode, so
> you'll have to truncate the write.
> 
> I think you'll be better off returning a 32-bit size in
> vmcs_field_size() in these cases.
> 
Actually write should be always 32bit long outside IA-32e mode and
64bit long in 64 bit mode. Unused bits should be set to zero.


> >+	return 1;
> >+}
> >+
> >+static int handle_vmread(struct kvm_vcpu *vcpu)
> >+{
> >+	unsigned long field;
> >+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> >+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> >+	gva_t gva = 0;
> >+	int read_succeed;
> >+
> >+	if (!nested_vmx_check_permission(vcpu))
> >+		return 1;
> >+
> >+	if (!nested_map_current(vcpu)) {
> >+		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
> >+		set_rflags_to_vmx_fail_invalid(vcpu);
> >+		return 1;
> >+	}
> 
> Can do the read_any() here.
> 
> >+
> >+	/* decode instruction info to get the field to read and where to store
> >+	 * its value */
> >+	field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
> >+	if (VMX_OPERAND_IS_REG(vmx_instruction_info)) {
> >+		read_succeed = handle_vmread_reg(vcpu,
> >+			VMX_OPERAND_REG(vmx_instruction_info), field);
> >+	} else {
> >+		gva = get_vmx_mem_address(vcpu, exit_qualification,
> >+					  vmx_instruction_info);
> >+		if (gva == 0)
> >+			return 1;
> >+		read_succeed = handle_vmread_mem(vcpu, gva, field);
> >+	}
> >+
> >+	if (read_succeed) {
> >+		clear_rflags_cf_zf(vcpu);
> >+		skip_emulated_instruction(vcpu);
> >+	} else {
> >+		set_rflags_to_vmx_fail_valid(vcpu);
> >+		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
> 
> s_e_i() in any case but an exception.
> 
> >+	}
> >+
> >+	nested_unmap_current(vcpu);
> >+	return 1;
> >+}
> >+
> >+
> >
> >+	if (VMX_OPERAND_IS_REG(vmx_instruction_info))
> >+		field_value = kvm_register_read(vcpu,
> >+			VMX_OPERAND_REG(vmx_instruction_info));
> >+	else {
> >+		gva  = get_vmx_mem_address(vcpu, exit_qualification,
> >+			vmx_instruction_info);
> >+		if (gva == 0)
> >+			return 1;
> >+		kvm_read_guest_virt(gva,&field_value,
> >+			vmcs_field_size(field_type, vcpu), vcpu, NULL);
> 
> Check for exception.
> 
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-13 12:29 ` [PATCH 13/24] Implement VMREAD and VMWRITE Nadav Har'El
  2010-06-14  9:36   ` Avi Kivity
@ 2010-06-16 15:03   ` Gleb Natapov
  2010-08-04 11:46     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-16 15:03 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:29:13PM +0300, Nadav Har'El wrote:
> Implement the VMREAD and VMWRITE instructions. With these instructions, L1
> can read and write to the VMCS it is holding. The values are read or written
> to the fields of the shadow_vmcs structure introduced in the previous patch.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -299,6 +299,42 @@ struct nested_vmx {
>  	int l2_vmcs_num;
>  };
>  
> +enum vmcs_field_type {
> +	VMCS_FIELD_TYPE_U16 = 0,
> +	VMCS_FIELD_TYPE_U64 = 1,
> +	VMCS_FIELD_TYPE_U32 = 2,
> +	VMCS_FIELD_TYPE_ULONG = 3
> +};
> +
> +#define VMCS_FIELD_LENGTH_OFFSET 13
> +#define VMCS_FIELD_LENGTH_MASK 0x6000
> +
> +static inline int vmcs_field_type(unsigned long field)
> +{
> +	if (0x1 & field)	/* one of the *_HIGH fields, all are 32 bit */
> +		return VMCS_FIELD_TYPE_U32;
> +	return (VMCS_FIELD_LENGTH_MASK & field) >> 13;
> +}
> +
> +static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
> +{
> +	switch (field_type) {
> +	case VMCS_FIELD_TYPE_U16:
> +		return 2;
> +	case VMCS_FIELD_TYPE_U32:
> +		return 4;
> +	case VMCS_FIELD_TYPE_U64:
> +		return 8;
> +	case VMCS_FIELD_TYPE_ULONG:
> +#ifdef CONFIG_X86_64
> +		if (is_long_mode(vcpu))
> +			return 8;
> +#endif
> +		return 4;
> +	}
> +	return 0; /* should never happen */
> +}
> +
>  struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	struct list_head      local_vcpus_link;
> @@ -4184,6 +4220,189 @@ static int handle_vmclear(struct kvm_vcp
>  	return 1;
>  }
>  
> +static inline bool nested_vmcs_read_any(struct kvm_vcpu *vcpu,
> +					unsigned long field, u64 *ret)
> +{
> +	short offset = vmcs_field_to_offset(field);
> +	char *p;
> +
> +	if (offset < 0)
> +		return 0;
> +	if (!to_vmx(vcpu)->nested.current_l2_page)
> +		return 0;
> +
> +	p = ((char *)(get_shadow_vmcs(vcpu))) + offset;
> +
> +	switch (vmcs_field_type(field)) {
> +	case VMCS_FIELD_TYPE_ULONG:
> +		*ret = *((unsigned long *)p);
> +		return 1;
> +	case VMCS_FIELD_TYPE_U16:
> +		*ret = (u16) *((unsigned long *)p);
> +		return 1;
> +	case VMCS_FIELD_TYPE_U32:
> +		*ret = (u32) *((unsigned long *)p);
> +		return 1;
> +	case VMCS_FIELD_TYPE_U64:
> +		*ret = *((u64 *)p);
> +		return 1;
> +	default:
> +		return 0; /* can never happen. */
> +	}
> +}
> +
> +static int handle_vmread_reg(struct kvm_vcpu *vcpu, int reg,
> +			     unsigned long field)
> +{
> +	u64 field_value;
> +	if (!nested_vmcs_read_any(vcpu, field, &field_value))
> +		return 0;
> +
> +#ifdef CONFIG_X86_64
> +	switch (vmcs_field_type(field)) {
> +	case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
> +		if (!is_long_mode(vcpu)) {
> +			kvm_register_write(vcpu, reg+1, field_value >> 32);
> +			field_value = (u32)field_value;
> +		}
> +	}
> +#endif
> +	kvm_register_write(vcpu, reg, field_value);
> +	return 1;
> +}
> +
> +static int handle_vmread_mem(struct kvm_vcpu *vcpu, gva_t gva,
> +			     unsigned long field)
> +{
> +	u64 field_value;
> +	if (!nested_vmcs_read_any(vcpu, field, &field_value))
> +		return 0;
> +
> +	/* It's ok to use *_system, because handle_vmread verifies cpl=0 */
> +	kvm_write_guest_virt_system(gva, &field_value,
> +			     vmcs_field_size(vmcs_field_type(field), vcpu),
> +			     vcpu, NULL);
> +	return 1;
> +}
> +
> +static int handle_vmread(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long field;
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	gva_t gva = 0;
> +	int read_succeed;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
> +		set_rflags_to_vmx_fail_invalid(vcpu);
> +		return 1;
> +	}
> +
> +	/* decode instruction info to get the field to read and where to store
> +	 * its value */
> +	field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
> +	if (VMX_OPERAND_IS_REG(vmx_instruction_info)) {
> +		read_succeed = handle_vmread_reg(vcpu,
> +			VMX_OPERAND_REG(vmx_instruction_info), field);
> +	} else {
> +		gva = get_vmx_mem_address(vcpu, exit_qualification,
> +					  vmx_instruction_info);
> +		if (gva == 0)
> +			return 1;
> +		read_succeed = handle_vmread_mem(vcpu, gva, field);
> +	}
> +
> +	if (read_succeed) {
> +		clear_rflags_cf_zf(vcpu);
> +		skip_emulated_instruction(vcpu);
> +	} else {
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
VM_INSTRUCTION_ERROR is read only and when do you transfer it to vmcs12 anyway?.
I think set_rflags_to_vmx_fail_valid() should get vm_instruction_error
as a parameter and put it into vmcs12, that way you'll never forget to
provide error code on fail_valid case, compiler will remind you.


> +	}
> +
> +	nested_unmap_current(vcpu);
> +	return 1;
> +}
> +
> +
> +static int handle_vmwrite(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long field;
> +	u64 field_value = 0;
> +	gva_t gva;
> +	int field_type;
> +	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
> +	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	char *p;
> +	short offset;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (!nested_map_current(vcpu)) {
> +		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
> +		set_rflags_to_vmx_fail_invalid(vcpu);
> +		return 1;
> +	}
> +
> +	field = kvm_register_read(vcpu, VMX_OPERAND_REG2(vmx_instruction_info));
> +	field_type = vmcs_field_type(field);
> +
> +	offset = vmcs_field_to_offset(field);
> +	if (offset < 0) {
> +		set_rflags_to_vmx_fail_invalid(vcpu);
> +		return 1;
> +	}
> +	p = ((char *) get_shadow_vmcs(vcpu)) + offset;
> +
> +	if (VMX_OPERAND_IS_REG(vmx_instruction_info))
> +		field_value = kvm_register_read(vcpu,
> +			VMX_OPERAND_REG(vmx_instruction_info));
> +	else {
> +		gva  = get_vmx_mem_address(vcpu, exit_qualification,
> +			vmx_instruction_info);
> +		if (gva == 0)
> +			return 1;
> +		kvm_read_guest_virt(gva, &field_value,
> +			vmcs_field_size(field_type, vcpu), vcpu, NULL);
> +	}
> +
What about checking that vmcs field is read only?

> +	switch (field_type) {
> +	case VMCS_FIELD_TYPE_U16:
> +		*(u16 *)p = field_value;
> +		break;
> +	case VMCS_FIELD_TYPE_U32:
> +		*(u32 *)p = field_value;
> +		break;
> +	case VMCS_FIELD_TYPE_U64:
> +#ifdef CONFIG_X86_64
> +		*(unsigned long *)p = field_value;
> +#else
> +		*(unsigned long *)p = field_value;
> +		*(((unsigned long *)p)+1) = field_value >> 32;
> +#endif
> +		break;
> +	case VMCS_FIELD_TYPE_ULONG:
> +		*(unsigned long *)p = field_value;
> +		break;
> +	default:
> +		printk(KERN_INFO "%s invalid field\n", __func__);
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
> +		nested_unmap_current(vcpu);
> +		return 1;
> +	}
> +
> +	clear_rflags_cf_zf(vcpu);
> +	skip_emulated_instruction(vcpu);
> +	nested_unmap_current(vcpu);
> +	return 1;
> +}
> +
>  static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t guest_vmcs_addr)
>  {
>  	bool ret;
> @@ -4548,9 +4767,9 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
>  	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
> -	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
> +	[EXIT_REASON_VMREAD]                  = handle_vmread,
>  	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
> -	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
> +	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
>  	[EXIT_REASON_VMOFF]                   = handle_vmoff,
>  	[EXIT_REASON_VMON]                    = handle_vmon,
>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 11/24] Implement VMPTRST
  2010-06-16 13:53     ` Gleb Natapov
@ 2010-06-16 15:33       ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-16 15:33 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm

On Wed, Jun 16, 2010, Gleb Natapov wrote about "Re: [PATCH 11/24] Implement VMPTRST":
> On Mon, Jun 14, 2010 at 12:15:10PM +0300, Avi Kivity wrote:
> > write_guest_virt_system() is used by writes which need to ignore the
> > cpl, for example when a cpl 3 instruction loads a segment, the
> > processor needs to update the accessed flag even though it is only
> > accessible to cpl 0.  That's not your case, you need the ordinary
> > write_guest_virt().
> > 
> > Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it.
> > 
> the code uses this function after checking cpl to be zero, so may be it
> is ok, not to pretty though. I was actually hoping to get rid of all
> kvm_(read|write)_guest_virt* and replace existing uses with
> emulator_(read|write)_emulated, but this patch series adds more users
> that will be hard to replace :(

If I remember the history correctly, this is exactly what happened in this
code. We used to use kvm_write_guest_virt(), until a few months ago it
disappeared. I thought it to be fine to call write_guest_virt_system()
because, like you said, we only already check above that cpl=0, so it is fine
to assume we have cpl 0 privileges.

So while it might look a bit strange at first, I think it should be fine and
there's no need to create more functions.

-- 
Nadav Har'El                        |    Wednesday, Jun 16 2010, 4 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Amateurs built the ark - professionals
http://nadav.harel.org.il           |built the Titanic.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
  2010-06-13 12:29 ` [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
  2010-06-14 11:11   ` Avi Kivity
@ 2010-06-17  8:50   ` Gleb Natapov
  2010-07-06  6:25   ` Dong, Eddie
  2 siblings, 0 replies; 147+ messages in thread
From: Gleb Natapov @ 2010-06-17  8:50 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:29:44PM +0300, Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
> in shadow_vmcs that L1 built for L2 (vmcs12), and that in the VMCS that we
> built for L1 (vmcs01).
> 
> VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
> makes it difficult for us to read from vmcs01 while writing to vmcs12. This
> is why we first make a copy of vmcs01 in memory (l1_shadow_vmcs) and then
> read that memory copy while writing to vmcs12.
> 
Did you mean vmcs02 instead of vmcs12 in above paragraph?

> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
>  	return flexpriority_enabled;
>  }
>  
> +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
> +{
> +	return cpu_has_vmx_tpr_shadow() &&
> +		get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
> +		CPU_BASED_TPR_SHADOW;
> +}
> +
> +static inline bool nested_cpu_has_secondary_exec_ctrls(struct kvm_vcpu *vcpu)
> +{
> +	return cpu_has_secondary_exec_ctrls() &&
> +		get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
> +		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> +}
> +
> +static inline bool nested_vm_need_virtualize_apic_accesses(struct kvm_vcpu
> +							   *vcpu)
> +{
> +	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
> +		(get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
> +		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
> +}
> +
> +static inline bool nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu)
> +{
> +	return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
> +		(get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
> +		SECONDARY_EXEC_ENABLE_EPT);
> +}
> +
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>  	int i;
> @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
>  	preempt_enable();
>  }
>  
> +int load_vmcs_host_state(struct shadow_vmcs *src)
> +{
> +	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
> +	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
> +	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
> +	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
> +	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
> +	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
> +	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
> +
> +	vmcs_write64(TSC_OFFSET, src->tsc_offset);
> +
> +	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
> +		vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
> +
> +	vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
> +
> +	vmcs_writel(HOST_CR0, src->host_cr0);
> +	vmcs_writel(HOST_CR3, src->host_cr3);
> +	vmcs_writel(HOST_CR4, src->host_cr4);
> +	vmcs_writel(HOST_FS_BASE, src->host_fs_base);
> +	vmcs_writel(HOST_GS_BASE, src->host_gs_base);
> +	vmcs_writel(HOST_TR_BASE, src->host_tr_base);
> +	vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);
> +	vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);
> +	vmcs_writel(HOST_RSP, src->host_rsp);
> +	vmcs_writel(HOST_RIP, src->host_rip);
> +	vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
> +	vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);
> +
> +	return 0;
> +}
> +
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put(), but assumes
>   * vcpu mutex is already taken.
> @@ -1922,6 +1985,71 @@ static void vmclear_local_vcpus(void)
>  		__vcpu_clear(vmx);
>  }
>  
> +int load_vmcs_common(struct shadow_vmcs *src)
> +{
> +	vmcs_write16(GUEST_ES_SELECTOR, src->guest_es_selector);
> +	vmcs_write16(GUEST_CS_SELECTOR, src->guest_cs_selector);
> +	vmcs_write16(GUEST_SS_SELECTOR, src->guest_ss_selector);
> +	vmcs_write16(GUEST_DS_SELECTOR, src->guest_ds_selector);
> +	vmcs_write16(GUEST_FS_SELECTOR, src->guest_fs_selector);
> +	vmcs_write16(GUEST_GS_SELECTOR, src->guest_gs_selector);
> +	vmcs_write16(GUEST_LDTR_SELECTOR, src->guest_ldtr_selector);
> +	vmcs_write16(GUEST_TR_SELECTOR, src->guest_tr_selector);
> +
> +	vmcs_write64(GUEST_IA32_DEBUGCTL, src->guest_ia32_debugctl);
> +
> +	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> +		vmcs_write64(GUEST_IA32_PAT, src->guest_ia32_pat);
> +
> +	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, src->vm_entry_intr_info_field);
> +	vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +		     src->vm_entry_exception_error_code);
> +	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, src->vm_entry_instruction_len);
> +
> +	vmcs_write32(GUEST_ES_LIMIT, src->guest_es_limit);
> +	vmcs_write32(GUEST_CS_LIMIT, src->guest_cs_limit);
> +	vmcs_write32(GUEST_SS_LIMIT, src->guest_ss_limit);
> +	vmcs_write32(GUEST_DS_LIMIT, src->guest_ds_limit);
> +	vmcs_write32(GUEST_FS_LIMIT, src->guest_fs_limit);
> +	vmcs_write32(GUEST_GS_LIMIT, src->guest_gs_limit);
> +	vmcs_write32(GUEST_LDTR_LIMIT, src->guest_ldtr_limit);
> +	vmcs_write32(GUEST_TR_LIMIT, src->guest_tr_limit);
> +	vmcs_write32(GUEST_GDTR_LIMIT, src->guest_gdtr_limit);
> +	vmcs_write32(GUEST_IDTR_LIMIT, src->guest_idtr_limit);
> +	vmcs_write32(GUEST_ES_AR_BYTES, src->guest_es_ar_bytes);
> +	vmcs_write32(GUEST_CS_AR_BYTES, src->guest_cs_ar_bytes);
> +	vmcs_write32(GUEST_SS_AR_BYTES, src->guest_ss_ar_bytes);
> +	vmcs_write32(GUEST_DS_AR_BYTES, src->guest_ds_ar_bytes);
> +	vmcs_write32(GUEST_FS_AR_BYTES, src->guest_fs_ar_bytes);
> +	vmcs_write32(GUEST_GS_AR_BYTES, src->guest_gs_ar_bytes);
> +	vmcs_write32(GUEST_LDTR_AR_BYTES, src->guest_ldtr_ar_bytes);
> +	vmcs_write32(GUEST_TR_AR_BYTES, src->guest_tr_ar_bytes);
> +	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
> +		     src->guest_interruptibility_info);
> +	vmcs_write32(GUEST_ACTIVITY_STATE, src->guest_activity_state);
> +	vmcs_write32(GUEST_SYSENTER_CS, src->guest_sysenter_cs);
> +
> +	vmcs_writel(GUEST_ES_BASE, src->guest_es_base);
> +	vmcs_writel(GUEST_CS_BASE, src->guest_cs_base);
> +	vmcs_writel(GUEST_SS_BASE, src->guest_ss_base);
> +	vmcs_writel(GUEST_DS_BASE, src->guest_ds_base);
> +	vmcs_writel(GUEST_FS_BASE, src->guest_fs_base);
> +	vmcs_writel(GUEST_GS_BASE, src->guest_gs_base);
> +	vmcs_writel(GUEST_LDTR_BASE, src->guest_ldtr_base);
> +	vmcs_writel(GUEST_TR_BASE, src->guest_tr_base);
> +	vmcs_writel(GUEST_GDTR_BASE, src->guest_gdtr_base);
> +	vmcs_writel(GUEST_IDTR_BASE, src->guest_idtr_base);
> +	vmcs_writel(GUEST_DR7, src->guest_dr7);
> +	vmcs_writel(GUEST_RSP, src->guest_rsp);
> +	vmcs_writel(GUEST_RIP, src->guest_rip);
> +	vmcs_writel(GUEST_RFLAGS, src->guest_rflags);
> +	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> +		    src->guest_pending_dbg_exceptions);
> +	vmcs_writel(GUEST_SYSENTER_ESP, src->guest_sysenter_esp);
> +	vmcs_writel(GUEST_SYSENTER_EIP, src->guest_sysenter_eip);
> +
> +	return 0;
> +}
>  
>  /* Just like cpu_vmxoff(), but with the __kvm_handle_fault_on_reboot()
>   * tricks.
> @@ -5363,6 +5491,281 @@ static void vmx_set_supported_cpuid(u32 
>  {
>  }
>  
> +/* Make a copy of the current VMCS to ordinary memory. This is needed because
> + * in VMX you cannot read and write to two VMCS at the same time, so when we
> + * want to do this (in prepare_vmcs_02, which needs to read from vmcs01 while
> + * preparing vmcs02), we need to first save a copy of one VMCS's fields in
> + * memory, and then use that copy.
> + */
> +void save_vmcs(struct shadow_vmcs *dst)
> +{
> +	dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
> +	dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
> +	dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
> +	dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
> +	dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
> +	dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
> +	dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
> +	dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
> +	dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
> +	dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
> +	dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
> +	dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
> +	dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
> +	dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
> +	dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
> +	dst->io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +	dst->io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +	if (cpu_has_vmx_msr_bitmap())
> +		dst->msr_bitmap = vmcs_read64(MSR_BITMAP);
> +	dst->tsc_offset = vmcs_read64(TSC_OFFSET);
> +	dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
> +	dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
> +	if (enable_ept)
> +		dst->ept_pointer = vmcs_read64(EPT_POINTER);
> +	dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> +	dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
> +	dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> +	if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> +		dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
> +	if (enable_ept) {
> +		dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> +		dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> +		dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> +		dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> +	}
> +	dst->pin_based_vm_exec_control = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
> +	dst->cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	dst->exception_bitmap = vmcs_read32(EXCEPTION_BITMAP);
> +	dst->page_fault_error_code_mask =
> +		vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
> +	dst->page_fault_error_code_match =
> +		vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH);
> +	dst->cr3_target_count = vmcs_read32(CR3_TARGET_COUNT);
> +	dst->vm_exit_controls = vmcs_read32(VM_EXIT_CONTROLS);
> +	dst->vm_entry_controls = vmcs_read32(VM_ENTRY_CONTROLS);
> +	dst->vm_entry_intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> +	dst->vm_entry_exception_error_code =
> +		vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
> +	dst->vm_entry_instruction_len = vmcs_read32(VM_ENTRY_INSTRUCTION_LEN);
> +	dst->tpr_threshold = vmcs_read32(TPR_THRESHOLD);
> +	dst->secondary_vm_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	if (enable_vpid && dst->secondary_vm_exec_control &
> +	    SECONDARY_EXEC_ENABLE_VPID)
> +		dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
> +	dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
> +	dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
> +	dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +	dst->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> +	dst->idt_vectoring_info_field = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +	dst->idt_vectoring_error_code = vmcs_read32(IDT_VECTORING_ERROR_CODE);
> +	dst->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> +	dst->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> +	dst->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
> +	dst->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
> +	dst->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
> +	dst->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
> +	dst->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
> +	dst->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
> +	dst->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
> +	dst->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
> +	dst->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
> +	dst->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
> +	dst->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
> +	dst->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
> +	dst->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
> +	dst->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
> +	dst->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
> +	dst->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
> +	dst->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
> +	dst->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
> +	dst->guest_interruptibility_info =
> +		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
> +	dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
> +	dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
> +	dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
> +	dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
> +	dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
> +	dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
> +	dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
> +	dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
> +	dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
> +	dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
> +	dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
> +	dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
> +	dst->guest_cr0 = vmcs_readl(GUEST_CR0);
> +	dst->guest_cr3 = vmcs_readl(GUEST_CR3);
> +	dst->guest_cr4 = vmcs_readl(GUEST_CR4);
> +	dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
> +	dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
> +	dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
> +	dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
> +	dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
> +	dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
> +	dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
> +	dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
> +	dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
> +	dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
> +	dst->guest_dr7 = vmcs_readl(GUEST_DR7);
> +	dst->guest_rsp = vmcs_readl(GUEST_RSP);
> +	dst->guest_rip = vmcs_readl(GUEST_RIP);
> +	dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
> +	dst->guest_pending_dbg_exceptions =
> +		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> +	dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
> +	dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
> +	dst->host_cr0 = vmcs_readl(HOST_CR0);
> +	dst->host_cr3 = vmcs_readl(HOST_CR3);
> +	dst->host_cr4 = vmcs_readl(HOST_CR4);
> +	dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
> +	dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
> +	dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +	dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +	dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
> +	dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
> +	dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
> +	dst->host_rsp = vmcs_readl(HOST_RSP);
> +	dst->host_rip = vmcs_readl(HOST_RIP);
> +	if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
> +		dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);
> +}
> +
> +/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs its nested
> + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> + * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
> + * way that will both be appropriate to L1's requests, and our needs.
> + */
> +int prepare_vmcs_02(struct kvm_vcpu *vcpu,
> +	struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01)
> +{
> +	u32 exec_control;
> +
> +	load_vmcs_common(vmcs12);
> +
> +	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
> +	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
> +	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
> +
> +	if (vmcs12->vm_entry_msr_load_count > 0 ||
> +			vmcs12->vm_exit_msr_load_count > 0 ||
> +			vmcs12->vm_exit_msr_store_count > 0) {
> +		printk(KERN_WARNING
> +			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
> +	}
> +
> +	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
> +		struct page *page =
> +			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
> +		if (!page)
> +			return 1;
> +		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
> +		kvm_release_page_clean(page);
> +	}
> +
> +	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
> +		struct page *page =
> +			nested_get_page(vcpu, vmcs12->apic_access_addr);
> +		if (!page)
> +			return 1;
> +		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +		kvm_release_page_clean(page);
> +	}
> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> +		     (vmcs01->pin_based_vm_exec_control |
> +		      vmcs12->pin_based_vm_exec_control));
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +		     (vmcs01->page_fault_error_code_mask &
> +		      vmcs12->page_fault_error_code_mask));
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +		     (vmcs01->page_fault_error_code_match &
> +		      vmcs12->page_fault_error_code_match));
> +
> +	if (cpu_has_secondary_exec_ctrls()) {
> +		u32 exec_control = vmcs01->secondary_vm_exec_control;
> +		if (nested_cpu_has_secondary_exec_ctrls(vcpu)) {
> +			exec_control |= vmcs12->secondary_vm_exec_control;
> +			if (!vm_need_virtualize_apic_accesses(vcpu->kvm) ||
> +			    !nested_vm_need_virtualize_apic_accesses(vcpu))
> +				exec_control &=
> +				~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +		}
> +		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> +	}
> +
> +	load_vmcs_host_state(vmcs01);
> +
> +	if (vm_need_tpr_shadow(vcpu->kvm) &&
> +	    nested_cpu_has_vmx_tpr_shadow(vcpu))
> +		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
> +
> +	if (enable_ept) {
> +		if (!nested_cpu_has_vmx_ept(vcpu)) {
> +			vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
> +			vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
> +			vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
> +			vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
> +			vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
> +		}
> +	}
> +
> +	exec_control = vmcs01->cpu_based_vm_exec_control;
> +	exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +	exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> +	exec_control &= ~CPU_BASED_TPR_SHADOW;
> +	exec_control |= vmcs12->cpu_based_vm_exec_control;
> +	if (!vm_need_tpr_shadow(vcpu->kvm) ||
> +	    vmcs12->virtual_apic_page_addr == 0) {
> +		exec_control &= ~CPU_BASED_TPR_SHADOW;
> +#ifdef CONFIG_X86_64
> +		exec_control |= CPU_BASED_CR8_STORE_EXITING |
> +			CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +	} else if (exec_control & CPU_BASED_TPR_SHADOW) {
> +#ifdef CONFIG_X86_64
> +		exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
> +		exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +	}
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
> +
> +	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> +	 * bitwise-or of what L1 wants to trap for L2, and what we want to
> +	 * trap. However, vmx_fpu_activate/deactivate may have happened after
> +	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
> +	 * and need to base them again on fpu_active. Note that CR0.TS also
> +	 * needs updating - we do this after this function returns (in
> +	 * nested_vmx_run).
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
> +		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
> +		      vmcs12->exception_bitmap));
> +	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));
> +	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));
> +
> +	vmcs_write32(VM_EXIT_CONTROLS,
> +		     (vmcs01->vm_exit_controls &
> +			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
> +		       | vmcs12->vm_exit_controls);
> +
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		     (vmcs01->vm_entry_controls &
> +			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
> +		      | vmcs12->vm_entry_controls);
> +
> +	vmcs_writel(CR4_GUEST_HOST_MASK,
> +		    (vmcs01->cr4_guest_host_mask  &
> +		     vmcs12->cr4_guest_host_mask));
> +
> +	return 0;
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops = {
>  	.cpu_has_kvm_support = cpu_has_kvm_support,
>  	.disabled_by_bios = vmx_disabled_by_bios,
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-06-13 12:30 ` [PATCH 16/24] Implement VMLAUNCH and VMRESUME Nadav Har'El
  2010-06-14 11:41   ` Avi Kivity
@ 2010-06-17 10:59   ` Gleb Natapov
  2010-09-16 16:06     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-17 10:59 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:30:46PM +0300, Nadav Har'El wrote:
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -272,6 +272,9 @@ struct __attribute__ ((__packed__)) vmcs
>  	struct shadow_vmcs shadow_vmcs;
>  
>  	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
> +
> +	int cpu;
> +	int launched;
>  };
>  
>  struct vmcs_list {
> @@ -297,6 +300,24 @@ struct nested_vmx {
>  	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
>  	struct list_head l2_vmcs_list; /* a vmcs_list */
>  	int l2_vmcs_num;
> +
> +	/* Are we running a nested guest now */
> +	bool nested_mode;
> +	/* Level 1 state for switching to level 2 and back */
> +	struct  {
> +		u64 efer;
> +		unsigned long cr3;
> +		unsigned long cr4;
> +		u64 io_bitmap_a;
> +		u64 io_bitmap_b;
> +		u64 msr_bitmap;
> +		int cpu;
> +		int launched;
> +	} l1_state;
> +	/* Level 1 shadow vmcs for switching to level 2 and back */
> +	struct shadow_vmcs *l1_shadow_vmcs;
> +	/* Level 1 vmcs loaded into the processor */
> +	struct vmcs *l1_vmcs;
>  };
>  
>  enum vmcs_field_type {
> @@ -1407,6 +1428,19 @@ static void vmx_vcpu_load(struct kvm_vcp
>  			new_offset = vmcs_read64(TSC_OFFSET) + delta;
>  			vmcs_write64(TSC_OFFSET, new_offset);
>  		}
> +
> +		if (vmx->nested.l1_shadow_vmcs != NULL) {
> +			struct shadow_vmcs *l1svmcs =
> +				vmx->nested.l1_shadow_vmcs;
> +			l1svmcs->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +			l1svmcs->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +			l1svmcs->host_ia32_sysenter_esp =
> +				vmcs_readl(HOST_IA32_SYSENTER_ESP);
> +			if (tsc_this < vcpu->arch.host_tsc)
> +				l1svmcs->tsc_offset = vmcs_read64(TSC_OFFSET);
> +			if (vmx->nested.nested_mode)
> +				load_vmcs_host_state(l1svmcs);
> +		}
>  	}
>  }
>  
> @@ -2301,6 +2335,9 @@ static void free_l1_state(struct kvm_vcp
>  		kfree(list_item);
>  	}
>  	vmx->nested.l2_vmcs_num = 0;
> +
> +	kfree(vmx->nested.l1_shadow_vmcs);
> +	vmx->nested.l1_shadow_vmcs = NULL;
>  }
>  
>  static void free_kvm_area(void)
> @@ -4158,6 +4195,13 @@ static int handle_vmon(struct kvm_vcpu *
>  	INIT_LIST_HEAD(&(vmx->nested.l2_vmcs_list));
>  	vmx->nested.l2_vmcs_num = 0;
>  
> +	vmx->nested.l1_shadow_vmcs = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!vmx->nested.l1_shadow_vmcs) {
> +		printk(KERN_INFO
> +			"couldn't allocate memory for l1_shadow_vmcs\n");
> +		return -ENOMEM;
> +	}
> +
>  	vmx->nested.vmxon = 1;
>  
>  	skip_emulated_instruction(vcpu);
> @@ -4348,6 +4392,42 @@ static int handle_vmclear(struct kvm_vcp
>  	return 1;
>  }
>  
> +static int nested_vmx_run(struct kvm_vcpu *vcpu);
> +
> +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
> +{
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (!nested_map_current(vcpu))
> +		return 1;
> +	if (to_vmx(vcpu)->nested.current_l2_page->launch_state == launch) {
> +		/* Must use VMLAUNCH for the first time, VMRESUME later */
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		nested_unmap_current(vcpu);
> +		return 1;
> +	}
Should also check MOV SS blocking. Why Intel decided that vm entry
should fail in this case? How knows, but spec says so.

> +	nested_unmap_current(vcpu);
> +
> +	skip_emulated_instruction(vcpu);
> +
> +	nested_vmx_run(vcpu);
> +	return 1;
> +}
> +
> +/* Emulate the VMLAUNCH instruction */
> +static int handle_vmlaunch(struct kvm_vcpu *vcpu)
> +{
> +	return handle_launch_or_resume(vcpu, true);
> +}
> +
> +/* Emulate the VMRESUME instruction */
> +static int handle_vmresume(struct kvm_vcpu *vcpu)
> +{
> +
> +	return handle_launch_or_resume(vcpu, false);
> +}
> +
>  static inline bool nested_vmcs_read_any(struct kvm_vcpu *vcpu,
>  					unsigned long field, u64 *ret)
>  {
> @@ -4892,11 +4972,11 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
> -	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
> +	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
>  	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
>  	[EXIT_REASON_VMREAD]                  = handle_vmread,
> -	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
> +	[EXIT_REASON_VMRESUME]                = handle_vmresume,
>  	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
>  	[EXIT_REASON_VMOFF]                   = handle_vmoff,
>  	[EXIT_REASON_VMON]                    = handle_vmon,
> @@ -4958,7 +5038,8 @@ static int vmx_handle_exit(struct kvm_vc
>  		       "(0x%x) and exit reason is 0x%x\n",
>  		       __func__, vectoring_info, exit_reason);
>  
> -	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
> +	if (!vmx->nested.nested_mode &&
> +		unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
>  		if (vmx_interrupt_allowed(vcpu)) {
>  			vmx->soft_vnmi_blocked = 0;
>  		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
> @@ -5771,6 +5852,138 @@ int prepare_vmcs_02(struct kvm_vcpu *vcp
>  	return 0;
>  }
>  
> +static int nested_vmx_run(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->nested.nested_mode = 1;
> +	sync_cached_regs_to_vmcs(vcpu);
> +	save_vmcs(vmx->nested.l1_shadow_vmcs);
> +
> +	vmx->nested.l1_state.efer = vcpu->arch.efer;
> +	if (!enable_ept)
> +		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
> +	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
> +
> +	if (!nested_map_current(vcpu)) {
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
> +	else
> +		vmx->nested.l1_state.msr_bitmap = 0;
> +
> +	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +	vmx->nested.l1_vmcs = vmx->vmcs;
> +	vmx->nested.l1_state.cpu = vcpu->cpu;
> +	vmx->nested.l1_state.launched = vmx->launched;
> +
> +	vmx->vmcs = nested_get_current_vmcs(vcpu);
> +	if (!vmx->vmcs) {
> +		printk(KERN_ERR "Missing VMCS\n");
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		return 1;
> +	}
> +
> +	vcpu->cpu = vmx->nested.current_l2_page->cpu;
> +	vmx->launched = vmx->nested.current_l2_page->launched;
> +
> +	if (!vmx->nested.current_l2_page->launch_state || !vmx->launched) {
> +		vmcs_clear(vmx->vmcs);
> +		vmx->launched = 0;
> +		vmx->nested.current_l2_page->launch_state = 1;
> +	}
> +
> +	vmx_vcpu_load(vcpu, get_cpu());
> +	put_cpu();
> +
> +	prepare_vmcs_02(vcpu,
> +		get_shadow_vmcs(vcpu), vmx->nested.l1_shadow_vmcs);
> +
> +	if (get_shadow_vmcs(vcpu)->vm_entry_controls &
> +	    VM_ENTRY_IA32E_MODE) {
> +		if (!((vcpu->arch.efer & EFER_LMA) &&
> +		      (vcpu->arch.efer & EFER_LME)))
> +			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> +	} else {
> +		if ((vcpu->arch.efer & EFER_LMA) ||
> +		    (vcpu->arch.efer & EFER_LME))
> +			vcpu->arch.efer = 0;
> +	}
> +
> +	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
> +	 * dictated, and takes appropriate actions for special cr0 bits (like
> +	 * real mode, etc.).
> +	 */
> +	vmx_set_cr0(vcpu,
> +		(get_shadow_vmcs(vcpu)->guest_cr0 &
> +			~get_shadow_vmcs(vcpu)->cr0_guest_host_mask) |
> +		(get_shadow_vmcs(vcpu)->cr0_read_shadow &
> +			get_shadow_vmcs(vcpu)->cr0_guest_host_mask));
> +
> +	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
> +	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
> +	 * the latter with with TS added if !fpu_active. We need to take the
> +	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
> +	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
> +	 * restoration of fpu registers until the FPU is really used).
> +	 */
> +	vmcs_writel(GUEST_CR0, get_shadow_vmcs(vcpu)->guest_cr0 |
> +		(vcpu->fpu_active ? 0 : X86_CR0_TS));
> +
> +	vmx_set_cr4(vcpu, get_shadow_vmcs(vcpu)->guest_cr4);
> +	vmcs_writel(CR4_READ_SHADOW,
> +		    get_shadow_vmcs(vcpu)->cr4_read_shadow);
> +
> +	/* we have to set the X86_CR0_PG bit of the cached cr0, because
> +	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
> +	 * CR0 (we need the paging so that KVM treat this guest as a paging
> +	 * guest so we can easly forward page faults to L1.)
> +	 */
> +	vcpu->arch.cr0 |= X86_CR0_PG;
> +
> +	if (enable_ept && !nested_cpu_has_vmx_ept(vcpu)) {
> +		vmcs_write32(GUEST_CR3, get_shadow_vmcs(vcpu)->guest_cr3);
> +		vmx->vcpu.arch.cr3 = get_shadow_vmcs(vcpu)->guest_cr3;
> +	} else {
> +		int r;
> +		kvm_set_cr3(vcpu, get_shadow_vmcs(vcpu)->guest_cr3);
> +		kvm_mmu_reset_context(vcpu);
> +
> +		nested_unmap_current(vcpu);
> +
> +		r = kvm_mmu_load(vcpu);
> +		if (unlikely(r)) {
> +			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
> +			set_rflags_to_vmx_fail_valid(vcpu);
> +			/* switch back to L1 */
> +			vmx->nested.nested_mode = 0;
> +			vmx->vmcs = vmx->nested.l1_vmcs;
> +			vcpu->cpu = vmx->nested.l1_state.cpu;
> +			vmx->launched = vmx->nested.l1_state.launched;
> +
> +			vmx_vcpu_load(vcpu, get_cpu());
> +			put_cpu();
> +
> +			return 1;
> +		}
> +
> +		nested_map_current(vcpu);
> +	}
> +
> +	kvm_register_write(vcpu, VCPU_REGS_RSP,
> +			   get_shadow_vmcs(vcpu)->guest_rsp);
> +	kvm_register_write(vcpu, VCPU_REGS_RIP,
> +			   get_shadow_vmcs(vcpu)->guest_rip);
> +
> +	nested_unmap_current(vcpu);
> +
> +	return 1;
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops = {
>  	.cpu_has_kvm_support = cpu_has_kvm_support,
>  	.disabled_by_bios = vmx_disabled_by_bios,
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-06-13 12:33 ` [PATCH 22/24] Correct handling of idt vectoring info Nadav Har'El
@ 2010-06-17 11:58   ` Gleb Natapov
  2010-09-20  6:37     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-06-17 11:58 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Jun 13, 2010 at 03:33:50PM +0300, Nadav Har'El wrote:
> This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
> case.
> 
> When a guest exits while handling an interrupt or exception, we get this
> information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
> there's nothing we need to do, because L1 will see this field in vmcs12, and
> handle it itself. However, when L2 exits and L0 handles the exit itself and
> plans to return to L2, L0 must inject this event to L2.
> 
> In the normal non-nested case, the idt_vectoring_info case is treated after
> the exit. However, in the nested case a decision of whether to return to L2
This is not correct. On the normal non-nested case the idt_vectoring_info is
parsed into vmx/svm independent data structure (which is saved/restored during
VM migartion) after exit. The reinjection happens on vmentry path.

> or L1 also happens during the injection phase (see the previous patches), so
> in the nested case we have to treat the idt_vectoring_info right after the
> injection, i.e., in the beginning of vmx_vcpu_run, which is the first time
> we know for sure if we're staying in L2 (i.e., nested_mode is true).
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:30.000000000 +0300
> @@ -320,6 +320,10 @@ struct nested_vmx {
>  	struct vmcs *l1_vmcs;
>  	/* L2 must run next, and mustn't decide to exit to L1. */
>  	bool nested_run_pending;
> +	/* true if last exit was of L2, and had a valid idt_vectoring_info */
> +	bool valid_idt_vectoring_info;
> +	/* These are saved if valid_idt_vectoring_info */
> +	u32 vm_exit_instruction_len, idt_vectoring_error_code;
>  };
>  
>  enum vmcs_field_type {
> @@ -5460,6 +5464,22 @@ static void fixup_rmode_irq(struct vcpu_
>  		| vmx->rmode.irq.vector;
>  }
>  
> +static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
> +{
> +	int irq  = vmx->idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
> +	int type = vmx->idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
> +	int errCodeValid = vmx->idt_vectoring_info &
> +		VECTORING_INFO_DELIVER_CODE_MASK;
> +	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> +		irq | type | INTR_INFO_VALID_MASK | errCodeValid);
> +
> +	vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> +		vmx->nested.vm_exit_instruction_len);
> +	if (errCodeValid)
> +		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> +			vmx->nested.idt_vectoring_error_code);
> +}
> +
Why can't you do that using existing exception/nmi/interrupt queues that
we have, but instead you effectively disable vmx_complete_interrupts()
by patch 18 when in nested mode and add logically same code in this
patch. I.e after exit you save info about idt event into nested_vmx
and reinject it on vm entry.

>  static inline void sync_cached_regs_to_vmcs(struct kvm_vcpu *vcpu)
>  {
>  	if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))
> @@ -5481,6 +5501,9 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
> +	if (vmx->nested.nested_mode && vmx->nested.valid_idt_vectoring_info)
> +		nested_handle_valid_idt_vectoring_info(vmx);
> +
>  	/* Record the guest's net vcpu time for enforced NMI injections. */
>  	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
>  		vmx->entry_time = ktime_get();
> @@ -5600,6 +5623,16 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  				  | (1 << VCPU_EXREG_PDPTR));
>  
>  	vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +
> +	vmx->nested.valid_idt_vectoring_info = vmx->nested.nested_mode &&
> +		(vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK);
> +	if (vmx->nested.valid_idt_vectoring_info) {
> +		vmx->nested.vm_exit_instruction_len =
> +			vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> +		vmx->nested.idt_vectoring_error_code =
> +			vmcs_read32(IDT_VECTORING_ERROR_CODE);
> +	}
> +
>  	if (vmx->rmode.irq.pending)
>  		fixup_rmode_irq(vmx);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-14  8:33   ` Avi Kivity
  2010-06-14  8:49     ` Nadav Har'El
  2010-06-16 12:24     ` Nadav Har'El
@ 2010-06-22 14:54     ` Nadav Har'El
  2010-06-22 16:53       ` Nadav Har'El
  2010-06-23  7:57       ` Avi Kivity
  2 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-06-22 14:54 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hi,

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
> On 06/13/2010 03:25 PM, Nadav Har'El wrote:
> >+#define VMCS12_REVISION 0x11e57ed0
> >   
> 
> Where did this number come from?  It's not from real hardware, yes?

Since obviously this wasn't self-explanatory, I added the explanation in
a comment.

> >+struct __attribute__ ((__packed__)) vmcs12 {
> >   
> 
> __packed is a convenient define for this.

Thanks, I wasn't aware of this macro. I'm using it now.

> >+	/* According to the Intel spec, a VMCS region must start with the
> >+	 * following two fields. Then follow implementation-specific data.
> >+	 */
> >+	u32 revision_id;
> >+	u32 abort;
> >+};
> >   
> 
> Note that this structure becomes an ABI, it cannot change except in a 
> backward compatible way due to the need for live migration.  So I'd like 
> a documentation patch that adds a description of the content to 
> Documentation/kvm/.  It can be as simple as listing the structure 
> definition.

I still have reservation about documenting the vmcs12 structure, because
it is explicitly not meant to be accessible by guests, who should use the
VMREAD/VMWRITE ABI which *is* documented (in the VMX spec). But since you
prefer, I can easily copy the structure definition into Doucmentation/kvm.
I'll add it as a separate patch.

> >+static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr)
> >+{
> >+	struct page *vmcs_page =
> >+		gfn_to_page(vcpu->kvm, vmcs_addr>>  PAGE_SHIFT);
> >+
> >+	if (is_error_page(vmcs_page)) {
> >+		printk(KERN_ERR "%s error allocating page 0x%llx\n",
> >+		       __func__, vmcs_addr);
> >   
> 
> Those printks can be used by a malicious guest to span the host logs.  
> Please wrap them with something that is conditional on a debug flag.
> 
> I'm not sure what we need to do with vmcs that is not in RAM.  It may 
> simplify things to return the error_page to the caller and set 
> KVM_REQ_TRIPLE_FAULT, so we don't have to deal with error handling later on.

This is a very good point. The approach in the patches I sent was to pin
the L1-specified VMCS page and map it every time it was needed, and unpin
and unmap it immediately afterward. This indeed will not work correctly if
called with interrupts disabled and for some reason the vmcs12 page is swapped
out...

I decided to reconsider this whole approach. In the new approach, we pin
and kmap the guest vmcs12 page on VMPTRLD time (when sleeping is fine),
and unpin and unmap it when a different VMPTRLD is done (or VMCLEAR, or
cleanup). Whenever we want to use this vmcs12 in the code - we can do so
without needing to swap in or map anything.

The code should now handle the rare situation when the vmcs12 gets swapped
out, and is cleaner (with almost 100 lines of ugly nested_map_current()/
nested_unmap_current() calls were eliminated). The only downside I see is that
now when nested vmx is being used, a single page, the current vmcs of L1, is
always pinned and kmaped. I believe that pinning and mapping one single page
(no matter how many guests there are) is a small price to pay - we already
spend more than that in other places (e.g., one vmcs02 page per .

Does this sound reasonable to you?

> >+	kvm_release_page_dirty(page);
> >   
> 
> Do we always dirty the page?
> 
> I guess it is no big deal even if we don't.

In the previous patch, you're right - it's pretty easy to know in each case
whether we modified the page or not. In the new patches, where the page is
pinned for significantly longer durations, it will always get modified
somehow so kvm_release_page_dirty() will always be the appropriate choice.

Here's the new patch. I will send the changes in vmptrld and all the places
that used to map/unmap the vmcs12 when I send new versions of the following
patches.

------
Subject: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guests.

This patch also adds the notion (as required by the VMX spec) of the "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/arch/x86/kvm/vmx.c	2010-06-22 15:57:45.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-06-22 15:57:45.000000000 +0300
@@ -126,6 +126,34 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a VMCS for the underlying
+ * hardware which will be used to run L2.
+ * This structure is packed in order to preserve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -134,6 +162,12 @@ struct shared_msr_entry {
 struct nested_vmx {
 	/* Has the level1 guest done vmxon? */
 	bool vmxon;
+
+	/* The guest-physical address of the current VMCS L1 keeps for L2 */
+	gpa_t current_vmptr;
+	/* The host-usable pointer to the above */
+	struct page *current_vmcs12_page;
+	struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -197,6 +231,21 @@ static inline struct vcpu_vmx *to_vmx(st
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+	struct page *page = gfn_to_page(vcpu->kvm, addr >> PAGE_SHIFT);
+	if (is_error_page(page)) {
+		kvm_release_page_clean(page);
+		return NULL;
+	}
+	return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+	kvm_release_page_dirty(page);
+}
+
 static int init_rmode(struct kvm *kvm);
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
@@ -3464,6 +3513,11 @@ static int handle_vmoff(struct kvm_vcpu 
 
 	to_vmx(vcpu)->nested.vmxon = false;
 
+	if(to_vmx(vcpu)->nested.current_vmptr != -1ull){
+		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
+		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
+	}
+
 	skip_emulated_instruction(vcpu);
 	return 1;
 }
@@ -4136,6 +4190,10 @@ static void vmx_free_vcpu(struct kvm_vcp
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	free_vpid(vmx);
+	if (vmx->nested.vmxon && to_vmx(vcpu)->nested.current_vmptr != -1ull){
+		kunmap(to_vmx(vcpu)->nested.current_vmcs12_page);
+		nested_release_page(to_vmx(vcpu)->nested.current_vmcs12_page);
+	}
 	vmx_free_vmcs(vcpu);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
@@ -4201,6 +4259,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
 			goto free_vmcs;
 	}
 
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.current_vmcs12 = NULL;
+
 	return &vmx->vcpu;
 
 free_vmcs:

-- 
Nadav Har'El                        |     Tuesday, Jun 22 2010, 10 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Corduroy pillows - they're making
http://nadav.harel.org.il           |headlines!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-22 14:54     ` Nadav Har'El
@ 2010-06-22 16:53       ` Nadav Har'El
  2010-06-23  8:07         ` Avi Kivity
  2010-06-23  7:57       ` Avi Kivity
  1 sibling, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-22 16:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Tue, Jun 22, 2010, Nadav Har'El wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
> > Note that this structure becomes an ABI, it cannot change except in a 
> > backward compatible way due to the need for live migration.  So I'd like 
> > a documentation patch that adds a description of the content to 
> > Documentation/kvm/.  It can be as simple as listing the structure 
> > definition.

I decided that if I add a file in Documentation/kvm, it would be very useful
for it to describe the nested vmx feature in general, in addition to the
structure that you asked documented. So here is the new patch I propose:


----
Subject: [PATCH 25/25] Documentation

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
--- .before/Documentation/kvm/nested-vmx.txt	2010-06-22 19:50:32.000000000 +0300
+++ .after/Documentation/kvm/nested-vmx.txt	2010-06-22 19:50:32.000000000 +0300
@@ -0,0 +1,233 @@
+Nested VMX
+==========
+
+Overview
+---------
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guests operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The "Nested VMX" feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in IBM Research report
+H-0282, "The Turtles Project: Design and Implementation of Nested
+Virtualization", available at:
+
+        http://bit.ly/a0o9te
+
+
+Terminology
+-----------
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and the nested guest, which we
+call L2.
+
+
+Known limitations
+-----------------
+
+The current code support running Linux under a nested KVM using shadow
+page table (with bypass_guest_pf disabled). They support multiple nested
+hypervisors, which can run multiple guests. Only 64-bit nested hypervisors
+are supported. SMP is supported. Additional patches for running Windows under
+nested KVM, and Linux under nested VMware server, and support for nested EPT,
+are currently running in the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+------------------
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the "nested=1" option to the kvm-intel module.
+
+
+ABIs
+----
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their "Intel 64 and IA-32 Architectures Software
+Developer's Manual". Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+struct __packed vmcs12 {
+	/* According to the Intel spec, a VMCS region must start with the
+	 * following two fields. Then follow implementation-specific data.
+	 */
+	u32 revision_id;
+	u32 abort;
+
+	struct shadow_vmcs shadow_vmcs;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
+}
+
+struct __packed shadow_vmcs {
+	u16 virtual_processor_id;
+	u16 guest_es_selector;
+	u16 guest_cs_selector;
+	u16 guest_ss_selector;
+	u16 guest_ds_selector;
+	u16 guest_fs_selector;
+	u16 guest_gs_selector;
+	u16 guest_ldtr_selector;
+	u16 guest_tr_selector;
+	u16 host_es_selector;
+	u16 host_cs_selector;
+	u16 host_ss_selector;
+	u16 host_ds_selector;
+	u16 host_fs_selector;
+	u16 host_gs_selector;
+	u16 host_tr_selector;
+	u64 io_bitmap_a;
+	u64 io_bitmap_b;
+	u64 msr_bitmap;
+	u64 vm_exit_msr_store_addr;
+	u64 vm_exit_msr_load_addr;
+	u64 vm_entry_msr_load_addr;
+	u64 tsc_offset;
+	u64 virtual_apic_page_addr;
+	u64 apic_access_addr;
+	u64 ept_pointer;
+	u64 guest_physical_address;
+	u64 vmcs_link_pointer;
+	u64 guest_ia32_debugctl;
+	u64 guest_ia32_pat;
+	u64 guest_pdptr0;
+	u64 guest_pdptr1;
+	u64 guest_pdptr2;
+	u64 guest_pdptr3;
+	u64 host_ia32_pat;
+	u32 pin_based_vm_exec_control;
+	u32 cpu_based_vm_exec_control;
+	u32 exception_bitmap;
+	u32 page_fault_error_code_mask;
+	u32 page_fault_error_code_match;
+	u32 cr3_target_count;
+	u32 vm_exit_controls;
+	u32 vm_exit_msr_store_count;
+	u32 vm_exit_msr_load_count;
+	u32 vm_entry_controls;
+	u32 vm_entry_msr_load_count;
+	u32 vm_entry_intr_info_field;
+	u32 vm_entry_exception_error_code;
+	u32 vm_entry_instruction_len;
+	u32 tpr_threshold;
+	u32 secondary_vm_exec_control;
+	u32 vm_instruction_error;
+	u32 vm_exit_reason;
+	u32 vm_exit_intr_info;
+	u32 vm_exit_intr_error_code;
+	u32 idt_vectoring_info_field;
+	u32 idt_vectoring_error_code;
+	u32 vm_exit_instruction_len;
+	u32 vmx_instruction_info;
+	u32 guest_es_limit;
+	u32 guest_cs_limit;
+	u32 guest_ss_limit;
+	u32 guest_ds_limit;
+	u32 guest_fs_limit;
+	u32 guest_gs_limit;
+	u32 guest_ldtr_limit;
+	u32 guest_tr_limit;
+	u32 guest_gdtr_limit;
+	u32 guest_idtr_limit;
+	u32 guest_es_ar_bytes;
+	u32 guest_cs_ar_bytes;
+	u32 guest_ss_ar_bytes;
+	u32 guest_ds_ar_bytes;
+	u32 guest_fs_ar_bytes;
+	u32 guest_gs_ar_bytes;
+	u32 guest_ldtr_ar_bytes;
+	u32 guest_tr_ar_bytes;
+	u32 guest_interruptibility_info;
+	u32 guest_activity_state;
+	u32 guest_sysenter_cs;
+	u32 host_ia32_sysenter_cs;
+	unsigned long cr0_guest_host_mask;
+	unsigned long cr4_guest_host_mask;
+	unsigned long cr0_read_shadow;
+	unsigned long cr4_read_shadow;
+	unsigned long cr3_target_value0;
+	unsigned long cr3_target_value1;
+	unsigned long cr3_target_value2;
+	unsigned long cr3_target_value3;
+	unsigned long exit_qualification;
+	unsigned long guest_linear_address;
+	unsigned long guest_cr0;
+	unsigned long guest_cr3;
+	unsigned long guest_cr4;
+	unsigned long guest_es_base;
+	unsigned long guest_cs_base;
+	unsigned long guest_ss_base;
+	unsigned long guest_ds_base;
+	unsigned long guest_fs_base;
+	unsigned long guest_gs_base;
+	unsigned long guest_ldtr_base;
+	unsigned long guest_tr_base;
+	unsigned long guest_gdtr_base;
+	unsigned long guest_idtr_base;
+	unsigned long guest_dr7;
+	unsigned long guest_rsp;
+	unsigned long guest_rip;
+	unsigned long guest_rflags;
+	unsigned long guest_pending_dbg_exceptions;
+	unsigned long guest_sysenter_esp;
+	unsigned long guest_sysenter_eip;
+	unsigned long host_cr0;
+	unsigned long host_cr3;
+	unsigned long host_cr4;
+	unsigned long host_fs_base;
+	unsigned long host_gs_base;
+	unsigned long host_tr_base;
+	unsigned long host_gdtr_base;
+	unsigned long host_idtr_base;
+	unsigned long host_ia32_sysenter_esp;
+	unsigned long host_ia32_sysenter_eip;
+	unsigned long host_rsp;
+	unsigned long host_rip;
+};
+
+
+Authors
+-------
+
+These patches were written by:
+     Abel Gordon, abelg <at> il.ibm.com
+     Nadav Har'El, nyh <at> il.ibm.com
+     Orit Wasserman, oritw <at> il.ibm.com
+     Ben-Ami Yassor, benami <at> il.ibm.com
+     Muli Ben-Yehuda, muli <at> il.ibm.com
+
+With contributions by:
+     Anthony Liguori, aliguori <at> us.ibm.com
+     Mike Day, mdday <at> us.ibm.com
+
+And valuable reviews by:
+     Avi Kivity, avi <at> redhat.com
+     Gleb Natapov, gleb <at> redhat.com

-- 
Nadav Har'El                        |     Tuesday, Jun 22 2010, 11 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Jury: Twelve people who determine which
http://nadav.harel.org.il           |client has the better lawyer.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-22 14:54     ` Nadav Har'El
  2010-06-22 16:53       ` Nadav Har'El
@ 2010-06-23  7:57       ` Avi Kivity
  2010-06-23  9:15         ` Alexander Graf
  2010-06-23 12:07         ` Nadav Har'El
  1 sibling, 2 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-23  7:57 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Alexander Graf, Joerg Roedel

On 06/22/2010 05:54 PM, Nadav Har'El wrote:
>
>> I'm not sure what we need to do with vmcs that is not in RAM.  It may
>> simplify things to return the error_page to the caller and set
>> KVM_REQ_TRIPLE_FAULT, so we don't have to deal with error handling later on.
>>      
> This is a very good point. The approach in the patches I sent was to pin
> the L1-specified VMCS page and map it every time it was needed, and unpin
> and unmap it immediately afterward. This indeed will not work correctly if
> called with interrupts disabled and for some reason the vmcs12 page is swapped
> out...
>
> I decided to reconsider this whole approach. In the new approach, we pin
> and kmap the guest vmcs12 page on VMPTRLD time (when sleeping is fine),
> and unpin and unmap it when a different VMPTRLD is done (or VMCLEAR, or
> cleanup). Whenever we want to use this vmcs12 in the code - we can do so
> without needing to swap in or map anything.
>
> The code should now handle the rare situation when the vmcs12 gets swapped
> out, and is cleaner (with almost 100 lines of ugly nested_map_current()/
> nested_unmap_current() calls were eliminated). The only downside I see is that
> now when nested vmx is being used, a single page, the current vmcs of L1, is
> always pinned and kmaped. I believe that pinning and mapping one single page
> (no matter how many guests there are) is a small price to pay - we already
> spend more than that in other places (e.g., one vmcs02 page per .
>
> Does this sound reasonable to you?
>    

kmap() really should be avoided when possible.  It is for when we don't 
have a pte pointing to the page (for example, accessing a user page from 
outside its process context).

Really, the correct API is kvm_read_guest() and kvm_write_guest().  They 
can easily be wrapped in with something that takes a vmcs12 field and 
automatically references the vmptr:


   kvm_set_cr0(vcpu, gvmcs_read64(vcpu, guest_cr0));

This will take care of mark_page_dirty() etc.

kvm_*_guest() is a slow API since it needs to search the memory slot 
list but that can be optimized easily by caching the memory slot (and 
invalidating the cache when memory mapping changes).  The optimized APIs 
can end up as doing a pointer fetch and a get/put_user(), which is very 
efficient.

Now the only problem is access from atomic contexts, as far as I can 
tell there are two areas where this is needed:

1) interrupt injection
2) optimized VMREAD/VMWRITE emulation

There are other reasons to move interrupt injection out of atomic 
context.  If that's the only thing in the way of using 
kvm_read/write_guest(), I'll be happy to prioritize that work.

Alex, Joerg, well gvmcb_{read,write}{32,64}() work for nsvm?  All that 
kmapping is incredibly annoying.

I guess it's fine to start with the kmap() based implementation and 
change it later.

Note: you need to kunmap() and kvm_release_page_dirty() not only on 
vmxoff and vmptrld/vmclear, but also when ioctl(KVM_VCPU_RUN) exits to 
userspace.  That ensures that live migration sees the page dirtied and 
is able to copy it.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-22 16:53       ` Nadav Har'El
@ 2010-06-23  8:07         ` Avi Kivity
  2010-08-08 15:09           ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-06-23  8:07 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/22/2010 07:53 PM, Nadav Har'El wrote:
> On Tue, Jun 22, 2010, Nadav Har'El wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
>    
>>> Note that this structure becomes an ABI, it cannot change except in a
>>> backward compatible way due to the need for live migration.  So I'd like
>>> a documentation patch that adds a description of the content to
>>> Documentation/kvm/.  It can be as simple as listing the structure
>>> definition.
>>>        
> I decided that if I add a file in Documentation/kvm, it would be very useful
> for it to describe the nested vmx feature in general, in addition to the
> structure that you asked documented. So here is the new patch I propose:
>
>
>    

Great, that's always helpful.

> ----
> Subject: [PATCH 25/25] Documentation
>
> This patch includes a brief introduction to the nested vmx feature in the
> Documentation/kvm directory. The document also includes a copy of the
> vmcs12 structure, as requested by Avi Kivity.
>
>
> +
> +We describe in much greater detail the theory behind the nested VMX feature,
> +its implementation and its performance characteristics, in IBM Research report
> +H-0282, "The Turtles Project: Design and Implementation of Nested
> +Virtualization", available at:
> +
> +        http://bit.ly/a0o9te
>    

Please put the true url in here.

> +
> +Known limitations
> +-----------------
> +
> +The current code support running Linux under a nested KVM using shadow
> +page table (with bypass_guest_pf disabled).

Might as well remove this, since nvmx will not be merged with such a 
gaping hole.

In theory I ought to reject anything that doesn't comply with the spec.  
In practice I'll accept deviations from the spec, so long as

- those features aren't used by common guests
- when the features are attempted to be used, kvm will issue a warning

I don't think PFEC matching ought to present any implementation difficulty.

> +ABIs
> +----
> +
> +Nested VMX aims to present a standard and (eventually) fully-functional VMX
> +implementation for the a guest hypervisor to use. As such, the official
> +specification of the ABI that it provides is Intel's VMX specification,
> +namely volume 3B of their "Intel 64 and IA-32 Architectures Software
> +Developer's Manual". Not all of VMX's features are currently fully supported,
> +but the goal is to eventually support them all, starting with the VMX features
> +which are used in practice by popular hypervisors (KVM and others).
> +
> +As a VMX implementation, nested VMX presents a VMCS structure to L1.
> +As mandated by the spec, other than the two fields revision_id and abort,
> +this structure is *opaque* to its user, who is not supposed to know or care
> +about its internal structure. Rather, the structure is accessed through the
> +VMREAD and VMWRITE instructions.
> +Still, for debugging purposes, KVM developers might be interested to know the
> +internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
> +For convenience, we repeat its content here. If the internals of this structure
> +changes, this can break live migration across KVM versions. VMCS12_REVISION
> +(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
> +is ever changed.
>    

This is indeed great for debugging, we can add qemu commands to inspect 
a vmcs (and add the vmptr to 'info registers').

> +
> +struct __packed vmcs12 {
> +	/* According to the Intel spec, a VMCS region must start with the
> +	 * following two fields. Then follow implementation-specific data.
> +	 */
> +	u32 revision_id;
> +	u32 abort;
> +
> +	struct shadow_vmcs shadow_vmcs;
> +
> +	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
> +
> +	int cpu;
> +	int launched;
>    

Why is cpu needed?  In what way is launched != launch_state?

Please use explicitly sized types.

> +struct __packed shadow_vmcs {
>
>
> +	u32 host_ia32_sysenter_cs;
>    

u32 pad;

> +	unsigned long cr0_guest_host_mask;
> +	unsigned long cr4_guest_host_mask;
> +	unsigned long cr0_read_shadow;
> +	unsigned long cr4_read_shadow;
> +	unsigned long cr3_target_value0;
> +	unsigned long cr3_target_value1;
> +	unsigned long cr3_target_value2;
> +	unsigned long cr3_target_value3;
> +	unsigned long exit_qualification;
> +	unsigned long guest_linear_address;
> +	unsigned long guest_cr0;
> +	unsigned long guest_cr3;
> +	unsigned long guest_cr4;
> +	unsigned long guest_es_base;
> +	unsigned long guest_cs_base;
> +	unsigned long guest_ss_base;
> +	unsigned long guest_ds_base;
> +	unsigned long guest_fs_base;
> +	unsigned long guest_gs_base;
> +	unsigned long guest_ldtr_base;
> +	unsigned long guest_tr_base;
> +	unsigned long guest_gdtr_base;
> +	unsigned long guest_idtr_base;
> +	unsigned long guest_dr7;
> +	unsigned long guest_rsp;
> +	unsigned long guest_rip;
> +	unsigned long guest_rflags;
> +	unsigned long guest_pending_dbg_exceptions;
> +	unsigned long guest_sysenter_esp;
> +	unsigned long guest_sysenter_eip;
> +	unsigned long host_cr0;
> +	unsigned long host_cr3;
> +	unsigned long host_cr4;
> +	unsigned long host_fs_base;
> +	unsigned long host_gs_base;
> +	unsigned long host_tr_base;
> +	unsigned long host_gdtr_base;
> +	unsigned long host_idtr_base;
> +	unsigned long host_ia32_sysenter_esp;
> +	unsigned long host_ia32_sysenter_eip;
> +	unsigned long host_rsp;
> +	unsigned long host_rip;
>    

Use u64 instead of unsigned long, otherwise the size changes during live 
migration from a 32-bit host to a 64-bit host.

Reserve tons of space here.

> +};
> +
> +
>    

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 6/24] Implement reading and writing of VMX MSRs
  2010-06-14  8:42   ` Avi Kivity
@ 2010-06-23  8:13     ` Nadav Har'El
  2010-06-23  8:24       ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-23  8:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 6/24] Implement reading and writing of VMX MSRs":
> On 06/13/2010 03:25 PM, Nadav Har'El wrote:
> >When the guest can use VMX instructions (when the "nested" module option is
> >on), it should also be able to read and write VMX MSRs, e.g., to query 
> >about
> >VMX capabilities. This patch adds this support.
>...
> >@@ -702,7 +702,11 @@ static u32 msrs_to_save[] = {
>...
> >+	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
>...
> These are read only from the guest point of view, but we need write 
> support from the host to allow for tuning the features exposed to the guest.

Hi,

I'm afraid I did not understand what you meant.
There is a KVM_SET_MSRS ioctl, but it appears to do the same thing that a
guest's WRMSR would do, i.e., eventually call nested_vmx_set_msr().
In some of these MSRs (like VMX_BASIC) there's no point, if I understand
correctly, for the guest or the host to change anything (e.g., what would it
mean to change the VMCS length or revision id?). In others like
VMX_PROCBASED_CTLS I guess that we can allow it to be set if it doesn't try
to enable something not supported - is that what you had in mind? Or did you
mean something else?

Thanks,
Nadav.

-- 
Nadav Har'El                        |   Wednesday, Jun 23 2010, 11 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |AAAAA: the American Association for the
http://nadav.harel.org.il           |Abolition of Abbreviations and Acronyms

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 6/24] Implement reading and writing of VMX MSRs
  2010-06-23  8:13     ` Nadav Har'El
@ 2010-06-23  8:24       ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-23  8:24 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

On 06/23/2010 11:13 AM, Nadav Har'El wrote:
> On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 6/24] Implement reading and writing of VMX MSRs":
>    
>> On 06/13/2010 03:25 PM, Nadav Har'El wrote:
>>      
>>> When the guest can use VMX instructions (when the "nested" module option is
>>> on), it should also be able to read and write VMX MSRs, e.g., to query
>>> about
>>> VMX capabilities. This patch adds this support.
>>>        
>> ...
>>      
>>> @@ -702,7 +702,11 @@ static u32 msrs_to_save[] = {
>>>        
>> ...
>>      
>>> +	MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
>>>        
>> ...
>> These are read only from the guest point of view, but we need write
>> support from the host to allow for tuning the features exposed to the guest.
>>      
> Hi,
>
> I'm afraid I did not understand what you meant.
> There is a KVM_SET_MSRS ioctl, but it appears to do the same thing that a
> guest's WRMSR would do, i.e., eventually call nested_vmx_set_msr().
> In some of these MSRs (like VMX_BASIC) there's no point, if I understand
> correctly, for the guest or the host to change anything (e.g., what would it
> mean to change the VMCS length or revision id?). In others like
> VMX_PROCBASED_CTLS I guess that we can allow it to be set if it doesn't try
> to enable something not supported - is that what you had in mind? Or did you
> mean something else?
>    

I meant allowing host userspace to change the capability MSRs, such as 
VMX_PROCBASED_CTLS.  For example, if live migrating from a version of 
kvm that does not support EPT to a version that does, we want to 
downgrade the reported capabilities so the guest sees exactly the same 
capabilities.

It's the same thing we do with cpuid.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-23  7:57       ` Avi Kivity
@ 2010-06-23  9:15         ` Alexander Graf
  2010-06-23  9:24           ` Avi Kivity
  2010-06-23 12:07         ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Alexander Graf @ 2010-06-23  9:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm, Joerg Roedel


On 23.06.2010, at 09:57, Avi Kivity wrote:

> On 06/22/2010 05:54 PM, Nadav Har'El wrote:
>> 
>>> I'm not sure what we need to do with vmcs that is not in RAM.  It may
>>> simplify things to return the error_page to the caller and set
>>> KVM_REQ_TRIPLE_FAULT, so we don't have to deal with error handling later on.
>>>     
>> This is a very good point. The approach in the patches I sent was to pin
>> the L1-specified VMCS page and map it every time it was needed, and unpin
>> and unmap it immediately afterward. This indeed will not work correctly if
>> called with interrupts disabled and for some reason the vmcs12 page is swapped
>> out...
>> 
>> I decided to reconsider this whole approach. In the new approach, we pin
>> and kmap the guest vmcs12 page on VMPTRLD time (when sleeping is fine),
>> and unpin and unmap it when a different VMPTRLD is done (or VMCLEAR, or
>> cleanup). Whenever we want to use this vmcs12 in the code - we can do so
>> without needing to swap in or map anything.
>> 
>> The code should now handle the rare situation when the vmcs12 gets swapped
>> out, and is cleaner (with almost 100 lines of ugly nested_map_current()/
>> nested_unmap_current() calls were eliminated). The only downside I see is that
>> now when nested vmx is being used, a single page, the current vmcs of L1, is
>> always pinned and kmaped. I believe that pinning and mapping one single page
>> (no matter how many guests there are) is a small price to pay - we already
>> spend more than that in other places (e.g., one vmcs02 page per .
>> 
>> Does this sound reasonable to you?
>>   
> 
> kmap() really should be avoided when possible.  It is for when we don't have a pte pointing to the page (for example, accessing a user page from outside its process context).
> 
> Really, the correct API is kvm_read_guest() and kvm_write_guest().  They can easily be wrapped in with something that takes a vmcs12 field and automatically references the vmptr:
> 
> 
>  kvm_set_cr0(vcpu, gvmcs_read64(vcpu, guest_cr0));
> 
> This will take care of mark_page_dirty() etc.
> 
> kvm_*_guest() is a slow API since it needs to search the memory slot list but that can be optimized easily by caching the memory slot (and invalidating the cache when memory mapping changes).  The optimized APIs can end up as doing a pointer fetch and a get/put_user(), which is very efficient.
> 
> Now the only problem is access from atomic contexts, as far as I can tell there are two areas where this is needed:
> 
> 1) interrupt injection
> 2) optimized VMREAD/VMWRITE emulation
> 
> There are other reasons to move interrupt injection out of atomic context.  If that's the only thing in the way of using kvm_read/write_guest(), I'll be happy to prioritize that work.
> 
> Alex, Joerg, well gvmcb_{read,write}{32,64}() work for nsvm?  All that kmapping is incredibly annoying.

I'm sceptical that we can actually get that to be as fast as a direct kmap, but apart from that I don't see an obvious reason why it wouldn't.


Alex


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-23  9:15         ` Alexander Graf
@ 2010-06-23  9:24           ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-23  9:24 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Nadav Har'El, kvm, Joerg Roedel

On 06/23/2010 12:15 PM, Alexander Graf wrote:
>
>> Alex, Joerg, well gvmcb_{read,write}{32,64}() work for nsvm?  All that kmapping is incredibly annoying.
>>      
> I'm sceptical that we can actually get that to be as fast as a direct kmap, but apart from that I don't see an obvious reason why it wouldn't.
>    

I'm thinking of

struct kvm_cached_guest_page {
     void __user **pmem;
};

static u64 kvm_read_cached_guest_page(struct kvm_vcpu *vcpu, struct 
kvm_cached_guest_page *kcgp, unsigned offset)
{
     u64 __user *mem = *rcu_dereference(kcgp->kcgp) + offset;
     u64 ret;

     if (get_user(ret, mem))
         kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
     return ret;
}

We could even use our own exception handler to take the error case out 
of line.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-23  7:57       ` Avi Kivity
  2010-06-23  9:15         ` Alexander Graf
@ 2010-06-23 12:07         ` Nadav Har'El
  2010-06-23 12:13           ` Avi Kivity
  1 sibling, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-06-23 12:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Alexander Graf, Joerg Roedel

On Wed, Jun 23, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
> kmap() really should be avoided when possible.  It is for when we don't 
> have a pte pointing to the page (for example, accessing a user page from 
> outside its process context).

I'm afraid I do not follow. kmap() is needed for when you have 32-bit
pointers but more than 4GB of RAM. Why do we need to avoid using it?
I understand why one would like to avoid leaving many kmaped pages for
long durations, but here we're talking about just one page (per vcpu).

> Really, the correct API is kvm_read_guest() and kvm_write_guest().  They 
> can easily be wrapped in with something that takes a vmcs12 field and 
> automatically references the vmptr:
> 
>   kvm_set_cr0(vcpu, gvmcs_read64(vcpu, guest_cr0));

But as you also said, this doesn't solve the original problem which you found,
(which was what happens when the vmcs12 page is swapped out and it is needed
when sleeping is not allowed) unless we make changes to the injection logic,
as you analyzed. I agree that we should consider it for a future fix.

If I understand the nested-SVM code correctly, they took a similar approach to
mine - except they pin and unpin the page on every entry and exit, instead of
on VMPTRLD (SVM doesn't have a notion of one current vmcb). But they still
don't (if I understand correctly) call any special function on every access.

> I guess it's fine to start with the kmap() based implementation and 
> change it later.

Great, thanks.

> Note: you need to kunmap() and kvm_release_page_dirty() not only on 
> vmxoff and vmptrld/vmclear, but also when ioctl(KVM_VCPU_RUN) exits to 
> userspace.  That ensures that live migration sees the page dirtied and 
> is able to copy it.

Thanks. I'll look into this and fix this.


-- 
Nadav Har'El                        |   Wednesday, Jun 23 2010, 11 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The space between my ears was
http://nadav.harel.org.il           |intentionally left blank.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-23 12:07         ` Nadav Har'El
@ 2010-06-23 12:13           ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-06-23 12:13 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Alexander Graf, Joerg Roedel

On 06/23/2010 03:07 PM, Nadav Har'El wrote:
> On Wed, Jun 23, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
>    
>> kmap() really should be avoided when possible.  It is for when we don't
>> have a pte pointing to the page (for example, accessing a user page from
>> outside its process context).
>>      
> I'm afraid I do not follow. kmap() is needed for when you have 32-bit
> pointers but more than 4GB of RAM.

No, it's needed when you don't have a pte to a page.  You always have 
ptes to userspace pages when in process context, even if those pages are 
above the 900MB normally addressable by the kernel on i386.

> Why do we need to avoid using it?
> I understand why one would like to avoid leaving many kmaped pages for
> long durations, but here we're talking about just one page (per vcpu).
>    

It prevents neat stuff like ksm, transparent hugepages, page migration, 
swapping.  One page per vcpu is not a lot, but if it can be avoided, it 
is better.

>> Really, the correct API is kvm_read_guest() and kvm_write_guest().  They
>> can easily be wrapped in with something that takes a vmcs12 field and
>> automatically references the vmptr:
>>
>>    kvm_set_cr0(vcpu, gvmcs_read64(vcpu, guest_cr0));
>>      
> But as you also said, this doesn't solve the original problem which you found,
> (which was what happens when the vmcs12 page is swapped out and it is needed
> when sleeping is not allowed) unless we make changes to the injection logic,
> as you analyzed. I agree that we should consider it for a future fix.
>    

Is it only the injection logic that is involved?  If so, we're in good 
shape.

> If I understand the nested-SVM code correctly, they took a similar approach to
> mine - except they pin and unpin the page on every entry and exit, instead of
> on VMPTRLD (SVM doesn't have a notion of one current vmcb). But they still
> don't (if I understand correctly) call any special function on every access.
>    

Right, it will be better if they use kvm_*_guest() as well.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* RE: [PATCH 9/24] Implement VMCLEAR
  2010-06-13 12:27 ` [PATCH 9/24] Implement VMCLEAR Nadav Har'El
  2010-06-14  9:03   ` Avi Kivity
  2010-06-15 13:47   ` Gleb Natapov
@ 2010-07-06  2:56   ` Dong, Eddie
  2010-08-03 12:12     ` Nadav Har'El
  2 siblings, 1 reply; 147+ messages in thread
From: Dong, Eddie @ 2010-07-06  2:56 UTC (permalink / raw)
  To: Nadav Har'El, avi; +Cc: kvm, Dong, Eddie, Dong, Eddie

Nadav Har'El wrote:
> This patch implements the VMCLEAR instruction.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -138,6 +138,8 @@ struct __attribute__ ((__packed__)) vmcs
>  	 */
>  	u32 revision_id;
>  	u32 abort;
> +
> +	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
>  };
> 
>  struct vmcs_list {
> @@ -3827,6 +3829,46 @@ static int read_guest_vmcs_gpa(struct kv
>  	return 0;
>  }
> 
> +static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rflags;
> +	rflags = vmx_get_rflags(vcpu);
> +	rflags &= ~(X86_EFLAGS_CF | X86_EFLAGS_ZF);
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
> +/* Emulate the VMCLEAR instruction */
> +static int handle_vmclear(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gpa_t guest_vmcs_addr, save_current_vmptr;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr))
> +		return 1;
> +

SDM implements alignment check, range check and reserve bit check and may generate VMfail(VMCLEAR with invalid physical address).
As well as "addr != VMXON pointer" check
Missed?

> +	save_current_vmptr = vmx->nested.current_vmptr;
> +
> +	vmx->nested.current_vmptr = guest_vmcs_addr;
> +	if (!nested_map_current(vcpu))
> +		return 1;
> +	vmx->nested.current_l2_page->launch_state = 0;
> +	nested_unmap_current(vcpu);
> +
> +	nested_free_current_vmcs(vcpu);
> +
> +	if (save_current_vmptr == guest_vmcs_addr)
> +		vmx->nested.current_vmptr = -1ull;
> +	else
> +		vmx->nested.current_vmptr = save_current_vmptr;
> +
> +	skip_emulated_instruction(vcpu);
> +	clear_rflags_cf_zf(vcpu);

SDM has formal definition of VMSucceed. Cleating CF/ZF only is not sufficient as SDM 2B 5.2 mentioned.
Any special concern here?

BTW, should we define formal VMfail() & VMsucceed() API for easy understand and map to SDM?

> +	return 1;
> +}
> +
>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -4109,7 +4151,7 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_HLT]                     = handle_halt,
>  	[EXIT_REASON_INVLPG]		      = handle_invlpg,
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
> -	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
> +	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>  	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
>  	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,


^ permalink raw reply	[flat|nested] 147+ messages in thread

* RE: [PATCH 10/24] Implement VMPTRLD
  2010-06-13 12:27 ` [PATCH 10/24] Implement VMPTRLD Nadav Har'El
  2010-06-14  9:07   ` Avi Kivity
  2010-06-16 13:36   ` Gleb Natapov
@ 2010-07-06  3:09   ` Dong, Eddie
  2010-08-05 11:35     ` Nadav Har'El
  2 siblings, 1 reply; 147+ messages in thread
From: Dong, Eddie @ 2010-07-06  3:09 UTC (permalink / raw)
  To: Nadav Har'El, avi; +Cc: kvm, Dong, Eddie

Nadav Har'El wrote:
> This patch implements the VMPTRLD instruction.
> 
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -3829,6 +3829,26 @@ static int read_guest_vmcs_gpa(struct kv
>  	return 0;
>  }
> 
> +static void set_rflags_to_vmx_fail_invalid(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rflags;
> +	rflags = vmx_get_rflags(vcpu);
> +	rflags |= X86_EFLAGS_CF;
> +	rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_ZF &
> +		~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
> +static void set_rflags_to_vmx_fail_valid(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long rflags;
> +	rflags = vmx_get_rflags(vcpu);
> +	rflags |= X86_EFLAGS_ZF;
> +	rflags &= ~X86_EFLAGS_PF & ~X86_EFLAGS_AF & ~X86_EFLAGS_CF &
> +		~X86_EFLAGS_SF & ~X86_EFLAGS_OF;
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
>  static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long rflags;
> @@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
>  	return 1;
>  }
> 
> +static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t
> guest_vmcs_addr) +{
> +	bool ret;
> +	struct vmcs12 *vmcs12;
> +	struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
> +	if (vmcs_page == NULL)
> +		return 0;
> +	vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
> +	if (vmcs12->revision_id == VMCS12_REVISION)
> +		ret = 1;
> +	else {
> +		set_rflags_to_vmx_fail_valid(vcpu);
> +		ret = 0;
> +	}
> +	kunmap_atomic(vmcs12, KM_USER0);
> +	kvm_release_page_dirty(vmcs_page);
> +	return ret;
> +}
> +
> +/* Emulate the VMPTRLD instruction */
> +static int handle_vmptrld(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	gpa_t guest_vmcs_addr;
> +
> +	if (!nested_vmx_check_permission(vcpu))
> +		return 1;
> +
> +	if (read_guest_vmcs_gpa(vcpu, &guest_vmcs_addr)) {
> +		set_rflags_to_vmx_fail_invalid(vcpu);
> +		return 1;
> +	}
> +
> +	if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
> +		return 1;
> +
> +	if (vmx->nested.current_vmptr != guest_vmcs_addr) {
> +		vmx->nested.current_vmptr = guest_vmcs_addr;
> +
> +		if (nested_create_current_vmcs(vcpu)) {
> +			printk(KERN_ERR "%s error could not allocate memory",
> +				__func__);
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	clear_rflags_cf_zf(vcpu);
> +	skip_emulated_instruction(vcpu);

How about the "Launch" status? Should we get that status from vmcs1x to distinguish guest VMLaunch & VMResume?

> +	return 1;
> +}
> +
>  static int handle_invlpg(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -4153,7 +4224,7 @@ static int (*kvm_vmx_exit_handlers[])(st
>  	[EXIT_REASON_VMCALL]                  = handle_vmcall,
>  	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
>  	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
> -	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
> +	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
>  	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
>  	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
>  	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,


^ permalink raw reply	[flat|nested] 147+ messages in thread

* RE: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
  2010-06-13 12:29 ` [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
  2010-06-14 11:11   ` Avi Kivity
  2010-06-17  8:50   ` Gleb Natapov
@ 2010-07-06  6:25   ` Dong, Eddie
  2 siblings, 0 replies; 147+ messages in thread
From: Dong, Eddie @ 2010-07-06  6:25 UTC (permalink / raw)
  To: Nadav Har'El, avi; +Cc: kvm, Dong, Eddie

Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to
> actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately
> merges the information in shadow_vmcs that L1 built for L2 (vmcs12),
> and that in the VMCS that we built for L1 (vmcs01).
>
> VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS),
> which makes it difficult for us to read from vmcs01 while writing to
> vmcs12. This is why we first make a copy of vmcs01 in memory
> (l1_shadow_vmcs) and then read that memory copy while writing to
> vmcs12.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c        2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c 2010-06-13 15:01:29.000000000 +0300
> @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
>       return flexpriority_enabled;
>  }
>
> +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu
> *vcpu) +{
> +     return cpu_has_vmx_tpr_shadow() &&
> +             get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
> +             CPU_BASED_TPR_SHADOW;
> +}
> +
> +static inline bool nested_cpu_has_secondary_exec_ctrls(struct
> kvm_vcpu *vcpu) +{
> +     return cpu_has_secondary_exec_ctrls() &&
> +             get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control &
> +             CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> +}
> +
> +static inline bool nested_vm_need_virtualize_apic_accesses(struct
> kvm_vcpu +                                                       *vcpu)
> +{
> +     return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
> +             (get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
> +             SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
> +}
> +
> +static inline bool nested_cpu_has_vmx_ept(struct kvm_vcpu *vcpu)
> +{
> +     return nested_cpu_has_secondary_exec_ctrls(vcpu) &&
> +             (get_shadow_vmcs(vcpu)->secondary_vm_exec_control &
> +             SECONDARY_EXEC_ENABLE_EPT);
> +}
> +
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>       int i;
> @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
>       preempt_enable();
>  }
>
> +int load_vmcs_host_state(struct shadow_vmcs *src)
> +{
> +     vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
> +     vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
> +     vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
> +     vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
> +     vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
> +     vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
> +     vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
> +
> +     vmcs_write64(TSC_OFFSET, src->tsc_offset);
> +
> +     if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
> +             vmcs_write64(HOST_IA32_PAT, src->host_ia32_pat);
> +
> +     vmcs_write32(HOST_IA32_SYSENTER_CS, src->host_ia32_sysenter_cs);
> +
> +     vmcs_writel(HOST_CR0, src->host_cr0);
> +     vmcs_writel(HOST_CR3, src->host_cr3);
> +     vmcs_writel(HOST_CR4, src->host_cr4);
> +     vmcs_writel(HOST_FS_BASE, src->host_fs_base);
> +     vmcs_writel(HOST_GS_BASE, src->host_gs_base);
> +     vmcs_writel(HOST_TR_BASE, src->host_tr_base);
> +     vmcs_writel(HOST_GDTR_BASE, src->host_gdtr_base);
> +     vmcs_writel(HOST_IDTR_BASE, src->host_idtr_base);
> +     vmcs_writel(HOST_RSP, src->host_rsp);
> +     vmcs_writel(HOST_RIP, src->host_rip);
> +     vmcs_writel(HOST_IA32_SYSENTER_ESP, src->host_ia32_sysenter_esp);
> +     vmcs_writel(HOST_IA32_SYSENTER_EIP, src->host_ia32_sysenter_eip);
> +
> +     return 0;
> +}
> +
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put(), but
> assumes
>   * vcpu mutex is already taken.
> @@ -1922,6 +1985,71 @@ static void vmclear_local_vcpus(void)
>               __vcpu_clear(vmx);
>  }
>
> +int load_vmcs_common(struct shadow_vmcs *src)
> +{
> +     vmcs_write16(GUEST_ES_SELECTOR, src->guest_es_selector);
> +     vmcs_write16(GUEST_CS_SELECTOR, src->guest_cs_selector);
> +     vmcs_write16(GUEST_SS_SELECTOR, src->guest_ss_selector);
> +     vmcs_write16(GUEST_DS_SELECTOR, src->guest_ds_selector);
> +     vmcs_write16(GUEST_FS_SELECTOR, src->guest_fs_selector);
> +     vmcs_write16(GUEST_GS_SELECTOR, src->guest_gs_selector);
> +     vmcs_write16(GUEST_LDTR_SELECTOR, src->guest_ldtr_selector);
> +     vmcs_write16(GUEST_TR_SELECTOR, src->guest_tr_selector);
> +
> +     vmcs_write64(GUEST_IA32_DEBUGCTL, src->guest_ia32_debugctl);
> +
> +     if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> +             vmcs_write64(GUEST_IA32_PAT, src->guest_ia32_pat);
> +
> +     vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> src->vm_entry_intr_info_field);
> +     vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, +
> src->vm_entry_exception_error_code);
> +     vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> src->vm_entry_instruction_len); + +   vmcs_write32(GUEST_ES_LIMIT,
> src->guest_es_limit); +       vmcs_write32(GUEST_CS_LIMIT,
> src->guest_cs_limit); +       vmcs_write32(GUEST_SS_LIMIT,
> src->guest_ss_limit); +       vmcs_write32(GUEST_DS_LIMIT,
> src->guest_ds_limit); +       vmcs_write32(GUEST_FS_LIMIT,
> src->guest_fs_limit); +       vmcs_write32(GUEST_GS_LIMIT,
> src->guest_gs_limit); +       vmcs_write32(GUEST_LDTR_LIMIT,
> src->guest_ldtr_limit); +     vmcs_write32(GUEST_TR_LIMIT,
> src->guest_tr_limit); +       vmcs_write32(GUEST_GDTR_LIMIT,
> src->guest_gdtr_limit); +     vmcs_write32(GUEST_IDTR_LIMIT,
> src->guest_idtr_limit); +     vmcs_write32(GUEST_ES_AR_BYTES,
> src->guest_es_ar_bytes); +    vmcs_write32(GUEST_CS_AR_BYTES,
> src->guest_cs_ar_bytes); +    vmcs_write32(GUEST_SS_AR_BYTES,
> src->guest_ss_ar_bytes); +    vmcs_write32(GUEST_DS_AR_BYTES,
> src->guest_ds_ar_bytes); +    vmcs_write32(GUEST_FS_AR_BYTES,
> src->guest_fs_ar_bytes); +    vmcs_write32(GUEST_GS_AR_BYTES,
> src->guest_gs_ar_bytes); +    vmcs_write32(GUEST_LDTR_AR_BYTES,
> src->guest_ldtr_ar_bytes); +  vmcs_write32(GUEST_TR_AR_BYTES,
> src->guest_tr_ar_bytes); +    vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
> +                  src->guest_interruptibility_info);
> +     vmcs_write32(GUEST_ACTIVITY_STATE, src->guest_activity_state);
> +     vmcs_write32(GUEST_SYSENTER_CS, src->guest_sysenter_cs);
> +
> +     vmcs_writel(GUEST_ES_BASE, src->guest_es_base);
> +     vmcs_writel(GUEST_CS_BASE, src->guest_cs_base);
> +     vmcs_writel(GUEST_SS_BASE, src->guest_ss_base);
> +     vmcs_writel(GUEST_DS_BASE, src->guest_ds_base);
> +     vmcs_writel(GUEST_FS_BASE, src->guest_fs_base);
> +     vmcs_writel(GUEST_GS_BASE, src->guest_gs_base);
> +     vmcs_writel(GUEST_LDTR_BASE, src->guest_ldtr_base);
> +     vmcs_writel(GUEST_TR_BASE, src->guest_tr_base);
> +     vmcs_writel(GUEST_GDTR_BASE, src->guest_gdtr_base);
> +     vmcs_writel(GUEST_IDTR_BASE, src->guest_idtr_base);
> +     vmcs_writel(GUEST_DR7, src->guest_dr7);
> +     vmcs_writel(GUEST_RSP, src->guest_rsp);
> +     vmcs_writel(GUEST_RIP, src->guest_rip);
> +     vmcs_writel(GUEST_RFLAGS, src->guest_rflags);
> +     vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> +                 src->guest_pending_dbg_exceptions);
> +     vmcs_writel(GUEST_SYSENTER_ESP, src->guest_sysenter_esp);
> +     vmcs_writel(GUEST_SYSENTER_EIP, src->guest_sysenter_eip);
> +
> +     return 0;
> +}
>
>  /* Just like cpu_vmxoff(), but with the
>   __kvm_handle_fault_on_reboot() * tricks.
> @@ -5363,6 +5491,281 @@ static void vmx_set_supported_cpuid(u32
>  {
>  }
>
> +/* Make a copy of the current VMCS to ordinary memory. This is
> needed because + * in VMX you cannot read and write to two VMCS at
> the same time, so when we + * want to do this (in prepare_vmcs_02,
> which needs to read from vmcs01 while + * preparing vmcs02), we need
> to first save a copy of one VMCS's fields in + * memory, and then use
> that copy. + */
> +void save_vmcs(struct shadow_vmcs *dst)
> +{
> +     dst->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
> +     dst->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
> +     dst->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
> +     dst->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
> +     dst->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
> +     dst->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
> +     dst->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
> +     dst->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
> +     dst->host_es_selector = vmcs_read16(HOST_ES_SELECTOR);
> +     dst->host_cs_selector = vmcs_read16(HOST_CS_SELECTOR);
> +     dst->host_ss_selector = vmcs_read16(HOST_SS_SELECTOR);
> +     dst->host_ds_selector = vmcs_read16(HOST_DS_SELECTOR);
> +     dst->host_fs_selector = vmcs_read16(HOST_FS_SELECTOR);
> +     dst->host_gs_selector = vmcs_read16(HOST_GS_SELECTOR);
> +     dst->host_tr_selector = vmcs_read16(HOST_TR_SELECTOR);
> +     dst->io_bitmap_a = vmcs_read64(IO_BITMAP_A);
> +     dst->io_bitmap_b = vmcs_read64(IO_BITMAP_B);
> +     if (cpu_has_vmx_msr_bitmap())
> +             dst->msr_bitmap = vmcs_read64(MSR_BITMAP);
> +     dst->tsc_offset = vmcs_read64(TSC_OFFSET);
> +     dst->virtual_apic_page_addr = vmcs_read64(VIRTUAL_APIC_PAGE_ADDR);
> +     dst->apic_access_addr = vmcs_read64(APIC_ACCESS_ADDR);
> +     if (enable_ept)
> +             dst->ept_pointer = vmcs_read64(EPT_POINTER);
> +     dst->guest_physical_address = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> +     dst->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
> +     dst->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> +     if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
> +             dst->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
> +     if (enable_ept) {
> +             dst->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
> +             dst->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
> +             dst->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
> +             dst->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
> +     }
> +     dst->pin_based_vm_exec_control =
> vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
> +     dst->cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); +     dst->exception_bitmap =
> vmcs_read32(EXCEPTION_BITMAP); +      dst->page_fault_error_code_mask =
> +             vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK);
> +     dst->page_fault_error_code_match =
> +             vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH); +     dst->cr3_target_count
> = vmcs_read32(CR3_TARGET_COUNT); +    dst->vm_exit_controls =
> vmcs_read32(VM_EXIT_CONTROLS); +      dst->vm_entry_controls =
> vmcs_read32(VM_ENTRY_CONTROLS); +     dst->vm_entry_intr_info_field =
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> +     dst->vm_entry_exception_error_code =
> +             vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE);
> +     dst->vm_entry_instruction_len =
> vmcs_read32(VM_ENTRY_INSTRUCTION_LEN); +      dst->tpr_threshold =
> vmcs_read32(TPR_THRESHOLD); + dst->secondary_vm_exec_control =
> vmcs_read32(SECONDARY_VM_EXEC_CONTROL); +     if (enable_vpid &&
> dst->secondary_vm_exec_control & +        SECONDARY_EXEC_ENABLE_VPID)
> +             dst->virtual_processor_id = vmcs_read16(VIRTUAL_PROCESSOR_ID);
> +     dst->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
> +     dst->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
> +     dst->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> +     dst->vm_exit_intr_error_code =
> vmcs_read32(VM_EXIT_INTR_ERROR_CODE); +       dst->idt_vectoring_info_field
> = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +     dst->idt_vectoring_error_code =
> vmcs_read32(IDT_VECTORING_ERROR_CODE); +      dst->vm_exit_instruction_len
> = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); +     dst->vmx_instruction_info =
> vmcs_read32(VMX_INSTRUCTION_INFO); +  dst->guest_es_limit =
> vmcs_read32(GUEST_ES_LIMIT); +        dst->guest_cs_limit =
> vmcs_read32(GUEST_CS_LIMIT); +        dst->guest_ss_limit =
> vmcs_read32(GUEST_SS_LIMIT); +        dst->guest_ds_limit =
> vmcs_read32(GUEST_DS_LIMIT); +        dst->guest_fs_limit =
> vmcs_read32(GUEST_FS_LIMIT); +        dst->guest_gs_limit =
> vmcs_read32(GUEST_GS_LIMIT); +        dst->guest_ldtr_limit =
> vmcs_read32(GUEST_LDTR_LIMIT); +      dst->guest_tr_limit =
> vmcs_read32(GUEST_TR_LIMIT); +        dst->guest_gdtr_limit =
> vmcs_read32(GUEST_GDTR_LIMIT); +      dst->guest_idtr_limit =
> vmcs_read32(GUEST_IDTR_LIMIT); +      dst->guest_es_ar_bytes =
> vmcs_read32(GUEST_ES_AR_BYTES); +     dst->guest_cs_ar_bytes =
> vmcs_read32(GUEST_CS_AR_BYTES); +     dst->guest_ss_ar_bytes =
> vmcs_read32(GUEST_SS_AR_BYTES); +     dst->guest_ds_ar_bytes =
> vmcs_read32(GUEST_DS_AR_BYTES); +     dst->guest_fs_ar_bytes =
> vmcs_read32(GUEST_FS_AR_BYTES); +     dst->guest_gs_ar_bytes =
> vmcs_read32(GUEST_GS_AR_BYTES); +     dst->guest_ldtr_ar_bytes =
> vmcs_read32(GUEST_LDTR_AR_BYTES); +   dst->guest_tr_ar_bytes =
> vmcs_read32(GUEST_TR_AR_BYTES); +     dst->guest_interruptibility_info =
> +             vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
> +     dst->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
> +     dst->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
> +     dst->host_ia32_sysenter_cs = vmcs_read32(HOST_IA32_SYSENTER_CS);
> +     dst->cr0_guest_host_mask = vmcs_readl(CR0_GUEST_HOST_MASK);
> +     dst->cr4_guest_host_mask = vmcs_readl(CR4_GUEST_HOST_MASK);
> +     dst->cr0_read_shadow = vmcs_readl(CR0_READ_SHADOW);
> +     dst->cr4_read_shadow = vmcs_readl(CR4_READ_SHADOW);
> +     dst->cr3_target_value0 = vmcs_readl(CR3_TARGET_VALUE0);
> +     dst->cr3_target_value1 = vmcs_readl(CR3_TARGET_VALUE1);
> +     dst->cr3_target_value2 = vmcs_readl(CR3_TARGET_VALUE2);
> +     dst->cr3_target_value3 = vmcs_readl(CR3_TARGET_VALUE3);
> +     dst->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +     dst->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
> +     dst->guest_cr0 = vmcs_readl(GUEST_CR0); +       dst->guest_cr3 =
> vmcs_readl(GUEST_CR3); +      dst->guest_cr4 = vmcs_readl(GUEST_CR4);
> +     dst->guest_es_base = vmcs_readl(GUEST_ES_BASE);
> +     dst->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
> +     dst->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
> +     dst->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
> +     dst->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
> +     dst->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
> +     dst->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
> +     dst->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
> +     dst->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
> +     dst->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
> +     dst->guest_dr7 = vmcs_readl(GUEST_DR7); +       dst->guest_rsp =
> vmcs_readl(GUEST_RSP); +      dst->guest_rip = vmcs_readl(GUEST_RIP);
> +     dst->guest_rflags = vmcs_readl(GUEST_RFLAGS);
> +     dst->guest_pending_dbg_exceptions =
> +             vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> +     dst->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
> +     dst->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
> +     dst->host_cr0 = vmcs_readl(HOST_CR0);
> +     dst->host_cr3 = vmcs_readl(HOST_CR3);
> +     dst->host_cr4 = vmcs_readl(HOST_CR4);
> +     dst->host_fs_base = vmcs_readl(HOST_FS_BASE);
> +     dst->host_gs_base = vmcs_readl(HOST_GS_BASE);
> +     dst->host_tr_base = vmcs_readl(HOST_TR_BASE);
> +     dst->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
> +     dst->host_idtr_base = vmcs_readl(HOST_IDTR_BASE);
> +     dst->host_ia32_sysenter_esp = vmcs_readl(HOST_IA32_SYSENTER_ESP);
> +     dst->host_ia32_sysenter_eip = vmcs_readl(HOST_IA32_SYSENTER_EIP);
> +     dst->host_rsp = vmcs_readl(HOST_RSP);
> +     dst->host_rip = vmcs_readl(HOST_RIP);
> +     if (vmcs_config.vmexit_ctrl & VM_EXIT_LOAD_IA32_PAT)
> +             dst->host_ia32_pat = vmcs_read64(HOST_IA32_PAT);
> +}
> +
> +/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs
> its nested + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this
> function "merges" it + * with L0's wishes for its guest (vmsc01), so
> we can run the L2 guest in a + * way that will both be appropriate to
> L1's requests, and our needs. + */
> +int prepare_vmcs_02(struct kvm_vcpu *vcpu,
> +     struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01)
> +{
> +     u32 exec_control;
> +
> +     load_vmcs_common(vmcs12);
> +
> +     vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
> +     vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
> +     vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
> +     if (cpu_has_vmx_msr_bitmap())
> +             vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
> +
> +     if (vmcs12->vm_entry_msr_load_count > 0 ||
> +                     vmcs12->vm_exit_msr_load_count > 0 ||
> +                     vmcs12->vm_exit_msr_store_count > 0) {
> +             printk(KERN_WARNING
> +                     "%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
> +     }
> +
> +     if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
> +             struct page *page =
> +                     nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
> +             if (!page)
> +                     return 1;
> +             vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
> +             kvm_release_page_clean(page);
> +     }
> +
> +     if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
> +             struct page *page =
> +                     nested_get_page(vcpu, vmcs12->apic_access_addr);
> +             if (!page)
> +                     return 1;
> +             vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +             kvm_release_page_clean(page);
> +     }
> +
> +     vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> +                  (vmcs01->pin_based_vm_exec_control |
> +                   vmcs12->pin_based_vm_exec_control));
> +     vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +                  (vmcs01->page_fault_error_code_mask &
> +                   vmcs12->page_fault_error_code_mask));
> +     vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +                  (vmcs01->page_fault_error_code_match &
> +                   vmcs12->page_fault_error_code_match));
> +
> +     if (cpu_has_secondary_exec_ctrls()) {
> +             u32 exec_control = vmcs01->secondary_vm_exec_control;
> +             if (nested_cpu_has_secondary_exec_ctrls(vcpu)) {
> +                     exec_control |= vmcs12->secondary_vm_exec_control;
> +                     if (!vm_need_virtualize_apic_accesses(vcpu->kvm) ||
> +                         !nested_vm_need_virtualize_apic_accesses(vcpu))
> +                             exec_control &=
> +                             ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +             }
> +             vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> +     }
> +
> +     load_vmcs_host_state(vmcs01);
> +
> +     if (vm_need_tpr_shadow(vcpu->kvm) &&
> +         nested_cpu_has_vmx_tpr_shadow(vcpu))
> +             vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
> +
> +     if (enable_ept) {
> +             if (!nested_cpu_has_vmx_ept(vcpu)) {
> +                     vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
> +                     vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
> +                     vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
> +                     vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
> +                     vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
> +             }
> +     }
> +
> +     exec_control = vmcs01->cpu_based_vm_exec_control;
> +     exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +     exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> +     exec_control &= ~CPU_BASED_TPR_SHADOW;
> +     exec_control |= vmcs12->cpu_based_vm_exec_control;
> +     if (!vm_need_tpr_shadow(vcpu->kvm) ||
> +         vmcs12->virtual_apic_page_addr == 0) {
> +             exec_control &= ~CPU_BASED_TPR_SHADOW;
> +#ifdef CONFIG_X86_64
> +             exec_control |= CPU_BASED_CR8_STORE_EXITING |
> +                     CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +     } else if (exec_control & CPU_BASED_TPR_SHADOW) {
> +#ifdef CONFIG_X86_64
> +             exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
> +             exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +     }
> +     vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
> +
> +     /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> +      * bitwise-or of what L1 wants to trap for L2, and what we want to
> +      * trap. However, vmx_fpu_activate/deactivate may have happened
> after +        * we saved vmcs01, so we shouldn't trust its TS and
> NM_VECTOR bits +       * and need to base them again on fpu_active. Note
> that CR0.TS also +     * needs updating - we do this after this function
> returns (in +  * nested_vmx_run).
> +      */
> +     vmcs_write32(EXCEPTION_BITMAP,
> +                  ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
> +                   (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
> +                   vmcs12->exception_bitmap));
> +     vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
> +                     (vcpu->fpu_active ? 0 : X86_CR0_TS));
> +     vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
> +                     (vcpu->fpu_active ? 0 : X86_CR0_TS));
> +
> +     vmcs_write32(VM_EXIT_CONTROLS,
> +                  (vmcs01->vm_exit_controls &
> +                     (~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
> +                    | vmcs12->vm_exit_controls);
> +
> +     vmcs_write32(VM_ENTRY_CONTROLS,
> +                  (vmcs01->vm_entry_controls &
> +                     (~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
> +                   | vmcs12->vm_entry_controls);
> +
> +     vmcs_writel(CR4_GUEST_HOST_MASK,
> +                 (vmcs01->cr4_guest_host_mask  &
> +                  vmcs12->cr4_guest_host_mask));
> +
> +     return 0;
> +}
> +
>  static struct kvm_x86_ops vmx_x86_ops = {
>       .cpu_has_kvm_support = cpu_has_kvm_support,
>       .disabled_by_bios = vmx_disabled_by_bios,

Curious if we need to save all VMCS fields when switching from L2 to L1 (save_vmcs). For example TSC_OFFSET, PIN_BASED_VM_EXEC_CONTROL & CPU_BASED_EXEC_CONTROL won't be changed during L2 execution.

Same for host VMCS states.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* RE: [PATCH 8/24] Hold a vmcs02 for each vmcs12
  2010-06-13 12:26 ` [PATCH 8/24] Hold a vmcs02 for each vmcs12 Nadav Har'El
  2010-06-14  8:57   ` Avi Kivity
@ 2010-07-06  9:50   ` Dong, Eddie
  2010-08-02 13:38     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Dong, Eddie @ 2010-07-06  9:50 UTC (permalink / raw)
  To: Nadav Har'El, avi; +Cc: kvm, Dong, Eddie

> +/* Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if
> one + * does not already exist. The allocation is done in L0 memory,
> so to avoid + * denial-of-service attack by guests, we limit the
> number of concurrently- + * allocated vmcss. A well-behaving L1 will
> VMCLEAR unused vmcs12s and not + * trigger this limit.
> + */
> +static const int NESTED_MAX_VMCS = 256;
> +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> +{
> +	struct vmcs_list *new_l2_guest;
> +	struct vmcs *l2_vmcs;
> +
> +	if (nested_get_current_vmcs(vcpu))
> +		return 0; /* nothing to do - we already have a VMCS */
> +
> +	if (to_vmx(vcpu)->nested.l2_vmcs_num >= NESTED_MAX_VMCS)
> +		return -ENOMEM;
> +
> +	new_l2_guest = (struct vmcs_list *)
> +		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
> +	if (!new_l2_guest)
> +		return -ENOMEM;
> +
> +	l2_vmcs = alloc_vmcs();

I didn't see where it was used. Hints on the usage?

> +	if (!l2_vmcs) {
> +		kfree(new_l2_guest);
> +		return -ENOMEM;
> +	}
> +
> +	new_l2_guest->vmcs_addr = to_vmx(vcpu)->nested.current_vmptr;
> +	new_l2_guest->l2_vmcs = l2_vmcs;
> +	list_add(&(new_l2_guest->list),
> &(to_vmx(vcpu)->nested.l2_vmcs_list));
> +	to_vmx(vcpu)->nested.l2_vmcs_num++; +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 147+ messages in thread

* RE: [PATCH 0/24] Nested VMX, v5
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (24 preceding siblings ...)
  2010-06-14 12:34 ` [PATCH 0/24] Nested VMX, v5 Avi Kivity
@ 2010-07-09  8:59 ` Dong, Eddie
  2010-07-11  8:27   ` Nadav Har'El
  2010-07-15  3:27 ` Sheng Yang
  26 siblings, 1 reply; 147+ messages in thread
From: Dong, Eddie @ 2010-07-09  8:59 UTC (permalink / raw)
  To: Nadav Har'El, avi; +Cc: kvm, Dong, Eddie

Nadav Har'El wrote:
> Hi Avi,
> 
> This is a followup of our nested VMX patches that Orit Wasserman
> posted in December. We've addressed most of the comments and concerns
> that you and others on the mailing list had with the previous patch
> set. We hope you'll find these patches easier to understand, and
> suitable for applying to KVM. 
> 
> 
> The following 24 patches implement nested VMX support. The patches
> enable a guest to use the VMX APIs in order to run its own nested
> guests. I.e., it allows running hypervisors (that use VMX) under KVM.
> We describe the theory behind this work, our implementation, and its
> performance characteristics, 
> in IBM Research report H-0282, "The Turtles Project: Design and
> Implementation of Nested Virtualization", available at:
> 
> 	http://bit.ly/a0o9te
> 
> The current patches support running Linux under a nested KVM using
> shadow page table (with bypass_guest_pf disabled). They support
> multiple nested hypervisors, which can run multiple guests. Only
> 64-bit nested hypervisors are supported. SMP is supported. Additional
> patches for running Windows under nested KVM, and Linux under nested
> VMware server, and support for nested EPT, are currently running in
> the lab, and will be sent as follow-on patchsets. 
> 

Nadav & All:
	Thnaks for the posting and in general the patches are well written. I like the concept of VMCSxy and I feel it is pretty clear (better than my previous naming as well), but there are some confusing inside, especially for the term "shadow" which I feel quit hard.

	Comments from me:
	1: Basically there are 2 diferent type in VMCS, one is defined by hardware, whose layout is unknown to VMM. Another one is defined by VMM (this patch) and used for vmcs12.

	The former one is using "struct vmcs" to describe its data instance, but the later one doesn't have a clear definition (or struct vmcs12?). I suggest we can have a distinguish struct for this, for example "struct sw_vmcs"(software vmcs), or "struct vvmcs" (virtual vmcs).

	2: vmcsxy (vmcs12, vmcs02, vmcs01) are for instances of either "struct vmcs", or "struct sw_vmcs", but not for struct Clear distinguish between data structure and instance helps IMO.

	3: We may use prefix or suffix in addition to vmcsxy to explictly state the format of that instance. For example vmcs02 in current patch is for hardware use, hence it is an instance "struct vmcs", but vmcs01 is an instance of "struct sw_vmcs". Postfix and prefix helps to make better understand.

	4: Rename l2_vmcs to vmcs02, l1_shadow_vmcs to vmcs01, l1_vmcs to vmcs02, with prefix/postfix can strengthen above concept of vmcsxy.


	5: guest VMPTRLD emulation. Current patch creates vmcs02 instance each time when guest VMPTRLD, and free the instance at VMCLEAR. The code may fail if the vmcs (un-vmcleared) exceeds certain threshold to avoid denial of service. That is fine, but it brings additional complexity and may pay with a lot of memory. I think we can emulate using concept of "cached vmcs" here in case L1 VMM doesn't do vmclear in time.  L0 VMM can simply flush those vmcs02 to guest memory i.e. vmcs12 per need. For example if the cached vcs02 exceeds 10, we can do automatically flush.


Thx, Eddie




	

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-09  8:59 ` Dong, Eddie
@ 2010-07-11  8:27   ` Nadav Har'El
  2010-07-11 11:05     ` Alexander Graf
  2010-07-11 13:20     ` Avi Kivity
  0 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-07-11  8:27 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: avi, kvm

On Fri, Jul 09, 2010, Dong, Eddie wrote about "RE: [PATCH 0/24] Nested VMX, v5":
> 	Thnaks for the posting and in general the patches are well written.
> I like the concept of VMCSxy and I feel it is pretty clear (better than my
> previous naming as well), but there are some confusing inside, especially
> for the term "shadow" which I feel quit hard.

Hi, and thanks for the excellent ideas. As you saw, I indeed started to
convert and converge the old terminology (including that ambiguous term
"shadow") into the new names vmcs01, vmcs02, vmcs12 - names which we
introduced in our technical report.
But I have not gone all the way with these changes. I should have, and I'll
do it now.

> 1: Basically there are 2 diferent type in VMCS, one is defined by hardware,
> whose layout is unknown to VMM. Another one is defined by VMM (this patch)
> and used for vmcs12.
> The former one is using "struct vmcs" to describe its data instance, but the
> later one doesn't have a clear definition (or struct vmcs12?). I suggest we
> can have a distinguish struct for this, for example "struct sw_vmcs"
> (software vmcs), or "struct vvmcs" (virtual vmcs).

I decided (but let me know if you have reservations) to use the name
"struct vmcs_fields" for the memory structure that contains the long list of
vmcs fields. I think this name describes the structure's content well.

As in the last version of the patches, this list of vmcs fields will not on
its own be vmcs12's structure, because vmcs12, as a spec-compliant vmcs, also
needs to contain a couple of additional fields in its beginning, and we also
need a few more runtime fields.

> 	2: vmcsxy (vmcs12, vmcs02, vmcs01) are for instances of either
> "struct vmcs", or "struct sw_vmcs", but not for struct Clear distinguish
> between data structure and instance helps IMO.

I agree with you that using the name "vmcs12" for both the type (struct vmcs12)
and instance of another type (struct vmcs_fields *vmcs12) is somewhat strange,
but I can only think of two alternatives:

1. Invent a new name for "struct vmcs12", say "struct sw_vmcs" as you
   suggested. But I think it will just make things less clear, because we
   replace the self-explanatory name vmcs12 by a less clear name.

2. Stop separating "struct vmcs_fields" (formerly struct shadow_vmcs) and
   "struct vmcs12" which contains it and a few more fields - and instead 
   put everything in one structure (and call that sw_vmcs or whatever).
   These extra fields will not be useful for vmcs01, but it's not a terrible
   waste (because vmcs01 already doesn't use a lot of these fields).

Personally, I find these two alternatives even less appealing than the
current alternative (with "struct vmcs12" describing vmcs12's type, and
it contains a struct vmcs_fields inside). What do you think?

> 3: We may use prefix or suffix in addition to vmcsxy to explictly state the
> format of that instance. For example vmcs02 in current patch is for hardware
> use, hence it is an instance "struct vmcs", but vmcs01 is an instance of
> "struct sw_vmcs". Postfix and prefix helps to make better understand.

I agree. After changing the old name struct shadow_vmcs to vmcs_fields, now
I can use a name like vmcs01_fields for the old l1_shadow_vmcs (memory copy
of vmcs01's fields) and vmcs01 for the old l1_vmcs (the actual hardware VMCS
used to run L1). This is is indeed more readable, thanks.

> 4: Rename l2_vmcs to vmcs02, l1_shadow_vmcs to vmcs01, l1_vmcs to
> vmcs02, with prefix/postfix can strengthen above concept of vmcsxy.

Good ideas.

renamed l2_vmcs, l2_vmcs_list, and the likes, to vmcs02.

Renamed l1_shadow_vmcs to vmcs01_fields, ands l1_vmcs to vmcs01 (NOT vmcs02).

renamed l2_shadow_vmcs, l2svmcs, nested_vmcs, and the likes, to vmcs12
(I decided not to use the longer name vmcs12_fields, because I don't think it
adds any clarity). I also renamed get_shadow_vmcs to get_vmcs12_fields.

> 5: guest VMPTRLD emulation. Current patch creates vmcs02 instance each
> time when guest VMPTRLD, and free the instance at VMCLEAR. The code may
> fail if the vmcs (un-vmcleared) exceeds certain threshold to avoid denial
> of service. That is fine, but it brings additional complexity and may pay
> with a lot of memory. I think we can emulate using concept of "cached vmcs"
> here in case L1 VMM doesn't do vmclear in time.  L0 VMM can simply flush
> those vmcs02 to guest memory i.e. vmcs12 per need. For example if the cached
> vcs02 exceeds 10, we can do automatically flush.

Right. I've already discussed this idea over the list with Avi Kivity, and
it is on my todo list and definitely should be done.
The current approach is simpler, because I don't need to add special code for
rebuilding a forgotten vmcs02 from vmcs12 - the current prepare_vmcs02 only
updates some of the fields, and I'll need to do some testing to figure out
what exactly is missing for a full rebuild.  

I think the current code is "good enough" as an ad-interim solution, because
users that follow the spec will not forget to VMCLEAR anyway (and if they
do, only they will suffer). And I wouldn't say that "a lot of memory" is
involved - at worst, an L1 can now cause 256 pages, or 1 MB, to be wasted on
this. More normally, an L1 will only have a few L2 guests, and only spend
a few pages for this - certainly much much less than he'd spend on actually
holding the L2's memory.

Thanks again for the review!

I don't want to attach the entire set of patches again now (before I respond
to the rest of the review comments, from Avi, Gleb, you, and others).
So in the meantime I'll include the new version of just one patch, so that
you can see an example of the changes I've made to the names.

=========
Subject: [PATCH 16/26] nVMX: Implement VMLAUNCH and VMRESUME

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  200 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 197 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-07-11 11:11:11.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-07-11 11:11:11.000000000 +0300
@@ -279,6 +279,9 @@ struct __packed vmcs12 {
 	struct vmcs_fields fields;
 
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
 };
 
 /*
@@ -313,6 +316,23 @@ struct nested_vmx {
 	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
 	struct list_head vmcs02_list; /* a vmcs_list */
 	int vmcs02_num;
+
+	/* Are we running a nested guest now */
+	bool nested_mode;
+	/* Level 1 state for switching to level 2 and back */
+	struct  {
+		u64 efer;
+		unsigned long cr3;
+		unsigned long cr4;
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		int cpu;
+		int launched;
+	} l1_state;
+	/* Saving the VMCS that we used for running L1 */
+	struct vmcs *vmcs01;
+	struct vmcs_fields *vmcs01_fields;
 };
 
 enum vmcs_field_type {
@@ -1383,6 +1403,19 @@ static void vmx_vcpu_load(struct kvm_vcp
 			new_offset = vmcs_read64(TSC_OFFSET) + delta;
 			vmcs_write64(TSC_OFFSET, new_offset);
 		}
+
+		if (vmx->nested.vmcs01_fields != NULL) {
+			struct vmcs_fields *vmcs01 =
+				vmx->nested.vmcs01_fields;
+			vmcs01->host_tr_base = vmcs_readl(HOST_TR_BASE);
+			vmcs01->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+			vmcs01->host_ia32_sysenter_esp =
+				vmcs_readl(HOST_IA32_SYSENTER_ESP);
+			if (tsc_this < vcpu->arch.host_tsc)
+				vmcs01->tsc_offset = vmcs_read64(TSC_OFFSET);
+			if (vmx->nested.nested_mode)
+				load_vmcs_host_state(vmcs01);
+		}
 	}
 }
 
@@ -2278,6 +2311,9 @@ static void free_l1_state(struct kvm_vcp
 		kfree(list_item);
 	}
 	vmx->nested.vmcs02_num = 0;
+
+	kfree(vmx->nested.vmcs01_fields);
+	vmx->nested.vmcs01_fields = NULL;
 }
 
 static void free_kvm_area(void)
@@ -4141,6 +4177,10 @@ static int handle_vmon(struct kvm_vcpu *
 	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
 	vmx->nested.vmcs02_num = 0;
 
+	vmx->nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!vmx->nested.vmcs01_fields)
+		return -ENOMEM;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4339,6 +4379,38 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (to_vmx(vcpu)->nested.current_vmcs12->launch_state == launch) {
+		/* Must use VMLAUNCH for the first time, VMRESUME later */
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	skip_emulated_instruction(vcpu);
+
+	nested_vmx_run(vcpu);
+	return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return handle_launch_or_resume(vcpu, false);
+}
+
 static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
 					unsigned long field, u64 *ret)
 {
@@ -4869,11 +4941,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -4935,7 +5007,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (!vmx->nested.nested_mode &&
+		unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -5756,6 +5829,127 @@ int prepare_vmcs_02(struct kvm_vcpu *vcp
 	return 0;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->nested.nested_mode = 1;
+	sync_cached_regs_to_vmcs(vcpu);
+	save_vmcs(vmx->nested.vmcs01_fields);
+
+	vmx->nested.l1_state.efer = vcpu->arch.efer;
+	if (!enable_ept)
+		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
+	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
+	else
+		vmx->nested.l1_state.msr_bitmap = 0;
+
+	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	vmx->nested.vmcs01 = vmx->vmcs;
+	vmx->nested.l1_state.cpu = vcpu->cpu;
+	vmx->nested.l1_state.launched = vmx->launched;
+
+	vmx->vmcs = nested_get_current_vmcs(vcpu);
+	if (!vmx->vmcs) {
+		printk(KERN_ERR "Missing VMCS\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	vcpu->cpu = vmx->nested.current_vmcs12->cpu;
+	vmx->launched = vmx->nested.current_vmcs12->launched;
+
+	if (!vmx->nested.current_vmcs12->launch_state || !vmx->launched) {
+		vmcs_clear(vmx->vmcs);
+		vmx->launched = 0;
+		vmx->nested.current_vmcs12->launch_state = 1;
+	}
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	prepare_vmcs_02(vcpu,
+		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
+
+	if (get_vmcs12_fields(vcpu)->vm_entry_controls &
+	    VM_ENTRY_IA32E_MODE) {
+		if (!((vcpu->arch.efer & EFER_LMA) &&
+		      (vcpu->arch.efer & EFER_LME)))
+			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	} else {
+		if ((vcpu->arch.efer & EFER_LMA) ||
+		    (vcpu->arch.efer & EFER_LME))
+			vcpu->arch.efer = 0;
+	}
+
+	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
+	 * dictated, and takes appropriate actions for special cr0 bits (like
+	 * real mode, etc.).
+	 */
+	vmx_set_cr0(vcpu,
+		(get_vmcs12_fields(vcpu)->guest_cr0 &
+			~get_vmcs12_fields(vcpu)->cr0_guest_host_mask) |
+		(get_vmcs12_fields(vcpu)->cr0_read_shadow &
+			get_vmcs12_fields(vcpu)->cr0_guest_host_mask));
+
+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
+	 * the latter with with TS added if !fpu_active. We need to take the
+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
+	 * restoration of fpu registers until the FPU is really used).
+	 */
+	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
+	vmcs_writel(CR4_READ_SHADOW,
+		    get_vmcs12_fields(vcpu)->cr4_read_shadow);
+
+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
+	 * CR0 (we need the paging so that KVM treat this guest as a paging
+	 * guest so we can easly forward page faults to L1.)
+	 */
+	vcpu->arch.cr0 |= X86_CR0_PG;
+
+	if (enable_ept && !nested_cpu_has_vmx_ept(vcpu)) {
+		vmcs_write32(GUEST_CR3, get_vmcs12_fields(vcpu)->guest_cr3);
+		vmx->vcpu.arch.cr3 = get_vmcs12_fields(vcpu)->guest_cr3;
+	} else {
+		int r;
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+		kvm_mmu_reset_context(vcpu);
+
+		r = kvm_mmu_load(vcpu);
+		if (unlikely(r)) {
+			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			set_rflags_to_vmx_fail_valid(vcpu);
+			/* switch back to L1 */
+			vmx->nested.nested_mode = 0;
+			vmx->vmcs = vmx->nested.vmcs01;
+			vcpu->cpu = vmx->nested.l1_state.cpu;
+			vmx->launched = vmx->nested.l1_state.launched;
+
+			vmx_vcpu_load(vcpu, get_cpu());
+			put_cpu();
+
+			return 1;
+		}
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   get_vmcs12_fields(vcpu)->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   get_vmcs12_fields(vcpu)->guest_rip);
+
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,


-- 
Nadav Har'El                        |      Sunday, Jul 11 2010, 29 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The two most common elements in the
http://nadav.harel.org.il           |universe are hydrogen and stupidity.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-11  8:27   ` Nadav Har'El
@ 2010-07-11 11:05     ` Alexander Graf
  2010-07-11 12:49       ` Nadav Har'El
  2010-07-11 13:20     ` Avi Kivity
  1 sibling, 1 reply; 147+ messages in thread
From: Alexander Graf @ 2010-07-11 11:05 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Dong, Eddie, avi, kvm


On 11.07.2010, at 10:27, Nadav Har'El wrote:

> On Fri, Jul 09, 2010, Dong, Eddie wrote about "RE: [PATCH 0/24] Nested VMX, v5":
>> 	Thnaks for the posting and in general the patches are well written.
>> I like the concept of VMCSxy and I feel it is pretty clear (better than my
>> previous naming as well), but there are some confusing inside, especially
>> for the term "shadow" which I feel quit hard.
> 
> Hi, and thanks for the excellent ideas. As you saw, I indeed started to
> convert and converge the old terminology (including that ambiguous term
> "shadow") into the new names vmcs01, vmcs02, vmcs12 - names which we
> introduced in our technical report.
> But I have not gone all the way with these changes. I should have, and I'll
> do it now.
> 
>> 1: Basically there are 2 diferent type in VMCS, one is defined by hardware,
>> whose layout is unknown to VMM. Another one is defined by VMM (this patch)
>> and used for vmcs12.
>> The former one is using "struct vmcs" to describe its data instance, but the
>> later one doesn't have a clear definition (or struct vmcs12?). I suggest we
>> can have a distinguish struct for this, for example "struct sw_vmcs"
>> (software vmcs), or "struct vvmcs" (virtual vmcs).
> 
> I decided (but let me know if you have reservations) to use the name
> "struct vmcs_fields" for the memory structure that contains the long list of
> vmcs fields. I think this name describes the structure's content well.
> 
> As in the last version of the patches, this list of vmcs fields will not on
> its own be vmcs12's structure, because vmcs12, as a spec-compliant vmcs, also
> needs to contain a couple of additional fields in its beginning, and we also
> need a few more runtime fields.

Thinking about this - it would be perfectly legal to split the VMCS into two separate structs, right? You could have one struct that you map directly into the guest, so modifications to that struct don't trap. Of course the l1 guest shouldn't be able to modify all fields of the VMCS, so you'd still keep a second struct around with shadow fields. While at it, also add a bitmap to store the dirtyness status of your fields in, if you need that.

That way a nesting aware guest could use a PV memory write instead of the (slow) instruction emulation. That should dramatically speed up nesting vmx.


Alex


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-11 11:05     ` Alexander Graf
@ 2010-07-11 12:49       ` Nadav Har'El
  2010-07-11 13:12         ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-07-11 12:49 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Dong, Eddie, avi, kvm

On Sun, Jul 11, 2010, Alexander Graf wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> Thinking about this - it would be perfectly legal to split the VMCS into two separate structs, right? You could have one struct that you map directly into the guest, so modifications to that struct don't trap. Of course the l1 guest shouldn't be able to modify all fields of the VMCS, so you'd still keep a second struct around with shadow fields. While at it, also add a bitmap to store the dirtyness status of your fields in, if you need that.
> 
> That way a nesting aware guest could use a PV memory write instead of the (slow) instruction emulation. That should dramatically speed up nesting vmx.

Hi,

We already tried this idea, and described the results in our tech report
(see http://www.mulix.org/pubs/turtles/h-0282.pdf). 

We didn't do things quite as cleanly as you suggested - we didn't split the
structure and make only part of it available directly to the guest. Rather,
we  only did what we had to do to get the performance improvement: We modified
L1 to access the VMCS directly, assuming the nested's vmcs12 structure layout,
instead of calling vmread/vmwrite.

As you can see in the various benchmarks in section 4 (Evaluation) of the
report, the so-called PV vmread/vmwrite method had a noticable, though perhaps
not as dramatic as you hoped, effect. For example, for the kernbench benchmark,
nested kvm overhead (over single-level kvm virtualization) came down from
14.5% to 10.3%, and for the specjbb benchmark, the overhead came down from
7.8% to 6.3%. In a microbenchmark less representative of real-life workloads,
we were able to measure a halving of the overhead by adding the PV
vmread/vmwrite.

In any case, the obvious problem with this whole idea on VMX is that it
requires a modified guest hypervisor, which reduces its usefulness.
This is why we didn't think we should "advertise" the ability to bypass
vmread/vmwrite in L1 and write directly to the vmcs12's. But Avi Kivity
already asked me to add a document about the vmcs12 internal structure,
and once I've done that, I guess you can now consider it "fair" for nesting-
aware L1 guest hypervisors to actually use that internal structure to modify
vmcs12 directly, without vmread/vmwrite and exits.

By the way, I see on the KVM Forum 2010 schedule that Eddie Dong will be
talking about "Examining KVM as Nested Virtualization Friendly Guest".
I'm looking forward to reading the proceedings (unfortunately, I won't be
able to travel to the actual meeting).

Nadav.


-- 
Nadav Har'El                        |      Sunday, Jul 11 2010, 29 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I used to work in a pickle factory, until
http://nadav.harel.org.il           |I got canned.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-11 12:49       ` Nadav Har'El
@ 2010-07-11 13:12         ` Avi Kivity
  2010-07-11 15:39           ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-07-11 13:12 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Alexander Graf, Dong, Eddie, kvm

On 07/11/2010 03:49 PM, Nadav Har'El wrote:
>
> In any case, the obvious problem with this whole idea on VMX is that it
> requires a modified guest hypervisor, which reduces its usefulness.
> This is why we didn't think we should "advertise" the ability to bypass
> vmread/vmwrite in L1 and write directly to the vmcs12's. But Avi Kivity
> already asked me to add a document about the vmcs12 internal structure,
> and once I've done that, I guess you can now consider it "fair" for nesting-
> aware L1 guest hypervisors to actually use that internal structure to modify
> vmcs12 directly, without vmread/vmwrite and exits.
>    

No, they can't, since (for writes) L0 might cache the information and 
not read it again.  For reads, L0 might choose to update vmcs12 on demand.

A pvvmread/write needs to communicate with L0 about what fields are 
valid (likely using available and dirty bitmaps).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-11  8:27   ` Nadav Har'El
  2010-07-11 11:05     ` Alexander Graf
@ 2010-07-11 13:20     ` Avi Kivity
  1 sibling, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-07-11 13:20 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Dong, Eddie, kvm

On 07/11/2010 11:27 AM, Nadav Har'El wrote:
>
>
>> 1: Basically there are 2 diferent type in VMCS, one is defined by hardware,
>> whose layout is unknown to VMM. Another one is defined by VMM (this patch)
>> and used for vmcs12.
>> The former one is using "struct vmcs" to describe its data instance, but the
>> later one doesn't have a clear definition (or struct vmcs12?). I suggest we
>> can have a distinguish struct for this, for example "struct sw_vmcs"
>> (software vmcs), or "struct vvmcs" (virtual vmcs).
>>      
> I decided (but let me know if you have reservations) to use the name
> "struct vmcs_fields" for the memory structure that contains the long list of
> vmcs fields. I think this name describes the structure's content well.
>    

I liked vvmcs myself...

> As in the last version of the patches, this list of vmcs fields will not on
> its own be vmcs12's structure, because vmcs12, as a spec-compliant vmcs, also
> needs to contain a couple of additional fields in its beginning, and we also
> need a few more runtime fields.
>    

... for the spec-compliant vmcs in L1's memory.

>> 	2: vmcsxy (vmcs12, vmcs02, vmcs01) are for instances of either
>> "struct vmcs", or "struct sw_vmcs", but not for struct Clear distinguish
>> between data structure and instance helps IMO.
>>      
> I agree with you that using the name "vmcs12" for both the type (struct vmcs12)
> and instance of another type (struct vmcs_fields *vmcs12) is somewhat strange,
> but I can only think of two alternatives:
>
> 1. Invent a new name for "struct vmcs12", say "struct sw_vmcs" as you
>     suggested. But I think it will just make things less clear, because we
>     replace the self-explanatory name vmcs12 by a less clear name.
>
> 2. Stop separating "struct vmcs_fields" (formerly struct shadow_vmcs) and
>     "struct vmcs12" which contains it and a few more fields - and instead
>     put everything in one structure (and call that sw_vmcs or whatever).
>    

I like this.

>     These extra fields will not be useful for vmcs01, but it's not a terrible
>     waste (because vmcs01 already doesn't use a lot of these fields).
>    

You don't really need vmcs01 to be a vvmcs (or sw_vmcsw).  IIRC you only 
need it when copying around vmcss, which you can avoid completely by 
initializing vmcs01 and vmcs02 using common initialization routines for 
the host part.

> Personally, I find these two alternatives even less appealing than the
> current alternative (with "struct vmcs12" describing vmcs12's type, and
> it contains a struct vmcs_fields inside). What do you think?
>    

IMO, vmcs_fields is artificial.  As soon as you eliminate the vmcs copy, 
you won't have any use for it, and then you can fold it into its container.

>> 5: guest VMPTRLD emulation. Current patch creates vmcs02 instance each
>> time when guest VMPTRLD, and free the instance at VMCLEAR. The code may
>> fail if the vmcs (un-vmcleared) exceeds certain threshold to avoid denial
>> of service. That is fine, but it brings additional complexity and may pay
>> with a lot of memory. I think we can emulate using concept of "cached vmcs"
>> here in case L1 VMM doesn't do vmclear in time.  L0 VMM can simply flush
>> those vmcs02 to guest memory i.e. vmcs12 per need. For example if the cached
>> vcs02 exceeds 10, we can do automatically flush.
>>      
> Right. I've already discussed this idea over the list with Avi Kivity, and
> it is on my todo list and definitely should be done.
> The current approach is simpler, because I don't need to add special code for
> rebuilding a forgotten vmcs02 from vmcs12 - the current prepare_vmcs02 only
> updates some of the fields, and I'll need to do some testing to figure out
> what exactly is missing for a full rebuild.
>    

You already support "full rebuild" - that's what happens when you first 
see a vmcs, when you launch a guest.

> I think the current code is "good enough" as an ad-interim solution, because
> users that follow the spec will not forget to VMCLEAR anyway (and if they
> do, only they will suffer). And I wouldn't say that "a lot of memory" is
> involved - at worst, an L1 can now cause 256 pages, or 1 MB, to be wasted on
> this. More normally, an L1 will only have a few L2 guests, and only spend
> a few pages for this - certainly much much less than he'd spend on actually
> holding the L2's memory.
>    

It's perfectly legitimate for a guest to disappear a vmcs.  It might 
swap it to disk, or move it to a separate NUMA node.  While I don't 
expect the first, the second will probably happen sometime.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-11 13:12         ` Avi Kivity
@ 2010-07-11 15:39           ` Nadav Har'El
  2010-07-11 15:45             ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-07-11 15:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, Dong, Eddie, kvm

On Sun, Jul 11, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> >nesting-
> >aware L1 guest hypervisors to actually use that internal structure to 
> >modify
> >vmcs12 directly, without vmread/vmwrite and exits.
> >   
> 
> No, they can't, since (for writes) L0 might cache the information and 
> not read it again.  For reads, L0 might choose to update vmcs12 on demand.

Well, in the current version of the nested code, all L0 does on a L1 vmwrite
is to update the in-memory vmcs12 structure. It doesn't not update vmcs02,
nor cache anything, nor remember what has changed and what hasn't. So replacing
it with a direct write to the memory structure should be fine...

Of course, this situation isn't optimal, and we *should* optimize the number of
unnecessary vmwrites L2 entry and exit (and we actually tried some of this
in our tech report), but it's not in the current patch set.  When we do these
kind of optimizations, you're right that:

> A pvvmread/write needs to communicate with L0 about what fields are 
> valid (likely using available and dirty bitmaps).


-- 
Nadav Har'El                        |           Sunday, Jul 11 2010, 1 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If marriage was illegal, only outlaws
http://nadav.harel.org.il           |would have in-laws.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-07-11 15:39           ` Nadav Har'El
@ 2010-07-11 15:45             ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-07-11 15:45 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Alexander Graf, Dong, Eddie, kvm

On 07/11/2010 06:39 PM, Nadav Har'El wrote:
> On Sun, Jul 11, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
>    
>>> nesting-
>>> aware L1 guest hypervisors to actually use that internal structure to
>>> modify
>>> vmcs12 directly, without vmread/vmwrite and exits.
>>>
>>>        
>> No, they can't, since (for writes) L0 might cache the information and
>> not read it again.  For reads, L0 might choose to update vmcs12 on demand.
>>      
> Well, in the current version of the nested code, all L0 does on a L1 vmwrite
> is to update the in-memory vmcs12 structure. It doesn't not update vmcs02,
> nor cache anything, nor remember what has changed and what hasn't. So replacing
> it with a direct write to the memory structure should be fine...
>    

Note you said "current version".  What if this later changes?

So, we cannot allow a guest to access vmcs12 directly.  There has to be 
a protocol that allows the guest to know what it can touch and what it 
can't (or, tell the host what the guest touched and what it hasn't).  
Otherwise, we lose the ability to optimize.

> Of course, this situation isn't optimal, and we *should* optimize the number of
> unnecessary vmwrites L2 entry and exit (and we actually tried some of this
> in our tech report), but it's not in the current patch set.  When we do these
> kind of optimizations, you're right that:
>
>    
>> A pvvmread/write needs to communicate with L0 about what fields are
>> valid (likely using available and dirty bitmaps).
>>      

It's right even before we do these optimizations, so a pv guest written 
before the optimizations can run on an optimized host.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
                   ` (25 preceding siblings ...)
  2010-07-09  8:59 ` Dong, Eddie
@ 2010-07-15  3:27 ` Sheng Yang
  26 siblings, 0 replies; 147+ messages in thread
From: Sheng Yang @ 2010-07-15  3:27 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sunday 13 June 2010 20:22:33 Nadav Har'El wrote:
> Hi Avi,
> 
> This is a followup of our nested VMX patches that Orit Wasserman posted in
> December. We've addressed most of the comments and concerns that you and
> others on the mailing list had with the previous patch set. We hope you'll
> find these patches easier to understand, and suitable for applying to KVM.
> 
> 
> The following 24 patches implement nested VMX support. The patches enable a
> guest to use the VMX APIs in order to run its own nested guests. I.e., it
> allows running hypervisors (that use VMX) under KVM. We describe the theory
> behind this work, our implementation, and its performance characteristics,
> in IBM Research report H-0282, "The Turtles Project: Design and
> Implementation of Nested Virtualization", available at:
> 
> 	http://bit.ly/a0o9te
> 
> The current patches support running Linux under a nested KVM using shadow
> page table (with bypass_guest_pf disabled). They support multiple nested
> hypervisors, which can run multiple guests. Only 64-bit nested hypervisors
> are supported. SMP is supported. Additional patches for running Windows
> under nested KVM, and Linux under nested VMware server, and support for
> nested EPT, are currently running in the lab, and will be sent as
> follow-on patchsets.

Hi Nadav

Do you have a tree or code base and instruction to try this patchset? I've spent 
some time on it, but can't get it right...

--
regards
Yang, Sheng

> 
> These patches were written by:
>      Abel Gordon, abelg <at> il.ibm.com
>      Nadav Har'El, nyh <at> il.ibm.com
>      Orit Wasserman, oritw <at> il.ibm.com
>      Ben-Ami Yassor, benami <at> il.ibm.com
>      Muli Ben-Yehuda, muli <at> il.ibm.com
> 
> With contributions by:
>      Anthony Liguori, aliguori <at> us.ibm.com
>      Mike Day, mdday <at> us.ibm.com
> 
> This work was inspired by the nested SVM support by Alexander Graf and
> Joerg Roedel.
> 
> 
> Changes since v4:
> * Rebased to the current KVM tree.
> * Support for lazy FPU loading.
> * Implemented about 90 requests and suggestions made on the mailing list
>   regarding the previous version of this patch set.
> * Split the changes into many more, and better documented, patches.
> 
> --
> Nadav Har'El
> IBM Haifa Research Lab
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-06-15 12:14   ` Gleb Natapov
@ 2010-08-01 15:16     ` Nadav Har'El
  2010-08-01 15:25       ` Gleb Natapov
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-08-01 15:16 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Tue, Jun 15, 2010, Gleb Natapov wrote about "Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures":
> > +/*
> > + * Decode the memory-address operand of a vmx instruction, according to the
> > + * Intel spec.
> > + */
>...
> > +static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu,
> > +				 unsigned long exit_qualification,
> > +				 u32 vmx_instruction_info)
> > +{
>...
> > +	if (is_reg) {
> > +		kvm_queue_exception(vcpu, UD_VECTOR);
> > +		return 0;
> Isn't zero a legitimate address for vmx operation?

Thanks. Please excuse my naivity, but is address 0 actually considered a
usable guest virtual address? If it is, do we have any possible value which is
considered invalid? Perhaps -1ull? I see that -1ull is used in a few places
in vmx.c, for example.

If all gva_t turn out to actually be valid addresses, I'll need to move to a
more complex (and uglier) success flag approach :(

-- 
Nadav Har'El                        |          Sunday, Aug  1 2010, 22 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The only "intuitive" interface is the
http://nadav.harel.org.il           |nipple. After that, it's all learned.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-08-01 15:16     ` Nadav Har'El
@ 2010-08-01 15:25       ` Gleb Natapov
  2010-08-02  8:57         ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-08-01 15:25 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Sun, Aug 01, 2010 at 06:16:59PM +0300, Nadav Har'El wrote:
> On Tue, Jun 15, 2010, Gleb Natapov wrote about "Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures":
> > > +/*
> > > + * Decode the memory-address operand of a vmx instruction, according to the
> > > + * Intel spec.
> > > + */
> >...
> > > +static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu,
> > > +				 unsigned long exit_qualification,
> > > +				 u32 vmx_instruction_info)
> > > +{
> >...
> > > +	if (is_reg) {
> > > +		kvm_queue_exception(vcpu, UD_VECTOR);
> > > +		return 0;
> > Isn't zero a legitimate address for vmx operation?
> 
> Thanks. Please excuse my naivity, but is address 0 actually considered a
> usable guest virtual address? If it is, do we have any possible value which is
> considered invalid? Perhaps -1ull? I see that -1ull is used in a few places
> in vmx.c, for example.
> 
Guest can use any valid virtual address. There is UNMAPPED_GVA (~(gpa_t)0) which
at least cannot be valid if address that your function returns have to be
page aligned. And not all virtual addresses are valid BTW. For 32 bit
guest virt address cannot be bigger then 32 bit and for 64 bit guest
virtual address should be in canonical form.

> If all gva_t turn out to actually be valid addresses, I'll need to move to a
> more complex (and uglier) success flag approach :(
> 
> -- 
> Nadav Har'El                        |          Sunday, Aug  1 2010, 22 Av 5770
> nyh@math.technion.ac.il             |-----------------------------------------
> Phone +972-523-790466, ICQ 13349191 |The only "intuitive" interface is the
> http://nadav.harel.org.il           |nipple. After that, it's all learned.

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-08-01 15:25       ` Gleb Natapov
@ 2010-08-02  8:57         ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-02  8:57 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Sun, Aug 01, 2010, Gleb Natapov wrote about "Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures":
> Guest can use any valid virtual address. There is UNMAPPED_GVA (~(gpa_t)0) which
> at least cannot be valid if address that your function returns have to be
> page aligned.

Thanks. Unfortunately, I also use this function to decode non-page-aligned
addresses (such as an address given to VMWRITE to take a value from), so I
cannot use this nice trick.

> And not all virtual addresses are valid BTW. For 32 bit
> guest virt address cannot be bigger then 32 bit and for 64 bit guest
> virtual address should be in canonical form.

I guess this means that I can easily find a gva_t which is always invalid -
e.g., 1<<63 isn't a legal 32-bit address (of course), and also isn't a legal
canonical-form 64 (or rather 48)-bit address - so I could use that as a flag.

But I decided that to make things clearer, I'll change the function to return
a success flag, and return the gva_t itself into a given pointer:

static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
                                 unsigned long exit_qualification,
                                 u32 vmx_instruction_info, gva_t *ret)


-- 
Nadav Har'El                        |          Monday, Aug  2 2010, 22 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Can Microsoft make a product that doesn't
http://nadav.harel.org.il           |suck? Yes, a vacuum cleaner!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-06-14  8:48   ` Avi Kivity
@ 2010-08-02 12:25     ` Nadav Har'El
  2010-08-02 13:38       ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-08-02 12:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hi,

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures":
> On 06/13/2010 03:26 PM, Nadav Har'El wrote:
> >This patch includes a couple of utility functions for extracting pointer
> >operands of VMX instructions issued by L1 (a guest hypervisor), and
> >translating guest-given vmcs12 virtual addresses to guest-physical 
>>addresses.
>...
> >+#define VMX_OPERAND_IS_REG(vii)		((vii)&  (1u<<  10))
>...
> Since those defines are used just ones, you can fold them into their 
> uses.  It doesn't add much to repeat the variable name.

Actually a few of these macros were used several times, but you're right,
it didn't make anything clearer and just made the code uglier. So I folded
them.

> >+	/* offfset = Base + [Index * Scale] + Displacement */
> >+	addr = vmx_get_segment_base(vcpu, seg_reg);
> >+	if (base_is_valid)
> >+		addr += kvm_register_read(vcpu, base_reg);
> >+	if (index_is_valid)
> >+		addr += kvm_register_read(vcpu, index_reg)<<scaling;
> >+	addr += exit_qualification; /* holds the displacement */
> >   
> 
> Do we need a segment limit and access rights check?

You are absolutely right. The instructions we're emulating (VMREAD, VMWRITE,
VMPTRLD, etc.) should throw a #GP in a bunch of segmentation errors, including
segment limit, execute-only segments, non-canonical 64-bit addresses, and a
bunch of other unlikely error cases.

To achieve 100% accurate emulation in the error path, it will require quite
a bit new code (here, and in many other places throughout the nested VMX code)
that isn't necessary for running a correctly-written guest hypervisor (such
as KVM or VMware). At worst, not accurately emulating the error path correctly
might allow a broken L1 to do bad things to itself, but it doesn't allow it
to do anything bad to L0 or other L1's.

Would you accept that I'll add a TODO in the code here (and in similar cases)
and leave perfecting the error path to a later path?

Thanks,
Nadav.


-- 
Nadav Har'El                        |          Monday, Aug  2 2010, 22 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Wear short sleeves! Support your right to
http://nadav.harel.org.il           |bare arms!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures
  2010-08-02 12:25     ` Nadav Har'El
@ 2010-08-02 13:38       ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-08-02 13:38 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 08/02/2010 03:25 PM, Nadav Har'El wrote:
>
>
>>> +	/* offfset = Base + [Index * Scale] + Displacement */
>>> +	addr = vmx_get_segment_base(vcpu, seg_reg);
>>> +	if (base_is_valid)
>>> +		addr += kvm_register_read(vcpu, base_reg);
>>> +	if (index_is_valid)
>>> +		addr += kvm_register_read(vcpu, index_reg)<<scaling;
>>> +	addr += exit_qualification; /* holds the displacement */
>>>
>> Do we need a segment limit and access rights check?
> You are absolutely right. The instructions we're emulating (VMREAD, VMWRITE,
> VMPTRLD, etc.) should throw a #GP in a bunch of segmentation errors, including
> segment limit, execute-only segments, non-canonical 64-bit addresses, and a
> bunch of other unlikely error cases.
>
> To achieve 100% accurate emulation in the error path, it will require quite
> a bit new code (here, and in many other places throughout the nested VMX code)
> that isn't necessary for running a correctly-written guest hypervisor (such
> as KVM or VMware). At worst, not accurately emulating the error path correctly
> might allow a broken L1 to do bad things to itself, but it doesn't allow it
> to do anything bad to L0 or other L1's.
>
> Would you accept that I'll add a TODO in the code here (and in similar cases)
> and leave perfecting the error path to a later path?

Given that the x86 emulator doesn't get this right, yes.  But please do 
document all the points where this is wrong.  Silent failure is the 
worst kind of failure, at least we'll know where to look.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 8/24] Hold a vmcs02 for each vmcs12
  2010-07-06  9:50   ` Dong, Eddie
@ 2010-08-02 13:38     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-02 13:38 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: avi, kvm

On Tue, Jul 06, 2010, Dong, Eddie wrote about "RE: [PATCH 8/24] Hold a vmcs02 for each vmcs12":
> > +/* Allocate an L0 VMCS (vmcs02) for the current L1 VMCS (vmcs12), if
> > one + * does not already exist. The allocation is done in L0 memory,
>...
> > +static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vmcs_list *new_l2_guest;
> > +	struct vmcs *l2_vmcs;
> > +
> > +	if (nested_get_current_vmcs(vcpu))
> > +		return 0; /* nothing to do - we already have a VMCS */
> > +
> > +	if (to_vmx(vcpu)->nested.l2_vmcs_num >= NESTED_MAX_VMCS)
> > +		return -ENOMEM;
> > +
> > +	new_l2_guest = (struct vmcs_list *)
> > +		kmalloc(sizeof(struct vmcs_list), GFP_KERNEL);
> > +	if (!new_l2_guest)
> > +		return -ENOMEM;
> > +
> > +	l2_vmcs = alloc_vmcs();
> 
> I didn't see where it was used. Hints on the usage?

Hi, I'm afraid I didn't understand the question. Where is what used?

What nested_create_current_vmcs does (as the comment above it explains) is
to allocate a vmcs02 for the current vmcs12, or return one that has been
previously allocated (and saved in a list we hold of these mappings).

The alloc_vmcs() call you pointed to happens when there isn't yet a vmcs02
for this vmcs12, and we need to allocate a new one. A few lines down this
new mapping between a vmcs12 address to vmcs02 is put in new_l2_guest and
that is inserted into the list of mappings.

Does this answer your question?

By the way, in my latest version of the code, l2_vmcs is now better named,
"vmcs02" :-)


> > +	if (!l2_vmcs) {
> > +		kfree(new_l2_guest);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	new_l2_guest->vmcs_addr = to_vmx(vcpu)->nested.current_vmptr;
> > +	new_l2_guest->l2_vmcs = l2_vmcs;
> > +	list_add(&(new_l2_guest->list),
> > &(to_vmx(vcpu)->nested.l2_vmcs_list));
> > +	to_vmx(vcpu)->nested.l2_vmcs_num++; +	return 0;
> > +}

-- 
Nadav Har'El                        |          Monday, Aug  2 2010, 22 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |A computer program does what you tell it
http://nadav.harel.org.il           |to do, not what you want it to do.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-07-06  2:56   ` Dong, Eddie
@ 2010-08-03 12:12     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-03 12:12 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: avi, kvm

On Tue, Jul 06, 2010, Dong, Eddie wrote about "RE: [PATCH 9/24] Implement VMCLEAR":
> Nadav Har'El wrote:
> > This patch implements the VMCLEAR instruction.
>...
> SDM implements alignment check, range check and reserve bit check and may generate VMfail(VMCLEAR with invalid physical address).
> As well as "addr != VMXON pointer" check
> Missed?

Right. I will add some of the missing checks - e.g., currently if the given
address is not page-aligned, I chop off the last bits and pretend that it
is, which can cause problems (although not for correctly-written hypervisors).

About the missing addr != VMXON pointer, as I explained in a comment in the
code (handle_vmon()), this was a deliberate ommission: the current
implementation doesn't store anything in the VMXON page (and I see no reason
why this will change in the future), so the VMXON emulation (handle_vmon())
doesn't even bother to save the pointer it is given, and VMCLEAR and VMPTRLD
don't check that the address they are given are different from this pointer,
since there is no real cause for concern even if it is.

I can quite easily add the missing code to save the vmxon pointer and check
it on vmclear/vmptrld, but frankly, wouldn't it be rather pointless?

> SDM has formal definition of VMSucceed. Cleating CF/ZF only is not sufficient as SDM 2B 5.2 mentioned.
> Any special concern here?
> 
> BTW, should we define formal VMfail() & VMsucceed() API for easy understand and map to SDM?

This is a good idea, and I'll do that.

-- 
Nadav Har'El                        |         Tuesday, Aug  3 2010, 23 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Sign in zoo: Do not feed the animals. If
http://nadav.harel.org.il           |you have food give it to the guard on duty

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-16 15:03   ` Gleb Natapov
@ 2010-08-04 11:46     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-04 11:46 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Wed, Jun 16, 2010, Gleb Natapov wrote about "Re: [PATCH 13/24] Implement VMREAD and VMWRITE":
> > +		set_rflags_to_vmx_fail_valid(vcpu);
> > +		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
> VM_INSTRUCTION_ERROR is read only and when do you transfer it to vmcs12 anyway?.
> I think set_rflags_to_vmx_fail_valid() should get vm_instruction_error
> as a parameter and put it into vmcs12, that way you'll never forget to
> provide error code on fail_valid case, compiler will remind you.

Good catch, and I now do exactly what you suggested.

Both you and Eddie Dong noticed that the functions that set the success and
failure flags weren't quite doing the right thing, and certainly the
vm_instruction_error needs to be set on vmcs12, not vmcs02 - and this needs
to be done on every failValid - not only some of the places as the code now
had. I'm fixing all these cases.

I attach a new patch with just the 3 success/failure functions, and the list
of error codes (from the spec, vol 2B table 5-1).

> What about checking that vmcs field is read only?

Good idea - I'll do that.

----
Subject: [PATCH 09/26] nVMX: Success/failure of VMX instructions.

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/vmx.h |   31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c         |   30 ++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-08-04 14:40:56.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-08-04 14:40:56.000000000 +0300
@@ -3817,6 +3817,36 @@ static int read_guest_vmcs_gpa(struct kv
 	return 0;
 }
 
+/*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, "Conventions".
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+		    	    X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+					u32 vm_instruction_error)
+{
+	vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+			& ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+			    X86_EFLAGS_SF | X86_EFLAGS_OF))
+			| X86_EFLAGS_ZF);
+	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
--- .before/arch/x86/include/asm/vmx.h	2010-08-04 14:40:56.000000000 +0300
+++ .after/arch/x86/include/asm/vmx.h	2010-08-04 14:40:56.000000000 +0300
@@ -409,4 +409,35 @@ struct vmx_msr_entry {
 	u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+	VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+	VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+	VMXERR_VMCLEAR_VMXON_POINTER = 3,
+	VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+	VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+	VMXERR_VMRESUME_CORRUPTED_VMCS = 6,
+	VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+	VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+	VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+	VMXERR_VMPTRLD_VMXON_POINTER = 10,
+	VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+	VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+	VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+	VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+	VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+	VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+	VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+	VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+	VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+	VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+	VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+	VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+	VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+	VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+	VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif

-- 
Nadav Har'El                        |       Wednesday, Aug  4 2010, 24 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"A mathematician is a device for turning
http://nadav.harel.org.il           |coffee into theorems" -- P. Erdos

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-16 14:48     ` Gleb Natapov
@ 2010-08-04 13:42       ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-04 13:42 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm

On Wed, Jun 16, 2010, Gleb Natapov wrote about "Re: [PATCH 13/24] Implement VMREAD and VMWRITE":
> On Mon, Jun 14, 2010 at 12:36:02PM +0300, Avi Kivity wrote:
> > vmread doesn't support 64-bit writes to memory outside long mode, so
> > you'll have to truncate the write.
> > 
> > I think you'll be better off returning a 32-bit size in
> > vmcs_field_size() in these cases.
> > 
> Actually write should be always 32bit long outside IA-32e mode and
> 64bit long in 64 bit mode. Unused bits should be set to zero.

Thanks, good catch. Fixed.

The code now looks like:

        u64 field_value;
        if (!vmcs12_read_any(vcpu, field, &field_value))
                return 0;

        /* It's ok to use *_system, because handle_vmread verifies cpl=0 */
        kvm_write_guest_virt_system(gva, &field_value,
                             (is_long_mode(vcpu) ? 8 : 4), vcpu, NULL);
        return 1;

with vmcs12_read_any() reading the whatever-length of field into a 64-bit
integer (zero-padding if the field is shorter), and then the write is either
64 or 32 bits depending only on is_long_mode(), not on the field's length.
A write may end up truncating the field, or zero-padding it, as necessary.

-- 
Nadav Har'El                        |       Wednesday, Aug  4 2010, 24 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The message above is just this
http://nadav.harel.org.il           |signature's way of propagating itself.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-06-14  9:36   ` Avi Kivity
  2010-06-16 14:48     ` Gleb Natapov
@ 2010-08-04 16:09     ` Nadav Har'El
  2010-08-04 16:41       ` Avi Kivity
  1 sibling, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-08-04 16:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 13/24] Implement VMREAD and VMWRITE":
> >+#ifdef CONFIG_X86_64
> >+	switch (vmcs_field_type(field)) {
> >+	case VMCS_FIELD_TYPE_U64: case VMCS_FIELD_TYPE_ULONG:
> >+		if (!is_long_mode(vcpu)) {
> >+			kvm_register_write(vcpu, reg+1, field_value>>  32);
> >   
> 
> What's this reg+1 thing?  I thought vmread simply ignores the upper half.

Thanks. Now that I look at this, I really can't figure out what it was
supposed to be doing. Maybe it was supposed to attempt to support running
64-bit guests on a 32-bit host, or something, I don't know. Anyway, I
removed it now.

> >+	kvm_write_guest_virt_system(gva,&field_value,
> >+			     vmcs_field_size(vmcs_field_type(field), vcpu),
> >+			     vcpu, NULL);
> >   
> 
> vmread doesn't support 64-bit writes to memory outside long mode, so 
> you'll have to truncate the write.
> 
> I think you'll be better off returning a 32-bit size in 
> vmcs_field_size() in these cases.

I think Gleb's correction (see my separate reply to him) was more accurate,
that the length of the write actually has nothing to do with the field size -
in 32-bit mode we always write 4 bytes, and in 64-bit mode, we always write 8
bytes - even if the given field to VMREAD was shorter or longer than those
sizes.

So now I have the code like this:
                kvm_write_guest_virt_system(gva, &field_value,
                             (is_long_mode(vcpu) ? 8 : 4), vcpu, NULL);

> >+	if (!nested_map_current(vcpu)) {
> >+		printk(KERN_INFO "%s invalid shadow vmcs\n", __func__);
> >+		set_rflags_to_vmx_fail_invalid(vcpu);
> >+		return 1;
> >+	}
> >   
> 
> Can do the read_any() here.

Right, and indeed it will make the code look better. Thanks, done.

> >+	if (read_succeed) {
> >+		clear_rflags_cf_zf(vcpu);
> >+		skip_emulated_instruction(vcpu);
> >+	} else {
> >+		set_rflags_to_vmx_fail_valid(vcpu);
> >+		vmcs_write32(VM_INSTRUCTION_ERROR, 12);
> >   
> 
> s_e_i() in any case but an exception.

Yes, I missed a bunch of those and will be a lot more careful now.
Of course the vmcs_write32 above was also completely broken and I fixed it
to write to vmcs12 (I discussed this issue in a reply to a different patch).

> >+		kvm_read_guest_virt(gva,&field_value,
> >+			vmcs_field_size(field_type, vcpu), vcpu, NULL);
> >   
> 
> Check for exception.

I am not sure what I should really do here... In emulating VMWRITE, we
try to read from memory the pointer of where to store the result, and
kvm_read_guest_virt fails (return !=0). What shall I do in such a case,
queue a PF_VECTOR? Or did you have something else in mind?

Thanks, and I'm attaching below a newer version of this patch with most
of your comments fixed (except the one I asked about in the last paragraph).
Nadav.

----
Subject: [PATCH 14/26] nVMX: Implement VMREAD and VMWRITE

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs_fields structure introduced in the previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  182 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 180 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-08-04 19:07:21.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-08-04 19:07:21.000000000 +0300
@@ -4180,6 +4180,184 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+enum vmcs_field_type {
+	VMCS_FIELD_TYPE_U16 = 0,
+	VMCS_FIELD_TYPE_U64 = 1,
+	VMCS_FIELD_TYPE_U32 = 2,
+	VMCS_FIELD_TYPE_ULONG = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+	if (0x1 & field)	/* one of the *_HIGH fields, all are 32 bit */
+		return VMCS_FIELD_TYPE_U32;
+	return (field >> 13) & 0x3 ;
+}
+
+static inline int vmcs_field_size(int field_type, struct kvm_vcpu *vcpu)
+{
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		return 2;
+	case VMCS_FIELD_TYPE_U32:
+		return 4;
+	case VMCS_FIELD_TYPE_U64:
+		return 8;
+	case VMCS_FIELD_TYPE_ULONG:
+		return is_long_mode(vcpu) ? 8 : 4;
+	}
+	BUG(); /* can never happen */
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+	return (((field >> 10) & 0x3) == 1);
+}
+
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+					unsigned long field, u64 *ret)
+{
+	short offset = vmcs_field_to_offset(field);
+	char *p;
+
+	if (offset < 0)
+		return 0;
+
+	p = ((char *)(get_vmcs12_fields(vcpu))) + offset;
+
+	switch (vmcs_field_type(field)) {
+	case VMCS_FIELD_TYPE_ULONG:
+		*ret = *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U16:
+		*ret = (u16) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U32:
+		*ret = (u32) *((unsigned long *)p);
+		return 1;
+	case VMCS_FIELD_TYPE_U64:
+		*ret = *((u64 *)p);
+		return 1;
+	default:
+		return 0; /* can never happen. */
+	}
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value;
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	gva_t gva = 0;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	/* decode instruction info and find the field to read */
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+	if(!vmcs12_read_any(vcpu, field, &field_value)){
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	/*
+	 * and now check if reuqest to put the value in register or memory.
+	 * Note that the number of bits actually written is 32 or 64 depending
+	 * in the mode, not on the given field's length.
+	 */
+	if (vmx_instruction_info & (1u << 10)) {
+		kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf),
+			field_value);
+	} else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		/* ok to use *_system, because handle_vmread verified cpl=0 */
+		kvm_write_guest_virt_system(gva, &field_value,
+			     (is_long_mode(vcpu) ? 8 : 4), vcpu, NULL);
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+	unsigned long field;
+	u64 field_value = 0;
+	gva_t gva;
+	int field_type;
+	unsigned long exit_qualification   = vmcs_readl(EXIT_QUALIFICATION);
+	u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+	char *p;
+	short offset;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf));
+
+	if (vmcs_field_readonly(field)) {
+		nested_vmx_failValid(vcpu,
+			VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	field_type = vmcs_field_type(field);
+
+	offset = vmcs_field_to_offset(field);
+	if (offset < 0) {
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+	p = ((char *) get_vmcs12_fields(vcpu)) + offset;
+
+	if (vmx_instruction_info & (1u << 10))
+		field_value = kvm_register_read(vcpu,
+			(((vmx_instruction_info) >> 3) & 0xf));
+	else {
+		if (get_vmx_mem_address(vcpu, exit_qualification,
+				vmx_instruction_info, &gva))
+			return 1;
+		kvm_read_guest_virt(gva, &field_value,
+			vmcs_field_size(field_type, vcpu), vcpu, NULL);
+	}
+
+	switch (field_type) {
+	case VMCS_FIELD_TYPE_U16:
+		*(u16 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U32:
+		*(u32 *)p = field_value;
+		break;
+	case VMCS_FIELD_TYPE_U64:
+#ifdef CONFIG_X86_64
+		*(unsigned long *)p = field_value;
+#else
+		*(unsigned long *)p = field_value;
+		*(((unsigned long *)p)+1) = field_value >> 32;
+#endif
+		break;
+	case VMCS_FIELD_TYPE_ULONG:
+		*(unsigned long *)p = field_value;
+		break;
+	default:
+		nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 {
 	if (vmcs12->revision_id == VMCS12_REVISION)
@@ -4546,9 +4724,9 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
-	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
+	[EXIT_REASON_VMREAD]                  = handle_vmread,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
-	[EXIT_REASON_VMWRITE]                 = handle_vmx_insn,
+	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

-- 
Nadav Har'El                        |       Wednesday, Aug  4 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I am the boss of the house, and I have my
http://nadav.harel.org.il           |wife's permission to say so!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 13/24] Implement VMREAD and VMWRITE
  2010-08-04 16:09     ` Nadav Har'El
@ 2010-08-04 16:41       ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-08-04 16:41 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 08/04/2010 07:09 PM, Nadav Har'El wrote:
>
>
>>> +		kvm_read_guest_virt(gva,&field_value,
>>> +			vmcs_field_size(field_type, vcpu), vcpu, NULL);
>>>
>> Check for exception.
> I am not sure what I should really do here... In emulating VMWRITE, we
> try to read from memory the pointer of where to store the result, and
> kvm_read_guest_virt fails (return !=0). What shall I do in such a case,
> queue a PF_VECTOR?

Yes indeed.  Plenty of kvm code does that (or needs to do it).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 10/24] Implement VMPTRLD
  2010-06-14  9:07   ` Avi Kivity
@ 2010-08-05 11:13     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-05 11:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 10/24] Implement VMPTRLD":
> >+	if (vmcs_page == NULL)
> >+		return 0;
> >   
> 
> Doesn't seem right.
>...
> >+
> >+	if (read_guest_vmcs_gpa(vcpu,&guest_vmcs_addr)) {
> >+		set_rflags_to_vmx_fail_invalid(vcpu);
> >   
> 
> Need to skip_emulated_instruction() in this case.

Thanks. I've "cleaned up my act" regarding error checking, and am now much
more careful to throw the right exception or set the right error code, and
to call skip_emulated_instruction when necessary.

Here is the new version, in case you want to look at it (of course, when I'm
done I'll send the whole patch set again).

----
Subject: [PATCH 11/26] nVMX: Implement VMPTRLD

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   64 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 63 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-08-05 14:12:24.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-08-05 14:12:24.000000000 +0300
@@ -3882,6 +3882,68 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmx->nested.current_vmptr != vmcs12_addr) {
+		struct vmcs12 *new_vmcs12;
+		struct page *page;
+		page = nested_get_page(vcpu, vmcs12_addr);
+		if (page == NULL){
+			nested_vmx_failInvalid(vcpu);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		new_vmcs12 = kmap(page);
+		if (new_vmcs12->revision_id != VMCS12_REVISION){
+			kunmap(page);
+			nested_release_page_clean(page);
+			nested_vmx_failValid(vcpu,
+				VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+			skip_emulated_instruction(vcpu);
+			return 1;
+		}
+		if (vmx->nested.current_vmptr != -1ull){
+			kunmap(vmx->nested.current_vmcs12_page);
+			nested_release_page(vmx->nested.current_vmcs12_page);
+		}
+
+		vmx->nested.current_vmptr = vmcs12_addr;
+		vmx->nested.current_vmcs12 = new_vmcs12;
+		vmx->nested.current_vmcs12_page = page;
+
+		if (nested_create_current_vmcs(vcpu))
+			return -ENOMEM;
+	}
+
+	nested_vmx_succeed(vcpu);
+	skip_emulated_instruction(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4166,7 +4228,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
-	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
+	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,
 	[EXIT_REASON_VMREAD]                  = handle_vmx_insn,
 	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,


-- 
Nadav Har'El                        |        Thursday, Aug  5 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we knew
http://nadav.harel.org.il           |how to make AOL's Free CD's edible!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 10/24] Implement VMPTRLD
  2010-07-06  3:09   ` Dong, Eddie
@ 2010-08-05 11:35     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-05 11:35 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: avi, kvm

On Tue, Jul 06, 2010, Dong, Eddie wrote about "RE: [PATCH 10/24] Implement VMPTRLD":
> Nadav Har'El wrote:
> > This patch implements the VMPTRLD instruction.
>..
> > +/* Emulate the VMPTRLD instruction */
> > +static int handle_vmptrld(struct kvm_vcpu *vcpu)
> > +{
>..
> 
> How about the "Launch" status? Should we get that status from vmcs1x to distinguish guest VMLaunch & VMResume?

What do you mean? What does VMPTRLD need to do with the launch status?

The VMCLEAR and VMLAUNCH/VMRESUME emulation need to clear and check the
emulated-VMX launch status (which we called vmcs12->launch_state), and indeed
they already do so (that code appears in other patches, not this specific one).

-- 
Nadav Har'El                        |        Thursday, Aug  5 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I am not a complete idiot - some parts
http://nadav.harel.org.il           |are missing.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-06-15 13:54       ` Gleb Natapov
@ 2010-08-05 11:50         ` Nadav Har'El
  2010-08-05 11:53           ` Gleb Natapov
  2010-08-05 12:03           ` Avi Kivity
  0 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-05 11:50 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm

On Tue, Jun 15, 2010, Gleb Natapov wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
> On Tue, Jun 15, 2010 at 04:50:35PM +0300, Avi Kivity wrote:
> > On 06/15/2010 04:47 PM, Gleb Natapov wrote:
> > Architectural errors (bad alignment) should update flags.  Internal
> > errors (ENOMEM, vpmtr pointing outside of RAM) should not.
> > 
> vpmtr pointing outside of RAM is architectural error (or not?). SDM
> says "The operand of this instruction is always 64 bits and is always in
> memory", but may be they mean "not in register". Anyway internal errors
> should generate error exit to userspace which this patch is also
> missing.

I'm a bit puzzled what I am supposed to do when the guest-physical
address I get as a parameter to VMCLEAR (after I read this address from guest 
virtual memory) is beyond the guest's actual memory, i.e., gfn_to_page
fails on this address. Is this a normal "architectural error" and I should
VMfail(VMCLEAR with invalid physical address)? #GP? Or something else?
The SMD says
	"ensure that data for VMCS referenced by the operand is in memory"
but it doesn't appear to say what to do if that is not the case. When the
address itself is faulty (e.g., more than 32 bits in 32 bit mode) the SDM
says VMfail(VMCLEAR with invalid physical address) - but it doesn't say to
do that when the physical address is "simply" beyond the amount of available
memory.

In any case, I don't think this should be considered an internal error, or
that we have a reason to exit to user space in this case.

Thanks,
Nadav.

-- 
Nadav Har'El                        |        Thursday, Aug  5 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Thousands of years ago, cats were
http://nadav.harel.org.il           |worshipped as gods. They never forgot.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 11:50         ` Nadav Har'El
@ 2010-08-05 11:53           ` Gleb Natapov
  2010-08-05 12:01             ` Nadav Har'El
  2010-08-05 12:03           ` Avi Kivity
  1 sibling, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-08-05 11:53 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Avi Kivity, kvm

On Thu, Aug 05, 2010 at 02:50:25PM +0300, Nadav Har'El wrote:
> On Tue, Jun 15, 2010, Gleb Natapov wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
> > On Tue, Jun 15, 2010 at 04:50:35PM +0300, Avi Kivity wrote:
> > > On 06/15/2010 04:47 PM, Gleb Natapov wrote:
> > > Architectural errors (bad alignment) should update flags.  Internal
> > > errors (ENOMEM, vpmtr pointing outside of RAM) should not.
> > > 
> > vpmtr pointing outside of RAM is architectural error (or not?). SDM
> > says "The operand of this instruction is always 64 bits and is always in
> > memory", but may be they mean "not in register". Anyway internal errors
> > should generate error exit to userspace which this patch is also
> > missing.
> 
> I'm a bit puzzled what I am supposed to do when the guest-physical
> address I get as a parameter to VMCLEAR (after I read this address from guest 
> virtual memory) is beyond the guest's actual memory, i.e., gfn_to_page
> fails on this address. Is this a normal "architectural error" and I should
> VMfail(VMCLEAR with invalid physical address)? #GP? Or something else?
> The SMD says
> 	"ensure that data for VMCS referenced by the operand is in memory"
> but it doesn't appear to say what to do if that is not the case. When the
> address itself is faulty (e.g., more than 32 bits in 32 bit mode) the SDM
> says VMfail(VMCLEAR with invalid physical address) - but it doesn't say to
> do that when the physical address is "simply" beyond the amount of available
> memory.
> 
> In any case, I don't think this should be considered an internal error, or
> that we have a reason to exit to user space in this case.
> 
But you can't emulate this either, no?

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 11:53           ` Gleb Natapov
@ 2010-08-05 12:01             ` Nadav Har'El
  2010-08-05 12:05               ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-08-05 12:01 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, kvm

On Thu, Aug 05, 2010, Gleb Natapov wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
> > In any case, I don't think this should be considered an internal error, or
> > that we have a reason to exit to user space in this case.
> > 
> But you can't emulate this either, no?

I could, if I knew what to emulate ;-) Does anybody know what a real processor
with VMX does when you give it VMCLEAR with a physical address which is beyond
the amount of available memory? If the answer was "it throws #GP" or "it
does VMFail(Invalid Physical Address) or something of this sort, I could
easily do this in the emulation too - I just don't know yet what it does...

-- 
Nadav Har'El                        |        Thursday, Aug  5 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"Guests, like fish, begin to smell after
http://nadav.harel.org.il           |three days." -- Benjamin Franklin

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 11:50         ` Nadav Har'El
  2010-08-05 11:53           ` Gleb Natapov
@ 2010-08-05 12:03           ` Avi Kivity
  1 sibling, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-08-05 12:03 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm

  On 08/05/2010 02:50 PM, Nadav Har'El wrote:
> On Tue, Jun 15, 2010, Gleb Natapov wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
>> On Tue, Jun 15, 2010 at 04:50:35PM +0300, Avi Kivity wrote:
>>> On 06/15/2010 04:47 PM, Gleb Natapov wrote:
>>> Architectural errors (bad alignment) should update flags.  Internal
>>> errors (ENOMEM, vpmtr pointing outside of RAM) should not.
>>>
>> vpmtr pointing outside of RAM is architectural error (or not?). SDM
>> says "The operand of this instruction is always 64 bits and is always in
>> memory", but may be they mean "not in register". Anyway internal errors
>> should generate error exit to userspace which this patch is also
>> missing.
> I'm a bit puzzled what I am supposed to do when the guest-physical
> address I get as a parameter to VMCLEAR (after I read this address from guest
> virtual memory) is beyond the guest's actual memory, i.e., gfn_to_page
> fails on this address. Is this a normal "architectural error" and I should
> VMfail(VMCLEAR with invalid physical address)? #GP? Or something else?
> The SMD says
> 	"ensure that data for VMCS referenced by the operand is in memory"
> but it doesn't appear to say what to do if that is not the case. When the
> address itself is faulty (e.g., more than 32 bits in 32 bit mode) the SDM
> says VMfail(VMCLEAR with invalid physical address) - but it doesn't say to
> do that when the physical address is "simply" beyond the amount of available
> memory.
>
> In any case, I don't think this should be considered an internal error, or
> that we have a reason to exit to user space in this case.

I think it's safe to KVM_REQ_SHUTDOWN in this case.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 12:01             ` Nadav Har'El
@ 2010-08-05 12:05               ` Avi Kivity
  2010-08-05 12:10                 ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-08-05 12:05 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm

  On 08/05/2010 03:01 PM, Nadav Har'El wrote:
> On Thu, Aug 05, 2010, Gleb Natapov wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
>>> In any case, I don't think this should be considered an internal error, or
>>> that we have a reason to exit to user space in this case.
>>>
>> But you can't emulate this either, no?
> I could, if I knew what to emulate ;-) Does anybody know what a real processor
> with VMX does when you give it VMCLEAR with a physical address which is beyond
> the amount of available memory? If the answer was "it throws #GP" or "it
> does VMFail(Invalid Physical Address) or something of this sort, I could
> easily do this in the emulation too - I just don't know yet what it does...

As far as the processor is concerned, there is no end to physical 
memory.  The VMCLEAR will write some stuff out, and the chipset will 
throw it away.  However, eventually the guest will crash and burn, 
better to take it out in a controlled way.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 12:05               ` Avi Kivity
@ 2010-08-05 12:10                 ` Nadav Har'El
  2010-08-05 12:13                   ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-08-05 12:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm

On Thu, Aug 05, 2010, Avi Kivity wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
> As far as the processor is concerned, there is no end to physical 
> memory.  The VMCLEAR will write some stuff out, and the chipset will 
> throw it away.

So isn't what you're saying that for accurate VMCLEAR emulation, I should
simply ignore this error and have VMCLEAR do nothing in this case?

>  However, eventually the guest will crash and burn, 
> better to take it out in a controlled way.

Isn't it concievable that some poorly-written hypervisor actually VMCLEARs
something beyond available memory, and doesn't crash because it never tries
to actually use that memory later? Do you think it's a good idea to kill
the guest immediately, in anticipation of a crash that will most likely come
later?

Thanks,
Nadav.

-- 
Nadav Har'El                        |        Thursday, Aug  5 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Tact is the art of making a point without
http://nadav.harel.org.il           |making an enemy.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 12:10                 ` Nadav Har'El
@ 2010-08-05 12:13                   ` Avi Kivity
  2010-08-05 12:29                     ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-08-05 12:13 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm

  On 08/05/2010 03:10 PM, Nadav Har'El wrote:
> On Thu, Aug 05, 2010, Avi Kivity wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
>> As far as the processor is concerned, there is no end to physical
>> memory.  The VMCLEAR will write some stuff out, and the chipset will
>> throw it away.
> So isn't what you're saying that for accurate VMCLEAR emulation, I should
> simply ignore this error and have VMCLEAR do nothing in this case?
>

That would be most accurate.

>>   However, eventually the guest will crash and burn,
>> better to take it out in a controlled way.
> Isn't it concievable that some poorly-written hypervisor actually VMCLEARs
> something beyond available memory, and doesn't crash because it never tries
> to actually use that memory later?

It's very unlikely.

> Do you think it's a good idea to kill
> the guest immediately, in anticipation of a crash that will most likely come
> later?

At this stage, the problem is likely to be in the nvmx implementation.  
So it's best to report the problem as early as possible.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 9/24] Implement VMCLEAR
  2010-08-05 12:13                   ` Avi Kivity
@ 2010-08-05 12:29                     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-08-05 12:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm

On Thu, Aug 05, 2010, Avi Kivity wrote about "Re: [PATCH 9/24] Implement VMCLEAR":
> >Do you think it's a good idea to kill
> >the guest immediately, in anticipation of a crash that will most likely 
> >come
> >later?
> 
> At this stage, the problem is likely to be in the nvmx implementation.  
> So it's best to report the problem as early as possible.

Ok, here is the new version of this patch, implementing this case by doing a
triple fault (I hope this is what you meant by KVM_REQ_SHUTDOWN), and also
with other fixes as suggested by you, Gleb and Eddie.

----
Subject: [PATCH 10/26] nVMX: Implement VMCLEAR

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   62 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-08-05 15:22:27.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-08-05 15:22:27.000000000 +0300
@@ -144,6 +144,8 @@ struct __packed vmcs12 {
 	 */
 	u32 revision_id;
 	u32 abort;
+
+	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
 };
 
 /*
@@ -3828,6 +3830,64 @@ static void nested_vmx_failValid(struct 
 	get_vmcs12_fields(vcpu)->vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	gva_t gva;
+	gpa_t vmcs12_addr;
+	struct vmcs12 *vmcs12;
+	struct page *page;
+
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+			vmcs_read32(VMX_INSTRUCTION_INFO), &gva))
+		return 1;
+
+	if (kvm_read_guest_virt(gva, &vmcs12_addr, sizeof(vmcs12_addr),
+				vcpu, NULL)) {
+		kvm_queue_exception(vcpu, PF_VECTOR);
+		return 1;
+	}
+
+	if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+		nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
+
+	if (vmcs12_addr == vmx->nested.current_vmptr){
+		kunmap(vmx->nested.current_vmcs12_page);
+		nested_release_page(vmx->nested.current_vmcs12_page);
+		vmx->nested.current_vmptr = -1ull;
+	}
+
+	page = nested_get_page(vcpu, vmcs12_addr);
+	if(page == NULL){
+		/*
+		 * For accurate processor emulation, VMCLEAR beyond available
+		 * physical memory should do nothing at all. However, it is
+		 * possible that a nested vmx bug, not a guest hypervisor bug,
+		 * resulted in this case, so let's shut down before doing any
+		 * more damage:
+		 */
+		set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+		return 1;
+	}
+	vmcs12 = kmap(page);
+	vmcs12->launch_state = 0;
+	kunmap(page);
+	nested_release_page(page);
+
+	nested_free_vmcs(vcpu, vmcs12_addr);
+
+	skip_emulated_instruction(vcpu);
+	nested_vmx_succeed(vcpu);
+	return 1;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -4110,7 +4170,7 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
-	[EXIT_REASON_VMCLEAR]	              = handle_vmx_insn,
+	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmx_insn,
 	[EXIT_REASON_VMPTRST]                 = handle_vmx_insn,

-- 
Nadav Har'El                        |        Thursday, Aug  5 2010, 25 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |A fine is a tax for doing wrong. A tax is
http://nadav.harel.org.il           |a fine for doing well.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-06-23  8:07         ` Avi Kivity
@ 2010-08-08 15:09           ` Nadav Har'El
  2010-08-10  3:24             ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-08-08 15:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Wed, Jun 23, 2010, Avi Kivity wrote about "Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1":
> >+
> >+We describe in much greater detail the theory behind the nested VMX 
> >feature,
> >+its implementation and its performance characteristics, in IBM Research 
> >report
> >+H-0282, "The Turtles Project: Design and Implementation of Nested
> >+Virtualization", available at:
> >+
> >+        http://bit.ly/a0o9te
> 
> Please put the true url in here.

Done.
By the way, since I wrote this, our paper has also been accepted to OSDI 2010
(see http://www.usenix.org/events/osdi10/tech/), so later I will change the
link again to the conference paper.

> >+The current code support running Linux under a nested KVM using shadow
> >+page table (with bypass_guest_pf disabled).
> 
> Might as well remove this, since nvmx will not be merged with such a 
> gaping hole.
> 
> In theory I ought to reject anything that doesn't comply with the spec.  
> In practice I'll accept deviations from the spec, so long as
> 
> - those features aren't used by common guests
> - when the features are attempted to be used, kvm will issue a warning

Ok, I plugged the big gaping hole and left a small invisible hole ;-)

The situation now is that you no longer have to run kvm with bypass_guest_pf,
not on L0 and not on L1. L1 guests will run normally, possibly with
bypass_guest_pf enabled. However, when L2 guests run every page-fault will
cause an exit - regardless of what L0 or L1 tried to define via
PFEC_MASK, PFEC_MATCH and EB[pf].

The reason why I said there is a "small hole" left is that now there is the
possibility that we inject L1 with a page fault that it didn't expect to get.
But in practice, this does not seem to cause any problems for neither KVM
nor VMWare Server.

> I don't think PFEC matching ought to present any implementation difficulty.

Well, it is more complicated than it first appeared (at least to me).
One problem is that there is no real way (at least none that I thought of)
to "or" the pf-trapping desires of L0 and L1. I solved this by  traping all
page faults, which is unfortunate. The second problem, related to the first
one, when L0 gets a page fault while running L2, it is now quite diffcult to
figure out whether it should be injected into L1, i.e., whether L1 asked
for this specific page-fault trap to happen. We need check whether the
page_fault_error_code match's the L1-specified pfec_mask and pfec_match
(and eb.pf), but it's actually more complicated, because the
page_fault_error_code we got from the processor refers to the shadow page
tables, and we need to translate it back to what it would mean for L1's page
tables.

Doing this correctly would require me to spend quite a bit more time to
understand exactly how the shadow page tables code works, and I hesitate
whether I should do that now, when I know that common guest hypervisors
work perfectly without fixing this issue, and when most people would rather
use EPT and not shadow page tables anyway.

In any case, I left a TODO in the code about this, so it won't be forgotten.

-- 
Nadav Har'El                        |          Sunday, Aug  8 2010, 28 Av 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |It's no use crying over spilt milk -- it
http://nadav.harel.org.il           |only makes it salty for the cat.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1
  2010-08-08 15:09           ` Nadav Har'El
@ 2010-08-10  3:24             ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-08-10  3:24 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm-devel

  On 08/08/2010 11:09 AM, Nadav Har'El wrote:
>
>>> +page table (with bypass_guest_pf disabled).
>> Might as well remove this, since nvmx will not be merged with such a
>> gaping hole.
>>
>> In theory I ought to reject anything that doesn't comply with the spec.
>> In practice I'll accept deviations from the spec, so long as
>>
>> - those features aren't used by common guests
>> - when the features are attempted to be used, kvm will issue a warning
> Ok, I plugged the big gaping hole and left a small invisible hole ;-)
>
> The situation now is that you no longer have to run kvm with bypass_guest_pf,
> not on L0 and not on L1. L1 guests will run normally, possibly with
> bypass_guest_pf enabled. However, when L2 guests run every page-fault will
> cause an exit - regardless of what L0 or L1 tried to define via
> PFEC_MASK, PFEC_MATCH and EB[pf].
>
> The reason why I said there is a "small hole" left is that now there is the
> possibility that we inject L1 with a page fault that it didn't expect to get.
> But in practice, this does not seem to cause any problems for neither KVM
> nor VMWare Server.

Not nice, but acceptable.  Spurious page faults are accepted by guests 
since they're often the result of concurrent faults on the same address.

>> I don't think PFEC matching ought to present any implementation difficulty.
> Well, it is more complicated than it first appeared (at least to me).
> One problem is that there is no real way (at least none that I thought of)
> to "or" the pf-trapping desires of L0 and L1.

If they use the same "sense" (bit 14 of EXCEPTION_BITMAP), you can AND 
the two PFEC_MASKs, and drop any bits remaining where PFEC_MATCH is 
different.  Not worth it, probably.

>   I solved this by  traping all
> page faults, which is unfortunate. The second problem, related to the first
> one, when L0 gets a page fault while running L2, it is now quite diffcult to
> figure out whether it should be injected into L1, i.e., whether L1 asked
> for this specific page-fault trap to happen. We need check whether the
> page_fault_error_code match's the L1-specified pfec_mask and pfec_match
> (and eb.pf), but it's actually more complicated, because the
> page_fault_error_code we got from the processor refers to the shadow page
> tables, and we need to translate it back to what it would mean for L1's page
> tables.

You can recover original PFEC by doing a walk_addr().

> Doing this correctly would require me to spend quite a bit more time to
> understand exactly how the shadow page tables code works, and I hesitate
> whether I should do that now, when I know that common guest hypervisors
> work perfectly without fixing this issue, and when most people would rather
> use EPT and not shadow page tables anyway.
>
> In any case, I left a TODO in the code about this, so it won't be forgotten.

Sure.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-06-14 12:04   ` Avi Kivity
@ 2010-09-12 14:05     ` Nadav Har'El
  2010-09-12 14:29       ` Avi Kivity
  2010-09-14 13:07     ` Nadav Har'El
  1 sibling, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-12 14:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Hi,

Continuing to work on the nested VMX patches,

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1":
> On 06/13/2010 03:31 PM, Nadav Har'El wrote:
>...
> >+/* prepare_vmcs_12 is called when the nested L2 guest exits and we want to
> >+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
> >+ * function updates it to reflect the state of the registers during the 
> >exit,
>...
> >+	vmcs12->tsc_offset = vmcs_read64(TSC_OFFSET);
> >   
> 
> TSC_OFFSET cannot have changed.

Right. I cleaned up this function now, to only copy the fields that could
have changed, namely fields listed as guest-state or exit-information fields
in the spec. Control fields like this TSC_OFFSET and more examples you found
below, indeed could not have changed while L2 was running or during the exit.

> >+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> >   
> Without msr bitmaps, cannot change.

I added a TODO before this (and a couple of others) for future optimization.
I'm not even convinced how much quicker it is to check the MSR bitmap before
doing vmcs_read64, vs just to going ahead and vmreading it in any case.

> >+    vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
>
> Can this change?

Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state
field, but it is listed as being for "future expansion". I guess that with
current hardware, it cannot change, but for future hardware it might. I'm
not sure if it's wiser to ignore this field for now (and shave a bit off
the l2->l1 switch time), or just copy it anyway, as I do now.
What would you prefer?

> >+	if (vmcs_config.vmentry_ctrl&  VM_ENTRY_LOAD_IA32_PAT)
> >+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
> >   
> 
> Should check for VM_EXIT_SAVE_IA32_PAT, no?

You're absolutely right. Fixed.

> >+	vmcs12->vm_entry_intr_info_field =
> >+		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> >   
> 
> Autocleared, no need to read.

Well, we need to clear the "valid bit" on exit, so we don't mistakenly inject
the same interrupt twice. There were two ways to do it: 1. clear it ourselves,
or 2. copy the value from vmcs02 where the processor already cleared it.
There are pros and cons for each approach, but I'll change like you suggest,
to clear it ourselves:

	vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;

> >+	vmcs12->vm_instruction_error = vmcs_read32(VM_INSTRUCTION_ERROR);
>
> We don't want to pass this to the guest?

I didn't quite understand your question, but now that I look at it, I see
that VM_INSTRUCTION_ERROR has nothing to do with exits, but is only modified
when running VMX instructions, and our emulation of VMX instructions already
sets it appropriately, so no sense of copying it here.

> 
> >+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
> >+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> >+	vmcs12->vm_exit_intr_error_code = 
> >vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> >+	vmcs12->idt_vectoring_info_field =
> >+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
> >+	vmcs12->idt_vectoring_error_code =
> >+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
> >+	vmcs12->vm_exit_instruction_len = 
> >vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> >+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> >   
> 
> For the above, if the host handles the exit, we must not clobber guest 
> fields.  A subsequent guest vmread will see the changed values even 
> though from its point of view a vmexit has not occurred.
> 
> But no, that can't happen, since a vmread needs to have a vmexit first 
> to happen.  Still, best to delay this.+

All this code is in prepare_vmcs_12, which only gets called if we exit from
L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host
handles the exit).

As long as a certain field gets written to on *every* exit, and not just
on some of them, I believe we can safely copy the current values in vmcs02
to vmcs12, knowing these are current values from the current exit, and not
some old values we shouldn't be copying.

You may have a point (if that was your point?) that some fields are not
always written to - e.g., for most exits vm_exit_intr_info doesn't get
written to and just one "valid" bit is cleared. As the code is now, we copy
vmcs02's field, which might have been written earlier (e.g., during
an exit to L0) and not now, and an observant L1 might notice this value
it should not have seen. However, I don't see any problem with that, because
the "valid" bit would be correctly turned off, and the spec says that all
other bits are undefined (for irrelevant exits) and copying-old-values is one
legal setting for undefined bits...

> >+	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
> >+	 * have changed some cr0 bits without us ever saving them in the 
> >shadow
> >+	 * vmcs. So we need to save these changes now.
>...
> >+
> >+	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);
> 
> Can't we have the same issue with cr4?

I guess we can. I didn't think this (giving guest full control over cr4 bits)
was happening in KVM, but maybe I was wrong, or maybe this will happen in the
future, so no reason not to do it for cr4 as well. So I'll do it for cr4 as
well.

> Better to have some helpers to do the common magic, and not encode the 
> special knowledge about TS into it (make it generic).

I thought that since in current KVM code the only cr0_guest_owned_bits bit
that could possibly be turned on was TS, then I should only deal with that
bit. But you're right, no reason not to make it more general, to look for
any bits which L0 traps but L1 didn't think it was trapping. As in:

	long bits;
	bits = vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
	vmcs12->guest_cr0 = (vmcs_readl(GUEST_CR0) & bits) |
				(vmcs_readl(CR0_READ_SHADOW) & ~bits);

(the "bits" lists all the bits which are already fine in guest_cr0, i.e.,
either guest_owned_bit (so L2 wrote to it directly) or guest_host_mask
(so L1 didn't expect them to be updated in guest_cr0 anyway). All other
bits need to be copied from the read_shadow).

I don't know how to put this into a helper function, because these two
statements have so many dependencies on the word "cr0" that making one
that would work for either cr0 or cr4 seems too difficult to be worth the
trouble.

This reply is getting long, so I'll leave it about prepare_vmcs_12 and
will reply to your comments about the rest of this patch in a separate mail.

Thanks,
Nadav.

-- 
Nadav Har'El                        |       Sunday, Sep 12 2010, 4 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |An Apple a day, keeps Windows away.
http://nadav.harel.org.il           |

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-12 14:05     ` Nadav Har'El
@ 2010-09-12 14:29       ` Avi Kivity
  2010-09-12 17:05         ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-09-12 14:29 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 09/12/2010 04:05 PM, Nadav Har'El wrote:
> Hi,
>
> Continuing to work on the nested VMX patches,

Great.  Hopefully you can commit some time to it or you'll spend a lot 
of cycles just catching up.

Joerg just merged nested npt; as far as I can tell it is 100% in line 
with nested ept, but please take a look as well.

>
>>> +	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
>>>
>> Without msr bitmaps, cannot change.
> I added a TODO before this (and a couple of others) for future optimization.
> I'm not even convinced how much quicker it is to check the MSR bitmap before
> doing vmcs_read64, vs just to going ahead and vmreading it in any case.

IIRC we don't support msr bitmaps now, so no need to check anything.

In general, avoid vmcs reads as much as possible.  Just think of your 
code running in a guest!

>>> +    vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
>> Can this change?
> Well, according to the spec, (SDM vol 3B), VMCS link pointer is a guest-state
> field, but it is listed as being for "future expansion". I guess that with
> current hardware, it cannot change, but for future hardware it might. I'm
> not sure if it's wiser to ignore this field for now (and shave a bit off
> the l2->l1 switch time), or just copy it anyway, as I do now.
> What would you prefer?

If it changes in the future, it can only be under the influence of some 
control or at least guest-discoverable capability.  Since we don't 
expose such control or capability, there's no need to copy it.

>>> +	vmcs12->vm_entry_intr_info_field =
>>> +		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>>>
>> Autocleared, no need to read.
> Well, we need to clear the "valid bit" on exit, so we don't mistakenly inject
> the same interrupt twice.

I don't think so.  We write this field as part of guest entry (that is, 
after the decision re which vmcs to use, yes?), so guest entry will 
always follow writing this field.  Since guest entry clears the field, 
reading it after an exit will necessarily return 0.

What can happen is that the contents of the field is transferred to the 
IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.

(question: on a failed vmentry, is this field cleared?)

> There were two ways to do it: 1. clear it ourselves,
> or 2. copy the value from vmcs02 where the processor already cleared it.
> There are pros and cons for each approach, but I'll change like you suggest,
> to clear it ourselves:
>
> 	vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK;

That's really a temporary variable, I don't think you need to touch it.

>>> +	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
>>> +	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
>>> +	vmcs12->vm_exit_intr_error_code =
>>> vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
>>> +	vmcs12->idt_vectoring_info_field =
>>> +		vmcs_read32(IDT_VECTORING_INFO_FIELD);
>>> +	vmcs12->idt_vectoring_error_code =
>>> +		vmcs_read32(IDT_VECTORING_ERROR_CODE);
>>> +	vmcs12->vm_exit_instruction_len =
>>> vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>>> +	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>>>
>> For the above, if the host handles the exit, we must not clobber guest
>> fields.  A subsequent guest vmread will see the changed values even
>> though from its point of view a vmexit has not occurred.
>>
>> But no, that can't happen, since a vmread needs to have a vmexit first
>> to happen.  Still, best to delay this.+
> All this code is in prepare_vmcs_12, which only gets called if we exit from
> L2 to L1 - it doesn't get called when we exit from L2 to L0 (when the host
> handles the exit).

Ok.  That answers my concern.

> As long as a certain field gets written to on *every* exit, and not just
> on some of them, I believe we can safely copy the current values in vmcs02
> to vmcs12, knowing these are current values from the current exit, and not
> some old values we shouldn't be copying.
>
> You may have a point (if that was your point?) that some fields are not
> always written to - e.g., for most exits vm_exit_intr_info doesn't get
> written to and just one "valid" bit is cleared. As the code is now, we copy
> vmcs02's field, which might have been written earlier (e.g., during
> an exit to L0) and not now, and an observant L1 might notice this value
> it should not have seen. However, I don't see any problem with that, because
> the "valid" bit would be correctly turned off, and the spec says that all
> other bits are undefined (for irrelevant exits) and copying-old-values is one
> legal setting for undefined bits...

No, I wasn't worried about that, simply misunderstood the code.

>>> +	/* If any of the CRO_GUEST_HOST_MASK bits are off, the L2 guest may
>>> +	 * have changed some cr0 bits without us ever saving them in the
>>> shadow
>>> +	 * vmcs. So we need to save these changes now.
>> ...
>>> +
>>> +	vmcs12->guest_cr4 = vmcs_readl(GUEST_CR4);
>> Can't we have the same issue with cr4?
> I guess we can. I didn't think this (giving guest full control over cr4 bits)
> was happening in KVM, but maybe I was wrong, or maybe this will happen in the
> future, so no reason not to do it for cr4 as well. So I'll do it for cr4 as
> well.

We give the guest partial control over cr4:

     #define KVM_CR4_GUEST_OWNED_BITS                                   \
              (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | 
X86_CR4_OSFXSR      \
               | X86_CR4_OSXMMEXCPT)

Plus PGE if EPT is enabled.

>> Better to have some helpers to do the common magic, and not encode the
>> special knowledge about TS into it (make it generic).
> I thought that since in current KVM code the only cr0_guest_owned_bits bit
> that could possibly be turned on was TS, then I should only deal with that
> bit. But you're right, no reason not to make it more general, to look for
> any bits which L0 traps but L1 didn't think it was trapping. As in:
>
> 	long bits;
> 	bits = vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> 	vmcs12->guest_cr0 = (vmcs_readl(GUEST_CR0)&  bits) |
> 				(vmcs_readl(CR0_READ_SHADOW)&  ~bits);
>
> (the "bits" lists all the bits which are already fine in guest_cr0, i.e.,
> either guest_owned_bit (so L2 wrote to it directly) or guest_host_mask
> (so L1 didn't expect them to be updated in guest_cr0 anyway). All other
> bits need to be copied from the read_shadow).
>
> I don't know how to put this into a helper function, because these two
> statements have so many dependencies on the word "cr0" that making one
> that would work for either cr0 or cr4 seems too difficult to be worth the
> trouble.

I didn't mean register independent helper; one function for cr0 and one 
function for cr4 so the reader can just see the name and pretend to 
understand what it does, instead of seeing a bunch of incomprehensible 
bitwise operations.

(well, reading what I wrote, maybe I did mean a cr independent helper, 
but don't do it if it results in more complication)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-12 14:29       ` Avi Kivity
@ 2010-09-12 17:05         ` Nadav Har'El
  2010-09-12 17:21           ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-12 17:05 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Sun, Sep 12, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1":
> Great.  Hopefully you can commit some time to it or you'll spend a lot 
> of cycles just catching up.

Right. I will.

> Joerg just merged nested npt; as far as I can tell it is 100% in line 
> with nested ept, but please take a look as well.

Indeed. Making nested EPT work based on the nested NPT work is one of the
first things I plan to do after the basic nested VMX patches are accepted.
As you know, we already have a version of nested EPT working in our testing
code, but I'll need to tweak it a bit to use the common nested NPT code.

> >>>+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
> >>>
> >>Without msr bitmaps, cannot change.
> >I added a TODO before this (and a couple of others) for future 
> >optimization.
> >I'm not even convinced how much quicker it is to check the MSR bitmap 
> >before
> >doing vmcs_read64, vs just to going ahead and vmreading it in any case.
> 
> IIRC we don't support msr bitmaps now, so no need to check anything.

I do think we support msr bitmaps... E.g., we have
nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a
certain MSR access.  Where don't we support them? (but I'm not denying the
possiblity that this support still has holes or bugs).

> In general, avoid vmcs reads as much as possible.  Just think of your 
> code running in a guest!

Yes. On the other hand, I don't want to be sorry in the future when I want
to support some feature, but because I wanted to shave off 1% of the L2->L1
switching time, and 0.01% of the total run time (and I'm just making
numbers up...), I now need to find a dozen places where things need to change
to support this feature. On the other hand, this will likely happen anyway ;-)

> 
> >>>+    vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
> >>Can this change?
>..
> If it changes in the future, it can only be under the influence of some 
> control or at least guest-discoverable capability.  Since we don't 
> expose such control or capability, there's no need to copy it.

You convinced me. Removed it.

> >>>+	vmcs12->vm_entry_intr_info_field =
> >>>+		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> >>>
> >>Autocleared, no need to read.
> >Well, we need to clear the "valid bit" on exit, so we don't mistakenly 
> >inject
> >the same interrupt twice.
> 
> I don't think so.  We write this field as part of guest entry (that is, 
> after the decision re which vmcs to use, yes?), so guest entry will 
> always follow writing this field.  Since guest entry clears the field, 
> reading it after an exit will necessarily return 0.

Well, you obviously know the KVM code much better than I do, but from what
I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM,
this field only gets written on injection, not on every entry, so the code
relies on the fact that the processor turns off the "valid" bit during exit,
to avoid the same event being injected when the same field value is used for
another entry. I can only find code which resets this field in vmx_vcpu_reset(),
but that doesn't get called on every entry, right? Or am I missing something?

> What can happen is that the contents of the field is transferred to the 
> IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.
> 
> (question: on a failed vmentry, is this field cleared?)

I don't know the answer :-)

> >There were two ways to do it: 1. clear it ourselves,
> >or 2. copy the value from vmcs02 where the processor already cleared it.
> >There are pros and cons for each approach, but I'll change like you 
> >suggest,
> >to clear it ourselves:
> >
> >	vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK;
> 
> That's really a temporary variable, I don't think you need to touch it.

But we need to emulate the correct VMX behavior. According to the spec, the
"valid" bit on this field needs to be cleared on vmexit, so we need to do it
also on emulated exits from L2 to L1. If we're sure that we already cleared it
earlier, then fine, but if not (and like I said, I failed to find this code),
we need to do it now, on exit - either by explictly clearing the bit or by
copying a value where the processor cleared this bit (arguably, the former is
more correct emulation).

> I didn't mean register independent helper; one function for cr0 and one 
> function for cr4 so the reader can just see the name and pretend to 
> understand what it does, instead of seeing a bunch of incomprehensible 
> bitwise operations.

Ok, done:

/*
 * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
 * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
 * without L0 trapping the change and updating vmcs12.
 * This function returns the value we should put in vmcs12.guest_cr0. It's not
 * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
 * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
 * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
 * to trap this change and instead set just the read shadow. If this is the
 * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
 * L1 believes they already are.
 */
static inline unsigned long
vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12){
	unsigned long guest_cr0_bits =
		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
		(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
}

and the call becomes just:

	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);

which is easy to glance over (but doesn't say much about what it is doing).


-- 
Nadav Har'El                        |       Sunday, Sep 12 2010, 4 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Sign above a shop selling burglar alarms:
http://nadav.harel.org.il           |"For the man who has everything"

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-12 17:05         ` Nadav Har'El
@ 2010-09-12 17:21           ` Avi Kivity
  2010-09-12 19:51             ` Nadav Har'El
  2010-09-13  5:53             ` Sheng Yang
  0 siblings, 2 replies; 147+ messages in thread
From: Avi Kivity @ 2010-09-12 17:21 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Sheng Yang

  On 09/12/2010 07:05 PM, Nadav Har'El wrote:
>
>>>>> +	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
>>>>>
>>>> Without msr bitmaps, cannot change.
>>> I added a TODO before this (and a couple of others) for future
>>> optimization.
>>> I'm not even convinced how much quicker it is to check the MSR bitmap
>>> before
>>> doing vmcs_read64, vs just to going ahead and vmreading it in any case.
>> IIRC we don't support msr bitmaps now, so no need to check anything.
> I do think we support msr bitmaps... E.g., we have
> nested_vmx_exit_handled_msr() to check whether L1 requires an exit for a
> certain MSR access.  Where don't we support them? (but I'm not denying the
> possiblity that this support still has holes or bugs).

I was just talking from memory.  If you do support them, that's great.

Note that kvm itself doesn't support give the guest control of 
DEBUGCTLMSR, so you should just be able to read it from the shadow value 
(which strangely doesn't exist - I'll post a fix).

>> In general, avoid vmcs reads as much as possible.  Just think of your
>> code running in a guest!
> Yes. On the other hand, I don't want to be sorry in the future when I want
> to support some feature, but because I wanted to shave off 1% of the L2->L1
> switching time, and 0.01% of the total run time (and I'm just making
> numbers up...), I now need to find a dozen places where things need to change
> to support this feature. On the other hand, this will likely happen anyway ;-)

Well, with msrs you have two cases: those which the guest controls and 
those which are shadowed.  So all we need is a systematic way for 
dealing with the two types.

>>>>> +	vmcs12->vm_entry_intr_info_field =
>>>>> +		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>>>>>
>>>> Autocleared, no need to read.
>>> Well, we need to clear the "valid bit" on exit, so we don't mistakenly
>>> inject
>>> the same interrupt twice.
>> I don't think so.  We write this field as part of guest entry (that is,
>> after the decision re which vmcs to use, yes?), so guest entry will
>> always follow writing this field.  Since guest entry clears the field,
>> reading it after an exit will necessarily return 0.
> Well, you obviously know the KVM code much better than I do, but from what
> I saw, I thought (maybe I misunderstood) that in normal (non-nested) KVM,
> this field only gets written on injection, not on every entry, so the code
> relies on the fact that the processor turns off the "valid" bit during exit,
> to avoid the same event being injected when the same field value is used for
> another entry.

Correct.

> I can only find code which resets this field in vmx_vcpu_reset(),
> but that doesn't get called on every entry, right? Or am I missing something?

prepare_vmcs12() is called in response for a 2->1 vmexit which is first 
trapped by 0, yes?  Because it's called immediately after a vmexit, 
VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the 
processor.

There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not 
be cleared by hardware:

1. if we call prepare_vmcs12() between injection and entry.  This cannot 
happen AFAICT.
2. if the vmexit was really a failed 1->2 vmentry, and if the processor 
doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures 
(need to check scripture)

If neither of these are valid, the code can be removed.  If only the 
second, we might make it conditional.

>> What can happen is that the contents of the field is transferred to the
>> IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.
>>
>> (question: on a failed vmentry, is this field cleared?)
> I don't know the answer :-)

Sheng?

>>> There were two ways to do it: 1. clear it ourselves,
>>> or 2. copy the value from vmcs02 where the processor already cleared it.
>>> There are pros and cons for each approach, but I'll change like you
>>> suggest,
>>> to clear it ourselves:
>>>
>>> 	vmcs12->vm_entry_intr_info_field&= ~INTR_INFO_VALID_MASK;
>> That's really a temporary variable, I don't think you need to touch it.
> But we need to emulate the correct VMX behavior. According to the spec, the
> "valid" bit on this field needs to be cleared on vmexit, so we need to do it
> also on emulated exits from L2 to L1. If we're sure that we already cleared it
> earlier, then fine, but if not (and like I said, I failed to find this code),
> we need to do it now, on exit - either by explictly clearing the bit or by
> copying a value where the processor cleared this bit (arguably, the former is
> more correct emulation).

Sorry, I misread it as vmx->idt_vectoring_info which is a temporary 
variable used to cache IDT_VECTORING_INFO.  Ignore my remark.

>> I didn't mean register independent helper; one function for cr0 and one
>> function for cr4 so the reader can just see the name and pretend to
>> understand what it does, instead of seeing a bunch of incomprehensible
>> bitwise operations.
> Ok, done:
>
> /*
>   * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
>   * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
>   * without L0 trapping the change and updating vmcs12.
>   * This function returns the value we should put in vmcs12.guest_cr0. It's not
>   * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
>   * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
>   * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
>   * to trap this change and instead set just the read shadow. If this is the
>   * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
>   * L1 believes they already are.
>   */
> static inline unsigned long
> vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)

newline...

> {
> 	unsigned long guest_cr0_bits =
> 		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> 	return (vmcs_readl(GUEST_CR0)&  guest_cr0_bits) |
> 		(vmcs_readl(CR0_READ_SHADOW)&  ~guest_cr0_bits);
> }
>
> and the call becomes just:
>
> 	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
>
> which is easy to glance over (but doesn't say much about what it is doing).

It's a little easier to digest, at least.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-12 17:21           ` Avi Kivity
@ 2010-09-12 19:51             ` Nadav Har'El
  2010-09-13  8:48               ` Avi Kivity
  2010-09-13  5:53             ` Sheng Yang
  1 sibling, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-12 19:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Sheng Yang

On Sun, Sep 12, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1":
> >>>>>+	vmcs12->vm_entry_intr_info_field =
> >>>>>+		vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>...
> prepare_vmcs12() is called in response for a 2->1 vmexit which is first 
> trapped by 0, yes?  Because it's called immediately after a vmexit, 
> VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the 
> processor.

Indeed - it is cleared in the real vmcs, i.e., vmcs02, but we also need
to clear it in the emulated vmcs that L1 sees for L2, i.e., vmcs12.

The original code (quoted above) just copied the vmcs02 value to vmcs12,
which was (mostly) fine because the vmcs02 value has a correctly-cleared
bit - but you asked to avoid the vmread.

So the second option is to just explicitly remove the valid bit from
vmcs12->vm_entry_intro_info_field, which I do now.
vmcs12->vm_entry_intro_info_field was set by L1 before it entered L2, and now
that L2 is exiting back to L1, we need to clear the valid bit.

The more I think about it, the more I become convinced that the second
option is indeed better than the first option (the original code in the
patch).

> There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not 
> be cleared by hardware:
>...
> If neither of these are valid, the code can be removed.  If only the 
> second, we might make it conditional.

Again, unless I'm misunderstanding what you mean, the hardware only
modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12
as well, to remove the "valid" bit. If we don't, when L1 enters into the same
L2 again, the same old value will be copied again from vmcs12 to vmcs02,
and cause an injection of the same interrupt again.

And by the way, I haven't said this enough, but thanks for your continued
reviews and all your very useful corrections for these patches!

Nadav.

-- 
Nadav Har'El                        |       Sunday, Sep 12 2010, 5 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I before E except after C. We live in a
http://nadav.harel.org.il           |weird society!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-12 17:21           ` Avi Kivity
  2010-09-12 19:51             ` Nadav Har'El
@ 2010-09-13  5:53             ` Sheng Yang
  2010-09-13  8:52               ` Avi Kivity
  1 sibling, 1 reply; 147+ messages in thread
From: Sheng Yang @ 2010-09-13  5:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Monday 13 September 2010 01:21:29 Avi Kivity wrote:
>   On 09/12/2010 07:05 PM, Nadav Har'El wrote:
> >> I don't think so.  We write this field as part of guest entry (that is,
> >> after the decision re which vmcs to use, yes?), so guest entry will
> >> always follow writing this field.  Since guest entry clears the field,
> >> reading it after an exit will necessarily return 0.
> > 
> > Well, you obviously know the KVM code much better than I do, but from
> > what I saw, I thought (maybe I misunderstood) that in normal
> > (non-nested) KVM, this field only gets written on injection, not on
> > every entry, so the code relies on the fact that the processor turns off
> > the "valid" bit during exit, to avoid the same event being injected when
> > the same field value is used for another entry.
> 
> Correct.
> 
> > I can only find code which resets this field in vmx_vcpu_reset(),
> > but that doesn't get called on every entry, right? Or am I missing
> > something?
> 
> prepare_vmcs12() is called in response for a 2->1 vmexit which is first
> trapped by 0, yes?  Because it's called immediately after a vmexit,
> VM_ENTRY_INTR_INFO_FIELD is guaranteed to have been cleared by the
> processor.
> 
> There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not
> be cleared by hardware:
> 
> 1. if we call prepare_vmcs12() between injection and entry.  This cannot
> happen AFAICT.
> 2. if the vmexit was really a failed 1->2 vmentry, and if the processor
> doesn't clear VM_ENTRY_INTR_INFO_FIELD in response to vm entry failures
> (need to check scripture)
> 
> If neither of these are valid, the code can be removed.  If only the
> second, we might make it conditional.
> 
> >> What can happen is that the contents of the field is transferred to the
> >> IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.
> >> 
> >> (question: on a failed vmentry, is this field cleared?)
> > 
> > I don't know the answer :-)
> 
> Sheng?

According to SDM 23.7 "VM-ENTRY FAILURES DURING OR AFTER LOADING
GUEST STATE":

Although this process resembles that of a VM exit, many steps taken during a VM 
exit do not occur for these VM-entry failures:
• Most VM-exit information fields are not updated (see step 1 above).
• The valid bit in the VM-entry interruption-information field is *not* cleared.
• The guest-state area is not modified.
• No MSRs are saved into the VM-exit MSR-store area.

So VM entry failure would result in _keep_ valid bit of VM_ENTRY_INTR_INFO_FIELD.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-12 19:51             ` Nadav Har'El
@ 2010-09-13  8:48               ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-09-13  8:48 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Sheng Yang

  On 09/12/2010 09:51 PM, Nadav Har'El wrote:
>
>> There are two cases where VM_ENTRY_INTR_INFO_FIELD can potentially not
>> be cleared by hardware:
>> ...
>> If neither of these are valid, the code can be removed.  If only the
>> second, we might make it conditional.
> Again, unless I'm misunderstanding what you mean, the hardware only
> modified vmcs02 (the hardware vmcs), not vmcs12. We need to modify vmcs12
> as well, to remove the "valid" bit. If we don't, when L1 enters into the same
> L2 again, the same old value will be copied again from vmcs12 to vmcs02,
> and cause an injection of the same interrupt again.

Yes, vmcs12 still needs to be updated.  So the code cannot be removed, 
just the vm


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-13  5:53             ` Sheng Yang
@ 2010-09-13  8:52               ` Avi Kivity
  2010-09-13  9:01                 ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-09-13  8:52 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Nadav Har'El, kvm

  On 09/13/2010 07:53 AM, Sheng Yang wrote:
>>
>>>> What can happen is that the contents of the field is transferred to the
>>>> IDT_VECTORING_INFO field or VM_EXIT_INTR_INFO field.
>>>>
>>>> (question: on a failed vmentry, is this field cleared?)
>>> I don't know the answer :-)
>> Sheng?
> According to SDM 23.7 "VM-ENTRY FAILURES DURING OR AFTER LOADING
> GUEST STATE":
>
> Although this process resembles that of a VM exit, many steps taken during a VM
> exit do not occur for these VM-entry failures:
> • Most VM-exit information fields are not updated (see step 1 above).
> • The valid bit in the VM-entry interruption-information field is *not* cleared.
> • The guest-state area is not modified.
> • No MSRs are saved into the VM-exit MSR-store area.
>
> So VM entry failure would result in _keep_ valid bit of VM_ENTRY_INTR_INFO_FIELD.
>
>

Ok.  So if the exit was actually due to a failed vmentry, then we do 
need the vmread... (or alternatively, we can avoid clearing the field in 
the first place).

So the following options should work:

1.  vmcs12->vm_entry_intr_info_field = 
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
2.  if (!(exit_reason & FAILED_ENTRY)) vmcs12->vm_exit_intry_info_field 
&= ~VALID;
3.  if (exit_reason & FAILED_ENTRY) vmcs12->vm_entry_intr_info_field = 
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-13  8:52               ` Avi Kivity
@ 2010-09-13  9:01                 ` Nadav Har'El
  2010-09-13  9:34                   ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-13  9:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sheng Yang, kvm

On Mon, Sep 13, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1":
> So the following options should work:
> 
> 1.  vmcs12->vm_entry_intr_info_field = 
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

Right, this was the original code in the patch.

> 2.  if (!(exit_reason & FAILED_ENTRY)) vmcs12->vm_exit_intry_info_field 
> &= ~VALID;

I now prefer this code. It doesn't do vmread (but replaces it with a bunch of
extra instructions - which might be even slower overall...).

But the more interesting thing is that it doesn't copy irrelevant bits from
vmcs02 to vmcs12, bits that might not have been set by L1 but rather by L0
which previously injected an interrupt into the same L2. These bits shouldn't
matter (when !valid), but a nosy L1 might notice them...

> 3.  if (exit_reason & FAILED_ENTRY) vmcs12->vm_entry_intr_info_field = 
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

I think you meant the opposite condition?

	if (!(exit_reason & FAILED_ENTRY)) vmcs12->vm_entry_intr_info_field = 
	vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

-- 
Nadav Har'El                        |       Monday, Sep 13 2010, 5 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Always borrow money from pessimists. They
http://nadav.harel.org.il           |don't expect to be paid back.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-09-13  9:01                 ` Nadav Har'El
@ 2010-09-13  9:34                   ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-09-13  9:34 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Sheng Yang, kvm

  On 09/13/2010 11:01 AM, Nadav Har'El wrote:
>
>> 3.  if (exit_reason&  FAILED_ENTRY) vmcs12->vm_entry_intr_info_field =
>> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
> I think you meant the opposite condition?
>
> 	if (!(exit_reason&  FAILED_ENTRY)) vmcs12->vm_entry_intr_info_field =
> 	vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
>

Dunno, I think both are subtly broken.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 18/24] Exiting from L2 to L1
  2010-06-14 12:04   ` Avi Kivity
  2010-09-12 14:05     ` Nadav Har'El
@ 2010-09-14 13:07     ` Nadav Har'El
  1 sibling, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-09-14 13:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 to L1":
> >+int switch_back_vmcs(struct kvm_vcpu *vcpu)
> >+{
> IIUC vpids are not exposed to the guest yet?  So the VPID should not 
> change between guest and nested guest.

Right. Removed.

> 
> >+
> >+	vmcs_write64(IO_BITMAP_A, src->io_bitmap_a);
> >+	vmcs_write64(IO_BITMAP_B, src->io_bitmap_b);
> >   
> 
> Why change the I/O bitmap?
>...
> Or the msr bitmap?  After all, we're switching the entire vmcs?
>...
> Why write all these?  What could have changed them?

You were right - most of these copies were utterly useless, and apparently
remained in our code since prehistory (when the same hardware vmcs was reused
for both L1 and L2). The great thing is that this removes several dozens of
vmwrites from the L2->L1 exit path in one fell swoop. In fact, the whole
function switch_back_vmcs() is now gone. Thanks for spotting this!


> >+	vmx_set_cr0(vcpu,
> >+		(vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> >+		vmx->nested.l1_shadow_vmcs->cr0_read_shadow) |
> >+		(~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> >+		vmx->nested.l1_shadow_vmcs->guest_cr0));
> >   
> 
> Helper wanted.

Done. The new helper looks like this:

static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields)
{
	return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
		(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
}

And is used in two places in the code (the above place, and another one).



> >+		vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3);
> >+	}
> >   
> 
> A kvm_set_cr3(src->host_cr3) should do all that and more, no?
> >+
> >+	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
> >+
> >   
> 
> Again, the kvm_set_crx() versions have more meat.

I have to admit, I still don't understand this part of the code completely.
The fact that kvm_set_cr4 does more than vmx_set_cr4 doesn't always mean
that we want (or need) to do those things. In particular:
> 
> >+	if (enable_ept) {
> >+		vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3;
> >+		vmcs_write32(GUEST_CR3, 
> >vmx->nested.l1_shadow_vmcs->guest_cr3);
> >+	} else {
> >+		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
> >+	}
> >   
> 
> kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case 
> the nested guest was able to corrupted the guest's PDPT).

kvm_set_cr3 calls vmx_set_cr3 which calls ept_load_pdptrs which assumes
that vcpu->arch.pdptrs[] is correct. I am guessing (but am not yet completely
sure) that this code tried to avoid assuming that this cache is up-to-date.
Again, I still need to better understand this part of the code before I
can correct it (because, as the saying goes, "if it ain't broken, don't fix
it" - or at least fix it carefully).

> >+	kvm_mmu_reset_context(vcpu);
> >+	kvm_mmu_load(vcpu);
> >   
> 
> kvm_mmu_load() unneeded, usually.

Again, I'll need to look into this deeper and report back.
In the meantime, attached below is the current version of this patch.

Thanks,
Nadav.

Subject: [PATCH 19/26] nVMX: Exiting from L2 to L1

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  242 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 241 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c	2010-09-14 15:02:37.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-09-14 15:02:37.000000000 +0200
@@ -4970,9 +4970,13 @@ static void vmx_complete_interrupts(stru
 	int type;
 	bool idtv_info_valid;
 
+	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+	if (vmx->nested.nested_mode)
+		return;
+
 	exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 
-	vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
 
 	/* Handle machine checks before interrupts are enabled */
 	if ((vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
@@ -5961,6 +5965,242 @@ static int nested_vmx_run(struct kvm_vcp
 	return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest - it is possible that
+ * L1 wished to allow its guest to set a cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr0_bits =
+		vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
+	return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
+		(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs_fields *vmcs12)
+{
+	unsigned long guest_cr4_bits =
+		vcpu->arch.cr4_guest_owned_bits | vmcs12->cr4_guest_host_mask;
+	return (vmcs_readl(GUEST_CR4) & guest_cr4_bits) |
+		(vmcs_readl(CR4_READ_SHADOW) & ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is called when the nested L2 guest exits and we want to
+ * prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), and this
+ * function updates it to reflect the changes to the guest state while L2 was
+ * running (and perhaps made some exits which were handled directly by L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* update guest state fields: */
+	vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+	vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+	vmcs12->guest_dr7 = vmcs_readl(GUEST_DR7);
+	vmcs12->guest_rsp = vmcs_readl(GUEST_RSP);
+	vmcs12->guest_rip = vmcs_readl(GUEST_RIP);
+	vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+	vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+	vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+	vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+	vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+	vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+	vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+	vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+	vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+	vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+	vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+	vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+	vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+	vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+	vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+	vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+	vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+	vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT);
+	vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT);
+	vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES);
+	vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES);
+	vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES);
+	vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES);
+	vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES);
+	vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES);
+	vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES);
+	vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES);
+	vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE);
+	vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE);
+	vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE);
+	vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE);
+	vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE);
+	vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE);
+	vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE);
+	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
+	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
+	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	/* TODO: These cannot have changed unless we have MSR bitmaps and
+	 * the relevant bit asks not to trap the change */
+	vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+	if (vmcs_config.vmentry_ctrl & VM_EXIT_SAVE_IA32_PAT)
+		vmcs12->guest_ia32_pat = vmcs_read64(GUEST_IA32_PAT);
+	vmcs12->guest_sysenter_cs = vmcs_read32(GUEST_SYSENTER_CS);
+	vmcs12->guest_sysenter_esp = vmcs_readl(GUEST_SYSENTER_ESP);
+	vmcs12->guest_sysenter_eip = vmcs_readl(GUEST_SYSENTER_EIP);
+
+	vmcs12->guest_activity_state = vmcs_read32(GUEST_ACTIVITY_STATE);
+	vmcs12->guest_interruptibility_info =
+		vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+	vmcs12->guest_pending_dbg_exceptions =
+		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
+	vmcs12->vmcs_link_pointer = vmcs_read64(VMCS_LINK_POINTER);
+
+	/* update exit information fields: */
+
+	vmcs12->vm_exit_reason  = vmcs_read32(VM_EXIT_REASON);
+	vmcs12->exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	if(enable_ept){
+		vmcs12->guest_physical_address =
+			vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+		vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS);
+	}
+
+	vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
+	vmcs12->idt_vectoring_info_field =
+		vmcs_read32(IDT_VECTORING_INFO_FIELD);
+	vmcs12->idt_vectoring_error_code =
+		vmcs_read32(IDT_VECTORING_ERROR_CODE);
+	vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+	vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+
+	/* clear vm-entry fields which are to be cleared on exit */
+	if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+		vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK;
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int efer_offset;
+	struct vmcs_fields *vmcs01 = vmx->nested.vmcs01_fields;
+
+	if (!vmx->nested.nested_mode) {
+		printk(KERN_INFO "WARNING: %s called but not in nested mode\n",
+		       __func__);
+		return 0;
+	}
+
+	sync_cached_regs_to_vmcs(vcpu);
+
+	prepare_vmcs12(vcpu);
+	if (is_interrupt)
+		get_vmcs12_fields(vcpu)->vm_exit_reason =
+			EXIT_REASON_EXTERNAL_INTERRUPT;
+
+	vmx->nested.current_vmcs12->launched = vmx->launched;
+	vmx->nested.current_vmcs12->cpu = vcpu->cpu;
+
+	vmx->vmcs = vmx->nested.vmcs01;
+	vcpu->cpu = vmx->nested.l1_state.cpu;
+	vmx->launched = vmx->nested.l1_state.launched;
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	vcpu->arch.efer = vmx->nested.l1_state.efer;
+	if ((vcpu->arch.efer & EFER_LMA) &&
+	    !(vcpu->arch.efer & EFER_SCE))
+		vcpu->arch.efer |= EFER_SCE;
+
+	efer_offset = __find_msr_index(vmx, MSR_EFER);
+	if (update_transition_efer(vmx, efer_offset))
+		wrmsrl(MSR_EFER, vmx->guest_msrs[efer_offset].data);
+	
+	/*
+	 * L2 perhaps switched to real mode and set vmx->rmode, but we're back
+	 * in L1 and as it is running VMX, it can't be in real mode.
+	 */
+	vmx->rmode.vm86_active = 0;
+
+	/*
+	 * We're running a regular L1 guest again, so we do the regular KVM
+	 * thing: run vmx_set_cr0 with the cr0 bits the guest thinks it has.
+	 * vmx_set_cr0 might use slightly different bits on the new guest_cr0
+	 * it sets, e.g., add TS when !fpu_active.
+	 * Note that vmx_set_cr0 refers to rmode and efer set above.
+	 */
+	vmx_set_cr0(vcpu, guest_readable_cr0(vmcs01));
+	/*
+	 * If we did fpu_activate()/fpu_deactive() during l2's run, we need to
+	 * apply the same changes to l1's vmcs. We just set cr0 correctly, but
+	 * now we need to also update cr0_guest_host_mask and exception_bitmap.
+	 */
+	vmcs_write32(EXCEPTION_BITMAP,
+		(vmx->nested.vmcs01_fields->exception_bitmap &
+			~(1u<<NM_VECTOR)) |
+			(vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)));
+	vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
+	vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
+
+
+	vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
+
+	if (enable_ept) {
+		vcpu->arch.cr3 = vmcs01->guest_cr3;
+		vmcs_write32(GUEST_CR3, vmcs01->guest_cr3);
+		vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
+		vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
+		vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
+		vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
+		vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
+	} else {
+		kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   vmx->nested.vmcs01_fields->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   vmx->nested.vmcs01_fields->guest_rip);
+
+	vmx->nested.nested_mode = 0;
+
+	kvm_mmu_reset_context(vcpu);
+	kvm_mmu_load(vcpu);
+
+	if (unlikely(vmx->fail)) {
+		/*
+		 * When L1 launches L2 and then we (L0) fail to launch L2,
+		 * we nested_vmx_vmexit back to L1, but now should let it know
+		 * that the VMLAUNCH failed - with the same error that we
+		 * got when launching L2.
+		 */
+		vmx->fail = 0;
+		nested_vmx_failValid(vcpu, vmcs_read32(VM_INSTRUCTION_ERROR));
+	} else
+		nested_vmx_succeed(vcpu);
+
+	return 0;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,

-- 
Nadav Har'El                        |       Sunday, Sep 12 2010, 4 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Anyone is entitled to their own opinions.
http://nadav.harel.org.il           |No one is entitled to their own facts.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit
  2010-06-14 12:24   ` Avi Kivity
@ 2010-09-16 14:42     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-09-16 14:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit":
> On 06/13/2010 03:32 PM, Nadav Har'El wrote:
> >This patch contains the logic of whether an L2 exit should be handled by L0
> >and then L2 should be resumed, or whether L1 should be run to handle this
> >exit (using the nested_vmx_vmexit() function of the previous patch).
> >The basic idea is to let L1 handle the exit only if it actually asked to
> >trap this sort of event. For example, when L2 exits on a change to CR0,
>...
> >@@ -3819,6 +3841,8 @@ static int handle_exception(struct kvm_v
> >
> >  	if (is_no_device(intr_info)) {
> >  		vmx_fpu_activate(vcpu);
> >+		if (vmx->nested.nested_mode)
> >+			vmx->nested.nested_run_pending = 1;
> >  		return 1;
> >  	}
> >   
> 
> Isn't this true for many other exceptions?  #UD which we emulate (but 
> the guest doesn't trap), page faults which we handle completely...

I was trying to think why nested_run_pending=1 (forcing us to run L2 next)
is necessary in the specific case of #NM, and couldn't think of any convincing
reason. Sure, in most cases we would like to continue running L2 after L0
serviced this exception that L1 didn't care about, but in the rare cases
where L1 should run next (especially, user-space injecting an interrupt),
what's so wrong with L1 running next?
And, like you said, if it's important for #NM, why not for #PF or other
things?

Anyway, the code appears to run correctly also without this setting,
so I'm guessing that it's an historic artifact, from older code which
was written before the days of the lazy fpu loading. So I'm removing it.

Good catch. I was aware of this peculiar case in the code (and even
documented it in the patch's description), but should have stopped to
think if it is really such a special case, or simply an error. And now
I believe it was nothing more than an error.

> >+/* Return 1 if we should exit from L2 to L1 to handle a CR access exit,
> >+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
> >+ * intercept (via guest_host_mask etc.) the current event.
> >+ */
> >+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
> >+	struct shadow_vmcs *l2svmcs)
> >+{
>...
> >+		case 8:
> >+			if (l2svmcs->cpu_based_vm_exec_control&
> >+			    CPU_BASED_CR8_LOAD_EXITING)
> >+				return 1;
> >   
> 
> Should check TPR threshold here too if enabled.

I'll return to this issue in another mail. This one is getting long enough.

> >+	case 3: /* lmsw */
> >+		if (l2svmcs->cr0_guest_host_mask&
> >+		    (val ^ l2svmcs->cr0_read_shadow))
> >+			return 1;
> >   
> 
> Need to mask off bit 0 (cr0.pe) of val, since lmsw can't clear it.

Right. Also, lmsw only works on the first 4 bits of cr0: The first bit, it can
only turn on (like you said), and the next 3 bits, it can change at will. Any
other attempted changes to cr0 through lmsw should be ignored, and not cause
exits.  So I changed the code to this:

		if (vmcs12->cr0_guest_host_mask & 0xe &
		    (val ^ vmcs12->cr0_read_shadow))
			return 1;
		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
		    !(vmcs12->cr0_read_shadow & 0x1) &&
		    (val & 0x1))
		    	return 1;

I wonder if there's a less ugly way to write the same thing...

This LMSW is so 80s :( I wonder who's using it these days, and specifically
if it would bother anyone if lmsw suddenly acquired new "abilities" to
change bits it never could... Oh, the things that we do for backward
compatibility :-)

> >+/* Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
> >+ * should handle it ourselves in L0. Only call this when in nested_mode 
> >(L2).
> >+ */
> >+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool afterexit)
>...
> >+	case EXIT_REASON_EXCEPTION_NMI:
> >+		if (is_external_interrupt(intr_info)&&
> >+		    (l2svmcs->pin_based_vm_exec_control&
> >+		     PIN_BASED_EXT_INTR_MASK))
> >+			r = 1;
> >   
> 
> A real external interrupt should never be handled by the guest, only a 
> virtual external interrupt.

You've opened a whole can of worms ;-) But it's very good that you did.

It appears that this nested_vmx_exit_handled() was called for two completely
different reasons, with different "afterexit" parameter (if you remember,
this flag had the puzzling name "kvm_override" in a previous version).
On normal exits, it is called with afterexit=1 and did exactly what you wanted,
i.e., never handle in the guest:

		case EXIT_REASON_EXTERNAL_INTERRUPT:
			return 0;

The case which you saw was only relevant for the other place this function is
called - in exception injection. But most of the code in this function was
irrelevant and/or plain wrong in this case.

This part of the code was just terrible, and I couldn't leave it like this,
even if it was working (I'm sure you'll agree). So I now completely rewrote
this function to become two separate functions, with (hopefully) no irrelevant
or wrong code.

Here is the new version of the two relevant patches - this one and the one
dealing with exception injection:

Subject: [PATCH 20/26] nVMX: Deciding if L0 or L1 should handle an L2 exit

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in several situations where had L1 run on
bare metal it would not have expected to be resumed at this stage. One
example is when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run.
Nested_run_pending is especially intended to avoid switching to L1 in the
injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  204 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-09-16 16:38:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-09-16 16:38:03.000000000 +0200
@@ -338,6 +338,8 @@ struct nested_vmx {
 	/* Saving the VMCS that we used for running L1 */
 	struct vmcs *vmcs01;
 	struct vmcs_fields *vmcs01_fields;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 };
 
 struct vcpu_vmx {
@@ -848,6 +850,20 @@ static inline bool nested_cpu_has_vmx_ep
 		SECONDARY_EXEC_ENABLE_EPT);
 }
 
+static inline bool nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -4879,6 +4895,182 @@ static const int kvm_vmx_max_exit_handle
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
 /*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
+ * it expressed interest in the current event (read or write a specific MSR).
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	struct page *msr_bitmap_page;
+	void *va;
+	bool ret;
+
+	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
+		return 1;
+
+	msr_bitmap_page = nested_get_page(vcpu, vmcs12->msr_bitmap);
+	if (!msr_bitmap_page) {
+		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	va = kmap_atomic(msr_bitmap_page, KM_USER1);
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		va += 0x800;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		va += 0x400;
+	}
+	if (msr_index > 0x1fff)
+		return 0;
+	ret = test_bit(msr_index, va);
+	kunmap_atomic(va, KM_USER1);
+	return ret;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_LOAD_EXITING)
+				return 1;
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_LOAD_EXITING)
+				return 1;
+			// TODO: missing else if control & CPU_BASED_TPR_SHADOW
+			//   then set tpr shadow and if below tpr_threshold,
+			//   exit.
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if (vmcs12->cr0_guest_host_mask & X86_CR0_TS)
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 0:
+			return 1;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 4:
+			return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+		    	return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in nested_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info) && (!enable_ept))
+			return 0;
+		return (vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK)));
+	case EXIT_REASON_EPT_VIOLATION:
+		return 0;
+	case EXIT_REASON_INVLPG:
+		return (vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING);
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return (vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_MOV_DR_EXITING);
+	default:
+		/*
+		 * One particularly interesting case that is covered here is an
+		 * exit caused by L2 running a VMX instruction. L2 is guest
+		 * mode in L1's world, and according to the VMX spec running a
+		 * VMX instruction in guest mode should cause an exit to root
+		 * mode, i.e., to L1. This is why we need to return r=1 for
+		 * those exit reasons too. This enables further nesting: Like
+		 * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
+		 * L2, who will then be able to run L3.
+		 */
+		return 1;
+	}
+}
+
+/*
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
  */
@@ -4894,6 +5086,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (vmx->nested.nested_mode && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	/* Access CR3 don't cause VMExit in paging mode, so we need
 	 * to sync with guest real CR3. */
 	if (enable_ept && is_paging(vcpu))
@@ -5941,6 +6144,7 @@ static int nested_vmx_run(struct kvm_vcp
 		r = kvm_mmu_load(vcpu);
 		if (unlikely(r)) {
 			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			nested_vmx_vmexit(vcpu, false);
 			nested_vmx_failValid(vcpu,
 				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
 			/* switch back to L1 */

Subject: [PATCH 22/26] nVMX: Correct handling of exception injection

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-09-16 16:38:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-09-16 16:38:04.000000000 +0200
@@ -1507,6 +1507,25 @@ static void skip_emulated_instruction(st
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu, false);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1514,6 +1533,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && vmx->nested.nested_mode &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3554,6 +3577,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (vmx->nested.nested_mode)
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

-- 
Nadav Har'El                        |      Tuesday, Sep 14 2010, 6 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |echo '[q]sa[ln0=aln256%Pln256/snlbx]
http://nadav.harel.org.il           |sb3135071790101768542287578439snlbxq'|dc

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 20/24] Correct handling of interrupt injection
  2010-06-14 12:29   ` Avi Kivity
  2010-06-14 12:48     ` Avi Kivity
@ 2010-09-16 15:25     ` Nadav Har'El
  1 sibling, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-09-16 15:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 20/24] Correct handling of interrupt injection":
>..
> >However, properly doing what is described above requires invasive changes 
> >to
> >the flow of the existing code, which we elected not to do in this stage.
> >Instead we do something more simplistic and less efficient: we modify
>...
> 
> That's a little sad.

I agree. I'd like to change this code to do the proper thing (as I explained
in the patch's description), but as I said this will require some invasive
changes to existing KVM code outside vmx.c.
So seeing that this existing code also works, and despite hurting performance
a bit there are much more pressing performance issues (namely, the need
for nested EPT) - with your permission I'd like to postpone fixing this issue.

> >  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
> >  {
> >+	if (to_vmx(vcpu)->nested.nested_mode&&  nested_exit_on_intr(vcpu)) {
> >+		if (to_vmx(vcpu)->nested.nested_run_pending)
> >+			return 0;
> >+		nested_vmx_vmexit(vcpu, true);
> >+		/* fall through to normal code, but now in L1, not L2 */
> >+	}
> >+
> >   
> 
> What exit is reported here?

When nested_vmx_vmexit is called with the second parameter true, as above,
it modifies the (vmcs12) exit reason to	be EXIT_REASON_EXTERNAL_INTERRUPT.
A hack, but it does the right thing in this case because L1 doesn't even get
a chance to care about this exit reason before it exits again (as I tried to
explain in the patch's description).

-- 
Nadav Har'El                        |     Thursday, Sep 16 2010, 8 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Debugging: the art of removing bugs.
http://nadav.harel.org.il           |Programming: the art of inserting them.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-06-17 10:59   ` Gleb Natapov
@ 2010-09-16 16:06     ` Nadav Har'El
  0 siblings, 0 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-09-16 16:06 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Thu, Jun 17, 2010, Gleb Natapov wrote about "Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME":
> > +static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
> > +{
> > +	if (!nested_vmx_check_permission(vcpu))
>...
> Should also check MOV SS blocking. Why Intel decided that vm entry
> should fail in this case? How knows, but spec says so.

Thanks. Added the check:

if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_MOV_SS){
	nested_vmx_failValid(vcpu,
		VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
	skip_emulated_instruction(vcpu);
	return 1;
}

Like you, I don't understand why this test is at all necessary...

-- 
Nadav Har'El                        |     Thursday, Sep 16 2010, 8 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Help Microsoft stamp out piracy. Give
http://nadav.harel.org.il           |Linux to a friend today!

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-06-17 11:58   ` Gleb Natapov
@ 2010-09-20  6:37     ` Nadav Har'El
  2010-09-20  9:34       ` Gleb Natapov
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-20  6:37 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Thu, Jun 17, 2010, Gleb Natapov wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> On Sun, Jun 13, 2010 at 03:33:50PM +0300, Nadav Har'El wrote:
> > In the normal non-nested case, the idt_vectoring_info case is treated after
> > the exit. However, in the nested case a decision of whether to return to L2
> This is not correct. On the normal non-nested case the idt_vectoring_info is
> parsed into vmx/svm independent data structure (which is saved/restored during
> VM migartion) after exit. The reinjection happens on vmentry path.

You're right that perhaps I overstated the difference between the nested
and normal case. In both cases, the code at the end of vmx_vcpu_run just
remembers what to do, and the actual injection (namely, setting the vmcs)
happens in the beginning of the next entry. I'll change the wording of the
patch's description to make it more accurate.

> Why can't you do that using existing exception/nmi/interrupt queues that
> we have, but instead you effectively disable vmx_complete_interrupts()
> by patch 18 when in nested mode and add logically same code in this
> patch. I.e after exit you save info about idt event into nested_vmx
> and reinject it on vm entry.

This is an interesting point.

The basic problem is that (as I explained in the patch's description) when
L2 exits to L1 with idt vectoring info, L0 should *not* do its normal
thing of injecting the event - it should basically do nothing, and just
leave the IDT_VECTORING_INFO_FIELD in vmcs12 for L1 to find and act upon.
So in this case we must eliminate the normal decision that KVM would make
to inject the event.

Perhaps it would have been possible to leave the decision as-is (i.e.,
not change vmx_complete_interrupts()), but instead disable the injection
itself in inject_pending_event() (in x86.c, not vmx.c) or in all of
vmx_queue_exception, vmx_set_nmi and vmx_set_irq. But I'm not sure this will
be a cleaner patch (and I'd especially like to avoid nested-specific changes
in x86.c), and I'm pretty sure that however I change this code, it's bound
to break in subtle ways. The current patch took some blood, toil, tears
and sweat (well, maybe everything except the blood...) of my coworkers 
to get right :-)

Nadav.

-- 
Nadav Har'El                        |      Sunday, Sep 19 2010, 12 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Life is what happens to you while you're
http://nadav.harel.org.il           |busy making other plans. - John Lennon

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-20  6:37     ` Nadav Har'El
@ 2010-09-20  9:34       ` Gleb Natapov
  2010-09-20 10:03         ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Gleb Natapov @ 2010-09-20  9:34 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Mon, Sep 20, 2010 at 08:37:04AM +0200, Nadav Har'El wrote:
> > Why can't you do that using existing exception/nmi/interrupt queues that
> > we have, but instead you effectively disable vmx_complete_interrupts()
> > by patch 18 when in nested mode and add logically same code in this
> > patch. I.e after exit you save info about idt event into nested_vmx
> > and reinject it on vm entry.
> 
> This is an interesting point.
> 
> The basic problem is that (as I explained in the patch's description) when
> L2 exits to L1 with idt vectoring info, L0 should *not* do its normal
> thing of injecting the event - it should basically do nothing, and just
> leave the IDT_VECTORING_INFO_FIELD in vmcs12 for L1 to find and act upon.
> So in this case we must eliminate the normal decision that KVM would make
> to inject the event.
> 
But your code disables normal path even if L0 is the one who should
handle exit and re-inject event into L2. Look at what nested SVM is doing.
It is checking in handle_exit() if vmexit should cause vmexit into L1
and if so they bypass regular code path by emulating exit instead, but if
L0 should handle the vmexit it uses regular code path.

> Perhaps it would have been possible to leave the decision as-is (i.e.,
> not change vmx_complete_interrupts()), but instead disable the injection
> itself in inject_pending_event() (in x86.c, not vmx.c) or in all of
> vmx_queue_exception, vmx_set_nmi and vmx_set_irq. But I'm not sure this will
> be a cleaner patch (and I'd especially like to avoid nested-specific changes
> in x86.c), and I'm pretty sure that however I change this code, it's bound
> to break in subtle ways. The current patch took some blood, toil, tears
> and sweat (well, maybe everything except the blood...) of my coworkers 
> to get right :-)
> 
Look at how SVM did it. VMX shouldn't be different.

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-20  9:34       ` Gleb Natapov
@ 2010-09-20 10:03         ` Nadav Har'El
  2010-09-20 10:11           ` Avi Kivity
  2010-09-20 10:20           ` Gleb Natapov
  0 siblings, 2 replies; 147+ messages in thread
From: Nadav Har'El @ 2010-09-20 10:03 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: avi, kvm

On Mon, Sep 20, 2010, Gleb Natapov wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> But your code disables normal path even if L0 is the one who should
> handle exit and re-inject event into L2. Look at what nested SVM is doing.
> It is checking in handle_exit() if vmexit should cause vmexit into L1
> and if so they bypass regular code path by emulating exit instead, but if
> L0 should handle the vmexit it uses regular code path.

I wish it could be this simple. In vmx.c, we unfortunately don't have one
such decision point (of when to exit into L1), but two: one is the exit
handling (like in svm), but there's another one in the injection path
(vmx_queue_exception): namely, when KVM decides to injects a #PF (because
the guest, not it, should have gotten this #PF), we also need to exit to L1,
and we only discover this in the entry path, not the exit path.

We could have changed the code to do this special PF handling not in the
entrance but rather at the point at the exit when this event is being queued.
We probably should. But I'm afraid that this would require quite a bit of
changes in the non-nested vmx (and possibly x86) code, which we wanted to
avoid making. I'm also afraid that I don't understand all the reasons that
brought to the current situation :-(

> Look at how SVM did it. VMX shouldn't be different.

I'm afraid I know very little about the SVM architecture. Does SVM even have
a parallel of the IDT_VECTORING_INFO that this patch is trying to address?

I agree that the nested SVM's handle_exit() looks cleaner that the parallel
code in nested VMX. The root of all evil is that second exit decision point
in the injection phase, and I'll think some more if I can find a way to
avoid it without rocking the foundations too much.

-- 
Nadav Har'El                        |      Monday, Sep 20 2010, 12 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Bore, n.: A person who talks when you
http://nadav.harel.org.il           |wish him to listen.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-20 10:03         ` Nadav Har'El
@ 2010-09-20 10:11           ` Avi Kivity
  2010-09-22 23:15             ` Nadav Har'El
  2010-09-20 10:20           ` Gleb Natapov
  1 sibling, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-09-20 10:11 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm

  On 09/20/2010 12:03 PM, Nadav Har'El wrote:
> On Mon, Sep 20, 2010, Gleb Natapov wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> >  But your code disables normal path even if L0 is the one who should
> >  handle exit and re-inject event into L2. Look at what nested SVM is doing.
> >  It is checking in handle_exit() if vmexit should cause vmexit into L1
> >  and if so they bypass regular code path by emulating exit instead, but if
> >  L0 should handle the vmexit it uses regular code path.
>
> I wish it could be this simple. In vmx.c, we unfortunately don't have one
> such decision point (of when to exit into L1), but two: one is the exit
> handling (like in svm), but there's another one in the injection path
> (vmx_queue_exception): namely, when KVM decides to injects a #PF (because
> the guest, not it, should have gotten this #PF), we also need to exit to L1,
> and we only discover this in the entry path, not the exit path.

This is not specific to #PF, we queue other exceptions as well, for 
example #GP and #UD.  They all need to be checked against EXCEPTION_BITMAP.

> We could have changed the code to do this special PF handling not in the
> entrance but rather at the point at the exit when this event is being queued.
> We probably should. But I'm afraid that this would require quite a bit of
> changes in the non-nested vmx (and possibly x86) code, which we wanted to
> avoid making. I'm also afraid that I don't understand all the reasons that
> brought to the current situation :-(

Maybe add a queue (like the exception queue) to hold those pending exits?

Then kvm_queue_exception() could check of an intercept and queue a 
vmexit instead.

> >  Look at how SVM did it. VMX shouldn't be different.
>
> I'm afraid I know very little about the SVM architecture. Does SVM even have
> a parallel of the IDT_VECTORING_INFO that this patch is trying to address?

It does. exit_int_info.

> I agree that the nested SVM's handle_exit() looks cleaner that the parallel
> code in nested VMX. The root of all evil is that second exit decision point
> in the injection phase, and I'll think some more if I can find a way to
> avoid it without rocking the foundations too much.
>

I think svm needs it too.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-20 10:03         ` Nadav Har'El
  2010-09-20 10:11           ` Avi Kivity
@ 2010-09-20 10:20           ` Gleb Natapov
  1 sibling, 0 replies; 147+ messages in thread
From: Gleb Natapov @ 2010-09-20 10:20 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: avi, kvm

On Mon, Sep 20, 2010 at 12:03:40PM +0200, Nadav Har'El wrote:
> On Mon, Sep 20, 2010, Gleb Natapov wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> > But your code disables normal path even if L0 is the one who should
> > handle exit and re-inject event into L2. Look at what nested SVM is doing.
> > It is checking in handle_exit() if vmexit should cause vmexit into L1
> > and if so they bypass regular code path by emulating exit instead, but if
> > L0 should handle the vmexit it uses regular code path.
> 
> I wish it could be this simple. In vmx.c, we unfortunately don't have one
> such decision point (of when to exit into L1), but two: one is the exit
> handling (like in svm), but there's another one in the injection path
> (vmx_queue_exception): namely, when KVM decides to injects a #PF (because
> the guest, not it, should have gotten this #PF), we also need to exit to L1,
> and we only discover this in the entry path, not the exit path.
> 
SVM has exactly same problem. What they do is in svm_queue_exception()
they check if exception should generate vmexit and if so they set
svm->nested.exit_required flag. They skip next vmentry if the flag is
set and proceed directly to handle_exit() where they check this flag one
again and emulate nested vmexit. Their nested vmexit emulation clears
exception/interrupt queue.

> We could have changed the code to do this special PF handling not in the
> entrance but rather at the point at the exit when this event is being queued.
> We probably should. But I'm afraid that this would require quite a bit of
> changes in the non-nested vmx (and possibly x86) code, which we wanted to
> avoid making. I'm also afraid that I don't understand all the reasons that
> brought to the current situation :-(
> 
Nested SVM managed to do it without to much hassle. There is reinject
logic in svm_queue_exception() that I still do not understand, but the
same logic should be applicable to VMX too since SVM and VMX way of doing
virtualization is very similar.

> > Look at how SVM did it. VMX shouldn't be different.
> 
> I'm afraid I know very little about the SVM architecture. Does SVM even have
> a parallel of the IDT_VECTORING_INFO that this patch is trying to address?
Exactly same. The way events injection work in SVM and VMX are similar.
This allow us to maintain most of the logic in common code.

> 
> I agree that the nested SVM's handle_exit() looks cleaner that the parallel
> code in nested VMX. The root of all evil is that second exit decision point
> in the injection phase, and I'll think some more if I can find a way to
> avoid it without rocking the foundations too much.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-20 10:11           ` Avi Kivity
@ 2010-09-22 23:15             ` Nadav Har'El
  2010-09-26 15:14               ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-22 23:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm

On Mon, Sep 20, 2010, Avi Kivity wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> >I'm afraid I know very little about the SVM architecture. Does SVM even 
> >have
> >a parallel of the IDT_VECTORING_INFO that this patch is trying to address?
> 
> It does. exit_int_info.

Thanks. I guess I need to do some serious reading on this subject. I guessed
that exit_int_info was more of a parallel of VMX's vm_exit_intr_info field and
not idt_vectoring_info, but I guess I was wrong.

> >I agree that the nested SVM's handle_exit() looks cleaner that the parallel
> >code in nested VMX. The root of all evil is that second exit decision point
> >in the injection phase, and I'll think some more if I can find a way to
> >avoid it without rocking the foundations too much.
> >
> 
> I think svm needs it too.

Can you please clarify? I didn't understand what "it" refers to here.

Thanks,
Nadav.

-- 
Nadav Har'El                        |    Thursday, Sep 23 2010, 15 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Birthdays are good for you - the more you
http://nadav.harel.org.il           |have the longer you live.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-06-14 11:41   ` Avi Kivity
@ 2010-09-26 11:14     ` Nadav Har'El
  2010-09-26 12:56       ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-26 11:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME":
> >+	vmx_set_cr0(vcpu,
> >+		(get_shadow_vmcs(vcpu)->guest_cr0&
> >+			~get_shadow_vmcs(vcpu)->cr0_guest_host_mask) |
> >+		(get_shadow_vmcs(vcpu)->cr0_read_shadow&
> >+			get_shadow_vmcs(vcpu)->cr0_guest_host_mask));
> >+
> >+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship 
> >between
> >+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same 
> >as
> >+	 * the latter with with TS added if !fpu_active. We need to take the
> >+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
> >+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
> >+	 * restoration of fpu registers until the FPU is really used).
> >+	 */
> >+	vmcs_writel(GUEST_CR0, get_shadow_vmcs(vcpu)->guest_cr0 |
> >+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
> >   
> 
> Please update vmx_set_cr0() instead.

How would you like that I do that?
I could split vmx_set_cr0(vcpu, cr0) into a __vmx_set_cr0(vcpu, cr0, hw_cr0)
and vmx_set_cr0 that calls it. Is this what you had in mind? Won't it be
a little ugly? I agree, though, that it will avoid the vmwriting GUEST_CR0
twice in the nested case.

> >+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
> >+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
> >+	 * CR0 (we need the paging so that KVM treat this guest as a paging
> >+	 * guest so we can easly forward page faults to L1.)
> >+	 */
> >+	vcpu->arch.cr0 |= X86_CR0_PG;
> >   
> 
> Since this version doesn't support unrestricted nested guests, cr0.pg 
> will be already set or we will have failed vmentry.

I believe without this "hack", things didn't work properly during boot of
L2, when cr0_read_shadow.pg was not yet set. I think PG is set in guest_cr0,
but not in cr0_read_shadow, which is what vcpu->arch.cr0 caches.

> >+	if (enable_ept&&  !nested_cpu_has_vmx_ept(vcpu)) {
> >   
> 
> We don't support nested ept yet, yes?

Right. It seems like this (and a couple of other places) were left from
our internal codebase (which did have nested ept). I'll clean it up.


-- 
Nadav Har'El                        |      Monday, Sep 20 2010, 12 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The two rules for success are: 1. Never
http://nadav.harel.org.il           |tell them everything you know.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-09-26 11:14     ` Nadav Har'El
@ 2010-09-26 12:56       ` Avi Kivity
  2010-09-26 13:06         ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-09-26 12:56 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 09/26/2010 01:14 PM, Nadav Har'El wrote:
> On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME":
> >  >+	vmx_set_cr0(vcpu,
> >  >+		(get_shadow_vmcs(vcpu)->guest_cr0&
> >  >+			~get_shadow_vmcs(vcpu)->cr0_guest_host_mask) |
> >  >+		(get_shadow_vmcs(vcpu)->cr0_read_shadow&
> >  >+			get_shadow_vmcs(vcpu)->cr0_guest_host_mask));
> >  >+
> >  >+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship
> >  >between
> >  >+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same
> >  >as
> >  >+	 * the latter with with TS added if !fpu_active. We need to take the
> >  >+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
> >  >+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
> >  >+	 * restoration of fpu registers until the FPU is really used).
> >  >+	 */
> >  >+	vmcs_writel(GUEST_CR0, get_shadow_vmcs(vcpu)->guest_cr0 |
> >  >+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
> >  >
> >
> >  Please update vmx_set_cr0() instead.
>
> How would you like that I do that?
> I could split vmx_set_cr0(vcpu, cr0) into a __vmx_set_cr0(vcpu, cr0, hw_cr0)
> and vmx_set_cr0 that calls it. Is this what you had in mind? Won't it be
> a little ugly? I agree, though, that it will avoid the vmwriting GUEST_CR0
> twice in the nested case.

Just move the extra calculations into vmx_set_cr0().  Check if you're in 
nested mode, and if so apply cr0_guest_host_mask.

The vmlaunch/vmresume code becomes kvm_set_cr0(vcpu, 
get_shadow_vmcs(vcpu)->guest_cr0).

> >  >+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
> >  >+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
> >  >+	 * CR0 (we need the paging so that KVM treat this guest as a paging
> >  >+	 * guest so we can easly forward page faults to L1.)
> >  >+	 */
> >  >+	vcpu->arch.cr0 |= X86_CR0_PG;
> >  >
> >
> >  Since this version doesn't support unrestricted nested guests, cr0.pg
> >  will be already set or we will have failed vmentry.
>
> I believe without this "hack", things didn't work properly during boot of
> L2, when cr0_read_shadow.pg was not yet set. I think PG is set in guest_cr0,
> but not in cr0_read_shadow, which is what vcpu->arch.cr0 caches.

I don't see how vcpu->arch.cr0 can cache cr0_read_shadow.  All the mmu 
calculations depend on vcpu->arch.cr0, which must be what the processor 
uses for translations.  cr0_read_shadow is only use to emulate read 
access to cr0 (note we need to both update the real CR0_READ_SHADOW, and 
to consider the virtual CR0_READ_SHADOW when emulating).



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-09-26 12:56       ` Avi Kivity
@ 2010-09-26 13:06         ` Nadav Har'El
  2010-09-26 13:51           ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-09-26 13:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Sun, Sep 26, 2010, Avi Kivity wrote about "Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME":
> I don't see how vcpu->arch.cr0 can cache cr0_read_shadow.

But this is precisely what the (unpatched) vmx_set_cr0 code does:
If you look at it, it takes a parameter "cr0" and builds an additional
variable "hw_cr0".
"cr0" gets written into CR0_READ_SHADOW, while "hw_cr0" gets written into
GUEST_CR0.
vcpu->arch.cr0 gets a copy of "cr0", not of "hw_cr0", i.e., it is a cache of
CR0_READ_SHADOW, not of GUEST_CR0.

Or am I missing something?

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Sunday, Sep 26 2010, 18 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |What's the greatest world-wide use of
http://nadav.harel.org.il           |cowhide? To hold cows together.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME
  2010-09-26 13:06         ` Nadav Har'El
@ 2010-09-26 13:51           ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-09-26 13:51 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 09/26/2010 03:06 PM, Nadav Har'El wrote:
> On Sun, Sep 26, 2010, Avi Kivity wrote about "Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME":
> >  I don't see how vcpu->arch.cr0 can cache cr0_read_shadow.
>
> But this is precisely what the (unpatched) vmx_set_cr0 code does:
> If you look at it, it takes a parameter "cr0" and builds an additional
> variable "hw_cr0".
> "cr0" gets written into CR0_READ_SHADOW, while "hw_cr0" gets written into
> GUEST_CR0.
> vcpu->arch.cr0 gets a copy of "cr0", not of "hw_cr0", i.e., it is a cache of
> CR0_READ_SHADOW, not of GUEST_CR0.
>
> Or am I missing something?
>

In vmx, cr0 is split into two registers, CR0_READ_SHADOW and GUEST_CR0.  
nvmx needs to split vCR0_READ_SHADOW and vGUEST_CR0 into three.

vCR0_READ_SHADOW can be assigned directly to CR0_READ_SHADOW.
vGUEST_CR0 can be copied to vcpu->arch.cr0 so the mmu acts according to 
the mode L1 thinks it places L2 into (but not what L2 thinks it is in).
vGUEST_CR0, appropriately munged (by ORing it with KVM_VM_CR0_ALWAYS_ON 
and doing the TS games) is assigned to GUEST_CR0.

We need to audit all code that touches vcpu->arch.cr0; but I think this 
split is the easiest one.  The only code that needs to change is the 
cr0/lmsw emulation code (writes need to consider vCR0_GUEST_HOST_MASK).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-22 23:15             ` Nadav Har'El
@ 2010-09-26 15:14               ` Avi Kivity
  2010-09-26 15:18                 ` Gleb Natapov
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-09-26 15:14 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: Gleb Natapov, kvm

  On 09/23/2010 01:15 AM, Nadav Har'El wrote:
> On Mon, Sep 20, 2010, Avi Kivity wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> >  >I'm afraid I know very little about the SVM architecture. Does SVM even
> >  >have
> >  >a parallel of the IDT_VECTORING_INFO that this patch is trying to address?
> >
> >  It does. exit_int_info.
>
> Thanks. I guess I need to do some serious reading on this subject. I guessed
> that exit_int_info was more of a parallel of VMX's vm_exit_intr_info field and
> not idt_vectoring_info, but I guess I was wrong.

svm has separate intercepts for every exception, so it doesn't need the 
vector field of vm_exit_intr_info_field.  The error code is stored in 
exit_info_1, cr2 (for #PF) in exit_info_2.

> >  >I agree that the nested SVM's handle_exit() looks cleaner that the parallel
> >  >code in nested VMX. The root of all evil is that second exit decision point
> >  >in the injection phase, and I'll think some more if I can find a way to
> >  >avoid it without rocking the foundations too much.
> >  >
> >
> >  I think svm needs it too.
>
> Can you please clarify? I didn't understand what "it" refers to here.
>
>

Sorry, it was a week ago.

In general I think both svm and vmx need to go through the 
exception/interrupt queues.  That is, if you exit with 
IDT_VECTORING_INFO_VALID, you unpack it into the queue as a pending 
exception, and when you enter again you load it into 
VM_ENTRY_INTR_INFO_FIELD.  That's a bit of work, but it reduces the 
amount of code paths when L0 needs to inject an exception into L2 (like 
in emulation) - all it has to do is to queue it into the generic 
exception queue.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 22/24] Correct handling of idt vectoring info
  2010-09-26 15:14               ` Avi Kivity
@ 2010-09-26 15:18                 ` Gleb Natapov
  0 siblings, 0 replies; 147+ messages in thread
From: Gleb Natapov @ 2010-09-26 15:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nadav Har'El, kvm

On Sun, Sep 26, 2010 at 05:14:37PM +0200, Avi Kivity wrote:
>  On 09/23/2010 01:15 AM, Nadav Har'El wrote:
> >On Mon, Sep 20, 2010, Avi Kivity wrote about "Re: [PATCH 22/24] Correct handling of idt vectoring info":
> >>  >I'm afraid I know very little about the SVM architecture. Does SVM even
> >>  >have
> >>  >a parallel of the IDT_VECTORING_INFO that this patch is trying to address?
> >>
> >>  It does. exit_int_info.
> >
> >Thanks. I guess I need to do some serious reading on this subject. I guessed
> >that exit_int_info was more of a parallel of VMX's vm_exit_intr_info field and
> >not idt_vectoring_info, but I guess I was wrong.
> 
> svm has separate intercepts for every exception, so it doesn't need
> the vector field of vm_exit_intr_info_field.  The error code is
> stored in exit_info_1, cr2 (for #PF) in exit_info_2.
> 
> >>  >I agree that the nested SVM's handle_exit() looks cleaner that the parallel
> >>  >code in nested VMX. The root of all evil is that second exit decision point
> >>  >in the injection phase, and I'll think some more if I can find a way to
> >>  >avoid it without rocking the foundations too much.
> >>  >
> >>
> >>  I think svm needs it too.
> >
> >Can you please clarify? I didn't understand what "it" refers to here.
> >
> >
> 
> Sorry, it was a week ago.
> 
> In general I think both svm and vmx need to go through the
> exception/interrupt queues.  That is, if you exit with
> IDT_VECTORING_INFO_VALID, you unpack it into the queue as a pending
> exception, and when you enter again you load it into
> VM_ENTRY_INTR_INFO_FIELD.  That's a bit of work, but it reduces the
> amount of code paths when L0 needs to inject an exception into L2
> (like in emulation) - all it has to do is to queue it into the
> generic exception queue.
> 
And if L0 needs to reinject event directly into L2 it just uses regular
L0 code path instead of ah-hoc nested_handle_valid_idt_vectoring_info()
function.

--
			Gleb.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-06-15 10:00     ` Avi Kivity
@ 2010-10-17 12:03       ` Nadav Har'El
  2010-10-17 12:10         ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-10-17 12:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Tue, Jun 15, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> I've tried to test the patches, but I see a vm-entry failure code 7 on 
> the very first vmentry.  Guest is Fedora 12 x86-64 (2.6.32.9-70.fc12).

Hi, as you can see, I posted a new set of patches, which apply to the current
trunk. Can you please give it another try? Thanks!

Please make sure you follow the instructions in the introduction to the
patch. In short, try running the L0 kernel with the "nosmp" option, give the
"-cpu host" option to qemu, and the "nested=1 ept=0 vpid=0" options to the
kvm-intel module in L0.

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Sunday, Oct 17 2010, 9 Heshvan 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |This space is for sale - inquire inside.
http://nadav.harel.org.il           |

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-10-17 12:03       ` Nadav Har'El
@ 2010-10-17 12:10         ` Avi Kivity
  2010-10-17 12:39           ` Nadav Har'El
  0 siblings, 1 reply; 147+ messages in thread
From: Avi Kivity @ 2010-10-17 12:10 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 10/17/2010 02:03 PM, Nadav Har'El wrote:
> On Tue, Jun 15, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> >  I've tried to test the patches, but I see a vm-entry failure code 7 on
> >  the very first vmentry.  Guest is Fedora 12 x86-64 (2.6.32.9-70.fc12).
>
> Hi, as you can see, I posted a new set of patches, which apply to the current
> trunk. Can you please give it another try? Thanks!
>
> Please make sure you follow the instructions in the introduction to the
> patch. In short, try running the L0 kernel with the "nosmp" option,

What are the problems with smp?

>   give the
> "-cpu host" option to qemu,

Why is this needed?

> and the "nested=1 ept=0 vpid=0" options to the
> kvm-intel module in L0.

Why are those needed?  Seems trivial to support a nonept guest on an ept 
host - all you do is switch cr3 during vmentry and vmexit.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-10-17 12:10         ` Avi Kivity
@ 2010-10-17 12:39           ` Nadav Har'El
  2010-10-17 13:35             ` Avi Kivity
  0 siblings, 1 reply; 147+ messages in thread
From: Nadav Har'El @ 2010-10-17 12:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> >patch. In short, try running the L0 kernel with the "nosmp" option,
> What are the problems with smp?

Unfortunately, there appears to be a bug which causes KVM with nested VMX to
hang when SMP is enabled, even if you don't try to use more than one CPU for
the guest. I still need to debug this to figure out why.

> >  give the
> >"-cpu host" option to qemu,
> 
> Why is this needed?

Qemu has a list of cpu types, and for each type it lists its features. The
problem is that Qemu doesn't list the "VMX" feature for any of the CPUs,
even those (like core 2 duo). I have a trivial patch to qemu to add the "VMX"
feature to those CPUs, which is harmless even if KVM doesn't support nested
VMX (qemu will drop features which KVM doesn't support). But until I send
such a patch to qemu, the easiest workaround is just to use "-cpu host" -
which will (among other things) tell qemu to emulate a machine which has vmx,
just like the host does.

(I also explained this in the intro to v6 of the patch).

> 
> >and the "nested=1 ept=0 vpid=0" options to the
> >kvm-intel module in L0.
> 
> Why are those needed?  Seems trivial to support a nonept guest on an ept 
> host - all you do is switch cr3 during vmentry and vmexit.

nested=1 is needed because you asked for it *not* to be the default :-)

You're right, ept=1 on the host *could* be supported even before nested ept
is supported (this is the mode we called "shadow on ept" in the paper).
But at the moment, I believe it doesn't work correctly. I'll add making this
case work to my TODO list.

I'm not sure why vpid=0 is needed (but I verified that you get a failed entry
if you don't use it). I understood that there was some discussion on what is
the proper way to do nested vpid, and that in the meantime it isn't supported,
but I agree that it should have been possible to use vpid normally to run L1's
but avoid using it when running L2's. Again, I'll need to debug this issue
to understand how difficult it would be to fix this case.

Nadav.

-- 
Nadav Har'El                        |      Sunday, Oct 17 2010, 9 Heshvan 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Strike not only while the iron is hot,
http://nadav.harel.org.il           |make the iron hot by striking it.

^ permalink raw reply	[flat|nested] 147+ messages in thread

* Re: [PATCH 0/24] Nested VMX, v5
  2010-10-17 12:39           ` Nadav Har'El
@ 2010-10-17 13:35             ` Avi Kivity
  0 siblings, 0 replies; 147+ messages in thread
From: Avi Kivity @ 2010-10-17 13:35 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm

  On 10/17/2010 02:39 PM, Nadav Har'El wrote:
> On Sun, Oct 17, 2010, Avi Kivity wrote about "Re: [PATCH 0/24] Nested VMX, v5":
> >  >patch. In short, try running the L0 kernel with the "nosmp" option,
> >  What are the problems with smp?
>
> Unfortunately, there appears to be a bug which causes KVM with nested VMX to
> hang when SMP is enabled, even if you don't try to use more than one CPU for
> the guest. I still need to debug this to figure out why.

Well, that seems pretty critical.

> >  >   give the
> >  >"-cpu host" option to qemu,
> >
> >  Why is this needed?
>
> Qemu has a list of cpu types, and for each type it lists its features. The
> problem is that Qemu doesn't list the "VMX" feature for any of the CPUs,
> even those (like core 2 duo). I have a trivial patch to qemu to add the "VMX"
> feature to those CPUs, which is harmless even if KVM doesn't support nested
> VMX (qemu will drop features which KVM doesn't support). But until I send
> such a patch to qemu, the easiest workaround is just to use "-cpu host" -
> which will (among other things) tell qemu to emulate a machine which has vmx,
> just like the host does.
>
> (I also explained this in the intro to v6 of the patch).

Ok.  I think we can get that patch merged, just so you don't have to 
re-explain it over and over again.  Please post it to qemu-devel.

> >
> >  >and the "nested=1 ept=0 vpid=0" options to the
> >  >kvm-intel module in L0.
> >
> >  Why are those needed?  Seems trivial to support a nonept guest on an ept
> >  host - all you do is switch cr3 during vmentry and vmexit.
>
> nested=1 is needed because you asked for it *not* to be the default :-)
>
> You're right, ept=1 on the host *could* be supported even before nested ept
> is supported (this is the mode we called "shadow on ept" in the paper).
> But at the moment, I believe it doesn't work correctly. I'll add making this
> case work to my TODO list.
>
> I'm not sure why vpid=0 is needed (but I verified that you get a failed entry
> if you don't use it). I understood that there was some discussion on what is
> the proper way to do nested vpid, and that in the meantime it isn't supported,
> but I agree that it should have been possible to use vpid normally to run L1's
> but avoid using it when running L2's. Again, I'll need to debug this issue
> to understand how difficult it would be to fix this case.

My feeling is the smp and vpid failures are due to bugs.  vpid=0 in 
particular forces a tlb flush on every exit which might mask your true 
bug.  smp might be due to host vcpu migration.  Are we vmclearing the 
right vmcs?

ept=1 may not be due to a bug per se, but my feeling is that it should 
be very easy to implement.  In particular nsvm started out on npt (but 
not nnpt) and had issues with shadow-on-shadow (IIRC).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 147+ messages in thread

end of thread, other threads:[~2010-10-17 13:35 UTC | newest]

Thread overview: 147+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-13 12:22 [PATCH 0/24] Nested VMX, v5 Nadav Har'El
2010-06-13 12:23 ` [PATCH 1/24] Move nested option from svm.c to x86.c Nadav Har'El
2010-06-14  8:11   ` Avi Kivity
2010-06-15 14:27     ` Nadav Har'El
2010-06-13 12:23 ` [PATCH 2/24] Add VMX and SVM to list of supported cpuid features Nadav Har'El
2010-06-14  8:13   ` Avi Kivity
2010-06-15 14:31     ` Nadav Har'El
2010-06-13 12:24 ` [PATCH 3/24] Implement VMXON and VMXOFF Nadav Har'El
2010-06-14  8:21   ` Avi Kivity
2010-06-16 11:14     ` Nadav Har'El
2010-06-16 11:26       ` Avi Kivity
2010-06-15 20:18   ` Marcelo Tosatti
2010-06-16  7:50     ` Nadav Har'El
2010-06-13 12:24 ` [PATCH 4/24] Allow setting the VMXE bit in CR4 Nadav Har'El
2010-06-15 11:09   ` Gleb Natapov
2010-06-15 14:44     ` Nadav Har'El
2010-06-13 12:25 ` [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1 Nadav Har'El
2010-06-14  8:33   ` Avi Kivity
2010-06-14  8:49     ` Nadav Har'El
2010-06-14 12:35       ` Avi Kivity
2010-06-16 12:24     ` Nadav Har'El
2010-06-16 13:10       ` Avi Kivity
2010-06-22 14:54     ` Nadav Har'El
2010-06-22 16:53       ` Nadav Har'El
2010-06-23  8:07         ` Avi Kivity
2010-08-08 15:09           ` Nadav Har'El
2010-08-10  3:24             ` Avi Kivity
2010-06-23  7:57       ` Avi Kivity
2010-06-23  9:15         ` Alexander Graf
2010-06-23  9:24           ` Avi Kivity
2010-06-23 12:07         ` Nadav Har'El
2010-06-23 12:13           ` Avi Kivity
2010-06-13 12:25 ` [PATCH 6/24] Implement reading and writing of VMX MSRs Nadav Har'El
2010-06-14  8:42   ` Avi Kivity
2010-06-23  8:13     ` Nadav Har'El
2010-06-23  8:24       ` Avi Kivity
2010-06-13 12:26 ` [PATCH 7/24] Understanding guest pointers to vmcs12 structures Nadav Har'El
2010-06-14  8:48   ` Avi Kivity
2010-08-02 12:25     ` Nadav Har'El
2010-08-02 13:38       ` Avi Kivity
2010-06-15 12:14   ` Gleb Natapov
2010-08-01 15:16     ` Nadav Har'El
2010-08-01 15:25       ` Gleb Natapov
2010-08-02  8:57         ` Nadav Har'El
2010-06-13 12:26 ` [PATCH 8/24] Hold a vmcs02 for each vmcs12 Nadav Har'El
2010-06-14  8:57   ` Avi Kivity
2010-07-06  9:50   ` Dong, Eddie
2010-08-02 13:38     ` Nadav Har'El
2010-06-13 12:27 ` [PATCH 9/24] Implement VMCLEAR Nadav Har'El
2010-06-14  9:03   ` Avi Kivity
2010-06-15 13:47   ` Gleb Natapov
2010-06-15 13:50     ` Avi Kivity
2010-06-15 13:54       ` Gleb Natapov
2010-08-05 11:50         ` Nadav Har'El
2010-08-05 11:53           ` Gleb Natapov
2010-08-05 12:01             ` Nadav Har'El
2010-08-05 12:05               ` Avi Kivity
2010-08-05 12:10                 ` Nadav Har'El
2010-08-05 12:13                   ` Avi Kivity
2010-08-05 12:29                     ` Nadav Har'El
2010-08-05 12:03           ` Avi Kivity
2010-07-06  2:56   ` Dong, Eddie
2010-08-03 12:12     ` Nadav Har'El
2010-06-13 12:27 ` [PATCH 10/24] Implement VMPTRLD Nadav Har'El
2010-06-14  9:07   ` Avi Kivity
2010-08-05 11:13     ` Nadav Har'El
2010-06-16 13:36   ` Gleb Natapov
2010-07-06  3:09   ` Dong, Eddie
2010-08-05 11:35     ` Nadav Har'El
2010-06-13 12:28 ` [PATCH 11/24] Implement VMPTRST Nadav Har'El
2010-06-14  9:15   ` Avi Kivity
2010-06-16 13:53     ` Gleb Natapov
2010-06-16 15:33       ` Nadav Har'El
2010-06-13 12:28 ` [PATCH 12/24] Add VMCS fields to the vmcs12 Nadav Har'El
2010-06-14  9:24   ` Avi Kivity
2010-06-16 14:18   ` Gleb Natapov
2010-06-13 12:29 ` [PATCH 13/24] Implement VMREAD and VMWRITE Nadav Har'El
2010-06-14  9:36   ` Avi Kivity
2010-06-16 14:48     ` Gleb Natapov
2010-08-04 13:42       ` Nadav Har'El
2010-08-04 16:09     ` Nadav Har'El
2010-08-04 16:41       ` Avi Kivity
2010-06-16 15:03   ` Gleb Natapov
2010-08-04 11:46     ` Nadav Har'El
2010-06-13 12:29 ` [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12 Nadav Har'El
2010-06-14 11:11   ` Avi Kivity
2010-06-17  8:50   ` Gleb Natapov
2010-07-06  6:25   ` Dong, Eddie
2010-06-13 12:30 ` [PATCH 15/24] Move register-syncing to a function Nadav Har'El
2010-06-13 12:30 ` [PATCH 16/24] Implement VMLAUNCH and VMRESUME Nadav Har'El
2010-06-14 11:41   ` Avi Kivity
2010-09-26 11:14     ` Nadav Har'El
2010-09-26 12:56       ` Avi Kivity
2010-09-26 13:06         ` Nadav Har'El
2010-09-26 13:51           ` Avi Kivity
2010-06-17 10:59   ` Gleb Natapov
2010-09-16 16:06     ` Nadav Har'El
2010-06-13 12:31 ` [PATCH 17/24] No need for handle_vmx_insn function any more Nadav Har'El
2010-06-13 12:31 ` [PATCH 18/24] Exiting from L2 to L1 Nadav Har'El
2010-06-14 12:04   ` Avi Kivity
2010-09-12 14:05     ` Nadav Har'El
2010-09-12 14:29       ` Avi Kivity
2010-09-12 17:05         ` Nadav Har'El
2010-09-12 17:21           ` Avi Kivity
2010-09-12 19:51             ` Nadav Har'El
2010-09-13  8:48               ` Avi Kivity
2010-09-13  5:53             ` Sheng Yang
2010-09-13  8:52               ` Avi Kivity
2010-09-13  9:01                 ` Nadav Har'El
2010-09-13  9:34                   ` Avi Kivity
2010-09-14 13:07     ` Nadav Har'El
2010-06-13 12:32 ` [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit Nadav Har'El
2010-06-14 12:24   ` Avi Kivity
2010-09-16 14:42     ` Nadav Har'El
2010-06-13 12:32 ` [PATCH 20/24] Correct handling of interrupt injection Nadav Har'El
2010-06-14 12:29   ` Avi Kivity
2010-06-14 12:48     ` Avi Kivity
2010-09-16 15:25     ` Nadav Har'El
2010-06-13 12:33 ` [PATCH 21/24] Correct handling of exception injection Nadav Har'El
2010-06-13 12:33 ` [PATCH 22/24] Correct handling of idt vectoring info Nadav Har'El
2010-06-17 11:58   ` Gleb Natapov
2010-09-20  6:37     ` Nadav Har'El
2010-09-20  9:34       ` Gleb Natapov
2010-09-20 10:03         ` Nadav Har'El
2010-09-20 10:11           ` Avi Kivity
2010-09-22 23:15             ` Nadav Har'El
2010-09-26 15:14               ` Avi Kivity
2010-09-26 15:18                 ` Gleb Natapov
2010-09-20 10:20           ` Gleb Natapov
2010-06-13 12:34 ` [PATCH 23/24] Handling of CR0.TS and #NM for Lazy FPU loading Nadav Har'El
2010-06-13 12:34 ` [PATCH 24/24] Miscellenous small corrections Nadav Har'El
2010-06-14 12:34 ` [PATCH 0/24] Nested VMX, v5 Avi Kivity
2010-06-14 13:03   ` Nadav Har'El
2010-06-15 10:00     ` Avi Kivity
2010-10-17 12:03       ` Nadav Har'El
2010-10-17 12:10         ` Avi Kivity
2010-10-17 12:39           ` Nadav Har'El
2010-10-17 13:35             ` Avi Kivity
2010-07-09  8:59 ` Dong, Eddie
2010-07-11  8:27   ` Nadav Har'El
2010-07-11 11:05     ` Alexander Graf
2010-07-11 12:49       ` Nadav Har'El
2010-07-11 13:12         ` Avi Kivity
2010-07-11 15:39           ` Nadav Har'El
2010-07-11 15:45             ` Avi Kivity
2010-07-11 13:20     ` Avi Kivity
2010-07-15  3:27 ` Sheng Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.