From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Nadav Har'El" <nyh@math.technion.ac.il>
Subject: Re: [PATCH 0/24] Nested VMX, v5
Date: Sun, 11 Jul 2010 11:27:03 +0300
Message-ID: <20100711082703.GA37@fermat.math.technion.ac.il>
References: <1276431753-nyh@il.ibm.com> <1A42CE6F5F474C41B63392A5F80372B21F70B7B1@shsmsx501.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: "avi@redhat.com" <avi@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: "Dong, Eddie" <eddie.dong@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mailgw11.technion.ac.il ([132.68.225.11]:32125 "EHLO
	mailgw11.technion.ac.il" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751978Ab0GKI1J (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 11 Jul 2010 04:27:09 -0400
Content-Disposition: inline
In-Reply-To: <1A42CE6F5F474C41B63392A5F80372B21F70B7B1@shsmsx501.ccr.corp.intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Jul 09, 2010, Dong, Eddie wrote about "RE: [PATCH 0/24] Nested VMX, v5":
> 	Thnaks for the posting and in general the patches are well written.
> I like the concept of VMCSxy and I feel it is pretty clear (better than my
> previous naming as well), but there are some confusing inside, especially
> for the term "shadow" which I feel quit hard.

Hi, and thanks for the excellent ideas. As you saw, I indeed started to
convert and converge the old terminology (including that ambiguous term
"shadow") into the new names vmcs01, vmcs02, vmcs12 - names which we
introduced in our technical report.
But I have not gone all the way with these changes. I should have, and I'll
do it now.

> 1: Basically there are 2 diferent type in VMCS, one is defined by hardware,
> whose layout is unknown to VMM. Another one is defined by VMM (this patch)
> and used for vmcs12.
> The former one is using "struct vmcs" to describe its data instance, but the
> later one doesn't have a clear definition (or struct vmcs12?). I suggest we
> can have a distinguish struct for this, for example "struct sw_vmcs"
> (software vmcs), or "struct vvmcs" (virtual vmcs).

I decided (but let me know if you have reservations) to use the name
"struct vmcs_fields" for the memory structure that contains the long list of
vmcs fields. I think this name describes the structure's content well.

As in the last version of the patches, this list of vmcs fields will not on
its own be vmcs12's structure, because vmcs12, as a spec-compliant vmcs, also
needs to contain a couple of additional fields in its beginning, and we also
need a few more runtime fields.

> 	2: vmcsxy (vmcs12, vmcs02, vmcs01) are for instances of either
> "struct vmcs", or "struct sw_vmcs", but not for struct Clear distinguish
> between data structure and instance helps IMO.

I agree with you that using the name "vmcs12" for both the type (struct vmcs12)
and instance of another type (struct vmcs_fields *vmcs12) is somewhat strange,
but I can only think of two alternatives:

1. Invent a new name for "struct vmcs12", say "struct sw_vmcs" as you
   suggested. But I think it will just make things less clear, because we
   replace the self-explanatory name vmcs12 by a less clear name.

2. Stop separating "struct vmcs_fields" (formerly struct shadow_vmcs) and
   "struct vmcs12" which contains it and a few more fields - and instead 
   put everything in one structure (and call that sw_vmcs or whatever).
   These extra fields will not be useful for vmcs01, but it's not a terrible
   waste (because vmcs01 already doesn't use a lot of these fields).

Personally, I find these two alternatives even less appealing than the
current alternative (with "struct vmcs12" describing vmcs12's type, and
it contains a struct vmcs_fields inside). What do you think?

> 3: We may use prefix or suffix in addition to vmcsxy to explictly state the
> format of that instance. For example vmcs02 in current patch is for hardware
> use, hence it is an instance "struct vmcs", but vmcs01 is an instance of
> "struct sw_vmcs". Postfix and prefix helps to make better understand.

I agree. After changing the old name struct shadow_vmcs to vmcs_fields, now
I can use a name like vmcs01_fields for the old l1_shadow_vmcs (memory copy
of vmcs01's fields) and vmcs01 for the old l1_vmcs (the actual hardware VMCS
used to run L1). This is is indeed more readable, thanks.

> 4: Rename l2_vmcs to vmcs02, l1_shadow_vmcs to vmcs01, l1_vmcs to
> vmcs02, with prefix/postfix can strengthen above concept of vmcsxy.

Good ideas.

renamed l2_vmcs, l2_vmcs_list, and the likes, to vmcs02.

Renamed l1_shadow_vmcs to vmcs01_fields, ands l1_vmcs to vmcs01 (NOT vmcs02).

renamed l2_shadow_vmcs, l2svmcs, nested_vmcs, and the likes, to vmcs12
(I decided not to use the longer name vmcs12_fields, because I don't think it
adds any clarity). I also renamed get_shadow_vmcs to get_vmcs12_fields.

> 5: guest VMPTRLD emulation. Current patch creates vmcs02 instance each
> time when guest VMPTRLD, and free the instance at VMCLEAR. The code may
> fail if the vmcs (un-vmcleared) exceeds certain threshold to avoid denial
> of service. That is fine, but it brings additional complexity and may pay
> with a lot of memory. I think we can emulate using concept of "cached vmcs"
> here in case L1 VMM doesn't do vmclear in time.  L0 VMM can simply flush
> those vmcs02 to guest memory i.e. vmcs12 per need. For example if the cached
> vcs02 exceeds 10, we can do automatically flush.

Right. I've already discussed this idea over the list with Avi Kivity, and
it is on my todo list and definitely should be done.
The current approach is simpler, because I don't need to add special code for
rebuilding a forgotten vmcs02 from vmcs12 - the current prepare_vmcs02 only
updates some of the fields, and I'll need to do some testing to figure out
what exactly is missing for a full rebuild.  

I think the current code is "good enough" as an ad-interim solution, because
users that follow the spec will not forget to VMCLEAR anyway (and if they
do, only they will suffer). And I wouldn't say that "a lot of memory" is
involved - at worst, an L1 can now cause 256 pages, or 1 MB, to be wasted on
this. More normally, an L1 will only have a few L2 guests, and only spend
a few pages for this - certainly much much less than he'd spend on actually
holding the L2's memory.

Thanks again for the review!

I don't want to attach the entire set of patches again now (before I respond
to the rest of the review comments, from Avi, Gleb, you, and others).
So in the meantime I'll include the new version of just one patch, so that
you can see an example of the changes I've made to the names.

=========
Subject: [PATCH 16/26] nVMX: Implement VMLAUNCH and VMRESUME

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  200 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 197 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2010-07-11 11:11:11.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2010-07-11 11:11:11.000000000 +0300
@@ -279,6 +279,9 @@ struct __packed vmcs12 {
 	struct vmcs_fields fields;
 
 	bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+	int cpu;
+	int launched;
 };
 
 /*
@@ -313,6 +316,23 @@ struct nested_vmx {
 	/* list of real (hardware) VMCS, one for each L2 guest of L1 */
 	struct list_head vmcs02_list; /* a vmcs_list */
 	int vmcs02_num;
+
+	/* Are we running a nested guest now */
+	bool nested_mode;
+	/* Level 1 state for switching to level 2 and back */
+	struct  {
+		u64 efer;
+		unsigned long cr3;
+		unsigned long cr4;
+		u64 io_bitmap_a;
+		u64 io_bitmap_b;
+		u64 msr_bitmap;
+		int cpu;
+		int launched;
+	} l1_state;
+	/* Saving the VMCS that we used for running L1 */
+	struct vmcs *vmcs01;
+	struct vmcs_fields *vmcs01_fields;
 };
 
 enum vmcs_field_type {
@@ -1383,6 +1403,19 @@ static void vmx_vcpu_load(struct kvm_vcp
 			new_offset = vmcs_read64(TSC_OFFSET) + delta;
 			vmcs_write64(TSC_OFFSET, new_offset);
 		}
+
+		if (vmx->nested.vmcs01_fields != NULL) {
+			struct vmcs_fields *vmcs01 =
+				vmx->nested.vmcs01_fields;
+			vmcs01->host_tr_base = vmcs_readl(HOST_TR_BASE);
+			vmcs01->host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+			vmcs01->host_ia32_sysenter_esp =
+				vmcs_readl(HOST_IA32_SYSENTER_ESP);
+			if (tsc_this < vcpu->arch.host_tsc)
+				vmcs01->tsc_offset = vmcs_read64(TSC_OFFSET);
+			if (vmx->nested.nested_mode)
+				load_vmcs_host_state(vmcs01);
+		}
 	}
 }
 
@@ -2278,6 +2311,9 @@ static void free_l1_state(struct kvm_vcp
 		kfree(list_item);
 	}
 	vmx->nested.vmcs02_num = 0;
+
+	kfree(vmx->nested.vmcs01_fields);
+	vmx->nested.vmcs01_fields = NULL;
 }
 
 static void free_kvm_area(void)
@@ -4141,6 +4177,10 @@ static int handle_vmon(struct kvm_vcpu *
 	INIT_LIST_HEAD(&(vmx->nested.vmcs02_list));
 	vmx->nested.vmcs02_num = 0;
 
+	vmx->nested.vmcs01_fields = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!vmx->nested.vmcs01_fields)
+		return -ENOMEM;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4339,6 +4379,38 @@ static int handle_vmclear(struct kvm_vcp
 	return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+	if (!nested_vmx_check_permission(vcpu))
+		return 1;
+
+	if (to_vmx(vcpu)->nested.current_vmcs12->launch_state == launch) {
+		/* Must use VMLAUNCH for the first time, VMRESUME later */
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	skip_emulated_instruction(vcpu);
+
+	nested_vmx_run(vcpu);
+	return 1;
+}
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+	return handle_launch_or_resume(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+	return handle_launch_or_resume(vcpu, false);
+}
+
 static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
 					unsigned long field, u64 *ret)
 {
@@ -4869,11 +4941,11 @@ static int (*kvm_vmx_exit_handlers[])(st
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
-	[EXIT_REASON_VMLAUNCH]                = handle_vmx_insn,
+	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
 	[EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
 	[EXIT_REASON_VMPTRST]                 = handle_vmptrst,
 	[EXIT_REASON_VMREAD]                  = handle_vmread,
-	[EXIT_REASON_VMRESUME]                = handle_vmx_insn,
+	[EXIT_REASON_VMRESUME]                = handle_vmresume,
 	[EXIT_REASON_VMWRITE]                 = handle_vmwrite,
 	[EXIT_REASON_VMOFF]                   = handle_vmoff,
 	[EXIT_REASON_VMON]                    = handle_vmon,
@@ -4935,7 +5007,8 @@ static int vmx_handle_exit(struct kvm_vc
 		       "(0x%x) and exit reason is 0x%x\n",
 		       __func__, vectoring_info, exit_reason);
 
-	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
+	if (!vmx->nested.nested_mode &&
+		unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked)) {
 		if (vmx_interrupt_allowed(vcpu)) {
 			vmx->soft_vnmi_blocked = 0;
 		} else if (vmx->vnmi_blocked_time > 1000000000LL &&
@@ -5756,6 +5829,127 @@ int prepare_vmcs_02(struct kvm_vcpu *vcp
 	return 0;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->nested.nested_mode = 1;
+	sync_cached_regs_to_vmcs(vcpu);
+	save_vmcs(vmx->nested.vmcs01_fields);
+
+	vmx->nested.l1_state.efer = vcpu->arch.efer;
+	if (!enable_ept)
+		vmx->nested.l1_state.cr3 = vcpu->arch.cr3;
+	vmx->nested.l1_state.cr4 = vcpu->arch.cr4;
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx->nested.l1_state.msr_bitmap = vmcs_read64(MSR_BITMAP);
+	else
+		vmx->nested.l1_state.msr_bitmap = 0;
+
+	vmx->nested.l1_state.io_bitmap_a = vmcs_read64(IO_BITMAP_A);
+	vmx->nested.l1_state.io_bitmap_b = vmcs_read64(IO_BITMAP_B);
+	vmx->nested.vmcs01 = vmx->vmcs;
+	vmx->nested.l1_state.cpu = vcpu->cpu;
+	vmx->nested.l1_state.launched = vmx->launched;
+
+	vmx->vmcs = nested_get_current_vmcs(vcpu);
+	if (!vmx->vmcs) {
+		printk(KERN_ERR "Missing VMCS\n");
+		set_rflags_to_vmx_fail_valid(vcpu);
+		return 1;
+	}
+
+	vcpu->cpu = vmx->nested.current_vmcs12->cpu;
+	vmx->launched = vmx->nested.current_vmcs12->launched;
+
+	if (!vmx->nested.current_vmcs12->launch_state || !vmx->launched) {
+		vmcs_clear(vmx->vmcs);
+		vmx->launched = 0;
+		vmx->nested.current_vmcs12->launch_state = 1;
+	}
+
+	vmx_vcpu_load(vcpu, get_cpu());
+	put_cpu();
+
+	prepare_vmcs_02(vcpu,
+		get_vmcs12_fields(vcpu), vmx->nested.vmcs01_fields);
+
+	if (get_vmcs12_fields(vcpu)->vm_entry_controls &
+	    VM_ENTRY_IA32E_MODE) {
+		if (!((vcpu->arch.efer & EFER_LMA) &&
+		      (vcpu->arch.efer & EFER_LME)))
+			vcpu->arch.efer |= (EFER_LMA | EFER_LME);
+	} else {
+		if ((vcpu->arch.efer & EFER_LMA) ||
+		    (vcpu->arch.efer & EFER_LME))
+			vcpu->arch.efer = 0;
+	}
+
+	/* vmx_set_cr0() sets the cr0 that L2 will read, to be the one that L1
+	 * dictated, and takes appropriate actions for special cr0 bits (like
+	 * real mode, etc.).
+	 */
+	vmx_set_cr0(vcpu,
+		(get_vmcs12_fields(vcpu)->guest_cr0 &
+			~get_vmcs12_fields(vcpu)->cr0_guest_host_mask) |
+		(get_vmcs12_fields(vcpu)->cr0_read_shadow &
+			get_vmcs12_fields(vcpu)->cr0_guest_host_mask));
+
+	/* However, vmx_set_cr0 incorrectly enforces KVM's relationship between
+	 * GUEST_CR0 and CR0_READ_SHADOW, e.g., that the former is the same as
+	 * the latter with with TS added if !fpu_active. We need to take the
+	 * actual GUEST_CR0 that L1 wanted, just with added TS if !fpu_active
+	 * like KVM wants (for the "lazy fpu" feature, to avoid the costly
+	 * restoration of fpu registers until the FPU is really used).
+	 */
+	vmcs_writel(GUEST_CR0, get_vmcs12_fields(vcpu)->guest_cr0 |
+		(vcpu->fpu_active ? 0 : X86_CR0_TS));
+
+	vmx_set_cr4(vcpu, get_vmcs12_fields(vcpu)->guest_cr4);
+	vmcs_writel(CR4_READ_SHADOW,
+		    get_vmcs12_fields(vcpu)->cr4_read_shadow);
+
+	/* we have to set the X86_CR0_PG bit of the cached cr0, because
+	 * kvm_mmu_reset_context enables paging only if X86_CR0_PG is set in
+	 * CR0 (we need the paging so that KVM treat this guest as a paging
+	 * guest so we can easly forward page faults to L1.)
+	 */
+	vcpu->arch.cr0 |= X86_CR0_PG;
+
+	if (enable_ept && !nested_cpu_has_vmx_ept(vcpu)) {
+		vmcs_write32(GUEST_CR3, get_vmcs12_fields(vcpu)->guest_cr3);
+		vmx->vcpu.arch.cr3 = get_vmcs12_fields(vcpu)->guest_cr3;
+	} else {
+		int r;
+		kvm_set_cr3(vcpu, get_vmcs12_fields(vcpu)->guest_cr3);
+		kvm_mmu_reset_context(vcpu);
+
+		r = kvm_mmu_load(vcpu);
+		if (unlikely(r)) {
+			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			set_rflags_to_vmx_fail_valid(vcpu);
+			/* switch back to L1 */
+			vmx->nested.nested_mode = 0;
+			vmx->vmcs = vmx->nested.vmcs01;
+			vcpu->cpu = vmx->nested.l1_state.cpu;
+			vmx->launched = vmx->nested.l1_state.launched;
+
+			vmx_vcpu_load(vcpu, get_cpu());
+			put_cpu();
+
+			return 1;
+		}
+	}
+
+	kvm_register_write(vcpu, VCPU_REGS_RSP,
+			   get_vmcs12_fields(vcpu)->guest_rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RIP,
+			   get_vmcs12_fields(vcpu)->guest_rip);
+
+	return 1;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,


-- 
Nadav Har'El                        |      Sunday, Jul 11 2010, 29 Tammuz 5770
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The two most common elements in the
http://nadav.harel.org.il           |universe are hydrogen and stupidity.