Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12

From: "Nadav Har'El" <nyh@math.technion.ac.il>
To: Avi Kivity <avi@redhat.com>
Cc: kvm@vger.kernel.org, gleb@redhat.com
Subject: Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12
Date: Thu, 3 Feb 2011 14:57:32 +0200	[thread overview]
Message-ID: <20110203125732.GA19503@fermat.math.technion.ac.il> (raw)
In-Reply-To: <4D45372E.2050605@redhat.com>

On Sun, Jan 30, 2011, Avi Kivity wrote about "Re: [PATCH 07/29] nVMX: Hold a vmcs02 for each vmcs12":
>..
> >+static int nested_create_current_vmcs(struct kvm_vcpu *vcpu)
> >+{
>...
> >+	if (vmx->nested.vmcs02_num>= NESTED_MAX_VMCS)
> >+		return -ENOMEM;
> 
> I asked to replace this by dropping the entire vmcs02_list (or perhaps 
> just its tail).

Hi, here is a completely rewritten patch.

Now we make no guarantee to keep one vmcs02 for each vmcs12. Rather, we
have a limited pool of vmcs02. If we can find the same vmcs02 that we
previously used for the current vmcs12. Otherwise, we can take any of them
(we take the least recently used) and just use that. Of course, if the
pool is not yet fool, we can also allocate a new vmcs02.

The current default size for the pool is 1, meaning that we just keep one
vmcs02 (per vcpu) for use in any L2, potentially many of them. Because in
this version prepare_vmcs02 each time sets all vmcs02 fields, and doesn't
try to avoid setting rarely modified fields, there's nothing to gain by
trying to start from the previous vmcs02 used to run a certain L2.
In the future, when we do have an optimized prepare_vmcs02 which doesn't
set every field each time, it will make sense to increase the pool size.



Subject: [PATCH 07/29] nVMX: Introduce vmcs02: VMCS used to run L2

We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that "vmcs02".
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu would have
sufficed: It could be reused even when L1 runs multiple L2 guests.
However, in future versions we'll probably want to add an optimization where
vmcs02 fields that rarely change will not be set each time. For that reason
it is beneficial to keep around several vmcs02s of L2 guests that have
recently run, so that potentially we could run these L2s again more quickly
because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
Because in the current version prepare_vmcs02() sets all vmcs02 fields no
matter what we start with, we choose VMCS02_POOL_SIZE=1. I.e., one vmcs02
is allocated (and loaded onto the processor), and it is reused to enter any
L2 guest. In the future, when prepare_vmcs02() is optimized not to set all
fields every time, VMCS02_POOL_SIZE should be increased.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  135 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2011-02-03 14:46:53.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2011-02-03 14:46:53.000000000 +0200
@@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
 module_param(ple_window, int, S_IRUGO);
 
 #define NR_AUTOLOAD_MSRS 1
+#define VMCS02_POOL_SIZE 1
 
 struct vmcs {
 	u32 revision_id;
@@ -159,6 +160,31 @@ struct __packed vmcs12 {
 #define VMCS12_REVISION 0x11e57ed0
 
 /*
+ * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
+ * while we use L2's VMCS), and wish to save the previous VMCS, we must also
+ * remember on which CPU it was last loaded (vcpu->cpu), so when we return to
+ * using this VMCS we'll know if we're now running on a different CPU and need
+ * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
+ * we need to remember whether this VMCS was launched (vmx->launched), so when
+ * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
+ * this from other state, because it's possible that this VMCS had once been
+ * launched, but has since been cleared after a CPU switch, and now
+ * vmx->launch is 0).
+ */
+struct saved_vmcs {
+	struct vmcs *vmcs;
+	int cpu;
+	int launched;
+};
+
+/* Used to remember the last vmcs02 used for some recently used vmcs12s */
+struct vmcs02_list {
+	struct list_head list;
+	gpa_t vmcs12_addr;
+	struct saved_vmcs vmcs02;
+};
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
  * the current VMCS set by L1, a list of the VMCSs used to run the active
@@ -173,6 +199,10 @@ struct nested_vmx {
 	/* The host-usable pointer to the above */
 	struct page *current_vmcs12_page;
 	struct vmcs12 *current_vmcs12;
+
+	/* vmcs02_list cache of VMCSs recently used to run L2 guests */
+	struct list_head vmcs02_pool;
+	int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -3965,6 +3995,106 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
+ * We could reuse a single VMCS for all the L2 guests, but we also want the
+ * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
+ * allows keeping them loaded on the processor, and in the future will allow
+ * optimizations where prepare_vmcs02 doesn't need to set all the fields on
+ * every entry if they never change.
+ * So we keep, in vmx->nested.vmcs02_pool, an cache of size VMCS02_POOL_SIZE
+ * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
+ *
+ * The following functions allocate and free a vmcs02 in this pool.
+ */
+
+static void __nested_free_saved_vmcs(void *arg)
+{
+	struct saved_vmcs *saved_vmcs = arg;
+
+	vmcs_clear(saved_vmcs->vmcs);
+	if (per_cpu(current_vmcs, saved_vmcs->cpu) == saved_vmcs->vmcs)
+		per_cpu(current_vmcs, saved_vmcs->cpu) = NULL;
+}
+
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ * (the necessary information is in the saved_vmcs structure).
+ * See also vcpu_clear() (with different parameters and side-effects)
+ */
+static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
+		struct saved_vmcs *saved_vmcs)
+{
+	if (saved_vmcs->cpu != -1)
+		smp_call_function_single(saved_vmcs->cpu,
+				__nested_free_saved_vmcs, saved_vmcs, 1);
+
+	free_vmcs(saved_vmcs->vmcs);
+}
+
+/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
+static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmptr) {
+			nested_free_saved_vmcs(vmx, &item->vmcs02);
+			list_del(&item->list);
+			kfree(item);
+			vmx->nested.vmcs02_num--;
+			return;
+		}
+}
+
+/* Free all vmcs02 saved for this vcpu */
+static void nested_free_all_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item, *n;
+	list_for_each_entry_safe(item, n, &vmx->nested.vmcs02_pool, list) {
+		nested_free_saved_vmcs(vmx, &item->vmcs02);
+		list_del(&item->list);
+		kfree(item);
+	}
+	vmx->nested.vmcs02_num = 0;
+}
+
+/* Get a vmcs02 for the current vmcs12. */
+static struct saved_vmcs *nested_get_current_vmcs02(struct vcpu_vmx *vmx)
+{
+	struct vmcs02_list *item;
+	list_for_each_entry(item, &vmx->nested.vmcs02_pool, list)
+		if (item->vmcs12_addr == vmx->nested.current_vmptr){
+			list_move(&item->list, &vmx->nested.vmcs02_pool);
+			return &item->vmcs02;
+		}
+
+	if (vmx->nested.vmcs02_num >= max(VMCS02_POOL_SIZE,1)) {
+		/* Recycle the least recently used VMCS. */
+		item = list_entry(vmx->nested.vmcs02_pool.prev,
+			struct vmcs02_list, list);
+		item->vmcs12_addr = vmx->nested.current_vmptr;
+		list_move(&item->list, &vmx->nested.vmcs02_pool);
+		return &item->vmcs02;
+	}
+
+	/* Create a new vmcs02 */
+	item = (struct vmcs02_list *)
+		kmalloc(sizeof(struct vmcs02_list), GFP_KERNEL);
+	if (!item)
+		return NULL;
+	item->vmcs02.vmcs = alloc_vmcs();
+	if (!item->vmcs02.vmcs) {
+		kfree(item);
+		return NULL;
+	}
+	item->vmcs12_addr = vmx->nested.current_vmptr;
+	item->vmcs02.cpu = -1;
+	item->vmcs02.launched = 0;
+	list_add(&(item->list), &(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num++;
+	return &item->vmcs02;
+}
+
+/*
  * Emulate the VMXON instruction.
  * Currently, we just remember that VMX is active, and do not save or even
  * inspect the argument to VMXON (the so-called "VMXON pointer") because we
@@ -4000,6 +4130,9 @@ static int handle_vmon(struct kvm_vcpu *
 		return 1;
 	}
 
+	INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
+	vmx->nested.vmcs02_num = 0;
+
 	vmx->nested.vmxon = true;
 
 	skip_emulated_instruction(vcpu);
@@ -4050,6 +4183,8 @@ static void free_nested(struct vcpu_vmx 
 		nested_release_page(vmx->nested.current_vmcs12_page);
 		vmx->nested.current_vmptr = -1ull;
 	}
+
+	nested_free_all_vmcs02(vmx);
 }
 
 /* Emulate the VMXOFF instruction */

-- 
Nadav Har'El                        |    Thursday, Feb  3 2011, 29 Shevat 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Boat: A hole in the water surrounded by
http://nadav.harel.org.il           |wood into which one pours money.