From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Nadav Har'El" <nyh@math.technion.ac.il>
Subject: Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit
Date: Thu, 16 Sep 2010 16:42:49 +0200
Message-ID: <20100916144249.GA6303@fermat.math.technion.ac.il>
References: <1276431753-nyh@il.ibm.com> <201006131232.o5DCWIHl013120@rice.haifa.ibm.com> <4C161F62.9070506@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: kvm@vger.kernel.org
To: Avi Kivity <avi@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mailgw12.technion.ac.il ([132.68.225.12]:17816 "EHLO
	mailgw12.technion.ac.il" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753897Ab0IPO6P (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 16 Sep 2010 10:58:15 -0400
Content-Disposition: inline
In-Reply-To: <4C161F62.9070506@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit":
> On 06/13/2010 03:32 PM, Nadav Har'El wrote:
> >This patch contains the logic of whether an L2 exit should be handled by L0
> >and then L2 should be resumed, or whether L1 should be run to handle this
> >exit (using the nested_vmx_vmexit() function of the previous patch).
> >The basic idea is to let L1 handle the exit only if it actually asked to
> >trap this sort of event. For example, when L2 exits on a change to CR0,
>...
> >@@ -3819,6 +3841,8 @@ static int handle_exception(struct kvm_v
> >
> >  	if (is_no_device(intr_info)) {
> >  		vmx_fpu_activate(vcpu);
> >+		if (vmx->nested.nested_mode)
> >+			vmx->nested.nested_run_pending = 1;
> >  		return 1;
> >  	}
> >   
> 
> Isn't this true for many other exceptions?  #UD which we emulate (but 
> the guest doesn't trap), page faults which we handle completely...

I was trying to think why nested_run_pending=1 (forcing us to run L2 next)
is necessary in the specific case of #NM, and couldn't think of any convincing
reason. Sure, in most cases we would like to continue running L2 after L0
serviced this exception that L1 didn't care about, but in the rare cases
where L1 should run next (especially, user-space injecting an interrupt),
what's so wrong with L1 running next?
And, like you said, if it's important for #NM, why not for #PF or other
things?

Anyway, the code appears to run correctly also without this setting,
so I'm guessing that it's an historic artifact, from older code which
was written before the days of the lazy fpu loading. So I'm removing it.

Good catch. I was aware of this peculiar case in the code (and even
documented it in the patch's description), but should have stopped to
think if it is really such a special case, or simply an error. And now
I believe it was nothing more than an error.

> >+/* Return 1 if we should exit from L2 to L1 to handle a CR access exit,
> >+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
> >+ * intercept (via guest_host_mask etc.) the current event.
> >+ */
> >+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
> >+	struct shadow_vmcs *l2svmcs)
> >+{
>...
> >+		case 8:
> >+			if (l2svmcs->cpu_based_vm_exec_control&
> >+			    CPU_BASED_CR8_LOAD_EXITING)
> >+				return 1;
> >   
> 
> Should check TPR threshold here too if enabled.

I'll return to this issue in another mail. This one is getting long enough.

> >+	case 3: /* lmsw */
> >+		if (l2svmcs->cr0_guest_host_mask&
> >+		    (val ^ l2svmcs->cr0_read_shadow))
> >+			return 1;
> >   
> 
> Need to mask off bit 0 (cr0.pe) of val, since lmsw can't clear it.

Right. Also, lmsw only works on the first 4 bits of cr0: The first bit, it can
only turn on (like you said), and the next 3 bits, it can change at will. Any
other attempted changes to cr0 through lmsw should be ignored, and not cause
exits.  So I changed the code to this:

		if (vmcs12->cr0_guest_host_mask & 0xe &
		    (val ^ vmcs12->cr0_read_shadow))
			return 1;
		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
		    !(vmcs12->cr0_read_shadow & 0x1) &&
		    (val & 0x1))
		    	return 1;

I wonder if there's a less ugly way to write the same thing...

This LMSW is so 80s :( I wonder who's using it these days, and specifically
if it would bother anyone if lmsw suddenly acquired new "abilities" to
change bits it never could... Oh, the things that we do for backward
compatibility :-)

> >+/* Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
> >+ * should handle it ourselves in L0. Only call this when in nested_mode 
> >(L2).
> >+ */
> >+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool afterexit)
>...
> >+	case EXIT_REASON_EXCEPTION_NMI:
> >+		if (is_external_interrupt(intr_info)&&
> >+		    (l2svmcs->pin_based_vm_exec_control&
> >+		     PIN_BASED_EXT_INTR_MASK))
> >+			r = 1;
> >   
> 
> A real external interrupt should never be handled by the guest, only a 
> virtual external interrupt.

You've opened a whole can of worms ;-) But it's very good that you did.

It appears that this nested_vmx_exit_handled() was called for two completely
different reasons, with different "afterexit" parameter (if you remember,
this flag had the puzzling name "kvm_override" in a previous version).
On normal exits, it is called with afterexit=1 and did exactly what you wanted,
i.e., never handle in the guest:

		case EXIT_REASON_EXTERNAL_INTERRUPT:
			return 0;

The case which you saw was only relevant for the other place this function is
called - in exception injection. But most of the code in this function was
irrelevant and/or plain wrong in this case.

This part of the code was just terrible, and I couldn't leave it like this,
even if it was working (I'm sure you'll agree). So I now completely rewrote
this function to become two separate functions, with (hopefully) no irrelevant
or wrong code.

Here is the new version of the two relevant patches - this one and the one
dealing with exception injection:

Subject: [PATCH 20/26] nVMX: Deciding if L0 or L1 should handle an L2 exit

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in several situations where had L1 run on
bare metal it would not have expected to be resumed at this stage. One
example is when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run.
Nested_run_pending is especially intended to avoid switching to L1 in the
injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |  204 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 204 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-09-16 16:38:03.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-09-16 16:38:03.000000000 +0200
@@ -338,6 +338,8 @@ struct nested_vmx {
 	/* Saving the VMCS that we used for running L1 */
 	struct vmcs *vmcs01;
 	struct vmcs_fields *vmcs01_fields;
+	/* L2 must run next, and mustn't decide to exit to L1. */
+	bool nested_run_pending;
 };
 
 struct vcpu_vmx {
@@ -848,6 +850,20 @@ static inline bool nested_cpu_has_vmx_ep
 		SECONDARY_EXEC_ENABLE_EPT);
 }
 
+static inline bool nested_cpu_has_vmx_msr_bitmap(struct kvm_vcpu *vcpu)
+{
+	return get_vmcs12_fields(vcpu)->cpu_based_vm_exec_control &
+		CPU_BASED_USE_MSR_BITMAPS;
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+		== (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static int nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
 	int i;
@@ -4879,6 +4895,182 @@ static const int kvm_vmx_max_exit_handle
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
 /*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check L1's MSR bitmap whether
+ * it expressed interest in the current event (read or write a specific MSR).
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12, u32 exit_reason)
+{
+	u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
+	struct page *msr_bitmap_page;
+	void *va;
+	bool ret;
+
+	if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has_vmx_msr_bitmap(vcpu))
+		return 1;
+
+	msr_bitmap_page = nested_get_page(vcpu, vmcs12->msr_bitmap);
+	if (!msr_bitmap_page) {
+		printk(KERN_INFO "%s error in nested_get_page\n", __func__);
+		return 0;
+	}
+
+	va = kmap_atomic(msr_bitmap_page, KM_USER1);
+	if (exit_reason == EXIT_REASON_MSR_WRITE)
+		va += 0x800;
+	if (msr_index >= 0xc0000000) {
+		msr_index -= 0xc0000000;
+		va += 0x400;
+	}
+	if (msr_index > 0x1fff)
+		return 0;
+	ret = test_bit(msr_index, va);
+	kunmap_atomic(va, KM_USER1);
+	return ret;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+	struct vmcs_fields *vmcs12)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int cr = exit_qualification & 15;
+	int reg = (exit_qualification >> 8) & 15;
+	unsigned long val = kvm_register_read(vcpu, reg);
+
+	switch ((exit_qualification >> 4) & 3) {
+	case 0: /* mov to cr */
+		switch (cr) {
+		case 0:
+			if (vmcs12->cr0_guest_host_mask &
+			    (val ^ vmcs12->cr0_read_shadow))
+				return 1;
+			break;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_LOAD_EXITING)
+				return 1;
+			break;
+		case 4:
+			if (vmcs12->cr4_guest_host_mask &
+			    (vmcs12->cr4_read_shadow ^ val))
+				return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_LOAD_EXITING)
+				return 1;
+			// TODO: missing else if control & CPU_BASED_TPR_SHADOW
+			//   then set tpr shadow and if below tpr_threshold,
+			//   exit.
+			break;
+		}
+		break;
+	case 2: /* clts */
+		if (vmcs12->cr0_guest_host_mask & X86_CR0_TS)
+			return 1;
+		break;
+	case 1: /* mov from cr */
+		switch (cr) {
+		case 0:
+			return 1;
+		case 3:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR3_STORE_EXITING)
+				return 1;
+			break;
+		case 4:
+			return 1;
+			break;
+		case 8:
+			if (vmcs12->cpu_based_vm_exec_control &
+			    CPU_BASED_CR8_STORE_EXITING)
+				return 1;
+			break;
+		}
+		break;
+	case 3: /* lmsw */
+		/*
+		 * lmsw can change bits 1..3 of cr0, and only set bit 0 of
+		 * cr0. Other attempted changes are ignored, with no exit.
+		 */
+		if (vmcs12->cr0_guest_host_mask & 0xe &
+		    (val ^ vmcs12->cr0_read_shadow))
+			return 1;
+		if ((vmcs12->cr0_guest_host_mask & 0x1) &&
+		    !(vmcs12->cr0_read_shadow & 0x1) &&
+		    (val & 0x1))
+		    	return 1;
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0 (and then continue L2). Only call this
+ * when in nested_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
+{
+	u32 exit_reason = vmcs_read32(VM_EXIT_REASON);
+	u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	if (vmx->nested.nested_run_pending)
+		return 0;
+
+	if (unlikely(vmx->fail)) {
+		printk(KERN_INFO "%s failed vm entry %x\n",
+		       __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+		return 1;
+	}
+
+	switch (exit_reason) {
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return 0;
+	case EXIT_REASON_EXCEPTION_NMI:
+		if (!is_exception(intr_info))
+			return 0;
+		else if (is_page_fault(intr_info) && (!enable_ept))
+			return 0;
+		return (vmcs12->exception_bitmap &
+				(1u << (intr_info & INTR_INFO_VECTOR_MASK)));
+	case EXIT_REASON_EPT_VIOLATION:
+		return 0;
+	case EXIT_REASON_INVLPG:
+		return (vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_INVLPG_EXITING);
+	case EXIT_REASON_MSR_READ:
+	case EXIT_REASON_MSR_WRITE:
+		return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
+	case EXIT_REASON_CR_ACCESS:
+		return nested_vmx_exit_handled_cr(vcpu, vmcs12);
+	case EXIT_REASON_DR_ACCESS:
+		return (vmcs12->cpu_based_vm_exec_control &
+				CPU_BASED_MOV_DR_EXITING);
+	default:
+		/*
+		 * One particularly interesting case that is covered here is an
+		 * exit caused by L2 running a VMX instruction. L2 is guest
+		 * mode in L1's world, and according to the VMX spec running a
+		 * VMX instruction in guest mode should cause an exit to root
+		 * mode, i.e., to L1. This is why we need to return r=1 for
+		 * those exit reasons too. This enables further nesting: Like
+		 * L0 emulates VMX for L1, we now allow L1 to emulate VMX for
+		 * L2, who will then be able to run L3.
+		 */
+		return 1;
+	}
+}
+
+/*
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
  */
@@ -4894,6 +5086,17 @@ static int vmx_handle_exit(struct kvm_vc
 	if (vmx->emulation_required && emulate_invalid_guest_state)
 		return handle_invalid_guest_state(vcpu);
 
+	if (exit_reason == EXIT_REASON_VMLAUNCH ||
+	    exit_reason == EXIT_REASON_VMRESUME)
+		vmx->nested.nested_run_pending = 1;
+	else
+		vmx->nested.nested_run_pending = 0;
+
+	if (vmx->nested.nested_mode && nested_vmx_exit_handled(vcpu)) {
+		nested_vmx_vmexit(vcpu, false);
+		return 1;
+	}
+
 	/* Access CR3 don't cause VMExit in paging mode, so we need
 	 * to sync with guest real CR3. */
 	if (enable_ept && is_paging(vcpu))
@@ -5941,6 +6144,7 @@ static int nested_vmx_run(struct kvm_vcp
 		r = kvm_mmu_load(vcpu);
 		if (unlikely(r)) {
 			printk(KERN_ERR "Error in kvm_mmu_load r %d\n", r);
+			nested_vmx_vmexit(vcpu, false);
 			nested_vmx_failValid(vcpu,
 				VMXERR_VMRESUME_CORRUPTED_VMCS /* ? */);
 			/* switch back to L1 */

Subject: [PATCH 22/26] nVMX: Correct handling of exception injection

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c	2010-09-16 16:38:04.000000000 +0200
+++ .after/arch/x86/kvm/vmx.c	2010-09-16 16:38:04.000000000 +0200
@@ -1507,6 +1507,25 @@ static void skip_emulated_instruction(st
 	vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+	struct vmcs_fields *vmcs12 = get_vmcs12_fields(vcpu);
+
+	/* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+	if (!(vmcs12->exception_bitmap & PF_VECTOR))
+		return 0;
+
+	nested_vmx_vmexit(vcpu, false);
+	return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
 				bool has_error_code, u32 error_code,
 				bool reinject)
@@ -1514,6 +1533,10 @@ static void vmx_queue_exception(struct k
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+	if (nr == PF_VECTOR && vmx->nested.nested_mode &&
+		nested_pf_handled(vcpu))
+		return;
+
 	if (has_error_code) {
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3554,6 +3577,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (vmx->nested.nested_mode)
+		return;
+
 	if (!cpu_has_virtual_nmis()) {
 		/*
 		 * Tracking the NMI-blocked state in software is built upon

-- 
Nadav Har'El                        |      Tuesday, Sep 14 2010, 6 Tishri 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |echo '[q]sa[ln0=aln256%Pln256/snlbx]
http://nadav.harel.org.il           |sb3135071790101768542287578439snlbxq'|dc