From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12
Date: Mon, 14 Jun 2010 14:11:41 +0300
Message-ID: <4C160E6D.3080609@redhat.com>
References: <1276431753-nyh@il.ibm.com> <201006131229.o5DCTih4013041@rice.haifa.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm@vger.kernel.org
To: "Nadav Har'El" <nyh@il.ibm.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:48601 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756162Ab0FNLLq (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 14 Jun 2010 07:11:46 -0400
In-Reply-To: <201006131229.o5DCTih4013041@rice.haifa.ibm.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 06/13/2010 03:29 PM, Nadav Har'El wrote:
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
> in shadow_vmcs that L1 built for L2 (vmcs12), and that in the VMCS that we
> built for L1 (vmcs01).
>
> VMREAD/WRITE can only access one VMCS at a time (the "current" VMCS), which
> makes it difficult for us to read from vmcs01 while writing to vmcs12. This
> is why we first make a copy of vmcs01 in memory (l1_shadow_vmcs) and then
> read that memory copy while writing to vmcs12.
>
> Signed-off-by: Nadav Har'El<nyh@il.ibm.com>
> ---
> --- .before/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c	2010-06-13 15:01:29.000000000 +0300
> @@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
>   	return flexpriority_enabled;
>   }
>
> +static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
> +{
> +	return cpu_has_vmx_tpr_shadow()&&
> +		get_shadow_vmcs(vcpu)->cpu_based_vm_exec_control&
> +		CPU_BASED_TPR_SHADOW;
>    

Operator precedence is with you, but the line width limit is not.  Use 
parentheses to improve readability.

> @@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
>   	preempt_enable();
>   }
>
> +int load_vmcs_host_state(struct shadow_vmcs *src)
> +{
> +	vmcs_write16(HOST_ES_SELECTOR, src->host_es_selector);
> +	vmcs_write16(HOST_CS_SELECTOR, src->host_cs_selector);
> +	vmcs_write16(HOST_SS_SELECTOR, src->host_ss_selector);
> +	vmcs_write16(HOST_DS_SELECTOR, src->host_ds_selector);
> +	vmcs_write16(HOST_FS_SELECTOR, src->host_fs_selector);
> +	vmcs_write16(HOST_GS_SELECTOR, src->host_gs_selector);
> +	vmcs_write16(HOST_TR_SELECTOR, src->host_tr_selector);
>    

Why do we need to go through a shadow_vmcs for host fields?  Instead of 
cloning a vmcs, you can call a common init routing to initialize the 
host fields.

> +
> +	vmcs_write64(TSC_OFFSET, src->tsc_offset);
>    

Don't you need to adjust for our TSC_OFFSET?

> +/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs its nested
> + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> + * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
> + * way that will both be appropriate to L1's requests, and our needs.
> + */
> +int prepare_vmcs_02(struct kvm_vcpu *vcpu,
> +	struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01)
> +{
> +	u32 exec_control;
> +
> +	load_vmcs_common(vmcs12);
> +
> +	vmcs_write64(VMCS_LINK_POINTER, vmcs12->vmcs_link_pointer);
>    

Not sure about this.  We don't use the vmcs link pointer, it's better to 
keep it at its default of -1ull.

> +	vmcs_write64(IO_BITMAP_A, vmcs01->io_bitmap_a);
> +	vmcs_write64(IO_BITMAP_B, vmcs01->io_bitmap_b);
>    

I guess merging the io bitmaps doesn't make much sense, at least at this 
stage.

> +	if (cpu_has_vmx_msr_bitmap())
> +		vmcs_write64(MSR_BITMAP, vmcs01->msr_bitmap);
>    

However, merging the msr bitmaps is critical.  Perhaps you do it in a 
later patch.

> +
> +	if (vmcs12->vm_entry_msr_load_count>  0 ||
> +			vmcs12->vm_exit_msr_load_count>  0 ||
> +			vmcs12->vm_exit_msr_store_count>  0) {
> +		printk(KERN_WARNING
> +			"%s: VMCS MSR_{LOAD,STORE} unsupported\n", __func__);
>    

Unfortunate, since kvm has started to use this feature.

For all unsupported mandatory features, we need reporting that is always 
enabled (i.e. no dprintk()), but to avoid flooding dmesg, use 
printk_ratelimit() or report just once.  Also, it's better to kill the 
guest (KVM_REQ_TRIPLE_FAULT, or a VM instruction error) than to let it 
continue incorrectly.

> +	}
> +
> +	if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
> +		struct page *page =
> +			nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
> +		if (!page)
> +			return 1;
>    

?

> +		vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
> +		kvm_release_page_clean(page);
>    

Hm.  The host can move this page around.  If it happens not to be mapped 
into the guest, we won't get any notification.  So we need to keep a 
reference to this page, or else force an exit from the mmu notifier 
callbacks if it is removed by the host.

> +	}
> +
> +	if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
> +		struct page *page =
> +			nested_get_page(vcpu, vmcs12->apic_access_addr);
> +		if (!page)
> +			return 1;
> +		vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
> +		kvm_release_page_clean(page);
> +	}
>    

Ditto.

> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> +		     (vmcs01->pin_based_vm_exec_control |
> +		      vmcs12->pin_based_vm_exec_control));
>    

We don't really want the guest's pin-based controls to influence our own 
(it doesn't really matter since ours are always set).  Rather, they 
should influence the interface between the local APIC and the guest.

Where do you check if the values are valid?  Otherwise the guest can 
easily crash where it expects a VM entry failure.


> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK,
> +		     (vmcs01->page_fault_error_code_mask&
> +		      vmcs12->page_fault_error_code_mask));
> +	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH,
> +		     (vmcs01->page_fault_error_code_match&
> +		      vmcs12->page_fault_error_code_match));
>    

Bit 14 in the exception bitmap also plays a part, I think it 
significantly changes the picture if set in the guest (the host will 
have it clear always IIRC).  Can perhaps avoid by clearing the whole 
thing if the guest is inverting the match.+
> +	if (enable_ept) {
> +		if (!nested_cpu_has_vmx_ept(vcpu)) {
> +			vmcs_write64(EPT_POINTER, vmcs01->ept_pointer);
> +			vmcs_write64(GUEST_PDPTR0, vmcs01->guest_pdptr0);
> +			vmcs_write64(GUEST_PDPTR1, vmcs01->guest_pdptr1);
> +			vmcs_write64(GUEST_PDPTR2, vmcs01->guest_pdptr2);
> +			vmcs_write64(GUEST_PDPTR3, vmcs01->guest_pdptr3);
> +		}
>    

Currently we don't support guest ept, so the second condition can be 
avoided.

> +	}
> +
> +	exec_control = vmcs01->cpu_based_vm_exec_control;
> +	exec_control&= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> +	exec_control&= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> +	exec_control&= ~CPU_BASED_TPR_SHADOW;
> +	exec_control |= vmcs12->cpu_based_vm_exec_control;
> +	if (!vm_need_tpr_shadow(vcpu->kvm) ||
> +	    vmcs12->virtual_apic_page_addr == 0) {
> +		exec_control&= ~CPU_BASED_TPR_SHADOW;
>    

Why?

> +#ifdef CONFIG_X86_64
> +		exec_control |= CPU_BASED_CR8_STORE_EXITING |
> +			CPU_BASED_CR8_LOAD_EXITING;
> +#endif
> +	} else if (exec_control&  CPU_BASED_TPR_SHADOW) {
> +#ifdef CONFIG_X86_64
> +		exec_control&= ~CPU_BASED_CR8_STORE_EXITING;
> +		exec_control&= ~CPU_BASED_CR8_LOAD_EXITING;
> +#endif
>    

This should be part of the general checks on valid control values.

> +	}
> +	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
> +
> +	/* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> +	 * bitwise-or of what L1 wants to trap for L2, and what we want to
> +	 * trap. However, vmx_fpu_activate/deactivate may have happened after
> +	 * we saved vmcs01, so we shouldn't trust its TS and NM_VECTOR bits
> +	 * and need to base them again on fpu_active. Note that CR0.TS also
> +	 * needs updating - we do this after this function returns (in
> +	 * nested_vmx_run).
> +	 */
> +	vmcs_write32(EXCEPTION_BITMAP,
> +		     ((vmcs01->exception_bitmap&~(1u<<NM_VECTOR)) |
> +		      (vcpu->fpu_active ? 0 : (1u<<NM_VECTOR)) |
> +		      vmcs12->exception_bitmap));
>    

Perhaps moving this to update_exception_bitmap will make this clearer.

> +	vmcs_writel(CR0_GUEST_HOST_MASK, vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));
> +	vcpu->arch.cr0_guest_owned_bits = ~(vmcs12->cr0_guest_host_mask |
> +			(vcpu->fpu_active ? 0 : X86_CR0_TS));
>    

I'm worried that managing this in two separate places will cause problems.

> +
> +	vmcs_write32(VM_EXIT_CONTROLS,
> +		     (vmcs01->vm_exit_controls&
> +			(~(VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT)))
> +		       | vmcs12->vm_exit_controls);
>    

Why do you drop PAT load/save?

> +
> +	vmcs_write32(VM_ENTRY_CONTROLS,
> +		     (vmcs01->vm_entry_controls&
> +			(~(VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE)))
> +		      | vmcs12->vm_entry_controls);
> +
> +	vmcs_writel(CR4_GUEST_HOST_MASK,
> +		    (vmcs01->cr4_guest_host_mask&
> +		     vmcs12->cr4_guest_host_mask));
> +
> +	return 0;
> +}
> +
>   static struct kvm_x86_ops vmx_x86_ops = {
>   	.cpu_has_kvm_support = cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
>    


-- 
error compiling committee.c: too many arguments to function