Re: RFC: Getting rid of LTR in VMX

From: Paolo Bonzini <pbonzini@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>,
	kvm list <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Borislav Petkov <bpetkov@suse.de>
Subject: Re: RFC: Getting rid of LTR in VMX
Date: Mon, 20 Feb 2017 12:05:13 +0100	[thread overview]
Message-ID: <38d613dc-5b46-1f4f-04d1-53c01932e6d6@redhat.com> (raw)
In-Reply-To: <CALCETrVnHc1_M0_zEL=EBvMNGBj-ZqzvhrGBw9jKJ6HmL-qKDg@mail.gmail.com>



On 18/02/2017 04:29, Andy Lutomirski wrote:
> There's no code here because the patch is trivial, but I want to run
> the idea by you all first to see if there are any issues.
> 
> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
> wastes lots of cycles bumping it back up to accomodate the io bitmap.

Actually looked at the code now...

reload_tss is only invoked for userspace exits, so it is a nice-to-have
but it wouldn't show on most workloads.  Still it does save 150-200
clock cycles to remove it (I just commented out reload_tss() from
__vmx_load_host_state to test).

Another 100-150 could be saved if we could just use rdgsbase/wrgsbase, 
instead of rdmsr/wrmsr, to read and write the kernel GS.  Super hacky
patch after sig.

vmx_save_host_state is really slow...  It would be nice if Intel
defined an XSAVES state to do it (just like AMD's vmload/vmsave).

Paolo

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9856b73a21ad..e76bfec463bc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2138,9 +2138,12 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 #endif
 
 #ifdef CONFIG_X86_64
-	rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+	local_irq_disable();
+	asm volatile("swapgs; rdgsbase %0" : "=r" (vmx->msr_host_kernel_gs_base));
 	if (is_long_mode(&vmx->vcpu))
-		wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+		asm volatile("wrgsbase %0" : : "r" (vmx->msr_guest_kernel_gs_base));
+	asm volatile("swapgs");
+	local_irq_enable();
 #endif
 	if (boot_cpu_has(X86_FEATURE_MPX))
 		rdmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
@@ -2152,14 +2155,20 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 
 static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 {
+	unsigned long flags;
+
 	if (!vmx->host_state.loaded)
 		return;
 
 	++vmx->vcpu.stat.host_state_reload;
 	vmx->host_state.loaded = 0;
 #ifdef CONFIG_X86_64
+	local_irq_save(flags);
+	asm("swapgs");
 	if (is_long_mode(&vmx->vcpu))
-		rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+		asm volatile("rdgsbase %0" : "=r" (vmx->msr_guest_kernel_gs_base));
+	asm volatile("wrgsbase %0; swapgs" : : "r" (vmx->msr_host_kernel_gs_base));
+	local_irq_restore(flags);
 #endif
 	if (vmx->host_state.gs_ldt_reload_needed) {
 		kvm_load_ldt(vmx->host_state.ldt_sel);
@@ -2177,10 +2186,7 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 		loadsegment(es, vmx->host_state.es_sel);
 	}
 #endif
-	reload_tss();
-#ifdef CONFIG_X86_64
-	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
-#endif
+	//reload_tss();
 	if (vmx->host_state.msr_host_bndcfgs)
 		wrmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
 	load_gdt(this_cpu_ptr(&host_gdt));
@@ -3469,6 +3475,7 @@ static int hardware_enable(void)
 		wrmsrl(MSR_IA32_FEATURE_CONTROL, old | test_bits);
 	}
 	cr4_set_bits(X86_CR4_VMXE);
+	cr4_set_bits(X86_CR4_FSGSBASE);
 
 	if (vmm_exclusive) {
 		kvm_cpu_vmxon(phys_addr);

> I propose that we rework this.  Add a percpu variable that indicates
> whether the TSS limit needs to be refreshed.  On task switch, if the
> new task has TIF_IO_BITMAP set, then check that flag and, if set,
> refresh TR and clear the flag.  On VMX exit, set the flag.
> 
> The TSS limit is (phew!) invisible to userspace, so we don't have ABI
> issues to worry about here.  We also shouldn't have security issues
> because a too-low TSS limit just results in unprivileged IO operations
> generating #GP, which is exactly what we want.
> 
> What do you all think?  I expect a speedup of a couple hundred cycles
> on each VM exit.