All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Getting rid of LTR in VMX
@ 2017-02-18  3:29 Andy Lutomirski
  2017-02-18 15:19 ` Paolo Bonzini
  2017-02-20 11:05 ` Paolo Bonzini
  0 siblings, 2 replies; 8+ messages in thread
From: Andy Lutomirski @ 2017-02-18  3:29 UTC (permalink / raw)
  To: kvm list, Paolo Bonzini, linux-kernel, X86 ML, H. Peter Anvin,
	Borislav Petkov

There's no code here because the patch is trivial, but I want to run
the idea by you all first to see if there are any issues.

VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
wastes lots of cycles bumping it back up to accomodate the io bitmap.

I propose that we rework this.  Add a percpu variable that indicates
whether the TSS limit needs to be refreshed.  On task switch, if the
new task has TIF_IO_BITMAP set, then check that flag and, if set,
refresh TR and clear the flag.  On VMX exit, set the flag.

The TSS limit is (phew!) invisible to userspace, so we don't have ABI
issues to worry about here.  We also shouldn't have security issues
because a too-low TSS limit just results in unprivileged IO operations
generating #GP, which is exactly what we want.

What do you all think?  I expect a speedup of a couple hundred cycles
on each VM exit.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-18  3:29 RFC: Getting rid of LTR in VMX Andy Lutomirski
@ 2017-02-18 15:19 ` Paolo Bonzini
  2017-02-20 11:05 ` Paolo Bonzini
  1 sibling, 0 replies; 8+ messages in thread
From: Paolo Bonzini @ 2017-02-18 15:19 UTC (permalink / raw)
  To: Andy Lutomirski, kvm list, linux-kernel, X86 ML, H. Peter Anvin,
	Borislav Petkov



On 18/02/2017 04:29, Andy Lutomirski wrote:
> There's no code here because the patch is trivial, but I want to run
> the idea by you all first to see if there are any issues.
> 
> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
> wastes lots of cycles bumping it back up to accomodate the io bitmap.
> 
> I propose that we rework this.  Add a percpu variable that indicates
> whether the TSS limit needs to be refreshed.  On task switch, if the
> new task has TIF_IO_BITMAP set, then check that flag and, if set,
> refresh TR and clear the flag.  On VMX exit, set the flag.

That's nice.  An alternative is to just use the user return notifier,
which does not need any global change.  This is even simpler and can be
used to benchmark.

> The TSS limit is (phew!) invisible to userspace, so we don't have ABI
> issues to worry about here.  We also shouldn't have security issues
> because a too-low TSS limit just results in unprivileged IO operations
> generating #GP, which is exactly what we want.
> 
> What do you all think?  I expect a speedup of a couple hundred cycles
> on each VM exit.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-18  3:29 RFC: Getting rid of LTR in VMX Andy Lutomirski
  2017-02-18 15:19 ` Paolo Bonzini
@ 2017-02-20 11:05 ` Paolo Bonzini
  2017-02-20 16:46   ` Andy Lutomirski
  1 sibling, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2017-02-20 11:05 UTC (permalink / raw)
  To: Andy Lutomirski, kvm list, linux-kernel, X86 ML, H. Peter Anvin,
	Borislav Petkov



On 18/02/2017 04:29, Andy Lutomirski wrote:
> There's no code here because the patch is trivial, but I want to run
> the idea by you all first to see if there are any issues.
> 
> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
> wastes lots of cycles bumping it back up to accomodate the io bitmap.

Actually looked at the code now...

reload_tss is only invoked for userspace exits, so it is a nice-to-have
but it wouldn't show on most workloads.  Still it does save 150-200
clock cycles to remove it (I just commented out reload_tss() from
__vmx_load_host_state to test).

Another 100-150 could be saved if we could just use rdgsbase/wrgsbase, 
instead of rdmsr/wrmsr, to read and write the kernel GS.  Super hacky
patch after sig.

vmx_save_host_state is really slow...  It would be nice if Intel
defined an XSAVES state to do it (just like AMD's vmload/vmsave).

Paolo

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9856b73a21ad..e76bfec463bc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2138,9 +2138,12 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 #endif
 
 #ifdef CONFIG_X86_64
-	rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+	local_irq_disable();
+	asm volatile("swapgs; rdgsbase %0" : "=r" (vmx->msr_host_kernel_gs_base));
 	if (is_long_mode(&vmx->vcpu))
-		wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+		asm volatile("wrgsbase %0" : : "r" (vmx->msr_guest_kernel_gs_base));
+	asm volatile("swapgs");
+	local_irq_enable();
 #endif
 	if (boot_cpu_has(X86_FEATURE_MPX))
 		rdmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
@@ -2152,14 +2155,20 @@ static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 
 static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 {
+	unsigned long flags;
+
 	if (!vmx->host_state.loaded)
 		return;
 
 	++vmx->vcpu.stat.host_state_reload;
 	vmx->host_state.loaded = 0;
 #ifdef CONFIG_X86_64
+	local_irq_save(flags);
+	asm("swapgs");
 	if (is_long_mode(&vmx->vcpu))
-		rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+		asm volatile("rdgsbase %0" : "=r" (vmx->msr_guest_kernel_gs_base));
+	asm volatile("wrgsbase %0; swapgs" : : "r" (vmx->msr_host_kernel_gs_base));
+	local_irq_restore(flags);
 #endif
 	if (vmx->host_state.gs_ldt_reload_needed) {
 		kvm_load_ldt(vmx->host_state.ldt_sel);
@@ -2177,10 +2186,7 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 		loadsegment(es, vmx->host_state.es_sel);
 	}
 #endif
-	reload_tss();
-#ifdef CONFIG_X86_64
-	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
-#endif
+	//reload_tss();
 	if (vmx->host_state.msr_host_bndcfgs)
 		wrmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
 	load_gdt(this_cpu_ptr(&host_gdt));
@@ -3469,6 +3475,7 @@ static int hardware_enable(void)
 		wrmsrl(MSR_IA32_FEATURE_CONTROL, old | test_bits);
 	}
 	cr4_set_bits(X86_CR4_VMXE);
+	cr4_set_bits(X86_CR4_FSGSBASE);
 
 	if (vmm_exclusive) {
 		kvm_cpu_vmxon(phys_addr);

> I propose that we rework this.  Add a percpu variable that indicates
> whether the TSS limit needs to be refreshed.  On task switch, if the
> new task has TIF_IO_BITMAP set, then check that flag and, if set,
> refresh TR and clear the flag.  On VMX exit, set the flag.
> 
> The TSS limit is (phew!) invisible to userspace, so we don't have ABI
> issues to worry about here.  We also shouldn't have security issues
> because a too-low TSS limit just results in unprivileged IO operations
> generating #GP, which is exactly what we want.
> 
> What do you all think?  I expect a speedup of a couple hundred cycles
> on each VM exit.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-20 11:05 ` Paolo Bonzini
@ 2017-02-20 16:46   ` Andy Lutomirski
  2017-02-20 16:51     ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2017-02-20 16:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, linux-kernel, X86 ML, H. Peter Anvin, Borislav Petkov

On Mon, Feb 20, 2017 at 3:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 18/02/2017 04:29, Andy Lutomirski wrote:
>> There's no code here because the patch is trivial, but I want to run
>> the idea by you all first to see if there are any issues.
>>
>> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
>> wastes lots of cycles bumping it back up to accomodate the io bitmap.
>
> Actually looked at the code now...
>
> reload_tss is only invoked for userspace exits, so it is a nice-to-have
> but it wouldn't show on most workloads.  Still it does save 150-200
> clock cycles to remove it (I just commented out reload_tss() from
> __vmx_load_host_state to test).

That's for anything involving userspace or preemption, right?

>
> Another 100-150 could be saved if we could just use rdgsbase/wrgsbase,
> instead of rdmsr/wrmsr, to read and write the kernel GS.  Super hacky
> patch after sig.

I have a Real Patch Series (tm) to do that, but it has a couple of
unresolved corner cases so far.

That being said, vmx_save_host_state() is, um, poorly optimized.  I'll
try to find some time to fix the obvious things.  Meanwhile, I'll send
real patches for TR.

> +       cr4_set_bits(X86_CR4_FSGSBASE);

Nice root hole :-p

--Andy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-20 16:46   ` Andy Lutomirski
@ 2017-02-20 16:51     ` Paolo Bonzini
  2017-02-20 16:55       ` Andy Lutomirski
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2017-02-20 16:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, linux-kernel, X86 ML, H. Peter Anvin, Borislav Petkov



On 20/02/2017 17:46, Andy Lutomirski wrote:
> On Mon, Feb 20, 2017 at 3:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 18/02/2017 04:29, Andy Lutomirski wrote:
>>> There's no code here because the patch is trivial, but I want to run
>>> the idea by you all first to see if there are any issues.
>>>
>>> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
>>> wastes lots of cycles bumping it back up to accomodate the io bitmap.
>>
>> Actually looked at the code now...
>>
>> reload_tss is only invoked for userspace exits, so it is a nice-to-have
>> but it wouldn't show on most workloads.  Still it does save 150-200
>> clock cycles to remove it (I just commented out reload_tss() from
>> __vmx_load_host_state to test).
> 
> That's for anything involving userspace or preemption, right?

Yes.  But 150-200 clock cycles are nothing compared to the cache misses
you get from preemption, so I'd ignore that.  Saving 300 clock cycles on
userspace exits from TR+GSBASE would be about 5% on my Haswell.

> That being said, vmx_save_host_state() is, um, poorly optimized.

Yeah, but again it doesn't run that often in practical cases.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-20 16:51     ` Paolo Bonzini
@ 2017-02-20 16:55       ` Andy Lutomirski
  2017-02-20 22:02         ` Paolo Bonzini
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Lutomirski @ 2017-02-20 16:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm list, linux-kernel, X86 ML, H. Peter Anvin, Borislav Petkov

On Mon, Feb 20, 2017 at 8:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 20/02/2017 17:46, Andy Lutomirski wrote:
>> On Mon, Feb 20, 2017 at 3:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 18/02/2017 04:29, Andy Lutomirski wrote:
>>>> There's no code here because the patch is trivial, but I want to run
>>>> the idea by you all first to see if there are any issues.
>>>>
>>>> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
>>>> wastes lots of cycles bumping it back up to accomodate the io bitmap.
>>>
>>> Actually looked at the code now...
>>>
>>> reload_tss is only invoked for userspace exits, so it is a nice-to-have
>>> but it wouldn't show on most workloads.  Still it does save 150-200
>>> clock cycles to remove it (I just commented out reload_tss() from
>>> __vmx_load_host_state to test).
>>
>> That's for anything involving userspace or preemption, right?
>
> Yes.  But 150-200 clock cycles are nothing compared to the cache misses
> you get from preemption, so I'd ignore that.  Saving 300 clock cycles on
> userspace exits from TR+GSBASE would be about 5% on my Haswell.

That's still 5% :)

>
>> That being said, vmx_save_host_state() is, um, poorly optimized.
>
> Yeah, but again it doesn't run that often in practical cases.

But it will if someone ever tries to upstream Google's approach of not
emulating anything in the kernel. :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-20 16:55       ` Andy Lutomirski
@ 2017-02-20 22:02         ` Paolo Bonzini
  2017-02-20 22:23           ` hpa
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2017-02-20 22:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: kvm list, linux-kernel, X86 ML, H. Peter Anvin, Borislav Petkov


> > Yes.  But 150-200 clock cycles are nothing compared to the cache misses
> > you get from preemption, so I'd ignore that.  Saving 300 clock cycles on
> > userspace exits from TR+GSBASE would be about 5% on my Haswell.
> 
> That's still 5% :)

Yes, 5% on userspace exits is good (though they're rare).

OTOH, 300 clock cycles on preemption are nothing to write home about.

Paolo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RFC: Getting rid of LTR in VMX
  2017-02-20 22:02         ` Paolo Bonzini
@ 2017-02-20 22:23           ` hpa
  0 siblings, 0 replies; 8+ messages in thread
From: hpa @ 2017-02-20 22:23 UTC (permalink / raw)
  To: Paolo Bonzini, Andy Lutomirski
  Cc: kvm list, linux-kernel, X86 ML, Borislav Petkov

On February 20, 2017 2:02:53 PM PST, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> > Yes.  But 150-200 clock cycles are nothing compared to the cache
>misses
>> > you get from preemption, so I'd ignore that.  Saving 300 clock
>cycles on
>> > userspace exits from TR+GSBASE would be about 5% on my Haswell.
>> 
>> That's still 5% :)
>
>Yes, 5% on userspace exits is good (though they're rare).
>
>OTOH, 300 clock cycles on preemption are nothing to write home about.
>
>Paolo

Oh, I would say it is... especially when it's basically free.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-02-20 22:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-18  3:29 RFC: Getting rid of LTR in VMX Andy Lutomirski
2017-02-18 15:19 ` Paolo Bonzini
2017-02-20 11:05 ` Paolo Bonzini
2017-02-20 16:46   ` Andy Lutomirski
2017-02-20 16:51     ` Paolo Bonzini
2017-02-20 16:55       ` Andy Lutomirski
2017-02-20 22:02         ` Paolo Bonzini
2017-02-20 22:23           ` hpa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.