linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Performance overhead of get_cycles_sync
@ 2007-12-11 13:11 Dor Laor
  2007-12-11 13:37 ` Ingo Molnar
  2007-12-11 14:27 ` Andi Kleen
  0 siblings, 2 replies; 14+ messages in thread
From: Dor Laor @ 2007-12-11 13:11 UTC (permalink / raw)
  To: mingo, tglx; +Cc: Linux Kernel Mailing List, kvm-devel

Hi Ingo, Thomas,

In the latest kernel (2.6.24-rc3) I noticed a drastic performance 
decrease for KVM networking.
The reason is many vmexit (exit reason is cpuid instruction) caused by 
calls to gettimeofday that uses tsc sourceclock.
read_tsc calls get_cycles_sync which might call cpuid in order to 
serialize the cpu.

Can you explain why the cpu needs to be serialized for every gettime call?
Do we need to be that accurate? (It will also slightly improve physical 
hosts).
I believe you have a reason and the answer is yes. In that case can you 
replace the serializing instruction
with an instruction that does not trigger vmexit? Maybe use 'ltr' for 
example?

Regards,
Dor.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 13:11 Performance overhead of get_cycles_sync Dor Laor
@ 2007-12-11 13:37 ` Ingo Molnar
  2007-12-11 14:14   ` Dor Laor
       [not found]   ` <475E9A92.4030001@qumranet.com>
  2007-12-11 14:27 ` Andi Kleen
  1 sibling, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-12-11 13:37 UTC (permalink / raw)
  To: dor.laor; +Cc: tglx, Linux Kernel Mailing List, kvm-devel


* Dor Laor <dor.laor@gmail.com> wrote:

> Hi Ingo, Thomas,
>
> In the latest kernel (2.6.24-rc3) I noticed a drastic performance 
> decrease for KVM networking. The reason is many vmexit (exit reason is 
> cpuid instruction) caused by calls to gettimeofday that uses tsc 
> sourceclock. read_tsc calls get_cycles_sync which might call cpuid in 
> order to serialize the cpu.
>
> Can you explain why the cpu needs to be serialized for every gettime 
> call? Do we need to be that accurate? (It will also slightly improve 
> physical hosts). I believe you have a reason and the answer is yes. In 
> that case can you replace the serializing instruction with an 
> instruction that does not trigger vmexit? Maybe use 'ltr' for example?

hm, where exactly does it call CPUID?

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 13:37 ` Ingo Molnar
@ 2007-12-11 14:14   ` Dor Laor
       [not found]   ` <475E9A92.4030001@qumranet.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Dor Laor @ 2007-12-11 14:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, Linux Kernel Mailing List, kvm-devel

Ingo Molnar wrote:
>
> * Dor Laor <dor.laor@gmail.com> wrote:
>
> > Hi Ingo, Thomas,
> >
> > In the latest kernel (2.6.24-rc3) I noticed a drastic performance
> > decrease for KVM networking. The reason is many vmexit (exit reason is
> > cpuid instruction) caused by calls to gettimeofday that uses tsc
> > sourceclock. read_tsc calls get_cycles_sync which might call cpuid in
> > order to serialize the cpu.
> >
> > Can you explain why the cpu needs to be serialized for every gettime
> > call? Do we need to be that accurate? (It will also slightly improve
> > physical hosts). I believe you have a reason and the answer is yes. In
> > that case can you replace the serializing instruction with an
> > instruction that does not trigger vmexit? Maybe use 'ltr' for example?
>
> hm, where exactly does it call CPUID?
>
>         Ingo
>
Here, commented out [include/asm-x86/tsc.h]:
/* Like get_cycles, but make sure the CPU is synchronized. */
static __always_inline cycles_t get_cycles_sync(void)
{
    unsigned long long ret;
    unsigned eax, edx;

    /*
       * Use RDTSCP if possible; it is guaranteed to be synchronous
      * and doesn't cause a VMEXIT on Hypervisors
     */
    alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
               ASM_OUTPUT2("=a" (eax), "=d" (edx)),
               "a" (0U), "d" (0U) : "ecx", "memory");
    ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
    if (ret)
        return ret;

    /*
     * Don't do an additional sync on CPUs where we know
     * RDTSC is already synchronous:
     */
//    alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
//              "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
    rdtscll(ret);

    return ret;
}


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 13:11 Performance overhead of get_cycles_sync Dor Laor
  2007-12-11 13:37 ` Ingo Molnar
@ 2007-12-11 14:27 ` Andi Kleen
  2007-12-11 14:57   ` Dor Laor
  1 sibling, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2007-12-11 14:27 UTC (permalink / raw)
  To: dor.laor, linux-kernel, tglx


[headers rewritten because of gmane crosspost breakage]

> In the latest kernel (2.6.24-rc3) I noticed a drastic performance 
> decrease for KVM networking.

That should not have changed for quite some time.

Also it depends on the CPU of course.

> The reason is many vmexit (exit reason is cpuid instruction) caused by
> calls to gettimeofday that uses tsc sourceclock.
> read_tsc calls get_cycles_sync which might call cpuid in order to 
> serialize the cpu.
>
> Can you explain why the cpu needs to be serialized for every gettime call?

Otherwise RDTSC can be speculated around and happen outside the protection
of the seqlock and that can sometimes lead to non monotonic time reporting.

Anyways after a lot of discussions it turns out there are ways to archive
this without CPUID and there is a solution implemented for this in ff
tree which I will submit for .25. It's a little complicated though
and not a quick fix.

> Do we need to be that accurate? (It will also slightly improve physical 
> hosts).
> I believe you have a reason and the answer is yes. In that case can you 
> replace the serializing instruction
> with an instruction that does not trigger vmexit? Maybe use 'ltr' for 
> example?

ltr doesn't synchronize RDTSC.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
       [not found]   ` <475E9A92.4030001@qumranet.com>
@ 2007-12-11 14:27     ` Ingo Molnar
  2007-12-11 15:03       ` Dor Laor
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-12-11 14:27 UTC (permalink / raw)
  To: dor.laor; +Cc: tglx, Linux Kernel Mailing List, kvm-devel


* Dor Laor <dor.laor@gmail.com> wrote:

> Here [include/asm-x86/tsc.h]:
>
> /* Like get_cycles, but make sure the CPU is synchronized. */
> static __always_inline cycles_t get_cycles_sync(void)
> {
>    unsigned long long ret;
>    unsigned eax, edx;
>
>    /*
>       * Use RDTSCP if possible; it is guaranteed to be synchronous
>      * and doesn't cause a VMEXIT on Hypervisors
>     */
>    alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
>               ASM_OUTPUT2("=a" (eax), "=d" (edx)),
>               "a" (0U), "d" (0U) : "ecx", "memory");
>    ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
>    if (ret)
>        return ret;
>
>    /*
>     * Don't do an additional sync on CPUs where we know
>     * RDTSC is already synchronous:
>     */
> //    alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> //              "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
>    rdtscll(ret);

The patch below should resolve this - could you please test and Ack it? 
But this CPUID was present in v2.6.23 too, so why did it only show up in 
2.6.24-rc for you?

	Ingo

-------------->
Subject: x86: fix get_cycles_sync() overhead
From: Ingo Molnar <mingo@elte.hu>

get_cycles_sync() is causing massive overhead in KVM networking:

   http://lkml.org/lkml/2007/12/11/54

remove the explicit CPUID serialization - it causes VM exits and is
pointless: we care about GTOD coherency but that goes to user-space
via a syscall, and syscalls are serialization points anyway.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/asm-x86/tsc.h |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-x86.q/include/asm-x86/tsc.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/tsc.h
+++ linux-x86.q/include/asm-x86/tsc.h
@@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl
 	unsigned eax, edx;
 
 	/*
-  	 * Use RDTSCP if possible; it is guaranteed to be synchronous
- 	 * and doesn't cause a VMEXIT on Hypervisors
+	 * Use RDTSCP if possible; it is guaranteed to be synchronous
+	 * and doesn't cause a VMEXIT on Hypervisors
 	 */
 	alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
 		       ASM_OUTPUT2("=a" (eax), "=d" (edx)),
@@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl
 		return ret;
 
 	/*
-	 * Don't do an additional sync on CPUs where we know
-	 * RDTSC is already synchronous:
+	 * Use RDTSC on other CPUs. This might not be fully synchronous,
+	 * but it's not a problem: the only coherency we care about is
+	 * the GTOD output to user-space, and syscalls are synchronization
+	 * points anyway:
 	 */
-	alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
-			  "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
 	rdtscll(ret);
 
 	return ret;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 14:27 ` Andi Kleen
@ 2007-12-11 14:57   ` Dor Laor
  2007-12-11 15:02     ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Dor Laor @ 2007-12-11 14:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, tglx

Andi Kleen wrote:
> [headers rewritten because of gmane crosspost breakage]
>
>   
>> In the latest kernel (2.6.24-rc3) I noticed a drastic performance 
>> decrease for KVM networking.
>>     
>
> That should not have changed for quite some time.
>
> Also it depends on the CPU of course.
>   
I didn't find the exact place of the change but using fedora 2.6.23-8 
there is no problem.
3aefbe0746580a710d4392a884ac1e4aac7c728f turn X86_FEATURE_SYNC_RDTSC  
off for most
intel cpus, but it was committed in May.

>   
>> The reason is many vmexit (exit reason is cpuid instruction) caused by
>> calls to gettimeofday that uses tsc sourceclock.
>> read_tsc calls get_cycles_sync which might call cpuid in order to 
>> serialize the cpu.
>>
>> Can you explain why the cpu needs to be serialized for every gettime call?
>>     
>
> Otherwise RDTSC can be speculated around and happen outside the protection
> of the seqlock and that can sometimes lead to non monotonic time reporting.
>   
What about moving the result into memory and calling mb() instead?
> Anyways after a lot of discussions it turns out there are ways to archive
> this without CPUID and there is a solution implemented for this in ff
> tree which I will submit for .25. It's a little complicated though
> and not a quick fix.
>
>   
>> Do we need to be that accurate? (It will also slightly improve physical 
>> hosts).
>> I believe you have a reason and the answer is yes. In that case can you 
>> replace the serializing instruction
>> with an instruction that does not trigger vmexit? Maybe use 'ltr' for 
>> example?
>>     
>
> ltr doesn't synchronize RDTSC.
>
>   
According to Intel spec it is a serializing instruction along with cpuid 
and others.
> -Andi
>
>   


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 14:57   ` Dor Laor
@ 2007-12-11 15:02     ` Andi Kleen
  0 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2007-12-11 15:02 UTC (permalink / raw)
  To: dor.laor; +Cc: Andi Kleen, linux-kernel, tglx

On Tue, Dec 11, 2007 at 04:57:16PM +0200, Dor Laor wrote:
> Andi Kleen wrote:
> >[headers rewritten because of gmane crosspost breakage]
> >
> >  
> >>In the latest kernel (2.6.24-rc3) I noticed a drastic performance 
> >>decrease for KVM networking.
> >>    
> >
> >That should not have changed for quite some time.
> >
> >Also it depends on the CPU of course.
> >  
> I didn't find the exact place of the change but using fedora 2.6.23-8 
> there is no problem.
> 3aefbe0746580a710d4392a884ac1e4aac7c728f turn X86_FEATURE_SYNC_RDTSC  
> off for most
> intel cpus, but it was committed in May.

Ah there was a change to enable it on Core2. On AMD it was 
always there except on CPUs with RDTSCP

Anyways the very likely right way (I'm still waiting for final
confirmation from Intel) to solve this is to use LFENCE which
happens to stop RDTSC and is relatively light weight and won't
cause vmexits. Unfortunately that doesn't work on AMD, but 
there MFENCE happens to work. But there are more problems
in this code which I have a (.24) patchkit to fix.

> According to Intel spec it is a serializing instruction along with cpuid 
> and others.

You're right. The real roblem is that it is ring 0 only and that can
be executed in ring 3. That is why Ingo's patch was wrong too.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 14:27     ` Ingo Molnar
@ 2007-12-11 15:03       ` Dor Laor
  2007-12-11 16:35       ` Arjan van de Ven
  2007-12-11 21:26       ` [kvm-devel] " Joerg Roedel
  2 siblings, 0 replies; 14+ messages in thread
From: Dor Laor @ 2007-12-11 15:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: tglx, Linux Kernel Mailing List, kvm-devel

Ingo Molnar wrote:
> * Dor Laor <dor.laor@gmail.com> wrote:
>
>   
>> Here [include/asm-x86/tsc.h]:
>>
>> /* Like get_cycles, but make sure the CPU is synchronized. */
>> static __always_inline cycles_t get_cycles_sync(void)
>> {
>>    unsigned long long ret;
>>    unsigned eax, edx;
>>
>>    /*
>>       * Use RDTSCP if possible; it is guaranteed to be synchronous
>>      * and doesn't cause a VMEXIT on Hypervisors
>>     */
>>    alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
>>               ASM_OUTPUT2("=a" (eax), "=d" (edx)),
>>               "a" (0U), "d" (0U) : "ecx", "memory");
>>    ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
>>    if (ret)
>>        return ret;
>>
>>    /*
>>     * Don't do an additional sync on CPUs where we know
>>     * RDTSC is already synchronous:
>>     */
>> //    alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
>> //              "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
>>    rdtscll(ret);
>>     
>
> The patch below should resolve this - could you please test and Ack it? 
>   
It works, actually I already commented it out.

Acked-by: Dor Laor <dor.laor@qumranet.com>

But this CPUID was present in v2.6.23 too, so why did it only show up in
> 2.6.24-rc for you?
>
>   
I tried to figure out but all the code movements for i386 go in the way.
In the previous email I reported to Andi that Fedora kernel 2.6.23-8 did 
not suffer from it.
Thanks for the ultra fast reply :)
Dor
> 	Ingo
>
> -------------->
> Subject: x86: fix get_cycles_sync() overhead
> From: Ingo Molnar <mingo@elte.hu>
>
> get_cycles_sync() is causing massive overhead in KVM networking:
>
>    http://lkml.org/lkml/2007/12/11/54
>
> remove the explicit CPUID serialization - it causes VM exits and is
> pointless: we care about GTOD coherency but that goes to user-space
> via a syscall, and syscalls are serialization points anyway.
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/asm-x86/tsc.h |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> Index: linux-x86.q/include/asm-x86/tsc.h
> ===================================================================
> --- linux-x86.q.orig/include/asm-x86/tsc.h
> +++ linux-x86.q/include/asm-x86/tsc.h
> @@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl
>  	unsigned eax, edx;
>  
>  	/*
> -  	 * Use RDTSCP if possible; it is guaranteed to be synchronous
> - 	 * and doesn't cause a VMEXIT on Hypervisors
> +	 * Use RDTSCP if possible; it is guaranteed to be synchronous
> +	 * and doesn't cause a VMEXIT on Hypervisors
>  	 */
>  	alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
>  		       ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> @@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl
>  		return ret;
>  
>  	/*
> -	 * Don't do an additional sync on CPUs where we know
> -	 * RDTSC is already synchronous:
> +	 * Use RDTSC on other CPUs. This might not be fully synchronous,
> +	 * but it's not a problem: the only coherency we care about is
> +	 * the GTOD output to user-space, and syscalls are synchronization
> +	 * points anyway:
>  	 */
> -	alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> -			  "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
>  	rdtscll(ret);
>  
>  	return ret;
>
>   


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 14:27     ` Ingo Molnar
  2007-12-11 15:03       ` Dor Laor
@ 2007-12-11 16:35       ` Arjan van de Ven
  2007-12-11 17:03         ` Ingo Molnar
  2007-12-11 21:26       ` [kvm-devel] " Joerg Roedel
  2 siblings, 1 reply; 14+ messages in thread
From: Arjan van de Ven @ 2007-12-11 16:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: dor.laor, tglx, Linux Kernel Mailing List, kvm-devel

On Tue, 11 Dec 2007 15:27:17 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Dor Laor <dor.laor@gmail.com> wrote:
> 
> 
> The patch below should resolve this - could you please test and Ack
> it? But this CPUID was present in v2.6.23 too, so why did it only
> show up in 2.6.24-rc for you?

isn't this probably wrong since this code is also used in the vsyscall code..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 16:35       ` Arjan van de Ven
@ 2007-12-11 17:03         ` Ingo Molnar
  2007-12-11 17:23           ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2007-12-11 17:03 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: dor.laor, tglx, Linux Kernel Mailing List, kvm-devel


* Arjan van de Ven <arjan@infradead.org> wrote:

> > * Dor Laor <dor.laor@gmail.com> wrote:
> > 
> > The patch below should resolve this - could you please test and Ack 
> > it? But this CPUID was present in v2.6.23 too, so why did it only 
> > show up in 2.6.24-rc for you?
> 
> isn't this probably wrong since this code is also used in the vsyscall 
> code..

the TSC clocksource (and hence the vsyscall code) is turned off on 
systems that fail the TOD/CLOCK portion of this test:

  http://people.redhat.com/mingo/time-warp-test/time-warp-test.c

i.e. on the majority of systems in place.

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 17:03         ` Ingo Molnar
@ 2007-12-11 17:23           ` Andi Kleen
  2007-12-11 20:19             ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2007-12-11 17:23 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> writes:

> the TSC clocksource (and hence the vsyscall code) is turned off on 
> systems that fail the TOD/CLOCK portion of this test:

Which is not on core2 which was the question about. And if it was
turned off it wouldn't use get_cycles_sync() at all.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 17:23           ` Andi Kleen
@ 2007-12-11 20:19             ` Ingo Molnar
  2007-12-11 20:29               ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2007-12-11 20:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel


* Andi Kleen <ak@suse.de> wrote:

> Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> writes:
> 
> > the TSC clocksource (and hence the vsyscall code) is turned off on 
> > systems that fail the TOD/CLOCK portion of this test:
> 
> Which is not on core2 which was the question about. And if it was 
> turned off it wouldn't use get_cycles_sync() at all.

it is turned off on core2 too:

 # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
 acpi_pm

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Performance overhead of get_cycles_sync
  2007-12-11 20:19             ` Ingo Molnar
@ 2007-12-11 20:29               ` Ingo Molnar
  0 siblings, 0 replies; 14+ messages in thread
From: Ingo Molnar @ 2007-12-11 20:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, Arjan van de Ven, dor.laor, kvm-devel


* Ingo Molnar <mingo@elte.hu> wrote:

> > Which is not on core2 which was the question about. And if it was 
> > turned off it wouldn't use get_cycles_sync() at all.
> 
> it is turned off on core2 too:
> 
>  # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
>  acpi_pm

but you are right that it's not turned off on all core2's, so my patch 
is wrong for them.

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [kvm-devel] Performance overhead of get_cycles_sync
  2007-12-11 14:27     ` Ingo Molnar
  2007-12-11 15:03       ` Dor Laor
  2007-12-11 16:35       ` Arjan van de Ven
@ 2007-12-11 21:26       ` Joerg Roedel
  2 siblings, 0 replies; 14+ messages in thread
From: Joerg Roedel @ 2007-12-11 21:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, dor.laor, kvm-devel, Linux Kernel Mailing List

On Tue, Dec 11, 2007 at 03:27:17PM +0100, Ingo Molnar wrote:
> 
> * Dor Laor <dor.laor@gmail.com> wrote:
> 
> > Here [include/asm-x86/tsc.h]:
> >
> > /* Like get_cycles, but make sure the CPU is synchronized. */
> > static __always_inline cycles_t get_cycles_sync(void)
> > {
> >    unsigned long long ret;
> >    unsigned eax, edx;
> >
> >    /*
> >       * Use RDTSCP if possible; it is guaranteed to be synchronous
> >      * and doesn't cause a VMEXIT on Hypervisors
> >     */
> >    alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
> >               ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> >               "a" (0U), "d" (0U) : "ecx", "memory");
> >    ret = (((unsigned long long)edx) << 32) | ((unsigned long long)eax);
> >    if (ret)
> >        return ret;
> >
> >    /*
> >     * Don't do an additional sync on CPUs where we know
> >     * RDTSC is already synchronous:
> >     */
> > //    alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> > //              "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
> >    rdtscll(ret);
> 
> The patch below should resolve this - could you please test and Ack it? 
> But this CPUID was present in v2.6.23 too, so why did it only show up in 
> 2.6.24-rc for you?
> 
> 	Ingo
> 
> -------------->
> Subject: x86: fix get_cycles_sync() overhead
> From: Ingo Molnar <mingo@elte.hu>
> 
> get_cycles_sync() is causing massive overhead in KVM networking:
> 
>    http://lkml.org/lkml/2007/12/11/54
> 
> remove the explicit CPUID serialization - it causes VM exits and is
> pointless: we care about GTOD coherency but that goes to user-space
> via a syscall, and syscalls are serialization points anyway.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/asm-x86/tsc.h |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> Index: linux-x86.q/include/asm-x86/tsc.h
> ===================================================================
> --- linux-x86.q.orig/include/asm-x86/tsc.h
> +++ linux-x86.q/include/asm-x86/tsc.h
> @@ -39,8 +39,8 @@ static __always_inline cycles_t get_cycl
>  	unsigned eax, edx;
>  
>  	/*
> -  	 * Use RDTSCP if possible; it is guaranteed to be synchronous
> - 	 * and doesn't cause a VMEXIT on Hypervisors
> +	 * Use RDTSCP if possible; it is guaranteed to be synchronous
> +	 * and doesn't cause a VMEXIT on Hypervisors
>  	 */
>  	alternative_io(ASM_NOP3, ".byte 0x0f,0x01,0xf9", X86_FEATURE_RDTSCP,
>  		       ASM_OUTPUT2("=a" (eax), "=d" (edx)),
> @@ -50,11 +50,11 @@ static __always_inline cycles_t get_cycl
>  		return ret;
>  
>  	/*
> -	 * Don't do an additional sync on CPUs where we know
> -	 * RDTSC is already synchronous:
> +	 * Use RDTSC on other CPUs. This might not be fully synchronous,
> +	 * but it's not a problem: the only coherency we care about is
> +	 * the GTOD output to user-space, and syscalls are synchronization
> +	 * points anyway:
>  	 */
> -	alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
> -			  "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
>  	rdtscll(ret);
>  
>  	return ret;

I don't think this is a good idea. I discussed exactly this item with
Andi Kleen a while ago and afair the serializing instruction was
necessary to fix a backwards walking gettimeofday() on some K8
revisions. Andi Kleen can tell more details, I added him to the CC list.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-12-11 21:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-11 13:11 Performance overhead of get_cycles_sync Dor Laor
2007-12-11 13:37 ` Ingo Molnar
2007-12-11 14:14   ` Dor Laor
     [not found]   ` <475E9A92.4030001@qumranet.com>
2007-12-11 14:27     ` Ingo Molnar
2007-12-11 15:03       ` Dor Laor
2007-12-11 16:35       ` Arjan van de Ven
2007-12-11 17:03         ` Ingo Molnar
2007-12-11 17:23           ` Andi Kleen
2007-12-11 20:19             ` Ingo Molnar
2007-12-11 20:29               ` Ingo Molnar
2007-12-11 21:26       ` [kvm-devel] " Joerg Roedel
2007-12-11 14:27 ` Andi Kleen
2007-12-11 14:57   ` Dor Laor
2007-12-11 15:02     ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).