From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932176Ab1DGCEv (ORCPT ); Wed, 6 Apr 2011 22:04:51 -0400 Received: from DMZ-MAILSEC-SCANNER-1.MIT.EDU ([18.9.25.12]:47766 "EHLO dmz-mailsec-scanner-1.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757158Ab1DGCEr (ORCPT ); Wed, 6 Apr 2011 22:04:47 -0400 X-AuditID: 1209190c-b7b7aae0000047c7-90-4d9d1bc2bd6c From: Andy Lutomirski To: x86@kernel.org Cc: Thomas Gleixner , Ingo Molnar , Andi Kleen , linux-kernel@vger.kernel.org, Andy Lutomirski Subject: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers Date: Wed, 6 Apr 2011 22:03:59 -0400 Message-Id: <80b43d57d15f7b141799a7634274ee3bfe5a5855.1302137785.git.luto@mit.edu> X-Mailer: git-send-email 1.7.4 In-Reply-To: References: In-Reply-To: References: X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrJIsWRmVeSWpSXmKPExsUixG6nontIeq6vwa7tNhZ9V46yWxy59p3d 4vKuOWwWWy41s1ps3jSV2eLHhsesDmwet9r+MHvM3/mR0WPnrLvsHptWdbJ5vDt3jt3j8ya5 ALYoLpuU1JzMstQifbsEroz9c/YxF5yXrbj+7StbA+Ny8S5GTg4JAROJ45O3MUPYYhIX7q1n 62Lk4hAS2McosengJRYIZz2jxNJ586EyT5kkpvUtZAFpYRNQkehY+oCpi5GDQ0RASGLp3TqQ GmaB7YwSG5a3g9UICzhKPGtcBbaCRUBV4t3T82wg9bwCQRKftjh3MbIDbZaTaA4EKeAUMJC4 +OU/K4gtJKAvMbn3GyMu8QmMAgsYGVYxyqbkVunmJmbmFKcm6xYnJ+blpRbpGurlZpbopaaU bmIEhSqnJM8OxjcHlQ4xCnAwKvHwhnTO8RViTSwrrsw9xCjJwaQkytspMddXiC8pP6UyI7E4 I76oNCe1+BCjBAezkgivkhBQjjclsbIqtSgfJiXNwaIkzjtDUt1XSCA9sSQ1OzW1ILUIJivD waEkwZsAjEkhwaLU9NSKtMycEoQ0EwcnyHAeoOETxEGGFxck5hZnpkPkTzHqcvzfcmgfoxBL Xn5eqpQ47zcpoCIBkKKM0jy4ObAU84pRHOgtYd5AkHU8wPQEN+kV0BImoCULz80BWVKSiJCS amBkX7miV33P/fALylb65eyJAdLNjhYJvEWP3yQ3RSSUzZ5SVPlJ1vzzFW2RSeEthzpWT7ml nCMmenOj1t6ivsN/v72w0fr0zet22Brug906cVws/Q9iPL+GBrIqpmz/qfnu/PuPM9eu1K2e 23Sg5+3shvm3pv47llIeZpoWF6G3lP9TQeDBrX1KLMUZiYZazEXFiQCM8qQYDAMAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org RDTSC is completely unordered on modern Intel and AMD CPUs. The Intel manual says that lfence;rdtsc causes all previous instructions to complete before the tsc is read, and the AMD manual says to use mfence;rdtsc to do the same thing. We want a stronger guarantee, though: we want the tsc to be read before any memory access that occurs after the call to vclock_gettime (or vgettimeofday). We currently guarantee that with a second lfence or mfence. This sequence is not really supported by the manual (AFAICT) and it's also slow. This patch changes the rdtsc to use implicit memory ordering instead of the second fence. The sequence looks like this: {l,m}fence rdtsc mov [something dependent on edx],[tmp] return [some function of tmp] This means that the time stamp has to be read before the load, and the return value depends on tmp. All x86-64 chips guarantee that no memory access after a load moves before that load. This means that all memory access after vread_tsc occurs after the time stamp is read. The trick is that the answer should not actually change as a result of the sneaky memory access. I accomplish this by shifting rdx left by 32 bits, twice, to generate the number zero. (I can't imagine that any CPU can break that dependency.) Then I use "zero" as an offset to a memory access that we had to do anyway. On Sandy Bridge (i7-2600), this improves a loop of clock_gettime(CLOCK_MONOTONIC) by 5 ns/iter (from ~22.7 to ~17.7). time-warp-test still passes. I suspect that it's sufficient to just load something dependent on edx without using the result, but I don't see any solid evidence in the manual that CPUs won't eliminate useless loads. I leave scary stuff like that to the real experts. Signed-off-by: Andy Lutomirski --- arch/x86/kernel/tsc.c | 37 ++++++++++++++++++++++++++++++------- 1 files changed, 30 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index bc46566..858c084 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -767,18 +767,41 @@ static cycle_t read_tsc(struct clocksource *cs) static cycle_t __vsyscall_fn vread_tsc(void) { cycle_t ret; + u64 zero, last; /* - * Surround the RDTSC by barriers, to make sure it's not - * speculated to outside the seqlock critical section and - * does not cause time warps: + * rdtsc is unordered, and we want it to be ordered like + * a load with respect to other CPUs (and we don't want + * it to execute absurdly early wrt code on this CPU). + * rdtsc_barrier() is a barrier that provides this ordering + * with respect to *earlier* loads. (Which barrier to use + * depends on the CPU.) */ rdtsc_barrier(); - ret = (cycle_t)vget_cycles(); - rdtsc_barrier(); - return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ? - ret : VVAR(vsyscall_gtod_data).clock.cycle_last; + asm volatile ("rdtsc\n\t" + "shl $0x20,%%rdx\n\t" + "or %%rdx,%%rax\n\t" + "shl $0x20,%%rdx" + : "=a" (ret), "=d" (zero) : : "cc"); + + /* + * zero == 0, but as far as the processor is concerned, zero + * depends on the output of rdtsc. So we can use it as a + * load barrier by loading something that depends on it. + * x86-64 keeps all loads in order wrt each other, so this + * ensures that rdtsc is ordered wrt all later loads. + */ + + /* + * This doesn't multiply 'zero' by anything, which *should* + * generate nicer code, except that gcc cleverly embeds the + * dereference into the cmp and the cmovae. Oh, well. + */ + last = *( (cycle_t *) + ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) ); + + return ret >= last ? ret : last; } #endif -- 1.7.4