From: Andrew Lutomirski <luto@mit.edu>
To: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Nick Piggin <npiggin@suse.de>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <eric.dumazet@gmail.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
x86@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
Andi Kleen <andi@firstfloor.org>,
linux-kernel@vger.kernel.org
Subject: Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers
Date: Thu, 7 Apr 2011 07:44:09 -0400 [thread overview]
Message-ID: <BANLkTik=Gefr11qyJ1cDer3mv76SUu8h_g@mail.gmail.com> (raw)
In-Reply-To: <20110407082550.GG24879@elte.hu>
On Thu, Apr 7, 2011 at 4:25 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> (Cc:-ed more memory ordering folks.)
>
>> --- a/arch/x86/kernel/tsc.c
>> +++ b/arch/x86/kernel/tsc.c
>> @@ -767,18 +767,41 @@ static cycle_t read_tsc(struct clocksource *cs)
>> static cycle_t __vsyscall_fn vread_tsc(void)
>> {
>> cycle_t ret;
>> + u64 zero, last;
>>
>> /*
>> - * Surround the RDTSC by barriers, to make sure it's not
>> - * speculated to outside the seqlock critical section and
>> - * does not cause time warps:
>> + * rdtsc is unordered, and we want it to be ordered like
>> + * a load with respect to other CPUs (and we don't want
>> + * it to execute absurdly early wrt code on this CPU).
>> + * rdtsc_barrier() is a barrier that provides this ordering
>> + * with respect to *earlier* loads. (Which barrier to use
>> + * depends on the CPU.)
>> */
>> rdtsc_barrier();
>> - ret = (cycle_t)vget_cycles();
>> - rdtsc_barrier();
>>
>> - return ret >= VVAR(vsyscall_gtod_data).clock.cycle_last ?
>> - ret : VVAR(vsyscall_gtod_data).clock.cycle_last;
>> + asm volatile ("rdtsc\n\t"
>> + "shl $0x20,%%rdx\n\t"
>> + "or %%rdx,%%rax\n\t"
>> + "shl $0x20,%%rdx"
>> + : "=a" (ret), "=d" (zero) : : "cc");
>> +
>> + /*
>> + * zero == 0, but as far as the processor is concerned, zero
>> + * depends on the output of rdtsc. So we can use it as a
>> + * load barrier by loading something that depends on it.
>> + * x86-64 keeps all loads in order wrt each other, so this
>> + * ensures that rdtsc is ordered wrt all later loads.
>> + */
>> +
>> + /*
>> + * This doesn't multiply 'zero' by anything, which *should*
>> + * generate nicer code, except that gcc cleverly embeds the
>> + * dereference into the cmp and the cmovae. Oh, well.
>> + */
>> + last = *( (cycle_t *)
>> + ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
>> +
>> + return ret >= last ? ret : last;
>
> Ok, that's like totally sick ;-)
>
> It's a software barrier in essence, using data dependency obfuscation.
>
> First objection would be the extra memory references: we have a general
> aversion against memory references. The memory reference here is arguably
> special, it is to the stack so should be in cache and should be pretty fast.
>
> But the code really looks too tricky for its own good.
>
> For example this assumption:
>
>> The trick is that the answer should not actually change as a result
>> of the sneaky memory access. I accomplish this by shifting rdx left
>> by 32 bits, twice, to generate the number zero. (I can't imagine
>> that any CPU can break that dependency.) Then I use "zero" as an
>
> is not particularly future-proof. Yes, i doubt any CPU will bother to figure
> out the data dependency across such a sequence, but synthetic CPUs might and
> who knows what the far future brings.
That's fixable with a bit more work. Imagine (whitespace damanged):
asm volatile ("rdtsc\n\t"
"shl $0x20,%%rdx\n\t"
"or %%rdx,%%rax\n\t"
"shr $0x3f,%%rdx"
: "=a" (ret), "=d" (zero_or_one) : : "cc");
last = VVAR(vsyscall_gtod_data).clock.cycle_last[zero_or_one];
For this to work, cycle_last would have to be an array containing two
identical values, and we'd want to be a little careful to keep the
whole mess in one cacheline. Now I think it's really safe because
zero_or_one isn't constant at all, so the CPU has to wait for its
value no matter how clever it is.
>
> Also, do we *really* have RDTSC SMP-coherency guarantees on multi-socket CPUs
> today? It now works on multi-core, but on bigger NUMA i strongly doubt it. So
> this hack tries to preserve something that we wont be able to offer anyway.
>
> So the much better optimization would be to give up on exact GTOD coherency and
> just make sure the same task does not see time going backwards. If user-space
> wants precise coherency it can use synchronization primitives itsef. By default
> it would get the fast and possibly off by a few cycles thing instead. We'd
> never be seriously jump in time - only small jumps would happen in practice,
> depending on CPU parallelism effects.
>
> If we do that then the optimization would be to RDTSC and not use *any* of the
> barriers, neither the hardware ones nor your tricky software data-dependency
> obfuscation barrier.
IIRC I measured 200ns time warps on Sandy Bridge when I tried that.
(I think you won't see them that easily with time-warp-test in TSC
mode, because there's a locking instruction before the rdtsc and very
close to it.) It would save about 4ns more, I think, which isn't bad.
Personally, I'm working on this code because I'm migrating a bunch of
code that likes to timestamp itself from Windows to Linux, and one of
the really big attractions to Linux is that it has clock_gettime,
which is fast, pretty much warp-free, and actually shows *wall* time
with high precision. The closest thing that Windows has is
QueryPerformanceCounter, which is a giant PITA because it doesn't
track wall time (although it's still slightly faster than
clock_gettime even with this patch). If I have to re-add
software-enforced clock coherency to the Linux version, I'll be sad.
--Andy
>
> Note that doing this will also have other advantages: we wont really need
> alternatives patching, thus we could more easily move this code into .S - which
> would allow further optimizations, such as the elimination of this GCC
> inflicted slowdown:
>
>> + /*
>> + * This doesn't multiply 'zero' by anything, which *should*
>> + * generate nicer code, except that gcc cleverly embeds the
>> + * dereference into the cmp and the cmovae. Oh, well.
>> + */
>> + last = *( (cycle_t *)
>> + ((char *)&VVAR(vsyscall_gtod_data).clock.cycle_last + zero) );
>> +
>> + return ret >= last ? ret : last;
>
> Thanks,
>
> Ingo
>
next prev parent reply other threads:[~2011-04-07 11:44 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-07 2:03 [RFT/PATCH v2 0/6] Micro-optimize vclock_gettime Andy Lutomirski
2011-04-07 2:03 ` [RFT/PATCH v2 1/6] x86-64: Clean up vdso/kernel shared variables Andy Lutomirski
2011-04-07 8:08 ` Ingo Molnar
2011-04-07 2:03 ` [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers Andy Lutomirski
2011-04-07 8:25 ` Ingo Molnar
2011-04-07 11:44 ` Andrew Lutomirski [this message]
2011-04-07 15:23 ` Andi Kleen
2011-04-07 17:28 ` Ingo Molnar
2011-04-07 16:18 ` Linus Torvalds
2011-04-07 16:42 ` Andi Kleen
2011-04-07 17:20 ` Linus Torvalds
2011-04-07 18:15 ` Andi Kleen
2011-04-07 18:30 ` Linus Torvalds
2011-04-07 21:26 ` Andrew Lutomirski
2011-04-08 17:59 ` Andrew Lutomirski
2011-04-09 11:51 ` Ingo Molnar
2011-04-07 21:43 ` Raghavendra D Prabhu
2011-04-07 22:52 ` Andi Kleen
2011-04-07 2:04 ` [RFT/PATCH v2 3/6] x86-64: Don't generate cmov in vread_tsc Andy Lutomirski
2011-04-07 7:54 ` Ingo Molnar
2011-04-07 11:25 ` Andrew Lutomirski
2011-04-07 2:04 ` [RFT/PATCH v2 4/6] x86-64: vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0 Andy Lutomirski
2011-04-07 7:57 ` Ingo Molnar
2011-04-07 11:27 ` Andrew Lutomirski
2011-04-07 2:04 ` [RFT/PATCH v2 5/6] x86-64: Move vread_tsc into a new file with sensible options Andy Lutomirski
2011-04-07 2:04 ` [RFT/PATCH v2 6/6] x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO Andy Lutomirski
2011-04-07 8:03 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='BANLkTik=Gefr11qyJ1cDer3mv76SUu8h_g@mail.gmail.com' \
--to=luto@mit.edu \
--cc=a.p.zijlstra@chello.nl \
--cc=andi@firstfloor.org \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=npiggin@suse.de \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).