From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756446Ab1DGRVY (ORCPT ); Thu, 7 Apr 2011 13:21:24 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:58092 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755928Ab1DGRVX (ORCPT ); Thu, 7 Apr 2011 13:21:23 -0400 MIME-Version: 1.0 In-Reply-To: <20110407164245.GA21838@one.firstfloor.org> References: <80b43d57d15f7b141799a7634274ee3bfe5a5855.1302137785.git.luto@mit.edu> <20110407164245.GA21838@one.firstfloor.org> From: Linus Torvalds Date: Thu, 7 Apr 2011 10:20:31 -0700 Message-ID: Subject: Re: [RFT/PATCH v2 2/6] x86-64: Optimize vread_tsc's barriers To: Andi Kleen Cc: Andy Lutomirski , x86@kernel.org, Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 7, 2011 at 9:42 AM, Andi Kleen wrote: > > I'm sure a single barrier would have fixed the testers, as you point out, > but the goal wasn't to only fix testers. You missed two points: - first off, do we have any reason to believe that the rdtsc would migrate down _anyway_? As AndyL says, both Intel and AMD seem to document only the "[lm]fence + rdtsc" thing with a single fence instruction before. Instruction scheduling isn't some kind of theoretical game. It's a very practical issue, and CPU schedulers are constrained to do a good job quickly and _effectively_. In other words, instructions don't just schedule randomly. In the presense of the barrier, is there any reason to believe that the rdtsc would really schedule oddly? There is never any reason to _delay_ an rdtsc (it can take no cache misses or wait on any other resources), so when it is not able to move up, where would it move? IOW, it's all about "practical vs theoretical". Sure, in theory the rdtsc could move down arbitrarily. In _practice_, the caller is going to use the result (and if it doesn't, the value doesn't matter), and thus the CPU will have data dependencies etc that constrain scheduling. But more practically, there's no reason to delay scheduling, because an rdtsc isn't going to be waiting for any interesting resources, and it's also not going to be holding up any more important resources (iow, sure, you'd like to schedule subsequent loads early, but those won't be fighting over the same resource with the rdtsc anyway, so I don't see any reason that would delay the rdtsc and move it down). So I suspect that one lfence (before) is basically the _same_ as the current two lfences (around). Now, I can't guarantee it, but in the absense of numbers to the contrary, there really isn't much reason to believe otherwise. Especially considering the Intel/AMD _documentation_. So we should at least try it, I think. - the reason "back-to-back" (with the extreme example being in a tight loop) matters is that if something isn't in a tight loop, any jitter we see in the time counting wouldn't be visible anyway. One random timestamp is meaningless on its own. It's only when you have multiple ones that you can compare them. No? So _before_ we try some really clever data dependency trick with new inline asm and magic "double shifts to create a zero" things, I really would suggest just trying to remove the second lfence entirely and see how that works. Maybe it doesn't work, but ... Linus