From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755953AbZCBVRv (ORCPT ); Mon, 2 Mar 2009 16:17:51 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753760AbZCBVRd (ORCPT ); Mon, 2 Mar 2009 16:17:33 -0500 Received: from smtp1.linux-foundation.org ([140.211.169.13]:37692 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753362AbZCBVRb (ORCPT ); Mon, 2 Mar 2009 16:17:31 -0500 Date: Mon, 2 Mar 2009 13:16:23 -0800 (PST) From: Linus Torvalds X-X-Sender: torvalds@localhost.localdomain To: Nick Piggin cc: "H. Peter Anvin" , Arjan van de Ven , Andi Kleen , David Miller , mingo@elte.hu, sqazi@google.com, linux-kernel@vger.kernel.org, tglx@linutronix.de Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() In-Reply-To: <200903020106.51865.nickpiggin@yahoo.com.au> Message-ID: References: <20090228173813.6d86c0ef@infradead.org> <49A9E7A3.4050102@zytor.com> <200903020106.51865.nickpiggin@yahoo.com.au> User-Agent: Alpine 2.00 (LFD 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2 Mar 2009, Nick Piggin wrote: > > I would expect any high performance CPU these days to combine entries > in the store queue, even for normal store instructions (especially for > linear memcpy patterns). Isn't this likely to be the case? None of this really matters. The big issue is that before you can do any write to any cacheline, if the memory is cacheable, it needs the cache coherency protocol to synchronize with any other CPU's that may have that line in the cache. The _only_ time a write is "free" is when you already have that cacheline in your own cache, and in an "exclusive" state. If that is the case, then you know that you don't need to do anything else. In _any_ other case, before you do the write, you need to make sure that no other CPU in the system has that line in its cache. Whether you do that with a "write and invalidate" model (which would be how a store buffer would do it or a write-through cache would work), or whether you do it with a "acquire exclusive cacheline" (which is how the cache coherency protocol would do it), it's going to end up using cache coherency bandwidth. Of course, what will be the limiting factor is unclear. On a single-socket thing, you don't have any cache coherency issues, an the only bandwidth you'd end up using is the actual memory write at the memory controller (which may be on-die, and entirely separate from the cache coherency protocol). It may be idle and the write queue may be deep enough that you reach memory speeds and the write buffer is the optimal approach. On many sockets, the limiting factor will almost certainly be the cache coherency overhead (since the cache coherency traffic needs to go to _all_ sockets, rather than just one stream to memory), at least unless you have a good cache coherency filter that can filter out part of the traffic based on whether it could be cached or not on some socket(s). IOW, it's almost impossible to tell what is the best approach. It will depend on number of sockets, it will depend on size of cache, and it will depend on the capabilities and performance of the memory controllers vs the cache coherency protocol. On a "single shared bus" model, the "write with invalidate" is fine, and it basically ends up working a lot like a single socket even if you actually have multiple sockets - it just won't scale much beyond two sockets. With HT or QPI, things are different, and the presense or absense of a snoop filter could make a big difference for 4+ socket situations. There simply is no single answer. And we really should keep that in mind. There is no right answer, and the right answer will depend on hardware. Playing cache games in software is almost always futile. It can be a huge improvement, but it can be a huge deprovement too, and it really tends to be only worth it if you (a) know your hardware really quite well and (b) know your _load_ pretty well too. We can play games in the kernel. We do know how many sockets there are. We do know the cache size. We _could_ try to make an educated guess at whether the next user of the data will be DMA or not. So there are unquestionably heuristics we could apply, but I also do suspect that they'd inevitably be pretty arbitrary. I suspect that we could make some boot-time (or maybe CPU hotplug time) decision that simply just sets a threshold value for when it is worth using non-temporal stores. With smaller caches, and with a single socket (or a single bus), it likely makes sense to use non-temporal stores earlier. But even with some rough heuristic, it will be wrong part of the time. So I think "simple and predictable" in the end tends to be better than "complex and still known to be broken". Btw, the "simple and predictable" could literally look at _where_ in the file the IO is. Because I know there are papers on the likelihood of re-use of data depending on where in the file it is written. Data written to low offsets is more likely to be accessed again (think temp-files), while data written to big offsets are much more likely to be random or to be written out (think databases or simply just large streaming files). So I suspect a "simple and predictable" algorithm could literally be something like - use nontemporal stores only if you are writing a whole page, and the byte offset of the page is larger than 'x', where 'x' may optionally even depend on size of cache. But removing it entirely may be fine too. What I _don't_ think is fine is to think that you've "solved" it, or that you should even try! Linus