From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Arjan van de Ven <arjan@infradead.org>,
Andi Kleen <andi@firstfloor.org>,
David Miller <davem@davemloft.net>,
mingo@elte.hu, sqazi@google.com, linux-kernel@vger.kernel.org,
tglx@linutronix.de
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
Date: Tue, 3 Mar 2009 15:20:59 +1100 [thread overview]
Message-ID: <200903031521.00217.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <alpine.LFD.2.00.0903021255020.3111@localhost.localdomain>
On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> On Mon, 2 Mar 2009, Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> None of this really matters.
Well that's just what I was replying to. Of course nontemporal/uncached
stores can't avoid cc operations either, but somebody was hoping that
they would avoid the write-allocate / RMW behaviour. I just replied because
I think that modern CPUs can combine stores in their store queues to get
the same result for cacheable stores.
Of course it doesn't make it free especially if it is a cc protocol that
has to go on the interconnect anyway. But avoiding the RAM read is a
good thing anyway.
> The big issue is that before you can do any write to any cacheline, if the
> memory is cacheable, it needs the cache coherency protocol to synchronize
> with any other CPU's that may have that line in the cache.
>
> The _only_ time a write is "free" is when you already have that cacheline
> in your own cache, and in an "exclusive" state. If that is the case, then
> you know that you don't need to do anything else.
>
> In _any_ other case, before you do the write, you need to make sure that
> no other CPU in the system has that line in its cache. Whether you do that
> with a "write and invalidate" model (which would be how a store buffer
> would do it or a write-through cache would work), or whether you do it
> with a "acquire exclusive cacheline" (which is how the cache coherency
> protocol would do it), it's going to end up using cache coherency
> bandwidth.
>
> Of course, what will be the limiting factor is unclear. On a single-socket
> thing, you don't have any cache coherency issues, an the only bandwidth
> you'd end up using is the actual memory write at the memory controller
> (which may be on-die, and entirely separate from the cache coherency
> protocol). It may be idle and the write queue may be deep enough that you
> reach memory speeds and the write buffer is the optimal approach.
>
> On many sockets, the limiting factor will almost certainly be the cache
> coherency overhead (since the cache coherency traffic needs to go to _all_
> sockets, rather than just one stream to memory), at least unless you have
> a good cache coherency filter that can filter out part of the traffic
> based on whether it could be cached or not on some socket(s).
>
> IOW, it's almost impossible to tell what is the best approach. It will
> depend on number of sockets, it will depend on size of cache, and it will
> depend on the capabilities and performance of the memory controllers vs
> the cache coherency protocol.
>
> On a "single shared bus" model, the "write with invalidate" is fine, and
> it basically ends up working a lot like a single socket even if you
> actually have multiple sockets - it just won't scale much beyond two
> sockets. With HT or QPI, things are different, and the presense or absense
> of a snoop filter could make a big difference for 4+ socket situations.
>
> There simply is no single answer.
>
> And we really should keep that in mind. There is no right answer, and the
> right answer will depend on hardware. Playing cache games in software is
> almost always futile. It can be a huge improvement, but it can be a huge
> deprovement too, and it really tends to be only worth it if you (a) know
> your hardware really quite well and (b) know your _load_ pretty well too.
>
> We can play games in the kernel. We do know how many sockets there are. We
> do know the cache size. We _could_ try to make an educated guess at
> whether the next user of the data will be DMA or not. So there are
> unquestionably heuristics we could apply, but I also do suspect that
> they'd inevitably be pretty arbitrary.
>
> I suspect that we could make some boot-time (or maybe CPU hotplug time)
> decision that simply just sets a threshold value for when it is worth
> using non-temporal stores. With smaller caches, and with a single socket
> (or a single bus), it likely makes sense to use non-temporal stores
> earlier.
>
> But even with some rough heuristic, it will be wrong part of the time. So
> I think "simple and predictable" in the end tends to be better than
> "complex and still known to be broken".
>
> Btw, the "simple and predictable" could literally look at _where_ in the
> file the IO is. Because I know there are papers on the likelihood of
> re-use of data depending on where in the file it is written. Data written
> to low offsets is more likely to be accessed again (think temp-files),
> while data written to big offsets are much more likely to be random or to
> be written out (think databases or simply just large streaming files).
>
> So I suspect a "simple and predictable" algorithm could literally be
> something like
>
> - use nontemporal stores only if you are writing a whole page, and the
> byte offset of the page is larger than 'x', where 'x' may optionally
> even depend on size of cache.
>
> But removing it entirely may be fine too.
>
> What I _don't_ think is fine is to think that you've "solved" it, or that
> you should even try!
Right. I don't know if you misunderstood me or aimed this post at the
general discussion rather than my reply specifically.
I know even if a CPU does write combining in the store buffer and even
if it does have "big-hammer" nontemporal stores like x86 apparently does,
then there are still cases where nontemporal stores will win if the data
doesn't get used by the CPU again.
I agree that if a heuristic can't get it right a *significant* amount of
time, then it is not worthwhile. Even if it gets it right a little more
often than wrong, the unpredictability is a negative factor. I agree
completely with you there :)
I would like to remove it, as in Ingo's last patch, FWIW. But I can see
obviously there are cases where nontemporal helps, so there will never be
a "right" answer.
next prev parent reply other threads:[~2009-03-03 4:21 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-24 2:03 Performance regression in write() syscall Salman Qazi
2009-02-24 4:10 ` Nick Piggin
2009-02-24 4:28 ` Linus Torvalds
2009-02-24 9:02 ` Nick Piggin
2009-02-24 15:52 ` Linus Torvalds
2009-02-24 16:24 ` Andi Kleen
2009-02-24 16:51 ` Ingo Molnar
2009-02-25 3:23 ` Nick Piggin
2009-02-25 7:25 ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Ingo Molnar
2009-02-25 8:09 ` Nick Piggin
2009-02-25 8:29 ` Ingo Molnar
2009-02-25 8:59 ` Nick Piggin
2009-02-25 12:01 ` Ingo Molnar
2009-02-25 16:04 ` Linus Torvalds
2009-02-25 16:29 ` Ingo Molnar
2009-02-27 12:05 ` Nick Piggin
2009-02-28 8:29 ` Ingo Molnar
2009-02-28 11:49 ` Nick Piggin
2009-02-28 12:58 ` Ingo Molnar
2009-02-28 17:16 ` Linus Torvalds
2009-02-28 17:24 ` Arjan van de Ven
2009-02-28 17:42 ` Linus Torvalds
2009-02-28 17:53 ` Arjan van de Ven
2009-02-28 18:05 ` Andi Kleen
2009-02-28 18:27 ` Ingo Molnar
2009-02-28 18:39 ` Arjan van de Ven
2009-03-02 10:39 ` [PATCH] x86, mm: dont use non-temporal stores in pagecache accesses Ingo Molnar
2009-02-28 18:52 ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Linus Torvalds
2009-03-01 14:19 ` Nick Piggin
2009-03-01 0:06 ` David Miller
2009-03-01 0:40 ` Andi Kleen
2009-03-01 0:28 ` H. Peter Anvin
2009-03-01 0:38 ` Arjan van de Ven
2009-03-01 1:48 ` Andi Kleen
2009-03-01 1:38 ` Arjan van de Ven
2009-03-01 1:40 ` H. Peter Anvin
2009-03-01 14:06 ` Nick Piggin
2009-03-02 4:46 ` H. Peter Anvin
2009-03-02 6:18 ` Nick Piggin
2009-03-02 21:16 ` Linus Torvalds
2009-03-02 21:25 ` Ingo Molnar
2009-03-03 4:30 ` Nick Piggin
2009-03-03 4:20 ` Nick Piggin [this message]
2009-03-03 9:02 ` Ingo Molnar
2009-03-04 3:37 ` Nick Piggin
2009-03-01 2:07 ` Andi Kleen
2009-02-24 5:43 ` Performance regression in write() syscall Salman Qazi
2009-02-24 10:09 ` Andi Kleen
2009-02-24 16:13 ` Ingo Molnar
2009-02-24 16:51 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200903031521.00217.nickpiggin@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=andi@firstfloor.org \
--cc=arjan@infradead.org \
--cc=davem@davemloft.net \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=sqazi@google.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).