linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Arjan van de Ven <arjan@infradead.org>,
	Andi Kleen <andi@firstfloor.org>,
	David Miller <davem@davemloft.net>,
	mingo@elte.hu, sqazi@google.com, linux-kernel@vger.kernel.org,
	tglx@linutronix.de
Subject: Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()
Date: Tue, 3 Mar 2009 15:20:59 +1100	[thread overview]
Message-ID: <200903031521.00217.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <alpine.LFD.2.00.0903021255020.3111@localhost.localdomain>

On Tuesday 03 March 2009 08:16:23 Linus Torvalds wrote:
> On Mon, 2 Mar 2009, Nick Piggin wrote:
> > I would expect any high performance CPU these days to combine entries
> > in the store queue, even for normal store instructions (especially for
> > linear memcpy patterns). Isn't this likely to be the case?
>
> None of this really matters.

Well that's just what I was replying to. Of course nontemporal/uncached
stores can't avoid cc operations either, but somebody was hoping that
they would avoid the write-allocate / RMW behaviour. I just replied because
I think that modern CPUs can combine stores in their store queues to get
the same result for cacheable stores.

Of course it doesn't make it free especially if it is a cc protocol that
has to go on the interconnect anyway. But avoiding the RAM read is a
good thing anyway.


> The big issue is that before you can do any write to any cacheline, if the
> memory is cacheable, it needs the cache coherency protocol to synchronize
> with any other CPU's that may have that line in the cache.
>
> The _only_ time a write is "free" is when you already have that cacheline
> in your own cache, and in an "exclusive" state. If that is the case, then
> you know that you don't need to do anything else.
>
> In _any_ other case, before you do the write, you need to make sure that
> no other CPU in the system has that line in its cache. Whether you do that
> with a "write and invalidate" model (which would be how a store buffer
> would do it or a write-through cache would work), or whether you do it
> with a "acquire exclusive cacheline" (which is how the cache coherency
> protocol would do it), it's going to end up using cache coherency
> bandwidth.
>
> Of course, what will be the limiting factor is unclear. On a single-socket
> thing, you don't have any cache coherency issues, an the only bandwidth
> you'd end up using is the actual memory write at the memory controller
> (which may be on-die, and entirely separate from the cache coherency
> protocol). It may be idle and the write queue may be deep enough that you
> reach memory speeds and the write buffer is the optimal approach.
>
> On many sockets, the limiting factor will almost certainly be the cache
> coherency overhead (since the cache coherency traffic needs to go to _all_
> sockets, rather than just one stream to memory), at least unless you have
> a good cache coherency filter that can filter out part of the traffic
> based on whether it could be cached or not on some socket(s).
>
> IOW, it's almost impossible to tell what is the best approach. It will
> depend on number of sockets, it will depend on size of cache, and it will
> depend on the capabilities and performance of the memory controllers vs
> the cache coherency protocol.
>
> On a "single shared bus" model, the "write with invalidate" is fine, and
> it basically ends up working a lot like a single socket even if you
> actually have multiple sockets - it just won't scale much beyond two
> sockets. With HT or QPI, things are different, and the presense or absense
> of a snoop filter could make a big difference for 4+ socket situations.
>
> There simply is no single answer.
>
> And we really should keep that in mind. There is no right answer, and the
> right answer will depend on hardware. Playing cache games in software is
> almost always futile. It can be a huge improvement, but it can be a huge
> deprovement too, and it really tends to be only worth it if you (a) know
> your hardware really quite well and (b) know your _load_ pretty well too.
>
> We can play games in the kernel. We do know how many sockets there are. We
> do know the cache size. We _could_ try to make an educated guess at
> whether the next user of the data will be DMA or not. So there are
> unquestionably heuristics we could apply, but I also do suspect that
> they'd inevitably be pretty arbitrary.
>
> I suspect that we could make some boot-time (or maybe CPU hotplug time)
> decision that simply just sets a threshold value for when it is worth
> using non-temporal stores. With smaller caches, and with a single socket
> (or a single bus), it likely makes sense to use non-temporal stores
> earlier.
>
> But even with some rough heuristic, it will be wrong part of the time. So
> I think "simple and predictable" in the end tends to be better than
> "complex and still known to be broken".
>
> Btw, the "simple and predictable" could literally look at _where_ in the
> file the IO is. Because I know there are papers on the likelihood of
> re-use of data depending on where in the file it is written. Data written
> to low offsets is more likely to be accessed again (think temp-files),
> while data written to big offsets are much more likely to be random or to
> be written out (think databases or simply just large streaming files).
>
> So I suspect a "simple and predictable" algorithm could literally be
> something like
>
>  - use nontemporal stores only if you are writing a whole page, and the
>    byte offset of the page is larger than 'x', where 'x' may optionally
>    even depend on size of cache.
>
> But removing it entirely may be fine too.
>
> What I _don't_ think is fine is to think that you've "solved" it, or that
> you should even try!

Right. I don't know if you misunderstood me or aimed this post at the
general discussion rather than my reply specifically.

I know even if a CPU does write combining in the store buffer and even
if it does have "big-hammer" nontemporal stores like x86 apparently does,
then there are still cases where nontemporal stores will win if the data
doesn't get used by the CPU again.

I agree that if a heuristic can't get it right a *significant* amount of
time, then it is not worthwhile. Even if it gets it right a little more
often than wrong, the unpredictability is a negative factor. I agree
completely with you there :)

I would like to remove it, as in Ingo's last patch, FWIW. But I can see
obviously there are cases where nontemporal helps, so there will never be
a "right" answer.


  parent reply	other threads:[~2009-03-03  4:21 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-02-24  2:03 Performance regression in write() syscall Salman Qazi
2009-02-24  4:10 ` Nick Piggin
2009-02-24  4:28   ` Linus Torvalds
2009-02-24  9:02     ` Nick Piggin
2009-02-24 15:52       ` Linus Torvalds
2009-02-24 16:24         ` Andi Kleen
2009-02-24 16:51         ` Ingo Molnar
2009-02-25  3:23         ` Nick Piggin
2009-02-25  7:25           ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Ingo Molnar
2009-02-25  8:09             ` Nick Piggin
2009-02-25  8:29               ` Ingo Molnar
2009-02-25  8:59                 ` Nick Piggin
2009-02-25 12:01                   ` Ingo Molnar
2009-02-25 16:04             ` Linus Torvalds
2009-02-25 16:29               ` Ingo Molnar
2009-02-27 12:05               ` Nick Piggin
2009-02-28  8:29                 ` Ingo Molnar
2009-02-28 11:49                   ` Nick Piggin
2009-02-28 12:58                     ` Ingo Molnar
2009-02-28 17:16                       ` Linus Torvalds
2009-02-28 17:24                         ` Arjan van de Ven
2009-02-28 17:42                           ` Linus Torvalds
2009-02-28 17:53                             ` Arjan van de Ven
2009-02-28 18:05                             ` Andi Kleen
2009-02-28 18:27                             ` Ingo Molnar
2009-02-28 18:39                               ` Arjan van de Ven
2009-03-02 10:39                                 ` [PATCH] x86, mm: dont use non-temporal stores in pagecache accesses Ingo Molnar
2009-02-28 18:52                               ` [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache() Linus Torvalds
2009-03-01 14:19                                 ` Nick Piggin
2009-03-01  0:06                             ` David Miller
2009-03-01  0:40                               ` Andi Kleen
2009-03-01  0:28                                 ` H. Peter Anvin
2009-03-01  0:38                                   ` Arjan van de Ven
2009-03-01  1:48                                     ` Andi Kleen
2009-03-01  1:38                                       ` Arjan van de Ven
2009-03-01  1:40                                         ` H. Peter Anvin
2009-03-01 14:06                                           ` Nick Piggin
2009-03-02  4:46                                             ` H. Peter Anvin
2009-03-02  6:18                                               ` Nick Piggin
2009-03-02 21:16                                             ` Linus Torvalds
2009-03-02 21:25                                               ` Ingo Molnar
2009-03-03  4:30                                                 ` Nick Piggin
2009-03-03  4:20                                               ` Nick Piggin [this message]
2009-03-03  9:02                                                 ` Ingo Molnar
2009-03-04  3:37                                                   ` Nick Piggin
2009-03-01  2:07                                         ` Andi Kleen
2009-02-24  5:43   ` Performance regression in write() syscall Salman Qazi
2009-02-24 10:09 ` Andi Kleen
2009-02-24 16:13   ` Ingo Molnar
2009-02-24 16:51     ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200903031521.00217.nickpiggin@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=andi@firstfloor.org \
    --cc=arjan@infradead.org \
    --cc=davem@davemloft.net \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=sqazi@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).