All of lore.kernel.org
 help / color / mirror / Atom feed
* Copy prefetch optimizations and non-coherent caches
@ 2003-09-06  6:07 Eugene Surovegin
  2003-09-06 13:20 ` Paul Mackerras
  0 siblings, 1 reply; 4+ messages in thread
From: Eugene Surovegin @ 2003-09-06  6:07 UTC (permalink / raw)
  To: linuxppc-embedded


Hi all!

There are read prefetch optimization in several PPC specific functions
responsible for copying memory (copy_page, __copy_tofrom_user). Current
implementations will try to prefetch up to 4 (MAX_COPY_PREFETCH) cache
lines _after_ the end of the source buffer.

Unfortunately, it's not a good idea on non-coherent cache CPUs. This
prefetching may establish cache lines for memory ranges that require
exactly the opposite (e.g. DMA read buffer).

Here is one example of the incorrect behavior (from the real life :()

Let's consider two adjacent pages in RAM. First page contains some data and
the second one is used as target for DMA read. Please, note that these two
pages are completely unrelated to each other.

Before initiating "read" we invalidate cache lines for addresses in the
second page range (e.g. with consistent_sync_page(...., PCI_DMA_FROMDEVICE)).

Then somewhere else in the kernel we initiate copying from the _first_
page, e.g. to the user space with __copy_tofrom_user. During copying, this
function calls dcbt to prefetch up to 4 cache lines of the source data.

This will also result in prefetching 4 cache lines _after_ the end of the
source buffer, e.g. _first_ 4 cache lines of the _second_ page. If DMA read
hasn't started filling this memory, we'll may get old data instead of real
DMA read.

I think we should disable prefetch if CONFIG_NONCOHERENT_CACHE is defined.
Other more complex solutions are possible, e.g. we can still prefetch our
own buffer but don't touch anything outside (I'll try to do some
performance testing to determine whether it's worth the effort :).

Patch against linuxppc-2.4: http://kernel.ebshome.net/read_prefetch-2.4.diff

Patch against linuxppc-2.5 (not tested, but compiles):
http://kernel.ebshome.net/read_prefetch-2.6.diff

Comments, suggestions...

Eugene.

P.S. I suspect that "speculative" dcache fetch past the end of the memory
on 440 is really copy_page() prefetching source data. I'm guilty :), it was
my interpretation of the strange machine check we saw a year ago ...


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Copy prefetch optimizations and non-coherent caches
  2003-09-06  6:07 Copy prefetch optimizations and non-coherent caches Eugene Surovegin
@ 2003-09-06 13:20 ` Paul Mackerras
  2003-09-10  5:54   ` Eugene Surovegin
  2003-10-08 21:51   ` Matt Porter
  0 siblings, 2 replies; 4+ messages in thread
From: Paul Mackerras @ 2003-09-06 13:20 UTC (permalink / raw)
  To: Eugene Surovegin; +Cc: linuxppc-embedded


Eugene Surovegin writes:

> There are read prefetch optimization in several PPC specific functions
> responsible for copying memory (copy_page, __copy_tofrom_user). Current
> implementations will try to prefetch up to 4 (MAX_COPY_PREFETCH) cache
> lines _after_ the end of the source buffer.
>
> Unfortunately, it's not a good idea on non-coherent cache CPUs. This
> prefetching may establish cache lines for memory ranges that require
> exactly the opposite (e.g. DMA read buffer).

You are right.

> I think we should disable prefetch if CONFIG_NONCOHERENT_CACHE is defined.
> Other more complex solutions are possible, e.g. we can still prefetch our
> own buffer but don't touch anything outside (I'll try to do some
> performance testing to determine whether it's worth the effort :).

The measurements I did on a ppc64 kernel indicated that most
copy_tofrom_user calls were either for relatively small buffers
(i.e. less than 256 bytes) or were page-sized and page-aligned.
Therefore I did two routines, one optimized for small copies that
didn't use any prefetching or dcbz's, and one optimized for page-sized
copies.

We could do something similar on ppc32 - we could do the small copy
case with no prefetching (or maybe we could just prefetch on the first
cache line), plus a page-copy case that does prefetching.  If you know
you are doing exactly one page, it shouldn't be hard to set up the
prefetching so you don't prefetch past the end of the source buffer.
In fact it should be possible to code up a relatively simple optimized
copy loop that avoids prefetching outside the source region if we just
assume that the source and destination addresses are
cacheline-aligned, and the size is a multiple of the cacheline size
and is at least 8 (say) cache lines.

Paul.

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Copy prefetch optimizations and non-coherent caches
  2003-09-06 13:20 ` Paul Mackerras
@ 2003-09-10  5:54   ` Eugene Surovegin
  2003-10-08 21:51   ` Matt Porter
  1 sibling, 0 replies; 4+ messages in thread
From: Eugene Surovegin @ 2003-09-10  5:54 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-embedded


At 06:20 AM 9/6/2003, Paul Mackerras wrote:
> > I think we should disable prefetch if CONFIG_NONCOHERENT_CACHE is defined.
> > Other more complex solutions are possible, e.g. we can still prefetch our
> > own buffer but don't touch anything outside (I'll try to do some
> > performance testing to determine whether it's worth the effort :).

I did some simple testing on Ebony, and it turns out that read prefetching
gives significant performance boost (~30%).

>The measurements I did on a ppc64 kernel indicated that most
>copy_tofrom_user calls were either for relatively small buffers
>(i.e. less than 256 bytes) or were page-sized and page-aligned.
>Therefore I did two routines, one optimized for small copies that
>didn't use any prefetching or dcbz's, and one optimized for page-sized
>copies.
>
>We could do something similar on ppc32 - we could do the small copy
>case with no prefetching (or maybe we could just prefetch on the first
>cache line), plus a page-copy case that does prefetching.  If you know
>you are doing exactly one page, it shouldn't be hard to set up the
>prefetching so you don't prefetch past the end of the source buffer.
>In fact it should be possible to code up a relatively simple optimized
>copy loop that avoids prefetching outside the source region if we just
>assume that the source and destination addresses are
>cacheline-aligned, and the size is a multiple of the cacheline size
>and is at least 8 (say) cache lines.

Frankly, I don't feel to be qualified to do such significant changes to
__copy_tofrom_user :).

So, I tried to modify copy_page and __copy_tofrom_user as little as
possible and make them non-coherent cache safe while still using read
prefetching. Idea is simple, we prefetch as before but only till the end of
the source buffer.

Updated patches:

  - against linuxppc-2.4: http://kernel.ebshome.net/read_prefetch-2.4-2.diff

  - against linuxppc-2.5 (not tested, but compiles):
http://kernel.ebshome.net/read_prefetch-2.6-2.diff

Eugene.


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Copy prefetch optimizations and non-coherent caches
  2003-09-06 13:20 ` Paul Mackerras
  2003-09-10  5:54   ` Eugene Surovegin
@ 2003-10-08 21:51   ` Matt Porter
  1 sibling, 0 replies; 4+ messages in thread
From: Matt Porter @ 2003-10-08 21:51 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Eugene Surovegin, linuxppc-embedded


On Sat, Sep 06, 2003 at 11:20:11PM +1000, Paul Mackerras wrote:
>
> Eugene Surovegin writes:
> > I think we should disable prefetch if CONFIG_NONCOHERENT_CACHE is defined.
> > Other more complex solutions are possible, e.g. we can still prefetch our
> > own buffer but don't touch anything outside (I'll try to do some
> > performance testing to determine whether it's worth the effort :).
>
> The measurements I did on a ppc64 kernel indicated that most
> copy_tofrom_user calls were either for relatively small buffers
> (i.e. less than 256 bytes) or were page-sized and page-aligned.
> Therefore I did two routines, one optimized for small copies that
> didn't use any prefetching or dcbz's, and one optimized for page-sized
> copies.
>
> We could do something similar on ppc32 - we could do the small copy
> case with no prefetching (or maybe we could just prefetch on the first
> cache line), plus a page-copy case that does prefetching.  If you know
> you are doing exactly one page, it shouldn't be hard to set up the
> prefetching so you don't prefetch past the end of the source buffer.
> In fact it should be possible to code up a relatively simple optimized
> copy loop that avoids prefetching outside the source region if we just
> assume that the source and destination addresses are
> cacheline-aligned, and the size is a multiple of the cacheline size
> and is at least 8 (say) cache lines.

It seems that the current version of __copy_tofrom_user() is
optimized for the page-aligned/cacheline-aligned case. i.e. it
seems that there isn't much overhead in detecting that there
are 0 bytes to the start of a cache line and then jumping to
the prefetching line copy. Would it be enough to just simply
detect <256-byte copies and use the non-prefetching byte copy
loop for those buffers while using the full current version
for all other cases?

This is, of course, in combination with Eugene's patch to ensure
that no prefetch past the end of the buffer occurs.

-Matt

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-10-08 21:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-06  6:07 Copy prefetch optimizations and non-coherent caches Eugene Surovegin
2003-09-06 13:20 ` Paul Mackerras
2003-09-10  5:54   ` Eugene Surovegin
2003-10-08 21:51   ` Matt Porter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.