All of lore.kernel.org
 help / color / mirror / Atom feed
* Speeding up dma_unmap
@ 2016-01-27  8:32 Jason Holt
  2016-01-27 11:22 ` Ard Biesheuvel
  2016-01-27 12:23 ` Arnd Bergmann
  0 siblings, 2 replies; 8+ messages in thread
From: Jason Holt @ 2016-01-27  8:32 UTC (permalink / raw)
  To: linux-arm-kernel

I'm new to the DMA API and looking for a sanity check.

As I understand it, dma_unmap_* is slow (for data coming from a device
to the CPU) on some ARM CPUs because the *_inv_range() functions have
to iterate in cache line sized steps through the entire buffer,
telling the cache controller "invalidate this if you have it".

For buffers larger than the size of the data cache, might it be faster
to go the other direction and check each line of the cache to see if
it's inside the buffer, then invalidate it if it is?  (I believe the
buffer must be contiguous in physical memory, so I assume that'd be a
simple bottom < x < top check).

So for a 256K L2 cache and 4MB buffer, we'd only have to check 256K
worth of cache lines instead of 4MB when we unmap.

Failing that, I suppose a very dirty hack would be to
data_cache_clean_and_invalidate if the only thing I cared about was
getting data from my DMA peripheral as fast as possible.  (I'm on
AM335X and seeing no more than 200MB/s from device to CPU with
dma_unmap_single, whereas the PRUs can write to main memory at
600MB/s.)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-27  8:32 Speeding up dma_unmap Jason Holt
@ 2016-01-27 11:22 ` Ard Biesheuvel
  2016-01-27 12:23 ` Arnd Bergmann
  1 sibling, 0 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2016-01-27 11:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 27 January 2016 at 09:32, Jason Holt <jholt@google.com> wrote:
> I'm new to the DMA API and looking for a sanity check.
>
> As I understand it, dma_unmap_* is slow (for data coming from a device
> to the CPU) on some ARM CPUs because the *_inv_range() functions have
> to iterate in cache line sized steps through the entire buffer,
> telling the cache controller "invalidate this if you have it".
>
> For buffers larger than the size of the data cache, might it be faster
> to go the other direction and check each line of the cache to see if
> it's inside the buffer, then invalidate it if it is?  (I believe the
> buffer must be contiguous in physical memory, so I assume that'd be a
> simple bottom < x < top check).
>
> So for a 256K L2 cache and 4MB buffer, we'd only have to check 256K
> worth of cache lines instead of 4MB when we unmap.
>
> Failing that, I suppose a very dirty hack would be to
> data_cache_clean_and_invalidate if the only thing I cared about was
> getting data from my DMA peripheral as fast as possible.  (I'm on
> AM335X and seeing no more than 200MB/s from device to CPU with
> dma_unmap_single, whereas the PRUs can write to main memory at
> 600MB/s.)
>

This may work in practice, but it violates the architecture, and may
cause hard to diagnose problems on coherent systems.

The reason is that cache maintenance by virtual address and cache
maintenance by set/way are completely different things, and set/way
operations are not broadcast to other cores or system caches, which
means you are not architecturally guaranteed to see the data in main
memory after you have invalidated your caches by set/way. None of this
is likely to affect your Cortex-A8 system, but it is not a good idea
in general. (Note that you would still need to invalidate the entire
cache, since making inferences about the relation between cache
geometry and the layout of physical memory is not portable either, and
since your buffer size exceeds the L2 set size, every cache line could
potentially hold some of your data anyway.)

-- 
Ard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-27  8:32 Speeding up dma_unmap Jason Holt
  2016-01-27 11:22 ` Ard Biesheuvel
@ 2016-01-27 12:23 ` Arnd Bergmann
  2016-01-27 16:06   ` Catalin Marinas
  1 sibling, 1 reply; 8+ messages in thread
From: Arnd Bergmann @ 2016-01-27 12:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 27 January 2016 00:32:56 Jason Holt wrote:
> 
> Failing that, I suppose a very dirty hack would be to
> data_cache_clean_and_invalidate if the only thing I cared about was
> getting data from my DMA peripheral as fast as possible.  (I'm on
> AM335X and seeing no more than 200MB/s from device to CPU with
> dma_unmap_single, whereas the PRUs can write to main memory at
> 600MB/s.)

On your Cortex-A8, we could come up with a way to not invalidate
the cache at all on unmap, as the comment in __dma_page_dev_to_cpu()
says:

        /* FIXME: non-speculating: not required */
        /* in any case, don't bother invalidating if DMA to device */
        if (dir != DMA_TO_DEVICE) {
                outer_inv_range(paddr, paddr + size);

                dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
        }


We already do a cache-invalidate operation on dma_map(), and the kernel
is not allowed to access the memory in the meantime. On CPU cores
that do speculative prefetching (Cortex-A9 and higher), we may end
up reading cache lines back in randomly on a speculative prefetch,
but as far as I can tell, the Cortex-A8 (or A5/A7) won't do that.

How does the performance change if you hack that file to simply not
do the invalidate?

	Arnd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-27 12:23 ` Arnd Bergmann
@ 2016-01-27 16:06   ` Catalin Marinas
  2016-01-27 18:09     ` Russell King - ARM Linux
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2016-01-27 16:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 27, 2016 at 01:23:27PM +0100, Arnd Bergmann wrote:
> On Wednesday 27 January 2016 00:32:56 Jason Holt wrote:
> > 
> > Failing that, I suppose a very dirty hack would be to
> > data_cache_clean_and_invalidate if the only thing I cared about was
> > getting data from my DMA peripheral as fast as possible.  (I'm on
> > AM335X and seeing no more than 200MB/s from device to CPU with
> > dma_unmap_single, whereas the PRUs can write to main memory at
> > 600MB/s.)
> 
> On your Cortex-A8, we could come up with a way to not invalidate
> the cache at all on unmap, as the comment in __dma_page_dev_to_cpu()
> says:
> 
>         /* FIXME: non-speculating: not required */
>         /* in any case, don't bother invalidating if DMA to device */
>         if (dir != DMA_TO_DEVICE) {
>                 outer_inv_range(paddr, paddr + size);
> 
>                 dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
>         }
> 
> We already do a cache-invalidate operation on dma_map(), and the kernel
> is not allowed to access the memory in the meantime. On CPU cores
> that do speculative prefetching (Cortex-A9 and higher), we may end
                                   ^^^^^^^^^^^^^^^^^^^^
I would say "Cortex-A9 and newer".

> up reading cache lines back in randomly on a speculative prefetch,
> but as far as I can tell, the Cortex-A8 (or A5/A7) won't do that.

Are you sure about A5 and A7? I'm not even sure about the A8 but there
are good chances that A7 and A5 do speculative prefetches.

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-27 16:06   ` Catalin Marinas
@ 2016-01-27 18:09     ` Russell King - ARM Linux
  2016-01-28 10:31       ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Russell King - ARM Linux @ 2016-01-27 18:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 27, 2016 at 04:06:30PM +0000, Catalin Marinas wrote:
> On Wed, Jan 27, 2016 at 01:23:27PM +0100, Arnd Bergmann wrote:
> > up reading cache lines back in randomly on a speculative prefetch,
> > but as far as I can tell, the Cortex-A8 (or A5/A7) won't do that.
> 
> Are you sure about A5 and A7? I'm not even sure about the A8 but there
> are good chances that A7 and A5 do speculative prefetches.

I thought when I was re-implementing the DMA API on ARM (which was
around early v7 times) that there were CPUs that did speculative
prefetching, which included the A8.  I seem to remember it was pretty
urgent to have the DMA API fixed for _any_ ARMv7 CPU because of the
speculative prefetching.


-- 
RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-27 18:09     ` Russell King - ARM Linux
@ 2016-01-28 10:31       ` Catalin Marinas
  2016-01-28 11:20         ` Arnd Bergmann
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2016-01-28 10:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jan 27, 2016 at 06:09:45PM +0000, Russell King - ARM Linux wrote:
> On Wed, Jan 27, 2016 at 04:06:30PM +0000, Catalin Marinas wrote:
> > On Wed, Jan 27, 2016 at 01:23:27PM +0100, Arnd Bergmann wrote:
> > > up reading cache lines back in randomly on a speculative prefetch,
> > > but as far as I can tell, the Cortex-A8 (or A5/A7) won't do that.
> > 
> > Are you sure about A5 and A7? I'm not even sure about the A8 but there
> > are good chances that A7 and A5 do speculative prefetches.
> 
> I thought when I was re-implementing the DMA API on ARM (which was
> around early v7 times) that there were CPUs that did speculative
> prefetching, which included the A8.  I seem to remember it was pretty
> urgent to have the DMA API fixed for _any_ ARMv7 CPU because of the
> speculative prefetching.

Indeed, it's a safe assumption to say that any ARMv7 CPU perform
speculative accesses. Even if some of them may only do I-cache
prefetching (just guessing), in the presence of a unified L2 this
distinction no longer matters.

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-28 10:31       ` Catalin Marinas
@ 2016-01-28 11:20         ` Arnd Bergmann
  2016-01-28 11:49           ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Arnd Bergmann @ 2016-01-28 11:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 28 January 2016 10:31:06 Catalin Marinas wrote:
> On Wed, Jan 27, 2016 at 06:09:45PM +0000, Russell King - ARM Linux wrote:
> > On Wed, Jan 27, 2016 at 04:06:30PM +0000, Catalin Marinas wrote:
> > > On Wed, Jan 27, 2016 at 01:23:27PM +0100, Arnd Bergmann wrote:
> > > > up reading cache lines back in randomly on a speculative prefetch,
> > > > but as far as I can tell, the Cortex-A8 (or A5/A7) won't do that.
> > > 
> > > Are you sure about A5 and A7? I'm not even sure about the A8 but there
> > > are good chances that A7 and A5 do speculative prefetches.
> > 
> > I thought when I was re-implementing the DMA API on ARM (which was
> > around early v7 times) that there were CPUs that did speculative
> > prefetching, which included the A8.  I seem to remember it was pretty
> > urgent to have the DMA API fixed for _any_ ARMv7 CPU because of the
> > speculative prefetching.
> 
> Indeed, it's a safe assumption to say that any ARMv7 CPU perform
> speculative accesses. Even if some of them may only do I-cache
> prefetching (just guessing), in the presence of a unified L2 this
> distinction no longer matters.

Ok, I was thrown off by the code comment then, and by my incorrect
assumption that only the out-of-order cores were doing any speculative
execution (prefetch or not). According to the Cortex-A5 TRM, "The
Cortex-A5 MPCore data cache implements an automatic prefetcher that
monitors cache misses done by the processor. When a pattern is detected,
the automatic prefetcher starts linefills in the background."

I have looked at the documentation for a couple of cores and found that:

* Cortex-A9 always does speculative prefetching
* Cortex-A8 does not have this mentioned in the manual, which would
  be a hint that it indeed does not do it at all, but that could be
  wrong. It does explicitly mention prefetching into icache, and
  mentions prefetching using the PLD instruction and the L2 PLE.
* A5/A7/A15/A17 all do prefetching unless disabled in the ACTLR
  register. CPUs that have L2 caches can control this separately
  for L1 and L2 as needed.

This means that there are still some cores on which one could try
if disabling the prefetching and the flushes in DMA unmap provides
any serious performance boost.

	Arnd

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Speeding up dma_unmap
  2016-01-28 11:20         ` Arnd Bergmann
@ 2016-01-28 11:49           ` Catalin Marinas
  0 siblings, 0 replies; 8+ messages in thread
From: Catalin Marinas @ 2016-01-28 11:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 28, 2016 at 12:20:55PM +0100, Arnd Bergmann wrote:
> On Thursday 28 January 2016 10:31:06 Catalin Marinas wrote:
> > On Wed, Jan 27, 2016 at 06:09:45PM +0000, Russell King - ARM Linux wrote:
> > > On Wed, Jan 27, 2016 at 04:06:30PM +0000, Catalin Marinas wrote:
> > > > On Wed, Jan 27, 2016 at 01:23:27PM +0100, Arnd Bergmann wrote:
> > > > > up reading cache lines back in randomly on a speculative prefetch,
> > > > > but as far as I can tell, the Cortex-A8 (or A5/A7) won't do that.
> > > > 
> > > > Are you sure about A5 and A7? I'm not even sure about the A8 but there
> > > > are good chances that A7 and A5 do speculative prefetches.
> > > 
> > > I thought when I was re-implementing the DMA API on ARM (which was
> > > around early v7 times) that there were CPUs that did speculative
> > > prefetching, which included the A8.  I seem to remember it was pretty
> > > urgent to have the DMA API fixed for _any_ ARMv7 CPU because of the
> > > speculative prefetching.
> > 
> > Indeed, it's a safe assumption to say that any ARMv7 CPU perform
> > speculative accesses. Even if some of them may only do I-cache
> > prefetching (just guessing), in the presence of a unified L2 this
> > distinction no longer matters.
[...]
> This means that there are still some cores on which one could try
> if disabling the prefetching and the flushes in DMA unmap provides
> any serious performance boost.

I think we need to look at the original use-case. There seems to be a
4MB buffer, how often is this mapped/unmapped? Would it be better off
with the coherent API than the streaming one?

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-01-28 11:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-27  8:32 Speeding up dma_unmap Jason Holt
2016-01-27 11:22 ` Ard Biesheuvel
2016-01-27 12:23 ` Arnd Bergmann
2016-01-27 16:06   ` Catalin Marinas
2016-01-27 18:09     ` Russell King - ARM Linux
2016-01-28 10:31       ` Catalin Marinas
2016-01-28 11:20         ` Arnd Bergmann
2016-01-28 11:49           ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.