All of lore.kernel.org
 help / color / mirror / Atom feed
* dma_sync_single_for_cpu takes a really long time
@ 2015-06-28 20:40 Sylvain Munaut
  2015-06-28 22:30 ` Russell King - ARM Linux
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sylvain Munaut @ 2015-06-28 20:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,


I'm working on a DMA driver that uses the the streaming DMA API to
synchronize the access between host and device. The data flow is
exclusively from the device to the host (video grabber).

As such, I call dma_sync_single_for_cpu when the hardware is done
writing a frame to make sure that the cpu gets up to date data when
accessing the zone.

However this call takes a _long_ time to complete. For a 6 Megabytes
buffer, it takes about 13 ms which is just crazy ... at that rate it'd
be faster to just read random data from a random buffer to trash the
measly 512k of cache ...

Is there any alternative that's faster when dealing with large buffers ?

(The platform is a Zynq 7000 - Dual Cortex A9).


Cheers,

   Sylvain

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-28 20:40 dma_sync_single_for_cpu takes a really long time Sylvain Munaut
@ 2015-06-28 22:30 ` Russell King - ARM Linux
  2015-06-29  6:07   ` Sylvain Munaut
  2015-06-29  9:09   ` Catalin Marinas
  2015-06-29  6:33 ` Mike Looijmans
  2015-06-29 10:25 ` Arnd Bergmann
  2 siblings, 2 replies; 12+ messages in thread
From: Russell King - ARM Linux @ 2015-06-28 22:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 28, 2015 at 10:40:03PM +0200, Sylvain Munaut wrote:
> I'm working on a DMA driver that uses the the streaming DMA API to
> synchronize the access between host and device. The data flow is
> exclusively from the device to the host (video grabber).
> 
> As such, I call dma_sync_single_for_cpu when the hardware is done
> writing a frame to make sure that the cpu gets up to date data when
> accessing the zone.
> 
> However this call takes a _long_ time to complete. For a 6 Megabytes
> buffer, it takes about 13 ms which is just crazy ... at that rate it'd
> be faster to just read random data from a random buffer to trash the
> measly 512k of cache ...

Flushing a large chunk of memory one cache line at a time takes a long
time, there's really nothing "new" about that.

It's the expense that has to be paid for using cacheable mappings on a
CPU which is not DMA coherent - something which I've brought up over
the years with ARM, but it's not something that ARM believe is wanted
by their silicon partners.

What we _could_ do is decide that if the buffer is larger than some
factor of the cache size, to just flush the entire cache.  However, that
penalises the case where none of the data is in the cache - and in all
probably  very little of the frame is actually sitting in the cache at
that moment.

However, if you're going to read the entire frame through a cacheable
mapping, you're probably going to end up flushing your cache several
times over through doing that - but that's probably something you're
doing in userspace, and so the kernel doesn't have the knowledge to know
that's what userspace will be doing (nor should it.)

There isn't a trivial solution to this problem, and I wish that ARM had
solved this issue by becoming a DMA-coherent architecture.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-28 22:30 ` Russell King - ARM Linux
@ 2015-06-29  6:07   ` Sylvain Munaut
  2015-06-29  9:08     ` Russell King - ARM Linux
  2015-06-29  9:09   ` Catalin Marinas
  1 sibling, 1 reply; 12+ messages in thread
From: Sylvain Munaut @ 2015-06-29  6:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,


Thanks for the quick and detailed answer.


> Flushing a large chunk of memory one cache line at a time takes a long
> time, there's really nothing "new" about that.

So when invalidating cache, you have to do it for every possible cache line
address ? There is not an instruction to invalidate a whole range ?


Also, I noticed that dma_sync_single_for_device takes a while too even
though I would have expected it to be a no-op for the FROM_DEVICE case.

I can guarantee that I never wrote to this memory zone, so there is nothing
in any write-back buffer, is there anyway to convey this guarantee to the
API ? Or should I just not call dma_sync_single_for_device at all ?



> It's the expense that has to be paid for using cacheable mappings on a
> CPU which is not DMA coherent - something which I've brought up over
> the years with ARM, but it's not something that ARM believe is wanted
> by their silicon partners.
>
> What we _could_ do is decide that if the buffer is larger than some
> factor of the cache size, to just flush the entire cache.  However, that
> penalises the case where none of the data is in the cache - and in all
> probably  very little of the frame is actually sitting in the cache at
> that moment.

If I wanted to give that a shot, how would I do that in my module ?

As a start, I tried calling outer_inv_all() instead of outer_inv_range(),
but it turned out to be a really bad idea (just freezes the system)


> However, if you're going to read the entire frame through a cacheable
> mapping, you're probably going to end up flushing your cache several
> times over through doing that

Isn't there some intermediary between coherent and cacheable, a bit like
write combine for read ?

After all, I don't really care if data ends up in cache, I'd even probably
prefer it didn't. But when reading a word I'd want it to be fetched by block
with prefetching and all that stuff.

The Zynq TRM mention something about having independent control on inner
and outer cacheability for instance. If only one was enabled, then at least
the other wouldn't have to be invalidated ?


Cheers,

    Sylvain

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-28 20:40 dma_sync_single_for_cpu takes a really long time Sylvain Munaut
  2015-06-28 22:30 ` Russell King - ARM Linux
@ 2015-06-29  6:33 ` Mike Looijmans
  2015-06-29 13:06   ` Sylvain Munaut
  2015-06-29 10:25 ` Arnd Bergmann
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Looijmans @ 2015-06-29  6:33 UTC (permalink / raw)
  To: linux-arm-kernel

?On 28-06-15 22:40, Sylvain Munaut wrote:
> Hi,
>
>
> I'm working on a DMA driver that uses the the streaming DMA API to
> synchronize the access between host and device. The data flow is
> exclusively from the device to the host (video grabber).
>
> As such, I call dma_sync_single_for_cpu when the hardware is done
> writing a frame to make sure that the cpu gets up to date data when
> accessing the zone.
>
> However this call takes a _long_ time to complete. For a 6 Megabytes
> buffer, it takes about 13 ms which is just crazy ... at that rate it'd
> be faster to just read random data from a random buffer to trash the
> measly 512k of cache ...

I have the same experience: The cache flush is so slow, that it is about as 
fast to just memcpy() the whole region.


> Is there any alternative that's faster when dealing with large buffers ?
>
> (The platform is a Zynq 7000 - Dual Cortex A9).

Yes.

You're on a Zynq, and that has an ACP port. Connect through that instead of an 
HP port (interface is almost the same), add "dma-coherent" to the devicetree 
and also add my patch that properly maps this into userspace.

The penalty of the ACP port is that it will write a lot slower to the memory 
(about half the speed of the 600MB/s you get from the HP port) because of all 
the cache administration. The good news is that all memory will be cacheable 
once more, and all the dma_sync_... calls will turn into no-ops. You don't 
have to change your driver and the logic also remains the same.


Another approach is to make your software uncached-memory friendly. If you 
process the frames sequentially and use NEON instructions to fetch large 
aligned chunks for further processing, the absense of caching won't matter much.

M.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-29  6:07   ` Sylvain Munaut
@ 2015-06-29  9:08     ` Russell King - ARM Linux
  2015-06-29  9:36       ` Catalin Marinas
  2015-06-29 12:30       ` Sylvain Munaut
  0 siblings, 2 replies; 12+ messages in thread
From: Russell King - ARM Linux @ 2015-06-29  9:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 29, 2015 at 08:07:52AM +0200, Sylvain Munaut wrote:
> Hi,
> 
> 
> Thanks for the quick and detailed answer.
> 
> 
> > Flushing a large chunk of memory one cache line at a time takes a long
> > time, there's really nothing "new" about that.
> 
> So when invalidating cache, you have to do it for every possible cache line
> address ? There is not an instruction to invalidate a whole range ?

Correct.

ARM did "have a go" at providing an instruction which operated on a cache
range in hardware, but it was a disaster, and was removed later on.  The
disaster about it is if you got an exception (eg, interrupt) while the
instruction was executing, it would stop doing the cache maintanence, and
jump to the exception handler.  When the exception handler returned, it
would restart the instruction, not from where it left off, but from the
very beginning.

With a sufficiently frequent interrupt rate and a large enough area, the
result is very effective at preventing the CPU from making any progress.

> Also, I noticed that dma_sync_single_for_device takes a while too even
> though I would have expected it to be a no-op for the FROM_DEVICE case.

In the FROM_DEVICE case, we perform cache maintanence before the DMA
starts, to ensure that there are no dirty cache lines which may get
evicted and overwrite the newly DMA'd data.

However, we also need to perform cache maintanence after DMA has finished
to ensure that the data in the cache is up to date with the newly DMA'd
data.  During the DMA operation, the CPU can speculatively load data into
its caches, which may or may not be the newly DMA'd data - we just don't
know.

> I can guarantee that I never wrote to this memory zone, so there is nothing
> in any write-back buffer, is there anyway to convey this guarantee to the
> API ? Or should I just not call dma_sync_single_for_device at all ?

It's not about whether you wrote to it.  It's whether the CPU speculatively
loaded data into its cache.

This is one of the penalties of having a non-coherent CPU cache with
features such as speculative prefetching to give a performance boost for
non-DMA cases - the DMA use case gets even worse, because the necessary
cache maintanence overheads double.  You can no longer rely on "this
memory area hasn't been touched by the program, so no data will be loaded
into the cache prior to my access" which you can with non-speculative
prefetching CPUs.

> > It's the expense that has to be paid for using cacheable mappings on a
> > CPU which is not DMA coherent - something which I've brought up over
> > the years with ARM, but it's not something that ARM believe is wanted
> > by their silicon partners.
> >
> > What we _could_ do is decide that if the buffer is larger than some
> > factor of the cache size, to just flush the entire cache.  However, that
> > penalises the case where none of the data is in the cache - and in all
> > probably  very little of the frame is actually sitting in the cache at
> > that moment.
> 
> If I wanted to give that a shot, how would I do that in my module ?
> 
> As a start, I tried calling outer_inv_all() instead of outer_inv_range(),
> but it turned out to be a really bad idea (just freezes the system)

_Invalidating_ the L2 destroyes data in the cache which may not have been
written back - it's effectively undoing the data modifications that have
yet to be written back to memory.  That's will cause things to break.

Also, the L2 cache has problems if you use the _all() functions (which
operate on cache set/way) and another CPU also wants to do some other
operation (like a sync, as part of a barrier.)

The trade-off is either never to use the _all() functions while other CPUs
are running, or pay a heavy penalty on every IO access and Linux memory
barrier caused by having to spinlock every L2 cache operation, and run
all L2 operations with interrupts disabled.

> > However, if you're going to read the entire frame through a cacheable
> > mapping, you're probably going to end up flushing your cache several
> > times over through doing that
> 
> Isn't there some intermediary between coherent and cacheable, a bit like
> write combine for read ?

Unfortunately not.  IIRC, some CPUs like PXA had a "read buffer" which
would do that, but that was a PXA specific extension, and never became
part of the ARM architecture itself.

> The Zynq TRM mention something about having independent control on inner
> and outer cacheability for instance. If only one was enabled, then at least
> the other wouldn't have to be invalidated ?

We then start running into other problems: there are only 8 memory types,
7 of which are usable (one is "implementation specific").  All of these
are already used by Linux...

I do feel your pain in this.  I think there has been some pressure on this
issue, because ARM finally made a coherent bus available on SMP systems,
which silicon vendors can use to maintain coherency with the caches.  It's
then up to silicon vendors to use that facility.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-28 22:30 ` Russell King - ARM Linux
  2015-06-29  6:07   ` Sylvain Munaut
@ 2015-06-29  9:09   ` Catalin Marinas
  1 sibling, 0 replies; 12+ messages in thread
From: Catalin Marinas @ 2015-06-29  9:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 28, 2015 at 11:30:40PM +0100, Russell King - ARM Linux wrote:
> On Sun, Jun 28, 2015 at 10:40:03PM +0200, Sylvain Munaut wrote:
> > I'm working on a DMA driver that uses the the streaming DMA API to
> > synchronize the access between host and device. The data flow is
> > exclusively from the device to the host (video grabber).
> > 
> > As such, I call dma_sync_single_for_cpu when the hardware is done
> > writing a frame to make sure that the cpu gets up to date data when
> > accessing the zone.
> > 
> > However this call takes a _long_ time to complete. For a 6 Megabytes
> > buffer, it takes about 13 ms which is just crazy ... at that rate it'd
> > be faster to just read random data from a random buffer to trash the
> > measly 512k of cache ...
> 
> Flushing a large chunk of memory one cache line at a time takes a long
> time, there's really nothing "new" about that.
> 
> It's the expense that has to be paid for using cacheable mappings on a
> CPU which is not DMA coherent - something which I've brought up over
> the years with ARM, but it's not something that ARM believe is wanted
> by their silicon partners.

You are slightly mistaken here. Over the years ARM introduced ACP, AMBA4
ACE, AMBA5 CHI. See Mike's reply in this thread about the presence of
the ACP on this system and some of its drawbacks. But even if there are
options to make a device coherent, vendors may decide not to for various
reasons (cache look-up latency, increased power, cache thrashing; the
latter was actually a case for introducing bit 22 in PL310 aux ctrl).
But it's definitely worth testing.

> What we _could_ do is decide that if the buffer is larger than some
> factor of the cache size, to just flush the entire cache.

This is not a safe operation for the inner cache in an SMP system unless
you disable the MMUs (some kind of stop_machine call with temporary loss
of coherency). The set/way ops are not broadcast to the other CPUs.

> However, if you're going to read the entire frame through a cacheable
> mapping, you're probably going to end up flushing your cache several
> times over through doing that - but that's probably something you're
> doing in userspace, and so the kernel doesn't have the knowledge to know
> that's what userspace will be doing (nor should it.)

Alternatively, create a non-cacheable mapping for this buffer.

-- 
Catalin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-29  9:08     ` Russell King - ARM Linux
@ 2015-06-29  9:36       ` Catalin Marinas
  2015-06-29 12:30       ` Sylvain Munaut
  1 sibling, 0 replies; 12+ messages in thread
From: Catalin Marinas @ 2015-06-29  9:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 29, 2015 at 10:08:04AM +0100, Russell King - ARM Linux wrote:
> On Mon, Jun 29, 2015 at 08:07:52AM +0200, Sylvain Munaut wrote:
> > > However, if you're going to read the entire frame through a cacheable
> > > mapping, you're probably going to end up flushing your cache several
> > > times over through doing that
> > 
> > Isn't there some intermediary between coherent and cacheable, a bit like
> > write combine for read ?
> 
> Unfortunately not.  IIRC, some CPUs like PXA had a "read buffer" which
> would do that, but that was a PXA specific extension, and never became
> part of the ARM architecture itself.

I'm not familiar with the PXA implementation but on A9, in combination
with the ACP, you can get "cacheable read no-allocate, write
no-allocate" accesses from the device side. This allows the CPU to just
use cacheable accesses.

If you don't want read-allocate for PL310, you can configure the ACP
outer cacheability attributes as shareable, non-cacheable and leave bit
22 in the PL310 aux ctrl register cleared. However, the latter requires
that all DMA goes through the ACP.

> > The Zynq TRM mention something about having independent control on inner
> > and outer cacheability for instance. If only one was enabled, then at least
> > the other wouldn't have to be invalidated ?
> 
> We then start running into other problems: there are only 8 memory types,
> 7 of which are usable (one is "implementation specific").  All of these
> are already used by Linux...

With the ACP, it's just hardware configuration so we wouldn't need
additional memory types in the kernel.

(as for the memory types, some of them are never used at the same time:
writeback, writealloc, writethrough)

-- 
Catalin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-28 20:40 dma_sync_single_for_cpu takes a really long time Sylvain Munaut
  2015-06-28 22:30 ` Russell King - ARM Linux
  2015-06-29  6:33 ` Mike Looijmans
@ 2015-06-29 10:25 ` Arnd Bergmann
  2 siblings, 0 replies; 12+ messages in thread
From: Arnd Bergmann @ 2015-06-29 10:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Sunday 28 June 2015 22:40:03 Sylvain Munaut wrote:
> Hi,
> 
> 
> I'm working on a DMA driver that uses the the streaming DMA API to
> synchronize the access between host and device. The data flow is
> exclusively from the device to the host (video grabber).
> 
> As such, I call dma_sync_single_for_cpu when the hardware is done
> writing a frame to make sure that the cpu gets up to date data when
> accessing the zone.
> 
> However this call takes a _long_ time to complete. For a 6 Megabytes
> buffer, it takes about 13 ms which is just crazy ... at that rate it'd
> be faster to just read random data from a random buffer to trash the
> measly 512k of cache ...
> 
> Is there any alternative that's faster when dealing with large buffers ?
> 
> (The platform is a Zynq 7000 - Dual Cortex A9).

?f the frame grabber is implemented in the FPGA, try using the
coherency port instead of the noncoherent port to memory and mark
the device as "dma-coherent", to avoid the explicit flushes.

Another alternative would be to use uncached memory for the buffer
and then read it using an optimized loop from the CPU, but that may
not fit your usage pattern.

	Arnd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-29  9:08     ` Russell King - ARM Linux
  2015-06-29  9:36       ` Catalin Marinas
@ 2015-06-29 12:30       ` Sylvain Munaut
  2015-06-29 13:29         ` Russell King - ARM Linux
  1 sibling, 1 reply; 12+ messages in thread
From: Sylvain Munaut @ 2015-06-29 12:30 UTC (permalink / raw)
  To: linux-arm-kernel

>> I can guarantee that I never wrote to this memory zone, so there is nothing
>> in any write-back buffer, is there anyway to convey this guarantee to the
>> API ? Or should I just not call dma_sync_single_for_device at all ?
>
> It's not about whether you wrote to it.  It's whether the CPU speculatively
> loaded data into its cache.

That I don't understand.

I see how that zone can be in cache (even though if it's on page
boundaries, it should never have been speculatively prefetched right
?). But if I never wrote to that buffer, the cache lines for it can't
possibly be marked as 'dirty'.

So doing a 'clean' on them should end up doing nothing. and the
sequence for a FROM_DMA exchange should be :

while (<transfer in progress>) {
   - Give the buffer to DMA ( dma_sync_single_for_device ) : Should be
no-op, but is not <===
   - Let the DMA do the write
   - Invalidate cache ( dma_sync_single_for_cpu )
   - Let the CPU do its thing on the data
}

Now I _do_ see that on the very first usage of the buffer I'd need to
do a clean. Because that memory could have been used for something
else before. But if I keep re-using that buffer and never write to it,
that only need to be done once.


>> As a start, I tried calling outer_inv_all() instead of outer_inv_range(),
>> but it turned out to be a really bad idea (just freezes the system)
>
> _Invalidating_ the L2 destroyes data in the cache which may not have been
> written back - it's effectively undoing the data modifications that have
> yet to be written back to memory.  That's will cause things to break.

Yeah, that was a really dumb thing to try ... after reading up last
night on how caching works in ARM A9 and the PL310 L2, I see that.


> Also, the L2 cache has problems if you use the _all() functions (which
> operate on cache set/way) and another CPU also wants to do some other
> operation (like a sync, as part of a barrier.)

Oh so even outer_flush_all() is not usable ?


Cheers,

    Sylvain

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-29  6:33 ` Mike Looijmans
@ 2015-06-29 13:06   ` Sylvain Munaut
  2015-06-29 13:24     ` Mike Looijmans
  0 siblings, 1 reply; 12+ messages in thread
From: Sylvain Munaut @ 2015-06-29 13:06 UTC (permalink / raw)
  To: linux-arm-kernel

@ Mike & @ Arnd :

Thanks for your suggestions.


> I have the same experience: The cache flush is so slow, that it is about as
> fast to just memcpy() the whole region.

So far it even looks like invalidating L1 takes 8 ms and L2 4 ms.
Which is pretty weird since the L1 inval is a pretty tight loop, and
invalidating something smaller and closer to the CPU takes more time ?
:

1:
        mcr     p15, 0, r0, c7, c6, 1           @ invalidate D / U line
        add     r0, r0, r2
        cmp     r0, r1
        blo     1b


Unless somehow I end up having high mem page in there and the
dma_cache_maint_page loops has more work than I think.


> You're on a Zynq, and that has an ACP port. Connect through that instead of
> an HP port (interface is almost the same), add "dma-coherent" to the
> devicetree and also add my patch that properly maps this into userspace.
>
> The penalty of the ACP port is that it will write a lot slower to the memory
> (about half the speed of the 600MB/s you get from the HP port) because of
> all the cache administration. The good news is that all memory will be
> cacheable once more, and all the dma_sync_... calls will turn into no-ops.
> You don't have to change your driver and the logic also remains the same.

That's a pretty big downside. 600 M/s write speed is already pretty
low (I mean, DDR raw bw should be close to 4G/s, sure it's DDR so you
can never reach that but still for large purely sequential access I
expected to get closer than that).

Also, doesn't that impact the ARM access performance too much to have to share ?

I guess the best flags to use for this are coherent write request
without L2 allocation.


> Another approach is to make your software uncached-memory friendly. If you
> process the frames sequentially and use NEON instructions to fetch large
> aligned chunks for further processing, the absense of caching won't matter
> much.

Yes, that was the next thing I was going to try.

Does using pre-load make anysense for uncached ? I guess not.


Cheers,

   Sylvain

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-29 13:06   ` Sylvain Munaut
@ 2015-06-29 13:24     ` Mike Looijmans
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Looijmans @ 2015-06-29 13:24 UTC (permalink / raw)
  To: linux-arm-kernel

>> You're on a Zynq, and that has an ACP port. Connect through that instead of
>> an HP port (interface is almost the same), add "dma-coherent" to the
>> devicetree and also add my patch that properly maps this into userspace.
>>
>> The penalty of the ACP port is that it will write a lot slower to the memory
>> (about half the speed of the 600MB/s you get from the HP port) because of
>> all the cache administration. The good news is that all memory will be
>> cacheable once more, and all the dma_sync_... calls will turn into no-ops.
>> You don't have to change your driver and the logic also remains the same.
>
> That's a pretty big downside. 600 M/s write speed is already pretty
> low (I mean, DDR raw bw should be close to 4G/s, sure it's DDR so you
> can never reach that but still for large purely sequential access I
> expected to get closer than that).

I just repeat what's in the Zynq documentation. I did measure 599 M/s 
(simultaneously reading and writing at that speed), so it lives up to that.
The 600MB/s appears to be a limitation of the HP port, not the DDR controller.

Xilinx also mentions 1200MB/s for the ACP port in the same document, but 
that's only the case when reading/writing the L2 cache data.

> Also, doesn't that impact the ARM access performance too much to have to share ?

That I haven't tested. I don't know if the snoop unit may become a bottleneck 
here. I'd expect not, since the CPU interface is a lot faster than when the 
ACP uses.

> I guess the best flags to use for this are coherent write request
> without L2 allocation.

That's the situation where you'll get about half the HP performance. Its the 
ACP-DDR path that is slow.

If you want to process the data fast, use smaller chunks (32k or 64k works 
well) so that all data fits in the L2 cache. Use a bit less than 512k (the L2 
cache size) of buffer memory (for example 6x64k) and have the CPU process it 
in those small chunks as it arrives. Let the CPU "touch" all buffers so that 
they are present in the L2 cache before the logic reads or writes them.
Simply put: Process scan lines, not whole frames. That would make the data 
never hit DDR at all, and raise the processing speed by a significant factor.


>> Another approach is to make your software uncached-memory friendly. If you
>> process the frames sequentially and use NEON instructions to fetch large
>> aligned chunks for further processing, the absense of caching won't matter
>> much.
>
> Yes, that was the next thing I was going to try.
>
> Does using pre-load make anysense for uncached ? I guess not.

You could do some "preloading" by interleaving fetch and process instructions, 
so the CPU has some work to do while waiting for the DDR data. I haven't 
experimented with that either.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* dma_sync_single_for_cpu takes a really long time
  2015-06-29 12:30       ` Sylvain Munaut
@ 2015-06-29 13:29         ` Russell King - ARM Linux
  0 siblings, 0 replies; 12+ messages in thread
From: Russell King - ARM Linux @ 2015-06-29 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 29, 2015 at 02:30:19PM +0200, Sylvain Munaut wrote:
> >> I can guarantee that I never wrote to this memory zone, so there is nothing
> >> in any write-back buffer, is there anyway to convey this guarantee to the
> >> API ? Or should I just not call dma_sync_single_for_device at all ?
> >
> > It's not about whether you wrote to it.  It's whether the CPU speculatively
> > loaded data into its cache.
> 
> That I don't understand.
> 
> I see how that zone can be in cache (even though if it's on page
> boundaries, it should never have been speculatively prefetched right
> ?). But if I never wrote to that buffer, the cache lines for it can't
> possibly be marked as 'dirty'.

Any cache line can be speculatively prefetched, and remember that cache
lines are naturally aligned, there's nothing special about page boundaries
as such (a page boundary is also a cache line boundary.)  So it's possible
the cache lines either side of a page boundary to be speculated if there
is a cacheable mapping present.

> So doing a 'clean' on them should end up doing nothing. and the
> sequence for a FROM_DMA exchange should be :
> 
> while (<transfer in progress>) {
>    - Give the buffer to DMA ( dma_sync_single_for_device ) : Should be
> no-op, but is not <===
>    - Let the DMA do the write
>    - Invalidate cache ( dma_sync_single_for_cpu )
>    - Let the CPU do its thing on the data
> }
> 
> Now I _do_ see that on the very first usage of the buffer I'd need to
> do a clean. Because that memory could have been used for something
> else before. But if I keep re-using that buffer and never write to it,
> that only need to be done once.

Hmm...  I _guess_ there is no reason why:

	addr = dma_map_single(dev, virt, size, DMA_FROM_DEVICE);
	while (whatever) {
		let device DMA to addr
		dma_sync_single_for_cpu(dev, addr, size, DMA_FROM_DEVICE);
		let CPU _read_ the buffer
	}
	dma_unmap_single(dev, addr, size, DMA_FROM_DEVICE);

would not be safe - there are places in the DMA API documentation that
suggest that is valid (Documentation/DMA-API-HOWTO.txt), provided the
CPU does not write to the buffer.

The implication in the above document is that a driver can eliminate the
dma_sync_single_for_device() call _iff_ the code does not write to the
buffer.  In other words, it's up to the driver to omit that call depending
on the driver's coded behaviour, but it's not the architecture's decision
to make dma_sync_single_for_device(..., DMA_FROM_DEVICE) be a no-op
(since the architecture code can't know whether the driver wrote to the
buffer for some reason.)

There is a final point to remember when dealing with the above, and that
is whether you have cache lines overlapping at the beginning and/or end
of the buffer which may be written to, which could then be evicted,
overwriting the DMA'd data.

The above should work with the existing implementation, so I'd encourage
you to try it and report back.

> > Also, the L2 cache has problems if you use the _all() functions (which
> > operate on cache set/way) and another CPU also wants to do some other
> > operation (like a sync, as part of a barrier.)
> 
> Oh so even outer_flush_all() is not usable ?

Correct - there are only three users of outer_flush_all() and all those
users only call the function in paths where the other CPUs have already
been shut down and IRQs are disabled.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-06-29 13:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-28 20:40 dma_sync_single_for_cpu takes a really long time Sylvain Munaut
2015-06-28 22:30 ` Russell King - ARM Linux
2015-06-29  6:07   ` Sylvain Munaut
2015-06-29  9:08     ` Russell King - ARM Linux
2015-06-29  9:36       ` Catalin Marinas
2015-06-29 12:30       ` Sylvain Munaut
2015-06-29 13:29         ` Russell King - ARM Linux
2015-06-29  9:09   ` Catalin Marinas
2015-06-29  6:33 ` Mike Looijmans
2015-06-29 13:06   ` Sylvain Munaut
2015-06-29 13:24     ` Mike Looijmans
2015-06-29 10:25 ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.