All of lore.kernel.org
 help / color / mirror / Atom feed
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
@ 2015-04-23 11:52 Mike Looijmans
  2015-04-23 12:32 ` Arnd Bergmann
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-23 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

?I'm writing a driver that transfers data on a Zynq-7000 between the ARM and PL 
part, using a buffering scheme similar to IIO. So it allocates buffers, which 
the user can memory-map and then send/receive using ioctl calls.
The trouble I have it that transitioning these buffers into user space is 
costing more CPU time than actually copying them.


The self-written DMA controller is in logic, can queue up several transfers 
and operates fine.

The trouble I have with the driver is that for the DMA transfers, I want to 
use a zero-copy style interface, because the data usually consists of video 
frames, and moving them around in memory is prohibitively expensive.

I based my implementation on what IIO (industrial IO) does, and implemented 
IOCTL calls to the driver to allocate, free, enqueue and dequeue blocks that 
are owned by the driver. Each block can be mmapped into user space.

Using dma_alloc_coherent to allocate the blocks, and then just using them 
without any extra measures works just fine. The system can transfer data at 
600MB/s to and from DDR into logic with very little CPU intervention. However, 
the memory returned by dma_mmap_coherent appears to be uncached, because 
accessing this area from userspace is horribly slow (e.g. reading 200MB 
byte-by-byte in a simple for loop takes 20 seconds, while this takes about a 
second in malloced memory)

After reading some documentation, I decided that I should use a streaming DMA 
interface because my driver knows exactly when logic or CPU "owns" the data 
blocks. So instead of the "coherent" functions, I just kmalloc these buffers 
and then use dma_map_single_* to initialize them. Before and after DMA 
transfers, I call the appropriate dma_sync_single_for_* method.

By mapping the kmalloced memory into user space, I once again have speedy 
access to this memory, and caching is enabled. Data transfers also work well 
and extensive testing shows that this works well. However, the in-kernel 
performance is now completely crippled. The system spends so much time in the 
dma_sync_single... calls, the the CPU now becomes a limiting factor. This 
limits the transfer speeds to about only 150MB/s. This method is only about 
20% more CPU intensive than just copying the data from the DMA buffer into a 
user buffer using a copy_to_user call!


Since my DMA controller is pretty smart, I also experimented with transfers 
directly from user memory. This boiled down to calling "get_user_pages" and 
then constructing a scatter-gather list using sg_init_table and then adding 
those user pages. Then call "dma_map_sg" to translate and coalesce the pages 
into DMA requests. Just this pagetable housekeeping took about the same amount 
of processing time as the copy_from_user call, which made me abandon that code 
before even getting to the point of actually transferring the data.

Based on that experience, I'd think the dma_sync calls do similar things 
(walking page tables and changing some attributes) and that is where they 
spend so much time.

I also tried cheating by not calling the dma_sync methods at all, but this 
(surprisingly) led to hangups in the driver. I would have only expected data 
corruption, not "hang". I'm still investigating that route. Specifying 
"dma-coherent" has the same effect as it basically "nulls" the sync methods.

Also tried to replace the "dma_mmap_coherent" call by a simple 
"remap_pfn_range" so as to prevent setting the cache attributes on that 
region, but that didn't have any effect at all, it appears that the 
non-cachable property was already applied in the dma_alloc_coherent method.


I added some timing code to the "sync" calls, this is what I get (numbers in 
microseconds) when using 1MB blocks of streaming DMA transfers:

dma_sync_single_for_device(TO_DEVICE): 3336
dma_sync_single_for_device(FROM_DEVICE): 1991
dma_sync_single_for_cpu(FROM_DEVICE): 2175
dma_sync_single_for_cpu(TO_DEVICE): 0
dma_sync_single_for_device(TO_DEVICE): 3152
dma_sync_single_for_device(FROM_DEVICE): 1990
dma_sync_single_for_cpu(FROM_DEVICE): 2193
dma_sync_single_for_cpu(TO_DEVICE): 0


As you can see, the system spends 2 or 3 ms on "housekeeping" for each 
transition, except the cpu(TO_DEVICE) one which appears to be free which is 
perfectly logical because returning the outgoing buffer to the CPU should not 
need any special cache handling. I would have expected the 
for_device(FROM_DEVICE) to be free as well, but surprisingly this one takes up 
2ms as well.

Adding the numbers, it takes over 7 ms overhead to transfer 1MB data, hence 
1MB/0.007s or about 150MB/s would be the maximum possible data transfer rate.


Can anyone here offer some advise on this?


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-23 11:52 dma_alloc_coherent versus streaming DMA, neither works satisfactory Mike Looijmans
@ 2015-04-23 12:32 ` Arnd Bergmann
  2015-04-23 13:05   ` Mike Looijmans
  2015-04-29  8:47   ` Mike Looijmans
  0 siblings, 2 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-23 12:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
> Can anyone here offer some advise on this?
> 

The problem you are experiencing is a direct result of using hardware
without cache-coherency from user space. There is no software workaround
for this: If you want data to be cacheable *and* avoid doing manual cache
flushes each time data is passed between user space and hardware, you have
to use hardware that is cache-coherent.

You mentioned that you are using 'Zynq', which supports cache-coherent
DMA using the 'accelerator coherency port'. If you are able to connect
your device to that port, it should work, otherwise you should consider
using a different platform.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-23 12:32 ` Arnd Bergmann
@ 2015-04-23 13:05   ` Mike Looijmans
  2015-04-29  8:47   ` Mike Looijmans
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-04-23 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

?On 23-04-15 14:32, Arnd Bergmann wrote:
> On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
>> Can anyone here offer some advise on this?
>>
>
> The problem you are experiencing is a direct result of using hardware
> without cache-coherency from user space. There is no software workaround
> for this: If you want data to be cacheable *and* avoid doing manual cache
> flushes each time data is passed between user space and hardware, you have
> to use hardware that is cache-coherent.

I can live with manual flushes. I know exactly what and where to flush in the 
driver.

What bothers me so much is that just invalidating the data cache(s) takes 
longer than actually reading the data from uncached memory and writing it into 
cached memory elsewhere. That just does not sound right to me.

> You mentioned that you are using 'Zynq', which supports cache-coherent
> DMA using the 'accelerator coherency port'. If you are able to connect
> your device to that port, it should work, otherwise you should consider
> using a different platform.

I can probably hook up the DMA controller to the ACP port, however, I would 
expect this to have serious negative impact on system performance. The 
transfers are typically in the megabyte range (for example, video frames) so 
they will not fit into the L2 cache (which is 512k on the Zynq). I'd expect 
these transfers, when routed through the ACP, to "push" all data out of the L2 
cache, replacing it with data that isn't going to be read any time soon, and 
thus interfere with performance too much.

It might be worth a try though to see what will actually happen.

Mike.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-23 12:32 ` Arnd Bergmann
  2015-04-23 13:05   ` Mike Looijmans
@ 2015-04-29  8:47   ` Mike Looijmans
  2015-04-29  9:01     ` Arnd Bergmann
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29  8:47 UTC (permalink / raw)
  To: linux-arm-kernel

?On 23-04-15 14:32, Arnd Bergmann wrote:
> On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
>> Can anyone here offer some advise on this?
>>
>
> The problem you are experiencing is a direct result of using hardware
> without cache-coherency from user space. There is no software workaround
> for this: If you want data to be cacheable *and* avoid doing manual cache
> flushes each time data is passed between user space and hardware, you have
> to use hardware that is cache-coherent.
>
> You mentioned that you are using 'Zynq', which supports cache-coherent
> DMA using the 'accelerator coherency port'. If you are able to connect
> your device to that port, it should work, otherwise you should consider
> using a different platform.

I tried as you suggested, and used the ACP instead of the HP to connect the 
logic to the CPU.

Without any further changes, this passes all tests with the exact same 
performance numbers. The reason for that is that the driver/DMA framework is 
unaware of the "coherency" hardware and still uses the manual cache 
flush/invalidate routines, so I figured out that adding a "dma-coherent" 
property to the devicetree node changes that.

However, with "dma-coherent" set for my driver, the system locks up at random 
points in the tests. Simple memory transfer tests fail with data mismatches 
(probably stale cache results). Running DMA tests usually results in the 
system completely locking up at some point.
Normal register read/write access is done through the AXI bus directly, not 
using the ACP at all.

Is the ACP hardware broken, is there some extra things my driver needs to be 
aware of, or is there something else I need to do here?


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29  8:47   ` Mike Looijmans
@ 2015-04-29  9:01     ` Arnd Bergmann
  2015-04-29  9:17       ` Russell King - ARM Linux
  0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29  9:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 29 April 2015 10:47:05 Mike Looijmans wrote:
> On 23-04-15 14:32, Arnd Bergmann wrote:
> > On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
> >> Can anyone here offer some advise on this?
> >>
> >
> > The problem you are experiencing is a direct result of using hardware
> > without cache-coherency from user space. There is no software workaround
> > for this: If you want data to be cacheable *and* avoid doing manual cache
> > flushes each time data is passed between user space and hardware, you have
> > to use hardware that is cache-coherent.
> >
> > You mentioned that you are using 'Zynq', which supports cache-coherent
> > DMA using the 'accelerator coherency port'. If you are able to connect
> > your device to that port, it should work, otherwise you should consider
> > using a different platform.
> 
> I tried as you suggested, and used the ACP instead of the HP to connect the 
> logic to the CPU.
> 
> Without any further changes, this passes all tests with the exact same 
> performance numbers. The reason for that is that the driver/DMA framework is 
> unaware of the "coherency" hardware and still uses the manual cache 
> flush/invalidate routines, so I figured out that adding a "dma-coherent" 
> property to the devicetree node changes that.

Correct.

> However, with "dma-coherent" set for my driver, the system locks up at random 
> points in the tests. Simple memory transfer tests fail with data mismatches 
> (probably stale cache results). Running DMA tests usually results in the 
> system completely locking up at some point.
> Normal register read/write access is done through the AXI bus directly, not 
> using the ACP at all.
> 
> Is the ACP hardware broken, is there some extra things my driver needs to be 
> aware of, or is there something else I need to do here?

You still need to synchronize MMIO register accesses with write buffers,
as the readl() and writel() functions do in the kernel.

In particular, after you have written a buffer to memory from the CPU,
you will need to do an outer_sync() before the MMIO write that triggers
the DMA. This is still much cheaper than doing the cache flush though.

On the inbound side, you normally need an MMIO read followed by a dsb
instruction to ensure that data is visible in the cache after you have
received an interrupt from the device. Again that is what readl()
does, but it won't be implied if you do the MMIO from user space.

Another possible problem would be if the driver mmaps the buffer in
uncached mode to user space. This is something your kernel driver has
to get right, it won't be handled automatically by setting the
"dma-coherent" property in DT.

There could of course be some other problem either in your FPGA code,
in your driver, or in your user space. I would not assume that the zynq
ACP itself is broken though.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29  9:01     ` Arnd Bergmann
@ 2015-04-29  9:17       ` Russell King - ARM Linux
  2015-04-29  9:47         ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2015-04-29  9:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
> You still need to synchronize MMIO register accesses with write buffers,
> as the readl() and writel() functions do in the kernel.
> 
> In particular, after you have written a buffer to memory from the CPU,
> you will need to do an outer_sync() before the MMIO write that triggers
> the DMA. This is still much cheaper than doing the cache flush though.

Note that outer_sync() is already done by readl/writel and/or the write
memory barriers (mb()/wmb()).

> Another possible problem would be if the driver mmaps the buffer in
> uncached mode to user space. This is something your kernel driver has
> to get right, it won't be handled automatically by setting the
> "dma-coherent" property in DT.

The buffer should also be mapped into userspace with the same memory
type and cache attributes as the kernel side mapping.  If using ACP,
then you probably want "normal memory, cacheable, writeback, read
allocate" or in the case of SMP, the same but "read/write allocate".

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29  9:17       ` Russell King - ARM Linux
@ 2015-04-29  9:47         ` Mike Looijmans
  2015-04-29 10:07           ` Arnd Bergmann
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29  9:47 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 11:17, Russell King - ARM Linux wrote:
> On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
>> You still need to synchronize MMIO register accesses with write buffers,
>> as the readl() and writel() functions do in the kernel.
>>
>> In particular, after you have written a buffer to memory from the CPU,
>> you will need to do an outer_sync() before the MMIO write that triggers
>> the DMA. This is still much cheaper than doing the cache flush though.
>
> Note that outer_sync() is already done by readl/writel and/or the write
> memory barriers (mb()/wmb()).

I initiate the DMA transfers using iowrite32() so if I understand correctly, 
I'm already doing the right thing here.

Just to be completely clear, there is no direct register access from user 
space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers, 
and uses ioctl to initiate transfers.

>> Another possible problem would be if the driver mmaps the buffer in
>> uncached mode to user space. This is something your kernel driver has
>> to get right, it won't be handled automatically by setting the
>> "dma-coherent" property in DT.
>
> The buffer should also be mapped into userspace with the same memory
> type and cache attributes as the kernel side mapping.  If using ACP,
> then you probably want "normal memory, cacheable, writeback, read
> allocate" or in the case of SMP, the same but "read/write allocate".

I currently use dma_alloc_coherent() to allocate buffers and 
dma_mmap_coherent() to map them to user space. I was under the assumption that 
these would do the right thing. Is that correct? If not, then what should I use?




Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29  9:47         ` Mike Looijmans
@ 2015-04-29 10:07           ` Arnd Bergmann
  2015-04-29 10:33             ` Mike Looijmans
  2015-04-29 11:09             ` Mike Looijmans
  0 siblings, 2 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 10:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> On 29-04-15 11:17, Russell King - ARM Linux wrote:
> > On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
> >> You still need to synchronize MMIO register accesses with write buffers,
> >> as the readl() and writel() functions do in the kernel.
> >>
> >> In particular, after you have written a buffer to memory from the CPU,
> >> you will need to do an outer_sync() before the MMIO write that triggers
> >> the DMA. This is still much cheaper than doing the cache flush though.
> >
> > Note that outer_sync() is already done by readl/writel and/or the write
> > memory barriers (mb()/wmb()).
> 
> I initiate the DMA transfers using iowrite32() so if I understand correctly, 
> I'm already doing the right thing here.
> 
> Just to be completely clear, there is no direct register access from user 
> space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers, 
> and uses ioctl to initiate transfers.

Ok, that seems all fine then.

> >> Another possible problem would be if the driver mmaps the buffer in
> >> uncached mode to user space. This is something your kernel driver has
> >> to get right, it won't be handled automatically by setting the
> >> "dma-coherent" property in DT.
> >
> > The buffer should also be mapped into userspace with the same memory
> > type and cache attributes as the kernel side mapping.  If using ACP,
> > then you probably want "normal memory, cacheable, writeback, read
> > allocate" or in the case of SMP, the same but "read/write allocate".
> 
> I currently use dma_alloc_coherent() to allocate buffers and 
> dma_mmap_coherent() to map them to user space. I was under the assumption that 
> these would do the right thing. Is that correct? If not, then what should I use?

dma_mmap_coherent() is the right interface, but I've just looked at the
implementation of arm_dma_mmap() and I'm not sure that it actually uses the
correct vma->vm_page_prot value here, because I don't see where it takes
into account whether the device is coherent or not. Most ARM machines have
only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
so it's quite possible that this interface got broken without anybody
noticing.

If my suspicion is correct, we should either change arm_coherent_dma_ops()
to refer to a different mmap() callback that does the right thing for
coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 10:07           ` Arnd Bergmann
@ 2015-04-29 10:33             ` Mike Looijmans
  2015-04-29 10:41               ` Arnd Bergmann
  2015-04-29 11:09             ` Mike Looijmans
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 10:33 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 12:07, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>> On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
>>>> You still need to synchronize MMIO register accesses with write buffers,
>>>> as the readl() and writel() functions do in the kernel.
>>>>
>>>> In particular, after you have written a buffer to memory from the CPU,
>>>> you will need to do an outer_sync() before the MMIO write that triggers
>>>> the DMA. This is still much cheaper than doing the cache flush though.
>>>
>>> Note that outer_sync() is already done by readl/writel and/or the write
>>> memory barriers (mb()/wmb()).
>>
>> I initiate the DMA transfers using iowrite32() so if I understand correctly,
>> I'm already doing the right thing here.
>>
>> Just to be completely clear, there is no direct register access from user
>> space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers,
>> and uses ioctl to initiate transfers.
>
> Ok, that seems all fine then.
>
>>>> Another possible problem would be if the driver mmaps the buffer in
>>>> uncached mode to user space. This is something your kernel driver has
>>>> to get right, it won't be handled automatically by setting the
>>>> "dma-coherent" property in DT.
>>>
>>> The buffer should also be mapped into userspace with the same memory
>>> type and cache attributes as the kernel side mapping.  If using ACP,
>>> then you probably want "normal memory, cacheable, writeback, read
>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>
>> I currently use dma_alloc_coherent() to allocate buffers and
>> dma_mmap_coherent() to map them to user space. I was under the assumption that
>> these would do the right thing. Is that correct? If not, then what should I use?
>
> dma_mmap_coherent() is the right interface, but I've just looked at the
> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> correct vma->vm_page_prot value here, because I don't see where it takes
> into account whether the device is coherent or not. Most ARM machines have
> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> so it's quite possible that this interface got broken without anybody
> noticing.
>
> If my suspicion is correct, we should either change arm_coherent_dma_ops()
> to refer to a different mmap() callback that does the right thing for
> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.

Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which 
is defined in arch/arm/include/asm/pgtable.h and that just returns uncached 
memory.

If you can give me some hints as to what the correct flags would be, I can 
patch my kernel and test it.




Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 10:33             ` Mike Looijmans
@ 2015-04-29 10:41               ` Arnd Bergmann
  2015-04-29 12:49                 ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 10:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
> On 29-04-15 12:07, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >> On 29-04-15 11:17, Russell King - ARM Linux wrote:
> >>> The buffer should also be mapped into userspace with the same memory
> >>> type and cache attributes as the kernel side mapping.  If using ACP,
> >>> then you probably want "normal memory, cacheable, writeback, read
> >>> allocate" or in the case of SMP, the same but "read/write allocate".
> >
> > If my suspicion is correct, we should either change arm_coherent_dma_ops()
> > to refer to a different mmap() callback that does the right thing for
> > coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
> 
> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which 
> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached 
> memory.
> 
> If you can give me some hints as to what the correct flags would be, I can 
> patch my kernel and test it.

Use the flags that Russell listed above.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 10:07           ` Arnd Bergmann
  2015-04-29 10:33             ` Mike Looijmans
@ 2015-04-29 11:09             ` Mike Looijmans
  2015-04-29 12:35               ` Arnd Bergmann
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 11:09 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 12:07, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>> On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
>>>> You still need to synchronize MMIO register accesses with write buffers,
>>>> as the readl() and writel() functions do in the kernel.
>>>>
>>>> In particular, after you have written a buffer to memory from the CPU,
>>>> you will need to do an outer_sync() before the MMIO write that triggers
>>>> the DMA. This is still much cheaper than doing the cache flush though.
>>>
>>> Note that outer_sync() is already done by readl/writel and/or the write
>>> memory barriers (mb()/wmb()).
>>
>> I initiate the DMA transfers using iowrite32() so if I understand correctly,
>> I'm already doing the right thing here.
>>
>> Just to be completely clear, there is no direct register access from user
>> space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers,
>> and uses ioctl to initiate transfers.
>
> Ok, that seems all fine then.
>
>>>> Another possible problem would be if the driver mmaps the buffer in
>>>> uncached mode to user space. This is something your kernel driver has
>>>> to get right, it won't be handled automatically by setting the
>>>> "dma-coherent" property in DT.
>>>
>>> The buffer should also be mapped into userspace with the same memory
>>> type and cache attributes as the kernel side mapping.  If using ACP,
>>> then you probably want "normal memory, cacheable, writeback, read
>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>
>> I currently use dma_alloc_coherent() to allocate buffers and
>> dma_mmap_coherent() to map them to user space. I was under the assumption that
>> these would do the right thing. Is that correct? If not, then what should I use?
>
> dma_mmap_coherent() is the right interface, but I've just looked at the
> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> correct vma->vm_page_prot value here, because I don't see where it takes
> into account whether the device is coherent or not. Most ARM machines have
> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> so it's quite possible that this interface got broken without anybody
> noticing.

My driver also supports 'classic' read/write by copying data to/from userspace 
using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That 
would suggest that the kernel mapping for this memory is also incorrect, I 
also get corrupted data in these transfers. These buffers are not being mapped 
to userspace at all. These tests all use page aligned transfer sizes.

M.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 11:09             ` Mike Looijmans
@ 2015-04-29 12:35               ` Arnd Bergmann
  2015-04-29 12:52                 ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 12:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 29 April 2015 13:09:05 Mike Looijmans wrote:
> On 29-04-15 12:07, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >> I currently use dma_alloc_coherent() to allocate buffers and
> >> dma_mmap_coherent() to map them to user space. I was under the assumption that
> >> these would do the right thing. Is that correct? If not, then what should I use?
> >
> > dma_mmap_coherent() is the right interface, but I've just looked at the
> > implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> > correct vma->vm_page_prot value here, because I don't see where it takes
> > into account whether the device is coherent or not. Most ARM machines have
> > only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> > so it's quite possible that this interface got broken without anybody
> > noticing.
> 
> My driver also supports 'classic' read/write by copying data to/from userspace 
> using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That 
> would suggest that the kernel mapping for this memory is also incorrect, I 
> also get corrupted data in these transfers. These buffers are not being mapped 
> to userspace at all. These tests all use page aligned transfer sizes.

Ok, if it's broken when you only use a kernel mapping and no user space
mapping, arm_dma_mmap() is not your (only) problem, but we should probably
still fix that if it's broken.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 10:41               ` Arnd Bergmann
@ 2015-04-29 12:49                 ` Mike Looijmans
  2015-04-29 13:13                   ` Arnd Bergmann
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 12:49 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 12:41, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>>>> The buffer should also be mapped into userspace with the same memory
>>>>> type and cache attributes as the kernel side mapping.  If using ACP,
>>>>> then you probably want "normal memory, cacheable, writeback, read
>>>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>>
>>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
>>> to refer to a different mmap() callback that does the right thing for
>>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>>
>> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
>> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
>> memory.
>>
>> If you can give me some hints as to what the correct flags would be, I can
>> patch my kernel and test it.
>
> Use the flags that Russell listed above.

I would if I had a clue how to do that. For one thing, I don't understand all 
the L_PTE_... flag juggling that happens here.

I also tried just using kmalloc() to allocate the buffer, and then 
dma_map_single that. This is what the DMA documentation told me to do for 
non-coherent mappings. This works fine for the streaming-dma mode using the 
HP, but using dma-coherent this not only fails to work, it tends to completely 
lock the system.

Avoiding the streaming mapping and using only the coherent modes prevents the 
system locking up, it then doesn't do more harm than just corrupting data in 
the buffers.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 12:35               ` Arnd Bergmann
@ 2015-04-29 12:52                 ` Mike Looijmans
  2015-04-29 12:54                   ` Arnd Bergmann
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 12:52 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 14:35, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 13:09:05 Mike Looijmans wrote:
>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>> I currently use dma_alloc_coherent() to allocate buffers and
>>>> dma_mmap_coherent() to map them to user space. I was under the assumption that
>>>> these would do the right thing. Is that correct? If not, then what should I use?
>>>
>>> dma_mmap_coherent() is the right interface, but I've just looked at the
>>> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
>>> correct vma->vm_page_prot value here, because I don't see where it takes
>>> into account whether the device is coherent or not. Most ARM machines have
>>> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
>>> so it's quite possible that this interface got broken without anybody
>>> noticing.
>>
>> My driver also supports 'classic' read/write by copying data to/from userspace
>> using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That
>> would suggest that the kernel mapping for this memory is also incorrect, I
>> also get corrupted data in these transfers. These buffers are not being mapped
>> to userspace at all. These tests all use page aligned transfer sizes.
>
> Ok, if it's broken when you only use a kernel mapping and no user space
> mapping, arm_dma_mmap() is not your (only) problem, but we should probably
> still fix that if it's broken.

The only difference in arm_coherent_dma_alloc() and arm_dma_alloc() is that 
the former passes a "true" to __dma_alloc(). That looks prettry suspicious, 
I'd at least expect different "prot" values.



Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 12:52                 ` Mike Looijmans
@ 2015-04-29 12:54                   ` Arnd Bergmann
  0 siblings, 0 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 12:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 29 April 2015 14:52:49 Mike Looijmans wrote:
> On 29-04-15 14:35, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 13:09:05 Mike Looijmans wrote:
> >> On 29-04-15 12:07, Arnd Bergmann wrote:
> >>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >>>> I currently use dma_alloc_coherent() to allocate buffers and
> >>>> dma_mmap_coherent() to map them to user space. I was under the assumption that
> >>>> these would do the right thing. Is that correct? If not, then what should I use?
> >>>
> >>> dma_mmap_coherent() is the right interface, but I've just looked at the
> >>> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> >>> correct vma->vm_page_prot value here, because I don't see where it takes
> >>> into account whether the device is coherent or not. Most ARM machines have
> >>> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> >>> so it's quite possible that this interface got broken without anybody
> >>> noticing.
> >>
> >> My driver also supports 'classic' read/write by copying data to/from userspace
> >> using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That
> >> would suggest that the kernel mapping for this memory is also incorrect, I
> >> also get corrupted data in these transfers. These buffers are not being mapped
> >> to userspace at all. These tests all use page aligned transfer sizes.
> >
> > Ok, if it's broken when you only use a kernel mapping and no user space
> > mapping, arm_dma_mmap() is not your (only) problem, but we should probably
> > still fix that if it's broken.
> 
> The only difference in arm_coherent_dma_alloc() and arm_dma_alloc() is that 
> the former passes a "true" to __dma_alloc(). That looks prettry suspicious, 
> I'd at least expect different "prot" values.
> 
> 

No, if coherent=true is passed as the argument, the prot value is not used.

If we had this part wrong, we would notice very quickly.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 12:49                 ` Mike Looijmans
@ 2015-04-29 13:13                   ` Arnd Bergmann
  2015-04-30 13:50                     ` Mike Looijmans
  2015-05-07 11:18                     ` Mike Looijmans
  0 siblings, 2 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 13:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
> On 29-04-15 12:41, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
> >> On 29-04-15 12:07, Arnd Bergmann wrote:
> >>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
> >>>>> The buffer should also be mapped into userspace with the same memory
> >>>>> type and cache attributes as the kernel side mapping.  If using ACP,
> >>>>> then you probably want "normal memory, cacheable, writeback, read
> >>>>> allocate" or in the case of SMP, the same but "read/write allocate".
> >>>
> >>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
> >>> to refer to a different mmap() callback that does the right thing for
> >>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
> >>
> >> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
> >> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
> >> memory.
> >>
> >> If you can give me some hints as to what the correct flags would be, I can
> >> patch my kernel and test it.
> >
> > Use the flags that Russell listed above.
> 
> I would if I had a clue how to do that. For one thing, I don't understand all 
> the L_PTE_... flag juggling that happens here.
> 
> I also tried just using kmalloc() to allocate the buffer, and then 
> dma_map_single that. This is what the DMA documentation told me to do for 
> non-coherent mappings. This works fine for the streaming-dma mode using the 
> HP, but using dma-coherent this not only fails to work, it tends to completely 
> lock the system.
> 
> Avoiding the streaming mapping and using only the coherent modes prevents the 
> system locking up, it then doesn't do more harm than just corrupting data in 
> the buffers.

If I understand it right, you basically just skip the 'vma->vm_page_prot =
__get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
here, i.e. the pgprot value we use for all normal memory.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 13:13                   ` Arnd Bergmann
@ 2015-04-30 13:50                     ` Mike Looijmans
  2015-04-30 13:54                       ` Arnd Bergmann
  2015-05-07 11:18                     ` Mike Looijmans
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-30 13:50 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 15:13, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
>> On 29-04-15 12:41, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
>>>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>>>>>> The buffer should also be mapped into userspace with the same memory
>>>>>>> type and cache attributes as the kernel side mapping.  If using ACP,
>>>>>>> then you probably want "normal memory, cacheable, writeback, read
>>>>>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>>>>
>>>>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
>>>>> to refer to a different mmap() callback that does the right thing for
>>>>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>>>>
>>>> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
>>>> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
>>>> memory.
>>>>
>>>> If you can give me some hints as to what the correct flags would be, I can
>>>> patch my kernel and test it.
>>>
>>> Use the flags that Russell listed above.
>>
>> I would if I had a clue how to do that. For one thing, I don't understand all
>> the L_PTE_... flag juggling that happens here.
>>
>> I also tried just using kmalloc() to allocate the buffer, and then
>> dma_map_single that. This is what the DMA documentation told me to do for
>> non-coherent mappings. This works fine for the streaming-dma mode using the
>> HP, but using dma-coherent this not only fails to work, it tends to completely
>> lock the system.
>>
>> Avoiding the streaming mapping and using only the coherent modes prevents the
>> system locking up, it then doesn't do more harm than just corrupting data in
>> the buffers.
>
> If I understand it right, you basically just skip the 'vma->vm_page_prot =
> __get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
> here, i.e. the pgprot value we use for all normal memory.

Just to give you a status update, I tried that too (by adding a 
dma_mmap_coherent variant that omits the "prot" change, and some printks to 
verify that it actually does as expected).

Current status is that the ACP behaves exactly like the HP port, which it 
should not do. If I send data from logic via the ACP port through the L2 
cache, using a version of dma_sync that just invalidates the cache could 
(should?) result in data corruption. Instead, the data gets corrupted only if 
you do not invalidate the line. This is what the (non-coherent) HP port 
behaves like, as it writes directly to DDR.

Currently I'm assuming that the tools did something wrong in the bitstream, 
for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0" 
instead of "1" while claiming to have wired them to "1" in the UI. A bug like 
that would also explain the behaviour I'm seeing now.

I'll let you know once I find out more.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-30 13:50                     ` Mike Looijmans
@ 2015-04-30 13:54                       ` Arnd Bergmann
  2015-05-01  6:08                         ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 30 April 2015 15:50:15 Mike Looijmans wrote:
> 
> Just to give you a status update, I tried that too (by adding a 
> dma_mmap_coherent variant that omits the "prot" change, and some printks to 
> verify that it actually does as expected).
> 
> Current status is that the ACP behaves exactly like the HP port, which it 
> should not do. If I send data from logic via the ACP port through the L2 
> cache, using a version of dma_sync that just invalidates the cache could 
> (should?) result in data corruption. Instead, the data gets corrupted only if 
> you do not invalidate the line. This is what the (non-coherent) HP port 
> behaves like, as it writes directly to DDR.
> 
> Currently I'm assuming that the tools did something wrong in the bitstream, 
> for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0" 
> instead of "1" while claiming to have wired them to "1" in the UI. A bug like 
> that would also explain the behaviour I'm seeing now.
> 
> I'll let you know once I find out more.
> 
> 

Ok, maybe you have to configure the SCU to include the ACP in the
cache coherency? That might not be done by default. I don't really
know anything about the SCU or the ACP, so I'm just taking wild
guesses here.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-30 13:54                       ` Arnd Bergmann
@ 2015-05-01  6:08                         ` Mike Looijmans
  2015-05-01  7:01                           ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-01  6:08 UTC (permalink / raw)
  To: linux-arm-kernel

?On 30-04-15 15:54, Arnd Bergmann wrote:
> On Thursday 30 April 2015 15:50:15 Mike Looijmans wrote:
>>
>> Just to give you a status update, I tried that too (by adding a
>> dma_mmap_coherent variant that omits the "prot" change, and some printks to
>> verify that it actually does as expected).
>>
>> Current status is that the ACP behaves exactly like the HP port, which it
>> should not do. If I send data from logic via the ACP port through the L2
>> cache, using a version of dma_sync that just invalidates the cache could
>> (should?) result in data corruption. Instead, the data gets corrupted only if
>> you do not invalidate the line. This is what the (non-coherent) HP port
>> behaves like, as it writes directly to DDR.
>>
>> Currently I'm assuming that the tools did something wrong in the bitstream,
>> for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0"
>> instead of "1" while claiming to have wired them to "1" in the UI. A bug like
>> that would also explain the behaviour I'm seeing now.
>>
>> I'll let you know once I find out more.
>
> Ok, maybe you have to configure the SCU to include the ACP in the
> cache coherency? That might not be done by default. I don't really
> know anything about the SCU or the ACP, so I'm just taking wild
> guesses here.

The issue seems to be in these signals, the tool falsely claimed to have them 
set high, but it actually only sets half of them, which thus has no effect at 
all. I'll be able to test that in an hour or so once the bitstream is ready.

Interestingly, the coherency is not a property of the port or of the device 
itself, it is a property of the DMA transaction. One single master on this 
post can do both coherent and non-coherent transfers.
Now there's an interesting challenge, since the kernel appears to assume that 
one device can only do one type. My 'hardware' actually has multiple masters, 
which could potentially be tied to different ports or busses.



Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-01  6:08                         ` Mike Looijmans
@ 2015-05-01  7:01                           ` Mike Looijmans
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-01  7:01 UTC (permalink / raw)
  To: linux-arm-kernel

?On 01-05-15 08:08, Mike Looijmans wrote:
> On 30-04-15 15:54, Arnd Bergmann wrote:
>> On Thursday 30 April 2015 15:50:15 Mike Looijmans wrote:
>>>
>>> Just to give you a status update, I tried that too (by adding a
>>> dma_mmap_coherent variant that omits the "prot" change, and some printks to
>>> verify that it actually does as expected).
>>>
>>> Current status is that the ACP behaves exactly like the HP port, which it
>>> should not do. If I send data from logic via the ACP port through the L2
>>> cache, using a version of dma_sync that just invalidates the cache could
>>> (should?) result in data corruption. Instead, the data gets corrupted only if
>>> you do not invalidate the line. This is what the (non-coherent) HP port
>>> behaves like, as it writes directly to DDR.
>>>
>>> Currently I'm assuming that the tools did something wrong in the bitstream,
>>> for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0"
>>> instead of "1" while claiming to have wired them to "1" in the UI. A bug like
>>> that would also explain the behaviour I'm seeing now.
>>>
>>> I'll let you know once I find out more.
>>
>> Ok, maybe you have to configure the SCU to include the ACP in the
>> cache coherency? That might not be done by default. I don't really
>> know anything about the SCU or the ACP, so I'm just taking wild
>> guesses here.
>
> The issue seems to be in these signals, the tool falsely claimed to have them
> set high, but it actually only sets half of them, which thus has no effect at
> all. I'll be able to test that in an hour or so once the bitstream is ready.

Indeed, controlling the signals manually instead of relying on the tools makes 
the ACP operate correctly.

> Interestingly, the coherency is not a property of the port or of the device
> itself, it is a property of the DMA transaction. One single master on this
> post can do both coherent and non-coherent transfers.
> Now there's an interesting challenge, since the kernel appears to assume that
> one device can only do one type. My 'hardware' actually has multiple masters,
> which could potentially be tied to different ports or busses.

Measurements show that this is going to be important. Writing larger blocks to 
DDR using coherency transfers data at about 240MB/s, while without coherency 
it's 600MB/s. Have yet to test small datasets, I expect opposite results 
there, since they'll remain in the L2 cache.




Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-04-29 13:13                   ` Arnd Bergmann
  2015-04-30 13:50                     ` Mike Looijmans
@ 2015-05-07 11:18                     ` Mike Looijmans
  2015-05-07 11:56                       ` Arnd Bergmann
  2015-05-07 13:21                       ` Daniel Drake
  1 sibling, 2 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-07 11:18 UTC (permalink / raw)
  To: linux-arm-kernel

?On 29-04-15 15:13, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
>> On 29-04-15 12:41, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
>>>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>>>>>> The buffer should also be mapped into userspace with the same memory
>>>>>>> type and cache attributes as the kernel side mapping.  If using ACP,
>>>>>>> then you probably want "normal memory, cacheable, writeback, read
>>>>>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>>>>
>>>>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
>>>>> to refer to a different mmap() callback that does the right thing for
>>>>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>>>>
>>>> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
>>>> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
>>>> memory.
>>>>
>>>> If you can give me some hints as to what the correct flags would be, I can
>>>> patch my kernel and test it.
>>>
>>> Use the flags that Russell listed above.
>>
>> I would if I had a clue how to do that. For one thing, I don't understand all
>> the L_PTE_... flag juggling that happens here.
>>
>> I also tried just using kmalloc() to allocate the buffer, and then
>> dma_map_single that. This is what the DMA documentation told me to do for
>> non-coherent mappings. This works fine for the streaming-dma mode using the
>> HP, but using dma-coherent this not only fails to work, it tends to completely
>> lock the system.
>>
>> Avoiding the streaming mapping and using only the coherent modes prevents the
>> system locking up, it then doesn't do more harm than just corrupting data in
>> the buffers.
>
> If I understand it right, you basically just skip the 'vma->vm_page_prot =
> __get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
> here, i.e. the pgprot value we use for all normal memory.

I reverted all my patches and workarounds. Indeed, the kernel needs a 
"coherent" version of the dma_mmap routine, as the current version will map it 
as non-cachable, resulting in a big performance hit (and nullifying the whole 
idea behind it).

I'll test it further on my 'hardware' and cook up a patch that correctly maps 
the coherent pages.

M.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-07 11:18                     ` Mike Looijmans
@ 2015-05-07 11:56                       ` Arnd Bergmann
  2015-05-07 13:21                       ` Daniel Drake
  1 sibling, 0 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-07 11:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thursday 07 May 2015 13:18:08 Mike Looijmans wrote:
> On 29-04-15 15:13, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
> >> Avoiding the streaming mapping and using only the coherent modes prevents the
> >> system locking up, it then doesn't do more harm than just corrupting data in
> >> the buffers.
> >
> > If I understand it right, you basically just skip the 'vma->vm_page_prot =
> > __get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
> > here, i.e. the pgprot value we use for all normal memory.
> 
> I reverted all my patches and workarounds. Indeed, the kernel needs a 
> "coherent" version of the dma_mmap routine, as the current version will map it 
> as non-cachable, resulting in a big performance hit (and nullifying the whole 
> idea behind it).
> 
> I'll test it further on my 'hardware' and cook up a patch that correctly maps 
> the coherent pages.
> 

Ok, thanks!

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-07 11:18                     ` Mike Looijmans
  2015-05-07 11:56                       ` Arnd Bergmann
@ 2015-05-07 13:21                       ` Daniel Drake
  2015-05-07 13:31                         ` Mike Looijmans
  1 sibling, 1 reply; 34+ messages in thread
From: Daniel Drake @ 2015-05-07 13:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 7, 2015 at 5:18 AM, Mike Looijmans <mike.looijmans@topic.nl> wrote:
> I reverted all my patches and workarounds. Indeed, the kernel needs a
> "coherent" version of the dma_mmap routine, as the current version will map
> it as non-cachable, resulting in a big performance hit (and nullifying the
> whole idea behind it).
>
> I'll test it further on my 'hardware' and cook up a patch that correctly
> maps the coherent pages.

Sorry that I have only read this thread briefly, but I wonder if this
is what you are looking for:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/325489.html

Thanks
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-07 13:21                       ` Daniel Drake
@ 2015-05-07 13:31                         ` Mike Looijmans
  2015-05-07 14:08                           ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-07 13:31 UTC (permalink / raw)
  To: linux-arm-kernel

?On 07-05-15 15:21, Daniel Drake wrote:
> On Thu, May 7, 2015 at 5:18 AM, Mike Looijmans <mike.looijmans@topic.nl> wrote:
>> I reverted all my patches and workarounds. Indeed, the kernel needs a
>> "coherent" version of the dma_mmap routine, as the current version will map
>> it as non-cachable, resulting in a big performance hit (and nullifying the
>> whole idea behind it).
>>
>> I'll test it further on my 'hardware' and cook up a patch that correctly
>> maps the coherent pages.
>
> Sorry that I have only read this thread briefly, but I wonder if this
> is what you are looking for:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/325489.html

It's related, but targets another use case. This one does the same in case the 
driver requested non-consistent memory.

My use case was that I have hardware implemented coherency (through ACP) so 
the CPU's and device's view on memory is already consistent, regardless of the 
status of the cache.

The patches are complimentary, not overlapping.

Thanks for the link though, it's something I was also looking into, as I don't 
always need coherency.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-07 13:31                         ` Mike Looijmans
@ 2015-05-07 14:08                           ` Mike Looijmans
  2015-05-07 14:30                             ` Russell King - ARM Linux
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-07 14:08 UTC (permalink / raw)
  To: linux-arm-kernel

?On 07-05-15 15:31, Mike Looijmans wrote:
> On 07-05-15 15:21, Daniel Drake wrote:
>> On Thu, May 7, 2015 at 5:18 AM, Mike Looijmans <mike.looijmans@topic.nl> wrote:
>>> I reverted all my patches and workarounds. Indeed, the kernel needs a
>>> "coherent" version of the dma_mmap routine, as the current version will map
>>> it as non-cachable, resulting in a big performance hit (and nullifying the
>>> whole idea behind it).
>>>
>>> I'll test it further on my 'hardware' and cook up a patch that correctly
>>> maps the coherent pages.
>>
>> Sorry that I have only read this thread briefly, but I wonder if this
>> is what you are looking for:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/325489.html
>
> It's related, but targets another use case. This one does the same in case the
> driver requested non-consistent memory.
>
> My use case was that I have hardware implemented coherency (through ACP) so
> the CPU's and device's view on memory is already consistent, regardless of the
> status of the cache.
>
> The patches are complimentary, not overlapping.
>
> Thanks for the link though, it's something I was also looking into, as I don't
> always need coherency.

I read the rest of the thread, apparently it was never integrated.

The patch for "non-consistent" is a BUG FIX, not some feature request or so. I 
was already wondering why my driver had to kalloc pages to get proper caching 
on it.

 From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
"""
DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
consistent or non-consistent memory as it sees fit.  By using this API,
you are guaranteeing to the platform that you have all the correct and
necessary sync points for this memory in the driver.
"""

The current ARM implementation is to *always* return memory that is 
non-cachable, even if the driver promises to do all the right things.
If the intention was that every implementation could get away with just 
ignoring the flag, the flag would not have existed. So the implementation 
should do the best it can do here, and the patch shows that it's just a simple 
one-liner to make it implement the flag as intended.

As for use cases, IIO is a candidate for this too, as it has explicit 
interfaces to move buffers to/from userspace without having to remap them over 
and over again. My usecase here is Dyplo, which uses a similar interface. If 
you do something as simple as "for (i=0;i<size;++i) sum += (char*)buffer[i]; 
in userspace on such a buffer, performance will collapse to about 1/20th 
because of the memory being non-cachable. The whole point of IIO is to prevent 
having to copy these buffers around.

(I'd rather add this plea to the that thread, but I'd have to figure out how 
to reply to a thread from the web...)

Mike.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans@topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-07 14:08                           ` Mike Looijmans
@ 2015-05-07 14:30                             ` Russell King - ARM Linux
  2015-05-08  5:55                               ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2015-05-07 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> I read the rest of the thread, apparently it was never integrated.
> 
> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> I was already wondering why my driver had to kalloc pages to get proper
> caching on it.

I disagree.

> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> """
> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> consistent or non-consistent memory as it sees fit.  By using this API,
> you are guaranteeing to the platform that you have all the correct and
> necessary sync points for this memory in the driver.
> """

DMA attributes are something that came in _after_ the DMA API had been
around for many years.  It's a "new feature" that was added to an
existing subsystem, and because there have been no need for it to be
implemented on ARM, the new feature was never implemented.

More than that, the vast majority of ARM hardware can't provide this
kind of memory, and there are _no_ kernel APIs to ensure that if
cacheable memory were to be returned, they could issue the necessary
cache flushes to ensure that the device could see the data.

So it's _not_ a bug fix, and there have been very good reasons not to
implement it.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-07 14:30                             ` Russell King - ARM Linux
@ 2015-05-08  5:55                               ` Mike Looijmans
  2015-05-08  7:54                                 ` Arnd Bergmann
  2015-05-08 11:10                                 ` Russell King - ARM Linux
  0 siblings, 2 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08  5:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 07-05-15 16:30, Russell King - ARM Linux wrote:
> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
>> I read the rest of the thread, apparently it was never integrated.
>>
>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
>> I was already wondering why my driver had to kalloc pages to get proper
>> caching on it.
>
> I disagree.
>
>>  From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
>> """
>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
>> consistent or non-consistent memory as it sees fit.  By using this API,
>> you are guaranteeing to the platform that you have all the correct and
>> necessary sync points for this memory in the driver.
>> """
>
> DMA attributes are something that came in _after_ the DMA API had been
> around for many years.  It's a "new feature" that was added to an
> existing subsystem, and because there have been no need for it to be
> implemented on ARM, the new feature was never implemented.
>
> More than that, the vast majority of ARM hardware can't provide this
> kind of memory, and there are _no_ kernel APIs to ensure that if

By "non-coherent" memory I thought it meant the same kind of memory that 
kalloc would return. But from your answer it seems I am mistaken and 
this is something different?

> cacheable memory were to be returned, they could issue the necessary
> cache flushes to ensure that the device could see the data.

Then what do the dma_sync_... methods do?

It has been my understanding that one can use dma_map... and dma_sync... 
methods to make memory ranges visible to the device.

Using dma_sync on coherent memory is just a waste of resources. So how 
do i allocate memory that I'm supposed to use with dma_sync?

> So it's _not_ a bug fix, and there have been very good reasons not to
> implement it.


-- 
Mike Looijmans

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08  5:55                               ` Mike Looijmans
@ 2015-05-08  7:54                                 ` Arnd Bergmann
  2015-05-08  8:31                                   ` Mike Looijmans
  2015-05-08 11:10                                 ` Russell King - ARM Linux
  1 sibling, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-08  7:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
> On 07-05-15 16:30, Russell King - ARM Linux wrote:
> > On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> >> I read the rest of the thread, apparently it was never integrated.
> >>
> >> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> >> I was already wondering why my driver had to kalloc pages to get proper
> >> caching on it.
> >
> > I disagree.
> >
> >>  From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> >> """
> >> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> >> consistent or non-consistent memory as it sees fit.  By using this API,
> >> you are guaranteeing to the platform that you have all the correct and
> >> necessary sync points for this memory in the driver.
> >> """
> >
> > DMA attributes are something that came in _after_ the DMA API had been
> > around for many years.  It's a "new feature" that was added to an
> > existing subsystem, and because there have been no need for it to be
> > implemented on ARM, the new feature was never implemented.
> >
> > More than that, the vast majority of ARM hardware can't provide this
> > kind of memory, and there are _no_ kernel APIs to ensure that if
> 
> By "non-coherent" memory I thought it meant the same kind of memory that 
> kalloc would return. But from your answer it seems I am mistaken and 
> this is something different?

It depends: on a device that is actually cache-coherent,
dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
memory.

On some architectures (not ARM) that are not fully coherent,
dma_alloc_coherent() has to return uncached memory, while
dma_alloc_noncoherent() is allowed to return cached memory but
requires a dma_cache_sync() operation.

dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
on that memory, which can actually work on  ARM, unlike dma_cache_sync().

> > cacheable memory were to be returned, they could issue the necessary
> > cache flushes to ensure that the device could see the data.
> 
> Then what do the dma_sync_... methods do?
> 
> It has been my understanding that one can use dma_map... and dma_sync... 
> methods to make memory ranges visible to the device.

That is correct, but the DMA_ATTR_NON_CONSISTENT flag is not meaningful
with dma_map_...(), as that memory is not assumed to be consistent unless
you call dma_sync_...() to start with.

> Using dma_sync on coherent memory is just a waste of resources. So how 
> do i allocate memory that I'm supposed to use with dma_sync?

The traditional API (before the various attributes is):

dma_alloc_coherent()      --> never requires sync
dma_alloc_writecombine()  --> never requires sync, arch specific
dma_alloc_noncoherent()   --> dma_cache_sync(), arch specific
alloc_pages + dma_map_*() --> dma_sync_*

The dma_alloc_coherent() and dma_sync_*() interfaces are supposed to
determine themselves whether they need to do any cache management
based on whether the device is coherent already or not.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08  7:54                                 ` Arnd Bergmann
@ 2015-05-08  8:31                                   ` Mike Looijmans
  2015-05-08 13:19                                     ` Arnd Bergmann
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08  8:31 UTC (permalink / raw)
  To: linux-arm-kernel

?On 08-05-15 09:54, Arnd Bergmann wrote:
> On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
>> On 07-05-15 16:30, Russell King - ARM Linux wrote:
>>> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
>>>> I read the rest of the thread, apparently it was never integrated.
>>>>
>>>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
>>>> I was already wondering why my driver had to kalloc pages to get proper
>>>> caching on it.
>>>
>>> I disagree.
>>>
>>>>   From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
>>>> """
>>>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
>>>> consistent or non-consistent memory as it sees fit.  By using this API,
>>>> you are guaranteeing to the platform that you have all the correct and
>>>> necessary sync points for this memory in the driver.
>>>> """
>>>
>>> DMA attributes are something that came in _after_ the DMA API had been
>>> around for many years.  It's a "new feature" that was added to an
>>> existing subsystem, and because there have been no need for it to be
>>> implemented on ARM, the new feature was never implemented.
>>>
>>> More than that, the vast majority of ARM hardware can't provide this
>>> kind of memory, and there are _no_ kernel APIs to ensure that if
>>
>> By "non-coherent" memory I thought it meant the same kind of memory that
>> kalloc would return. But from your answer it seems I am mistaken and
>> this is something different?
>
> It depends: on a device that is actually cache-coherent,
> dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
> memory.
>
> On some architectures (not ARM) that are not fully coherent,
> dma_alloc_coherent() has to return uncached memory, while
> dma_alloc_noncoherent() is allowed to return cached memory but
> requires a dma_cache_sync() operation.
>
> dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
> but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
> on that memory, which can actually work on  ARM, unlike dma_cache_sync().

Ah, okay, I was misled by the names. I was under the impression that memory 
would be either "coherent" or "non-coherent". But what is called 
"non-coherent" here is actually something like "less-coherent", it isn't 
normal memory as alloc_pages would return, but it also isn't completely 
coherent. Is that a correct summary?

In that case, I stand corrected.

>>> cacheable memory were to be returned, they could issue the necessary
>>> cache flushes to ensure that the device could see the data.
>>
>> Then what do the dma_sync_... methods do?
>>
>> It has been my understanding that one can use dma_map... and dma_sync...
>> methods to make memory ranges visible to the device.
>
> That is correct, but the DMA_ATTR_NON_CONSISTENT flag is not meaningful
> with dma_map_...(), as that memory is not assumed to be consistent unless
> you call dma_sync_...() to start with.
>
>> Using dma_sync on coherent memory is just a waste of resources. So how
>> do i allocate memory that I'm supposed to use with dma_sync?
>
> The traditional API (before the various attributes is):
>
> dma_alloc_coherent()      --> never requires sync
> dma_alloc_writecombine()  --> never requires sync, arch specific
> dma_alloc_noncoherent()   --> dma_cache_sync(), arch specific
> alloc_pages + dma_map_*() --> dma_sync_*
>
> The dma_alloc_coherent() and dma_sync_*() interfaces are supposed to
> determine themselves whether they need to do any cache management
> based on whether the device is coherent already or not.

Okay, so in my case, I need to forget about the "non_coherent" stuff, it's 
something specific to a few platforms.

I was looking for an interface that would allocate memory for access by my 
device, but that would be just alloc_pages style memory. If my DMA controller 
is limited to say only the first GB of RAM, I'd set the DMA mask to "30 bits". 
If I just allocate memory using alloc_pages, the kernel doesn't know that I'd 
want it to be in the lower 1GB range, and could allocate it in a spot my 
device could not map.

Hence I'd expect there to be some "dma_alloc_pages(struct device* ...)" style 
of call to get memory that my device could access (and I was under the false 
impression that dma_alloc_noncoherent was the one I was looking for).

Currently I can get away with just using alloc_pages or kmalloc since my DMA 
controller happens to be able to access all memory. But I also want my device 
driver to work on 64-bit platforms (e.g. arm64 for the MPSOC and x86-64 for 
the PCIe version of the board).

M.


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08  5:55                               ` Mike Looijmans
  2015-05-08  7:54                                 ` Arnd Bergmann
@ 2015-05-08 11:10                                 ` Russell King - ARM Linux
  2015-05-08 12:17                                   ` Mike Looijmans
  1 sibling, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2015-05-08 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 08, 2015 at 07:55:26AM +0200, Mike Looijmans wrote:
> On 07-05-15 16:30, Russell King - ARM Linux wrote:
> >On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> >>I read the rest of the thread, apparently it was never integrated.
> >>
> >>The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> >>I was already wondering why my driver had to kalloc pages to get proper
> >>caching on it.
> >
> >I disagree.
> >
> >> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> >>"""
> >>DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> >>consistent or non-consistent memory as it sees fit.  By using this API,
> >>you are guaranteeing to the platform that you have all the correct and
> >>necessary sync points for this memory in the driver.
> >>"""
> >
> >DMA attributes are something that came in _after_ the DMA API had been
> >around for many years.  It's a "new feature" that was added to an
> >existing subsystem, and because there have been no need for it to be
> >implemented on ARM, the new feature was never implemented.
> >
> >More than that, the vast majority of ARM hardware can't provide this
> >kind of memory, and there are _no_ kernel APIs to ensure that if
> 
> By "non-coherent" memory I thought it meant the same kind of memory that
> kalloc would return. But from your answer it seems I am mistaken and this is
> something different?

Well, I do find the language used rather loose.  The term "coherent" is
well understood, but when the documentation mixes it with "consistent"
it becomes less clear whether the two terms mean the same thing or
something different.

First, let's clear up the API.  There are two parts of the DMA API, the
streaming API and the coherent API.

The streaming API is made up from:

- dma_map_single()
- dma_map_page()
- dma_map_sg()
- dma_sync_single_*()
- dma_sync_sg_*()
- dma_unmap_single() 
- dma_unmap_page()
- dma_unmap_sg()

The coherent API is made up from:

- dma_alloc_coherent() (or a derived function of this.)
- dma_zalloc_coherent()
- dma_free_coherent()
- dma_pool_create()
- dma_pool_alloc()
- dma_pool_free()
- dma_pool_destroy()

Using the streaming API functions with memory returned from the coherent
API is invalid, and in older ARM implementations will cause a WARN_ON()
or BUG_ON() to trigger (due to the need for coherent memory to be
remapped.)

>From the description of DMA_ATTR_NON_CONSISTENT talks about returning
memory.  None of the streaming API interfaces returns memory, only the
coherent API does.  So, DMA_ATTR_NON_CONSISTENT is something you'd pass
into dma_alloc_coherent().

Great, so you get your coherent-but-non-consistent memory (what's that?)
but the documentation says that you guarantee to the platform that you
have the necessary sync points in the driver for this.  What sync points
are those?  Is it talking about memory barriers, or is it talking about
some other cache flushing that the DMA API doesn't describe?

If it's talking about the driver needing to have memory barriers to
ensure proper ordering of accesses to the returned memory, that's
something modern ARMs always need irrespective of anything, so in that
sense, we always return non-consistent memory everywhere.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08 11:10                                 ` Russell King - ARM Linux
@ 2015-05-08 12:17                                   ` Mike Looijmans
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08 12:17 UTC (permalink / raw)
  To: linux-arm-kernel

?On 08-05-15 13:10, Russell King - ARM Linux wrote:
> Well, I do find the language used rather loose.  The term "coherent" is
> well understood, but when the documentation mixes it with "consistent"
> it becomes less clear whether the two terms mean the same thing or
> something different.

I re-read https://www.kernel.org/doc/Documentation/DMA-API.txt with this in mind.

"Part Ia" starts with "dma_alloc_coherent" and (friends) but then talks about 
"consistent" memory throughout the whole part. I think a 
s/consistent/coherent/g in this part will make things a lot clearer for everyone.

Shall I submit a patch for that?

Part Ib continues, but switches to using the word "coherent" instead.




Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08  8:31                                   ` Mike Looijmans
@ 2015-05-08 13:19                                     ` Arnd Bergmann
  2015-05-08 14:18                                       ` Mike Looijmans
  0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-08 13:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 08 May 2015 10:31:53 Mike Looijmans wrote:
> On 08-05-15 09:54, Arnd Bergmann wrote:
> > On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
> >> On 07-05-15 16:30, Russell King - ARM Linux wrote:
> >>> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> >>>> I read the rest of the thread, apparently it was never integrated.
> >>>>
> >>>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> >>>> I was already wondering why my driver had to kalloc pages to get proper
> >>>> caching on it.
> >>>
> >>> I disagree.
> >>>
> >>>>   From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> >>>> """
> >>>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> >>>> consistent or non-consistent memory as it sees fit.  By using this API,
> >>>> you are guaranteeing to the platform that you have all the correct and
> >>>> necessary sync points for this memory in the driver.
> >>>> """
> >>>
> >>> DMA attributes are something that came in _after_ the DMA API had been
> >>> around for many years.  It's a "new feature" that was added to an
> >>> existing subsystem, and because there have been no need for it to be
> >>> implemented on ARM, the new feature was never implemented.
> >>>
> >>> More than that, the vast majority of ARM hardware can't provide this
> >>> kind of memory, and there are _no_ kernel APIs to ensure that if
> >>
> >> By "non-coherent" memory I thought it meant the same kind of memory that
> >> kalloc would return. But from your answer it seems I am mistaken and
> >> this is something different?
> >
> > It depends: on a device that is actually cache-coherent,
> > dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
> > memory.
> >
> > On some architectures (not ARM) that are not fully coherent,
> > dma_alloc_coherent() has to return uncached memory, while
> > dma_alloc_noncoherent() is allowed to return cached memory but
> > requires a dma_cache_sync() operation.
> >
> > dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
> > but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
> > on that memory, which can actually work on  ARM, unlike dma_cache_sync().
> 
> Ah, okay, I was misled by the names. I was under the impression that memory 
> would be either "coherent" or "non-coherent". But what is called 
> "non-coherent" here is actually something like "less-coherent", it isn't 
> normal memory as alloc_pages would return, but it also isn't completely 
> coherent. Is that a correct summary?
> 
> In that case, I stand corrected.

Almost. I think the part that you are still missing is that memory
itself it not coherent or non-coherent. It's the device access to
that memory that can be coherent or not with regard to the CPU.

The memory that is returned by alloc_pages can be coherent with one
device but non-coherent with another.

> I was looking for an interface that would allocate memory for access by my 
> device, but that would be just alloc_pages style memory. If my DMA controller 
> is limited to say only the first GB of RAM, I'd set the DMA mask to "30 bits". 
> If I just allocate memory using alloc_pages, the kernel doesn't know that I'd 
> want it to be in the lower 1GB range, and could allocate it in a spot my 
> device could not map.

If you have that low-memory restriction, you also need to ensure that there
is a ZONE_DMA that is large enough. ZONE_DMA should be sized to match the
common subset that all devices can access, so a GFP_DMA request returns
memory that is guaranteed to be accessible by all devices.

> Hence I'd expect there to be some "dma_alloc_pages(struct device* ...)" style 
> of call to get memory that my device could access (and I was under the false 
> impression that dma_alloc_noncoherent was the one I was looking for).
> 
> Currently I can get away with just using alloc_pages or kmalloc since my DMA 
> controller happens to be able to access all memory. But I also want my device 
> driver to work on 64-bit platforms (e.g. arm64 for the MPSOC and x86-64 for 
> the PCIe version of the board).

Those machines will have ZONE_DMA32, which refers to the first 4GB of memory,
so that should work fine. Alternatively you can use the iommu to get
all memory mapped into the space that is accessible by the device.
Also, on 64-bit x86 or ARM machines, all memory tends to be coherent, so
dma_sync_* will turn into a nop.

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08 13:19                                     ` Arnd Bergmann
@ 2015-05-08 14:18                                       ` Mike Looijmans
  2015-05-08 14:27                                         ` Arnd Bergmann
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08 14:18 UTC (permalink / raw)
  To: linux-arm-kernel

?On 08-05-15 15:19, Arnd Bergmann wrote:
> On Friday 08 May 2015 10:31:53 Mike Looijmans wrote:
>> On 08-05-15 09:54, Arnd Bergmann wrote:
>>> On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
>>>> On 07-05-15 16:30, Russell King - ARM Linux wrote:
>>>>> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
>>>>>> I read the rest of the thread, apparently it was never integrated.
>>>>>>
>>>>>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
>>>>>> I was already wondering why my driver had to kalloc pages to get proper
>>>>>> caching on it.
>>>>>
>>>>> I disagree.
>>>>>
>>>>>>    From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
>>>>>> """
>>>>>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
>>>>>> consistent or non-consistent memory as it sees fit.  By using this API,
>>>>>> you are guaranteeing to the platform that you have all the correct and
>>>>>> necessary sync points for this memory in the driver.
>>>>>> """
>>>>>
>>>>> DMA attributes are something that came in _after_ the DMA API had been
>>>>> around for many years.  It's a "new feature" that was added to an
>>>>> existing subsystem, and because there have been no need for it to be
>>>>> implemented on ARM, the new feature was never implemented.
>>>>>
>>>>> More than that, the vast majority of ARM hardware can't provide this
>>>>> kind of memory, and there are _no_ kernel APIs to ensure that if
>>>>
>>>> By "non-coherent" memory I thought it meant the same kind of memory that
>>>> kalloc would return. But from your answer it seems I am mistaken and
>>>> this is something different?
>>>
>>> It depends: on a device that is actually cache-coherent,
>>> dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
>>> memory.
>>>
>>> On some architectures (not ARM) that are not fully coherent,
>>> dma_alloc_coherent() has to return uncached memory, while
>>> dma_alloc_noncoherent() is allowed to return cached memory but
>>> requires a dma_cache_sync() operation.
>>>
>>> dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
>>> but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
>>> on that memory, which can actually work on  ARM, unlike dma_cache_sync().
>>
>> Ah, okay, I was misled by the names. I was under the impression that memory
>> would be either "coherent" or "non-coherent". But what is called
>> "non-coherent" here is actually something like "less-coherent", it isn't
>> normal memory as alloc_pages would return, but it also isn't completely
>> coherent. Is that a correct summary?
>>
>> In that case, I stand corrected.
>
> Almost. I think the part that you are still missing is that memory
> itself it not coherent or non-coherent. It's the device access to
> that memory that can be coherent or not with regard to the CPU.
>
> The memory that is returned by alloc_pages can be coherent with one
> device but non-coherent with another.

Yeah, I did understand, but never made that explicit.

The Zynq is even more funky than most hardware - I can actually enable/disable 
(hardware) coherency for each memory burst. If I wanted to. Currently I just 
turn it on/off for the whole system at boot time, since I need to write its 
coherency status into the devicetree for the kernel to understand how my 
device accesses memory.

>> I was looking for an interface that would allocate memory for access by my
>> device, but that would be just alloc_pages style memory. If my DMA controller
>> is limited to say only the first GB of RAM, I'd set the DMA mask to "30 bits".
>> If I just allocate memory using alloc_pages, the kernel doesn't know that I'd
>> want it to be in the lower 1GB range, and could allocate it in a spot my
>> device could not map.
>
> If you have that low-memory restriction, you also need to ensure that there
> is a ZONE_DMA that is large enough. ZONE_DMA should be sized to match the
> common subset that all devices can access, so a GFP_DMA request returns
> memory that is guaranteed to be accessible by all devices.

GFP_DMA still carries that scary remark: "on x86, GFP_DMA guarantees to be 
within the first 16MB of available bus addresses". Which is really scary if 
you want to allocate multimegabyte buffers.

>> Hence I'd expect there to be some "dma_alloc_pages(struct device* ...)" style
>> of call to get memory that my device could access (and I was under the false
>> impression that dma_alloc_noncoherent was the one I was looking for).
>>
>> Currently I can get away with just using alloc_pages or kmalloc since my DMA
>> controller happens to be able to access all memory. But I also want my device
>> driver to work on 64-bit platforms (e.g. arm64 for the MPSOC and x86-64 for
>> the PCIe version of the board).
>
> Those machines will have ZONE_DMA32, which refers to the first 4GB of memory,
> so that should work fine. Alternatively you can use the iommu to get
> all memory mapped into the space that is accessible by the device.

With my first tests on the PC version I ran into a problem with the PCIe 
interface that the IP block that does the address translation would do 
something like addr=base_addr|offset instead of addr=base_addr+offset. Since 
base_addr was assigned dynamically (BAR) the board would fail to work (write 
to the wrong spot in memory, that is) if the BAR was aligned on a 4MB boundary 
and my buffer was on an "odd" 4MB address. I hacked around that by calling 
dma_alloc_coherent() until I got one that was evenly aligned and then get rid 
of the failed ones. Just an anecdote that shows that requirements for buffer 
alignment can be weird. Of course, we software guys are here to clean up the 
mess because fixing the bug in hardware is out of the question.

 > Also, on 64-bit x86 or ARM machines, all memory tends to be coherent, so
 > dma_sync_* will turn into a nop.

Well, that should make things easier, right? I just can't wait until I get 
that first MPSOC on my desk :)


Kind regards,

Mike Looijmans
System Expert

TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply	[flat|nested] 34+ messages in thread

* dma_alloc_coherent versus streaming DMA, neither works satisfactory
  2015-05-08 14:18                                       ` Mike Looijmans
@ 2015-05-08 14:27                                         ` Arnd Bergmann
  0 siblings, 0 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-08 14:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Friday 08 May 2015 16:18:11 Mike Looijmans wrote:
> >
> > If you have that low-memory restriction, you also need to ensure that there
> > is a ZONE_DMA that is large enough. ZONE_DMA should be sized to match the
> > common subset that all devices can access, so a GFP_DMA request returns
> > memory that is guaranteed to be accessible by all devices.
> 
> GFP_DMA still carries that scary remark: "on x86, GFP_DMA guarantees to be 
> within the first 16MB of available bus addresses". Which is really scary if 
> you want to allocate multimegabyte buffers.

Don't worry about that on ARM: the size of the dma zone is platform specific,
so you can set it to whatever you want. The 16MB limit is only relevant
on systems that use ISA DMA (floppy, soundblaster, ...)

	Arnd

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-05-08 14:27 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-23 11:52 dma_alloc_coherent versus streaming DMA, neither works satisfactory Mike Looijmans
2015-04-23 12:32 ` Arnd Bergmann
2015-04-23 13:05   ` Mike Looijmans
2015-04-29  8:47   ` Mike Looijmans
2015-04-29  9:01     ` Arnd Bergmann
2015-04-29  9:17       ` Russell King - ARM Linux
2015-04-29  9:47         ` Mike Looijmans
2015-04-29 10:07           ` Arnd Bergmann
2015-04-29 10:33             ` Mike Looijmans
2015-04-29 10:41               ` Arnd Bergmann
2015-04-29 12:49                 ` Mike Looijmans
2015-04-29 13:13                   ` Arnd Bergmann
2015-04-30 13:50                     ` Mike Looijmans
2015-04-30 13:54                       ` Arnd Bergmann
2015-05-01  6:08                         ` Mike Looijmans
2015-05-01  7:01                           ` Mike Looijmans
2015-05-07 11:18                     ` Mike Looijmans
2015-05-07 11:56                       ` Arnd Bergmann
2015-05-07 13:21                       ` Daniel Drake
2015-05-07 13:31                         ` Mike Looijmans
2015-05-07 14:08                           ` Mike Looijmans
2015-05-07 14:30                             ` Russell King - ARM Linux
2015-05-08  5:55                               ` Mike Looijmans
2015-05-08  7:54                                 ` Arnd Bergmann
2015-05-08  8:31                                   ` Mike Looijmans
2015-05-08 13:19                                     ` Arnd Bergmann
2015-05-08 14:18                                       ` Mike Looijmans
2015-05-08 14:27                                         ` Arnd Bergmann
2015-05-08 11:10                                 ` Russell King - ARM Linux
2015-05-08 12:17                                   ` Mike Looijmans
2015-04-29 11:09             ` Mike Looijmans
2015-04-29 12:35               ` Arnd Bergmann
2015-04-29 12:52                 ` Mike Looijmans
2015-04-29 12:54                   ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.