* dma_alloc_coherent versus streaming DMA, neither works satisfactory
@ 2015-04-23 11:52 Mike Looijmans
2015-04-23 12:32 ` Arnd Bergmann
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-23 11:52 UTC (permalink / raw)
To: linux-arm-kernel
?I'm writing a driver that transfers data on a Zynq-7000 between the ARM and PL
part, using a buffering scheme similar to IIO. So it allocates buffers, which
the user can memory-map and then send/receive using ioctl calls.
The trouble I have it that transitioning these buffers into user space is
costing more CPU time than actually copying them.
The self-written DMA controller is in logic, can queue up several transfers
and operates fine.
The trouble I have with the driver is that for the DMA transfers, I want to
use a zero-copy style interface, because the data usually consists of video
frames, and moving them around in memory is prohibitively expensive.
I based my implementation on what IIO (industrial IO) does, and implemented
IOCTL calls to the driver to allocate, free, enqueue and dequeue blocks that
are owned by the driver. Each block can be mmapped into user space.
Using dma_alloc_coherent to allocate the blocks, and then just using them
without any extra measures works just fine. The system can transfer data at
600MB/s to and from DDR into logic with very little CPU intervention. However,
the memory returned by dma_mmap_coherent appears to be uncached, because
accessing this area from userspace is horribly slow (e.g. reading 200MB
byte-by-byte in a simple for loop takes 20 seconds, while this takes about a
second in malloced memory)
After reading some documentation, I decided that I should use a streaming DMA
interface because my driver knows exactly when logic or CPU "owns" the data
blocks. So instead of the "coherent" functions, I just kmalloc these buffers
and then use dma_map_single_* to initialize them. Before and after DMA
transfers, I call the appropriate dma_sync_single_for_* method.
By mapping the kmalloced memory into user space, I once again have speedy
access to this memory, and caching is enabled. Data transfers also work well
and extensive testing shows that this works well. However, the in-kernel
performance is now completely crippled. The system spends so much time in the
dma_sync_single... calls, the the CPU now becomes a limiting factor. This
limits the transfer speeds to about only 150MB/s. This method is only about
20% more CPU intensive than just copying the data from the DMA buffer into a
user buffer using a copy_to_user call!
Since my DMA controller is pretty smart, I also experimented with transfers
directly from user memory. This boiled down to calling "get_user_pages" and
then constructing a scatter-gather list using sg_init_table and then adding
those user pages. Then call "dma_map_sg" to translate and coalesce the pages
into DMA requests. Just this pagetable housekeeping took about the same amount
of processing time as the copy_from_user call, which made me abandon that code
before even getting to the point of actually transferring the data.
Based on that experience, I'd think the dma_sync calls do similar things
(walking page tables and changing some attributes) and that is where they
spend so much time.
I also tried cheating by not calling the dma_sync methods at all, but this
(surprisingly) led to hangups in the driver. I would have only expected data
corruption, not "hang". I'm still investigating that route. Specifying
"dma-coherent" has the same effect as it basically "nulls" the sync methods.
Also tried to replace the "dma_mmap_coherent" call by a simple
"remap_pfn_range" so as to prevent setting the cache attributes on that
region, but that didn't have any effect at all, it appears that the
non-cachable property was already applied in the dma_alloc_coherent method.
I added some timing code to the "sync" calls, this is what I get (numbers in
microseconds) when using 1MB blocks of streaming DMA transfers:
dma_sync_single_for_device(TO_DEVICE): 3336
dma_sync_single_for_device(FROM_DEVICE): 1991
dma_sync_single_for_cpu(FROM_DEVICE): 2175
dma_sync_single_for_cpu(TO_DEVICE): 0
dma_sync_single_for_device(TO_DEVICE): 3152
dma_sync_single_for_device(FROM_DEVICE): 1990
dma_sync_single_for_cpu(FROM_DEVICE): 2193
dma_sync_single_for_cpu(TO_DEVICE): 0
As you can see, the system spends 2 or 3 ms on "housekeeping" for each
transition, except the cpu(TO_DEVICE) one which appears to be free which is
perfectly logical because returning the outgoing buffer to the CPU should not
need any special cache handling. I would have expected the
for_device(FROM_DEVICE) to be free as well, but surprisingly this one takes up
2ms as well.
Adding the numbers, it takes over 7 ms overhead to transfer 1MB data, hence
1MB/0.007s or about 150MB/s would be the maximum possible data transfer rate.
Can anyone here offer some advise on this?
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-23 11:52 dma_alloc_coherent versus streaming DMA, neither works satisfactory Mike Looijmans
@ 2015-04-23 12:32 ` Arnd Bergmann
2015-04-23 13:05 ` Mike Looijmans
2015-04-29 8:47 ` Mike Looijmans
0 siblings, 2 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-23 12:32 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
> Can anyone here offer some advise on this?
>
The problem you are experiencing is a direct result of using hardware
without cache-coherency from user space. There is no software workaround
for this: If you want data to be cacheable *and* avoid doing manual cache
flushes each time data is passed between user space and hardware, you have
to use hardware that is cache-coherent.
You mentioned that you are using 'Zynq', which supports cache-coherent
DMA using the 'accelerator coherency port'. If you are able to connect
your device to that port, it should work, otherwise you should consider
using a different platform.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-23 12:32 ` Arnd Bergmann
@ 2015-04-23 13:05 ` Mike Looijmans
2015-04-29 8:47 ` Mike Looijmans
1 sibling, 0 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-04-23 13:05 UTC (permalink / raw)
To: linux-arm-kernel
?On 23-04-15 14:32, Arnd Bergmann wrote:
> On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
>> Can anyone here offer some advise on this?
>>
>
> The problem you are experiencing is a direct result of using hardware
> without cache-coherency from user space. There is no software workaround
> for this: If you want data to be cacheable *and* avoid doing manual cache
> flushes each time data is passed between user space and hardware, you have
> to use hardware that is cache-coherent.
I can live with manual flushes. I know exactly what and where to flush in the
driver.
What bothers me so much is that just invalidating the data cache(s) takes
longer than actually reading the data from uncached memory and writing it into
cached memory elsewhere. That just does not sound right to me.
> You mentioned that you are using 'Zynq', which supports cache-coherent
> DMA using the 'accelerator coherency port'. If you are able to connect
> your device to that port, it should work, otherwise you should consider
> using a different platform.
I can probably hook up the DMA controller to the ACP port, however, I would
expect this to have serious negative impact on system performance. The
transfers are typically in the megabyte range (for example, video frames) so
they will not fit into the L2 cache (which is 512k on the Zynq). I'd expect
these transfers, when routed through the ACP, to "push" all data out of the L2
cache, replacing it with data that isn't going to be read any time soon, and
thus interfere with performance too much.
It might be worth a try though to see what will actually happen.
Mike.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-23 12:32 ` Arnd Bergmann
2015-04-23 13:05 ` Mike Looijmans
@ 2015-04-29 8:47 ` Mike Looijmans
2015-04-29 9:01 ` Arnd Bergmann
1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 8:47 UTC (permalink / raw)
To: linux-arm-kernel
?On 23-04-15 14:32, Arnd Bergmann wrote:
> On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
>> Can anyone here offer some advise on this?
>>
>
> The problem you are experiencing is a direct result of using hardware
> without cache-coherency from user space. There is no software workaround
> for this: If you want data to be cacheable *and* avoid doing manual cache
> flushes each time data is passed between user space and hardware, you have
> to use hardware that is cache-coherent.
>
> You mentioned that you are using 'Zynq', which supports cache-coherent
> DMA using the 'accelerator coherency port'. If you are able to connect
> your device to that port, it should work, otherwise you should consider
> using a different platform.
I tried as you suggested, and used the ACP instead of the HP to connect the
logic to the CPU.
Without any further changes, this passes all tests with the exact same
performance numbers. The reason for that is that the driver/DMA framework is
unaware of the "coherency" hardware and still uses the manual cache
flush/invalidate routines, so I figured out that adding a "dma-coherent"
property to the devicetree node changes that.
However, with "dma-coherent" set for my driver, the system locks up at random
points in the tests. Simple memory transfer tests fail with data mismatches
(probably stale cache results). Running DMA tests usually results in the
system completely locking up at some point.
Normal register read/write access is done through the AXI bus directly, not
using the ACP at all.
Is the ACP hardware broken, is there some extra things my driver needs to be
aware of, or is there something else I need to do here?
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 8:47 ` Mike Looijmans
@ 2015-04-29 9:01 ` Arnd Bergmann
2015-04-29 9:17 ` Russell King - ARM Linux
0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 9:01 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 10:47:05 Mike Looijmans wrote:
> On 23-04-15 14:32, Arnd Bergmann wrote:
> > On Thursday 23 April 2015 13:52:34 Mike Looijmans wrote:
> >> Can anyone here offer some advise on this?
> >>
> >
> > The problem you are experiencing is a direct result of using hardware
> > without cache-coherency from user space. There is no software workaround
> > for this: If you want data to be cacheable *and* avoid doing manual cache
> > flushes each time data is passed between user space and hardware, you have
> > to use hardware that is cache-coherent.
> >
> > You mentioned that you are using 'Zynq', which supports cache-coherent
> > DMA using the 'accelerator coherency port'. If you are able to connect
> > your device to that port, it should work, otherwise you should consider
> > using a different platform.
>
> I tried as you suggested, and used the ACP instead of the HP to connect the
> logic to the CPU.
>
> Without any further changes, this passes all tests with the exact same
> performance numbers. The reason for that is that the driver/DMA framework is
> unaware of the "coherency" hardware and still uses the manual cache
> flush/invalidate routines, so I figured out that adding a "dma-coherent"
> property to the devicetree node changes that.
Correct.
> However, with "dma-coherent" set for my driver, the system locks up at random
> points in the tests. Simple memory transfer tests fail with data mismatches
> (probably stale cache results). Running DMA tests usually results in the
> system completely locking up at some point.
> Normal register read/write access is done through the AXI bus directly, not
> using the ACP at all.
>
> Is the ACP hardware broken, is there some extra things my driver needs to be
> aware of, or is there something else I need to do here?
You still need to synchronize MMIO register accesses with write buffers,
as the readl() and writel() functions do in the kernel.
In particular, after you have written a buffer to memory from the CPU,
you will need to do an outer_sync() before the MMIO write that triggers
the DMA. This is still much cheaper than doing the cache flush though.
On the inbound side, you normally need an MMIO read followed by a dsb
instruction to ensure that data is visible in the cache after you have
received an interrupt from the device. Again that is what readl()
does, but it won't be implied if you do the MMIO from user space.
Another possible problem would be if the driver mmaps the buffer in
uncached mode to user space. This is something your kernel driver has
to get right, it won't be handled automatically by setting the
"dma-coherent" property in DT.
There could of course be some other problem either in your FPGA code,
in your driver, or in your user space. I would not assume that the zynq
ACP itself is broken though.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 9:01 ` Arnd Bergmann
@ 2015-04-29 9:17 ` Russell King - ARM Linux
2015-04-29 9:47 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2015-04-29 9:17 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
> You still need to synchronize MMIO register accesses with write buffers,
> as the readl() and writel() functions do in the kernel.
>
> In particular, after you have written a buffer to memory from the CPU,
> you will need to do an outer_sync() before the MMIO write that triggers
> the DMA. This is still much cheaper than doing the cache flush though.
Note that outer_sync() is already done by readl/writel and/or the write
memory barriers (mb()/wmb()).
> Another possible problem would be if the driver mmaps the buffer in
> uncached mode to user space. This is something your kernel driver has
> to get right, it won't be handled automatically by setting the
> "dma-coherent" property in DT.
The buffer should also be mapped into userspace with the same memory
type and cache attributes as the kernel side mapping. If using ACP,
then you probably want "normal memory, cacheable, writeback, read
allocate" or in the case of SMP, the same but "read/write allocate".
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 9:17 ` Russell King - ARM Linux
@ 2015-04-29 9:47 ` Mike Looijmans
2015-04-29 10:07 ` Arnd Bergmann
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 9:47 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 11:17, Russell King - ARM Linux wrote:
> On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
>> You still need to synchronize MMIO register accesses with write buffers,
>> as the readl() and writel() functions do in the kernel.
>>
>> In particular, after you have written a buffer to memory from the CPU,
>> you will need to do an outer_sync() before the MMIO write that triggers
>> the DMA. This is still much cheaper than doing the cache flush though.
>
> Note that outer_sync() is already done by readl/writel and/or the write
> memory barriers (mb()/wmb()).
I initiate the DMA transfers using iowrite32() so if I understand correctly,
I'm already doing the right thing here.
Just to be completely clear, there is no direct register access from user
space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers,
and uses ioctl to initiate transfers.
>> Another possible problem would be if the driver mmaps the buffer in
>> uncached mode to user space. This is something your kernel driver has
>> to get right, it won't be handled automatically by setting the
>> "dma-coherent" property in DT.
>
> The buffer should also be mapped into userspace with the same memory
> type and cache attributes as the kernel side mapping. If using ACP,
> then you probably want "normal memory, cacheable, writeback, read
> allocate" or in the case of SMP, the same but "read/write allocate".
I currently use dma_alloc_coherent() to allocate buffers and
dma_mmap_coherent() to map them to user space. I was under the assumption that
these would do the right thing. Is that correct? If not, then what should I use?
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 9:47 ` Mike Looijmans
@ 2015-04-29 10:07 ` Arnd Bergmann
2015-04-29 10:33 ` Mike Looijmans
2015-04-29 11:09 ` Mike Looijmans
0 siblings, 2 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 10:07 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> On 29-04-15 11:17, Russell King - ARM Linux wrote:
> > On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
> >> You still need to synchronize MMIO register accesses with write buffers,
> >> as the readl() and writel() functions do in the kernel.
> >>
> >> In particular, after you have written a buffer to memory from the CPU,
> >> you will need to do an outer_sync() before the MMIO write that triggers
> >> the DMA. This is still much cheaper than doing the cache flush though.
> >
> > Note that outer_sync() is already done by readl/writel and/or the write
> > memory barriers (mb()/wmb()).
>
> I initiate the DMA transfers using iowrite32() so if I understand correctly,
> I'm already doing the right thing here.
>
> Just to be completely clear, there is no direct register access from user
> space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers,
> and uses ioctl to initiate transfers.
Ok, that seems all fine then.
> >> Another possible problem would be if the driver mmaps the buffer in
> >> uncached mode to user space. This is something your kernel driver has
> >> to get right, it won't be handled automatically by setting the
> >> "dma-coherent" property in DT.
> >
> > The buffer should also be mapped into userspace with the same memory
> > type and cache attributes as the kernel side mapping. If using ACP,
> > then you probably want "normal memory, cacheable, writeback, read
> > allocate" or in the case of SMP, the same but "read/write allocate".
>
> I currently use dma_alloc_coherent() to allocate buffers and
> dma_mmap_coherent() to map them to user space. I was under the assumption that
> these would do the right thing. Is that correct? If not, then what should I use?
dma_mmap_coherent() is the right interface, but I've just looked at the
implementation of arm_dma_mmap() and I'm not sure that it actually uses the
correct vma->vm_page_prot value here, because I don't see where it takes
into account whether the device is coherent or not. Most ARM machines have
only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
so it's quite possible that this interface got broken without anybody
noticing.
If my suspicion is correct, we should either change arm_coherent_dma_ops()
to refer to a different mmap() callback that does the right thing for
coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 10:07 ` Arnd Bergmann
@ 2015-04-29 10:33 ` Mike Looijmans
2015-04-29 10:41 ` Arnd Bergmann
2015-04-29 11:09 ` Mike Looijmans
1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 10:33 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 12:07, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>> On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
>>>> You still need to synchronize MMIO register accesses with write buffers,
>>>> as the readl() and writel() functions do in the kernel.
>>>>
>>>> In particular, after you have written a buffer to memory from the CPU,
>>>> you will need to do an outer_sync() before the MMIO write that triggers
>>>> the DMA. This is still much cheaper than doing the cache flush though.
>>>
>>> Note that outer_sync() is already done by readl/writel and/or the write
>>> memory barriers (mb()/wmb()).
>>
>> I initiate the DMA transfers using iowrite32() so if I understand correctly,
>> I'm already doing the right thing here.
>>
>> Just to be completely clear, there is no direct register access from user
>> space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers,
>> and uses ioctl to initiate transfers.
>
> Ok, that seems all fine then.
>
>>>> Another possible problem would be if the driver mmaps the buffer in
>>>> uncached mode to user space. This is something your kernel driver has
>>>> to get right, it won't be handled automatically by setting the
>>>> "dma-coherent" property in DT.
>>>
>>> The buffer should also be mapped into userspace with the same memory
>>> type and cache attributes as the kernel side mapping. If using ACP,
>>> then you probably want "normal memory, cacheable, writeback, read
>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>
>> I currently use dma_alloc_coherent() to allocate buffers and
>> dma_mmap_coherent() to map them to user space. I was under the assumption that
>> these would do the right thing. Is that correct? If not, then what should I use?
>
> dma_mmap_coherent() is the right interface, but I've just looked at the
> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> correct vma->vm_page_prot value here, because I don't see where it takes
> into account whether the device is coherent or not. Most ARM machines have
> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> so it's quite possible that this interface got broken without anybody
> noticing.
>
> If my suspicion is correct, we should either change arm_coherent_dma_ops()
> to refer to a different mmap() callback that does the right thing for
> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
memory.
If you can give me some hints as to what the correct flags would be, I can
patch my kernel and test it.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 10:33 ` Mike Looijmans
@ 2015-04-29 10:41 ` Arnd Bergmann
2015-04-29 12:49 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 10:41 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
> On 29-04-15 12:07, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >> On 29-04-15 11:17, Russell King - ARM Linux wrote:
> >>> The buffer should also be mapped into userspace with the same memory
> >>> type and cache attributes as the kernel side mapping. If using ACP,
> >>> then you probably want "normal memory, cacheable, writeback, read
> >>> allocate" or in the case of SMP, the same but "read/write allocate".
> >
> > If my suspicion is correct, we should either change arm_coherent_dma_ops()
> > to refer to a different mmap() callback that does the right thing for
> > coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>
> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
> memory.
>
> If you can give me some hints as to what the correct flags would be, I can
> patch my kernel and test it.
Use the flags that Russell listed above.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 10:07 ` Arnd Bergmann
2015-04-29 10:33 ` Mike Looijmans
@ 2015-04-29 11:09 ` Mike Looijmans
2015-04-29 12:35 ` Arnd Bergmann
1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 11:09 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 12:07, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>> On Wed, Apr 29, 2015 at 11:01:35AM +0200, Arnd Bergmann wrote:
>>>> You still need to synchronize MMIO register accesses with write buffers,
>>>> as the readl() and writel() functions do in the kernel.
>>>>
>>>> In particular, after you have written a buffer to memory from the CPU,
>>>> you will need to do an outer_sync() before the MMIO write that triggers
>>>> the DMA. This is still much cheaper than doing the cache flush though.
>>>
>>> Note that outer_sync() is already done by readl/writel and/or the write
>>> memory barriers (mb()/wmb()).
>>
>> I initiate the DMA transfers using iowrite32() so if I understand correctly,
>> I'm already doing the right thing here.
>>
>> Just to be completely clear, there is no direct register access from user
>> space, the driver does all MMIO. Userspace only gets an mmap for DMA buffers,
>> and uses ioctl to initiate transfers.
>
> Ok, that seems all fine then.
>
>>>> Another possible problem would be if the driver mmaps the buffer in
>>>> uncached mode to user space. This is something your kernel driver has
>>>> to get right, it won't be handled automatically by setting the
>>>> "dma-coherent" property in DT.
>>>
>>> The buffer should also be mapped into userspace with the same memory
>>> type and cache attributes as the kernel side mapping. If using ACP,
>>> then you probably want "normal memory, cacheable, writeback, read
>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>
>> I currently use dma_alloc_coherent() to allocate buffers and
>> dma_mmap_coherent() to map them to user space. I was under the assumption that
>> these would do the right thing. Is that correct? If not, then what should I use?
>
> dma_mmap_coherent() is the right interface, but I've just looked at the
> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> correct vma->vm_page_prot value here, because I don't see where it takes
> into account whether the device is coherent or not. Most ARM machines have
> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> so it's quite possible that this interface got broken without anybody
> noticing.
My driver also supports 'classic' read/write by copying data to/from userspace
using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That
would suggest that the kernel mapping for this memory is also incorrect, I
also get corrupted data in these transfers. These buffers are not being mapped
to userspace at all. These tests all use page aligned transfer sizes.
M.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 11:09 ` Mike Looijmans
@ 2015-04-29 12:35 ` Arnd Bergmann
2015-04-29 12:52 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 12:35 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 13:09:05 Mike Looijmans wrote:
> On 29-04-15 12:07, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >> I currently use dma_alloc_coherent() to allocate buffers and
> >> dma_mmap_coherent() to map them to user space. I was under the assumption that
> >> these would do the right thing. Is that correct? If not, then what should I use?
> >
> > dma_mmap_coherent() is the right interface, but I've just looked at the
> > implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> > correct vma->vm_page_prot value here, because I don't see where it takes
> > into account whether the device is coherent or not. Most ARM machines have
> > only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> > so it's quite possible that this interface got broken without anybody
> > noticing.
>
> My driver also supports 'classic' read/write by copying data to/from userspace
> using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That
> would suggest that the kernel mapping for this memory is also incorrect, I
> also get corrupted data in these transfers. These buffers are not being mapped
> to userspace at all. These tests all use page aligned transfer sizes.
Ok, if it's broken when you only use a kernel mapping and no user space
mapping, arm_dma_mmap() is not your (only) problem, but we should probably
still fix that if it's broken.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 10:41 ` Arnd Bergmann
@ 2015-04-29 12:49 ` Mike Looijmans
2015-04-29 13:13 ` Arnd Bergmann
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 12:49 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 12:41, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>>>> The buffer should also be mapped into userspace with the same memory
>>>>> type and cache attributes as the kernel side mapping. If using ACP,
>>>>> then you probably want "normal memory, cacheable, writeback, read
>>>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>>
>>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
>>> to refer to a different mmap() callback that does the right thing for
>>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>>
>> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
>> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
>> memory.
>>
>> If you can give me some hints as to what the correct flags would be, I can
>> patch my kernel and test it.
>
> Use the flags that Russell listed above.
I would if I had a clue how to do that. For one thing, I don't understand all
the L_PTE_... flag juggling that happens here.
I also tried just using kmalloc() to allocate the buffer, and then
dma_map_single that. This is what the DMA documentation told me to do for
non-coherent mappings. This works fine for the streaming-dma mode using the
HP, but using dma-coherent this not only fails to work, it tends to completely
lock the system.
Avoiding the streaming mapping and using only the coherent modes prevents the
system locking up, it then doesn't do more harm than just corrupting data in
the buffers.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 12:35 ` Arnd Bergmann
@ 2015-04-29 12:52 ` Mike Looijmans
2015-04-29 12:54 ` Arnd Bergmann
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-29 12:52 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 14:35, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 13:09:05 Mike Looijmans wrote:
>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>> I currently use dma_alloc_coherent() to allocate buffers and
>>>> dma_mmap_coherent() to map them to user space. I was under the assumption that
>>>> these would do the right thing. Is that correct? If not, then what should I use?
>>>
>>> dma_mmap_coherent() is the right interface, but I've just looked at the
>>> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
>>> correct vma->vm_page_prot value here, because I don't see where it takes
>>> into account whether the device is coherent or not. Most ARM machines have
>>> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
>>> so it's quite possible that this interface got broken without anybody
>>> noticing.
>>
>> My driver also supports 'classic' read/write by copying data to/from userspace
>> using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That
>> would suggest that the kernel mapping for this memory is also incorrect, I
>> also get corrupted data in these transfers. These buffers are not being mapped
>> to userspace at all. These tests all use page aligned transfer sizes.
>
> Ok, if it's broken when you only use a kernel mapping and no user space
> mapping, arm_dma_mmap() is not your (only) problem, but we should probably
> still fix that if it's broken.
The only difference in arm_coherent_dma_alloc() and arm_dma_alloc() is that
the former passes a "true" to __dma_alloc(). That looks prettry suspicious,
I'd at least expect different "prot" values.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 12:52 ` Mike Looijmans
@ 2015-04-29 12:54 ` Arnd Bergmann
0 siblings, 0 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 12:54 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 14:52:49 Mike Looijmans wrote:
> On 29-04-15 14:35, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 13:09:05 Mike Looijmans wrote:
> >> On 29-04-15 12:07, Arnd Bergmann wrote:
> >>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >>>> I currently use dma_alloc_coherent() to allocate buffers and
> >>>> dma_mmap_coherent() to map them to user space. I was under the assumption that
> >>>> these would do the right thing. Is that correct? If not, then what should I use?
> >>>
> >>> dma_mmap_coherent() is the right interface, but I've just looked at the
> >>> implementation of arm_dma_mmap() and I'm not sure that it actually uses the
> >>> correct vma->vm_page_prot value here, because I don't see where it takes
> >>> into account whether the device is coherent or not. Most ARM machines have
> >>> only noncoherent devices, and dma_mmap_coherent() is used rarely by drivers,
> >>> so it's quite possible that this interface got broken without anybody
> >>> noticing.
> >>
> >> My driver also supports 'classic' read/write by copying data to/from userspace
> >> using a DMA ringbuffer, which is also allocated using dma_alloc_coherent. That
> >> would suggest that the kernel mapping for this memory is also incorrect, I
> >> also get corrupted data in these transfers. These buffers are not being mapped
> >> to userspace at all. These tests all use page aligned transfer sizes.
> >
> > Ok, if it's broken when you only use a kernel mapping and no user space
> > mapping, arm_dma_mmap() is not your (only) problem, but we should probably
> > still fix that if it's broken.
>
> The only difference in arm_coherent_dma_alloc() and arm_dma_alloc() is that
> the former passes a "true" to __dma_alloc(). That looks prettry suspicious,
> I'd at least expect different "prot" values.
>
>
No, if coherent=true is passed as the argument, the prot value is not used.
If we had this part wrong, we would notice very quickly.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 12:49 ` Mike Looijmans
@ 2015-04-29 13:13 ` Arnd Bergmann
2015-04-30 13:50 ` Mike Looijmans
2015-05-07 11:18 ` Mike Looijmans
0 siblings, 2 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-29 13:13 UTC (permalink / raw)
To: linux-arm-kernel
On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
> On 29-04-15 12:41, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
> >> On 29-04-15 12:07, Arnd Bergmann wrote:
> >>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
> >>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
> >>>>> The buffer should also be mapped into userspace with the same memory
> >>>>> type and cache attributes as the kernel side mapping. If using ACP,
> >>>>> then you probably want "normal memory, cacheable, writeback, read
> >>>>> allocate" or in the case of SMP, the same but "read/write allocate".
> >>>
> >>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
> >>> to refer to a different mmap() callback that does the right thing for
> >>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
> >>
> >> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
> >> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
> >> memory.
> >>
> >> If you can give me some hints as to what the correct flags would be, I can
> >> patch my kernel and test it.
> >
> > Use the flags that Russell listed above.
>
> I would if I had a clue how to do that. For one thing, I don't understand all
> the L_PTE_... flag juggling that happens here.
>
> I also tried just using kmalloc() to allocate the buffer, and then
> dma_map_single that. This is what the DMA documentation told me to do for
> non-coherent mappings. This works fine for the streaming-dma mode using the
> HP, but using dma-coherent this not only fails to work, it tends to completely
> lock the system.
>
> Avoiding the streaming mapping and using only the coherent modes prevents the
> system locking up, it then doesn't do more harm than just corrupting data in
> the buffers.
If I understand it right, you basically just skip the 'vma->vm_page_prot =
__get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
here, i.e. the pgprot value we use for all normal memory.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 13:13 ` Arnd Bergmann
@ 2015-04-30 13:50 ` Mike Looijmans
2015-04-30 13:54 ` Arnd Bergmann
2015-05-07 11:18 ` Mike Looijmans
1 sibling, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-04-30 13:50 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 15:13, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
>> On 29-04-15 12:41, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
>>>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>>>>>> The buffer should also be mapped into userspace with the same memory
>>>>>>> type and cache attributes as the kernel side mapping. If using ACP,
>>>>>>> then you probably want "normal memory, cacheable, writeback, read
>>>>>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>>>>
>>>>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
>>>>> to refer to a different mmap() callback that does the right thing for
>>>>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>>>>
>>>> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
>>>> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
>>>> memory.
>>>>
>>>> If you can give me some hints as to what the correct flags would be, I can
>>>> patch my kernel and test it.
>>>
>>> Use the flags that Russell listed above.
>>
>> I would if I had a clue how to do that. For one thing, I don't understand all
>> the L_PTE_... flag juggling that happens here.
>>
>> I also tried just using kmalloc() to allocate the buffer, and then
>> dma_map_single that. This is what the DMA documentation told me to do for
>> non-coherent mappings. This works fine for the streaming-dma mode using the
>> HP, but using dma-coherent this not only fails to work, it tends to completely
>> lock the system.
>>
>> Avoiding the streaming mapping and using only the coherent modes prevents the
>> system locking up, it then doesn't do more harm than just corrupting data in
>> the buffers.
>
> If I understand it right, you basically just skip the 'vma->vm_page_prot =
> __get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
> here, i.e. the pgprot value we use for all normal memory.
Just to give you a status update, I tried that too (by adding a
dma_mmap_coherent variant that omits the "prot" change, and some printks to
verify that it actually does as expected).
Current status is that the ACP behaves exactly like the HP port, which it
should not do. If I send data from logic via the ACP port through the L2
cache, using a version of dma_sync that just invalidates the cache could
(should?) result in data corruption. Instead, the data gets corrupted only if
you do not invalidate the line. This is what the (non-coherent) HP port
behaves like, as it writes directly to DDR.
Currently I'm assuming that the tools did something wrong in the bitstream,
for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0"
instead of "1" while claiming to have wired them to "1" in the UI. A bug like
that would also explain the behaviour I'm seeing now.
I'll let you know once I find out more.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-30 13:50 ` Mike Looijmans
@ 2015-04-30 13:54 ` Arnd Bergmann
2015-05-01 6:08 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-04-30 13:54 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 30 April 2015 15:50:15 Mike Looijmans wrote:
>
> Just to give you a status update, I tried that too (by adding a
> dma_mmap_coherent variant that omits the "prot" change, and some printks to
> verify that it actually does as expected).
>
> Current status is that the ACP behaves exactly like the HP port, which it
> should not do. If I send data from logic via the ACP port through the L2
> cache, using a version of dma_sync that just invalidates the cache could
> (should?) result in data corruption. Instead, the data gets corrupted only if
> you do not invalidate the line. This is what the (non-coherent) HP port
> behaves like, as it writes directly to DDR.
>
> Currently I'm assuming that the tools did something wrong in the bitstream,
> for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0"
> instead of "1" while claiming to have wired them to "1" in the UI. A bug like
> that would also explain the behaviour I'm seeing now.
>
> I'll let you know once I find out more.
>
>
Ok, maybe you have to configure the SCU to include the ACP in the
cache coherency? That might not be done by default. I don't really
know anything about the SCU or the ACP, so I'm just taking wild
guesses here.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-30 13:54 ` Arnd Bergmann
@ 2015-05-01 6:08 ` Mike Looijmans
2015-05-01 7:01 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-01 6:08 UTC (permalink / raw)
To: linux-arm-kernel
?On 30-04-15 15:54, Arnd Bergmann wrote:
> On Thursday 30 April 2015 15:50:15 Mike Looijmans wrote:
>>
>> Just to give you a status update, I tried that too (by adding a
>> dma_mmap_coherent variant that omits the "prot" change, and some printks to
>> verify that it actually does as expected).
>>
>> Current status is that the ACP behaves exactly like the HP port, which it
>> should not do. If I send data from logic via the ACP port through the L2
>> cache, using a version of dma_sync that just invalidates the cache could
>> (should?) result in data corruption. Instead, the data gets corrupted only if
>> you do not invalidate the line. This is what the (non-coherent) HP port
>> behaves like, as it writes directly to DDR.
>>
>> Currently I'm assuming that the tools did something wrong in the bitstream,
>> for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0"
>> instead of "1" while claiming to have wired them to "1" in the UI. A bug like
>> that would also explain the behaviour I'm seeing now.
>>
>> I'll let you know once I find out more.
>
> Ok, maybe you have to configure the SCU to include the ACP in the
> cache coherency? That might not be done by default. I don't really
> know anything about the SCU or the ACP, so I'm just taking wild
> guesses here.
The issue seems to be in these signals, the tool falsely claimed to have them
set high, but it actually only sets half of them, which thus has no effect at
all. I'll be able to test that in an hour or so once the bitstream is ready.
Interestingly, the coherency is not a property of the port or of the device
itself, it is a property of the DMA transaction. One single master on this
post can do both coherent and non-coherent transfers.
Now there's an interesting challenge, since the kernel appears to assume that
one device can only do one type. My 'hardware' actually has multiple masters,
which could potentially be tied to different ports or busses.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-01 6:08 ` Mike Looijmans
@ 2015-05-01 7:01 ` Mike Looijmans
0 siblings, 0 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-01 7:01 UTC (permalink / raw)
To: linux-arm-kernel
?On 01-05-15 08:08, Mike Looijmans wrote:
> On 30-04-15 15:54, Arnd Bergmann wrote:
>> On Thursday 30 April 2015 15:50:15 Mike Looijmans wrote:
>>>
>>> Just to give you a status update, I tried that too (by adding a
>>> dma_mmap_coherent variant that omits the "prot" change, and some printks to
>>> verify that it actually does as expected).
>>>
>>> Current status is that the ACP behaves exactly like the HP port, which it
>>> should not do. If I send data from logic via the ACP port through the L2
>>> cache, using a version of dma_sync that just invalidates the cache could
>>> (should?) result in data corruption. Instead, the data gets corrupted only if
>>> you do not invalidate the line. This is what the (non-coherent) HP port
>>> behaves like, as it writes directly to DDR.
>>>
>>> Currently I'm assuming that the tools did something wrong in the bitstream,
>>> for example, wiring the "AWCACHE" and similar signals on the ACP to logic "0"
>>> instead of "1" while claiming to have wired them to "1" in the UI. A bug like
>>> that would also explain the behaviour I'm seeing now.
>>>
>>> I'll let you know once I find out more.
>>
>> Ok, maybe you have to configure the SCU to include the ACP in the
>> cache coherency? That might not be done by default. I don't really
>> know anything about the SCU or the ACP, so I'm just taking wild
>> guesses here.
>
> The issue seems to be in these signals, the tool falsely claimed to have them
> set high, but it actually only sets half of them, which thus has no effect at
> all. I'll be able to test that in an hour or so once the bitstream is ready.
Indeed, controlling the signals manually instead of relying on the tools makes
the ACP operate correctly.
> Interestingly, the coherency is not a property of the port or of the device
> itself, it is a property of the DMA transaction. One single master on this
> post can do both coherent and non-coherent transfers.
> Now there's an interesting challenge, since the kernel appears to assume that
> one device can only do one type. My 'hardware' actually has multiple masters,
> which could potentially be tied to different ports or busses.
Measurements show that this is going to be important. Writing larger blocks to
DDR using coherency transfers data at about 240MB/s, while without coherency
it's 600MB/s. Have yet to test small datasets, I expect opposite results
there, since they'll remain in the L2 cache.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-04-29 13:13 ` Arnd Bergmann
2015-04-30 13:50 ` Mike Looijmans
@ 2015-05-07 11:18 ` Mike Looijmans
2015-05-07 11:56 ` Arnd Bergmann
2015-05-07 13:21 ` Daniel Drake
1 sibling, 2 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-07 11:18 UTC (permalink / raw)
To: linux-arm-kernel
?On 29-04-15 15:13, Arnd Bergmann wrote:
> On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
>> On 29-04-15 12:41, Arnd Bergmann wrote:
>>> On Wednesday 29 April 2015 12:33:00 Mike Looijmans wrote:
>>>> On 29-04-15 12:07, Arnd Bergmann wrote:
>>>>> On Wednesday 29 April 2015 11:47:37 Mike Looijmans wrote:
>>>>>> On 29-04-15 11:17, Russell King - ARM Linux wrote:
>>>>>>> The buffer should also be mapped into userspace with the same memory
>>>>>>> type and cache attributes as the kernel side mapping. If using ACP,
>>>>>>> then you probably want "normal memory, cacheable, writeback, read
>>>>>>> allocate" or in the case of SMP, the same but "read/write allocate".
>>>>>
>>>>> If my suspicion is correct, we should either change arm_coherent_dma_ops()
>>>>> to refer to a different mmap() callback that does the right thing for
>>>>> coherent devices, or change arm_dma_mmap() to look at dev->is_coherent.
>>>>
>>>> Following the route, arch/arm/mm/dma-mapping.c uses pgprot_dmacoherent() which
>>>> is defined in arch/arm/include/asm/pgtable.h and that just returns uncached
>>>> memory.
>>>>
>>>> If you can give me some hints as to what the correct flags would be, I can
>>>> patch my kernel and test it.
>>>
>>> Use the flags that Russell listed above.
>>
>> I would if I had a clue how to do that. For one thing, I don't understand all
>> the L_PTE_... flag juggling that happens here.
>>
>> I also tried just using kmalloc() to allocate the buffer, and then
>> dma_map_single that. This is what the DMA documentation told me to do for
>> non-coherent mappings. This works fine for the streaming-dma mode using the
>> HP, but using dma-coherent this not only fails to work, it tends to completely
>> lock the system.
>>
>> Avoiding the streaming mapping and using only the coherent modes prevents the
>> system locking up, it then doesn't do more harm than just corrupting data in
>> the buffers.
>
> If I understand it right, you basically just skip the 'vma->vm_page_prot =
> __get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
> here, i.e. the pgprot value we use for all normal memory.
I reverted all my patches and workarounds. Indeed, the kernel needs a
"coherent" version of the dma_mmap routine, as the current version will map it
as non-cachable, resulting in a big performance hit (and nullifying the whole
idea behind it).
I'll test it further on my 'hardware' and cook up a patch that correctly maps
the coherent pages.
M.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-07 11:18 ` Mike Looijmans
@ 2015-05-07 11:56 ` Arnd Bergmann
2015-05-07 13:21 ` Daniel Drake
1 sibling, 0 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-07 11:56 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 07 May 2015 13:18:08 Mike Looijmans wrote:
> On 29-04-15 15:13, Arnd Bergmann wrote:
> > On Wednesday 29 April 2015 14:49:26 Mike Looijmans wrote:
> >> Avoiding the streaming mapping and using only the coherent modes prevents the
> >> system locking up, it then doesn't do more harm than just corrupting data in
> >> the buffers.
> >
> > If I understand it right, you basically just skip the 'vma->vm_page_prot =
> > __get_dma_pgprot(attrs, vma->vm_page_prot);' step and get the right mapping
> > here, i.e. the pgprot value we use for all normal memory.
>
> I reverted all my patches and workarounds. Indeed, the kernel needs a
> "coherent" version of the dma_mmap routine, as the current version will map it
> as non-cachable, resulting in a big performance hit (and nullifying the whole
> idea behind it).
>
> I'll test it further on my 'hardware' and cook up a patch that correctly maps
> the coherent pages.
>
Ok, thanks!
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-07 11:18 ` Mike Looijmans
2015-05-07 11:56 ` Arnd Bergmann
@ 2015-05-07 13:21 ` Daniel Drake
2015-05-07 13:31 ` Mike Looijmans
1 sibling, 1 reply; 34+ messages in thread
From: Daniel Drake @ 2015-05-07 13:21 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, May 7, 2015 at 5:18 AM, Mike Looijmans <mike.looijmans@topic.nl> wrote:
> I reverted all my patches and workarounds. Indeed, the kernel needs a
> "coherent" version of the dma_mmap routine, as the current version will map
> it as non-cachable, resulting in a big performance hit (and nullifying the
> whole idea behind it).
>
> I'll test it further on my 'hardware' and cook up a patch that correctly
> maps the coherent pages.
Sorry that I have only read this thread briefly, but I wonder if this
is what you are looking for:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/325489.html
Thanks
Daniel
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-07 13:21 ` Daniel Drake
@ 2015-05-07 13:31 ` Mike Looijmans
2015-05-07 14:08 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-07 13:31 UTC (permalink / raw)
To: linux-arm-kernel
?On 07-05-15 15:21, Daniel Drake wrote:
> On Thu, May 7, 2015 at 5:18 AM, Mike Looijmans <mike.looijmans@topic.nl> wrote:
>> I reverted all my patches and workarounds. Indeed, the kernel needs a
>> "coherent" version of the dma_mmap routine, as the current version will map
>> it as non-cachable, resulting in a big performance hit (and nullifying the
>> whole idea behind it).
>>
>> I'll test it further on my 'hardware' and cook up a patch that correctly
>> maps the coherent pages.
>
> Sorry that I have only read this thread briefly, but I wonder if this
> is what you are looking for:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/325489.html
It's related, but targets another use case. This one does the same in case the
driver requested non-consistent memory.
My use case was that I have hardware implemented coherency (through ACP) so
the CPU's and device's view on memory is already consistent, regardless of the
status of the cache.
The patches are complimentary, not overlapping.
Thanks for the link though, it's something I was also looking into, as I don't
always need coherency.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-07 13:31 ` Mike Looijmans
@ 2015-05-07 14:08 ` Mike Looijmans
2015-05-07 14:30 ` Russell King - ARM Linux
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-07 14:08 UTC (permalink / raw)
To: linux-arm-kernel
?On 07-05-15 15:31, Mike Looijmans wrote:
> On 07-05-15 15:21, Daniel Drake wrote:
>> On Thu, May 7, 2015 at 5:18 AM, Mike Looijmans <mike.looijmans@topic.nl> wrote:
>>> I reverted all my patches and workarounds. Indeed, the kernel needs a
>>> "coherent" version of the dma_mmap routine, as the current version will map
>>> it as non-cachable, resulting in a big performance hit (and nullifying the
>>> whole idea behind it).
>>>
>>> I'll test it further on my 'hardware' and cook up a patch that correctly
>>> maps the coherent pages.
>>
>> Sorry that I have only read this thread briefly, but I wonder if this
>> is what you are looking for:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/325489.html
>
> It's related, but targets another use case. This one does the same in case the
> driver requested non-consistent memory.
>
> My use case was that I have hardware implemented coherency (through ACP) so
> the CPU's and device's view on memory is already consistent, regardless of the
> status of the cache.
>
> The patches are complimentary, not overlapping.
>
> Thanks for the link though, it's something I was also looking into, as I don't
> always need coherency.
I read the rest of the thread, apparently it was never integrated.
The patch for "non-consistent" is a BUG FIX, not some feature request or so. I
was already wondering why my driver had to kalloc pages to get proper caching
on it.
From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
"""
DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
consistent or non-consistent memory as it sees fit. By using this API,
you are guaranteeing to the platform that you have all the correct and
necessary sync points for this memory in the driver.
"""
The current ARM implementation is to *always* return memory that is
non-cachable, even if the driver promises to do all the right things.
If the intention was that every implementation could get away with just
ignoring the flag, the flag would not have existed. So the implementation
should do the best it can do here, and the patch shows that it's just a simple
one-liner to make it implement the flag as intended.
As for use cases, IIO is a candidate for this too, as it has explicit
interfaces to move buffers to/from userspace without having to remap them over
and over again. My usecase here is Dyplo, which uses a similar interface. If
you do something as simple as "for (i=0;i<size;++i) sum += (char*)buffer[i];
in userspace on such a buffer, performance will collapse to about 1/20th
because of the memory being non-cachable. The whole point of IIO is to prevent
having to copy these buffers around.
(I'd rather add this plea to the that thread, but I'd have to figure out how
to reply to a thread from the web...)
Mike.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans@topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-07 14:08 ` Mike Looijmans
@ 2015-05-07 14:30 ` Russell King - ARM Linux
2015-05-08 5:55 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2015-05-07 14:30 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> I read the rest of the thread, apparently it was never integrated.
>
> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> I was already wondering why my driver had to kalloc pages to get proper
> caching on it.
I disagree.
> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> """
> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> consistent or non-consistent memory as it sees fit. By using this API,
> you are guaranteeing to the platform that you have all the correct and
> necessary sync points for this memory in the driver.
> """
DMA attributes are something that came in _after_ the DMA API had been
around for many years. It's a "new feature" that was added to an
existing subsystem, and because there have been no need for it to be
implemented on ARM, the new feature was never implemented.
More than that, the vast majority of ARM hardware can't provide this
kind of memory, and there are _no_ kernel APIs to ensure that if
cacheable memory were to be returned, they could issue the necessary
cache flushes to ensure that the device could see the data.
So it's _not_ a bug fix, and there have been very good reasons not to
implement it.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-07 14:30 ` Russell King - ARM Linux
@ 2015-05-08 5:55 ` Mike Looijmans
2015-05-08 7:54 ` Arnd Bergmann
2015-05-08 11:10 ` Russell King - ARM Linux
0 siblings, 2 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08 5:55 UTC (permalink / raw)
To: linux-arm-kernel
On 07-05-15 16:30, Russell King - ARM Linux wrote:
> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
>> I read the rest of the thread, apparently it was never integrated.
>>
>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
>> I was already wondering why my driver had to kalloc pages to get proper
>> caching on it.
>
> I disagree.
>
>> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
>> """
>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
>> consistent or non-consistent memory as it sees fit. By using this API,
>> you are guaranteeing to the platform that you have all the correct and
>> necessary sync points for this memory in the driver.
>> """
>
> DMA attributes are something that came in _after_ the DMA API had been
> around for many years. It's a "new feature" that was added to an
> existing subsystem, and because there have been no need for it to be
> implemented on ARM, the new feature was never implemented.
>
> More than that, the vast majority of ARM hardware can't provide this
> kind of memory, and there are _no_ kernel APIs to ensure that if
By "non-coherent" memory I thought it meant the same kind of memory that
kalloc would return. But from your answer it seems I am mistaken and
this is something different?
> cacheable memory were to be returned, they could issue the necessary
> cache flushes to ensure that the device could see the data.
Then what do the dma_sync_... methods do?
It has been my understanding that one can use dma_map... and dma_sync...
methods to make memory ranges visible to the device.
Using dma_sync on coherent memory is just a waste of resources. So how
do i allocate memory that I'm supposed to use with dma_sync?
> So it's _not_ a bug fix, and there have been very good reasons not to
> implement it.
--
Mike Looijmans
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 5:55 ` Mike Looijmans
@ 2015-05-08 7:54 ` Arnd Bergmann
2015-05-08 8:31 ` Mike Looijmans
2015-05-08 11:10 ` Russell King - ARM Linux
1 sibling, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-08 7:54 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
> On 07-05-15 16:30, Russell King - ARM Linux wrote:
> > On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> >> I read the rest of the thread, apparently it was never integrated.
> >>
> >> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> >> I was already wondering why my driver had to kalloc pages to get proper
> >> caching on it.
> >
> > I disagree.
> >
> >> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> >> """
> >> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> >> consistent or non-consistent memory as it sees fit. By using this API,
> >> you are guaranteeing to the platform that you have all the correct and
> >> necessary sync points for this memory in the driver.
> >> """
> >
> > DMA attributes are something that came in _after_ the DMA API had been
> > around for many years. It's a "new feature" that was added to an
> > existing subsystem, and because there have been no need for it to be
> > implemented on ARM, the new feature was never implemented.
> >
> > More than that, the vast majority of ARM hardware can't provide this
> > kind of memory, and there are _no_ kernel APIs to ensure that if
>
> By "non-coherent" memory I thought it meant the same kind of memory that
> kalloc would return. But from your answer it seems I am mistaken and
> this is something different?
It depends: on a device that is actually cache-coherent,
dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
memory.
On some architectures (not ARM) that are not fully coherent,
dma_alloc_coherent() has to return uncached memory, while
dma_alloc_noncoherent() is allowed to return cached memory but
requires a dma_cache_sync() operation.
dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
on that memory, which can actually work on ARM, unlike dma_cache_sync().
> > cacheable memory were to be returned, they could issue the necessary
> > cache flushes to ensure that the device could see the data.
>
> Then what do the dma_sync_... methods do?
>
> It has been my understanding that one can use dma_map... and dma_sync...
> methods to make memory ranges visible to the device.
That is correct, but the DMA_ATTR_NON_CONSISTENT flag is not meaningful
with dma_map_...(), as that memory is not assumed to be consistent unless
you call dma_sync_...() to start with.
> Using dma_sync on coherent memory is just a waste of resources. So how
> do i allocate memory that I'm supposed to use with dma_sync?
The traditional API (before the various attributes is):
dma_alloc_coherent() --> never requires sync
dma_alloc_writecombine() --> never requires sync, arch specific
dma_alloc_noncoherent() --> dma_cache_sync(), arch specific
alloc_pages + dma_map_*() --> dma_sync_*
The dma_alloc_coherent() and dma_sync_*() interfaces are supposed to
determine themselves whether they need to do any cache management
based on whether the device is coherent already or not.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 7:54 ` Arnd Bergmann
@ 2015-05-08 8:31 ` Mike Looijmans
2015-05-08 13:19 ` Arnd Bergmann
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08 8:31 UTC (permalink / raw)
To: linux-arm-kernel
?On 08-05-15 09:54, Arnd Bergmann wrote:
> On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
>> On 07-05-15 16:30, Russell King - ARM Linux wrote:
>>> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
>>>> I read the rest of the thread, apparently it was never integrated.
>>>>
>>>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
>>>> I was already wondering why my driver had to kalloc pages to get proper
>>>> caching on it.
>>>
>>> I disagree.
>>>
>>>> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
>>>> """
>>>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
>>>> consistent or non-consistent memory as it sees fit. By using this API,
>>>> you are guaranteeing to the platform that you have all the correct and
>>>> necessary sync points for this memory in the driver.
>>>> """
>>>
>>> DMA attributes are something that came in _after_ the DMA API had been
>>> around for many years. It's a "new feature" that was added to an
>>> existing subsystem, and because there have been no need for it to be
>>> implemented on ARM, the new feature was never implemented.
>>>
>>> More than that, the vast majority of ARM hardware can't provide this
>>> kind of memory, and there are _no_ kernel APIs to ensure that if
>>
>> By "non-coherent" memory I thought it meant the same kind of memory that
>> kalloc would return. But from your answer it seems I am mistaken and
>> this is something different?
>
> It depends: on a device that is actually cache-coherent,
> dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
> memory.
>
> On some architectures (not ARM) that are not fully coherent,
> dma_alloc_coherent() has to return uncached memory, while
> dma_alloc_noncoherent() is allowed to return cached memory but
> requires a dma_cache_sync() operation.
>
> dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
> but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
> on that memory, which can actually work on ARM, unlike dma_cache_sync().
Ah, okay, I was misled by the names. I was under the impression that memory
would be either "coherent" or "non-coherent". But what is called
"non-coherent" here is actually something like "less-coherent", it isn't
normal memory as alloc_pages would return, but it also isn't completely
coherent. Is that a correct summary?
In that case, I stand corrected.
>>> cacheable memory were to be returned, they could issue the necessary
>>> cache flushes to ensure that the device could see the data.
>>
>> Then what do the dma_sync_... methods do?
>>
>> It has been my understanding that one can use dma_map... and dma_sync...
>> methods to make memory ranges visible to the device.
>
> That is correct, but the DMA_ATTR_NON_CONSISTENT flag is not meaningful
> with dma_map_...(), as that memory is not assumed to be consistent unless
> you call dma_sync_...() to start with.
>
>> Using dma_sync on coherent memory is just a waste of resources. So how
>> do i allocate memory that I'm supposed to use with dma_sync?
>
> The traditional API (before the various attributes is):
>
> dma_alloc_coherent() --> never requires sync
> dma_alloc_writecombine() --> never requires sync, arch specific
> dma_alloc_noncoherent() --> dma_cache_sync(), arch specific
> alloc_pages + dma_map_*() --> dma_sync_*
>
> The dma_alloc_coherent() and dma_sync_*() interfaces are supposed to
> determine themselves whether they need to do any cache management
> based on whether the device is coherent already or not.
Okay, so in my case, I need to forget about the "non_coherent" stuff, it's
something specific to a few platforms.
I was looking for an interface that would allocate memory for access by my
device, but that would be just alloc_pages style memory. If my DMA controller
is limited to say only the first GB of RAM, I'd set the DMA mask to "30 bits".
If I just allocate memory using alloc_pages, the kernel doesn't know that I'd
want it to be in the lower 1GB range, and could allocate it in a spot my
device could not map.
Hence I'd expect there to be some "dma_alloc_pages(struct device* ...)" style
of call to get memory that my device could access (and I was under the false
impression that dma_alloc_noncoherent was the one I was looking for).
Currently I can get away with just using alloc_pages or kmalloc since my DMA
controller happens to be able to access all memory. But I also want my device
driver to work on 64-bit platforms (e.g. arm64 for the MPSOC and x86-64 for
the PCIe version of the board).
M.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 5:55 ` Mike Looijmans
2015-05-08 7:54 ` Arnd Bergmann
@ 2015-05-08 11:10 ` Russell King - ARM Linux
2015-05-08 12:17 ` Mike Looijmans
1 sibling, 1 reply; 34+ messages in thread
From: Russell King - ARM Linux @ 2015-05-08 11:10 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, May 08, 2015 at 07:55:26AM +0200, Mike Looijmans wrote:
> On 07-05-15 16:30, Russell King - ARM Linux wrote:
> >On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> >>I read the rest of the thread, apparently it was never integrated.
> >>
> >>The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> >>I was already wondering why my driver had to kalloc pages to get proper
> >>caching on it.
> >
> >I disagree.
> >
> >> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> >>"""
> >>DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> >>consistent or non-consistent memory as it sees fit. By using this API,
> >>you are guaranteeing to the platform that you have all the correct and
> >>necessary sync points for this memory in the driver.
> >>"""
> >
> >DMA attributes are something that came in _after_ the DMA API had been
> >around for many years. It's a "new feature" that was added to an
> >existing subsystem, and because there have been no need for it to be
> >implemented on ARM, the new feature was never implemented.
> >
> >More than that, the vast majority of ARM hardware can't provide this
> >kind of memory, and there are _no_ kernel APIs to ensure that if
>
> By "non-coherent" memory I thought it meant the same kind of memory that
> kalloc would return. But from your answer it seems I am mistaken and this is
> something different?
Well, I do find the language used rather loose. The term "coherent" is
well understood, but when the documentation mixes it with "consistent"
it becomes less clear whether the two terms mean the same thing or
something different.
First, let's clear up the API. There are two parts of the DMA API, the
streaming API and the coherent API.
The streaming API is made up from:
- dma_map_single()
- dma_map_page()
- dma_map_sg()
- dma_sync_single_*()
- dma_sync_sg_*()
- dma_unmap_single()
- dma_unmap_page()
- dma_unmap_sg()
The coherent API is made up from:
- dma_alloc_coherent() (or a derived function of this.)
- dma_zalloc_coherent()
- dma_free_coherent()
- dma_pool_create()
- dma_pool_alloc()
- dma_pool_free()
- dma_pool_destroy()
Using the streaming API functions with memory returned from the coherent
API is invalid, and in older ARM implementations will cause a WARN_ON()
or BUG_ON() to trigger (due to the need for coherent memory to be
remapped.)
>From the description of DMA_ATTR_NON_CONSISTENT talks about returning
memory. None of the streaming API interfaces returns memory, only the
coherent API does. So, DMA_ATTR_NON_CONSISTENT is something you'd pass
into dma_alloc_coherent().
Great, so you get your coherent-but-non-consistent memory (what's that?)
but the documentation says that you guarantee to the platform that you
have the necessary sync points in the driver for this. What sync points
are those? Is it talking about memory barriers, or is it talking about
some other cache flushing that the DMA API doesn't describe?
If it's talking about the driver needing to have memory barriers to
ensure proper ordering of accesses to the returned memory, that's
something modern ARMs always need irrespective of anything, so in that
sense, we always return non-consistent memory everywhere.
--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 11:10 ` Russell King - ARM Linux
@ 2015-05-08 12:17 ` Mike Looijmans
0 siblings, 0 replies; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08 12:17 UTC (permalink / raw)
To: linux-arm-kernel
?On 08-05-15 13:10, Russell King - ARM Linux wrote:
> Well, I do find the language used rather loose. The term "coherent" is
> well understood, but when the documentation mixes it with "consistent"
> it becomes less clear whether the two terms mean the same thing or
> something different.
I re-read https://www.kernel.org/doc/Documentation/DMA-API.txt with this in mind.
"Part Ia" starts with "dma_alloc_coherent" and (friends) but then talks about
"consistent" memory throughout the whole part. I think a
s/consistent/coherent/g in this part will make things a lot clearer for everyone.
Shall I submit a patch for that?
Part Ib continues, but switches to using the word "coherent" instead.
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 8:31 ` Mike Looijmans
@ 2015-05-08 13:19 ` Arnd Bergmann
2015-05-08 14:18 ` Mike Looijmans
0 siblings, 1 reply; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-08 13:19 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 08 May 2015 10:31:53 Mike Looijmans wrote:
> On 08-05-15 09:54, Arnd Bergmann wrote:
> > On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
> >> On 07-05-15 16:30, Russell King - ARM Linux wrote:
> >>> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
> >>>> I read the rest of the thread, apparently it was never integrated.
> >>>>
> >>>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
> >>>> I was already wondering why my driver had to kalloc pages to get proper
> >>>> caching on it.
> >>>
> >>> I disagree.
> >>>
> >>>> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
> >>>> """
> >>>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
> >>>> consistent or non-consistent memory as it sees fit. By using this API,
> >>>> you are guaranteeing to the platform that you have all the correct and
> >>>> necessary sync points for this memory in the driver.
> >>>> """
> >>>
> >>> DMA attributes are something that came in _after_ the DMA API had been
> >>> around for many years. It's a "new feature" that was added to an
> >>> existing subsystem, and because there have been no need for it to be
> >>> implemented on ARM, the new feature was never implemented.
> >>>
> >>> More than that, the vast majority of ARM hardware can't provide this
> >>> kind of memory, and there are _no_ kernel APIs to ensure that if
> >>
> >> By "non-coherent" memory I thought it meant the same kind of memory that
> >> kalloc would return. But from your answer it seems I am mistaken and
> >> this is something different?
> >
> > It depends: on a device that is actually cache-coherent,
> > dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
> > memory.
> >
> > On some architectures (not ARM) that are not fully coherent,
> > dma_alloc_coherent() has to return uncached memory, while
> > dma_alloc_noncoherent() is allowed to return cached memory but
> > requires a dma_cache_sync() operation.
> >
> > dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
> > but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
> > on that memory, which can actually work on ARM, unlike dma_cache_sync().
>
> Ah, okay, I was misled by the names. I was under the impression that memory
> would be either "coherent" or "non-coherent". But what is called
> "non-coherent" here is actually something like "less-coherent", it isn't
> normal memory as alloc_pages would return, but it also isn't completely
> coherent. Is that a correct summary?
>
> In that case, I stand corrected.
Almost. I think the part that you are still missing is that memory
itself it not coherent or non-coherent. It's the device access to
that memory that can be coherent or not with regard to the CPU.
The memory that is returned by alloc_pages can be coherent with one
device but non-coherent with another.
> I was looking for an interface that would allocate memory for access by my
> device, but that would be just alloc_pages style memory. If my DMA controller
> is limited to say only the first GB of RAM, I'd set the DMA mask to "30 bits".
> If I just allocate memory using alloc_pages, the kernel doesn't know that I'd
> want it to be in the lower 1GB range, and could allocate it in a spot my
> device could not map.
If you have that low-memory restriction, you also need to ensure that there
is a ZONE_DMA that is large enough. ZONE_DMA should be sized to match the
common subset that all devices can access, so a GFP_DMA request returns
memory that is guaranteed to be accessible by all devices.
> Hence I'd expect there to be some "dma_alloc_pages(struct device* ...)" style
> of call to get memory that my device could access (and I was under the false
> impression that dma_alloc_noncoherent was the one I was looking for).
>
> Currently I can get away with just using alloc_pages or kmalloc since my DMA
> controller happens to be able to access all memory. But I also want my device
> driver to work on 64-bit platforms (e.g. arm64 for the MPSOC and x86-64 for
> the PCIe version of the board).
Those machines will have ZONE_DMA32, which refers to the first 4GB of memory,
so that should work fine. Alternatively you can use the iommu to get
all memory mapped into the space that is accessible by the device.
Also, on 64-bit x86 or ARM machines, all memory tends to be coherent, so
dma_sync_* will turn into a nop.
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 13:19 ` Arnd Bergmann
@ 2015-05-08 14:18 ` Mike Looijmans
2015-05-08 14:27 ` Arnd Bergmann
0 siblings, 1 reply; 34+ messages in thread
From: Mike Looijmans @ 2015-05-08 14:18 UTC (permalink / raw)
To: linux-arm-kernel
?On 08-05-15 15:19, Arnd Bergmann wrote:
> On Friday 08 May 2015 10:31:53 Mike Looijmans wrote:
>> On 08-05-15 09:54, Arnd Bergmann wrote:
>>> On Friday 08 May 2015 07:55:26 Mike Looijmans wrote:
>>>> On 07-05-15 16:30, Russell King - ARM Linux wrote:
>>>>> On Thu, May 07, 2015 at 04:08:54PM +0200, Mike Looijmans wrote:
>>>>>> I read the rest of the thread, apparently it was never integrated.
>>>>>>
>>>>>> The patch for "non-consistent" is a BUG FIX, not some feature request or so.
>>>>>> I was already wondering why my driver had to kalloc pages to get proper
>>>>>> caching on it.
>>>>>
>>>>> I disagree.
>>>>>
>>>>>> From https://www.kernel.org/doc/Documentation/DMA-attributes.txt:
>>>>>> """
>>>>>> DMA_ATTR_NON_CONSISTENT ... lets the platform to choose to return either
>>>>>> consistent or non-consistent memory as it sees fit. By using this API,
>>>>>> you are guaranteeing to the platform that you have all the correct and
>>>>>> necessary sync points for this memory in the driver.
>>>>>> """
>>>>>
>>>>> DMA attributes are something that came in _after_ the DMA API had been
>>>>> around for many years. It's a "new feature" that was added to an
>>>>> existing subsystem, and because there have been no need for it to be
>>>>> implemented on ARM, the new feature was never implemented.
>>>>>
>>>>> More than that, the vast majority of ARM hardware can't provide this
>>>>> kind of memory, and there are _no_ kernel APIs to ensure that if
>>>>
>>>> By "non-coherent" memory I thought it meant the same kind of memory that
>>>> kalloc would return. But from your answer it seems I am mistaken and
>>>> this is something different?
>>>
>>> It depends: on a device that is actually cache-coherent,
>>> dma_alloc_coherent() and dma_alloc_noncoherent() both return normal
>>> memory.
>>>
>>> On some architectures (not ARM) that are not fully coherent,
>>> dma_alloc_coherent() has to return uncached memory, while
>>> dma_alloc_noncoherent() is allowed to return cached memory but
>>> requires a dma_cache_sync() operation.
>>>
>>> dma_alloc_attrs() with DMA_ATTR_NON_CONSISTENT is a variant of that,
>>> but I assume the idea is that you use dma_sync_single_fo_{cpu,device}()
>>> on that memory, which can actually work on ARM, unlike dma_cache_sync().
>>
>> Ah, okay, I was misled by the names. I was under the impression that memory
>> would be either "coherent" or "non-coherent". But what is called
>> "non-coherent" here is actually something like "less-coherent", it isn't
>> normal memory as alloc_pages would return, but it also isn't completely
>> coherent. Is that a correct summary?
>>
>> In that case, I stand corrected.
>
> Almost. I think the part that you are still missing is that memory
> itself it not coherent or non-coherent. It's the device access to
> that memory that can be coherent or not with regard to the CPU.
>
> The memory that is returned by alloc_pages can be coherent with one
> device but non-coherent with another.
Yeah, I did understand, but never made that explicit.
The Zynq is even more funky than most hardware - I can actually enable/disable
(hardware) coherency for each memory burst. If I wanted to. Currently I just
turn it on/off for the whole system at boot time, since I need to write its
coherency status into the devicetree for the kernel to understand how my
device accesses memory.
>> I was looking for an interface that would allocate memory for access by my
>> device, but that would be just alloc_pages style memory. If my DMA controller
>> is limited to say only the first GB of RAM, I'd set the DMA mask to "30 bits".
>> If I just allocate memory using alloc_pages, the kernel doesn't know that I'd
>> want it to be in the lower 1GB range, and could allocate it in a spot my
>> device could not map.
>
> If you have that low-memory restriction, you also need to ensure that there
> is a ZONE_DMA that is large enough. ZONE_DMA should be sized to match the
> common subset that all devices can access, so a GFP_DMA request returns
> memory that is guaranteed to be accessible by all devices.
GFP_DMA still carries that scary remark: "on x86, GFP_DMA guarantees to be
within the first 16MB of available bus addresses". Which is really scary if
you want to allocate multimegabyte buffers.
>> Hence I'd expect there to be some "dma_alloc_pages(struct device* ...)" style
>> of call to get memory that my device could access (and I was under the false
>> impression that dma_alloc_noncoherent was the one I was looking for).
>>
>> Currently I can get away with just using alloc_pages or kmalloc since my DMA
>> controller happens to be able to access all memory. But I also want my device
>> driver to work on 64-bit platforms (e.g. arm64 for the MPSOC and x86-64 for
>> the PCIe version of the board).
>
> Those machines will have ZONE_DMA32, which refers to the first 4GB of memory,
> so that should work fine. Alternatively you can use the iommu to get
> all memory mapped into the space that is accessible by the device.
With my first tests on the PC version I ran into a problem with the PCIe
interface that the IP block that does the address translation would do
something like addr=base_addr|offset instead of addr=base_addr+offset. Since
base_addr was assigned dynamically (BAR) the board would fail to work (write
to the wrong spot in memory, that is) if the BAR was aligned on a 4MB boundary
and my buffer was on an "odd" 4MB address. I hacked around that by calling
dma_alloc_coherent() until I got one that was evenly aligned and then get rid
of the failed ones. Just an anecdote that shows that requirements for buffer
alignment can be weird. Of course, we software guys are here to clean up the
mess because fixing the bug in hardware is out of the question.
> Also, on 64-bit x86 or ARM machines, all memory tends to be coherent, so
> dma_sync_* will turn into a nop.
Well, that should make things easier, right? I just can't wait until I get
that first MPSOC on my desk :)
Kind regards,
Mike Looijmans
System Expert
TOPIC Embedded Products
Eindhovenseweg 32-C, NL-5683 KH Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
Telefax: +31 (0) 499 33 69 70
E-mail: mike.looijmans at topicproducts.com
Website: www.topicproducts.com
Please consider the environment before printing this e-mail
^ permalink raw reply [flat|nested] 34+ messages in thread
* dma_alloc_coherent versus streaming DMA, neither works satisfactory
2015-05-08 14:18 ` Mike Looijmans
@ 2015-05-08 14:27 ` Arnd Bergmann
0 siblings, 0 replies; 34+ messages in thread
From: Arnd Bergmann @ 2015-05-08 14:27 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 08 May 2015 16:18:11 Mike Looijmans wrote:
> >
> > If you have that low-memory restriction, you also need to ensure that there
> > is a ZONE_DMA that is large enough. ZONE_DMA should be sized to match the
> > common subset that all devices can access, so a GFP_DMA request returns
> > memory that is guaranteed to be accessible by all devices.
>
> GFP_DMA still carries that scary remark: "on x86, GFP_DMA guarantees to be
> within the first 16MB of available bus addresses". Which is really scary if
> you want to allocate multimegabyte buffers.
Don't worry about that on ARM: the size of the dma zone is platform specific,
so you can set it to whatever you want. The 16MB limit is only relevant
on systems that use ISA DMA (floppy, soundblaster, ...)
Arnd
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2015-05-08 14:27 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-23 11:52 dma_alloc_coherent versus streaming DMA, neither works satisfactory Mike Looijmans
2015-04-23 12:32 ` Arnd Bergmann
2015-04-23 13:05 ` Mike Looijmans
2015-04-29 8:47 ` Mike Looijmans
2015-04-29 9:01 ` Arnd Bergmann
2015-04-29 9:17 ` Russell King - ARM Linux
2015-04-29 9:47 ` Mike Looijmans
2015-04-29 10:07 ` Arnd Bergmann
2015-04-29 10:33 ` Mike Looijmans
2015-04-29 10:41 ` Arnd Bergmann
2015-04-29 12:49 ` Mike Looijmans
2015-04-29 13:13 ` Arnd Bergmann
2015-04-30 13:50 ` Mike Looijmans
2015-04-30 13:54 ` Arnd Bergmann
2015-05-01 6:08 ` Mike Looijmans
2015-05-01 7:01 ` Mike Looijmans
2015-05-07 11:18 ` Mike Looijmans
2015-05-07 11:56 ` Arnd Bergmann
2015-05-07 13:21 ` Daniel Drake
2015-05-07 13:31 ` Mike Looijmans
2015-05-07 14:08 ` Mike Looijmans
2015-05-07 14:30 ` Russell King - ARM Linux
2015-05-08 5:55 ` Mike Looijmans
2015-05-08 7:54 ` Arnd Bergmann
2015-05-08 8:31 ` Mike Looijmans
2015-05-08 13:19 ` Arnd Bergmann
2015-05-08 14:18 ` Mike Looijmans
2015-05-08 14:27 ` Arnd Bergmann
2015-05-08 11:10 ` Russell King - ARM Linux
2015-05-08 12:17 ` Mike Looijmans
2015-04-29 11:09 ` Mike Looijmans
2015-04-29 12:35 ` Arnd Bergmann
2015-04-29 12:52 ` Mike Looijmans
2015-04-29 12:54 ` Arnd Bergmann
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.