All of lore.kernel.org
 help / color / mirror / Atom feed
* Why dma_alloc_coherent don't return direct mapped vaddr?
@ 2022-07-21  3:28 Li Chen
  2022-07-21  7:06 ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-21  3:28 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel

Hi Arnd,

dma_alloc_coherent two addr:
1. vaddr.
2. dma_addr

I noticed vaddr is not simply linear/direct mapped to dma_addr, which means I cannot use virt_to_phys/virt_to_page to get 
paddr/page. Instead, I should use dma_addr as paddr and phys_to_page(dma_addr) to get struct page.

My question is why dma_alloc_coherent not simply return phys_to_virt(dma_addr)? IOW, why vaddr is not directly mapped to dma_addr?

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-21  3:28 Why dma_alloc_coherent don't return direct mapped vaddr? Li Chen
@ 2022-07-21  7:06 ` Arnd Bergmann
  2022-07-22  2:57   ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-21  7:06 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Thu, Jul 21, 2022 at 5:28 AM Li Chen <me@linux.beauty> wrote:
>
> Hi Arnd,
>
> dma_alloc_coherent two addr:
> 1. vaddr.
> 2. dma_addr
>
> I noticed vaddr is not simply linear/direct mapped to dma_addr, which means I cannot use virt_to_phys/virt_to_page to get
> paddr/page. Instead, I should use dma_addr as paddr and phys_to_page(dma_addr) to get struct page.
>
> My question is why dma_alloc_coherent not simply return phys_to_virt(dma_addr)? IOW, why vaddr is
> not directly mapped to dma_addr?

dma_alloc_coherent() tries to allocate memory that is the correct type
for the device you
pass. If the device is not itself marked as cache coherent in the DT,
then it has to use
an uncached mapping.

The normal linear map of all memory into the kernel address space is
cacheable, so this
device can't use it, and you instead get a new mapping into kernel
space at a different
virtual address.

Note that you should never need a 'struct page' in this case, as the
device needs a physical
ddress and access from kernel space just needs a pointer. Calling
phys_to_page() on
a dma_addr_t is not portable because a lot of devices have DMA
addresses that are not
the same number as the physical address as seen by the CPU, or there
may be an IOMMU
inbetween.

         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-21  7:06 ` Arnd Bergmann
@ 2022-07-22  2:57   ` Li Chen
  2022-07-22  6:50     ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-22  2:57 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel

Hi Arnd,
 ---- On Thu, 21 Jul 2022 15:06:57 +0800  Arnd Bergmann <arnd@arndb.de> wrote --- 
 > On Thu, Jul 21, 2022 at 5:28 AM Li Chen <me@linux.beauty> wrote:
 > >
 > > Hi Arnd,
 > >
 > > dma_alloc_coherent two addr:
 > > 1. vaddr.
 > > 2. dma_addr
 > >
 > > I noticed vaddr is not simply linear/direct mapped to dma_addr, which means I cannot use virt_to_phys/virt_to_page to get
 > > paddr/page. Instead, I should use dma_addr as paddr and phys_to_page(dma_addr) to get struct page.
 > >
 > > My question is why dma_alloc_coherent not simply return phys_to_virt(dma_addr)? IOW, why vaddr is
 > > not directly mapped to dma_addr?
 > 
 > dma_alloc_coherent() tries to allocate memory that is the correct type
 > for the device you
 > pass. If the device is not itself marked as cache coherent in the DT,
 > then it has to use
 > an uncached mapping.
 > 
 > The normal linear map of all memory into the kernel address space is
 > cacheable, so this
 > device can't use it, and you instead get a new mapping into kernel
 > space at a different
 > virtual address.
 > 
 > Note that you should never need a 'struct page' in this case, as the
 > device needs a physical
 > ddress and access from kernel space just needs a pointer. Calling
 > phys_to_page() on
 > a dma_addr_t is not portable because a lot of devices have DMA
 > addresses that are not
 > the same number as the physical address as seen by the CPU, or there
 > may be an IOMMU
 > in between.

Thanks for your answer! My device is a misc character device, just like https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/. IIUC, its dma_addr is always the same with phy addr. If I want to alloc from reserved memory and then mmap to userspace with vm_insert_pages, are cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices?

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-22  2:57   ` Li Chen
@ 2022-07-22  6:50     ` Arnd Bergmann
  2022-07-22  8:19       ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-22  6:50 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote:
>  ---- On Thu, 21 Jul 2022 15:06:57 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
>  > in between.
>
> Thanks for your answer! My device is a misc character device, just like
> https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/
> IIUC, its dma_addr is always the same with phy addr. If I want to alloc from
> reserved memory and then mmap to userspace with vm_insert_pages, are
> cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices?

In the driver, you should only ever use dma_alloc_coherent() for getting
a coherent DMA buffer, the other functions are just the implementation
details behind that.

To map this buffer to user space, your mmap() function should call
dma_mmap_coherent(), which in turn does the correct translation
from device specific dma_addr_t values into pages and uses the
correct caching attributes.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-22  6:50     ` Arnd Bergmann
@ 2022-07-22  8:19       ` Li Chen
  2022-07-22  9:06         ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-22  8:19 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel


 ---- On Fri, 22 Jul 2022 14:50:17 +0800  Arnd Bergmann <arnd@arndb.de> wrote --- 
 > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote:
 > >  ---- On Thu, 21 Jul 2022 15:06:57 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  > in between.
 > >
 > > Thanks for your answer! My device is a misc character device, just like
 > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/
 > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from
 > > reserved memory and then mmap to userspace with vm_insert_pages, are
 > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices?
 > 
 > In the driver, you should only ever use dma_alloc_coherent() for getting
 > a coherent DMA buffer, the other functions are just the implementation
 > details behind that.
 > 
 > To map this buffer to user space, your mmap() function should call
 > dma_mmap_coherent(), which in turn does the correct translation
 > from device specific dma_addr_t values into pages and uses the
 > correct caching attributes.

Yeah, dma_mmap_coherent() is best if I don't care about direct IO.
But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses remap_pfn_range internally, which will set vma 
to be VM_IO and VM_PFNMAP, so I think I still have to go back to get struct page from rmem and use vm_insert_pages to insert pages into vma, right?

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-22  8:19       ` Li Chen
@ 2022-07-22  9:06         ` Arnd Bergmann
  2022-07-22 10:31           ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-22  9:06 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Fri, Jul 22, 2022 at 10:19 AM Li Chen <me@linux.beauty> wrote:
>  ---- On Fri, 22 Jul 2022 14:50:17 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
>  > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote:
>  > >  ---- On Thu, 21 Jul 2022 15:06:57 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
>  > >  > in between.
>  > >
>  > > Thanks for your answer! My device is a misc character device, just like
>  > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/
>  > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from
>  > > reserved memory and then mmap to userspace with vm_insert_pages, are
>  > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices?
>  >
>  > In the driver, you should only ever use dma_alloc_coherent() for getting
>  > a coherent DMA buffer, the other functions are just the implementation
>  > details behind that.
>  >
>  > To map this buffer to user space, your mmap() function should call
>  > dma_mmap_coherent(), which in turn does the correct translation
>  > from device specific dma_addr_t values into pages and uses the
>  > correct caching attributes.
>
> Yeah, dma_mmap_coherent() is best if I don't care about direct IO.
> But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses
> remap_pfn_range internally, which will set vma to be VM_IO and VM_PFNMAP,
> so I think I still have to go back to get struct page from rmem and use
> vm_insert_pages to insert pages into vma, right?

I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
uncacheable will cause data corruption somewhere as well: The direct i/o
code expects normal page cache pages, but these are clearly not.

Also, the coherent DMA API is not actually meant for transferring large
amounts of data. My guess is that what you are doing here is to use the
coherent API to map a large buffer uncached and then try to access the
uncached data in user space, which is inherently slow. Using direct I/o
appears to solve the problem by not actually using the uncached mapping
when sending the data to another device, but this is not the right approach.

Do you have an IOMMU, scatter/gather support or similar to back the
device? I think the only way to safely do what you want to achieve in  way
that is both safe and efficient would be to use normal page cache pages
allocated from user space, ideally using hugepage mappings, and then
mapping those into the device using the streaming DMA API to assign
them to the DMA master with get_user_pages_fast()/dma_map_sg()
and dma_sync_sg_for_{device,cpu}.

         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-22  9:06         ` Arnd Bergmann
@ 2022-07-22 10:31           ` Li Chen
  2022-07-22 11:06             ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-22 10:31 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel


 ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd@arndb.de> wrote --- 
 > On Fri, Jul 22, 2022 at 10:19 AM Li Chen <me@linux.beauty> wrote:
 > >  ---- On Fri, 22 Jul 2022 14:50:17 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote:
 > >  > >  ---- On Thu, 21 Jul 2022 15:06:57 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  > >  > in between.
 > >  > >
 > >  > > Thanks for your answer! My device is a misc character device, just like
 > >  > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/
 > >  > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from
 > >  > > reserved memory and then mmap to userspace with vm_insert_pages, are
 > >  > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices?
 > >  >
 > >  > In the driver, you should only ever use dma_alloc_coherent() for getting
 > >  > a coherent DMA buffer, the other functions are just the implementation
 > >  > details behind that.
 > >  >
 > >  > To map this buffer to user space, your mmap() function should call
 > >  > dma_mmap_coherent(), which in turn does the correct translation
 > >  > from device specific dma_addr_t values into pages and uses the
 > >  > correct caching attributes.
 > >
 > > Yeah, dma_mmap_coherent() is best if I don't care about direct IO.
 > > But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses
 > > remap_pfn_range internally, which will set vma to be VM_IO and VM_PFNMAP,
 > > so I think I still have to go back to get struct page from rmem and use
 > > vm_insert_pages to insert pages into vma, right?
 > 
 > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
 > uncacheable will cause data corruption somewhere as well: The direct i/o
 > code expects normal page cache pages, but these are clearly not.

direct I/O just bypasses page cache, so I think you want to say "normal pages"?
At least from my hundreds of attempts on 512M rmem, the data doesn't get corrupted, crc32 of the resulted file
is always correct after direct I/O.
 
 > Also, the coherent DMA API is not actually meant for transferring large
 > amounts of data. 

 Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous and they also worked fine.

 > My guess is that what you are doing here is to use the
 > coherent API to map a large buffer uncached and then try to access the
 > uncached data in user space, which is inherently slow. Using direct I/o
 > appears to solve the problem by not actually using the uncached mapping
 > when sending the data to another device, but this is not the right approach.

My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
 
 > Do you have an IOMMU, scatter/gather support or similar to back the
 > device? 

No. My misc char device is simply a pseudo device and have no real hardware.
Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.

 > I think the only way to safely do what you want to achieve in  way
 > that is both safe and efficient would be to use normal page cache pages
 > allocated from user space, ideally using hugepage mappings, and then
 > mapping those into the device using the streaming DMA API to assign
 > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > and dma_sync_sg_for_{device,cpu}.

Thanks for your advice, but unfortunately, dsp can only write to contiguous physical memory(it doesn't know MMU),
and pages allocated from userspace are not contiguous on physical memory.

Regards,
Li


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-22 10:31           ` Li Chen
@ 2022-07-22 11:06             ` Arnd Bergmann
  2022-07-25  2:50               ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-22 11:06 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote:
>  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
>  >
>  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
>  > uncacheable will cause data corruption somewhere as well: The direct i/o
>  > code expects normal page cache pages, but these are clearly not.
>
> direct I/O just bypasses page cache, so I think you want to say "normal pages"?

All normal memory available to user space is in the page cache. What you bypass
with direct I/O is just the copy into another page cache page.

>  > Also, the coherent DMA API is not actually meant for transferring large
>  > amounts of data.
>
>  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
> and they also worked fine.

Those two interfaces just return a 'struct page', so if you convert them into
a kernel pointer or map them into user space, you get a cacheable mapping.
Is that what you do? If so, then your device appears to be cache coherent
with the CPU, and you can just mark it as coherent in the devicetree.

>  > My guess is that what you are doing here is to use the
>  > coherent API to map a large buffer uncached and then try to access the
>  > uncached data in user space, which is inherently slow. Using direct I/o
>  > appears to solve the problem by not actually using the uncached mapping
>  > when sending the data to another device, but this is not the right approach.
>
> My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
> 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
>
>  > Do you have an IOMMU, scatter/gather support or similar to back the
>  > device?
>
> No. My misc char device is simply a pseudo device and have no real hardware.
> Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.

It is the DSP that I'm talking about here, this is what makes all the
difference.
If the DSP is cache coherent and you mark it that way in DT, then everything
just becomes fast, and you don't have to use direct I/O. If the DSP is not
cache coherent, but you can program it to write into arbitrary memory page
cache pages allocated from user space, then you can use the streaming
mapping interface that does explicit cache management. This is of course
not as fast as coherent hardware, but it also allows accessing the data
through the CPU cache later.

>  > I think the only way to safely do what you want to achieve in  way
>  > that is both safe and efficient would be to use normal page cache pages
>  > allocated from user space, ideally using hugepage mappings, and then
>  > mapping those into the device using the streaming DMA API to assign
>  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
>  > and dma_sync_sg_for_{device,cpu}.
>
> Thanks for your advice, but unfortunately, dsp can only write to contiguous
> physical memory(it doesn't know MMU), and pages allocated from
> userspace are not contiguous on physical memory.

Usually what you can do with a DSP is that it can run user-provided
software, so if you can pass it a scatter-gather list for the output data
in addition to the buffer that it uses for its code and intermediate
buffers. If the goal is to store this data in a file, you can even go as far
as calling mmap() on the file, and then letting the driver get the page
cache pages backing the file mapping, and then relying on the normal
file system writeback to store the data to disk.

         Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-22 11:06             ` Arnd Bergmann
@ 2022-07-25  2:50               ` Li Chen
  2022-07-25  7:03                 ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-25  2:50 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel

Hi Arnd,
 ---- On Fri, 22 Jul 2022 20:06:35 +0900  Arnd Bergmann <arnd@arndb.de> wrote --- 
 > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote:
 > >  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  >
 > >  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
 > >  > uncacheable will cause data corruption somewhere as well: The direct i/o
 > >  > code expects normal page cache pages, but these are clearly not.
 > >
 > > direct I/O just bypasses page cache, so I think you want to say "normal pages"?
 > 
 > All normal memory available to user space is in the page cache. 

Just want to make sure that "all normal memory available to user space" come from functions
like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then:
sys_mmap()
└→ do_mmap_pgoff()
   └→ mmap_region()
      └→ generic_file_mmap() // file mapping, then 
      └→ vma_set_anonymous(vma); // anon vma path

IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they
don't have files as the backing stores.

Please correct me if something I am missing.

 > What you bypass with direct I/O is just the copy into another page cache page.

 > >  > Also, the coherent DMA API is not actually meant for transferring large
 > >  > amounts of data.
 > >
 > >  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
 > > and they also worked fine.
 > 
 > Those two interfaces just return a 'struct page', so if you convert them into
 > a kernel pointer or map them into user space, you get a cacheable mapping.
 > Is that what you do? 

Yes.

 > If so, then your device appears to be cache coherent
 > with the CPU, and you can just mark it as coherent in the devicetree.

Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so
all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate 
cached pages, right?
 
 > >  > My guess is that what you are doing here is to use the
 > >  > coherent API to map a large buffer uncached and then try to access the
 > >  > uncached data in user space, which is inherently slow. Using direct I/o
 > >  > appears to solve the problem by not actually using the uncached mapping
 > >  > when sending the data to another device, but this is not the right approach.
 > >
 > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
 > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
 > >
 > >  > Do you have an IOMMU, scatter/gather support or similar to back the
 > >  > device?
 > >
 > > No. My misc char device is simply a pseudo device and have no real hardware.
 > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.
 > 
 > It is the DSP that I'm talking about here, this is what makes all the
 > difference.
 > If the DSP is cache coherent and you mark it that way in DT, then everything
 > just becomes fast, and you don't have to use direct I/O. If the DSP is not
 > cache coherent, but you can program it to write into arbitrary memory page
 > cache pages allocated from user space, then you can use the streaming
 > mapping interface that does explicit cache management. This is of course
 > not as fast as coherent hardware, but it also allows accessing the data
 > through the CPU cache later.
 
I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only
reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so
I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck.
I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much
raw data to write and read.

 > >  > I think the only way to safely do what you want to achieve in  way
 > >  > that is both safe and efficient would be to use normal page cache pages
 > >  > allocated from user space, ideally using hugepage mappings, and then
 > >  > mapping those into the device using the streaming DMA API to assign
 > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > >  > and dma_sync_sg_for_{device,cpu}.
 > >
 > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
 > > physical memory(it doesn't know MMU), and pages allocated from
 > > userspace are not contiguous on physical memory.
 > 
 > Usually what you can do with a DSP is that it can run user-provided
 > software, so if you can pass it a scatter-gather list for the output data
 > in addition to the buffer that it uses for its code and intermediate
 > buffers. If the goal is to store this data in a file, you can even go as far
 > as calling mmap() on the file, and then letting the driver get the page
 > cache pages backing the file mapping, and then relying on the normal
 > file system writeback to store the data to disk.

Our DSP doesn't support scatter-gather lists.
What does "intermediate buffers" mean? 

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-25  2:50               ` Li Chen
@ 2022-07-25  7:03                 ` Arnd Bergmann
  2022-07-25 11:06                   ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-25  7:03 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Mon, Jul 25, 2022 at 4:50 AM Li Chen <me@linux.beauty> wrote:
>  ---- On Fri, 22 Jul 2022 20:06:35 +0900  Arnd Bergmann <arnd@arndb.de> wrote ---
>  > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote:
>  > >  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
>  > >  >
>  > >  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
>  > >  > uncacheable will cause data corruption somewhere as well: The direct i/o
>  > >  > code expects normal page cache pages, but these are clearly not.
>  > >
>  > > direct I/O just bypasses page cache, so I think you want to say "normal pages"?
>  >
>  > All normal memory available to user space is in the page cache.
>
> Just want to make sure that "all normal memory available to user space" come from functions
> like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then:
> sys_mmap()
> └→ do_mmap_pgoff()
>    └→ mmap_region()
>       └→ generic_file_mmap() // file mapping, then
>       └→ vma_set_anonymous(vma); // anon vma path
>
> IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they
> don't have files as the backing stores.
>
> Please correct me if something I am missing.

I think both anonymous user space pages and file backed pages are commonly
considered 'page cache'. Anonymous memory is eventually backed by swap space,
which is similar to but not the same here.

When I wrote 'page cache', I meant both of these, as opposed to memory allocated
by a kernel driver.

>  > What you bypass with direct I/O is just the copy into another page cache page.
>
>  > >  > Also, the coherent DMA API is not actually meant for transferring large
>  > >  > amounts of data.
>  > >
>  > >  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
>  > > and they also worked fine.
>  >
>  > Those two interfaces just return a 'struct page', so if you convert them into
>  > a kernel pointer or map them into user space, you get a cacheable mapping.
>  > Is that what you do?
>
> Yes.
>
>  > If so, then your device appears to be cache coherent
>  > with the CPU, and you can just mark it as coherent in the devicetree.
>
> Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so
> all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate
> cached pages, right?

Using CMA or not is not the problem here, what you have to do for correctness
is to use the same mapping type in every place that maps the pages into
a page table. The two options you have are:

- Using uncached mappings from dma_alloc_coherent() in combination
  with dma_mmap_coherent(). You cannot use direct I/O on these, and
  any access through a pointer is slow.

- Using cached mappings from anywhere, and then flushing the caches
   during ownership transfers with dma_map_sg()/dma_unmap_sg()/
   dma_sync_sg_for_cpu()/dma_sync_sg_for_device().

>  > >  > My guess is that what you are doing here is to use the
>  > >  > coherent API to map a large buffer uncached and then try to access the
>  > >  > uncached data in user space, which is inherently slow. Using direct I/o
>  > >  > appears to solve the problem by not actually using the uncached mapping
>  > >  > when sending the data to another device, but this is not the right approach.
>  > >
>  > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
>  > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
>  > >
>  > >  > Do you have an IOMMU, scatter/gather support or similar to back the
>  > >  > device?
>  > >
>  > > No. My misc char device is simply a pseudo device and have no real hardware.
>  > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.
>  >
>  > It is the DSP that I'm talking about here, this is what makes all the
>  > difference.
>  > If the DSP is cache coherent and you mark it that way in DT, then everything
>  > just becomes fast, and you don't have to use direct I/O. If the DSP is not
>  > cache coherent, but you can program it to write into arbitrary memory page
>  > cache pages allocated from user space, then you can use the streaming
>  > mapping interface that does explicit cache management. This is of course
>  > not as fast as coherent hardware, but it also allows accessing the data
>  > through the CPU cache later.
>
> I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only
> reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so
> I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck.
> I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much
> raw data to write and read.

copy_from_user() is particularly slow on uncached data. What throughput do you
get if you mmap() /dev/null and write that to a file?

>  > >  > I think the only way to safely do what you want to achieve in  way
>  > >  > that is both safe and efficient would be to use normal page cache pages
>  > >  > allocated from user space, ideally using hugepage mappings, and then
>  > >  > mapping those into the device using the streaming DMA API to assign
>  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
>  > >  > and dma_sync_sg_for_{device,cpu}.
>  > >
>  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
>  > > physical memory(it doesn't know MMU), and pages allocated from
>  > > userspace are not contiguous on physical memory.
>  >
>  > Usually what you can do with a DSP is that it can run user-provided
>  > software, so if you can pass it a scatter-gather list for the output data
>  > in addition to the buffer that it uses for its code and intermediate
>  > buffers. If the goal is to store this data in a file, you can even go as far
>  > as calling mmap() on the file, and then letting the driver get the page
>  > cache pages backing the file mapping, and then relying on the normal
>  > file system writeback to store the data to disk.
>
> Our DSP doesn't support scatter-gather lists.
> What does "intermediate buffers" mean?

I meant using a statically allocated memory area at a fixed location for
whatever the DSP does internally, and then copying it to the page cache
page that gets written to disk using the DSP to avoid any extra copies
on the CPU side.

This obviously involves changes to the interface that the DSP program
uses.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-25  7:03                 ` Arnd Bergmann
@ 2022-07-25 11:06                   ` Li Chen
  2022-07-25 11:45                     ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-25 11:06 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel


 ---- On Mon, 25 Jul 2022 16:03:30 +0900  Arnd Bergmann <arnd@arndb.de> wrote --- 
 > On Mon, Jul 25, 2022 at 4:50 AM Li Chen <me@linux.beauty> wrote:
 > >  ---- On Fri, 22 Jul 2022 20:06:35 +0900  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote:
 > >  > >  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  > >  >
 > >  > >  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
 > >  > >  > uncacheable will cause data corruption somewhere as well: The direct i/o
 > >  > >  > code expects normal page cache pages, but these are clearly not.
 > >  > >
 > >  > > direct I/O just bypasses page cache, so I think you want to say "normal pages"?
 > >  >
 > >  > All normal memory available to user space is in the page cache.
 > >
 > > Just want to make sure that "all normal memory available to user space" come from functions
 > > like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then:
 > > sys_mmap()
 > > └→ do_mmap_pgoff()
 > >    └→ mmap_region()
 > >       └→ generic_file_mmap() // file mapping, then
 > >       └→ vma_set_anonymous(vma); // anon vma path
 > >
 > > IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they
 > > don't have files as the backing stores.
 > >
 > > Please correct me if something I am missing.
 > 
 > I think both anonymous user space pages and file backed pages are commonly
 > considered 'page cache'. Anonymous memory is eventually backed by swap space,
 > which is similar to but not the same here.
 > 
 > When I wrote 'page cache', I meant both of these, as opposed to memory allocated
 > by a kernel driver.

Gotcha. 

 > >  > What you bypass with direct I/O is just the copy into another page cache page.
 > >
 > >  > >  > Also, the coherent DMA API is not actually meant for transferring large
 > >  > >  > amounts of data.
 > >  > >
 > >  > >  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
 > >  > > and they also worked fine.
 > >  >
 > >  > Those two interfaces just return a 'struct page', so if you convert them into
 > >  > a kernel pointer or map them into user space, you get a cacheable mapping.
 > >  > Is that what you do?
 > >
 > > Yes.
 > >
 > >  > If so, then your device appears to be cache coherent
 > >  > with the CPU, and you can just mark it as coherent in the devicetree.
 > >
 > > Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so
 > > all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate
 > > cached pages, right?
 > 
 > Using CMA or not is not the problem here, what you have to do for correctness
 > is to use the same mapping type in every place that maps the pages into
 > a page table. The two options you have are:
 > 
 > - Using uncached mappings from dma_alloc_coherent() in combination
 >   with dma_mmap_coherent(). You cannot use direct I/O on these, and
 >   any access through a pointer is slow.
 
Yes, very slow, around 300-500MB/s.


 > - Using cached mappings from anywhere, and then flushing the caches
 >    during ownership transfers with dma_map_sg()/dma_unmap_sg()/
 >    dma_sync_sg_for_cpu()/dma_sync_sg_for_device().

We just set up phy addr and other configures then send write command to dsp instead of
using kernel dma engine api.
So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer
data. Userspace app will query before writing to file, and we will invalidate cache when "query"
if the memory region is cache-able memory.

To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous, 
then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames.
But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much
slower than direct I/O throughput.
it seems weird?

 > >  > >  > My guess is that what you are doing here is to use the
 > >  > >  > coherent API to map a large buffer uncached and then try to access the
 > >  > >  > uncached data in user space, which is inherently slow. Using direct I/o
 > >  > >  > appears to solve the problem by not actually using the uncached mapping
 > >  > >  > when sending the data to another device, but this is not the right approach.
 > >  > >
 > >  > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
 > >  > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
 > >  > >
 > >  > >  > Do you have an IOMMU, scatter/gather support or similar to back the
 > >  > >  > device?
 > >  > >
 > >  > > No. My misc char device is simply a pseudo device and have no real hardware.
 > >  > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.
 > >  >
 > >  > It is the DSP that I'm talking about here, this is what makes all the
 > >  > difference.
 > >  > If the DSP is cache coherent and you mark it that way in DT, then everything
 > >  > just becomes fast, and you don't have to use direct I/O. If the DSP is not
 > >  > cache coherent, but you can program it to write into arbitrary memory page
 > >  > cache pages allocated from user space, then you can use the streaming
 > >  > mapping interface that does explicit cache management. This is of course
 > >  > not as fast as coherent hardware, but it also allows accessing the data
 > >  > through the CPU cache later.
 > >
 > > I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only
 > > reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so
 > > I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck.
 > > I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much
 > > raw data to write and read.
 > 
 > copy_from_user() is particularly slow on uncached data. What throughput do you
 > get if you mmap() /dev/null and write that to a file?
 
mmap /dev/null return No such device, but I do have this device:
# ls -l /dev/null
crw-rw-rw-    1 root     root        1,   3 Nov 14 17:34 /dev/null

Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed.

 > >  > >  > I think the only way to safely do what you want to achieve in  way
 > >  > >  > that is both safe and efficient would be to use normal page cache pages
 > >  > >  > allocated from user space, ideally using hugepage mappings, and then
 > >  > >  > mapping those into the device using the streaming DMA API to assign
 > >  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > >  > >  > and dma_sync_sg_for_{device,cpu}.
 > >  > >
 > >  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
 > >  > > physical memory(it doesn't know MMU), and pages allocated from
 > >  > > userspace are not contiguous on physical memory.
 > >  >
 > >  > Usually what you can do with a DSP is that it can run user-provided
 > >  > software, so if you can pass it a scatter-gather list for the output data
 > >  > in addition to the buffer that it uses for its code and intermediate
 > >  > buffers. If the goal is to store this data in a file, you can even go as far
 > >  > as calling mmap() on the file, and then letting the driver get the page
 > >  > cache pages backing the file mapping, and then relying on the normal
 > >  > file system writeback to store the data to disk.
 > >
 > > Our DSP doesn't support scatter-gather lists.
 > > What does "intermediate buffers" mean?
 > 
 > I meant using a statically allocated memory area at a fixed location for
 > whatever the DSP does internally, and then copying it to the page cache
 > page that gets written to disk using the DSP to avoid any extra copies
 > on the CPU side.
 > 
 > This obviously involves changes to the interface that the DSP program
 > uses.

Looks promising, but our DSP doesn't support sg list:-(
So it seems it is impossible for our case. Do you know is there any open source
DSP driver that using such dma sg-list and/or copy its data to "page cache page"?
Thanks.

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-25 11:06                   ` Li Chen
@ 2022-07-25 11:45                     ` Arnd Bergmann
  2022-07-26  6:50                       ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-25 11:45 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me@linux.beauty> wrote:
>  ---- On Mon, 25 Jul 2022 16:03:30 +0900  Arnd Bergmann <arnd@arndb.de> wrote ---
>  > - Using cached mappings from anywhere, and then flushing the caches
>  >    during ownership transfers with dma_map_sg()/dma_unmap_sg()/
>  >    dma_sync_sg_for_cpu()/dma_sync_sg_for_device().
>
> We just set up phy addr and other configures then send write command to dsp instead of
> using kernel dma engine api.
> So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer
> data. Userspace app will query before writing to file, and we will invalidate cache when "query"
> if the memory region is cache-able memory.
>
> To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous,
> then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames.
> But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much
> slower than direct I/O throughput.
> it seems weird?

I'm not sure what is actually meant to happen when you have both cacheable
and uncached mappings for the same data, this it not all that well defined and
it may be that you end up just getting uncached data in the end. You
clearly either
get a cache miss here, or stale data, and either way is not good.

>  > copy_from_user() is particularly slow on uncached data. What throughput do you
>  > get if you mmap() /dev/null and write that to a file?
>
> mmap /dev/null return No such device, but I do have this device:
> # ls -l /dev/null
> crw-rw-rw-    1 root     root        1,   3 Nov 14 17:34 /dev/null
>
> Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed.

Sorry, I meant /dev/zero (or any other normal memory really).

>  > >  > >  > I think the only way to safely do what you want to achieve in  way
>  > >  > >  > that is both safe and efficient would be to use normal page cache pages
>  > >  > >  > allocated from user space, ideally using hugepage mappings, and then
>  > >  > >  > mapping those into the device using the streaming DMA API to assign
>  > >  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
>  > >  > >  > and dma_sync_sg_for_{device,cpu}.
>  > >  > >
>  > >  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
>  > >  > > physical memory(it doesn't know MMU), and pages allocated from
>  > >  > > userspace are not contiguous on physical memory.
>  > >  >
>  > >  > Usually what you can do with a DSP is that it can run user-provided
>  > >  > software, so if you can pass it a scatter-gather list for the output data
>  > >  > in addition to the buffer that it uses for its code and intermediate
>  > >  > buffers. If the goal is to store this data in a file, you can even go as far
>  > >  > as calling mmap() on the file, and then letting the driver get the page
>  > >  > cache pages backing the file mapping, and then relying on the normal
>  > >  > file system writeback to store the data to disk.
>  > >
>  > > Our DSP doesn't support scatter-gather lists.
>  > > What does "intermediate buffers" mean?
>  >
>  > I meant using a statically allocated memory area at a fixed location for
>  > whatever the DSP does internally, and then copying it to the page cache
>  > page that gets written to disk using the DSP to avoid any extra copies
>  > on the CPU side.
>  >
>  > This obviously involves changes to the interface that the DSP program
>  > uses.
>
> Looks promising, but our DSP doesn't support sg list:-(
> So it seems it is impossible for our case. Do you know is there any open source
> DSP driver that using such dma sg-list and/or copy its data to "page cache page"?

I don't recall any other driver that uses the page cache to write into
a file-backed
mapping, but you can search the kernel sources for drivers that use
pin_user_pages() or a related function to get access to the user address space
and extract a page number of that to pass into a hardware buffer.

        Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-25 11:45                     ` Arnd Bergmann
@ 2022-07-26  6:50                       ` Li Chen
  2022-07-27  3:02                         ` Li Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-26  6:50 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel


 ---- On Mon, 25 Jul 2022 20:45:10 +0900  Arnd Bergmann <arnd@arndb.de> wrote --- 
 > On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me@linux.beauty> wrote:
 > >  ---- On Mon, 25 Jul 2022 16:03:30 +0900  Arnd Bergmann <arnd@arndb.de> wrote ---
 > >  > - Using cached mappings from anywhere, and then flushing the caches
 > >  >    during ownership transfers with dma_map_sg()/dma_unmap_sg()/
 > >  >    dma_sync_sg_for_cpu()/dma_sync_sg_for_device().
 > >
 > > We just set up phy addr and other configures then send write command to dsp instead of
 > > using kernel dma engine api.
 > > So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer
 > > data. Userspace app will query before writing to file, and we will invalidate cache when "query"
 > > if the memory region is cache-able memory.
 > >
 > > To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous,
 > > then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames.
 > > But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much
 > > slower than direct I/O throughput.
 > > it seems weird?
 > 
 > I'm not sure what is actually meant to happen when you have both cacheable
 > and uncached mappings for the same data, this it not all that well defined and
 > it may be that you end up just getting uncached data in the end. You
 > clearly either
 > get a cache miss here, or stale data, and either way is not good.
 
It's not the same data with both cacheable and uncached mappings.
I replaced the kernel image for the two tests (one kernel driver uses dma_alloc_contiguous and the other one uses dma_alloc_coherent),
so, the pages should be either cacheable or non-cacheable, not both.

 > >  > copy_from_user() is particularly slow on uncached data. What throughput do you
 > >  > get if you mmap() /dev/null and write that to a file?
 > >
 > > mmap /dev/null return No such device, but I do have this device:
 > > # ls -l /dev/null
 > > crw-rw-rw-    1 root     root        1,   3 Nov 14 17:34 /dev/null
 > >
 > > Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed.
 > 
 > Sorry, I meant /dev/zero (or any other normal memory really).
 
mmap from /dev/zero, and then buffered-I/O write to file still gets very slow throughput. 
Is this zero-page cache-able page?

From its mmap implementation:

static int mmap_zero(struct file *file, struct vm_area_struct *vma)
{
#ifndef CONFIG_MMU
	return -ENOSYS;
#endif
	if (vma->vm_flags & VM_SHARED)
		return shmem_zero_setup(vma);
	vma_set_anonymous(vma);
	return 0;
}

I tried both MAP_PRIVATE and MAP_SHARED, both still get slow throughput, no
notable change.

 > >  > >  > >  > I think the only way to safely do what you want to achieve in  way
 > >  > >  > >  > that is both safe and efficient would be to use normal page cache pages
 > >  > >  > >  > allocated from user space, ideally using hugepage mappings, and then
 > >  > >  > >  > mapping those into the device using the streaming DMA API to assign
 > >  > >  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > >  > >  > >  > and dma_sync_sg_for_{device,cpu}.
 > >  > >  > >
 > >  > >  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
 > >  > >  > > physical memory(it doesn't know MMU), and pages allocated from
 > >  > >  > > userspace are not contiguous on physical memory.
 > >  > >  >
 > >  > >  > Usually what you can do with a DSP is that it can run user-provided
 > >  > >  > software, so if you can pass it a scatter-gather list for the output data
 > >  > >  > in addition to the buffer that it uses for its code and intermediate
 > >  > >  > buffers. If the goal is to store this data in a file, you can even go as far
 > >  > >  > as calling mmap() on the file, and then letting the driver get the page
 > >  > >  > cache pages backing the file mapping, and then relying on the normal
 > >  > >  > file system writeback to store the data to disk.
 > >  > >
 > >  > > Our DSP doesn't support scatter-gather lists.
 > >  > > What does "intermediate buffers" mean?
 > >  >
 > >  > I meant using a statically allocated memory area at a fixed location for
 > >  > whatever the DSP does internally, and then copying it to the page cache
 > >  > page that gets written to disk using the DSP to avoid any extra copies
 > >  > on the CPU side.
 > >  >
 > >  > This obviously involves changes to the interface that the DSP program
 > >  > uses.
 > >
 > > Looks promising, but our DSP doesn't support sg list:-(
 > > So it seems it is impossible for our case. Do you know is there any open source
 > > DSP driver that using such dma sg-list and/or copy its data to "page cache page"?
 > 
 > I don't recall any other driver that uses the page cache to write into
 > a file-backed
 > mapping, but you can search the kernel sources for drivers that use
 > pin_user_pages() or a related function to get access to the user address space
 > and extract a page number of that to pass into a hardware buffer.

So, IIUC, this solution consists of the following steps:
step 1. alloc normal memory from userspace using functions like malloc.
step 2. find malloc-ed memory, then use pin_user_pages* function to pin this memory then pass this virtual contiguous but non-phy contiguous(so sg is needed) memory to DSP for writing.
step 3. unpin_user_pages* then let filesystem's writeback queue to write anon "page cache" back to files on disk.

Are these steps all right?

If they are, I have some noob questions:
for step 2, how can I find page frames allocated by malloc, hack brk/mmap to track?
for step 2, if sg is not supported(we don't operate on DSP's dma controller directly, but send command and phy addr and etc to it), is there any other way to do it? This is important, otherwise
                   I still have to reserve phy contiguous addr for DSP writing.
for step 3, I haven't seen trival way to do it. Anyway, they are still anon "page cache" and have no files as backing store(Of course, swap is its backing store, but
                  we hope it write back to real files instead of swap), so it's a little tricky: how to set target files as these anon “page cache"'s backing store?

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-26  6:50                       ` Li Chen
@ 2022-07-27  3:02                         ` Li Chen
  2022-07-27  6:40                           ` Arnd Bergmann
  0 siblings, 1 reply; 15+ messages in thread
From: Li Chen @ 2022-07-27  3:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-arm-kernel


 ---- On Tue, 26 Jul 2022 15:50:57 +0900  Li Chen <me@linux.beauty> wrote --- 
 > 
 >  ---- On Mon, 25 Jul 2022 20:45:10 +0900  Arnd Bergmann <arnd@arndb.de> wrote --- 
 >  > On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me@linux.beauty> wrote:
 >  > >  ---- On Mon, 25 Jul 2022 16:03:30 +0900  Arnd Bergmann <arnd@arndb.de> wrote ---
 >  > >  > - Using cached mappings from anywhere, and then flushing the caches
 >  > >  >    during ownership transfers with dma_map_sg()/dma_unmap_sg()/
 >  > >  >    dma_sync_sg_for_cpu()/dma_sync_sg_for_device().
 >  > >
 >  > > We just set up phy addr and other configures then send write command to dsp instead of
 >  > > using kernel dma engine api.
 >  > > So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer
 >  > > data. Userspace app will query before writing to file, and we will invalidate cache when "query"
 >  > > if the memory region is cache-able memory.
 >  > >
 >  > > To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous,
 >  > > then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames.
 >  > > But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much
 >  > > slower than direct I/O throughput.
 >  > > it seems weird?
 >  > 
 >  > I'm not sure what is actually meant to happen when you have both cacheable
 >  > and uncached mappings for the same data, this it not all that well defined and
 >  > it may be that you end up just getting uncached data in the end. You
 >  > clearly either
 >  > get a cache miss here, or stale data, and either way is not good.
 >  
 > It's not the same data with both cacheable and uncached mappings.
 > I replaced the kernel image for the two tests (one kernel driver uses dma_alloc_contiguous and the other one uses dma_alloc_coherent),
 > so, the pages should be either cacheable or non-cacheable, not both.
 > 
 >  > >  > copy_from_user() is particularly slow on uncached data. What throughput do you
 >  > >  > get if you mmap() /dev/null and write that to a file?
 >  > >
 >  > > mmap /dev/null return No such device, but I do have this device:
 >  > > # ls -l /dev/null
 >  > > crw-rw-rw-    1 root     root        1,   3 Nov 14 17:34 /dev/null
 >  > >
 >  > > Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed.
 >  > 
 >  > Sorry, I meant /dev/zero (or any other normal memory really).
 >  
 > mmap from /dev/zero, and then buffered-I/O write to file still gets very slow throughput. 
 > Is this zero-page cache-able page?
 > 
 > From its mmap implementation:
 > 
 > static int mmap_zero(struct file *file, struct vm_area_struct *vma)
 > {
 > #ifndef CONFIG_MMU
 >     return -ENOSYS;
 > #endif
 >     if (vma->vm_flags & VM_SHARED)
 >         return shmem_zero_setup(vma);
 >     vma_set_anonymous(vma);
 >     return 0;
 > }
 > 
 > I tried both MAP_PRIVATE and MAP_SHARED, both still get slow throughput, no
 > notable change.
 > 
 >  > >  > >  > >  > I think the only way to safely do what you want to achieve in  way
 >  > >  > >  > >  > that is both safe and efficient would be to use normal page cache pages
 >  > >  > >  > >  > allocated from user space, ideally using hugepage mappings, and then
 >  > >  > >  > >  > mapping those into the device using the streaming DMA API to assign
 >  > >  > >  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 >  > >  > >  > >  > and dma_sync_sg_for_{device,cpu}.
 >  > >  > >  > >
 >  > >  > >  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
 >  > >  > >  > > physical memory(it doesn't know MMU), and pages allocated from
 >  > >  > >  > > userspace are not contiguous on physical memory.
 >  > >  > >  >
 >  > >  > >  > Usually what you can do with a DSP is that it can run user-provided
 >  > >  > >  > software, so if you can pass it a scatter-gather list for the output data
 >  > >  > >  > in addition to the buffer that it uses for its code and intermediate
 >  > >  > >  > buffers. If the goal is to store this data in a file, you can even go as far
 >  > >  > >  > as calling mmap() on the file, and then letting the driver get the page
 >  > >  > >  > cache pages backing the file mapping, and then relying on the normal
 >  > >  > >  > file system writeback to store the data to disk.
 >  > >  > >
 >  > >  > > Our DSP doesn't support scatter-gather lists.
 >  > >  > > What does "intermediate buffers" mean?
 >  > >  >
 >  > >  > I meant using a statically allocated memory area at a fixed location for
 >  > >  > whatever the DSP does internally, and then copying it to the page cache
 >  > >  > page that gets written to disk using the DSP to avoid any extra copies
 >  > >  > on the CPU side.
 >  > >  >
 >  > >  > This obviously involves changes to the interface that the DSP program
 >  > >  > uses.
 >  > >
 >  > > Looks promising, but our DSP doesn't support sg list:-(
 >  > > So it seems it is impossible for our case. Do you know is there any open source
 >  > > DSP driver that using such dma sg-list and/or copy its data to "page cache page"?
 >  > 
 >  > I don't recall any other driver that uses the page cache to write into
 >  > a file-backed
 >  > mapping, but you can search the kernel sources for drivers that use
 >  > pin_user_pages() or a related function to get access to the user address space
 >  > and extract a page number of that to pass into a hardware buffer.
 > 
 > So, IIUC, this solution consists of the following steps:
 > step 1. alloc normal memory from userspace using functions like malloc.
 > step 2. find malloc-ed memory, then use pin_user_pages* function to pin this memory then pass this virtual contiguous but non-phy contiguous(so sg is needed) memory to DSP for writing.
 > step 3. unpin_user_pages* then let filesystem's writeback queue to write anon "page cache" back to files on disk.
 > 
 > Are these steps all right?
 > 
 > If they are, I have some noob questions:
 > for step 2, how can I find page frames allocated by malloc, hack brk/mmap to track?
 > for step 2, if sg is not supported(we don't operate on DSP's dma controller directly, but send command and phy addr and etc to it), is there any other way to do it? This is important, otherwise
 >                    I still have to reserve phy contiguous addr for DSP writing.
 > for step 3, I haven't seen trival way to do it. Anyway, they are still anon "page cache" and have no files as backing store(Of course, swap is its backing store, but
 >                   we hope it write back to real files instead of swap), so it's a little tricky: how to set target files as these anon “page cache"'s backing store?
 
It seems that SVA(Shared Virtual Addressing, https://lwn.net/Articles/747230/) is a perfect solution to do DMA operation from userspace and using userspace allocated memory,
 but unfortunately, our SoC don't have SMMU and cannot operate on DSP's DMA engine directly from Linux driver

Regards,
Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Why dma_alloc_coherent don't return direct mapped vaddr?
  2022-07-27  3:02                         ` Li Chen
@ 2022-07-27  6:40                           ` Arnd Bergmann
  0 siblings, 0 replies; 15+ messages in thread
From: Arnd Bergmann @ 2022-07-27  6:40 UTC (permalink / raw)
  To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel

On Wed, Jul 27, 2022 at 5:02 AM Li Chen <me@linux.beauty> wrote:
>  ---- On Tue, 26 Jul 2022 15:50:57 +0900  Li Chen <me@linux.beauty> wrote ---
>
> It seems that SVA(Shared Virtual Addressing, https://lwn.net/Articles/747230/)
> is a perfect solution to do DMA operation from userspace and using userspace
> allocated memory,
>  but unfortunately, our SoC don't have SMMU and cannot operate on DSP's
> DMA engine directly from Linux driver

SVA assumes that the DMA master is cache coherent with your CPU, so even
if you could make the pagetable side work with an IOMMU here, you still get
incorrect data.

       Arnd

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-07-27  6:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-21  3:28 Why dma_alloc_coherent don't return direct mapped vaddr? Li Chen
2022-07-21  7:06 ` Arnd Bergmann
2022-07-22  2:57   ` Li Chen
2022-07-22  6:50     ` Arnd Bergmann
2022-07-22  8:19       ` Li Chen
2022-07-22  9:06         ` Arnd Bergmann
2022-07-22 10:31           ` Li Chen
2022-07-22 11:06             ` Arnd Bergmann
2022-07-25  2:50               ` Li Chen
2022-07-25  7:03                 ` Arnd Bergmann
2022-07-25 11:06                   ` Li Chen
2022-07-25 11:45                     ` Arnd Bergmann
2022-07-26  6:50                       ` Li Chen
2022-07-27  3:02                         ` Li Chen
2022-07-27  6:40                           ` Arnd Bergmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.