* Why dma_alloc_coherent don't return direct mapped vaddr? @ 2022-07-21 3:28 Li Chen 2022-07-21 7:06 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-21 3:28 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel Hi Arnd, dma_alloc_coherent two addr: 1. vaddr. 2. dma_addr I noticed vaddr is not simply linear/direct mapped to dma_addr, which means I cannot use virt_to_phys/virt_to_page to get paddr/page. Instead, I should use dma_addr as paddr and phys_to_page(dma_addr) to get struct page. My question is why dma_alloc_coherent not simply return phys_to_virt(dma_addr)? IOW, why vaddr is not directly mapped to dma_addr? Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-21 3:28 Why dma_alloc_coherent don't return direct mapped vaddr? Li Chen @ 2022-07-21 7:06 ` Arnd Bergmann 2022-07-22 2:57 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Arnd Bergmann @ 2022-07-21 7:06 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Thu, Jul 21, 2022 at 5:28 AM Li Chen <me@linux.beauty> wrote: > > Hi Arnd, > > dma_alloc_coherent two addr: > 1. vaddr. > 2. dma_addr > > I noticed vaddr is not simply linear/direct mapped to dma_addr, which means I cannot use virt_to_phys/virt_to_page to get > paddr/page. Instead, I should use dma_addr as paddr and phys_to_page(dma_addr) to get struct page. > > My question is why dma_alloc_coherent not simply return phys_to_virt(dma_addr)? IOW, why vaddr is > not directly mapped to dma_addr? dma_alloc_coherent() tries to allocate memory that is the correct type for the device you pass. If the device is not itself marked as cache coherent in the DT, then it has to use an uncached mapping. The normal linear map of all memory into the kernel address space is cacheable, so this device can't use it, and you instead get a new mapping into kernel space at a different virtual address. Note that you should never need a 'struct page' in this case, as the device needs a physical ddress and access from kernel space just needs a pointer. Calling phys_to_page() on a dma_addr_t is not portable because a lot of devices have DMA addresses that are not the same number as the physical address as seen by the CPU, or there may be an IOMMU inbetween. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-21 7:06 ` Arnd Bergmann @ 2022-07-22 2:57 ` Li Chen 2022-07-22 6:50 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-22 2:57 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel Hi Arnd, ---- On Thu, 21 Jul 2022 15:06:57 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > On Thu, Jul 21, 2022 at 5:28 AM Li Chen <me@linux.beauty> wrote: > > > > Hi Arnd, > > > > dma_alloc_coherent two addr: > > 1. vaddr. > > 2. dma_addr > > > > I noticed vaddr is not simply linear/direct mapped to dma_addr, which means I cannot use virt_to_phys/virt_to_page to get > > paddr/page. Instead, I should use dma_addr as paddr and phys_to_page(dma_addr) to get struct page. > > > > My question is why dma_alloc_coherent not simply return phys_to_virt(dma_addr)? IOW, why vaddr is > > not directly mapped to dma_addr? > > dma_alloc_coherent() tries to allocate memory that is the correct type > for the device you > pass. If the device is not itself marked as cache coherent in the DT, > then it has to use > an uncached mapping. > > The normal linear map of all memory into the kernel address space is > cacheable, so this > device can't use it, and you instead get a new mapping into kernel > space at a different > virtual address. > > Note that you should never need a 'struct page' in this case, as the > device needs a physical > ddress and access from kernel space just needs a pointer. Calling > phys_to_page() on > a dma_addr_t is not portable because a lot of devices have DMA > addresses that are not > the same number as the physical address as seen by the CPU, or there > may be an IOMMU > in between. Thanks for your answer! My device is a misc character device, just like https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/. IIUC, its dma_addr is always the same with phy addr. If I want to alloc from reserved memory and then mmap to userspace with vm_insert_pages, are cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices? Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-22 2:57 ` Li Chen @ 2022-07-22 6:50 ` Arnd Bergmann 2022-07-22 8:19 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Arnd Bergmann @ 2022-07-22 6:50 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote: > ---- On Thu, 21 Jul 2022 15:06:57 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > in between. > > Thanks for your answer! My device is a misc character device, just like > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/ > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from > reserved memory and then mmap to userspace with vm_insert_pages, are > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices? In the driver, you should only ever use dma_alloc_coherent() for getting a coherent DMA buffer, the other functions are just the implementation details behind that. To map this buffer to user space, your mmap() function should call dma_mmap_coherent(), which in turn does the correct translation from device specific dma_addr_t values into pages and uses the correct caching attributes. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-22 6:50 ` Arnd Bergmann @ 2022-07-22 8:19 ` Li Chen 2022-07-22 9:06 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-22 8:19 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel ---- On Fri, 22 Jul 2022 14:50:17 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote: > > ---- On Thu, 21 Jul 2022 15:06:57 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > in between. > > > > Thanks for your answer! My device is a misc character device, just like > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/ > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from > > reserved memory and then mmap to userspace with vm_insert_pages, are > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices? > > In the driver, you should only ever use dma_alloc_coherent() for getting > a coherent DMA buffer, the other functions are just the implementation > details behind that. > > To map this buffer to user space, your mmap() function should call > dma_mmap_coherent(), which in turn does the correct translation > from device specific dma_addr_t values into pages and uses the > correct caching attributes. Yeah, dma_mmap_coherent() is best if I don't care about direct IO. But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses remap_pfn_range internally, which will set vma to be VM_IO and VM_PFNMAP, so I think I still have to go back to get struct page from rmem and use vm_insert_pages to insert pages into vma, right? Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-22 8:19 ` Li Chen @ 2022-07-22 9:06 ` Arnd Bergmann 2022-07-22 10:31 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Arnd Bergmann @ 2022-07-22 9:06 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Fri, Jul 22, 2022 at 10:19 AM Li Chen <me@linux.beauty> wrote: > ---- On Fri, 22 Jul 2022 14:50:17 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote: > > > ---- On Thu, 21 Jul 2022 15:06:57 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > in between. > > > > > > Thanks for your answer! My device is a misc character device, just like > > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/ > > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from > > > reserved memory and then mmap to userspace with vm_insert_pages, are > > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices? > > > > In the driver, you should only ever use dma_alloc_coherent() for getting > > a coherent DMA buffer, the other functions are just the implementation > > details behind that. > > > > To map this buffer to user space, your mmap() function should call > > dma_mmap_coherent(), which in turn does the correct translation > > from device specific dma_addr_t values into pages and uses the > > correct caching attributes. > > Yeah, dma_mmap_coherent() is best if I don't care about direct IO. > But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses > remap_pfn_range internally, which will set vma to be VM_IO and VM_PFNMAP, > so I think I still have to go back to get struct page from rmem and use > vm_insert_pages to insert pages into vma, right? I'm not entirely sure, but I suspect that direct I/O on pages that are mapped uncacheable will cause data corruption somewhere as well: The direct i/o code expects normal page cache pages, but these are clearly not. Also, the coherent DMA API is not actually meant for transferring large amounts of data. My guess is that what you are doing here is to use the coherent API to map a large buffer uncached and then try to access the uncached data in user space, which is inherently slow. Using direct I/o appears to solve the problem by not actually using the uncached mapping when sending the data to another device, but this is not the right approach. Do you have an IOMMU, scatter/gather support or similar to back the device? I think the only way to safely do what you want to achieve in way that is both safe and efficient would be to use normal page cache pages allocated from user space, ideally using hugepage mappings, and then mapping those into the device using the streaming DMA API to assign them to the DMA master with get_user_pages_fast()/dma_map_sg() and dma_sync_sg_for_{device,cpu}. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-22 9:06 ` Arnd Bergmann @ 2022-07-22 10:31 ` Li Chen 2022-07-22 11:06 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-22 10:31 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel ---- On Fri, 22 Jul 2022 17:06:36 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > On Fri, Jul 22, 2022 at 10:19 AM Li Chen <me@linux.beauty> wrote: > > ---- On Fri, 22 Jul 2022 14:50:17 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me@linux.beauty> wrote: > > > > ---- On Thu, 21 Jul 2022 15:06:57 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > > in between. > > > > > > > > Thanks for your answer! My device is a misc character device, just like > > > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/ > > > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from > > > > reserved memory and then mmap to userspace with vm_insert_pages, are > > > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices? > > > > > > In the driver, you should only ever use dma_alloc_coherent() for getting > > > a coherent DMA buffer, the other functions are just the implementation > > > details behind that. > > > > > > To map this buffer to user space, your mmap() function should call > > > dma_mmap_coherent(), which in turn does the correct translation > > > from device specific dma_addr_t values into pages and uses the > > > correct caching attributes. > > > > Yeah, dma_mmap_coherent() is best if I don't care about direct IO. > > But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses > > remap_pfn_range internally, which will set vma to be VM_IO and VM_PFNMAP, > > so I think I still have to go back to get struct page from rmem and use > > vm_insert_pages to insert pages into vma, right? > > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped > uncacheable will cause data corruption somewhere as well: The direct i/o > code expects normal page cache pages, but these are clearly not. direct I/O just bypasses page cache, so I think you want to say "normal pages"? At least from my hundreds of attempts on 512M rmem, the data doesn't get corrupted, crc32 of the resulted file is always correct after direct I/O. > Also, the coherent DMA API is not actually meant for transferring large > amounts of data. Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous and they also worked fine. > My guess is that what you are doing here is to use the > coherent API to map a large buffer uncached and then try to access the > uncached data in user space, which is inherently slow. Using direct I/o > appears to solve the problem by not actually using the uncached mapping > when sending the data to another device, but this is not the right approach. My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory). > Do you have an IOMMU, scatter/gather support or similar to back the > device? No. My misc char device is simply a pseudo device and have no real hardware. Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here. > I think the only way to safely do what you want to achieve in way > that is both safe and efficient would be to use normal page cache pages > allocated from user space, ideally using hugepage mappings, and then > mapping those into the device using the streaming DMA API to assign > them to the DMA master with get_user_pages_fast()/dma_map_sg() > and dma_sync_sg_for_{device,cpu}. Thanks for your advice, but unfortunately, dsp can only write to contiguous physical memory(it doesn't know MMU), and pages allocated from userspace are not contiguous on physical memory. Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-22 10:31 ` Li Chen @ 2022-07-22 11:06 ` Arnd Bergmann 2022-07-25 2:50 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Arnd Bergmann @ 2022-07-22 11:06 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote: > ---- On Fri, 22 Jul 2022 17:06:36 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped > > uncacheable will cause data corruption somewhere as well: The direct i/o > > code expects normal page cache pages, but these are clearly not. > > direct I/O just bypasses page cache, so I think you want to say "normal pages"? All normal memory available to user space is in the page cache. What you bypass with direct I/O is just the copy into another page cache page. > > Also, the coherent DMA API is not actually meant for transferring large > > amounts of data. > > Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous > and they also worked fine. Those two interfaces just return a 'struct page', so if you convert them into a kernel pointer or map them into user space, you get a cacheable mapping. Is that what you do? If so, then your device appears to be cache coherent with the CPU, and you can just mark it as coherent in the devicetree. > > My guess is that what you are doing here is to use the > > coherent API to map a large buffer uncached and then try to access the > > uncached data in user space, which is inherently slow. Using direct I/o > > appears to solve the problem by not actually using the uncached mapping > > when sending the data to another device, but this is not the right approach. > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory). > > > Do you have an IOMMU, scatter/gather support or similar to back the > > device? > > No. My misc char device is simply a pseudo device and have no real hardware. > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here. It is the DSP that I'm talking about here, this is what makes all the difference. If the DSP is cache coherent and you mark it that way in DT, then everything just becomes fast, and you don't have to use direct I/O. If the DSP is not cache coherent, but you can program it to write into arbitrary memory page cache pages allocated from user space, then you can use the streaming mapping interface that does explicit cache management. This is of course not as fast as coherent hardware, but it also allows accessing the data through the CPU cache later. > > I think the only way to safely do what you want to achieve in way > > that is both safe and efficient would be to use normal page cache pages > > allocated from user space, ideally using hugepage mappings, and then > > mapping those into the device using the streaming DMA API to assign > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > and dma_sync_sg_for_{device,cpu}. > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > physical memory(it doesn't know MMU), and pages allocated from > userspace are not contiguous on physical memory. Usually what you can do with a DSP is that it can run user-provided software, so if you can pass it a scatter-gather list for the output data in addition to the buffer that it uses for its code and intermediate buffers. If the goal is to store this data in a file, you can even go as far as calling mmap() on the file, and then letting the driver get the page cache pages backing the file mapping, and then relying on the normal file system writeback to store the data to disk. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-22 11:06 ` Arnd Bergmann @ 2022-07-25 2:50 ` Li Chen 2022-07-25 7:03 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-25 2:50 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel Hi Arnd, ---- On Fri, 22 Jul 2022 20:06:35 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote: > > ---- On Fri, 22 Jul 2022 17:06:36 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > > > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped > > > uncacheable will cause data corruption somewhere as well: The direct i/o > > > code expects normal page cache pages, but these are clearly not. > > > > direct I/O just bypasses page cache, so I think you want to say "normal pages"? > > All normal memory available to user space is in the page cache. Just want to make sure that "all normal memory available to user space" come from functions like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then: sys_mmap() └→ do_mmap_pgoff() └→ mmap_region() └→ generic_file_mmap() // file mapping, then └→ vma_set_anonymous(vma); // anon vma path IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they don't have files as the backing stores. Please correct me if something I am missing. > What you bypass with direct I/O is just the copy into another page cache page. > > > Also, the coherent DMA API is not actually meant for transferring large > > > amounts of data. > > > > Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous > > and they also worked fine. > > Those two interfaces just return a 'struct page', so if you convert them into > a kernel pointer or map them into user space, you get a cacheable mapping. > Is that what you do? Yes. > If so, then your device appears to be cache coherent > with the CPU, and you can just mark it as coherent in the devicetree. Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate cached pages, right? > > > My guess is that what you are doing here is to use the > > > coherent API to map a large buffer uncached and then try to access the > > > uncached data in user space, which is inherently slow. Using direct I/o > > > appears to solve the problem by not actually using the uncached mapping > > > when sending the data to another device, but this is not the right approach. > > > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory). > > > > > Do you have an IOMMU, scatter/gather support or similar to back the > > > device? > > > > No. My misc char device is simply a pseudo device and have no real hardware. > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here. > > It is the DSP that I'm talking about here, this is what makes all the > difference. > If the DSP is cache coherent and you mark it that way in DT, then everything > just becomes fast, and you don't have to use direct I/O. If the DSP is not > cache coherent, but you can program it to write into arbitrary memory page > cache pages allocated from user space, then you can use the streaming > mapping interface that does explicit cache management. This is of course > not as fast as coherent hardware, but it also allows accessing the data > through the CPU cache later. I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck. I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much raw data to write and read. > > > I think the only way to safely do what you want to achieve in way > > > that is both safe and efficient would be to use normal page cache pages > > > allocated from user space, ideally using hugepage mappings, and then > > > mapping those into the device using the streaming DMA API to assign > > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > > and dma_sync_sg_for_{device,cpu}. > > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > > physical memory(it doesn't know MMU), and pages allocated from > > userspace are not contiguous on physical memory. > > Usually what you can do with a DSP is that it can run user-provided > software, so if you can pass it a scatter-gather list for the output data > in addition to the buffer that it uses for its code and intermediate > buffers. If the goal is to store this data in a file, you can even go as far > as calling mmap() on the file, and then letting the driver get the page > cache pages backing the file mapping, and then relying on the normal > file system writeback to store the data to disk. Our DSP doesn't support scatter-gather lists. What does "intermediate buffers" mean? Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-25 2:50 ` Li Chen @ 2022-07-25 7:03 ` Arnd Bergmann 2022-07-25 11:06 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Arnd Bergmann @ 2022-07-25 7:03 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Mon, Jul 25, 2022 at 4:50 AM Li Chen <me@linux.beauty> wrote: > ---- On Fri, 22 Jul 2022 20:06:35 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote: > > > ---- On Fri, 22 Jul 2022 17:06:36 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > > > > > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped > > > > uncacheable will cause data corruption somewhere as well: The direct i/o > > > > code expects normal page cache pages, but these are clearly not. > > > > > > direct I/O just bypasses page cache, so I think you want to say "normal pages"? > > > > All normal memory available to user space is in the page cache. > > Just want to make sure that "all normal memory available to user space" come from functions > like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then: > sys_mmap() > └→ do_mmap_pgoff() > └→ mmap_region() > └→ generic_file_mmap() // file mapping, then > └→ vma_set_anonymous(vma); // anon vma path > > IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they > don't have files as the backing stores. > > Please correct me if something I am missing. I think both anonymous user space pages and file backed pages are commonly considered 'page cache'. Anonymous memory is eventually backed by swap space, which is similar to but not the same here. When I wrote 'page cache', I meant both of these, as opposed to memory allocated by a kernel driver. > > What you bypass with direct I/O is just the copy into another page cache page. > > > > > Also, the coherent DMA API is not actually meant for transferring large > > > > amounts of data. > > > > > > Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous > > > and they also worked fine. > > > > Those two interfaces just return a 'struct page', so if you convert them into > > a kernel pointer or map them into user space, you get a cacheable mapping. > > Is that what you do? > > Yes. > > > If so, then your device appears to be cache coherent > > with the CPU, and you can just mark it as coherent in the devicetree. > > Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so > all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate > cached pages, right? Using CMA or not is not the problem here, what you have to do for correctness is to use the same mapping type in every place that maps the pages into a page table. The two options you have are: - Using uncached mappings from dma_alloc_coherent() in combination with dma_mmap_coherent(). You cannot use direct I/O on these, and any access through a pointer is slow. - Using cached mappings from anywhere, and then flushing the caches during ownership transfers with dma_map_sg()/dma_unmap_sg()/ dma_sync_sg_for_cpu()/dma_sync_sg_for_device(). > > > > My guess is that what you are doing here is to use the > > > > coherent API to map a large buffer uncached and then try to access the > > > > uncached data in user space, which is inherently slow. Using direct I/o > > > > appears to solve the problem by not actually using the uncached mapping > > > > when sending the data to another device, but this is not the right approach. > > > > > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about > > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory). > > > > > > > Do you have an IOMMU, scatter/gather support or similar to back the > > > > device? > > > > > > No. My misc char device is simply a pseudo device and have no real hardware. > > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here. > > > > It is the DSP that I'm talking about here, this is what makes all the > > difference. > > If the DSP is cache coherent and you mark it that way in DT, then everything > > just becomes fast, and you don't have to use direct I/O. If the DSP is not > > cache coherent, but you can program it to write into arbitrary memory page > > cache pages allocated from user space, then you can use the streaming > > mapping interface that does explicit cache management. This is of course > > not as fast as coherent hardware, but it also allows accessing the data > > through the CPU cache later. > > I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only > reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so > I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck. > I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much > raw data to write and read. copy_from_user() is particularly slow on uncached data. What throughput do you get if you mmap() /dev/null and write that to a file? > > > > I think the only way to safely do what you want to achieve in way > > > > that is both safe and efficient would be to use normal page cache pages > > > > allocated from user space, ideally using hugepage mappings, and then > > > > mapping those into the device using the streaming DMA API to assign > > > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > > > and dma_sync_sg_for_{device,cpu}. > > > > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > > > physical memory(it doesn't know MMU), and pages allocated from > > > userspace are not contiguous on physical memory. > > > > Usually what you can do with a DSP is that it can run user-provided > > software, so if you can pass it a scatter-gather list for the output data > > in addition to the buffer that it uses for its code and intermediate > > buffers. If the goal is to store this data in a file, you can even go as far > > as calling mmap() on the file, and then letting the driver get the page > > cache pages backing the file mapping, and then relying on the normal > > file system writeback to store the data to disk. > > Our DSP doesn't support scatter-gather lists. > What does "intermediate buffers" mean? I meant using a statically allocated memory area at a fixed location for whatever the DSP does internally, and then copying it to the page cache page that gets written to disk using the DSP to avoid any extra copies on the CPU side. This obviously involves changes to the interface that the DSP program uses. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-25 7:03 ` Arnd Bergmann @ 2022-07-25 11:06 ` Li Chen 2022-07-25 11:45 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-25 11:06 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel ---- On Mon, 25 Jul 2022 16:03:30 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > On Mon, Jul 25, 2022 at 4:50 AM Li Chen <me@linux.beauty> wrote: > > ---- On Fri, 22 Jul 2022 20:06:35 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > > > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me@linux.beauty> wrote: > > > > ---- On Fri, 22 Jul 2022 17:06:36 +0800 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > > > > > > > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped > > > > > uncacheable will cause data corruption somewhere as well: The direct i/o > > > > > code expects normal page cache pages, but these are clearly not. > > > > > > > > direct I/O just bypasses page cache, so I think you want to say "normal pages"? > > > > > > All normal memory available to user space is in the page cache. > > > > Just want to make sure that "all normal memory available to user space" come from functions > > like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then: > > sys_mmap() > > └→ do_mmap_pgoff() > > └→ mmap_region() > > └→ generic_file_mmap() // file mapping, then > > └→ vma_set_anonymous(vma); // anon vma path > > > > IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they > > don't have files as the backing stores. > > > > Please correct me if something I am missing. > > I think both anonymous user space pages and file backed pages are commonly > considered 'page cache'. Anonymous memory is eventually backed by swap space, > which is similar to but not the same here. > > When I wrote 'page cache', I meant both of these, as opposed to memory allocated > by a kernel driver. Gotcha. > > > What you bypass with direct I/O is just the copy into another page cache page. > > > > > > > Also, the coherent DMA API is not actually meant for transferring large > > > > > amounts of data. > > > > > > > > Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous > > > > and they also worked fine. > > > > > > Those two interfaces just return a 'struct page', so if you convert them into > > > a kernel pointer or map them into user space, you get a cacheable mapping. > > > Is that what you do? > > > > Yes. > > > > > If so, then your device appears to be cache coherent > > > with the CPU, and you can just mark it as coherent in the devicetree. > > > > Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so > > all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate > > cached pages, right? > > Using CMA or not is not the problem here, what you have to do for correctness > is to use the same mapping type in every place that maps the pages into > a page table. The two options you have are: > > - Using uncached mappings from dma_alloc_coherent() in combination > with dma_mmap_coherent(). You cannot use direct I/O on these, and > any access through a pointer is slow. Yes, very slow, around 300-500MB/s. > - Using cached mappings from anywhere, and then flushing the caches > during ownership transfers with dma_map_sg()/dma_unmap_sg()/ > dma_sync_sg_for_cpu()/dma_sync_sg_for_device(). We just set up phy addr and other configures then send write command to dsp instead of using kernel dma engine api. So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer data. Userspace app will query before writing to file, and we will invalidate cache when "query" if the memory region is cache-able memory. To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous, then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames. But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much slower than direct I/O throughput. it seems weird? > > > > > My guess is that what you are doing here is to use the > > > > > coherent API to map a large buffer uncached and then try to access the > > > > > uncached data in user space, which is inherently slow. Using direct I/o > > > > > appears to solve the problem by not actually using the uncached mapping > > > > > when sending the data to another device, but this is not the right approach. > > > > > > > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about > > > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory). > > > > > > > > > Do you have an IOMMU, scatter/gather support or similar to back the > > > > > device? > > > > > > > > No. My misc char device is simply a pseudo device and have no real hardware. > > > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here. > > > > > > It is the DSP that I'm talking about here, this is what makes all the > > > difference. > > > If the DSP is cache coherent and you mark it that way in DT, then everything > > > just becomes fast, and you don't have to use direct I/O. If the DSP is not > > > cache coherent, but you can program it to write into arbitrary memory page > > > cache pages allocated from user space, then you can use the streaming > > > mapping interface that does explicit cache management. This is of course > > > not as fast as coherent hardware, but it also allows accessing the data > > > through the CPU cache later. > > > > I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only > > reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so > > I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck. > > I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much > > raw data to write and read. > > copy_from_user() is particularly slow on uncached data. What throughput do you > get if you mmap() /dev/null and write that to a file? mmap /dev/null return No such device, but I do have this device: # ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Nov 14 17:34 /dev/null Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed. > > > > > I think the only way to safely do what you want to achieve in way > > > > > that is both safe and efficient would be to use normal page cache pages > > > > > allocated from user space, ideally using hugepage mappings, and then > > > > > mapping those into the device using the streaming DMA API to assign > > > > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > > > > and dma_sync_sg_for_{device,cpu}. > > > > > > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > > > > physical memory(it doesn't know MMU), and pages allocated from > > > > userspace are not contiguous on physical memory. > > > > > > Usually what you can do with a DSP is that it can run user-provided > > > software, so if you can pass it a scatter-gather list for the output data > > > in addition to the buffer that it uses for its code and intermediate > > > buffers. If the goal is to store this data in a file, you can even go as far > > > as calling mmap() on the file, and then letting the driver get the page > > > cache pages backing the file mapping, and then relying on the normal > > > file system writeback to store the data to disk. > > > > Our DSP doesn't support scatter-gather lists. > > What does "intermediate buffers" mean? > > I meant using a statically allocated memory area at a fixed location for > whatever the DSP does internally, and then copying it to the page cache > page that gets written to disk using the DSP to avoid any extra copies > on the CPU side. > > This obviously involves changes to the interface that the DSP program > uses. Looks promising, but our DSP doesn't support sg list:-( So it seems it is impossible for our case. Do you know is there any open source DSP driver that using such dma sg-list and/or copy its data to "page cache page"? Thanks. Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-25 11:06 ` Li Chen @ 2022-07-25 11:45 ` Arnd Bergmann 2022-07-26 6:50 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Arnd Bergmann @ 2022-07-25 11:45 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me@linux.beauty> wrote: > ---- On Mon, 25 Jul 2022 16:03:30 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > > - Using cached mappings from anywhere, and then flushing the caches > > during ownership transfers with dma_map_sg()/dma_unmap_sg()/ > > dma_sync_sg_for_cpu()/dma_sync_sg_for_device(). > > We just set up phy addr and other configures then send write command to dsp instead of > using kernel dma engine api. > So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer > data. Userspace app will query before writing to file, and we will invalidate cache when "query" > if the memory region is cache-able memory. > > To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous, > then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames. > But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much > slower than direct I/O throughput. > it seems weird? I'm not sure what is actually meant to happen when you have both cacheable and uncached mappings for the same data, this it not all that well defined and it may be that you end up just getting uncached data in the end. You clearly either get a cache miss here, or stale data, and either way is not good. > > copy_from_user() is particularly slow on uncached data. What throughput do you > > get if you mmap() /dev/null and write that to a file? > > mmap /dev/null return No such device, but I do have this device: > # ls -l /dev/null > crw-rw-rw- 1 root root 1, 3 Nov 14 17:34 /dev/null > > Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed. Sorry, I meant /dev/zero (or any other normal memory really). > > > > > > I think the only way to safely do what you want to achieve in way > > > > > > that is both safe and efficient would be to use normal page cache pages > > > > > > allocated from user space, ideally using hugepage mappings, and then > > > > > > mapping those into the device using the streaming DMA API to assign > > > > > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > > > > > and dma_sync_sg_for_{device,cpu}. > > > > > > > > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > > > > > physical memory(it doesn't know MMU), and pages allocated from > > > > > userspace are not contiguous on physical memory. > > > > > > > > Usually what you can do with a DSP is that it can run user-provided > > > > software, so if you can pass it a scatter-gather list for the output data > > > > in addition to the buffer that it uses for its code and intermediate > > > > buffers. If the goal is to store this data in a file, you can even go as far > > > > as calling mmap() on the file, and then letting the driver get the page > > > > cache pages backing the file mapping, and then relying on the normal > > > > file system writeback to store the data to disk. > > > > > > Our DSP doesn't support scatter-gather lists. > > > What does "intermediate buffers" mean? > > > > I meant using a statically allocated memory area at a fixed location for > > whatever the DSP does internally, and then copying it to the page cache > > page that gets written to disk using the DSP to avoid any extra copies > > on the CPU side. > > > > This obviously involves changes to the interface that the DSP program > > uses. > > Looks promising, but our DSP doesn't support sg list:-( > So it seems it is impossible for our case. Do you know is there any open source > DSP driver that using such dma sg-list and/or copy its data to "page cache page"? I don't recall any other driver that uses the page cache to write into a file-backed mapping, but you can search the kernel sources for drivers that use pin_user_pages() or a related function to get access to the user address space and extract a page number of that to pass into a hardware buffer. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-25 11:45 ` Arnd Bergmann @ 2022-07-26 6:50 ` Li Chen 2022-07-27 3:02 ` Li Chen 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-26 6:50 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel ---- On Mon, 25 Jul 2022 20:45:10 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me@linux.beauty> wrote: > > ---- On Mon, 25 Jul 2022 16:03:30 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > > > - Using cached mappings from anywhere, and then flushing the caches > > > during ownership transfers with dma_map_sg()/dma_unmap_sg()/ > > > dma_sync_sg_for_cpu()/dma_sync_sg_for_device(). > > > > We just set up phy addr and other configures then send write command to dsp instead of > > using kernel dma engine api. > > So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer > > data. Userspace app will query before writing to file, and we will invalidate cache when "query" > > if the memory region is cache-able memory. > > > > To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous, > > then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames. > > But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much > > slower than direct I/O throughput. > > it seems weird? > > I'm not sure what is actually meant to happen when you have both cacheable > and uncached mappings for the same data, this it not all that well defined and > it may be that you end up just getting uncached data in the end. You > clearly either > get a cache miss here, or stale data, and either way is not good. It's not the same data with both cacheable and uncached mappings. I replaced the kernel image for the two tests (one kernel driver uses dma_alloc_contiguous and the other one uses dma_alloc_coherent), so, the pages should be either cacheable or non-cacheable, not both. > > > copy_from_user() is particularly slow on uncached data. What throughput do you > > > get if you mmap() /dev/null and write that to a file? > > > > mmap /dev/null return No such device, but I do have this device: > > # ls -l /dev/null > > crw-rw-rw- 1 root root 1, 3 Nov 14 17:34 /dev/null > > > > Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed. > > Sorry, I meant /dev/zero (or any other normal memory really). mmap from /dev/zero, and then buffered-I/O write to file still gets very slow throughput. Is this zero-page cache-able page? From its mmap implementation: static int mmap_zero(struct file *file, struct vm_area_struct *vma) { #ifndef CONFIG_MMU return -ENOSYS; #endif if (vma->vm_flags & VM_SHARED) return shmem_zero_setup(vma); vma_set_anonymous(vma); return 0; } I tried both MAP_PRIVATE and MAP_SHARED, both still get slow throughput, no notable change. > > > > > > > I think the only way to safely do what you want to achieve in way > > > > > > > that is both safe and efficient would be to use normal page cache pages > > > > > > > allocated from user space, ideally using hugepage mappings, and then > > > > > > > mapping those into the device using the streaming DMA API to assign > > > > > > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > > > > > > and dma_sync_sg_for_{device,cpu}. > > > > > > > > > > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > > > > > > physical memory(it doesn't know MMU), and pages allocated from > > > > > > userspace are not contiguous on physical memory. > > > > > > > > > > Usually what you can do with a DSP is that it can run user-provided > > > > > software, so if you can pass it a scatter-gather list for the output data > > > > > in addition to the buffer that it uses for its code and intermediate > > > > > buffers. If the goal is to store this data in a file, you can even go as far > > > > > as calling mmap() on the file, and then letting the driver get the page > > > > > cache pages backing the file mapping, and then relying on the normal > > > > > file system writeback to store the data to disk. > > > > > > > > Our DSP doesn't support scatter-gather lists. > > > > What does "intermediate buffers" mean? > > > > > > I meant using a statically allocated memory area at a fixed location for > > > whatever the DSP does internally, and then copying it to the page cache > > > page that gets written to disk using the DSP to avoid any extra copies > > > on the CPU side. > > > > > > This obviously involves changes to the interface that the DSP program > > > uses. > > > > Looks promising, but our DSP doesn't support sg list:-( > > So it seems it is impossible for our case. Do you know is there any open source > > DSP driver that using such dma sg-list and/or copy its data to "page cache page"? > > I don't recall any other driver that uses the page cache to write into > a file-backed > mapping, but you can search the kernel sources for drivers that use > pin_user_pages() or a related function to get access to the user address space > and extract a page number of that to pass into a hardware buffer. So, IIUC, this solution consists of the following steps: step 1. alloc normal memory from userspace using functions like malloc. step 2. find malloc-ed memory, then use pin_user_pages* function to pin this memory then pass this virtual contiguous but non-phy contiguous(so sg is needed) memory to DSP for writing. step 3. unpin_user_pages* then let filesystem's writeback queue to write anon "page cache" back to files on disk. Are these steps all right? If they are, I have some noob questions: for step 2, how can I find page frames allocated by malloc, hack brk/mmap to track? for step 2, if sg is not supported(we don't operate on DSP's dma controller directly, but send command and phy addr and etc to it), is there any other way to do it? This is important, otherwise I still have to reserve phy contiguous addr for DSP writing. for step 3, I haven't seen trival way to do it. Anyway, they are still anon "page cache" and have no files as backing store(Of course, swap is its backing store, but we hope it write back to real files instead of swap), so it's a little tricky: how to set target files as these anon “page cache"'s backing store? Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-26 6:50 ` Li Chen @ 2022-07-27 3:02 ` Li Chen 2022-07-27 6:40 ` Arnd Bergmann 0 siblings, 1 reply; 15+ messages in thread From: Li Chen @ 2022-07-27 3:02 UTC (permalink / raw) To: Arnd Bergmann; +Cc: linux-arm-kernel ---- On Tue, 26 Jul 2022 15:50:57 +0900 Li Chen <me@linux.beauty> wrote --- > > ---- On Mon, 25 Jul 2022 20:45:10 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > > On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me@linux.beauty> wrote: > > > ---- On Mon, 25 Jul 2022 16:03:30 +0900 Arnd Bergmann <arnd@arndb.de> wrote --- > > > > - Using cached mappings from anywhere, and then flushing the caches > > > > during ownership transfers with dma_map_sg()/dma_unmap_sg()/ > > > > dma_sync_sg_for_cpu()/dma_sync_sg_for_device(). > > > > > > We just set up phy addr and other configures then send write command to dsp instead of > > > using kernel dma engine api. > > > So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer > > > data. Userspace app will query before writing to file, and we will invalidate cache when "query" > > > if the memory region is cache-able memory. > > > > > > To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous, > > > then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames. > > > But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much > > > slower than direct I/O throughput. > > > it seems weird? > > > > I'm not sure what is actually meant to happen when you have both cacheable > > and uncached mappings for the same data, this it not all that well defined and > > it may be that you end up just getting uncached data in the end. You > > clearly either > > get a cache miss here, or stale data, and either way is not good. > > It's not the same data with both cacheable and uncached mappings. > I replaced the kernel image for the two tests (one kernel driver uses dma_alloc_contiguous and the other one uses dma_alloc_coherent), > so, the pages should be either cacheable or non-cacheable, not both. > > > > > copy_from_user() is particularly slow on uncached data. What throughput do you > > > > get if you mmap() /dev/null and write that to a file? > > > > > > mmap /dev/null return No such device, but I do have this device: > > > # ls -l /dev/null > > > crw-rw-rw- 1 root root 1, 3 Nov 14 17:34 /dev/null > > > > > > Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed. > > > > Sorry, I meant /dev/zero (or any other normal memory really). > > mmap from /dev/zero, and then buffered-I/O write to file still gets very slow throughput. > Is this zero-page cache-able page? > > From its mmap implementation: > > static int mmap_zero(struct file *file, struct vm_area_struct *vma) > { > #ifndef CONFIG_MMU > return -ENOSYS; > #endif > if (vma->vm_flags & VM_SHARED) > return shmem_zero_setup(vma); > vma_set_anonymous(vma); > return 0; > } > > I tried both MAP_PRIVATE and MAP_SHARED, both still get slow throughput, no > notable change. > > > > > > > > > I think the only way to safely do what you want to achieve in way > > > > > > > > that is both safe and efficient would be to use normal page cache pages > > > > > > > > allocated from user space, ideally using hugepage mappings, and then > > > > > > > > mapping those into the device using the streaming DMA API to assign > > > > > > > > them to the DMA master with get_user_pages_fast()/dma_map_sg() > > > > > > > > and dma_sync_sg_for_{device,cpu}. > > > > > > > > > > > > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous > > > > > > > physical memory(it doesn't know MMU), and pages allocated from > > > > > > > userspace are not contiguous on physical memory. > > > > > > > > > > > > Usually what you can do with a DSP is that it can run user-provided > > > > > > software, so if you can pass it a scatter-gather list for the output data > > > > > > in addition to the buffer that it uses for its code and intermediate > > > > > > buffers. If the goal is to store this data in a file, you can even go as far > > > > > > as calling mmap() on the file, and then letting the driver get the page > > > > > > cache pages backing the file mapping, and then relying on the normal > > > > > > file system writeback to store the data to disk. > > > > > > > > > > Our DSP doesn't support scatter-gather lists. > > > > > What does "intermediate buffers" mean? > > > > > > > > I meant using a statically allocated memory area at a fixed location for > > > > whatever the DSP does internally, and then copying it to the page cache > > > > page that gets written to disk using the DSP to avoid any extra copies > > > > on the CPU side. > > > > > > > > This obviously involves changes to the interface that the DSP program > > > > uses. > > > > > > Looks promising, but our DSP doesn't support sg list:-( > > > So it seems it is impossible for our case. Do you know is there any open source > > > DSP driver that using such dma sg-list and/or copy its data to "page cache page"? > > > > I don't recall any other driver that uses the page cache to write into > > a file-backed > > mapping, but you can search the kernel sources for drivers that use > > pin_user_pages() or a related function to get access to the user address space > > and extract a page number of that to pass into a hardware buffer. > > So, IIUC, this solution consists of the following steps: > step 1. alloc normal memory from userspace using functions like malloc. > step 2. find malloc-ed memory, then use pin_user_pages* function to pin this memory then pass this virtual contiguous but non-phy contiguous(so sg is needed) memory to DSP for writing. > step 3. unpin_user_pages* then let filesystem's writeback queue to write anon "page cache" back to files on disk. > > Are these steps all right? > > If they are, I have some noob questions: > for step 2, how can I find page frames allocated by malloc, hack brk/mmap to track? > for step 2, if sg is not supported(we don't operate on DSP's dma controller directly, but send command and phy addr and etc to it), is there any other way to do it? This is important, otherwise > I still have to reserve phy contiguous addr for DSP writing. > for step 3, I haven't seen trival way to do it. Anyway, they are still anon "page cache" and have no files as backing store(Of course, swap is its backing store, but > we hope it write back to real files instead of swap), so it's a little tricky: how to set target files as these anon “page cache"'s backing store? It seems that SVA(Shared Virtual Addressing, https://lwn.net/Articles/747230/) is a perfect solution to do DMA operation from userspace and using userspace allocated memory, but unfortunately, our SoC don't have SMMU and cannot operate on DSP's DMA engine directly from Linux driver Regards, Li _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Why dma_alloc_coherent don't return direct mapped vaddr? 2022-07-27 3:02 ` Li Chen @ 2022-07-27 6:40 ` Arnd Bergmann 0 siblings, 0 replies; 15+ messages in thread From: Arnd Bergmann @ 2022-07-27 6:40 UTC (permalink / raw) To: Li Chen; +Cc: Arnd Bergmann, linux-arm-kernel On Wed, Jul 27, 2022 at 5:02 AM Li Chen <me@linux.beauty> wrote: > ---- On Tue, 26 Jul 2022 15:50:57 +0900 Li Chen <me@linux.beauty> wrote --- > > It seems that SVA(Shared Virtual Addressing, https://lwn.net/Articles/747230/) > is a perfect solution to do DMA operation from userspace and using userspace > allocated memory, > but unfortunately, our SoC don't have SMMU and cannot operate on DSP's > DMA engine directly from Linux driver SVA assumes that the DMA master is cache coherent with your CPU, so even if you could make the pagetable side work with an IOMMU here, you still get incorrect data. Arnd _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-07-27 6:42 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-07-21 3:28 Why dma_alloc_coherent don't return direct mapped vaddr? Li Chen 2022-07-21 7:06 ` Arnd Bergmann 2022-07-22 2:57 ` Li Chen 2022-07-22 6:50 ` Arnd Bergmann 2022-07-22 8:19 ` Li Chen 2022-07-22 9:06 ` Arnd Bergmann 2022-07-22 10:31 ` Li Chen 2022-07-22 11:06 ` Arnd Bergmann 2022-07-25 2:50 ` Li Chen 2022-07-25 7:03 ` Arnd Bergmann 2022-07-25 11:06 ` Li Chen 2022-07-25 11:45 ` Arnd Bergmann 2022-07-26 6:50 ` Li Chen 2022-07-27 3:02 ` Li Chen 2022-07-27 6:40 ` Arnd Bergmann
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.