linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Correct use of DMA api (Some newbie questions)
@ 2019-07-14 17:06 Nikolai Zhubr
  2019-07-17  6:28 ` Randy Dunlap
  0 siblings, 1 reply; 2+ messages in thread
From: Nikolai Zhubr @ 2019-07-14 17:06 UTC (permalink / raw)
  To: linux-kernel

Hi all,

After reading some (apparently contradictory) revisions of DMA api 
references in Documentation/DMA-*.txt, some (contradictory) discussions 
thereof, and even digging through the in-tree drivers in search for a 
good enlightening example, still I have to ask for advice.

I'm crafting a tiny driver (or rather, a kernel-mode helper) for a very 
special PCIe device. And actually it does work already, but performs 
differenly on different kernels. I'm targeting x86 (i686) only (although 
preferrably the driver should stay platform-neutral) and I need to 
support kernels 4.9+. Due to how the device is designed and used, very 
little has to be done in kernel space. The device has large internal 
memory, which accumulates some measurement data, and it is capable of 
transferring it to the host using DMA (with at least 32-bit address 
space available). Arranging memory for DMA is pretty much the only thing 
that userspace can not reasonably do, so this needs to be in the driver. 
So my currenly attempted layout is as follows:

1. In the (kernel-mode) driver, allocate large contiguous block of 
physical memory to do DMA into. It will be later reused several times. 
This block does not need to have a kernel-mode virtual address because 
it will never be accessed from the driver directly. The block size is 
typically 128M and I use CMA=256M. Currently I use dma_alloc_coherent(), 
but I'm not convinced it really needs to be a strictly coherent memory, 
for performance reasons, see below. Also, AFAICS on x86 
dma_alloc_coherent() always creates a kernel address mapping anyway, so 
maybe I'd better simply kalloc() with subsequent dma_map_single()?

2. Upon DMA completion (from device to host), some sort of 
barrier/synchronization might be necessary (to be safe WRT speculative 
loads, cache, etc), like dma_cache_sync() or dma_sync_single_for_cpu(), 
however the latter looks like a nop for x86 AFAICS, and the former is 
apparently flush_write_buffers() which is not very involved either (asm 
lock; nop) and does not look usefull for my case. Currentlly, I do not 
use any, and it seems like OK, maybe by pure luck. So, is it so 
trivially simple on x86 or am I just missing something horribly big here?

3. mmap this buffer for userspace. Reading from it should be as fast as 
possible, therefore this block AFAICS should be cacheble (and 
prefetchable and whatever else for better performance), at least from 
userspace context. It is not quite clear if such properties would depend 
on block allocation method (in step 1 above) or just on remapping 
attributes only. Currently, for mmap I employ dma_mmap_coherent(), but 
it seems also possible to use remap_pfn_range(), and also change 
vm_page_prot somewhat. I've already found that e.g. pgprot_noncached 
hurts performance quite a lot, but supposedly without it some DMA 
barrier (step 2 above) seems still necessary?

Any hints greatly appreciated,

Regards,
Nikolai

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Correct use of DMA api (Some newbie questions)
  2019-07-14 17:06 Correct use of DMA api (Some newbie questions) Nikolai Zhubr
@ 2019-07-17  6:28 ` Randy Dunlap
  0 siblings, 0 replies; 2+ messages in thread
From: Randy Dunlap @ 2019-07-17  6:28 UTC (permalink / raw)
  To: Nikolai Zhubr, linux-kernel

On 7/14/19 10:06 AM, Nikolai Zhubr wrote:
> Hi all,
> 
> After reading some (apparently contradictory) revisions of DMA api references in Documentation/DMA-*.txt, some (contradictory) discussions thereof, and even digging through the in-tree drivers in search for a good enlightening example, still I have to ask for advice.
> 
> I'm crafting a tiny driver (or rather, a kernel-mode helper) for a very special PCIe device. And actually it does work already, but performs differenly on different kernels. I'm targeting x86 (i686) only (although preferrably the driver should stay platform-neutral) and I need to support kernels 4.9+. Due to how the device is designed and used, very little has to be done in kernel space. The device has large internal memory, which accumulates some measurement data, and it is capable of transferring it to the host using DMA (with at least 32-bit address space available). Arranging memory for DMA is pretty much the only thing that userspace can not reasonably do, so this needs to be in the driver. So my currenly attempted layout is as follows:
> 
> 1. In the (kernel-mode) driver, allocate large contiguous block of physical memory to do DMA into. It will be later reused several times. This block does not need to have a kernel-mode virtual address because it will never be accessed from the driver directly. The block size is typically 128M and I use CMA=256M. Currently I use dma_alloc_coherent(), but I'm not convinced it really needs to be a strictly coherent memory, for performance reasons, see below. Also, AFAICS on x86 dma_alloc_coherent() always creates a kernel address mapping anyway, so maybe I'd better simply kalloc() with subsequent dma_map_single()?
> 
> 2. Upon DMA completion (from device to host), some sort of barrier/synchronization might be necessary (to be safe WRT speculative loads, cache, etc), like dma_cache_sync() or dma_sync_single_for_cpu(), however the latter looks like a nop for x86 AFAICS, and the former is apparently flush_write_buffers() which is not very involved either (asm lock; nop) and does not look usefull for my case. Currentlly, I do not use any, and it seems like OK, maybe by pure luck. So, is it so trivially simple on x86 or am I just missing something horribly big here?
> 
> 3. mmap this buffer for userspace. Reading from it should be as fast as possible, therefore this block AFAICS should be cacheble (and prefetchable and whatever else for better performance), at least from userspace context. It is not quite clear if such properties would depend on block allocation method (in step 1 above) or just on remapping attributes only. Currently, for mmap I employ dma_mmap_coherent(), but it seems also possible to use remap_pfn_range(), and also change vm_page_prot somewhat. I've already found that e.g. pgprot_noncached hurts performance quite a lot, but supposedly without it some DMA barrier (step 2 above) seems still necessary?
> 
> Any hints greatly appreciated,
> 
> Regards,
> Nikolai

Hi,

I suggest that you try some mailing list(s) besides linux-kernel.
The MAINTAINERS file has these possibilities:

dmaengine@vger.kernel.org
iommu@lists.linux-foundation.org

or just try linux-mm@vger.kernel.org

-- 
~Randy

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-07-17  6:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-14 17:06 Correct use of DMA api (Some newbie questions) Nikolai Zhubr
2019-07-17  6:28 ` Randy Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).