Correct use of DMA api (Some newbie questions)

* Correct use of DMA api (Some newbie questions)
@ 2019-07-14 17:06 Nikolai Zhubr
  2019-07-17  6:28 ` Randy Dunlap
  0 siblings, 1 reply; 2+ messages in thread
From: Nikolai Zhubr @ 2019-07-14 17:06 UTC (permalink / raw)
  To: linux-kernel

Hi all,

After reading some (apparently contradictory) revisions of DMA api 
references in Documentation/DMA-*.txt, some (contradictory) discussions 
thereof, and even digging through the in-tree drivers in search for a 
good enlightening example, still I have to ask for advice.

I'm crafting a tiny driver (or rather, a kernel-mode helper) for a very 
special PCIe device. And actually it does work already, but performs 
differenly on different kernels. I'm targeting x86 (i686) only (although 
preferrably the driver should stay platform-neutral) and I need to 
support kernels 4.9+. Due to how the device is designed and used, very 
little has to be done in kernel space. The device has large internal 
memory, which accumulates some measurement data, and it is capable of 
transferring it to the host using DMA (with at least 32-bit address 
space available). Arranging memory for DMA is pretty much the only thing 
that userspace can not reasonably do, so this needs to be in the driver. 
So my currenly attempted layout is as follows:

1. In the (kernel-mode) driver, allocate large contiguous block of 
physical memory to do DMA into. It will be later reused several times. 
This block does not need to have a kernel-mode virtual address because 
it will never be accessed from the driver directly. The block size is 
typically 128M and I use CMA=256M. Currently I use dma_alloc_coherent(), 
but I'm not convinced it really needs to be a strictly coherent memory, 
for performance reasons, see below. Also, AFAICS on x86 
dma_alloc_coherent() always creates a kernel address mapping anyway, so 
maybe I'd better simply kalloc() with subsequent dma_map_single()?

2. Upon DMA completion (from device to host), some sort of 
barrier/synchronization might be necessary (to be safe WRT speculative 
loads, cache, etc), like dma_cache_sync() or dma_sync_single_for_cpu(), 
however the latter looks like a nop for x86 AFAICS, and the former is 
apparently flush_write_buffers() which is not very involved either (asm 
lock; nop) and does not look usefull for my case. Currentlly, I do not 
use any, and it seems like OK, maybe by pure luck. So, is it so 
trivially simple on x86 or am I just missing something horribly big here?

3. mmap this buffer for userspace. Reading from it should be as fast as 
possible, therefore this block AFAICS should be cacheble (and 
prefetchable and whatever else for better performance), at least from 
userspace context. It is not quite clear if such properties would depend 
on block allocation method (in step 1 above) or just on remapping 
attributes only. Currently, for mmap I employ dma_mmap_coherent(), but 
it seems also possible to use remap_pfn_range(), and also change 
vm_page_prot somewhat. I've already found that e.g. pgprot_noncached 
hurts performance quite a lot, but supposedly without it some DMA 
barrier (step 2 above) seems still necessary?

Any hints greatly appreciated,

Regards,
Nikolai

^ permalink raw reply	[flat|nested] 2+ messages in thread