All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
@ 2014-07-22 12:33 Laurent Pinchart
  2014-07-23  2:17 ` Kuninori Morimoto
                   ` (3 more replies)
  0 siblings, 4 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-22 12:33 UTC (permalink / raw)
  To: linux-sh

Hello,

This patch set implements support for hardware descriptor lists in the R-Car
Gen2 DMAC driver.

The DMAC supports reconfiguring itself after every chunk from a list of
hardware transfer descriptors in physically contiguous memory. This reduces
the number of interrupts required for processing a DMA transfer.

In theory the transfer throughput can be slightly increased and the CPU load
slightly decreased, but in practice the gain might not be significant as most
DMAC users, if not all, perform small DMA transfers to physically contiguous
memory, resulting in a single chunk per transfer. I'll perform performance
tests and will post results shortly.

The code has been tested by artificially lowering the maximum chunk size to
4096 bytes and running dmatest, which completed sucessfully. Morimoto-san, is
there an easy way to test cyclic transfers with your audio driver ?

The patches apply on top of the "[PATCH v2 0/8] R-Car Gen2 DMA Controller
driver" series previously posted to the dmaengine and linux-sh mailing list.

The RFC status of this series comes from the way hardware descriptors memory
is allocated. The DMAC has an internal descriptor memory of 128 entries shared
between all channels and also supports storing descriptors in system memory.
Using the DMAC internal descriptor memory speeds descriptor fetch operations
up compared to system memory.

Several options are thus possible :

1. Allocate one descriptor list with the DMA coherent allocation API per DMA
transfer request. This is the currently implementated option. The upside is
simplicity, the downsides are slower descriptor fetch operations (compared to
using internal memory) and higher memory usage as dma_alloc_coherent() can't
allocate less than one page. Memory allocation and free also introduce an
overhead, but that's partly alleviated by caching memory (patch 5/5).

2. Allocate pages of physically contiguous memory using the DMA coherent
allocation API as a backend, and manually allocate descriptor lists from
within those pages. The upside is a lower memory usage, the dowsides are
slower descriptor fetch operations (compared to using internal memory) and
higher complexity. As memory will be preallocated the overhead at transfer
descriptor preparation time will be negligible, except when the driver runs
out of preallocated memory and needs to perform a new allocation.

3. Manually allocate descriptor lists from the DMAC internal memory. This has
the upside of speeding descriptor fetch operations up, and the downside of
limiting the total number of descriptors in use at any given time to 128 at
most (and possibly less in practice due to fragmentation). Note that failures
to allocate descriptors memory are not fatal, the driver falls back to not
using hardware descriptors lists in that case.

4. A mix of options 2 and 3, allocating descriptors from internal memory when
available, and falling back to system memory otherwise. This is the most
efficient option from a descriptor fetch point of view, but is also the most
complex to implement.

My gut feeling is that the overhead introduced by fetching descriptors from
external memory will not be significant, but that's just a gut feeling.
Comments and ideas will be appreciated. I plan to keep the current
implementation for now unless someone strongly believes it needs to be
changed.

Laurent Pinchart (5):
  dmaengine: rcar-dmac: Rename rcar_dmac_hw_desc to rcar_dmac_xfer_chunk
  dmaengine: rcar-dmac: Fix typo in register definition
  dmaengine: rcar-dmac: Compute maximum chunk size at runtime
  dmaengine: rcar-dmac: Implement support for hardware descriptor lists
  dmaengine: rcar-dmac: Cache hardware descriptors memory

 drivers/dma/sh/rcar-dmac.c | 432 +++++++++++++++++++++++++++++++++------------
 1 file changed, 324 insertions(+), 108 deletions(-)

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
  2014-07-22 12:33 [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
@ 2014-07-23  2:17 ` Kuninori Morimoto
  2014-07-23 10:28     ` Laurent Pinchart
  2014-07-23  9:48 ` [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-23  2:17 UTC (permalink / raw)
  To: linux-sh


Hi Laurent

> The code has been tested by artificially lowering the maximum chunk size to
> 4096 bytes and running dmatest, which completed sucessfully. Morimoto-san, is
> there an easy way to test cyclic transfers with your audio driver ?

Thank you for your offer.
I tested this patchset with audio DMAC (= cyclic transfer)
but, it doesn't work for me.

First of all, this sound driver which is using cyclic transfer
was worked well in shdma-base driver.
I had sent audio DMA support plafrom side patches before.
But, of course I'm happy to update sound driver side.

I will re-send my audio DMAC support patches after this email.

My troubles are...

1. "filter" still can't care "dma0" or "audma0"

   	dmac0: dma-controller@e6700000 {
	..
	};
	dmac1: dma-controller@e6720000 {
	...
	};
	audma0: dma-contorller@ec700000 {
	...
	};
	audma1: dma-controller@ec720000 {
	...
	};
	audmapp: audio-dma-pp@0xec740000 {
	...
	};

  audio driver requests "audma0, 0x01",
  but, filter accepts it as "dmac0, 0x01"

2. cyclic transfer doesn't work

   I got attached error.

----------------------
Playing WAVE '/home/Calm_16bit_48k.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Stereo
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at /opt/home/morimoto/linux/drivers/dma/sh/rcar-dmac.c:1264 rcar_dmac_isr_channel+0x68/0x184()
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc4-02824-gfc6caf1-dirty #291
Backtrace: 
[<c0011bbc>] (dump_backtrace) from [<c0011df8>] (show_stack+0x18/0x1c)
 r6:c0516218 r5:00000009 r4:00000000 r3:00200000
[<c0011de0>] (show_stack) from [<c0437e44>] (dump_stack+0x7c/0x98)
[<c0437dc8>] (dump_stack) from [<c0024dac>] (warn_slowpath_common+0x68/0x8c)
 r4:00000000 r3:600f0193
[<c0024d44>] (warn_slowpath_common) from [<c0024ea8>] (warn_slowpath_null+0x24/0x2c)
 r8:edbe6800 r7:00000000 r6:00000000 r5:3e0c1810 r4:ee278010
[<c0024e84>] (warn_slowpath_null) from [<c01d4710>] (rcar_dmac_isr_channel+0x68/0x184)
[<c01d46a8>] (rcar_dmac_isr_channel) from [<c005f2dc>] (handle_irq_event_percpu+0x38/0x130)
 r6:00000000 r5:00000160 r4:ee3ab3c0 r3:c01d46a8
[<c005f2a4>] (handle_irq_event_percpu) from [<c005f41c>] (handle_irq_event+0x48/0x68)
 r10:00000000 r9:00000014 r8:edbe6800 r7:c0587c9c r6:c0587f40 r5:c05a0c84
 r4:ee818c40
[<c005f3d4>] (handle_irq_event) from [<c00621c0>] (handle_fasteoi_irq+0xbc/0x144)
 r5:c05a0c84 r4:ee818c40
[<c0062104>] (handle_fasteoi_irq) from [<c005ed04>] (generic_handle_irq+0x28/0x38)
 r5:c0583bcc r4:00000160
[<c005ecdc>] (generic_handle_irq) from [<c000f184>] (handle_IRQ+0x70/0x98)
 r4:00000160 r3:000001a7
[<c000f114>] (handle_IRQ) from [<c0009320>] (gic_handle_irq+0x44/0x68)
 r6:c058e88c r5:c0587c68 r4:f0002000 r3:000001a0
[<c00092dc>] (gic_handle_irq) from [<c00129c0>] (__irq_svc+0x40/0x50)
Exception stack(0xc0587c68 to 0xc0587cb0)
7c60:                   c05b308c 00000000 0d4a0d4a 0d4b0d4a ee231a00 edb71b80
7c80: 00000000 00000000 edbe6800 00000014 00000000 c0587cbc c0587cc0 c0587cb0
7ca0: c03871f0 c043c4c0 600f0113 ffffffff
 r6:ffffffff r5:600f0113 r4:c043c4c0 r3:c03871f0
[<c043c49c>] (_raw_spin_lock) from [<c03871f0>] (ip_defrag+0xaac/0xcb8)
[<c0386744>] (ip_defrag) from [<c0385a10>] (ip_local_deliver+0x5c/0x268)
 r10:edb71b80 r9:c058f24c r8:00000008 r7:c05b2f40 r6:edb71b80 r5:edb71b98
 r4:ee22332e
[<c03859b4>] (ip_local_deliver) from [<c038626c>] (ip_rcv+0x650/0x6f0)
 r7:c05b2f40 r6:edb71b80 r5:edb71b98 r4:ee22332e
[<c0385c1c>] (ip_rcv) from [<c035e6a0>] (__netif_receive_skb_core+0x470/0x50c)
 r9:c058f24c r8:00000008 r7:c05900d0 r6:00000000 r5:c058f238 r4:00000000
[<c035e230>] (__netif_receive_skb_core) from [<c035e908>] (__netif_receive_skb+0x2c/0x80)
 r10:000005ea r9:edbe6a78 r8:edbe6d10 r7:0000003f r6:0000003f r5:c058f238
 r4:edb71b80
[<c035e8dc>] (__netif_receive_skb) from [<c035e9c0>] (netif_receive_skb_internal+0x64/0xa4)
 r5:c058f238 r4:edb71b80
[<c035e95c>] (netif_receive_skb_internal) from [<c0362200>] (netif_receive_skb+0x10/0x14)
 r5:edb71b80 r4:edbe6800
[<c03621f0>] (netif_receive_skb) from [<c027b230>] (sh_eth_poll+0x214/0x494)
[<c027b01c>] (sh_eth_poll) from [<c0362784>] (net_rx_action+0xb8/0x174)
 r10:c05880c0 r9:00000040 r8:c05ba8b8 r7:eef9fc88 r6:0000012c r5:eef9fc80
 r4:edbe6d10
[<c03626cc>] (net_rx_action) from [<c0028a04>] (__do_softirq+0xf0/0x22c)
 r10:c0586000 r9:00000100 r8:c058808c r7:c0588080 r6:c0586000 r5:c0586018
 r4:00000008
[<c0028914>] (__do_softirq) from [<c0028da4>] (irq_exit+0x8c/0xe8)
 r10:00000000 r9:413fc0f2 r8:ef7fccc0 r7:c0587f74 r6:00000000 r5:c0583bcc
 r4:c0586028
[<c0028d18>] (irq_exit) from [<c000f188>] (handle_IRQ+0x74/0x98)
 r4:000000c2 r3:000001a7
[<c000f114>] (handle_IRQ) from [<c0009320>] (gic_handle_irq+0x44/0x68)
 r6:c058e88c r5:c0587f40 r4:f0002000 r3:000001a0
[<c00092dc>] (gic_handle_irq) from [<c00129c0>] (__irq_svc+0x40/0x50)
Exception stack(0xc0587f40 to 0xc0587f88)
7f40: eef9e490 00000000 0000404a 00000000 c0586018 c0586000 c0586000 c0579814
7f60: ef7fccc0 413fc0f2 00000000 c0587f94 c0587f98 c0587f88 c000f484 c000f488
7f80: 600f0013 ffffffff
 r6:ffffffff r5:600f0013 r4:c000f488 r3:c000f484
[<c000f45c>] (arch_cpu_idle) from [<c0056428>] (cpu_startup_entry+0xf4/0x154)
[<c0056334>] (cpu_startup_entry) from [<c0434ea8>] (rest_init+0x68/0x80)
[<c0434e40>] (rest_init) from [<c0548b74>] (start_kernel+0x2cc/0x31c)
[<c05488a8>] (start_kernel) from [<40008074>] (0x40008074)
----------------------

Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
  2014-07-22 12:33 [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
  2014-07-23  2:17 ` Kuninori Morimoto
@ 2014-07-23  9:48 ` Laurent Pinchart
  2014-07-23 23:56 ` Kuninori Morimoto
  2014-07-24  0:12 ` [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
  3 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-23  9:48 UTC (permalink / raw)
  To: linux-sh

Hi Morimoto-san,

On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
> Hi Laurent
> 
> > The code has been tested by artificially lowering the maximum chunk size
> > to 4096 bytes and running dmatest, which completed sucessfully. Morimoto-
> > san, is there an easy way to test cyclic transfers with your audio driver
> > ?
>
> Thank you for your offer.
> I tested this patchset with audio DMAC (= cyclic transfer)
> but, it doesn't work for me.
> 
> First of all, this sound driver which is using cyclic transfer
> was worked well in shdma-base driver.
> I had sent audio DMA support plafrom side patches before.
> But, of course I'm happy to update sound driver side.
> 
> I will re-send my audio DMAC support patches after this email.
> 
> My troubles are...
> 
> 1. "filter" still can't care "dma0" or "audma0"
> 
>  	dmac0: dma-controller@e6700000 {
> 	..
> 	};
> 	dmac1: dma-controller@e6720000 {
> 	...
> 	};
> 	audma0: dma-contorller@ec700000 {
> 	...
> 	};
> 	audma1: dma-controller@ec720000 {
> 	...
> 	};
> 	audmapp: audio-dma-pp@0xec740000 {
> 	...
> 	};
> 
>   audio driver requests "audma0, 0x01",
>   but, filter accepts it as "dmac0, 0x01"

Indeed, I've fixed the rcar-dmac driver to ignore channels handled by a 
different driver, but not channels handled by the same driver but a different 
device. I'll fix that.

By the way, I've noticed an issue with the snd_soc_rcar driver. If a 
dma_request_slave_channel_compat() call fails in a module probe operation, the 
rsnd_probe() function will return an error immediately

        for_each_rsnd_dai(rdai, priv, i) {
                ret = rsnd_dai_call(probe, &rdai->playback, rdai);
                if (ret)
                        return ret;

                ret = rsnd_dai_call(probe, &rdai->capture, rdai);
                if (ret)
                        return ret;
        }

The modules that have been successfully probed are not cleaned up, so DMA 
channels allocated by modules successfully probed are never released.

> 2. cyclic transfer doesn't work
> 
>    I got attached error.

I'm not too surprised as I haven't tested cyclic DMA yet :-) I'll fix it.

> ----------------------
> Playing WAVE '/home/Calm_16bit_48k.wav' : Signed 16 bit Little Endian, Rate
> 48000 Hz, Stereo ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 0 at
> /opt/home/morimoto/linux/drivers/dma/sh/rcar-dmac.c:1264
> rcar_dmac_isr_channel+0x68/0x184() CPU: 0 PID: 0 Comm: swapper/0 Not
> tainted 3.16.0-rc4-02824-gfc6caf1-dirty #291 Backtrace:
> [<c0011bbc>] (dump_backtrace) from [<c0011df8>] (show_stack+0x18/0x1c)
>  r6:c0516218 r5:00000009 r4:00000000 r3:00200000
> [<c0011de0>] (show_stack) from [<c0437e44>] (dump_stack+0x7c/0x98)
> [<c0437dc8>] (dump_stack) from [<c0024dac>] (warn_slowpath_common+0x68/0x8c)
> r4:00000000 r3:600f0193
> [<c0024d44>] (warn_slowpath_common) from [<c0024ea8>]
> (warn_slowpath_null+0x24/0x2c) r8:edbe6800 r7:00000000 r6:00000000
> r5:3e0c1810 r4:ee278010
> [<c0024e84>] (warn_slowpath_null) from [<c01d4710>]
> (rcar_dmac_isr_channel+0x68/0x184) [<c01d46a8>] (rcar_dmac_isr_channel)
> from [<c005f2dc>] (handle_irq_event_percpu+0x38/0x130) r6:00000000
> r5:00000160 r4:ee3ab3c0 r3:c01d46a8
> [<c005f2a4>] (handle_irq_event_percpu) from [<c005f41c>]
> (handle_irq_event+0x48/0x68) r10:00000000 r9:00000014 r8:edbe6800
> r7:c0587c9c r6:c0587f40 r5:c05a0c84 r4:ee818c40
> [<c005f3d4>] (handle_irq_event) from [<c00621c0>]
> (handle_fasteoi_irq+0xbc/0x144) r5:c05a0c84 r4:ee818c40
> [<c0062104>] (handle_fasteoi_irq) from [<c005ed04>]
> (generic_handle_irq+0x28/0x38) r5:c0583bcc r4:00000160
> [<c005ecdc>] (generic_handle_irq) from [<c000f184>] (handle_IRQ+0x70/0x98)
>  r4:00000160 r3:000001a7
> [<c000f114>] (handle_IRQ) from [<c0009320>] (gic_handle_irq+0x44/0x68)
>  r6:c058e88c r5:c0587c68 r4:f0002000 r3:000001a0
> [<c00092dc>] (gic_handle_irq) from [<c00129c0>] (__irq_svc+0x40/0x50)
> Exception stack(0xc0587c68 to 0xc0587cb0)
> 7c60:                   c05b308c 00000000 0d4a0d4a 0d4b0d4a ee231a00
> edb71b80 7c80: 00000000 00000000 edbe6800 00000014 00000000 c0587cbc
> c0587cc0 c0587cb0 7ca0: c03871f0 c043c4c0 600f0113 ffffffff
>  r6:ffffffff r5:600f0113 r4:c043c4c0 r3:c03871f0
> [<c043c49c>] (_raw_spin_lock) from [<c03871f0>] (ip_defrag+0xaac/0xcb8)
> [<c0386744>] (ip_defrag) from [<c0385a10>] (ip_local_deliver+0x5c/0x268)
>  r10:edb71b80 r9:c058f24c r8:00000008 r7:c05b2f40 r6:edb71b80 r5:edb71b98
>  r4:ee22332e
> [<c03859b4>] (ip_local_deliver) from [<c038626c>] (ip_rcv+0x650/0x6f0)
>  r7:c05b2f40 r6:edb71b80 r5:edb71b98 r4:ee22332e
> [<c0385c1c>] (ip_rcv) from [<c035e6a0>]
> (__netif_receive_skb_core+0x470/0x50c) r9:c058f24c r8:00000008 r7:c05900d0
> r6:00000000 r5:c058f238 r4:00000000 [<c035e230>] (__netif_receive_skb_core)
> from [<c035e908>] (__netif_receive_skb+0x2c/0x80) r10:000005ea r9:edbe6a78
> r8:edbe6d10 r7:0000003f r6:0000003f r5:c058f238 r4:edb71b80
> [<c035e8dc>] (__netif_receive_skb) from [<c035e9c0>]
> (netif_receive_skb_internal+0x64/0xa4) r5:c058f238 r4:edb71b80
> [<c035e95c>] (netif_receive_skb_internal) from [<c0362200>]
> (netif_receive_skb+0x10/0x14) r5:edb71b80 r4:edbe6800
> [<c03621f0>] (netif_receive_skb) from [<c027b230>] (sh_eth_poll+0x214/0x494)
> [<c027b01c>] (sh_eth_poll) from [<c0362784>] (net_rx_action+0xb8/0x174)
> r10:c05880c0 r9:00000040 r8:c05ba8b8 r7:eef9fc88 r6:0000012c r5:eef9fc80
> r4:edbe6d10
> [<c03626cc>] (net_rx_action) from [<c0028a04>] (__do_softirq+0xf0/0x22c)
>  r10:c0586000 r9:00000100 r8:c058808c r7:c0588080 r6:c0586000 r5:c0586018
>  r4:00000008
> [<c0028914>] (__do_softirq) from [<c0028da4>] (irq_exit+0x8c/0xe8)
>  r10:00000000 r9:413fc0f2 r8:ef7fccc0 r7:c0587f74 r6:00000000 r5:c0583bcc
>  r4:c0586028
> [<c0028d18>] (irq_exit) from [<c000f188>] (handle_IRQ+0x74/0x98)
>  r4:000000c2 r3:000001a7
> [<c000f114>] (handle_IRQ) from [<c0009320>] (gic_handle_irq+0x44/0x68)
>  r6:c058e88c r5:c0587f40 r4:f0002000 r3:000001a0
> [<c00092dc>] (gic_handle_irq) from [<c00129c0>] (__irq_svc+0x40/0x50)
> Exception stack(0xc0587f40 to 0xc0587f88)
> 7f40: eef9e490 00000000 0000404a 00000000 c0586018 c0586000 c0586000
> c0579814 7f60: ef7fccc0 413fc0f2 00000000 c0587f94 c0587f98 c0587f88
> c000f484 c000f488 7f80: 600f0013 ffffffff
>  r6:ffffffff r5:600f0013 r4:c000f488 r3:c000f484
> [<c000f45c>] (arch_cpu_idle) from [<c0056428>]
> (cpu_startup_entry+0xf4/0x154) [<c0056334>] (cpu_startup_entry) from
> [<c0434ea8>] (rest_init+0x68/0x80) [<c0434e40>] (rest_init) from
> [<c0548b74>] (start_kernel+0x2cc/0x31c) [<c05488a8>] (start_kernel) from
> [<40008074>] (0x40008074)
> ----------------------

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
  2014-07-23  2:17 ` Kuninori Morimoto
@ 2014-07-23 10:28     ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-23 10:28 UTC (permalink / raw)
  To: Kuninori Morimoto; +Cc: dmaengine, linux-sh, Magnus Damm, Linux-ALSA

Hi Morimoto-san,

On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
> Hi Laurent
> 
> > The code has been tested by artificially lowering the maximum chunk size
> > to 4096 bytes and running dmatest, which completed sucessfully. Morimoto-
> > san, is there an easy way to test cyclic transfers with your audio driver
> > ?
>
> Thank you for your offer.
> I tested this patchset with audio DMAC (= cyclic transfer)
> but, it doesn't work for me.
> 
> First of all, this sound driver which is using cyclic transfer
> was worked well in shdma-base driver.
> I had sent audio DMA support plafrom side patches before.
> But, of course I'm happy to update sound driver side.
> 
> I will re-send my audio DMAC support patches after this email.
> 
> My troubles are...

[snip]

> 2. cyclic transfer doesn't work
> 
>    I got attached error.

[snip]

I can reproduce that, but I have this error coming up before.

[   16.207027] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:103
[   16.215795] in_atomic(): 1, irqs_disabled(): 128, pid: 1319, name: aplay
[   16.222636] CPU: 0 PID: 1319 Comm: aplay Not tainted 3.16.0-rc5-02821-g12a72a3 #2501
[   16.230536] Backtrace: 
[   16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>] (show_stack+0x18/0x1c)
[   16.240778]  r6:ffffffff r5:c04aa7c0 r4:00000000 r3:00000000
[   16.246593] [<c0012580>] (show_stack) from [<c032ea84>] (dump_stack+0x8c/0xc0)
[   16.253967] [<c032e9f8>] (dump_stack) from [<c0049e98>] (__might_sleep+0xcc/0x108)
[   16.261689]  r4:e8dd2000 r3:00000093
[   16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>] (mutex_lock+0x20/0x70)
[   16.272990]  r5:00000000 r4:e900fe00
[   16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>] (regmap_lock_mutex+0x10/0x14)
[   16.284644]  r4:e900fe00 r3:00000000
[   16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>] (regmap_update_bits+0x2c/0x64)
[   16.297009] [<c01fb9b0>] (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44)
[   16.305883]  r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200
[   16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>] (rsnd_write+0x30/0x4c [snd_soc_rcar])
[   16.321195]  r5:e93a4c00 r4:e8d1f898
[   16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>] (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar])
[   16.336940] [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar])
[   16.349624]  r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0
[   16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar])
[   16.367236]  r5:e8d9d810 r4:e8d9d824
[   16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core])
[   16.382271]  r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500
[   16.390301]  r4:e8ef3810
[   16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm])
[   16.403467]  r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144
[   16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>] (snd_pcm_action_single+0x40/0x80 [snd_pcm])
[   16.422079] [<bf0941b8>] (snd_pcm_action_single [snd_pcm]) from [<bf09443c>] (snd_pcm_action+0xcc/0xd0 [snd_pcm])
[   16.432547]  r7:00000003 r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500
[   16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>] (snd_pcm_start+0x1c/0x24 [snd_pcm])
[   16.448125]  r8:00000000 r7:e8dd2000 r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000
[   16.456083] [<bf09707c>] (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0 [snd_pcm])
[   16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm])
[   16.476860]  r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800
[   16.484889]  r4:e99fb500
[   16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>] (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm])
[   16.498583]  r6:00000000 r5:be91ea4c r4:e99fb500
[   16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm])
[   16.514685]  r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880
[   16.521575] [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>] (do_vfs_ioctl+0x80/0x5c8)
[   16.531163] [<c00d7b90>] (do_vfs_ioctl) from [<c00d8194>] (SyS_ioctl+0x3c/0x60)
[   16.538618]  r10:00000000 r9:e8dd2000 r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880
[   16.546648]  r4:e93d3880
[   16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>] (ret_fast_syscall+0x0/0x30)
[   16.556964]  r8:c000fa24 r7:00000036 r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c

The rsnd_soc_dai_trigger() function takes a spinlock, making the context
atomic, which regmap doesn't like as it locks a mutex.

It might be possible to fix this by setting the fast_io field in both the
regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
will then use a spinlock instead of a mutex. However, even if I believe that
change makes sense and should be done, another atomic context issue will come
from the rcar-dmac driver which allocates memory in the prep_dma_cyclic
function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the
spinlock help.

What context is the rsnd_soc_dai_trigger() function called in by the alsa
core ? If it's guaranteed to be a sleepable context, would it make sense to
replace the rsnd_priv spinlock with a mutex ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
@ 2014-07-23 10:28     ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-23 10:28 UTC (permalink / raw)
  To: Kuninori Morimoto; +Cc: dmaengine, linux-sh, Magnus Damm, Linux-ALSA

Hi Morimoto-san,

On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
> Hi Laurent
> 
> > The code has been tested by artificially lowering the maximum chunk size
> > to 4096 bytes and running dmatest, which completed sucessfully. Morimoto-
> > san, is there an easy way to test cyclic transfers with your audio driver
> > ?
>
> Thank you for your offer.
> I tested this patchset with audio DMAC (= cyclic transfer)
> but, it doesn't work for me.
> 
> First of all, this sound driver which is using cyclic transfer
> was worked well in shdma-base driver.
> I had sent audio DMA support plafrom side patches before.
> But, of course I'm happy to update sound driver side.
> 
> I will re-send my audio DMAC support patches after this email.
> 
> My troubles are...

[snip]

> 2. cyclic transfer doesn't work
> 
>    I got attached error.

[snip]

I can reproduce that, but I have this error coming up before.

[   16.207027] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:103
[   16.215795] in_atomic(): 1, irqs_disabled(): 128, pid: 1319, name: aplay
[   16.222636] CPU: 0 PID: 1319 Comm: aplay Not tainted 3.16.0-rc5-02821-g12a72a3 #2501
[   16.230536] Backtrace: 
[   16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>] (show_stack+0x18/0x1c)
[   16.240778]  r6:ffffffff r5:c04aa7c0 r4:00000000 r3:00000000
[   16.246593] [<c0012580>] (show_stack) from [<c032ea84>] (dump_stack+0x8c/0xc0)
[   16.253967] [<c032e9f8>] (dump_stack) from [<c0049e98>] (__might_sleep+0xcc/0x108)
[   16.261689]  r4:e8dd2000 r3:00000093
[   16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>] (mutex_lock+0x20/0x70)
[   16.272990]  r5:00000000 r4:e900fe00
[   16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>] (regmap_lock_mutex+0x10/0x14)
[   16.284644]  r4:e900fe00 r3:00000000
[   16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>] (regmap_update_bits+0x2c/0x64)
[   16.297009] [<c01fb9b0>] (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44)
[   16.305883]  r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200
[   16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>] (rsnd_write+0x30/0x4c [snd_soc_rcar])
[   16.321195]  r5:e93a4c00 r4:e8d1f898
[   16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>] (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar])
[   16.336940] [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar])
[   16.349624]  r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0
[   16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar])
[   16.367236]  r5:e8d9d810 r4:e8d9d824
[   16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core])
[   16.382271]  r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500
[   16.390301]  r4:e8ef3810
[   16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm])
[   16.403467]  r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144
[   16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>] (snd_pcm_action_single+0x40/0x80 [snd_pcm])
[   16.422079] [<bf0941b8>] (snd_pcm_action_single [snd_pcm]) from [<bf09443c>] (snd_pcm_action+0xcc/0xd0 [snd_pcm])
[   16.432547]  r7:00000003 r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500
[   16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>] (snd_pcm_start+0x1c/0x24 [snd_pcm])
[   16.448125]  r8:00000000 r7:e8dd2000 r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000
[   16.456083] [<bf09707c>] (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0 [snd_pcm])
[   16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm])
[   16.476860]  r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800
[   16.484889]  r4:e99fb500
[   16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>] (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm])
[   16.498583]  r6:00000000 r5:be91ea4c r4:e99fb500
[   16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm])
[   16.514685]  r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880
[   16.521575] [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>] (do_vfs_ioctl+0x80/0x5c8)
[   16.531163] [<c00d7b90>] (do_vfs_ioctl) from [<c00d8194>] (SyS_ioctl+0x3c/0x60)
[   16.538618]  r10:00000000 r9:e8dd2000 r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880
[   16.546648]  r4:e93d3880
[   16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>] (ret_fast_syscall+0x0/0x30)
[   16.556964]  r8:c000fa24 r7:00000036 r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c

The rsnd_soc_dai_trigger() function takes a spinlock, making the context
atomic, which regmap doesn't like as it locks a mutex.

It might be possible to fix this by setting the fast_io field in both the
regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
will then use a spinlock instead of a mutex. However, even if I believe that
change makes sense and should be done, another atomic context issue will come
from the rcar-dmac driver which allocates memory in the prep_dma_cyclic
function, called by rsnd_dma_start() from rsnd_soc_dai_trigger() with the
spinlock help.

What context is the rsnd_soc_dai_trigger() function called in by the alsa
core ? If it's guaranteed to be a sleepable context, would it make sense to
replace the rsnd_priv spinlock with a mutex ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-23 10:28     ` Laurent Pinchart
  (?)
@ 2014-07-23 11:07       ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-23 11:07 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm, dmaengine,
	Maxime Ripard, linux-arm-kernel

(Expanding the CC list)

On Wednesday 23 July 2014 12:28:47 Laurent Pinchart wrote:
> On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
> > Hi Laurent
> > 
> > > The code has been tested by artificially lowering the maximum chunk size
> > > to 4096 bytes and running dmatest, which completed sucessfully.
> > > Morimoto- san, is there an easy way to test cyclic transfers with your
> > > audio driver ?
> > 
> > Thank you for your offer.
> > I tested this patchset with audio DMAC (= cyclic transfer)
> > but, it doesn't work for me.
> > 
> > First of all, this sound driver which is using cyclic transfer
> > was worked well in shdma-base driver.
> > I had sent audio DMA support plafrom side patches before.
> > But, of course I'm happy to update sound driver side.
> > 
> > I will re-send my audio DMAC support patches after this email.
> > 
> > My troubles are...
> 
> [snip]
> 
> > 2. cyclic transfer doesn't work
> > 
> >    I got attached error.
> 
> [snip]
> 
> I can reproduce that, but I have this error coming up before.
> 
> [   16.207027] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:103 [   16.215795] in_atomic(): 1, irqs_disabled():
> 128, pid: 1319, name: aplay [   16.222636] CPU: 0 PID: 1319 Comm: aplay Not
> tainted 3.16.0-rc5-02821-g12a72a3 #2501 [   16.230536] Backtrace:
> [   16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>]
> (show_stack+0x18/0x1c) [   16.240778]  r6:ffffffff r5:c04aa7c0 r4:00000000
> r3:00000000
> [   16.246593] [<c0012580>] (show_stack) from [<c032ea84>]
> (dump_stack+0x8c/0xc0) [   16.253967] [<c032e9f8>] (dump_stack) from
> [<c0049e98>] (__might_sleep+0xcc/0x108) [   16.261689]  r4:e8dd2000
> r3:00000093
> [   16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>]
> (mutex_lock+0x20/0x70) [   16.272990]  r5:00000000 r4:e900fe00
> [   16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>]
> (regmap_lock_mutex+0x10/0x14) [   16.284644]  r4:e900fe00 r3:00000000
> [   16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>]
> (regmap_update_bits+0x2c/0x64) [   16.297009] [<c01fb9b0>]
> (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44) [  
> 16.305883]  r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200
> [   16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>]
> (rsnd_write+0x30/0x4c [snd_soc_rcar]) [   16.321195]  r5:e93a4c00
> r4:e8d1f898
> [   16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>]
> (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar]) [   16.336940]
> [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from
> [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar]) [   16.349624] 
> r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0
> [   16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from
> [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar]) [  
> 16.367236]  r5:e8d9d810 r4:e8d9d824
> [   16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from
> [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core]) [   16.382271] 
> r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500 [ 
>  16.390301]  r4:e8ef3810
> [   16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from
> [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm]) [   16.403467] 
> r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144 [  
> 16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>]
> (snd_pcm_action_single+0x40/0x80 [snd_pcm]) [   16.422079] [<bf0941b8>]
> (snd_pcm_action_single [snd_pcm]) from [<bf09443c>]
> (snd_pcm_action+0xcc/0xd0 [snd_pcm]) [   16.432547]  r7:00000003
> r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500
> [   16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>]
> (snd_pcm_start+0x1c/0x24 [snd_pcm]) [   16.448125]  r8:00000000 r7:e8dd2000
> r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000 [   16.456083] [<bf09707c>]
> (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0
> [snd_pcm]) [   16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from
> [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm]) [   16.476860] 
> r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800 [ 
>  16.484889]  r4:e99fb500
> [   16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>]
> (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm]) [   16.498583]  r6:00000000
> r5:be91ea4c r4:e99fb500
> [   16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from
> [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm]) [   16.514685] 
> r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880 [   16.521575]
> [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>]
> (do_vfs_ioctl+0x80/0x5c8) [   16.531163] [<c00d7b90>] (do_vfs_ioctl) from
> [<c00d8194>] (SyS_ioctl+0x3c/0x60) [   16.538618]  r10:00000000 r9:e8dd2000
> r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880 [   16.546648]  r4:e93d3880
> [   16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>]
> (ret_fast_syscall+0x0/0x30) [   16.556964]  r8:c000fa24 r7:00000036
> r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c
> 
> The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> atomic, which regmap doesn't like as it locks a mutex.
> 
> It might be possible to fix this by setting the fast_io field in both the
> regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> will then use a spinlock instead of a mutex. However, even if I believe that
> change makes sense and should be done, another atomic context issue will
> come from the rcar-dmac driver which allocates memory in the
> prep_dma_cyclic function, called by rsnd_dma_start() from
> rsnd_soc_dai_trigger() with the spinlock help.
> 
> What context is the rsnd_soc_dai_trigger() function called in by the alsa
> core ? If it's guaranteed to be a sleepable context, would it make sense to
> replace the rsnd_priv spinlock with a mutex ?

Answering myself here, that's not an option, as the trigger function is called 
in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
due to snd_pcm_lib_write1() taking a spinlock.

Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
called in atomic context, and on the other side the function ends up calling 
dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
func the DMA engine API is undocumented and completely silent on whether the 
prep functions are allowed to sleep. The question is, who's wrong ?

Now, if you're tempted to say that I'm wrong by allocating memory with 
GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
more complex to solve.

The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
needs to do so when a descriptor is submitted. This operation, currently 
performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
issue_pending handler, but the three operations are called in a row from 
rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
spinlock held. This means I have no place in my DMA engine driver where I can 
resume the device.

One could argue that the rcar-dmac driver could use a work queue to handle 
power management. That's correct, but the additional complexity, which would 
be required in *all* DMA engine drivers, seem too high to me. If we need to go 
that way, this is really a task that the DMA engine core should handle.

Let's start by answering the background question and updating the DMA engine 
API documentation once and for good : in which context are drivers allowed to 
call the prep, tx_submit and issue_pending functions ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-23 11:07       ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-23 11:07 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm, dmaengine,
	Maxime Ripard, linux-arm-kernel

(Expanding the CC list)

On Wednesday 23 July 2014 12:28:47 Laurent Pinchart wrote:
> On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
> > Hi Laurent
> > 
> > > The code has been tested by artificially lowering the maximum chunk size
> > > to 4096 bytes and running dmatest, which completed sucessfully.
> > > Morimoto- san, is there an easy way to test cyclic transfers with your
> > > audio driver ?
> > 
> > Thank you for your offer.
> > I tested this patchset with audio DMAC (= cyclic transfer)
> > but, it doesn't work for me.
> > 
> > First of all, this sound driver which is using cyclic transfer
> > was worked well in shdma-base driver.
> > I had sent audio DMA support plafrom side patches before.
> > But, of course I'm happy to update sound driver side.
> > 
> > I will re-send my audio DMAC support patches after this email.
> > 
> > My troubles are...
> 
> [snip]
> 
> > 2. cyclic transfer doesn't work
> > 
> >    I got attached error.
> 
> [snip]
> 
> I can reproduce that, but I have this error coming up before.
> 
> [   16.207027] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:103 [   16.215795] in_atomic(): 1, irqs_disabled():
> 128, pid: 1319, name: aplay [   16.222636] CPU: 0 PID: 1319 Comm: aplay Not
> tainted 3.16.0-rc5-02821-g12a72a3 #2501 [   16.230536] Backtrace:
> [   16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>]
> (show_stack+0x18/0x1c) [   16.240778]  r6:ffffffff r5:c04aa7c0 r4:00000000
> r3:00000000
> [   16.246593] [<c0012580>] (show_stack) from [<c032ea84>]
> (dump_stack+0x8c/0xc0) [   16.253967] [<c032e9f8>] (dump_stack) from
> [<c0049e98>] (__might_sleep+0xcc/0x108) [   16.261689]  r4:e8dd2000
> r3:00000093
> [   16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>]
> (mutex_lock+0x20/0x70) [   16.272990]  r5:00000000 r4:e900fe00
> [   16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>]
> (regmap_lock_mutex+0x10/0x14) [   16.284644]  r4:e900fe00 r3:00000000
> [   16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>]
> (regmap_update_bits+0x2c/0x64) [   16.297009] [<c01fb9b0>]
> (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44) [  
> 16.305883]  r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200
> [   16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>]
> (rsnd_write+0x30/0x4c [snd_soc_rcar]) [   16.321195]  r5:e93a4c00
> r4:e8d1f898
> [   16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>]
> (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar]) [   16.336940]
> [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from
> [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar]) [   16.349624] 
> r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0
> [   16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from
> [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar]) [  
> 16.367236]  r5:e8d9d810 r4:e8d9d824
> [   16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from
> [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core]) [   16.382271] 
> r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500 [ 
>  16.390301]  r4:e8ef3810
> [   16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from
> [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm]) [   16.403467] 
> r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144 [  
> 16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>]
> (snd_pcm_action_single+0x40/0x80 [snd_pcm]) [   16.422079] [<bf0941b8>]
> (snd_pcm_action_single [snd_pcm]) from [<bf09443c>]
> (snd_pcm_action+0xcc/0xd0 [snd_pcm]) [   16.432547]  r7:00000003
> r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500
> [   16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>]
> (snd_pcm_start+0x1c/0x24 [snd_pcm]) [   16.448125]  r8:00000000 r7:e8dd2000
> r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000 [   16.456083] [<bf09707c>]
> (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0
> [snd_pcm]) [   16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from
> [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm]) [   16.476860] 
> r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800 [ 
>  16.484889]  r4:e99fb500
> [   16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>]
> (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm]) [   16.498583]  r6:00000000
> r5:be91ea4c r4:e99fb500
> [   16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from
> [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm]) [   16.514685] 
> r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880 [   16.521575]
> [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>]
> (do_vfs_ioctl+0x80/0x5c8) [   16.531163] [<c00d7b90>] (do_vfs_ioctl) from
> [<c00d8194>] (SyS_ioctl+0x3c/0x60) [   16.538618]  r10:00000000 r9:e8dd2000
> r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880 [   16.546648]  r4:e93d3880
> [   16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>]
> (ret_fast_syscall+0x0/0x30) [   16.556964]  r8:c000fa24 r7:00000036
> r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c
> 
> The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> atomic, which regmap doesn't like as it locks a mutex.
> 
> It might be possible to fix this by setting the fast_io field in both the
> regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> will then use a spinlock instead of a mutex. However, even if I believe that
> change makes sense and should be done, another atomic context issue will
> come from the rcar-dmac driver which allocates memory in the
> prep_dma_cyclic function, called by rsnd_dma_start() from
> rsnd_soc_dai_trigger() with the spinlock help.
> 
> What context is the rsnd_soc_dai_trigger() function called in by the alsa
> core ? If it's guaranteed to be a sleepable context, would it make sense to
> replace the rsnd_priv spinlock with a mutex ?

Answering myself here, that's not an option, as the trigger function is called 
in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
due to snd_pcm_lib_write1() taking a spinlock.

Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
called in atomic context, and on the other side the function ends up calling 
dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
func the DMA engine API is undocumented and completely silent on whether the 
prep functions are allowed to sleep. The question is, who's wrong ?

Now, if you're tempted to say that I'm wrong by allocating memory with 
GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
more complex to solve.

The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
needs to do so when a descriptor is submitted. This operation, currently 
performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
issue_pending handler, but the three operations are called in a row from 
rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
spinlock held. This means I have no place in my DMA engine driver where I can 
resume the device.

One could argue that the rcar-dmac driver could use a work queue to handle 
power management. That's correct, but the additional complexity, which would 
be required in *all* DMA engine drivers, seem too high to me. If we need to go 
that way, this is really a task that the DMA engine core should handle.

Let's start by answering the background question and updating the DMA engine 
API documentation once and for good : in which context are drivers allowed to 
call the prep, tx_submit and issue_pending functions ?

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-23 11:07       ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-23 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

(Expanding the CC list)

On Wednesday 23 July 2014 12:28:47 Laurent Pinchart wrote:
> On Tuesday 22 July 2014 19:17:23 Kuninori Morimoto wrote:
> > Hi Laurent
> > 
> > > The code has been tested by artificially lowering the maximum chunk size
> > > to 4096 bytes and running dmatest, which completed sucessfully.
> > > Morimoto- san, is there an easy way to test cyclic transfers with your
> > > audio driver ?
> > 
> > Thank you for your offer.
> > I tested this patchset with audio DMAC (= cyclic transfer)
> > but, it doesn't work for me.
> > 
> > First of all, this sound driver which is using cyclic transfer
> > was worked well in shdma-base driver.
> > I had sent audio DMA support plafrom side patches before.
> > But, of course I'm happy to update sound driver side.
> > 
> > I will re-send my audio DMAC support patches after this email.
> > 
> > My troubles are...
> 
> [snip]
> 
> > 2. cyclic transfer doesn't work
> > 
> >    I got attached error.
> 
> [snip]
> 
> I can reproduce that, but I have this error coming up before.
> 
> [   16.207027] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:103 [   16.215795] in_atomic(): 1, irqs_disabled():
> 128, pid: 1319, name: aplay [   16.222636] CPU: 0 PID: 1319 Comm: aplay Not
> tainted 3.16.0-rc5-02821-g12a72a3 #2501 [   16.230536] Backtrace:
> [   16.233056] [<c00121e4>] (dump_backtrace) from [<c0012598>]
> (show_stack+0x18/0x1c) [   16.240778]  r6:ffffffff r5:c04aa7c0 r4:00000000
> r3:00000000
> [   16.246593] [<c0012580>] (show_stack) from [<c032ea84>]
> (dump_stack+0x8c/0xc0) [   16.253967] [<c032e9f8>] (dump_stack) from
> [<c0049e98>] (__might_sleep+0xcc/0x108) [   16.261689]  r4:e8dd2000
> r3:00000093
> [   16.265357] [<c0049dcc>] (__might_sleep) from [<c0332134>]
> (mutex_lock+0x20/0x70) [   16.272990]  r5:00000000 r4:e900fe00
> [   16.276657] [<c0332114>] (mutex_lock) from [<c01fa4dc>]
> (regmap_lock_mutex+0x10/0x14) [   16.284644]  r4:e900fe00 r3:00000000
> [   16.288309] [<c01fa4cc>] (regmap_lock_mutex) from [<c01fb9dc>]
> (regmap_update_bits+0x2c/0x64) [   16.297009] [<c01fb9b0>]
> (regmap_update_bits) from [<c01fba90>] (regmap_fields_write+0x38/0x44) [  
> 16.305883]  r7:e8d9d990 r6:00000004 r5:00000040 r4:f0368200
> [   16.311701] [<c01fba58>] (regmap_fields_write) from [<bf0ec280>]
> (rsnd_write+0x30/0x4c [snd_soc_rcar]) [   16.321195]  r5:e93a4c00
> r4:e8d1f898
> [   16.324866] [<bf0ec250>] (rsnd_write [snd_soc_rcar]) from [<bf0ec884>]
> (rsnd_src_set_convert_rate.isra.6+0xf8/0x144 [snd_soc_rcar]) [   16.336940]
> [<bf0ec78c>] (rsnd_src_set_convert_rate.isra.6 [snd_soc_rcar]) from
> [<bf0ec8fc>] (rsnd_src_init_gen2+0x2c/0xc4 [snd_soc_rcar]) [   16.349624] 
> r6:00000004 r5:e8d9d810 r4:e8d1f898 r3:bf0ec8d0
> [   16.355438] [<bf0ec8d0>] (rsnd_src_init_gen2 [snd_soc_rcar]) from
> [<bf0ea640>] (rsnd_soc_dai_trigger+0x1cc/0x22c [snd_soc_rcar]) [  
> 16.367236]  r5:e8d9d810 r4:e8d9d824
> [   16.370916] [<bf0ea474>] (rsnd_soc_dai_trigger [snd_soc_rcar]) from
> [<bf0c51ec>] (soc_pcm_trigger+0xa8/0xf8 [snd_soc_core]) [   16.382271] 
> r10:00002000 r9:00002000 r8:e9290d00 r7:e8d9d700 r6:00000001 r5:e99fb500 [ 
>  16.390301]  r4:e8ef3810
> [   16.392910] [<bf0c5144>] (soc_pcm_trigger [snd_soc_core]) from
> [<bf094704>] (snd_pcm_do_start+0x34/0x38 [snd_pcm]) [   16.403467] 
> r8:bf09e050 r7:00000000 r6:00000003 r5:e99fb500 r4:bf09e050 r3:bf0c5144 [  
> 16.411421] [<bf0946d0>] (snd_pcm_do_start [snd_pcm]) from [<bf0941f8>]
> (snd_pcm_action_single+0x40/0x80 [snd_pcm]) [   16.422079] [<bf0941b8>]
> (snd_pcm_action_single [snd_pcm]) from [<bf09443c>]
> (snd_pcm_action+0xcc/0xd0 [snd_pcm]) [   16.432547]  r7:00000003
> r6:e99fb5c8 r5:bf09e4c0 r4:e99fb500
> [   16.438366] [<bf094370>] (snd_pcm_action [snd_pcm]) from [<bf097098>]
> (snd_pcm_start+0x1c/0x24 [snd_pcm]) [   16.448125]  r8:00000000 r7:e8dd2000
> r6:e93a4c00 r5:bf09e4c0 r4:e99fb500 r3:00002000 [   16.456083] [<bf09707c>]
> (snd_pcm_start [snd_pcm]) from [<bf09b094>] (snd_pcm_lib_write1+0x40c/0x4f0
> [snd_pcm]) [   16.466391] [<bf09ac88>] (snd_pcm_lib_write1 [snd_pcm]) from
> [<bf09b244>] (snd_pcm_lib_write+0x64/0x78 [snd_pcm]) [   16.476860] 
> r10:be91ea4c r9:e8dd2000 r8:e8d66488 r7:00000000 r6:0002c780 r5:00000800 [ 
>  16.484889]  r4:e99fb500
> [   16.487494] [<bf09b1e0>] (snd_pcm_lib_write [snd_pcm]) from [<bf096c38>]
> (snd_pcm_playback_ioctl1+0x134/0x4c8 [snd_pcm]) [   16.498583]  r6:00000000
> r5:be91ea4c r4:e99fb500
> [   16.503330] [<bf096b04>] (snd_pcm_playback_ioctl1 [snd_pcm]) from
> [<bf096ffc>] (snd_pcm_playback_ioctl+0x30/0x3c [snd_pcm]) [   16.514685] 
> r8:e8d66488 r7:be91ea4c r6:00000004 r5:e93d3880 r4:e93d3880 [   16.521575]
> [<bf096fcc>] (snd_pcm_playback_ioctl [snd_pcm]) from [<c00d7c10>]
> (do_vfs_ioctl+0x80/0x5c8) [   16.531163] [<c00d7b90>] (do_vfs_ioctl) from
> [<c00d8194>] (SyS_ioctl+0x3c/0x60) [   16.538618]  r10:00000000 r9:e8dd2000
> r8:00000004 r7:be91ea4c r6:400c4150 r5:e93d3880 [   16.546648]  r4:e93d3880
> [   16.549243] [<c00d8158>] (SyS_ioctl) from [<c000f8a0>]
> (ret_fast_syscall+0x0/0x30) [   16.556964]  r8:c000fa24 r7:00000036
> r6:00000000 r5:0002c498 r4:0002c448 r3:be91ea4c
> 
> The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> atomic, which regmap doesn't like as it locks a mutex.
> 
> It might be possible to fix this by setting the fast_io field in both the
> regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> will then use a spinlock instead of a mutex. However, even if I believe that
> change makes sense and should be done, another atomic context issue will
> come from the rcar-dmac driver which allocates memory in the
> prep_dma_cyclic function, called by rsnd_dma_start() from
> rsnd_soc_dai_trigger() with the spinlock help.
> 
> What context is the rsnd_soc_dai_trigger() function called in by the alsa
> core ? If it's guaranteed to be a sleepable context, would it make sense to
> replace the rsnd_priv spinlock with a mutex ?

Answering myself here, that's not an option, as the trigger function is called 
in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
due to snd_pcm_lib_write1() taking a spinlock.

Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
called in atomic context, and on the other side the function ends up calling 
dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
func the DMA engine API is undocumented and completely silent on whether the 
prep functions are allowed to sleep. The question is, who's wrong ?

Now, if you're tempted to say that I'm wrong by allocating memory with 
GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
more complex to solve.

The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
needs to do so when a descriptor is submitted. This operation, currently 
performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
issue_pending handler, but the three operations are called in a row from 
rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
spinlock held. This means I have no place in my DMA engine driver where I can 
resume the device.

One could argue that the rcar-dmac driver could use a work queue to handle 
power management. That's correct, but the additional complexity, which would 
be required in *all* DMA engine drivers, seem too high to me. If we need to go 
that way, this is really a task that the DMA engine core should handle.

Let's start by answering the background question and updating the DMA engine 
API documentation once and for good : in which context are drivers allowed to 
call the prep, tx_submit and issue_pending functions ?

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
  2014-07-22 12:33 [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
  2014-07-23  2:17 ` Kuninori Morimoto
  2014-07-23  9:48 ` [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
@ 2014-07-23 23:56 ` Kuninori Morimoto
  2014-07-24  8:51   ` [PATCH] ASoC: rsnd: fixup dai remove callback operation Kuninori Morimoto
  2014-07-24  0:12 ` [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
  3 siblings, 1 reply; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-23 23:56 UTC (permalink / raw)
  To: linux-sh


Hi Laurent

> > 1. "filter" still can't care "dma0" or "audma0"
> > 
> >  	dmac0: dma-controller@e6700000 {
> > 	..
> > 	};
> > 	dmac1: dma-controller@e6720000 {
> > 	...
> > 	};
> > 	audma0: dma-contorller@ec700000 {
> > 	...
> > 	};
> > 	audma1: dma-controller@ec720000 {
> > 	...
> > 	};
> > 	audmapp: audio-dma-pp@0xec740000 {
> > 	...
> > 	};
> > 
> >   audio driver requests "audma0, 0x01",
> >   but, filter accepts it as "dmac0, 0x01"
> 
> Indeed, I've fixed the rcar-dmac driver to ignore channels handled by a 
> different driver, but not channels handled by the same driver but a different 
> device. I'll fix that.

Thank you

> By the way, I've noticed an issue with the snd_soc_rcar driver. If a 
> dma_request_slave_channel_compat() call fails in a module probe operation, the 
> rsnd_probe() function will return an error immediately
> 
>         for_each_rsnd_dai(rdai, priv, i) {
>                 ret = rsnd_dai_call(probe, &rdai->playback, rdai);
>                 if (ret)
>                         return ret;
> 
>                 ret = rsnd_dai_call(probe, &rdai->capture, rdai);
>                 if (ret)
>                         return ret;
>         }
>
> The modules that have been successfully probed are not cleaned up, so DMA 
> channels allocated by modules successfully probed are never released.

Hmm... indeed.
Thank you. I will fix.

> > 2. cyclic transfer doesn't work
> > 
> >    I got attached error.
> 
> I'm not too surprised as I haven't tested cyclic DMA yet :-) I'll fix it.

Hehehe :)

Please let me know if you need my help

Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support
  2014-07-22 12:33 [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
                   ` (2 preceding siblings ...)
  2014-07-23 23:56 ` Kuninori Morimoto
@ 2014-07-24  0:12 ` Laurent Pinchart
  3 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-07-24  0:12 UTC (permalink / raw)
  To: linux-sh

Hi Morimoto-san,

On Wednesday 23 July 2014 16:56:56 Kuninori Morimoto wrote:

[snip]

> > > 2. cyclic transfer doesn't work
> > > 
> > >    I got attached error.
> > 
> > I'm not too surprised as I haven't tested cyclic DMA yet :-) I'll fix it.
> 
> Hehehe :)
> 
> Please let me know if you need my help

As explained in another e-mail I've run into issues with sleep operations in 
atomic context, and I believe we first need to clarify the DMA engine API 
requirements in term of contexts. Your opinion would be appreciated.

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-23 11:07       ` Laurent Pinchart
  (?)
@ 2014-07-24  0:46         ` Kuninori Morimoto
  -1 siblings, 0 replies; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  0:46 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	Vinod Koul, linux-arm-kernel, Maxime Ripard


Hi Laurent

> > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > atomic, which regmap doesn't like as it locks a mutex.
> > 
> > It might be possible to fix this by setting the fast_io field in both the
> > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > will then use a spinlock instead of a mutex. However, even if I believe that
> > change makes sense and should be done, another atomic context issue will
> > come from the rcar-dmac driver which allocates memory in the
> > prep_dma_cyclic function, called by rsnd_dma_start() from
> > rsnd_soc_dai_trigger() with the spinlock help.
> > 
> > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > replace the rsnd_priv spinlock with a mutex ?
> 
> Answering myself here, that's not an option, as the trigger function is called 
> in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> due to snd_pcm_lib_write1() taking a spinlock.
> 
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?
> 
> Now, if you're tempted to say that I'm wrong by allocating memory with 
> GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> more complex to solve.
> 
> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.
> 
> One could argue that the rcar-dmac driver could use a work queue to handle 
> power management. That's correct, but the additional complexity, which would 
> be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> that way, this is really a task that the DMA engine core should handle.
> 
> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?

rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
but, it had been used prep_slave_single() before.
Then, it used work queue in dai_trigger function.
How about to use same method in prep_dma_cyclic() ?
Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
from work queue ?


Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  0:46         ` Kuninori Morimoto
  0 siblings, 0 replies; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  0:46 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	Vinod Koul, linux-arm-kernel, Maxime Ripard


Hi Laurent

> > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > atomic, which regmap doesn't like as it locks a mutex.
> > 
> > It might be possible to fix this by setting the fast_io field in both the
> > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > will then use a spinlock instead of a mutex. However, even if I believe that
> > change makes sense and should be done, another atomic context issue will
> > come from the rcar-dmac driver which allocates memory in the
> > prep_dma_cyclic function, called by rsnd_dma_start() from
> > rsnd_soc_dai_trigger() with the spinlock help.
> > 
> > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > replace the rsnd_priv spinlock with a mutex ?
> 
> Answering myself here, that's not an option, as the trigger function is called 
> in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> due to snd_pcm_lib_write1() taking a spinlock.
> 
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?
> 
> Now, if you're tempted to say that I'm wrong by allocating memory with 
> GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> more complex to solve.
> 
> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.
> 
> One could argue that the rcar-dmac driver could use a work queue to handle 
> power management. That's correct, but the additional complexity, which would 
> be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> that way, this is really a task that the DMA engine core should handle.
> 
> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?

rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
but, it had been used prep_slave_single() before.
Then, it used work queue in dai_trigger function.
How about to use same method in prep_dma_cyclic() ?
Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
from work queue ?


Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  0:46         ` Kuninori Morimoto
  0 siblings, 0 replies; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  0:46 UTC (permalink / raw)
  To: linux-arm-kernel


Hi Laurent

> > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > atomic, which regmap doesn't like as it locks a mutex.
> > 
> > It might be possible to fix this by setting the fast_io field in both the
> > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > will then use a spinlock instead of a mutex. However, even if I believe that
> > change makes sense and should be done, another atomic context issue will
> > come from the rcar-dmac driver which allocates memory in the
> > prep_dma_cyclic function, called by rsnd_dma_start() from
> > rsnd_soc_dai_trigger() with the spinlock help.
> > 
> > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > replace the rsnd_priv spinlock with a mutex ?
> 
> Answering myself here, that's not an option, as the trigger function is called 
> in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> due to snd_pcm_lib_write1() taking a spinlock.
> 
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?
> 
> Now, if you're tempted to say that I'm wrong by allocating memory with 
> GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> more complex to solve.
> 
> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.
> 
> One could argue that the rcar-dmac driver could use a work queue to handle 
> power management. That's correct, but the additional complexity, which would 
> be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> that way, this is really a task that the DMA engine core should handle.
> 
> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?

rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
but, it had been used prep_slave_single() before.
Then, it used work queue in dai_trigger function.
How about to use same method in prep_dma_cyclic() ?
Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
from work queue ?


Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-24  0:46         ` Kuninori Morimoto
  (?)
@ 2014-07-24  1:35           ` Kuninori Morimoto
  -1 siblings, 0 replies; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  1:35 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Laurent Pinchart, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	Vinod Koul, linux-arm-kernel, Maxime Ripard


Hi Laurent again

> > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > atomic, which regmap doesn't like as it locks a mutex.
> > > 
> > > It might be possible to fix this by setting the fast_io field in both the
> > > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > > will then use a spinlock instead of a mutex. However, even if I believe that
> > > change makes sense and should be done, another atomic context issue will
> > > come from the rcar-dmac driver which allocates memory in the
> > > prep_dma_cyclic function, called by rsnd_dma_start() from
> > > rsnd_soc_dai_trigger() with the spinlock help.
> > > 
> > > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > > replace the rsnd_priv spinlock with a mutex ?
> > 
> > Answering myself here, that's not an option, as the trigger function is called 
> > in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> > due to snd_pcm_lib_write1() taking a spinlock.
> > 
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> > called in atomic context, and on the other side the function ends up calling 
> > dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> > func the DMA engine API is undocumented and completely silent on whether the 
> > prep functions are allowed to sleep. The question is, who's wrong ?
> > 
> > Now, if you're tempted to say that I'm wrong by allocating memory with 
> > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> > more complex to solve.
> > 
> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> > suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> > needs to do so when a descriptor is submitted. This operation, currently 
> > performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> > issue_pending handler, but the three operations are called in a row from 
> > rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> > spinlock held. This means I have no place in my DMA engine driver where I can 
> > resume the device.
> > 
> > One could argue that the rcar-dmac driver could use a work queue to handle 
> > power management. That's correct, but the additional complexity, which would 
> > be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> > that way, this is really a task that the DMA engine core should handle.
> > 
> > Let's start by answering the background question and updating the DMA engine 
> > API documentation once and for good : in which context are drivers allowed to 
> > call the prep, tx_submit and issue_pending functions ?
> 
> rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
> but, it had been used prep_slave_single() before.
> Then, it used work queue in dai_trigger function.
> How about to use same method in prep_dma_cyclic() ?
> Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
> from work queue ?

Sorry, this doesn't solve issue.
dmaengine_prep_dma_cyclic() is used in
${LINUX}/sound/core/pcm_dmaengine.c,
and the situation is same as ours.

Hmm..
In my quick check, other DMAEngine drivers are using GFP_ATOMIC
in cyclic/prep_slave_sg functions...


Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  1:35           ` Kuninori Morimoto
  0 siblings, 0 replies; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  1:35 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Laurent Pinchart, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	Vinod Koul, linux-arm-kernel, Maxime Ripard


Hi Laurent again

> > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > atomic, which regmap doesn't like as it locks a mutex.
> > > 
> > > It might be possible to fix this by setting the fast_io field in both the
> > > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > > will then use a spinlock instead of a mutex. However, even if I believe that
> > > change makes sense and should be done, another atomic context issue will
> > > come from the rcar-dmac driver which allocates memory in the
> > > prep_dma_cyclic function, called by rsnd_dma_start() from
> > > rsnd_soc_dai_trigger() with the spinlock help.
> > > 
> > > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > > replace the rsnd_priv spinlock with a mutex ?
> > 
> > Answering myself here, that's not an option, as the trigger function is called 
> > in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> > due to snd_pcm_lib_write1() taking a spinlock.
> > 
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> > called in atomic context, and on the other side the function ends up calling 
> > dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> > func the DMA engine API is undocumented and completely silent on whether the 
> > prep functions are allowed to sleep. The question is, who's wrong ?
> > 
> > Now, if you're tempted to say that I'm wrong by allocating memory with 
> > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> > more complex to solve.
> > 
> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> > suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> > needs to do so when a descriptor is submitted. This operation, currently 
> > performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> > issue_pending handler, but the three operations are called in a row from 
> > rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> > spinlock held. This means I have no place in my DMA engine driver where I can 
> > resume the device.
> > 
> > One could argue that the rcar-dmac driver could use a work queue to handle 
> > power management. That's correct, but the additional complexity, which would 
> > be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> > that way, this is really a task that the DMA engine core should handle.
> > 
> > Let's start by answering the background question and updating the DMA engine 
> > API documentation once and for good : in which context are drivers allowed to 
> > call the prep, tx_submit and issue_pending functions ?
> 
> rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
> but, it had been used prep_slave_single() before.
> Then, it used work queue in dai_trigger function.
> How about to use same method in prep_dma_cyclic() ?
> Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
> from work queue ?

Sorry, this doesn't solve issue.
dmaengine_prep_dma_cyclic() is used in
${LINUX}/sound/core/pcm_dmaengine.c,
and the situation is same as ours.

Hmm..
In my quick check, other DMAEngine drivers are using GFP_ATOMIC
in cyclic/prep_slave_sg functions...


Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  1:35           ` Kuninori Morimoto
  0 siblings, 0 replies; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  1:35 UTC (permalink / raw)
  To: linux-arm-kernel


Hi Laurent again

> > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > atomic, which regmap doesn't like as it locks a mutex.
> > > 
> > > It might be possible to fix this by setting the fast_io field in both the
> > > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > > will then use a spinlock instead of a mutex. However, even if I believe that
> > > change makes sense and should be done, another atomic context issue will
> > > come from the rcar-dmac driver which allocates memory in the
> > > prep_dma_cyclic function, called by rsnd_dma_start() from
> > > rsnd_soc_dai_trigger() with the spinlock help.
> > > 
> > > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > > replace the rsnd_priv spinlock with a mutex ?
> > 
> > Answering myself here, that's not an option, as the trigger function is called 
> > in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> > due to snd_pcm_lib_write1() taking a spinlock.
> > 
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> > called in atomic context, and on the other side the function ends up calling 
> > dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> > func the DMA engine API is undocumented and completely silent on whether the 
> > prep functions are allowed to sleep. The question is, who's wrong ?
> > 
> > Now, if you're tempted to say that I'm wrong by allocating memory with 
> > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> > more complex to solve.
> > 
> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> > suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> > needs to do so when a descriptor is submitted. This operation, currently 
> > performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> > issue_pending handler, but the three operations are called in a row from 
> > rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> > spinlock held. This means I have no place in my DMA engine driver where I can 
> > resume the device.
> > 
> > One could argue that the rcar-dmac driver could use a work queue to handle 
> > power management. That's correct, but the additional complexity, which would 
> > be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> > that way, this is really a task that the DMA engine core should handle.
> > 
> > Let's start by answering the background question and updating the DMA engine 
> > API documentation once and for good : in which context are drivers allowed to 
> > call the prep, tx_submit and issue_pending functions ?
> 
> rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
> but, it had been used prep_slave_single() before.
> Then, it used work queue in dai_trigger function.
> How about to use same method in prep_dma_cyclic() ?
> Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
> from work queue ?

Sorry, this doesn't solve issue.
dmaengine_prep_dma_cyclic() is used in
${LINUX}/sound/core/pcm_dmaengine.c,
and the situation is same as ours.

Hmm..
In my quick check, other DMAEngine drivers are using GFP_ATOMIC
in cyclic/prep_slave_sg functions...


Best regards
---
Kuninori Morimoto

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-23 11:07       ` Laurent Pinchart
  (?)
@ 2014-07-24  4:52         ` Vinod Koul
  -1 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-07-24  4:52 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard

On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > atomic, which regmap doesn't like as it locks a mutex.
> > 
> > It might be possible to fix this by setting the fast_io field in both the
> > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > will then use a spinlock instead of a mutex. However, even if I believe that
> > change makes sense and should be done, another atomic context issue will
> > come from the rcar-dmac driver which allocates memory in the
> > prep_dma_cyclic function, called by rsnd_dma_start() from
> > rsnd_soc_dai_trigger() with the spinlock help.
> > 
> > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > replace the rsnd_priv spinlock with a mutex ?
> 
> Answering myself here, that's not an option, as the trigger function is called 
> in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> due to snd_pcm_lib_write1() taking a spinlock.
> 
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?
> 
> Now, if you're tempted to say that I'm wrong by allocating memory with 
> GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> more complex to solve.
> 
> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.
> 
> One could argue that the rcar-dmac driver could use a work queue to handle 
> power management. That's correct, but the additional complexity, which would 
> be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> that way, this is really a task that the DMA engine core should handle.
> 
> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all
_prep functions can be invoked in atomic context.

This is the reason why we have been pushing folks to use GFP_NOWAIT is
memory allocations during prepare.

Thanks for pointing out documentation doesn't say so, will send a patch for
that.

On issue_pending and tx_submit, yes these should be allowed to be called
from atomic context too.

Lastly, just to clarify the callback invoked after descriptor is complete
can also be used to submit new descriptors, so folks are dropping locking
before invoking the callback

HTH

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  4:52         ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-07-24  4:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > atomic, which regmap doesn't like as it locks a mutex.
> > 
> > It might be possible to fix this by setting the fast_io field in both the
> > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > will then use a spinlock instead of a mutex. However, even if I believe that
> > change makes sense and should be done, another atomic context issue will
> > come from the rcar-dmac driver which allocates memory in the
> > prep_dma_cyclic function, called by rsnd_dma_start() from
> > rsnd_soc_dai_trigger() with the spinlock help.
> > 
> > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > replace the rsnd_priv spinlock with a mutex ?
> 
> Answering myself here, that's not an option, as the trigger function is called 
> in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> due to snd_pcm_lib_write1() taking a spinlock.
> 
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?
> 
> Now, if you're tempted to say that I'm wrong by allocating memory with 
> GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> more complex to solve.
> 
> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.
> 
> One could argue that the rcar-dmac driver could use a work queue to handle 
> power management. That's correct, but the additional complexity, which would 
> be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> that way, this is really a task that the DMA engine core should handle.
> 
> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all
_prep functions can be invoked in atomic context.

This is the reason why we have been pushing folks to use GFP_NOWAIT is
memory allocations during prepare.

Thanks for pointing out documentation doesn't say so, will send a patch for
that.

On issue_pending and tx_submit, yes these should be allowed to be called
from atomic context too.

Lastly, just to clarify the callback invoked after descriptor is complete
can also be used to submit new descriptors, so folks are dropping locking
before invoking the callback

HTH

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-24  1:35           ` Kuninori Morimoto
  (?)
@ 2014-07-24  4:53             ` Vinod Koul
  -1 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-07-24  4:53 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Laurent Pinchart, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard

On Wed, Jul 23, 2014 at 06:35:17PM -0700, Kuninori Morimoto wrote:
> 
> Hi Laurent again
> 
> > > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > > atomic, which regmap doesn't like as it locks a mutex.
> > > > 
> > > > It might be possible to fix this by setting the fast_io field in both the
> > > > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > > > will then use a spinlock instead of a mutex. However, even if I believe that
> > > > change makes sense and should be done, another atomic context issue will
> > > > come from the rcar-dmac driver which allocates memory in the
> > > > prep_dma_cyclic function, called by rsnd_dma_start() from
> > > > rsnd_soc_dai_trigger() with the spinlock help.
> > > > 
> > > > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > > > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > > > replace the rsnd_priv spinlock with a mutex ?
> > > 
> > > Answering myself here, that's not an option, as the trigger function is called 
> > > in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> > > due to snd_pcm_lib_write1() taking a spinlock.
> > > 
> > > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> > > called in atomic context, and on the other side the function ends up calling 
> > > dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> > > func the DMA engine API is undocumented and completely silent on whether the 
> > > prep functions are allowed to sleep. The question is, who's wrong ?
> > > 
> > > Now, if you're tempted to say that I'm wrong by allocating memory with 
> > > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> > > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> > > more complex to solve.
> > > 
> > > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> > > suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> > > needs to do so when a descriptor is submitted. This operation, currently 
> > > performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> > > issue_pending handler, but the three operations are called in a row from 
> > > rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> > > spinlock held. This means I have no place in my DMA engine driver where I can 
> > > resume the device.
> > > 
> > > One could argue that the rcar-dmac driver could use a work queue to handle 
> > > power management. That's correct, but the additional complexity, which would 
> > > be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> > > that way, this is really a task that the DMA engine core should handle.
> > > 
> > > Let's start by answering the background question and updating the DMA engine 
> > > API documentation once and for good : in which context are drivers allowed to 
> > > call the prep, tx_submit and issue_pending functions ?
> > 
> > rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
> > but, it had been used prep_slave_single() before.
> > Then, it used work queue in dai_trigger function.
> > How about to use same method in prep_dma_cyclic() ?
> > Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
> > from work queue ?
> 
> Sorry, this doesn't solve issue.
> dmaengine_prep_dma_cyclic() is used in
> ${LINUX}/sound/core/pcm_dmaengine.c,
> and the situation is same as ours.
> 
> Hmm..
> In my quick check, other DMAEngine drivers are using GFP_ATOMIC
> in cyclic/prep_slave_sg functions...
And thats partially right. You need to use GFP_NOWAIT.

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  4:53             ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-07-24  4:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 23, 2014 at 06:35:17PM -0700, Kuninori Morimoto wrote:
> 
> Hi Laurent again
> 
> > > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > > atomic, which regmap doesn't like as it locks a mutex.
> > > > 
> > > > It might be possible to fix this by setting the fast_io field in both the
> > > > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > > > will then use a spinlock instead of a mutex. However, even if I believe that
> > > > change makes sense and should be done, another atomic context issue will
> > > > come from the rcar-dmac driver which allocates memory in the
> > > > prep_dma_cyclic function, called by rsnd_dma_start() from
> > > > rsnd_soc_dai_trigger() with the spinlock help.
> > > > 
> > > > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > > > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > > > replace the rsnd_priv spinlock with a mutex ?
> > > 
> > > Answering myself here, that's not an option, as the trigger function is called 
> > > in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> > > due to snd_pcm_lib_write1() taking a spinlock.
> > > 
> > > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> > > called in atomic context, and on the other side the function ends up calling 
> > > dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> > > func the DMA engine API is undocumented and completely silent on whether the 
> > > prep functions are allowed to sleep. The question is, who's wrong ?
> > > 
> > > Now, if you're tempted to say that I'm wrong by allocating memory with 
> > > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> > > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> > > more complex to solve.
> > > 
> > > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> > > suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> > > needs to do so when a descriptor is submitted. This operation, currently 
> > > performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> > > issue_pending handler, but the three operations are called in a row from 
> > > rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> > > spinlock held. This means I have no place in my DMA engine driver where I can 
> > > resume the device.
> > > 
> > > One could argue that the rcar-dmac driver could use a work queue to handle 
> > > power management. That's correct, but the additional complexity, which would 
> > > be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> > > that way, this is really a task that the DMA engine core should handle.
> > > 
> > > Let's start by answering the background question and updating the DMA engine 
> > > API documentation once and for good : in which context are drivers allowed to 
> > > call the prep, tx_submit and issue_pending functions ?
> > 
> > rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
> > but, it had been used prep_slave_single() before.
> > Then, it used work queue in dai_trigger function.
> > How about to use same method in prep_dma_cyclic() ?
> > Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
> > from work queue ?
> 
> Sorry, this doesn't solve issue.
> dmaengine_prep_dma_cyclic() is used in
> ${LINUX}/sound/core/pcm_dmaengine.c,
> and the situation is same as ours.
> 
> Hmm..
> In my quick check, other DMAEngine drivers are using GFP_ATOMIC
> in cyclic/prep_slave_sg functions...
And thats partially right. You need to use GFP_NOWAIT.

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  4:52         ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-07-24  4:58 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard

On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > atomic, which regmap doesn't like as it locks a mutex.
> > 
> > It might be possible to fix this by setting the fast_io field in both the
> > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > will then use a spinlock instead of a mutex. However, even if I believe that
> > change makes sense and should be done, another atomic context issue will
> > come from the rcar-dmac driver which allocates memory in the
> > prep_dma_cyclic function, called by rsnd_dma_start() from
> > rsnd_soc_dai_trigger() with the spinlock help.
> > 
> > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > replace the rsnd_priv spinlock with a mutex ?
> 
> Answering myself here, that's not an option, as the trigger function is called 
> in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> due to snd_pcm_lib_write1() taking a spinlock.
> 
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?
> 
> Now, if you're tempted to say that I'm wrong by allocating memory with 
> GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> more complex to solve.
> 
> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.
> 
> One could argue that the rcar-dmac driver could use a work queue to handle 
> power management. That's correct, but the additional complexity, which would 
> be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> that way, this is really a task that the DMA engine core should handle.
> 
> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?
I think this was bought up sometime back and we have cleared that all
_prep functions can be invoked in atomic context.

This is the reason why we have been pushing folks to use GFP_NOWAIT is
memory allocations during prepare.

Thanks for pointing out documentation doesn't say so, will send a patch for
that.

On issue_pending and tx_submit, yes these should be allowed to be called
from atomic context too.

Lastly, just to clarify the callback invoked after descriptor is complete
can also be used to submit new descriptors, so folks are dropping locking
before invoking the callback

HTH

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24  4:53             ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-07-24  4:59 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Laurent Pinchart, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard

On Wed, Jul 23, 2014 at 06:35:17PM -0700, Kuninori Morimoto wrote:
> 
> Hi Laurent again
> 
> > > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > > atomic, which regmap doesn't like as it locks a mutex.
> > > > 
> > > > It might be possible to fix this by setting the fast_io field in both the
> > > > regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c. regmap
> > > > will then use a spinlock instead of a mutex. However, even if I believe that
> > > > change makes sense and should be done, another atomic context issue will
> > > > come from the rcar-dmac driver which allocates memory in the
> > > > prep_dma_cyclic function, called by rsnd_dma_start() from
> > > > rsnd_soc_dai_trigger() with the spinlock help.
> > > > 
> > > > What context is the rsnd_soc_dai_trigger() function called in by the alsa
> > > > core ? If it's guaranteed to be a sleepable context, would it make sense to
> > > > replace the rsnd_priv spinlock with a mutex ?
> > > 
> > > Answering myself here, that's not an option, as the trigger function is called 
> > > in atomic context (replacing the spinlock with a mutex produced a clear BUG) 
> > > due to snd_pcm_lib_write1() taking a spinlock.
> > > 
> > > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> > > called in atomic context, and on the other side the function ends up calling 
> > > dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> > > func the DMA engine API is undocumented and completely silent on whether the 
> > > prep functions are allowed to sleep. The question is, who's wrong ?
> > > 
> > > Now, if you're tempted to say that I'm wrong by allocating memory with 
> > > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried 
> > > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a problem 
> > > more complex to solve.
> > > 
> > > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> > > suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> > > needs to do so when a descriptor is submitted. This operation, currently 
> > > performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> > > issue_pending handler, but the three operations are called in a row from 
> > > rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> > > spinlock held. This means I have no place in my DMA engine driver where I can 
> > > resume the device.
> > > 
> > > One could argue that the rcar-dmac driver could use a work queue to handle 
> > > power management. That's correct, but the additional complexity, which would 
> > > be required in *all* DMA engine drivers, seem too high to me. If we need to go 
> > > that way, this is really a task that the DMA engine core should handle.
> > > 
> > > Let's start by answering the background question and updating the DMA engine 
> > > API documentation once and for good : in which context are drivers allowed to 
> > > call the prep, tx_submit and issue_pending functions ?
> > 
> > rsnd driver (and sound/soc/sh/fsi driver too) is using prep_dma_cyclic() now,
> > but, it had been used prep_slave_single() before.
> > Then, it used work queue in dai_trigger function.
> > How about to use same method in prep_dma_cyclic() ?
> > Do you think your issue will be solved if sound driver calls prep_dma_cyclic()
> > from work queue ?
> 
> Sorry, this doesn't solve issue.
> dmaengine_prep_dma_cyclic() is used in
> ${LINUX}/sound/core/pcm_dmaengine.c,
> and the situation is same as ours.
> 
> Hmm..
> In my quick check, other DMAEngine drivers are using GFP_ATOMIC
> in cyclic/prep_slave_sg functions...
And thats partially right. You need to use GFP_NOWAIT.

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH] ASoC: rsnd: fixup dai remove callback operation
  2014-07-23 23:56 ` Kuninori Morimoto
@ 2014-07-24  8:51   ` Kuninori Morimoto
  2014-07-25 17:50     ` Mark Brown
  0 siblings, 1 reply; 78+ messages in thread
From: Kuninori Morimoto @ 2014-07-24  8:51 UTC (permalink / raw)
  To: Mark Brown
  Cc: Linux-ALSA, Simon, Liam Girdwood, Kuninori Morimoto, Laurent Pinchart

From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>

rsnd driver is using SSI/SRC/DVC which are
using "mod" base operation.
These "mod" are supporting "probe" and "remove" callbacks.

Current rsnd_probe should call "remove" if "probe" was failed,
since "probe" might be having DMAEngine handle.
Some mod's "remove" callback might be called without calling
"probe", but it is no problem. because "remove" do nothing
in such case.

So, all mod's "remove" should be called when error case
of rsnd_probe() and rsnd_remove().

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 sound/soc/sh/rcar/core.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/sound/soc/sh/rcar/core.c b/sound/soc/sh/rcar/core.c
index 907d480..1b4d8700 100644
--- a/sound/soc/sh/rcar/core.c
+++ b/sound/soc/sh/rcar/core.c
@@ -1041,11 +1041,11 @@ static int rsnd_probe(struct platform_device *pdev)
 	for_each_rsnd_dai(rdai, priv, i) {
 		ret = rsnd_dai_call(probe, &rdai->playback, rdai);
 		if (ret)
-			return ret;
+			goto exit_snd_probe;
 
 		ret = rsnd_dai_call(probe, &rdai->capture, rdai);
 		if (ret)
-			return ret;
+			goto exit_snd_probe;
 	}
 
 	/*
@@ -1073,6 +1073,11 @@ static int rsnd_probe(struct platform_device *pdev)
 
 exit_snd_soc:
 	snd_soc_unregister_platform(dev);
+exit_snd_probe:
+	for_each_rsnd_dai(rdai, priv, i) {
+		rsnd_dai_call(remove, &rdai->playback, rdai);
+		rsnd_dai_call(remove, &rdai->capture, rdai);
+	}
 
 	return ret;
 }
@@ -1081,21 +1086,16 @@ static int rsnd_remove(struct platform_device *pdev)
 {
 	struct rsnd_priv *priv = dev_get_drvdata(&pdev->dev);
 	struct rsnd_dai *rdai;
-	int ret, i;
+	int ret = 0, i;
 
 	pm_runtime_disable(&pdev->dev);
 
 	for_each_rsnd_dai(rdai, priv, i) {
-		ret = rsnd_dai_call(remove, &rdai->playback, rdai);
-		if (ret)
-			return ret;
-
-		ret = rsnd_dai_call(remove, &rdai->capture, rdai);
-		if (ret)
-			return ret;
+		ret |= rsnd_dai_call(remove, &rdai->playback, rdai);
+		ret |= rsnd_dai_call(remove, &rdai->capture, rdai);
 	}
 
-	return 0;
+	return ret;
 }
 
 static struct platform_driver rsnd_driver = {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [alsa-devel] DMA engine API issue
  2014-07-23 11:07       ` Laurent Pinchart
  (?)
@ 2014-07-24 12:29         ` Lars-Peter Clausen
  -1 siblings, 0 replies; 78+ messages in thread
From: Lars-Peter Clausen @ 2014-07-24 12:29 UTC (permalink / raw)
  To: Laurent Pinchart, Kuninori Morimoto
  Cc: Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm, dmaengine,
	Maxime Ripard, linux-arm-kernel

On 07/23/2014 01:07 PM, Laurent Pinchart wrote:
[...]
> Let's start by answering the background question and updating the DMA engine
> API documentation once and for good : in which context are drivers allowed to
> call the prep, tx_submit and issue_pending functions ?
>

I think the expectation is that these functions can be called in any 
context. Maybe what's missing is a way to tell the DMA engine driver to get 
ready and that it is going to be used very soon. This could be done from the 
sound devices open() callback.

- Lars

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
@ 2014-07-24 12:29         ` Lars-Peter Clausen
  0 siblings, 0 replies; 78+ messages in thread
From: Lars-Peter Clausen @ 2014-07-24 12:29 UTC (permalink / raw)
  To: Laurent Pinchart, Kuninori Morimoto
  Cc: Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm, dmaengine,
	Maxime Ripard, linux-arm-kernel

On 07/23/2014 01:07 PM, Laurent Pinchart wrote:
[...]
> Let's start by answering the background question and updating the DMA engine
> API documentation once and for good : in which context are drivers allowed to
> call the prep, tx_submit and issue_pending functions ?
>

I think the expectation is that these functions can be called in any 
context. Maybe what's missing is a way to tell the DMA engine driver to get 
ready and that it is going to be used very soon. This could be done from the 
sound devices open() callback.

- Lars

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [alsa-devel] DMA engine API issue
@ 2014-07-24 12:29         ` Lars-Peter Clausen
  0 siblings, 0 replies; 78+ messages in thread
From: Lars-Peter Clausen @ 2014-07-24 12:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/23/2014 01:07 PM, Laurent Pinchart wrote:
[...]
> Let's start by answering the background question and updating the DMA engine
> API documentation once and for good : in which context are drivers allowed to
> call the prep, tx_submit and issue_pending functions ?
>

I think the expectation is that these functions can be called in any 
context. Maybe what's missing is a way to tell the DMA engine driver to get 
ready and that it is going to be used very soon. This could be done from the 
sound devices open() callback.

- Lars

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-23 11:07       ` Laurent Pinchart
  (?)
@ 2014-07-24 12:51         ` Russell King - ARM Linux
  -1 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-07-24 12:51 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm,
	dmaengine, Maxime Ripard, linux-arm-kernel

On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?

For slave DMA drivers, there is the expectation that the prepare
functions will be callable from tasklet context, without any locks
held by the driver.  So, it's expected that the prepare functions
will work from tasklet context.

I don't think we've ever specified whether they should be callable
from interrupt context, but in practice, we have drivers which do
exactly that, so I think the decision has already been made - they
will be callable from IRQ context, and so GFP_ATOMIC is required
in the driver.

> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.

Right, runtime PM with DMA engine drivers is hard.  The best that can
be done right now is to pm_runtime_get() in the alloc_chan_resources()
method and put it in free_chan_resources() if you don't want to do the
workqueue thing.

There's a problem with the workqueue thing though - by doing so, you
make it asynchronous to the starting of the DMA.  The DMA engine API
allows for delayed starting (it's actually the normal thing for DMA
engine), but that may not always be appropriate or desirable.

> One could argue that the rcar-dmac driver could use a work queue to
> handle power management. That's correct, but the additional complexity,
> which would be required in *all* DMA engine drivers, seem too high to
> me. If we need to go that way, this is really a task that the DMA
> engine core should handle.

As I mention above, the problem with that is getting the workqueue to
run soon enough that it doesn't cause a performance degredation or
other issues.

There's also expectations from other code - OMAP for example explicitly
needs DMA to be started on the hardware before the audio block can be
enabled (from what I remember, it tickless an erratum if this is not
done.)

> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?

IRQs-off contexts. :)

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24 12:51         ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-07-24 12:51 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm,
	dmaengine, Maxime Ripard, linux-arm-kernel

On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?

For slave DMA drivers, there is the expectation that the prepare
functions will be callable from tasklet context, without any locks
held by the driver.  So, it's expected that the prepare functions
will work from tasklet context.

I don't think we've ever specified whether they should be callable
from interrupt context, but in practice, we have drivers which do
exactly that, so I think the decision has already been made - they
will be callable from IRQ context, and so GFP_ATOMIC is required
in the driver.

> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.

Right, runtime PM with DMA engine drivers is hard.  The best that can
be done right now is to pm_runtime_get() in the alloc_chan_resources()
method and put it in free_chan_resources() if you don't want to do the
workqueue thing.

There's a problem with the workqueue thing though - by doing so, you
make it asynchronous to the starting of the DMA.  The DMA engine API
allows for delayed starting (it's actually the normal thing for DMA
engine), but that may not always be appropriate or desirable.

> One could argue that the rcar-dmac driver could use a work queue to
> handle power management. That's correct, but the additional complexity,
> which would be required in *all* DMA engine drivers, seem too high to
> me. If we need to go that way, this is really a task that the DMA
> engine core should handle.

As I mention above, the problem with that is getting the workqueue to
run soon enough that it doesn't cause a performance degredation or
other issues.

There's also expectations from other code - OMAP for example explicitly
needs DMA to be started on the hardware before the audio block can be
enabled (from what I remember, it tickless an erratum if this is not
done.)

> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?

IRQs-off contexts. :)

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-07-24 12:51         ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-07-24 12:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being 
> called in atomic context, and on the other side the function ends up calling 
> dmaengine_prep_dma_cyclic() which needs to allocate memory. To make this more 
> func the DMA engine API is undocumented and completely silent on whether the 
> prep functions are allowed to sleep. The question is, who's wrong ?

For slave DMA drivers, there is the expectation that the prepare
functions will be callable from tasklet context, without any locks
held by the driver.  So, it's expected that the prepare functions
will work from tasklet context.

I don't think we've ever specified whether they should be callable
from interrupt context, but in practice, we have drivers which do
exactly that, so I think the decision has already been made - they
will be callable from IRQ context, and so GFP_ATOMIC is required
in the driver.

> The rcar-dmac DMA engine driver uses runtime PM. When not used, the device is 
> suspended. The driver calls pm_runtime_get_sync() to resume the device, and 
> needs to do so when a descriptor is submitted. This operation, currently 
> performed in the tx_submit handler, could be moved to the prep_dma_cyclic or 
> issue_pending handler, but the three operations are called in a row from 
> rsnd_dma_start(), itself ultimately called from snd_pcm_lib_write1() with the 
> spinlock held. This means I have no place in my DMA engine driver where I can 
> resume the device.

Right, runtime PM with DMA engine drivers is hard.  The best that can
be done right now is to pm_runtime_get() in the alloc_chan_resources()
method and put it in free_chan_resources() if you don't want to do the
workqueue thing.

There's a problem with the workqueue thing though - by doing so, you
make it asynchronous to the starting of the DMA.  The DMA engine API
allows for delayed starting (it's actually the normal thing for DMA
engine), but that may not always be appropriate or desirable.

> One could argue that the rcar-dmac driver could use a work queue to
> handle power management. That's correct, but the additional complexity,
> which would be required in *all* DMA engine drivers, seem too high to
> me. If we need to go that way, this is really a task that the DMA
> engine core should handle.

As I mention above, the problem with that is getting the workqueue to
run soon enough that it doesn't cause a performance degredation or
other issues.

There's also expectations from other code - OMAP for example explicitly
needs DMA to be started on the hardware before the audio block can be
enabled (from what I remember, it tickless an erratum if this is not
done.)

> Let's start by answering the background question and updating the DMA engine 
> API documentation once and for good : in which context are drivers allowed to 
> call the prep, tx_submit and issue_pending functions ?

IRQs-off contexts. :)

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] ASoC: rsnd: fixup dai remove callback operation
  2014-07-24  8:51   ` [PATCH] ASoC: rsnd: fixup dai remove callback operation Kuninori Morimoto
@ 2014-07-25 17:50     ` Mark Brown
  0 siblings, 0 replies; 78+ messages in thread
From: Mark Brown @ 2014-07-25 17:50 UTC (permalink / raw)
  To: Kuninori Morimoto
  Cc: Linux-ALSA, Simon, Liam Girdwood, Kuninori Morimoto, Laurent Pinchart


[-- Attachment #1.1: Type: text/plain, Size: 293 bytes --]

On Thu, Jul 24, 2014 at 01:51:31AM -0700, Kuninori Morimoto wrote:
> From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
> 
> rsnd driver is using SSI/SRC/DVC which are
> using "mod" base operation.
> These "mod" are supporting "probe" and "remove" callbacks.

Applied, thanks.

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-24  4:52         ` Vinod Koul
  (?)
@ 2014-08-01  8:51           ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-01  8:51 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Linux-ALSA, Russell King - ARM Linux, linux-sh, Magnus Damm,
	dmaengine, Maxime Ripard, Kuninori Morimoto, linux-arm-kernel

Hi Vinod,

On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > atomic, which regmap doesn't like as it locks a mutex.
> > > 
> > > It might be possible to fix this by setting the fast_io field in both
> > > the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c.
> > > regmap will then use a spinlock instead of a mutex. However, even if I
> > > believe that change makes sense and should be done, another atomic
> > > context issue will come from the rcar-dmac driver which allocates memory
> > > in the prep_dma_cyclic function, called by rsnd_dma_start() from
> > > rsnd_soc_dai_trigger() with the spinlock help.
> > > 
> > > What context is the rsnd_soc_dai_trigger() function called in by the
> > > alsa core ? If it's guaranteed to be a sleepable context, would it make
> > > sense to replace the rsnd_priv spinlock with a mutex ?
> > 
> > Answering myself here, that's not an option, as the trigger function is
> > called in atomic context (replacing the spinlock with a mutex produced a
> > clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> > 
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > called in atomic context, and on the other side the function ends up
> > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > make this more func the DMA engine API is undocumented and completely
> > silent on whether the prep functions are allowed to sleep. The question
> > is, who's wrong ?
> > 
> > Now, if you're tempted to say that I'm wrong by allocating memory with
> > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried
> > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a
> > problem more complex to solve.
> > 
> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > device, and needs to do so when a descriptor is submitted. This
> > operation, currently performed in the tx_submit handler, could be moved
> > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > are called in a row from rsnd_dma_start(), itself ultimately called from
> > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > in my DMA engine driver where I can resume the device.
> > 
> > One could argue that the rcar-dmac driver could use a work queue to handle
> > power management. That's correct, but the additional complexity, which
> > would be required in *all* DMA engine drivers, seem too high to me. If we
> > need to go that way, this is really a task that the DMA engine core
> > should handle.
> > 
> > Let's start by answering the background question and updating the DMA
> > engine API documentation once and for good : in which context are drivers
> > allowed to call the prep, tx_submit and issue_pending functions ?
> 
> I think this was bought up sometime back and we have cleared that all _prep
> functions can be invoked in atomic context.
> 
> This is the reason why we have been pushing folks to use GFP_NOWAIT is
> memory allocations during prepare.

From the replies I've received it's pretty clear that the prep functions need 
to be callable from atomic context. I'll respond to this in more depth in a 
reply to Russell's e-mail.

> Thanks for pointing out documentation doesn't say so, will send a patch for
> that.

I wish that was all that is missing from the documentation ;-) Luckily Maxime 
Ripard has sent a patch that documents DMA engine from a DMA engine driver's 
point of view. While not perfect (I'm going to review it), it's a nice 
starting point to (hopefully) get to a properly documented framework.

> On issue_pending and tx_submit, yes these should be allowed to be called
> from atomic context too.

I'll take this opportunity to question why we have a separation between 
tx_submit and issue_pending. What's the rationale for that, especially given 
that dma_issue_pending_all() might kick in at any point and issue pending 
transfers for all devices. A driver could thus see its submitted but not 
issued transactions being issued before it explicitly calls 
dma_async_issue_pending().

The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
being clear how that mechanism is supposed to work. This should be documented 
as well, and any light you could shed on this dark corner of the API would 
help.

Similarly, the DMA engine API is split in functions with different prefixes 
(mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such 
as async_tx_ack or txd_lock. If there's a rationale for that (beyond just 
historical reasons) it should also be documented, otherwise a cleanup would 
help all the confused DMA engine users (myself included). I might be able to 
find a bit of time to work on that, but I'll first need to correctly 
understand where we come from and where we are. Again, information would be 
welcome and fully appreciated.

> Lastly, just to clarify the callback invoked after descriptor is complete
> can also be used to submit new descriptors, so folks are dropping locking
> before invoking the callback

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01  8:51           ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-01  8:51 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Linux-ALSA, Russell King - ARM Linux, linux-sh, Magnus Damm,
	dmaengine, Maxime Ripard, Kuninori Morimoto, linux-arm-kernel

Hi Vinod,

On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > atomic, which regmap doesn't like as it locks a mutex.
> > > 
> > > It might be possible to fix this by setting the fast_io field in both
> > > the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c.
> > > regmap will then use a spinlock instead of a mutex. However, even if I
> > > believe that change makes sense and should be done, another atomic
> > > context issue will come from the rcar-dmac driver which allocates memory
> > > in the prep_dma_cyclic function, called by rsnd_dma_start() from
> > > rsnd_soc_dai_trigger() with the spinlock help.
> > > 
> > > What context is the rsnd_soc_dai_trigger() function called in by the
> > > alsa core ? If it's guaranteed to be a sleepable context, would it make
> > > sense to replace the rsnd_priv spinlock with a mutex ?
> > 
> > Answering myself here, that's not an option, as the trigger function is
> > called in atomic context (replacing the spinlock with a mutex produced a
> > clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> > 
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > called in atomic context, and on the other side the function ends up
> > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > make this more func the DMA engine API is undocumented and completely
> > silent on whether the prep functions are allowed to sleep. The question
> > is, who's wrong ?
> > 
> > Now, if you're tempted to say that I'm wrong by allocating memory with
> > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried
> > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a
> > problem more complex to solve.
> > 
> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > device, and needs to do so when a descriptor is submitted. This
> > operation, currently performed in the tx_submit handler, could be moved
> > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > are called in a row from rsnd_dma_start(), itself ultimately called from
> > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > in my DMA engine driver where I can resume the device.
> > 
> > One could argue that the rcar-dmac driver could use a work queue to handle
> > power management. That's correct, but the additional complexity, which
> > would be required in *all* DMA engine drivers, seem too high to me. If we
> > need to go that way, this is really a task that the DMA engine core
> > should handle.
> > 
> > Let's start by answering the background question and updating the DMA
> > engine API documentation once and for good : in which context are drivers
> > allowed to call the prep, tx_submit and issue_pending functions ?
> 
> I think this was bought up sometime back and we have cleared that all _prep
> functions can be invoked in atomic context.
> 
> This is the reason why we have been pushing folks to use GFP_NOWAIT is
> memory allocations during prepare.

>From the replies I've received it's pretty clear that the prep functions need 
to be callable from atomic context. I'll respond to this in more depth in a 
reply to Russell's e-mail.

> Thanks for pointing out documentation doesn't say so, will send a patch for
> that.

I wish that was all that is missing from the documentation ;-) Luckily Maxime 
Ripard has sent a patch that documents DMA engine from a DMA engine driver's 
point of view. While not perfect (I'm going to review it), it's a nice 
starting point to (hopefully) get to a properly documented framework.

> On issue_pending and tx_submit, yes these should be allowed to be called
> from atomic context too.

I'll take this opportunity to question why we have a separation between 
tx_submit and issue_pending. What's the rationale for that, especially given 
that dma_issue_pending_all() might kick in at any point and issue pending 
transfers for all devices. A driver could thus see its submitted but not 
issued transactions being issued before it explicitly calls 
dma_async_issue_pending().

The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
being clear how that mechanism is supposed to work. This should be documented 
as well, and any light you could shed on this dark corner of the API would 
help.

Similarly, the DMA engine API is split in functions with different prefixes 
(mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such 
as async_tx_ack or txd_lock. If there's a rationale for that (beyond just 
historical reasons) it should also be documented, otherwise a cleanup would 
help all the confused DMA engine users (myself included). I might be able to 
find a bit of time to work on that, but I'll first need to correctly 
understand where we come from and where we are. Again, information would be 
welcome and fully appreciated.

> Lastly, just to clarify the callback invoked after descriptor is complete
> can also be used to submit new descriptors, so folks are dropping locking
> before invoking the callback

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01  8:51           ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-01  8:51 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vinod,

On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > atomic, which regmap doesn't like as it locks a mutex.
> > > 
> > > It might be possible to fix this by setting the fast_io field in both
> > > the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c.
> > > regmap will then use a spinlock instead of a mutex. However, even if I
> > > believe that change makes sense and should be done, another atomic
> > > context issue will come from the rcar-dmac driver which allocates memory
> > > in the prep_dma_cyclic function, called by rsnd_dma_start() from
> > > rsnd_soc_dai_trigger() with the spinlock help.
> > > 
> > > What context is the rsnd_soc_dai_trigger() function called in by the
> > > alsa core ? If it's guaranteed to be a sleepable context, would it make
> > > sense to replace the rsnd_priv spinlock with a mutex ?
> > 
> > Answering myself here, that's not an option, as the trigger function is
> > called in atomic context (replacing the spinlock with a mutex produced a
> > clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> > 
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > called in atomic context, and on the other side the function ends up
> > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > make this more func the DMA engine API is undocumented and completely
> > silent on whether the prep functions are allowed to sleep. The question
> > is, who's wrong ?
> > 
> > Now, if you're tempted to say that I'm wrong by allocating memory with
> > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried
> > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a
> > problem more complex to solve.
> > 
> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > device, and needs to do so when a descriptor is submitted. This
> > operation, currently performed in the tx_submit handler, could be moved
> > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > are called in a row from rsnd_dma_start(), itself ultimately called from
> > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > in my DMA engine driver where I can resume the device.
> > 
> > One could argue that the rcar-dmac driver could use a work queue to handle
> > power management. That's correct, but the additional complexity, which
> > would be required in *all* DMA engine drivers, seem too high to me. If we
> > need to go that way, this is really a task that the DMA engine core
> > should handle.
> > 
> > Let's start by answering the background question and updating the DMA
> > engine API documentation once and for good : in which context are drivers
> > allowed to call the prep, tx_submit and issue_pending functions ?
> 
> I think this was bought up sometime back and we have cleared that all _prep
> functions can be invoked in atomic context.
> 
> This is the reason why we have been pushing folks to use GFP_NOWAIT is
> memory allocations during prepare.

>From the replies I've received it's pretty clear that the prep functions need 
to be callable from atomic context. I'll respond to this in more depth in a 
reply to Russell's e-mail.

> Thanks for pointing out documentation doesn't say so, will send a patch for
> that.

I wish that was all that is missing from the documentation ;-) Luckily Maxime 
Ripard has sent a patch that documents DMA engine from a DMA engine driver's 
point of view. While not perfect (I'm going to review it), it's a nice 
starting point to (hopefully) get to a properly documented framework.

> On issue_pending and tx_submit, yes these should be allowed to be called
> from atomic context too.

I'll take this opportunity to question why we have a separation between 
tx_submit and issue_pending. What's the rationale for that, especially given 
that dma_issue_pending_all() might kick in at any point and issue pending 
transfers for all devices. A driver could thus see its submitted but not 
issued transactions being issued before it explicitly calls 
dma_async_issue_pending().

The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
being clear how that mechanism is supposed to work. This should be documented 
as well, and any light you could shed on this dark corner of the API would 
help.

Similarly, the DMA engine API is split in functions with different prefixes 
(mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such 
as async_tx_ack or txd_lock. If there's a rationale for that (beyond just 
historical reasons) it should also be documented, otherwise a cleanup would 
help all the confused DMA engine users (myself included). I might be able to 
find a bit of time to work on that, but I'll first need to correctly 
understand where we come from and where we are. Again, information would be 
welcome and fully appreciated.

> Lastly, just to clarify the callback invoked after descriptor is complete
> can also be used to submit new descriptors, so folks are dropping locking
> before invoking the callback

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-07-24 12:51         ` Russell King - ARM Linux
  (?)
@ 2014-08-01  9:24           ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-01  9:24 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Kuninori Morimoto, Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm,
	dmaengine, Maxime Ripard, linux-arm-kernel

Hi Russell,

On Thursday 24 July 2014 13:51:21 Russell King - ARM Linux wrote:
> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > called in atomic context, and on the other side the function ends up
> > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > make this more func the DMA engine API is undocumented and completely
> > silent on whether the prep functions are allowed to sleep. The question
> > is, who's wrong ?
>
> For slave DMA drivers, there is the expectation that the prepare functions
> will be callable from tasklet context, without any locks held by the driver. 
> So, it's expected that the prepare functions will work from tasklet context.
> 
> I don't think we've ever specified whether they should be callable from
> interrupt context, but in practice, we have drivers which do exactly that,
> so I think the decision has already been made - they will be callable from
> IRQ context, and so GFP_ATOMIC is required in the driver.

I agree with you, the decision has made itself to use allocation routines that 
will not sleep. However,

$ grep -r GFP_NOWAIT drivers/dma | wc -l
22
$ grep -r GFP_ATOMIC drivers/dma | wc -l
24

Looks like a draw to me :-) I'm pretty sure most of the decisions to use 
GFP_NOWAIT or GFP_ATOMIC are cases of cargo-cult programming instead of 
resulting from a thoughtful process. We should document this in Maxime's new 
DMA engine internals documentation.

This being said, I wonder whether allowing the prep functions to be called in 
atomic context was a sane decision. We've pushed the whole DMA engine API to 
atomic context, leading to such niceties as memory allocation pressure and the 
impossibility to implement runtime PM support without resorting to a work 
queue for the sole purpose of power management, in *every* DMA engine driver. 
I can't help but thinking something is broken.

I'd like your opinion on that. Whether fixing it would be worth the effort is 
a different question, so we could certainly conclude that we'll have to live 
with an imperfect design for the time being.

> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > device, and needs to do so when a descriptor is submitted. This
> > operation, currently performed in the tx_submit handler, could be moved
> > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > are called in a row from rsnd_dma_start(), itself ultimately called from
> > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > in my DMA engine driver where I can resume the device.
> 
> Right, runtime PM with DMA engine drivers is hard. The best that can be done
> right now is to pm_runtime_get() in the alloc_chan_resources() method and
> put it in free_chan_resources() if you don't want to do the workqueue thing.

That's an easy enough implementation, but given that channels are usually 
requested at probe time and released at remove time, that would be roughly 
equivalent to no PM at all.

> There's a problem with the workqueue thing though - by doing so, you make it
> asynchronous to the starting of the DMA. The DMA engine API allows for
> delayed starting (it's actually the normal thing for DMA engine), but that
> may not always be appropriate or desirable.

If I'm not mistaken, the DMA engine API doesn't provide a way to synchronously 
start DMA transfers. An implementation could do so, but no guarantee is 
offered by the API to the caller. I agree, however, that it could be an issue.

Doesn't this call for a new pair of open/close-like functions in the API ? 
They would be called right before a client driver starts using a channel, and 
right after it stops using it. Those functions would be allowed to sleep.

Beside simplifying power management, those functions could also be used for 
lazy hardware channel allocation. Renesas SoCs have DMA engines that include 
general-purpose channels, usable by any slave. There are more slaves than 
hardware channels, so not all slaves can be used at the same time. At the 
moment the hardware channel is allocated when requesting the DMA engine 
channel, at probe time for most slave drivers. This breaks when too many 
slaves get registered, even if they're not all used at the same time. Some 
kind of lazy/delayed allocation scheme would be useful.

Another option would be to request the DMA engine channel later than probe 
time, when the channel will actually be used. However, that would break 
deferred probing, and would possibly degrade performances.

> > One could argue that the rcar-dmac driver could use a work queue to
> > handle power management. That's correct, but the additional complexity,
> > which would be required in *all* DMA engine drivers, seem too high to
> > me. If we need to go that way, this is really a task that the DMA
> > engine core should handle.
> 
> As I mention above, the problem with that is getting the workqueue to run
> soon enough that it doesn't cause a performance degredation or other issues.

That's why I liked the ability to sleep in the prep functions ;-)

> There's also expectations from other code - OMAP for example explicitly
> needs DMA to be started on the hardware before the audio block can be
> enabled (from what I remember, it tickless an erratum if this is not done.)

Nice. That pretty much bans the usage of a workqueue then, we need something 
else.

> > Let's start by answering the background question and updating the DMA
> > engine API documentation once and for good : in which context are drivers
> > allowed to call the prep, tx_submit and issue_pending functions ?
> 
> IRQs-off contexts. :)

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01  9:24           ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-01  9:24 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Kuninori Morimoto, Linux-ALSA, linux-sh, Vinod Koul, Magnus Damm,
	dmaengine, Maxime Ripard, linux-arm-kernel

Hi Russell,

On Thursday 24 July 2014 13:51:21 Russell King - ARM Linux wrote:
> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > called in atomic context, and on the other side the function ends up
> > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > make this more func the DMA engine API is undocumented and completely
> > silent on whether the prep functions are allowed to sleep. The question
> > is, who's wrong ?
>
> For slave DMA drivers, there is the expectation that the prepare functions
> will be callable from tasklet context, without any locks held by the driver. 
> So, it's expected that the prepare functions will work from tasklet context.
> 
> I don't think we've ever specified whether they should be callable from
> interrupt context, but in practice, we have drivers which do exactly that,
> so I think the decision has already been made - they will be callable from
> IRQ context, and so GFP_ATOMIC is required in the driver.

I agree with you, the decision has made itself to use allocation routines that 
will not sleep. However,

$ grep -r GFP_NOWAIT drivers/dma | wc -l
22
$ grep -r GFP_ATOMIC drivers/dma | wc -l
24

Looks like a draw to me :-) I'm pretty sure most of the decisions to use 
GFP_NOWAIT or GFP_ATOMIC are cases of cargo-cult programming instead of 
resulting from a thoughtful process. We should document this in Maxime's new 
DMA engine internals documentation.

This being said, I wonder whether allowing the prep functions to be called in 
atomic context was a sane decision. We've pushed the whole DMA engine API to 
atomic context, leading to such niceties as memory allocation pressure and the 
impossibility to implement runtime PM support without resorting to a work 
queue for the sole purpose of power management, in *every* DMA engine driver. 
I can't help but thinking something is broken.

I'd like your opinion on that. Whether fixing it would be worth the effort is 
a different question, so we could certainly conclude that we'll have to live 
with an imperfect design for the time being.

> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > device, and needs to do so when a descriptor is submitted. This
> > operation, currently performed in the tx_submit handler, could be moved
> > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > are called in a row from rsnd_dma_start(), itself ultimately called from
> > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > in my DMA engine driver where I can resume the device.
> 
> Right, runtime PM with DMA engine drivers is hard. The best that can be done
> right now is to pm_runtime_get() in the alloc_chan_resources() method and
> put it in free_chan_resources() if you don't want to do the workqueue thing.

That's an easy enough implementation, but given that channels are usually 
requested at probe time and released at remove time, that would be roughly 
equivalent to no PM at all.

> There's a problem with the workqueue thing though - by doing so, you make it
> asynchronous to the starting of the DMA. The DMA engine API allows for
> delayed starting (it's actually the normal thing for DMA engine), but that
> may not always be appropriate or desirable.

If I'm not mistaken, the DMA engine API doesn't provide a way to synchronously 
start DMA transfers. An implementation could do so, but no guarantee is 
offered by the API to the caller. I agree, however, that it could be an issue.

Doesn't this call for a new pair of open/close-like functions in the API ? 
They would be called right before a client driver starts using a channel, and 
right after it stops using it. Those functions would be allowed to sleep.

Beside simplifying power management, those functions could also be used for 
lazy hardware channel allocation. Renesas SoCs have DMA engines that include 
general-purpose channels, usable by any slave. There are more slaves than 
hardware channels, so not all slaves can be used at the same time. At the 
moment the hardware channel is allocated when requesting the DMA engine 
channel, at probe time for most slave drivers. This breaks when too many 
slaves get registered, even if they're not all used at the same time. Some 
kind of lazy/delayed allocation scheme would be useful.

Another option would be to request the DMA engine channel later than probe 
time, when the channel will actually be used. However, that would break 
deferred probing, and would possibly degrade performances.

> > One could argue that the rcar-dmac driver could use a work queue to
> > handle power management. That's correct, but the additional complexity,
> > which would be required in *all* DMA engine drivers, seem too high to
> > me. If we need to go that way, this is really a task that the DMA
> > engine core should handle.
> 
> As I mention above, the problem with that is getting the workqueue to run
> soon enough that it doesn't cause a performance degredation or other issues.

That's why I liked the ability to sleep in the prep functions ;-)

> There's also expectations from other code - OMAP for example explicitly
> needs DMA to be started on the hardware before the audio block can be
> enabled (from what I remember, it tickless an erratum if this is not done.)

Nice. That pretty much bans the usage of a workqueue then, we need something 
else.

> > Let's start by answering the background question and updating the DMA
> > engine API documentation once and for good : in which context are drivers
> > allowed to call the prep, tx_submit and issue_pending functions ?
> 
> IRQs-off contexts. :)

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01  9:24           ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-01  9:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Thursday 24 July 2014 13:51:21 Russell King - ARM Linux wrote:
> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > called in atomic context, and on the other side the function ends up
> > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > make this more func the DMA engine API is undocumented and completely
> > silent on whether the prep functions are allowed to sleep. The question
> > is, who's wrong ?
>
> For slave DMA drivers, there is the expectation that the prepare functions
> will be callable from tasklet context, without any locks held by the driver. 
> So, it's expected that the prepare functions will work from tasklet context.
> 
> I don't think we've ever specified whether they should be callable from
> interrupt context, but in practice, we have drivers which do exactly that,
> so I think the decision has already been made - they will be callable from
> IRQ context, and so GFP_ATOMIC is required in the driver.

I agree with you, the decision has made itself to use allocation routines that 
will not sleep. However,

$ grep -r GFP_NOWAIT drivers/dma | wc -l
22
$ grep -r GFP_ATOMIC drivers/dma | wc -l
24

Looks like a draw to me :-) I'm pretty sure most of the decisions to use 
GFP_NOWAIT or GFP_ATOMIC are cases of cargo-cult programming instead of 
resulting from a thoughtful process. We should document this in Maxime's new 
DMA engine internals documentation.

This being said, I wonder whether allowing the prep functions to be called in 
atomic context was a sane decision. We've pushed the whole DMA engine API to 
atomic context, leading to such niceties as memory allocation pressure and the 
impossibility to implement runtime PM support without resorting to a work 
queue for the sole purpose of power management, in *every* DMA engine driver. 
I can't help but thinking something is broken.

I'd like your opinion on that. Whether fixing it would be worth the effort is 
a different question, so we could certainly conclude that we'll have to live 
with an imperfect design for the time being.

> > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > device, and needs to do so when a descriptor is submitted. This
> > operation, currently performed in the tx_submit handler, could be moved
> > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > are called in a row from rsnd_dma_start(), itself ultimately called from
> > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > in my DMA engine driver where I can resume the device.
> 
> Right, runtime PM with DMA engine drivers is hard. The best that can be done
> right now is to pm_runtime_get() in the alloc_chan_resources() method and
> put it in free_chan_resources() if you don't want to do the workqueue thing.

That's an easy enough implementation, but given that channels are usually 
requested at probe time and released at remove time, that would be roughly 
equivalent to no PM at all.

> There's a problem with the workqueue thing though - by doing so, you make it
> asynchronous to the starting of the DMA. The DMA engine API allows for
> delayed starting (it's actually the normal thing for DMA engine), but that
> may not always be appropriate or desirable.

If I'm not mistaken, the DMA engine API doesn't provide a way to synchronously 
start DMA transfers. An implementation could do so, but no guarantee is 
offered by the API to the caller. I agree, however, that it could be an issue.

Doesn't this call for a new pair of open/close-like functions in the API ? 
They would be called right before a client driver starts using a channel, and 
right after it stops using it. Those functions would be allowed to sleep.

Beside simplifying power management, those functions could also be used for 
lazy hardware channel allocation. Renesas SoCs have DMA engines that include 
general-purpose channels, usable by any slave. There are more slaves than 
hardware channels, so not all slaves can be used at the same time. At the 
moment the hardware channel is allocated when requesting the DMA engine 
channel, at probe time for most slave drivers. This breaks when too many 
slaves get registered, even if they're not all used at the same time. Some 
kind of lazy/delayed allocation scheme would be useful.

Another option would be to request the DMA engine channel later than probe 
time, when the channel will actually be used. However, that would break 
deferred probing, and would possibly degrade performances.

> > One could argue that the rcar-dmac driver could use a work queue to
> > handle power management. That's correct, but the additional complexity,
> > which would be required in *all* DMA engine drivers, seem too high to
> > me. If we need to go that way, this is really a task that the DMA
> > engine core should handle.
> 
> As I mention above, the problem with that is getting the workqueue to run
> soon enough that it doesn't cause a performance degredation or other issues.

That's why I liked the ability to sleep in the prep functions ;-)

> There's also expectations from other code - OMAP for example explicitly
> needs DMA to be started on the hardware before the audio block can be
> enabled (from what I remember, it tickless an erratum if this is not done.)

Nice. That pretty much bans the usage of a workqueue then, we need something 
else.

> > Let's start by answering the background question and updating the DMA
> > engine API documentation once and for good : in which context are drivers
> > allowed to call the prep, tx_submit and issue_pending functions ?
> 
> IRQs-off contexts. :)

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-01  8:51           ` Laurent Pinchart
  (?)
@ 2014-08-01 14:30             ` Russell King - ARM Linux
  -1 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-01 14:30 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> I'll take this opportunity to question why we have a separation between 
> tx_submit and issue_pending. What's the rationale for that, especially given 
> that dma_issue_pending_all() might kick in at any point and issue pending 
> transfers for all devices. A driver could thus see its submitted but not 
> issued transactions being issued before it explicitly calls 
> dma_async_issue_pending().

A prepared but not submitted transaction is not a pending transaction.

The split is necessary so that a callback can be attached to the
transaction.  This partially comes from the async-tx API, and also
gets a lot of use with the slave API.

The prepare function allocates the descriptor and does the initial
setup, but does not mark the descriptor as a pending transaction.
It returns the descriptor, and the caller is then free to add a
callback function and data pointer to the descriptor before finally
submitting it.  This sequence must occur in a timely manner as some
DMA engine implementations hold a lock between the prepare and submit
callbacks (Dan explicitly permits this as part of the API.)

> The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> being clear how that mechanism is supposed to work. This should be documented 
> as well, and any light you could shed on this dark corner of the API would 
> help.

Why do you think that DMA_PRIVATE has a bearing in the callbacks?  It
doesn't.  DMA_PRIVATE is all about channel allocation as I explained
yesterday, and whether the channel is available for async_tx usage.

A channel marked DMA_PRIVATE is not available for async_tx usage at
any moment.  A channel without DMA_PRIVATE is available for async_tx
usage until it is allocated for the slave API - at which point the
generic DMA engine code will mark the channel with DMA_PRIVATE,
thereby taking it away from async_tx API usage.  When the slave API
frees the channel, DMA_PRIVATE will be cleared, making the channel
available for async_tx usage.

So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
simple.

> Similarly, the DMA engine API is split in functions with different
> prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> rationale for that (beyond just historical reasons) it should also
> be documented, otherwise a cleanup would help all the confused DMA
> engine users (myself included).

dmaengine_* are generally the interface functions to the DMA engine code,
which have been recently introduced to avoid the multiple levels of
pointer indirection having to be typed in every driver.

dma_async_* are the DMA engine interface functions for the async_tx API.

dma_* predate the dmaengine_* naming, and are probably too generic, so
should probably end up being renamed to dmaengine_*.

txd_* are all about the DMA engine descriptors.

async_tx_* are the higher level async_tx API functions.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01 14:30             ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-01 14:30 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> I'll take this opportunity to question why we have a separation between 
> tx_submit and issue_pending. What's the rationale for that, especially given 
> that dma_issue_pending_all() might kick in at any point and issue pending 
> transfers for all devices. A driver could thus see its submitted but not 
> issued transactions being issued before it explicitly calls 
> dma_async_issue_pending().

A prepared but not submitted transaction is not a pending transaction.

The split is necessary so that a callback can be attached to the
transaction.  This partially comes from the async-tx API, and also
gets a lot of use with the slave API.

The prepare function allocates the descriptor and does the initial
setup, but does not mark the descriptor as a pending transaction.
It returns the descriptor, and the caller is then free to add a
callback function and data pointer to the descriptor before finally
submitting it.  This sequence must occur in a timely manner as some
DMA engine implementations hold a lock between the prepare and submit
callbacks (Dan explicitly permits this as part of the API.)

> The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> being clear how that mechanism is supposed to work. This should be documented 
> as well, and any light you could shed on this dark corner of the API would 
> help.

Why do you think that DMA_PRIVATE has a bearing in the callbacks?  It
doesn't.  DMA_PRIVATE is all about channel allocation as I explained
yesterday, and whether the channel is available for async_tx usage.

A channel marked DMA_PRIVATE is not available for async_tx usage at
any moment.  A channel without DMA_PRIVATE is available for async_tx
usage until it is allocated for the slave API - at which point the
generic DMA engine code will mark the channel with DMA_PRIVATE,
thereby taking it away from async_tx API usage.  When the slave API
frees the channel, DMA_PRIVATE will be cleared, making the channel
available for async_tx usage.

So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
simple.

> Similarly, the DMA engine API is split in functions with different
> prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> rationale for that (beyond just historical reasons) it should also
> be documented, otherwise a cleanup would help all the confused DMA
> engine users (myself included).

dmaengine_* are generally the interface functions to the DMA engine code,
which have been recently introduced to avoid the multiple levels of
pointer indirection having to be typed in every driver.

dma_async_* are the DMA engine interface functions for the async_tx API.

dma_* predate the dmaengine_* naming, and are probably too generic, so
should probably end up being renamed to dmaengine_*.

txd_* are all about the DMA engine descriptors.

async_tx_* are the higher level async_tx API functions.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01 14:30             ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-01 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> I'll take this opportunity to question why we have a separation between 
> tx_submit and issue_pending. What's the rationale for that, especially given 
> that dma_issue_pending_all() might kick in at any point and issue pending 
> transfers for all devices. A driver could thus see its submitted but not 
> issued transactions being issued before it explicitly calls 
> dma_async_issue_pending().

A prepared but not submitted transaction is not a pending transaction.

The split is necessary so that a callback can be attached to the
transaction.  This partially comes from the async-tx API, and also
gets a lot of use with the slave API.

The prepare function allocates the descriptor and does the initial
setup, but does not mark the descriptor as a pending transaction.
It returns the descriptor, and the caller is then free to add a
callback function and data pointer to the descriptor before finally
submitting it.  This sequence must occur in a timely manner as some
DMA engine implementations hold a lock between the prepare and submit
callbacks (Dan explicitly permits this as part of the API.)

> The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> being clear how that mechanism is supposed to work. This should be documented 
> as well, and any light you could shed on this dark corner of the API would 
> help.

Why do you think that DMA_PRIVATE has a bearing in the callbacks?  It
doesn't.  DMA_PRIVATE is all about channel allocation as I explained
yesterday, and whether the channel is available for async_tx usage.

A channel marked DMA_PRIVATE is not available for async_tx usage at
any moment.  A channel without DMA_PRIVATE is available for async_tx
usage until it is allocated for the slave API - at which point the
generic DMA engine code will mark the channel with DMA_PRIVATE,
thereby taking it away from async_tx API usage.  When the slave API
frees the channel, DMA_PRIVATE will be cleared, making the channel
available for async_tx usage.

So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
simple.

> Similarly, the DMA engine API is split in functions with different
> prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> rationale for that (beyond just historical reasons) it should also
> be documented, otherwise a cleanup would help all the confused DMA
> engine users (myself included).

dmaengine_* are generally the interface functions to the DMA engine code,
which have been recently introduced to avoid the multiple levels of
pointer indirection having to be typed in every driver.

dma_async_* are the DMA engine interface functions for the async_tx API.

dma_* predate the dmaengine_* naming, and are probably too generic, so
should probably end up being renamed to dmaengine_*.

txd_* are all about the DMA engine descriptors.

async_tx_* are the higher level async_tx API functions.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-01  8:51           ` Laurent Pinchart
  (?)
@ 2014-08-01 17:07             ` Vinod Koul
  -1 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-01 17:07 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard, Russell King - ARM Linux

On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> Hi Vinod,
> 
> On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> > On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > > atomic, which regmap doesn't like as it locks a mutex.
> > > > 
> > > > It might be possible to fix this by setting the fast_io field in both
> > > > the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c.
> > > > regmap will then use a spinlock instead of a mutex. However, even if I
> > > > believe that change makes sense and should be done, another atomic
> > > > context issue will come from the rcar-dmac driver which allocates memory
> > > > in the prep_dma_cyclic function, called by rsnd_dma_start() from
> > > > rsnd_soc_dai_trigger() with the spinlock help.
> > > > 
> > > > What context is the rsnd_soc_dai_trigger() function called in by the
> > > > alsa core ? If it's guaranteed to be a sleepable context, would it make
> > > > sense to replace the rsnd_priv spinlock with a mutex ?
> > > 
> > > Answering myself here, that's not an option, as the trigger function is
> > > called in atomic context (replacing the spinlock with a mutex produced a
> > > clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> > > 
> > > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > > called in atomic context, and on the other side the function ends up
> > > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > > make this more func the DMA engine API is undocumented and completely
> > > silent on whether the prep functions are allowed to sleep. The question
> > > is, who's wrong ?
> > > 
> > > Now, if you're tempted to say that I'm wrong by allocating memory with
> > > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried
> > > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a
> > > problem more complex to solve.
> > > 
> > > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > > device, and needs to do so when a descriptor is submitted. This
> > > operation, currently performed in the tx_submit handler, could be moved
> > > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > > are called in a row from rsnd_dma_start(), itself ultimately called from
> > > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > > in my DMA engine driver where I can resume the device.
> > > 
> > > One could argue that the rcar-dmac driver could use a work queue to handle
> > > power management. That's correct, but the additional complexity, which
> > > would be required in *all* DMA engine drivers, seem too high to me. If we
> > > need to go that way, this is really a task that the DMA engine core
> > > should handle.
> > > 
> > > Let's start by answering the background question and updating the DMA
> > > engine API documentation once and for good : in which context are drivers
> > > allowed to call the prep, tx_submit and issue_pending functions ?
> > 
> > I think this was bought up sometime back and we have cleared that all _prep
> > functions can be invoked in atomic context.
> > 
> > This is the reason why we have been pushing folks to use GFP_NOWAIT is
> > memory allocations during prepare.
> 
> From the replies I've received it's pretty clear that the prep functions need 
> to be callable from atomic context. I'll respond to this in more depth in a 
> reply to Russell's e-mail.
> 
> > Thanks for pointing out documentation doesn't say so, will send a patch for
> > that.
> 
> I wish that was all that is missing from the documentation ;-) Luckily Maxime 
> Ripard has sent a patch that documents DMA engine from a DMA engine driver's 
> point of view. While not perfect (I'm going to review it), it's a nice 
> starting point to (hopefully) get to a properly documented framework.
Sure, please do point out any other instance.

Russell did improve it a lot. But then Documentation gets not so quick
updates. Yes new users like you can list issues far more easily than others.

Also now commit log provides a very valuable source of why a particular
thing was done.

> > from atomic context too.
> 
> I'll take this opportunity to question why we have a separation between 
> tx_submit and issue_pending. What's the rationale for that, especially given 
> that dma_issue_pending_all() might kick in at any point and issue pending 
> transfers for all devices. A driver could thus see its submitted but not 
> issued transactions being issued before it explicitly calls 
> dma_async_issue_pending().
The  API states that you need to get a channel, then prepare a descriptor
and submit it back. Prepare can be in any order. The submit order is the one
which is run on dmaengine. The submit marks the descriptor as pending.
Typically you should have a pending_list which the descriptor should be
pushed to.

And lastly invoke dma_async_issue_pending() to start the pending ones.

You have the flexibility to prepare descriptors and issue in the order you
like. You can also attach the callback required for descriptors here.


> 
> The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> being clear how that mechanism is supposed to work. This should be documented 
> as well, and any light you could shed on this dark corner of the API would 
> help.
Ah it is not so dark.

if you look closely at dmaengine channel allocation it is only for marking
if channel is privately to be used or for async_tx.
Thus slave devices must set DMA_PRIVATE

> 
> Similarly, the DMA engine API is split in functions with different prefixes 
> (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such 
> as async_tx_ack or txd_lock. If there's a rationale for that (beyond just 
> historical reasons) it should also be documented, otherwise a cleanup would 
> help all the confused DMA engine users (myself included). I might be able to 
> find a bit of time to work on that, but I'll first need to correctly 
> understand where we come from and where we are. Again, information would be 
> welcome and fully appreciated.
History. DMA engine was developed for async_tx. (hence async_tx)

I think most of dmaengine APIs are dmaengine_. the async_ ones are
specifically for async_tx usage.
txd ones are descriptor related.

HTH

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01 17:07             ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-01 17:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> Hi Vinod,
> 
> On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> > On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > > atomic, which regmap doesn't like as it locks a mutex.
> > > > 
> > > > It might be possible to fix this by setting the fast_io field in both
> > > > the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c.
> > > > regmap will then use a spinlock instead of a mutex. However, even if I
> > > > believe that change makes sense and should be done, another atomic
> > > > context issue will come from the rcar-dmac driver which allocates memory
> > > > in the prep_dma_cyclic function, called by rsnd_dma_start() from
> > > > rsnd_soc_dai_trigger() with the spinlock help.
> > > > 
> > > > What context is the rsnd_soc_dai_trigger() function called in by the
> > > > alsa core ? If it's guaranteed to be a sleepable context, would it make
> > > > sense to replace the rsnd_priv spinlock with a mutex ?
> > > 
> > > Answering myself here, that's not an option, as the trigger function is
> > > called in atomic context (replacing the spinlock with a mutex produced a
> > > clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> > > 
> > > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > > called in atomic context, and on the other side the function ends up
> > > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > > make this more func the DMA engine API is undocumented and completely
> > > silent on whether the prep functions are allowed to sleep. The question
> > > is, who's wrong ?
> > > 
> > > Now, if you're tempted to say that I'm wrong by allocating memory with
> > > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried
> > > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a
> > > problem more complex to solve.
> > > 
> > > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > > device, and needs to do so when a descriptor is submitted. This
> > > operation, currently performed in the tx_submit handler, could be moved
> > > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > > are called in a row from rsnd_dma_start(), itself ultimately called from
> > > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > > in my DMA engine driver where I can resume the device.
> > > 
> > > One could argue that the rcar-dmac driver could use a work queue to handle
> > > power management. That's correct, but the additional complexity, which
> > > would be required in *all* DMA engine drivers, seem too high to me. If we
> > > need to go that way, this is really a task that the DMA engine core
> > > should handle.
> > > 
> > > Let's start by answering the background question and updating the DMA
> > > engine API documentation once and for good : in which context are drivers
> > > allowed to call the prep, tx_submit and issue_pending functions ?
> > 
> > I think this was bought up sometime back and we have cleared that all _prep
> > functions can be invoked in atomic context.
> > 
> > This is the reason why we have been pushing folks to use GFP_NOWAIT is
> > memory allocations during prepare.
> 
> From the replies I've received it's pretty clear that the prep functions need 
> to be callable from atomic context. I'll respond to this in more depth in a 
> reply to Russell's e-mail.
> 
> > Thanks for pointing out documentation doesn't say so, will send a patch for
> > that.
> 
> I wish that was all that is missing from the documentation ;-) Luckily Maxime 
> Ripard has sent a patch that documents DMA engine from a DMA engine driver's 
> point of view. While not perfect (I'm going to review it), it's a nice 
> starting point to (hopefully) get to a properly documented framework.
Sure, please do point out any other instance.

Russell did improve it a lot. But then Documentation gets not so quick
updates. Yes new users like you can list issues far more easily than others.

Also now commit log provides a very valuable source of why a particular
thing was done.

> > from atomic context too.
> 
> I'll take this opportunity to question why we have a separation between 
> tx_submit and issue_pending. What's the rationale for that, especially given 
> that dma_issue_pending_all() might kick in at any point and issue pending 
> transfers for all devices. A driver could thus see its submitted but not 
> issued transactions being issued before it explicitly calls 
> dma_async_issue_pending().
The  API states that you need to get a channel, then prepare a descriptor
and submit it back. Prepare can be in any order. The submit order is the one
which is run on dmaengine. The submit marks the descriptor as pending.
Typically you should have a pending_list which the descriptor should be
pushed to.

And lastly invoke dma_async_issue_pending() to start the pending ones.

You have the flexibility to prepare descriptors and issue in the order you
like. You can also attach the callback required for descriptors here.


> 
> The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> being clear how that mechanism is supposed to work. This should be documented 
> as well, and any light you could shed on this dark corner of the API would 
> help.
Ah it is not so dark.

if you look closely at dmaengine channel allocation it is only for marking
if channel is privately to be used or for async_tx.
Thus slave devices must set DMA_PRIVATE

> 
> Similarly, the DMA engine API is split in functions with different prefixes 
> (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such 
> as async_tx_ack or txd_lock. If there's a rationale for that (beyond just 
> historical reasons) it should also be documented, otherwise a cleanup would 
> help all the confused DMA engine users (myself included). I might be able to 
> find a bit of time to work on that, but I'll first need to correctly 
> understand where we come from and where we are. Again, information would be 
> welcome and fully appreciated.
History. DMA engine was developed for async_tx. (hence async_tx)

I think most of dmaengine APIs are dmaengine_. the async_ ones are
specifically for async_tx usage.
txd ones are descriptor related.

HTH

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-01 14:30             ` Russell King - ARM Linux
  (?)
@ 2014-08-01 17:09               ` Vinod Koul
  -1 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-01 17:09 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Laurent Pinchart, Kuninori Morimoto, dmaengine, linux-sh,
	Magnus Damm, Linux-ALSA, linux-arm-kernel, Maxime Ripard

On Fri, Aug 01, 2014 at 03:30:20PM +0100, Russell King - ARM Linux wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > I'll take this opportunity to question why we have a separation between 
> > tx_submit and issue_pending. What's the rationale for that, especially given 
> > that dma_issue_pending_all() might kick in at any point and issue pending 
> > transfers for all devices. A driver could thus see its submitted but not 
> > issued transactions being issued before it explicitly calls 
> > dma_async_issue_pending().
> 
> A prepared but not submitted transaction is not a pending transaction.
> 
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
> 
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.  This sequence must occur in a timely manner as some
> DMA engine implementations hold a lock between the prepare and submit
> callbacks (Dan explicitly permits this as part of the API.)
> 
> > The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> > being clear how that mechanism is supposed to work. This should be documented 
> > as well, and any light you could shed on this dark corner of the API would 
> > help.
> 
> Why do you think that DMA_PRIVATE has a bearing in the callbacks?  It
> doesn't.  DMA_PRIVATE is all about channel allocation as I explained
> yesterday, and whether the channel is available for async_tx usage.
> 
> A channel marked DMA_PRIVATE is not available for async_tx usage at
> any moment.  A channel without DMA_PRIVATE is available for async_tx
> usage until it is allocated for the slave API - at which point the
> generic DMA engine code will mark the channel with DMA_PRIVATE,
> thereby taking it away from async_tx API usage.  When the slave API
> frees the channel, DMA_PRIVATE will be cleared, making the channel
> available for async_tx usage.
> 
> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> simple.
> 
> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > rationale for that (beyond just historical reasons) it should also
> > be documented, otherwise a cleanup would help all the confused DMA
> > engine users (myself included).
> 
> dmaengine_* are generally the interface functions to the DMA engine code,
> which have been recently introduced to avoid the multiple levels of
> pointer indirection having to be typed in every driver.
> 
> dma_async_* are the DMA engine interface functions for the async_tx API.
> 
> dma_* predate the dmaengine_* naming, and are probably too generic, so
> should probably end up being renamed to dmaengine_*.
> 
> txd_* are all about the DMA engine descriptors.
> 
> async_tx_* are the higher level async_tx API functions.

Ah looks like I repeated the good answers from you. Should have read all
replied first

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01 17:09               ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-01 17:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Aug 01, 2014 at 03:30:20PM +0100, Russell King - ARM Linux wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > I'll take this opportunity to question why we have a separation between 
> > tx_submit and issue_pending. What's the rationale for that, especially given 
> > that dma_issue_pending_all() might kick in at any point and issue pending 
> > transfers for all devices. A driver could thus see its submitted but not 
> > issued transactions being issued before it explicitly calls 
> > dma_async_issue_pending().
> 
> A prepared but not submitted transaction is not a pending transaction.
> 
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
> 
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.  This sequence must occur in a timely manner as some
> DMA engine implementations hold a lock between the prepare and submit
> callbacks (Dan explicitly permits this as part of the API.)
> 
> > The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> > being clear how that mechanism is supposed to work. This should be documented 
> > as well, and any light you could shed on this dark corner of the API would 
> > help.
> 
> Why do you think that DMA_PRIVATE has a bearing in the callbacks?  It
> doesn't.  DMA_PRIVATE is all about channel allocation as I explained
> yesterday, and whether the channel is available for async_tx usage.
> 
> A channel marked DMA_PRIVATE is not available for async_tx usage at
> any moment.  A channel without DMA_PRIVATE is available for async_tx
> usage until it is allocated for the slave API - at which point the
> generic DMA engine code will mark the channel with DMA_PRIVATE,
> thereby taking it away from async_tx API usage.  When the slave API
> frees the channel, DMA_PRIVATE will be cleared, making the channel
> available for async_tx usage.
> 
> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> simple.
> 
> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > rationale for that (beyond just historical reasons) it should also
> > be documented, otherwise a cleanup would help all the confused DMA
> > engine users (myself included).
> 
> dmaengine_* are generally the interface functions to the DMA engine code,
> which have been recently introduced to avoid the multiple levels of
> pointer indirection having to be typed in every driver.
> 
> dma_async_* are the DMA engine interface functions for the async_tx API.
> 
> dma_* predate the dmaengine_* naming, and are probably too generic, so
> should probably end up being renamed to dmaengine_*.
> 
> txd_* are all about the DMA engine descriptors.
> 
> async_tx_* are the higher level async_tx API functions.

Ah looks like I repeated the good answers from you. Should have read all
replied first

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01 17:07             ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-01 17:19 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard, Russell King - ARM Linux

On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> Hi Vinod,
> 
> On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> > On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> > > > The rsnd_soc_dai_trigger() function takes a spinlock, making the context
> > > > atomic, which regmap doesn't like as it locks a mutex.
> > > > 
> > > > It might be possible to fix this by setting the fast_io field in both
> > > > the regmap_config and regmap_bus structures in sound/soc/sh/rcar/gen.c.
> > > > regmap will then use a spinlock instead of a mutex. However, even if I
> > > > believe that change makes sense and should be done, another atomic
> > > > context issue will come from the rcar-dmac driver which allocates memory
> > > > in the prep_dma_cyclic function, called by rsnd_dma_start() from
> > > > rsnd_soc_dai_trigger() with the spinlock help.
> > > > 
> > > > What context is the rsnd_soc_dai_trigger() function called in by the
> > > > alsa core ? If it's guaranteed to be a sleepable context, would it make
> > > > sense to replace the rsnd_priv spinlock with a mutex ?
> > > 
> > > Answering myself here, that's not an option, as the trigger function is
> > > called in atomic context (replacing the spinlock with a mutex produced a
> > > clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> > > 
> > > Now we have a core issue. On one side there's rsnd_soc_dai_trigger() being
> > > called in atomic context, and on the other side the function ends up
> > > calling dmaengine_prep_dma_cyclic() which needs to allocate memory. To
> > > make this more func the DMA engine API is undocumented and completely
> > > silent on whether the prep functions are allowed to sleep. The question
> > > is, who's wrong ?
> > > 
> > > Now, if you're tempted to say that I'm wrong by allocating memory with
> > > GFP_KERNEL in the DMA engine driver, please read on first :-) I've tried
> > > replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran into a
> > > problem more complex to solve.
> > > 
> > > The rcar-dmac DMA engine driver uses runtime PM. When not used, the device
> > > is suspended. The driver calls pm_runtime_get_sync() to resume the
> > > device, and needs to do so when a descriptor is submitted. This
> > > operation, currently performed in the tx_submit handler, could be moved
> > > to the prep_dma_cyclic or issue_pending handler, but the three operations
> > > are called in a row from rsnd_dma_start(), itself ultimately called from
> > > snd_pcm_lib_write1() with the spinlock held. This means I have no place
> > > in my DMA engine driver where I can resume the device.
> > > 
> > > One could argue that the rcar-dmac driver could use a work queue to handle
> > > power management. That's correct, but the additional complexity, which
> > > would be required in *all* DMA engine drivers, seem too high to me. If we
> > > need to go that way, this is really a task that the DMA engine core
> > > should handle.
> > > 
> > > Let's start by answering the background question and updating the DMA
> > > engine API documentation once and for good : in which context are drivers
> > > allowed to call the prep, tx_submit and issue_pending functions ?
> > 
> > I think this was bought up sometime back and we have cleared that all _prep
> > functions can be invoked in atomic context.
> > 
> > This is the reason why we have been pushing folks to use GFP_NOWAIT is
> > memory allocations during prepare.
> 
> From the replies I've received it's pretty clear that the prep functions need 
> to be callable from atomic context. I'll respond to this in more depth in a 
> reply to Russell's e-mail.
> 
> > Thanks for pointing out documentation doesn't say so, will send a patch for
> > that.
> 
> I wish that was all that is missing from the documentation ;-) Luckily Maxime 
> Ripard has sent a patch that documents DMA engine from a DMA engine driver's 
> point of view. While not perfect (I'm going to review it), it's a nice 
> starting point to (hopefully) get to a properly documented framework.
Sure, please do point out any other instance.

Russell did improve it a lot. But then Documentation gets not so quick
updates. Yes new users like you can list issues far more easily than others.

Also now commit log provides a very valuable source of why a particular
thing was done.

> > from atomic context too.
> 
> I'll take this opportunity to question why we have a separation between 
> tx_submit and issue_pending. What's the rationale for that, especially given 
> that dma_issue_pending_all() might kick in at any point and issue pending 
> transfers for all devices. A driver could thus see its submitted but not 
> issued transactions being issued before it explicitly calls 
> dma_async_issue_pending().
The  API states that you need to get a channel, then prepare a descriptor
and submit it back. Prepare can be in any order. The submit order is the one
which is run on dmaengine. The submit marks the descriptor as pending.
Typically you should have a pending_list which the descriptor should be
pushed to.

And lastly invoke dma_async_issue_pending() to start the pending ones.

You have the flexibility to prepare descriptors and issue in the order you
like. You can also attach the callback required for descriptors here.


> 
> The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> being clear how that mechanism is supposed to work. This should be documented 
> as well, and any light you could shed on this dark corner of the API would 
> help.
Ah it is not so dark.

if you look closely at dmaengine channel allocation it is only for marking
if channel is privately to be used or for async_tx.
Thus slave devices must set DMA_PRIVATE

> 
> Similarly, the DMA engine API is split in functions with different prefixes 
> (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed niceties such 
> as async_tx_ack or txd_lock. If there's a rationale for that (beyond just 
> historical reasons) it should also be documented, otherwise a cleanup would 
> help all the confused DMA engine users (myself included). I might be able to 
> find a bit of time to work on that, but I'll first need to correctly 
> understand where we come from and where we are. Again, information would be 
> welcome and fully appreciated.
History. DMA engine was developed for async_tx. (hence async_tx)

I think most of dmaengine APIs are dmaengine_. the async_ ones are
specifically for async_tx usage.
txd ones are descriptor related.

HTH

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-01 17:09               ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-01 17:21 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Laurent Pinchart, Kuninori Morimoto, dmaengine, linux-sh,
	Magnus Damm, Linux-ALSA, linux-arm-kernel, Maxime Ripard

On Fri, Aug 01, 2014 at 03:30:20PM +0100, Russell King - ARM Linux wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > I'll take this opportunity to question why we have a separation between 
> > tx_submit and issue_pending. What's the rationale for that, especially given 
> > that dma_issue_pending_all() might kick in at any point and issue pending 
> > transfers for all devices. A driver could thus see its submitted but not 
> > issued transactions being issued before it explicitly calls 
> > dma_async_issue_pending().
> 
> A prepared but not submitted transaction is not a pending transaction.
> 
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
> 
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.  This sequence must occur in a timely manner as some
> DMA engine implementations hold a lock between the prepare and submit
> callbacks (Dan explicitly permits this as part of the API.)
> 
> > The DMA_PRIVATE capability flag seems to play a role here, but it's far from 
> > being clear how that mechanism is supposed to work. This should be documented 
> > as well, and any light you could shed on this dark corner of the API would 
> > help.
> 
> Why do you think that DMA_PRIVATE has a bearing in the callbacks?  It
> doesn't.  DMA_PRIVATE is all about channel allocation as I explained
> yesterday, and whether the channel is available for async_tx usage.
> 
> A channel marked DMA_PRIVATE is not available for async_tx usage at
> any moment.  A channel without DMA_PRIVATE is available for async_tx
> usage until it is allocated for the slave API - at which point the
> generic DMA engine code will mark the channel with DMA_PRIVATE,
> thereby taking it away from async_tx API usage.  When the slave API
> frees the channel, DMA_PRIVATE will be cleared, making the channel
> available for async_tx usage.
> 
> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> simple.
> 
> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > rationale for that (beyond just historical reasons) it should also
> > be documented, otherwise a cleanup would help all the confused DMA
> > engine users (myself included).
> 
> dmaengine_* are generally the interface functions to the DMA engine code,
> which have been recently introduced to avoid the multiple levels of
> pointer indirection having to be typed in every driver.
> 
> dma_async_* are the DMA engine interface functions for the async_tx API.
> 
> dma_* predate the dmaengine_* naming, and are probably too generic, so
> should probably end up being renamed to dmaengine_*.
> 
> txd_* are all about the DMA engine descriptors.
> 
> async_tx_* are the higher level async_tx API functions.

Ah looks like I repeated the good answers from you. Should have read all
replied first

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-01 14:30             ` Russell King - ARM Linux
  (?)
@ 2014-08-04 13:47               ` Geert Uytterhoeven
  -1 siblings, 0 replies; 78+ messages in thread
From: Geert Uytterhoeven @ 2014-08-04 13:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Laurent Pinchart, Vinod Koul, Kuninori Morimoto, dmaengine,
	Linux-sh list, Magnus Damm, Linux-ALSA, linux-arm-kernel,
	Maxime Ripard

Hi Russell,

On Fri, Aug 1, 2014 at 4:30 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
>> I'll take this opportunity to question why we have a separation between
>> tx_submit and issue_pending. What's the rationale for that, especially given
>> that dma_issue_pending_all() might kick in at any point and issue pending
>> transfers for all devices. A driver could thus see its submitted but not
>> issued transactions being issued before it explicitly calls
>> dma_async_issue_pending().
>
> A prepared but not submitted transaction is not a pending transaction.
>
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
>
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.  This sequence must occur in a timely manner as some
> DMA engine implementations hold a lock between the prepare and submit
> callbacks (Dan explicitly permits this as part of the API.)

I think you misunderstood the question: Laurent asked about
dmaengine_submit() (step 2) and dma_async_issue_pending() (step 3),
while your answer is about dmaengine_prep_slave_*() (step 1) and
dmaengine_submit() (step 2).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 13:47               ` Geert Uytterhoeven
  0 siblings, 0 replies; 78+ messages in thread
From: Geert Uytterhoeven @ 2014-08-04 13:47 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Laurent Pinchart, Vinod Koul, Kuninori Morimoto, dmaengine,
	Linux-sh list, Magnus Damm, Linux-ALSA, linux-arm-kernel,
	Maxime Ripard

Hi Russell,

On Fri, Aug 1, 2014 at 4:30 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
>> I'll take this opportunity to question why we have a separation between
>> tx_submit and issue_pending. What's the rationale for that, especially given
>> that dma_issue_pending_all() might kick in at any point and issue pending
>> transfers for all devices. A driver could thus see its submitted but not
>> issued transactions being issued before it explicitly calls
>> dma_async_issue_pending().
>
> A prepared but not submitted transaction is not a pending transaction.
>
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
>
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.  This sequence must occur in a timely manner as some
> DMA engine implementations hold a lock between the prepare and submit
> callbacks (Dan explicitly permits this as part of the API.)

I think you misunderstood the question: Laurent asked about
dmaengine_submit() (step 2) and dma_async_issue_pending() (step 3),
while your answer is about dmaengine_prep_slave_*() (step 1) and
dmaengine_submit() (step 2).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 13:47               ` Geert Uytterhoeven
  0 siblings, 0 replies; 78+ messages in thread
From: Geert Uytterhoeven @ 2014-08-04 13:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Fri, Aug 1, 2014 at 4:30 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
>> I'll take this opportunity to question why we have a separation between
>> tx_submit and issue_pending. What's the rationale for that, especially given
>> that dma_issue_pending_all() might kick in at any point and issue pending
>> transfers for all devices. A driver could thus see its submitted but not
>> issued transactions being issued before it explicitly calls
>> dma_async_issue_pending().
>
> A prepared but not submitted transaction is not a pending transaction.
>
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
>
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.  This sequence must occur in a timely manner as some
> DMA engine implementations hold a lock between the prepare and submit
> callbacks (Dan explicitly permits this as part of the API.)

I think you misunderstood the question: Laurent asked about
dmaengine_submit() (step 2) and dma_async_issue_pending() (step 3),
while your answer is about dmaengine_prep_slave_*() (step 1) and
dmaengine_submit() (step 2).

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert at linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-01 17:07             ` Vinod Koul
  (?)
@ 2014-08-04 16:50               ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 16:50 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard, Russell King - ARM Linux

Hi Vinod,

On Friday 01 August 2014 22:37:44 Vinod Koul wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> >> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> >>>> The rsnd_soc_dai_trigger() function takes a spinlock, making the
> >>>> context atomic, which regmap doesn't like as it locks a mutex.
> >>>> 
> >>>> It might be possible to fix this by setting the fast_io field in
> >>>> both the regmap_config and regmap_bus structures in
> >>>> sound/soc/sh/rcar/gen.c.
> >>>> regmap will then use a spinlock instead of a mutex. However, even if
> >>>> I believe that change makes sense and should be done, another atomic
> >>>> context issue will come from the rcar-dmac driver which allocates
> >>>> memory in the prep_dma_cyclic function, called by rsnd_dma_start()
> >>>> from rsnd_soc_dai_trigger() with the spinlock help.
> >>>> 
> >>>> What context is the rsnd_soc_dai_trigger() function called in by the
> >>>> alsa core ? If it's guaranteed to be a sleepable context, would it
> >>>> make sense to replace the rsnd_priv spinlock with a mutex ?
> >>> 
> >>> Answering myself here, that's not an option, as the trigger function
> >>> is called in atomic context (replacing the spinlock with a mutex
> >>> produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> >>> 
> >>> Now we have a core issue. On one side there's rsnd_soc_dai_trigger()
> >>> being called in atomic context, and on the other side the function
> >>> ends up calling dmaengine_prep_dma_cyclic() which needs to allocate
> >>> memory. To make this more func the DMA engine API is undocumented and
> >>> completely silent on whether the prep functions are allowed to sleep.
> >>> The question is, who's wrong ?
> >>> 
> >>> Now, if you're tempted to say that I'm wrong by allocating memory with
> >>> GFP_KERNEL in the DMA engine driver, please read on first :-) I've
> >>> tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran
> >>> into a problem more complex to solve.
> >>> 
> >>> The rcar-dmac DMA engine driver uses runtime PM. When not used, the
> >>> device is suspended. The driver calls pm_runtime_get_sync() to resume
> >>> the device, and needs to do so when a descriptor is submitted. This
> >>> operation, currently performed in the tx_submit handler, could be
> >>> moved to the prep_dma_cyclic or issue_pending handler, but the three
> >>> operations are called in a row from rsnd_dma_start(), itself
> >>> ultimately called from snd_pcm_lib_write1() with the spinlock held.
> >>> This means I have no place in my DMA engine driver where I can resume
> >>> the device.
> >>> 
> >>> One could argue that the rcar-dmac driver could use a work queue to
> >>> handle power management. That's correct, but the additional
> >>> complexity, which would be required in *all* DMA engine drivers, seem
> >>> too high to me. If we need to go that way, this is really a task that
> >>> the DMA engine core should handle.
> >>> 
> >>> Let's start by answering the background question and updating the DMA
> >>> engine API documentation once and for good : in which context are
> >>> drivers allowed to call the prep, tx_submit and issue_pending
> >>> functions ?
> >> 
> >> I think this was bought up sometime back and we have cleared that all
> >> _prep functions can be invoked in atomic context.
> >> 
> >> This is the reason why we have been pushing folks to use GFP_NOWAIT is
> >> memory allocations during prepare.
> > 
> > From the replies I've received it's pretty clear that the prep functions
> > need to be callable from atomic context. I'll respond to this in more
> > depth in a reply to Russell's e-mail.
> > 
> > > Thanks for pointing out documentation doesn't say so, will send a patch
> > > for that.
> > 
> > I wish that was all that is missing from the documentation ;-) Luckily
> > Maxime Ripard has sent a patch that documents DMA engine from a DMA
> > engine driver's point of view. While not perfect (I'm going to review
> > it), it's a nice starting point to (hopefully) get to a properly
> > documented framework.
>
> Sure, please do point out any other instance.

Don't worry, I will :-)

> Russell did improve it a lot. But then Documentation gets not so quick
> updates. Yes new users like you can list issues far more easily than others.
> 
> Also now commit log provides a very valuable source of why a particular
> thing was done.

Come on. That's the theory (and we should of course aim for it). We all know 
the difference between theory and practice : in theory they're the same.

More seriously speaking, I've dived into the git history more times than I 
should have to retain some mental sanity, and it leaves way too many questions 
unanswered.

> > > from atomic context too.
> > 
> > I'll take this opportunity to question why we have a separation between
> > tx_submit and issue_pending. What's the rationale for that, especially
> > given that dma_issue_pending_all() might kick in at any point and issue
> > pending transfers for all devices. A driver could thus see its submitted
> > but not issued transactions being issued before it explicitly calls
> > dma_async_issue_pending().
> 
> The  API states that you need to get a channel, then prepare a descriptor
> and submit it back. Prepare can be in any order. The submit order is the one
> which is run on dmaengine. The submit marks the descriptor as pending.
> Typically you should have a pending_list which the descriptor should be
> pushed to.
> 
> And lastly invoke dma_async_issue_pending() to start the pending ones.
> 
> You have the flexibility to prepare descriptors and issue in the order you
> like. You can also attach the callback required for descriptors here.

The question was why is there a dma_async_issue_pending() operation at all ? 
Why can't dmaengine_submit() triggers the transfer start ? The only 
explanation is a small comment in dmaengine.h that states

 * This allows drivers to push copies to HW in batches,
 * reducing MMIO writes where possible.

I don't think that's applicable for DMA slave transfers. Is it still 
applicable for anything else ?

Quoting git log, the reason is

commit c13c8260da3155f2cefb63b0d1b7dcdcb405c644
Author: Chris Leech <christopher.leech@intel.com>
Date:   Tue May 23 17:18:44 2006 -0700

    [I/OAT]: DMA memcpy subsystem
    
    Provides an API for offloading memory copies to DMA devices
    
    Signed-off-by: Chris Leech <christopher.leech@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

;-)

> > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > from being clear how that mechanism is supposed to work. This should be
> > documented as well, and any light you could shed on this dark corner of
> > the API would help.
> 
> Ah it is not so dark.
> 
> if you look closely at dmaengine channel allocation it is only for marking
> if channel is privately to be used or for async_tx.
> Thus slave devices must set DMA_PRIVATE

In order to avoid scattering one topic across multiple mails, I'll pursue this 
one in a reply to Russell.

> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed
> > niceties such as async_tx_ack or txd_lock. If there's a rationale for that
> > (beyond just historical reasons) it should also be documented, otherwise a
> > cleanup would help all the confused DMA engine users (myself included). I
> > might be able to find a bit of time to work on that, but I'll first need
> > to correctly understand where we come from and where we are. Again,
> > information would be welcome and fully appreciated.
> 
> History. DMA engine was developed for async_tx. (hence async_tx)
> 
> I think most of dmaengine APIs are dmaengine_. the async_ ones are
> specifically for async_tx usage.
> txd ones are descriptor related.

Ditto here.

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 16:50               ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 16:50 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard, Russell King - ARM Linux

Hi Vinod,

On Friday 01 August 2014 22:37:44 Vinod Koul wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> >> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> >>>> The rsnd_soc_dai_trigger() function takes a spinlock, making the
> >>>> context atomic, which regmap doesn't like as it locks a mutex.
> >>>> 
> >>>> It might be possible to fix this by setting the fast_io field in
> >>>> both the regmap_config and regmap_bus structures in
> >>>> sound/soc/sh/rcar/gen.c.
> >>>> regmap will then use a spinlock instead of a mutex. However, even if
> >>>> I believe that change makes sense and should be done, another atomic
> >>>> context issue will come from the rcar-dmac driver which allocates
> >>>> memory in the prep_dma_cyclic function, called by rsnd_dma_start()
> >>>> from rsnd_soc_dai_trigger() with the spinlock help.
> >>>> 
> >>>> What context is the rsnd_soc_dai_trigger() function called in by the
> >>>> alsa core ? If it's guaranteed to be a sleepable context, would it
> >>>> make sense to replace the rsnd_priv spinlock with a mutex ?
> >>> 
> >>> Answering myself here, that's not an option, as the trigger function
> >>> is called in atomic context (replacing the spinlock with a mutex
> >>> produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> >>> 
> >>> Now we have a core issue. On one side there's rsnd_soc_dai_trigger()
> >>> being called in atomic context, and on the other side the function
> >>> ends up calling dmaengine_prep_dma_cyclic() which needs to allocate
> >>> memory. To make this more func the DMA engine API is undocumented and
> >>> completely silent on whether the prep functions are allowed to sleep.
> >>> The question is, who's wrong ?
> >>> 
> >>> Now, if you're tempted to say that I'm wrong by allocating memory with
> >>> GFP_KERNEL in the DMA engine driver, please read on first :-) I've
> >>> tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran
> >>> into a problem more complex to solve.
> >>> 
> >>> The rcar-dmac DMA engine driver uses runtime PM. When not used, the
> >>> device is suspended. The driver calls pm_runtime_get_sync() to resume
> >>> the device, and needs to do so when a descriptor is submitted. This
> >>> operation, currently performed in the tx_submit handler, could be
> >>> moved to the prep_dma_cyclic or issue_pending handler, but the three
> >>> operations are called in a row from rsnd_dma_start(), itself
> >>> ultimately called from snd_pcm_lib_write1() with the spinlock held.
> >>> This means I have no place in my DMA engine driver where I can resume
> >>> the device.
> >>> 
> >>> One could argue that the rcar-dmac driver could use a work queue to
> >>> handle power management. That's correct, but the additional
> >>> complexity, which would be required in *all* DMA engine drivers, seem
> >>> too high to me. If we need to go that way, this is really a task that
> >>> the DMA engine core should handle.
> >>> 
> >>> Let's start by answering the background question and updating the DMA
> >>> engine API documentation once and for good : in which context are
> >>> drivers allowed to call the prep, tx_submit and issue_pending
> >>> functions ?
> >> 
> >> I think this was bought up sometime back and we have cleared that all
> >> _prep functions can be invoked in atomic context.
> >> 
> >> This is the reason why we have been pushing folks to use GFP_NOWAIT is
> >> memory allocations during prepare.
> > 
> > From the replies I've received it's pretty clear that the prep functions
> > need to be callable from atomic context. I'll respond to this in more
> > depth in a reply to Russell's e-mail.
> > 
> > > Thanks for pointing out documentation doesn't say so, will send a patch
> > > for that.
> > 
> > I wish that was all that is missing from the documentation ;-) Luckily
> > Maxime Ripard has sent a patch that documents DMA engine from a DMA
> > engine driver's point of view. While not perfect (I'm going to review
> > it), it's a nice starting point to (hopefully) get to a properly
> > documented framework.
>
> Sure, please do point out any other instance.

Don't worry, I will :-)

> Russell did improve it a lot. But then Documentation gets not so quick
> updates. Yes new users like you can list issues far more easily than others.
> 
> Also now commit log provides a very valuable source of why a particular
> thing was done.

Come on. That's the theory (and we should of course aim for it). We all know 
the difference between theory and practice : in theory they're the same.

More seriously speaking, I've dived into the git history more times than I 
should have to retain some mental sanity, and it leaves way too many questions 
unanswered.

> > > from atomic context too.
> > 
> > I'll take this opportunity to question why we have a separation between
> > tx_submit and issue_pending. What's the rationale for that, especially
> > given that dma_issue_pending_all() might kick in at any point and issue
> > pending transfers for all devices. A driver could thus see its submitted
> > but not issued transactions being issued before it explicitly calls
> > dma_async_issue_pending().
> 
> The  API states that you need to get a channel, then prepare a descriptor
> and submit it back. Prepare can be in any order. The submit order is the one
> which is run on dmaengine. The submit marks the descriptor as pending.
> Typically you should have a pending_list which the descriptor should be
> pushed to.
> 
> And lastly invoke dma_async_issue_pending() to start the pending ones.
> 
> You have the flexibility to prepare descriptors and issue in the order you
> like. You can also attach the callback required for descriptors here.

The question was why is there a dma_async_issue_pending() operation at all ? 
Why can't dmaengine_submit() triggers the transfer start ? The only 
explanation is a small comment in dmaengine.h that states

 * This allows drivers to push copies to HW in batches,
 * reducing MMIO writes where possible.

I don't think that's applicable for DMA slave transfers. Is it still 
applicable for anything else ?

Quoting git log, the reason is

commit c13c8260da3155f2cefb63b0d1b7dcdcb405c644
Author: Chris Leech <christopher.leech@intel.com>
Date:   Tue May 23 17:18:44 2006 -0700

    [I/OAT]: DMA memcpy subsystem
    
    Provides an API for offloading memory copies to DMA devices
    
    Signed-off-by: Chris Leech <christopher.leech@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

;-)

> > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > from being clear how that mechanism is supposed to work. This should be
> > documented as well, and any light you could shed on this dark corner of
> > the API would help.
> 
> Ah it is not so dark.
> 
> if you look closely at dmaengine channel allocation it is only for marking
> if channel is privately to be used or for async_tx.
> Thus slave devices must set DMA_PRIVATE

In order to avoid scattering one topic across multiple mails, I'll pursue this 
one in a reply to Russell.

> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed
> > niceties such as async_tx_ack or txd_lock. If there's a rationale for that
> > (beyond just historical reasons) it should also be documented, otherwise a
> > cleanup would help all the confused DMA engine users (myself included). I
> > might be able to find a bit of time to work on that, but I'll first need
> > to correctly understand where we come from and where we are. Again,
> > information would be welcome and fully appreciated.
> 
> History. DMA engine was developed for async_tx. (hence async_tx)
> 
> I think most of dmaengine APIs are dmaengine_. the async_ ones are
> specifically for async_tx usage.
> txd ones are descriptor related.

Ditto here.

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 16:50               ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 16:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vinod,

On Friday 01 August 2014 22:37:44 Vinod Koul wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > On Thursday 24 July 2014 10:22:48 Vinod Koul wrote:
> >> On Wed, Jul 23, 2014 at 01:07:43PM +0200, Laurent Pinchart wrote:
> >>>> The rsnd_soc_dai_trigger() function takes a spinlock, making the
> >>>> context atomic, which regmap doesn't like as it locks a mutex.
> >>>> 
> >>>> It might be possible to fix this by setting the fast_io field in
> >>>> both the regmap_config and regmap_bus structures in
> >>>> sound/soc/sh/rcar/gen.c.
> >>>> regmap will then use a spinlock instead of a mutex. However, even if
> >>>> I believe that change makes sense and should be done, another atomic
> >>>> context issue will come from the rcar-dmac driver which allocates
> >>>> memory in the prep_dma_cyclic function, called by rsnd_dma_start()
> >>>> from rsnd_soc_dai_trigger() with the spinlock help.
> >>>> 
> >>>> What context is the rsnd_soc_dai_trigger() function called in by the
> >>>> alsa core ? If it's guaranteed to be a sleepable context, would it
> >>>> make sense to replace the rsnd_priv spinlock with a mutex ?
> >>> 
> >>> Answering myself here, that's not an option, as the trigger function
> >>> is called in atomic context (replacing the spinlock with a mutex
> >>> produced a clear BUG) due to snd_pcm_lib_write1() taking a spinlock.
> >>> 
> >>> Now we have a core issue. On one side there's rsnd_soc_dai_trigger()
> >>> being called in atomic context, and on the other side the function
> >>> ends up calling dmaengine_prep_dma_cyclic() which needs to allocate
> >>> memory. To make this more func the DMA engine API is undocumented and
> >>> completely silent on whether the prep functions are allowed to sleep.
> >>> The question is, who's wrong ?
> >>> 
> >>> Now, if you're tempted to say that I'm wrong by allocating memory with
> >>> GFP_KERNEL in the DMA engine driver, please read on first :-) I've
> >>> tried replacing GFP_KERNEL with GFP_ATOMIC allocations, and then ran
> >>> into a problem more complex to solve.
> >>> 
> >>> The rcar-dmac DMA engine driver uses runtime PM. When not used, the
> >>> device is suspended. The driver calls pm_runtime_get_sync() to resume
> >>> the device, and needs to do so when a descriptor is submitted. This
> >>> operation, currently performed in the tx_submit handler, could be
> >>> moved to the prep_dma_cyclic or issue_pending handler, but the three
> >>> operations are called in a row from rsnd_dma_start(), itself
> >>> ultimately called from snd_pcm_lib_write1() with the spinlock held.
> >>> This means I have no place in my DMA engine driver where I can resume
> >>> the device.
> >>> 
> >>> One could argue that the rcar-dmac driver could use a work queue to
> >>> handle power management. That's correct, but the additional
> >>> complexity, which would be required in *all* DMA engine drivers, seem
> >>> too high to me. If we need to go that way, this is really a task that
> >>> the DMA engine core should handle.
> >>> 
> >>> Let's start by answering the background question and updating the DMA
> >>> engine API documentation once and for good : in which context are
> >>> drivers allowed to call the prep, tx_submit and issue_pending
> >>> functions ?
> >> 
> >> I think this was bought up sometime back and we have cleared that all
> >> _prep functions can be invoked in atomic context.
> >> 
> >> This is the reason why we have been pushing folks to use GFP_NOWAIT is
> >> memory allocations during prepare.
> > 
> > From the replies I've received it's pretty clear that the prep functions
> > need to be callable from atomic context. I'll respond to this in more
> > depth in a reply to Russell's e-mail.
> > 
> > > Thanks for pointing out documentation doesn't say so, will send a patch
> > > for that.
> > 
> > I wish that was all that is missing from the documentation ;-) Luckily
> > Maxime Ripard has sent a patch that documents DMA engine from a DMA
> > engine driver's point of view. While not perfect (I'm going to review
> > it), it's a nice starting point to (hopefully) get to a properly
> > documented framework.
>
> Sure, please do point out any other instance.

Don't worry, I will :-)

> Russell did improve it a lot. But then Documentation gets not so quick
> updates. Yes new users like you can list issues far more easily than others.
> 
> Also now commit log provides a very valuable source of why a particular
> thing was done.

Come on. That's the theory (and we should of course aim for it). We all know 
the difference between theory and practice : in theory they're the same.

More seriously speaking, I've dived into the git history more times than I 
should have to retain some mental sanity, and it leaves way too many questions 
unanswered.

> > > from atomic context too.
> > 
> > I'll take this opportunity to question why we have a separation between
> > tx_submit and issue_pending. What's the rationale for that, especially
> > given that dma_issue_pending_all() might kick in at any point and issue
> > pending transfers for all devices. A driver could thus see its submitted
> > but not issued transactions being issued before it explicitly calls
> > dma_async_issue_pending().
> 
> The  API states that you need to get a channel, then prepare a descriptor
> and submit it back. Prepare can be in any order. The submit order is the one
> which is run on dmaengine. The submit marks the descriptor as pending.
> Typically you should have a pending_list which the descriptor should be
> pushed to.
> 
> And lastly invoke dma_async_issue_pending() to start the pending ones.
> 
> You have the flexibility to prepare descriptors and issue in the order you
> like. You can also attach the callback required for descriptors here.

The question was why is there a dma_async_issue_pending() operation at all ? 
Why can't dmaengine_submit() triggers the transfer start ? The only 
explanation is a small comment in dmaengine.h that states

 * This allows drivers to push copies to HW in batches,
 * reducing MMIO writes where possible.

I don't think that's applicable for DMA slave transfers. Is it still 
applicable for anything else ?

Quoting git log, the reason is

commit c13c8260da3155f2cefb63b0d1b7dcdcb405c644
Author: Chris Leech <christopher.leech@intel.com>
Date:   Tue May 23 17:18:44 2006 -0700

    [I/OAT]: DMA memcpy subsystem
    
    Provides an API for offloading memory copies to DMA devices
    
    Signed-off-by: Chris Leech <christopher.leech@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

;-)

> > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > from being clear how that mechanism is supposed to work. This should be
> > documented as well, and any light you could shed on this dark corner of
> > the API would help.
> 
> Ah it is not so dark.
> 
> if you look closely at dmaengine channel allocation it is only for marking
> if channel is privately to be used or for async_tx.
> Thus slave devices must set DMA_PRIVATE

In order to avoid scattering one topic across multiple mails, I'll pursue this 
one in a reply to Russell.

> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various unprefixed
> > niceties such as async_tx_ack or txd_lock. If there's a rationale for that
> > (beyond just historical reasons) it should also be documented, otherwise a
> > cleanup would help all the confused DMA engine users (myself included). I
> > might be able to find a bit of time to work on that, but I'll first need
> > to correctly understand where we come from and where we are. Again,
> > information would be welcome and fully appreciated.
> 
> History. DMA engine was developed for async_tx. (hence async_tx)
> 
> I think most of dmaengine APIs are dmaengine_. the async_ ones are
> specifically for async_tx usage.
> txd ones are descriptor related.

Ditto here.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-01 14:30             ` Russell King - ARM Linux
  (?)
@ 2014-08-04 17:00               ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 17:00 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

Hi Russell,

On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > I'll take this opportunity to question why we have a separation between
> > tx_submit and issue_pending. What's the rationale for that, especially
> > given that dma_issue_pending_all() might kick in at any point and issue
> > pending transfers for all devices. A driver could thus see its submitted
> > but not issued transactions being issued before it explicitly calls
> > dma_async_issue_pending().
> 
> A prepared but not submitted transaction is not a pending transaction.
> 
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
> 
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.

No disagreement on that. However, as Geert pointed out, my question was 
related to the split between dmaengine_submit() and dma_async_issue_pending(), 
not between the prep_* functions and dmaengine_submit().

>  This sequence must occur in a timely manner as some DMA engine
> implementations hold a lock between the prepare and submit callbacks (Dan
> explicitly permits this as part of the API.)

That really triggers a red alarm in the part of my brain that deals with API 
design, but I suppose it would be too difficult to change that.

> > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > from being clear how that mechanism is supposed to work. This should be
> > documented as well, and any light you could shed on this dark corner of
> > the API would help.
> 
> Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> doesn't.

Not on callbacks, but on how pending descriptors are pushed to the hardware. 
The flag is explicitly checked in dma_issue_pending_all().

> DMA_PRIVATE is all about channel allocation as I explained yesterday, and
> whether the channel is available for async_tx usage.
> 
> A channel marked DMA_PRIVATE is not available for async_tx usage at
> any moment.  A channel without DMA_PRIVATE is available for async_tx
> usage until it is allocated for the slave API - at which point the
> generic DMA engine code will mark the channel with DMA_PRIVATE,
> thereby taking it away from async_tx API usage.  When the slave API
> frees the channel, DMA_PRIVATE will be cleared, making the channel
> available for async_tx usage.
> 
> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> simple.

DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel 
is allocated by __dma_request_channel() the whole device is marked with 
DMA_PRIVATE, making all channels private. What am I missing ?

> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > rationale for that (beyond just historical reasons) it should also
> > be documented, otherwise a cleanup would help all the confused DMA
> > engine users (myself included).
> 
> dmaengine_* are generally the interface functions to the DMA engine code,
> which have been recently introduced to avoid the multiple levels of
> pointer indirection having to be typed in every driver.
> 
> dma_async_* are the DMA engine interface functions for the async_tx API.
> 
> dma_* predate the dmaengine_* naming, and are probably too generic, so
> should probably end up being renamed to dmaengine_*.

Thank you for the confirmation. I'll see if I can cook up a patch. It will 
likely be pretty large and broad though, but I guess there's no way around it. 

> txd_* are all about the DMA engine descriptors.
> 
> async_tx_* are the higher level async_tx API functions.

Thank you for the information. How about the dma_async_* functions, should 
they be renamed to dmaengine_* as well ? Or are some of them part of the 
async_tx_* API ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 17:00               ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 17:00 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

Hi Russell,

On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > I'll take this opportunity to question why we have a separation between
> > tx_submit and issue_pending. What's the rationale for that, especially
> > given that dma_issue_pending_all() might kick in at any point and issue
> > pending transfers for all devices. A driver could thus see its submitted
> > but not issued transactions being issued before it explicitly calls
> > dma_async_issue_pending().
> 
> A prepared but not submitted transaction is not a pending transaction.
> 
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
> 
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.

No disagreement on that. However, as Geert pointed out, my question was 
related to the split between dmaengine_submit() and dma_async_issue_pending(), 
not between the prep_* functions and dmaengine_submit().

>  This sequence must occur in a timely manner as some DMA engine
> implementations hold a lock between the prepare and submit callbacks (Dan
> explicitly permits this as part of the API.)

That really triggers a red alarm in the part of my brain that deals with API 
design, but I suppose it would be too difficult to change that.

> > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > from being clear how that mechanism is supposed to work. This should be
> > documented as well, and any light you could shed on this dark corner of
> > the API would help.
> 
> Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> doesn't.

Not on callbacks, but on how pending descriptors are pushed to the hardware. 
The flag is explicitly checked in dma_issue_pending_all().

> DMA_PRIVATE is all about channel allocation as I explained yesterday, and
> whether the channel is available for async_tx usage.
> 
> A channel marked DMA_PRIVATE is not available for async_tx usage at
> any moment.  A channel without DMA_PRIVATE is available for async_tx
> usage until it is allocated for the slave API - at which point the
> generic DMA engine code will mark the channel with DMA_PRIVATE,
> thereby taking it away from async_tx API usage.  When the slave API
> frees the channel, DMA_PRIVATE will be cleared, making the channel
> available for async_tx usage.
> 
> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> simple.

DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel 
is allocated by __dma_request_channel() the whole device is marked with 
DMA_PRIVATE, making all channels private. What am I missing ?

> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > rationale for that (beyond just historical reasons) it should also
> > be documented, otherwise a cleanup would help all the confused DMA
> > engine users (myself included).
> 
> dmaengine_* are generally the interface functions to the DMA engine code,
> which have been recently introduced to avoid the multiple levels of
> pointer indirection having to be typed in every driver.
> 
> dma_async_* are the DMA engine interface functions for the async_tx API.
> 
> dma_* predate the dmaengine_* naming, and are probably too generic, so
> should probably end up being renamed to dmaengine_*.

Thank you for the confirmation. I'll see if I can cook up a patch. It will 
likely be pretty large and broad though, but I guess there's no way around it. 

> txd_* are all about the DMA engine descriptors.
> 
> async_tx_* are the higher level async_tx API functions.

Thank you for the information. How about the dma_async_* functions, should 
they be renamed to dmaengine_* as well ? Or are some of them part of the 
async_tx_* API ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 17:00               ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 17:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> On Fri, Aug 01, 2014 at 10:51:26AM +0200, Laurent Pinchart wrote:
> > I'll take this opportunity to question why we have a separation between
> > tx_submit and issue_pending. What's the rationale for that, especially
> > given that dma_issue_pending_all() might kick in at any point and issue
> > pending transfers for all devices. A driver could thus see its submitted
> > but not issued transactions being issued before it explicitly calls
> > dma_async_issue_pending().
> 
> A prepared but not submitted transaction is not a pending transaction.
> 
> The split is necessary so that a callback can be attached to the
> transaction.  This partially comes from the async-tx API, and also
> gets a lot of use with the slave API.
> 
> The prepare function allocates the descriptor and does the initial
> setup, but does not mark the descriptor as a pending transaction.
> It returns the descriptor, and the caller is then free to add a
> callback function and data pointer to the descriptor before finally
> submitting it.

No disagreement on that. However, as Geert pointed out, my question was 
related to the split between dmaengine_submit() and dma_async_issue_pending(), 
not between the prep_* functions and dmaengine_submit().

>  This sequence must occur in a timely manner as some DMA engine
> implementations hold a lock between the prepare and submit callbacks (Dan
> explicitly permits this as part of the API.)

That really triggers a red alarm in the part of my brain that deals with API 
design, but I suppose it would be too difficult to change that.

> > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > from being clear how that mechanism is supposed to work. This should be
> > documented as well, and any light you could shed on this dark corner of
> > the API would help.
> 
> Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> doesn't.

Not on callbacks, but on how pending descriptors are pushed to the hardware. 
The flag is explicitly checked in dma_issue_pending_all().

> DMA_PRIVATE is all about channel allocation as I explained yesterday, and
> whether the channel is available for async_tx usage.
> 
> A channel marked DMA_PRIVATE is not available for async_tx usage at
> any moment.  A channel without DMA_PRIVATE is available for async_tx
> usage until it is allocated for the slave API - at which point the
> generic DMA engine code will mark the channel with DMA_PRIVATE,
> thereby taking it away from async_tx API usage.  When the slave API
> frees the channel, DMA_PRIVATE will be cleared, making the channel
> available for async_tx usage.
> 
> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> simple.

DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel 
is allocated by __dma_request_channel() the whole device is marked with 
DMA_PRIVATE, making all channels private. What am I missing ?

> > Similarly, the DMA engine API is split in functions with different
> > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > rationale for that (beyond just historical reasons) it should also
> > be documented, otherwise a cleanup would help all the confused DMA
> > engine users (myself included).
> 
> dmaengine_* are generally the interface functions to the DMA engine code,
> which have been recently introduced to avoid the multiple levels of
> pointer indirection having to be typed in every driver.
> 
> dma_async_* are the DMA engine interface functions for the async_tx API.
> 
> dma_* predate the dmaengine_* naming, and are probably too generic, so
> should probably end up being renamed to dmaengine_*.

Thank you for the confirmation. I'll see if I can cook up a patch. It will 
likely be pretty large and broad though, but I guess there's no way around it. 

> txd_* are all about the DMA engine descriptors.
> 
> async_tx_* are the higher level async_tx API functions.

Thank you for the information. How about the dma_async_* functions, should 
they be renamed to dmaengine_* as well ? Or are some of them part of the 
async_tx_* API ?

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-04 17:00               ` Laurent Pinchart
  (?)
@ 2014-08-04 17:54                 ` Russell King - ARM Linux
  -1 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-04 17:54 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
> Hi Russell,
> 
> On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> >  This sequence must occur in a timely manner as some DMA engine
> > implementations hold a lock between the prepare and submit callbacks (Dan
> > explicitly permits this as part of the API.)
> 
> That really triggers a red alarm in the part of my brain that deals with API 
> design, but I suppose it would be too difficult to change that.

Mine to, but there's not a lot which can be done about it without
changing a lot of users.

> > > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > > from being clear how that mechanism is supposed to work. This should be
> > > documented as well, and any light you could shed on this dark corner of
> > > the API would help.
> > 
> > Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> > doesn't.
> 
> Not on callbacks, but on how pending descriptors are pushed to the hardware. 
> The flag is explicitly checked in dma_issue_pending_all().

Right.  So, let me put a question to you - what do you think is the
effect of the check in dma_issue_pending_all()?

I'll give you a hint - disregard the comment at the top of the function,
because that's out of date.

> > DMA_PRIVATE is all about channel allocation as I explained yesterday, and
> > whether the channel is available for async_tx usage.
> > 
> > A channel marked DMA_PRIVATE is not available for async_tx usage at
> > any moment.  A channel without DMA_PRIVATE is available for async_tx
> > usage until it is allocated for the slave API - at which point the
> > generic DMA engine code will mark the channel with DMA_PRIVATE,
> > thereby taking it away from async_tx API usage.  When the slave API
> > frees the channel, DMA_PRIVATE will be cleared, making the channel
> > available for async_tx usage.
> > 
> > So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> > DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> > simple.
> 
> DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel 
> is allocated by __dma_request_channel() the whole device is marked with 
> DMA_PRIVATE, making all channels private. What am I missing ?

I can't answer that - I don't know why the previous authors decided to
make it a DMA-device wide property - presumably there are DMA controllers
where this matters.

However, one thing to realise is that a dma_device is a virtual concept -
it is a set of channels which share a common set of properties.  It is not
a physical device.  It is entirely reasonable for a set of channels on a
physical device to be shared between two different dma_device instances
and handed out by the driver code as needed.

> > > Similarly, the DMA engine API is split in functions with different
> > > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > > rationale for that (beyond just historical reasons) it should also
> > > be documented, otherwise a cleanup would help all the confused DMA
> > > engine users (myself included).
> > 
> > dmaengine_* are generally the interface functions to the DMA engine code,
> > which have been recently introduced to avoid the multiple levels of
> > pointer indirection having to be typed in every driver.
> > 
> > dma_async_* are the DMA engine interface functions for the async_tx API.
> > 
> > dma_* predate the dmaengine_* naming, and are probably too generic, so
> > should probably end up being renamed to dmaengine_*.
> 
> Thank you for the confirmation. I'll see if I can cook up a patch. It will 
> likely be pretty large and broad though, but I guess there's no way around it. 
> 
> > txd_* are all about the DMA engine descriptors.
> > 
> > async_tx_* are the higher level async_tx API functions.
> 
> Thank you for the information. How about the dma_async_* functions, should 
> they be renamed to dmaengine_* as well ? Or are some of them part of the 
> async_tx_* API ?

Well, these:

dma_async_device_register
dma_async_device_unregister
dma_async_tx_descriptor_init

are more DMA engine core <-> DMA engine driver interface functions than
user functions.  The remainder of the dma_async_* functions are internal
to the async_tx API.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 17:54                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-04 17:54 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
> Hi Russell,
> 
> On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> >  This sequence must occur in a timely manner as some DMA engine
> > implementations hold a lock between the prepare and submit callbacks (Dan
> > explicitly permits this as part of the API.)
> 
> That really triggers a red alarm in the part of my brain that deals with API 
> design, but I suppose it would be too difficult to change that.

Mine to, but there's not a lot which can be done about it without
changing a lot of users.

> > > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > > from being clear how that mechanism is supposed to work. This should be
> > > documented as well, and any light you could shed on this dark corner of
> > > the API would help.
> > 
> > Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> > doesn't.
> 
> Not on callbacks, but on how pending descriptors are pushed to the hardware. 
> The flag is explicitly checked in dma_issue_pending_all().

Right.  So, let me put a question to you - what do you think is the
effect of the check in dma_issue_pending_all()?

I'll give you a hint - disregard the comment at the top of the function,
because that's out of date.

> > DMA_PRIVATE is all about channel allocation as I explained yesterday, and
> > whether the channel is available for async_tx usage.
> > 
> > A channel marked DMA_PRIVATE is not available for async_tx usage at
> > any moment.  A channel without DMA_PRIVATE is available for async_tx
> > usage until it is allocated for the slave API - at which point the
> > generic DMA engine code will mark the channel with DMA_PRIVATE,
> > thereby taking it away from async_tx API usage.  When the slave API
> > frees the channel, DMA_PRIVATE will be cleared, making the channel
> > available for async_tx usage.
> > 
> > So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> > DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> > simple.
> 
> DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel 
> is allocated by __dma_request_channel() the whole device is marked with 
> DMA_PRIVATE, making all channels private. What am I missing ?

I can't answer that - I don't know why the previous authors decided to
make it a DMA-device wide property - presumably there are DMA controllers
where this matters.

However, one thing to realise is that a dma_device is a virtual concept -
it is a set of channels which share a common set of properties.  It is not
a physical device.  It is entirely reasonable for a set of channels on a
physical device to be shared between two different dma_device instances
and handed out by the driver code as needed.

> > > Similarly, the DMA engine API is split in functions with different
> > > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > > rationale for that (beyond just historical reasons) it should also
> > > be documented, otherwise a cleanup would help all the confused DMA
> > > engine users (myself included).
> > 
> > dmaengine_* are generally the interface functions to the DMA engine code,
> > which have been recently introduced to avoid the multiple levels of
> > pointer indirection having to be typed in every driver.
> > 
> > dma_async_* are the DMA engine interface functions for the async_tx API.
> > 
> > dma_* predate the dmaengine_* naming, and are probably too generic, so
> > should probably end up being renamed to dmaengine_*.
> 
> Thank you for the confirmation. I'll see if I can cook up a patch. It will 
> likely be pretty large and broad though, but I guess there's no way around it. 
> 
> > txd_* are all about the DMA engine descriptors.
> > 
> > async_tx_* are the higher level async_tx API functions.
> 
> Thank you for the information. How about the dma_async_* functions, should 
> they be renamed to dmaengine_* as well ? Or are some of them part of the 
> async_tx_* API ?

Well, these:

dma_async_device_register
dma_async_device_unregister
dma_async_tx_descriptor_init

are more DMA engine core <-> DMA engine driver interface functions than
user functions.  The remainder of the dma_async_* functions are internal
to the async_tx API.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-04 17:54                 ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-04 17:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
> Hi Russell,
> 
> On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> >  This sequence must occur in a timely manner as some DMA engine
> > implementations hold a lock between the prepare and submit callbacks (Dan
> > explicitly permits this as part of the API.)
> 
> That really triggers a red alarm in the part of my brain that deals with API 
> design, but I suppose it would be too difficult to change that.

Mine to, but there's not a lot which can be done about it without
changing a lot of users.

> > > The DMA_PRIVATE capability flag seems to play a role here, but it's far
> > > from being clear how that mechanism is supposed to work. This should be
> > > documented as well, and any light you could shed on this dark corner of
> > > the API would help.
> > 
> > Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> > doesn't.
> 
> Not on callbacks, but on how pending descriptors are pushed to the hardware. 
> The flag is explicitly checked in dma_issue_pending_all().

Right.  So, let me put a question to you - what do you think is the
effect of the check in dma_issue_pending_all()?

I'll give you a hint - disregard the comment at the top of the function,
because that's out of date.

> > DMA_PRIVATE is all about channel allocation as I explained yesterday, and
> > whether the channel is available for async_tx usage.
> > 
> > A channel marked DMA_PRIVATE is not available for async_tx usage at
> > any moment.  A channel without DMA_PRIVATE is available for async_tx
> > usage until it is allocated for the slave API - at which point the
> > generic DMA engine code will mark the channel with DMA_PRIVATE,
> > thereby taking it away from async_tx API usage.  When the slave API
> > frees the channel, DMA_PRIVATE will be cleared, making the channel
> > available for async_tx usage.
> > 
> > So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> > DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> > simple.
> 
> DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one channel 
> is allocated by __dma_request_channel() the whole device is marked with 
> DMA_PRIVATE, making all channels private. What am I missing ?

I can't answer that - I don't know why the previous authors decided to
make it a DMA-device wide property - presumably there are DMA controllers
where this matters.

However, one thing to realise is that a dma_device is a virtual concept -
it is a set of channels which share a common set of properties.  It is not
a physical device.  It is entirely reasonable for a set of channels on a
physical device to be shared between two different dma_device instances
and handed out by the driver code as needed.

> > > Similarly, the DMA engine API is split in functions with different
> > > prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> > > unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> > > rationale for that (beyond just historical reasons) it should also
> > > be documented, otherwise a cleanup would help all the confused DMA
> > > engine users (myself included).
> > 
> > dmaengine_* are generally the interface functions to the DMA engine code,
> > which have been recently introduced to avoid the multiple levels of
> > pointer indirection having to be typed in every driver.
> > 
> > dma_async_* are the DMA engine interface functions for the async_tx API.
> > 
> > dma_* predate the dmaengine_* naming, and are probably too generic, so
> > should probably end up being renamed to dmaengine_*.
> 
> Thank you for the confirmation. I'll see if I can cook up a patch. It will 
> likely be pretty large and broad though, but I guess there's no way around it. 
> 
> > txd_* are all about the DMA engine descriptors.
> > 
> > async_tx_* are the higher level async_tx API functions.
> 
> Thank you for the information. How about the dma_async_* functions, should 
> they be renamed to dmaengine_* as well ? Or are some of them part of the 
> async_tx_* API ?

Well, these:

dma_async_device_register
dma_async_device_unregister
dma_async_tx_descriptor_init

are more DMA engine core <-> DMA engine driver interface functions than
user functions.  The remainder of the dma_async_* functions are internal
to the async_tx API.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
  2014-08-04 16:50               ` Laurent Pinchart
  (?)
@ 2014-08-04 18:03                 ` Lars-Peter Clausen
  -1 siblings, 0 replies; 78+ messages in thread
From: Lars-Peter Clausen @ 2014-08-04 18:03 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard,
	Russell King - ARM Linux

On 08/04/2014 06:50 PM, Laurent Pinchart wrote:
[...]
>>>> from atomic context too.
>>>
>>> I'll take this opportunity to question why we have a separation between
>>> tx_submit and issue_pending. What's the rationale for that, especially
>>> given that dma_issue_pending_all() might kick in at any point and issue
>>> pending transfers for all devices. A driver could thus see its submitted
>>> but not issued transactions being issued before it explicitly calls
>>> dma_async_issue_pending().
>>
>> The  API states that you need to get a channel, then prepare a descriptor
>> and submit it back. Prepare can be in any order. The submit order is the one
>> which is run on dmaengine. The submit marks the descriptor as pending.
>> Typically you should have a pending_list which the descriptor should be
>> pushed to.
>>
>> And lastly invoke dma_async_issue_pending() to start the pending ones.
>>
>> You have the flexibility to prepare descriptors and issue in the order you
>> like. You can also attach the callback required for descriptors here.
>
> The question was why is there a dma_async_issue_pending() operation at all ?
> Why can't dmaengine_submit() triggers the transfer start ? The only
> explanation is a small comment in dmaengine.h that states
>
>   * This allows drivers to push copies to HW in batches,
>   * reducing MMIO writes where possible.
>
> I don't think that's applicable for DMA slave transfers. Is it still
> applicable for anything else ?
[...]

If the hardware has scatter gather support it allows the driver to chain the 
descriptors before submitting them, which reduces the latency between the 
transfers as well as the IO over overhead. The flaw with the current 
implementation is that there is only one global chain per channel instead of 
e.g. having the possibility to build up a chain in a driver and then submit and 
start the chain. Some drivers have virtual channels where each channel 
basically acts as the chain and once issue pending is called it is the chain is 
mapped to a real channel which then executes it.

- Lars

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
@ 2014-08-04 18:03                 ` Lars-Peter Clausen
  0 siblings, 0 replies; 78+ messages in thread
From: Lars-Peter Clausen @ 2014-08-04 18:03 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard,
	Russell King - ARM Linux

On 08/04/2014 06:50 PM, Laurent Pinchart wrote:
[...]
>>>> from atomic context too.
>>>
>>> I'll take this opportunity to question why we have a separation between
>>> tx_submit and issue_pending. What's the rationale for that, especially
>>> given that dma_issue_pending_all() might kick in at any point and issue
>>> pending transfers for all devices. A driver could thus see its submitted
>>> but not issued transactions being issued before it explicitly calls
>>> dma_async_issue_pending().
>>
>> The  API states that you need to get a channel, then prepare a descriptor
>> and submit it back. Prepare can be in any order. The submit order is the one
>> which is run on dmaengine. The submit marks the descriptor as pending.
>> Typically you should have a pending_list which the descriptor should be
>> pushed to.
>>
>> And lastly invoke dma_async_issue_pending() to start the pending ones.
>>
>> You have the flexibility to prepare descriptors and issue in the order you
>> like. You can also attach the callback required for descriptors here.
>
> The question was why is there a dma_async_issue_pending() operation at all ?
> Why can't dmaengine_submit() triggers the transfer start ? The only
> explanation is a small comment in dmaengine.h that states
>
>   * This allows drivers to push copies to HW in batches,
>   * reducing MMIO writes where possible.
>
> I don't think that's applicable for DMA slave transfers. Is it still
> applicable for anything else ?
[...]

If the hardware has scatter gather support it allows the driver to chain the 
descriptors before submitting them, which reduces the latency between the 
transfers as well as the IO over overhead. The flaw with the current 
implementation is that there is only one global chain per channel instead of 
e.g. having the possibility to build up a chain in a driver and then submit and 
start the chain. Some drivers have virtual channels where each channel 
basically acts as the chain and once issue pending is called it is the chain is 
mapped to a real channel which then executes it.

- Lars

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue
@ 2014-08-04 18:03                 ` Lars-Peter Clausen
  0 siblings, 0 replies; 78+ messages in thread
From: Lars-Peter Clausen @ 2014-08-04 18:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/04/2014 06:50 PM, Laurent Pinchart wrote:
[...]
>>>> from atomic context too.
>>>
>>> I'll take this opportunity to question why we have a separation between
>>> tx_submit and issue_pending. What's the rationale for that, especially
>>> given that dma_issue_pending_all() might kick in at any point and issue
>>> pending transfers for all devices. A driver could thus see its submitted
>>> but not issued transactions being issued before it explicitly calls
>>> dma_async_issue_pending().
>>
>> The  API states that you need to get a channel, then prepare a descriptor
>> and submit it back. Prepare can be in any order. The submit order is the one
>> which is run on dmaengine. The submit marks the descriptor as pending.
>> Typically you should have a pending_list which the descriptor should be
>> pushed to.
>>
>> And lastly invoke dma_async_issue_pending() to start the pending ones.
>>
>> You have the flexibility to prepare descriptors and issue in the order you
>> like. You can also attach the callback required for descriptors here.
>
> The question was why is there a dma_async_issue_pending() operation at all ?
> Why can't dmaengine_submit() triggers the transfer start ? The only
> explanation is a small comment in dmaengine.h that states
>
>   * This allows drivers to push copies to HW in batches,
>   * reducing MMIO writes where possible.
>
> I don't think that's applicable for DMA slave transfers. Is it still
> applicable for anything else ?
[...]

If the hardware has scatter gather support it allows the driver to chain the 
descriptors before submitting them, which reduces the latency between the 
transfers as well as the IO over overhead. The flaw with the current 
implementation is that there is only one global chain per channel instead of 
e.g. having the possibility to build up a chain in a driver and then submit and 
start the chain. Some drivers have virtual channels where each channel 
basically acts as the chain and once issue pending is called it is the chain is 
mapped to a real channel which then executes it.

- Lars

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
  2014-08-04 18:03                 ` Lars-Peter Clausen
  (?)
@ 2014-08-04 18:32                   ` Russell King - ARM Linux
  -1 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-04 18:32 UTC (permalink / raw)
  To: Lars-Peter Clausen
  Cc: Laurent Pinchart, Vinod Koul, Kuninori Morimoto, dmaengine,
	linux-sh, Magnus Damm, Linux-ALSA, linux-arm-kernel,
	Maxime Ripard

On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
> If the hardware has scatter gather support it allows the driver to chain 
> the descriptors before submitting them, which reduces the latency between 
> the transfers as well as the IO over overhead.

While partially true, that's not the full story...

BTW, you're talking about stuff in DMA engine not being clear, but you're
using confusing terminology.  Descriptors vs transactions.  The prepare
functions return a transaction.  Descriptors are the hardware data
structures which describe the transaction.  I'll take what you're talking
about above as "chain the previous transaction descriptors to the next
transaction descriptors".

> The flaw with the current  
> implementation is that there is only one global chain per channel instead 
> of e.g. having the possibility to build up a chain in a driver and then 
> submit and start the chain. Some drivers have virtual channels where each 
> channel basically acts as the chain and once issue pending is called it 
> is the chain is mapped to a real channel which then executes it.

Most DMA engines are unable to program anything except the parameters for
the next stage of the transfer.  In order to switch between "channels",
many DMA engine implementations need the help of the CPU to reprogram the
physical channel configuration.  Chaining two different channels which
may ultimately end up on the same physical channel would be a bug in that
case.

Where the real flaw exists is the way that a lot of people write their
DMA engine drivers - in particular how they deal with the end of a
transfer.

Many driver implementations receive an interrupt from the DMA controller,
and either queue a tasklet, or they check the existing transfer, mark it
as completed in some way, and queue a tasklet.

When the tasklet runs, they then look to see if there's another transfer
which they can start, and they then start it.

That is horribly inefficient - it is much better to do all the DMA
manipulation in IRQ context.  So, when the channel completes the
existing transfer, you move the transaction to the queue of completed
transfers and queue the tasklet, check whether there's a transaction for
the same channel pending, and if so, start it immediately.

This means that your inter-transfer gap is reduced down from the
interrupt latency plus tasklet latency, to just the interrupt latency.

Controllers such as OMAP (if their hardware scatter chains were used)
do have the ability to reprogram the entire channel configuration from
an appropriate transaction, and so /could/ start the next transfer
entirely automatically - but I never added support for the hardware
scatterlists as I have been told that TI measurements indicated that
it did not gain any performance to use them.  Had this been implemented,
it would mean that OMAP would only need to issue an interrupt to notify
completion of a transfer (so the driver would only have to work out
how many dma transactions had been completed.)

In this case, it is important that we do batch up the entries (since
an already in progress descriptor should not be modified), but I
suspect in the case of slave DMA, it is rarely the case that there
is more than one or two descriptors queued at any moment.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
@ 2014-08-04 18:32                   ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-04 18:32 UTC (permalink / raw)
  To: Lars-Peter Clausen
  Cc: Laurent Pinchart, Vinod Koul, Kuninori Morimoto, dmaengine,
	linux-sh, Magnus Damm, Linux-ALSA, linux-arm-kernel,
	Maxime Ripard

On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
> If the hardware has scatter gather support it allows the driver to chain 
> the descriptors before submitting them, which reduces the latency between 
> the transfers as well as the IO over overhead.

While partially true, that's not the full story...

BTW, you're talking about stuff in DMA engine not being clear, but you're
using confusing terminology.  Descriptors vs transactions.  The prepare
functions return a transaction.  Descriptors are the hardware data
structures which describe the transaction.  I'll take what you're talking
about above as "chain the previous transaction descriptors to the next
transaction descriptors".

> The flaw with the current  
> implementation is that there is only one global chain per channel instead 
> of e.g. having the possibility to build up a chain in a driver and then 
> submit and start the chain. Some drivers have virtual channels where each 
> channel basically acts as the chain and once issue pending is called it 
> is the chain is mapped to a real channel which then executes it.

Most DMA engines are unable to program anything except the parameters for
the next stage of the transfer.  In order to switch between "channels",
many DMA engine implementations need the help of the CPU to reprogram the
physical channel configuration.  Chaining two different channels which
may ultimately end up on the same physical channel would be a bug in that
case.

Where the real flaw exists is the way that a lot of people write their
DMA engine drivers - in particular how they deal with the end of a
transfer.

Many driver implementations receive an interrupt from the DMA controller,
and either queue a tasklet, or they check the existing transfer, mark it
as completed in some way, and queue a tasklet.

When the tasklet runs, they then look to see if there's another transfer
which they can start, and they then start it.

That is horribly inefficient - it is much better to do all the DMA
manipulation in IRQ context.  So, when the channel completes the
existing transfer, you move the transaction to the queue of completed
transfers and queue the tasklet, check whether there's a transaction for
the same channel pending, and if so, start it immediately.

This means that your inter-transfer gap is reduced down from the
interrupt latency plus tasklet latency, to just the interrupt latency.

Controllers such as OMAP (if their hardware scatter chains were used)
do have the ability to reprogram the entire channel configuration from
an appropriate transaction, and so /could/ start the next transfer
entirely automatically - but I never added support for the hardware
scatterlists as I have been told that TI measurements indicated that
it did not gain any performance to use them.  Had this been implemented,
it would mean that OMAP would only need to issue an interrupt to notify
completion of a transfer (so the driver would only have to work out
how many dma transactions had been completed.)

In this case, it is important that we do batch up the entries (since
an already in progress descriptor should not be modified), but I
suspect in the case of slave DMA, it is rarely the case that there
is more than one or two descriptors queued at any moment.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue
@ 2014-08-04 18:32                   ` Russell King - ARM Linux
  0 siblings, 0 replies; 78+ messages in thread
From: Russell King - ARM Linux @ 2014-08-04 18:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
> If the hardware has scatter gather support it allows the driver to chain 
> the descriptors before submitting them, which reduces the latency between 
> the transfers as well as the IO over overhead.

While partially true, that's not the full story...

BTW, you're talking about stuff in DMA engine not being clear, but you're
using confusing terminology.  Descriptors vs transactions.  The prepare
functions return a transaction.  Descriptors are the hardware data
structures which describe the transaction.  I'll take what you're talking
about above as "chain the previous transaction descriptors to the next
transaction descriptors".

> The flaw with the current  
> implementation is that there is only one global chain per channel instead 
> of e.g. having the possibility to build up a chain in a driver and then 
> submit and start the chain. Some drivers have virtual channels where each 
> channel basically acts as the chain and once issue pending is called it 
> is the chain is mapped to a real channel which then executes it.

Most DMA engines are unable to program anything except the parameters for
the next stage of the transfer.  In order to switch between "channels",
many DMA engine implementations need the help of the CPU to reprogram the
physical channel configuration.  Chaining two different channels which
may ultimately end up on the same physical channel would be a bug in that
case.

Where the real flaw exists is the way that a lot of people write their
DMA engine drivers - in particular how they deal with the end of a
transfer.

Many driver implementations receive an interrupt from the DMA controller,
and either queue a tasklet, or they check the existing transfer, mark it
as completed in some way, and queue a tasklet.

When the tasklet runs, they then look to see if there's another transfer
which they can start, and they then start it.

That is horribly inefficient - it is much better to do all the DMA
manipulation in IRQ context.  So, when the channel completes the
existing transfer, you move the transaction to the queue of completed
transfers and queue the tasklet, check whether there's a transaction for
the same channel pending, and if so, start it immediately.

This means that your inter-transfer gap is reduced down from the
interrupt latency plus tasklet latency, to just the interrupt latency.

Controllers such as OMAP (if their hardware scatter chains were used)
do have the ability to reprogram the entire channel configuration from
an appropriate transaction, and so /could/ start the next transfer
entirely automatically - but I never added support for the hardware
scatterlists as I have been told that TI measurements indicated that
it did not gain any performance to use them.  Had this been implemented,
it would mean that OMAP would only need to issue an interrupt to notify
completion of a transfer (so the driver would only have to work out
how many dma transactions had been completed.)

In this case, it is important that we do batch up the entries (since
an already in progress descriptor should not be modified), but I
suspect in the case of slave DMA, it is rarely the case that there
is more than one or two descriptors queued at any moment.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
  2014-08-04 18:32                   ` Russell King - ARM Linux
  (?)
@ 2014-08-04 23:12                     ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 23:12 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Lars-Peter Clausen, Vinod Koul, Kuninori Morimoto, dmaengine,
	linux-sh, Magnus Damm, Linux-ALSA, linux-arm-kernel,
	Maxime Ripard

Hi Russell,

On Monday 04 August 2014 19:32:25 Russell King - ARM Linux wrote:
> On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
> > If the hardware has scatter gather support it allows the driver to chain
> > the descriptors before submitting them, which reduces the latency between
> > the transfers as well as the IO over overhead.
> 
> While partially true, that's not the full story...
> 
> BTW, you're talking about stuff in DMA engine not being clear, but you're
> using confusing terminology.  Descriptors vs transactions.  The prepare
> functions return a transaction.  Descriptors are the hardware data
> structures which describe the transaction.  I'll take what you're talking
> about above as "chain the previous transaction descriptors to the next
> transaction descriptors".

Well, the prep_* functions return a struct dma_async_tx_descriptor, documented 
as an "async transaction descriptor".

There are several types of descriptors, transactions and transfers involved, 
with different names depending on where you look at.

- At the highest level, we have the DMA engine representation of a transaction 
in the form of a struct dma_async_tx_descriptor (even this is slightly 
misleading, as tx is a common abbreviation of transmit or transmission, but 
not of transaction).

- One level lower, when the high level transaction targets non contiguous 
memory (from the device point of view) the transaction is split into 
contiguous chunks. The device might be able to execute a list (or table, 
depending of the implementation) of chunks on its own without requiring 
software intervention. If it isn't the driver will need to submit the next 
chunk in the completion interrupt of the previous chunk. Even when the device 
supports executing multiple chunks on its own, it might be limited in the 
number of chunks it can chain, requiring software intervention to handle one 
transaction descriptor.

- At the lowest level, the hardware will perform the transfer by repeating 
transfer cycles, reading a data unit from the source and writing to the 
destination. When the source or destination supports it, the read and/or write 
operations can also be grouped in bursts.

If we want to lower the confusion we should decide on names for those 
different levels and stick to them.

The highest level unit is called a transaction by (at least some parts of) the 
API, the name sounds good enough to me. "Transaction" could thus refer to that 
operation, and "transaction descriptor" to the struct dma_async_tx_descriptor 
instance.

We could then say that a transaction is split into transfers, each of them 
targeting a contiguous piece of memory of both the source and the destination, 
and that transfers are split into transfer cycles, each of them transferring 
one data unit or element. I'm also open to other proposals (including using 
the name "chunk" for one of the elements).

> > The flaw with the current implementation is that there is only one global
> > chain per channel instead of e.g. having the possibility to build up a
> > chain in a driver and then submit and start the chain.

Well, that's not completely true, the API supports scatterlists, so you could 
create a single transaction descriptor that spans several unrelated transfers 
(as long as they can use the same channel, for instance targeting the same 
device for slave transactions).

> > Some drivers have virtual channels where each channel basically acts as
> > the chain and once issue pending is called it is the chain is mapped to a
> > real channel which then executes it.
> 
> Most DMA engines are unable to program anything except the parameters for
> the next stage of the transfer.  In order to switch between "channels",
> many DMA engine implementations need the help of the CPU to reprogram the
> physical channel configuration.  Chaining two different channels which
> may ultimately end up on the same physical channel would be a bug in that
> case.

I'm mostly familiar with DMA engines designed for slave transfers. The ones 
I've seen have channels that are programmed and run independently, usually 
with some logic to arbitrate bus access. When they support executing lists or 
arrays of transfers the hardware transfer descriptors include the source and 
destination addresses and the number of elements to be transfered. The 
identifier of the slave device (basically the DMA request line to which the 
slave is connected) is constant across all chained transfers.

I'm not sure what you mean by "switching between channels". Could you please 
explain that ?

> Where the real flaw exists is the way that a lot of people write their
> DMA engine drivers - in particular how they deal with the end of a
> transfer.
> 
> Many driver implementations receive an interrupt from the DMA controller,
> and either queue a tasklet, or they check the existing transfer, mark it
> as completed in some way, and queue a tasklet.
> 
> When the tasklet runs, they then look to see if there's another transfer
> which they can start, and they then start it.
> 
> That is horribly inefficient - it is much better to do all the DMA
> manipulation in IRQ context.  So, when the channel completes the
> existing transfer, you move the transaction to the queue of completed
> transfers and queue the tasklet, check whether there's a transaction for
> the same channel pending, and if so, start it immediately.
> 
> This means that your inter-transfer gap is reduced down from the
> interrupt latency plus tasklet latency, to just the interrupt latency.

I totally agree. This should be documented to avoid this kind of mistake in 
the future. Maxime, if you can find time for it, could you add this to the 
next version of your documentation patch ?

> Controllers such as OMAP (if their hardware scatter chains were used)
> do have the ability to reprogram the entire channel configuration from
> an appropriate transaction, and so /could/ start the next transfer
> entirely automatically - but I never added support for the hardware
> scatterlists as I have been told that TI measurements indicated that
> it did not gain any performance to use them.  Had this been implemented,
> it would mean that OMAP would only need to issue an interrupt to notify
> completion of a transfer (so the driver would only have to work out
> how many dma transactions had been completed.)
> 
> In this case, it is important that we do batch up the entries (since
> an already in progress descriptor should not be modified), but I
> suspect in the case of slave DMA, it is rarely the case that there
> is more than one or two descriptors queued at any moment.

I agree, in most cases there's only one or a few transaction descriptors 
queued for slave DMA. There could then be a larger number of hardware transfer 
descriptors to represent one transaction descriptor, but those would have been 
created by the DMA engine driver from a single transaction descriptor, so 
there would be no problem chaining the transfers.

How about the memcpy (non-slave) DMA ? Do client drivers submit lots of small 
DMA transactions that should be chained for optimal performances ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue
@ 2014-08-04 23:12                     ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 23:12 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Lars-Peter Clausen, Vinod Koul, Kuninori Morimoto, dmaengine,
	linux-sh, Magnus Damm, Linux-ALSA, linux-arm-kernel,
	Maxime Ripard

Hi Russell,

On Monday 04 August 2014 19:32:25 Russell King - ARM Linux wrote:
> On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
> > If the hardware has scatter gather support it allows the driver to chain
> > the descriptors before submitting them, which reduces the latency between
> > the transfers as well as the IO over overhead.
> 
> While partially true, that's not the full story...
> 
> BTW, you're talking about stuff in DMA engine not being clear, but you're
> using confusing terminology.  Descriptors vs transactions.  The prepare
> functions return a transaction.  Descriptors are the hardware data
> structures which describe the transaction.  I'll take what you're talking
> about above as "chain the previous transaction descriptors to the next
> transaction descriptors".

Well, the prep_* functions return a struct dma_async_tx_descriptor, documented 
as an "async transaction descriptor".

There are several types of descriptors, transactions and transfers involved, 
with different names depending on where you look at.

- At the highest level, we have the DMA engine representation of a transaction 
in the form of a struct dma_async_tx_descriptor (even this is slightly 
misleading, as tx is a common abbreviation of transmit or transmission, but 
not of transaction).

- One level lower, when the high level transaction targets non contiguous 
memory (from the device point of view) the transaction is split into 
contiguous chunks. The device might be able to execute a list (or table, 
depending of the implementation) of chunks on its own without requiring 
software intervention. If it isn't the driver will need to submit the next 
chunk in the completion interrupt of the previous chunk. Even when the device 
supports executing multiple chunks on its own, it might be limited in the 
number of chunks it can chain, requiring software intervention to handle one 
transaction descriptor.

- At the lowest level, the hardware will perform the transfer by repeating 
transfer cycles, reading a data unit from the source and writing to the 
destination. When the source or destination supports it, the read and/or write 
operations can also be grouped in bursts.

If we want to lower the confusion we should decide on names for those 
different levels and stick to them.

The highest level unit is called a transaction by (at least some parts of) the 
API, the name sounds good enough to me. "Transaction" could thus refer to that 
operation, and "transaction descriptor" to the struct dma_async_tx_descriptor 
instance.

We could then say that a transaction is split into transfers, each of them 
targeting a contiguous piece of memory of both the source and the destination, 
and that transfers are split into transfer cycles, each of them transferring 
one data unit or element. I'm also open to other proposals (including using 
the name "chunk" for one of the elements).

> > The flaw with the current implementation is that there is only one global
> > chain per channel instead of e.g. having the possibility to build up a
> > chain in a driver and then submit and start the chain.

Well, that's not completely true, the API supports scatterlists, so you could 
create a single transaction descriptor that spans several unrelated transfers 
(as long as they can use the same channel, for instance targeting the same 
device for slave transactions).

> > Some drivers have virtual channels where each channel basically acts as
> > the chain and once issue pending is called it is the chain is mapped to a
> > real channel which then executes it.
> 
> Most DMA engines are unable to program anything except the parameters for
> the next stage of the transfer.  In order to switch between "channels",
> many DMA engine implementations need the help of the CPU to reprogram the
> physical channel configuration.  Chaining two different channels which
> may ultimately end up on the same physical channel would be a bug in that
> case.

I'm mostly familiar with DMA engines designed for slave transfers. The ones 
I've seen have channels that are programmed and run independently, usually 
with some logic to arbitrate bus access. When they support executing lists or 
arrays of transfers the hardware transfer descriptors include the source and 
destination addresses and the number of elements to be transfered. The 
identifier of the slave device (basically the DMA request line to which the 
slave is connected) is constant across all chained transfers.

I'm not sure what you mean by "switching between channels". Could you please 
explain that ?

> Where the real flaw exists is the way that a lot of people write their
> DMA engine drivers - in particular how they deal with the end of a
> transfer.
> 
> Many driver implementations receive an interrupt from the DMA controller,
> and either queue a tasklet, or they check the existing transfer, mark it
> as completed in some way, and queue a tasklet.
> 
> When the tasklet runs, they then look to see if there's another transfer
> which they can start, and they then start it.
> 
> That is horribly inefficient - it is much better to do all the DMA
> manipulation in IRQ context.  So, when the channel completes the
> existing transfer, you move the transaction to the queue of completed
> transfers and queue the tasklet, check whether there's a transaction for
> the same channel pending, and if so, start it immediately.
> 
> This means that your inter-transfer gap is reduced down from the
> interrupt latency plus tasklet latency, to just the interrupt latency.

I totally agree. This should be documented to avoid this kind of mistake in 
the future. Maxime, if you can find time for it, could you add this to the 
next version of your documentation patch ?

> Controllers such as OMAP (if their hardware scatter chains were used)
> do have the ability to reprogram the entire channel configuration from
> an appropriate transaction, and so /could/ start the next transfer
> entirely automatically - but I never added support for the hardware
> scatterlists as I have been told that TI measurements indicated that
> it did not gain any performance to use them.  Had this been implemented,
> it would mean that OMAP would only need to issue an interrupt to notify
> completion of a transfer (so the driver would only have to work out
> how many dma transactions had been completed.)
> 
> In this case, it is important that we do batch up the entries (since
> an already in progress descriptor should not be modified), but I
> suspect in the case of slave DMA, it is rarely the case that there
> is more than one or two descriptors queued at any moment.

I agree, in most cases there's only one or a few transaction descriptors 
queued for slave DMA. There could then be a larger number of hardware transfer 
descriptors to represent one transaction descriptor, but those would have been 
created by the DMA engine driver from a single transaction descriptor, so 
there would be no problem chaining the transfers.

How about the memcpy (non-slave) DMA ? Do client drivers submit lots of small 
DMA transactions that should be chained for optimal performances ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue
@ 2014-08-04 23:12                     ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-04 23:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Monday 04 August 2014 19:32:25 Russell King - ARM Linux wrote:
> On Mon, Aug 04, 2014 at 08:03:45PM +0200, Lars-Peter Clausen wrote:
> > If the hardware has scatter gather support it allows the driver to chain
> > the descriptors before submitting them, which reduces the latency between
> > the transfers as well as the IO over overhead.
> 
> While partially true, that's not the full story...
> 
> BTW, you're talking about stuff in DMA engine not being clear, but you're
> using confusing terminology.  Descriptors vs transactions.  The prepare
> functions return a transaction.  Descriptors are the hardware data
> structures which describe the transaction.  I'll take what you're talking
> about above as "chain the previous transaction descriptors to the next
> transaction descriptors".

Well, the prep_* functions return a struct dma_async_tx_descriptor, documented 
as an "async transaction descriptor".

There are several types of descriptors, transactions and transfers involved, 
with different names depending on where you look at.

- At the highest level, we have the DMA engine representation of a transaction 
in the form of a struct dma_async_tx_descriptor (even this is slightly 
misleading, as tx is a common abbreviation of transmit or transmission, but 
not of transaction).

- One level lower, when the high level transaction targets non contiguous 
memory (from the device point of view) the transaction is split into 
contiguous chunks. The device might be able to execute a list (or table, 
depending of the implementation) of chunks on its own without requiring 
software intervention. If it isn't the driver will need to submit the next 
chunk in the completion interrupt of the previous chunk. Even when the device 
supports executing multiple chunks on its own, it might be limited in the 
number of chunks it can chain, requiring software intervention to handle one 
transaction descriptor.

- At the lowest level, the hardware will perform the transfer by repeating 
transfer cycles, reading a data unit from the source and writing to the 
destination. When the source or destination supports it, the read and/or write 
operations can also be grouped in bursts.

If we want to lower the confusion we should decide on names for those 
different levels and stick to them.

The highest level unit is called a transaction by (at least some parts of) the 
API, the name sounds good enough to me. "Transaction" could thus refer to that 
operation, and "transaction descriptor" to the struct dma_async_tx_descriptor 
instance.

We could then say that a transaction is split into transfers, each of them 
targeting a contiguous piece of memory of both the source and the destination, 
and that transfers are split into transfer cycles, each of them transferring 
one data unit or element. I'm also open to other proposals (including using 
the name "chunk" for one of the elements).

> > The flaw with the current implementation is that there is only one global
> > chain per channel instead of e.g. having the possibility to build up a
> > chain in a driver and then submit and start the chain.

Well, that's not completely true, the API supports scatterlists, so you could 
create a single transaction descriptor that spans several unrelated transfers 
(as long as they can use the same channel, for instance targeting the same 
device for slave transactions).

> > Some drivers have virtual channels where each channel basically acts as
> > the chain and once issue pending is called it is the chain is mapped to a
> > real channel which then executes it.
> 
> Most DMA engines are unable to program anything except the parameters for
> the next stage of the transfer.  In order to switch between "channels",
> many DMA engine implementations need the help of the CPU to reprogram the
> physical channel configuration.  Chaining two different channels which
> may ultimately end up on the same physical channel would be a bug in that
> case.

I'm mostly familiar with DMA engines designed for slave transfers. The ones 
I've seen have channels that are programmed and run independently, usually 
with some logic to arbitrate bus access. When they support executing lists or 
arrays of transfers the hardware transfer descriptors include the source and 
destination addresses and the number of elements to be transfered. The 
identifier of the slave device (basically the DMA request line to which the 
slave is connected) is constant across all chained transfers.

I'm not sure what you mean by "switching between channels". Could you please 
explain that ?

> Where the real flaw exists is the way that a lot of people write their
> DMA engine drivers - in particular how they deal with the end of a
> transfer.
> 
> Many driver implementations receive an interrupt from the DMA controller,
> and either queue a tasklet, or they check the existing transfer, mark it
> as completed in some way, and queue a tasklet.
> 
> When the tasklet runs, they then look to see if there's another transfer
> which they can start, and they then start it.
> 
> That is horribly inefficient - it is much better to do all the DMA
> manipulation in IRQ context.  So, when the channel completes the
> existing transfer, you move the transaction to the queue of completed
> transfers and queue the tasklet, check whether there's a transaction for
> the same channel pending, and if so, start it immediately.
> 
> This means that your inter-transfer gap is reduced down from the
> interrupt latency plus tasklet latency, to just the interrupt latency.

I totally agree. This should be documented to avoid this kind of mistake in 
the future. Maxime, if you can find time for it, could you add this to the 
next version of your documentation patch ?

> Controllers such as OMAP (if their hardware scatter chains were used)
> do have the ability to reprogram the entire channel configuration from
> an appropriate transaction, and so /could/ start the next transfer
> entirely automatically - but I never added support for the hardware
> scatterlists as I have been told that TI measurements indicated that
> it did not gain any performance to use them.  Had this been implemented,
> it would mean that OMAP would only need to issue an interrupt to notify
> completion of a transfer (so the driver would only have to work out
> how many dma transactions had been completed.)
> 
> In this case, it is important that we do batch up the entries (since
> an already in progress descriptor should not be modified), but I
> suspect in the case of slave DMA, it is rarely the case that there
> is more than one or two descriptors queued at any moment.

I agree, in most cases there's only one or a few transaction descriptors 
queued for slave DMA. There could then be a larger number of hardware transfer 
descriptors to represent one transaction descriptor, but those would have been 
created by the DMA engine driver from a single transaction descriptor, so 
there would be no problem chaining the transfers.

How about the memcpy (non-slave) DMA ? Do client drivers submit lots of small 
DMA transactions that should be chained for optimal performances ?

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-04 16:50               ` Laurent Pinchart
  (?)
@ 2014-08-05 16:56                 ` Vinod Koul
  -1 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-05 16:56 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard, Russell King - ARM Linux

On Mon, Aug 04, 2014 at 06:50:17PM +0200, Laurent Pinchart wrote:
 
> The question was why is there a dma_async_issue_pending() operation at all ? 
> Why can't dmaengine_submit() triggers the transfer start ? The only 
> explanation is a small comment in dmaengine.h that states
> 
>  * This allows drivers to push copies to HW in batches,
>  * reducing MMIO writes where possible.
> 
> I don't think that's applicable for DMA slave transfers. Is it still 
> applicable for anything else ?
why not?

If your hw supports sg-lists and say length of 8 and you prepare two
descriptors for lengths of 3 and 5. While in issue pending what prevents you
from submiiting them in one shot to hardware while still getting interrupt.

I know designware and intel-dma do support that. It is different point that
drivers don't

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-05 16:56                 ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-05 16:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 04, 2014 at 06:50:17PM +0200, Laurent Pinchart wrote:
 
> The question was why is there a dma_async_issue_pending() operation at all ? 
> Why can't dmaengine_submit() triggers the transfer start ? The only 
> explanation is a small comment in dmaengine.h that states
> 
>  * This allows drivers to push copies to HW in batches,
>  * reducing MMIO writes where possible.
> 
> I don't think that's applicable for DMA slave transfers. Is it still 
> applicable for anything else ?
why not?

If your hw supports sg-lists and say length of 8 and you prepare two
descriptors for lengths of 3 and 5. While in issue pending what prevents you
from submiiting them in one shot to hardware while still getting interrupt.

I know designware and intel-dma do support that. It is different point that
drivers don't

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-05 16:56                 ` Vinod Koul
  0 siblings, 0 replies; 78+ messages in thread
From: Vinod Koul @ 2014-08-05 17:08 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard, Russell King - ARM Linux

On Mon, Aug 04, 2014 at 06:50:17PM +0200, Laurent Pinchart wrote:
 
> The question was why is there a dma_async_issue_pending() operation at all ? 
> Why can't dmaengine_submit() triggers the transfer start ? The only 
> explanation is a small comment in dmaengine.h that states
> 
>  * This allows drivers to push copies to HW in batches,
>  * reducing MMIO writes where possible.
> 
> I don't think that's applicable for DMA slave transfers. Is it still 
> applicable for anything else ?
why not?

If your hw supports sg-lists and say length of 8 and you prepare two
descriptors for lengths of 3 and 5. While in issue pending what prevents you
from submiiting them in one shot to hardware while still getting interrupt.

I know designware and intel-dma do support that. It is different point that
drivers don't

-- 
~Vinod

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-04 17:54                 ` Russell King - ARM Linux
  (?)
@ 2014-08-05 23:19                   ` Laurent Pinchart
  -1 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-05 23:19 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

Hi Russell,

On Monday 04 August 2014 18:54:58 Russell King - ARM Linux wrote:
> On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
> > On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> >>  This sequence must occur in a timely manner as some DMA engine
> >> 
> >> implementations hold a lock between the prepare and submit callbacks
> >> (Dan explicitly permits this as part of the API.)
> > 
> > That really triggers a red alarm in the part of my brain that deals with
> > API design, but I suppose it would be too difficult to change that.
> 
> Mine to, but there's not a lot which can be done about it without changing a
> lot of users.

Well, the good side is that it "only" requires enough motivation and free time 
then :-)

> >>> The DMA_PRIVATE capability flag seems to play a role here, but it's
> >>> far from being clear how that mechanism is supposed to work. This
> >>> should be documented as well, and any light you could shed on this
> >>> dark corner of the API would help.
> >> 
> >> Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> >> doesn't.
> > 
> > Not on callbacks, but on how pending descriptors are pushed to the
> > hardware. The flag is explicitly checked in dma_issue_pending_all().
> 
> Right.  So, let me put a question to you - what do you think is the
> effect of the check in dma_issue_pending_all()?
> 
> I'll give you a hint - disregard the comment at the top of the function,
> because that's out of date.

I suppose the idea is that dma_issue_pending_all() is only used for memcpy 
offload, and can thus ignore channels used for slave transfers.

The code seems to be buggy though. A DMA engine that can serve both memcpy and 
slave transfers could have one channel allocated for memcpy first, then a 
second channel allocated for slave transfers. This would cause the DMA_PRIVATE 
flag to be set, which will prevent dma_issue_pending_all() from calling the 
device_issue_pending operation of the memcpy channel.

> >> DMA_PRIVATE is all about channel allocation as I explained yesterday,
> >> and whether the channel is available for async_tx usage.
> >> 
> >> A channel marked DMA_PRIVATE is not available for async_tx usage at
> >> any moment.  A channel without DMA_PRIVATE is available for async_tx
> >> usage until it is allocated for the slave API - at which point the
> >> generic DMA engine code will mark the channel with DMA_PRIVATE,
> >> thereby taking it away from async_tx API usage.  When the slave API
> >> frees the channel, DMA_PRIVATE will be cleared, making the channel
> >> available for async_tx usage.
> >> 
> >> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> >> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> >> simple.
> > 
> > DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one
> > channel is allocated by __dma_request_channel() the whole device is
> > marked with DMA_PRIVATE, making all channels private. What am I missing ?
> 
> I can't answer that - I don't know why the previous authors decided to
> make it a DMA-device wide property - presumably there are DMA controllers
> where this matters.

If I understand the history correctly, the reason to make DMA_PRIVATE a device 
flag is to avoid starving slaves by allocating all channels of a device for 
memcpy. If the DMA_PRIVATE flag is set by the DMA engine driver that works as 
expected. If the DMA engine can be used for both, though, there's no 
guarantee, and the behaviour won't be very predictable.

By the way, shouldn't DMA_PRIVATE be renamed to something more explicit, such 
as DMA_NO_MEMCPY or DMA_SLAVE_ONLY ?

> However, one thing to realise is that a dma_device is a virtual concept -
> it is a set of channels which share a common set of properties.  It is not
> a physical device.  It is entirely reasonable for a set of channels on a
> physical device to be shared between two different dma_device instances
> and handed out by the driver code as needed.

When the channels are independent, sure, but they sometimes share hardware 
resources. For instance the Renesas R-Car Gen2 SoCs have 2 generic-purpose DMA 
engines usable for both memcpy and slave transfers, each of them having 15 
channels. Each DMA engine arbitrates memory accesses from the different 
channels, using either fixed priorities, or a round-robin arbitration. In that 
case it wouldn't make much sense to split the 15 channels across several 
dma_device instances.

I actually have the opposite problem, in my case channels of physically 
separate DMA engines can be used interchangeably to serve the system's slaves. 
Using the DMA engine DT bindings, DT nodes of the slaves currently reference a 
specific DMA engine, even if they can be served by both. This leads to limited 
dynamic channel allocation capabilities (especially when taking into account 
lazy channel allocation as mentioned in another mail in this thread).

> >>> Similarly, the DMA engine API is split in functions with different
> >>> prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> >>> unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> >>> rationale for that (beyond just historical reasons) it should also
> >>> be documented, otherwise a cleanup would help all the confused DMA
> >>> engine users (myself included).
> >> 
> >> dmaengine_* are generally the interface functions to the DMA engine
> >> code, which have been recently introduced to avoid the multiple levels of
> >> pointer indirection having to be typed in every driver.
> >> 
> >> dma_async_* are the DMA engine interface functions for the async_tx API.
> >> 
> >> dma_* predate the dmaengine_* naming, and are probably too generic, so
> >> should probably end up being renamed to dmaengine_*.
> > 
> > Thank you for the confirmation. I'll see if I can cook up a patch. It will
> > likely be pretty large and broad though, but I guess there's no way around
> > it.
> >
> >> txd_* are all about the DMA engine descriptors.
> >> 
> >> async_tx_* are the higher level async_tx API functions.
> > 
> > Thank you for the information. How about the dma_async_* functions, should
> > they be renamed to dmaengine_* as well ? Or are some of them part of the
> > async_tx_* API ?
> 
> Well, these:
> 
> dma_async_device_register
> dma_async_device_unregister
> dma_async_tx_descriptor_init
> 
> are more DMA engine core <-> DMA engine driver interface functions than
> user functions.  The remainder of the dma_async_* functions are internal
> to the async_tx API.

I was also thinking about dma_async_issue_pending(). Isn't that function part 
of the DMA engine API rather than the async API ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-05 23:19                   ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-05 23:19 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Vinod Koul, Kuninori Morimoto, dmaengine, linux-sh, Magnus Damm,
	Linux-ALSA, linux-arm-kernel, Maxime Ripard

Hi Russell,

On Monday 04 August 2014 18:54:58 Russell King - ARM Linux wrote:
> On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
> > On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> >>  This sequence must occur in a timely manner as some DMA engine
> >> 
> >> implementations hold a lock between the prepare and submit callbacks
> >> (Dan explicitly permits this as part of the API.)
> > 
> > That really triggers a red alarm in the part of my brain that deals with
> > API design, but I suppose it would be too difficult to change that.
> 
> Mine to, but there's not a lot which can be done about it without changing a
> lot of users.

Well, the good side is that it "only" requires enough motivation and free time 
then :-)

> >>> The DMA_PRIVATE capability flag seems to play a role here, but it's
> >>> far from being clear how that mechanism is supposed to work. This
> >>> should be documented as well, and any light you could shed on this
> >>> dark corner of the API would help.
> >> 
> >> Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> >> doesn't.
> > 
> > Not on callbacks, but on how pending descriptors are pushed to the
> > hardware. The flag is explicitly checked in dma_issue_pending_all().
> 
> Right.  So, let me put a question to you - what do you think is the
> effect of the check in dma_issue_pending_all()?
> 
> I'll give you a hint - disregard the comment at the top of the function,
> because that's out of date.

I suppose the idea is that dma_issue_pending_all() is only used for memcpy 
offload, and can thus ignore channels used for slave transfers.

The code seems to be buggy though. A DMA engine that can serve both memcpy and 
slave transfers could have one channel allocated for memcpy first, then a 
second channel allocated for slave transfers. This would cause the DMA_PRIVATE 
flag to be set, which will prevent dma_issue_pending_all() from calling the 
device_issue_pending operation of the memcpy channel.

> >> DMA_PRIVATE is all about channel allocation as I explained yesterday,
> >> and whether the channel is available for async_tx usage.
> >> 
> >> A channel marked DMA_PRIVATE is not available for async_tx usage at
> >> any moment.  A channel without DMA_PRIVATE is available for async_tx
> >> usage until it is allocated for the slave API - at which point the
> >> generic DMA engine code will mark the channel with DMA_PRIVATE,
> >> thereby taking it away from async_tx API usage.  When the slave API
> >> frees the channel, DMA_PRIVATE will be cleared, making the channel
> >> available for async_tx usage.
> >> 
> >> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> >> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> >> simple.
> > 
> > DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one
> > channel is allocated by __dma_request_channel() the whole device is
> > marked with DMA_PRIVATE, making all channels private. What am I missing ?
> 
> I can't answer that - I don't know why the previous authors decided to
> make it a DMA-device wide property - presumably there are DMA controllers
> where this matters.

If I understand the history correctly, the reason to make DMA_PRIVATE a device 
flag is to avoid starving slaves by allocating all channels of a device for 
memcpy. If the DMA_PRIVATE flag is set by the DMA engine driver that works as 
expected. If the DMA engine can be used for both, though, there's no 
guarantee, and the behaviour won't be very predictable.

By the way, shouldn't DMA_PRIVATE be renamed to something more explicit, such 
as DMA_NO_MEMCPY or DMA_SLAVE_ONLY ?

> However, one thing to realise is that a dma_device is a virtual concept -
> it is a set of channels which share a common set of properties.  It is not
> a physical device.  It is entirely reasonable for a set of channels on a
> physical device to be shared between two different dma_device instances
> and handed out by the driver code as needed.

When the channels are independent, sure, but they sometimes share hardware 
resources. For instance the Renesas R-Car Gen2 SoCs have 2 generic-purpose DMA 
engines usable for both memcpy and slave transfers, each of them having 15 
channels. Each DMA engine arbitrates memory accesses from the different 
channels, using either fixed priorities, or a round-robin arbitration. In that 
case it wouldn't make much sense to split the 15 channels across several 
dma_device instances.

I actually have the opposite problem, in my case channels of physically 
separate DMA engines can be used interchangeably to serve the system's slaves. 
Using the DMA engine DT bindings, DT nodes of the slaves currently reference a 
specific DMA engine, even if they can be served by both. This leads to limited 
dynamic channel allocation capabilities (especially when taking into account 
lazy channel allocation as mentioned in another mail in this thread).

> >>> Similarly, the DMA engine API is split in functions with different
> >>> prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> >>> unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> >>> rationale for that (beyond just historical reasons) it should also
> >>> be documented, otherwise a cleanup would help all the confused DMA
> >>> engine users (myself included).
> >> 
> >> dmaengine_* are generally the interface functions to the DMA engine
> >> code, which have been recently introduced to avoid the multiple levels of
> >> pointer indirection having to be typed in every driver.
> >> 
> >> dma_async_* are the DMA engine interface functions for the async_tx API.
> >> 
> >> dma_* predate the dmaengine_* naming, and are probably too generic, so
> >> should probably end up being renamed to dmaengine_*.
> > 
> > Thank you for the confirmation. I'll see if I can cook up a patch. It will
> > likely be pretty large and broad though, but I guess there's no way around
> > it.
> >
> >> txd_* are all about the DMA engine descriptors.
> >> 
> >> async_tx_* are the higher level async_tx API functions.
> > 
> > Thank you for the information. How about the dma_async_* functions, should
> > they be renamed to dmaengine_* as well ? Or are some of them part of the
> > async_tx_* API ?
> 
> Well, these:
> 
> dma_async_device_register
> dma_async_device_unregister
> dma_async_tx_descriptor_init
> 
> are more DMA engine core <-> DMA engine driver interface functions than
> user functions.  The remainder of the dma_async_* functions are internal
> to the async_tx API.

I was also thinking about dma_async_issue_pending(). Isn't that function part 
of the DMA engine API rather than the async API ?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-05 23:19                   ` Laurent Pinchart
  0 siblings, 0 replies; 78+ messages in thread
From: Laurent Pinchart @ 2014-08-05 23:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Monday 04 August 2014 18:54:58 Russell King - ARM Linux wrote:
> On Mon, Aug 04, 2014 at 07:00:36PM +0200, Laurent Pinchart wrote:
> > On Friday 01 August 2014 15:30:20 Russell King - ARM Linux wrote:
> >>  This sequence must occur in a timely manner as some DMA engine
> >> 
> >> implementations hold a lock between the prepare and submit callbacks
> >> (Dan explicitly permits this as part of the API.)
> > 
> > That really triggers a red alarm in the part of my brain that deals with
> > API design, but I suppose it would be too difficult to change that.
> 
> Mine to, but there's not a lot which can be done about it without changing a
> lot of users.

Well, the good side is that it "only" requires enough motivation and free time 
then :-)

> >>> The DMA_PRIVATE capability flag seems to play a role here, but it's
> >>> far from being clear how that mechanism is supposed to work. This
> >>> should be documented as well, and any light you could shed on this
> >>> dark corner of the API would help.
> >> 
> >> Why do you think that DMA_PRIVATE has a bearing in the callbacks? It
> >> doesn't.
> > 
> > Not on callbacks, but on how pending descriptors are pushed to the
> > hardware. The flag is explicitly checked in dma_issue_pending_all().
> 
> Right.  So, let me put a question to you - what do you think is the
> effect of the check in dma_issue_pending_all()?
> 
> I'll give you a hint - disregard the comment at the top of the function,
> because that's out of date.

I suppose the idea is that dma_issue_pending_all() is only used for memcpy 
offload, and can thus ignore channels used for slave transfers.

The code seems to be buggy though. A DMA engine that can serve both memcpy and 
slave transfers could have one channel allocated for memcpy first, then a 
second channel allocated for slave transfers. This would cause the DMA_PRIVATE 
flag to be set, which will prevent dma_issue_pending_all() from calling the 
device_issue_pending operation of the memcpy channel.

> >> DMA_PRIVATE is all about channel allocation as I explained yesterday,
> >> and whether the channel is available for async_tx usage.
> >> 
> >> A channel marked DMA_PRIVATE is not available for async_tx usage at
> >> any moment.  A channel without DMA_PRIVATE is available for async_tx
> >> usage until it is allocated for the slave API - at which point the
> >> generic DMA engine code will mark the channel with DMA_PRIVATE,
> >> thereby taking it away from async_tx API usage.  When the slave API
> >> frees the channel, DMA_PRIVATE will be cleared, making the channel
> >> available for async_tx usage.
> >> 
> >> So, basically, DMA_PRIVATE set -> async_tx usage not allowed.
> >> DMA_PRIVATE clear -> async_tx usage permitted.  It really is that
> >> simple.
> > 
> > DMA_PRIVATE is a dma_device flag, not a dma_chan flag. As soon as one
> > channel is allocated by __dma_request_channel() the whole device is
> > marked with DMA_PRIVATE, making all channels private. What am I missing ?
> 
> I can't answer that - I don't know why the previous authors decided to
> make it a DMA-device wide property - presumably there are DMA controllers
> where this matters.

If I understand the history correctly, the reason to make DMA_PRIVATE a device 
flag is to avoid starving slaves by allocating all channels of a device for 
memcpy. If the DMA_PRIVATE flag is set by the DMA engine driver that works as 
expected. If the DMA engine can be used for both, though, there's no 
guarantee, and the behaviour won't be very predictable.

By the way, shouldn't DMA_PRIVATE be renamed to something more explicit, such 
as DMA_NO_MEMCPY or DMA_SLAVE_ONLY ?

> However, one thing to realise is that a dma_device is a virtual concept -
> it is a set of channels which share a common set of properties.  It is not
> a physical device.  It is entirely reasonable for a set of channels on a
> physical device to be shared between two different dma_device instances
> and handed out by the driver code as needed.

When the channels are independent, sure, but they sometimes share hardware 
resources. For instance the Renesas R-Car Gen2 SoCs have 2 generic-purpose DMA 
engines usable for both memcpy and slave transfers, each of them having 15 
channels. Each DMA engine arbitrates memory accesses from the different 
channels, using either fixed priorities, or a round-robin arbitration. In that 
case it wouldn't make much sense to split the 15 channels across several 
dma_device instances.

I actually have the opposite problem, in my case channels of physically 
separate DMA engines can be used interchangeably to serve the system's slaves. 
Using the DMA engine DT bindings, DT nodes of the slaves currently reference a 
specific DMA engine, even if they can be served by both. This leads to limited 
dynamic channel allocation capabilities (especially when taking into account 
lazy channel allocation as mentioned in another mail in this thread).

> >>> Similarly, the DMA engine API is split in functions with different
> >>> prefixes (mostly dmaengine_*, dma_async_*, dma_*, and various
> >>> unprefixed niceties such as async_tx_ack or txd_lock. If there's a
> >>> rationale for that (beyond just historical reasons) it should also
> >>> be documented, otherwise a cleanup would help all the confused DMA
> >>> engine users (myself included).
> >> 
> >> dmaengine_* are generally the interface functions to the DMA engine
> >> code, which have been recently introduced to avoid the multiple levels of
> >> pointer indirection having to be typed in every driver.
> >> 
> >> dma_async_* are the DMA engine interface functions for the async_tx API.
> >> 
> >> dma_* predate the dmaengine_* naming, and are probably too generic, so
> >> should probably end up being renamed to dmaengine_*.
> > 
> > Thank you for the confirmation. I'll see if I can cook up a patch. It will
> > likely be pretty large and broad though, but I guess there's no way around
> > it.
> >
> >> txd_* are all about the DMA engine descriptors.
> >> 
> >> async_tx_* are the higher level async_tx API functions.
> > 
> > Thank you for the information. How about the dma_async_* functions, should
> > they be renamed to dmaengine_* as well ? Or are some of them part of the
> > async_tx_* API ?
> 
> Well, these:
> 
> dma_async_device_register
> dma_async_device_unregister
> dma_async_tx_descriptor_init
> 
> are more DMA engine core <-> DMA engine driver interface functions than
> user functions.  The remainder of the dma_async_* functions are internal
> to the async_tx API.

I was also thinking about dma_async_issue_pending(). Isn't that function part 
of the DMA engine API rather than the async API ?

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-05 23:19                   ` Laurent Pinchart
  (?)
@ 2014-08-06  7:17                     ` Geert Uytterhoeven
  -1 siblings, 0 replies; 78+ messages in thread
From: Geert Uytterhoeven @ 2014-08-06  7:17 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Russell King - ARM Linux, Vinod Koul, Kuninori Morimoto,
	dmaengine, Linux-sh list, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard

Hi Laurent,

On Wed, Aug 6, 2014 at 1:19 AM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
> I actually have the opposite problem, in my case channels of physically
> separate DMA engines can be used interchangeably to serve the system's slaves.
> Using the DMA engine DT bindings, DT nodes of the slaves currently reference a
> specific DMA engine, even if they can be served by both. This leads to limited
> dynamic channel allocation capabilities (especially when taking into account
> lazy channel allocation as mentioned in another mail in this thread).

What about adding a property to the first one, referencing the second
(or the other way around, don't know what's the easiest to implement)?

        dmac0: dma-controller@e6700000 {
                ...
                renesas,alternative = <&dmac1>;
                ...
        };

        dmac1: dma-controller@e6720000 {
                ...
        };

That would avoid having to bind a slave device explicitly to a single
dmac, or having to bind all slave devices to all dmacs.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-06  7:17                     ` Geert Uytterhoeven
  0 siblings, 0 replies; 78+ messages in thread
From: Geert Uytterhoeven @ 2014-08-06  7:17 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Russell King - ARM Linux, Vinod Koul, Kuninori Morimoto,
	dmaengine, Linux-sh list, Magnus Damm, Linux-ALSA,
	linux-arm-kernel, Maxime Ripard

Hi Laurent,

On Wed, Aug 6, 2014 at 1:19 AM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
> I actually have the opposite problem, in my case channels of physically
> separate DMA engines can be used interchangeably to serve the system's slaves.
> Using the DMA engine DT bindings, DT nodes of the slaves currently reference a
> specific DMA engine, even if they can be served by both. This leads to limited
> dynamic channel allocation capabilities (especially when taking into account
> lazy channel allocation as mentioned in another mail in this thread).

What about adding a property to the first one, referencing the second
(or the other way around, don't know what's the easiest to implement)?

        dmac0: dma-controller@e6700000 {
                ...
                renesas,alternative = <&dmac1>;
                ...
        };

        dmac1: dma-controller@e6720000 {
                ...
        };

That would avoid having to bind a slave device explicitly to a single
dmac, or having to bind all slave devices to all dmacs.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-06  7:17                     ` Geert Uytterhoeven
  0 siblings, 0 replies; 78+ messages in thread
From: Geert Uytterhoeven @ 2014-08-06  7:17 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Laurent,

On Wed, Aug 6, 2014 at 1:19 AM, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
> I actually have the opposite problem, in my case channels of physically
> separate DMA engines can be used interchangeably to serve the system's slaves.
> Using the DMA engine DT bindings, DT nodes of the slaves currently reference a
> specific DMA engine, even if they can be served by both. This leads to limited
> dynamic channel allocation capabilities (especially when taking into account
> lazy channel allocation as mentioned in another mail in this thread).

What about adding a property to the first one, referencing the second
(or the other way around, don't know what's the easiest to implement)?

        dmac0: dma-controller at e6700000 {
                ...
                renesas,alternative = <&dmac1>;
                ...
        };

        dmac1: dma-controller at e6720000 {
                ...
        };

That would avoid having to bind a slave device explicitly to a single
dmac, or having to bind all slave devices to all dmacs.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert at linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
  2014-08-06  7:17                     ` Geert Uytterhoeven
  (?)
@ 2014-08-06 11:04                       ` Arnd Bergmann
  -1 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2014-08-06 11:04 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Geert Uytterhoeven, Laurent Pinchart, Linux-ALSA,
	Russell King - ARM Linux, Linux-sh list, Vinod Koul, Magnus Damm,
	dmaengine, Maxime Ripard, Kuninori Morimoto

On Wednesday 06 August 2014, Geert Uytterhoeven wrote:
> > I actually have the opposite problem, in my case channels of physically
> > separate DMA engines can be used interchangeably to serve the system's slaves.
> > Using the DMA engine DT bindings, DT nodes of the slaves currently reference a
> > specific DMA engine, even if they can be served by both. This leads to limited
> > dynamic channel allocation capabilities (especially when taking into account
> > lazy channel allocation as mentioned in another mail in this thread).
> 
> What about adding a property to the first one, referencing the second
> (or the other way around, don't know what's the easiest to implement)?
> 
>         dmac0: dma-controller@e6700000 {
>                 ...
>                 renesas,alternative = <&dmac1>;
>                 ...
>         };
> 
>         dmac1: dma-controller@e6720000 {
>                 ...
>         };
> 
> That would avoid having to bind a slave device explicitly to a single
> dmac, or having to bind all slave devices to all dmacs.

We have a perfectly fine way to express this with the existing binding
already: you just list channels for both (or more) controllers for each
slave, and let the dma subsystem pick one. This was a compromise we
reached when we initially introduced the dma slave binding, the downside
being that we have to name every reference from a slave to a controller,
even though almost all of them are "rx", "tx" or "data".

I believe what happened though is that the initial implementation in the
kernel was to just pick the first channel for a given name and try that
but give up if it fails. This works for the majority of users and I had
expected someone to implement a smarter strategy as needed.

The easiest way would be to just randomize the order of the channels
during lookup and then try them all, but there is a potential problem
with this failing sometimes in nondeterministic ways.
Another alternative would be for the dma controller to report back
some form of "utilization" number to the dma subsystem and have the
common code pick the least utilized engine that is connected to that
slave.

	Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-06 11:04                       ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2014-08-06 11:04 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Geert Uytterhoeven, Laurent Pinchart, Linux-ALSA,
	Russell King - ARM Linux, Linux-sh list, Vinod Koul, Magnus Damm,
	dmaengine, Maxime Ripard, Kuninori Morimoto

On Wednesday 06 August 2014, Geert Uytterhoeven wrote:
> > I actually have the opposite problem, in my case channels of physically
> > separate DMA engines can be used interchangeably to serve the system's slaves.
> > Using the DMA engine DT bindings, DT nodes of the slaves currently reference a
> > specific DMA engine, even if they can be served by both. This leads to limited
> > dynamic channel allocation capabilities (especially when taking into account
> > lazy channel allocation as mentioned in another mail in this thread).
> 
> What about adding a property to the first one, referencing the second
> (or the other way around, don't know what's the easiest to implement)?
> 
>         dmac0: dma-controller@e6700000 {
>                 ...
>                 renesas,alternative = <&dmac1>;
>                 ...
>         };
> 
>         dmac1: dma-controller@e6720000 {
>                 ...
>         };
> 
> That would avoid having to bind a slave device explicitly to a single
> dmac, or having to bind all slave devices to all dmacs.

We have a perfectly fine way to express this with the existing binding
already: you just list channels for both (or more) controllers for each
slave, and let the dma subsystem pick one. This was a compromise we
reached when we initially introduced the dma slave binding, the downside
being that we have to name every reference from a slave to a controller,
even though almost all of them are "rx", "tx" or "data".

I believe what happened though is that the initial implementation in the
kernel was to just pick the first channel for a given name and try that
but give up if it fails. This works for the majority of users and I had
expected someone to implement a smarter strategy as needed.

The easiest way would be to just randomize the order of the channels
during lookup and then try them all, but there is a potential problem
with this failing sometimes in nondeterministic ways.
Another alternative would be for the dma controller to report back
some form of "utilization" number to the dma subsystem and have the
common code pick the least utilized engine that is connected to that
slave.

	Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support)
@ 2014-08-06 11:04                       ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2014-08-06 11:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 06 August 2014, Geert Uytterhoeven wrote:
> > I actually have the opposite problem, in my case channels of physically
> > separate DMA engines can be used interchangeably to serve the system's slaves.
> > Using the DMA engine DT bindings, DT nodes of the slaves currently reference a
> > specific DMA engine, even if they can be served by both. This leads to limited
> > dynamic channel allocation capabilities (especially when taking into account
> > lazy channel allocation as mentioned in another mail in this thread).
> 
> What about adding a property to the first one, referencing the second
> (or the other way around, don't know what's the easiest to implement)?
> 
>         dmac0: dma-controller at e6700000 {
>                 ...
>                 renesas,alternative = <&dmac1>;
>                 ...
>         };
> 
>         dmac1: dma-controller at e6720000 {
>                 ...
>         };
> 
> That would avoid having to bind a slave device explicitly to a single
> dmac, or having to bind all slave devices to all dmacs.

We have a perfectly fine way to express this with the existing binding
already: you just list channels for both (or more) controllers for each
slave, and let the dma subsystem pick one. This was a compromise we
reached when we initially introduced the dma slave binding, the downside
being that we have to name every reference from a slave to a controller,
even though almost all of them are "rx", "tx" or "data".

I believe what happened though is that the initial implementation in the
kernel was to just pick the first channel for a given name and try that
but give up if it fails. This works for the majority of users and I had
expected someone to implement a smarter strategy as needed.

The easiest way would be to just randomize the order of the channels
during lookup and then try them all, but there is a potential problem
with this failing sometimes in nondeterministic ways.
Another alternative would be for the dma controller to report back
some form of "utilization" number to the dma subsystem and have the
common code pick the least utilized engine that is connected to that
slave.

	Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2014-08-06 11:04 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-22 12:33 [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
2014-07-23  2:17 ` Kuninori Morimoto
2014-07-23 10:28   ` Laurent Pinchart
2014-07-23 10:28     ` Laurent Pinchart
2014-07-23 11:07     ` DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support) Laurent Pinchart
2014-07-23 11:07       ` Laurent Pinchart
2014-07-23 11:07       ` Laurent Pinchart
2014-07-24  0:46       ` Kuninori Morimoto
2014-07-24  0:46         ` Kuninori Morimoto
2014-07-24  0:46         ` Kuninori Morimoto
2014-07-24  1:35         ` Kuninori Morimoto
2014-07-24  1:35           ` Kuninori Morimoto
2014-07-24  1:35           ` Kuninori Morimoto
2014-07-24  4:53           ` Vinod Koul
2014-07-24  4:59             ` Vinod Koul
2014-07-24  4:53             ` Vinod Koul
2014-07-24  4:52       ` Vinod Koul
2014-07-24  4:58         ` Vinod Koul
2014-07-24  4:52         ` Vinod Koul
2014-08-01  8:51         ` Laurent Pinchart
2014-08-01  8:51           ` Laurent Pinchart
2014-08-01  8:51           ` Laurent Pinchart
2014-08-01 14:30           ` Russell King - ARM Linux
2014-08-01 14:30             ` Russell King - ARM Linux
2014-08-01 14:30             ` Russell King - ARM Linux
2014-08-01 17:09             ` Vinod Koul
2014-08-01 17:21               ` Vinod Koul
2014-08-01 17:09               ` Vinod Koul
2014-08-04 13:47             ` Geert Uytterhoeven
2014-08-04 13:47               ` Geert Uytterhoeven
2014-08-04 13:47               ` Geert Uytterhoeven
2014-08-04 17:00             ` Laurent Pinchart
2014-08-04 17:00               ` Laurent Pinchart
2014-08-04 17:00               ` Laurent Pinchart
2014-08-04 17:54               ` Russell King - ARM Linux
2014-08-04 17:54                 ` Russell King - ARM Linux
2014-08-04 17:54                 ` Russell King - ARM Linux
2014-08-05 23:19                 ` Laurent Pinchart
2014-08-05 23:19                   ` Laurent Pinchart
2014-08-05 23:19                   ` Laurent Pinchart
2014-08-06  7:17                   ` Geert Uytterhoeven
2014-08-06  7:17                     ` Geert Uytterhoeven
2014-08-06  7:17                     ` Geert Uytterhoeven
2014-08-06 11:04                     ` Arnd Bergmann
2014-08-06 11:04                       ` Arnd Bergmann
2014-08-06 11:04                       ` Arnd Bergmann
2014-08-01 17:07           ` Vinod Koul
2014-08-01 17:19             ` Vinod Koul
2014-08-01 17:07             ` Vinod Koul
2014-08-04 16:50             ` Laurent Pinchart
2014-08-04 16:50               ` Laurent Pinchart
2014-08-04 16:50               ` Laurent Pinchart
2014-08-04 18:03               ` DMA engine API issue Lars-Peter Clausen
2014-08-04 18:03                 ` Lars-Peter Clausen
2014-08-04 18:03                 ` Lars-Peter Clausen
2014-08-04 18:32                 ` Russell King - ARM Linux
2014-08-04 18:32                   ` Russell King - ARM Linux
2014-08-04 18:32                   ` Russell King - ARM Linux
2014-08-04 23:12                   ` Laurent Pinchart
2014-08-04 23:12                     ` Laurent Pinchart
2014-08-04 23:12                     ` Laurent Pinchart
2014-08-05 16:56               ` DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support) Vinod Koul
2014-08-05 17:08                 ` Vinod Koul
2014-08-05 16:56                 ` Vinod Koul
2014-07-24 12:29       ` [alsa-devel] DMA engine API issue Lars-Peter Clausen
2014-07-24 12:29         ` Lars-Peter Clausen
2014-07-24 12:29         ` Lars-Peter Clausen
2014-07-24 12:51       ` DMA engine API issue (was: [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support) Russell King - ARM Linux
2014-07-24 12:51         ` Russell King - ARM Linux
2014-07-24 12:51         ` Russell King - ARM Linux
2014-08-01  9:24         ` Laurent Pinchart
2014-08-01  9:24           ` Laurent Pinchart
2014-08-01  9:24           ` Laurent Pinchart
2014-07-23  9:48 ` [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart
2014-07-23 23:56 ` Kuninori Morimoto
2014-07-24  8:51   ` [PATCH] ASoC: rsnd: fixup dai remove callback operation Kuninori Morimoto
2014-07-25 17:50     ` Mark Brown
2014-07-24  0:12 ` [PATCH/RFC 0/5] R-Car Gen2 DMAC hardware descriptor list support Laurent Pinchart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.