All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Add Xilinx AXI Video DMA Engine driver
@ 2014-02-15 12:00 ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: dan.j.williams, vinod.koul, michal.simek, grant.likely, robh+dt
  Cc: levex, lars, andriy.shevchenko, dmaengine, linux-arm-kernel,
	linux-kernel, devicetree, Srikanth Thokala

Hi,

This is the driver for Xilinx AXI Video Direct Memory Access Engine.
It is a soft IP core, which provides high-bandwidth direct memory
access between memory and AXI4-Stream video type target peripherals
including peripherals which support AXI4-Stream Video Protocol. The
core provides efficient two dimensional DMA operations with independent
asynchronous read and write channel operation.

For more information on the IP, please refer to
http://www.xilinx.com/support/documentation/ip_documentation/axi_vdma/v6_1/pg020_axi_vdma.pdf

This patch also provides a test client, which assumes read and write channels
of the core are configured in a back-to-back connection.  It transfers
data on the write channel, read and verify the data on the read channel.

Use cases:
++++++++++
1. Xilinx Video Targeted Reference design
   http://www.wiki.xilinx.com/Zynq+Base+TRD+14.5
2. Common Display Framework
   http://events.linuxfoundation.org/sites/events/files/slides/20131024-elce.pdf

Regards,
Srikanth

Changes in v3:
- Created a separate patch for the DT binding documentation as suggested by
  Vinod, Thanks.
- Added support for interleaved frames with non-contiguous memory as suggested
  by Lars, Thanks.
- Rebased on v3.14.0-rc2 

Srikanth Thokala (3):
  dma: Support multiple interleaved frames with non-contiguous memory
  dma: Add Xilinx Video DMA DT Binding Documentation
  dma: Add Xilinx AXI Video Direct Memory Access Engine driver support

 .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++
 Documentation/dmaengine.txt                        |    2 +-
 drivers/dma/Kconfig                                |   14 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/imx-dma.c                              |    3 +-
 drivers/dma/sirf-dma.c                             |    3 +-
 drivers/dma/xilinx/Makefile                        |    1 +
 drivers/dma/xilinx/xilinx_vdma.c                   | 1388 ++++++++++++++++++++
 drivers/media/platform/m2m-deinterlace.c           |    2 +-
 include/linux/amba/xilinx_dma.h                    |   46 +
 include/linux/dmaengine.h                          |    6 +-
 11 files changed, 1534 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
 create mode 100644 drivers/dma/xilinx/Makefile
 create mode 100644 drivers/dma/xilinx/xilinx_vdma.c
 create mode 100644 include/linux/amba/xilinx_dma.h

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 0/3] Add Xilinx AXI Video DMA Engine driver
@ 2014-02-15 12:00 ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: dan.j.williams, vinod.koul, michal.simek, grant.likely, robh+dt
  Cc: devicetree, levex, lars, linux-kernel, Srikanth Thokala,
	dmaengine, andriy.shevchenko, linux-arm-kernel

Hi,

This is the driver for Xilinx AXI Video Direct Memory Access Engine.
It is a soft IP core, which provides high-bandwidth direct memory
access between memory and AXI4-Stream video type target peripherals
including peripherals which support AXI4-Stream Video Protocol. The
core provides efficient two dimensional DMA operations with independent
asynchronous read and write channel operation.

For more information on the IP, please refer to
http://www.xilinx.com/support/documentation/ip_documentation/axi_vdma/v6_1/pg020_axi_vdma.pdf

This patch also provides a test client, which assumes read and write channels
of the core are configured in a back-to-back connection.  It transfers
data on the write channel, read and verify the data on the read channel.

Use cases:
++++++++++
1. Xilinx Video Targeted Reference design
   http://www.wiki.xilinx.com/Zynq+Base+TRD+14.5
2. Common Display Framework
   http://events.linuxfoundation.org/sites/events/files/slides/20131024-elce.pdf

Regards,
Srikanth

Changes in v3:
- Created a separate patch for the DT binding documentation as suggested by
  Vinod, Thanks.
- Added support for interleaved frames with non-contiguous memory as suggested
  by Lars, Thanks.
- Rebased on v3.14.0-rc2 

Srikanth Thokala (3):
  dma: Support multiple interleaved frames with non-contiguous memory
  dma: Add Xilinx Video DMA DT Binding Documentation
  dma: Add Xilinx AXI Video Direct Memory Access Engine driver support

 .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++
 Documentation/dmaengine.txt                        |    2 +-
 drivers/dma/Kconfig                                |   14 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/imx-dma.c                              |    3 +-
 drivers/dma/sirf-dma.c                             |    3 +-
 drivers/dma/xilinx/Makefile                        |    1 +
 drivers/dma/xilinx/xilinx_vdma.c                   | 1388 ++++++++++++++++++++
 drivers/media/platform/m2m-deinterlace.c           |    2 +-
 include/linux/amba/xilinx_dma.h                    |   46 +
 include/linux/dmaengine.h                          |    6 +-
 11 files changed, 1534 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
 create mode 100644 drivers/dma/xilinx/Makefile
 create mode 100644 drivers/dma/xilinx/xilinx_vdma.c
 create mode 100644 include/linux/amba/xilinx_dma.h

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 0/3] Add Xilinx AXI Video DMA Engine driver
@ 2014-02-15 12:00 ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

This is the driver for Xilinx AXI Video Direct Memory Access Engine.
It is a soft IP core, which provides high-bandwidth direct memory
access between memory and AXI4-Stream video type target peripherals
including peripherals which support AXI4-Stream Video Protocol. The
core provides efficient two dimensional DMA operations with independent
asynchronous read and write channel operation.

For more information on the IP, please refer to
http://www.xilinx.com/support/documentation/ip_documentation/axi_vdma/v6_1/pg020_axi_vdma.pdf

This patch also provides a test client, which assumes read and write channels
of the core are configured in a back-to-back connection.  It transfers
data on the write channel, read and verify the data on the read channel.

Use cases:
++++++++++
1. Xilinx Video Targeted Reference design
   http://www.wiki.xilinx.com/Zynq+Base+TRD+14.5
2. Common Display Framework
   http://events.linuxfoundation.org/sites/events/files/slides/20131024-elce.pdf

Regards,
Srikanth

Changes in v3:
- Created a separate patch for the DT binding documentation as suggested by
  Vinod, Thanks.
- Added support for interleaved frames with non-contiguous memory as suggested
  by Lars, Thanks.
- Rebased on v3.14.0-rc2 

Srikanth Thokala (3):
  dma: Support multiple interleaved frames with non-contiguous memory
  dma: Add Xilinx Video DMA DT Binding Documentation
  dma: Add Xilinx AXI Video Direct Memory Access Engine driver support

 .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++
 Documentation/dmaengine.txt                        |    2 +-
 drivers/dma/Kconfig                                |   14 +
 drivers/dma/Makefile                               |    1 +
 drivers/dma/imx-dma.c                              |    3 +-
 drivers/dma/sirf-dma.c                             |    3 +-
 drivers/dma/xilinx/Makefile                        |    1 +
 drivers/dma/xilinx/xilinx_vdma.c                   | 1388 ++++++++++++++++++++
 drivers/media/platform/m2m-deinterlace.c           |    2 +-
 include/linux/amba/xilinx_dma.h                    |   46 +
 include/linux/dmaengine.h                          |    6 +-
 11 files changed, 1534 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
 create mode 100644 drivers/dma/xilinx/Makefile
 create mode 100644 drivers/dma/xilinx/xilinx_vdma.c
 create mode 100644 include/linux/amba/xilinx_dma.h

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-15 12:00   ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: dan.j.williams, vinod.koul, michal.simek, grant.likely, robh+dt
  Cc: levex, lars, andriy.shevchenko, dmaengine, linux-arm-kernel,
	linux-kernel, devicetree, Srikanth Thokala

The current implementation of interleaved DMA API support multiple
frames only when the memory is contiguous by incrementing src_start/
dst_start members of interleaved template.

But, when the memory is non-contiguous it will restrict slave device
to not submit multiple frames in a batch.  This patch handles this
issue by allowing the slave device to send array of interleaved dma
templates each having a different memory location.

Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
---
 Documentation/dmaengine.txt              |    2 +-
 drivers/dma/imx-dma.c                    |    3 ++-
 drivers/dma/sirf-dma.c                   |    3 ++-
 drivers/media/platform/m2m-deinterlace.c |    2 +-
 include/linux/dmaengine.h                |    6 +++---
 5 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
index 879b6e3..c642614 100644
--- a/Documentation/dmaengine.txt
+++ b/Documentation/dmaengine.txt
@@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
 		size_t period_len, enum dma_data_direction direction);
 
 	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags);
 
    The peripheral driver is expected to have mapped the scatterlist for
diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
index 6f9ac20..e2c52ce 100644
--- a/drivers/dma/imx-dma.c
+++ b/drivers/dma/imx-dma.c
@@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
 }
 
 static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
-	struct dma_chan *chan, struct dma_interleaved_template *xt,
+	struct dma_chan *chan, struct dma_interleaved_template **xts,
 	unsigned long flags)
 {
 	struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
 	struct imxdma_engine *imxdma = imxdmac->imxdma;
 	struct imxdma_desc *desc;
+	struct dma_interleaved_template *xt = *xts;
 
 	dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
 		"   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
index d4d3a31..b6a150b 100644
--- a/drivers/dma/sirf-dma.c
+++ b/drivers/dma/sirf-dma.c
@@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
 }
 
 static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
-	struct dma_chan *chan, struct dma_interleaved_template *xt,
+	struct dma_chan *chan, struct dma_interleaved_template **xts,
 	unsigned long flags)
 {
 	struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
 	struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
 	struct sirfsoc_dma_desc *sdesc = NULL;
+	struct dma_interleaved_template *xt = *xts;
 	unsigned long iflags;
 	int ret;
 
diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
index 6bb86b5..468110a 100644
--- a/drivers/media/platform/m2m-deinterlace.c
+++ b/drivers/media/platform/m2m-deinterlace.c
@@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
 	ctx->xt->dst_sgl = true;
 	flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
 
-	tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
+	tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
 	if (tx == NULL) {
 		v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
 		return;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index c5c92d5..2f77a9a 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -675,7 +675,7 @@ struct dma_device {
 		size_t period_len, enum dma_transfer_direction direction,
 		unsigned long flags, void *context);
 	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags);
 	int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
 		unsigned long arg);
@@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
 }
 
 static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags)
 {
-	return chan->device->device_prep_interleaved_dma(chan, xt, flags);
+	return chan->device->device_prep_interleaved_dma(chan, xts, flags);
 }
 
 static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-15 12:00   ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: dan.j.williams-ral2JQCrhuEAvxtiuMwx3w,
	vinod.koul-ral2JQCrhuEAvxtiuMwx3w,
	michal.simek-gjFFaj9aHVfQT0dZR+AlfA,
	grant.likely-QSEj5FYQhm4dnm+yROfE0A,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A
  Cc: levex-vYTEC60ixJUAvxtiuMwx3w, lars-Qo5EllUWu/uELgA04lAiVw,
	andriy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Srikanth Thokala

The current implementation of interleaved DMA API support multiple
frames only when the memory is contiguous by incrementing src_start/
dst_start members of interleaved template.

But, when the memory is non-contiguous it will restrict slave device
to not submit multiple frames in a batch.  This patch handles this
issue by allowing the slave device to send array of interleaved dma
templates each having a different memory location.

Signed-off-by: Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
---
 Documentation/dmaengine.txt              |    2 +-
 drivers/dma/imx-dma.c                    |    3 ++-
 drivers/dma/sirf-dma.c                   |    3 ++-
 drivers/media/platform/m2m-deinterlace.c |    2 +-
 include/linux/dmaengine.h                |    6 +++---
 5 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
index 879b6e3..c642614 100644
--- a/Documentation/dmaengine.txt
+++ b/Documentation/dmaengine.txt
@@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
 		size_t period_len, enum dma_data_direction direction);
 
 	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags);
 
    The peripheral driver is expected to have mapped the scatterlist for
diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
index 6f9ac20..e2c52ce 100644
--- a/drivers/dma/imx-dma.c
+++ b/drivers/dma/imx-dma.c
@@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
 }
 
 static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
-	struct dma_chan *chan, struct dma_interleaved_template *xt,
+	struct dma_chan *chan, struct dma_interleaved_template **xts,
 	unsigned long flags)
 {
 	struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
 	struct imxdma_engine *imxdma = imxdmac->imxdma;
 	struct imxdma_desc *desc;
+	struct dma_interleaved_template *xt = *xts;
 
 	dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
 		"   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
index d4d3a31..b6a150b 100644
--- a/drivers/dma/sirf-dma.c
+++ b/drivers/dma/sirf-dma.c
@@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
 }
 
 static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
-	struct dma_chan *chan, struct dma_interleaved_template *xt,
+	struct dma_chan *chan, struct dma_interleaved_template **xts,
 	unsigned long flags)
 {
 	struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
 	struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
 	struct sirfsoc_dma_desc *sdesc = NULL;
+	struct dma_interleaved_template *xt = *xts;
 	unsigned long iflags;
 	int ret;
 
diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
index 6bb86b5..468110a 100644
--- a/drivers/media/platform/m2m-deinterlace.c
+++ b/drivers/media/platform/m2m-deinterlace.c
@@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
 	ctx->xt->dst_sgl = true;
 	flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
 
-	tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
+	tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
 	if (tx == NULL) {
 		v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
 		return;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index c5c92d5..2f77a9a 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -675,7 +675,7 @@ struct dma_device {
 		size_t period_len, enum dma_transfer_direction direction,
 		unsigned long flags, void *context);
 	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags);
 	int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
 		unsigned long arg);
@@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
 }
 
 static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags)
 {
-	return chan->device->device_prep_interleaved_dma(chan, xt, flags);
+	return chan->device->device_prep_interleaved_dma(chan, xts, flags);
 }
 
 static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-15 12:00   ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

The current implementation of interleaved DMA API support multiple
frames only when the memory is contiguous by incrementing src_start/
dst_start members of interleaved template.

But, when the memory is non-contiguous it will restrict slave device
to not submit multiple frames in a batch.  This patch handles this
issue by allowing the slave device to send array of interleaved dma
templates each having a different memory location.

Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
---
 Documentation/dmaengine.txt              |    2 +-
 drivers/dma/imx-dma.c                    |    3 ++-
 drivers/dma/sirf-dma.c                   |    3 ++-
 drivers/media/platform/m2m-deinterlace.c |    2 +-
 include/linux/dmaengine.h                |    6 +++---
 5 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
index 879b6e3..c642614 100644
--- a/Documentation/dmaengine.txt
+++ b/Documentation/dmaengine.txt
@@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
 		size_t period_len, enum dma_data_direction direction);
 
 	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags);
 
    The peripheral driver is expected to have mapped the scatterlist for
diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
index 6f9ac20..e2c52ce 100644
--- a/drivers/dma/imx-dma.c
+++ b/drivers/dma/imx-dma.c
@@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
 }
 
 static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
-	struct dma_chan *chan, struct dma_interleaved_template *xt,
+	struct dma_chan *chan, struct dma_interleaved_template **xts,
 	unsigned long flags)
 {
 	struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
 	struct imxdma_engine *imxdma = imxdmac->imxdma;
 	struct imxdma_desc *desc;
+	struct dma_interleaved_template *xt = *xts;
 
 	dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
 		"   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
index d4d3a31..b6a150b 100644
--- a/drivers/dma/sirf-dma.c
+++ b/drivers/dma/sirf-dma.c
@@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
 }
 
 static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
-	struct dma_chan *chan, struct dma_interleaved_template *xt,
+	struct dma_chan *chan, struct dma_interleaved_template **xts,
 	unsigned long flags)
 {
 	struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
 	struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
 	struct sirfsoc_dma_desc *sdesc = NULL;
+	struct dma_interleaved_template *xt = *xts;
 	unsigned long iflags;
 	int ret;
 
diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
index 6bb86b5..468110a 100644
--- a/drivers/media/platform/m2m-deinterlace.c
+++ b/drivers/media/platform/m2m-deinterlace.c
@@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
 	ctx->xt->dst_sgl = true;
 	flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
 
-	tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
+	tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
 	if (tx == NULL) {
 		v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
 		return;
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index c5c92d5..2f77a9a 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -675,7 +675,7 @@ struct dma_device {
 		size_t period_len, enum dma_transfer_direction direction,
 		unsigned long flags, void *context);
 	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags);
 	int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
 		unsigned long arg);
@@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
 }
 
 static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
-		struct dma_chan *chan, struct dma_interleaved_template *xt,
+		struct dma_chan *chan, struct dma_interleaved_template **xts,
 		unsigned long flags)
 {
-	return chan->device->device_prep_interleaved_dma(chan, xt, flags);
+	return chan->device->device_prep_interleaved_dma(chan, xts, flags);
 }
 
 static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
  2014-02-15 12:00 ` Srikanth Thokala
@ 2014-02-15 12:00   ` Srikanth Thokala
  -1 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: dan.j.williams, vinod.koul, michal.simek, grant.likely, robh+dt
  Cc: levex, lars, andriy.shevchenko, dmaengine, linux-arm-kernel,
	linux-kernel, devicetree, Srikanth Thokala

Device-tree binding documentation of Xilinx Video DMA Engine

Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
---
Changes in v3:
None

Changes in v2:
- Removed device-id DT property, as suggested by Arnd Bergmann
- Properly documented DT bindings as suggested by Arnd Bergmann
---
 .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++++++++++++++++++++
 1 file changed, 75 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt

diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
new file mode 100644
index 0000000..ab8be1a
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
@@ -0,0 +1,75 @@
+Xilinx AXI VDMA engine, it does transfers between memory and video devices.
+It can be configured to have one channel or two channels. If configured
+as two channels, one is to transmit to the video device and another is
+to receive from the video device.
+
+Required properties:
+- compatible: Should be "xlnx,axi-vdma-1.00.a"
+- #dma-cells: Should be <1>, see "dmas" property below
+- reg: Should contain VDMA registers location and length.
+- xlnx,num-fstores: Should be the number of framebuffers as configured in h/w.
+- dma-channel child node: Should have atleast one channel and can have upto
+	two channels per device. This node specifies the properties of each
+	DMA channel (see child node properties below).
+
+Optional properties:
+- xlnx,include-sg: Tells whether configured for Scatter-mode in
+	the hardware.
+- xlnx,flush-fsync: Tells whether which channel to Flush on Frame sync.
+	It takes following values:
+	{1}, flush both channels
+	{2}, flush mm2s channel
+	{3}, flush s2mm channel
+
+Required child node properties:
+- compatible: It should be either "xlnx,axi-vdma-mm2s-channel" or
+	"xlnx,axi-vdma-s2mm-channel".
+- interrupts: Should contain per channel VDMA interrupts.
+- xlnx,data-width: Should contain the stream data width, take values
+	{32,64...1024}.
+
+Option child node properties:
+- xlnx,include-dre: Tells whether hardware is configured for Data
+	Realignment Engine.
+- xlnx,genlock-mode: Tells whether Genlock synchronization is
+	enabled/disabled in hardware.
+
+Example:
+++++++++
+
+axi_vdma_0: axivdma@40030000 {
+	compatible = "xlnx,axi-vdma-1.00.a";
+	#dma_cells = <1>;
+	reg = < 0x40030000 0x10000 >;
+	xlnx,num-fstores = <0x8>;
+	xlnx,flush-fsync = <0x1>;
+	dma-channel@40030000 {
+		compatible = "xlnx,axi-vdma-mm2s-channel";
+		interrupts = < 0 54 4 >;
+		xlnx,datawidth = <0x40>;
+	} ;
+	dma-channel@40030030 {
+		compatible = "xlnx,axi-vdma-s2mm-channel";
+		interrupts = < 0 53 4 >;
+		xlnx,datawidth = <0x40>;
+	} ;
+} ;
+
+
+* DMA client
+
+Required properties:
+- dmas: a list of <[Video DMA device phandle] [Channel ID]> pairs,
+	where Channel ID is '0' for write/tx and '1' for read/rx
+	channel.
+- dma-names: a list of DMA channel names, one per "dmas" entry
+
+Example:
+++++++++
+
+vdmatest_0: vdmatest@0 {
+	compatible ="xlnx,axi-vdma-test-1.00.a";
+	dmas = <&axi_vdma_0 0
+		&axi_vdma_0 1>;
+	dma-names = "vdma0", "vdma1";
+} ;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
@ 2014-02-15 12:00   ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

Device-tree binding documentation of Xilinx Video DMA Engine

Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
---
Changes in v3:
None

Changes in v2:
- Removed device-id DT property, as suggested by Arnd Bergmann
- Properly documented DT bindings as suggested by Arnd Bergmann
---
 .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++++++++++++++++++++
 1 file changed, 75 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt

diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
new file mode 100644
index 0000000..ab8be1a
--- /dev/null
+++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
@@ -0,0 +1,75 @@
+Xilinx AXI VDMA engine, it does transfers between memory and video devices.
+It can be configured to have one channel or two channels. If configured
+as two channels, one is to transmit to the video device and another is
+to receive from the video device.
+
+Required properties:
+- compatible: Should be "xlnx,axi-vdma-1.00.a"
+- #dma-cells: Should be <1>, see "dmas" property below
+- reg: Should contain VDMA registers location and length.
+- xlnx,num-fstores: Should be the number of framebuffers as configured in h/w.
+- dma-channel child node: Should have atleast one channel and can have upto
+	two channels per device. This node specifies the properties of each
+	DMA channel (see child node properties below).
+
+Optional properties:
+- xlnx,include-sg: Tells whether configured for Scatter-mode in
+	the hardware.
+- xlnx,flush-fsync: Tells whether which channel to Flush on Frame sync.
+	It takes following values:
+	{1}, flush both channels
+	{2}, flush mm2s channel
+	{3}, flush s2mm channel
+
+Required child node properties:
+- compatible: It should be either "xlnx,axi-vdma-mm2s-channel" or
+	"xlnx,axi-vdma-s2mm-channel".
+- interrupts: Should contain per channel VDMA interrupts.
+- xlnx,data-width: Should contain the stream data width, take values
+	{32,64...1024}.
+
+Option child node properties:
+- xlnx,include-dre: Tells whether hardware is configured for Data
+	Realignment Engine.
+- xlnx,genlock-mode: Tells whether Genlock synchronization is
+	enabled/disabled in hardware.
+
+Example:
+++++++++
+
+axi_vdma_0: axivdma at 40030000 {
+	compatible = "xlnx,axi-vdma-1.00.a";
+	#dma_cells = <1>;
+	reg = < 0x40030000 0x10000 >;
+	xlnx,num-fstores = <0x8>;
+	xlnx,flush-fsync = <0x1>;
+	dma-channel at 40030000 {
+		compatible = "xlnx,axi-vdma-mm2s-channel";
+		interrupts = < 0 54 4 >;
+		xlnx,datawidth = <0x40>;
+	} ;
+	dma-channel at 40030030 {
+		compatible = "xlnx,axi-vdma-s2mm-channel";
+		interrupts = < 0 53 4 >;
+		xlnx,datawidth = <0x40>;
+	} ;
+} ;
+
+
+* DMA client
+
+Required properties:
+- dmas: a list of <[Video DMA device phandle] [Channel ID]> pairs,
+	where Channel ID is '0' for write/tx and '1' for read/rx
+	channel.
+- dma-names: a list of DMA channel names, one per "dmas" entry
+
+Example:
+++++++++
+
+vdmatest_0: vdmatest at 0 {
+	compatible ="xlnx,axi-vdma-test-1.00.a";
+	dmas = <&axi_vdma_0 0
+		&axi_vdma_0 1>;
+	dma-names = "vdma0", "vdma1";
+} ;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 3/3] dma: Add Xilinx AXI Video Direct Memory Access Engine driver support
  2014-02-15 12:00 ` Srikanth Thokala
@ 2014-02-15 12:00   ` Srikanth Thokala
  -1 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: dan.j.williams, vinod.koul, michal.simek, grant.likely, robh+dt
  Cc: levex, lars, andriy.shevchenko, dmaengine, linux-arm-kernel,
	linux-kernel, devicetree, Srikanth Thokala

This is the driver for the AXI Video Direct Memory Access (AXI
VDMA) core, which is a soft Xilinx IP core that provides high-
bandwidth direct memory access between memory and AXI4-Stream
type video target peripherals. The core provides efficient two
dimensional DMA operations with independent asynchronous read
and write channel operation.

This module works on Zynq (ARM Based SoC) and Microblaze platforms.

Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
Reviewed-by: Levente Kurusa <levex@linux.com>
---
NOTE:
- Created a separate directory 'dma/xilinx' as Xilinx has two more
  DMA IPs and we are also planning to upstream these drivers soon.

Changes in v3:
- Implemented interleaved DMA API as suggested by Vinod and Lars, Thanks.
- Use dma_slave_config generic API as suggested by Vinod and Lars, Thanks.
- Simplified cookie implementation as sugguested by Vinod, Thanks.
- Simplified *_xlate function by using dma_get_slave_channel() as
  suggested by Lars, Thanks.
- Proper indentation of constants and maintained same multi-line of
  comments as suggested by Andy Shevchenko, Thanks.
- Modified to use request_irq() instead of devm_request_irq as suggested
  by Lars, Thanks.
- Fixed minor comments suggested by Andy and Levente, Thanks.

Changes in v2:
- Removed DMA Test client module from the patchset as suggested
  by Andy Shevchenko
- Returning with error, if registration of DMA to node fails
- Fixed typo errors
- Used BIT() macro at applicable places
- Added missing header file to the patchset
- Changed copyright year to include 2014
---
 drivers/dma/Kconfig              |   14 +
 drivers/dma/Makefile             |    1 +
 drivers/dma/xilinx/Makefile      |    1 +
 drivers/dma/xilinx/xilinx_vdma.c | 1388 ++++++++++++++++++++++++++++++++++++++
 include/linux/amba/xilinx_dma.h  |   46 ++
 5 files changed, 1450 insertions(+)
 create mode 100644 drivers/dma/xilinx/Makefile
 create mode 100644 drivers/dma/xilinx/xilinx_vdma.c
 create mode 100644 include/linux/amba/xilinx_dma.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 9bed1a2..53a1fa6 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -350,6 +350,20 @@ config MOXART_DMA
 	help
 	  Enable support for the MOXA ART SoC DMA controller.
 
+config XILINX_VDMA
+	tristate "Xilinx AXI VDMA Engine"
+	depends on (ARCH_ZYNQ || MICROBLAZE)
+	select DMA_ENGINE
+	help
+	  Enable support for Xilinx AXI VDMA Soft IP.
+
+	  This engine provides high-bandwidth direct memory access
+	  between memory and AXI4-Stream video type target
+	  peripherals including peripherals which support AXI4-
+	  Stream Video Protocol.  It has two stream interfaces/
+	  channels, Memory Mapped to Stream (MM2S) and Stream to
+	  Memory Mapped (S2MM) for the data transfers.
+
 config DMA_ENGINE
 	bool
 
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index a029d0f4..fa38a77 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -44,3 +44,4 @@ obj-$(CONFIG_DMA_JZ4740) += dma-jz4740.o
 obj-$(CONFIG_TI_CPPI41) += cppi41.o
 obj-$(CONFIG_K3_DMA) += k3dma.o
 obj-$(CONFIG_MOXART_DMA) += moxart-dma.o
+obj-y += xilinx/
diff --git a/drivers/dma/xilinx/Makefile b/drivers/dma/xilinx/Makefile
new file mode 100644
index 0000000..3c4e9f2
--- /dev/null
+++ b/drivers/dma/xilinx/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_XILINX_VDMA) += xilinx_vdma.o
diff --git a/drivers/dma/xilinx/xilinx_vdma.c b/drivers/dma/xilinx/xilinx_vdma.c
new file mode 100644
index 0000000..fde6a9c
--- /dev/null
+++ b/drivers/dma/xilinx/xilinx_vdma.c
@@ -0,0 +1,1388 @@
+/*
+ * DMA driver for Xilinx Video DMA Engine
+ *
+ * Copyright (C) 2010-2014 Xilinx, Inc. All rights reserved.
+ *
+ * Based on the Freescale DMA driver.
+ *
+ * Description:
+ * The AXI Video Direct Memory Access (AXI VDMA) core is a soft Xilinx IP
+ * core that provides high-bandwidth direct memory access between memory
+ * and AXI4-Stream type video target peripherals. The core provides efficient
+ * two dimensional DMA operations with independent asynchronous read (S2MM)
+ * and write (MM2S) channel operation. It can be configured to have either
+ * one channel or two channels. If configured as two channels, one is to
+ * transmit to the video device (MM2S) and another is to receive from the
+ * video device (S2MM). Initialization, status, interrupt and management
+ * registers are accessed through an AXI4-Lite slave interface.
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/amba/xilinx_dma.h>
+#include <linux/bitops.h>
+#include <linux/dmapool.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_dma.h>
+#include <linux/of_platform.h>
+#include <linux/of_irq.h>
+#include <linux/slab.h>
+
+#include "../dmaengine.h"
+
+/* Register/Descriptor Offsets */
+#define XILINX_VDMA_MM2S_CTRL_OFFSET		0x0000
+#define XILINX_VDMA_S2MM_CTRL_OFFSET		0x0030
+#define XILINX_VDMA_MM2S_DESC_OFFSET		0x0050
+#define XILINX_VDMA_S2MM_DESC_OFFSET		0x00a0
+
+/* Control Registers */
+#define XILINX_VDMA_REG_DMACR			0x0000
+#define XILINX_VDMA_DMACR_DELAY_MAX		0xff
+#define XILINX_VDMA_DMACR_DELAY_SHIFT		24
+#define XILINX_VDMA_DMACR_FRAME_COUNT_MAX	0xff
+#define XILINX_VDMA_DMACR_FRAME_COUNT_SHIFT	16
+#define XILINX_VDMA_DMACR_ERR_IRQ		BIT(14)
+#define XILINX_VDMA_DMACR_DLY_CNT_IRQ		BIT(13)
+#define XILINX_VDMA_DMACR_FRM_CNT_IRQ		BIT(12)
+#define XILINX_VDMA_DMACR_MASTER_SHIFT		8
+#define XILINX_VDMA_DMACR_FSYNCSRC_SHIFT	5
+#define XILINX_VDMA_DMACR_FRAMECNT_EN		BIT(4)
+#define XILINX_VDMA_DMACR_GENLOCK_EN		BIT(3)
+#define XILINX_VDMA_DMACR_RESET			BIT(2)
+#define XILINX_VDMA_DMACR_CIRC_EN		BIT(1)
+#define XILINX_VDMA_DMACR_RUNSTOP		BIT(0)
+#define XILINX_VDMA_DMACR_FSYNCSRC_MASK		GENMASK(6, 5)
+
+#define XILINX_VDMA_REG_DMASR			0x0004
+#define XILINX_VDMA_DMASR_EOL_LATE_ERR		BIT(15)
+#define XILINX_VDMA_DMASR_ERR_IRQ		BIT(14)
+#define XILINX_VDMA_DMASR_DLY_CNT_IRQ		BIT(13)
+#define XILINX_VDMA_DMASR_FRM_CNT_IRQ		BIT(12)
+#define XILINX_VDMA_DMASR_SOF_LATE_ERR		BIT(11)
+#define XILINX_VDMA_DMASR_SG_DEC_ERR		BIT(10)
+#define XILINX_VDMA_DMASR_SG_SLV_ERR		BIT(9)
+#define XILINX_VDMA_DMASR_EOF_EARLY_ERR		BIT(8)
+#define XILINX_VDMA_DMASR_SOF_EARLY_ERR		BIT(7)
+#define XILINX_VDMA_DMASR_DMA_DEC_ERR		BIT(6)
+#define XILINX_VDMA_DMASR_DMA_SLAVE_ERR		BIT(5)
+#define XILINX_VDMA_DMASR_DMA_INT_ERR		BIT(4)
+#define XILINX_VDMA_DMASR_IDLE			BIT(1)
+#define XILINX_VDMA_DMASR_HALTED		BIT(0)
+#define XILINX_VDMA_DMASR_DELAY_MASK		GENMASK(31, 24)
+#define XILINX_VDMA_DMASR_FRAME_COUNT_MASK	GENMASK(23, 16)
+
+#define XILINX_VDMA_REG_CURDESC			0x0008
+#define XILINX_VDMA_REG_TAILDESC		0x0010
+#define XILINX_VDMA_REG_REG_INDEX		0x0014
+#define XILINX_VDMA_REG_FRMSTORE		0x0018
+#define XILINX_VDMA_REG_THRESHOLD		0x001c
+#define XILINX_VDMA_REG_FRMPTR_STS		0x0024
+#define XILINX_VDMA_REG_PARK_PTR		0x0028
+#define XILINX_VDMA_PARK_PTR_WR_REF_SHIFT	8
+#define XILINX_VDMA_PARK_PTR_RD_REF_SHIFT	0
+#define XILINX_VDMA_REG_VDMA_VERSION		0x002c
+
+/* Register Direct Mode Registers */
+#define XILINX_VDMA_REG_VSIZE			0x0000
+#define XILINX_VDMA_REG_HSIZE			0x0004
+
+#define XILINX_VDMA_REG_FRMDLY_STRIDE		0x0008
+#define XILINX_VDMA_FRMDLY_STRIDE_FRMDLY_SHIFT	24
+#define XILINX_VDMA_FRMDLY_STRIDE_STRIDE_SHIFT	0
+
+#define XILINX_VDMA_REG_START_ADDRESS(n)	(0x000c + 4 * (n))
+
+/* HW specific definitions */
+#define XILINX_VDMA_MAX_CHANS_PER_DEVICE	0x2
+
+#define XILINX_VDMA_DMAXR_ALL_IRQ_MASK	\
+		(XILINX_VDMA_DMASR_FRM_CNT_IRQ | \
+		 XILINX_VDMA_DMASR_DLY_CNT_IRQ | \
+		 XILINX_VDMA_DMASR_ERR_IRQ)
+
+#define XILINX_VDMA_DMASR_ALL_ERR_MASK	\
+		(XILINX_VDMA_DMASR_EOL_LATE_ERR | \
+		 XILINX_VDMA_DMASR_SOF_LATE_ERR | \
+		 XILINX_VDMA_DMASR_SG_DEC_ERR | \
+		 XILINX_VDMA_DMASR_SG_SLV_ERR | \
+		 XILINX_VDMA_DMASR_EOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_SOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_DMA_DEC_ERR | \
+		 XILINX_VDMA_DMASR_DMA_SLAVE_ERR | \
+		 XILINX_VDMA_DMASR_DMA_INT_ERR)
+
+/*
+ * Recoverable errors are DMA Internal error, SOF Early, EOF Early
+ * and SOF Late. They are only recoverable when C_FLUSH_ON_FSYNC
+ * is enabled in the h/w system.
+ */
+#define XILINX_VDMA_DMASR_ERR_RECOVER_MASK	\
+		(XILINX_VDMA_DMASR_SOF_LATE_ERR | \
+		 XILINX_VDMA_DMASR_EOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_SOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_DMA_INT_ERR)
+
+/* Axi VDMA Flush on Fsync bits */
+#define XILINX_VDMA_FLUSH_S2MM		3
+#define XILINX_VDMA_FLUSH_MM2S		2
+#define XILINX_VDMA_FLUSH_BOTH		1
+
+/* Delay loop counter to prevent hardware failure */
+#define XILINX_VDMA_LOOP_COUNT		1000000
+
+/**
+ * struct xilinx_vdma_desc_hw - Hardware Descriptor
+ * @next_desc: Next Descriptor Pointer @0x00
+ * @pad1: Reserved @0x04
+ * @buf_addr: Buffer address @0x08
+ * @pad2: Reserved @0x0C
+ * @vsize: Vertical Size @0x10
+ * @hsize: Horizontal Size @0x14
+ * @stride: Number of bytes between the first
+ *	    pixels of each horizontal line @0x18
+ */
+struct xilinx_vdma_desc_hw {
+	u32 next_desc;
+	u32 pad1;
+	u32 buf_addr;
+	u32 pad2;
+	u32 vsize;
+	u32 hsize;
+	u32 stride;
+} __aligned(64);
+
+/**
+ * struct xilinx_vdma_tx_segment - Descriptor segment
+ * @hw: Hardware descriptor
+ * @node: Node in the descriptor segments list
+ * @phys: Physical address of segment
+ */
+struct xilinx_vdma_tx_segment {
+	struct xilinx_vdma_desc_hw hw;
+	struct list_head node;
+	dma_addr_t phys;
+} __aligned(64);
+
+/**
+ * struct xilinx_vdma_tx_descriptor - Per Transaction structure
+ * @async_tx: Async transaction descriptor
+ * @segments: TX segments list
+ * @node: Node in the channel descriptors list
+ */
+struct xilinx_vdma_tx_descriptor {
+	struct dma_async_tx_descriptor async_tx;
+	struct list_head segments;
+	struct list_head node;
+};
+
+/**
+ * struct xilinx_vdma_chan - Driver specific VDMA channel structure
+ * @xdev: Driver specific device structure
+ * @ctrl_offset: Control registers offset
+ * @desc_offset: TX descriptor registers offset
+ * @lock: Descriptor operation lock
+ * @pending_list: Descriptors waiting
+ * @active_desc: Active descriptor
+ * @done_list: Complete descriptors
+ * @common: DMA common channel
+ * @desc_pool: Descriptors pool
+ * @dev: The dma device
+ * @irq: Channel IRQ
+ * @id: Channel ID
+ * @direction: Transfer direction
+ * @num_frms: Number of frames
+ * @has_sg: Support scatter transfers
+ * @genlock: Support genlock mode
+ * @err: Channel has errors
+ * @tasklet: Cleanup work after irq
+ * @config: Device configuration info
+ * @flush_on_fsync: Flush on Frame sync
+ */
+struct xilinx_vdma_chan {
+	struct xilinx_vdma_device *xdev;
+	u32 ctrl_offset;
+	u32 desc_offset;
+	spinlock_t lock;
+	struct list_head pending_list;
+	struct xilinx_vdma_tx_descriptor *active_desc;
+	struct list_head done_list;
+	struct dma_chan common;
+	struct dma_pool *desc_pool;
+	struct device *dev;
+	int irq;
+	int id;
+	enum dma_transfer_direction direction;
+	int num_frms;
+	bool has_sg;
+	bool genlock;
+	bool err;
+	struct tasklet_struct tasklet;
+	struct xilinx_vdma_config config;
+	bool flush_on_fsync;
+};
+
+/**
+ * struct xilinx_vdma_device - VDMA device structure
+ * @regs: I/O mapped base address
+ * @dev: Device Structure
+ * @common: DMA device structure
+ * @chan: Driver specific VDMA channel
+ * @has_sg: Specifies whether Scatter-Gather is present or not
+ * @flush_on_fsync: Flush on frame sync
+ */
+struct xilinx_vdma_device {
+	void __iomem *regs;
+	struct device *dev;
+	struct dma_device common;
+	struct xilinx_vdma_chan *chan[XILINX_VDMA_MAX_CHANS_PER_DEVICE];
+	bool has_sg;
+	u32 flush_on_fsync;
+};
+
+/* Macros */
+#define to_xilinx_chan(chan) \
+	container_of(chan, struct xilinx_vdma_chan, common)
+#define to_vdma_tx_descriptor(tx) \
+	container_of(tx, struct xilinx_vdma_tx_descriptor, async_tx)
+#define to_xilinx_vdma_config(cfg) \
+	container_of(cfg, struct xilinx_vdma_config, slvcfg)
+
+/* IO accessors */
+static inline u32 vdma_read(struct xilinx_vdma_chan *chan, u32 reg)
+{
+	return ioread32(chan->xdev->regs + reg);
+}
+
+static inline void vdma_write(struct xilinx_vdma_chan *chan, u32 reg, u32 value)
+{
+	iowrite32(value, chan->xdev->regs + reg);
+}
+
+static inline void vdma_desc_write(struct xilinx_vdma_chan *chan, u32 reg,
+				   u32 value)
+{
+	vdma_write(chan, chan->desc_offset + reg, value);
+}
+
+static inline u32 vdma_ctrl_read(struct xilinx_vdma_chan *chan, u32 reg)
+{
+	return vdma_read(chan, chan->ctrl_offset + reg);
+}
+
+static inline void vdma_ctrl_write(struct xilinx_vdma_chan *chan, u32 reg,
+				   u32 value)
+{
+	vdma_write(chan, chan->ctrl_offset + reg, value);
+}
+
+static inline void vdma_ctrl_clr(struct xilinx_vdma_chan *chan, u32 reg,
+				 u32 clr)
+{
+	vdma_ctrl_write(chan, reg, vdma_ctrl_read(chan, reg) & ~clr);
+}
+
+static inline void vdma_ctrl_set(struct xilinx_vdma_chan *chan, u32 reg,
+				 u32 set)
+{
+	vdma_ctrl_write(chan, reg, vdma_ctrl_read(chan, reg) | set);
+}
+
+/* -----------------------------------------------------------------------------
+ * Descriptors and segments alloc and free
+ */
+
+/**
+ * xilinx_vdma_alloc_tx_segment - Allocate transaction segment
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: The allocated segment on success and NULL on failure.
+ */
+static struct xilinx_vdma_tx_segment *
+xilinx_vdma_alloc_tx_segment(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_segment *segment;
+	dma_addr_t phys;
+
+	segment = dma_pool_alloc(chan->desc_pool, GFP_ATOMIC, &phys);
+	if (!segment)
+		return NULL;
+
+	memset(segment, 0, sizeof(*segment));
+	segment->phys = phys;
+
+	return segment;
+}
+
+/**
+ * xilinx_vdma_free_tx_segment - Free transaction segment
+ * @chan: Driver specific VDMA channel
+ * @segment: VDMA transaction segment
+ */
+static void xilinx_vdma_free_tx_segment(struct xilinx_vdma_chan *chan,
+					struct xilinx_vdma_tx_segment *segment)
+{
+	dma_pool_free(chan->desc_pool, segment, segment->phys);
+}
+
+/**
+ * xilinx_vdma_tx_descriptor - Allocate transaction descriptor
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: The allocated descriptor on success and NULL on failure.
+ */
+static struct xilinx_vdma_tx_descriptor *
+xilinx_vdma_alloc_tx_descriptor(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_descriptor *desc;
+
+	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+	if (!desc)
+		return NULL;
+
+	INIT_LIST_HEAD(&desc->segments);
+
+	return desc;
+}
+
+/**
+ * xilinx_vdma_free_tx_descriptor - Free transaction descriptor
+ * @chan: Driver specific VDMA channel
+ * @desc: VDMA transaction descriptor
+ */
+static void
+xilinx_vdma_free_tx_descriptor(struct xilinx_vdma_chan *chan,
+			       struct xilinx_vdma_tx_descriptor *desc)
+{
+	struct xilinx_vdma_tx_segment *segment, *next;
+
+	if (!desc)
+		return;
+
+	list_for_each_entry_safe(segment, next, &desc->segments, node) {
+		list_del(&segment->node);
+		xilinx_vdma_free_tx_segment(chan, segment);
+	}
+
+	kfree(desc);
+}
+
+/* Required functions */
+
+/**
+ * xilinx_vdma_free_desc_list - Free descriptors list
+ * @chan: Driver specific VDMA channel
+ * @list: List to parse and delete the descriptor
+ */
+static void xilinx_vdma_free_desc_list(struct xilinx_vdma_chan *chan,
+					struct list_head *list)
+{
+	struct xilinx_vdma_tx_descriptor *desc, *next;
+
+	list_for_each_entry_safe(desc, next, list, node) {
+		list_del(&desc->node);
+		xilinx_vdma_free_tx_descriptor(chan, desc);
+	}
+}
+
+/**
+ * xilinx_vdma_free_descriptors - Free channel descriptors
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_free_descriptors(struct xilinx_vdma_chan *chan)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	xilinx_vdma_free_desc_list(chan, &chan->pending_list);
+	xilinx_vdma_free_desc_list(chan, &chan->done_list);
+
+	xilinx_vdma_free_tx_descriptor(chan, chan->active_desc);
+	chan->active_desc = NULL;
+
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_free_chan_resources - Free channel resources
+ * @dchan: DMA channel
+ */
+static void xilinx_vdma_free_chan_resources(struct dma_chan *dchan)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	dev_dbg(chan->dev, "Free all channel resources.\n");
+
+	xilinx_vdma_free_descriptors(chan);
+	dma_pool_destroy(chan->desc_pool);
+	chan->desc_pool = NULL;
+}
+
+/**
+ * xilinx_vdma_chan_desc_cleanup - Clean channel descriptors
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_chan_desc_cleanup(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_descriptor *desc, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	list_for_each_entry_safe(desc, next, &chan->done_list, node) {
+		dma_async_tx_callback callback;
+		void *callback_param;
+
+		/* Remove from the list of running transactions */
+		list_del(&desc->node);
+
+		/* Run the link descriptor callback function */
+		callback = desc->async_tx.callback;
+		callback_param = desc->async_tx.callback_param;
+		if (callback) {
+			spin_unlock_irqrestore(&chan->lock, flags);
+			callback(callback_param);
+			spin_lock_irqsave(&chan->lock, flags);
+		}
+
+		/* Run any dependencies, then free the descriptor */
+		dma_run_dependencies(&desc->async_tx);
+		xilinx_vdma_free_tx_descriptor(chan, desc);
+	}
+
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_do_tasklet - Schedule completion tasklet
+ * @data: Pointer to the Xilinx VDMA channel structure
+ */
+static void xilinx_vdma_do_tasklet(unsigned long data)
+{
+	struct xilinx_vdma_chan *chan = (struct xilinx_vdma_chan *)data;
+
+	xilinx_vdma_chan_desc_cleanup(chan);
+}
+
+/**
+ * xilinx_vdma_alloc_chan_resources - Allocate channel resources
+ * @dchan: DMA channel
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_alloc_chan_resources(struct dma_chan *dchan)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	/* Has this channel already been allocated? */
+	if (chan->desc_pool)
+		return 0;
+
+	/*
+	 * We need the descriptor to be aligned to 64bytes
+	 * for meeting Xilinx VDMA specification requirement.
+	 */
+	chan->desc_pool = dma_pool_create("xilinx_vdma_desc_pool",
+				chan->dev,
+				sizeof(struct xilinx_vdma_tx_segment),
+				__alignof__(struct xilinx_vdma_tx_segment), 0);
+	if (!chan->desc_pool) {
+		dev_err(chan->dev,
+			"unable to allocate channel %d descriptor pool\n",
+			chan->id);
+		return -ENOMEM;
+	}
+
+	dma_cookie_init(dchan);
+	return 0;
+}
+
+/**
+ * xilinx_vdma_tx_status - Get VDMA transaction status
+ * @dchan: DMA channel
+ * @cookie: Transaction identifier
+ * @txstate: Transaction state
+ *
+ * Return: DMA transaction status
+ */
+static enum dma_status xilinx_vdma_tx_status(struct dma_chan *dchan,
+					dma_cookie_t cookie,
+					struct dma_tx_state *txstate)
+{
+	return dma_cookie_status(dchan, cookie, txstate);
+}
+
+/**
+ * xilinx_vdma_is_running - Check if VDMA channel is running
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '1' if running, '0' if not.
+ */
+static int xilinx_vdma_is_running(struct xilinx_vdma_chan *chan)
+{
+	return !(vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		 XILINX_VDMA_DMASR_HALTED) &&
+		(vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR) &
+		 XILINX_VDMA_DMACR_RUNSTOP);
+}
+
+/**
+ * xilinx_vdma_is_idle - Check if VDMA channel is idle
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '1' if idle, '0' if not.
+ */
+static int xilinx_vdma_is_idle(struct xilinx_vdma_chan *chan)
+{
+	return vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		XILINX_VDMA_DMASR_IDLE;
+}
+
+/**
+ * xilinx_vdma_halt - Halt VDMA channel
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_halt(struct xilinx_vdma_chan *chan)
+{
+	int loop = XILINX_VDMA_LOOP_COUNT;
+
+	vdma_ctrl_clr(chan, XILINX_VDMA_REG_DMACR, XILINX_VDMA_DMACR_RUNSTOP);
+
+	/* Wait for the hardware to halt */
+	do {
+		if (vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		    XILINX_VDMA_DMASR_HALTED)
+			break;
+	} while (loop--);
+
+	if (!loop) {
+		dev_err(chan->dev, "Cannot stop channel %p: %x\n",
+			chan, vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR));
+		chan->err = true;
+	}
+
+	return;
+}
+
+/**
+ * xilinx_vdma_start - Start VDMA channel
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_start(struct xilinx_vdma_chan *chan)
+{
+	int loop = XILINX_VDMA_LOOP_COUNT;
+
+	vdma_ctrl_set(chan, XILINX_VDMA_REG_DMACR, XILINX_VDMA_DMACR_RUNSTOP);
+
+	/* Wait for the hardware to start */
+	do {
+		if (!(vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		      XILINX_VDMA_DMASR_HALTED))
+			break;
+	} while (loop--);
+
+	if (!loop) {
+		dev_err(chan->dev, "Cannot start channel %p: %x\n",
+			chan, vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR));
+
+		chan->err = true;
+	}
+
+	return;
+}
+
+/**
+ * xilinx_vdma_start_transfer - Starts VDMA transfer
+ * @chan: Driver specific channel struct pointer
+ */
+static void xilinx_vdma_start_transfer(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_config *config = &chan->config;
+	struct xilinx_vdma_tx_descriptor *desc;
+	unsigned long flags;
+	u32 reg;
+	struct xilinx_vdma_tx_segment *head, *tail = NULL;
+
+	if (chan->err)
+		return;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	/* There's already an active descriptor, bail out. */
+	if (chan->active_desc)
+		goto out_unlock;
+
+	if (list_empty(&chan->pending_list))
+		goto out_unlock;
+
+	desc = list_first_entry(&chan->pending_list,
+				struct xilinx_vdma_tx_descriptor, node);
+
+	/* If it is SG mode and hardware is busy, cannot submit */
+	if (chan->has_sg && xilinx_vdma_is_running(chan) &&
+	    !xilinx_vdma_is_idle(chan)) {
+		dev_dbg(chan->dev, "DMA controller still busy\n");
+		goto out_unlock;
+	}
+
+	if (chan->err)
+		goto out_unlock;
+
+	/*
+	 * If hardware is idle, then all descriptors on the running lists are
+	 * done, start new transfers
+	 */
+	if (chan->has_sg) {
+		head = list_first_entry(&desc->segments,
+					struct xilinx_vdma_tx_segment, node);
+		tail = list_entry(desc->segments.prev,
+				  struct xilinx_vdma_tx_segment, node);
+
+		vdma_ctrl_write(chan, XILINX_VDMA_REG_CURDESC, head->phys);
+	}
+
+	/* Configure the hardware using info in the config structure */
+	reg = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR);
+
+	if (config->frm_cnt_en)
+		reg |= XILINX_VDMA_DMACR_FRAMECNT_EN;
+	else
+		reg &= ~XILINX_VDMA_DMACR_FRAMECNT_EN;
+
+	/*
+	 * With SG, start with circular mode, so that BDs can be fetched.
+	 * In direct register mode, if not parking, enable circular mode
+	 */
+	if (chan->has_sg || !config->park)
+		reg |= XILINX_VDMA_DMACR_CIRC_EN;
+
+	if (config->park)
+		reg &= ~XILINX_VDMA_DMACR_CIRC_EN;
+
+	vdma_ctrl_write(chan, XILINX_VDMA_REG_DMACR, reg);
+
+	if (config->park && (config->park_frm >= 0) &&
+			(config->park_frm < chan->num_frms)) {
+		if (chan->direction == DMA_MEM_TO_DEV)
+			vdma_write(chan, XILINX_VDMA_REG_PARK_PTR,
+				config->park_frm <<
+					XILINX_VDMA_PARK_PTR_RD_REF_SHIFT);
+		else
+			vdma_write(chan, XILINX_VDMA_REG_PARK_PTR,
+				config->park_frm <<
+					XILINX_VDMA_PARK_PTR_WR_REF_SHIFT);
+	}
+
+	/* Start the hardware */
+	xilinx_vdma_start(chan);
+
+	if (chan->err)
+		goto out_unlock;
+
+	/* Start the transfer */
+	if (chan->has_sg) {
+		vdma_ctrl_write(chan, XILINX_VDMA_REG_TAILDESC, tail->phys);
+	} else {
+		struct xilinx_vdma_tx_segment *segment, *last = NULL;
+		int i = 0;
+
+		list_for_each_entry(segment, &desc->segments, node) {
+			vdma_desc_write(chan,
+					XILINX_VDMA_REG_START_ADDRESS(i++),
+					segment->hw.buf_addr);
+			last = segment;
+		}
+
+		if (!last)
+			goto out_unlock;
+
+		/* HW expects these parameters to be same for one transaction */
+		vdma_desc_write(chan, XILINX_VDMA_REG_HSIZE, last->hw.hsize);
+		vdma_desc_write(chan, XILINX_VDMA_REG_FRMDLY_STRIDE,
+				last->hw.stride);
+		vdma_desc_write(chan, XILINX_VDMA_REG_VSIZE, last->hw.vsize);
+	}
+
+	list_del(&desc->node);
+	chan->active_desc = desc;
+
+out_unlock:
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_issue_pending - Issue pending transactions
+ * @dchan: DMA channel
+ */
+static void xilinx_vdma_issue_pending(struct dma_chan *dchan)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	xilinx_vdma_start_transfer(chan);
+}
+
+/**
+ * xilinx_vdma_complete_descriptor - Mark the active descriptor as complete
+ * @chan : xilinx DMA channel
+ *
+ * CONTEXT: hardirq
+ */
+static void xilinx_vdma_complete_descriptor(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_descriptor *desc;
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	desc = chan->active_desc;
+	if (!desc) {
+		dev_dbg(chan->dev, "no running descriptors\n");
+		goto out_unlock;
+	}
+
+	dma_cookie_complete(&desc->async_tx);
+	list_add_tail(&desc->node, &chan->done_list);
+
+	chan->active_desc = NULL;
+
+out_unlock:
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_reset - Reset VDMA channel
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_reset(struct xilinx_vdma_chan *chan)
+{
+	int loop = XILINX_VDMA_LOOP_COUNT;
+	u32 tmp;
+
+	vdma_ctrl_set(chan, XILINX_VDMA_REG_DMACR, XILINX_VDMA_DMACR_RESET);
+
+	tmp = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR) &
+		XILINX_VDMA_DMACR_RESET;
+
+	/* Wait for the hardware to finish reset */
+	do {
+		tmp = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR) &
+			XILINX_VDMA_DMACR_RESET;
+	} while (loop-- && tmp);
+
+	if (!loop) {
+		dev_err(chan->dev, "reset timeout, cr %x, sr %x\n",
+			vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR),
+			vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR));
+		return -ETIMEDOUT;
+	}
+
+	chan->err = false;
+
+	return 0;
+}
+
+/**
+ * xilinx_vdma_chan_reset - Reset VDMA channel and enable interrupts
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_chan_reset(struct xilinx_vdma_chan *chan)
+{
+	int err;
+
+	/* Reset VDMA */
+	err = xilinx_vdma_reset(chan);
+	if (err)
+		return err;
+
+	/* Enable interrupts */
+	vdma_ctrl_set(chan, XILINX_VDMA_REG_DMACR,
+		      XILINX_VDMA_DMAXR_ALL_IRQ_MASK);
+
+	return 0;
+}
+
+/**
+ * xilinx_vdma_irq_handler - VDMA Interrupt handler
+ * @irq: IRQ number
+ * @data: Pointer to the Xilinx VDMA channel structure
+ *
+ * Return: IRQ_HANDLED/IRQ_NONE
+ */
+static irqreturn_t xilinx_vdma_irq_handler(int irq, void *data)
+{
+	struct xilinx_vdma_chan *chan = data;
+	u32 status;
+
+	/* Read the status and ack the interrupts. */
+	status = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR);
+	if (!(status & XILINX_VDMA_DMAXR_ALL_IRQ_MASK))
+		return IRQ_NONE;
+
+	vdma_ctrl_write(chan, XILINX_VDMA_REG_DMASR,
+			status & XILINX_VDMA_DMAXR_ALL_IRQ_MASK);
+
+	if (status & XILINX_VDMA_DMASR_ERR_IRQ) {
+		/*
+		 * An error occurred. If C_FLUSH_ON_FSYNC is enabled and the
+		 * error is recoverable, ignore it. Otherwise flag the error.
+		 *
+		 * Only recoverable errors can be cleared in the DMASR register,
+		 * make sure not to write to other error bits to 1.
+		 */
+		u32 errors = status & XILINX_VDMA_DMASR_ALL_ERR_MASK;
+		vdma_ctrl_write(chan, XILINX_VDMA_REG_DMASR,
+				errors & XILINX_VDMA_DMASR_ERR_RECOVER_MASK);
+
+		if (!chan->flush_on_fsync ||
+		    (errors & ~XILINX_VDMA_DMASR_ERR_RECOVER_MASK)) {
+			dev_err(chan->dev,
+				"Channel %p has errors %x, cdr %x tdr %x\n",
+				chan, errors,
+				vdma_ctrl_read(chan, XILINX_VDMA_REG_CURDESC),
+				vdma_ctrl_read(chan, XILINX_VDMA_REG_TAILDESC));
+			chan->err = true;
+		}
+	}
+
+	if (status & XILINX_VDMA_DMASR_DLY_CNT_IRQ) {
+		/*
+		 * Device takes too long to do the transfer when user requires
+		 * responsiveness.
+		 */
+		dev_dbg(chan->dev, "Inter-packet latency too long\n");
+	}
+
+	if (status & XILINX_VDMA_DMASR_FRM_CNT_IRQ) {
+		xilinx_vdma_complete_descriptor(chan);
+		xilinx_vdma_start_transfer(chan);
+	}
+
+	tasklet_schedule(&chan->tasklet);
+	return IRQ_HANDLED;
+}
+
+/**
+ * xilinx_vdma_tx_submit - Submit DMA transaction
+ * @tx: Async transaction descriptor
+ *
+ * Return: cookie value on success and failure value on error
+ */
+static dma_cookie_t xilinx_vdma_tx_submit(struct dma_async_tx_descriptor *tx)
+{
+	struct xilinx_vdma_tx_descriptor *desc = to_vdma_tx_descriptor(tx);
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(tx->chan);
+	dma_cookie_t cookie;
+	unsigned long flags;
+	int err;
+
+	if (chan->err) {
+		/*
+		 * If reset fails, need to hard reset the system.
+		 * Channel is no longer functional
+		 */
+		err = xilinx_vdma_chan_reset(chan);
+		if (err < 0)
+			return err;
+	}
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	cookie = dma_cookie_assign(tx);
+
+	/* Append the transaction to the pending transactions queue. */
+	list_add_tail(&desc->node, &chan->pending_list);
+
+	spin_unlock_irqrestore(&chan->lock, flags);
+
+	return cookie;
+}
+
+/**
+ * xilinx_vdma_prep_slave_sg - prepare a descriptor for a DMA_SLAVE transaction
+ * @dchan: DMA channel
+ * @xts: Array of interleaved template structs
+ * @flags: transfer ack flags
+ *
+ * Return: Async transaction descriptor on success and NULL on failure
+ */
+static struct dma_async_tx_descriptor *
+xilinx_vdma_dma_prep_interleaved(struct dma_chan *dchan,
+				 struct dma_interleaved_template **xts,
+				 unsigned long flags)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+	struct xilinx_vdma_tx_descriptor *desc;
+	struct xilinx_vdma_tx_segment *segment;
+	struct xilinx_vdma_tx_segment *prev = NULL;
+	struct dma_interleaved_template *xt = *xts;
+	u32 frmno = 0;
+
+	/* Count number of frames */
+	for (; xts[frmno] != NULL; frmno++)
+		;
+
+	if (frmno != chan->num_frms) {
+		dev_err(chan->dev,
+			"number of entries %d not the same as num stores %d\n",
+			frmno, chan->num_frms);
+		return NULL;
+	}
+
+	/* Allocate a transaction descriptor. */
+	desc = xilinx_vdma_alloc_tx_descriptor(chan);
+	if (!desc)
+		return NULL;
+
+	dma_async_tx_descriptor_init(&desc->async_tx, &chan->common);
+	desc->async_tx.tx_submit = xilinx_vdma_tx_submit;
+	desc->async_tx.cookie = 0;
+	async_tx_ack(&desc->async_tx);
+
+	/* Build the list of transaction segments. */
+	for (; (xt = *xts); xts++) {
+		struct xilinx_vdma_desc_hw *hw;
+
+		if ((xt->dir != DMA_MEM_TO_DEV) && (xt->dir != DMA_DEV_TO_MEM))
+			goto error;
+
+		if (!xt->numf || !xt->sgl[0].size)
+			goto error;
+
+		/* Allocate the link descriptor from DMA pool */
+		segment = xilinx_vdma_alloc_tx_segment(chan);
+		if (!segment)
+			goto error;
+
+		/* Fill in the hardware descriptor */
+		hw = &segment->hw;
+		if (xt->dir != DMA_MEM_TO_DEV)
+			hw->buf_addr = xt->dst_start;
+		else
+			hw->buf_addr = xt->src_start;
+		hw->vsize = xt->numf;
+		hw->hsize = xt->sgl[0].size;
+		hw->stride = (chan->config.frm_dly <<
+			      XILINX_VDMA_FRMDLY_STRIDE_FRMDLY_SHIFT) |
+			     (xt->sgl[0].icg <<
+			      XILINX_VDMA_FRMDLY_STRIDE_STRIDE_SHIFT);
+
+		if (prev)
+			prev->hw.next_desc = segment->phys;
+
+		/* Insert the segment into the descriptor segments list. */
+		list_add_tail(&segment->node, &desc->segments);
+
+		prev = segment;
+	}
+
+	/* Link the last hardware descriptor with the first. */
+	segment = list_first_entry(&desc->segments,
+				   struct xilinx_vdma_tx_segment, node);
+	prev->hw.next_desc = segment->phys;
+
+	return &desc->async_tx;
+
+error:
+	xilinx_vdma_free_tx_descriptor(chan, desc);
+	return NULL;
+}
+
+/**
+ * xilinx_vdma_terminate_all - Halt the channel and free descriptors
+ * @chan: Driver specific VDMA Channel pointer
+ */
+static void xilinx_vdma_terminate_all(struct xilinx_vdma_chan *chan)
+{
+	/* Halt the DMA engine */
+	xilinx_vdma_halt(chan);
+
+	/* Remove and free all of the descriptors in the lists */
+	xilinx_vdma_free_descriptors(chan);
+}
+
+/**
+ * xilinx_vdma_slave_config - Configure VDMA channel
+ * Run-time configuration for Axi VDMA, supports:
+ * . halt the channel
+ * . configure interrupt coalescing and inter-packet delay threshold
+ * . start/stop parking
+ * . enable genlock
+ *
+ * @chan: Driver specific VDMA Channel pointer
+ * @slvcfg: DMA Slave configuration pointer
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_slave_config(struct xilinx_vdma_chan *chan,
+				    struct dma_slave_config *slvcfg)
+{
+	u32 dmacr;
+	struct xilinx_vdma_config *cfg = to_xilinx_vdma_config(slvcfg);
+
+	if (cfg->reset)
+		return xilinx_vdma_chan_reset(chan);
+
+	dmacr = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR);
+
+	chan->config.frm_dly = cfg->frm_dly;
+	chan->config.park = cfg->park;
+
+	/* genlock settings */
+	chan->config.gen_lock = cfg->gen_lock;
+	chan->config.master = cfg->master;
+
+	if (cfg->gen_lock && chan->genlock) {
+		dmacr |= XILINX_VDMA_DMACR_GENLOCK_EN;
+		dmacr |= cfg->master << XILINX_VDMA_DMACR_MASTER_SHIFT;
+	}
+
+	chan->config.frm_cnt_en = cfg->frm_cnt_en;
+	if (cfg->park)
+		chan->config.park_frm = cfg->park_frm;
+	else
+		chan->config.park_frm = -1;
+
+	chan->config.coalesc = cfg->coalesc;
+	chan->config.delay = cfg->delay;
+
+	if (cfg->coalesc <= XILINX_VDMA_DMACR_FRAME_COUNT_MAX) {
+		dmacr |= cfg->coalesc << XILINX_VDMA_DMACR_FRAME_COUNT_SHIFT;
+		chan->config.coalesc = cfg->coalesc;
+	}
+
+	if (cfg->delay <= XILINX_VDMA_DMACR_DELAY_MAX) {
+		dmacr |= cfg->delay << XILINX_VDMA_DMACR_DELAY_SHIFT;
+		chan->config.delay = cfg->delay;
+	}
+
+	/* FSync Source selection */
+	dmacr &= ~XILINX_VDMA_DMACR_FSYNCSRC_MASK;
+	dmacr |= cfg->ext_fsync << XILINX_VDMA_DMACR_FSYNCSRC_SHIFT;
+
+	vdma_ctrl_write(chan, XILINX_VDMA_REG_DMACR, dmacr);
+	return 0;
+}
+
+/**
+ * xilinx_vdma_device_control - Configure DMA channel of the device
+ * @dchan: DMA Channel pointer
+ * @cmd: DMA control command
+ * @arg: Channel configuration
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_device_control(struct dma_chan *dchan,
+				      enum dma_ctrl_cmd cmd, unsigned long arg)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	switch (cmd) {
+	case DMA_TERMINATE_ALL:
+		xilinx_vdma_terminate_all(chan);
+		return 0;
+	case DMA_SLAVE_CONFIG:
+		return xilinx_vdma_slave_config(chan,
+						(struct dma_slave_config *)arg);
+	default:
+		return -ENXIO;
+	}
+}
+
+/* -----------------------------------------------------------------------------
+ * Probe and remove
+ */
+
+/**
+ * xilinx_vdma_chan_remove - Per Channel remove function
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_chan_remove(struct xilinx_vdma_chan *chan)
+{
+	/* Disable all interrupts */
+	vdma_ctrl_clr(chan, XILINX_VDMA_REG_DMACR,
+		      XILINX_VDMA_DMAXR_ALL_IRQ_MASK);
+
+	if (chan->irq > 0)
+		free_irq(chan->irq, chan);
+
+	tasklet_kill(&chan->tasklet);
+
+	list_del(&chan->common.device_node);
+}
+
+/**
+ * xilinx_vdma_chan_probe - Per Channel Probing
+ * It get channel features from the device tree entry and
+ * initialize special channel handling routines
+ *
+ * @xdev: Driver specific device structure
+ * @node: Device node
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_chan_probe(struct xilinx_vdma_device *xdev,
+				  struct device_node *node)
+{
+	struct xilinx_vdma_chan *chan;
+	bool has_dre = false;
+	u32 value, width;
+	int err;
+
+	/* Allocate and initialize the channel structure */
+	chan = devm_kzalloc(xdev->dev, sizeof(*chan), GFP_KERNEL);
+	if (!chan)
+		return -ENOMEM;
+
+	chan->dev = xdev->dev;
+	chan->xdev = xdev;
+	chan->has_sg = xdev->has_sg;
+
+	spin_lock_init(&chan->lock);
+	INIT_LIST_HEAD(&chan->pending_list);
+	INIT_LIST_HEAD(&chan->done_list);
+
+	/* Retrieve the channel properties from the device tree */
+	has_dre = of_property_read_bool(node, "xlnx,include-dre");
+
+	chan->genlock = of_property_read_bool(node, "xlnx,genlock-mode");
+
+	err = of_property_read_u32(node, "xlnx,datawidth", &value);
+	if (err) {
+		dev_err(xdev->dev, "missing xlnx,datawidth property\n");
+		return err;
+	}
+	width = value >> 3; /* Convert bits to bytes */
+
+	/* If data width is greater than 8 bytes, DRE is not in hw */
+	if (width > 8)
+		has_dre = false;
+
+	if (!has_dre)
+		xdev->common.copy_align = fls(width - 1);
+
+	if (of_device_is_compatible(node, "xlnx,axi-vdma-mm2s-channel")) {
+		chan->direction = DMA_MEM_TO_DEV;
+		chan->id = 0;
+
+		chan->ctrl_offset = XILINX_VDMA_MM2S_CTRL_OFFSET;
+		chan->desc_offset = XILINX_VDMA_MM2S_DESC_OFFSET;
+
+		if (xdev->flush_on_fsync == XILINX_VDMA_FLUSH_BOTH ||
+		    xdev->flush_on_fsync == XILINX_VDMA_FLUSH_MM2S)
+			chan->flush_on_fsync = true;
+	} else if (of_device_is_compatible(node,
+					    "xlnx,axi-vdma-s2mm-channel")) {
+		chan->direction = DMA_DEV_TO_MEM;
+		chan->id = 1;
+
+		chan->ctrl_offset = XILINX_VDMA_S2MM_CTRL_OFFSET;
+		chan->desc_offset = XILINX_VDMA_S2MM_DESC_OFFSET;
+
+		if (xdev->flush_on_fsync == XILINX_VDMA_FLUSH_BOTH ||
+		    xdev->flush_on_fsync == XILINX_VDMA_FLUSH_S2MM)
+			chan->flush_on_fsync = true;
+	} else {
+		dev_err(xdev->dev, "Invalid channel compatible node\n");
+		return -EINVAL;
+	}
+
+	/* Request the interrupt */
+	chan->irq = irq_of_parse_and_map(node, 0);
+	err = request_irq(chan->irq, xilinx_vdma_irq_handler, IRQF_SHARED,
+			  "xilinx-vdma-controller", chan);
+	if (err) {
+		dev_err(xdev->dev, "unable to request IRQ %d\n", chan->irq);
+		return err;
+	}
+
+	/* Initialize the tasklet */
+	tasklet_init(&chan->tasklet, xilinx_vdma_do_tasklet,
+			(unsigned long)chan);
+
+	/*
+	 * Initialize the DMA channel and add it to the DMA engine channels
+	 * list.
+	 */
+	chan->common.device = &xdev->common;
+
+	list_add_tail(&chan->common.device_node, &xdev->common.channels);
+	xdev->chan[chan->id] = chan;
+
+	/* Reset the channel */
+	err = xilinx_vdma_chan_reset(chan);
+	if (err < 0) {
+		dev_err(xdev->dev, "Reset channel failed\n");
+		return err;
+	}
+
+	return 0;
+}
+
+/**
+ * of_dma_xilinx_xlate - Translation function
+ * @dma_spec: Pointer to DMA specifier as found in the device tree
+ * @ofdma: Pointer to DMA controller data
+ *
+ * Return: DMA channel pointer on success and NULL on error
+ */
+static struct dma_chan *of_dma_xilinx_xlate(struct of_phandle_args *dma_spec,
+						struct of_dma *ofdma)
+{
+	struct xilinx_vdma_device *xdev = ofdma->of_dma_data;
+	int chan_id = dma_spec->args[0];
+
+	if (chan_id >= XILINX_VDMA_MAX_CHANS_PER_DEVICE)
+		return NULL;
+
+	return dma_get_slave_channel(&xdev->chan[chan_id]->common);
+}
+
+/**
+ * xilinx_vdma_probe - Driver probe function
+ * @pdev: Pointer to the platform_device structure
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_probe(struct platform_device *pdev)
+{
+	struct device_node *node = pdev->dev.of_node;
+	struct xilinx_vdma_device *xdev;
+	struct device_node *child;
+	struct resource *io;
+	u32 num_frames;
+	int i, err;
+
+	dev_info(&pdev->dev, "Probing xilinx axi vdma engine\n");
+
+	/* Allocate and initialize the DMA engine structure */
+	xdev = devm_kzalloc(&pdev->dev, sizeof(*xdev), GFP_KERNEL);
+	if (!xdev)
+		return -ENOMEM;
+
+	xdev->dev = &pdev->dev;
+
+	/* Request and map I/O memory */
+	io = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	xdev->regs = devm_ioremap_resource(&pdev->dev, io);
+	if (IS_ERR(xdev->regs))
+		return PTR_ERR(xdev->regs);
+
+	/* Retrieve the DMA engine properties from the device tree */
+	xdev->has_sg = of_property_read_bool(node, "xlnx,include-sg");
+
+	err = of_property_read_u32(node, "xlnx,num-fstores", &num_frames);
+	if (err < 0) {
+		dev_err(xdev->dev, "missing xlnx,num-fstores property\n");
+		return err;
+	}
+
+	err = of_property_read_u32(node, "xlnx,flush-fsync",
+					&xdev->flush_on_fsync);
+	if (err < 0)
+		dev_warn(xdev->dev, "missing xlnx,flush-fsync property\n");
+
+	/* Initialize the DMA engine */
+	xdev->common.dev = &pdev->dev;
+
+	INIT_LIST_HEAD(&xdev->common.channels);
+	dma_cap_set(DMA_SLAVE, xdev->common.cap_mask);
+	dma_cap_set(DMA_PRIVATE, xdev->common.cap_mask);
+
+	xdev->common.device_alloc_chan_resources =
+				xilinx_vdma_alloc_chan_resources;
+	xdev->common.device_free_chan_resources =
+				xilinx_vdma_free_chan_resources;
+	xdev->common.device_prep_interleaved_dma =
+				xilinx_vdma_dma_prep_interleaved;
+	xdev->common.device_control = xilinx_vdma_device_control;
+	xdev->common.device_tx_status = xilinx_vdma_tx_status;
+	xdev->common.device_issue_pending = xilinx_vdma_issue_pending;
+
+	platform_set_drvdata(pdev, xdev);
+
+	/* Initialize the channels */
+	for_each_child_of_node(node, child) {
+		err = xilinx_vdma_chan_probe(xdev, child);
+		if (err < 0)
+			goto error;
+	}
+
+	for (i = 0; i < XILINX_VDMA_MAX_CHANS_PER_DEVICE; i++)
+		if (xdev->chan[i])
+			xdev->chan[i]->num_frms = num_frames;
+
+	/* Register the DMA engine with the core */
+	dma_async_device_register(&xdev->common);
+
+	err = of_dma_controller_register(node, of_dma_xilinx_xlate,
+					 xdev);
+	if (err < 0) {
+		dev_err(&pdev->dev, "Unable to register DMA to DT\n");
+		dma_async_device_unregister(&xdev->common);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	for (i = 0; i < XILINX_VDMA_MAX_CHANS_PER_DEVICE; i++)
+		if (xdev->chan[i])
+			xilinx_vdma_chan_remove(xdev->chan[i]);
+
+	return err;
+}
+
+/**
+ * xilinx_vdma_remove - Driver remove function
+ * @pdev: Pointer to the platform_device structure
+ *
+ * Return: Always '0'
+ */
+static int xilinx_vdma_remove(struct platform_device *pdev)
+{
+	struct xilinx_vdma_device *xdev = platform_get_drvdata(pdev);
+	int i;
+
+	of_dma_controller_free(pdev->dev.of_node);
+
+	dma_async_device_unregister(&xdev->common);
+
+	for (i = 0; i < XILINX_VDMA_MAX_CHANS_PER_DEVICE; i++)
+		if (xdev->chan[i])
+			xilinx_vdma_chan_remove(xdev->chan[i]);
+
+	return 0;
+}
+
+static const struct of_device_id xilinx_vdma_of_ids[] = {
+	{ .compatible = "xlnx,axi-vdma-1.00.a",},
+	{}
+};
+
+static struct platform_driver xilinx_vdma_driver = {
+	.driver = {
+		.name = "xilinx-vdma",
+		.owner = THIS_MODULE,
+		.of_match_table = xilinx_vdma_of_ids,
+	},
+	.probe = xilinx_vdma_probe,
+	.remove = xilinx_vdma_remove,
+};
+
+module_platform_driver(xilinx_vdma_driver);
+
+MODULE_AUTHOR("Xilinx, Inc.");
+MODULE_DESCRIPTION("Xilinx VDMA driver");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/amba/xilinx_dma.h b/include/linux/amba/xilinx_dma.h
new file mode 100644
index 0000000..146406c
--- /dev/null
+++ b/include/linux/amba/xilinx_dma.h
@@ -0,0 +1,46 @@
+/*
+ * Xilinx DMA Engine drivers support header file
+ *
+ * Copyright (C) 2010-2014 Xilinx, Inc. All rights reserved.
+ *
+ * This is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __DMA_XILINX_DMA_H
+#define __DMA_XILINX_DMA_H
+
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+
+/**
+ * struct xilinx_vdma_config - VDMA Configuration structure
+ * @slvcfg: DMA Slave configuration
+ * @frm_dly: Frame delay
+ * @gen_lock: Whether in gen-lock mode
+ * @master: Master that it syncs to
+ * @frm_cnt_en: Enable frame count enable
+ * @park: Whether wants to park
+ * @park_frm: Frame to park on
+ * @coalesc: Interrupt coalescing threshold
+ * @delay: Delay counter
+ * @reset: Reset Channel
+ * @ext_fsync: External Frame Sync source
+ */
+struct xilinx_vdma_config {
+	struct dma_slave_config slvcfg;
+	int frm_dly;
+	int gen_lock;
+	int master;
+	int frm_cnt_en;
+	int park;
+	int park_frm;
+	int coalesc;
+	int delay;
+	int reset;
+	int ext_fsync;
+};
+
+#endif
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 3/3] dma: Add Xilinx AXI Video Direct Memory Access Engine driver support
@ 2014-02-15 12:00   ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-15 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

This is the driver for the AXI Video Direct Memory Access (AXI
VDMA) core, which is a soft Xilinx IP core that provides high-
bandwidth direct memory access between memory and AXI4-Stream
type video target peripherals. The core provides efficient two
dimensional DMA operations with independent asynchronous read
and write channel operation.

This module works on Zynq (ARM Based SoC) and Microblaze platforms.

Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
Reviewed-by: Levente Kurusa <levex@linux.com>
---
NOTE:
- Created a separate directory 'dma/xilinx' as Xilinx has two more
  DMA IPs and we are also planning to upstream these drivers soon.

Changes in v3:
- Implemented interleaved DMA API as suggested by Vinod and Lars, Thanks.
- Use dma_slave_config generic API as suggested by Vinod and Lars, Thanks.
- Simplified cookie implementation as sugguested by Vinod, Thanks.
- Simplified *_xlate function by using dma_get_slave_channel() as
  suggested by Lars, Thanks.
- Proper indentation of constants and maintained same multi-line of
  comments as suggested by Andy Shevchenko, Thanks.
- Modified to use request_irq() instead of devm_request_irq as suggested
  by Lars, Thanks.
- Fixed minor comments suggested by Andy and Levente, Thanks.

Changes in v2:
- Removed DMA Test client module from the patchset as suggested
  by Andy Shevchenko
- Returning with error, if registration of DMA to node fails
- Fixed typo errors
- Used BIT() macro at applicable places
- Added missing header file to the patchset
- Changed copyright year to include 2014
---
 drivers/dma/Kconfig              |   14 +
 drivers/dma/Makefile             |    1 +
 drivers/dma/xilinx/Makefile      |    1 +
 drivers/dma/xilinx/xilinx_vdma.c | 1388 ++++++++++++++++++++++++++++++++++++++
 include/linux/amba/xilinx_dma.h  |   46 ++
 5 files changed, 1450 insertions(+)
 create mode 100644 drivers/dma/xilinx/Makefile
 create mode 100644 drivers/dma/xilinx/xilinx_vdma.c
 create mode 100644 include/linux/amba/xilinx_dma.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 9bed1a2..53a1fa6 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -350,6 +350,20 @@ config MOXART_DMA
 	help
 	  Enable support for the MOXA ART SoC DMA controller.
 
+config XILINX_VDMA
+	tristate "Xilinx AXI VDMA Engine"
+	depends on (ARCH_ZYNQ || MICROBLAZE)
+	select DMA_ENGINE
+	help
+	  Enable support for Xilinx AXI VDMA Soft IP.
+
+	  This engine provides high-bandwidth direct memory access
+	  between memory and AXI4-Stream video type target
+	  peripherals including peripherals which support AXI4-
+	  Stream Video Protocol.  It has two stream interfaces/
+	  channels, Memory Mapped to Stream (MM2S) and Stream to
+	  Memory Mapped (S2MM) for the data transfers.
+
 config DMA_ENGINE
 	bool
 
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index a029d0f4..fa38a77 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -44,3 +44,4 @@ obj-$(CONFIG_DMA_JZ4740) += dma-jz4740.o
 obj-$(CONFIG_TI_CPPI41) += cppi41.o
 obj-$(CONFIG_K3_DMA) += k3dma.o
 obj-$(CONFIG_MOXART_DMA) += moxart-dma.o
+obj-y += xilinx/
diff --git a/drivers/dma/xilinx/Makefile b/drivers/dma/xilinx/Makefile
new file mode 100644
index 0000000..3c4e9f2
--- /dev/null
+++ b/drivers/dma/xilinx/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_XILINX_VDMA) += xilinx_vdma.o
diff --git a/drivers/dma/xilinx/xilinx_vdma.c b/drivers/dma/xilinx/xilinx_vdma.c
new file mode 100644
index 0000000..fde6a9c
--- /dev/null
+++ b/drivers/dma/xilinx/xilinx_vdma.c
@@ -0,0 +1,1388 @@
+/*
+ * DMA driver for Xilinx Video DMA Engine
+ *
+ * Copyright (C) 2010-2014 Xilinx, Inc. All rights reserved.
+ *
+ * Based on the Freescale DMA driver.
+ *
+ * Description:
+ * The AXI Video Direct Memory Access (AXI VDMA) core is a soft Xilinx IP
+ * core that provides high-bandwidth direct memory access between memory
+ * and AXI4-Stream type video target peripherals. The core provides efficient
+ * two dimensional DMA operations with independent asynchronous read (S2MM)
+ * and write (MM2S) channel operation. It can be configured to have either
+ * one channel or two channels. If configured as two channels, one is to
+ * transmit to the video device (MM2S) and another is to receive from the
+ * video device (S2MM). Initialization, status, interrupt and management
+ * registers are accessed through an AXI4-Lite slave interface.
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/amba/xilinx_dma.h>
+#include <linux/bitops.h>
+#include <linux/dmapool.h>
+#include <linux/init.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/of_address.h>
+#include <linux/of_dma.h>
+#include <linux/of_platform.h>
+#include <linux/of_irq.h>
+#include <linux/slab.h>
+
+#include "../dmaengine.h"
+
+/* Register/Descriptor Offsets */
+#define XILINX_VDMA_MM2S_CTRL_OFFSET		0x0000
+#define XILINX_VDMA_S2MM_CTRL_OFFSET		0x0030
+#define XILINX_VDMA_MM2S_DESC_OFFSET		0x0050
+#define XILINX_VDMA_S2MM_DESC_OFFSET		0x00a0
+
+/* Control Registers */
+#define XILINX_VDMA_REG_DMACR			0x0000
+#define XILINX_VDMA_DMACR_DELAY_MAX		0xff
+#define XILINX_VDMA_DMACR_DELAY_SHIFT		24
+#define XILINX_VDMA_DMACR_FRAME_COUNT_MAX	0xff
+#define XILINX_VDMA_DMACR_FRAME_COUNT_SHIFT	16
+#define XILINX_VDMA_DMACR_ERR_IRQ		BIT(14)
+#define XILINX_VDMA_DMACR_DLY_CNT_IRQ		BIT(13)
+#define XILINX_VDMA_DMACR_FRM_CNT_IRQ		BIT(12)
+#define XILINX_VDMA_DMACR_MASTER_SHIFT		8
+#define XILINX_VDMA_DMACR_FSYNCSRC_SHIFT	5
+#define XILINX_VDMA_DMACR_FRAMECNT_EN		BIT(4)
+#define XILINX_VDMA_DMACR_GENLOCK_EN		BIT(3)
+#define XILINX_VDMA_DMACR_RESET			BIT(2)
+#define XILINX_VDMA_DMACR_CIRC_EN		BIT(1)
+#define XILINX_VDMA_DMACR_RUNSTOP		BIT(0)
+#define XILINX_VDMA_DMACR_FSYNCSRC_MASK		GENMASK(6, 5)
+
+#define XILINX_VDMA_REG_DMASR			0x0004
+#define XILINX_VDMA_DMASR_EOL_LATE_ERR		BIT(15)
+#define XILINX_VDMA_DMASR_ERR_IRQ		BIT(14)
+#define XILINX_VDMA_DMASR_DLY_CNT_IRQ		BIT(13)
+#define XILINX_VDMA_DMASR_FRM_CNT_IRQ		BIT(12)
+#define XILINX_VDMA_DMASR_SOF_LATE_ERR		BIT(11)
+#define XILINX_VDMA_DMASR_SG_DEC_ERR		BIT(10)
+#define XILINX_VDMA_DMASR_SG_SLV_ERR		BIT(9)
+#define XILINX_VDMA_DMASR_EOF_EARLY_ERR		BIT(8)
+#define XILINX_VDMA_DMASR_SOF_EARLY_ERR		BIT(7)
+#define XILINX_VDMA_DMASR_DMA_DEC_ERR		BIT(6)
+#define XILINX_VDMA_DMASR_DMA_SLAVE_ERR		BIT(5)
+#define XILINX_VDMA_DMASR_DMA_INT_ERR		BIT(4)
+#define XILINX_VDMA_DMASR_IDLE			BIT(1)
+#define XILINX_VDMA_DMASR_HALTED		BIT(0)
+#define XILINX_VDMA_DMASR_DELAY_MASK		GENMASK(31, 24)
+#define XILINX_VDMA_DMASR_FRAME_COUNT_MASK	GENMASK(23, 16)
+
+#define XILINX_VDMA_REG_CURDESC			0x0008
+#define XILINX_VDMA_REG_TAILDESC		0x0010
+#define XILINX_VDMA_REG_REG_INDEX		0x0014
+#define XILINX_VDMA_REG_FRMSTORE		0x0018
+#define XILINX_VDMA_REG_THRESHOLD		0x001c
+#define XILINX_VDMA_REG_FRMPTR_STS		0x0024
+#define XILINX_VDMA_REG_PARK_PTR		0x0028
+#define XILINX_VDMA_PARK_PTR_WR_REF_SHIFT	8
+#define XILINX_VDMA_PARK_PTR_RD_REF_SHIFT	0
+#define XILINX_VDMA_REG_VDMA_VERSION		0x002c
+
+/* Register Direct Mode Registers */
+#define XILINX_VDMA_REG_VSIZE			0x0000
+#define XILINX_VDMA_REG_HSIZE			0x0004
+
+#define XILINX_VDMA_REG_FRMDLY_STRIDE		0x0008
+#define XILINX_VDMA_FRMDLY_STRIDE_FRMDLY_SHIFT	24
+#define XILINX_VDMA_FRMDLY_STRIDE_STRIDE_SHIFT	0
+
+#define XILINX_VDMA_REG_START_ADDRESS(n)	(0x000c + 4 * (n))
+
+/* HW specific definitions */
+#define XILINX_VDMA_MAX_CHANS_PER_DEVICE	0x2
+
+#define XILINX_VDMA_DMAXR_ALL_IRQ_MASK	\
+		(XILINX_VDMA_DMASR_FRM_CNT_IRQ | \
+		 XILINX_VDMA_DMASR_DLY_CNT_IRQ | \
+		 XILINX_VDMA_DMASR_ERR_IRQ)
+
+#define XILINX_VDMA_DMASR_ALL_ERR_MASK	\
+		(XILINX_VDMA_DMASR_EOL_LATE_ERR | \
+		 XILINX_VDMA_DMASR_SOF_LATE_ERR | \
+		 XILINX_VDMA_DMASR_SG_DEC_ERR | \
+		 XILINX_VDMA_DMASR_SG_SLV_ERR | \
+		 XILINX_VDMA_DMASR_EOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_SOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_DMA_DEC_ERR | \
+		 XILINX_VDMA_DMASR_DMA_SLAVE_ERR | \
+		 XILINX_VDMA_DMASR_DMA_INT_ERR)
+
+/*
+ * Recoverable errors are DMA Internal error, SOF Early, EOF Early
+ * and SOF Late. They are only recoverable when C_FLUSH_ON_FSYNC
+ * is enabled in the h/w system.
+ */
+#define XILINX_VDMA_DMASR_ERR_RECOVER_MASK	\
+		(XILINX_VDMA_DMASR_SOF_LATE_ERR | \
+		 XILINX_VDMA_DMASR_EOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_SOF_EARLY_ERR | \
+		 XILINX_VDMA_DMASR_DMA_INT_ERR)
+
+/* Axi VDMA Flush on Fsync bits */
+#define XILINX_VDMA_FLUSH_S2MM		3
+#define XILINX_VDMA_FLUSH_MM2S		2
+#define XILINX_VDMA_FLUSH_BOTH		1
+
+/* Delay loop counter to prevent hardware failure */
+#define XILINX_VDMA_LOOP_COUNT		1000000
+
+/**
+ * struct xilinx_vdma_desc_hw - Hardware Descriptor
+ * @next_desc: Next Descriptor Pointer @0x00
+ * @pad1: Reserved @0x04
+ * @buf_addr: Buffer address @0x08
+ * @pad2: Reserved @0x0C
+ * @vsize: Vertical Size @0x10
+ * @hsize: Horizontal Size @0x14
+ * @stride: Number of bytes between the first
+ *	    pixels of each horizontal line @0x18
+ */
+struct xilinx_vdma_desc_hw {
+	u32 next_desc;
+	u32 pad1;
+	u32 buf_addr;
+	u32 pad2;
+	u32 vsize;
+	u32 hsize;
+	u32 stride;
+} __aligned(64);
+
+/**
+ * struct xilinx_vdma_tx_segment - Descriptor segment
+ * @hw: Hardware descriptor
+ * @node: Node in the descriptor segments list
+ * @phys: Physical address of segment
+ */
+struct xilinx_vdma_tx_segment {
+	struct xilinx_vdma_desc_hw hw;
+	struct list_head node;
+	dma_addr_t phys;
+} __aligned(64);
+
+/**
+ * struct xilinx_vdma_tx_descriptor - Per Transaction structure
+ * @async_tx: Async transaction descriptor
+ * @segments: TX segments list
+ * @node: Node in the channel descriptors list
+ */
+struct xilinx_vdma_tx_descriptor {
+	struct dma_async_tx_descriptor async_tx;
+	struct list_head segments;
+	struct list_head node;
+};
+
+/**
+ * struct xilinx_vdma_chan - Driver specific VDMA channel structure
+ * @xdev: Driver specific device structure
+ * @ctrl_offset: Control registers offset
+ * @desc_offset: TX descriptor registers offset
+ * @lock: Descriptor operation lock
+ * @pending_list: Descriptors waiting
+ * @active_desc: Active descriptor
+ * @done_list: Complete descriptors
+ * @common: DMA common channel
+ * @desc_pool: Descriptors pool
+ * @dev: The dma device
+ * @irq: Channel IRQ
+ * @id: Channel ID
+ * @direction: Transfer direction
+ * @num_frms: Number of frames
+ * @has_sg: Support scatter transfers
+ * @genlock: Support genlock mode
+ * @err: Channel has errors
+ * @tasklet: Cleanup work after irq
+ * @config: Device configuration info
+ * @flush_on_fsync: Flush on Frame sync
+ */
+struct xilinx_vdma_chan {
+	struct xilinx_vdma_device *xdev;
+	u32 ctrl_offset;
+	u32 desc_offset;
+	spinlock_t lock;
+	struct list_head pending_list;
+	struct xilinx_vdma_tx_descriptor *active_desc;
+	struct list_head done_list;
+	struct dma_chan common;
+	struct dma_pool *desc_pool;
+	struct device *dev;
+	int irq;
+	int id;
+	enum dma_transfer_direction direction;
+	int num_frms;
+	bool has_sg;
+	bool genlock;
+	bool err;
+	struct tasklet_struct tasklet;
+	struct xilinx_vdma_config config;
+	bool flush_on_fsync;
+};
+
+/**
+ * struct xilinx_vdma_device - VDMA device structure
+ * @regs: I/O mapped base address
+ * @dev: Device Structure
+ * @common: DMA device structure
+ * @chan: Driver specific VDMA channel
+ * @has_sg: Specifies whether Scatter-Gather is present or not
+ * @flush_on_fsync: Flush on frame sync
+ */
+struct xilinx_vdma_device {
+	void __iomem *regs;
+	struct device *dev;
+	struct dma_device common;
+	struct xilinx_vdma_chan *chan[XILINX_VDMA_MAX_CHANS_PER_DEVICE];
+	bool has_sg;
+	u32 flush_on_fsync;
+};
+
+/* Macros */
+#define to_xilinx_chan(chan) \
+	container_of(chan, struct xilinx_vdma_chan, common)
+#define to_vdma_tx_descriptor(tx) \
+	container_of(tx, struct xilinx_vdma_tx_descriptor, async_tx)
+#define to_xilinx_vdma_config(cfg) \
+	container_of(cfg, struct xilinx_vdma_config, slvcfg)
+
+/* IO accessors */
+static inline u32 vdma_read(struct xilinx_vdma_chan *chan, u32 reg)
+{
+	return ioread32(chan->xdev->regs + reg);
+}
+
+static inline void vdma_write(struct xilinx_vdma_chan *chan, u32 reg, u32 value)
+{
+	iowrite32(value, chan->xdev->regs + reg);
+}
+
+static inline void vdma_desc_write(struct xilinx_vdma_chan *chan, u32 reg,
+				   u32 value)
+{
+	vdma_write(chan, chan->desc_offset + reg, value);
+}
+
+static inline u32 vdma_ctrl_read(struct xilinx_vdma_chan *chan, u32 reg)
+{
+	return vdma_read(chan, chan->ctrl_offset + reg);
+}
+
+static inline void vdma_ctrl_write(struct xilinx_vdma_chan *chan, u32 reg,
+				   u32 value)
+{
+	vdma_write(chan, chan->ctrl_offset + reg, value);
+}
+
+static inline void vdma_ctrl_clr(struct xilinx_vdma_chan *chan, u32 reg,
+				 u32 clr)
+{
+	vdma_ctrl_write(chan, reg, vdma_ctrl_read(chan, reg) & ~clr);
+}
+
+static inline void vdma_ctrl_set(struct xilinx_vdma_chan *chan, u32 reg,
+				 u32 set)
+{
+	vdma_ctrl_write(chan, reg, vdma_ctrl_read(chan, reg) | set);
+}
+
+/* -----------------------------------------------------------------------------
+ * Descriptors and segments alloc and free
+ */
+
+/**
+ * xilinx_vdma_alloc_tx_segment - Allocate transaction segment
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: The allocated segment on success and NULL on failure.
+ */
+static struct xilinx_vdma_tx_segment *
+xilinx_vdma_alloc_tx_segment(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_segment *segment;
+	dma_addr_t phys;
+
+	segment = dma_pool_alloc(chan->desc_pool, GFP_ATOMIC, &phys);
+	if (!segment)
+		return NULL;
+
+	memset(segment, 0, sizeof(*segment));
+	segment->phys = phys;
+
+	return segment;
+}
+
+/**
+ * xilinx_vdma_free_tx_segment - Free transaction segment
+ * @chan: Driver specific VDMA channel
+ * @segment: VDMA transaction segment
+ */
+static void xilinx_vdma_free_tx_segment(struct xilinx_vdma_chan *chan,
+					struct xilinx_vdma_tx_segment *segment)
+{
+	dma_pool_free(chan->desc_pool, segment, segment->phys);
+}
+
+/**
+ * xilinx_vdma_tx_descriptor - Allocate transaction descriptor
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: The allocated descriptor on success and NULL on failure.
+ */
+static struct xilinx_vdma_tx_descriptor *
+xilinx_vdma_alloc_tx_descriptor(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_descriptor *desc;
+
+	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+	if (!desc)
+		return NULL;
+
+	INIT_LIST_HEAD(&desc->segments);
+
+	return desc;
+}
+
+/**
+ * xilinx_vdma_free_tx_descriptor - Free transaction descriptor
+ * @chan: Driver specific VDMA channel
+ * @desc: VDMA transaction descriptor
+ */
+static void
+xilinx_vdma_free_tx_descriptor(struct xilinx_vdma_chan *chan,
+			       struct xilinx_vdma_tx_descriptor *desc)
+{
+	struct xilinx_vdma_tx_segment *segment, *next;
+
+	if (!desc)
+		return;
+
+	list_for_each_entry_safe(segment, next, &desc->segments, node) {
+		list_del(&segment->node);
+		xilinx_vdma_free_tx_segment(chan, segment);
+	}
+
+	kfree(desc);
+}
+
+/* Required functions */
+
+/**
+ * xilinx_vdma_free_desc_list - Free descriptors list
+ * @chan: Driver specific VDMA channel
+ * @list: List to parse and delete the descriptor
+ */
+static void xilinx_vdma_free_desc_list(struct xilinx_vdma_chan *chan,
+					struct list_head *list)
+{
+	struct xilinx_vdma_tx_descriptor *desc, *next;
+
+	list_for_each_entry_safe(desc, next, list, node) {
+		list_del(&desc->node);
+		xilinx_vdma_free_tx_descriptor(chan, desc);
+	}
+}
+
+/**
+ * xilinx_vdma_free_descriptors - Free channel descriptors
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_free_descriptors(struct xilinx_vdma_chan *chan)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	xilinx_vdma_free_desc_list(chan, &chan->pending_list);
+	xilinx_vdma_free_desc_list(chan, &chan->done_list);
+
+	xilinx_vdma_free_tx_descriptor(chan, chan->active_desc);
+	chan->active_desc = NULL;
+
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_free_chan_resources - Free channel resources
+ * @dchan: DMA channel
+ */
+static void xilinx_vdma_free_chan_resources(struct dma_chan *dchan)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	dev_dbg(chan->dev, "Free all channel resources.\n");
+
+	xilinx_vdma_free_descriptors(chan);
+	dma_pool_destroy(chan->desc_pool);
+	chan->desc_pool = NULL;
+}
+
+/**
+ * xilinx_vdma_chan_desc_cleanup - Clean channel descriptors
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_chan_desc_cleanup(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_descriptor *desc, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	list_for_each_entry_safe(desc, next, &chan->done_list, node) {
+		dma_async_tx_callback callback;
+		void *callback_param;
+
+		/* Remove from the list of running transactions */
+		list_del(&desc->node);
+
+		/* Run the link descriptor callback function */
+		callback = desc->async_tx.callback;
+		callback_param = desc->async_tx.callback_param;
+		if (callback) {
+			spin_unlock_irqrestore(&chan->lock, flags);
+			callback(callback_param);
+			spin_lock_irqsave(&chan->lock, flags);
+		}
+
+		/* Run any dependencies, then free the descriptor */
+		dma_run_dependencies(&desc->async_tx);
+		xilinx_vdma_free_tx_descriptor(chan, desc);
+	}
+
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_do_tasklet - Schedule completion tasklet
+ * @data: Pointer to the Xilinx VDMA channel structure
+ */
+static void xilinx_vdma_do_tasklet(unsigned long data)
+{
+	struct xilinx_vdma_chan *chan = (struct xilinx_vdma_chan *)data;
+
+	xilinx_vdma_chan_desc_cleanup(chan);
+}
+
+/**
+ * xilinx_vdma_alloc_chan_resources - Allocate channel resources
+ * @dchan: DMA channel
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_alloc_chan_resources(struct dma_chan *dchan)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	/* Has this channel already been allocated? */
+	if (chan->desc_pool)
+		return 0;
+
+	/*
+	 * We need the descriptor to be aligned to 64bytes
+	 * for meeting Xilinx VDMA specification requirement.
+	 */
+	chan->desc_pool = dma_pool_create("xilinx_vdma_desc_pool",
+				chan->dev,
+				sizeof(struct xilinx_vdma_tx_segment),
+				__alignof__(struct xilinx_vdma_tx_segment), 0);
+	if (!chan->desc_pool) {
+		dev_err(chan->dev,
+			"unable to allocate channel %d descriptor pool\n",
+			chan->id);
+		return -ENOMEM;
+	}
+
+	dma_cookie_init(dchan);
+	return 0;
+}
+
+/**
+ * xilinx_vdma_tx_status - Get VDMA transaction status
+ * @dchan: DMA channel
+ * @cookie: Transaction identifier
+ * @txstate: Transaction state
+ *
+ * Return: DMA transaction status
+ */
+static enum dma_status xilinx_vdma_tx_status(struct dma_chan *dchan,
+					dma_cookie_t cookie,
+					struct dma_tx_state *txstate)
+{
+	return dma_cookie_status(dchan, cookie, txstate);
+}
+
+/**
+ * xilinx_vdma_is_running - Check if VDMA channel is running
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '1' if running, '0' if not.
+ */
+static int xilinx_vdma_is_running(struct xilinx_vdma_chan *chan)
+{
+	return !(vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		 XILINX_VDMA_DMASR_HALTED) &&
+		(vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR) &
+		 XILINX_VDMA_DMACR_RUNSTOP);
+}
+
+/**
+ * xilinx_vdma_is_idle - Check if VDMA channel is idle
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '1' if idle, '0' if not.
+ */
+static int xilinx_vdma_is_idle(struct xilinx_vdma_chan *chan)
+{
+	return vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		XILINX_VDMA_DMASR_IDLE;
+}
+
+/**
+ * xilinx_vdma_halt - Halt VDMA channel
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_halt(struct xilinx_vdma_chan *chan)
+{
+	int loop = XILINX_VDMA_LOOP_COUNT;
+
+	vdma_ctrl_clr(chan, XILINX_VDMA_REG_DMACR, XILINX_VDMA_DMACR_RUNSTOP);
+
+	/* Wait for the hardware to halt */
+	do {
+		if (vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		    XILINX_VDMA_DMASR_HALTED)
+			break;
+	} while (loop--);
+
+	if (!loop) {
+		dev_err(chan->dev, "Cannot stop channel %p: %x\n",
+			chan, vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR));
+		chan->err = true;
+	}
+
+	return;
+}
+
+/**
+ * xilinx_vdma_start - Start VDMA channel
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_start(struct xilinx_vdma_chan *chan)
+{
+	int loop = XILINX_VDMA_LOOP_COUNT;
+
+	vdma_ctrl_set(chan, XILINX_VDMA_REG_DMACR, XILINX_VDMA_DMACR_RUNSTOP);
+
+	/* Wait for the hardware to start */
+	do {
+		if (!(vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR) &
+		      XILINX_VDMA_DMASR_HALTED))
+			break;
+	} while (loop--);
+
+	if (!loop) {
+		dev_err(chan->dev, "Cannot start channel %p: %x\n",
+			chan, vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR));
+
+		chan->err = true;
+	}
+
+	return;
+}
+
+/**
+ * xilinx_vdma_start_transfer - Starts VDMA transfer
+ * @chan: Driver specific channel struct pointer
+ */
+static void xilinx_vdma_start_transfer(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_config *config = &chan->config;
+	struct xilinx_vdma_tx_descriptor *desc;
+	unsigned long flags;
+	u32 reg;
+	struct xilinx_vdma_tx_segment *head, *tail = NULL;
+
+	if (chan->err)
+		return;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	/* There's already an active descriptor, bail out. */
+	if (chan->active_desc)
+		goto out_unlock;
+
+	if (list_empty(&chan->pending_list))
+		goto out_unlock;
+
+	desc = list_first_entry(&chan->pending_list,
+				struct xilinx_vdma_tx_descriptor, node);
+
+	/* If it is SG mode and hardware is busy, cannot submit */
+	if (chan->has_sg && xilinx_vdma_is_running(chan) &&
+	    !xilinx_vdma_is_idle(chan)) {
+		dev_dbg(chan->dev, "DMA controller still busy\n");
+		goto out_unlock;
+	}
+
+	if (chan->err)
+		goto out_unlock;
+
+	/*
+	 * If hardware is idle, then all descriptors on the running lists are
+	 * done, start new transfers
+	 */
+	if (chan->has_sg) {
+		head = list_first_entry(&desc->segments,
+					struct xilinx_vdma_tx_segment, node);
+		tail = list_entry(desc->segments.prev,
+				  struct xilinx_vdma_tx_segment, node);
+
+		vdma_ctrl_write(chan, XILINX_VDMA_REG_CURDESC, head->phys);
+	}
+
+	/* Configure the hardware using info in the config structure */
+	reg = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR);
+
+	if (config->frm_cnt_en)
+		reg |= XILINX_VDMA_DMACR_FRAMECNT_EN;
+	else
+		reg &= ~XILINX_VDMA_DMACR_FRAMECNT_EN;
+
+	/*
+	 * With SG, start with circular mode, so that BDs can be fetched.
+	 * In direct register mode, if not parking, enable circular mode
+	 */
+	if (chan->has_sg || !config->park)
+		reg |= XILINX_VDMA_DMACR_CIRC_EN;
+
+	if (config->park)
+		reg &= ~XILINX_VDMA_DMACR_CIRC_EN;
+
+	vdma_ctrl_write(chan, XILINX_VDMA_REG_DMACR, reg);
+
+	if (config->park && (config->park_frm >= 0) &&
+			(config->park_frm < chan->num_frms)) {
+		if (chan->direction == DMA_MEM_TO_DEV)
+			vdma_write(chan, XILINX_VDMA_REG_PARK_PTR,
+				config->park_frm <<
+					XILINX_VDMA_PARK_PTR_RD_REF_SHIFT);
+		else
+			vdma_write(chan, XILINX_VDMA_REG_PARK_PTR,
+				config->park_frm <<
+					XILINX_VDMA_PARK_PTR_WR_REF_SHIFT);
+	}
+
+	/* Start the hardware */
+	xilinx_vdma_start(chan);
+
+	if (chan->err)
+		goto out_unlock;
+
+	/* Start the transfer */
+	if (chan->has_sg) {
+		vdma_ctrl_write(chan, XILINX_VDMA_REG_TAILDESC, tail->phys);
+	} else {
+		struct xilinx_vdma_tx_segment *segment, *last = NULL;
+		int i = 0;
+
+		list_for_each_entry(segment, &desc->segments, node) {
+			vdma_desc_write(chan,
+					XILINX_VDMA_REG_START_ADDRESS(i++),
+					segment->hw.buf_addr);
+			last = segment;
+		}
+
+		if (!last)
+			goto out_unlock;
+
+		/* HW expects these parameters to be same for one transaction */
+		vdma_desc_write(chan, XILINX_VDMA_REG_HSIZE, last->hw.hsize);
+		vdma_desc_write(chan, XILINX_VDMA_REG_FRMDLY_STRIDE,
+				last->hw.stride);
+		vdma_desc_write(chan, XILINX_VDMA_REG_VSIZE, last->hw.vsize);
+	}
+
+	list_del(&desc->node);
+	chan->active_desc = desc;
+
+out_unlock:
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_issue_pending - Issue pending transactions
+ * @dchan: DMA channel
+ */
+static void xilinx_vdma_issue_pending(struct dma_chan *dchan)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	xilinx_vdma_start_transfer(chan);
+}
+
+/**
+ * xilinx_vdma_complete_descriptor - Mark the active descriptor as complete
+ * @chan : xilinx DMA channel
+ *
+ * CONTEXT: hardirq
+ */
+static void xilinx_vdma_complete_descriptor(struct xilinx_vdma_chan *chan)
+{
+	struct xilinx_vdma_tx_descriptor *desc;
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	desc = chan->active_desc;
+	if (!desc) {
+		dev_dbg(chan->dev, "no running descriptors\n");
+		goto out_unlock;
+	}
+
+	dma_cookie_complete(&desc->async_tx);
+	list_add_tail(&desc->node, &chan->done_list);
+
+	chan->active_desc = NULL;
+
+out_unlock:
+	spin_unlock_irqrestore(&chan->lock, flags);
+}
+
+/**
+ * xilinx_vdma_reset - Reset VDMA channel
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_reset(struct xilinx_vdma_chan *chan)
+{
+	int loop = XILINX_VDMA_LOOP_COUNT;
+	u32 tmp;
+
+	vdma_ctrl_set(chan, XILINX_VDMA_REG_DMACR, XILINX_VDMA_DMACR_RESET);
+
+	tmp = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR) &
+		XILINX_VDMA_DMACR_RESET;
+
+	/* Wait for the hardware to finish reset */
+	do {
+		tmp = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR) &
+			XILINX_VDMA_DMACR_RESET;
+	} while (loop-- && tmp);
+
+	if (!loop) {
+		dev_err(chan->dev, "reset timeout, cr %x, sr %x\n",
+			vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR),
+			vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR));
+		return -ETIMEDOUT;
+	}
+
+	chan->err = false;
+
+	return 0;
+}
+
+/**
+ * xilinx_vdma_chan_reset - Reset VDMA channel and enable interrupts
+ * @chan: Driver specific VDMA channel
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_chan_reset(struct xilinx_vdma_chan *chan)
+{
+	int err;
+
+	/* Reset VDMA */
+	err = xilinx_vdma_reset(chan);
+	if (err)
+		return err;
+
+	/* Enable interrupts */
+	vdma_ctrl_set(chan, XILINX_VDMA_REG_DMACR,
+		      XILINX_VDMA_DMAXR_ALL_IRQ_MASK);
+
+	return 0;
+}
+
+/**
+ * xilinx_vdma_irq_handler - VDMA Interrupt handler
+ * @irq: IRQ number
+ * @data: Pointer to the Xilinx VDMA channel structure
+ *
+ * Return: IRQ_HANDLED/IRQ_NONE
+ */
+static irqreturn_t xilinx_vdma_irq_handler(int irq, void *data)
+{
+	struct xilinx_vdma_chan *chan = data;
+	u32 status;
+
+	/* Read the status and ack the interrupts. */
+	status = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMASR);
+	if (!(status & XILINX_VDMA_DMAXR_ALL_IRQ_MASK))
+		return IRQ_NONE;
+
+	vdma_ctrl_write(chan, XILINX_VDMA_REG_DMASR,
+			status & XILINX_VDMA_DMAXR_ALL_IRQ_MASK);
+
+	if (status & XILINX_VDMA_DMASR_ERR_IRQ) {
+		/*
+		 * An error occurred. If C_FLUSH_ON_FSYNC is enabled and the
+		 * error is recoverable, ignore it. Otherwise flag the error.
+		 *
+		 * Only recoverable errors can be cleared in the DMASR register,
+		 * make sure not to write to other error bits to 1.
+		 */
+		u32 errors = status & XILINX_VDMA_DMASR_ALL_ERR_MASK;
+		vdma_ctrl_write(chan, XILINX_VDMA_REG_DMASR,
+				errors & XILINX_VDMA_DMASR_ERR_RECOVER_MASK);
+
+		if (!chan->flush_on_fsync ||
+		    (errors & ~XILINX_VDMA_DMASR_ERR_RECOVER_MASK)) {
+			dev_err(chan->dev,
+				"Channel %p has errors %x, cdr %x tdr %x\n",
+				chan, errors,
+				vdma_ctrl_read(chan, XILINX_VDMA_REG_CURDESC),
+				vdma_ctrl_read(chan, XILINX_VDMA_REG_TAILDESC));
+			chan->err = true;
+		}
+	}
+
+	if (status & XILINX_VDMA_DMASR_DLY_CNT_IRQ) {
+		/*
+		 * Device takes too long to do the transfer when user requires
+		 * responsiveness.
+		 */
+		dev_dbg(chan->dev, "Inter-packet latency too long\n");
+	}
+
+	if (status & XILINX_VDMA_DMASR_FRM_CNT_IRQ) {
+		xilinx_vdma_complete_descriptor(chan);
+		xilinx_vdma_start_transfer(chan);
+	}
+
+	tasklet_schedule(&chan->tasklet);
+	return IRQ_HANDLED;
+}
+
+/**
+ * xilinx_vdma_tx_submit - Submit DMA transaction
+ * @tx: Async transaction descriptor
+ *
+ * Return: cookie value on success and failure value on error
+ */
+static dma_cookie_t xilinx_vdma_tx_submit(struct dma_async_tx_descriptor *tx)
+{
+	struct xilinx_vdma_tx_descriptor *desc = to_vdma_tx_descriptor(tx);
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(tx->chan);
+	dma_cookie_t cookie;
+	unsigned long flags;
+	int err;
+
+	if (chan->err) {
+		/*
+		 * If reset fails, need to hard reset the system.
+		 * Channel is no longer functional
+		 */
+		err = xilinx_vdma_chan_reset(chan);
+		if (err < 0)
+			return err;
+	}
+
+	spin_lock_irqsave(&chan->lock, flags);
+
+	cookie = dma_cookie_assign(tx);
+
+	/* Append the transaction to the pending transactions queue. */
+	list_add_tail(&desc->node, &chan->pending_list);
+
+	spin_unlock_irqrestore(&chan->lock, flags);
+
+	return cookie;
+}
+
+/**
+ * xilinx_vdma_prep_slave_sg - prepare a descriptor for a DMA_SLAVE transaction
+ * @dchan: DMA channel
+ * @xts: Array of interleaved template structs
+ * @flags: transfer ack flags
+ *
+ * Return: Async transaction descriptor on success and NULL on failure
+ */
+static struct dma_async_tx_descriptor *
+xilinx_vdma_dma_prep_interleaved(struct dma_chan *dchan,
+				 struct dma_interleaved_template **xts,
+				 unsigned long flags)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+	struct xilinx_vdma_tx_descriptor *desc;
+	struct xilinx_vdma_tx_segment *segment;
+	struct xilinx_vdma_tx_segment *prev = NULL;
+	struct dma_interleaved_template *xt = *xts;
+	u32 frmno = 0;
+
+	/* Count number of frames */
+	for (; xts[frmno] != NULL; frmno++)
+		;
+
+	if (frmno != chan->num_frms) {
+		dev_err(chan->dev,
+			"number of entries %d not the same as num stores %d\n",
+			frmno, chan->num_frms);
+		return NULL;
+	}
+
+	/* Allocate a transaction descriptor. */
+	desc = xilinx_vdma_alloc_tx_descriptor(chan);
+	if (!desc)
+		return NULL;
+
+	dma_async_tx_descriptor_init(&desc->async_tx, &chan->common);
+	desc->async_tx.tx_submit = xilinx_vdma_tx_submit;
+	desc->async_tx.cookie = 0;
+	async_tx_ack(&desc->async_tx);
+
+	/* Build the list of transaction segments. */
+	for (; (xt = *xts); xts++) {
+		struct xilinx_vdma_desc_hw *hw;
+
+		if ((xt->dir != DMA_MEM_TO_DEV) && (xt->dir != DMA_DEV_TO_MEM))
+			goto error;
+
+		if (!xt->numf || !xt->sgl[0].size)
+			goto error;
+
+		/* Allocate the link descriptor from DMA pool */
+		segment = xilinx_vdma_alloc_tx_segment(chan);
+		if (!segment)
+			goto error;
+
+		/* Fill in the hardware descriptor */
+		hw = &segment->hw;
+		if (xt->dir != DMA_MEM_TO_DEV)
+			hw->buf_addr = xt->dst_start;
+		else
+			hw->buf_addr = xt->src_start;
+		hw->vsize = xt->numf;
+		hw->hsize = xt->sgl[0].size;
+		hw->stride = (chan->config.frm_dly <<
+			      XILINX_VDMA_FRMDLY_STRIDE_FRMDLY_SHIFT) |
+			     (xt->sgl[0].icg <<
+			      XILINX_VDMA_FRMDLY_STRIDE_STRIDE_SHIFT);
+
+		if (prev)
+			prev->hw.next_desc = segment->phys;
+
+		/* Insert the segment into the descriptor segments list. */
+		list_add_tail(&segment->node, &desc->segments);
+
+		prev = segment;
+	}
+
+	/* Link the last hardware descriptor with the first. */
+	segment = list_first_entry(&desc->segments,
+				   struct xilinx_vdma_tx_segment, node);
+	prev->hw.next_desc = segment->phys;
+
+	return &desc->async_tx;
+
+error:
+	xilinx_vdma_free_tx_descriptor(chan, desc);
+	return NULL;
+}
+
+/**
+ * xilinx_vdma_terminate_all - Halt the channel and free descriptors
+ * @chan: Driver specific VDMA Channel pointer
+ */
+static void xilinx_vdma_terminate_all(struct xilinx_vdma_chan *chan)
+{
+	/* Halt the DMA engine */
+	xilinx_vdma_halt(chan);
+
+	/* Remove and free all of the descriptors in the lists */
+	xilinx_vdma_free_descriptors(chan);
+}
+
+/**
+ * xilinx_vdma_slave_config - Configure VDMA channel
+ * Run-time configuration for Axi VDMA, supports:
+ * . halt the channel
+ * . configure interrupt coalescing and inter-packet delay threshold
+ * . start/stop parking
+ * . enable genlock
+ *
+ * @chan: Driver specific VDMA Channel pointer
+ * @slvcfg: DMA Slave configuration pointer
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_slave_config(struct xilinx_vdma_chan *chan,
+				    struct dma_slave_config *slvcfg)
+{
+	u32 dmacr;
+	struct xilinx_vdma_config *cfg = to_xilinx_vdma_config(slvcfg);
+
+	if (cfg->reset)
+		return xilinx_vdma_chan_reset(chan);
+
+	dmacr = vdma_ctrl_read(chan, XILINX_VDMA_REG_DMACR);
+
+	chan->config.frm_dly = cfg->frm_dly;
+	chan->config.park = cfg->park;
+
+	/* genlock settings */
+	chan->config.gen_lock = cfg->gen_lock;
+	chan->config.master = cfg->master;
+
+	if (cfg->gen_lock && chan->genlock) {
+		dmacr |= XILINX_VDMA_DMACR_GENLOCK_EN;
+		dmacr |= cfg->master << XILINX_VDMA_DMACR_MASTER_SHIFT;
+	}
+
+	chan->config.frm_cnt_en = cfg->frm_cnt_en;
+	if (cfg->park)
+		chan->config.park_frm = cfg->park_frm;
+	else
+		chan->config.park_frm = -1;
+
+	chan->config.coalesc = cfg->coalesc;
+	chan->config.delay = cfg->delay;
+
+	if (cfg->coalesc <= XILINX_VDMA_DMACR_FRAME_COUNT_MAX) {
+		dmacr |= cfg->coalesc << XILINX_VDMA_DMACR_FRAME_COUNT_SHIFT;
+		chan->config.coalesc = cfg->coalesc;
+	}
+
+	if (cfg->delay <= XILINX_VDMA_DMACR_DELAY_MAX) {
+		dmacr |= cfg->delay << XILINX_VDMA_DMACR_DELAY_SHIFT;
+		chan->config.delay = cfg->delay;
+	}
+
+	/* FSync Source selection */
+	dmacr &= ~XILINX_VDMA_DMACR_FSYNCSRC_MASK;
+	dmacr |= cfg->ext_fsync << XILINX_VDMA_DMACR_FSYNCSRC_SHIFT;
+
+	vdma_ctrl_write(chan, XILINX_VDMA_REG_DMACR, dmacr);
+	return 0;
+}
+
+/**
+ * xilinx_vdma_device_control - Configure DMA channel of the device
+ * @dchan: DMA Channel pointer
+ * @cmd: DMA control command
+ * @arg: Channel configuration
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_device_control(struct dma_chan *dchan,
+				      enum dma_ctrl_cmd cmd, unsigned long arg)
+{
+	struct xilinx_vdma_chan *chan = to_xilinx_chan(dchan);
+
+	switch (cmd) {
+	case DMA_TERMINATE_ALL:
+		xilinx_vdma_terminate_all(chan);
+		return 0;
+	case DMA_SLAVE_CONFIG:
+		return xilinx_vdma_slave_config(chan,
+						(struct dma_slave_config *)arg);
+	default:
+		return -ENXIO;
+	}
+}
+
+/* -----------------------------------------------------------------------------
+ * Probe and remove
+ */
+
+/**
+ * xilinx_vdma_chan_remove - Per Channel remove function
+ * @chan: Driver specific VDMA channel
+ */
+static void xilinx_vdma_chan_remove(struct xilinx_vdma_chan *chan)
+{
+	/* Disable all interrupts */
+	vdma_ctrl_clr(chan, XILINX_VDMA_REG_DMACR,
+		      XILINX_VDMA_DMAXR_ALL_IRQ_MASK);
+
+	if (chan->irq > 0)
+		free_irq(chan->irq, chan);
+
+	tasklet_kill(&chan->tasklet);
+
+	list_del(&chan->common.device_node);
+}
+
+/**
+ * xilinx_vdma_chan_probe - Per Channel Probing
+ * It get channel features from the device tree entry and
+ * initialize special channel handling routines
+ *
+ * @xdev: Driver specific device structure
+ * @node: Device node
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_chan_probe(struct xilinx_vdma_device *xdev,
+				  struct device_node *node)
+{
+	struct xilinx_vdma_chan *chan;
+	bool has_dre = false;
+	u32 value, width;
+	int err;
+
+	/* Allocate and initialize the channel structure */
+	chan = devm_kzalloc(xdev->dev, sizeof(*chan), GFP_KERNEL);
+	if (!chan)
+		return -ENOMEM;
+
+	chan->dev = xdev->dev;
+	chan->xdev = xdev;
+	chan->has_sg = xdev->has_sg;
+
+	spin_lock_init(&chan->lock);
+	INIT_LIST_HEAD(&chan->pending_list);
+	INIT_LIST_HEAD(&chan->done_list);
+
+	/* Retrieve the channel properties from the device tree */
+	has_dre = of_property_read_bool(node, "xlnx,include-dre");
+
+	chan->genlock = of_property_read_bool(node, "xlnx,genlock-mode");
+
+	err = of_property_read_u32(node, "xlnx,datawidth", &value);
+	if (err) {
+		dev_err(xdev->dev, "missing xlnx,datawidth property\n");
+		return err;
+	}
+	width = value >> 3; /* Convert bits to bytes */
+
+	/* If data width is greater than 8 bytes, DRE is not in hw */
+	if (width > 8)
+		has_dre = false;
+
+	if (!has_dre)
+		xdev->common.copy_align = fls(width - 1);
+
+	if (of_device_is_compatible(node, "xlnx,axi-vdma-mm2s-channel")) {
+		chan->direction = DMA_MEM_TO_DEV;
+		chan->id = 0;
+
+		chan->ctrl_offset = XILINX_VDMA_MM2S_CTRL_OFFSET;
+		chan->desc_offset = XILINX_VDMA_MM2S_DESC_OFFSET;
+
+		if (xdev->flush_on_fsync == XILINX_VDMA_FLUSH_BOTH ||
+		    xdev->flush_on_fsync == XILINX_VDMA_FLUSH_MM2S)
+			chan->flush_on_fsync = true;
+	} else if (of_device_is_compatible(node,
+					    "xlnx,axi-vdma-s2mm-channel")) {
+		chan->direction = DMA_DEV_TO_MEM;
+		chan->id = 1;
+
+		chan->ctrl_offset = XILINX_VDMA_S2MM_CTRL_OFFSET;
+		chan->desc_offset = XILINX_VDMA_S2MM_DESC_OFFSET;
+
+		if (xdev->flush_on_fsync == XILINX_VDMA_FLUSH_BOTH ||
+		    xdev->flush_on_fsync == XILINX_VDMA_FLUSH_S2MM)
+			chan->flush_on_fsync = true;
+	} else {
+		dev_err(xdev->dev, "Invalid channel compatible node\n");
+		return -EINVAL;
+	}
+
+	/* Request the interrupt */
+	chan->irq = irq_of_parse_and_map(node, 0);
+	err = request_irq(chan->irq, xilinx_vdma_irq_handler, IRQF_SHARED,
+			  "xilinx-vdma-controller", chan);
+	if (err) {
+		dev_err(xdev->dev, "unable to request IRQ %d\n", chan->irq);
+		return err;
+	}
+
+	/* Initialize the tasklet */
+	tasklet_init(&chan->tasklet, xilinx_vdma_do_tasklet,
+			(unsigned long)chan);
+
+	/*
+	 * Initialize the DMA channel and add it to the DMA engine channels
+	 * list.
+	 */
+	chan->common.device = &xdev->common;
+
+	list_add_tail(&chan->common.device_node, &xdev->common.channels);
+	xdev->chan[chan->id] = chan;
+
+	/* Reset the channel */
+	err = xilinx_vdma_chan_reset(chan);
+	if (err < 0) {
+		dev_err(xdev->dev, "Reset channel failed\n");
+		return err;
+	}
+
+	return 0;
+}
+
+/**
+ * of_dma_xilinx_xlate - Translation function
+ * @dma_spec: Pointer to DMA specifier as found in the device tree
+ * @ofdma: Pointer to DMA controller data
+ *
+ * Return: DMA channel pointer on success and NULL on error
+ */
+static struct dma_chan *of_dma_xilinx_xlate(struct of_phandle_args *dma_spec,
+						struct of_dma *ofdma)
+{
+	struct xilinx_vdma_device *xdev = ofdma->of_dma_data;
+	int chan_id = dma_spec->args[0];
+
+	if (chan_id >= XILINX_VDMA_MAX_CHANS_PER_DEVICE)
+		return NULL;
+
+	return dma_get_slave_channel(&xdev->chan[chan_id]->common);
+}
+
+/**
+ * xilinx_vdma_probe - Driver probe function
+ * @pdev: Pointer to the platform_device structure
+ *
+ * Return: '0' on success and failure value on error
+ */
+static int xilinx_vdma_probe(struct platform_device *pdev)
+{
+	struct device_node *node = pdev->dev.of_node;
+	struct xilinx_vdma_device *xdev;
+	struct device_node *child;
+	struct resource *io;
+	u32 num_frames;
+	int i, err;
+
+	dev_info(&pdev->dev, "Probing xilinx axi vdma engine\n");
+
+	/* Allocate and initialize the DMA engine structure */
+	xdev = devm_kzalloc(&pdev->dev, sizeof(*xdev), GFP_KERNEL);
+	if (!xdev)
+		return -ENOMEM;
+
+	xdev->dev = &pdev->dev;
+
+	/* Request and map I/O memory */
+	io = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	xdev->regs = devm_ioremap_resource(&pdev->dev, io);
+	if (IS_ERR(xdev->regs))
+		return PTR_ERR(xdev->regs);
+
+	/* Retrieve the DMA engine properties from the device tree */
+	xdev->has_sg = of_property_read_bool(node, "xlnx,include-sg");
+
+	err = of_property_read_u32(node, "xlnx,num-fstores", &num_frames);
+	if (err < 0) {
+		dev_err(xdev->dev, "missing xlnx,num-fstores property\n");
+		return err;
+	}
+
+	err = of_property_read_u32(node, "xlnx,flush-fsync",
+					&xdev->flush_on_fsync);
+	if (err < 0)
+		dev_warn(xdev->dev, "missing xlnx,flush-fsync property\n");
+
+	/* Initialize the DMA engine */
+	xdev->common.dev = &pdev->dev;
+
+	INIT_LIST_HEAD(&xdev->common.channels);
+	dma_cap_set(DMA_SLAVE, xdev->common.cap_mask);
+	dma_cap_set(DMA_PRIVATE, xdev->common.cap_mask);
+
+	xdev->common.device_alloc_chan_resources =
+				xilinx_vdma_alloc_chan_resources;
+	xdev->common.device_free_chan_resources =
+				xilinx_vdma_free_chan_resources;
+	xdev->common.device_prep_interleaved_dma =
+				xilinx_vdma_dma_prep_interleaved;
+	xdev->common.device_control = xilinx_vdma_device_control;
+	xdev->common.device_tx_status = xilinx_vdma_tx_status;
+	xdev->common.device_issue_pending = xilinx_vdma_issue_pending;
+
+	platform_set_drvdata(pdev, xdev);
+
+	/* Initialize the channels */
+	for_each_child_of_node(node, child) {
+		err = xilinx_vdma_chan_probe(xdev, child);
+		if (err < 0)
+			goto error;
+	}
+
+	for (i = 0; i < XILINX_VDMA_MAX_CHANS_PER_DEVICE; i++)
+		if (xdev->chan[i])
+			xdev->chan[i]->num_frms = num_frames;
+
+	/* Register the DMA engine with the core */
+	dma_async_device_register(&xdev->common);
+
+	err = of_dma_controller_register(node, of_dma_xilinx_xlate,
+					 xdev);
+	if (err < 0) {
+		dev_err(&pdev->dev, "Unable to register DMA to DT\n");
+		dma_async_device_unregister(&xdev->common);
+		goto error;
+	}
+
+	return 0;
+
+error:
+	for (i = 0; i < XILINX_VDMA_MAX_CHANS_PER_DEVICE; i++)
+		if (xdev->chan[i])
+			xilinx_vdma_chan_remove(xdev->chan[i]);
+
+	return err;
+}
+
+/**
+ * xilinx_vdma_remove - Driver remove function
+ * @pdev: Pointer to the platform_device structure
+ *
+ * Return: Always '0'
+ */
+static int xilinx_vdma_remove(struct platform_device *pdev)
+{
+	struct xilinx_vdma_device *xdev = platform_get_drvdata(pdev);
+	int i;
+
+	of_dma_controller_free(pdev->dev.of_node);
+
+	dma_async_device_unregister(&xdev->common);
+
+	for (i = 0; i < XILINX_VDMA_MAX_CHANS_PER_DEVICE; i++)
+		if (xdev->chan[i])
+			xilinx_vdma_chan_remove(xdev->chan[i]);
+
+	return 0;
+}
+
+static const struct of_device_id xilinx_vdma_of_ids[] = {
+	{ .compatible = "xlnx,axi-vdma-1.00.a",},
+	{}
+};
+
+static struct platform_driver xilinx_vdma_driver = {
+	.driver = {
+		.name = "xilinx-vdma",
+		.owner = THIS_MODULE,
+		.of_match_table = xilinx_vdma_of_ids,
+	},
+	.probe = xilinx_vdma_probe,
+	.remove = xilinx_vdma_remove,
+};
+
+module_platform_driver(xilinx_vdma_driver);
+
+MODULE_AUTHOR("Xilinx, Inc.");
+MODULE_DESCRIPTION("Xilinx VDMA driver");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/amba/xilinx_dma.h b/include/linux/amba/xilinx_dma.h
new file mode 100644
index 0000000..146406c
--- /dev/null
+++ b/include/linux/amba/xilinx_dma.h
@@ -0,0 +1,46 @@
+/*
+ * Xilinx DMA Engine drivers support header file
+ *
+ * Copyright (C) 2010-2014 Xilinx, Inc. All rights reserved.
+ *
+ * This is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __DMA_XILINX_DMA_H
+#define __DMA_XILINX_DMA_H
+
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+
+/**
+ * struct xilinx_vdma_config - VDMA Configuration structure
+ * @slvcfg: DMA Slave configuration
+ * @frm_dly: Frame delay
+ * @gen_lock: Whether in gen-lock mode
+ * @master: Master that it syncs to
+ * @frm_cnt_en: Enable frame count enable
+ * @park: Whether wants to park
+ * @park_frm: Frame to park on
+ * @coalesc: Interrupt coalescing threshold
+ * @delay: Delay counter
+ * @reset: Reset Channel
+ * @ext_fsync: External Frame Sync source
+ */
+struct xilinx_vdma_config {
+	struct dma_slave_config slvcfg;
+	int frm_dly;
+	int gen_lock;
+	int master;
+	int frm_cnt_en;
+	int park;
+	int park_frm;
+	int coalesc;
+	int delay;
+	int reset;
+	int ext_fsync;
+};
+
+#endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-15 12:00   ` Srikanth Thokala
@ 2014-02-17  8:43     ` Vinod Koul
  -1 siblings, 0 replies; 65+ messages in thread
From: Vinod Koul @ 2014-02-17  8:43 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: dan.j.williams, michal.simek, grant.likely, robh+dt, levex, lars,
	andriy.shevchenko, dmaengine, linux-arm-kernel, linux-kernel,
	devicetree

On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
> The current implementation of interleaved DMA API support multiple
> frames only when the memory is contiguous by incrementing src_start/
> dst_start members of interleaved template.
> 
> But, when the memory is non-contiguous it will restrict slave device
> to not submit multiple frames in a batch.  This patch handles this
> issue by allowing the slave device to send array of interleaved dma
> templates each having a different memory location.
This seems to be missing the numbers of templates you are sending, wouldnt this
require sending ARRAY_SiZE too?

And why send double pointer?

-- 
~Vinod

> 
> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
> ---
>  Documentation/dmaengine.txt              |    2 +-
>  drivers/dma/imx-dma.c                    |    3 ++-
>  drivers/dma/sirf-dma.c                   |    3 ++-
>  drivers/media/platform/m2m-deinterlace.c |    2 +-
>  include/linux/dmaengine.h                |    6 +++---
>  5 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
> index 879b6e3..c642614 100644
> --- a/Documentation/dmaengine.txt
> +++ b/Documentation/dmaengine.txt
> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>  		size_t period_len, enum dma_data_direction direction);
>  
>  	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
> -		struct dma_chan *chan, struct dma_interleaved_template *xt,
> +		struct dma_chan *chan, struct dma_interleaved_template **xts,
>  		unsigned long flags);
>  
>     The peripheral driver is expected to have mapped the scatterlist for
> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
> index 6f9ac20..e2c52ce 100644
> --- a/drivers/dma/imx-dma.c
> +++ b/drivers/dma/imx-dma.c
> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>  }
>  
>  static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
> -	struct dma_chan *chan, struct dma_interleaved_template *xt,
> +	struct dma_chan *chan, struct dma_interleaved_template **xts,
>  	unsigned long flags)
>  {
>  	struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>  	struct imxdma_engine *imxdma = imxdmac->imxdma;
>  	struct imxdma_desc *desc;
> +	struct dma_interleaved_template *xt = *xts;
>  
>  	dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>  		"   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
> index d4d3a31..b6a150b 100644
> --- a/drivers/dma/sirf-dma.c
> +++ b/drivers/dma/sirf-dma.c
> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>  }
>  
>  static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
> -	struct dma_chan *chan, struct dma_interleaved_template *xt,
> +	struct dma_chan *chan, struct dma_interleaved_template **xts,
>  	unsigned long flags)
>  {
>  	struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>  	struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>  	struct sirfsoc_dma_desc *sdesc = NULL;
> +	struct dma_interleaved_template *xt = *xts;
>  	unsigned long iflags;
>  	int ret;
>  
> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
> index 6bb86b5..468110a 100644
> --- a/drivers/media/platform/m2m-deinterlace.c
> +++ b/drivers/media/platform/m2m-deinterlace.c
> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>  	ctx->xt->dst_sgl = true;
>  	flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>  
> -	tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
> +	tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>  	if (tx == NULL) {
>  		v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>  		return;
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index c5c92d5..2f77a9a 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -675,7 +675,7 @@ struct dma_device {
>  		size_t period_len, enum dma_transfer_direction direction,
>  		unsigned long flags, void *context);
>  	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
> -		struct dma_chan *chan, struct dma_interleaved_template *xt,
> +		struct dma_chan *chan, struct dma_interleaved_template **xts,
>  		unsigned long flags);
>  	int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>  		unsigned long arg);
> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>  }
>  
>  static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
> -		struct dma_chan *chan, struct dma_interleaved_template *xt,
> +		struct dma_chan *chan, struct dma_interleaved_template **xts,
>  		unsigned long flags)
>  {
> -	return chan->device->device_prep_interleaved_dma(chan, xt, flags);
> +	return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>  }
>  
>  static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  8:43     ` Vinod Koul
  0 siblings, 0 replies; 65+ messages in thread
From: Vinod Koul @ 2014-02-17  8:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
> The current implementation of interleaved DMA API support multiple
> frames only when the memory is contiguous by incrementing src_start/
> dst_start members of interleaved template.
> 
> But, when the memory is non-contiguous it will restrict slave device
> to not submit multiple frames in a batch.  This patch handles this
> issue by allowing the slave device to send array of interleaved dma
> templates each having a different memory location.
This seems to be missing the numbers of templates you are sending, wouldnt this
require sending ARRAY_SiZE too?

And why send double pointer?

-- 
~Vinod

> 
> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
> ---
>  Documentation/dmaengine.txt              |    2 +-
>  drivers/dma/imx-dma.c                    |    3 ++-
>  drivers/dma/sirf-dma.c                   |    3 ++-
>  drivers/media/platform/m2m-deinterlace.c |    2 +-
>  include/linux/dmaengine.h                |    6 +++---
>  5 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
> index 879b6e3..c642614 100644
> --- a/Documentation/dmaengine.txt
> +++ b/Documentation/dmaengine.txt
> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>  		size_t period_len, enum dma_data_direction direction);
>  
>  	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
> -		struct dma_chan *chan, struct dma_interleaved_template *xt,
> +		struct dma_chan *chan, struct dma_interleaved_template **xts,
>  		unsigned long flags);
>  
>     The peripheral driver is expected to have mapped the scatterlist for
> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
> index 6f9ac20..e2c52ce 100644
> --- a/drivers/dma/imx-dma.c
> +++ b/drivers/dma/imx-dma.c
> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>  }
>  
>  static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
> -	struct dma_chan *chan, struct dma_interleaved_template *xt,
> +	struct dma_chan *chan, struct dma_interleaved_template **xts,
>  	unsigned long flags)
>  {
>  	struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>  	struct imxdma_engine *imxdma = imxdmac->imxdma;
>  	struct imxdma_desc *desc;
> +	struct dma_interleaved_template *xt = *xts;
>  
>  	dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>  		"   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
> index d4d3a31..b6a150b 100644
> --- a/drivers/dma/sirf-dma.c
> +++ b/drivers/dma/sirf-dma.c
> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>  }
>  
>  static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
> -	struct dma_chan *chan, struct dma_interleaved_template *xt,
> +	struct dma_chan *chan, struct dma_interleaved_template **xts,
>  	unsigned long flags)
>  {
>  	struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>  	struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>  	struct sirfsoc_dma_desc *sdesc = NULL;
> +	struct dma_interleaved_template *xt = *xts;
>  	unsigned long iflags;
>  	int ret;
>  
> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
> index 6bb86b5..468110a 100644
> --- a/drivers/media/platform/m2m-deinterlace.c
> +++ b/drivers/media/platform/m2m-deinterlace.c
> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>  	ctx->xt->dst_sgl = true;
>  	flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>  
> -	tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
> +	tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>  	if (tx == NULL) {
>  		v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>  		return;
> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
> index c5c92d5..2f77a9a 100644
> --- a/include/linux/dmaengine.h
> +++ b/include/linux/dmaengine.h
> @@ -675,7 +675,7 @@ struct dma_device {
>  		size_t period_len, enum dma_transfer_direction direction,
>  		unsigned long flags, void *context);
>  	struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
> -		struct dma_chan *chan, struct dma_interleaved_template *xt,
> +		struct dma_chan *chan, struct dma_interleaved_template **xts,
>  		unsigned long flags);
>  	int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>  		unsigned long arg);
> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>  }
>  
>  static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
> -		struct dma_chan *chan, struct dma_interleaved_template *xt,
> +		struct dma_chan *chan, struct dma_interleaved_template **xts,
>  		unsigned long flags)
>  {
> -	return chan->device->device_prep_interleaved_dma(chan, xt, flags);
> +	return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>  }
>  
>  static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-17  8:43     ` Vinod Koul
@ 2014-02-17  9:29       ` Srikanth Thokala
  -1 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:29 UTC (permalink / raw)
  To: Vinod Koul
  Cc: Srikanth Thokala, dan.j.williams, michal.simek, Grant Likely,
	robh+dt, Levente Kurusa, Lars-Peter Clausen, andriy.shevchenko,
	dmaengine, linux-arm-kernel, linux-kernel, devicetree

On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>> The current implementation of interleaved DMA API support multiple
>> frames only when the memory is contiguous by incrementing src_start/
>> dst_start members of interleaved template.
>>
>> But, when the memory is non-contiguous it will restrict slave device
>> to not submit multiple frames in a batch.  This patch handles this
>> issue by allowing the slave device to send array of interleaved dma
>> templates each having a different memory location.
> This seems to be missing the numbers of templates you are sending, wouldnt this
> require sending ARRAY_SiZE too?
>
> And why send double pointer?

Array size is not required, when we pass the double pointer.  The last
element would be
pointed to NULL and we could get the number of templates from this condition.
Here is an example snippet,

In slave device driver,

        struct dma_interleaved_template **xts;

        xts = kcalloc(frm_cnt+1, sizeof(struct
dma_interleaved_template *), GFP_KERNEL);
        /* Error check for xts */
        for (i = 0; i < frm_cnt; i++) {
                xts[i] = kmalloc(sizeof(struct
dma_interleaved_template), GFP_KERNEL);
                /* Error check for xts[i] */
        }
        xts[i] = NULL;

In DMA engine driver,  we could get the number of frames by,

        for (; xts[frmno] != NULL; frmno++);

I felt this way is simpler than adding an extra argument to the API.
Please let me know
your opinion and suggest me a better way.

Thanks
Srikanth

>
> --
> ~Vinod
>
>>
>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>> ---
>>  Documentation/dmaengine.txt              |    2 +-
>>  drivers/dma/imx-dma.c                    |    3 ++-
>>  drivers/dma/sirf-dma.c                   |    3 ++-
>>  drivers/media/platform/m2m-deinterlace.c |    2 +-
>>  include/linux/dmaengine.h                |    6 +++---
>>  5 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>> index 879b6e3..c642614 100644
>> --- a/Documentation/dmaengine.txt
>> +++ b/Documentation/dmaengine.txt
>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>               size_t period_len, enum dma_data_direction direction);
>>
>>       struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>               unsigned long flags);
>>
>>     The peripheral driver is expected to have mapped the scatterlist for
>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>> index 6f9ac20..e2c52ce 100644
>> --- a/drivers/dma/imx-dma.c
>> +++ b/drivers/dma/imx-dma.c
>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>>  }
>>
>>  static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>       unsigned long flags)
>>  {
>>       struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>       struct imxdma_engine *imxdma = imxdmac->imxdma;
>>       struct imxdma_desc *desc;
>> +     struct dma_interleaved_template *xt = *xts;
>>
>>       dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>>               "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>> index d4d3a31..b6a150b 100644
>> --- a/drivers/dma/sirf-dma.c
>> +++ b/drivers/dma/sirf-dma.c
>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>>  }
>>
>>  static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>       unsigned long flags)
>>  {
>>       struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>       struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>>       struct sirfsoc_dma_desc *sdesc = NULL;
>> +     struct dma_interleaved_template *xt = *xts;
>>       unsigned long iflags;
>>       int ret;
>>
>> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
>> index 6bb86b5..468110a 100644
>> --- a/drivers/media/platform/m2m-deinterlace.c
>> +++ b/drivers/media/platform/m2m-deinterlace.c
>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>>       ctx->xt->dst_sgl = true;
>>       flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>
>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>       if (tx == NULL) {
>>               v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>>               return;
>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>> index c5c92d5..2f77a9a 100644
>> --- a/include/linux/dmaengine.h
>> +++ b/include/linux/dmaengine.h
>> @@ -675,7 +675,7 @@ struct dma_device {
>>               size_t period_len, enum dma_transfer_direction direction,
>>               unsigned long flags, void *context);
>>       struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>               unsigned long flags);
>>       int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>>               unsigned long arg);
>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>>  }
>>
>>  static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>               unsigned long flags)
>>  {
>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>> +     return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>>  }
>>
>>  static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:29       ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>> The current implementation of interleaved DMA API support multiple
>> frames only when the memory is contiguous by incrementing src_start/
>> dst_start members of interleaved template.
>>
>> But, when the memory is non-contiguous it will restrict slave device
>> to not submit multiple frames in a batch.  This patch handles this
>> issue by allowing the slave device to send array of interleaved dma
>> templates each having a different memory location.
> This seems to be missing the numbers of templates you are sending, wouldnt this
> require sending ARRAY_SiZE too?
>
> And why send double pointer?

Array size is not required, when we pass the double pointer.  The last
element would be
pointed to NULL and we could get the number of templates from this condition.
Here is an example snippet,

In slave device driver,

        struct dma_interleaved_template **xts;

        xts = kcalloc(frm_cnt+1, sizeof(struct
dma_interleaved_template *), GFP_KERNEL);
        /* Error check for xts */
        for (i = 0; i < frm_cnt; i++) {
                xts[i] = kmalloc(sizeof(struct
dma_interleaved_template), GFP_KERNEL);
                /* Error check for xts[i] */
        }
        xts[i] = NULL;

In DMA engine driver,  we could get the number of frames by,

        for (; xts[frmno] != NULL; frmno++);

I felt this way is simpler than adding an extra argument to the API.
Please let me know
your opinion and suggest me a better way.

Thanks
Srikanth

>
> --
> ~Vinod
>
>>
>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>> ---
>>  Documentation/dmaengine.txt              |    2 +-
>>  drivers/dma/imx-dma.c                    |    3 ++-
>>  drivers/dma/sirf-dma.c                   |    3 ++-
>>  drivers/media/platform/m2m-deinterlace.c |    2 +-
>>  include/linux/dmaengine.h                |    6 +++---
>>  5 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>> index 879b6e3..c642614 100644
>> --- a/Documentation/dmaengine.txt
>> +++ b/Documentation/dmaengine.txt
>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>               size_t period_len, enum dma_data_direction direction);
>>
>>       struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>               unsigned long flags);
>>
>>     The peripheral driver is expected to have mapped the scatterlist for
>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>> index 6f9ac20..e2c52ce 100644
>> --- a/drivers/dma/imx-dma.c
>> +++ b/drivers/dma/imx-dma.c
>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>>  }
>>
>>  static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>       unsigned long flags)
>>  {
>>       struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>       struct imxdma_engine *imxdma = imxdmac->imxdma;
>>       struct imxdma_desc *desc;
>> +     struct dma_interleaved_template *xt = *xts;
>>
>>       dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>>               "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>> index d4d3a31..b6a150b 100644
>> --- a/drivers/dma/sirf-dma.c
>> +++ b/drivers/dma/sirf-dma.c
>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>>  }
>>
>>  static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>       unsigned long flags)
>>  {
>>       struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>       struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>>       struct sirfsoc_dma_desc *sdesc = NULL;
>> +     struct dma_interleaved_template *xt = *xts;
>>       unsigned long iflags;
>>       int ret;
>>
>> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
>> index 6bb86b5..468110a 100644
>> --- a/drivers/media/platform/m2m-deinterlace.c
>> +++ b/drivers/media/platform/m2m-deinterlace.c
>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>>       ctx->xt->dst_sgl = true;
>>       flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>
>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>       if (tx == NULL) {
>>               v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>>               return;
>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>> index c5c92d5..2f77a9a 100644
>> --- a/include/linux/dmaengine.h
>> +++ b/include/linux/dmaengine.h
>> @@ -675,7 +675,7 @@ struct dma_device {
>>               size_t period_len, enum dma_transfer_direction direction,
>>               unsigned long flags, void *context);
>>       struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>               unsigned long flags);
>>       int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>>               unsigned long arg);
>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>>  }
>>
>>  static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>               unsigned long flags)
>>  {
>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>> +     return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>>  }
>>
>>  static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:35         ` Lars-Peter Clausen
  0 siblings, 0 replies; 65+ messages in thread
From: Lars-Peter Clausen @ 2014-02-17  9:35 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Vinod Koul, dan.j.williams, michal.simek, Grant Likely, robh+dt,
	Levente Kurusa, andriy.shevchenko, dmaengine, linux-arm-kernel,
	linux-kernel, devicetree

On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>> The current implementation of interleaved DMA API support multiple
>>> frames only when the memory is contiguous by incrementing src_start/
>>> dst_start members of interleaved template.
>>>
>>> But, when the memory is non-contiguous it will restrict slave device
>>> to not submit multiple frames in a batch.  This patch handles this
>>> issue by allowing the slave device to send array of interleaved dma
>>> templates each having a different memory location.
>> This seems to be missing the numbers of templates you are sending, wouldnt this
>> require sending ARRAY_SiZE too?
>>
>> And why send double pointer?
>
> Array size is not required, when we pass the double pointer.  The last
> element would be
> pointed to NULL and we could get the number of templates from this condition.
> Here is an example snippet,
>
> In slave device driver,
>
>          struct dma_interleaved_template **xts;
>
>          xts = kcalloc(frm_cnt+1, sizeof(struct
> dma_interleaved_template *), GFP_KERNEL);
>          /* Error check for xts */
>          for (i = 0; i < frm_cnt; i++) {
>                  xts[i] = kmalloc(sizeof(struct
> dma_interleaved_template), GFP_KERNEL);
>                  /* Error check for xts[i] */
>          }
>          xts[i] = NULL;
>
> In DMA engine driver,  we could get the number of frames by,
>
>          for (; xts[frmno] != NULL; frmno++);
>
> I felt this way is simpler than adding an extra argument to the API.
> Please let me know
> your opinion and suggest me a better way.

I think Vinod's suggestion of passing in an array of interleaved_templates 
and the size of the array is better than what you are currently doing.

Btw. you also need to update the current implementations and users of the 
API accordingly.

>
>>
>> --
>> ~Vinod
>>
>>>
>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>> ---
>>>   Documentation/dmaengine.txt              |    2 +-
>>>   drivers/dma/imx-dma.c                    |    3 ++-
>>>   drivers/dma/sirf-dma.c                   |    3 ++-
>>>   drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>   include/linux/dmaengine.h                |    6 +++---
>>>   5 files changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>> index 879b6e3..c642614 100644
>>> --- a/Documentation/dmaengine.txt
>>> +++ b/Documentation/dmaengine.txt
>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>                size_t period_len, enum dma_data_direction direction);
>>>
>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags);
>>>
>>>      The peripheral driver is expected to have mapped the scatterlist for
>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>> index 6f9ac20..e2c52ce 100644
>>> --- a/drivers/dma/imx-dma.c
>>> +++ b/drivers/dma/imx-dma.c
>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>>>   }
>>>
>>>   static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>        unsigned long flags)
>>>   {
>>>        struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>        struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>        struct imxdma_desc *desc;
>>> +     struct dma_interleaved_template *xt = *xts;
>>>
>>>        dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>>>                "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>> index d4d3a31..b6a150b 100644
>>> --- a/drivers/dma/sirf-dma.c
>>> +++ b/drivers/dma/sirf-dma.c
>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>>>   }
>>>
>>>   static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>        unsigned long flags)
>>>   {
>>>        struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>        struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>>>        struct sirfsoc_dma_desc *sdesc = NULL;
>>> +     struct dma_interleaved_template *xt = *xts;
>>>        unsigned long iflags;
>>>        int ret;
>>>
>>> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
>>> index 6bb86b5..468110a 100644
>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>>>        ctx->xt->dst_sgl = true;
>>>        flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>
>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>        if (tx == NULL) {
>>>                v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>>>                return;
>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>> index c5c92d5..2f77a9a 100644
>>> --- a/include/linux/dmaengine.h
>>> +++ b/include/linux/dmaengine.h
>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>                size_t period_len, enum dma_transfer_direction direction,
>>>                unsigned long flags, void *context);
>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags);
>>>        int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>>>                unsigned long arg);
>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>>>   }
>>>
>>>   static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags)
>>>   {
>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>> +     return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>>>   }
>>>
>>>   static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
>>> --
>>> 1.7.9.5
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:35         ` Lars-Peter Clausen
  0 siblings, 0 replies; 65+ messages in thread
From: Lars-Peter Clausen @ 2014-02-17  9:35 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Vinod Koul, dan.j.williams-ral2JQCrhuEAvxtiuMwx3w,
	michal.simek-gjFFaj9aHVfQT0dZR+AlfA, Grant Likely,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, Levente Kurusa,
	andriy.shevchenko-b5Z7lJ3WibVrdx17CPfAsdBPR1lH4CV8,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>> The current implementation of interleaved DMA API support multiple
>>> frames only when the memory is contiguous by incrementing src_start/
>>> dst_start members of interleaved template.
>>>
>>> But, when the memory is non-contiguous it will restrict slave device
>>> to not submit multiple frames in a batch.  This patch handles this
>>> issue by allowing the slave device to send array of interleaved dma
>>> templates each having a different memory location.
>> This seems to be missing the numbers of templates you are sending, wouldnt this
>> require sending ARRAY_SiZE too?
>>
>> And why send double pointer?
>
> Array size is not required, when we pass the double pointer.  The last
> element would be
> pointed to NULL and we could get the number of templates from this condition.
> Here is an example snippet,
>
> In slave device driver,
>
>          struct dma_interleaved_template **xts;
>
>          xts = kcalloc(frm_cnt+1, sizeof(struct
> dma_interleaved_template *), GFP_KERNEL);
>          /* Error check for xts */
>          for (i = 0; i < frm_cnt; i++) {
>                  xts[i] = kmalloc(sizeof(struct
> dma_interleaved_template), GFP_KERNEL);
>                  /* Error check for xts[i] */
>          }
>          xts[i] = NULL;
>
> In DMA engine driver,  we could get the number of frames by,
>
>          for (; xts[frmno] != NULL; frmno++);
>
> I felt this way is simpler than adding an extra argument to the API.
> Please let me know
> your opinion and suggest me a better way.

I think Vinod's suggestion of passing in an array of interleaved_templates 
and the size of the array is better than what you are currently doing.

Btw. you also need to update the current implementations and users of the 
API accordingly.

>
>>
>> --
>> ~Vinod
>>
>>>
>>> Signed-off-by: Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   Documentation/dmaengine.txt              |    2 +-
>>>   drivers/dma/imx-dma.c                    |    3 ++-
>>>   drivers/dma/sirf-dma.c                   |    3 ++-
>>>   drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>   include/linux/dmaengine.h                |    6 +++---
>>>   5 files changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>> index 879b6e3..c642614 100644
>>> --- a/Documentation/dmaengine.txt
>>> +++ b/Documentation/dmaengine.txt
>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>                size_t period_len, enum dma_data_direction direction);
>>>
>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags);
>>>
>>>      The peripheral driver is expected to have mapped the scatterlist for
>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>> index 6f9ac20..e2c52ce 100644
>>> --- a/drivers/dma/imx-dma.c
>>> +++ b/drivers/dma/imx-dma.c
>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>>>   }
>>>
>>>   static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>        unsigned long flags)
>>>   {
>>>        struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>        struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>        struct imxdma_desc *desc;
>>> +     struct dma_interleaved_template *xt = *xts;
>>>
>>>        dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>>>                "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>> index d4d3a31..b6a150b 100644
>>> --- a/drivers/dma/sirf-dma.c
>>> +++ b/drivers/dma/sirf-dma.c
>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>>>   }
>>>
>>>   static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>        unsigned long flags)
>>>   {
>>>        struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>        struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>>>        struct sirfsoc_dma_desc *sdesc = NULL;
>>> +     struct dma_interleaved_template *xt = *xts;
>>>        unsigned long iflags;
>>>        int ret;
>>>
>>> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
>>> index 6bb86b5..468110a 100644
>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>>>        ctx->xt->dst_sgl = true;
>>>        flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>
>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>        if (tx == NULL) {
>>>                v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>>>                return;
>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>> index c5c92d5..2f77a9a 100644
>>> --- a/include/linux/dmaengine.h
>>> +++ b/include/linux/dmaengine.h
>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>                size_t period_len, enum dma_transfer_direction direction,
>>>                unsigned long flags, void *context);
>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags);
>>>        int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>>>                unsigned long arg);
>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>>>   }
>>>
>>>   static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags)
>>>   {
>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>> +     return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>>>   }
>>>
>>>   static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
>>> --
>>> 1.7.9.5
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:35         ` Lars-Peter Clausen
  0 siblings, 0 replies; 65+ messages in thread
From: Lars-Peter Clausen @ 2014-02-17  9:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>> The current implementation of interleaved DMA API support multiple
>>> frames only when the memory is contiguous by incrementing src_start/
>>> dst_start members of interleaved template.
>>>
>>> But, when the memory is non-contiguous it will restrict slave device
>>> to not submit multiple frames in a batch.  This patch handles this
>>> issue by allowing the slave device to send array of interleaved dma
>>> templates each having a different memory location.
>> This seems to be missing the numbers of templates you are sending, wouldnt this
>> require sending ARRAY_SiZE too?
>>
>> And why send double pointer?
>
> Array size is not required, when we pass the double pointer.  The last
> element would be
> pointed to NULL and we could get the number of templates from this condition.
> Here is an example snippet,
>
> In slave device driver,
>
>          struct dma_interleaved_template **xts;
>
>          xts = kcalloc(frm_cnt+1, sizeof(struct
> dma_interleaved_template *), GFP_KERNEL);
>          /* Error check for xts */
>          for (i = 0; i < frm_cnt; i++) {
>                  xts[i] = kmalloc(sizeof(struct
> dma_interleaved_template), GFP_KERNEL);
>                  /* Error check for xts[i] */
>          }
>          xts[i] = NULL;
>
> In DMA engine driver,  we could get the number of frames by,
>
>          for (; xts[frmno] != NULL; frmno++);
>
> I felt this way is simpler than adding an extra argument to the API.
> Please let me know
> your opinion and suggest me a better way.

I think Vinod's suggestion of passing in an array of interleaved_templates 
and the size of the array is better than what you are currently doing.

Btw. you also need to update the current implementations and users of the 
API accordingly.

>
>>
>> --
>> ~Vinod
>>
>>>
>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>> ---
>>>   Documentation/dmaengine.txt              |    2 +-
>>>   drivers/dma/imx-dma.c                    |    3 ++-
>>>   drivers/dma/sirf-dma.c                   |    3 ++-
>>>   drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>   include/linux/dmaengine.h                |    6 +++---
>>>   5 files changed, 9 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>> index 879b6e3..c642614 100644
>>> --- a/Documentation/dmaengine.txt
>>> +++ b/Documentation/dmaengine.txt
>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>                size_t period_len, enum dma_data_direction direction);
>>>
>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags);
>>>
>>>      The peripheral driver is expected to have mapped the scatterlist for
>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>> index 6f9ac20..e2c52ce 100644
>>> --- a/drivers/dma/imx-dma.c
>>> +++ b/drivers/dma/imx-dma.c
>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor *imxdma_prep_dma_memcpy(
>>>   }
>>>
>>>   static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>        unsigned long flags)
>>>   {
>>>        struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>        struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>        struct imxdma_desc *desc;
>>> +     struct dma_interleaved_template *xt = *xts;
>>>
>>>        dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx dst_start=0x%llx\n"
>>>                "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n", __func__,
>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>> index d4d3a31..b6a150b 100644
>>> --- a/drivers/dma/sirf-dma.c
>>> +++ b/drivers/dma/sirf-dma.c
>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan, dma_cookie_t cookie,
>>>   }
>>>
>>>   static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>        unsigned long flags)
>>>   {
>>>        struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>        struct sirfsoc_dma_chan *schan = dma_chan_to_sirfsoc_dma_chan(chan);
>>>        struct sirfsoc_dma_desc *sdesc = NULL;
>>> +     struct dma_interleaved_template *xt = *xts;
>>>        unsigned long iflags;
>>>        int ret;
>>>
>>> diff --git a/drivers/media/platform/m2m-deinterlace.c b/drivers/media/platform/m2m-deinterlace.c
>>> index 6bb86b5..468110a 100644
>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct deinterlace_ctx *ctx, int op,
>>>        ctx->xt->dst_sgl = true;
>>>        flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>
>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>        if (tx == NULL) {
>>>                v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep error\n");
>>>                return;
>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>> index c5c92d5..2f77a9a 100644
>>> --- a/include/linux/dmaengine.h
>>> +++ b/include/linux/dmaengine.h
>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>                size_t period_len, enum dma_transfer_direction direction,
>>>                unsigned long flags, void *context);
>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags);
>>>        int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd cmd,
>>>                unsigned long arg);
>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor *dmaengine_prep_dma_cyclic(
>>>   }
>>>
>>>   static inline struct dma_async_tx_descriptor *dmaengine_prep_interleaved_dma(
>>> -             struct dma_chan *chan, struct dma_interleaved_template *xt,
>>> +             struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>                unsigned long flags)
>>>   {
>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>> +     return chan->device->device_prep_interleaved_dma(chan, xts, flags);
>>>   }
>>>
>>>   static inline int dma_get_slave_caps(struct dma_chan *chan, struct dma_slave_caps *caps)
>>> --
>>> 1.7.9.5
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>> the body of a message to majordomo at vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:42           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:42 UTC (permalink / raw)
  To: Lars-Peter Clausen
  Cc: Srikanth Thokala, Vinod Koul, dan.j.williams, michal.simek,
	Grant Likely, robh+dt, Levente Kurusa, andriy.shevchenko,
	dmaengine, linux-arm-kernel, linux-kernel, devicetree

On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars@metafoo.de> wrote:
> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>
>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>>>
>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>
>>>> The current implementation of interleaved DMA API support multiple
>>>> frames only when the memory is contiguous by incrementing src_start/
>>>> dst_start members of interleaved template.
>>>>
>>>> But, when the memory is non-contiguous it will restrict slave device
>>>> to not submit multiple frames in a batch.  This patch handles this
>>>> issue by allowing the slave device to send array of interleaved dma
>>>> templates each having a different memory location.
>>>
>>> This seems to be missing the numbers of templates you are sending,
>>> wouldnt this
>>> require sending ARRAY_SiZE too?
>>>
>>> And why send double pointer?
>>
>>
>> Array size is not required, when we pass the double pointer.  The last
>> element would be
>> pointed to NULL and we could get the number of templates from this
>> condition.
>> Here is an example snippet,
>>
>> In slave device driver,
>>
>>          struct dma_interleaved_template **xts;
>>
>>          xts = kcalloc(frm_cnt+1, sizeof(struct
>> dma_interleaved_template *), GFP_KERNEL);
>>          /* Error check for xts */
>>          for (i = 0; i < frm_cnt; i++) {
>>                  xts[i] = kmalloc(sizeof(struct
>> dma_interleaved_template), GFP_KERNEL);
>>                  /* Error check for xts[i] */
>>          }
>>          xts[i] = NULL;
>>
>> In DMA engine driver,  we could get the number of frames by,
>>
>>          for (; xts[frmno] != NULL; frmno++);
>>
>> I felt this way is simpler than adding an extra argument to the API.
>> Please let me know
>> your opinion and suggest me a better way.
>
>
> I think Vinod's suggestion of passing in an array of interleaved_templates
> and the size of the array is better than what you are currently doing.

Ok, Lars.  I will update with this in my v4. Thanks.

>
> Btw. you also need to update the current implementations and users of the
> API accordingly.

Yes, I have updated them in this patch.

Thanks
Srikanth


>
>
>>
>>>
>>> --
>>> ~Vinod
>>>
>>>>
>>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>>> ---
>>>>   Documentation/dmaengine.txt              |    2 +-
>>>>   drivers/dma/imx-dma.c                    |    3 ++-
>>>>   drivers/dma/sirf-dma.c                   |    3 ++-
>>>>   drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>   include/linux/dmaengine.h                |    6 +++---
>>>>   5 files changed, 9 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>> index 879b6e3..c642614 100644
>>>> --- a/Documentation/dmaengine.txt
>>>> +++ b/Documentation/dmaengine.txt
>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>                size_t period_len, enum dma_data_direction direction);
>>>>
>>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags);
>>>>
>>>>      The peripheral driver is expected to have mapped the scatterlist
>>>> for
>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>> index 6f9ac20..e2c52ce 100644
>>>> --- a/drivers/dma/imx-dma.c
>>>> +++ b/drivers/dma/imx-dma.c
>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>> *imxdma_prep_dma_memcpy(
>>>>   }
>>>>
>>>>   static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>        unsigned long flags)
>>>>   {
>>>>        struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>        struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>        struct imxdma_desc *desc;
>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>
>>>>        dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>> dst_start=0x%llx\n"
>>>>                "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>> __func__,
>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>> index d4d3a31..b6a150b 100644
>>>> --- a/drivers/dma/sirf-dma.c
>>>> +++ b/drivers/dma/sirf-dma.c
>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>> dma_cookie_t cookie,
>>>>   }
>>>>
>>>>   static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>        unsigned long flags)
>>>>   {
>>>>        struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>        struct sirfsoc_dma_chan *schan =
>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>        struct sirfsoc_dma_desc *sdesc = NULL;
>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>        unsigned long iflags;
>>>>        int ret;
>>>>
>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>> index 6bb86b5..468110a 100644
>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>> deinterlace_ctx *ctx, int op,
>>>>        ctx->xt->dst_sgl = true;
>>>>        flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>
>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>        if (tx == NULL) {
>>>>                v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>> error\n");
>>>>                return;
>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>> index c5c92d5..2f77a9a 100644
>>>> --- a/include/linux/dmaengine.h
>>>> +++ b/include/linux/dmaengine.h
>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>                size_t period_len, enum dma_transfer_direction direction,
>>>>                unsigned long flags, void *context);
>>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags);
>>>>        int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>> cmd,
>>>>                unsigned long arg);
>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>> *dmaengine_prep_dma_cyclic(
>>>>   }
>>>>
>>>>   static inline struct dma_async_tx_descriptor
>>>> *dmaengine_prep_interleaved_dma(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags)
>>>>   {
>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>> flags);
>>>>   }
>>>>
>>>>   static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>> dma_slave_caps *caps)
>>>> --
>>>> 1.7.9.5
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:42           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:42 UTC (permalink / raw)
  To: Lars-Peter Clausen
  Cc: Srikanth Thokala, Vinod Koul,
	dan.j.williams-ral2JQCrhuEAvxtiuMwx3w,
	michal.simek-gjFFaj9aHVfQT0dZR+AlfA, Grant Likely,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, Levente Kurusa,
	andriy.shevchenko-b5Z7lJ3WibVrdx17CPfAsdBPR1lH4CV8,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars-Qo5EllUWu/uELgA04lAiVw@public.gmane.org> wrote:
> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>
>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>>>
>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>
>>>> The current implementation of interleaved DMA API support multiple
>>>> frames only when the memory is contiguous by incrementing src_start/
>>>> dst_start members of interleaved template.
>>>>
>>>> But, when the memory is non-contiguous it will restrict slave device
>>>> to not submit multiple frames in a batch.  This patch handles this
>>>> issue by allowing the slave device to send array of interleaved dma
>>>> templates each having a different memory location.
>>>
>>> This seems to be missing the numbers of templates you are sending,
>>> wouldnt this
>>> require sending ARRAY_SiZE too?
>>>
>>> And why send double pointer?
>>
>>
>> Array size is not required, when we pass the double pointer.  The last
>> element would be
>> pointed to NULL and we could get the number of templates from this
>> condition.
>> Here is an example snippet,
>>
>> In slave device driver,
>>
>>          struct dma_interleaved_template **xts;
>>
>>          xts = kcalloc(frm_cnt+1, sizeof(struct
>> dma_interleaved_template *), GFP_KERNEL);
>>          /* Error check for xts */
>>          for (i = 0; i < frm_cnt; i++) {
>>                  xts[i] = kmalloc(sizeof(struct
>> dma_interleaved_template), GFP_KERNEL);
>>                  /* Error check for xts[i] */
>>          }
>>          xts[i] = NULL;
>>
>> In DMA engine driver,  we could get the number of frames by,
>>
>>          for (; xts[frmno] != NULL; frmno++);
>>
>> I felt this way is simpler than adding an extra argument to the API.
>> Please let me know
>> your opinion and suggest me a better way.
>
>
> I think Vinod's suggestion of passing in an array of interleaved_templates
> and the size of the array is better than what you are currently doing.

Ok, Lars.  I will update with this in my v4. Thanks.

>
> Btw. you also need to update the current implementations and users of the
> API accordingly.

Yes, I have updated them in this patch.

Thanks
Srikanth


>
>
>>
>>>
>>> --
>>> ~Vinod
>>>
>>>>
>>>> Signed-off-by: Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>   Documentation/dmaengine.txt              |    2 +-
>>>>   drivers/dma/imx-dma.c                    |    3 ++-
>>>>   drivers/dma/sirf-dma.c                   |    3 ++-
>>>>   drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>   include/linux/dmaengine.h                |    6 +++---
>>>>   5 files changed, 9 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>> index 879b6e3..c642614 100644
>>>> --- a/Documentation/dmaengine.txt
>>>> +++ b/Documentation/dmaengine.txt
>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>                size_t period_len, enum dma_data_direction direction);
>>>>
>>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags);
>>>>
>>>>      The peripheral driver is expected to have mapped the scatterlist
>>>> for
>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>> index 6f9ac20..e2c52ce 100644
>>>> --- a/drivers/dma/imx-dma.c
>>>> +++ b/drivers/dma/imx-dma.c
>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>> *imxdma_prep_dma_memcpy(
>>>>   }
>>>>
>>>>   static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>        unsigned long flags)
>>>>   {
>>>>        struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>        struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>        struct imxdma_desc *desc;
>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>
>>>>        dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>> dst_start=0x%llx\n"
>>>>                "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>> __func__,
>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>> index d4d3a31..b6a150b 100644
>>>> --- a/drivers/dma/sirf-dma.c
>>>> +++ b/drivers/dma/sirf-dma.c
>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>> dma_cookie_t cookie,
>>>>   }
>>>>
>>>>   static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>        unsigned long flags)
>>>>   {
>>>>        struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>        struct sirfsoc_dma_chan *schan =
>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>        struct sirfsoc_dma_desc *sdesc = NULL;
>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>        unsigned long iflags;
>>>>        int ret;
>>>>
>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>> index 6bb86b5..468110a 100644
>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>> deinterlace_ctx *ctx, int op,
>>>>        ctx->xt->dst_sgl = true;
>>>>        flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>
>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>        if (tx == NULL) {
>>>>                v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>> error\n");
>>>>                return;
>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>> index c5c92d5..2f77a9a 100644
>>>> --- a/include/linux/dmaengine.h
>>>> +++ b/include/linux/dmaengine.h
>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>                size_t period_len, enum dma_transfer_direction direction,
>>>>                unsigned long flags, void *context);
>>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags);
>>>>        int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>> cmd,
>>>>                unsigned long arg);
>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>> *dmaengine_prep_dma_cyclic(
>>>>   }
>>>>
>>>>   static inline struct dma_async_tx_descriptor
>>>> *dmaengine_prep_interleaved_dma(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags)
>>>>   {
>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>> flags);
>>>>   }
>>>>
>>>>   static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>> dma_slave_caps *caps)
>>>> --
>>>> 1.7.9.5
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:42           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars@metafoo.de> wrote:
> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>
>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>>>
>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>
>>>> The current implementation of interleaved DMA API support multiple
>>>> frames only when the memory is contiguous by incrementing src_start/
>>>> dst_start members of interleaved template.
>>>>
>>>> But, when the memory is non-contiguous it will restrict slave device
>>>> to not submit multiple frames in a batch.  This patch handles this
>>>> issue by allowing the slave device to send array of interleaved dma
>>>> templates each having a different memory location.
>>>
>>> This seems to be missing the numbers of templates you are sending,
>>> wouldnt this
>>> require sending ARRAY_SiZE too?
>>>
>>> And why send double pointer?
>>
>>
>> Array size is not required, when we pass the double pointer.  The last
>> element would be
>> pointed to NULL and we could get the number of templates from this
>> condition.
>> Here is an example snippet,
>>
>> In slave device driver,
>>
>>          struct dma_interleaved_template **xts;
>>
>>          xts = kcalloc(frm_cnt+1, sizeof(struct
>> dma_interleaved_template *), GFP_KERNEL);
>>          /* Error check for xts */
>>          for (i = 0; i < frm_cnt; i++) {
>>                  xts[i] = kmalloc(sizeof(struct
>> dma_interleaved_template), GFP_KERNEL);
>>                  /* Error check for xts[i] */
>>          }
>>          xts[i] = NULL;
>>
>> In DMA engine driver,  we could get the number of frames by,
>>
>>          for (; xts[frmno] != NULL; frmno++);
>>
>> I felt this way is simpler than adding an extra argument to the API.
>> Please let me know
>> your opinion and suggest me a better way.
>
>
> I think Vinod's suggestion of passing in an array of interleaved_templates
> and the size of the array is better than what you are currently doing.

Ok, Lars.  I will update with this in my v4. Thanks.

>
> Btw. you also need to update the current implementations and users of the
> API accordingly.

Yes, I have updated them in this patch.

Thanks
Srikanth


>
>
>>
>>>
>>> --
>>> ~Vinod
>>>
>>>>
>>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>>> ---
>>>>   Documentation/dmaengine.txt              |    2 +-
>>>>   drivers/dma/imx-dma.c                    |    3 ++-
>>>>   drivers/dma/sirf-dma.c                   |    3 ++-
>>>>   drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>   include/linux/dmaengine.h                |    6 +++---
>>>>   5 files changed, 9 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>> index 879b6e3..c642614 100644
>>>> --- a/Documentation/dmaengine.txt
>>>> +++ b/Documentation/dmaengine.txt
>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>                size_t period_len, enum dma_data_direction direction);
>>>>
>>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags);
>>>>
>>>>      The peripheral driver is expected to have mapped the scatterlist
>>>> for
>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>> index 6f9ac20..e2c52ce 100644
>>>> --- a/drivers/dma/imx-dma.c
>>>> +++ b/drivers/dma/imx-dma.c
>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>> *imxdma_prep_dma_memcpy(
>>>>   }
>>>>
>>>>   static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>        unsigned long flags)
>>>>   {
>>>>        struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>        struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>        struct imxdma_desc *desc;
>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>
>>>>        dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>> dst_start=0x%llx\n"
>>>>                "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>> __func__,
>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>> index d4d3a31..b6a150b 100644
>>>> --- a/drivers/dma/sirf-dma.c
>>>> +++ b/drivers/dma/sirf-dma.c
>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>> dma_cookie_t cookie,
>>>>   }
>>>>
>>>>   static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>        unsigned long flags)
>>>>   {
>>>>        struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>        struct sirfsoc_dma_chan *schan =
>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>        struct sirfsoc_dma_desc *sdesc = NULL;
>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>        unsigned long iflags;
>>>>        int ret;
>>>>
>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>> index 6bb86b5..468110a 100644
>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>> deinterlace_ctx *ctx, int op,
>>>>        ctx->xt->dst_sgl = true;
>>>>        flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>
>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>        if (tx == NULL) {
>>>>                v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>> error\n");
>>>>                return;
>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>> index c5c92d5..2f77a9a 100644
>>>> --- a/include/linux/dmaengine.h
>>>> +++ b/include/linux/dmaengine.h
>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>                size_t period_len, enum dma_transfer_direction direction,
>>>>                unsigned long flags, void *context);
>>>>        struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags);
>>>>        int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>> cmd,
>>>>                unsigned long arg);
>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>> *dmaengine_prep_dma_cyclic(
>>>>   }
>>>>
>>>>   static inline struct dma_async_tx_descriptor
>>>> *dmaengine_prep_interleaved_dma(
>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>> *xt,
>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>> **xts,
>>>>                unsigned long flags)
>>>>   {
>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>> flags);
>>>>   }
>>>>
>>>>   static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>> dma_slave_caps *caps)
>>>> --
>>>> 1.7.9.5
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>>> the body of a message to majordomo at vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo at vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-17  9:42           ` Srikanth Thokala
@ 2014-02-17  9:44             ` Lars-Peter Clausen
  -1 siblings, 0 replies; 65+ messages in thread
From: Lars-Peter Clausen @ 2014-02-17  9:44 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Vinod Koul, dan.j.williams, michal.simek, Grant Likely, robh+dt,
	Levente Kurusa, andriy.shevchenko, dmaengine, linux-arm-kernel,
	linux-kernel, devicetree

On 02/17/2014 10:42 AM, Srikanth Thokala wrote:
> On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars@metafoo.de> wrote:
>> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>>
>>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>>>>
>>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>>
>>>>> The current implementation of interleaved DMA API support multiple
>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>> dst_start members of interleaved template.
>>>>>
>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>> templates each having a different memory location.
>>>>
>>>> This seems to be missing the numbers of templates you are sending,
>>>> wouldnt this
>>>> require sending ARRAY_SiZE too?
>>>>
>>>> And why send double pointer?
>>>
>>>
>>> Array size is not required, when we pass the double pointer.  The last
>>> element would be
>>> pointed to NULL and we could get the number of templates from this
>>> condition.
>>> Here is an example snippet,
>>>
>>> In slave device driver,
>>>
>>>           struct dma_interleaved_template **xts;
>>>
>>>           xts = kcalloc(frm_cnt+1, sizeof(struct
>>> dma_interleaved_template *), GFP_KERNEL);
>>>           /* Error check for xts */
>>>           for (i = 0; i < frm_cnt; i++) {
>>>                   xts[i] = kmalloc(sizeof(struct
>>> dma_interleaved_template), GFP_KERNEL);
>>>                   /* Error check for xts[i] */
>>>           }
>>>           xts[i] = NULL;
>>>
>>> In DMA engine driver,  we could get the number of frames by,
>>>
>>>           for (; xts[frmno] != NULL; frmno++);
>>>
>>> I felt this way is simpler than adding an extra argument to the API.
>>> Please let me know
>>> your opinion and suggest me a better way.
>>
>>
>> I think Vinod's suggestion of passing in an array of interleaved_templates
>> and the size of the array is better than what you are currently doing.
>
> Ok, Lars.  I will update with this in my v4. Thanks.
>
>>
>> Btw. you also need to update the current implementations and users of the
>> API accordingly.
>
> Yes, I have updated them in this patch.

But you didn't update them accordingly to your proposed semantics. The 
caller didn't NULL terminate the array and the drivers did blindly assume 
there will always be exactly one transfer.

>
> Thanks
> Srikanth
>
>
>>
>>
>>>
>>>>
>>>> --
>>>> ~Vinod
>>>>
>>>>>
>>>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>>>> ---
>>>>>    Documentation/dmaengine.txt              |    2 +-
>>>>>    drivers/dma/imx-dma.c                    |    3 ++-
>>>>>    drivers/dma/sirf-dma.c                   |    3 ++-
>>>>>    drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>>    include/linux/dmaengine.h                |    6 +++---
>>>>>    5 files changed, 9 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>>> index 879b6e3..c642614 100644
>>>>> --- a/Documentation/dmaengine.txt
>>>>> +++ b/Documentation/dmaengine.txt
>>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>>                 size_t period_len, enum dma_data_direction direction);
>>>>>
>>>>>         struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>> *xt,
>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>> **xts,
>>>>>                 unsigned long flags);
>>>>>
>>>>>       The peripheral driver is expected to have mapped the scatterlist
>>>>> for
>>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>>> index 6f9ac20..e2c52ce 100644
>>>>> --- a/drivers/dma/imx-dma.c
>>>>> +++ b/drivers/dma/imx-dma.c
>>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>>> *imxdma_prep_dma_memcpy(
>>>>>    }
>>>>>
>>>>>    static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>         unsigned long flags)
>>>>>    {
>>>>>         struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>>         struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>>         struct imxdma_desc *desc;
>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>
>>>>>         dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>>> dst_start=0x%llx\n"
>>>>>                 "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>>> __func__,
>>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>>> index d4d3a31..b6a150b 100644
>>>>> --- a/drivers/dma/sirf-dma.c
>>>>> +++ b/drivers/dma/sirf-dma.c
>>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>>> dma_cookie_t cookie,
>>>>>    }
>>>>>
>>>>>    static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>         unsigned long flags)
>>>>>    {
>>>>>         struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>>         struct sirfsoc_dma_chan *schan =
>>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>>         struct sirfsoc_dma_desc *sdesc = NULL;
>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>         unsigned long iflags;
>>>>>         int ret;
>>>>>
>>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>>> index 6bb86b5..468110a 100644
>>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>>> deinterlace_ctx *ctx, int op,
>>>>>         ctx->xt->dst_sgl = true;
>>>>>         flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>>
>>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>>         if (tx == NULL) {
>>>>>                 v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>>> error\n");
>>>>>                 return;
>>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>>> index c5c92d5..2f77a9a 100644
>>>>> --- a/include/linux/dmaengine.h
>>>>> +++ b/include/linux/dmaengine.h
>>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>>                 size_t period_len, enum dma_transfer_direction direction,
>>>>>                 unsigned long flags, void *context);
>>>>>         struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>> *xt,
>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>> **xts,
>>>>>                 unsigned long flags);
>>>>>         int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>>> cmd,
>>>>>                 unsigned long arg);
>>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>>> *dmaengine_prep_dma_cyclic(
>>>>>    }
>>>>>
>>>>>    static inline struct dma_async_tx_descriptor
>>>>> *dmaengine_prep_interleaved_dma(
>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>> *xt,
>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>> **xts,
>>>>>                 unsigned long flags)
>>>>>    {
>>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>>> flags);
>>>>>    }
>>>>>
>>>>>    static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>>> dma_slave_caps *caps)
>>>>> --
>>>>> 1.7.9.5
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> --
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:44             ` Lars-Peter Clausen
  0 siblings, 0 replies; 65+ messages in thread
From: Lars-Peter Clausen @ 2014-02-17  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/17/2014 10:42 AM, Srikanth Thokala wrote:
> On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars@metafoo.de> wrote:
>> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>>
>>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com> wrote:
>>>>
>>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>>
>>>>> The current implementation of interleaved DMA API support multiple
>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>> dst_start members of interleaved template.
>>>>>
>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>> templates each having a different memory location.
>>>>
>>>> This seems to be missing the numbers of templates you are sending,
>>>> wouldnt this
>>>> require sending ARRAY_SiZE too?
>>>>
>>>> And why send double pointer?
>>>
>>>
>>> Array size is not required, when we pass the double pointer.  The last
>>> element would be
>>> pointed to NULL and we could get the number of templates from this
>>> condition.
>>> Here is an example snippet,
>>>
>>> In slave device driver,
>>>
>>>           struct dma_interleaved_template **xts;
>>>
>>>           xts = kcalloc(frm_cnt+1, sizeof(struct
>>> dma_interleaved_template *), GFP_KERNEL);
>>>           /* Error check for xts */
>>>           for (i = 0; i < frm_cnt; i++) {
>>>                   xts[i] = kmalloc(sizeof(struct
>>> dma_interleaved_template), GFP_KERNEL);
>>>                   /* Error check for xts[i] */
>>>           }
>>>           xts[i] = NULL;
>>>
>>> In DMA engine driver,  we could get the number of frames by,
>>>
>>>           for (; xts[frmno] != NULL; frmno++);
>>>
>>> I felt this way is simpler than adding an extra argument to the API.
>>> Please let me know
>>> your opinion and suggest me a better way.
>>
>>
>> I think Vinod's suggestion of passing in an array of interleaved_templates
>> and the size of the array is better than what you are currently doing.
>
> Ok, Lars.  I will update with this in my v4. Thanks.
>
>>
>> Btw. you also need to update the current implementations and users of the
>> API accordingly.
>
> Yes, I have updated them in this patch.

But you didn't update them accordingly to your proposed semantics. The 
caller didn't NULL terminate the array and the drivers did blindly assume 
there will always be exactly one transfer.

>
> Thanks
> Srikanth
>
>
>>
>>
>>>
>>>>
>>>> --
>>>> ~Vinod
>>>>
>>>>>
>>>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>>>> ---
>>>>>    Documentation/dmaengine.txt              |    2 +-
>>>>>    drivers/dma/imx-dma.c                    |    3 ++-
>>>>>    drivers/dma/sirf-dma.c                   |    3 ++-
>>>>>    drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>>    include/linux/dmaengine.h                |    6 +++---
>>>>>    5 files changed, 9 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>>> index 879b6e3..c642614 100644
>>>>> --- a/Documentation/dmaengine.txt
>>>>> +++ b/Documentation/dmaengine.txt
>>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>>                 size_t period_len, enum dma_data_direction direction);
>>>>>
>>>>>         struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>> *xt,
>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>> **xts,
>>>>>                 unsigned long flags);
>>>>>
>>>>>       The peripheral driver is expected to have mapped the scatterlist
>>>>> for
>>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>>> index 6f9ac20..e2c52ce 100644
>>>>> --- a/drivers/dma/imx-dma.c
>>>>> +++ b/drivers/dma/imx-dma.c
>>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>>> *imxdma_prep_dma_memcpy(
>>>>>    }
>>>>>
>>>>>    static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>         unsigned long flags)
>>>>>    {
>>>>>         struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>>         struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>>         struct imxdma_desc *desc;
>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>
>>>>>         dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>>> dst_start=0x%llx\n"
>>>>>                 "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>>> __func__,
>>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>>> index d4d3a31..b6a150b 100644
>>>>> --- a/drivers/dma/sirf-dma.c
>>>>> +++ b/drivers/dma/sirf-dma.c
>>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>>> dma_cookie_t cookie,
>>>>>    }
>>>>>
>>>>>    static struct dma_async_tx_descriptor *sirfsoc_dma_prep_interleaved(
>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>         unsigned long flags)
>>>>>    {
>>>>>         struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>>         struct sirfsoc_dma_chan *schan =
>>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>>         struct sirfsoc_dma_desc *sdesc = NULL;
>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>         unsigned long iflags;
>>>>>         int ret;
>>>>>
>>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>>> index 6bb86b5..468110a 100644
>>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>>> deinterlace_ctx *ctx, int op,
>>>>>         ctx->xt->dst_sgl = true;
>>>>>         flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>>
>>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>>         if (tx == NULL) {
>>>>>                 v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>>> error\n");
>>>>>                 return;
>>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>>> index c5c92d5..2f77a9a 100644
>>>>> --- a/include/linux/dmaengine.h
>>>>> +++ b/include/linux/dmaengine.h
>>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>>                 size_t period_len, enum dma_transfer_direction direction,
>>>>>                 unsigned long flags, void *context);
>>>>>         struct dma_async_tx_descriptor *(*device_prep_interleaved_dma)(
>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>> *xt,
>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>> **xts,
>>>>>                 unsigned long flags);
>>>>>         int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>>> cmd,
>>>>>                 unsigned long arg);
>>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>>> *dmaengine_prep_dma_cyclic(
>>>>>    }
>>>>>
>>>>>    static inline struct dma_async_tx_descriptor
>>>>> *dmaengine_prep_interleaved_dma(
>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>> *xt,
>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>> **xts,
>>>>>                 unsigned long flags)
>>>>>    {
>>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt, flags);
>>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>>> flags);
>>>>>    }
>>>>>
>>>>>    static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>>> dma_slave_caps *caps)
>>>>> --
>>>>> 1.7.9.5
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine" in
>>>>> the body of a message to majordomo at vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>> --
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>> in
>>>> the body of a message to majordomo at vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:52               ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:52 UTC (permalink / raw)
  To: Lars-Peter Clausen
  Cc: Srikanth Thokala, Vinod Koul, dan.j.williams, michal.simek,
	Grant Likely, robh+dt, Levente Kurusa, andriy.shevchenko,
	dmaengine, linux-arm-kernel, linux-kernel, devicetree

On Mon, Feb 17, 2014 at 3:14 PM, Lars-Peter Clausen <lars@metafoo.de> wrote:
> On 02/17/2014 10:42 AM, Srikanth Thokala wrote:
>>
>> On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars@metafoo.de>
>> wrote:
>>>
>>> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>>>
>>>>
>>>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>>>
>>>>>>
>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>> dst_start members of interleaved template.
>>>>>>
>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>> templates each having a different memory location.
>>>>>
>>>>>
>>>>> This seems to be missing the numbers of templates you are sending,
>>>>> wouldnt this
>>>>> require sending ARRAY_SiZE too?
>>>>>
>>>>> And why send double pointer?
>>>>
>>>>
>>>>
>>>> Array size is not required, when we pass the double pointer.  The last
>>>> element would be
>>>> pointed to NULL and we could get the number of templates from this
>>>> condition.
>>>> Here is an example snippet,
>>>>
>>>> In slave device driver,
>>>>
>>>>           struct dma_interleaved_template **xts;
>>>>
>>>>           xts = kcalloc(frm_cnt+1, sizeof(struct
>>>> dma_interleaved_template *), GFP_KERNEL);
>>>>           /* Error check for xts */
>>>>           for (i = 0; i < frm_cnt; i++) {
>>>>                   xts[i] = kmalloc(sizeof(struct
>>>> dma_interleaved_template), GFP_KERNEL);
>>>>                   /* Error check for xts[i] */
>>>>           }
>>>>           xts[i] = NULL;
>>>>
>>>> In DMA engine driver,  we could get the number of frames by,
>>>>
>>>>           for (; xts[frmno] != NULL; frmno++);
>>>>
>>>> I felt this way is simpler than adding an extra argument to the API.
>>>> Please let me know
>>>> your opinion and suggest me a better way.
>>>
>>>
>>>
>>> I think Vinod's suggestion of passing in an array of
>>> interleaved_templates
>>> and the size of the array is better than what you are currently doing.
>>
>>
>> Ok, Lars.  I will update with this in my v4. Thanks.
>>
>>>
>>> Btw. you also need to update the current implementations and users of the
>>> API accordingly.
>>
>>
>> Yes, I have updated them in this patch.
>
>
> But you didn't update them accordingly to your proposed semantics. The
> caller didn't NULL terminate the array and the drivers did blindly assume
> there will always be exactly one transfer.

Ok, got it.  I will correct in v4.  Thanks for pointing this.

Srikanth

>
>
>>
>> Thanks
>> Srikanth
>>
>>
>>>
>>>
>>>>
>>>>>
>>>>> --
>>>>> ~Vinod
>>>>>
>>>>>>
>>>>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>>>>> ---
>>>>>>    Documentation/dmaengine.txt              |    2 +-
>>>>>>    drivers/dma/imx-dma.c                    |    3 ++-
>>>>>>    drivers/dma/sirf-dma.c                   |    3 ++-
>>>>>>    drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>>>    include/linux/dmaengine.h                |    6 +++---
>>>>>>    5 files changed, 9 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>>>> index 879b6e3..c642614 100644
>>>>>> --- a/Documentation/dmaengine.txt
>>>>>> +++ b/Documentation/dmaengine.txt
>>>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>>>                 size_t period_len, enum dma_data_direction direction);
>>>>>>
>>>>>>         struct dma_async_tx_descriptor
>>>>>> *(*device_prep_interleaved_dma)(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags);
>>>>>>
>>>>>>       The peripheral driver is expected to have mapped the scatterlist
>>>>>> for
>>>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>>>> index 6f9ac20..e2c52ce 100644
>>>>>> --- a/drivers/dma/imx-dma.c
>>>>>> +++ b/drivers/dma/imx-dma.c
>>>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>>>> *imxdma_prep_dma_memcpy(
>>>>>>    }
>>>>>>
>>>>>>    static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>>         unsigned long flags)
>>>>>>    {
>>>>>>         struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>>>         struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>>>         struct imxdma_desc *desc;
>>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>>
>>>>>>         dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>>>> dst_start=0x%llx\n"
>>>>>>                 "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>>>> __func__,
>>>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>>>> index d4d3a31..b6a150b 100644
>>>>>> --- a/drivers/dma/sirf-dma.c
>>>>>> +++ b/drivers/dma/sirf-dma.c
>>>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>>>> dma_cookie_t cookie,
>>>>>>    }
>>>>>>
>>>>>>    static struct dma_async_tx_descriptor
>>>>>> *sirfsoc_dma_prep_interleaved(
>>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>>         unsigned long flags)
>>>>>>    {
>>>>>>         struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>>>         struct sirfsoc_dma_chan *schan =
>>>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>>>         struct sirfsoc_dma_desc *sdesc = NULL;
>>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>>         unsigned long iflags;
>>>>>>         int ret;
>>>>>>
>>>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>>>> index 6bb86b5..468110a 100644
>>>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>>>> deinterlace_ctx *ctx, int op,
>>>>>>         ctx->xt->dst_sgl = true;
>>>>>>         flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>>>
>>>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>>>         if (tx == NULL) {
>>>>>>                 v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>>>> error\n");
>>>>>>                 return;
>>>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>>>> index c5c92d5..2f77a9a 100644
>>>>>> --- a/include/linux/dmaengine.h
>>>>>> +++ b/include/linux/dmaengine.h
>>>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>>>                 size_t period_len, enum dma_transfer_direction
>>>>>> direction,
>>>>>>                 unsigned long flags, void *context);
>>>>>>         struct dma_async_tx_descriptor
>>>>>> *(*device_prep_interleaved_dma)(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags);
>>>>>>         int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>>>> cmd,
>>>>>>                 unsigned long arg);
>>>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>>>> *dmaengine_prep_dma_cyclic(
>>>>>>    }
>>>>>>
>>>>>>    static inline struct dma_async_tx_descriptor
>>>>>> *dmaengine_prep_interleaved_dma(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags)
>>>>>>    {
>>>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt,
>>>>>> flags);
>>>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>>>> flags);
>>>>>>    }
>>>>>>
>>>>>>    static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>>>> dma_slave_caps *caps)
>>>>>> --
>>>>>> 1.7.9.5
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:52               ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:52 UTC (permalink / raw)
  To: Lars-Peter Clausen
  Cc: Srikanth Thokala, Vinod Koul,
	dan.j.williams-ral2JQCrhuEAvxtiuMwx3w,
	michal.simek-gjFFaj9aHVfQT0dZR+AlfA, Grant Likely,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, Levente Kurusa,
	andriy.shevchenko-b5Z7lJ3WibVrdx17CPfAsdBPR1lH4CV8,
	dmaengine-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA

On Mon, Feb 17, 2014 at 3:14 PM, Lars-Peter Clausen <lars-Qo5EllUWu/uELgA04lAiVw@public.gmane.org> wrote:
> On 02/17/2014 10:42 AM, Srikanth Thokala wrote:
>>
>> On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars-Qo5EllUWu/uELgA04lAiVw@public.gmane.org>
>> wrote:
>>>
>>> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>>>
>>>>
>>>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>>>
>>>>>>
>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>> dst_start members of interleaved template.
>>>>>>
>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>> templates each having a different memory location.
>>>>>
>>>>>
>>>>> This seems to be missing the numbers of templates you are sending,
>>>>> wouldnt this
>>>>> require sending ARRAY_SiZE too?
>>>>>
>>>>> And why send double pointer?
>>>>
>>>>
>>>>
>>>> Array size is not required, when we pass the double pointer.  The last
>>>> element would be
>>>> pointed to NULL and we could get the number of templates from this
>>>> condition.
>>>> Here is an example snippet,
>>>>
>>>> In slave device driver,
>>>>
>>>>           struct dma_interleaved_template **xts;
>>>>
>>>>           xts = kcalloc(frm_cnt+1, sizeof(struct
>>>> dma_interleaved_template *), GFP_KERNEL);
>>>>           /* Error check for xts */
>>>>           for (i = 0; i < frm_cnt; i++) {
>>>>                   xts[i] = kmalloc(sizeof(struct
>>>> dma_interleaved_template), GFP_KERNEL);
>>>>                   /* Error check for xts[i] */
>>>>           }
>>>>           xts[i] = NULL;
>>>>
>>>> In DMA engine driver,  we could get the number of frames by,
>>>>
>>>>           for (; xts[frmno] != NULL; frmno++);
>>>>
>>>> I felt this way is simpler than adding an extra argument to the API.
>>>> Please let me know
>>>> your opinion and suggest me a better way.
>>>
>>>
>>>
>>> I think Vinod's suggestion of passing in an array of
>>> interleaved_templates
>>> and the size of the array is better than what you are currently doing.
>>
>>
>> Ok, Lars.  I will update with this in my v4. Thanks.
>>
>>>
>>> Btw. you also need to update the current implementations and users of the
>>> API accordingly.
>>
>>
>> Yes, I have updated them in this patch.
>
>
> But you didn't update them accordingly to your proposed semantics. The
> caller didn't NULL terminate the array and the drivers did blindly assume
> there will always be exactly one transfer.

Ok, got it.  I will correct in v4.  Thanks for pointing this.

Srikanth

>
>
>>
>> Thanks
>> Srikanth
>>
>>
>>>
>>>
>>>>
>>>>>
>>>>> --
>>>>> ~Vinod
>>>>>
>>>>>>
>>>>>> Signed-off-by: Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>>>>> ---
>>>>>>    Documentation/dmaengine.txt              |    2 +-
>>>>>>    drivers/dma/imx-dma.c                    |    3 ++-
>>>>>>    drivers/dma/sirf-dma.c                   |    3 ++-
>>>>>>    drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>>>    include/linux/dmaengine.h                |    6 +++---
>>>>>>    5 files changed, 9 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>>>> index 879b6e3..c642614 100644
>>>>>> --- a/Documentation/dmaengine.txt
>>>>>> +++ b/Documentation/dmaengine.txt
>>>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>>>                 size_t period_len, enum dma_data_direction direction);
>>>>>>
>>>>>>         struct dma_async_tx_descriptor
>>>>>> *(*device_prep_interleaved_dma)(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags);
>>>>>>
>>>>>>       The peripheral driver is expected to have mapped the scatterlist
>>>>>> for
>>>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>>>> index 6f9ac20..e2c52ce 100644
>>>>>> --- a/drivers/dma/imx-dma.c
>>>>>> +++ b/drivers/dma/imx-dma.c
>>>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>>>> *imxdma_prep_dma_memcpy(
>>>>>>    }
>>>>>>
>>>>>>    static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>>         unsigned long flags)
>>>>>>    {
>>>>>>         struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>>>         struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>>>         struct imxdma_desc *desc;
>>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>>
>>>>>>         dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>>>> dst_start=0x%llx\n"
>>>>>>                 "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>>>> __func__,
>>>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>>>> index d4d3a31..b6a150b 100644
>>>>>> --- a/drivers/dma/sirf-dma.c
>>>>>> +++ b/drivers/dma/sirf-dma.c
>>>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>>>> dma_cookie_t cookie,
>>>>>>    }
>>>>>>
>>>>>>    static struct dma_async_tx_descriptor
>>>>>> *sirfsoc_dma_prep_interleaved(
>>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>>         unsigned long flags)
>>>>>>    {
>>>>>>         struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>>>         struct sirfsoc_dma_chan *schan =
>>>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>>>         struct sirfsoc_dma_desc *sdesc = NULL;
>>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>>         unsigned long iflags;
>>>>>>         int ret;
>>>>>>
>>>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>>>> index 6bb86b5..468110a 100644
>>>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>>>> deinterlace_ctx *ctx, int op,
>>>>>>         ctx->xt->dst_sgl = true;
>>>>>>         flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>>>
>>>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>>>         if (tx == NULL) {
>>>>>>                 v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>>>> error\n");
>>>>>>                 return;
>>>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>>>> index c5c92d5..2f77a9a 100644
>>>>>> --- a/include/linux/dmaengine.h
>>>>>> +++ b/include/linux/dmaengine.h
>>>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>>>                 size_t period_len, enum dma_transfer_direction
>>>>>> direction,
>>>>>>                 unsigned long flags, void *context);
>>>>>>         struct dma_async_tx_descriptor
>>>>>> *(*device_prep_interleaved_dma)(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags);
>>>>>>         int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>>>> cmd,
>>>>>>                 unsigned long arg);
>>>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>>>> *dmaengine_prep_dma_cyclic(
>>>>>>    }
>>>>>>
>>>>>>    static inline struct dma_async_tx_descriptor
>>>>>> *dmaengine_prep_interleaved_dma(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags)
>>>>>>    {
>>>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt,
>>>>>> flags);
>>>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>>>> flags);
>>>>>>    }
>>>>>>
>>>>>>    static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>>>> dma_slave_caps *caps)
>>>>>> --
>>>>>> 1.7.9.5
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine"
>>>>>> in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>>> in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:52               ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-17  9:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 17, 2014 at 3:14 PM, Lars-Peter Clausen <lars@metafoo.de> wrote:
> On 02/17/2014 10:42 AM, Srikanth Thokala wrote:
>>
>> On Mon, Feb 17, 2014 at 3:05 PM, Lars-Peter Clausen <lars@metafoo.de>
>> wrote:
>>>
>>> On 02/17/2014 10:29 AM, Srikanth Thokala wrote:
>>>>
>>>>
>>>> On Mon, Feb 17, 2014 at 2:13 PM, Vinod Koul <vinod.koul@intel.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On Sat, Feb 15, 2014 at 05:30:17PM +0530, Srikanth Thokala wrote:
>>>>>>
>>>>>>
>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>> dst_start members of interleaved template.
>>>>>>
>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>> templates each having a different memory location.
>>>>>
>>>>>
>>>>> This seems to be missing the numbers of templates you are sending,
>>>>> wouldnt this
>>>>> require sending ARRAY_SiZE too?
>>>>>
>>>>> And why send double pointer?
>>>>
>>>>
>>>>
>>>> Array size is not required, when we pass the double pointer.  The last
>>>> element would be
>>>> pointed to NULL and we could get the number of templates from this
>>>> condition.
>>>> Here is an example snippet,
>>>>
>>>> In slave device driver,
>>>>
>>>>           struct dma_interleaved_template **xts;
>>>>
>>>>           xts = kcalloc(frm_cnt+1, sizeof(struct
>>>> dma_interleaved_template *), GFP_KERNEL);
>>>>           /* Error check for xts */
>>>>           for (i = 0; i < frm_cnt; i++) {
>>>>                   xts[i] = kmalloc(sizeof(struct
>>>> dma_interleaved_template), GFP_KERNEL);
>>>>                   /* Error check for xts[i] */
>>>>           }
>>>>           xts[i] = NULL;
>>>>
>>>> In DMA engine driver,  we could get the number of frames by,
>>>>
>>>>           for (; xts[frmno] != NULL; frmno++);
>>>>
>>>> I felt this way is simpler than adding an extra argument to the API.
>>>> Please let me know
>>>> your opinion and suggest me a better way.
>>>
>>>
>>>
>>> I think Vinod's suggestion of passing in an array of
>>> interleaved_templates
>>> and the size of the array is better than what you are currently doing.
>>
>>
>> Ok, Lars.  I will update with this in my v4. Thanks.
>>
>>>
>>> Btw. you also need to update the current implementations and users of the
>>> API accordingly.
>>
>>
>> Yes, I have updated them in this patch.
>
>
> But you didn't update them accordingly to your proposed semantics. The
> caller didn't NULL terminate the array and the drivers did blindly assume
> there will always be exactly one transfer.

Ok, got it.  I will correct in v4.  Thanks for pointing this.

Srikanth

>
>
>>
>> Thanks
>> Srikanth
>>
>>
>>>
>>>
>>>>
>>>>>
>>>>> --
>>>>> ~Vinod
>>>>>
>>>>>>
>>>>>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
>>>>>> ---
>>>>>>    Documentation/dmaengine.txt              |    2 +-
>>>>>>    drivers/dma/imx-dma.c                    |    3 ++-
>>>>>>    drivers/dma/sirf-dma.c                   |    3 ++-
>>>>>>    drivers/media/platform/m2m-deinterlace.c |    2 +-
>>>>>>    include/linux/dmaengine.h                |    6 +++---
>>>>>>    5 files changed, 9 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt
>>>>>> index 879b6e3..c642614 100644
>>>>>> --- a/Documentation/dmaengine.txt
>>>>>> +++ b/Documentation/dmaengine.txt
>>>>>> @@ -94,7 +94,7 @@ The slave DMA usage consists of following steps:
>>>>>>                 size_t period_len, enum dma_data_direction direction);
>>>>>>
>>>>>>         struct dma_async_tx_descriptor
>>>>>> *(*device_prep_interleaved_dma)(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags);
>>>>>>
>>>>>>       The peripheral driver is expected to have mapped the scatterlist
>>>>>> for
>>>>>> diff --git a/drivers/dma/imx-dma.c b/drivers/dma/imx-dma.c
>>>>>> index 6f9ac20..e2c52ce 100644
>>>>>> --- a/drivers/dma/imx-dma.c
>>>>>> +++ b/drivers/dma/imx-dma.c
>>>>>> @@ -954,12 +954,13 @@ static struct dma_async_tx_descriptor
>>>>>> *imxdma_prep_dma_memcpy(
>>>>>>    }
>>>>>>
>>>>>>    static struct dma_async_tx_descriptor *imxdma_prep_dma_interleaved(
>>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>>         unsigned long flags)
>>>>>>    {
>>>>>>         struct imxdma_channel *imxdmac = to_imxdma_chan(chan);
>>>>>>         struct imxdma_engine *imxdma = imxdmac->imxdma;
>>>>>>         struct imxdma_desc *desc;
>>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>>
>>>>>>         dev_dbg(imxdma->dev, "%s channel: %d src_start=0x%llx
>>>>>> dst_start=0x%llx\n"
>>>>>>                 "   src_sgl=%s dst_sgl=%s numf=%zu frame_size=%zu\n",
>>>>>> __func__,
>>>>>> diff --git a/drivers/dma/sirf-dma.c b/drivers/dma/sirf-dma.c
>>>>>> index d4d3a31..b6a150b 100644
>>>>>> --- a/drivers/dma/sirf-dma.c
>>>>>> +++ b/drivers/dma/sirf-dma.c
>>>>>> @@ -509,12 +509,13 @@ sirfsoc_dma_tx_status(struct dma_chan *chan,
>>>>>> dma_cookie_t cookie,
>>>>>>    }
>>>>>>
>>>>>>    static struct dma_async_tx_descriptor
>>>>>> *sirfsoc_dma_prep_interleaved(
>>>>>> -     struct dma_chan *chan, struct dma_interleaved_template *xt,
>>>>>> +     struct dma_chan *chan, struct dma_interleaved_template **xts,
>>>>>>         unsigned long flags)
>>>>>>    {
>>>>>>         struct sirfsoc_dma *sdma = dma_chan_to_sirfsoc_dma(chan);
>>>>>>         struct sirfsoc_dma_chan *schan =
>>>>>> dma_chan_to_sirfsoc_dma_chan(chan);
>>>>>>         struct sirfsoc_dma_desc *sdesc = NULL;
>>>>>> +     struct dma_interleaved_template *xt = *xts;
>>>>>>         unsigned long iflags;
>>>>>>         int ret;
>>>>>>
>>>>>> diff --git a/drivers/media/platform/m2m-deinterlace.c
>>>>>> b/drivers/media/platform/m2m-deinterlace.c
>>>>>> index 6bb86b5..468110a 100644
>>>>>> --- a/drivers/media/platform/m2m-deinterlace.c
>>>>>> +++ b/drivers/media/platform/m2m-deinterlace.c
>>>>>> @@ -343,7 +343,7 @@ static void deinterlace_issue_dma(struct
>>>>>> deinterlace_ctx *ctx, int op,
>>>>>>         ctx->xt->dst_sgl = true;
>>>>>>         flags = DMA_CTRL_ACK | DMA_PREP_INTERRUPT;
>>>>>>
>>>>>> -     tx = dmadev->device_prep_interleaved_dma(chan, ctx->xt, flags);
>>>>>> +     tx = dmadev->device_prep_interleaved_dma(chan, &ctx->xt, flags);
>>>>>>         if (tx == NULL) {
>>>>>>                 v4l2_warn(&pcdev->v4l2_dev, "DMA interleaved prep
>>>>>> error\n");
>>>>>>                 return;
>>>>>> diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
>>>>>> index c5c92d5..2f77a9a 100644
>>>>>> --- a/include/linux/dmaengine.h
>>>>>> +++ b/include/linux/dmaengine.h
>>>>>> @@ -675,7 +675,7 @@ struct dma_device {
>>>>>>                 size_t period_len, enum dma_transfer_direction
>>>>>> direction,
>>>>>>                 unsigned long flags, void *context);
>>>>>>         struct dma_async_tx_descriptor
>>>>>> *(*device_prep_interleaved_dma)(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags);
>>>>>>         int (*device_control)(struct dma_chan *chan, enum dma_ctrl_cmd
>>>>>> cmd,
>>>>>>                 unsigned long arg);
>>>>>> @@ -752,10 +752,10 @@ static inline struct dma_async_tx_descriptor
>>>>>> *dmaengine_prep_dma_cyclic(
>>>>>>    }
>>>>>>
>>>>>>    static inline struct dma_async_tx_descriptor
>>>>>> *dmaengine_prep_interleaved_dma(
>>>>>> -             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> *xt,
>>>>>> +             struct dma_chan *chan, struct dma_interleaved_template
>>>>>> **xts,
>>>>>>                 unsigned long flags)
>>>>>>    {
>>>>>> -     return chan->device->device_prep_interleaved_dma(chan, xt,
>>>>>> flags);
>>>>>> +     return chan->device->device_prep_interleaved_dma(chan, xts,
>>>>>> flags);
>>>>>>    }
>>>>>>
>>>>>>    static inline int dma_get_slave_caps(struct dma_chan *chan, struct
>>>>>> dma_slave_caps *caps)
>>>>>> --
>>>>>> 1.7.9.5
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe dmaengine"
>>>>>> in
>>>>>> the body of a message to majordomo at vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>>>> in
>>>>> the body of a message to majordomo at vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>>> in
>>> the body of a message to majordomo at vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-15 12:00   ` Srikanth Thokala
  (?)
@ 2014-02-17  9:57     ` Jassi Brar
  -1 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-17  9:57 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, michal.simek, grant.likely,
	robh+dt, devicetree, levex, lars, lkml, dmaengine,
	andriy.shevchenko, linux-arm-kernel

On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
> The current implementation of interleaved DMA API support multiple
> frames only when the memory is contiguous by incrementing src_start/
> dst_start members of interleaved template.
>
> But, when the memory is non-contiguous it will restrict slave device
> to not submit multiple frames in a batch.  This patch handles this
> issue by allowing the slave device to send array of interleaved dma
> templates each having a different memory location.
>
How fragmented could be memory in your case? Is it inefficient to
submit separate transfers for each segment/frame?
It will help if you could give a typical example (chunk size and gap
in bytes) of what you worry about.

Thanks,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:57     ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-17  9:57 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, michal.simek, grant.likely,
	robh+dt, devicetree, levex, lars, lkml, dmaengine,
	andriy.shevchenko, linux-arm-kernel

On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
> The current implementation of interleaved DMA API support multiple
> frames only when the memory is contiguous by incrementing src_start/
> dst_start members of interleaved template.
>
> But, when the memory is non-contiguous it will restrict slave device
> to not submit multiple frames in a batch.  This patch handles this
> issue by allowing the slave device to send array of interleaved dma
> templates each having a different memory location.
>
How fragmented could be memory in your case? Is it inefficient to
submit separate transfers for each segment/frame?
It will help if you could give a typical example (chunk size and gap
in bytes) of what you worry about.

Thanks,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-17  9:57     ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-17  9:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
> The current implementation of interleaved DMA API support multiple
> frames only when the memory is contiguous by incrementing src_start/
> dst_start members of interleaved template.
>
> But, when the memory is non-contiguous it will restrict slave device
> to not submit multiple frames in a batch.  This patch handles this
> issue by allowing the slave device to send array of interleaved dma
> templates each having a different memory location.
>
How fragmented could be memory in your case? Is it inefficient to
submit separate transfers for each segment/frame?
It will help if you could give a typical example (chunk size and gap
in bytes) of what you worry about.

Thanks,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 11:28       ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-18 11:28 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, michal.simek,
	Grant Likely, robh+dt, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> The current implementation of interleaved DMA API support multiple
>> frames only when the memory is contiguous by incrementing src_start/
>> dst_start members of interleaved template.
>>
>> But, when the memory is non-contiguous it will restrict slave device
>> to not submit multiple frames in a batch.  This patch handles this
>> issue by allowing the slave device to send array of interleaved dma
>> templates each having a different memory location.
>>
> How fragmented could be memory in your case? Is it inefficient to
> submit separate transfers for each segment/frame?
> It will help if you could give a typical example (chunk size and gap
> in bytes) of what you worry about.

With scatter-gather engine feature in the hardware, submitting separate
transfers for each frame look inefficient. As an example, our DMA engine
supports up to 16 video frames, with each frame (a typical video frame
size) being contiguous in memory but frames are scattered into different
locations. We could not definitely submit frame by frame as it would be
software overhead (HW interrupting for each frame) resulting in video lags.

By this approach, it will allow slave device to submit multiple frames at
once.

Srikanth

>
> Thanks,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 11:28       ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-18 11:28 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod,
	michal.simek-gjFFaj9aHVfQT0dZR+AlfA, Grant Likely,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine-u79uwXL29TY76Z2rM5mHXA,
	Andy Shevchenko,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> On 15 February 2014 17:30, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>> The current implementation of interleaved DMA API support multiple
>> frames only when the memory is contiguous by incrementing src_start/
>> dst_start members of interleaved template.
>>
>> But, when the memory is non-contiguous it will restrict slave device
>> to not submit multiple frames in a batch.  This patch handles this
>> issue by allowing the slave device to send array of interleaved dma
>> templates each having a different memory location.
>>
> How fragmented could be memory in your case? Is it inefficient to
> submit separate transfers for each segment/frame?
> It will help if you could give a typical example (chunk size and gap
> in bytes) of what you worry about.

With scatter-gather engine feature in the hardware, submitting separate
transfers for each frame look inefficient. As an example, our DMA engine
supports up to 16 video frames, with each frame (a typical video frame
size) being contiguous in memory but frames are scattered into different
locations. We could not definitely submit frame by frame as it would be
software overhead (HW interrupting for each frame) resulting in video lags.

By this approach, it will allow slave device to submit multiple frames at
once.

Srikanth

>
> Thanks,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 11:28       ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-18 11:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> The current implementation of interleaved DMA API support multiple
>> frames only when the memory is contiguous by incrementing src_start/
>> dst_start members of interleaved template.
>>
>> But, when the memory is non-contiguous it will restrict slave device
>> to not submit multiple frames in a batch.  This patch handles this
>> issue by allowing the slave device to send array of interleaved dma
>> templates each having a different memory location.
>>
> How fragmented could be memory in your case? Is it inefficient to
> submit separate transfers for each segment/frame?
> It will help if you could give a typical example (chunk size and gap
> in bytes) of what you worry about.

With scatter-gather engine feature in the hardware, submitting separate
transfers for each frame look inefficient. As an example, our DMA engine
supports up to 16 video frames, with each frame (a typical video frame
size) being contiguous in memory but frames are scattered into different
locations. We could not definitely submit frame by frame as it would be
software overhead (HW interrupting for each frame) resulting in video lags.

By this approach, it will allow slave device to submit multiple frames at
once.

Srikanth

>
> Thanks,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-18 11:28       ` Srikanth Thokala
  (?)
@ 2014-02-18 16:50         ` Jassi Brar
  -1 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-18 16:50 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, michal.simek, Grant Likely,
	robh+dt, devicetree, Levente Kurusa, Lars-Peter Clausen, lkml,
	dmaengine, Andy Shevchenko, linux-arm-kernel

On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> The current implementation of interleaved DMA API support multiple
>>> frames only when the memory is contiguous by incrementing src_start/
>>> dst_start members of interleaved template.
>>>
>>> But, when the memory is non-contiguous it will restrict slave device
>>> to not submit multiple frames in a batch.  This patch handles this
>>> issue by allowing the slave device to send array of interleaved dma
>>> templates each having a different memory location.
>>>
>> How fragmented could be memory in your case? Is it inefficient to
>> submit separate transfers for each segment/frame?
>> It will help if you could give a typical example (chunk size and gap
>> in bytes) of what you worry about.
>
> With scatter-gather engine feature in the hardware, submitting separate
> transfers for each frame look inefficient. As an example, our DMA engine
> supports up to 16 video frames, with each frame (a typical video frame
> size) being contiguous in memory but frames are scattered into different
> locations. We could not definitely submit frame by frame as it would be
> software overhead (HW interrupting for each frame) resulting in video lags.
>
IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
inefficient at all. Even poor-latency audio would generate a higher
interrupt-rate. So the "inefficiency concern" doesn't seem valid to
me.

Not to mean we shouldn't strive to reduce the interrupt-rate further.
Another option is to emulate the ring-buffer scheme of ALSA.... which
should be possible since for a session of video playback the frame
buffers' locations wouldn't change.

Yet another option is to use the full potential of the
interleaved-xfer api as such. It seems you confuse a 'video frame'
with the interleaved-xfer api's 'frame'. They are different.

Assuming your one video frame is F bytes long and Gk is the gap in
bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
for i!=j
In the context of interleaved-xfer api, you have just 1 Frame of 16
chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
0<=k<15
So for your use-case .....
  dma_interleaved_template.numf = 1   /* just 1 frame */
  dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
   ...... //other parameters

You have 3 options to choose from and all should work just as fine.
Otherwise please state your problem in real numbers (video-frames'
size, count & gap in bytes).

-Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 16:50         ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-18 16:50 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, michal.simek, Grant Likely,
	robh+dt, devicetree, Levente Kurusa, Lars-Peter Clausen, lkml,
	dmaengine, Andy Shevchenko, linux-arm-kernel

On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> The current implementation of interleaved DMA API support multiple
>>> frames only when the memory is contiguous by incrementing src_start/
>>> dst_start members of interleaved template.
>>>
>>> But, when the memory is non-contiguous it will restrict slave device
>>> to not submit multiple frames in a batch.  This patch handles this
>>> issue by allowing the slave device to send array of interleaved dma
>>> templates each having a different memory location.
>>>
>> How fragmented could be memory in your case? Is it inefficient to
>> submit separate transfers for each segment/frame?
>> It will help if you could give a typical example (chunk size and gap
>> in bytes) of what you worry about.
>
> With scatter-gather engine feature in the hardware, submitting separate
> transfers for each frame look inefficient. As an example, our DMA engine
> supports up to 16 video frames, with each frame (a typical video frame
> size) being contiguous in memory but frames are scattered into different
> locations. We could not definitely submit frame by frame as it would be
> software overhead (HW interrupting for each frame) resulting in video lags.
>
IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
inefficient at all. Even poor-latency audio would generate a higher
interrupt-rate. So the "inefficiency concern" doesn't seem valid to
me.

Not to mean we shouldn't strive to reduce the interrupt-rate further.
Another option is to emulate the ring-buffer scheme of ALSA.... which
should be possible since for a session of video playback the frame
buffers' locations wouldn't change.

Yet another option is to use the full potential of the
interleaved-xfer api as such. It seems you confuse a 'video frame'
with the interleaved-xfer api's 'frame'. They are different.

Assuming your one video frame is F bytes long and Gk is the gap in
bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
for i!=j
In the context of interleaved-xfer api, you have just 1 Frame of 16
chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
0<=k<15
So for your use-case .....
  dma_interleaved_template.numf = 1   /* just 1 frame */
  dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
   ...... //other parameters

You have 3 options to choose from and all should work just as fine.
Otherwise please state your problem in real numbers (video-frames'
size, count & gap in bytes).

-Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 16:50         ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-18 16:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> The current implementation of interleaved DMA API support multiple
>>> frames only when the memory is contiguous by incrementing src_start/
>>> dst_start members of interleaved template.
>>>
>>> But, when the memory is non-contiguous it will restrict slave device
>>> to not submit multiple frames in a batch.  This patch handles this
>>> issue by allowing the slave device to send array of interleaved dma
>>> templates each having a different memory location.
>>>
>> How fragmented could be memory in your case? Is it inefficient to
>> submit separate transfers for each segment/frame?
>> It will help if you could give a typical example (chunk size and gap
>> in bytes) of what you worry about.
>
> With scatter-gather engine feature in the hardware, submitting separate
> transfers for each frame look inefficient. As an example, our DMA engine
> supports up to 16 video frames, with each frame (a typical video frame
> size) being contiguous in memory but frames are scattered into different
> locations. We could not definitely submit frame by frame as it would be
> software overhead (HW interrupting for each frame) resulting in video lags.
>
IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
inefficient at all. Even poor-latency audio would generate a higher
interrupt-rate. So the "inefficiency concern" doesn't seem valid to
me.

Not to mean we shouldn't strive to reduce the interrupt-rate further.
Another option is to emulate the ring-buffer scheme of ALSA.... which
should be possible since for a session of video playback the frame
buffers' locations wouldn't change.

Yet another option is to use the full potential of the
interleaved-xfer api as such. It seems you confuse a 'video frame'
with the interleaved-xfer api's 'frame'. They are different.

Assuming your one video frame is F bytes long and Gk is the gap in
bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
for i!=j
In the context of interleaved-xfer api, you have just 1 Frame of 16
chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
0<=k<15
So for your use-case .....
  dma_interleaved_template.numf = 1   /* just 1 frame */
  dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
   ...... //other parameters

You have 3 options to choose from and all should work just as fine.
Otherwise please state your problem in real numbers (video-frames'
size, count & gap in bytes).

-Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-18 16:50         ` Jassi Brar
  (?)
@ 2014-02-18 17:46           ` Srikanth Thokala
  -1 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-18 17:46 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, michal.simek,
	Grant Likely, robh+dt, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> The current implementation of interleaved DMA API support multiple
>>>> frames only when the memory is contiguous by incrementing src_start/
>>>> dst_start members of interleaved template.
>>>>
>>>> But, when the memory is non-contiguous it will restrict slave device
>>>> to not submit multiple frames in a batch.  This patch handles this
>>>> issue by allowing the slave device to send array of interleaved dma
>>>> templates each having a different memory location.
>>>>
>>> How fragmented could be memory in your case? Is it inefficient to
>>> submit separate transfers for each segment/frame?
>>> It will help if you could give a typical example (chunk size and gap
>>> in bytes) of what you worry about.
>>
>> With scatter-gather engine feature in the hardware, submitting separate
>> transfers for each frame look inefficient. As an example, our DMA engine
>> supports up to 16 video frames, with each frame (a typical video frame
>> size) being contiguous in memory but frames are scattered into different
>> locations. We could not definitely submit frame by frame as it would be
>> software overhead (HW interrupting for each frame) resulting in video lags.
>>
> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
> inefficient at all. Even poor-latency audio would generate a higher
> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
> me.
>
> Not to mean we shouldn't strive to reduce the interrupt-rate further.
> Another option is to emulate the ring-buffer scheme of ALSA.... which
> should be possible since for a session of video playback the frame
> buffers' locations wouldn't change.
>
> Yet another option is to use the full potential of the
> interleaved-xfer api as such. It seems you confuse a 'video frame'
> with the interleaved-xfer api's 'frame'. They are different.
>
> Assuming your one video frame is F bytes long and Gk is the gap in
> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
> for i!=j
> In the context of interleaved-xfer api, you have just 1 Frame of 16
> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
> 0<=k<15
> So for your use-case .....
>   dma_interleaved_template.numf = 1   /* just 1 frame */
>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>    ...... //other parameters
>
> You have 3 options to choose from and all should work just as fine.
> Otherwise please state your problem in real numbers (video-frames'
> size, count & gap in bytes).

Initially I interpreted interleaved template the same.  But, Lars corrected me
in the subsequent discussion and let me put it here briefly,

In the interleaved template, each frame represents a line of size denoted by
chunk.size and the stride by icg.  'numf' represent number of frames i.e.
number of lines.

In video frame context,
chunk.size -> hsize
chunk.icg -> stride
numf -> vsize
and frame_size is always 1 as it will have only one chunk in a line.

So, the API would not allow to pass multiple frames and we came up with a
resolution to pass an array of interleaved template structs to handle this.

Srikanth

>
> -Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 17:46           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-18 17:46 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, michal.simek,
	Grant Likely, robh+dt, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> The current implementation of interleaved DMA API support multiple
>>>> frames only when the memory is contiguous by incrementing src_start/
>>>> dst_start members of interleaved template.
>>>>
>>>> But, when the memory is non-contiguous it will restrict slave device
>>>> to not submit multiple frames in a batch.  This patch handles this
>>>> issue by allowing the slave device to send array of interleaved dma
>>>> templates each having a different memory location.
>>>>
>>> How fragmented could be memory in your case? Is it inefficient to
>>> submit separate transfers for each segment/frame?
>>> It will help if you could give a typical example (chunk size and gap
>>> in bytes) of what you worry about.
>>
>> With scatter-gather engine feature in the hardware, submitting separate
>> transfers for each frame look inefficient. As an example, our DMA engine
>> supports up to 16 video frames, with each frame (a typical video frame
>> size) being contiguous in memory but frames are scattered into different
>> locations. We could not definitely submit frame by frame as it would be
>> software overhead (HW interrupting for each frame) resulting in video lags.
>>
> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
> inefficient at all. Even poor-latency audio would generate a higher
> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
> me.
>
> Not to mean we shouldn't strive to reduce the interrupt-rate further.
> Another option is to emulate the ring-buffer scheme of ALSA.... which
> should be possible since for a session of video playback the frame
> buffers' locations wouldn't change.
>
> Yet another option is to use the full potential of the
> interleaved-xfer api as such. It seems you confuse a 'video frame'
> with the interleaved-xfer api's 'frame'. They are different.
>
> Assuming your one video frame is F bytes long and Gk is the gap in
> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
> for i!=j
> In the context of interleaved-xfer api, you have just 1 Frame of 16
> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
> 0<=k<15
> So for your use-case .....
>   dma_interleaved_template.numf = 1   /* just 1 frame */
>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>    ...... //other parameters
>
> You have 3 options to choose from and all should work just as fine.
> Otherwise please state your problem in real numbers (video-frames'
> size, count & gap in bytes).

Initially I interpreted interleaved template the same.  But, Lars corrected me
in the subsequent discussion and let me put it here briefly,

In the interleaved template, each frame represents a line of size denoted by
chunk.size and the stride by icg.  'numf' represent number of frames i.e.
number of lines.

In video frame context,
chunk.size -> hsize
chunk.icg -> stride
numf -> vsize
and frame_size is always 1 as it will have only one chunk in a line.

So, the API would not allow to pass multiple frames and we came up with a
resolution to pass an array of interleaved template structs to handle this.

Srikanth

>
> -Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 17:46           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-18 17:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> The current implementation of interleaved DMA API support multiple
>>>> frames only when the memory is contiguous by incrementing src_start/
>>>> dst_start members of interleaved template.
>>>>
>>>> But, when the memory is non-contiguous it will restrict slave device
>>>> to not submit multiple frames in a batch.  This patch handles this
>>>> issue by allowing the slave device to send array of interleaved dma
>>>> templates each having a different memory location.
>>>>
>>> How fragmented could be memory in your case? Is it inefficient to
>>> submit separate transfers for each segment/frame?
>>> It will help if you could give a typical example (chunk size and gap
>>> in bytes) of what you worry about.
>>
>> With scatter-gather engine feature in the hardware, submitting separate
>> transfers for each frame look inefficient. As an example, our DMA engine
>> supports up to 16 video frames, with each frame (a typical video frame
>> size) being contiguous in memory but frames are scattered into different
>> locations. We could not definitely submit frame by frame as it would be
>> software overhead (HW interrupting for each frame) resulting in video lags.
>>
> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
> inefficient at all. Even poor-latency audio would generate a higher
> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
> me.
>
> Not to mean we shouldn't strive to reduce the interrupt-rate further.
> Another option is to emulate the ring-buffer scheme of ALSA.... which
> should be possible since for a session of video playback the frame
> buffers' locations wouldn't change.
>
> Yet another option is to use the full potential of the
> interleaved-xfer api as such. It seems you confuse a 'video frame'
> with the interleaved-xfer api's 'frame'. They are different.
>
> Assuming your one video frame is F bytes long and Gk is the gap in
> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
> for i!=j
> In the context of interleaved-xfer api, you have just 1 Frame of 16
> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
> 0<=k<15
> So for your use-case .....
>   dma_interleaved_template.numf = 1   /* just 1 frame */
>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>    ...... //other parameters
>
> You have 3 options to choose from and all should work just as fine.
> Otherwise please state your problem in real numbers (video-frames'
> size, count & gap in bytes).

Initially I interpreted interleaved template the same.  But, Lars corrected me
in the subsequent discussion and let me put it here briefly,

In the interleaved template, each frame represents a line of size denoted by
chunk.size and the stride by icg.  'numf' represent number of frames i.e.
number of lines.

In video frame context,
chunk.size -> hsize
chunk.icg -> stride
numf -> vsize
and frame_size is always 1 as it will have only one chunk in a line.

So, the API would not allow to pass multiple frames and we came up with a
resolution to pass an array of interleaved template structs to handle this.

Srikanth

>
> -Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-18 17:46           ` Srikanth Thokala
  (?)
@ 2014-02-18 19:03             ` Jassi Brar
  -1 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-18 19:03 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, michal.simek, Grant Likely,
	robh+dt, devicetree, Levente Kurusa, Lars-Peter Clausen, lkml,
	dmaengine, Andy Shevchenko, linux-arm-kernel

On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> The current implementation of interleaved DMA API support multiple
>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>> dst_start members of interleaved template.
>>>>>
>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>> templates each having a different memory location.
>>>>>
>>>> How fragmented could be memory in your case? Is it inefficient to
>>>> submit separate transfers for each segment/frame?
>>>> It will help if you could give a typical example (chunk size and gap
>>>> in bytes) of what you worry about.
>>>
>>> With scatter-gather engine feature in the hardware, submitting separate
>>> transfers for each frame look inefficient. As an example, our DMA engine
>>> supports up to 16 video frames, with each frame (a typical video frame
>>> size) being contiguous in memory but frames are scattered into different
>>> locations. We could not definitely submit frame by frame as it would be
>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>
>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>> inefficient at all. Even poor-latency audio would generate a higher
>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>> me.
>>
>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>> should be possible since for a session of video playback the frame
>> buffers' locations wouldn't change.
>>
>> Yet another option is to use the full potential of the
>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>> with the interleaved-xfer api's 'frame'. They are different.
>>
>> Assuming your one video frame is F bytes long and Gk is the gap in
>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>> for i!=j
>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>> 0<=k<15
>> So for your use-case .....
>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>    ...... //other parameters
>>
>> You have 3 options to choose from and all should work just as fine.
>> Otherwise please state your problem in real numbers (video-frames'
>> size, count & gap in bytes).
>
> Initially I interpreted interleaved template the same.  But, Lars corrected me
> in the subsequent discussion and let me put it here briefly,
>
> In the interleaved template, each frame represents a line of size denoted by
> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
> number of lines.
>
> In video frame context,
> chunk.size -> hsize
> chunk.icg -> stride
> numf -> vsize
> and frame_size is always 1 as it will have only one chunk in a line.
>
But you said in your last post
  "with each frame (a typical video frame size) being contiguous in memory"
 ... which is not true from what you write above. Anyways, my first 2
suggestions still hold.

> So, the API would not allow to pass multiple frames and we came up with a
> resolution to pass an array of interleaved template structs to handle this.
>
Yeah the API doesn't allow such xfers that don't fall into any
'regular expression' of a transfer and also because no controller
natively supports such xfers -- your controller will break your
request up into 16 transfers and program them individually, right?
   BTW if you insist you could still express the 16 video frames as 1
interleaved-xfer frame with frame_size = (vsize + 1) * 16   ;)

Again, I would suggest you implement ring-buffer type scheme. Say
prepare 16 interleaved xfer templates and queue them. Upon each
xfer-done callback (i.e frame rendered), update the data and queue it
back. It might be much simpler for your actual case. At 30fps, 33ms to
queue a dma request should _not_ result in any frame-drop.

-Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 19:03             ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-18 19:03 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, michal.simek, Grant Likely,
	robh+dt, devicetree, Levente Kurusa, Lars-Peter Clausen, lkml,
	dmaengine, Andy Shevchenko, linux-arm-kernel

On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> The current implementation of interleaved DMA API support multiple
>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>> dst_start members of interleaved template.
>>>>>
>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>> templates each having a different memory location.
>>>>>
>>>> How fragmented could be memory in your case? Is it inefficient to
>>>> submit separate transfers for each segment/frame?
>>>> It will help if you could give a typical example (chunk size and gap
>>>> in bytes) of what you worry about.
>>>
>>> With scatter-gather engine feature in the hardware, submitting separate
>>> transfers for each frame look inefficient. As an example, our DMA engine
>>> supports up to 16 video frames, with each frame (a typical video frame
>>> size) being contiguous in memory but frames are scattered into different
>>> locations. We could not definitely submit frame by frame as it would be
>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>
>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>> inefficient at all. Even poor-latency audio would generate a higher
>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>> me.
>>
>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>> should be possible since for a session of video playback the frame
>> buffers' locations wouldn't change.
>>
>> Yet another option is to use the full potential of the
>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>> with the interleaved-xfer api's 'frame'. They are different.
>>
>> Assuming your one video frame is F bytes long and Gk is the gap in
>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>> for i!=j
>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>> 0<=k<15
>> So for your use-case .....
>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>    ...... //other parameters
>>
>> You have 3 options to choose from and all should work just as fine.
>> Otherwise please state your problem in real numbers (video-frames'
>> size, count & gap in bytes).
>
> Initially I interpreted interleaved template the same.  But, Lars corrected me
> in the subsequent discussion and let me put it here briefly,
>
> In the interleaved template, each frame represents a line of size denoted by
> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
> number of lines.
>
> In video frame context,
> chunk.size -> hsize
> chunk.icg -> stride
> numf -> vsize
> and frame_size is always 1 as it will have only one chunk in a line.
>
But you said in your last post
  "with each frame (a typical video frame size) being contiguous in memory"
 ... which is not true from what you write above. Anyways, my first 2
suggestions still hold.

> So, the API would not allow to pass multiple frames and we came up with a
> resolution to pass an array of interleaved template structs to handle this.
>
Yeah the API doesn't allow such xfers that don't fall into any
'regular expression' of a transfer and also because no controller
natively supports such xfers -- your controller will break your
request up into 16 transfers and program them individually, right?
   BTW if you insist you could still express the 16 video frames as 1
interleaved-xfer frame with frame_size = (vsize + 1) * 16   ;)

Again, I would suggest you implement ring-buffer type scheme. Say
prepare 16 interleaved xfer templates and queue them. Upon each
xfer-done callback (i.e frame rendered), update the data and queue it
back. It might be much simpler for your actual case. At 30fps, 33ms to
queue a dma request should _not_ result in any frame-drop.

-Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-18 19:03             ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-18 19:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> The current implementation of interleaved DMA API support multiple
>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>> dst_start members of interleaved template.
>>>>>
>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>> templates each having a different memory location.
>>>>>
>>>> How fragmented could be memory in your case? Is it inefficient to
>>>> submit separate transfers for each segment/frame?
>>>> It will help if you could give a typical example (chunk size and gap
>>>> in bytes) of what you worry about.
>>>
>>> With scatter-gather engine feature in the hardware, submitting separate
>>> transfers for each frame look inefficient. As an example, our DMA engine
>>> supports up to 16 video frames, with each frame (a typical video frame
>>> size) being contiguous in memory but frames are scattered into different
>>> locations. We could not definitely submit frame by frame as it would be
>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>
>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>> inefficient at all. Even poor-latency audio would generate a higher
>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>> me.
>>
>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>> should be possible since for a session of video playback the frame
>> buffers' locations wouldn't change.
>>
>> Yet another option is to use the full potential of the
>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>> with the interleaved-xfer api's 'frame'. They are different.
>>
>> Assuming your one video frame is F bytes long and Gk is the gap in
>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>> for i!=j
>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>> 0<=k<15
>> So for your use-case .....
>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>    ...... //other parameters
>>
>> You have 3 options to choose from and all should work just as fine.
>> Otherwise please state your problem in real numbers (video-frames'
>> size, count & gap in bytes).
>
> Initially I interpreted interleaved template the same.  But, Lars corrected me
> in the subsequent discussion and let me put it here briefly,
>
> In the interleaved template, each frame represents a line of size denoted by
> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
> number of lines.
>
> In video frame context,
> chunk.size -> hsize
> chunk.icg -> stride
> numf -> vsize
> and frame_size is always 1 as it will have only one chunk in a line.
>
But you said in your last post
  "with each frame (a typical video frame size) being contiguous in memory"
 ... which is not true from what you write above. Anyways, my first 2
suggestions still hold.

> So, the API would not allow to pass multiple frames and we came up with a
> resolution to pass an array of interleaved template structs to handle this.
>
Yeah the API doesn't allow such xfers that don't fall into any
'regular expression' of a transfer and also because no controller
natively supports such xfers -- your controller will break your
request up into 16 transfers and program them individually, right?
   BTW if you insist you could still express the 16 video frames as 1
interleaved-xfer frame with frame_size = (vsize + 1) * 16   ;)

Again, I would suggest you implement ring-buffer type scheme. Say
prepare 16 interleaved xfer templates and queue them. Upon each
xfer-done callback (i.e frame rendered), update the data and queue it
back. It might be much simpler for your actual case. At 30fps, 33ms to
queue a dma request should _not_ result in any frame-drop.

-Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-20  9:24               ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-20  9:24 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, Michal Simek,
	Grant Likely, Rob Herring, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>> dst_start members of interleaved template.
>>>>>>
>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>> templates each having a different memory location.
>>>>>>
>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>> submit separate transfers for each segment/frame?
>>>>> It will help if you could give a typical example (chunk size and gap
>>>>> in bytes) of what you worry about.
>>>>
>>>> With scatter-gather engine feature in the hardware, submitting separate
>>>> transfers for each frame look inefficient. As an example, our DMA engine
>>>> supports up to 16 video frames, with each frame (a typical video frame
>>>> size) being contiguous in memory but frames are scattered into different
>>>> locations. We could not definitely submit frame by frame as it would be
>>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>>
>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>> inefficient at all. Even poor-latency audio would generate a higher
>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>> me.
>>>
>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>> should be possible since for a session of video playback the frame
>>> buffers' locations wouldn't change.
>>>
>>> Yet another option is to use the full potential of the
>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>> with the interleaved-xfer api's 'frame'. They are different.
>>>
>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>> for i!=j
>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>> 0<=k<15
>>> So for your use-case .....
>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>    ...... //other parameters
>>>
>>> You have 3 options to choose from and all should work just as fine.
>>> Otherwise please state your problem in real numbers (video-frames'
>>> size, count & gap in bytes).
>>
>> Initially I interpreted interleaved template the same.  But, Lars corrected me
>> in the subsequent discussion and let me put it here briefly,
>>
>> In the interleaved template, each frame represents a line of size denoted by
>> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
>> number of lines.
>>
>> In video frame context,
>> chunk.size -> hsize
>> chunk.icg -> stride
>> numf -> vsize
>> and frame_size is always 1 as it will have only one chunk in a line.
>>
> But you said in your last post
>   "with each frame (a typical video frame size) being contiguous in memory"
>  ... which is not true from what you write above. Anyways, my first 2
> suggestions still hold.

Yes, each video frame is contiguous and they can be scattered.

>
>> So, the API would not allow to pass multiple frames and we came up with a
>> resolution to pass an array of interleaved template structs to handle this.
>>
> Yeah the API doesn't allow such xfers that don't fall into any
> 'regular expression' of a transfer and also because no controller
> natively supports such xfers -- your controller will break your
> request up into 16 transfers and program them individually, right?

No, it will not program individually.  It has a scatter-gather engine where we
could update the current pointer to the first frame and tail pointer to the last
frame and hardware does the transfer and raise an interrupt when all the frames
are completed (ie. when current reaches tail).

>    BTW if you insist you could still express the 16 video frames as 1
> interleaved-xfer frame with frame_size = (vsize + 1) * 16   ;)
>
> Again, I would suggest you implement ring-buffer type scheme. Say
> prepare 16 interleaved xfer templates and queue them. Upon each
> xfer-done callback (i.e frame rendered), update the data and queue it
> back. It might be much simpler for your actual case. At 30fps, 33ms to
> queue a dma request should _not_ result in any frame-drop.

The driver has similar implementation, where each desc (handling 16 frames)
is pushed to pending queue and then to done queue.  Slave device still can
add desc's to the pending queue and whenever the transfer of 16 frames is
completed we move each desc to done list.

Srikanth

>
> -Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-20  9:24               ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-20  9:24 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, Michal Simek,
	Grant Likely, Rob Herring, devicetree-u79uwXL29TY76Z2rM5mHXA,
	Levente Kurusa, Lars-Peter Clausen, lkml,
	dmaengine-u79uwXL29TY76Z2rM5mHXA, Andy Shevchenko,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> On 18 February 2014 23:16, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>> dst_start members of interleaved template.
>>>>>>
>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>> templates each having a different memory location.
>>>>>>
>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>> submit separate transfers for each segment/frame?
>>>>> It will help if you could give a typical example (chunk size and gap
>>>>> in bytes) of what you worry about.
>>>>
>>>> With scatter-gather engine feature in the hardware, submitting separate
>>>> transfers for each frame look inefficient. As an example, our DMA engine
>>>> supports up to 16 video frames, with each frame (a typical video frame
>>>> size) being contiguous in memory but frames are scattered into different
>>>> locations. We could not definitely submit frame by frame as it would be
>>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>>
>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>> inefficient at all. Even poor-latency audio would generate a higher
>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>> me.
>>>
>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>> should be possible since for a session of video playback the frame
>>> buffers' locations wouldn't change.
>>>
>>> Yet another option is to use the full potential of the
>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>> with the interleaved-xfer api's 'frame'. They are different.
>>>
>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>> for i!=j
>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>> 0<=k<15
>>> So for your use-case .....
>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>    ...... //other parameters
>>>
>>> You have 3 options to choose from and all should work just as fine.
>>> Otherwise please state your problem in real numbers (video-frames'
>>> size, count & gap in bytes).
>>
>> Initially I interpreted interleaved template the same.  But, Lars corrected me
>> in the subsequent discussion and let me put it here briefly,
>>
>> In the interleaved template, each frame represents a line of size denoted by
>> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
>> number of lines.
>>
>> In video frame context,
>> chunk.size -> hsize
>> chunk.icg -> stride
>> numf -> vsize
>> and frame_size is always 1 as it will have only one chunk in a line.
>>
> But you said in your last post
>   "with each frame (a typical video frame size) being contiguous in memory"
>  ... which is not true from what you write above. Anyways, my first 2
> suggestions still hold.

Yes, each video frame is contiguous and they can be scattered.

>
>> So, the API would not allow to pass multiple frames and we came up with a
>> resolution to pass an array of interleaved template structs to handle this.
>>
> Yeah the API doesn't allow such xfers that don't fall into any
> 'regular expression' of a transfer and also because no controller
> natively supports such xfers -- your controller will break your
> request up into 16 transfers and program them individually, right?

No, it will not program individually.  It has a scatter-gather engine where we
could update the current pointer to the first frame and tail pointer to the last
frame and hardware does the transfer and raise an interrupt when all the frames
are completed (ie. when current reaches tail).

>    BTW if you insist you could still express the 16 video frames as 1
> interleaved-xfer frame with frame_size = (vsize + 1) * 16   ;)
>
> Again, I would suggest you implement ring-buffer type scheme. Say
> prepare 16 interleaved xfer templates and queue them. Upon each
> xfer-done callback (i.e frame rendered), update the data and queue it
> back. It might be much simpler for your actual case. At 30fps, 33ms to
> queue a dma request should _not_ result in any frame-drop.

The driver has similar implementation, where each desc (handling 16 frames)
is pushed to pending queue and then to done queue.  Slave device still can
add desc's to the pending queue and whenever the transfer of 16 frames is
completed we move each desc to done list.

Srikanth

>
> -Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-20  9:24               ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-20  9:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>> dst_start members of interleaved template.
>>>>>>
>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>> templates each having a different memory location.
>>>>>>
>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>> submit separate transfers for each segment/frame?
>>>>> It will help if you could give a typical example (chunk size and gap
>>>>> in bytes) of what you worry about.
>>>>
>>>> With scatter-gather engine feature in the hardware, submitting separate
>>>> transfers for each frame look inefficient. As an example, our DMA engine
>>>> supports up to 16 video frames, with each frame (a typical video frame
>>>> size) being contiguous in memory but frames are scattered into different
>>>> locations. We could not definitely submit frame by frame as it would be
>>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>>
>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>> inefficient at all. Even poor-latency audio would generate a higher
>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>> me.
>>>
>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>> should be possible since for a session of video playback the frame
>>> buffers' locations wouldn't change.
>>>
>>> Yet another option is to use the full potential of the
>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>> with the interleaved-xfer api's 'frame'. They are different.
>>>
>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>> for i!=j
>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>> 0<=k<15
>>> So for your use-case .....
>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>    ...... //other parameters
>>>
>>> You have 3 options to choose from and all should work just as fine.
>>> Otherwise please state your problem in real numbers (video-frames'
>>> size, count & gap in bytes).
>>
>> Initially I interpreted interleaved template the same.  But, Lars corrected me
>> in the subsequent discussion and let me put it here briefly,
>>
>> In the interleaved template, each frame represents a line of size denoted by
>> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
>> number of lines.
>>
>> In video frame context,
>> chunk.size -> hsize
>> chunk.icg -> stride
>> numf -> vsize
>> and frame_size is always 1 as it will have only one chunk in a line.
>>
> But you said in your last post
>   "with each frame (a typical video frame size) being contiguous in memory"
>  ... which is not true from what you write above. Anyways, my first 2
> suggestions still hold.

Yes, each video frame is contiguous and they can be scattered.

>
>> So, the API would not allow to pass multiple frames and we came up with a
>> resolution to pass an array of interleaved template structs to handle this.
>>
> Yeah the API doesn't allow such xfers that don't fall into any
> 'regular expression' of a transfer and also because no controller
> natively supports such xfers -- your controller will break your
> request up into 16 transfers and program them individually, right?

No, it will not program individually.  It has a scatter-gather engine where we
could update the current pointer to the first frame and tail pointer to the last
frame and hardware does the transfer and raise an interrupt when all the frames
are completed (ie. when current reaches tail).

>    BTW if you insist you could still express the 16 video frames as 1
> interleaved-xfer frame with frame_size = (vsize + 1) * 16   ;)
>
> Again, I would suggest you implement ring-buffer type scheme. Say
> prepare 16 interleaved xfer templates and queue them. Upon each
> xfer-done callback (i.e frame rendered), update the data and queue it
> back. It might be much simpler for your actual case. At 30fps, 33ms to
> queue a dma request should _not_ result in any frame-drop.

The driver has similar implementation, where each desc (handling 16 frames)
is pushed to pending queue and then to done queue.  Slave device still can
add desc's to the pending queue and whenever the transfer of 16 frames is
completed we move each desc to done list.

Srikanth

>
> -Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-20  9:53                 ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-20  9:53 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, Michal Simek, Grant Likely,
	Rob Herring, devicetree, Levente Kurusa, Lars-Peter Clausen,
	lkml, dmaengine, Andy Shevchenko, linux-arm-kernel

On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>>> dst_start members of interleaved template.
>>>>>>>
>>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>> templates each having a different memory location.
>>>>>>>
>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>> submit separate transfers for each segment/frame?
>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>> in bytes) of what you worry about.
>>>>>
>>>>> With scatter-gather engine feature in the hardware, submitting separate
>>>>> transfers for each frame look inefficient. As an example, our DMA engine
>>>>> supports up to 16 video frames, with each frame (a typical video frame
>>>>> size) being contiguous in memory but frames are scattered into different
>>>>> locations. We could not definitely submit frame by frame as it would be
>>>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>>>
>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>> me.
>>>>
>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>> should be possible since for a session of video playback the frame
>>>> buffers' locations wouldn't change.
>>>>
>>>> Yet another option is to use the full potential of the
>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>
>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>> for i!=j
>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>> 0<=k<15
>>>> So for your use-case .....
>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>    ...... //other parameters
>>>>
>>>> You have 3 options to choose from and all should work just as fine.
>>>> Otherwise please state your problem in real numbers (video-frames'
>>>> size, count & gap in bytes).
>>>
>>> Initially I interpreted interleaved template the same.  But, Lars corrected me
>>> in the subsequent discussion and let me put it here briefly,
>>>
>>> In the interleaved template, each frame represents a line of size denoted by
>>> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
>>> number of lines.
>>>
>>> In video frame context,
>>> chunk.size -> hsize
>>> chunk.icg -> stride
>>> numf -> vsize
>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>
>> But you said in your last post
>>   "with each frame (a typical video frame size) being contiguous in memory"
>>  ... which is not true from what you write above. Anyways, my first 2
>> suggestions still hold.
>
> Yes, each video frame is contiguous and they can be scattered.
>
I assume by contiguous frame you mean as in framebuffer?  Which is an
array of bytes.
If yes, then you should do as I suggest first,  frame_size=16  and numf=1.

If no, then it seems you are already doing the right thing.... the
ring-buffer scheme. Please share some stats how the current api is
causing you overhead because that is a very common case (many
controllers support LLI) and you have 467ms (@30fps with 16-frames
ring-buffer) to queue in before you see any frame drop.

Regards,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-20  9:53                 ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-20  9:53 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, Michal Simek, Grant Likely,
	Rob Herring, devicetree-u79uwXL29TY76Z2rM5mHXA, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine-u79uwXL29TY76Z2rM5mHXA,
	Andy Shevchenko,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 20 February 2014 14:54, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>> On 18 February 2014 23:16, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>>> dst_start members of interleaved template.
>>>>>>>
>>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>> templates each having a different memory location.
>>>>>>>
>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>> submit separate transfers for each segment/frame?
>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>> in bytes) of what you worry about.
>>>>>
>>>>> With scatter-gather engine feature in the hardware, submitting separate
>>>>> transfers for each frame look inefficient. As an example, our DMA engine
>>>>> supports up to 16 video frames, with each frame (a typical video frame
>>>>> size) being contiguous in memory but frames are scattered into different
>>>>> locations. We could not definitely submit frame by frame as it would be
>>>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>>>
>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>> me.
>>>>
>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>> should be possible since for a session of video playback the frame
>>>> buffers' locations wouldn't change.
>>>>
>>>> Yet another option is to use the full potential of the
>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>
>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>> for i!=j
>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>> 0<=k<15
>>>> So for your use-case .....
>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>    ...... //other parameters
>>>>
>>>> You have 3 options to choose from and all should work just as fine.
>>>> Otherwise please state your problem in real numbers (video-frames'
>>>> size, count & gap in bytes).
>>>
>>> Initially I interpreted interleaved template the same.  But, Lars corrected me
>>> in the subsequent discussion and let me put it here briefly,
>>>
>>> In the interleaved template, each frame represents a line of size denoted by
>>> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
>>> number of lines.
>>>
>>> In video frame context,
>>> chunk.size -> hsize
>>> chunk.icg -> stride
>>> numf -> vsize
>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>
>> But you said in your last post
>>   "with each frame (a typical video frame size) being contiguous in memory"
>>  ... which is not true from what you write above. Anyways, my first 2
>> suggestions still hold.
>
> Yes, each video frame is contiguous and they can be scattered.
>
I assume by contiguous frame you mean as in framebuffer?  Which is an
array of bytes.
If yes, then you should do as I suggest first,  frame_size=16  and numf=1.

If no, then it seems you are already doing the right thing.... the
ring-buffer scheme. Please share some stats how the current api is
causing you overhead because that is a very common case (many
controllers support LLI) and you have 467ms (@30fps with 16-frames
ring-buffer) to queue in before you see any frame drop.

Regards,
Jassi
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-20  9:53                 ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-20  9:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>> frames only when the memory is contiguous by incrementing src_start/
>>>>>>> dst_start members of interleaved template.
>>>>>>>
>>>>>>> But, when the memory is non-contiguous it will restrict slave device
>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>> templates each having a different memory location.
>>>>>>>
>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>> submit separate transfers for each segment/frame?
>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>> in bytes) of what you worry about.
>>>>>
>>>>> With scatter-gather engine feature in the hardware, submitting separate
>>>>> transfers for each frame look inefficient. As an example, our DMA engine
>>>>> supports up to 16 video frames, with each frame (a typical video frame
>>>>> size) being contiguous in memory but frames are scattered into different
>>>>> locations. We could not definitely submit frame by frame as it would be
>>>>> software overhead (HW interrupting for each frame) resulting in video lags.
>>>>>
>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>> me.
>>>>
>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>> should be possible since for a session of video playback the frame
>>>> buffers' locations wouldn't change.
>>>>
>>>> Yet another option is to use the full potential of the
>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>
>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>> for i!=j
>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>> 0<=k<15
>>>> So for your use-case .....
>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>    ...... //other parameters
>>>>
>>>> You have 3 options to choose from and all should work just as fine.
>>>> Otherwise please state your problem in real numbers (video-frames'
>>>> size, count & gap in bytes).
>>>
>>> Initially I interpreted interleaved template the same.  But, Lars corrected me
>>> in the subsequent discussion and let me put it here briefly,
>>>
>>> In the interleaved template, each frame represents a line of size denoted by
>>> chunk.size and the stride by icg.  'numf' represent number of frames i.e.
>>> number of lines.
>>>
>>> In video frame context,
>>> chunk.size -> hsize
>>> chunk.icg -> stride
>>> numf -> vsize
>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>
>> But you said in your last post
>>   "with each frame (a typical video frame size) being contiguous in memory"
>>  ... which is not true from what you write above. Anyways, my first 2
>> suggestions still hold.
>
> Yes, each video frame is contiguous and they can be scattered.
>
I assume by contiguous frame you mean as in framebuffer?  Which is an
array of bytes.
If yes, then you should do as I suggest first,  frame_size=16  and numf=1.

If no, then it seems you are already doing the right thing.... the
ring-buffer scheme. Please share some stats how the current api is
causing you overhead because that is a very common case (many
controllers support LLI) and you have 467ms (@30fps with 16-frames
ring-buffer) to queue in before you see any frame drop.

Regards,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-20  9:53                 ` Jassi Brar
  (?)
  (?)
@ 2014-02-21 14:37                 ` Srikanth Thokala
  2014-02-24  2:09                     ` Jassi Brar
  -1 siblings, 1 reply; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-21 14:37 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, Michal Simek,
	Grant Likely, Rob Herring, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 7380 bytes --]

On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
wrote:
> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
wrote:
>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar <
jaswinder.singh@linaro.org> wrote:
>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
wrote:
>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar <
jaswinder.singh@linaro.org> wrote:
>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
wrote:
>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>> frames only when the memory is contiguous by incrementing
src_start/
>>>>>>>> dst_start members of interleaved template.
>>>>>>>>
>>>>>>>> But, when the memory is non-contiguous it will restrict slave
device
>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>> templates each having a different memory location.
>>>>>>>>
>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>> submit separate transfers for each segment/frame?
>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>> in bytes) of what you worry about.
>>>>>>
>>>>>> With scatter-gather engine feature in the hardware, submitting
separate
>>>>>> transfers for each frame look inefficient. As an example, our DMA
engine
>>>>>> supports up to 16 video frames, with each frame (a typical video
frame
>>>>>> size) being contiguous in memory but frames are scattered into
different
>>>>>> locations. We could not definitely submit frame by frame as it would
be
>>>>>> software overhead (HW interrupting for each frame) resulting in
video lags.
>>>>>>
>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>> me.
>>>>>
>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>> should be possible since for a session of video playback the frame
>>>>> buffers' locations wouldn't change.
>>>>>
>>>>> Yet another option is to use the full potential of the
>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>
>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>> for i!=j
>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>> 0<=k<15
>>>>> So for your use-case .....
>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>    ...... //other parameters
>>>>>
>>>>> You have 3 options to choose from and all should work just as fine.
>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>> size, count & gap in bytes).
>>>>
>>>> Initially I interpreted interleaved template the same.  But, Lars
corrected me
>>>> in the subsequent discussion and let me put it here briefly,
>>>>
>>>> In the interleaved template, each frame represents a line of size
denoted by
>>>> chunk.size and the stride by icg.  'numf' represent number of frames
i.e.
>>>> number of lines.
>>>>
>>>> In video frame context,
>>>> chunk.size -> hsize
>>>> chunk.icg -> stride
>>>> numf -> vsize
>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>
>>> But you said in your last post
>>>   "with each frame (a typical video frame size) being contiguous in
memory"
>>>  ... which is not true from what you write above. Anyways, my first 2
>>> suggestions still hold.
>>
>> Yes, each video frame is contiguous and they can be scattered.
>>
> I assume by contiguous frame you mean as in framebuffer?  Which is an
> array of bytes.
> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.

I think am confusing you.  I would like to explain with an example.  Lets
say
each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
other frame at 0x20002000 (-0x20003000), and so on.  So, the frames are
scattered in memory and as the template doesnt allow multiple src_start/
dst_start we could not use single template to fill the HW descriptors (of
frames).  So, I feel your suggestion might not work if the frames are
scattered.
Also, how could we get 'vsize' value in your approach?  More importantly,
we are overriding the semantics of interleaved template members.

>
> If no, then it seems you are already doing the right thing.... the
> ring-buffer scheme. Please share some stats how the current api is
> causing you overhead because that is a very common case (many
> controllers support LLI) and you have 467ms (@30fps with 16-frames
> ring-buffer) to queue in before you see any frame drop.

As I mentioned earlier in the thread, our hardware has a SG engine where by
we could send multiple frames in a batch.  Using the original implementation
of interleaved API, we have three options to transfer.

One is to send frame by frame to the hardware.  We get a async_tx desc for
each frame and then we submit it to hardware triggering it to transfer this
BD.
We queue the next descriptor on the pending queue and whenever there is
a completion interrupt we submit the next BD on this queue to hardware. In
this implementation we are not efficiently using the SG engine in the
hardware
as we transferring frame by frame even HW allows us to transfer multiple
frames.

The second option is to queue all the BDs until the maximum frames that HW
is
capable to transfer in a batch and then submit to SG engine in the HW.  With
this approach I feel there will be additional software overhead to track
the number
of maximum transfers and few additional cycles to release the cookie of
each desc.
Here, each desc represents a frame.

The last option is the current implementation of the driver along with this
change in
API.  It will allow us to send array of interleaved templates wherein we
could allocate
a single async desc which will handle multiple frames (or segments) and
just submit
this desc to HW.  Then we program the current to first frame, tail to last
frame and
HW will complete this transfer.  Here, each desc represents multiple frames.

My point here is the driver should use hardware resources efficiently and I
feel the
driver will be inefficient if we dont use them.

I request you to please go through our hardware specification to get an
overview of
implementation and provide me your suggestions.  Here is the link,
http://www.xilinx.com/support/documentation/ip_documentation/axi_vdma/v6_1/pg020_axi_vdma.pdf

Srikanth

>
> Regards,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Type: text/html, Size: 10223 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-24  2:09                     ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-24  2:09 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, Michal Simek, Grant Likely,
	Rob Herring, devicetree, Levente Kurusa, Lars-Peter Clausen,
	lkml, dmaengine, Andy Shevchenko, linux-arm-kernel

On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
> wrote:
>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>> wrote:
>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>> wrote:
>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>> wrote:
>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>> src_start/
>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>
>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>> device
>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>> templates each having a different memory location.
>>>>>>>>>
>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>> in bytes) of what you worry about.
>>>>>>>
>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>> separate
>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>> engine
>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>> frame
>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>> different
>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>> be
>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>> lags.
>>>>>>>
>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>> me.
>>>>>>
>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>> should be possible since for a session of video playback the frame
>>>>>> buffers' locations wouldn't change.
>>>>>>
>>>>>> Yet another option is to use the full potential of the
>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>
>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>> for i!=j
>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>> 0<=k<15
>>>>>> So for your use-case .....
>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>    ...... //other parameters
>>>>>>
>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>> size, count & gap in bytes).
>>>>>
>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>> corrected me
>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>
>>>>> In the interleaved template, each frame represents a line of size
>>>>> denoted by
>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>> i.e.
>>>>> number of lines.
>>>>>
>>>>> In video frame context,
>>>>> chunk.size -> hsize
>>>>> chunk.icg -> stride
>>>>> numf -> vsize
>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>
>>>> But you said in your last post
>>>>   "with each frame (a typical video frame size) being contiguous in
>>>> memory"
>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>> suggestions still hold.
>>>
>>> Yes, each video frame is contiguous and they can be scattered.
>>>
>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>> array of bytes.
>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>
> I think am confusing you.  I would like to explain with an example.  Lets
> say
> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
> other frame at 0x20002000 (-0x20003000), and so on.
>
As I said plz dont confuse video frame with DMA frame.... in video
frame the stride is constant(zero or not) whereas in DMA context the
stride must be zero for the frame to be called contiguous.

> So, the frames are
> scattered in memory and as the template doesnt allow multiple src_start/
> dst_start we could not use single template to fill the HW descriptors (of
> frames).  So, I feel your suggestion might not work if the frames are
> scattered.
> Also, how could we get 'vsize' value in your approach?
>
The client driver(video driver) should know the frame parameters.
Practically you'll have to populate 16 transfer templates (frame
attributes and locations won't change for a session) and submit to be
transferred _cyclically_.

>  More importantly,
> we are overriding the semantics of interleaved template members.
>
Not at all. Interleaved-dma isn't meant for only constant stride/icg
transfers. Rather it's for identical frames with random strides.

>
>>
>> If no, then it seems you are already doing the right thing.... the
>> ring-buffer scheme. Please share some stats how the current api is
>> causing you overhead because that is a very common case (many
>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>> ring-buffer) to queue in before you see any frame drop.
>
> As I mentioned earlier in the thread, our hardware has a SG engine where by
> we could send multiple frames in a batch.  Using the original implementation
> of interleaved API, we have three options to transfer.
>
> One is to send frame by frame to the hardware.  We get a async_tx desc for
> each frame and then we submit it to hardware triggering it to transfer this
> BD.
> We queue the next descriptor on the pending queue and whenever there is
> a completion interrupt we submit the next BD on this queue to hardware. In
> this implementation we are not efficiently using the SG engine in the
> hardware
> as we transferring frame by frame even HW allows us to transfer multiple
> frames.
>
Sending one frame at a time will likely cause jitters. That shouldn't be done.

> The second option is to queue all the BDs until the maximum frames that HW
> is
> capable to transfer in a batch and then submit to SG engine in the HW.  With
> this approach I feel there will be additional software overhead to track the
> number
> of maximum transfers and few additional cycles to release the cookie of each
> desc.
> Here, each desc represents a frame.
>
APIs are written for the gcd of h/w. We should optimize only for
majority and here isn't even anything gained after changing the api.
What you want to 'optimize' has been there since ever. Nobody
considers that overhead. BTW you don't even want to spend a few 'extra
cycles' but what about every other platform that doesn't support this
and will have to scan for such transfer requests?

> The last option is the current implementation of the driver along with this
> change in
> API.  It will allow us to send array of interleaved templates wherein we
> could allocate
> a single async desc which will handle multiple frames (or segments) and just
> submit
> this desc to HW.  Then we program the current to first frame, tail to last
> frame and
> HW will complete this transfer.  Here, each desc represents multiple frames.
>
> My point here is the driver should use hardware resources efficiently and I
> feel the
> driver will be inefficient if we dont use them.
>
I am afraid you are getting carried away by the 'awesomeness' of your
hardware. RingBuffers/Cyclic transfers  are meant for cases just like
yours.
Consider the following ... if you queue 16 frames and don't care to
track before all are transmitted, you'll have very high latency. Video
seeks will take unacceptably long and give the impression of a slow
system. Whereas if you get callbacks for each frame rendered, you
could updates frames from the next one thereby having very quick
response time.

Anyways there are already ways to achieve what you want (however a bad
idea that might be). So I am not convinced the change to api is any
useful (your hardware won't run any more efficiently with the change).

Regards,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-24  2:09                     ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-24  2:09 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, Michal Simek, Grant Likely,
	Rob Herring, devicetree-u79uwXL29TY76Z2rM5mHXA, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine-u79uwXL29TY76Z2rM5mHXA,
	Andy Shevchenko,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 21 February 2014 23:37, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> wrote:
>> On 20 February 2014 14:54, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>>> wrote:
>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>> <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>>>>> wrote:
>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>> <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>>>>>>> wrote:
>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>> src_start/
>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>
>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>> device
>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>> templates each having a different memory location.
>>>>>>>>>
>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>> in bytes) of what you worry about.
>>>>>>>
>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>> separate
>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>> engine
>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>> frame
>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>> different
>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>> be
>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>> lags.
>>>>>>>
>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>> me.
>>>>>>
>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>> should be possible since for a session of video playback the frame
>>>>>> buffers' locations wouldn't change.
>>>>>>
>>>>>> Yet another option is to use the full potential of the
>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>
>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>> for i!=j
>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>> 0<=k<15
>>>>>> So for your use-case .....
>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>    ...... //other parameters
>>>>>>
>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>> size, count & gap in bytes).
>>>>>
>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>> corrected me
>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>
>>>>> In the interleaved template, each frame represents a line of size
>>>>> denoted by
>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>> i.e.
>>>>> number of lines.
>>>>>
>>>>> In video frame context,
>>>>> chunk.size -> hsize
>>>>> chunk.icg -> stride
>>>>> numf -> vsize
>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>
>>>> But you said in your last post
>>>>   "with each frame (a typical video frame size) being contiguous in
>>>> memory"
>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>> suggestions still hold.
>>>
>>> Yes, each video frame is contiguous and they can be scattered.
>>>
>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>> array of bytes.
>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>
> I think am confusing you.  I would like to explain with an example.  Lets
> say
> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
> other frame at 0x20002000 (-0x20003000), and so on.
>
As I said plz dont confuse video frame with DMA frame.... in video
frame the stride is constant(zero or not) whereas in DMA context the
stride must be zero for the frame to be called contiguous.

> So, the frames are
> scattered in memory and as the template doesnt allow multiple src_start/
> dst_start we could not use single template to fill the HW descriptors (of
> frames).  So, I feel your suggestion might not work if the frames are
> scattered.
> Also, how could we get 'vsize' value in your approach?
>
The client driver(video driver) should know the frame parameters.
Practically you'll have to populate 16 transfer templates (frame
attributes and locations won't change for a session) and submit to be
transferred _cyclically_.

>  More importantly,
> we are overriding the semantics of interleaved template members.
>
Not at all. Interleaved-dma isn't meant for only constant stride/icg
transfers. Rather it's for identical frames with random strides.

>
>>
>> If no, then it seems you are already doing the right thing.... the
>> ring-buffer scheme. Please share some stats how the current api is
>> causing you overhead because that is a very common case (many
>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>> ring-buffer) to queue in before you see any frame drop.
>
> As I mentioned earlier in the thread, our hardware has a SG engine where by
> we could send multiple frames in a batch.  Using the original implementation
> of interleaved API, we have three options to transfer.
>
> One is to send frame by frame to the hardware.  We get a async_tx desc for
> each frame and then we submit it to hardware triggering it to transfer this
> BD.
> We queue the next descriptor on the pending queue and whenever there is
> a completion interrupt we submit the next BD on this queue to hardware. In
> this implementation we are not efficiently using the SG engine in the
> hardware
> as we transferring frame by frame even HW allows us to transfer multiple
> frames.
>
Sending one frame at a time will likely cause jitters. That shouldn't be done.

> The second option is to queue all the BDs until the maximum frames that HW
> is
> capable to transfer in a batch and then submit to SG engine in the HW.  With
> this approach I feel there will be additional software overhead to track the
> number
> of maximum transfers and few additional cycles to release the cookie of each
> desc.
> Here, each desc represents a frame.
>
APIs are written for the gcd of h/w. We should optimize only for
majority and here isn't even anything gained after changing the api.
What you want to 'optimize' has been there since ever. Nobody
considers that overhead. BTW you don't even want to spend a few 'extra
cycles' but what about every other platform that doesn't support this
and will have to scan for such transfer requests?

> The last option is the current implementation of the driver along with this
> change in
> API.  It will allow us to send array of interleaved templates wherein we
> could allocate
> a single async desc which will handle multiple frames (or segments) and just
> submit
> this desc to HW.  Then we program the current to first frame, tail to last
> frame and
> HW will complete this transfer.  Here, each desc represents multiple frames.
>
> My point here is the driver should use hardware resources efficiently and I
> feel the
> driver will be inefficient if we dont use them.
>
I am afraid you are getting carried away by the 'awesomeness' of your
hardware. RingBuffers/Cyclic transfers  are meant for cases just like
yours.
Consider the following ... if you queue 16 frames and don't care to
track before all are transmitted, you'll have very high latency. Video
seeks will take unacceptably long and give the impression of a slow
system. Whereas if you get callbacks for each frame rendered, you
could updates frames from the next one thereby having very quick
response time.

Anyways there are already ways to achieve what you want (however a bad
idea that might be). So I am not convinced the change to api is any
useful (your hardware won't run any more efficiently with the change).

Regards,
Jassi
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-24  2:09                     ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-24  2:09 UTC (permalink / raw)
  To: linux-arm-kernel

On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
> wrote:
>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>> wrote:
>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>> wrote:
>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>> wrote:
>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>> src_start/
>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>
>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>> device
>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>> templates each having a different memory location.
>>>>>>>>>
>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>> in bytes) of what you worry about.
>>>>>>>
>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>> separate
>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>> engine
>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>> frame
>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>> different
>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>> be
>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>> lags.
>>>>>>>
>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>> me.
>>>>>>
>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>> should be possible since for a session of video playback the frame
>>>>>> buffers' locations wouldn't change.
>>>>>>
>>>>>> Yet another option is to use the full potential of the
>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>
>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>> for i!=j
>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>> 0<=k<15
>>>>>> So for your use-case .....
>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>    ...... //other parameters
>>>>>>
>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>> size, count & gap in bytes).
>>>>>
>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>> corrected me
>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>
>>>>> In the interleaved template, each frame represents a line of size
>>>>> denoted by
>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>> i.e.
>>>>> number of lines.
>>>>>
>>>>> In video frame context,
>>>>> chunk.size -> hsize
>>>>> chunk.icg -> stride
>>>>> numf -> vsize
>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>
>>>> But you said in your last post
>>>>   "with each frame (a typical video frame size) being contiguous in
>>>> memory"
>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>> suggestions still hold.
>>>
>>> Yes, each video frame is contiguous and they can be scattered.
>>>
>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>> array of bytes.
>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>
> I think am confusing you.  I would like to explain with an example.  Lets
> say
> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
> other frame at 0x20002000 (-0x20003000), and so on.
>
As I said plz dont confuse video frame with DMA frame.... in video
frame the stride is constant(zero or not) whereas in DMA context the
stride must be zero for the frame to be called contiguous.

> So, the frames are
> scattered in memory and as the template doesnt allow multiple src_start/
> dst_start we could not use single template to fill the HW descriptors (of
> frames).  So, I feel your suggestion might not work if the frames are
> scattered.
> Also, how could we get 'vsize' value in your approach?
>
The client driver(video driver) should know the frame parameters.
Practically you'll have to populate 16 transfer templates (frame
attributes and locations won't change for a session) and submit to be
transferred _cyclically_.

>  More importantly,
> we are overriding the semantics of interleaved template members.
>
Not at all. Interleaved-dma isn't meant for only constant stride/icg
transfers. Rather it's for identical frames with random strides.

>
>>
>> If no, then it seems you are already doing the right thing.... the
>> ring-buffer scheme. Please share some stats how the current api is
>> causing you overhead because that is a very common case (many
>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>> ring-buffer) to queue in before you see any frame drop.
>
> As I mentioned earlier in the thread, our hardware has a SG engine where by
> we could send multiple frames in a batch.  Using the original implementation
> of interleaved API, we have three options to transfer.
>
> One is to send frame by frame to the hardware.  We get a async_tx desc for
> each frame and then we submit it to hardware triggering it to transfer this
> BD.
> We queue the next descriptor on the pending queue and whenever there is
> a completion interrupt we submit the next BD on this queue to hardware. In
> this implementation we are not efficiently using the SG engine in the
> hardware
> as we transferring frame by frame even HW allows us to transfer multiple
> frames.
>
Sending one frame at a time will likely cause jitters. That shouldn't be done.

> The second option is to queue all the BDs until the maximum frames that HW
> is
> capable to transfer in a batch and then submit to SG engine in the HW.  With
> this approach I feel there will be additional software overhead to track the
> number
> of maximum transfers and few additional cycles to release the cookie of each
> desc.
> Here, each desc represents a frame.
>
APIs are written for the gcd of h/w. We should optimize only for
majority and here isn't even anything gained after changing the api.
What you want to 'optimize' has been there since ever. Nobody
considers that overhead. BTW you don't even want to spend a few 'extra
cycles' but what about every other platform that doesn't support this
and will have to scan for such transfer requests?

> The last option is the current implementation of the driver along with this
> change in
> API.  It will allow us to send array of interleaved templates wherein we
> could allocate
> a single async desc which will handle multiple frames (or segments) and just
> submit
> this desc to HW.  Then we program the current to first frame, tail to last
> frame and
> HW will complete this transfer.  Here, each desc represents multiple frames.
>
> My point here is the driver should use hardware resources efficiently and I
> feel the
> driver will be inefficient if we dont use them.
>
I am afraid you are getting carried away by the 'awesomeness' of your
hardware. RingBuffers/Cyclic transfers  are meant for cases just like
yours.
Consider the following ... if you queue 16 frames and don't care to
track before all are transmitted, you'll have very high latency. Video
seeks will take unacceptably long and give the impression of a slow
system. Whereas if you get callbacks for each frame rendered, you
could updates frames from the next one thereby having very quick
response time.

Anyways there are already ways to achieve what you want (however a bad
idea that might be). So I am not convinced the change to api is any
useful (your hardware won't run any more efficiently with the change).

Regards,
Jassi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-24  2:09                     ` Jassi Brar
  (?)
@ 2014-02-26 14:21                       ` Srikanth Thokala
  -1 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-26 14:21 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, Michal Simek,
	Grant Likely, Rob Herring, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>> wrote:
>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>> wrote:
>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>> wrote:
>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>> wrote:
>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>> src_start/
>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>
>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>> device
>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>
>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>
>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>> separate
>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>> engine
>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>> frame
>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>> different
>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>> be
>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>> lags.
>>>>>>>>
>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>> me.
>>>>>>>
>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>> should be possible since for a session of video playback the frame
>>>>>>> buffers' locations wouldn't change.
>>>>>>>
>>>>>>> Yet another option is to use the full potential of the
>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>
>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>> for i!=j
>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>> 0<=k<15
>>>>>>> So for your use-case .....
>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>    ...... //other parameters
>>>>>>>
>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>> size, count & gap in bytes).
>>>>>>
>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>> corrected me
>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>
>>>>>> In the interleaved template, each frame represents a line of size
>>>>>> denoted by
>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>> i.e.
>>>>>> number of lines.
>>>>>>
>>>>>> In video frame context,
>>>>>> chunk.size -> hsize
>>>>>> chunk.icg -> stride
>>>>>> numf -> vsize
>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>
>>>>> But you said in your last post
>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>> memory"
>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>> suggestions still hold.
>>>>
>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>
>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>> array of bytes.
>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>
>> I think am confusing you.  I would like to explain with an example.  Lets
>> say
>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>> other frame at 0x20002000 (-0x20003000), and so on.
>>
> As I said plz dont confuse video frame with DMA frame.... in video
> frame the stride is constant(zero or not) whereas in DMA context the
> stride must be zero for the frame to be called contiguous.
>
>> So, the frames are
>> scattered in memory and as the template doesnt allow multiple src_start/
>> dst_start we could not use single template to fill the HW descriptors (of
>> frames).  So, I feel your suggestion might not work if the frames are
>> scattered.
>> Also, how could we get 'vsize' value in your approach?
>>
> The client driver(video driver) should know the frame parameters.
> Practically you'll have to populate 16 transfer templates (frame
> attributes and locations won't change for a session) and submit to be
> transferred _cyclically_.
>
>>  More importantly,
>> we are overriding the semantics of interleaved template members.
>>
> Not at all. Interleaved-dma isn't meant for only constant stride/icg
> transfers. Rather it's for identical frames with random strides.
>
>>
>>>
>>> If no, then it seems you are already doing the right thing.... the
>>> ring-buffer scheme. Please share some stats how the current api is
>>> causing you overhead because that is a very common case (many
>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>> ring-buffer) to queue in before you see any frame drop.
>>
>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>> we could send multiple frames in a batch.  Using the original implementation
>> of interleaved API, we have three options to transfer.
>>
>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>> each frame and then we submit it to hardware triggering it to transfer this
>> BD.
>> We queue the next descriptor on the pending queue and whenever there is
>> a completion interrupt we submit the next BD on this queue to hardware. In
>> this implementation we are not efficiently using the SG engine in the
>> hardware
>> as we transferring frame by frame even HW allows us to transfer multiple
>> frames.
>>
> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>
>> The second option is to queue all the BDs until the maximum frames that HW
>> is
>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>> this approach I feel there will be additional software overhead to track the
>> number
>> of maximum transfers and few additional cycles to release the cookie of each
>> desc.
>> Here, each desc represents a frame.
>>
> APIs are written for the gcd of h/w. We should optimize only for
> majority and here isn't even anything gained after changing the api.
> What you want to 'optimize' has been there since ever. Nobody
> considers that overhead. BTW you don't even want to spend a few 'extra
> cycles' but what about every other platform that doesn't support this
> and will have to scan for such transfer requests?
>
>> The last option is the current implementation of the driver along with this
>> change in
>> API.  It will allow us to send array of interleaved templates wherein we
>> could allocate
>> a single async desc which will handle multiple frames (or segments) and just
>> submit
>> this desc to HW.  Then we program the current to first frame, tail to last
>> frame and
>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>
>> My point here is the driver should use hardware resources efficiently and I
>> feel the
>> driver will be inefficient if we dont use them.
>>
> I am afraid you are getting carried away by the 'awesomeness' of your
> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
> yours.
> Consider the following ... if you queue 16 frames and don't care to
> track before all are transmitted, you'll have very high latency. Video
> seeks will take unacceptably long and give the impression of a slow
> system. Whereas if you get callbacks for each frame rendered, you
> could updates frames from the next one thereby having very quick
> response time.

Not at all, at least in our hardware, when we submit transfers it is most
likely to be completed. There are few errors, but mostly are recoverable
and it doesnt stop the transfer unless there is a critical error which needs
a reset to the whole system.  So, at least in my use case there will be
no latency.  I am not saying hardware is great, but it is the IP implementation.

Regarding this API change, I had earlier explained my use case in my
v2 thread.
Lars and Vinod came up with this resolution to allow array of
interleaved templates.
I also feel this is reasonable change for the subsystem.

Lars/Vinod, could you also comment on this thread for an early resolution?
Based on everyone's opinion I can take a call for my v4 patch.

Thanks
Srikanth

>
> Anyways there are already ways to achieve what you want (however a bad
> idea that might be). So I am not convinced the change to api is any
> useful (your hardware won't run any more efficiently with the change).
>
> Regards,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-26 14:21                       ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-26 14:21 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, Michal Simek,
	Grant Likely, Rob Herring, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>> wrote:
>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>> wrote:
>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>> wrote:
>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>> wrote:
>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>> src_start/
>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>
>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>> device
>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>
>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>
>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>> separate
>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>> engine
>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>> frame
>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>> different
>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>> be
>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>> lags.
>>>>>>>>
>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>> me.
>>>>>>>
>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>> should be possible since for a session of video playback the frame
>>>>>>> buffers' locations wouldn't change.
>>>>>>>
>>>>>>> Yet another option is to use the full potential of the
>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>
>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>> for i!=j
>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>> 0<=k<15
>>>>>>> So for your use-case .....
>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>    ...... //other parameters
>>>>>>>
>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>> size, count & gap in bytes).
>>>>>>
>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>> corrected me
>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>
>>>>>> In the interleaved template, each frame represents a line of size
>>>>>> denoted by
>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>> i.e.
>>>>>> number of lines.
>>>>>>
>>>>>> In video frame context,
>>>>>> chunk.size -> hsize
>>>>>> chunk.icg -> stride
>>>>>> numf -> vsize
>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>
>>>>> But you said in your last post
>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>> memory"
>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>> suggestions still hold.
>>>>
>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>
>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>> array of bytes.
>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>
>> I think am confusing you.  I would like to explain with an example.  Lets
>> say
>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>> other frame at 0x20002000 (-0x20003000), and so on.
>>
> As I said plz dont confuse video frame with DMA frame.... in video
> frame the stride is constant(zero or not) whereas in DMA context the
> stride must be zero for the frame to be called contiguous.
>
>> So, the frames are
>> scattered in memory and as the template doesnt allow multiple src_start/
>> dst_start we could not use single template to fill the HW descriptors (of
>> frames).  So, I feel your suggestion might not work if the frames are
>> scattered.
>> Also, how could we get 'vsize' value in your approach?
>>
> The client driver(video driver) should know the frame parameters.
> Practically you'll have to populate 16 transfer templates (frame
> attributes and locations won't change for a session) and submit to be
> transferred _cyclically_.
>
>>  More importantly,
>> we are overriding the semantics of interleaved template members.
>>
> Not at all. Interleaved-dma isn't meant for only constant stride/icg
> transfers. Rather it's for identical frames with random strides.
>
>>
>>>
>>> If no, then it seems you are already doing the right thing.... the
>>> ring-buffer scheme. Please share some stats how the current api is
>>> causing you overhead because that is a very common case (many
>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>> ring-buffer) to queue in before you see any frame drop.
>>
>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>> we could send multiple frames in a batch.  Using the original implementation
>> of interleaved API, we have three options to transfer.
>>
>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>> each frame and then we submit it to hardware triggering it to transfer this
>> BD.
>> We queue the next descriptor on the pending queue and whenever there is
>> a completion interrupt we submit the next BD on this queue to hardware. In
>> this implementation we are not efficiently using the SG engine in the
>> hardware
>> as we transferring frame by frame even HW allows us to transfer multiple
>> frames.
>>
> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>
>> The second option is to queue all the BDs until the maximum frames that HW
>> is
>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>> this approach I feel there will be additional software overhead to track the
>> number
>> of maximum transfers and few additional cycles to release the cookie of each
>> desc.
>> Here, each desc represents a frame.
>>
> APIs are written for the gcd of h/w. We should optimize only for
> majority and here isn't even anything gained after changing the api.
> What you want to 'optimize' has been there since ever. Nobody
> considers that overhead. BTW you don't even want to spend a few 'extra
> cycles' but what about every other platform that doesn't support this
> and will have to scan for such transfer requests?
>
>> The last option is the current implementation of the driver along with this
>> change in
>> API.  It will allow us to send array of interleaved templates wherein we
>> could allocate
>> a single async desc which will handle multiple frames (or segments) and just
>> submit
>> this desc to HW.  Then we program the current to first frame, tail to last
>> frame and
>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>
>> My point here is the driver should use hardware resources efficiently and I
>> feel the
>> driver will be inefficient if we dont use them.
>>
> I am afraid you are getting carried away by the 'awesomeness' of your
> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
> yours.
> Consider the following ... if you queue 16 frames and don't care to
> track before all are transmitted, you'll have very high latency. Video
> seeks will take unacceptably long and give the impression of a slow
> system. Whereas if you get callbacks for each frame rendered, you
> could updates frames from the next one thereby having very quick
> response time.

Not at all, at least in our hardware, when we submit transfers it is most
likely to be completed. There are few errors, but mostly are recoverable
and it doesnt stop the transfer unless there is a critical error which needs
a reset to the whole system.  So, at least in my use case there will be
no latency.  I am not saying hardware is great, but it is the IP implementation.

Regarding this API change, I had earlier explained my use case in my
v2 thread.
Lars and Vinod came up with this resolution to allow array of
interleaved templates.
I also feel this is reasonable change for the subsystem.

Lars/Vinod, could you also comment on this thread for an early resolution?
Based on everyone's opinion I can take a call for my v4 patch.

Thanks
Srikanth

>
> Anyways there are already ways to achieve what you want (however a bad
> idea that might be). So I am not convinced the change to api is any
> useful (your hardware won't run any more efficiently with the change).
>
> Regards,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-26 14:21                       ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-26 14:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>> wrote:
>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>> wrote:
>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>> wrote:
>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>> wrote:
>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>> src_start/
>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>
>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>> device
>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>
>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>
>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>> separate
>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>> engine
>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>> frame
>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>> different
>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>> be
>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>> lags.
>>>>>>>>
>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>> me.
>>>>>>>
>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>> should be possible since for a session of video playback the frame
>>>>>>> buffers' locations wouldn't change.
>>>>>>>
>>>>>>> Yet another option is to use the full potential of the
>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>
>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>> for i!=j
>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>> 0<=k<15
>>>>>>> So for your use-case .....
>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>    ...... //other parameters
>>>>>>>
>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>> size, count & gap in bytes).
>>>>>>
>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>> corrected me
>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>
>>>>>> In the interleaved template, each frame represents a line of size
>>>>>> denoted by
>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>> i.e.
>>>>>> number of lines.
>>>>>>
>>>>>> In video frame context,
>>>>>> chunk.size -> hsize
>>>>>> chunk.icg -> stride
>>>>>> numf -> vsize
>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>
>>>>> But you said in your last post
>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>> memory"
>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>> suggestions still hold.
>>>>
>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>
>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>> array of bytes.
>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>
>> I think am confusing you.  I would like to explain with an example.  Lets
>> say
>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>> other frame at 0x20002000 (-0x20003000), and so on.
>>
> As I said plz dont confuse video frame with DMA frame.... in video
> frame the stride is constant(zero or not) whereas in DMA context the
> stride must be zero for the frame to be called contiguous.
>
>> So, the frames are
>> scattered in memory and as the template doesnt allow multiple src_start/
>> dst_start we could not use single template to fill the HW descriptors (of
>> frames).  So, I feel your suggestion might not work if the frames are
>> scattered.
>> Also, how could we get 'vsize' value in your approach?
>>
> The client driver(video driver) should know the frame parameters.
> Practically you'll have to populate 16 transfer templates (frame
> attributes and locations won't change for a session) and submit to be
> transferred _cyclically_.
>
>>  More importantly,
>> we are overriding the semantics of interleaved template members.
>>
> Not at all. Interleaved-dma isn't meant for only constant stride/icg
> transfers. Rather it's for identical frames with random strides.
>
>>
>>>
>>> If no, then it seems you are already doing the right thing.... the
>>> ring-buffer scheme. Please share some stats how the current api is
>>> causing you overhead because that is a very common case (many
>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>> ring-buffer) to queue in before you see any frame drop.
>>
>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>> we could send multiple frames in a batch.  Using the original implementation
>> of interleaved API, we have three options to transfer.
>>
>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>> each frame and then we submit it to hardware triggering it to transfer this
>> BD.
>> We queue the next descriptor on the pending queue and whenever there is
>> a completion interrupt we submit the next BD on this queue to hardware. In
>> this implementation we are not efficiently using the SG engine in the
>> hardware
>> as we transferring frame by frame even HW allows us to transfer multiple
>> frames.
>>
> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>
>> The second option is to queue all the BDs until the maximum frames that HW
>> is
>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>> this approach I feel there will be additional software overhead to track the
>> number
>> of maximum transfers and few additional cycles to release the cookie of each
>> desc.
>> Here, each desc represents a frame.
>>
> APIs are written for the gcd of h/w. We should optimize only for
> majority and here isn't even anything gained after changing the api.
> What you want to 'optimize' has been there since ever. Nobody
> considers that overhead. BTW you don't even want to spend a few 'extra
> cycles' but what about every other platform that doesn't support this
> and will have to scan for such transfer requests?
>
>> The last option is the current implementation of the driver along with this
>> change in
>> API.  It will allow us to send array of interleaved templates wherein we
>> could allocate
>> a single async desc which will handle multiple frames (or segments) and just
>> submit
>> this desc to HW.  Then we program the current to first frame, tail to last
>> frame and
>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>
>> My point here is the driver should use hardware resources efficiently and I
>> feel the
>> driver will be inefficient if we dont use them.
>>
> I am afraid you are getting carried away by the 'awesomeness' of your
> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
> yours.
> Consider the following ... if you queue 16 frames and don't care to
> track before all are transmitted, you'll have very high latency. Video
> seeks will take unacceptably long and give the impression of a slow
> system. Whereas if you get callbacks for each frame rendered, you
> could updates frames from the next one thereby having very quick
> response time.

Not at all, at least in our hardware, when we submit transfers it is most
likely to be completed. There are few errors, but mostly are recoverable
and it doesnt stop the transfer unless there is a critical error which needs
a reset to the whole system.  So, at least in my use case there will be
no latency.  I am not saying hardware is great, but it is the IP implementation.

Regarding this API change, I had earlier explained my use case in my
v2 thread.
Lars and Vinod came up with this resolution to allow array of
interleaved templates.
I also feel this is reasonable change for the subsystem.

Lars/Vinod, could you also comment on this thread for an early resolution?
Based on everyone's opinion I can take a call for my v4 patch.

Thanks
Srikanth

>
> Anyways there are already ways to achieve what you want (however a bad
> idea that might be). So I am not convinced the change to api is any
> useful (your hardware won't run any more efficiently with the change).
>
> Regards,
> Jassi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-26 15:02                         ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-26 15:02 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, Michal Simek, Grant Likely,
	Rob Herring, devicetree, Levente Kurusa, Lars-Peter Clausen,
	lkml, dmaengine, Andy Shevchenko, linux-arm-kernel

On 26 February 2014 23:21, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>>> wrote:
>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>>> wrote:
>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>> wrote:
>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>> src_start/
>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>
>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>> device
>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>
>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>
>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>> separate
>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>> engine
>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>> frame
>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>> different
>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>> be
>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>> lags.
>>>>>>>>>
>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>> me.
>>>>>>>>
>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>
>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>
>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>> for i!=j
>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>> 0<=k<15
>>>>>>>> So for your use-case .....
>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>    ...... //other parameters
>>>>>>>>
>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>> size, count & gap in bytes).
>>>>>>>
>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>> corrected me
>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>
>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>> denoted by
>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>> i.e.
>>>>>>> number of lines.
>>>>>>>
>>>>>>> In video frame context,
>>>>>>> chunk.size -> hsize
>>>>>>> chunk.icg -> stride
>>>>>>> numf -> vsize
>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>
>>>>>> But you said in your last post
>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>> memory"
>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>> suggestions still hold.
>>>>>
>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>
>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>> array of bytes.
>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>
>>> I think am confusing you.  I would like to explain with an example.  Lets
>>> say
>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>
>> As I said plz dont confuse video frame with DMA frame.... in video
>> frame the stride is constant(zero or not) whereas in DMA context the
>> stride must be zero for the frame to be called contiguous.
>>
>>> So, the frames are
>>> scattered in memory and as the template doesnt allow multiple src_start/
>>> dst_start we could not use single template to fill the HW descriptors (of
>>> frames).  So, I feel your suggestion might not work if the frames are
>>> scattered.
>>> Also, how could we get 'vsize' value in your approach?
>>>
>> The client driver(video driver) should know the frame parameters.
>> Practically you'll have to populate 16 transfer templates (frame
>> attributes and locations won't change for a session) and submit to be
>> transferred _cyclically_.
>>
>>>  More importantly,
>>> we are overriding the semantics of interleaved template members.
>>>
>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>> transfers. Rather it's for identical frames with random strides.
>>
>>>
>>>>
>>>> If no, then it seems you are already doing the right thing.... the
>>>> ring-buffer scheme. Please share some stats how the current api is
>>>> causing you overhead because that is a very common case (many
>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>> ring-buffer) to queue in before you see any frame drop.
>>>
>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>> we could send multiple frames in a batch.  Using the original implementation
>>> of interleaved API, we have three options to transfer.
>>>
>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>> each frame and then we submit it to hardware triggering it to transfer this
>>> BD.
>>> We queue the next descriptor on the pending queue and whenever there is
>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>> this implementation we are not efficiently using the SG engine in the
>>> hardware
>>> as we transferring frame by frame even HW allows us to transfer multiple
>>> frames.
>>>
>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>
>>> The second option is to queue all the BDs until the maximum frames that HW
>>> is
>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>> this approach I feel there will be additional software overhead to track the
>>> number
>>> of maximum transfers and few additional cycles to release the cookie of each
>>> desc.
>>> Here, each desc represents a frame.
>>>
>> APIs are written for the gcd of h/w. We should optimize only for
>> majority and here isn't even anything gained after changing the api.
>> What you want to 'optimize' has been there since ever. Nobody
>> considers that overhead. BTW you don't even want to spend a few 'extra
>> cycles' but what about every other platform that doesn't support this
>> and will have to scan for such transfer requests?
>>
>>> The last option is the current implementation of the driver along with this
>>> change in
>>> API.  It will allow us to send array of interleaved templates wherein we
>>> could allocate
>>> a single async desc which will handle multiple frames (or segments) and just
>>> submit
>>> this desc to HW.  Then we program the current to first frame, tail to last
>>> frame and
>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>
>>> My point here is the driver should use hardware resources efficiently and I
>>> feel the
>>> driver will be inefficient if we dont use them.
>>>
>> I am afraid you are getting carried away by the 'awesomeness' of your
>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>> yours.
>> Consider the following ... if you queue 16 frames and don't care to
>> track before all are transmitted, you'll have very high latency. Video
>> seeks will take unacceptably long and give the impression of a slow
>> system. Whereas if you get callbacks for each frame rendered, you
>> could updates frames from the next one thereby having very quick
>> response time.
>
> Not at all, at least in our hardware, when we submit transfers it is most
> likely to be completed. There are few errors, but mostly are recoverable
> and it doesnt stop the transfer unless there is a critical error which needs
> a reset to the whole system.  So, at least in my use case there will be
> no latency.  I am not saying hardware is great, but it is the IP implementation.
>
I am not talking about the error cases.
Apply your patch locally so that you queue 16frames and not get
notified upon each frame 'rendered'...  now click on 'seek bar' of the
video player. See how slow it is to jump to play from the new
location.  Or if the frames are to be encoded after dma transfer...
see the 'padding' that would need to be done at the end.  These
concerns are common with audio subsystem using ring-buffers.
Anyways that is just FYI, I don't care how you implement your platform.

> Regarding this API change, I had earlier explained my use case in my
> v2 thread.
> Lars and Vinod came up with this resolution to allow array of
> interleaved templates.
> I also feel this is reasonable change for the subsystem.
>
I don't see you getting any better performance from your hardware and
I certainly don't find your usecase anything new (many dma controllers
support LLI and they work just fine as such). And I have already
explained how you could do, whatever you want, without this change.
So a polite NAK from me.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-26 15:02                         ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-26 15:02 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Williams, Dan J, Koul, Vinod, Michal Simek, Grant Likely,
	Rob Herring, devicetree-u79uwXL29TY76Z2rM5mHXA, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine-u79uwXL29TY76Z2rM5mHXA,
	Andy Shevchenko,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 26 February 2014 23:21, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>> On 21 February 2014 23:37, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>>> wrote:
>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
>>>>> wrote:
>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>> <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>>>>>>> wrote:
>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>> <jaswinder.singh-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal-gjFFaj9aHVfQT0dZR+AlfA@public.gmane.org>
>>>>>>>>>> wrote:
>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>> src_start/
>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>
>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>> device
>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>
>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>
>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>> separate
>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>> engine
>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>> frame
>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>> different
>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>> be
>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>> lags.
>>>>>>>>>
>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>> me.
>>>>>>>>
>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>
>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>
>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>> for i!=j
>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>> 0<=k<15
>>>>>>>> So for your use-case .....
>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>    ...... //other parameters
>>>>>>>>
>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>> size, count & gap in bytes).
>>>>>>>
>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>> corrected me
>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>
>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>> denoted by
>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>> i.e.
>>>>>>> number of lines.
>>>>>>>
>>>>>>> In video frame context,
>>>>>>> chunk.size -> hsize
>>>>>>> chunk.icg -> stride
>>>>>>> numf -> vsize
>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>
>>>>>> But you said in your last post
>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>> memory"
>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>> suggestions still hold.
>>>>>
>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>
>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>> array of bytes.
>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>
>>> I think am confusing you.  I would like to explain with an example.  Lets
>>> say
>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>
>> As I said plz dont confuse video frame with DMA frame.... in video
>> frame the stride is constant(zero or not) whereas in DMA context the
>> stride must be zero for the frame to be called contiguous.
>>
>>> So, the frames are
>>> scattered in memory and as the template doesnt allow multiple src_start/
>>> dst_start we could not use single template to fill the HW descriptors (of
>>> frames).  So, I feel your suggestion might not work if the frames are
>>> scattered.
>>> Also, how could we get 'vsize' value in your approach?
>>>
>> The client driver(video driver) should know the frame parameters.
>> Practically you'll have to populate 16 transfer templates (frame
>> attributes and locations won't change for a session) and submit to be
>> transferred _cyclically_.
>>
>>>  More importantly,
>>> we are overriding the semantics of interleaved template members.
>>>
>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>> transfers. Rather it's for identical frames with random strides.
>>
>>>
>>>>
>>>> If no, then it seems you are already doing the right thing.... the
>>>> ring-buffer scheme. Please share some stats how the current api is
>>>> causing you overhead because that is a very common case (many
>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>> ring-buffer) to queue in before you see any frame drop.
>>>
>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>> we could send multiple frames in a batch.  Using the original implementation
>>> of interleaved API, we have three options to transfer.
>>>
>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>> each frame and then we submit it to hardware triggering it to transfer this
>>> BD.
>>> We queue the next descriptor on the pending queue and whenever there is
>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>> this implementation we are not efficiently using the SG engine in the
>>> hardware
>>> as we transferring frame by frame even HW allows us to transfer multiple
>>> frames.
>>>
>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>
>>> The second option is to queue all the BDs until the maximum frames that HW
>>> is
>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>> this approach I feel there will be additional software overhead to track the
>>> number
>>> of maximum transfers and few additional cycles to release the cookie of each
>>> desc.
>>> Here, each desc represents a frame.
>>>
>> APIs are written for the gcd of h/w. We should optimize only for
>> majority and here isn't even anything gained after changing the api.
>> What you want to 'optimize' has been there since ever. Nobody
>> considers that overhead. BTW you don't even want to spend a few 'extra
>> cycles' but what about every other platform that doesn't support this
>> and will have to scan for such transfer requests?
>>
>>> The last option is the current implementation of the driver along with this
>>> change in
>>> API.  It will allow us to send array of interleaved templates wherein we
>>> could allocate
>>> a single async desc which will handle multiple frames (or segments) and just
>>> submit
>>> this desc to HW.  Then we program the current to first frame, tail to last
>>> frame and
>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>
>>> My point here is the driver should use hardware resources efficiently and I
>>> feel the
>>> driver will be inefficient if we dont use them.
>>>
>> I am afraid you are getting carried away by the 'awesomeness' of your
>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>> yours.
>> Consider the following ... if you queue 16 frames and don't care to
>> track before all are transmitted, you'll have very high latency. Video
>> seeks will take unacceptably long and give the impression of a slow
>> system. Whereas if you get callbacks for each frame rendered, you
>> could updates frames from the next one thereby having very quick
>> response time.
>
> Not at all, at least in our hardware, when we submit transfers it is most
> likely to be completed. There are few errors, but mostly are recoverable
> and it doesnt stop the transfer unless there is a critical error which needs
> a reset to the whole system.  So, at least in my use case there will be
> no latency.  I am not saying hardware is great, but it is the IP implementation.
>
I am not talking about the error cases.
Apply your patch locally so that you queue 16frames and not get
notified upon each frame 'rendered'...  now click on 'seek bar' of the
video player. See how slow it is to jump to play from the new
location.  Or if the frames are to be encoded after dma transfer...
see the 'padding' that would need to be done at the end.  These
concerns are common with audio subsystem using ring-buffers.
Anyways that is just FYI, I don't care how you implement your platform.

> Regarding this API change, I had earlier explained my use case in my
> v2 thread.
> Lars and Vinod came up with this resolution to allow array of
> interleaved templates.
> I also feel this is reasonable change for the subsystem.
>
I don't see you getting any better performance from your hardware and
I certainly don't find your usecase anything new (many dma controllers
support LLI and they work just fine as such). And I have already
explained how you could do, whatever you want, without this change.
So a polite NAK from me.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-26 15:02                         ` Jassi Brar
  0 siblings, 0 replies; 65+ messages in thread
From: Jassi Brar @ 2014-02-26 15:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 26 February 2014 23:21, Srikanth Thokala <sthokal@xilinx.com> wrote:
> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>>> wrote:
>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>>> wrote:
>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>> wrote:
>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>> src_start/
>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>
>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>> device
>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>
>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>
>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>> separate
>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>> engine
>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>> frame
>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>> different
>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>> be
>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>> lags.
>>>>>>>>>
>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>> me.
>>>>>>>>
>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>
>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>
>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>> for i!=j
>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>> 0<=k<15
>>>>>>>> So for your use-case .....
>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>    ...... //other parameters
>>>>>>>>
>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>> size, count & gap in bytes).
>>>>>>>
>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>> corrected me
>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>
>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>> denoted by
>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>> i.e.
>>>>>>> number of lines.
>>>>>>>
>>>>>>> In video frame context,
>>>>>>> chunk.size -> hsize
>>>>>>> chunk.icg -> stride
>>>>>>> numf -> vsize
>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>
>>>>>> But you said in your last post
>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>> memory"
>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>> suggestions still hold.
>>>>>
>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>
>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>> array of bytes.
>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>
>>> I think am confusing you.  I would like to explain with an example.  Lets
>>> say
>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>
>> As I said plz dont confuse video frame with DMA frame.... in video
>> frame the stride is constant(zero or not) whereas in DMA context the
>> stride must be zero for the frame to be called contiguous.
>>
>>> So, the frames are
>>> scattered in memory and as the template doesnt allow multiple src_start/
>>> dst_start we could not use single template to fill the HW descriptors (of
>>> frames).  So, I feel your suggestion might not work if the frames are
>>> scattered.
>>> Also, how could we get 'vsize' value in your approach?
>>>
>> The client driver(video driver) should know the frame parameters.
>> Practically you'll have to populate 16 transfer templates (frame
>> attributes and locations won't change for a session) and submit to be
>> transferred _cyclically_.
>>
>>>  More importantly,
>>> we are overriding the semantics of interleaved template members.
>>>
>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>> transfers. Rather it's for identical frames with random strides.
>>
>>>
>>>>
>>>> If no, then it seems you are already doing the right thing.... the
>>>> ring-buffer scheme. Please share some stats how the current api is
>>>> causing you overhead because that is a very common case (many
>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>> ring-buffer) to queue in before you see any frame drop.
>>>
>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>> we could send multiple frames in a batch.  Using the original implementation
>>> of interleaved API, we have three options to transfer.
>>>
>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>> each frame and then we submit it to hardware triggering it to transfer this
>>> BD.
>>> We queue the next descriptor on the pending queue and whenever there is
>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>> this implementation we are not efficiently using the SG engine in the
>>> hardware
>>> as we transferring frame by frame even HW allows us to transfer multiple
>>> frames.
>>>
>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>
>>> The second option is to queue all the BDs until the maximum frames that HW
>>> is
>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>> this approach I feel there will be additional software overhead to track the
>>> number
>>> of maximum transfers and few additional cycles to release the cookie of each
>>> desc.
>>> Here, each desc represents a frame.
>>>
>> APIs are written for the gcd of h/w. We should optimize only for
>> majority and here isn't even anything gained after changing the api.
>> What you want to 'optimize' has been there since ever. Nobody
>> considers that overhead. BTW you don't even want to spend a few 'extra
>> cycles' but what about every other platform that doesn't support this
>> and will have to scan for such transfer requests?
>>
>>> The last option is the current implementation of the driver along with this
>>> change in
>>> API.  It will allow us to send array of interleaved templates wherein we
>>> could allocate
>>> a single async desc which will handle multiple frames (or segments) and just
>>> submit
>>> this desc to HW.  Then we program the current to first frame, tail to last
>>> frame and
>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>
>>> My point here is the driver should use hardware resources efficiently and I
>>> feel the
>>> driver will be inefficient if we dont use them.
>>>
>> I am afraid you are getting carried away by the 'awesomeness' of your
>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>> yours.
>> Consider the following ... if you queue 16 frames and don't care to
>> track before all are transmitted, you'll have very high latency. Video
>> seeks will take unacceptably long and give the impression of a slow
>> system. Whereas if you get callbacks for each frame rendered, you
>> could updates frames from the next one thereby having very quick
>> response time.
>
> Not at all, at least in our hardware, when we submit transfers it is most
> likely to be completed. There are few errors, but mostly are recoverable
> and it doesnt stop the transfer unless there is a critical error which needs
> a reset to the whole system.  So, at least in my use case there will be
> no latency.  I am not saying hardware is great, but it is the IP implementation.
>
I am not talking about the error cases.
Apply your patch locally so that you queue 16frames and not get
notified upon each frame 'rendered'...  now click on 'seek bar' of the
video player. See how slow it is to jump to play from the new
location.  Or if the frames are to be encoded after dma transfer...
see the 'padding' that would need to be done at the end.  These
concerns are common with audio subsystem using ring-buffers.
Anyways that is just FYI, I don't care how you implement your platform.

> Regarding this API change, I had earlier explained my use case in my
> v2 thread.
> Lars and Vinod came up with this resolution to allow array of
> interleaved templates.
> I also feel this is reasonable change for the subsystem.
>
I don't see you getting any better performance from your hardware and
I certainly don't find your usecase anything new (many dma controllers
support LLI and they work just fine as such). And I have already
explained how you could do, whatever you want, without this change.
So a polite NAK from me.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
  2014-02-26 15:02                         ` Jassi Brar
  (?)
@ 2014-02-27 18:04                           ` Srikanth Thokala
  -1 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-27 18:04 UTC (permalink / raw)
  To: Jassi Brar
  Cc: Srikanth Thokala, Williams, Dan J, Koul, Vinod, Michal Simek,
	Grant Likely, Rob Herring, devicetree, Levente Kurusa,
	Lars-Peter Clausen, lkml, dmaengine, Andy Shevchenko,
	linux-arm-kernel

On Wed, Feb 26, 2014 at 8:32 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 26 February 2014 23:21, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>>>> wrote:
>>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>>>> wrote:
>>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>>> src_start/
>>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>>
>>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>>> device
>>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>>
>>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>>
>>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>>> separate
>>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>>> engine
>>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>>> frame
>>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>>> different
>>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>>> be
>>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>>> lags.
>>>>>>>>>>
>>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>>> me.
>>>>>>>>>
>>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>>
>>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>>
>>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>>> for i!=j
>>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>>> 0<=k<15
>>>>>>>>> So for your use-case .....
>>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>>    ...... //other parameters
>>>>>>>>>
>>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>>> size, count & gap in bytes).
>>>>>>>>
>>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>>> corrected me
>>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>>
>>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>>> denoted by
>>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>>> i.e.
>>>>>>>> number of lines.
>>>>>>>>
>>>>>>>> In video frame context,
>>>>>>>> chunk.size -> hsize
>>>>>>>> chunk.icg -> stride
>>>>>>>> numf -> vsize
>>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>>
>>>>>>> But you said in your last post
>>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>>> memory"
>>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>>> suggestions still hold.
>>>>>>
>>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>>
>>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>>> array of bytes.
>>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>>
>>>> I think am confusing you.  I would like to explain with an example.  Lets
>>>> say
>>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>>
>>> As I said plz dont confuse video frame with DMA frame.... in video
>>> frame the stride is constant(zero or not) whereas in DMA context the
>>> stride must be zero for the frame to be called contiguous.
>>>
>>>> So, the frames are
>>>> scattered in memory and as the template doesnt allow multiple src_start/
>>>> dst_start we could not use single template to fill the HW descriptors (of
>>>> frames).  So, I feel your suggestion might not work if the frames are
>>>> scattered.
>>>> Also, how could we get 'vsize' value in your approach?
>>>>
>>> The client driver(video driver) should know the frame parameters.
>>> Practically you'll have to populate 16 transfer templates (frame
>>> attributes and locations won't change for a session) and submit to be
>>> transferred _cyclically_.
>>>
>>>>  More importantly,
>>>> we are overriding the semantics of interleaved template members.
>>>>
>>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>>> transfers. Rather it's for identical frames with random strides.
>>>
>>>>
>>>>>
>>>>> If no, then it seems you are already doing the right thing.... the
>>>>> ring-buffer scheme. Please share some stats how the current api is
>>>>> causing you overhead because that is a very common case (many
>>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>>> ring-buffer) to queue in before you see any frame drop.
>>>>
>>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>>> we could send multiple frames in a batch.  Using the original implementation
>>>> of interleaved API, we have three options to transfer.
>>>>
>>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>>> each frame and then we submit it to hardware triggering it to transfer this
>>>> BD.
>>>> We queue the next descriptor on the pending queue and whenever there is
>>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>>> this implementation we are not efficiently using the SG engine in the
>>>> hardware
>>>> as we transferring frame by frame even HW allows us to transfer multiple
>>>> frames.
>>>>
>>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>>
>>>> The second option is to queue all the BDs until the maximum frames that HW
>>>> is
>>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>>> this approach I feel there will be additional software overhead to track the
>>>> number
>>>> of maximum transfers and few additional cycles to release the cookie of each
>>>> desc.
>>>> Here, each desc represents a frame.
>>>>
>>> APIs are written for the gcd of h/w. We should optimize only for
>>> majority and here isn't even anything gained after changing the api.
>>> What you want to 'optimize' has been there since ever. Nobody
>>> considers that overhead. BTW you don't even want to spend a few 'extra
>>> cycles' but what about every other platform that doesn't support this
>>> and will have to scan for such transfer requests?
>>>
>>>> The last option is the current implementation of the driver along with this
>>>> change in
>>>> API.  It will allow us to send array of interleaved templates wherein we
>>>> could allocate
>>>> a single async desc which will handle multiple frames (or segments) and just
>>>> submit
>>>> this desc to HW.  Then we program the current to first frame, tail to last
>>>> frame and
>>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>>
>>>> My point here is the driver should use hardware resources efficiently and I
>>>> feel the
>>>> driver will be inefficient if we dont use them.
>>>>
>>> I am afraid you are getting carried away by the 'awesomeness' of your
>>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>>> yours.
>>> Consider the following ... if you queue 16 frames and don't care to
>>> track before all are transmitted, you'll have very high latency. Video
>>> seeks will take unacceptably long and give the impression of a slow
>>> system. Whereas if you get callbacks for each frame rendered, you
>>> could updates frames from the next one thereby having very quick
>>> response time.
>>
>> Not at all, at least in our hardware, when we submit transfers it is most
>> likely to be completed. There are few errors, but mostly are recoverable
>> and it doesnt stop the transfer unless there is a critical error which needs
>> a reset to the whole system.  So, at least in my use case there will be
>> no latency.  I am not saying hardware is great, but it is the IP implementation.
>>
> I am not talking about the error cases.
> Apply your patch locally so that you queue 16frames and not get
> notified upon each frame 'rendered'...  now click on 'seek bar' of the
> video player. See how slow it is to jump to play from the new
> location.  Or if the frames are to be encoded after dma transfer...
> see the 'padding' that would need to be done at the end.  These
> concerns are common with audio subsystem using ring-buffers.
> Anyways that is just FYI, I don't care how you implement your platform.
>
>> Regarding this API change, I had earlier explained my use case in my
>> v2 thread.
>> Lars and Vinod came up with this resolution to allow array of
>> interleaved templates.
>> I also feel this is reasonable change for the subsystem.
>>
> I don't see you getting any better performance from your hardware and
> I certainly don't find your usecase anything new (many dma controllers
> support LLI and they work just fine as such). And I have already
> explained how you could do, whatever you want, without this change.
> So a polite NAK from me.

Ok.  I would go with your suggestion of having frame_size = 16 and will
send v4 dropping this change.

Thanks for the valuable suggestions.

Srikanth

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-27 18:04                           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-27 18:04 UTC (permalink / raw)
  To: Jassi Brar
  Cc: devicetree, Levente Kurusa, Lars-Peter Clausen, Koul, Vinod,
	lkml, Michal Simek, Srikanth Thokala, dmaengine, Rob Herring,
	Grant Likely, Williams, Dan J, Andy Shevchenko, linux-arm-kernel

On Wed, Feb 26, 2014 at 8:32 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 26 February 2014 23:21, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>>>> wrote:
>>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>>>> wrote:
>>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>>> src_start/
>>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>>
>>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>>> device
>>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>>
>>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>>
>>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>>> separate
>>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>>> engine
>>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>>> frame
>>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>>> different
>>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>>> be
>>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>>> lags.
>>>>>>>>>>
>>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>>> me.
>>>>>>>>>
>>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>>
>>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>>
>>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>>> for i!=j
>>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>>> 0<=k<15
>>>>>>>>> So for your use-case .....
>>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>>    ...... //other parameters
>>>>>>>>>
>>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>>> size, count & gap in bytes).
>>>>>>>>
>>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>>> corrected me
>>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>>
>>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>>> denoted by
>>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>>> i.e.
>>>>>>>> number of lines.
>>>>>>>>
>>>>>>>> In video frame context,
>>>>>>>> chunk.size -> hsize
>>>>>>>> chunk.icg -> stride
>>>>>>>> numf -> vsize
>>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>>
>>>>>>> But you said in your last post
>>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>>> memory"
>>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>>> suggestions still hold.
>>>>>>
>>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>>
>>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>>> array of bytes.
>>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>>
>>>> I think am confusing you.  I would like to explain with an example.  Lets
>>>> say
>>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>>
>>> As I said plz dont confuse video frame with DMA frame.... in video
>>> frame the stride is constant(zero or not) whereas in DMA context the
>>> stride must be zero for the frame to be called contiguous.
>>>
>>>> So, the frames are
>>>> scattered in memory and as the template doesnt allow multiple src_start/
>>>> dst_start we could not use single template to fill the HW descriptors (of
>>>> frames).  So, I feel your suggestion might not work if the frames are
>>>> scattered.
>>>> Also, how could we get 'vsize' value in your approach?
>>>>
>>> The client driver(video driver) should know the frame parameters.
>>> Practically you'll have to populate 16 transfer templates (frame
>>> attributes and locations won't change for a session) and submit to be
>>> transferred _cyclically_.
>>>
>>>>  More importantly,
>>>> we are overriding the semantics of interleaved template members.
>>>>
>>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>>> transfers. Rather it's for identical frames with random strides.
>>>
>>>>
>>>>>
>>>>> If no, then it seems you are already doing the right thing.... the
>>>>> ring-buffer scheme. Please share some stats how the current api is
>>>>> causing you overhead because that is a very common case (many
>>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>>> ring-buffer) to queue in before you see any frame drop.
>>>>
>>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>>> we could send multiple frames in a batch.  Using the original implementation
>>>> of interleaved API, we have three options to transfer.
>>>>
>>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>>> each frame and then we submit it to hardware triggering it to transfer this
>>>> BD.
>>>> We queue the next descriptor on the pending queue and whenever there is
>>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>>> this implementation we are not efficiently using the SG engine in the
>>>> hardware
>>>> as we transferring frame by frame even HW allows us to transfer multiple
>>>> frames.
>>>>
>>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>>
>>>> The second option is to queue all the BDs until the maximum frames that HW
>>>> is
>>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>>> this approach I feel there will be additional software overhead to track the
>>>> number
>>>> of maximum transfers and few additional cycles to release the cookie of each
>>>> desc.
>>>> Here, each desc represents a frame.
>>>>
>>> APIs are written for the gcd of h/w. We should optimize only for
>>> majority and here isn't even anything gained after changing the api.
>>> What you want to 'optimize' has been there since ever. Nobody
>>> considers that overhead. BTW you don't even want to spend a few 'extra
>>> cycles' but what about every other platform that doesn't support this
>>> and will have to scan for such transfer requests?
>>>
>>>> The last option is the current implementation of the driver along with this
>>>> change in
>>>> API.  It will allow us to send array of interleaved templates wherein we
>>>> could allocate
>>>> a single async desc which will handle multiple frames (or segments) and just
>>>> submit
>>>> this desc to HW.  Then we program the current to first frame, tail to last
>>>> frame and
>>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>>
>>>> My point here is the driver should use hardware resources efficiently and I
>>>> feel the
>>>> driver will be inefficient if we dont use them.
>>>>
>>> I am afraid you are getting carried away by the 'awesomeness' of your
>>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>>> yours.
>>> Consider the following ... if you queue 16 frames and don't care to
>>> track before all are transmitted, you'll have very high latency. Video
>>> seeks will take unacceptably long and give the impression of a slow
>>> system. Whereas if you get callbacks for each frame rendered, you
>>> could updates frames from the next one thereby having very quick
>>> response time.
>>
>> Not at all, at least in our hardware, when we submit transfers it is most
>> likely to be completed. There are few errors, but mostly are recoverable
>> and it doesnt stop the transfer unless there is a critical error which needs
>> a reset to the whole system.  So, at least in my use case there will be
>> no latency.  I am not saying hardware is great, but it is the IP implementation.
>>
> I am not talking about the error cases.
> Apply your patch locally so that you queue 16frames and not get
> notified upon each frame 'rendered'...  now click on 'seek bar' of the
> video player. See how slow it is to jump to play from the new
> location.  Or if the frames are to be encoded after dma transfer...
> see the 'padding' that would need to be done at the end.  These
> concerns are common with audio subsystem using ring-buffers.
> Anyways that is just FYI, I don't care how you implement your platform.
>
>> Regarding this API change, I had earlier explained my use case in my
>> v2 thread.
>> Lars and Vinod came up with this resolution to allow array of
>> interleaved templates.
>> I also feel this is reasonable change for the subsystem.
>>
> I don't see you getting any better performance from your hardware and
> I certainly don't find your usecase anything new (many dma controllers
> support LLI and they work just fine as such). And I have already
> explained how you could do, whatever you want, without this change.
> So a polite NAK from me.

Ok.  I would go with your suggestion of having frame_size = 16 and will
send v4 dropping this change.

Thanks for the valuable suggestions.

Srikanth

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory
@ 2014-02-27 18:04                           ` Srikanth Thokala
  0 siblings, 0 replies; 65+ messages in thread
From: Srikanth Thokala @ 2014-02-27 18:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 26, 2014 at 8:32 PM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
> On 26 February 2014 23:21, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh@linaro.org> wrote:
>>> On 21 February 2014 23:37, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh@linaro.org>
>>>> wrote:
>>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh@linaro.org>
>>>>>> wrote:
>>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal@xilinx.com> wrote:
>>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>> wrote:
>>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>>> <jaswinder.singh@linaro.org> wrote:
>>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal@xilinx.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>>> src_start/
>>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>>
>>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>>> device
>>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>>
>>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>>
>>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>>> separate
>>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>>> engine
>>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>>> frame
>>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>>> different
>>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>>> be
>>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>>> lags.
>>>>>>>>>>
>>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>>> me.
>>>>>>>>>
>>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>>
>>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>>
>>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>>> for i!=j
>>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>>> 0<=k<15
>>>>>>>>> So for your use-case .....
>>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>>    ...... //other parameters
>>>>>>>>>
>>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>>> size, count & gap in bytes).
>>>>>>>>
>>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>>> corrected me
>>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>>
>>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>>> denoted by
>>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>>> i.e.
>>>>>>>> number of lines.
>>>>>>>>
>>>>>>>> In video frame context,
>>>>>>>> chunk.size -> hsize
>>>>>>>> chunk.icg -> stride
>>>>>>>> numf -> vsize
>>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>>
>>>>>>> But you said in your last post
>>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>>> memory"
>>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>>> suggestions still hold.
>>>>>>
>>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>>
>>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>>> array of bytes.
>>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>>
>>>> I think am confusing you.  I would like to explain with an example.  Lets
>>>> say
>>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>>
>>> As I said plz dont confuse video frame with DMA frame.... in video
>>> frame the stride is constant(zero or not) whereas in DMA context the
>>> stride must be zero for the frame to be called contiguous.
>>>
>>>> So, the frames are
>>>> scattered in memory and as the template doesnt allow multiple src_start/
>>>> dst_start we could not use single template to fill the HW descriptors (of
>>>> frames).  So, I feel your suggestion might not work if the frames are
>>>> scattered.
>>>> Also, how could we get 'vsize' value in your approach?
>>>>
>>> The client driver(video driver) should know the frame parameters.
>>> Practically you'll have to populate 16 transfer templates (frame
>>> attributes and locations won't change for a session) and submit to be
>>> transferred _cyclically_.
>>>
>>>>  More importantly,
>>>> we are overriding the semantics of interleaved template members.
>>>>
>>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>>> transfers. Rather it's for identical frames with random strides.
>>>
>>>>
>>>>>
>>>>> If no, then it seems you are already doing the right thing.... the
>>>>> ring-buffer scheme. Please share some stats how the current api is
>>>>> causing you overhead because that is a very common case (many
>>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>>> ring-buffer) to queue in before you see any frame drop.
>>>>
>>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>>> we could send multiple frames in a batch.  Using the original implementation
>>>> of interleaved API, we have three options to transfer.
>>>>
>>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>>> each frame and then we submit it to hardware triggering it to transfer this
>>>> BD.
>>>> We queue the next descriptor on the pending queue and whenever there is
>>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>>> this implementation we are not efficiently using the SG engine in the
>>>> hardware
>>>> as we transferring frame by frame even HW allows us to transfer multiple
>>>> frames.
>>>>
>>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>>
>>>> The second option is to queue all the BDs until the maximum frames that HW
>>>> is
>>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>>> this approach I feel there will be additional software overhead to track the
>>>> number
>>>> of maximum transfers and few additional cycles to release the cookie of each
>>>> desc.
>>>> Here, each desc represents a frame.
>>>>
>>> APIs are written for the gcd of h/w. We should optimize only for
>>> majority and here isn't even anything gained after changing the api.
>>> What you want to 'optimize' has been there since ever. Nobody
>>> considers that overhead. BTW you don't even want to spend a few 'extra
>>> cycles' but what about every other platform that doesn't support this
>>> and will have to scan for such transfer requests?
>>>
>>>> The last option is the current implementation of the driver along with this
>>>> change in
>>>> API.  It will allow us to send array of interleaved templates wherein we
>>>> could allocate
>>>> a single async desc which will handle multiple frames (or segments) and just
>>>> submit
>>>> this desc to HW.  Then we program the current to first frame, tail to last
>>>> frame and
>>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>>
>>>> My point here is the driver should use hardware resources efficiently and I
>>>> feel the
>>>> driver will be inefficient if we dont use them.
>>>>
>>> I am afraid you are getting carried away by the 'awesomeness' of your
>>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>>> yours.
>>> Consider the following ... if you queue 16 frames and don't care to
>>> track before all are transmitted, you'll have very high latency. Video
>>> seeks will take unacceptably long and give the impression of a slow
>>> system. Whereas if you get callbacks for each frame rendered, you
>>> could updates frames from the next one thereby having very quick
>>> response time.
>>
>> Not at all, at least in our hardware, when we submit transfers it is most
>> likely to be completed. There are few errors, but mostly are recoverable
>> and it doesnt stop the transfer unless there is a critical error which needs
>> a reset to the whole system.  So, at least in my use case there will be
>> no latency.  I am not saying hardware is great, but it is the IP implementation.
>>
> I am not talking about the error cases.
> Apply your patch locally so that you queue 16frames and not get
> notified upon each frame 'rendered'...  now click on 'seek bar' of the
> video player. See how slow it is to jump to play from the new
> location.  Or if the frames are to be encoded after dma transfer...
> see the 'padding' that would need to be done at the end.  These
> concerns are common with audio subsystem using ring-buffers.
> Anyways that is just FYI, I don't care how you implement your platform.
>
>> Regarding this API change, I had earlier explained my use case in my
>> v2 thread.
>> Lars and Vinod came up with this resolution to allow array of
>> interleaved templates.
>> I also feel this is reasonable change for the subsystem.
>>
> I don't see you getting any better performance from your hardware and
> I certainly don't find your usecase anything new (many dma controllers
> support LLI and they work just fine as such). And I have already
> explained how you could do, whatever you want, without this change.
> So a polite NAK from me.

Ok.  I would go with your suggestion of having frame_size = 16 and will
send v4 dropping this change.

Thanks for the valuable suggestions.

Srikanth

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
  2014-02-15 12:00   ` Srikanth Thokala
  (?)
@ 2014-04-22 13:34     ` Rob Herring
  -1 siblings, 0 replies; 65+ messages in thread
From: Rob Herring @ 2014-04-22 13:34 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Dan Williams, Vinod Koul, Michal Simek, Grant Likely,
	Rob Herring, levex, Lars-Peter Clausen, Andy Shevchenko,
	dmaengine, linux-arm-kernel, linux-kernel, devicetree

On Sat, Feb 15, 2014 at 6:00 AM, Srikanth Thokala <sthokal@xilinx.com> wrote:
> Device-tree binding documentation of Xilinx Video DMA Engine
>
> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>

A few typos below, but otherwise:

Acked-by: Rob Herring <robh@kernel.org>

> ---
> Changes in v3:
> None
>
> Changes in v2:
> - Removed device-id DT property, as suggested by Arnd Bergmann
> - Properly documented DT bindings as suggested by Arnd Bergmann
> ---
>  .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++++++++++++++++++++
>  1 file changed, 75 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
>
> diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
> new file mode 100644
> index 0000000..ab8be1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
> @@ -0,0 +1,75 @@
> +Xilinx AXI VDMA engine, it does transfers between memory and video devices.
> +It can be configured to have one channel or two channels. If configured
> +as two channels, one is to transmit to the video device and another is
> +to receive from the video device.
> +
> +Required properties:
> +- compatible: Should be "xlnx,axi-vdma-1.00.a"
> +- #dma-cells: Should be <1>, see "dmas" property below
> +- reg: Should contain VDMA registers location and length.
> +- xlnx,num-fstores: Should be the number of framebuffers as configured in h/w.
> +- dma-channel child node: Should have atleast one channel and can have upto

s/atleast/at least/
s/upto/up to/

> +       two channels per device. This node specifies the properties of each
> +       DMA channel (see child node properties below).
> +
> +Optional properties:
> +- xlnx,include-sg: Tells whether configured for Scatter-mode in
> +       the hardware.
> +- xlnx,flush-fsync: Tells whether which channel to Flush on Frame sync.

s/whether//

> +       It takes following values:
> +       {1}, flush both channels
> +       {2}, flush mm2s channel
> +       {3}, flush s2mm channel
> +
> +Required child node properties:
> +- compatible: It should be either "xlnx,axi-vdma-mm2s-channel" or
> +       "xlnx,axi-vdma-s2mm-channel".
> +- interrupts: Should contain per channel VDMA interrupts.
> +- xlnx,data-width: Should contain the stream data width, take values
> +       {32,64...1024}.
> +
> +Option child node properties:

s/Option/Optional/

> +- xlnx,include-dre: Tells whether hardware is configured for Data
> +       Realignment Engine.
> +- xlnx,genlock-mode: Tells whether Genlock synchronization is
> +       enabled/disabled in hardware.
> +
> +Example:
> +++++++++
> +
> +axi_vdma_0: axivdma@40030000 {
> +       compatible = "xlnx,axi-vdma-1.00.a";
> +       #dma_cells = <1>;
> +       reg = < 0x40030000 0x10000 >;
> +       xlnx,num-fstores = <0x8>;
> +       xlnx,flush-fsync = <0x1>;
> +       dma-channel@40030000 {
> +               compatible = "xlnx,axi-vdma-mm2s-channel";
> +               interrupts = < 0 54 4 >;
> +               xlnx,datawidth = <0x40>;
> +       } ;
> +       dma-channel@40030030 {
> +               compatible = "xlnx,axi-vdma-s2mm-channel";
> +               interrupts = < 0 53 4 >;
> +               xlnx,datawidth = <0x40>;
> +       } ;
> +} ;
> +
> +
> +* DMA client
> +
> +Required properties:
> +- dmas: a list of <[Video DMA device phandle] [Channel ID]> pairs,
> +       where Channel ID is '0' for write/tx and '1' for read/rx
> +       channel.
> +- dma-names: a list of DMA channel names, one per "dmas" entry
> +
> +Example:
> +++++++++
> +
> +vdmatest_0: vdmatest@0 {
> +       compatible ="xlnx,axi-vdma-test-1.00.a";
> +       dmas = <&axi_vdma_0 0
> +               &axi_vdma_0 1>;
> +       dma-names = "vdma0", "vdma1";
> +} ;
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
@ 2014-04-22 13:34     ` Rob Herring
  0 siblings, 0 replies; 65+ messages in thread
From: Rob Herring @ 2014-04-22 13:34 UTC (permalink / raw)
  To: Srikanth Thokala
  Cc: Dan Williams, Vinod Koul, Michal Simek, Grant Likely,
	Rob Herring, levex, Lars-Peter Clausen, Andy Shevchenko,
	dmaengine, linux-arm-kernel, linux-kernel, devicetree

On Sat, Feb 15, 2014 at 6:00 AM, Srikanth Thokala <sthokal@xilinx.com> wrote:
> Device-tree binding documentation of Xilinx Video DMA Engine
>
> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>

A few typos below, but otherwise:

Acked-by: Rob Herring <robh@kernel.org>

> ---
> Changes in v3:
> None
>
> Changes in v2:
> - Removed device-id DT property, as suggested by Arnd Bergmann
> - Properly documented DT bindings as suggested by Arnd Bergmann
> ---
>  .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++++++++++++++++++++
>  1 file changed, 75 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
>
> diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
> new file mode 100644
> index 0000000..ab8be1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
> @@ -0,0 +1,75 @@
> +Xilinx AXI VDMA engine, it does transfers between memory and video devices.
> +It can be configured to have one channel or two channels. If configured
> +as two channels, one is to transmit to the video device and another is
> +to receive from the video device.
> +
> +Required properties:
> +- compatible: Should be "xlnx,axi-vdma-1.00.a"
> +- #dma-cells: Should be <1>, see "dmas" property below
> +- reg: Should contain VDMA registers location and length.
> +- xlnx,num-fstores: Should be the number of framebuffers as configured in h/w.
> +- dma-channel child node: Should have atleast one channel and can have upto

s/atleast/at least/
s/upto/up to/

> +       two channels per device. This node specifies the properties of each
> +       DMA channel (see child node properties below).
> +
> +Optional properties:
> +- xlnx,include-sg: Tells whether configured for Scatter-mode in
> +       the hardware.
> +- xlnx,flush-fsync: Tells whether which channel to Flush on Frame sync.

s/whether//

> +       It takes following values:
> +       {1}, flush both channels
> +       {2}, flush mm2s channel
> +       {3}, flush s2mm channel
> +
> +Required child node properties:
> +- compatible: It should be either "xlnx,axi-vdma-mm2s-channel" or
> +       "xlnx,axi-vdma-s2mm-channel".
> +- interrupts: Should contain per channel VDMA interrupts.
> +- xlnx,data-width: Should contain the stream data width, take values
> +       {32,64...1024}.
> +
> +Option child node properties:

s/Option/Optional/

> +- xlnx,include-dre: Tells whether hardware is configured for Data
> +       Realignment Engine.
> +- xlnx,genlock-mode: Tells whether Genlock synchronization is
> +       enabled/disabled in hardware.
> +
> +Example:
> +++++++++
> +
> +axi_vdma_0: axivdma@40030000 {
> +       compatible = "xlnx,axi-vdma-1.00.a";
> +       #dma_cells = <1>;
> +       reg = < 0x40030000 0x10000 >;
> +       xlnx,num-fstores = <0x8>;
> +       xlnx,flush-fsync = <0x1>;
> +       dma-channel@40030000 {
> +               compatible = "xlnx,axi-vdma-mm2s-channel";
> +               interrupts = < 0 54 4 >;
> +               xlnx,datawidth = <0x40>;
> +       } ;
> +       dma-channel@40030030 {
> +               compatible = "xlnx,axi-vdma-s2mm-channel";
> +               interrupts = < 0 53 4 >;
> +               xlnx,datawidth = <0x40>;
> +       } ;
> +} ;
> +
> +
> +* DMA client
> +
> +Required properties:
> +- dmas: a list of <[Video DMA device phandle] [Channel ID]> pairs,
> +       where Channel ID is '0' for write/tx and '1' for read/rx
> +       channel.
> +- dma-names: a list of DMA channel names, one per "dmas" entry
> +
> +Example:
> +++++++++
> +
> +vdmatest_0: vdmatest@0 {
> +       compatible ="xlnx,axi-vdma-test-1.00.a";
> +       dmas = <&axi_vdma_0 0
> +               &axi_vdma_0 1>;
> +       dma-names = "vdma0", "vdma1";
> +} ;
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
@ 2014-04-22 13:34     ` Rob Herring
  0 siblings, 0 replies; 65+ messages in thread
From: Rob Herring @ 2014-04-22 13:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Feb 15, 2014 at 6:00 AM, Srikanth Thokala <sthokal@xilinx.com> wrote:
> Device-tree binding documentation of Xilinx Video DMA Engine
>
> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>

A few typos below, but otherwise:

Acked-by: Rob Herring <robh@kernel.org>

> ---
> Changes in v3:
> None
>
> Changes in v2:
> - Removed device-id DT property, as suggested by Arnd Bergmann
> - Properly documented DT bindings as suggested by Arnd Bergmann
> ---
>  .../devicetree/bindings/dma/xilinx/xilinx_vdma.txt |   75 ++++++++++++++++++++
>  1 file changed, 75 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
>
> diff --git a/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
> new file mode 100644
> index 0000000..ab8be1a
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/dma/xilinx/xilinx_vdma.txt
> @@ -0,0 +1,75 @@
> +Xilinx AXI VDMA engine, it does transfers between memory and video devices.
> +It can be configured to have one channel or two channels. If configured
> +as two channels, one is to transmit to the video device and another is
> +to receive from the video device.
> +
> +Required properties:
> +- compatible: Should be "xlnx,axi-vdma-1.00.a"
> +- #dma-cells: Should be <1>, see "dmas" property below
> +- reg: Should contain VDMA registers location and length.
> +- xlnx,num-fstores: Should be the number of framebuffers as configured in h/w.
> +- dma-channel child node: Should have atleast one channel and can have upto

s/atleast/at least/
s/upto/up to/

> +       two channels per device. This node specifies the properties of each
> +       DMA channel (see child node properties below).
> +
> +Optional properties:
> +- xlnx,include-sg: Tells whether configured for Scatter-mode in
> +       the hardware.
> +- xlnx,flush-fsync: Tells whether which channel to Flush on Frame sync.

s/whether//

> +       It takes following values:
> +       {1}, flush both channels
> +       {2}, flush mm2s channel
> +       {3}, flush s2mm channel
> +
> +Required child node properties:
> +- compatible: It should be either "xlnx,axi-vdma-mm2s-channel" or
> +       "xlnx,axi-vdma-s2mm-channel".
> +- interrupts: Should contain per channel VDMA interrupts.
> +- xlnx,data-width: Should contain the stream data width, take values
> +       {32,64...1024}.
> +
> +Option child node properties:

s/Option/Optional/

> +- xlnx,include-dre: Tells whether hardware is configured for Data
> +       Realignment Engine.
> +- xlnx,genlock-mode: Tells whether Genlock synchronization is
> +       enabled/disabled in hardware.
> +
> +Example:
> +++++++++
> +
> +axi_vdma_0: axivdma at 40030000 {
> +       compatible = "xlnx,axi-vdma-1.00.a";
> +       #dma_cells = <1>;
> +       reg = < 0x40030000 0x10000 >;
> +       xlnx,num-fstores = <0x8>;
> +       xlnx,flush-fsync = <0x1>;
> +       dma-channel at 40030000 {
> +               compatible = "xlnx,axi-vdma-mm2s-channel";
> +               interrupts = < 0 54 4 >;
> +               xlnx,datawidth = <0x40>;
> +       } ;
> +       dma-channel at 40030030 {
> +               compatible = "xlnx,axi-vdma-s2mm-channel";
> +               interrupts = < 0 53 4 >;
> +               xlnx,datawidth = <0x40>;
> +       } ;
> +} ;
> +
> +
> +* DMA client
> +
> +Required properties:
> +- dmas: a list of <[Video DMA device phandle] [Channel ID]> pairs,
> +       where Channel ID is '0' for write/tx and '1' for read/rx
> +       channel.
> +- dma-names: a list of DMA channel names, one per "dmas" entry
> +
> +Example:
> +++++++++
> +
> +vdmatest_0: vdmatest at 0 {
> +       compatible ="xlnx,axi-vdma-test-1.00.a";
> +       dmas = <&axi_vdma_0 0
> +               &axi_vdma_0 1>;
> +       dma-names = "vdma0", "vdma1";
> +} ;
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
  2014-04-22 13:34     ` Rob Herring
  (?)
@ 2014-04-22 13:41       ` Michal Simek
  -1 siblings, 0 replies; 65+ messages in thread
From: Michal Simek @ 2014-04-22 13:41 UTC (permalink / raw)
  To: Rob Herring
  Cc: Srikanth Thokala, Dan Williams, Vinod Koul, Michal Simek,
	Grant Likely, Rob Herring, levex, Lars-Peter Clausen,
	Andy Shevchenko, dmaengine, linux-arm-kernel, linux-kernel,
	devicetree

[-- Attachment #1: Type: text/plain, Size: 672 bytes --]

On 04/22/2014 03:34 PM, Rob Herring wrote:
> On Sat, Feb 15, 2014 at 6:00 AM, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> Device-tree binding documentation of Xilinx Video DMA Engine
>>
>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
> 
> A few typos below, but otherwise:
> 
> Acked-by: Rob Herring <robh@kernel.org>

Thanks Rob,
Michal

-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Microblaze cpu - http://www.monstr.eu/fdt/
Maintainer of Linux kernel - Xilinx Zynq ARM architecture
Microblaze U-BOOT custodian and responsible for u-boot arm zynq platform



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
@ 2014-04-22 13:41       ` Michal Simek
  0 siblings, 0 replies; 65+ messages in thread
From: Michal Simek @ 2014-04-22 13:41 UTC (permalink / raw)
  To: Rob Herring
  Cc: Srikanth Thokala, Dan Williams, Vinod Koul, Michal Simek,
	Grant Likely, Rob Herring, levex, Lars-Peter Clausen,
	Andy Shevchenko, dmaengine, linux-arm-kernel, linux-kernel,
	devicetree

[-- Attachment #1: Type: text/plain, Size: 672 bytes --]

On 04/22/2014 03:34 PM, Rob Herring wrote:
> On Sat, Feb 15, 2014 at 6:00 AM, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> Device-tree binding documentation of Xilinx Video DMA Engine
>>
>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
> 
> A few typos below, but otherwise:
> 
> Acked-by: Rob Herring <robh@kernel.org>

Thanks Rob,
Michal

-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Microblaze cpu - http://www.monstr.eu/fdt/
Maintainer of Linux kernel - Xilinx Zynq ARM architecture
Microblaze U-BOOT custodian and responsible for u-boot arm zynq platform



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation
@ 2014-04-22 13:41       ` Michal Simek
  0 siblings, 0 replies; 65+ messages in thread
From: Michal Simek @ 2014-04-22 13:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/22/2014 03:34 PM, Rob Herring wrote:
> On Sat, Feb 15, 2014 at 6:00 AM, Srikanth Thokala <sthokal@xilinx.com> wrote:
>> Device-tree binding documentation of Xilinx Video DMA Engine
>>
>> Signed-off-by: Srikanth Thokala <sthokal@xilinx.com>
> 
> A few typos below, but otherwise:
> 
> Acked-by: Rob Herring <robh@kernel.org>

Thanks Rob,
Michal

-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Microblaze cpu - http://www.monstr.eu/fdt/
Maintainer of Linux kernel - Xilinx Zynq ARM architecture
Microblaze U-BOOT custodian and responsible for u-boot arm zynq platform


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140422/a37415b4/attachment.sig>

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2014-04-22 13:41 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-15 12:00 [PATCH v3 0/3] Add Xilinx AXI Video DMA Engine driver Srikanth Thokala
2014-02-15 12:00 ` Srikanth Thokala
2014-02-15 12:00 ` Srikanth Thokala
2014-02-15 12:00 ` [PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory Srikanth Thokala
2014-02-15 12:00   ` Srikanth Thokala
2014-02-15 12:00   ` Srikanth Thokala
2014-02-17  8:43   ` Vinod Koul
2014-02-17  8:43     ` Vinod Koul
2014-02-17  9:29     ` Srikanth Thokala
2014-02-17  9:29       ` Srikanth Thokala
2014-02-17  9:35       ` Lars-Peter Clausen
2014-02-17  9:35         ` Lars-Peter Clausen
2014-02-17  9:35         ` Lars-Peter Clausen
2014-02-17  9:42         ` Srikanth Thokala
2014-02-17  9:42           ` Srikanth Thokala
2014-02-17  9:42           ` Srikanth Thokala
2014-02-17  9:44           ` Lars-Peter Clausen
2014-02-17  9:44             ` Lars-Peter Clausen
2014-02-17  9:52             ` Srikanth Thokala
2014-02-17  9:52               ` Srikanth Thokala
2014-02-17  9:52               ` Srikanth Thokala
2014-02-17  9:57   ` Jassi Brar
2014-02-17  9:57     ` Jassi Brar
2014-02-17  9:57     ` Jassi Brar
2014-02-18 11:28     ` Srikanth Thokala
2014-02-18 11:28       ` Srikanth Thokala
2014-02-18 11:28       ` Srikanth Thokala
2014-02-18 16:50       ` Jassi Brar
2014-02-18 16:50         ` Jassi Brar
2014-02-18 16:50         ` Jassi Brar
2014-02-18 17:46         ` Srikanth Thokala
2014-02-18 17:46           ` Srikanth Thokala
2014-02-18 17:46           ` Srikanth Thokala
2014-02-18 19:03           ` Jassi Brar
2014-02-18 19:03             ` Jassi Brar
2014-02-18 19:03             ` Jassi Brar
2014-02-20  9:24             ` Srikanth Thokala
2014-02-20  9:24               ` Srikanth Thokala
2014-02-20  9:24               ` Srikanth Thokala
2014-02-20  9:53               ` Jassi Brar
2014-02-20  9:53                 ` Jassi Brar
2014-02-20  9:53                 ` Jassi Brar
2014-02-21 14:37                 ` Srikanth Thokala
2014-02-24  2:09                   ` Jassi Brar
2014-02-24  2:09                     ` Jassi Brar
2014-02-24  2:09                     ` Jassi Brar
2014-02-26 14:21                     ` Srikanth Thokala
2014-02-26 14:21                       ` Srikanth Thokala
2014-02-26 14:21                       ` Srikanth Thokala
2014-02-26 15:02                       ` Jassi Brar
2014-02-26 15:02                         ` Jassi Brar
2014-02-26 15:02                         ` Jassi Brar
2014-02-27 18:04                         ` Srikanth Thokala
2014-02-27 18:04                           ` Srikanth Thokala
2014-02-27 18:04                           ` Srikanth Thokala
2014-02-15 12:00 ` [PATCH v3 2/3] dma: Add Xilinx Video DMA DT Binding Documentation Srikanth Thokala
2014-02-15 12:00   ` Srikanth Thokala
2014-04-22 13:34   ` Rob Herring
2014-04-22 13:34     ` Rob Herring
2014-04-22 13:34     ` Rob Herring
2014-04-22 13:41     ` Michal Simek
2014-04-22 13:41       ` Michal Simek
2014-04-22 13:41       ` Michal Simek
2014-02-15 12:00 ` [PATCH v3 3/3] dma: Add Xilinx AXI Video Direct Memory Access Engine driver support Srikanth Thokala
2014-02-15 12:00   ` Srikanth Thokala

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.