dmaengine.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next
@ 2020-01-27 13:21 Peter Ujfalusi
  2020-01-27 13:21 ` [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check Peter Ujfalusi
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-27 13:21 UTC (permalink / raw)
  To: vkoul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

Hi Vinod,

Based on customer reports we have identified two issues with the UDMA driver:

TX completion (1st patch):
The scheduled work based workaround for checking for completion worked well for
UART, but it had significant impact on SPI performance.
The underlying issue is coming from the fact that we have split data movement
architecture.
In order to know that the transfer is really done we need to check the remote
end's (PDMA) byte counter.

RX channel teardown with stale data in PDMA (2nd patch):
If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
data is might received from a peripheral, but if UDMA does not have a packet to
use for this draining than it is going to push back on the PDMA and the flush
will never completes.
The workaround is to use a dummy descriptor for flush purposes when the channel
is terminated and we did not have active transfer (no descriptor for UDMA).
This allows UDMA to drain the data and the teardown can complete.

The last two patch is to use common code to set up the TR parameters for
slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
with the change we can handle 4.2GB sg elements and periods in case of cyclic.
It is also nice that we have single function to do the configuration.

Regards,
Peter
---
Peter Ujfalusi (3):
  dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in
    peer
  dmaengine: ti: k3-udma: Move the TR counter calculation to helper
    function
  dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and
    cyclic

Vignesh Raghavendra (1):
  dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion
    check

 drivers/dma/ti/k3-udma.c | 452 +++++++++++++++++++++++++++++----------
 1 file changed, 343 insertions(+), 109 deletions(-)

-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check
  2020-01-27 13:21 [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
@ 2020-01-27 13:21 ` Peter Ujfalusi
  2020-01-28 11:48   ` Vinod Koul
  2020-01-27 13:21 ` [PATCH for-next 2/4] dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in peer Peter Ujfalusi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-27 13:21 UTC (permalink / raw)
  To: vkoul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

From: Vignesh Raghavendra <vigneshr@ti.com>

In some cases (McSPI for example) the jiffie and delayed_work based
workaround can cause big throughput drop.

Switch to use ktime/usleep_range based implementation to be able
to sustain speed for PDMA based peripherals.

Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
---
 drivers/dma/ti/k3-udma.c | 80 ++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 27 deletions(-)

diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
index ea79c2df28e0..fb59c869a6a7 100644
--- a/drivers/dma/ti/k3-udma.c
+++ b/drivers/dma/ti/k3-udma.c
@@ -5,6 +5,7 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/delay.h>
 #include <linux/dmaengine.h>
 #include <linux/dma-mapping.h>
 #include <linux/dmapool.h>
@@ -169,7 +170,7 @@ enum udma_chan_state {
 
 struct udma_tx_drain {
 	struct delayed_work work;
-	unsigned long jiffie;
+	ktime_t tstamp;
 	u32 residue;
 };
 
@@ -946,9 +947,10 @@ static bool udma_is_desc_really_done(struct udma_chan *uc, struct udma_desc *d)
 	peer_bcnt = udma_tchanrt_read(uc->tchan, UDMA_TCHAN_RT_PEER_BCNT_REG);
 	bcnt = udma_tchanrt_read(uc->tchan, UDMA_TCHAN_RT_BCNT_REG);
 
+	/* Transfer is incomplete, store current residue and time stamp */
 	if (peer_bcnt < bcnt) {
 		uc->tx_drain.residue = bcnt - peer_bcnt;
-		uc->tx_drain.jiffie = jiffies;
+		uc->tx_drain.tstamp = ktime_get();
 		return false;
 	}
 
@@ -961,35 +963,59 @@ static void udma_check_tx_completion(struct work_struct *work)
 					    tx_drain.work.work);
 	bool desc_done = true;
 	u32 residue_diff;
-	unsigned long jiffie_diff, delay;
+	ktime_t time_diff;
+	unsigned long delay;
+
+	while (1) {
+		if (uc->desc) {
+			/* Get previous residue and time stamp */
+			residue_diff = uc->tx_drain.residue;
+			time_diff = uc->tx_drain.tstamp;
+			/*
+			 * Get current residue and time stamp or see if
+			 * transfer is complete
+			 */
+			desc_done = udma_is_desc_really_done(uc, uc->desc);
+		}
 
-	if (uc->desc) {
-		residue_diff = uc->tx_drain.residue;
-		jiffie_diff = uc->tx_drain.jiffie;
-		desc_done = udma_is_desc_really_done(uc, uc->desc);
-	}
-
-	if (!desc_done) {
-		jiffie_diff = uc->tx_drain.jiffie - jiffie_diff;
-		residue_diff -= uc->tx_drain.residue;
-		if (residue_diff) {
-			/* Try to guess when we should check next time */
-			residue_diff /= jiffie_diff;
-			delay = uc->tx_drain.residue / residue_diff / 3;
-			if (jiffies_to_msecs(delay) < 5)
-				delay = 0;
-		} else {
-			/* No progress, check again in 1 second  */
-			delay = HZ;
+		if (!desc_done) {
+			/*
+			 * Find the time delta and residue delta w.r.t
+			 * previous poll
+			 */
+			time_diff = ktime_sub(uc->tx_drain.tstamp,
+					      time_diff) + 1;
+			residue_diff -= uc->tx_drain.residue;
+			if (residue_diff) {
+				/*
+				 * Try to guess when we should check
+				 * next time by calculating rate at
+				 * which data is being drained at the
+				 * peer device
+				 */
+				delay = (time_diff / residue_diff) *
+					uc->tx_drain.residue;
+			} else {
+				/* No progress, check again in 1 second  */
+				schedule_delayed_work(&uc->tx_drain.work, HZ);
+				break;
+			}
+
+			usleep_range(ktime_to_us(delay),
+				     ktime_to_us(delay) + 10);
+			continue;
 		}
 
-		schedule_delayed_work(&uc->tx_drain.work, delay);
-	} else if (uc->desc) {
-		struct udma_desc *d = uc->desc;
+		if (uc->desc) {
+			struct udma_desc *d = uc->desc;
+
+			uc->bcnt += d->residue;
+			udma_start(uc);
+			vchan_cookie_complete(&d->vd);
+			break;
+		}
 
-		uc->bcnt += d->residue;
-		udma_start(uc);
-		vchan_cookie_complete(&d->vd);
+		break;
 	}
 }
 
-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH for-next 2/4] dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in peer
  2020-01-27 13:21 [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
  2020-01-27 13:21 ` [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check Peter Ujfalusi
@ 2020-01-27 13:21 ` Peter Ujfalusi
  2020-01-27 13:21 ` [PATCH for-next 3/4] dmaengine: ti: k3-udma: Move the TR counter calculation to helper function Peter Ujfalusi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-27 13:21 UTC (permalink / raw)
  To: vkoul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

When a channel is asked to be stopped (teardown) and we do not have active
descriptor to receive stale data buffered on the remote side then the
teardown will not complete as UDMA needs a descriptor to be able to flush
out the DMA pipe.
The peer is trying to push the data to UDMA in teardown, but UDMA is
pushing back because it has no descriptor which would allow it to drain the
data.

The workaround is to create 1K 'trashcan' to receive the discarded data and
set up descriptors for packet and TR mode channels.
When a channel is stopped and there is no active descriptor then a
descriptor is pushed to the ring for UDMA before the teardown is initiated.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
---
 drivers/dma/ti/k3-udma.c | 168 +++++++++++++++++++++++++++++++++++----
 1 file changed, 151 insertions(+), 17 deletions(-)

diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
index fb59c869a6a7..cb9259e104b4 100644
--- a/drivers/dma/ti/k3-udma.c
+++ b/drivers/dma/ti/k3-udma.c
@@ -97,6 +97,24 @@ struct udma_match_data {
 	u32 level_start_idx[];
 };
 
+struct udma_hwdesc {
+	size_t cppi5_desc_size;
+	void *cppi5_desc_vaddr;
+	dma_addr_t cppi5_desc_paddr;
+
+	/* TR descriptor internal pointers */
+	void *tr_req_base;
+	struct cppi5_tr_resp_t *tr_resp_base;
+};
+
+struct udma_rx_flush {
+	struct udma_hwdesc hwdescs[2];
+
+	size_t buffer_size;
+	void *buffer_vaddr;
+	dma_addr_t buffer_paddr;
+};
+
 struct udma_dev {
 	struct dma_device ddev;
 	struct device *dev;
@@ -113,6 +131,8 @@ struct udma_dev {
 	struct list_head desc_to_purge;
 	spinlock_t lock;
 
+	struct udma_rx_flush rx_flush;
+
 	int tchan_cnt;
 	int echan_cnt;
 	int rchan_cnt;
@@ -131,16 +151,6 @@ struct udma_dev {
 	u32 psil_base;
 };
 
-struct udma_hwdesc {
-	size_t cppi5_desc_size;
-	void *cppi5_desc_vaddr;
-	dma_addr_t cppi5_desc_paddr;
-
-	/* TR descriptor internal pointers */
-	void *tr_req_base;
-	struct cppi5_tr_resp_t *tr_resp_base;
-};
-
 struct udma_desc {
 	struct virt_dma_desc vd;
 
@@ -552,12 +562,17 @@ static void udma_sync_for_device(struct udma_chan *uc, int idx)
 	}
 }
 
+static inline dma_addr_t udma_get_rx_flush_hwdesc_paddr(struct udma_chan *uc)
+{
+	return uc->ud->rx_flush.hwdescs[uc->config.pkt_mode].cppi5_desc_paddr;
+}
+
 static int udma_push_to_ring(struct udma_chan *uc, int idx)
 {
 	struct udma_desc *d = uc->desc;
-
 	struct k3_ring *ring = NULL;
-	int ret = -EINVAL;
+	dma_addr_t paddr;
+	int ret;
 
 	switch (uc->config.dir) {
 	case DMA_DEV_TO_MEM:
@@ -568,21 +583,37 @@ static int udma_push_to_ring(struct udma_chan *uc, int idx)
 		ring = uc->tchan->t_ring;
 		break;
 	default:
-		break;
+		return -EINVAL;
 	}
 
-	if (ring) {
-		dma_addr_t desc_addr = udma_curr_cppi5_desc_paddr(d, idx);
+	/* RX flush packet: idx == -1 is only passed in case of DEV_TO_MEM */
+	if (idx == -1) {
+		paddr = udma_get_rx_flush_hwdesc_paddr(uc);
+	} else {
+		paddr = udma_curr_cppi5_desc_paddr(d, idx);
 
 		wmb(); /* Ensure that writes are not moved over this point */
 		udma_sync_for_device(uc, idx);
-		ret = k3_ringacc_ring_push(ring, &desc_addr);
-		uc->in_ring_cnt++;
 	}
 
+	ret = k3_ringacc_ring_push(ring, &paddr);
+	if (!ret)
+		uc->in_ring_cnt++;
+
 	return ret;
 }
 
+static bool udma_desc_is_rx_flush(struct udma_chan *uc, dma_addr_t addr)
+{
+	if (uc->config.dir != DMA_DEV_TO_MEM)
+		return false;
+
+	if (addr == udma_get_rx_flush_hwdesc_paddr(uc))
+		return true;
+
+	return false;
+}
+
 static int udma_pop_from_ring(struct udma_chan *uc, dma_addr_t *addr)
 {
 	struct k3_ring *ring = NULL;
@@ -611,6 +642,10 @@ static int udma_pop_from_ring(struct udma_chan *uc, dma_addr_t *addr)
 		if (cppi5_desc_is_tdcm(*addr))
 			return ret;
 
+		/* Check for flush descriptor */
+		if (udma_desc_is_rx_flush(uc, *addr))
+			return -ENOENT;
+
 		d = udma_udma_desc_from_paddr(uc, *addr);
 
 		if (d)
@@ -891,6 +926,9 @@ static int udma_stop(struct udma_chan *uc)
 
 	switch (uc->config.dir) {
 	case DMA_DEV_TO_MEM:
+		if (!uc->cyclic && !uc->desc)
+			udma_push_to_ring(uc, -1);
+
 		udma_rchanrt_write(uc->rchan, UDMA_RCHAN_RT_PEER_RT_EN_REG,
 				   UDMA_PEER_RT_EN_ENABLE |
 				   UDMA_PEER_RT_EN_TEARDOWN);
@@ -3274,6 +3312,98 @@ static int udma_setup_resources(struct udma_dev *ud)
 	return ch_count;
 }
 
+static int udma_setup_rx_flush(struct udma_dev *ud)
+{
+	struct udma_rx_flush *rx_flush = &ud->rx_flush;
+	struct cppi5_desc_hdr_t *tr_desc;
+	struct cppi5_tr_type1_t *tr_req;
+	struct cppi5_host_desc_t *desc;
+	struct device *dev = ud->dev;
+	struct udma_hwdesc *hwdesc;
+	size_t tr_size;
+
+	/* Allocate 1K buffer for discarded data on RX channel teardown */
+	rx_flush->buffer_size = SZ_1K;
+	rx_flush->buffer_vaddr = devm_kzalloc(dev, rx_flush->buffer_size,
+					      GFP_KERNEL);
+	if (!rx_flush->buffer_vaddr)
+		return -ENOMEM;
+
+	rx_flush->buffer_paddr = dma_map_single(dev, rx_flush->buffer_vaddr,
+						rx_flush->buffer_size,
+						DMA_TO_DEVICE);
+	if (dma_mapping_error(dev, rx_flush->buffer_paddr))
+		return -ENOMEM;
+
+	/* Set up descriptor to be used for TR mode */
+	hwdesc = &rx_flush->hwdescs[0];
+	tr_size = sizeof(struct cppi5_tr_type1_t);
+	hwdesc->cppi5_desc_size = cppi5_trdesc_calc_size(tr_size, 1);
+	hwdesc->cppi5_desc_size = ALIGN(hwdesc->cppi5_desc_size,
+					ud->desc_align);
+
+	hwdesc->cppi5_desc_vaddr = devm_kzalloc(dev, hwdesc->cppi5_desc_size,
+						GFP_KERNEL);
+	if (!hwdesc->cppi5_desc_vaddr)
+		return -ENOMEM;
+
+	hwdesc->cppi5_desc_paddr = dma_map_single(dev, hwdesc->cppi5_desc_vaddr,
+						  hwdesc->cppi5_desc_size,
+						  DMA_TO_DEVICE);
+	if (dma_mapping_error(dev, hwdesc->cppi5_desc_paddr))
+		return -ENOMEM;
+
+	/* Start of the TR req records */
+	hwdesc->tr_req_base = hwdesc->cppi5_desc_vaddr + tr_size;
+	/* Start address of the TR response array */
+	hwdesc->tr_resp_base = hwdesc->tr_req_base + tr_size;
+
+	tr_desc = hwdesc->cppi5_desc_vaddr;
+	cppi5_trdesc_init(tr_desc, 1, tr_size, 0, 0);
+	cppi5_desc_set_pktids(tr_desc, 0, CPPI5_INFO1_DESC_FLOWID_DEFAULT);
+	cppi5_desc_set_retpolicy(tr_desc, 0, 0);
+
+	tr_req = hwdesc->tr_req_base;
+	cppi5_tr_init(&tr_req->flags, CPPI5_TR_TYPE1, false, false,
+		      CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+	cppi5_tr_csf_set(&tr_req->flags, CPPI5_TR_CSF_SUPR_EVT);
+
+	tr_req->addr = rx_flush->buffer_paddr;
+	tr_req->icnt0 = rx_flush->buffer_size;
+	tr_req->icnt1 = 1;
+
+	/* Set up descriptor to be used for packet mode */
+	hwdesc = &rx_flush->hwdescs[1];
+	hwdesc->cppi5_desc_size = ALIGN(sizeof(struct cppi5_host_desc_t) +
+					CPPI5_INFO0_HDESC_EPIB_SIZE +
+					CPPI5_INFO0_HDESC_PSDATA_MAX_SIZE,
+					ud->desc_align);
+
+	hwdesc->cppi5_desc_vaddr = devm_kzalloc(dev, hwdesc->cppi5_desc_size,
+						GFP_KERNEL);
+	if (!hwdesc->cppi5_desc_vaddr)
+		return -ENOMEM;
+
+	hwdesc->cppi5_desc_paddr = dma_map_single(dev, hwdesc->cppi5_desc_vaddr,
+						  hwdesc->cppi5_desc_size,
+						  DMA_TO_DEVICE);
+	if (dma_mapping_error(dev, hwdesc->cppi5_desc_paddr))
+		return -ENOMEM;
+
+	desc = hwdesc->cppi5_desc_vaddr;
+	cppi5_hdesc_init(desc, 0, 0);
+	cppi5_desc_set_pktids(&desc->hdr, 0, CPPI5_INFO1_DESC_FLOWID_DEFAULT);
+	cppi5_desc_set_retpolicy(&desc->hdr, 0, 0);
+
+	cppi5_hdesc_attach_buf(desc,
+			       rx_flush->buffer_paddr, rx_flush->buffer_size,
+			       rx_flush->buffer_paddr, rx_flush->buffer_size);
+
+	dma_sync_single_for_device(dev, hwdesc->cppi5_desc_paddr,
+				   hwdesc->cppi5_desc_size, DMA_TO_DEVICE);
+	return 0;
+}
+
 #define TI_UDMAC_BUSWIDTHS	(BIT(DMA_SLAVE_BUSWIDTH_1_BYTE) | \
 				 BIT(DMA_SLAVE_BUSWIDTH_2_BYTES) | \
 				 BIT(DMA_SLAVE_BUSWIDTH_3_BYTES) | \
@@ -3387,6 +3517,10 @@ static int udma_probe(struct platform_device *pdev)
 	if (ud->desc_align < dma_get_cache_alignment())
 		ud->desc_align = dma_get_cache_alignment();
 
+	ret = udma_setup_rx_flush(ud);
+	if (ret)
+		return ret;
+
 	for (i = 0; i < ud->tchan_cnt; i++) {
 		struct udma_tchan *tchan = &ud->tchans[i];
 
-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH for-next 3/4] dmaengine: ti: k3-udma: Move the TR counter calculation to helper function
  2020-01-27 13:21 [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
  2020-01-27 13:21 ` [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check Peter Ujfalusi
  2020-01-27 13:21 ` [PATCH for-next 2/4] dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in peer Peter Ujfalusi
@ 2020-01-27 13:21 ` Peter Ujfalusi
  2020-01-27 13:21 ` [PATCH for-next 4/4] dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and cyclic Peter Ujfalusi
  2020-01-28 10:15 ` [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
  4 siblings, 0 replies; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-27 13:21 UTC (permalink / raw)
  To: vkoul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

Move the TR counter parameter configuration code out from the prep_memcpy
callback to a helper function to allow a generic re-usable code for other
TR based transfers.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
---
 drivers/dma/ti/k3-udma.c | 74 +++++++++++++++++++++++++++-------------
 1 file changed, 51 insertions(+), 23 deletions(-)

diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
index cb9259e104b4..9b00013d6f63 100644
--- a/drivers/dma/ti/k3-udma.c
+++ b/drivers/dma/ti/k3-udma.c
@@ -2029,6 +2029,51 @@ static struct udma_desc *udma_alloc_tr_desc(struct udma_chan *uc,
 	return d;
 }
 
+/**
+ * udma_get_tr_counters - calculate TR counters for a given length
+ * @len: Length of the trasnfer
+ * @align_to: Preferred alignment
+ * @tr0_cnt0: First TR icnt0
+ * @tr0_cnt1: First TR icnt1
+ * @tr1_cnt0: Second (if used) TR icnt0
+ *
+ * For len < SZ_64K only one TR is enough, tr1_cnt0 is not updated
+ * For len >= SZ_64K two TRs are used in a simple way:
+ * First TR: SZ_64K-alignment blocks (tr0_cnt0, tr0_cnt1)
+ * Second TR: the remaining length (tr1_cnt0)
+ *
+ * Returns the number of TRs the length needs (1 or 2)
+ * -EINVAL if the length can not be supported
+ */
+static int udma_get_tr_counters(size_t len, unsigned long align_to,
+				u16 *tr0_cnt0, u16 *tr0_cnt1, u16 *tr1_cnt0)
+{
+	if (len < SZ_64K) {
+		*tr0_cnt0 = len;
+		*tr0_cnt1 = 1;
+
+		return 1;
+	}
+
+	if (align_to > 3)
+		align_to = 3;
+
+realign:
+	*tr0_cnt0 = SZ_64K - BIT(align_to);
+	if (len / *tr0_cnt0 >= SZ_64K) {
+		if (align_to) {
+			align_to--;
+			goto realign;
+		}
+		return -EINVAL;
+	}
+
+	*tr0_cnt1 = len / *tr0_cnt0;
+	*tr1_cnt0 = len % *tr0_cnt0;
+
+	return 2;
+}
+
 static struct udma_desc *
 udma_prep_slave_sg_tr(struct udma_chan *uc, struct scatterlist *sgl,
 		      unsigned int sglen, enum dma_transfer_direction dir,
@@ -2581,29 +2626,12 @@ udma_prep_dma_memcpy(struct dma_chan *chan, dma_addr_t dest, dma_addr_t src,
 		return NULL;
 	}
 
-	if (len < SZ_64K) {
-		num_tr = 1;
-		tr0_cnt0 = len;
-		tr0_cnt1 = 1;
-	} else {
-		unsigned long align_to = __ffs(src | dest);
-
-		if (align_to > 3)
-			align_to = 3;
-		/*
-		 * Keep simple: tr0: SZ_64K-alignment blocks,
-		 *		tr1: the remaining
-		 */
-		num_tr = 2;
-		tr0_cnt0 = (SZ_64K - BIT(align_to));
-		if (len / tr0_cnt0 >= SZ_64K) {
-			dev_err(uc->ud->dev, "size %zu is not supported\n",
-				len);
-			return NULL;
-		}
-
-		tr0_cnt1 = len / tr0_cnt0;
-		tr1_cnt0 = len % tr0_cnt0;
+	num_tr = udma_get_tr_counters(len, __ffs(src | dest), &tr0_cnt0,
+				      &tr0_cnt1, &tr1_cnt0);
+	if (num_tr < 0) {
+		dev_err(uc->ud->dev, "size %zu is not supported\n",
+			len);
+		return NULL;
 	}
 
 	d = udma_alloc_tr_desc(uc, tr_size, num_tr, DMA_MEM_TO_MEM);
-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH for-next 4/4] dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and cyclic
  2020-01-27 13:21 [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
                   ` (2 preceding siblings ...)
  2020-01-27 13:21 ` [PATCH for-next 3/4] dmaengine: ti: k3-udma: Move the TR counter calculation to helper function Peter Ujfalusi
@ 2020-01-27 13:21 ` Peter Ujfalusi
  2020-01-28 10:15 ` [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
  4 siblings, 0 replies; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-27 13:21 UTC (permalink / raw)
  To: vkoul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

Use the generic TR setup function to get the TR counters for both cyclic
and slave_sg transfers.
This way the period_size for cyclic and sg_dma_len() for slave_sg can be
as large as (SZ_64K - 1) * (SZ_64K - 1) and we can handle cases when the
length is >SZ_64K and a prime number.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
---
 drivers/dma/ti/k3-udma.c | 130 ++++++++++++++++++++++++++-------------
 1 file changed, 88 insertions(+), 42 deletions(-)

diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
index 9b00013d6f63..1dba47c662c4 100644
--- a/drivers/dma/ti/k3-udma.c
+++ b/drivers/dma/ti/k3-udma.c
@@ -2079,31 +2079,31 @@ udma_prep_slave_sg_tr(struct udma_chan *uc, struct scatterlist *sgl,
 		      unsigned int sglen, enum dma_transfer_direction dir,
 		      unsigned long tx_flags, void *context)
 {
-	enum dma_slave_buswidth dev_width;
 	struct scatterlist *sgent;
 	struct udma_desc *d;
-	size_t tr_size;
 	struct cppi5_tr_type1_t *tr_req = NULL;
+	u16 tr0_cnt0, tr0_cnt1, tr1_cnt0;
 	unsigned int i;
-	u32 burst;
+	size_t tr_size;
+	int num_tr = 0;
+	int tr_idx = 0;
 
-	if (dir == DMA_DEV_TO_MEM) {
-		dev_width = uc->cfg.src_addr_width;
-		burst = uc->cfg.src_maxburst;
-	} else if (dir == DMA_MEM_TO_DEV) {
-		dev_width = uc->cfg.dst_addr_width;
-		burst = uc->cfg.dst_maxburst;
-	} else {
-		dev_err(uc->ud->dev, "%s: bad direction?\n", __func__);
+	if (!is_slave_direction(dir)) {
+		dev_err(uc->ud->dev, "Only slave cyclic is supported\n");
 		return NULL;
 	}
 
-	if (!burst)
-		burst = 1;
+	/* estimate the number of TRs we will need */
+	for_each_sg(sgl, sgent, sglen, i) {
+		if (sg_dma_len(sgent) < SZ_64K)
+			num_tr++;
+		else
+			num_tr += 2;
+	}
 
 	/* Now allocate and setup the descriptor. */
 	tr_size = sizeof(struct cppi5_tr_type1_t);
-	d = udma_alloc_tr_desc(uc, tr_size, sglen, dir);
+	d = udma_alloc_tr_desc(uc, tr_size, num_tr, dir);
 	if (!d)
 		return NULL;
 
@@ -2111,19 +2111,46 @@ udma_prep_slave_sg_tr(struct udma_chan *uc, struct scatterlist *sgl,
 
 	tr_req = d->hwdesc[0].tr_req_base;
 	for_each_sg(sgl, sgent, sglen, i) {
-		d->residue += sg_dma_len(sgent);
+		dma_addr_t sg_addr = sg_dma_address(sgent);
+
+		num_tr = udma_get_tr_counters(sg_dma_len(sgent), __ffs(sg_addr),
+					      &tr0_cnt0, &tr0_cnt1, &tr1_cnt0);
+		if (num_tr < 0) {
+			dev_err(uc->ud->dev, "size %u is not supported\n",
+				sg_dma_len(sgent));
+			udma_free_hwdesc(uc, d);
+			kfree(d);
+			return NULL;
+		}
 
 		cppi5_tr_init(&tr_req[i].flags, CPPI5_TR_TYPE1, false, false,
 			      CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
 		cppi5_tr_csf_set(&tr_req[i].flags, CPPI5_TR_CSF_SUPR_EVT);
 
-		tr_req[i].addr = sg_dma_address(sgent);
-		tr_req[i].icnt0 = burst * dev_width;
-		tr_req[i].dim1 = burst * dev_width;
-		tr_req[i].icnt1 = sg_dma_len(sgent) / tr_req[i].icnt0;
+		tr_req[tr_idx].addr = sg_addr;
+		tr_req[tr_idx].icnt0 = tr0_cnt0;
+		tr_req[tr_idx].icnt1 = tr0_cnt1;
+		tr_req[tr_idx].dim1 = tr0_cnt0;
+		tr_idx++;
+
+		if (num_tr == 2) {
+			cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE1,
+				      false, false,
+				      CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+			cppi5_tr_csf_set(&tr_req[tr_idx].flags,
+					 CPPI5_TR_CSF_SUPR_EVT);
+
+			tr_req[tr_idx].addr = sg_addr + tr0_cnt1 * tr0_cnt0;
+			tr_req[tr_idx].icnt0 = tr1_cnt0;
+			tr_req[tr_idx].icnt1 = 1;
+			tr_req[tr_idx].dim1 = tr1_cnt0;
+			tr_idx++;
+		}
+
+		d->residue += sg_dma_len(sgent);
 	}
 
-	cppi5_tr_csf_set(&tr_req[i - 1].flags, CPPI5_TR_CSF_EOP);
+	cppi5_tr_csf_set(&tr_req[tr_idx - 1].flags, CPPI5_TR_CSF_EOP);
 
 	return d;
 }
@@ -2428,47 +2455,66 @@ udma_prep_dma_cyclic_tr(struct udma_chan *uc, dma_addr_t buf_addr,
 			size_t buf_len, size_t period_len,
 			enum dma_transfer_direction dir, unsigned long flags)
 {
-	enum dma_slave_buswidth dev_width;
 	struct udma_desc *d;
-	size_t tr_size;
+	size_t tr_size, period_addr;
 	struct cppi5_tr_type1_t *tr_req;
-	unsigned int i;
 	unsigned int periods = buf_len / period_len;
-	u32 burst;
+	u16 tr0_cnt0, tr0_cnt1, tr1_cnt0;
+	unsigned int i;
+	int num_tr;
 
-	if (dir == DMA_DEV_TO_MEM) {
-		dev_width = uc->cfg.src_addr_width;
-		burst = uc->cfg.src_maxburst;
-	} else if (dir == DMA_MEM_TO_DEV) {
-		dev_width = uc->cfg.dst_addr_width;
-		burst = uc->cfg.dst_maxburst;
-	} else {
-		dev_err(uc->ud->dev, "%s: bad direction?\n", __func__);
+	if (!is_slave_direction(dir)) {
+		dev_err(uc->ud->dev, "Only slave cyclic is supported\n");
 		return NULL;
 	}
 
-	if (!burst)
-		burst = 1;
+	num_tr = udma_get_tr_counters(period_len, __ffs(buf_addr), &tr0_cnt0,
+				      &tr0_cnt1, &tr1_cnt0);
+	if (num_tr < 0) {
+		dev_err(uc->ud->dev, "size %zu is not supported\n",
+			period_len);
+		return NULL;
+	}
 
 	/* Now allocate and setup the descriptor. */
 	tr_size = sizeof(struct cppi5_tr_type1_t);
-	d = udma_alloc_tr_desc(uc, tr_size, periods, dir);
+	d = udma_alloc_tr_desc(uc, tr_size, periods * num_tr, dir);
 	if (!d)
 		return NULL;
 
 	tr_req = d->hwdesc[0].tr_req_base;
+	period_addr = buf_addr;
 	for (i = 0; i < periods; i++) {
-		cppi5_tr_init(&tr_req[i].flags, CPPI5_TR_TYPE1, false, false,
-			      CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+		int tr_idx = i * num_tr;
 
-		tr_req[i].addr = buf_addr + period_len * i;
-		tr_req[i].icnt0 = dev_width;
-		tr_req[i].icnt1 = period_len / dev_width;
-		tr_req[i].dim1 = dev_width;
+		cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE1, false,
+			      false, CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+
+		tr_req[tr_idx].addr = period_addr;
+		tr_req[tr_idx].icnt0 = tr0_cnt0;
+		tr_req[tr_idx].icnt1 = tr0_cnt1;
+		tr_req[tr_idx].dim1 = tr0_cnt0;
+
+		if (num_tr == 2) {
+			cppi5_tr_csf_set(&tr_req[tr_idx].flags,
+					 CPPI5_TR_CSF_SUPR_EVT);
+			tr_idx++;
+
+			cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE1,
+				      false, false,
+				      CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+
+			tr_req[tr_idx].addr = period_addr + tr0_cnt1 * tr0_cnt0;
+			tr_req[tr_idx].icnt0 = tr1_cnt0;
+			tr_req[tr_idx].icnt1 = 1;
+			tr_req[tr_idx].dim1 = tr1_cnt0;
+		}
 
 		if (!(flags & DMA_PREP_INTERRUPT))
-			cppi5_tr_csf_set(&tr_req[i].flags,
+			cppi5_tr_csf_set(&tr_req[tr_idx].flags,
 					 CPPI5_TR_CSF_SUPR_EVT);
+
+		period_addr += period_len;
 	}
 
 	return d;
-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next
  2020-01-27 13:21 [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
                   ` (3 preceding siblings ...)
  2020-01-27 13:21 ` [PATCH for-next 4/4] dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and cyclic Peter Ujfalusi
@ 2020-01-28 10:15 ` Peter Ujfalusi
  2020-01-28 11:50   ` Vinod Koul
  4 siblings, 1 reply; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-28 10:15 UTC (permalink / raw)
  To: vkoul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

Vinod,

On 27/01/2020 15.21, Peter Ujfalusi wrote:
> Hi Vinod,
> 
> Based on customer reports we have identified two issues with the UDMA driver:
> 
> TX completion (1st patch):
> The scheduled work based workaround for checking for completion worked well for
> UART, but it had significant impact on SPI performance.
> The underlying issue is coming from the fact that we have split data movement
> architecture.
> In order to know that the transfer is really done we need to check the remote
> end's (PDMA) byte counter.
> 
> RX channel teardown with stale data in PDMA (2nd patch):
> If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
> data is might received from a peripheral, but if UDMA does not have a packet to
> use for this draining than it is going to push back on the PDMA and the flush
> will never completes.
> The workaround is to use a dummy descriptor for flush purposes when the channel
> is terminated and we did not have active transfer (no descriptor for UDMA).
> This allows UDMA to drain the data and the teardown can complete.
> 
> The last two patch is to use common code to set up the TR parameters for
> slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
> with the change we can handle 4.2GB sg elements and periods in case of cyclic.
> It is also nice that we have single function to do the configuration.

I have marked these patches as for-next as 5.5 was not released yet.
Would it be possible to have these as fixes for 5.6?

Thanks,
- Péter

> 
> Regards,
> Peter
> ---
> Peter Ujfalusi (3):
>   dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in
>     peer
>   dmaengine: ti: k3-udma: Move the TR counter calculation to helper
>     function
>   dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and
>     cyclic
> 
> Vignesh Raghavendra (1):
>   dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion
>     check
> 
>  drivers/dma/ti/k3-udma.c | 452 +++++++++++++++++++++++++++++----------
>  1 file changed, 343 insertions(+), 109 deletions(-)
> 

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check
  2020-01-27 13:21 ` [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check Peter Ujfalusi
@ 2020-01-28 11:48   ` Vinod Koul
  2020-01-28 12:05     ` Vignesh Raghavendra
  0 siblings, 1 reply; 13+ messages in thread
From: Vinod Koul @ 2020-01-28 11:48 UTC (permalink / raw)
  To: Peter Ujfalusi
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

On 27-01-20, 15:21, Peter Ujfalusi wrote:
> From: Vignesh Raghavendra <vigneshr@ti.com>
> 
> In some cases (McSPI for example) the jiffie and delayed_work based
> workaround can cause big throughput drop.
> 
> Switch to use ktime/usleep_range based implementation to be able
> to sustain speed for PDMA based peripherals.
> 
> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
> ---
>  drivers/dma/ti/k3-udma.c | 80 ++++++++++++++++++++++++++--------------
>  1 file changed, 53 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
> index ea79c2df28e0..fb59c869a6a7 100644
> --- a/drivers/dma/ti/k3-udma.c
> +++ b/drivers/dma/ti/k3-udma.c
> @@ -5,6 +5,7 @@
>   */
>  
>  #include <linux/kernel.h>
> +#include <linux/delay.h>
>  #include <linux/dmaengine.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/dmapool.h>
> @@ -169,7 +170,7 @@ enum udma_chan_state {
>  
>  struct udma_tx_drain {
>  	struct delayed_work work;
> -	unsigned long jiffie;
> +	ktime_t tstamp;
>  	u32 residue;
>  };
>  
> @@ -946,9 +947,10 @@ static bool udma_is_desc_really_done(struct udma_chan *uc, struct udma_desc *d)
>  	peer_bcnt = udma_tchanrt_read(uc->tchan, UDMA_TCHAN_RT_PEER_BCNT_REG);
>  	bcnt = udma_tchanrt_read(uc->tchan, UDMA_TCHAN_RT_BCNT_REG);
>  
> +	/* Transfer is incomplete, store current residue and time stamp */
>  	if (peer_bcnt < bcnt) {
>  		uc->tx_drain.residue = bcnt - peer_bcnt;
> -		uc->tx_drain.jiffie = jiffies;
> +		uc->tx_drain.tstamp = ktime_get();

Any reason why ktime_get() is better than jiffies..?

>  		return false;
>  	}
>  
> @@ -961,35 +963,59 @@ static void udma_check_tx_completion(struct work_struct *work)
>  					    tx_drain.work.work);
>  	bool desc_done = true;
>  	u32 residue_diff;
> -	unsigned long jiffie_diff, delay;
> +	ktime_t time_diff;
> +	unsigned long delay;
> +
> +	while (1) {
> +		if (uc->desc) {
> +			/* Get previous residue and time stamp */
> +			residue_diff = uc->tx_drain.residue;
> +			time_diff = uc->tx_drain.tstamp;
> +			/*
> +			 * Get current residue and time stamp or see if
> +			 * transfer is complete
> +			 */
> +			desc_done = udma_is_desc_really_done(uc, uc->desc);
> +		}
>  
> -	if (uc->desc) {
> -		residue_diff = uc->tx_drain.residue;
> -		jiffie_diff = uc->tx_drain.jiffie;
> -		desc_done = udma_is_desc_really_done(uc, uc->desc);
> -	}
> -
> -	if (!desc_done) {
> -		jiffie_diff = uc->tx_drain.jiffie - jiffie_diff;
> -		residue_diff -= uc->tx_drain.residue;
> -		if (residue_diff) {
> -			/* Try to guess when we should check next time */
> -			residue_diff /= jiffie_diff;
> -			delay = uc->tx_drain.residue / residue_diff / 3;
> -			if (jiffies_to_msecs(delay) < 5)
> -				delay = 0;
> -		} else {
> -			/* No progress, check again in 1 second  */
> -			delay = HZ;
> +		if (!desc_done) {
> +			/*
> +			 * Find the time delta and residue delta w.r.t
> +			 * previous poll
> +			 */
> +			time_diff = ktime_sub(uc->tx_drain.tstamp,
> +					      time_diff) + 1;
> +			residue_diff -= uc->tx_drain.residue;
> +			if (residue_diff) {
> +				/*
> +				 * Try to guess when we should check
> +				 * next time by calculating rate at
> +				 * which data is being drained at the
> +				 * peer device
> +				 */
> +				delay = (time_diff / residue_diff) *
> +					uc->tx_drain.residue;
> +			} else {
> +				/* No progress, check again in 1 second  */
> +				schedule_delayed_work(&uc->tx_drain.work, HZ);
> +				break;
> +			}
> +
> +			usleep_range(ktime_to_us(delay),
> +				     ktime_to_us(delay) + 10);
> +			continue;
>  		}
>  
> -		schedule_delayed_work(&uc->tx_drain.work, delay);
> -	} else if (uc->desc) {
> -		struct udma_desc *d = uc->desc;
> +		if (uc->desc) {
> +			struct udma_desc *d = uc->desc;
> +
> +			uc->bcnt += d->residue;
> +			udma_start(uc);
> +			vchan_cookie_complete(&d->vd);
> +			break;
> +		}
>  
> -		uc->bcnt += d->residue;
> -		udma_start(uc);
> -		vchan_cookie_complete(&d->vd);
> +		break;
>  	}
>  }
>  
> -- 
> Peter
> 
> Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
> Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

-- 
~Vinod

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next
  2020-01-28 10:15 ` [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
@ 2020-01-28 11:50   ` Vinod Koul
  2020-01-28 12:37     ` Peter Ujfalusi
  0 siblings, 1 reply; 13+ messages in thread
From: Vinod Koul @ 2020-01-28 11:50 UTC (permalink / raw)
  To: Peter Ujfalusi
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

On 28-01-20, 12:15, Peter Ujfalusi wrote:
> Vinod,
> 
> On 27/01/2020 15.21, Peter Ujfalusi wrote:
> > Hi Vinod,
> > 
> > Based on customer reports we have identified two issues with the UDMA driver:
> > 
> > TX completion (1st patch):
> > The scheduled work based workaround for checking for completion worked well for
> > UART, but it had significant impact on SPI performance.
> > The underlying issue is coming from the fact that we have split data movement
> > architecture.
> > In order to know that the transfer is really done we need to check the remote
> > end's (PDMA) byte counter.
> > 
> > RX channel teardown with stale data in PDMA (2nd patch):
> > If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
> > data is might received from a peripheral, but if UDMA does not have a packet to
> > use for this draining than it is going to push back on the PDMA and the flush
> > will never completes.
> > The workaround is to use a dummy descriptor for flush purposes when the channel
> > is terminated and we did not have active transfer (no descriptor for UDMA).
> > This allows UDMA to drain the data and the teardown can complete.
> > 
> > The last two patch is to use common code to set up the TR parameters for
> > slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
> > with the change we can handle 4.2GB sg elements and periods in case of cyclic.
> > It is also nice that we have single function to do the configuration.
> 
> I have marked these patches as for-next as 5.5 was not released yet.
> Would it be possible to have these as fixes for 5.6?

Sure but are they really fixes, why cant they go for next release :)

They seem to improve things for sure, but do we want to call them as
fixes..?

-- 
~Vinod

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check
  2020-01-28 11:48   ` Vinod Koul
@ 2020-01-28 12:05     ` Vignesh Raghavendra
  2020-01-28 12:44       ` Vinod Koul
  0 siblings, 1 reply; 13+ messages in thread
From: Vignesh Raghavendra @ 2020-01-28 12:05 UTC (permalink / raw)
  To: Vinod Koul, Peter Ujfalusi
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko

Hi Vinod,

On 1/28/2020 5:18 PM, Vinod Koul wrote:
> On 27-01-20, 15:21, Peter Ujfalusi wrote:
>> From: Vignesh Raghavendra <vigneshr@ti.com>
>>
>> In some cases (McSPI for example) the jiffie and delayed_work based
>> workaround can cause big throughput drop.
>>
>> Switch to use ktime/usleep_range based implementation to be able
>> to sustain speed for PDMA based peripherals.
>>
>> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
>> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
>> ---
>>  drivers/dma/ti/k3-udma.c | 80 ++++++++++++++++++++++++++--------------
>>  1 file changed, 53 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
>> index ea79c2df28e0..fb59c869a6a7 100644
>> --- a/drivers/dma/ti/k3-udma.c
>> +++ b/drivers/dma/ti/k3-udma.c
>> @@ -5,6 +5,7 @@
>>   */
>>  
>>  #include <linux/kernel.h>
>> +#include <linux/delay.h>
>>  #include <linux/dmaengine.h>
>>  #include <linux/dma-mapping.h>
>>  #include <linux/dmapool.h>
>> @@ -169,7 +170,7 @@ enum udma_chan_state {
>>  
>>  struct udma_tx_drain {
>>  	struct delayed_work work;
>> -	unsigned long jiffie;
>> +	ktime_t tstamp;
>>  	u32 residue;
>>  };
>>  
>> @@ -946,9 +947,10 @@ static bool udma_is_desc_really_done(struct udma_chan *uc, struct udma_desc *d)
>>  	peer_bcnt = udma_tchanrt_read(uc->tchan, UDMA_TCHAN_RT_PEER_BCNT_REG);
>>  	bcnt = udma_tchanrt_read(uc->tchan, UDMA_TCHAN_RT_BCNT_REG);
>>  
>> +	/* Transfer is incomplete, store current residue and time stamp */
>>  	if (peer_bcnt < bcnt) {
>>  		uc->tx_drain.residue = bcnt - peer_bcnt;
>> -		uc->tx_drain.jiffie = jiffies;
>> +		uc->tx_drain.tstamp = ktime_get();
> 
> Any reason why ktime_get() is better than jiffies..?

Resolution of jiffies is 4ms. ktime_t is has better resolution (upto ns
scale). With jiffies, I observed that code was either always polling DMA
progress counters (which affects HW data transfer speed) or sleeping too
long, both causing performance loss. Switching to ktime_t provides
better prediction of how long transfer takes to complete.

Regards
Vignesh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next
  2020-01-28 11:50   ` Vinod Koul
@ 2020-01-28 12:37     ` Peter Ujfalusi
  2020-01-30 13:19       ` Peter Ujfalusi
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-28 12:37 UTC (permalink / raw)
  To: Vinod Koul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

Hi Vinod,

On 28/01/2020 13.50, Vinod Koul wrote:
> On 28-01-20, 12:15, Peter Ujfalusi wrote:
>> Vinod,
>>
>> On 27/01/2020 15.21, Peter Ujfalusi wrote:
>>> Hi Vinod,
>>>
>>> Based on customer reports we have identified two issues with the UDMA driver:
>>>
>>> TX completion (1st patch):
>>> The scheduled work based workaround for checking for completion worked well for
>>> UART, but it had significant impact on SPI performance.
>>> The underlying issue is coming from the fact that we have split data movement
>>> architecture.
>>> In order to know that the transfer is really done we need to check the remote
>>> end's (PDMA) byte counter.
>>>
>>> RX channel teardown with stale data in PDMA (2nd patch):
>>> If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
>>> data is might received from a peripheral, but if UDMA does not have a packet to
>>> use for this draining than it is going to push back on the PDMA and the flush
>>> will never completes.
>>> The workaround is to use a dummy descriptor for flush purposes when the channel
>>> is terminated and we did not have active transfer (no descriptor for UDMA).
>>> This allows UDMA to drain the data and the teardown can complete.
>>>
>>> The last two patch is to use common code to set up the TR parameters for
>>> slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
>>> with the change we can handle 4.2GB sg elements and periods in case of cyclic.
>>> It is also nice that we have single function to do the configuration.
>>
>> I have marked these patches as for-next as 5.5 was not released yet.
>> Would it be possible to have these as fixes for 5.6?
> 
> Sure but are they really fixes, why cant they go for next release :)
> 
> They seem to improve things for sure, but do we want to call them as
> fixes..?

I would say that the first two patch is a fix:
TX completion check is fixing the performance hit by the early TX
completion workaround which used jiffies+work.

The second patch is fixing a case when we have stale data during RX and
no active transfer. For example when UART reads 1000 bytes, but the
other end is 'streaming' the data and after the 1000 bytes the UART+PDMA
receives data.
Recovering from this state is not easy and it might not even succeed in
HW level.

The last two is I agree, it is not fixing much, it does corrects the
slave_sg TR setup (and improves the cyclic as well).
With that I could send the ASoC platform wrapper for UDMA with
period_bytes_max = 4.2GB ;)
I have SZ_512K in there atm, with the old calculation SZ_64K is the
maximum, not a big issue.

I think the first two patch is a fix candidate as they fix regression
(albeit regression between the series's) and a real world channel lockup
discovered too late for the initial driver.

- Péter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check
  2020-01-28 12:05     ` Vignesh Raghavendra
@ 2020-01-28 12:44       ` Vinod Koul
  2020-02-11 10:13         ` Peter Ujfalusi
  0 siblings, 1 reply; 13+ messages in thread
From: Vinod Koul @ 2020-01-28 12:44 UTC (permalink / raw)
  To: Vignesh Raghavendra
  Cc: Peter Ujfalusi, dmaengine, linux-kernel, dan.j.williams,
	grygorii.strashko

On 28-01-20, 17:35, Vignesh Raghavendra wrote:

> >> +	/* Transfer is incomplete, store current residue and time stamp */
> >>  	if (peer_bcnt < bcnt) {
> >>  		uc->tx_drain.residue = bcnt - peer_bcnt;
> >> -		uc->tx_drain.jiffie = jiffies;
> >> +		uc->tx_drain.tstamp = ktime_get();
> > 
> > Any reason why ktime_get() is better than jiffies..?
> 
> Resolution of jiffies is 4ms. ktime_t is has better resolution (upto ns
> scale). With jiffies, I observed that code was either always polling DMA
> progress counters (which affects HW data transfer speed) or sleeping too
> long, both causing performance loss. Switching to ktime_t provides
> better prediction of how long transfer takes to complete.

Thanks for explanation, i think it is good info to add in changelog.

-- 
~Vinod

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next
  2020-01-28 12:37     ` Peter Ujfalusi
@ 2020-01-30 13:19       ` Peter Ujfalusi
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Ujfalusi @ 2020-01-30 13:19 UTC (permalink / raw)
  To: Vinod Koul
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko, vigneshr

Hi Vinod,

On 28/01/2020 14.37, Peter Ujfalusi wrote:
> Hi Vinod,
> 
> On 28/01/2020 13.50, Vinod Koul wrote:
>> On 28-01-20, 12:15, Peter Ujfalusi wrote:
>>> Vinod,
>>>
>>> On 27/01/2020 15.21, Peter Ujfalusi wrote:
>>>> Hi Vinod,
>>>>
>>>> Based on customer reports we have identified two issues with the UDMA driver:
>>>>
>>>> TX completion (1st patch):
>>>> The scheduled work based workaround for checking for completion worked well for
>>>> UART, but it had significant impact on SPI performance.
>>>> The underlying issue is coming from the fact that we have split data movement
>>>> architecture.
>>>> In order to know that the transfer is really done we need to check the remote
>>>> end's (PDMA) byte counter.
>>>>
>>>> RX channel teardown with stale data in PDMA (2nd patch):
>>>> If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
>>>> data is might received from a peripheral, but if UDMA does not have a packet to
>>>> use for this draining than it is going to push back on the PDMA and the flush
>>>> will never completes.
>>>> The workaround is to use a dummy descriptor for flush purposes when the channel
>>>> is terminated and we did not have active transfer (no descriptor for UDMA).
>>>> This allows UDMA to drain the data and the teardown can complete.
>>>>
>>>> The last two patch is to use common code to set up the TR parameters for
>>>> slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
>>>> with the change we can handle 4.2GB sg elements and periods in case of cyclic.
>>>> It is also nice that we have single function to do the configuration.
>>>
>>> I have marked these patches as for-next as 5.5 was not released yet.
>>> Would it be possible to have these as fixes for 5.6?
>>
>> Sure but are they really fixes, why cant they go for next release :)
>>
>> They seem to improve things for sure, but do we want to call them as
>> fixes..?
> 
> I would say that the first two patch is a fix:
> TX completion check is fixing the performance hit by the early TX
> completion workaround which used jiffies+work.
> 
> The second patch is fixing a case when we have stale data during RX and
> no active transfer. For example when UART reads 1000 bytes, but the
> other end is 'streaming' the data and after the 1000 bytes the UART+PDMA
> receives data.
> Recovering from this state is not easy and it might not even succeed in
> HW level.
> 
> The last two is I agree, it is not fixing much, it does corrects the
> slave_sg TR setup (and improves the cyclic as well).
> With that I could send the ASoC platform wrapper for UDMA with
> period_bytes_max = 4.2GB ;)
> I have SZ_512K in there atm, with the old calculation SZ_64K is the
> maximum, not a big issue.

Actually this also fixes a real bug in the driver for the slave_sg_tr case:
if the sg_dma_len(sgent) is not multiple of (burst * dev_width) then we
end up with missing bits as the counters are not set up correctly.
The client driver which we tested the slave_sg_tr was always giving
sg_len == 1 and the buffer was aligned, but when I tuned the client to
pass a list, things got broken.

> 
> I think the first two patch is a fix candidate as they fix regression
> (albeit regression between the series's) and a real world channel lockup
> discovered too late for the initial driver.
> 
> - Péter
> 
> Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
> Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
> 

- Péter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check
  2020-01-28 12:44       ` Vinod Koul
@ 2020-02-11 10:13         ` Peter Ujfalusi
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Ujfalusi @ 2020-02-11 10:13 UTC (permalink / raw)
  To: Vinod Koul, Vignesh Raghavendra
  Cc: dmaengine, linux-kernel, dan.j.williams, grygorii.strashko



On 28/01/2020 14.44, Vinod Koul wrote:
> On 28-01-20, 17:35, Vignesh Raghavendra wrote:
> 
>>>> +	/* Transfer is incomplete, store current residue and time stamp */
>>>>  	if (peer_bcnt < bcnt) {
>>>>  		uc->tx_drain.residue = bcnt - peer_bcnt;
>>>> -		uc->tx_drain.jiffie = jiffies;
>>>> +		uc->tx_drain.tstamp = ktime_get();
>>>
>>> Any reason why ktime_get() is better than jiffies..?
>>
>> Resolution of jiffies is 4ms. ktime_t is has better resolution (upto ns
>> scale). With jiffies, I observed that code was either always polling DMA
>> progress counters (which affects HW data transfer speed) or sleeping too
>> long, both causing performance loss. Switching to ktime_t provides
>> better prediction of how long transfer takes to complete.
> 
> Thanks for explanation, i think it is good info to add in changelog.

It turns out that this patch causes lockup with UART stress testing.
The strange thing is that we have identical patch in production with
4.19 without issues.

I'll send two series for UDMA update as we have found a way to induce a
kernel crash with experimental UART patches.
One with patches as must bug fixes for 5.6 and another one with lower
priority fixes.

- Péter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-11 10:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-27 13:21 [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
2020-01-27 13:21 ` [PATCH for-next 1/4] dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion check Peter Ujfalusi
2020-01-28 11:48   ` Vinod Koul
2020-01-28 12:05     ` Vignesh Raghavendra
2020-01-28 12:44       ` Vinod Koul
2020-02-11 10:13         ` Peter Ujfalusi
2020-01-27 13:21 ` [PATCH for-next 2/4] dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in peer Peter Ujfalusi
2020-01-27 13:21 ` [PATCH for-next 3/4] dmaengine: ti: k3-udma: Move the TR counter calculation to helper function Peter Ujfalusi
2020-01-27 13:21 ` [PATCH for-next 4/4] dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and cyclic Peter Ujfalusi
2020-01-28 10:15 ` [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next Peter Ujfalusi
2020-01-28 11:50   ` Vinod Koul
2020-01-28 12:37     ` Peter Ujfalusi
2020-01-30 13:19       ` Peter Ujfalusi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).