* [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement @ 2020-12-14 8:13 Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 1/2] dmaengine: Extend the dmaengine_alignment for 128 and 256 bytes Peter Ujfalusi ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Peter Ujfalusi @ 2020-12-14 8:13 UTC (permalink / raw) To: vkoul Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, kishon Hi, Newer members of the KS3 family (after AM654) have support for burst_size configuration for each DMA channel. The HW default value is 64 bytes but on higher throughput channels it can be increased to 256 bytes (UCHANs) or 128 byes (HCHANs). Aligning the buffers and length of the transfer to the burst size also increases the throughput. Numbers gathered on j721e (UCHAN pair): echo 8000000 > /sys/module/dmatest/parameters/test_buf_size echo 2000 > /sys/module/dmatest/parameters/timeout echo 50 > /sys/module/dmatest/parameters/iterations echo 1 > /sys/module/dmatest/parameters/max_channels Prior to this patch: ~1.3 GB/s After this patch: ~1.8 GB/s with 1 byte alignment: ~1.7 GB/s The patches are on top of the AM64 support series: https://lore.kernel.org/lkml/20201208090440.31792-1-peter.ujfalusi@ti.com/ Regards, Peter --- Peter Ujfalusi (2): dmaengine: Extend the dmaengine_alignment for 128 and 256 bytes dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem drivers/dma/ti/k3-udma.c | 115 ++++++++++++++++++++++++++++++++++++-- include/linux/dmaengine.h | 2 + 2 files changed, 112 insertions(+), 5 deletions(-) -- Peter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] dmaengine: Extend the dmaengine_alignment for 128 and 256 bytes 2020-12-14 8:13 [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Peter Ujfalusi @ 2020-12-14 8:13 ` Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem Peter Ujfalusi 2021-01-12 3:37 ` [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Kishon Vijay Abraham I 2 siblings, 0 replies; 7+ messages in thread From: Peter Ujfalusi @ 2020-12-14 8:13 UTC (permalink / raw) To: vkoul Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, kishon Some DMA device can benefit with higher order of alignment than the maximum of 64 bytes currently defined. Define 128 and 256 bytes alignment for these devices. Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> --- include/linux/dmaengine.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h index 68130f5f599e..004736b6a9c8 100644 --- a/include/linux/dmaengine.h +++ b/include/linux/dmaengine.h @@ -745,6 +745,8 @@ enum dmaengine_alignment { DMAENGINE_ALIGN_16_BYTES = 4, DMAENGINE_ALIGN_32_BYTES = 5, DMAENGINE_ALIGN_64_BYTES = 6, + DMAENGINE_ALIGN_128_BYTES = 7, + DMAENGINE_ALIGN_256_BYTES = 8, }; /** -- Peter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem 2020-12-14 8:13 [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 1/2] dmaengine: Extend the dmaengine_alignment for 128 and 256 bytes Peter Ujfalusi @ 2020-12-14 8:13 ` Peter Ujfalusi 2021-01-12 10:16 ` Vinod Koul 2021-01-12 3:37 ` [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Kishon Vijay Abraham I 2 siblings, 1 reply; 7+ messages in thread From: Peter Ujfalusi @ 2020-12-14 8:13 UTC (permalink / raw) To: vkoul Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, kishon The UDMA and BCDMA can provide higher throughput if the burst_size of the channel is changed from it's default (which is 64 bytes) for Ultra-high and high capacity channels. This performance benefit is even more visible when the buffers are aligned with the burst_size configuration. The am654 does not have a way to change the burst size, but it is using 64 bytes burst, so increasing the copy_align from 8 bytes to 64 (and clients taking that into account) can increase the throughput as well. Numbers gathered on j721e: echo 8000000 > /sys/module/dmatest/parameters/test_buf_size echo 2000 > /sys/module/dmatest/parameters/timeout echo 50 > /sys/module/dmatest/parameters/iterations echo 1 > /sys/module/dmatest/parameters/max_channels Prior this patch: ~1.3 GB/s After this patch: ~1.8 GB/s with 1 byte alignment: ~1.7 GB/s Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> --- drivers/dma/ti/k3-udma.c | 115 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 110 insertions(+), 5 deletions(-) diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c index 87157cbae1b8..54e4ccb1b37e 100644 --- a/drivers/dma/ti/k3-udma.c +++ b/drivers/dma/ti/k3-udma.c @@ -121,6 +121,11 @@ struct udma_oes_offsets { #define UDMA_FLAG_PDMA_ACC32 BIT(0) #define UDMA_FLAG_PDMA_BURST BIT(1) #define UDMA_FLAG_TDTYPE BIT(2) +#define UDMA_FLAG_BURST_SIZE BIT(3) +#define UDMA_FLAGS_J7_CLASS (UDMA_FLAG_PDMA_ACC32 | \ + UDMA_FLAG_PDMA_BURST | \ + UDMA_FLAG_TDTYPE | \ + UDMA_FLAG_BURST_SIZE) struct udma_match_data { enum k3_dma_type type; @@ -128,6 +133,7 @@ struct udma_match_data { bool enable_memcpy_support; u32 flags; u32 statictr_z_mask; + u8 burst_size[3]; }; struct udma_soc_data { @@ -436,6 +442,18 @@ static void k3_configure_chan_coherency(struct dma_chan *chan, u32 asel) } } +static u8 udma_get_chan_tpl_index(struct udma_tpl *tpl_map, int chan_id) +{ + int i; + + for (i = 0; i < tpl_map->levels; i++) { + if (chan_id >= tpl_map->start_idx[i]) + return i; + } + + return 0; +} + static void udma_reset_uchan(struct udma_chan *uc) { memset(&uc->config, 0, sizeof(uc->config)); @@ -1811,6 +1829,7 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) const struct ti_sci_rm_udmap_ops *tisci_ops = tisci_rm->tisci_udmap_ops; struct udma_tchan *tchan = uc->tchan; struct udma_rchan *rchan = uc->rchan; + u8 burst_size = 0; int ret = 0; /* Non synchronized - mem to mem type of transfer */ @@ -1818,6 +1837,12 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) struct ti_sci_msg_rm_udmap_tx_ch_cfg req_tx = { 0 }; struct ti_sci_msg_rm_udmap_rx_ch_cfg req_rx = { 0 }; + if (ud->match_data->flags & UDMA_FLAG_BURST_SIZE) { + u8 tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, tchan->id); + + burst_size = ud->match_data->burst_size[tpl]; + } + req_tx.valid_params = TISCI_UDMA_TCHAN_VALID_PARAMS; req_tx.nav_id = tisci_rm->tisci_dev_id; req_tx.index = tchan->id; @@ -1825,6 +1850,10 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) req_tx.tx_fetch_size = sizeof(struct cppi5_desc_hdr_t) >> 2; req_tx.txcq_qnum = tc_ring; req_tx.tx_atype = ud->atype; + if (burst_size) { + req_tx.valid_params |= TI_SCI_MSG_VALUE_RM_UDMAP_CH_BURST_SIZE_VALID; + req_tx.tx_burst_size = burst_size; + } ret = tisci_ops->tx_ch_cfg(tisci_rm->tisci, &req_tx); if (ret) { @@ -1839,6 +1868,10 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) req_rx.rxcq_qnum = tc_ring; req_rx.rx_chan_type = TI_SCI_RM_UDMAP_CHAN_TYPE_3RDP_BCOPY_PBRR; req_rx.rx_atype = ud->atype; + if (burst_size) { + req_rx.valid_params |= TI_SCI_MSG_VALUE_RM_UDMAP_CH_BURST_SIZE_VALID; + req_rx.rx_burst_size = burst_size; + } ret = tisci_ops->rx_ch_cfg(tisci_rm->tisci, &req_rx); if (ret) @@ -1854,12 +1887,23 @@ static int bcdma_tisci_m2m_channel_config(struct udma_chan *uc) const struct ti_sci_rm_udmap_ops *tisci_ops = tisci_rm->tisci_udmap_ops; struct ti_sci_msg_rm_udmap_tx_ch_cfg req_tx = { 0 }; struct udma_bchan *bchan = uc->bchan; + u8 burst_size = 0; int ret = 0; + if (ud->match_data->flags & UDMA_FLAG_BURST_SIZE) { + u8 tpl = udma_get_chan_tpl_index(&ud->bchan_tpl, bchan->id); + + burst_size = ud->match_data->burst_size[tpl]; + } + req_tx.valid_params = TISCI_BCDMA_BCHAN_VALID_PARAMS; req_tx.nav_id = tisci_rm->tisci_dev_id; req_tx.extended_ch_type = TI_SCI_RM_BCDMA_EXTENDED_CH_TYPE_BCHAN; req_tx.index = bchan->id; + if (burst_size) { + req_tx.valid_params |= TI_SCI_MSG_VALUE_RM_UDMAP_CH_BURST_SIZE_VALID; + req_tx.tx_burst_size = burst_size; + } ret = tisci_ops->tx_ch_cfg(tisci_rm->tisci, &req_tx); if (ret) @@ -4167,6 +4211,11 @@ static struct udma_match_data am654_main_data = { .psil_base = 0x1000, .enable_memcpy_support = true, .statictr_z_mask = GENMASK(11, 0), + .burst_size = { + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* H Channels */ + 0, /* No UH Channels */ + }, }; static struct udma_match_data am654_mcu_data = { @@ -4174,38 +4223,63 @@ static struct udma_match_data am654_mcu_data = { .psil_base = 0x6000, .enable_memcpy_support = false, .statictr_z_mask = GENMASK(11, 0), + .burst_size = { + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* H Channels */ + 0, /* No UH Channels */ + }, }; static struct udma_match_data j721e_main_data = { .type = DMA_TYPE_UDMA, .psil_base = 0x1000, .enable_memcpy_support = true, - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, + .flags = UDMA_FLAGS_J7_CLASS, .statictr_z_mask = GENMASK(23, 0), + .burst_size = { + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES, /* H Channels */ + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES, /* UH Channels */ + }, }; static struct udma_match_data j721e_mcu_data = { .type = DMA_TYPE_UDMA, .psil_base = 0x6000, .enable_memcpy_support = false, /* MEM_TO_MEM is slow via MCU UDMA */ - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, + .flags = UDMA_FLAGS_J7_CLASS, .statictr_z_mask = GENMASK(23, 0), + .burst_size = { + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_128_BYTES, /* H Channels */ + 0, /* No UH Channels */ + }, }; static struct udma_match_data am64_bcdma_data = { .type = DMA_TYPE_BCDMA, .psil_base = 0x2000, /* for tchan and rchan, not applicable to bchan */ .enable_memcpy_support = true, /* Supported via bchan */ - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, + .flags = UDMA_FLAGS_J7_CLASS, .statictr_z_mask = GENMASK(23, 0), + .burst_size = { + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ + 0, /* No H Channels */ + 0, /* No UH Channels */ + }, }; static struct udma_match_data am64_pktdma_data = { .type = DMA_TYPE_PKTDMA, .psil_base = 0x1000, .enable_memcpy_support = false, /* PKTDMA does not support MEM_TO_MEM */ - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, + .flags = UDMA_FLAGS_J7_CLASS, .statictr_z_mask = GENMASK(23, 0), + .burst_size = { + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ + 0, /* No H Channels */ + 0, /* No UH Channels */ + }, }; static const struct of_device_id udma_of_match[] = { @@ -5045,6 +5119,35 @@ static void udma_dbg_summary_show(struct seq_file *s, } #endif /* CONFIG_DEBUG_FS */ +static enum dmaengine_alignment udma_get_copy_align(struct udma_dev *ud) +{ + const struct udma_match_data *match_data = ud->match_data; + u8 tpl; + + if (!match_data->enable_memcpy_support) + return DMAENGINE_ALIGN_8_BYTES; + + /* Get the highest TPL level the device supports for memcpy */ + if (ud->bchan_cnt) { + tpl = udma_get_chan_tpl_index(&ud->bchan_tpl, 0); + } else if (ud->tchan_cnt) { + tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, 0); + } else { + return DMAENGINE_ALIGN_8_BYTES; + } + + switch (match_data->burst_size[tpl]) { + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES: + return DMAENGINE_ALIGN_256_BYTES; + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_128_BYTES: + return DMAENGINE_ALIGN_128_BYTES; + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES: + fallthrough; + default: + return DMAENGINE_ALIGN_64_BYTES; + } +} + #define TI_UDMAC_BUSWIDTHS (BIT(DMA_SLAVE_BUSWIDTH_1_BYTE) | \ BIT(DMA_SLAVE_BUSWIDTH_2_BYTES) | \ BIT(DMA_SLAVE_BUSWIDTH_3_BYTES) | \ @@ -5201,7 +5304,6 @@ static int udma_probe(struct platform_device *pdev) ud->ddev.dst_addr_widths = TI_UDMAC_BUSWIDTHS; ud->ddev.directions = BIT(DMA_DEV_TO_MEM) | BIT(DMA_MEM_TO_DEV); ud->ddev.residue_granularity = DMA_RESIDUE_GRANULARITY_BURST; - ud->ddev.copy_align = DMAENGINE_ALIGN_8_BYTES; ud->ddev.desc_metadata_modes = DESC_METADATA_CLIENT | DESC_METADATA_ENGINE; if (ud->match_data->enable_memcpy_support && @@ -5283,6 +5385,9 @@ static int udma_probe(struct platform_device *pdev) INIT_DELAYED_WORK(&uc->tx_drain.work, udma_check_tx_completion); } + /* Configure the copy_align to the maximum burst size the device supports */ + ud->ddev.copy_align = udma_get_copy_align(ud); + ret = dma_async_device_register(&ud->ddev); if (ret) { dev_err(dev, "failed to register slave DMA engine: %d\n", ret); -- Peter Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki. Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem 2020-12-14 8:13 ` [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem Peter Ujfalusi @ 2021-01-12 10:16 ` Vinod Koul 2021-01-13 7:39 ` Péter Ujfalusi 0 siblings, 1 reply; 7+ messages in thread From: Vinod Koul @ 2021-01-12 10:16 UTC (permalink / raw) To: Peter Ujfalusi Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, kishon On 14-12-20, 10:13, Peter Ujfalusi wrote: > The UDMA and BCDMA can provide higher throughput if the burst_size of the > channel is changed from it's default (which is 64 bytes) for Ultra-high > and high capacity channels. > > This performance benefit is even more visible when the buffers are aligned > with the burst_size configuration. > > The am654 does not have a way to change the burst size, but it is using > 64 bytes burst, so increasing the copy_align from 8 bytes to 64 (and > clients taking that into account) can increase the throughput as well. > > Numbers gathered on j721e: > echo 8000000 > /sys/module/dmatest/parameters/test_buf_size > echo 2000 > /sys/module/dmatest/parameters/timeout > echo 50 > /sys/module/dmatest/parameters/iterations > echo 1 > /sys/module/dmatest/parameters/max_channels > > Prior this patch: ~1.3 GB/s > After this patch: ~1.8 GB/s > with 1 byte alignment: ~1.7 GB/s > > Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> > --- > drivers/dma/ti/k3-udma.c | 115 +++++++++++++++++++++++++++++++++++++-- > 1 file changed, 110 insertions(+), 5 deletions(-) > > diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c > index 87157cbae1b8..54e4ccb1b37e 100644 > --- a/drivers/dma/ti/k3-udma.c > +++ b/drivers/dma/ti/k3-udma.c > @@ -121,6 +121,11 @@ struct udma_oes_offsets { > #define UDMA_FLAG_PDMA_ACC32 BIT(0) > #define UDMA_FLAG_PDMA_BURST BIT(1) > #define UDMA_FLAG_TDTYPE BIT(2) > +#define UDMA_FLAG_BURST_SIZE BIT(3) > +#define UDMA_FLAGS_J7_CLASS (UDMA_FLAG_PDMA_ACC32 | \ > + UDMA_FLAG_PDMA_BURST | \ > + UDMA_FLAG_TDTYPE | \ > + UDMA_FLAG_BURST_SIZE) > > struct udma_match_data { > enum k3_dma_type type; > @@ -128,6 +133,7 @@ struct udma_match_data { > bool enable_memcpy_support; > u32 flags; > u32 statictr_z_mask; > + u8 burst_size[3]; > }; > > struct udma_soc_data { > @@ -436,6 +442,18 @@ static void k3_configure_chan_coherency(struct dma_chan *chan, u32 asel) > } > } > > +static u8 udma_get_chan_tpl_index(struct udma_tpl *tpl_map, int chan_id) > +{ > + int i; > + > + for (i = 0; i < tpl_map->levels; i++) { > + if (chan_id >= tpl_map->start_idx[i]) > + return i; > + } Braces seem not required > + > + return 0; > +} > + > static void udma_reset_uchan(struct udma_chan *uc) > { > memset(&uc->config, 0, sizeof(uc->config)); > @@ -1811,6 +1829,7 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) > const struct ti_sci_rm_udmap_ops *tisci_ops = tisci_rm->tisci_udmap_ops; > struct udma_tchan *tchan = uc->tchan; > struct udma_rchan *rchan = uc->rchan; > + u8 burst_size = 0; > int ret = 0; > > /* Non synchronized - mem to mem type of transfer */ > @@ -1818,6 +1837,12 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) > struct ti_sci_msg_rm_udmap_tx_ch_cfg req_tx = { 0 }; > struct ti_sci_msg_rm_udmap_rx_ch_cfg req_rx = { 0 }; > > + if (ud->match_data->flags & UDMA_FLAG_BURST_SIZE) { > + u8 tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, tchan->id); Can we define variable at function start please > + > + burst_size = ud->match_data->burst_size[tpl]; > + } > + > req_tx.valid_params = TISCI_UDMA_TCHAN_VALID_PARAMS; > req_tx.nav_id = tisci_rm->tisci_dev_id; > req_tx.index = tchan->id; > @@ -1825,6 +1850,10 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) > req_tx.tx_fetch_size = sizeof(struct cppi5_desc_hdr_t) >> 2; > req_tx.txcq_qnum = tc_ring; > req_tx.tx_atype = ud->atype; > + if (burst_size) { > + req_tx.valid_params |= TI_SCI_MSG_VALUE_RM_UDMAP_CH_BURST_SIZE_VALID; > + req_tx.tx_burst_size = burst_size; > + } > > ret = tisci_ops->tx_ch_cfg(tisci_rm->tisci, &req_tx); > if (ret) { > @@ -1839,6 +1868,10 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) > req_rx.rxcq_qnum = tc_ring; > req_rx.rx_chan_type = TI_SCI_RM_UDMAP_CHAN_TYPE_3RDP_BCOPY_PBRR; > req_rx.rx_atype = ud->atype; > + if (burst_size) { > + req_rx.valid_params |= TI_SCI_MSG_VALUE_RM_UDMAP_CH_BURST_SIZE_VALID; > + req_rx.rx_burst_size = burst_size; > + } > > ret = tisci_ops->rx_ch_cfg(tisci_rm->tisci, &req_rx); > if (ret) > @@ -1854,12 +1887,23 @@ static int bcdma_tisci_m2m_channel_config(struct udma_chan *uc) > const struct ti_sci_rm_udmap_ops *tisci_ops = tisci_rm->tisci_udmap_ops; > struct ti_sci_msg_rm_udmap_tx_ch_cfg req_tx = { 0 }; > struct udma_bchan *bchan = uc->bchan; > + u8 burst_size = 0; > int ret = 0; > > + if (ud->match_data->flags & UDMA_FLAG_BURST_SIZE) { > + u8 tpl = udma_get_chan_tpl_index(&ud->bchan_tpl, bchan->id); here as well > + > + burst_size = ud->match_data->burst_size[tpl]; > + } > + > req_tx.valid_params = TISCI_BCDMA_BCHAN_VALID_PARAMS; > req_tx.nav_id = tisci_rm->tisci_dev_id; > req_tx.extended_ch_type = TI_SCI_RM_BCDMA_EXTENDED_CH_TYPE_BCHAN; > req_tx.index = bchan->id; > + if (burst_size) { > + req_tx.valid_params |= TI_SCI_MSG_VALUE_RM_UDMAP_CH_BURST_SIZE_VALID; > + req_tx.tx_burst_size = burst_size; > + } > > ret = tisci_ops->tx_ch_cfg(tisci_rm->tisci, &req_tx); > if (ret) > @@ -4167,6 +4211,11 @@ static struct udma_match_data am654_main_data = { > .psil_base = 0x1000, > .enable_memcpy_support = true, > .statictr_z_mask = GENMASK(11, 0), > + .burst_size = { > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* H Channels */ > + 0, /* No UH Channels */ > + }, > }; > > static struct udma_match_data am654_mcu_data = { > @@ -4174,38 +4223,63 @@ static struct udma_match_data am654_mcu_data = { > .psil_base = 0x6000, > .enable_memcpy_support = false, > .statictr_z_mask = GENMASK(11, 0), > + .burst_size = { > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* H Channels */ > + 0, /* No UH Channels */ > + }, > }; > > static struct udma_match_data j721e_main_data = { > .type = DMA_TYPE_UDMA, > .psil_base = 0x1000, > .enable_memcpy_support = true, > - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, > + .flags = UDMA_FLAGS_J7_CLASS, > .statictr_z_mask = GENMASK(23, 0), > + .burst_size = { > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES, /* H Channels */ > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES, /* UH Channels */ > + }, > }; > > static struct udma_match_data j721e_mcu_data = { > .type = DMA_TYPE_UDMA, > .psil_base = 0x6000, > .enable_memcpy_support = false, /* MEM_TO_MEM is slow via MCU UDMA */ > - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, > + .flags = UDMA_FLAGS_J7_CLASS, > .statictr_z_mask = GENMASK(23, 0), > + .burst_size = { > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_128_BYTES, /* H Channels */ > + 0, /* No UH Channels */ > + }, > }; > > static struct udma_match_data am64_bcdma_data = { > .type = DMA_TYPE_BCDMA, > .psil_base = 0x2000, /* for tchan and rchan, not applicable to bchan */ > .enable_memcpy_support = true, /* Supported via bchan */ > - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, > + .flags = UDMA_FLAGS_J7_CLASS, > .statictr_z_mask = GENMASK(23, 0), > + .burst_size = { > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ > + 0, /* No H Channels */ > + 0, /* No UH Channels */ > + }, > }; > > static struct udma_match_data am64_pktdma_data = { > .type = DMA_TYPE_PKTDMA, > .psil_base = 0x1000, > .enable_memcpy_support = false, /* PKTDMA does not support MEM_TO_MEM */ > - .flags = UDMA_FLAG_PDMA_ACC32 | UDMA_FLAG_PDMA_BURST | UDMA_FLAG_TDTYPE, > + .flags = UDMA_FLAGS_J7_CLASS, > .statictr_z_mask = GENMASK(23, 0), > + .burst_size = { > + TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES, /* Normal Channels */ > + 0, /* No H Channels */ > + 0, /* No UH Channels */ > + }, > }; > > static const struct of_device_id udma_of_match[] = { > @@ -5045,6 +5119,35 @@ static void udma_dbg_summary_show(struct seq_file *s, > } > #endif /* CONFIG_DEBUG_FS */ > > +static enum dmaengine_alignment udma_get_copy_align(struct udma_dev *ud) > +{ > + const struct udma_match_data *match_data = ud->match_data; > + u8 tpl; > + > + if (!match_data->enable_memcpy_support) > + return DMAENGINE_ALIGN_8_BYTES; > + > + /* Get the highest TPL level the device supports for memcpy */ > + if (ud->bchan_cnt) { > + tpl = udma_get_chan_tpl_index(&ud->bchan_tpl, 0); > + } else if (ud->tchan_cnt) { > + tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, 0); > + } else { > + return DMAENGINE_ALIGN_8_BYTES; > + } Braces seem not required > + > + switch (match_data->burst_size[tpl]) { > + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES: > + return DMAENGINE_ALIGN_256_BYTES; > + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_128_BYTES: > + return DMAENGINE_ALIGN_128_BYTES; > + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES: > + fallthrough; > + default: > + return DMAENGINE_ALIGN_64_BYTES; ah, we are supposed to have case at same indent as switch, pls run checkpatch to have these flagged off -- ~Vinod ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem 2021-01-12 10:16 ` Vinod Koul @ 2021-01-13 7:39 ` Péter Ujfalusi 2021-01-13 10:43 ` Vinod Koul 0 siblings, 1 reply; 7+ messages in thread From: Péter Ujfalusi @ 2021-01-13 7:39 UTC (permalink / raw) To: Vinod Koul Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, kishon Hi Vinod, On 1/12/21 12:16 PM, Vinod Koul wrote: > On 14-12-20, 10:13, Peter Ujfalusi wrote: >> The UDMA and BCDMA can provide higher throughput if the burst_size of the >> channel is changed from it's default (which is 64 bytes) for Ultra-high >> and high capacity channels. >> >> This performance benefit is even more visible when the buffers are aligned >> with the burst_size configuration. >> >> The am654 does not have a way to change the burst size, but it is using >> 64 bytes burst, so increasing the copy_align from 8 bytes to 64 (and >> clients taking that into account) can increase the throughput as well. >> >> Numbers gathered on j721e: >> echo 8000000 > /sys/module/dmatest/parameters/test_buf_size >> echo 2000 > /sys/module/dmatest/parameters/timeout >> echo 50 > /sys/module/dmatest/parameters/iterations >> echo 1 > /sys/module/dmatest/parameters/max_channels >> >> Prior this patch: ~1.3 GB/s >> After this patch: ~1.8 GB/s >> with 1 byte alignment: ~1.7 GB/s >> >> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> >> --- >> drivers/dma/ti/k3-udma.c | 115 +++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 110 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c >> index 87157cbae1b8..54e4ccb1b37e 100644 >> --- a/drivers/dma/ti/k3-udma.c >> +++ b/drivers/dma/ti/k3-udma.c >> @@ -121,6 +121,11 @@ struct udma_oes_offsets { >> #define UDMA_FLAG_PDMA_ACC32 BIT(0) >> #define UDMA_FLAG_PDMA_BURST BIT(1) >> #define UDMA_FLAG_TDTYPE BIT(2) >> +#define UDMA_FLAG_BURST_SIZE BIT(3) >> +#define UDMA_FLAGS_J7_CLASS (UDMA_FLAG_PDMA_ACC32 | \ >> + UDMA_FLAG_PDMA_BURST | \ >> + UDMA_FLAG_TDTYPE | \ >> + UDMA_FLAG_BURST_SIZE) >> >> struct udma_match_data { >> enum k3_dma_type type; >> @@ -128,6 +133,7 @@ struct udma_match_data { >> bool enable_memcpy_support; >> u32 flags; >> u32 statictr_z_mask; >> + u8 burst_size[3]; >> }; >> >> struct udma_soc_data { >> @@ -436,6 +442,18 @@ static void k3_configure_chan_coherency(struct dma_chan *chan, u32 asel) >> } >> } >> >> +static u8 udma_get_chan_tpl_index(struct udma_tpl *tpl_map, int chan_id) >> +{ >> + int i; >> + >> + for (i = 0; i < tpl_map->levels; i++) { >> + if (chan_id >= tpl_map->start_idx[i]) >> + return i; >> + } > > Braces seem not required True, they are not strictly needed but I prefer to have them when I have any condition in the loop. >> + >> + return 0; >> +} >> + >> static void udma_reset_uchan(struct udma_chan *uc) >> { >> memset(&uc->config, 0, sizeof(uc->config)); >> @@ -1811,6 +1829,7 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) >> const struct ti_sci_rm_udmap_ops *tisci_ops = tisci_rm->tisci_udmap_ops; >> struct udma_tchan *tchan = uc->tchan; >> struct udma_rchan *rchan = uc->rchan; >> + u8 burst_size = 0; >> int ret = 0; >> >> /* Non synchronized - mem to mem type of transfer */ >> @@ -1818,6 +1837,12 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) >> struct ti_sci_msg_rm_udmap_tx_ch_cfg req_tx = { 0 }; >> struct ti_sci_msg_rm_udmap_rx_ch_cfg req_rx = { 0 }; >> >> + if (ud->match_data->flags & UDMA_FLAG_BURST_SIZE) { >> + u8 tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, tchan->id); > > Can we define variable at function start please The 'tpl' is only used within this if branch, it looks a bit cleaner imho, but if you insist, I can move the definition. ... >> +static enum dmaengine_alignment udma_get_copy_align(struct udma_dev *ud) >> +{ >> + const struct udma_match_data *match_data = ud->match_data; >> + u8 tpl; >> + >> + if (!match_data->enable_memcpy_support) >> + return DMAENGINE_ALIGN_8_BYTES; >> + >> + /* Get the highest TPL level the device supports for memcpy */ >> + if (ud->bchan_cnt) { >> + tpl = udma_get_chan_tpl_index(&ud->bchan_tpl, 0); >> + } else if (ud->tchan_cnt) { >> + tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, 0); >> + } else { >> + return DMAENGINE_ALIGN_8_BYTES; >> + } > > Braces seem not required Very true. > >> + >> + switch (match_data->burst_size[tpl]) { >> + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES: >> + return DMAENGINE_ALIGN_256_BYTES; >> + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_128_BYTES: >> + return DMAENGINE_ALIGN_128_BYTES; >> + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES: >> + fallthrough; >> + default: >> + return DMAENGINE_ALIGN_64_BYTES; > > ah, we are supposed to have case at same indent as switch, pls run > checkpatch to have these flagged off Yes, they should be. The other me did a sloppy job for sure, this should have been screaming even without checkpatch... This has been done in a rush during the last days to close on the backlog item which got the most votes. -- Péter ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem 2021-01-13 7:39 ` Péter Ujfalusi @ 2021-01-13 10:43 ` Vinod Koul 0 siblings, 0 replies; 7+ messages in thread From: Vinod Koul @ 2021-01-13 10:43 UTC (permalink / raw) To: Péter Ujfalusi Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, kishon On 13-01-21, 09:39, Péter Ujfalusi wrote: > Hi Vinod, > > On 1/12/21 12:16 PM, Vinod Koul wrote: > > On 14-12-20, 10:13, Peter Ujfalusi wrote: > >> The UDMA and BCDMA can provide higher throughput if the burst_size of the > >> channel is changed from it's default (which is 64 bytes) for Ultra-high > >> and high capacity channels. > >> > >> This performance benefit is even more visible when the buffers are aligned > >> with the burst_size configuration. > >> > >> The am654 does not have a way to change the burst size, but it is using > >> 64 bytes burst, so increasing the copy_align from 8 bytes to 64 (and > >> clients taking that into account) can increase the throughput as well. > >> > >> Numbers gathered on j721e: > >> echo 8000000 > /sys/module/dmatest/parameters/test_buf_size > >> echo 2000 > /sys/module/dmatest/parameters/timeout > >> echo 50 > /sys/module/dmatest/parameters/iterations > >> echo 1 > /sys/module/dmatest/parameters/max_channels > >> > >> Prior this patch: ~1.3 GB/s > >> After this patch: ~1.8 GB/s > >> with 1 byte alignment: ~1.7 GB/s > >> > >> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com> > >> --- > >> drivers/dma/ti/k3-udma.c | 115 +++++++++++++++++++++++++++++++++++++-- > >> 1 file changed, 110 insertions(+), 5 deletions(-) > >> > >> diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c > >> index 87157cbae1b8..54e4ccb1b37e 100644 > >> --- a/drivers/dma/ti/k3-udma.c > >> +++ b/drivers/dma/ti/k3-udma.c > >> @@ -121,6 +121,11 @@ struct udma_oes_offsets { > >> #define UDMA_FLAG_PDMA_ACC32 BIT(0) > >> #define UDMA_FLAG_PDMA_BURST BIT(1) > >> #define UDMA_FLAG_TDTYPE BIT(2) > >> +#define UDMA_FLAG_BURST_SIZE BIT(3) > >> +#define UDMA_FLAGS_J7_CLASS (UDMA_FLAG_PDMA_ACC32 | \ > >> + UDMA_FLAG_PDMA_BURST | \ > >> + UDMA_FLAG_TDTYPE | \ > >> + UDMA_FLAG_BURST_SIZE) > >> > >> struct udma_match_data { > >> enum k3_dma_type type; > >> @@ -128,6 +133,7 @@ struct udma_match_data { > >> bool enable_memcpy_support; > >> u32 flags; > >> u32 statictr_z_mask; > >> + u8 burst_size[3]; > >> }; > >> > >> struct udma_soc_data { > >> @@ -436,6 +442,18 @@ static void k3_configure_chan_coherency(struct dma_chan *chan, u32 asel) > >> } > >> } > >> > >> +static u8 udma_get_chan_tpl_index(struct udma_tpl *tpl_map, int chan_id) > >> +{ > >> + int i; > >> + > >> + for (i = 0; i < tpl_map->levels; i++) { > >> + if (chan_id >= tpl_map->start_idx[i]) > >> + return i; > >> + } > > > > Braces seem not required > > True, they are not strictly needed but I prefer to have them when I have > any condition in the loop. ok > >> static void udma_reset_uchan(struct udma_chan *uc) > >> { > >> memset(&uc->config, 0, sizeof(uc->config)); > >> @@ -1811,6 +1829,7 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) > >> const struct ti_sci_rm_udmap_ops *tisci_ops = tisci_rm->tisci_udmap_ops; > >> struct udma_tchan *tchan = uc->tchan; > >> struct udma_rchan *rchan = uc->rchan; > >> + u8 burst_size = 0; > >> int ret = 0; > >> > >> /* Non synchronized - mem to mem type of transfer */ > >> @@ -1818,6 +1837,12 @@ static int udma_tisci_m2m_channel_config(struct udma_chan *uc) > >> struct ti_sci_msg_rm_udmap_tx_ch_cfg req_tx = { 0 }; > >> struct ti_sci_msg_rm_udmap_rx_ch_cfg req_rx = { 0 }; > >> > >> + if (ud->match_data->flags & UDMA_FLAG_BURST_SIZE) { > >> + u8 tpl = udma_get_chan_tpl_index(&ud->tchan_tpl, tchan->id); > > > > Can we define variable at function start please > > The 'tpl' is only used within this if branch, it looks a bit cleaner > imho, but if you insist, I can move the definition. yeah lets be consistent and keep them at the start of the function please > >> + switch (match_data->burst_size[tpl]) { > >> + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_256_BYTES: > >> + return DMAENGINE_ALIGN_256_BYTES; > >> + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_128_BYTES: > >> + return DMAENGINE_ALIGN_128_BYTES; > >> + case TI_SCI_RM_UDMAP_CHAN_BURST_SIZE_64_BYTES: > >> + fallthrough; > >> + default: > >> + return DMAENGINE_ALIGN_64_BYTES; > > > > ah, we are supposed to have case at same indent as switch, pls run > > checkpatch to have these flagged off > > Yes, they should be. > > The other me did a sloppy job for sure, this should have been screaming > even without checkpatch... > This has been done in a rush during the last days to close on the > backlog item which got the most votes. no worries, that is where reviews help :) -- ~Vinod ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement 2020-12-14 8:13 [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 1/2] dmaengine: Extend the dmaengine_alignment for 128 and 256 bytes Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem Peter Ujfalusi @ 2021-01-12 3:37 ` Kishon Vijay Abraham I 2 siblings, 0 replies; 7+ messages in thread From: Kishon Vijay Abraham I @ 2021-01-12 3:37 UTC (permalink / raw) To: vkoul Cc: dan.j.williams, linux-kernel, dmaengine, vigneshr, grygorii.strashko, peter.ujfalusi Hi, On 14/12/20 1:43 pm, Peter Ujfalusi wrote: > Hi, > > Newer members of the KS3 family (after AM654) have support for burst_size > configuration for each DMA channel. > > The HW default value is 64 bytes but on higher throughput channels it can be > increased to 256 bytes (UCHANs) or 128 byes (HCHANs). > > Aligning the buffers and length of the transfer to the burst size also increases > the throughput. > > Numbers gathered on j721e (UCHAN pair): > echo 8000000 > /sys/module/dmatest/parameters/test_buf_size > echo 2000 > /sys/module/dmatest/parameters/timeout > echo 50 > /sys/module/dmatest/parameters/iterations > echo 1 > /sys/module/dmatest/parameters/max_channels > > Prior to this patch: ~1.3 GB/s > After this patch: ~1.8 GB/s > with 1 byte alignment: ~1.7 GB/s > > The patches are on top of the AM64 support series: > https://lore.kernel.org/lkml/20201208090440.31792-1-peter.ujfalusi@ti.com/ FWIW, tested this series with PCIe RC<->EP (using pcitest utility) Without this series READ => Size: 67108864 bytes DMA: YES Time: 0.137854270 seconds Rate: 475400 KB/s WRITE => Size: 67108864 bytes DMA: YES Time: 0.049701495 seconds Rate: 1318592 KB/s With this series READ => Size: 67108864 bytes DMA: YES Time: 0.045611175 seconds Rate: 1436840 KB/s WRITE => Size: 67108864 bytes DMA: YES Time: 0.042737440 seconds Rate: 1533456 KB/s Tested-by: Kishon Vijay Abraham I <kishon@ti.com> Thanks Kishon ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-01-13 10:44 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-14 8:13 [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 1/2] dmaengine: Extend the dmaengine_alignment for 128 and 256 bytes Peter Ujfalusi 2020-12-14 8:13 ` [PATCH 2/2] dmaengine: ti: k3-udma: Add support for burst_size configuration for mem2mem Peter Ujfalusi 2021-01-12 10:16 ` Vinod Koul 2021-01-13 7:39 ` Péter Ujfalusi 2021-01-13 10:43 ` Vinod Koul 2021-01-12 3:37 ` [PATCH 0/2] dmaengine: ti: k3-udma: memcpy throughput improvement Kishon Vijay Abraham I
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).