All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:02 ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
access fields of rx_desc. The rx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache
coherent, reading from uncached memory is fairly slow.

patch1 reuses the read out status to getting status field of rx_desc
again.

patch2 uses cacheable memory to store the rx buffer DMA address.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 149265960 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

RFC: can we do similar modification for tx? If yes, I can prepare a v2.


Basically, these two patches do what Arnd mentioned in [1].

Hi Arnd,

I added "Suggested-by you" tag, I hope you don't mind ;)

Thanks

[1] https://www.spinics.net/lists/netdev/msg405889.html

Since v1:
  - correct the performance data typo

Jisheng Zhang (2):
  net: mvneta: avoid getting status from rx_desc as much as possible
  net: mvneta: Use cacheable memory to store the rx buffer DMA address

 drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:02 ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
access fields of rx_desc. The rx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache
coherent, reading from uncached memory is fairly slow.

patch1 reuses the read out status to getting status field of rx_desc
again.

patch2 uses cacheable memory to store the rx buffer DMA address.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 149265960 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

RFC: can we do similar modification for tx? If yes, I can prepare a v2.


Basically, these two patches do what Arnd mentioned in [1].

Hi Arnd,

I added "Suggested-by you" tag, I hope you don't mind ;)

Thanks

[1] https://www.spinics.net/lists/netdev/msg405889.html

Since v1:
  - correct the performance data typo

Jisheng Zhang (2):
  net: mvneta: avoid getting status from rx_desc as much as possible
  net: mvneta: Use cacheable memory to store the rx buffer DMA address

 drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible
  2017-02-17 10:02 ` Jisheng Zhang
  (?)
@ 2017-02-17 10:02   ` Jisheng Zhang
  -1 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
the device isn't cache-coherent, reading from uncached memory is
fairly slow. So reuse the read out rx_status to avoid the second
reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 61dd4462411c..06df72b8da85 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -313,8 +313,8 @@
 	((addr >= txq->tso_hdrs_phys) && \
 	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
 
-#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
-	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
+#define MVNETA_RX_GET_BM_POOL_ID(status) \
+	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
 
 struct mvneta_statistic {
 	unsigned short offset;
@@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
-			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
-		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible
@ 2017-02-17 10:02   ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd
  Cc: Jisheng Zhang, netdev, linux-kernel, linux-arm-kernel

In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
the device isn't cache-coherent, reading from uncached memory is
fairly slow. So reuse the read out rx_status to avoid the second
reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 61dd4462411c..06df72b8da85 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -313,8 +313,8 @@
 	((addr >= txq->tso_hdrs_phys) && \
 	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
 
-#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
-	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
+#define MVNETA_RX_GET_BM_POOL_ID(status) \
+	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
 
 struct mvneta_statistic {
 	unsigned short offset;
@@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
-			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
-		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible
@ 2017-02-17 10:02   ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
the device isn't cache-coherent, reading from uncached memory is
fairly slow. So reuse the read out rx_status to avoid the second
reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 61dd4462411c..06df72b8da85 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -313,8 +313,8 @@
 	((addr >= txq->tso_hdrs_phys) && \
 	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
 
-#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
-	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
+#define MVNETA_RX_GET_BM_POOL_ID(status) \
+	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
 
 struct mvneta_statistic {
 	unsigned short offset;
@@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
-			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
-		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
  2017-02-17 10:02 ` Jisheng Zhang
  (?)
@ 2017-02-17 10:02   ` Jisheng Zhang
  -1 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the
buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache
coherent, reading from uncached memory is fairly slow. This patch uses
cacheable memory to store the rx buffer DMA address. We get the
following performance data on Marvell BG4CT Platforms (tested with
iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 06df72b8da85..e24c3028fe1d 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -580,6 +580,9 @@ struct mvneta_rx_queue {
 	/* Virtual address of the RX buffer */
 	void  **buf_virt_addr;
 
+	/* DMA address of the RX buffer */
+	dma_addr_t *buf_dma_addr;
+
 	/* Virtual address of the RX DMA descriptors array */
 	struct mvneta_rx_desc *descs;
 
@@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 
 	rx_desc->buf_phys_addr = phys_addr;
 	i = rx_desc - rxq->descs;
+	rxq->buf_dma_addr[i] = phys_addr;
 	rxq->buf_virt_addr[i] = virt_addr;
 }
 
@@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
+			int index = rx_desc - rxq->descs;
 			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
 			/* Return dropped buffer to the pool */
 			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+					      rxq->buf_dma_addr[index]);
 		}
 		return;
 	}
 
 	for (i = 0; i < rxq->size; i++) {
-		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
 		void *data = rxq->buf_virt_addr[i];
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 		mvneta_frag_free(pp->frag_size, data);
 	}
@@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		index = rx_desc - rxq->descs;
 		data = rxq->buf_virt_addr[index];
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
@@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 	/* Fairness NAPI loop */
 	while (rx_done < rx_todo) {
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
+		int index = rx_desc - rxq->descs;
 		struct mvneta_bm_pool *bm_pool = NULL;
 		struct sk_buff *skb;
 		unsigned char *data;
@@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
@@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 err_drop_frame_ret_pool:
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 				goto err_drop_frame_ret_pool;
 
 			dma_sync_single_range_for_cpu(dev->dev.parent,
-			                              rx_desc->buf_phys_addr,
+			                              phys_addr,
 			                              MVNETA_MH_SIZE + NET_SKB_PAD,
 			                              rx_bytes,
 			                              DMA_FROM_DEVICE);
@@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 			rcvd_bytes += rx_bytes;
 
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 
 			/* leave the descriptor and buffer untouched */
 			continue;
@@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
 		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
 						  rxq->size * sizeof(void *),
 						  GFP_KERNEL);
-		if (!rxq->buf_virt_addr)
+		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
+						 rxq->size * sizeof(dma_addr_t),
+						 GFP_KERNEL);
+		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
 			return -ENOMEM;
 	}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
@ 2017-02-17 10:02   ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd
  Cc: Jisheng Zhang, netdev, linux-kernel, linux-arm-kernel

In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the
buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache
coherent, reading from uncached memory is fairly slow. This patch uses
cacheable memory to store the rx buffer DMA address. We get the
following performance data on Marvell BG4CT Platforms (tested with
iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 06df72b8da85..e24c3028fe1d 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -580,6 +580,9 @@ struct mvneta_rx_queue {
 	/* Virtual address of the RX buffer */
 	void  **buf_virt_addr;
 
+	/* DMA address of the RX buffer */
+	dma_addr_t *buf_dma_addr;
+
 	/* Virtual address of the RX DMA descriptors array */
 	struct mvneta_rx_desc *descs;
 
@@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 
 	rx_desc->buf_phys_addr = phys_addr;
 	i = rx_desc - rxq->descs;
+	rxq->buf_dma_addr[i] = phys_addr;
 	rxq->buf_virt_addr[i] = virt_addr;
 }
 
@@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
+			int index = rx_desc - rxq->descs;
 			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
 			/* Return dropped buffer to the pool */
 			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+					      rxq->buf_dma_addr[index]);
 		}
 		return;
 	}
 
 	for (i = 0; i < rxq->size; i++) {
-		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
 		void *data = rxq->buf_virt_addr[i];
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 		mvneta_frag_free(pp->frag_size, data);
 	}
@@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		index = rx_desc - rxq->descs;
 		data = rxq->buf_virt_addr[index];
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
@@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 	/* Fairness NAPI loop */
 	while (rx_done < rx_todo) {
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
+		int index = rx_desc - rxq->descs;
 		struct mvneta_bm_pool *bm_pool = NULL;
 		struct sk_buff *skb;
 		unsigned char *data;
@@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
@@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 err_drop_frame_ret_pool:
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 				goto err_drop_frame_ret_pool;
 
 			dma_sync_single_range_for_cpu(dev->dev.parent,
-			                              rx_desc->buf_phys_addr,
+			                              phys_addr,
 			                              MVNETA_MH_SIZE + NET_SKB_PAD,
 			                              rx_bytes,
 			                              DMA_FROM_DEVICE);
@@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 			rcvd_bytes += rx_bytes;
 
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 
 			/* leave the descriptor and buffer untouched */
 			continue;
@@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
 		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
 						  rxq->size * sizeof(void *),
 						  GFP_KERNEL);
-		if (!rxq->buf_virt_addr)
+		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
+						 rxq->size * sizeof(dma_addr_t),
+						 GFP_KERNEL);
+		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
 			return -ENOMEM;
 	}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
@ 2017-02-17 10:02   ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:02 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the
buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache
coherent, reading from uncached memory is fairly slow. This patch uses
cacheable memory to store the rx buffer DMA address. We get the
following performance data on Marvell BG4CT Platforms (tested with
iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 06df72b8da85..e24c3028fe1d 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -580,6 +580,9 @@ struct mvneta_rx_queue {
 	/* Virtual address of the RX buffer */
 	void  **buf_virt_addr;
 
+	/* DMA address of the RX buffer */
+	dma_addr_t *buf_dma_addr;
+
 	/* Virtual address of the RX DMA descriptors array */
 	struct mvneta_rx_desc *descs;
 
@@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 
 	rx_desc->buf_phys_addr = phys_addr;
 	i = rx_desc - rxq->descs;
+	rxq->buf_dma_addr[i] = phys_addr;
 	rxq->buf_virt_addr[i] = virt_addr;
 }
 
@@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
+			int index = rx_desc - rxq->descs;
 			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
 			/* Return dropped buffer to the pool */
 			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+					      rxq->buf_dma_addr[index]);
 		}
 		return;
 	}
 
 	for (i = 0; i < rxq->size; i++) {
-		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
 		void *data = rxq->buf_virt_addr[i];
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 		mvneta_frag_free(pp->frag_size, data);
 	}
@@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		index = rx_desc - rxq->descs;
 		data = rxq->buf_virt_addr[index];
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
@@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 	/* Fairness NAPI loop */
 	while (rx_done < rx_todo) {
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
+		int index = rx_desc - rxq->descs;
 		struct mvneta_bm_pool *bm_pool = NULL;
 		struct sk_buff *skb;
 		unsigned char *data;
@@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
@@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 err_drop_frame_ret_pool:
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 				goto err_drop_frame_ret_pool;
 
 			dma_sync_single_range_for_cpu(dev->dev.parent,
-			                              rx_desc->buf_phys_addr,
+			                              phys_addr,
 			                              MVNETA_MH_SIZE + NET_SKB_PAD,
 			                              rx_bytes,
 			                              DMA_FROM_DEVICE);
@@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 			rcvd_bytes += rx_bytes;
 
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 
 			/* leave the descriptor and buffer untouched */
 			continue;
@@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
 		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
 						  rxq->size * sizeof(void *),
 						  GFP_KERNEL);
-		if (!rxq->buf_virt_addr)
+		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
+						 rxq->size * sizeof(dma_addr_t),
+						 GFP_KERNEL);
+		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
 			return -ENOMEM;
 	}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/2] net: mvneta: improve rx performance
  2017-02-17 10:02 ` Jisheng Zhang
  (?)
@ 2017-02-17 10:09   ` Jisheng Zhang
  -1 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:09 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd; +Cc: linux-arm-kernel, netdev, linux-kernel

On Fri, 17 Feb 2017 18:02:31 +0800
Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> access fields of rx_desc. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow.
> 
> patch1 reuses the read out status to getting status field of rx_desc
> again.
> 
> patch2 uses cacheable memory to store the rx buffer DMA address.
> 
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
> 
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 149265960 ns

oops, I still didn't correct the typo here, it should be 1492659600 ns

Sorry, but I think there must be comments, I'll fix this typo in v3 when
address comments.

> 
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
> 
> We saved 4.76% time.
> 
> RFC: can we do similar modification for tx? If yes, I can prepare a v2.
> 
> 
> Basically, these two patches do what Arnd mentioned in [1].
> 
> Hi Arnd,
> 
> I added "Suggested-by you" tag, I hope you don't mind ;)
> 
> Thanks
> 
> [1] https://www.spinics.net/lists/netdev/msg405889.html
> 
> Since v1:
>   - correct the performance data typo
> 
> Jisheng Zhang (2):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> 
>  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:09   ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:09 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd; +Cc: netdev, linux-kernel, linux-arm-kernel

On Fri, 17 Feb 2017 18:02:31 +0800
Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> access fields of rx_desc. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow.
> 
> patch1 reuses the read out status to getting status field of rx_desc
> again.
> 
> patch2 uses cacheable memory to store the rx buffer DMA address.
> 
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
> 
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 149265960 ns

oops, I still didn't correct the typo here, it should be 1492659600 ns

Sorry, but I think there must be comments, I'll fix this typo in v3 when
address comments.

> 
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
> 
> We saved 4.76% time.
> 
> RFC: can we do similar modification for tx? If yes, I can prepare a v2.
> 
> 
> Basically, these two patches do what Arnd mentioned in [1].
> 
> Hi Arnd,
> 
> I added "Suggested-by you" tag, I hope you don't mind ;)
> 
> Thanks
> 
> [1] https://www.spinics.net/lists/netdev/msg405889.html
> 
> Since v1:
>   - correct the performance data typo
> 
> Jisheng Zhang (2):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> 
>  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:09   ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 17 Feb 2017 18:02:31 +0800
Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> access fields of rx_desc. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow.
> 
> patch1 reuses the read out status to getting status field of rx_desc
> again.
> 
> patch2 uses cacheable memory to store the rx buffer DMA address.
> 
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
> 
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 149265960 ns

oops, I still didn't correct the typo here, it should be 1492659600 ns

Sorry, but I think there must be comments, I'll fix this typo in v3 when
address comments.

> 
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
> 
> We saved 4.76% time.
> 
> RFC: can we do similar modification for tx? If yes, I can prepare a v2.
> 
> 
> Basically, these two patches do what Arnd mentioned in [1].
> 
> Hi Arnd,
> 
> I added "Suggested-by you" tag, I hope you don't mind ;)
> 
> Thanks
> 
> [1] https://www.spinics.net/lists/netdev/msg405889.html
> 
> Since v1:
>   - correct the performance data typo
> 
> Jisheng Zhang (2):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> 
>  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/2] net: mvneta: improve rx performance
  2017-02-17 10:02 ` Jisheng Zhang
@ 2017-02-17 10:37   ` Gregory CLEMENT
  -1 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 10:37 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: thomas.petazzoni, davem, arnd, netdev, linux-kernel, linux-arm-kernel

Hi Jisheng,
 
 On ven., févr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> access fields of rx_desc. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow.

Did you test it with HWBM support?

I am  not sure ti will work in this case.

Gregory

>
> patch1 reuses the read out status to getting status field of rx_desc
> again.
>
> patch2 uses cacheable memory to store the rx buffer DMA address.
>
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
>
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 149265960 ns
>
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> We saved 4.76% time.
>
> RFC: can we do similar modification for tx? If yes, I can prepare a v2.
>
>
> Basically, these two patches do what Arnd mentioned in [1].
>
> Hi Arnd,
>
> I added "Suggested-by you" tag, I hope you don't mind ;)
>
> Thanks
>
> [1] https://www.spinics.net/lists/netdev/msg405889.html
>
> Since v1:
>   - correct the performance data typo
>
> Jisheng Zhang (2):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
>
>  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
>
> -- 
> 2.11.0
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:37   ` Gregory CLEMENT
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 10:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jisheng,
 
 On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> access fields of rx_desc. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow.

Did you test it with HWBM support?

I am  not sure ti will work in this case.

Gregory

>
> patch1 reuses the read out status to getting status field of rx_desc
> again.
>
> patch2 uses cacheable memory to store the rx buffer DMA address.
>
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
>
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 149265960 ns
>
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> We saved 4.76% time.
>
> RFC: can we do similar modification for tx? If yes, I can prepare a v2.
>
>
> Basically, these two patches do what Arnd mentioned in [1].
>
> Hi Arnd,
>
> I added "Suggested-by you" tag, I hope you don't mind ;)
>
> Thanks
>
> [1] https://www.spinics.net/lists/netdev/msg405889.html
>
> Since v1:
>   - correct the performance data typo
>
> Jisheng Zhang (2):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
>
>  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
>
> -- 
> 2.11.0
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/2] net: mvneta: improve rx performance
  2017-02-17 10:37   ` Gregory CLEMENT
  (?)
@ 2017-02-17 10:44     ` Jisheng Zhang
  -1 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:44 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: thomas.petazzoni, davem, arnd, netdev, linux-kernel, linux-arm-kernel

On Fri, 17 Feb 2017 11:37:21 +0100 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On ven., févr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> > access fields of rx_desc. The rx_desc is allocated by
> > dma_alloc_coherent, it's uncacheable if the device isn't cache
> > coherent, reading from uncached memory is fairly slow.  
> 
> Did you test it with HWBM support?

No I didn't test it for lacking of such HW, so it's appreciated if someone
can test with HWBM capable HW.

> 
> I am not sure ti will work in this case.

IMHO, if mvneta HW doesn't update rx_desc->buf_phys_addr, it can still work.
I don't have HWBM background, so above may be wrong. If this case doesn't
work for HWBM, I'll submit v3 to modify mvneta_rx_swbm() only.

Thanks,
Jisheng

> 
> Gregory
> 
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 149265960 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
> >
> > We saved 4.76% time.
> >
> > RFC: can we do similar modification for tx? If yes, I can prepare a v2.
> >
> >
> > Basically, these two patches do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v1:
> >   - correct the performance data typo
> >
> > Jisheng Zhang (2):
> >   net: mvneta: avoid getting status from rx_desc as much as possible
> >   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> >  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
> >  1 file changed, 21 insertions(+), 15 deletions(-)
> >
> > -- 
> > 2.11.0
> >
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:44     ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:44 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: thomas.petazzoni, arnd, netdev, linux-kernel, davem, linux-arm-kernel

On Fri, 17 Feb 2017 11:37:21 +0100 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On ven., févr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> > access fields of rx_desc. The rx_desc is allocated by
> > dma_alloc_coherent, it's uncacheable if the device isn't cache
> > coherent, reading from uncached memory is fairly slow.  
> 
> Did you test it with HWBM support?

No I didn't test it for lacking of such HW, so it's appreciated if someone
can test with HWBM capable HW.

> 
> I am not sure ti will work in this case.

IMHO, if mvneta HW doesn't update rx_desc->buf_phys_addr, it can still work.
I don't have HWBM background, so above may be wrong. If this case doesn't
work for HWBM, I'll submit v3 to modify mvneta_rx_swbm() only.

Thanks,
Jisheng

> 
> Gregory
> 
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 149265960 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
> >
> > We saved 4.76% time.
> >
> > RFC: can we do similar modification for tx? If yes, I can prepare a v2.
> >
> >
> > Basically, these two patches do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v1:
> >   - correct the performance data typo
> >
> > Jisheng Zhang (2):
> >   net: mvneta: avoid getting status from rx_desc as much as possible
> >   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> >  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
> >  1 file changed, 21 insertions(+), 15 deletions(-)
> >
> > -- 
> > 2.11.0
> >
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel  
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 0/2] net: mvneta: improve rx performance
@ 2017-02-17 10:44     ` Jisheng Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Jisheng Zhang @ 2017-02-17 10:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 17 Feb 2017 11:37:21 +0100 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm(), we may
> > access fields of rx_desc. The rx_desc is allocated by
> > dma_alloc_coherent, it's uncacheable if the device isn't cache
> > coherent, reading from uncached memory is fairly slow.  
> 
> Did you test it with HWBM support?

No I didn't test it for lacking of such HW, so it's appreciated if someone
can test with HWBM capable HW.

> 
> I am not sure ti will work in this case.

IMHO, if mvneta HW doesn't update rx_desc->buf_phys_addr, it can still work.
I don't have HWBM background, so above may be wrong. If this case doesn't
work for HWBM, I'll submit v3 to modify mvneta_rx_swbm() only.

Thanks,
Jisheng

> 
> Gregory
> 
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 149265960 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
> >
> > We saved 4.76% time.
> >
> > RFC: can we do similar modification for tx? If yes, I can prepare a v2.
> >
> >
> > Basically, these two patches do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v1:
> >   - correct the performance data typo
> >
> > Jisheng Zhang (2):
> >   net: mvneta: avoid getting status from rx_desc as much as possible
> >   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> >  drivers/net/ethernet/marvell/mvneta.c | 36 ++++++++++++++++++++---------------
> >  1 file changed, 21 insertions(+), 15 deletions(-)
> >
> > -- 
> > 2.11.0
> >
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
  2017-02-17 10:02   ` Jisheng Zhang
@ 2017-02-17 13:30     ` Gregory CLEMENT
  -1 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 13:30 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: thomas.petazzoni, davem, arnd, netdev, linux-kernel, linux-arm-kernel

Hi Jisheng,
 
 On ven., févr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the
> buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow. This patch uses
> cacheable memory to store the rx buffer DMA address. We get the
> following performance data on Marvell BG4CT Platforms (tested with
> iperf):
>
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
>
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> We saved 4.76% time.

I have just tested it and as I feared, with HWBM enabled, a simple iperf
just doesn't work.

Gregory

>
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 06df72b8da85..e24c3028fe1d 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -580,6 +580,9 @@ struct mvneta_rx_queue {
>  	/* Virtual address of the RX buffer */
>  	void  **buf_virt_addr;
>  
> +	/* DMA address of the RX buffer */
> +	dma_addr_t *buf_dma_addr;
> +
>  	/* Virtual address of the RX DMA descriptors array */
>  	struct mvneta_rx_desc *descs;
>  
> @@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
>  
>  	rx_desc->buf_phys_addr = phys_addr;
>  	i = rx_desc - rxq->descs;
> +	rxq->buf_dma_addr[i] = phys_addr;
>  	rxq->buf_virt_addr[i] = virt_addr;
>  }
>  
> @@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  		for (i = 0; i < rx_done; i++) {
>  			struct mvneta_rx_desc *rx_desc =
>  						  mvneta_rxq_next_desc_get(rxq);
> +			int index = rx_desc - rxq->descs;
>  			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
>  			struct mvneta_bm_pool *bm_pool;
>  
>  			bm_pool = &pp->bm_priv->bm_pools[pool_id];
>  			/* Return dropped buffer to the pool */
>  			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> -					      rx_desc->buf_phys_addr);
> +					      rxq->buf_dma_addr[index]);
>  		}
>  		return;
>  	}
>  
>  	for (i = 0; i < rxq->size; i++) {
> -		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
>  		void *data = rxq->buf_virt_addr[i];
>  
> -		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
> +		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
>  				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
>  		mvneta_frag_free(pp->frag_size, data);
>  	}
> @@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>  		index = rx_desc - rxq->descs;
>  		data = rxq->buf_virt_addr[index];
> -		phys_addr = rx_desc->buf_phys_addr;
> +		phys_addr = rxq->buf_dma_addr[index];
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>  		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
> @@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  	/* Fairness NAPI loop */
>  	while (rx_done < rx_todo) {
>  		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
> +		int index = rx_desc - rxq->descs;
>  		struct mvneta_bm_pool *bm_pool = NULL;
>  		struct sk_buff *skb;
>  		unsigned char *data;
> @@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>  		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
> -		phys_addr = rx_desc->buf_phys_addr;
> +		phys_addr = rxq->buf_dma_addr[index];
>  		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
>  
> @@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
>  err_drop_frame_ret_pool:
>  			/* Return the buffer to the pool */
> -			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> -					      rx_desc->buf_phys_addr);
> +			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
>  err_drop_frame:
>  			dev->stats.rx_errors++;
>  			mvneta_rx_error(pp, rx_desc);
> @@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  				goto err_drop_frame_ret_pool;
>  
>  			dma_sync_single_range_for_cpu(dev->dev.parent,
> -			                              rx_desc->buf_phys_addr,
> +			                              phys_addr,
>  			                              MVNETA_MH_SIZE + NET_SKB_PAD,
>  			                              rx_bytes,
>  			                              DMA_FROM_DEVICE);
> @@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  			rcvd_bytes += rx_bytes;
>  
>  			/* Return the buffer to the pool */
> -			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> -					      rx_desc->buf_phys_addr);
> +			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
>  
>  			/* leave the descriptor and buffer untouched */
>  			continue;
> @@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
>  		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
>  						  rxq->size * sizeof(void *),
>  						  GFP_KERNEL);
> -		if (!rxq->buf_virt_addr)
> +		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
> +						 rxq->size * sizeof(dma_addr_t),
> +						 GFP_KERNEL);
> +		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
>  			return -ENOMEM;
>  	}
>  
> -- 
> 2.11.0
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
@ 2017-02-17 13:30     ` Gregory CLEMENT
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 13:30 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jisheng,
 
 On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_hwbm() and mvneta_rx_swbm, the
> buf_phys_addr field of rx_dec is accessed. The rx_desc is allocated by
> dma_alloc_coherent, it's uncacheable if the device isn't cache
> coherent, reading from uncached memory is fairly slow. This patch uses
> cacheable memory to store the rx buffer DMA address. We get the
> following performance data on Marvell BG4CT Platforms (tested with
> iperf):
>
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
>
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns
>
> We saved 4.76% time.

I have just tested it and as I feared, with HWBM enabled, a simple iperf
just doesn't work.

Gregory

>
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 06df72b8da85..e24c3028fe1d 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -580,6 +580,9 @@ struct mvneta_rx_queue {
>  	/* Virtual address of the RX buffer */
>  	void  **buf_virt_addr;
>  
> +	/* DMA address of the RX buffer */
> +	dma_addr_t *buf_dma_addr;
> +
>  	/* Virtual address of the RX DMA descriptors array */
>  	struct mvneta_rx_desc *descs;
>  
> @@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
>  
>  	rx_desc->buf_phys_addr = phys_addr;
>  	i = rx_desc - rxq->descs;
> +	rxq->buf_dma_addr[i] = phys_addr;
>  	rxq->buf_virt_addr[i] = virt_addr;
>  }
>  
> @@ -1900,22 +1904,22 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  		for (i = 0; i < rx_done; i++) {
>  			struct mvneta_rx_desc *rx_desc =
>  						  mvneta_rxq_next_desc_get(rxq);
> +			int index = rx_desc - rxq->descs;
>  			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
>  			struct mvneta_bm_pool *bm_pool;
>  
>  			bm_pool = &pp->bm_priv->bm_pools[pool_id];
>  			/* Return dropped buffer to the pool */
>  			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> -					      rx_desc->buf_phys_addr);
> +					      rxq->buf_dma_addr[index]);
>  		}
>  		return;
>  	}
>  
>  	for (i = 0; i < rxq->size; i++) {
> -		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
>  		void *data = rxq->buf_virt_addr[i];
>  
> -		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
> +		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
>  				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
>  		mvneta_frag_free(pp->frag_size, data);
>  	}
> @@ -1953,7 +1957,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>  		index = rx_desc - rxq->descs;
>  		data = rxq->buf_virt_addr[index];
> -		phys_addr = rx_desc->buf_phys_addr;
> +		phys_addr = rxq->buf_dma_addr[index];
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>  		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
> @@ -2062,6 +2066,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  	/* Fairness NAPI loop */
>  	while (rx_done < rx_todo) {
>  		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
> +		int index = rx_desc - rxq->descs;
>  		struct mvneta_bm_pool *bm_pool = NULL;
>  		struct sk_buff *skb;
>  		unsigned char *data;
> @@ -2074,7 +2079,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>  		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
> -		phys_addr = rx_desc->buf_phys_addr;
> +		phys_addr = rxq->buf_dma_addr[index];
>  		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
>  
> @@ -2082,8 +2087,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
>  err_drop_frame_ret_pool:
>  			/* Return the buffer to the pool */
> -			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> -					      rx_desc->buf_phys_addr);
> +			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
>  err_drop_frame:
>  			dev->stats.rx_errors++;
>  			mvneta_rx_error(pp, rx_desc);
> @@ -2098,7 +2102,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  				goto err_drop_frame_ret_pool;
>  
>  			dma_sync_single_range_for_cpu(dev->dev.parent,
> -			                              rx_desc->buf_phys_addr,
> +			                              phys_addr,
>  			                              MVNETA_MH_SIZE + NET_SKB_PAD,
>  			                              rx_bytes,
>  			                              DMA_FROM_DEVICE);
> @@ -2114,8 +2118,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  			rcvd_bytes += rx_bytes;
>  
>  			/* Return the buffer to the pool */
> -			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
> -					      rx_desc->buf_phys_addr);
> +			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
>  
>  			/* leave the descriptor and buffer untouched */
>  			continue;
> @@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
>  		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
>  						  rxq->size * sizeof(void *),
>  						  GFP_KERNEL);
> -		if (!rxq->buf_virt_addr)
> +		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
> +						 rxq->size * sizeof(dma_addr_t),
> +						 GFP_KERNEL);
> +		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
>  			return -ENOMEM;
>  	}
>  
> -- 
> 2.11.0
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible
  2017-02-17 10:02   ` Jisheng Zhang
@ 2017-02-17 13:35     ` Gregory CLEMENT
  -1 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 13:35 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: thomas.petazzoni, davem, arnd, linux-arm-kernel, netdev, linux-kernel

Hi Jisheng,
 
 On ven., févr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
> The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
> the device isn't cache-coherent, reading from uncached memory is
> fairly slow. So reuse the read out rx_status to avoid the second
> reading from uncached memory.
>
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>

This one is OK and I didn't see a regression:

Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>

Gregory


> ---
>  drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 61dd4462411c..06df72b8da85 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -313,8 +313,8 @@
>  	((addr >= txq->tso_hdrs_phys) && \
>  	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
>  
> -#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
> -	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
> +#define MVNETA_RX_GET_BM_POOL_ID(status) \
> +	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
>  
>  struct mvneta_statistic {
>  	unsigned short offset;
> @@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  		for (i = 0; i < rx_done; i++) {
>  			struct mvneta_rx_desc *rx_desc =
>  						  mvneta_rxq_next_desc_get(rxq);
> -			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
> +			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
>  			struct mvneta_bm_pool *bm_pool;
>  
>  			bm_pool = &pp->bm_priv->bm_pools[pool_id];
> @@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>  		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
>  		phys_addr = rx_desc->buf_phys_addr;
> -		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
> +		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
> -- 
> 2.11.0
>

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible
@ 2017-02-17 13:35     ` Gregory CLEMENT
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 13:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jisheng,
 
 On ven., f?vr. 17 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
> The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
> the device isn't cache-coherent, reading from uncached memory is
> fairly slow. So reuse the read out rx_status to avoid the second
> reading from uncached memory.
>
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>

This one is OK and I didn't see a regression:

Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>

Gregory


> ---
>  drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 61dd4462411c..06df72b8da85 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -313,8 +313,8 @@
>  	((addr >= txq->tso_hdrs_phys) && \
>  	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
>  
> -#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
> -	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
> +#define MVNETA_RX_GET_BM_POOL_ID(status) \
> +	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
>  
>  struct mvneta_statistic {
>  	unsigned short offset;
> @@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  		for (i = 0; i < rx_done; i++) {
>  			struct mvneta_rx_desc *rx_desc =
>  						  mvneta_rxq_next_desc_get(rxq);
> -			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
> +			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
>  			struct mvneta_bm_pool *bm_pool;
>  
>  			bm_pool = &pp->bm_priv->bm_pools[pool_id];
> @@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>  		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
>  		phys_addr = rx_desc->buf_phys_addr;
> -		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
> +		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
> -- 
> 2.11.0
>

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
  2017-02-17 13:30     ` Gregory CLEMENT
@ 2017-02-17 13:55       ` Thomas Petazzoni
  -1 siblings, 0 replies; 24+ messages in thread
From: Thomas Petazzoni @ 2017-02-17 13:55 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: Jisheng Zhang, davem, arnd, netdev, linux-kernel, linux-arm-kernel

Hello,

On Fri, 17 Feb 2017 14:30:03 +0100, Gregory CLEMENT wrote:

> I have just tested it and as I feared, with HWBM enabled, a simple iperf
> just doesn't work.

And that's expected: the whole point of HWBM is that the buffer into
which a RX packet is placed is allocated by the HW, and its address
stored in the RX descriptor. So the following code:

> >  	rx_desc->buf_phys_addr = phys_addr;
> >  	i = rx_desc - rxq->descs;
> > +	rxq->buf_dma_addr[i] = phys_addr;

Does not make sense, because it's not the SW that refills the RX
descriptors with the address of the RX buffers. It's done by the HW.

With HWBM, I believe you have no choice but to read the physical
address from the RX descriptor. But you can probably optimize things a
little bit by reading it only once, and then storing it into a
cacheable variable.

So maybe:

 - For SWBM, use the strategy proposed by Jisheng
 - For HWBM, at the beginning of the RX completion path, read once the
   rx_desc->buf_phys_addr, and store it in rxq->buf_dma_addr[index]

Of course that's just a very rough proposal. I've been looking mainly
at mvpp2 lately, and I'm not sure I still remember how mvneta works in
the details.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
@ 2017-02-17 13:55       ` Thomas Petazzoni
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Petazzoni @ 2017-02-17 13:55 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

On Fri, 17 Feb 2017 14:30:03 +0100, Gregory CLEMENT wrote:

> I have just tested it and as I feared, with HWBM enabled, a simple iperf
> just doesn't work.

And that's expected: the whole point of HWBM is that the buffer into
which a RX packet is placed is allocated by the HW, and its address
stored in the RX descriptor. So the following code:

> >  	rx_desc->buf_phys_addr = phys_addr;
> >  	i = rx_desc - rxq->descs;
> > +	rxq->buf_dma_addr[i] = phys_addr;

Does not make sense, because it's not the SW that refills the RX
descriptors with the address of the RX buffers. It's done by the HW.

With HWBM, I believe you have no choice but to read the physical
address from the RX descriptor. But you can probably optimize things a
little bit by reading it only once, and then storing it into a
cacheable variable.

So maybe:

 - For SWBM, use the strategy proposed by Jisheng
 - For HWBM, at the beginning of the RX completion path, read once the
   rx_desc->buf_phys_addr, and store it in rxq->buf_dma_addr[index]

Of course that's just a very rough proposal. I've been looking mainly
at mvpp2 lately, and I'm not sure I still remember how mvneta works in
the details.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
  2017-02-17 13:55       ` Thomas Petazzoni
@ 2017-02-17 15:20         ` Gregory CLEMENT
  -1 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 15:20 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: Jisheng Zhang, davem, arnd, netdev, linux-kernel, linux-arm-kernel

Hi Thomas,
 
 On ven., févr. 17 2017, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote:

> Does not make sense, because it's not the SW that refills the RX
> descriptors with the address of the RX buffers. It's done by the HW.
>
> With HWBM, I believe you have no choice but to read the physical
> address from the RX descriptor. But you can probably optimize things a
> little bit by reading it only once, and then storing it into a
> cacheable variable.
>
> So maybe:
>
>  - For SWBM, use the strategy proposed by Jisheng
>  - For HWBM, at the beginning of the RX completion path, read once the
>    rx_desc->buf_phys_addr, and store it in rxq->buf_dma_addr[index]


For the HWBM path storing rx_desc->buf_phys_addr in
rxq->buf_dma_addr[index] is not useful as we only use it in a single
function.

But a quick improvement could be to use the phys_addr variable. Indeed
we store the value of rx_desc->buf_phys_addr in it and we never used it,
instead we always use rx_desc->buf_phys_addr.

Gregory

>
> Of course that's just a very rough proposal. I've been looking mainly
> at mvpp2 lately, and I'm not sure I still remember how mvneta works in
> the details.
>
> Best regards,
>
> Thomas
> -- 
> Thomas Petazzoni, CTO, Free Electrons
> Embedded Linux and Kernel engineering
> http://free-electrons.com

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address
@ 2017-02-17 15:20         ` Gregory CLEMENT
  0 siblings, 0 replies; 24+ messages in thread
From: Gregory CLEMENT @ 2017-02-17 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Thomas,
 
 On ven., f?vr. 17 2017, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote:

> Does not make sense, because it's not the SW that refills the RX
> descriptors with the address of the RX buffers. It's done by the HW.
>
> With HWBM, I believe you have no choice but to read the physical
> address from the RX descriptor. But you can probably optimize things a
> little bit by reading it only once, and then storing it into a
> cacheable variable.
>
> So maybe:
>
>  - For SWBM, use the strategy proposed by Jisheng
>  - For HWBM, at the beginning of the RX completion path, read once the
>    rx_desc->buf_phys_addr, and store it in rxq->buf_dma_addr[index]


For the HWBM path storing rx_desc->buf_phys_addr in
rxq->buf_dma_addr[index] is not useful as we only use it in a single
function.

But a quick improvement could be to use the phys_addr variable. Indeed
we store the value of rx_desc->buf_phys_addr in it and we never used it,
instead we always use rx_desc->buf_phys_addr.

Gregory

>
> Of course that's just a very rough proposal. I've been looking mainly
> at mvpp2 lately, and I'm not sure I still remember how mvneta works in
> the details.
>
> Best regards,
>
> Thomas
> -- 
> Thomas Petazzoni, CTO, Free Electrons
> Embedded Linux and Kernel engineering
> http://free-electrons.com

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-02-17 15:20 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-17 10:02 [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang
2017-02-17 10:02 ` Jisheng Zhang
2017-02-17 10:02 ` [PATCH net-next v2 1/2] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang
2017-02-17 10:02   ` Jisheng Zhang
2017-02-17 10:02   ` Jisheng Zhang
2017-02-17 13:35   ` Gregory CLEMENT
2017-02-17 13:35     ` Gregory CLEMENT
2017-02-17 10:02 ` [PATCH net-next v2 2/2] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang
2017-02-17 10:02   ` Jisheng Zhang
2017-02-17 10:02   ` Jisheng Zhang
2017-02-17 13:30   ` Gregory CLEMENT
2017-02-17 13:30     ` Gregory CLEMENT
2017-02-17 13:55     ` Thomas Petazzoni
2017-02-17 13:55       ` Thomas Petazzoni
2017-02-17 15:20       ` Gregory CLEMENT
2017-02-17 15:20         ` Gregory CLEMENT
2017-02-17 10:09 ` [PATCH net-next v2 0/2] net: mvneta: improve rx performance Jisheng Zhang
2017-02-17 10:09   ` Jisheng Zhang
2017-02-17 10:09   ` Jisheng Zhang
2017-02-17 10:37 ` Gregory CLEMENT
2017-02-17 10:37   ` Gregory CLEMENT
2017-02-17 10:44   ` Jisheng Zhang
2017-02-17 10:44     ` Jisheng Zhang
2017-02-17 10:44     ` Jisheng Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.