[PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-20 12:53 ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd, gregory.clement, mw
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
are uncacheable if the device isn't cache coherent, reading from
uncached memory is fairly slow.

patch1 reuses the read out status to getting status field of rx_desc
again.

patch2 avoids getting buf_phys_addr from rx_desc again in
mvneta_rx_hwbm by reusing the phys_addr variable.

patch3 avoids reading from tx_desc as much as possible by store what
we need in local variable.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns

after the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns

we saved 9.2% time.

patch4 uses cacheable memory to store the rx buffer DMA address.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Basically, patch1 and patch4 do what Arnd mentioned in [1].

Hi Arnd,

I added "Suggested-by you" tag, I hope you don't mind ;)

Thanks

[1] https://www.spinics.net/lists/netdev/msg405889.html

Since v2:
  - add Gregory's ack to patch1
  - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
  - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
  - add patch 3 to avoid reading from tx_desc as much as possible

Since v1:
  - correct the performance data typo

Jisheng Zhang (4):
  net: mvneta: avoid getting status from rx_desc as much as possible
  net: mvneta: avoid getting buf_phys_addr from rx_desc again
  net: mvneta: avoid reading from tx_desc as much as possible
  net: mvneta: Use cacheable memory to store the rx buffer DMA address

 drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 37 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-20 12:53 ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
are uncacheable if the device isn't cache coherent, reading from
uncached memory is fairly slow.

patch1 reuses the read out status to getting status field of rx_desc
again.

patch2 avoids getting buf_phys_addr from rx_desc again in
mvneta_rx_hwbm by reusing the phys_addr variable.

patch3 avoids reading from tx_desc as much as possible by store what
we need in local variable.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns

after the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns

we saved 9.2% time.

patch4 uses cacheable memory to store the rx buffer DMA address.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Basically, patch1 and patch4 do what Arnd mentioned in [1].

Hi Arnd,

I added "Suggested-by you" tag, I hope you don't mind ;)

Thanks

[1] https://www.spinics.net/lists/netdev/msg405889.html

Since v2:
  - add Gregory's ack to patch1
  - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
  - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
  - add patch 3 to avoid reading from tx_desc as much as possible

Since v1:
  - correct the performance data typo

Jisheng Zhang (4):
  net: mvneta: avoid getting status from rx_desc as much as possible
  net: mvneta: avoid getting buf_phys_addr from rx_desc again
  net: mvneta: avoid reading from tx_desc as much as possible
  net: mvneta: Use cacheable memory to store the rx buffer DMA address

 drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 37 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 1/4] net: mvneta: avoid getting status from rx_desc as much as possible
  2017-02-20 12:53 ` Jisheng Zhang
  (?)
@ 2017-02-20 12:53   ` Jisheng Zhang
  -1 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd, gregory.clement, mw
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
the device isn't cache-coherent, reading from uncached memory is
fairly slow. So reuse the read out rx_status to avoid the second
reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 61dd4462411c..06df72b8da85 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -313,8 +313,8 @@
 	((addr >= txq->tso_hdrs_phys) && \
 	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
 
-#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
-	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
+#define MVNETA_RX_GET_BM_POOL_ID(status) \
+	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
 
 struct mvneta_statistic {
 	unsigned short offset;
@@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
-			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
-		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 1/4] net: mvneta: avoid getting status from rx_desc as much as possible
@ 2017-02-20 12:53   ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd, gregory.clement, mw
  Cc: Jisheng Zhang, netdev, linux-kernel, linux-arm-kernel

In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
the device isn't cache-coherent, reading from uncached memory is
fairly slow. So reuse the read out rx_status to avoid the second
reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 61dd4462411c..06df72b8da85 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -313,8 +313,8 @@
 	((addr >= txq->tso_hdrs_phys) && \
 	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
 
-#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
-	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
+#define MVNETA_RX_GET_BM_POOL_ID(status) \
+	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
 
 struct mvneta_statistic {
 	unsigned short offset;
@@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
-			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
-		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 1/4] net: mvneta: avoid getting status from rx_desc as much as possible
@ 2017-02-20 12:53   ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path mvneta_rx_hwbm(), the rx_desc->status is read twice.
The rx_desc is allocated by dma_alloc_coherent, it's uncacheable if
the device isn't cache-coherent, reading from uncached memory is
fairly slow. So reuse the read out rx_status to avoid the second
reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 61dd4462411c..06df72b8da85 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -313,8 +313,8 @@
 	((addr >= txq->tso_hdrs_phys) && \
 	 (addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
 
-#define MVNETA_RX_GET_BM_POOL_ID(rxd) \
-	(((rxd)->status & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
+#define MVNETA_RX_GET_BM_POOL_ID(status) \
+	(((status) & MVNETA_RXD_BM_POOL_MASK) >> MVNETA_RXD_BM_POOL_SHIFT)
 
 struct mvneta_statistic {
 	unsigned short offset;
@@ -1900,7 +1900,7 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 		for (i = 0; i < rx_done; i++) {
 			struct mvneta_rx_desc *rx_desc =
 						  mvneta_rxq_next_desc_get(rxq);
-			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+			u8 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc->status);
 			struct mvneta_bm_pool *bm_pool;
 
 			bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2075,7 +2075,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
 		phys_addr = rx_desc->buf_phys_addr;
-		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
+		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_status);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 2/4] net: mvneta: avoid getting buf_phys_addr from rx_desc again
  2017-02-20 12:53 ` Jisheng Zhang
@ 2017-02-20 12:53   ` Jisheng Zhang
  -1 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd, gregory.clement, mw
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path mvneta_rx_hwbm(), the rx_desc->buf_phys_addr is read
four times. The rx_desc is allocated by dma_alloc_coherent, it's
uncacheable if the device isn't cache-coherent, reading from uncached
memory is fairly slow. So reuse the read out phys_addr variable to
avoid the extra reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 06df72b8da85..a25042801eec 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2082,8 +2082,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 err_drop_frame_ret_pool:
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -2098,7 +2097,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 				goto err_drop_frame_ret_pool;
 
 			dma_sync_single_range_for_cpu(dev->dev.parent,
-			                              rx_desc->buf_phys_addr,
+			                              phys_addr,
 			                              MVNETA_MH_SIZE + NET_SKB_PAD,
 			                              rx_bytes,
 			                              DMA_FROM_DEVICE);
@@ -2114,8 +2113,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 			rcvd_bytes += rx_bytes;
 
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 
 			/* leave the descriptor and buffer untouched */
 			continue;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 2/4] net: mvneta: avoid getting buf_phys_addr from rx_desc again
@ 2017-02-20 12:53   ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path mvneta_rx_hwbm(), the rx_desc->buf_phys_addr is read
four times. The rx_desc is allocated by dma_alloc_coherent, it's
uncacheable if the device isn't cache-coherent, reading from uncached
memory is fairly slow. So reuse the read out phys_addr variable to
avoid the extra reading from uncached memory.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 06df72b8da85..a25042801eec 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2082,8 +2082,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
 err_drop_frame_ret_pool:
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 err_drop_frame:
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
@@ -2098,7 +2097,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 				goto err_drop_frame_ret_pool;
 
 			dma_sync_single_range_for_cpu(dev->dev.parent,
-			                              rx_desc->buf_phys_addr,
+			                              phys_addr,
 			                              MVNETA_MH_SIZE + NET_SKB_PAD,
 			                              rx_bytes,
 			                              DMA_FROM_DEVICE);
@@ -2114,8 +2113,7 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 			rcvd_bytes += rx_bytes;
 
 			/* Return the buffer to the pool */
-			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
-					      rx_desc->buf_phys_addr);
+			mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool, phys_addr);
 
 			/* leave the descriptor and buffer untouched */
 			continue;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 3/4] net: mvneta: avoid reading from tx_desc as much as possible
  2017-02-20 12:53 ` Jisheng Zhang
@ 2017-02-20 12:53   ` Jisheng Zhang
  -1 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd, gregory.clement, mw
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path such as mvneta_tx(), mvneta_txq_bufs_free() etc. we
access tx_desc several times. The tx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache-coherent,
reading from uncached memory is fairly slow. So use local variable to
store what we need to avoid extra reading from uncached memory.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns

after the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns

we saved 9.2% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 50 ++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index a25042801eec..b6cda4131c78 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1770,6 +1770,7 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 		struct mvneta_tx_desc *tx_desc = txq->descs +
 			txq->txq_get_index;
 		struct sk_buff *skb = txq->tx_skb[txq->txq_get_index];
+		u32 dma_addr = tx_desc->buf_phys_addr;
 
 		if (skb) {
 			bytes_compl += skb->len;
@@ -1778,9 +1779,8 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 
 		mvneta_txq_inc_get(txq);
 
-		if (!IS_TSO_HEADER(txq, tx_desc->buf_phys_addr))
-			dma_unmap_single(pp->dev->dev.parent,
-					 tx_desc->buf_phys_addr,
+		if (!IS_TSO_HEADER(txq, dma_addr))
+			dma_unmap_single(pp->dev->dev.parent, dma_addr,
 					 tx_desc->data_size, DMA_TO_DEVICE);
 		if (!skb)
 			continue;
@@ -2191,17 +2191,18 @@ mvneta_tso_put_data(struct net_device *dev, struct mvneta_tx_queue *txq,
 		    bool last_tcp, bool is_last)
 {
 	struct mvneta_tx_desc *tx_desc;
+	dma_addr_t dma_addr;
 
 	tx_desc = mvneta_txq_next_desc_get(txq);
 	tx_desc->data_size = size;
-	tx_desc->buf_phys_addr = dma_map_single(dev->dev.parent, data,
-						size, DMA_TO_DEVICE);
-	if (unlikely(dma_mapping_error(dev->dev.parent,
-		     tx_desc->buf_phys_addr))) {
+
+	dma_addr = dma_map_single(dev->dev.parent, data, size, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(dev->dev.parent, dma_addr))) {
 		mvneta_txq_desc_put(txq);
 		return -ENOMEM;
 	}
 
+	tx_desc->buf_phys_addr = dma_addr;
 	tx_desc->command = 0;
 	txq->tx_skb[txq->txq_put_index] = NULL;
 
@@ -2278,9 +2279,10 @@ static int mvneta_tx_tso(struct sk_buff *skb, struct net_device *dev,
 	 */
 	for (i = desc_count - 1; i >= 0; i--) {
 		struct mvneta_tx_desc *tx_desc = txq->descs + i;
-		if (!IS_TSO_HEADER(txq, tx_desc->buf_phys_addr))
+		u32 dma_addr = tx_desc->buf_phys_addr;
+		if (!IS_TSO_HEADER(txq, dma_addr))
 			dma_unmap_single(pp->dev->dev.parent,
-					 tx_desc->buf_phys_addr,
+					 dma_addr,
 					 tx_desc->data_size,
 					 DMA_TO_DEVICE);
 		mvneta_txq_desc_put(txq);
@@ -2296,21 +2298,20 @@ static int mvneta_tx_frag_process(struct mvneta_port *pp, struct sk_buff *skb,
 	int i, nr_frags = skb_shinfo(skb)->nr_frags;
 
 	for (i = 0; i < nr_frags; i++) {
+		dma_addr_t dma_addr;
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		void *addr = page_address(frag->page.p) + frag->page_offset;
 
 		tx_desc = mvneta_txq_next_desc_get(txq);
 		tx_desc->data_size = frag->size;
 
-		tx_desc->buf_phys_addr =
-			dma_map_single(pp->dev->dev.parent, addr,
-				       tx_desc->data_size, DMA_TO_DEVICE);
-
-		if (dma_mapping_error(pp->dev->dev.parent,
-				      tx_desc->buf_phys_addr)) {
+		dma_addr = dma_map_single(pp->dev->dev.parent, addr,
+					  frag->size, DMA_TO_DEVICE);
+		if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
 			mvneta_txq_desc_put(txq);
 			goto error;
 		}
+		tx_desc->buf_phys_addr = dma_addr;
 
 		if (i == nr_frags - 1) {
 			/* Last descriptor */
@@ -2351,7 +2352,8 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 	struct mvneta_tx_desc *tx_desc;
 	int len = skb->len;
 	int frags = 0;
-	u32 tx_cmd;
+	u32 tx_cmd, size;
+	dma_addr_t dma_addr;
 
 	if (!netif_running(dev))
 		goto out;
@@ -2368,17 +2370,17 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 
 	tx_cmd = mvneta_skb_tx_csum(pp, skb);
 
-	tx_desc->data_size = skb_headlen(skb);
+	size = skb_headlen(skb);
+	tx_desc->data_size = size;
 
-	tx_desc->buf_phys_addr = dma_map_single(dev->dev.parent, skb->data,
-						tx_desc->data_size,
-						DMA_TO_DEVICE);
-	if (unlikely(dma_mapping_error(dev->dev.parent,
-				       tx_desc->buf_phys_addr))) {
+	dma_addr = dma_map_single(dev->dev.parent, skb->data,
+				  size, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(dev->dev.parent, dma_addr))) {
 		mvneta_txq_desc_put(txq);
 		frags = 0;
 		goto out;
 	}
+	tx_desc->buf_phys_addr = dma_addr;
 
 	if (frags == 1) {
 		/* First and Last descriptor */
@@ -2395,8 +2397,8 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 		/* Continue with other skb fragments */
 		if (mvneta_tx_frag_process(pp, skb, txq)) {
 			dma_unmap_single(dev->dev.parent,
-					 tx_desc->buf_phys_addr,
-					 tx_desc->data_size,
+					 dma_addr,
+					 size,
 					 DMA_TO_DEVICE);
 			mvneta_txq_desc_put(txq);
 			frags = 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 3/4] net: mvneta: avoid reading from tx_desc as much as possible
@ 2017-02-20 12:53   ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path such as mvneta_tx(), mvneta_txq_bufs_free() etc. we
access tx_desc several times. The tx_desc is allocated by
dma_alloc_coherent, it's uncacheable if the device isn't cache-coherent,
reading from uncached memory is fairly slow. So use local variable to
store what we need to avoid extra reading from uncached memory.

We get the following performance data on Marvell BG4CT Platforms
(tested with iperf):

before the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns

after the patch:
sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns

we saved 9.2% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 50 ++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index a25042801eec..b6cda4131c78 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1770,6 +1770,7 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 		struct mvneta_tx_desc *tx_desc = txq->descs +
 			txq->txq_get_index;
 		struct sk_buff *skb = txq->tx_skb[txq->txq_get_index];
+		u32 dma_addr = tx_desc->buf_phys_addr;
 
 		if (skb) {
 			bytes_compl += skb->len;
@@ -1778,9 +1779,8 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 
 		mvneta_txq_inc_get(txq);
 
-		if (!IS_TSO_HEADER(txq, tx_desc->buf_phys_addr))
-			dma_unmap_single(pp->dev->dev.parent,
-					 tx_desc->buf_phys_addr,
+		if (!IS_TSO_HEADER(txq, dma_addr))
+			dma_unmap_single(pp->dev->dev.parent, dma_addr,
 					 tx_desc->data_size, DMA_TO_DEVICE);
 		if (!skb)
 			continue;
@@ -2191,17 +2191,18 @@ mvneta_tso_put_data(struct net_device *dev, struct mvneta_tx_queue *txq,
 		    bool last_tcp, bool is_last)
 {
 	struct mvneta_tx_desc *tx_desc;
+	dma_addr_t dma_addr;
 
 	tx_desc = mvneta_txq_next_desc_get(txq);
 	tx_desc->data_size = size;
-	tx_desc->buf_phys_addr = dma_map_single(dev->dev.parent, data,
-						size, DMA_TO_DEVICE);
-	if (unlikely(dma_mapping_error(dev->dev.parent,
-		     tx_desc->buf_phys_addr))) {
+
+	dma_addr = dma_map_single(dev->dev.parent, data, size, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(dev->dev.parent, dma_addr))) {
 		mvneta_txq_desc_put(txq);
 		return -ENOMEM;
 	}
 
+	tx_desc->buf_phys_addr = dma_addr;
 	tx_desc->command = 0;
 	txq->tx_skb[txq->txq_put_index] = NULL;
 
@@ -2278,9 +2279,10 @@ static int mvneta_tx_tso(struct sk_buff *skb, struct net_device *dev,
 	 */
 	for (i = desc_count - 1; i >= 0; i--) {
 		struct mvneta_tx_desc *tx_desc = txq->descs + i;
-		if (!IS_TSO_HEADER(txq, tx_desc->buf_phys_addr))
+		u32 dma_addr = tx_desc->buf_phys_addr;
+		if (!IS_TSO_HEADER(txq, dma_addr))
 			dma_unmap_single(pp->dev->dev.parent,
-					 tx_desc->buf_phys_addr,
+					 dma_addr,
 					 tx_desc->data_size,
 					 DMA_TO_DEVICE);
 		mvneta_txq_desc_put(txq);
@@ -2296,21 +2298,20 @@ static int mvneta_tx_frag_process(struct mvneta_port *pp, struct sk_buff *skb,
 	int i, nr_frags = skb_shinfo(skb)->nr_frags;
 
 	for (i = 0; i < nr_frags; i++) {
+		dma_addr_t dma_addr;
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 		void *addr = page_address(frag->page.p) + frag->page_offset;
 
 		tx_desc = mvneta_txq_next_desc_get(txq);
 		tx_desc->data_size = frag->size;
 
-		tx_desc->buf_phys_addr =
-			dma_map_single(pp->dev->dev.parent, addr,
-				       tx_desc->data_size, DMA_TO_DEVICE);
-
-		if (dma_mapping_error(pp->dev->dev.parent,
-				      tx_desc->buf_phys_addr)) {
+		dma_addr = dma_map_single(pp->dev->dev.parent, addr,
+					  frag->size, DMA_TO_DEVICE);
+		if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
 			mvneta_txq_desc_put(txq);
 			goto error;
 		}
+		tx_desc->buf_phys_addr = dma_addr;
 
 		if (i == nr_frags - 1) {
 			/* Last descriptor */
@@ -2351,7 +2352,8 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 	struct mvneta_tx_desc *tx_desc;
 	int len = skb->len;
 	int frags = 0;
-	u32 tx_cmd;
+	u32 tx_cmd, size;
+	dma_addr_t dma_addr;
 
 	if (!netif_running(dev))
 		goto out;
@@ -2368,17 +2370,17 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 
 	tx_cmd = mvneta_skb_tx_csum(pp, skb);
 
-	tx_desc->data_size = skb_headlen(skb);
+	size = skb_headlen(skb);
+	tx_desc->data_size = size;
 
-	tx_desc->buf_phys_addr = dma_map_single(dev->dev.parent, skb->data,
-						tx_desc->data_size,
-						DMA_TO_DEVICE);
-	if (unlikely(dma_mapping_error(dev->dev.parent,
-				       tx_desc->buf_phys_addr))) {
+	dma_addr = dma_map_single(dev->dev.parent, skb->data,
+				  size, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(dev->dev.parent, dma_addr))) {
 		mvneta_txq_desc_put(txq);
 		frags = 0;
 		goto out;
 	}
+	tx_desc->buf_phys_addr = dma_addr;
 
 	if (frags == 1) {
 		/* First and Last descriptor */
@@ -2395,8 +2397,8 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 		/* Continue with other skb fragments */
 		if (mvneta_tx_frag_process(pp, skb, txq)) {
 			dma_unmap_single(dev->dev.parent,
-					 tx_desc->buf_phys_addr,
-					 tx_desc->data_size,
+					 dma_addr,
+					 size,
 					 DMA_TO_DEVICE);
 			mvneta_txq_desc_put(txq);
 			frags = 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 4/4] net: mvneta: Use cacheable memory to store the rx buffer DMA address
  2017-02-20 12:53 ` Jisheng Zhang
@ 2017-02-20 12:53   ` Jisheng Zhang
  -1 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: thomas.petazzoni, davem, arnd, gregory.clement, mw
  Cc: linux-arm-kernel, netdev, linux-kernel, Jisheng Zhang

In hot code path such as mvneta_rx_swbm(), the buf_phys_addr field of
rx_dec is accessed. The rx_desc is allocated by dma_alloc_coherent,
it's uncacheable if the device isn't cache coherent, reading from
uncached memory is fairly slow. This patch uses cacheable memory to
store the rx buffer DMA address. We get the following performance data
on Marvell BG4CT Platforms (tested with iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b6cda4131c78..ccd3f2601446 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -580,6 +580,9 @@ struct mvneta_rx_queue {
 	/* Virtual address of the RX buffer */
 	void  **buf_virt_addr;
 
+	/* DMA address of the RX buffer */
+	dma_addr_t *buf_dma_addr;
+
 	/* Virtual address of the RX DMA descriptors array */
 	struct mvneta_rx_desc *descs;
 
@@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 
 	rx_desc->buf_phys_addr = phys_addr;
 	i = rx_desc - rxq->descs;
+	rxq->buf_dma_addr[i] = phys_addr;
 	rxq->buf_virt_addr[i] = virt_addr;
 }
 
@@ -1912,10 +1916,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 	}
 
 	for (i = 0; i < rxq->size; i++) {
-		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
 		void *data = rxq->buf_virt_addr[i];
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 		mvneta_frag_free(pp->frag_size, data);
 	}
@@ -1953,7 +1956,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		index = rx_desc - rxq->descs;
 		data = rxq->buf_virt_addr[index];
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
@@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
 		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
 						  rxq->size * sizeof(void *),
 						  GFP_KERNEL);
-		if (!rxq->buf_virt_addr)
+		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
+						 rxq->size * sizeof(dma_addr_t),
+						 GFP_KERNEL);
+		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
 			return -ENOMEM;
 	}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 4/4] net: mvneta: Use cacheable memory to store the rx buffer DMA address
@ 2017-02-20 12:53   ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-20 12:53 UTC (permalink / raw)
  To: linux-arm-kernel

In hot code path such as mvneta_rx_swbm(), the buf_phys_addr field of
rx_dec is accessed. The rx_desc is allocated by dma_alloc_coherent,
it's uncacheable if the device isn't cache coherent, reading from
uncached memory is fairly slow. This patch uses cacheable memory to
store the rx buffer DMA address. We get the following performance data
on Marvell BG4CT Platforms (tested with iperf):

before the patch:
recving 1GB in mvneta_rx_swbm() costs 1492659600 ns

after the patch:
recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

We saved 4.76% time.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/marvell/mvneta.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b6cda4131c78..ccd3f2601446 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -580,6 +580,9 @@ struct mvneta_rx_queue {
 	/* Virtual address of the RX buffer */
 	void  **buf_virt_addr;
 
+	/* DMA address of the RX buffer */
+	dma_addr_t *buf_dma_addr;
+
 	/* Virtual address of the RX DMA descriptors array */
 	struct mvneta_rx_desc *descs;
 
@@ -1617,6 +1620,7 @@ static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
 
 	rx_desc->buf_phys_addr = phys_addr;
 	i = rx_desc - rxq->descs;
+	rxq->buf_dma_addr[i] = phys_addr;
 	rxq->buf_virt_addr[i] = virt_addr;
 }
 
@@ -1912,10 +1916,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 	}
 
 	for (i = 0; i < rxq->size; i++) {
-		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
 		void *data = rxq->buf_virt_addr[i];
 
-		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
+		dma_unmap_single(pp->dev->dev.parent, rxq->buf_dma_addr[i],
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 		mvneta_frag_free(pp->frag_size, data);
 	}
@@ -1953,7 +1956,7 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
 		index = rx_desc - rxq->descs;
 		data = rxq->buf_virt_addr[index];
-		phys_addr = rx_desc->buf_phys_addr;
+		phys_addr = rxq->buf_dma_addr[index];
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
 		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
@@ -4019,7 +4022,10 @@ static int mvneta_init(struct device *dev, struct mvneta_port *pp)
 		rxq->buf_virt_addr = devm_kmalloc(pp->dev->dev.parent,
 						  rxq->size * sizeof(void *),
 						  GFP_KERNEL);
-		if (!rxq->buf_virt_addr)
+		rxq->buf_dma_addr = devm_kmalloc(pp->dev->dev.parent,
+						 rxq->size * sizeof(dma_addr_t),
+						 GFP_KERNEL);
+		if (!rxq->buf_virt_addr || !rxq->buf_dma_addr)
 			return -ENOMEM;
 	}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
  2017-02-20 12:53 ` Jisheng Zhang
@ 2017-02-20 14:21   ` Gregory CLEMENT
  -1 siblings, 0 replies; 22+ messages in thread
From: Gregory CLEMENT @ 2017-02-20 14:21 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: thomas.petazzoni, davem, arnd, mw, linux-arm-kernel, netdev,
	linux-kernel

Hi Jisheng,
 
 On lun., févr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> are uncacheable if the device isn't cache coherent, reading from
> uncached memory is fairly slow.
>
> patch1 reuses the read out status to getting status field of rx_desc
> again.
>
> patch2 avoids getting buf_phys_addr from rx_desc again in
> mvneta_rx_hwbm by reusing the phys_addr variable.
>
> patch3 avoids reading from tx_desc as much as possible by store what
> we need in local variable.
>
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
>
> before the patch:
> sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
>
> after the patch:
> sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
>
> we saved 9.2% time.
>
> patch4 uses cacheable memory to store the rx buffer DMA address.
>
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
>
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
>
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

Could you explain who you get this number?

receiving 1GB in 1.42 second means having a bandwidth of
8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
interface.

When I used iperf I didn't have this kind of granularity:
iperf -c 192.168.10.1 -n 1024M
------------------------------------------------------------
Client connecting to 192.168.10.19, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 9.1 sec  1.00 GBytes   942 Mbits/sec

Also without HWBM enabled (so with the same configuration of your test),
I didn't noticed any improvement with the patch set applied. But at
least I didn't see any regression with or without HWBM.

Gregory

>
> We saved 4.76% time.
>
> Basically, patch1 and patch4 do what Arnd mentioned in [1].
>
> Hi Arnd,
>
> I added "Suggested-by you" tag, I hope you don't mind ;)
>
> Thanks
>
> [1] https://www.spinics.net/lists/netdev/msg405889.html
>
> Since v2:
>   - add Gregory's ack to patch1
>   - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
>   - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
>   - add patch 3 to avoid reading from tx_desc as much as possible
>
> Since v1:
>   - correct the performance data typo
>
>
> Jisheng Zhang (4):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: avoid getting buf_phys_addr from rx_desc again
>   net: mvneta: avoid reading from tx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
>
>  drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
>  1 file changed, 43 insertions(+), 37 deletions(-)
>
> -- 
> 2.11.0
>

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-20 14:21   ` Gregory CLEMENT
  0 siblings, 0 replies; 22+ messages in thread
From: Gregory CLEMENT @ 2017-02-20 14:21 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jisheng,
 
 On lun., f?vr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:

> In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> are uncacheable if the device isn't cache coherent, reading from
> uncached memory is fairly slow.
>
> patch1 reuses the read out status to getting status field of rx_desc
> again.
>
> patch2 avoids getting buf_phys_addr from rx_desc again in
> mvneta_rx_hwbm by reusing the phys_addr variable.
>
> patch3 avoids reading from tx_desc as much as possible by store what
> we need in local variable.
>
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
>
> before the patch:
> sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
>
> after the patch:
> sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
>
> we saved 9.2% time.
>
> patch4 uses cacheable memory to store the rx buffer DMA address.
>
> We get the following performance data on Marvell BG4CT Platforms
> (tested with iperf):
>
> before the patch:
> recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
>
> after the patch:
> recving 1GB in mvneta_rx_swbm() costs 1421565640 ns

Could you explain who you get this number?

receiving 1GB in 1.42 second means having a bandwidth of
8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
interface.

When I used iperf I didn't have this kind of granularity:
iperf -c 192.168.10.1 -n 1024M
------------------------------------------------------------
Client connecting to 192.168.10.19, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[  3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 9.1 sec  1.00 GBytes   942 Mbits/sec

Also without HWBM enabled (so with the same configuration of your test),
I didn't noticed any improvement with the patch set applied. But at
least I didn't see any regression with or without HWBM.

Gregory

>
> We saved 4.76% time.
>
> Basically, patch1 and patch4 do what Arnd mentioned in [1].
>
> Hi Arnd,
>
> I added "Suggested-by you" tag, I hope you don't mind ;)
>
> Thanks
>
> [1] https://www.spinics.net/lists/netdev/msg405889.html
>
> Since v2:
>   - add Gregory's ack to patch1
>   - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
>   - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
>   - add patch 3 to avoid reading from tx_desc as much as possible
>
> Since v1:
>   - correct the performance data typo
>
>
> Jisheng Zhang (4):
>   net: mvneta: avoid getting status from rx_desc as much as possible
>   net: mvneta: avoid getting buf_phys_addr from rx_desc again
>   net: mvneta: avoid reading from tx_desc as much as possible
>   net: mvneta: Use cacheable memory to store the rx buffer DMA address
>
>  drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
>  1 file changed, 43 insertions(+), 37 deletions(-)
>
> -- 
> 2.11.0
>

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
  2017-02-20 14:21   ` Gregory CLEMENT
  (?)
@ 2017-02-21  4:37     ` Jisheng Zhang
  -1 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-21  4:37 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: thomas.petazzoni, davem, arnd, mw, linux-arm-kernel, netdev,
	linux-kernel

Hi Gregory,

On Mon, 20 Feb 2017 15:21:35 +0100 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On lun., févr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> > and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> > are uncacheable if the device isn't cache coherent, reading from
> > uncached memory is fairly slow.
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 avoids getting buf_phys_addr from rx_desc again in
> > mvneta_rx_hwbm by reusing the phys_addr variable.
> >
> > patch3 avoids reading from tx_desc as much as possible by store what
> > we need in local variable.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
> >
> > after the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
> >
> > we saved 9.2% time.
> >
> > patch4 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns  
> 
> Could you explain who you get this number?

Thanks for your review.

The measurement is simple: record how much time we spent in mvneta_rx_swbm()
for receiving 1GB data, something as below:

mvneta_rx_swbm()
{
	static u64 total_time;
	u64 t1, t2;
	static u64 count;

	t1 = sched_clock();
	...

	if (rcvd_pkts) {
		...
		t2 = sched_clock() - t1;
		total_time += t2;
		count += rcvd_bytes;;
		if (count >= 0x40000000) {
			printk("!!!! %lld %lld\n", total_time, count);
			total_time = 0;
			count = 0;
		}
	...
}

> 
> receiving 1GB in 1.42 second means having a bandwidth of
> 8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
> interface.

hmmm, we just measured the time spent in mvneta_rx_swbm(), so we can't solve
the bandwidth as 8/1.42, what do you think?

> 
> When I used iperf I didn't have this kind of granularity:
> iperf -c 192.168.10.1 -n 1024M
> ------------------------------------------------------------
> Client connecting to 192.168.10.19, TCP port 5001
> TCP window size: 43.8 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 9.1 sec  1.00 GBytes   942 Mbits/sec
> 
> Also without HWBM enabled (so with the same configuration of your test),
> I didn't noticed any improvement with the patch set applied. But at

>From bandwidth point of view, yes, there's no improvement. But from cpu
time/load point of view, I do see a trivial improvement. Could you also
did a simple test from your side to see whether we have similar improvement
data?

Thanks,
Jisheng



> least I didn't see any regression with or without HWBM.
> 
> Gregory
> 
> >
> > We saved 4.76% time.
> >
> > Basically, patch1 and patch4 do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v2:
> >   - add Gregory's ack to patch1
> >   - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
> >   - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
> >   - add patch 3 to avoid reading from tx_desc as much as possible
> >
> > Since v1:
> >   - correct the performance data typo
> >
> >
> > Jisheng Zhang (4):
> >   net: mvneta: avoid getting status from rx_desc as much as possible
> >   net: mvneta: avoid getting buf_phys_addr from rx_desc again
> >   net: mvneta: avoid reading from tx_desc as much as possible
> >   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> >  drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
> >  1 file changed, 43 insertions(+), 37 deletions(-)
> >
> > -- 
> > 2.11.0
> >  
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-21  4:37     ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-21  4:37 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: thomas.petazzoni, arnd, netdev, linux-kernel, mw, davem,
	linux-arm-kernel

Hi Gregory,

On Mon, 20 Feb 2017 15:21:35 +0100 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On lun., févr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> > and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> > are uncacheable if the device isn't cache coherent, reading from
> > uncached memory is fairly slow.
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 avoids getting buf_phys_addr from rx_desc again in
> > mvneta_rx_hwbm by reusing the phys_addr variable.
> >
> > patch3 avoids reading from tx_desc as much as possible by store what
> > we need in local variable.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
> >
> > after the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
> >
> > we saved 9.2% time.
> >
> > patch4 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns  
> 
> Could you explain who you get this number?

Thanks for your review.

The measurement is simple: record how much time we spent in mvneta_rx_swbm()
for receiving 1GB data, something as below:

mvneta_rx_swbm()
{
	static u64 total_time;
	u64 t1, t2;
	static u64 count;

	t1 = sched_clock();
	...

	if (rcvd_pkts) {
		...
		t2 = sched_clock() - t1;
		total_time += t2;
		count += rcvd_bytes;;
		if (count >= 0x40000000) {
			printk("!!!! %lld %lld\n", total_time, count);
			total_time = 0;
			count = 0;
		}
	...
}

> 
> receiving 1GB in 1.42 second means having a bandwidth of
> 8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
> interface.

hmmm, we just measured the time spent in mvneta_rx_swbm(), so we can't solve
the bandwidth as 8/1.42, what do you think?

> 
> When I used iperf I didn't have this kind of granularity:
> iperf -c 192.168.10.1 -n 1024M
> ------------------------------------------------------------
> Client connecting to 192.168.10.19, TCP port 5001
> TCP window size: 43.8 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 9.1 sec  1.00 GBytes   942 Mbits/sec
> 
> Also without HWBM enabled (so with the same configuration of your test),
> I didn't noticed any improvement with the patch set applied. But at

From bandwidth point of view, yes, there's no improvement. But from cpu
time/load point of view, I do see a trivial improvement. Could you also
did a simple test from your side to see whether we have similar improvement
data?

Thanks,
Jisheng



> least I didn't see any regression with or without HWBM.
> 
> Gregory
> 
> >
> > We saved 4.76% time.
> >
> > Basically, patch1 and patch4 do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v2:
> >   - add Gregory's ack to patch1
> >   - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
> >   - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
> >   - add patch 3 to avoid reading from tx_desc as much as possible
> >
> > Since v1:
> >   - correct the performance data typo
> >
> >
> > Jisheng Zhang (4):
> >   net: mvneta: avoid getting status from rx_desc as much as possible
> >   net: mvneta: avoid getting buf_phys_addr from rx_desc again
> >   net: mvneta: avoid reading from tx_desc as much as possible
> >   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> >  drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
> >  1 file changed, 43 insertions(+), 37 deletions(-)
> >
> > -- 
> > 2.11.0
> >  
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-21  4:37     ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-21  4:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Gregory,

On Mon, 20 Feb 2017 15:21:35 +0100 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On lun., f?vr. 20 2017, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > In hot code path such as mvneta_rx_swbm(), we access fields of rx_desc
> > and tx_desc. These DMA descs are allocated by dma_alloc_coherent, they
> > are uncacheable if the device isn't cache coherent, reading from
> > uncached memory is fairly slow.
> >
> > patch1 reuses the read out status to getting status field of rx_desc
> > again.
> >
> > patch2 avoids getting buf_phys_addr from rx_desc again in
> > mvneta_rx_hwbm by reusing the phys_addr variable.
> >
> > patch3 avoids reading from tx_desc as much as possible by store what
> > we need in local variable.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 793553760ns
> >
> > after the patch:
> > sending 1GB in mvneta_tx()(disabled TSO) costs 719953800ns
> >
> > we saved 9.2% time.
> >
> > patch4 uses cacheable memory to store the rx buffer DMA address.
> >
> > We get the following performance data on Marvell BG4CT Platforms
> > (tested with iperf):
> >
> > before the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1492659600 ns
> >
> > after the patch:
> > recving 1GB in mvneta_rx_swbm() costs 1421565640 ns  
> 
> Could you explain who you get this number?

Thanks for your review.

The measurement is simple: record how much time we spent in mvneta_rx_swbm()
for receiving 1GB data, something as below:

mvneta_rx_swbm()
{
	static u64 total_time;
	u64 t1, t2;
	static u64 count;

	t1 = sched_clock();
	...

	if (rcvd_pkts) {
		...
		t2 = sched_clock() - t1;
		total_time += t2;
		count += rcvd_bytes;;
		if (count >= 0x40000000) {
			printk("!!!! %lld %lld\n", total_time, count);
			total_time = 0;
			count = 0;
		}
	...
}

> 
> receiving 1GB in 1.42 second means having a bandwidth of
> 8/1.42=5.63 Gb/s, that means that you are using at least a 10Gb
> interface.

hmmm, we just measured the time spent in mvneta_rx_swbm(), so we can't solve
the bandwidth as 8/1.42, what do you think?

> 
> When I used iperf I didn't have this kind of granularity:
> iperf -c 192.168.10.1 -n 1024M
> ------------------------------------------------------------
> Client connecting to 192.168.10.19, TCP port 5001
> TCP window size: 43.8 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.10.28 port 53086 connected with 192.168.10.1 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 9.1 sec  1.00 GBytes   942 Mbits/sec
> 
> Also without HWBM enabled (so with the same configuration of your test),
> I didn't noticed any improvement with the patch set applied. But at

>From bandwidth point of view, yes, there's no improvement. But from cpu
time/load point of view, I do see a trivial improvement. Could you also
did a simple test from your side to see whether we have similar improvement
data?

Thanks,
Jisheng



> least I didn't see any regression with or without HWBM.
> 
> Gregory
> 
> >
> > We saved 4.76% time.
> >
> > Basically, patch1 and patch4 do what Arnd mentioned in [1].
> >
> > Hi Arnd,
> >
> > I added "Suggested-by you" tag, I hope you don't mind ;)
> >
> > Thanks
> >
> > [1] https://www.spinics.net/lists/netdev/msg405889.html
> >
> > Since v2:
> >   - add Gregory's ack to patch1
> >   - only get rx buffer DMA address from cacheable memory for mvneta_rx_swbm()
> >   - add patch 2 to read rx_desc->buf_phys_addr once in mvneta_rx_hwbm()
> >   - add patch 3 to avoid reading from tx_desc as much as possible
> >
> > Since v1:
> >   - correct the performance data typo
> >
> >
> > Jisheng Zhang (4):
> >   net: mvneta: avoid getting status from rx_desc as much as possible
> >   net: mvneta: avoid getting buf_phys_addr from rx_desc again
> >   net: mvneta: avoid reading from tx_desc as much as possible
> >   net: mvneta: Use cacheable memory to store the rx buffer DMA address
> >
> >  drivers/net/ethernet/marvell/mvneta.c | 80 +++++++++++++++++++----------------
> >  1 file changed, 43 insertions(+), 37 deletions(-)
> >
> > -- 
> > 2.11.0
> >  
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
  2017-02-21  4:37     ` Jisheng Zhang
@ 2017-02-21 16:16       ` David Miller
  -1 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2017-02-21 16:16 UTC (permalink / raw)
  To: jszhang
  Cc: gregory.clement, thomas.petazzoni, arnd, mw, linux-arm-kernel,
	netdev, linux-kernel

From: Jisheng Zhang <jszhang@marvell.com>
Date: Tue, 21 Feb 2017 12:37:40 +0800

> Thanks for your review.
> 
> The measurement is simple: record how much time we spent in mvneta_rx_swbm()
> for receiving 1GB data, something as below:

Please use a standard tool for measuring performance, rather than profiling
the driver and trying to derive numbers that way.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-21 16:16       ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2017-02-21 16:16 UTC (permalink / raw)
  To: linux-arm-kernel

From: Jisheng Zhang <jszhang@marvell.com>
Date: Tue, 21 Feb 2017 12:37:40 +0800

> Thanks for your review.
> 
> The measurement is simple: record how much time we spent in mvneta_rx_swbm()
> for receiving 1GB data, something as below:

Please use a standard tool for measuring performance, rather than profiling
the driver and trying to derive numbers that way.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
  2017-02-21 16:16       ` David Miller
@ 2017-02-21 16:35         ` Marcin Wojtas
  -1 siblings, 0 replies; 22+ messages in thread
From: Marcin Wojtas @ 2017-02-21 16:35 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: David Miller, Gregory Clément, Thomas Petazzoni,
	Arnd Bergmann, linux-arm-kernel, netdev, linux-kernel

Hi Jisheng,

2017-02-21 17:16 GMT+01:00 David Miller <davem@davemloft.net>:
> From: Jisheng Zhang <jszhang@marvell.com>
> Date: Tue, 21 Feb 2017 12:37:40 +0800
>
>> Thanks for your review.
>>
>> The measurement is simple: record how much time we spent in mvneta_rx_swbm()
>> for receiving 1GB data, something as below:
>
> Please use a standard tool for measuring performance, rather than profiling
> the driver and trying to derive numbers that way.

If possible in your setup, I suggest pushing 64B (and other sizes)
packets uni or bidirectionally via 2 ports in L2 bridge mode. It's a
good stress test, you'd get some meaningful numbers (also check cpu
consumption with mpstat in the meantime).

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-21 16:35         ` Marcin Wojtas
  0 siblings, 0 replies; 22+ messages in thread
From: Marcin Wojtas @ 2017-02-21 16:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jisheng,

2017-02-21 17:16 GMT+01:00 David Miller <davem@davemloft.net>:
> From: Jisheng Zhang <jszhang@marvell.com>
> Date: Tue, 21 Feb 2017 12:37:40 +0800
>
>> Thanks for your review.
>>
>> The measurement is simple: record how much time we spent in mvneta_rx_swbm()
>> for receiving 1GB data, something as below:
>
> Please use a standard tool for measuring performance, rather than profiling
> the driver and trying to derive numbers that way.

If possible in your setup, I suggest pushing 64B (and other sizes)
packets uni or bidirectionally via 2 ports in L2 bridge mode. It's a
good stress test, you'd get some meaningful numbers (also check cpu
consumption with mpstat in the meantime).

Best regards,
Marcin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
  2017-02-21 16:16       ` David Miller
@ 2017-02-24 11:56         ` Jisheng Zhang
  -1 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-24 11:56 UTC (permalink / raw)
  To: David Miller, mw
  Cc: gregory.clement, thomas.petazzoni, arnd, linux-arm-kernel,
	netdev, linux-kernel

Hi David, Marcin,

On Tue, 21 Feb 2017 11:16:02 -0500 David Miller wrote:

> From: Jisheng Zhang <jszhang@marvell.com>
> Date: Tue, 21 Feb 2017 12:37:40 +0800
> 
> > Thanks for your review.
> > 
> > The measurement is simple: record how much time we spent in mvneta_rx_swbm()
> > for receiving 1GB data, something as below:  
> 
> Please use a standard tool for measuring performance, rather than profiling
> the driver and trying to derive numbers that way.

Got your point. I will try to get performance with standard tool and cook a
v4 once rc1 is released.

Thanks,
Jisheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance
@ 2017-02-24 11:56         ` Jisheng Zhang
  0 siblings, 0 replies; 22+ messages in thread
From: Jisheng Zhang @ 2017-02-24 11:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi David, Marcin,

On Tue, 21 Feb 2017 11:16:02 -0500 David Miller wrote:

> From: Jisheng Zhang <jszhang@marvell.com>
> Date: Tue, 21 Feb 2017 12:37:40 +0800
> 
> > Thanks for your review.
> > 
> > The measurement is simple: record how much time we spent in mvneta_rx_swbm()
> > for receiving 1GB data, something as below:  
> 
> Please use a standard tool for measuring performance, rather than profiling
> the driver and trying to derive numbers that way.

Got your point. I will try to get performance with standard tool and cook a
v4 once rc1 is released.

Thanks,
Jisheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-02-24 12:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-20 12:53 [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance Jisheng Zhang
2017-02-20 12:53 ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 1/4] net: mvneta: avoid getting status from rx_desc as much as possible Jisheng Zhang
2017-02-20 12:53   ` Jisheng Zhang
2017-02-20 12:53   ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 2/4] net: mvneta: avoid getting buf_phys_addr from rx_desc again Jisheng Zhang
2017-02-20 12:53   ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 3/4] net: mvneta: avoid reading from tx_desc as much as possible Jisheng Zhang
2017-02-20 12:53   ` Jisheng Zhang
2017-02-20 12:53 ` [PATCH net-next v3 4/4] net: mvneta: Use cacheable memory to store the rx buffer DMA address Jisheng Zhang
2017-02-20 12:53   ` Jisheng Zhang
2017-02-20 14:21 ` [PATCH net-next v3 0/4] net: mvneta: improve rx/tx performance Gregory CLEMENT
2017-02-20 14:21   ` Gregory CLEMENT
2017-02-21  4:37   ` Jisheng Zhang
2017-02-21  4:37     ` Jisheng Zhang
2017-02-21  4:37     ` Jisheng Zhang
2017-02-21 16:16     ` David Miller
2017-02-21 16:16       ` David Miller
2017-02-21 16:35       ` Marcin Wojtas
2017-02-21 16:35         ` Marcin Wojtas
2017-02-24 11:56       ` Jisheng Zhang
2017-02-24 11:56         ` Jisheng Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.