All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] bcm63xx_enet: batch process rx path
@ 2020-12-04  5:46 Sieng Piaw Liew
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: add BQL support Sieng Piaw Liew
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Sieng Piaw Liew @ 2020-12-04  5:46 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: bcm-kernel-feedback-list, Sieng Piaw Liew, netdev

Use netif_receive_skb_list to batch process rx skb.
Tested on BCM6328 320 MHz using iperf3 -M 512, increasing performance
by 12.5%.

Before:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-30.00  sec   120 MBytes  33.7 Mbits/sec  277         sender
[  4]   0.00-30.00  sec   120 MBytes  33.5 Mbits/sec            receiver

After:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-30.00  sec   136 MBytes  37.9 Mbits/sec  203         sender
[  4]   0.00-30.00  sec   135 MBytes  37.7 Mbits/sec            receiver

Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 916824cca3fd..b82b7805c36a 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -297,10 +297,12 @@ static void bcm_enet_refill_rx_timer(struct timer_list *t)
 static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 {
 	struct bcm_enet_priv *priv;
+	struct list_head rx_list;
 	struct device *kdev;
 	int processed;
 
 	priv = netdev_priv(dev);
+	INIT_LIST_HEAD(&rx_list);
 	kdev = &priv->pdev->dev;
 	processed = 0;
 
@@ -391,10 +393,12 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 		skb->protocol = eth_type_trans(skb, dev);
 		dev->stats.rx_packets++;
 		dev->stats.rx_bytes += len;
-		netif_receive_skb(skb);
+		list_add_tail(&skb->list, &rx_list);
 
 	} while (--budget > 0);
 
+	netif_receive_skb_list(&rx_list);
+
 	if (processed || !priv->rx_desc_count) {
 		bcm_enet_refill_rx(dev);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next] bcm63xx_enet: add BQL support
  2020-12-04  5:46 [PATCH net-next] bcm63xx_enet: batch process rx path Sieng Piaw Liew
@ 2020-12-04  5:46 ` Sieng Piaw Liew
  2020-12-04 18:27   ` Florian Fainelli
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: alloc rx skb with NET_IP_ALIGN Sieng Piaw Liew
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Sieng Piaw Liew @ 2020-12-04  5:46 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: bcm-kernel-feedback-list, Sieng Piaw Liew, netdev

Add Byte Queue Limits support to reduce/remove bufferbloat in
bcm63xx_enet.

Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index b82b7805c36a..c1eba5fa3258 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -417,9 +417,11 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 static int bcm_enet_tx_reclaim(struct net_device *dev, int force)
 {
 	struct bcm_enet_priv *priv;
+	unsigned int bytes;
 	int released;
 
 	priv = netdev_priv(dev);
+	bytes = 0;
 	released = 0;
 
 	while (priv->tx_desc_count < priv->tx_ring_size) {
@@ -456,10 +458,13 @@ static int bcm_enet_tx_reclaim(struct net_device *dev, int force)
 		if (desc->len_stat & DMADESC_UNDER_MASK)
 			dev->stats.tx_errors++;
 
+		bytes += skb->len;
 		dev_kfree_skb(skb);
 		released++;
 	}
 
+	netdev_completed_queue(dev, released, bytes);
+
 	if (netif_queue_stopped(dev) && released)
 		netif_wake_queue(dev);
 
@@ -626,6 +631,8 @@ bcm_enet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	desc->len_stat = len_stat;
 	wmb();
 
+	netdev_sent_queue(dev, skb->len);
+
 	/* kick tx dma */
 	enet_dmac_writel(priv, priv->dma_chan_en_mask,
 				 ENETDMAC_CHANCFG, priv->tx_chan);
@@ -1069,6 +1076,7 @@ static int bcm_enet_open(struct net_device *dev)
 	else
 		bcm_enet_adjust_link(dev);
 
+	netdev_reset_queue(dev);
 	netif_start_queue(dev);
 	return 0;
 
@@ -2246,6 +2254,7 @@ static int bcm_enetsw_open(struct net_device *dev)
 			 ENETDMAC_IRMASK, priv->tx_chan);
 
 	netif_carrier_on(dev);
+	netdev_reset_queue(dev);
 	netif_start_queue(dev);
 
 	/* apply override config for bypass_link ports here. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next] bcm63xx_enet: alloc rx skb with NET_IP_ALIGN
  2020-12-04  5:46 [PATCH net-next] bcm63xx_enet: batch process rx path Sieng Piaw Liew
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: add BQL support Sieng Piaw Liew
@ 2020-12-04  5:46 ` Sieng Piaw Liew
  2020-12-04 18:28   ` Florian Fainelli
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: convert to build_skb Sieng Piaw Liew
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Sieng Piaw Liew @ 2020-12-04  5:46 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: bcm-kernel-feedback-list, Sieng Piaw Liew, netdev

Use netdev_alloc_skb_ip_align on newer SoCs with integrated switch
(enetsw) when refilling RX. Increases packet processing performance
by 30% (with netif_receive_skb_list).

Non-enetsw SoCs cannot function with the extra pad so continue to use
the regular netdev_alloc_skb.

Tested on BCM6328 320 MHz and iperf3 -M 512 to measure packet/sec
performance.

Before:
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-30.00 sec 120 MBytes 33.7 Mbits/sec 277 sender
[ 4] 0.00-30.00 sec 120 MBytes 33.5 Mbits/sec receiver

After (+netif_receive_skb_list):
[ 4] 0.00-30.00 sec 155 MBytes 43.3 Mbits/sec 354 sender
[ 4] 0.00-30.00 sec 154 MBytes 43.1 Mbits/sec receiver

Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index c1eba5fa3258..92bff4758216 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -237,7 +237,10 @@ static int bcm_enet_refill_rx(struct net_device *dev)
 		desc = &priv->rx_desc_cpu[desc_idx];
 
 		if (!priv->rx_skb[desc_idx]) {
-			skb = netdev_alloc_skb(dev, priv->rx_skb_size);
+			if (priv->enet_is_sw)
+				skb = netdev_alloc_skb_ip_align(dev, priv->rx_skb_size);
+			else
+				skb = netdev_alloc_skb(dev, priv->rx_skb_size);
 			if (!skb)
 				break;
 			priv->rx_skb[desc_idx] = skb;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next] bcm63xx_enet: convert to build_skb
  2020-12-04  5:46 [PATCH net-next] bcm63xx_enet: batch process rx path Sieng Piaw Liew
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: add BQL support Sieng Piaw Liew
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: alloc rx skb with NET_IP_ALIGN Sieng Piaw Liew
@ 2020-12-04  5:46 ` Sieng Piaw Liew
  2020-12-04 10:14   ` Eric Dumazet
  2020-12-04  9:50 ` [PATCH net-next] bcm63xx_enet: batch process rx path Eric Dumazet
  2020-12-04 18:24 ` Florian Fainelli
  4 siblings, 1 reply; 10+ messages in thread
From: Sieng Piaw Liew @ 2020-12-04  5:46 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: bcm-kernel-feedback-list, Sieng Piaw Liew, netdev

We can increase the efficiency of rx path by using buffers to receive
packets then build SKBs around them just before passing into the network
stack. In contrast, preallocating SKBs too early reduces CPU cache
efficiency.

Check if we're in NAPI context when refilling RX. Normally we're almost
always running in NAPI context. Dispatch to napi_alloc_frag directly
instead of relying on netdev_alloc_frag which still runs
local_bh_disable/enable.

Tested on BCM6328 320 MHz and iperf3 -M 512 to measure packet/sec
performance. Included netif_receive_skb_list and NET_IP_ALIGN
optimizations.

Before:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  49.9 MBytes  41.9 Mbits/sec  197         sender
[  4]   0.00-10.00  sec  49.3 MBytes  41.3 Mbits/sec            receiver

After:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-30.00  sec   171 MBytes  47.8 Mbits/sec  272         sender
[  4]   0.00-30.00  sec   170 MBytes  47.6 Mbits/sec            receiver

Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c | 156 ++++++++-----------
 drivers/net/ethernet/broadcom/bcm63xx_enet.h |  14 +-
 2 files changed, 79 insertions(+), 91 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 92bff4758216..4ace540f0c92 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -220,7 +220,7 @@ static void bcm_enet_mdio_write_mii(struct net_device *dev, int mii_id,
 /*
  * refill rx queue
  */
-static int bcm_enet_refill_rx(struct net_device *dev)
+static int bcm_enet_refill_rx(struct net_device *dev, bool napi_mode)
 {
 	struct bcm_enet_priv *priv;
 
@@ -228,29 +228,29 @@ static int bcm_enet_refill_rx(struct net_device *dev)
 
 	while (priv->rx_desc_count < priv->rx_ring_size) {
 		struct bcm_enet_desc *desc;
-		struct sk_buff *skb;
-		dma_addr_t p;
 		int desc_idx;
 		u32 len_stat;
 
 		desc_idx = priv->rx_dirty_desc;
 		desc = &priv->rx_desc_cpu[desc_idx];
 
-		if (!priv->rx_skb[desc_idx]) {
-			if (priv->enet_is_sw)
-				skb = netdev_alloc_skb_ip_align(dev, priv->rx_skb_size);
+		if (!priv->rx_buf[desc_idx]) {
+			void *buf;
+
+			if (likely(napi_mode))
+				buf = napi_alloc_frag(priv->rx_frag_size);
 			else
-				skb = netdev_alloc_skb(dev, priv->rx_skb_size);
-			if (!skb)
+				buf = netdev_alloc_frag(priv->rx_frag_size);
+			if (unlikely(!buf))
 				break;
-			priv->rx_skb[desc_idx] = skb;
-			p = dma_map_single(&priv->pdev->dev, skb->data,
-					   priv->rx_skb_size,
-					   DMA_FROM_DEVICE);
-			desc->address = p;
+			priv->rx_buf[desc_idx] = buf;
+			desc->address = dma_map_single(&priv->pdev->dev,
+						       buf + priv->rx_buf_offset,
+						       priv->rx_buf_size,
+						       DMA_FROM_DEVICE);
 		}
 
-		len_stat = priv->rx_skb_size << DMADESC_LENGTH_SHIFT;
+		len_stat = priv->rx_buf_size << DMADESC_LENGTH_SHIFT;
 		len_stat |= DMADESC_OWNER_MASK;
 		if (priv->rx_dirty_desc == priv->rx_ring_size - 1) {
 			len_stat |= (DMADESC_WRAP_MASK >> priv->dma_desc_shift);
@@ -290,7 +290,7 @@ static void bcm_enet_refill_rx_timer(struct timer_list *t)
 	struct net_device *dev = priv->net_dev;
 
 	spin_lock(&priv->rx_lock);
-	bcm_enet_refill_rx(dev);
+	bcm_enet_refill_rx(dev, false);
 	spin_unlock(&priv->rx_lock);
 }
 
@@ -320,6 +320,7 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 		int desc_idx;
 		u32 len_stat;
 		unsigned int len;
+		void *buf;
 
 		desc_idx = priv->rx_curr_desc;
 		desc = &priv->rx_desc_cpu[desc_idx];
@@ -365,16 +366,14 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 		}
 
 		/* valid packet */
-		skb = priv->rx_skb[desc_idx];
+		buf = priv->rx_buf[desc_idx];
 		len = (len_stat & DMADESC_LENGTH_MASK) >> DMADESC_LENGTH_SHIFT;
 		/* don't include FCS */
 		len -= 4;
 
 		if (len < copybreak) {
-			struct sk_buff *nskb;
-
-			nskb = napi_alloc_skb(&priv->napi, len);
-			if (!nskb) {
+			skb = napi_alloc_skb(&priv->napi, len);
+			if (unlikely(!skb)) {
 				/* forget packet, just rearm desc */
 				dev->stats.rx_dropped++;
 				continue;
@@ -382,14 +381,20 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 
 			dma_sync_single_for_cpu(kdev, desc->address,
 						len, DMA_FROM_DEVICE);
-			memcpy(nskb->data, skb->data, len);
+			memcpy(skb->data, buf + priv->rx_buf_offset, len);
 			dma_sync_single_for_device(kdev, desc->address,
 						   len, DMA_FROM_DEVICE);
-			skb = nskb;
 		} else {
-			dma_unmap_single(&priv->pdev->dev, desc->address,
-					 priv->rx_skb_size, DMA_FROM_DEVICE);
-			priv->rx_skb[desc_idx] = NULL;
+			dma_unmap_single(kdev, desc->address,
+					 priv->rx_buf_size, DMA_FROM_DEVICE);
+			skb = build_skb(buf, priv->rx_frag_size);
+			if (unlikely(!skb)) {
+				skb_free_frag(buf);
+				dev->stats.rx_dropped++;
+				continue;
+			}
+			priv->rx_buf[desc_idx] = NULL;
+			skb_reserve(skb, priv->rx_buf_offset);
 		}
 
 		skb_put(skb, len);
@@ -403,7 +408,7 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
 	netif_receive_skb_list(&rx_list);
 
 	if (processed || !priv->rx_desc_count) {
-		bcm_enet_refill_rx(dev);
+		bcm_enet_refill_rx(dev, true);
 
 		/* kick rx dma */
 		enet_dmac_writel(priv, priv->dma_chan_en_mask,
@@ -859,6 +864,24 @@ static void bcm_enet_adjust_link(struct net_device *dev)
 		priv->pause_tx ? "tx" : "off");
 }
 
+static void bcm_enet_free_rx_buf_ring(struct device *kdev, struct bcm_enet_priv *priv)
+{
+	int i;
+
+	for (i = 0; i < priv->rx_ring_size; i++) {
+		struct bcm_enet_desc *desc;
+
+		if (!priv->rx_buf[i])
+			continue;
+
+		desc = &priv->rx_desc_cpu[i];
+		dma_unmap_single(kdev, desc->address, priv->rx_buf_size,
+				 DMA_FROM_DEVICE);
+		skb_free_frag(priv->rx_buf[i]);
+	}
+	kfree(priv->rx_buf);
+}
+
 /*
  * open callback, allocate dma rings & buffers and start rx operation
  */
@@ -968,10 +991,10 @@ static int bcm_enet_open(struct net_device *dev)
 	priv->tx_curr_desc = 0;
 	spin_lock_init(&priv->tx_lock);
 
-	/* init & fill rx ring with skbs */
-	priv->rx_skb = kcalloc(priv->rx_ring_size, sizeof(struct sk_buff *),
+	/* init & fill rx ring with buffers */
+	priv->rx_buf = kcalloc(priv->rx_ring_size, sizeof(void *),
 			       GFP_KERNEL);
-	if (!priv->rx_skb) {
+	if (!priv->rx_buf) {
 		ret = -ENOMEM;
 		goto out_free_tx_skb;
 	}
@@ -988,7 +1011,7 @@ static int bcm_enet_open(struct net_device *dev)
 		enet_dmac_writel(priv, ENETDMA_BUFALLOC_FORCE_MASK | 0,
 				ENETDMAC_BUFALLOC, priv->rx_chan);
 
-	if (bcm_enet_refill_rx(dev)) {
+	if (bcm_enet_refill_rx(dev, false)) {
 		dev_err(kdev, "cannot allocate rx skb queue\n");
 		ret = -ENOMEM;
 		goto out;
@@ -1084,18 +1107,7 @@ static int bcm_enet_open(struct net_device *dev)
 	return 0;
 
 out:
-	for (i = 0; i < priv->rx_ring_size; i++) {
-		struct bcm_enet_desc *desc;
-
-		if (!priv->rx_skb[i])
-			continue;
-
-		desc = &priv->rx_desc_cpu[i];
-		dma_unmap_single(kdev, desc->address, priv->rx_skb_size,
-				 DMA_FROM_DEVICE);
-		kfree_skb(priv->rx_skb[i]);
-	}
-	kfree(priv->rx_skb);
+	bcm_enet_free_rx_buf_ring(kdev, priv);
 
 out_free_tx_skb:
 	kfree(priv->tx_skb);
@@ -1174,7 +1186,6 @@ static int bcm_enet_stop(struct net_device *dev)
 {
 	struct bcm_enet_priv *priv;
 	struct device *kdev;
-	int i;
 
 	priv = netdev_priv(dev);
 	kdev = &priv->pdev->dev;
@@ -1201,21 +1212,10 @@ static int bcm_enet_stop(struct net_device *dev)
 	/* force reclaim of all tx buffers */
 	bcm_enet_tx_reclaim(dev, 1);
 
-	/* free the rx skb ring */
-	for (i = 0; i < priv->rx_ring_size; i++) {
-		struct bcm_enet_desc *desc;
-
-		if (!priv->rx_skb[i])
-			continue;
-
-		desc = &priv->rx_desc_cpu[i];
-		dma_unmap_single(kdev, desc->address, priv->rx_skb_size,
-				 DMA_FROM_DEVICE);
-		kfree_skb(priv->rx_skb[i]);
-	}
+	/* free the rx buffer ring */
+	bcm_enet_free_rx_buf_ring(kdev, priv);
 
 	/* free remaining allocated memory */
-	kfree(priv->rx_skb);
 	kfree(priv->tx_skb);
 	dma_free_coherent(kdev, priv->rx_desc_alloc_size,
 			  priv->rx_desc_cpu, priv->rx_desc_dma);
@@ -1637,9 +1637,12 @@ static int bcm_enet_change_mtu(struct net_device *dev, int new_mtu)
 	 * align rx buffer size to dma burst len, account FCS since
 	 * it's appended
 	 */
-	priv->rx_skb_size = ALIGN(actual_mtu + ETH_FCS_LEN,
+	priv->rx_buf_size = ALIGN(actual_mtu + ETH_FCS_LEN,
 				  priv->dma_maxburst * 4);
 
+	priv->rx_frag_size = SKB_DATA_ALIGN(priv->rx_buf_size + priv->rx_buf_offset)
+						+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
 	dev->mtu = new_mtu;
 	return 0;
 }
@@ -1724,6 +1727,7 @@ static int bcm_enet_probe(struct platform_device *pdev)
 
 	priv->enet_is_sw = false;
 	priv->dma_maxburst = BCMENET_DMA_MAXBURST;
+	priv->rx_buf_offset = NET_SKB_PAD;
 
 	ret = bcm_enet_change_mtu(dev, dev->mtu);
 	if (ret)
@@ -2151,10 +2155,10 @@ static int bcm_enetsw_open(struct net_device *dev)
 	priv->tx_curr_desc = 0;
 	spin_lock_init(&priv->tx_lock);
 
-	/* init & fill rx ring with skbs */
-	priv->rx_skb = kcalloc(priv->rx_ring_size, sizeof(struct sk_buff *),
+	/* init & fill rx ring with buffers */
+	priv->rx_buf = kcalloc(priv->rx_ring_size, sizeof(void *),
 			       GFP_KERNEL);
-	if (!priv->rx_skb) {
+	if (!priv->rx_buf) {
 		dev_err(kdev, "cannot allocate rx skb queue\n");
 		ret = -ENOMEM;
 		goto out_free_tx_skb;
@@ -2202,7 +2206,7 @@ static int bcm_enetsw_open(struct net_device *dev)
 	enet_dma_writel(priv, ENETDMA_BUFALLOC_FORCE_MASK | 0,
 			ENETDMA_BUFALLOC_REG(priv->rx_chan));
 
-	if (bcm_enet_refill_rx(dev)) {
+	if (bcm_enet_refill_rx(dev, false)) {
 		dev_err(kdev, "cannot allocate rx skb queue\n");
 		ret = -ENOMEM;
 		goto out;
@@ -2303,18 +2307,7 @@ static int bcm_enetsw_open(struct net_device *dev)
 	return 0;
 
 out:
-	for (i = 0; i < priv->rx_ring_size; i++) {
-		struct bcm_enet_desc *desc;
-
-		if (!priv->rx_skb[i])
-			continue;
-
-		desc = &priv->rx_desc_cpu[i];
-		dma_unmap_single(kdev, desc->address, priv->rx_skb_size,
-				 DMA_FROM_DEVICE);
-		kfree_skb(priv->rx_skb[i]);
-	}
-	kfree(priv->rx_skb);
+	bcm_enet_free_rx_buf_ring(kdev, priv);
 
 out_free_tx_skb:
 	kfree(priv->tx_skb);
@@ -2343,7 +2336,6 @@ static int bcm_enetsw_stop(struct net_device *dev)
 {
 	struct bcm_enet_priv *priv;
 	struct device *kdev;
-	int i;
 
 	priv = netdev_priv(dev);
 	kdev = &priv->pdev->dev;
@@ -2364,21 +2356,10 @@ static int bcm_enetsw_stop(struct net_device *dev)
 	/* force reclaim of all tx buffers */
 	bcm_enet_tx_reclaim(dev, 1);
 
-	/* free the rx skb ring */
-	for (i = 0; i < priv->rx_ring_size; i++) {
-		struct bcm_enet_desc *desc;
-
-		if (!priv->rx_skb[i])
-			continue;
-
-		desc = &priv->rx_desc_cpu[i];
-		dma_unmap_single(kdev, desc->address, priv->rx_skb_size,
-				 DMA_FROM_DEVICE);
-		kfree_skb(priv->rx_skb[i]);
-	}
+	/* free the rx buffer ring */
+	bcm_enet_free_rx_buf_ring(kdev, priv);
 
 	/* free remaining allocated memory */
-	kfree(priv->rx_skb);
 	kfree(priv->tx_skb);
 	dma_free_coherent(kdev, priv->rx_desc_alloc_size,
 			  priv->rx_desc_cpu, priv->rx_desc_dma);
@@ -2675,6 +2656,7 @@ static int bcm_enetsw_probe(struct platform_device *pdev)
 	priv->rx_ring_size = BCMENET_DEF_RX_DESC;
 	priv->tx_ring_size = BCMENET_DEF_TX_DESC;
 	priv->dma_maxburst = BCMENETSW_DMA_MAXBURST;
+	priv->rx_buf_offset = NET_SKB_PAD + NET_IP_ALIGN;
 
 	pd = dev_get_platdata(&pdev->dev);
 	if (pd) {
diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.h b/drivers/net/ethernet/broadcom/bcm63xx_enet.h
index 1d3c917eb830..78f1830fb3cb 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.h
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.h
@@ -230,11 +230,17 @@ struct bcm_enet_priv {
 	/* next dirty rx descriptor to refill */
 	int rx_dirty_desc;
 
-	/* size of allocated rx skbs */
-	unsigned int rx_skb_size;
+	/* size of allocated rx buffers */
+	unsigned int rx_buf_size;
 
-	/* list of skb given to hw for rx */
-	struct sk_buff **rx_skb;
+	/* allocated rx buffer offset */
+	unsigned int rx_buf_offset;
+
+	/* size of allocated rx frag */
+	unsigned int rx_frag_size;
+
+	/* list of buffer given to hw for rx */
+	void **rx_buf;
 
 	/* used when rx skb allocation failed, so we defer rx queue
 	 * refill */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] bcm63xx_enet: batch process rx path
  2020-12-04  5:46 [PATCH net-next] bcm63xx_enet: batch process rx path Sieng Piaw Liew
                   ` (2 preceding siblings ...)
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: convert to build_skb Sieng Piaw Liew
@ 2020-12-04  9:50 ` Eric Dumazet
  2020-12-09  3:33   ` Sieng Piaw Liew
  2020-12-04 18:24 ` Florian Fainelli
  4 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2020-12-04  9:50 UTC (permalink / raw)
  To: Sieng Piaw Liew, Florian Fainelli; +Cc: bcm-kernel-feedback-list, netdev



On 12/4/20 6:46 AM, Sieng Piaw Liew wrote:
> Use netif_receive_skb_list to batch process rx skb.
> Tested on BCM6328 320 MHz using iperf3 -M 512, increasing performance
> by 12.5%.
> 



Well, the real question is why you do not simply use GRO,
to get 100% performance gain or more for TCP flows.


netif_receive_skb_list() is no longer needed,
GRO layer already uses batching for non TCP packets.

We probably should mark is deprecated.

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 916824cca3fda194c42fefec7f514ced1a060043..6fdbe231b7c1b27f523889bda8a20ab7eaab65a6 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -391,7 +391,7 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
                skb->protocol = eth_type_trans(skb, dev);
                dev->stats.rx_packets++;
                dev->stats.rx_bytes += len;
-               netif_receive_skb(skb);
+               napi_gro_receive_skb(&priv->napi, skb);
 
        } while (--budget > 0);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] bcm63xx_enet: convert to build_skb
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: convert to build_skb Sieng Piaw Liew
@ 2020-12-04 10:14   ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2020-12-04 10:14 UTC (permalink / raw)
  To: Sieng Piaw Liew, Florian Fainelli; +Cc: bcm-kernel-feedback-list, netdev



On 12/4/20 6:46 AM, Sieng Piaw Liew wrote:
> We can increase the efficiency of rx path by using buffers to receive
> packets then build SKBs around them just before passing into the network
> stack. In contrast, preallocating SKBs too early reduces CPU cache
> efficiency.
> 
> Check if we're in NAPI context when refilling RX. Normally we're almost
> always running in NAPI context. Dispatch to napi_alloc_frag directly
> instead of relying on netdev_alloc_frag which still runs
> local_bh_disable/enable.
> 
> Tested on BCM6328 320 MHz and iperf3 -M 512 to measure packet/sec
> performance. Included netif_receive_skb_list and NET_IP_ALIGN
> optimizations.
> 
> Before:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-10.00  sec  49.9 MBytes  41.9 Mbits/sec  197         sender
> [  4]   0.00-10.00  sec  49.3 MBytes  41.3 Mbits/sec            receiver
> 
> After:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-30.00  sec   171 MBytes  47.8 Mbits/sec  272         sender
> [  4]   0.00-30.00  sec   170 MBytes  47.6 Mbits/sec            receiver

Please test this again after GRO has been added to this driver.

Problem with build_skb() is that overall skb truesize after GRO might be increased
a lot, since we have sizeof(struct skb_shared_info) added overhead per MSS,
and this can double the truesize depending on device MTU.

This matters on long RTT flows, because an inflation of skb->truesize reduces
TCP receive window quite a lot.

Ideally if you want best performance, this driver should use napi_gro_frags(),
so that skb->len/skb->truesize is the smallest one.

In order to test your change you need to set up a testbed with 
10ms or 50ms delay between the hosts, unless this driver is only used
by hosts on the same LAN (which I doubt)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] bcm63xx_enet: batch process rx path
  2020-12-04  5:46 [PATCH net-next] bcm63xx_enet: batch process rx path Sieng Piaw Liew
                   ` (3 preceding siblings ...)
  2020-12-04  9:50 ` [PATCH net-next] bcm63xx_enet: batch process rx path Eric Dumazet
@ 2020-12-04 18:24 ` Florian Fainelli
  4 siblings, 0 replies; 10+ messages in thread
From: Florian Fainelli @ 2020-12-04 18:24 UTC (permalink / raw)
  To: Sieng Piaw Liew; +Cc: bcm-kernel-feedback-list, netdev



On 12/3/2020 9:46 PM, Sieng Piaw Liew wrote:
> Use netif_receive_skb_list to batch process rx skb.
> Tested on BCM6328 320 MHz using iperf3 -M 512, increasing performance
> by 12.5%.
> 
> Before:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-30.00  sec   120 MBytes  33.7 Mbits/sec  277         sender
> [  4]   0.00-30.00  sec   120 MBytes  33.5 Mbits/sec            receiver
> 
> After:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-30.00  sec   136 MBytes  37.9 Mbits/sec  203         sender
> [  4]   0.00-30.00  sec   135 MBytes  37.7 Mbits/sec            receiver
> 
> Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>

Your patches are all dependent on one another and part of a series to
please have a cover letter and order them so they can be applied in the
correct order, after you address Eric's feedback. Thank you
-- 
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] bcm63xx_enet: add BQL support
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: add BQL support Sieng Piaw Liew
@ 2020-12-04 18:27   ` Florian Fainelli
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Fainelli @ 2020-12-04 18:27 UTC (permalink / raw)
  To: Sieng Piaw Liew, Florian Fainelli; +Cc: bcm-kernel-feedback-list, netdev



On 12/3/2020 9:46 PM, Sieng Piaw Liew wrote:
> Add Byte Queue Limits support to reduce/remove bufferbloat in
> bcm63xx_enet.
> 
> Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bcm63xx_enet.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> index b82b7805c36a..c1eba5fa3258 100644
> --- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> +++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> @@ -417,9 +417,11 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
>  static int bcm_enet_tx_reclaim(struct net_device *dev, int force)
>  {
>  	struct bcm_enet_priv *priv;
> +	unsigned int bytes;
>  	int released;
>  
>  	priv = netdev_priv(dev);
> +	bytes = 0;
>  	released = 0;
>  
>  	while (priv->tx_desc_count < priv->tx_ring_size) {
> @@ -456,10 +458,13 @@ static int bcm_enet_tx_reclaim(struct net_device *dev, int force)
>  		if (desc->len_stat & DMADESC_UNDER_MASK)
>  			dev->stats.tx_errors++;
>  
> +		bytes += skb->len;
>  		dev_kfree_skb(skb);
>  		released++;
>  	}
>  
> +	netdev_completed_queue(dev, released, bytes);
> +
>  	if (netif_queue_stopped(dev) && released)
>  		netif_wake_queue(dev);
>  
> @@ -626,6 +631,8 @@ bcm_enet_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	desc->len_stat = len_stat;
>  	wmb();
>  
> +	netdev_sent_queue(dev, skb->len);
> +
>  	/* kick tx dma */
>  	enet_dmac_writel(priv, priv->dma_chan_en_mask,
>  				 ENETDMAC_CHANCFG, priv->tx_chan);
> @@ -1069,6 +1076,7 @@ static int bcm_enet_open(struct net_device *dev)
>  	else
>  		bcm_enet_adjust_link(dev);
>  
> +	netdev_reset_queue(dev);
>  	netif_start_queue(dev);
>  	return 0;
>  
> @@ -2246,6 +2254,7 @@ static int bcm_enetsw_open(struct net_device *dev)
>  			 ENETDMAC_IRMASK, priv->tx_chan);
>  
>  	netif_carrier_on(dev);
> +	netdev_reset_queue(dev);
>  	netif_start_queue(dev);

Doing netdev_reset_queue() in bcm_enetsw_stop() and bcm_enet_stop()
would feel more natural and be consistent with where the resources are
being shut down. With that fixed:

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] bcm63xx_enet: alloc rx skb with NET_IP_ALIGN
  2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: alloc rx skb with NET_IP_ALIGN Sieng Piaw Liew
@ 2020-12-04 18:28   ` Florian Fainelli
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Fainelli @ 2020-12-04 18:28 UTC (permalink / raw)
  To: Sieng Piaw Liew, Florian Fainelli; +Cc: bcm-kernel-feedback-list, netdev



On 12/3/2020 9:46 PM, Sieng Piaw Liew wrote:
> Use netdev_alloc_skb_ip_align on newer SoCs with integrated switch
> (enetsw) when refilling RX. Increases packet processing performance
> by 30% (with netif_receive_skb_list).
> 
> Non-enetsw SoCs cannot function with the extra pad so continue to use
> the regular netdev_alloc_skb.
> 
> Tested on BCM6328 320 MHz and iperf3 -M 512 to measure packet/sec
> performance.
> 
> Before:
> [ ID] Interval Transfer Bandwidth Retr
> [ 4] 0.00-30.00 sec 120 MBytes 33.7 Mbits/sec 277 sender
> [ 4] 0.00-30.00 sec 120 MBytes 33.5 Mbits/sec receiver
> 
> After (+netif_receive_skb_list):
> [ 4] 0.00-30.00 sec 155 MBytes 43.3 Mbits/sec 354 sender
> [ 4] 0.00-30.00 sec 154 MBytes 43.1 Mbits/sec receiver
> 
> Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com>

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] bcm63xx_enet: batch process rx path
  2020-12-04  9:50 ` [PATCH net-next] bcm63xx_enet: batch process rx path Eric Dumazet
@ 2020-12-09  3:33   ` Sieng Piaw Liew
  0 siblings, 0 replies; 10+ messages in thread
From: Sieng Piaw Liew @ 2020-12-09  3:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Fainelli, bcm-kernel-feedback-list, netdev, liew.s.piaw

On Fri, Dec 04, 2020 at 10:50:45AM +0100, Eric Dumazet wrote:
> 
> 
> On 12/4/20 6:46 AM, Sieng Piaw Liew wrote:
> > Use netif_receive_skb_list to batch process rx skb.
> > Tested on BCM6328 320 MHz using iperf3 -M 512, increasing performance
> > by 12.5%.
> > 
> 
> 
> 
> Well, the real question is why you do not simply use GRO,
> to get 100% performance gain or more for TCP flows.
> 
> 
> netif_receive_skb_list() is no longer needed,
> GRO layer already uses batching for non TCP packets.
> 
> We probably should mark is deprecated.
> 
> diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> index 916824cca3fda194c42fefec7f514ced1a060043..6fdbe231b7c1b27f523889bda8a20ab7eaab65a6 100644
> --- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> +++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
> @@ -391,7 +391,7 @@ static int bcm_enet_receive_queue(struct net_device *dev, int budget)
>                 skb->protocol = eth_type_trans(skb, dev);
>                 dev->stats.rx_packets++;
>                 dev->stats.rx_bytes += len;
> -               netif_receive_skb(skb);
> +               napi_gro_receive_skb(&priv->napi, skb);
>  
>         } while (--budget > 0);
>  

The bcm63xx router SoC does not have enough CPU power nor hardware
accelerator to process checksum validation fast enough for GRO/GSO.

I have tested napi_gro_receive() on LAN-WAN setup. The resulting
bandwidth dropped from 95Mbps wire speed down to 80Mbps. And it's
inconsistent, with spikes and drops of >5Mbps.

The ag71xx driver for ath79 router SoC reverted its use for the same
reason.
http://lists.infradead.org/pipermail/lede-commits/2017-October/004864.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-12-09  3:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-04  5:46 [PATCH net-next] bcm63xx_enet: batch process rx path Sieng Piaw Liew
2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: add BQL support Sieng Piaw Liew
2020-12-04 18:27   ` Florian Fainelli
2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: alloc rx skb with NET_IP_ALIGN Sieng Piaw Liew
2020-12-04 18:28   ` Florian Fainelli
2020-12-04  5:46 ` [PATCH net-next] bcm63xx_enet: convert to build_skb Sieng Piaw Liew
2020-12-04 10:14   ` Eric Dumazet
2020-12-04  9:50 ` [PATCH net-next] bcm63xx_enet: batch process rx path Eric Dumazet
2020-12-09  3:33   ` Sieng Piaw Liew
2020-12-04 18:24 ` Florian Fainelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.