netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance ***
@ 2014-05-30  2:05 Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 1/6] net: fec: Factorize the .xmit transmit function Fugang Duan
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

Add SG and software TSO support for FEC.
This feature allows to improve outbound throughput performance.
Tested on imx6dl sabresd board, running iperf tcp tests shows:
	* 82% improvement comparing with NO SG & TSO patch

$ ethtool -K eth0 sg on
$ ethtool -K eth0 tso on
 [  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
 [ ID] Interval       Transfer     Bandwidth
 [  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec

$ ethtool -K eth0 sg off
$ ethtool -K eth0 tso off
[  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec

FEC HW support IP header and TCP/UDP hw checksum, support multi buffer descriptor transfer
one frame, but don't support HW TSO, so FEC bandwidth is limited to this.

The patch set just enable TSO feature.  Test on imx6dl sabresd board, there have 82% improvement.
The software TSO also tested on imx6sx platform, the cpu loading reduce from 100% to 52% with the
same Bandwidth, later, i will send some patch let FEC support imx6sx platform.


Fugang Duan (6):
  net: fec: Factorize the .xmit transmit function
  net: fec: Enable IP header hardware checksum
  net: fec: Factorize feature setting
  net: fec: Increase buffer descriptor entry number
  net: fec: Add Scatter/gather support
  net: fec: Add software TSO support

 drivers/net/ethernet/freescale/fec.h      |   12 +-
 drivers/net/ethernet/freescale/fec_main.c |  510 +++++++++++++++++++++++-----
 2 files changed, 427 insertions(+), 95 deletions(-)

-- 
1.7.8

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v1 1/6] net: fec: Factorize the .xmit transmit function
  2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
@ 2014-05-30  2:05 ` Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 2/6] net: fec: Enable IP header hardware checksum Fugang Duan
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

Make the code more readable and easy to support other features like
SG, TSO, moving the common transmit function to one api.

And the patch also factorize the getting BD index to it own function.

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec_main.c |   87 ++++++++++++++++++-----------
 1 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 4d989b2..96c1a07 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -286,6 +286,25 @@ struct bufdesc *fec_enet_get_prevdesc(struct bufdesc *bdp, struct fec_enet_priva
 		return (new_bd < base) ? (new_bd + ring_size) : new_bd;
 }
 
+static inline
+int fec_enet_get_bd_index(struct bufdesc *bdp, struct fec_enet_private *fep)
+{
+	struct bufdesc *base;
+	int index;
+
+	if (bdp >= fep->tx_bd_base)
+		base = fep->tx_bd_base;
+	else
+		base = fep->rx_bd_base;
+
+	if (fep->bufdesc_ex)
+		index = (struct bufdesc_ex *)bdp - (struct bufdesc_ex *)base;
+	else
+		index = bdp - base;
+
+	return index;
+}
+
 static void *swap_buffer(void *bufaddr, int len)
 {
 	int i;
@@ -312,8 +331,7 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 	return 0;
 }
 
-static netdev_tx_t
-fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	const struct platform_device_id *id_entry =
@@ -328,14 +346,6 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 
 	status = bdp->cbd_sc;
 
-	if (status & BD_ENET_TX_READY) {
-		/* Ooops.  All transmit buffers are full.  Bail out.
-		 * This should not happen, since ndev->tbusy should be set.
-		 */
-		netdev_err(ndev, "tx queue full!\n");
-		return NETDEV_TX_BUSY;
-	}
-
 	/* Protocol checksum off-load for TCP and UDP. */
 	if (fec_enet_clear_csum(skb, ndev)) {
 		dev_kfree_skb_any(skb);
@@ -349,16 +359,7 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	bufaddr = skb->data;
 	bdp->cbd_datlen = skb->len;
 
-	/*
-	 * On some FEC implementations data must be aligned on
-	 * 4-byte boundaries. Use bounce buffers to copy data
-	 * and get it aligned. Ugh.
-	 */
-	if (fep->bufdesc_ex)
-		index = (struct bufdesc_ex *)bdp -
-			(struct bufdesc_ex *)fep->tx_bd_base;
-	else
-		index = bdp - fep->tx_bd_base;
+	index = fec_enet_get_bd_index(bdp, fep);
 
 	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
 		memcpy(fep->tx_bounce[index], skb->data, skb->len);
@@ -432,15 +433,43 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 
 	fep->cur_tx = bdp;
 
-	if (fep->cur_tx == fep->dirty_tx)
-		netif_stop_queue(ndev);
-
 	/* Trigger transmission start */
 	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
 
 	return NETDEV_TX_OK;
 }
 
+static netdev_tx_t
+fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	struct bufdesc *bdp;
+	unsigned short	status;
+	int ret;
+
+	/* Fill in a Tx ring entry */
+	bdp = fep->cur_tx;
+
+	status = bdp->cbd_sc;
+
+	if (status & BD_ENET_TX_READY) {
+		/* Ooops.  All transmit buffers are full.  Bail out.
+		 * This should not happen, since ndev->tbusy should be set.
+		 */
+		netdev_err(ndev, "tx queue full!\n");
+		return NETDEV_TX_BUSY;
+	}
+
+	ret = txq_submit_skb(skb, ndev);
+	if (ret == -EBUSY)
+		return NETDEV_TX_BUSY;
+
+	if (fep->cur_tx == fep->dirty_tx)
+		netif_stop_queue(ndev);
+
+	return NETDEV_TX_OK;
+}
+
 /* Init RX & TX buffer descriptors
  */
 static void fec_enet_bd_init(struct net_device *dev)
@@ -769,11 +798,7 @@ fec_enet_tx(struct net_device *ndev)
 		if (bdp == fep->cur_tx)
 			break;
 
-		if (fep->bufdesc_ex)
-			index = (struct bufdesc_ex *)bdp -
-				(struct bufdesc_ex *)fep->tx_bd_base;
-		else
-			index = bdp - fep->tx_bd_base;
+		index = fec_enet_get_bd_index(bdp, fep);
 
 		skb = fep->tx_skbuff[index];
 		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
@@ -920,11 +945,7 @@ fec_enet_rx(struct net_device *ndev, int budget)
 		pkt_len = bdp->cbd_datlen;
 		ndev->stats.rx_bytes += pkt_len;
 
-		if (fep->bufdesc_ex)
-			index = (struct bufdesc_ex *)bdp -
-				(struct bufdesc_ex *)fep->rx_bd_base;
-		else
-			index = bdp - fep->rx_bd_base;
+		index = fec_enet_get_bd_index(bdp, fep);
 		data = fep->rx_skbuff[index]->data;
 		dma_sync_single_for_cpu(&fep->pdev->dev, bdp->cbd_bufaddr,
 					FEC_ENET_RX_FRSIZE, DMA_FROM_DEVICE);
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 2/6] net: fec: Enable IP header hardware checksum
  2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 1/6] net: fec: Factorize the .xmit transmit function Fugang Duan
@ 2014-05-30  2:05 ` Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 3/6] net: fec: Factorize feature setting Fugang Duan
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

IP header checksum is calcalated by network layer in default.
To support software TSO, it is better to use HW calculate the
IP header checksum.

FEC hw checksum feature request the checksum field in frame
is zero, otherwise the calculative CRC is not correct.

For segmentated TCP packet, HW calculate the IP header checksum again,
it doesn't bring any impact. For SW TSO, HW calculated checksum bring
better performance.

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec_main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 96c1a07..d7c1998 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -326,6 +326,7 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 	if (unlikely(skb_cow_head(skb, 0)))
 		return -1;
 
+	ip_hdr(skb)->check = 0;
 	*(__sum16 *)(skb->head + skb->csum_start + skb->csum_offset) = 0;
 
 	return 0;
@@ -407,7 +408,7 @@ static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
 			 * are done by the kernel
 			 */
 			if (skb->ip_summed == CHECKSUM_PARTIAL)
-				ebdp->cbd_esc |= BD_ENET_TX_PINS;
+				ebdp->cbd_esc |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
 		}
 	}
 
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 3/6] net: fec: Factorize feature setting
  2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 1/6] net: fec: Factorize the .xmit transmit function Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 2/6] net: fec: Enable IP header hardware checksum Fugang Duan
@ 2014-05-30  2:05 ` Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number Fugang Duan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

In order to enhance the code readable, let's factorize the
feature list.

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec_main.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index d7c1998..79f8ac0 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -2096,21 +2096,19 @@ static int fec_enet_init(struct net_device *ndev)
 	writel(FEC_RX_DISABLED_IMASK, fep->hwp + FEC_IMASK);
 	netif_napi_add(ndev, &fep->napi, fec_enet_rx_napi, NAPI_POLL_WEIGHT);
 
-	if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN) {
+	if (id_entry->driver_data & FEC_QUIRK_HAS_VLAN)
 		/* enable hw VLAN support */
 		ndev->features |= NETIF_F_HW_VLAN_CTAG_RX;
-		ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_RX;
-	}
 
 	if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
 		/* enable hw accelerator */
 		ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
 				| NETIF_F_RXCSUM);
-		ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM);
 		fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
 	}
 
+	ndev->hw_features = ndev->features;
+
 	fec_restart(ndev, 0);
 
 	return 0;
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
                   ` (2 preceding siblings ...)
  2014-05-30  2:05 ` [PATCH v1 3/6] net: fec: Factorize feature setting Fugang Duan
@ 2014-05-30  2:05 ` Fugang Duan
  2014-05-30  9:10   ` David Laight
  2014-05-30  2:05 ` [PATCH v1 5/6] net: fec: Add Scatter/gather support Fugang Duan
  2014-05-30  2:05 ` [PATCH v1 6/6] net: fec: Add software TSO support Fugang Duan
  5 siblings, 1 reply; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

In order to support SG, software TSO, let's increase BD entry number.

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec.h      |    6 +++---
 drivers/net/ethernet/freescale/fec_main.c |   21 ++++++++++++---------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 3b8d6d1..74fbd49 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -240,14 +240,14 @@ struct bufdesc_ex {
  * the skbuffer directly.
  */
 
-#define FEC_ENET_RX_PAGES	8
+#define FEC_ENET_RX_PAGES	256
 #define FEC_ENET_RX_FRSIZE	2048
 #define FEC_ENET_RX_FRPPG	(PAGE_SIZE / FEC_ENET_RX_FRSIZE)
 #define RX_RING_SIZE		(FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
 #define FEC_ENET_TX_FRSIZE	2048
 #define FEC_ENET_TX_FRPPG	(PAGE_SIZE / FEC_ENET_TX_FRSIZE)
-#define TX_RING_SIZE		16	/* Must be power of two */
-#define TX_RING_MOD_MASK	15	/*   for this to work */
+#define TX_RING_SIZE		512	/* Must be power of two */
+#define TX_RING_MOD_MASK	511	/*   for this to work */
 
 #define BD_ENET_RX_INT          0x00800000
 #define BD_ENET_RX_PTP          ((ushort)0x0400)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 79f8ac0..714100f 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -172,10 +172,6 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 #endif
 #endif /* CONFIG_M5272 */
 
-#if (((RX_RING_SIZE + TX_RING_SIZE) * 32) > PAGE_SIZE)
-#error "FEC: descriptor ring size constants too large"
-#endif
-
 /* Interrupt events/masks. */
 #define FEC_ENET_HBERR	((uint)0x80000000)	/* Heartbeat error */
 #define FEC_ENET_BABR	((uint)0x40000000)	/* Babbling receiver */
@@ -2060,9 +2056,20 @@ static int fec_enet_init(struct net_device *ndev)
 	const struct platform_device_id *id_entry =
 				platform_get_device_id(fep->pdev);
 	struct bufdesc *cbd_base;
+	int bd_size;
+
+	/* init the tx & rx ring size */
+	fep->tx_ring_size = TX_RING_SIZE;
+	fep->rx_ring_size = RX_RING_SIZE;
+
+	if (fep->bufdesc_ex)
+		bd_size = sizeof(struct bufdesc_ex);
+	else
+		bd_size = sizeof(struct bufdesc);
+	bd_size *= (fep->tx_ring_size + fep->rx_ring_size);
 
 	/* Allocate memory for buffer descriptors. */
-	cbd_base = dma_alloc_coherent(NULL, PAGE_SIZE, &fep->bd_dma,
+	cbd_base = dma_alloc_coherent(NULL, bd_size, &fep->bd_dma,
 				      GFP_KERNEL);
 	if (!cbd_base)
 		return -ENOMEM;
@@ -2076,10 +2083,6 @@ static int fec_enet_init(struct net_device *ndev)
 	/* make sure MAC we just acquired is programmed into the hw */
 	fec_set_mac_address(ndev, NULL);
 
-	/* init the tx & rx ring size */
-	fep->tx_ring_size = TX_RING_SIZE;
-	fep->rx_ring_size = RX_RING_SIZE;
-
 	/* Set receive and transmit descriptor base. */
 	fep->rx_bd_base = cbd_base;
 	if (fep->bufdesc_ex)
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 5/6] net: fec: Add Scatter/gather support
  2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
                   ` (3 preceding siblings ...)
  2014-05-30  2:05 ` [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number Fugang Duan
@ 2014-05-30  2:05 ` Fugang Duan
  2014-05-30  4:37   ` Eric Dumazet
  2014-05-30 14:26   ` Frank.Li
  2014-05-30  2:05 ` [PATCH v1 6/6] net: fec: Add software TSO support Fugang Duan
  5 siblings, 2 replies; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

Add Scatter/gather support for FEC.
This feature allows to improve outbound throughput performance.
Running iperf tests shows a 55.4% improvement, tested on imx6dl sabresd
board.

$ ethtool -K eth0 sg off
[  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec

$ ethtool -K eth0 sg on
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 52617 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   154 MBytes   432 Mbits/sec

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec.h      |    2 +-
 drivers/net/ethernet/freescale/fec_main.c |  230 ++++++++++++++++++++++-------
 2 files changed, 178 insertions(+), 54 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 74fbd49..57f3ecf 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -221,7 +221,7 @@ struct bufdesc_ex {
 #define BD_ENET_TX_RCMASK       ((ushort)0x003c)
 #define BD_ENET_TX_UN           ((ushort)0x0002)
 #define BD_ENET_TX_CSL          ((ushort)0x0001)
-#define BD_ENET_TX_STATS        ((ushort)0x03ff)        /* All status bits */
+#define BD_ENET_TX_STATS        ((ushort)0x0fff)        /* All status bits */
 
 /*enhanced buffer descriptor control/status used by Ethernet transmit*/
 #define BD_ENET_TX_INT          0x40000000
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 714100f..a96ed3a 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -301,6 +301,23 @@ int fec_enet_get_bd_index(struct bufdesc *bdp, struct fec_enet_private *fep)
 	return index;
 }
 
+static inline
+int fec_enet_txdesc_entry_free(struct fec_enet_private *fep)
+{
+	int entries;
+
+	if (fep->bufdesc_ex)
+		entries = (struct bufdesc_ex *)fep->dirty_tx -
+				(struct bufdesc_ex *)fep->cur_tx;
+	else
+		entries = fep->dirty_tx - fep->cur_tx;
+
+	if (fep->cur_tx >= fep->dirty_tx)
+		entries += fep->tx_ring_size;
+
+	return entries;
+}
+
 static void *swap_buffer(void *bufaddr, int len)
 {
 	int i;
@@ -328,20 +345,116 @@ fec_enet_clear_csum(struct sk_buff *skb, struct net_device *ndev)
 	return 0;
 }
 
-static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
+static void
+fec_enet_submit_work(struct bufdesc *bdp, struct fec_enet_private *fep)
+{
+	const struct platform_device_id *id_entry =
+				platform_get_device_id(fep->pdev);
+	struct bufdesc *bdp_pre;
+
+	bdp_pre = fec_enet_get_prevdesc(bdp, fep);
+	if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
+	    !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
+		fep->delay_work.trig_tx = true;
+		schedule_delayed_work(&(fep->delay_work.delay_work),
+					msecs_to_jiffies(1));
+	}
+}
+
+static int
+fec_enet_txq_submit_frag_skb(struct sk_buff *skb, struct net_device *ndev)
 {
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	const struct platform_device_id *id_entry =
 				platform_get_device_id(fep->pdev);
-	struct bufdesc *bdp, *bdp_pre;
-	void *bufaddr;
-	unsigned short	status;
+	struct bufdesc *bdp = fep->cur_tx;
+	struct bufdesc_ex *ebdp;
+	int nr_frags = skb_shinfo(skb)->nr_frags;
+	int frag, frag_len;
+	unsigned short status;
+	unsigned int estatus = 0;
+	skb_frag_t *this_frag;
 	unsigned int index;
+	void *bufaddr;
+	int i;
 
-	/* Fill in a Tx ring entry */
+	for (frag = 0; frag < nr_frags; frag++) {
+		this_frag = &skb_shinfo(skb)->frags[frag];
+		bdp = fec_enet_get_nextdesc(bdp, fep);
+		ebdp = (struct bufdesc_ex *)bdp;
+
+		status = bdp->cbd_sc;
+		status &= ~BD_ENET_TX_STATS;
+		status |= (BD_ENET_TX_TC | BD_ENET_TX_READY);
+		frag_len = skb_shinfo(skb)->frags[frag].size;
+
+		/* Handle the last BD specially */
+		if (frag == nr_frags - 1) {
+			status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
+			if (fep->bufdesc_ex) {
+				estatus |= BD_ENET_TX_INT;
+				if (unlikely(skb_shinfo(skb)->tx_flags &
+					SKBTX_HW_TSTAMP && fep->hwts_tx_en))
+					estatus |= BD_ENET_TX_TS;
+			}
+		}
+
+		if (fep->bufdesc_ex) {
+			if (skb->ip_summed == CHECKSUM_PARTIAL)
+				estatus |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
+			ebdp->cbd_bdu = 0;
+			ebdp->cbd_esc = estatus;
+		}
+
+		bufaddr = page_address(this_frag->page.p) + this_frag->page_offset;
+
+		index = fec_enet_get_bd_index(bdp, fep);
+		if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
+			memcpy(fep->tx_bounce[index], bufaddr, frag_len);
+			bufaddr = fep->tx_bounce[index];
+		}
+		if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
+			swap_buffer(bufaddr, frag_len);
+		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
+						frag_len, DMA_TO_DEVICE);
+		if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
+			dev_kfree_skb_any(skb);
+			if (net_ratelimit())
+				netdev_err(ndev, "Tx DMA memory map failed\n");
+			goto dma_mapping_error;
+		}
+
+		bdp->cbd_datlen = frag_len;
+		bdp->cbd_sc = status;
+	}
+
+	fep->cur_tx = bdp;
+
+	return 0;
+
+dma_mapping_error:
 	bdp = fep->cur_tx;
+	for (i = 0; i < frag; i++) {
+		bdp = fec_enet_get_nextdesc(bdp, fep);
+		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
+				bdp->cbd_datlen, DMA_TO_DEVICE);
+	}
+	return NETDEV_TX_OK;
+}
 
-	status = bdp->cbd_sc;
+static int fec_enet_txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	const struct platform_device_id *id_entry =
+				platform_get_device_id(fep->pdev);
+	int nr_frags = skb_shinfo(skb)->nr_frags;
+	struct bufdesc *bdp, *last_bdp;
+	void *bufaddr;
+	unsigned short status;
+	unsigned short buflen;
+	unsigned int estatus = 0;
+	unsigned int index;
+	int ret;
 
 	/* Protocol checksum off-load for TCP and UDP. */
 	if (fec_enet_clear_csum(skb, ndev)) {
@@ -349,17 +462,18 @@ static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
 		return NETDEV_TX_OK;
 	}
 
-	/* Clear all of the status flags */
+	/* Fill in a Tx ring entry */
+	bdp = fep->cur_tx;
+	status = bdp->cbd_sc;
 	status &= ~BD_ENET_TX_STATS;
 
 	/* Set buffer length and buffer pointer */
 	bufaddr = skb->data;
-	bdp->cbd_datlen = skb->len;
+	buflen = skb_headlen(skb);
 
 	index = fec_enet_get_bd_index(bdp, fep);
-
 	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
-		memcpy(fep->tx_bounce[index], skb->data, skb->len);
+		memcpy(fep->tx_bounce[index], skb->data, buflen);
 		bufaddr = fep->tx_bounce[index];
 	}
 
@@ -369,62 +483,66 @@ static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
 	 * swap every frame going to and coming from the controller.
 	 */
 	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
-		swap_buffer(bufaddr, skb->len);
-
-	/* Save skb pointer */
-	fep->tx_skbuff[index] = skb;
+		swap_buffer(bufaddr, buflen);
 
 	/* Push the data cache so the CPM does not get stale memory
 	 * data.
 	 */
 	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
-			skb->len, DMA_TO_DEVICE);
+					buflen, DMA_TO_DEVICE);
 	if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
-		bdp->cbd_bufaddr = 0;
-		fep->tx_skbuff[index] = NULL;
 		dev_kfree_skb_any(skb);
 		if (net_ratelimit())
 			netdev_err(ndev, "Tx DMA memory map failed\n");
 		return NETDEV_TX_OK;
 	}
 
+	if (nr_frags) {
+		ret = fec_enet_txq_submit_frag_skb(skb, ndev);
+		if (ret)
+			return ret;
+	} else {
+		status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
+		if (fep->bufdesc_ex) {
+			estatus = BD_ENET_TX_INT;
+			if (unlikely(skb_shinfo(skb)->tx_flags &
+				SKBTX_HW_TSTAMP && fep->hwts_tx_en))
+				estatus |= BD_ENET_TX_TS;
+		}
+	}
+
 	if (fep->bufdesc_ex) {
 
 		struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
-		ebdp->cbd_bdu = 0;
+
 		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
-			fep->hwts_tx_en)) {
-			ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
+			fep->hwts_tx_en))
 			skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
-		} else {
-			ebdp->cbd_esc = BD_ENET_TX_INT;
 
-			/* Enable protocol checksum flags
-			 * We do not bother with the IP Checksum bits as they
-			 * are done by the kernel
-			 */
-			if (skb->ip_summed == CHECKSUM_PARTIAL)
-				ebdp->cbd_esc |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
-		}
+		if (skb->ip_summed == CHECKSUM_PARTIAL)
+			estatus |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
+
+		ebdp->cbd_bdu = 0;
+		ebdp->cbd_esc = estatus;
 	}
 
+	last_bdp = fep->cur_tx;
+	index = fec_enet_get_bd_index(last_bdp, fep);
+	/* Save skb pointer */
+	fep->tx_skbuff[index] = skb;
+
+	bdp->cbd_datlen = buflen;
+
 	/* Send it on its way.  Tell FEC it's ready, interrupt when done,
 	 * it's the last BD of the frame, and to put the CRC on the end.
 	 */
-	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
-			| BD_ENET_TX_LAST | BD_ENET_TX_TC);
+	status |= (BD_ENET_TX_READY | BD_ENET_TX_TC);
 	bdp->cbd_sc = status;
 
-	bdp_pre = fec_enet_get_prevdesc(bdp, fep);
-	if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
-	    !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
-		fep->delay_work.trig_tx = true;
-		schedule_delayed_work(&(fep->delay_work.delay_work),
-					msecs_to_jiffies(1));
-	}
+	fec_enet_submit_work(bdp, fep);
 
 	/* If this was the last BD in the ring, start at the beginning again. */
-	bdp = fec_enet_get_nextdesc(bdp, fep);
+	bdp = fec_enet_get_nextdesc(last_bdp, fep);
 
 	skb_tx_timestamp(skb);
 
@@ -433,7 +551,7 @@ static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
 	/* Trigger transmission start */
 	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
 
-	return NETDEV_TX_OK;
+	return 0;
 }
 
 static netdev_tx_t
@@ -442,6 +560,7 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	struct bufdesc *bdp;
 	unsigned short	status;
+	int entries_free;
 	int ret;
 
 	/* Fill in a Tx ring entry */
@@ -453,15 +572,17 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		/* Ooops.  All transmit buffers are full.  Bail out.
 		 * This should not happen, since ndev->tbusy should be set.
 		 */
-		netdev_err(ndev, "tx queue full!\n");
+		if (net_ratelimit())
+			netdev_err(ndev, "tx queue full!\n");
 		return NETDEV_TX_BUSY;
 	}
 
-	ret = txq_submit_skb(skb, ndev);
-	if (ret == -EBUSY)
-		return NETDEV_TX_BUSY;
+	ret = fec_enet_txq_submit_skb(skb, ndev);
+	if (ret)
+		return ret;
 
-	if (fep->cur_tx == fep->dirty_tx)
+	entries_free = fec_enet_txdesc_entry_free(fep);
+	if (entries_free < MAX_SKB_FRAGS + 1)
 		netif_stop_queue(ndev);
 
 	return NETDEV_TX_OK;
@@ -782,6 +903,7 @@ fec_enet_tx(struct net_device *ndev)
 	unsigned short status;
 	struct	sk_buff	*skb;
 	int	index = 0;
+	int	entries;
 
 	fep = netdev_priv(ndev);
 	bdp = fep->dirty_tx;
@@ -798,9 +920,13 @@ fec_enet_tx(struct net_device *ndev)
 		index = fec_enet_get_bd_index(bdp, fep);
 
 		skb = fep->tx_skbuff[index];
-		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
+		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, bdp->cbd_datlen,
 				DMA_TO_DEVICE);
 		bdp->cbd_bufaddr = 0;
+		if (!skb) {
+			bdp = fec_enet_get_nextdesc(bdp, fep);
+			continue;
+		}
 
 		/* Check for errors. */
 		if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC |
@@ -819,7 +945,7 @@ fec_enet_tx(struct net_device *ndev)
 				ndev->stats.tx_carrier_errors++;
 		} else {
 			ndev->stats.tx_packets++;
-			ndev->stats.tx_bytes += bdp->cbd_datlen;
+			ndev->stats.tx_bytes += skb->len;
 		}
 
 		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
@@ -856,15 +982,13 @@ fec_enet_tx(struct net_device *ndev)
 
 		/* Since we have freed up a buffer, the ring is no longer full
 		 */
-		if (fep->dirty_tx != fep->cur_tx) {
-			if (netif_queue_stopped(ndev))
-				netif_wake_queue(ndev);
-		}
+		entries = fec_enet_txdesc_entry_free(fep);
+		if (entries >= MAX_SKB_FRAGS + 1 && netif_queue_stopped(ndev))
+			netif_wake_queue(ndev);
 	}
 	return;
 }
 
-
 /* During a receive, the cur_rx points to the current incoming buffer.
  * When we update through the ring, if the next incoming buffer has
  * not been given to the system, we just set the empty indicator,
@@ -2106,7 +2230,7 @@ static int fec_enet_init(struct net_device *ndev)
 	if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
 		/* enable hw accelerator */
 		ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM);
+				| NETIF_F_RXCSUM | NETIF_F_SG);
 		fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
 	}
 
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 6/6] net: fec: Add software TSO support
  2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
                   ` (4 preceding siblings ...)
  2014-05-30  2:05 ` [PATCH v1 5/6] net: fec: Add Scatter/gather support Fugang Duan
@ 2014-05-30  2:05 ` Fugang Duan
  2014-05-30  6:30   ` Eric Dumazet
  5 siblings, 1 reply; 28+ messages in thread
From: Fugang Duan @ 2014-05-30  2:05 UTC (permalink / raw)
  To: b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, b38611, stephen

Add software TSO support for FEC.i
This feature allows to improve outbound throughput performance.
Tested on imx6dl sabresd board, running iperf tcp tests shows:
- 16.2% improvement comparing with FEC SG patch
- 82% improvement comparing with NO SG & TSO patch

$ ethtool -K eth0 tso on
[  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec

Signed-off-by: Fugang Duan <B38611@freescale.com>
---
 drivers/net/ethernet/freescale/fec.h      |    4 +
 drivers/net/ethernet/freescale/fec_main.c |  185 ++++++++++++++++++++++++++++-
 2 files changed, 187 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 57f3ecf..caba87d 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -299,6 +299,10 @@ struct fec_enet_private {
 	unsigned short tx_ring_size;
 	unsigned short rx_ring_size;
 
+	/* Software TSO */
+	char *tso_hdrs;
+	dma_addr_t tso_hdrs_dma;
+
 	struct	platform_device *pdev;
 
 	int	opened;
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index a96ed3a..cd20344 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -36,6 +36,7 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <net/ip.h>
+#include <net/tso.h>
 #include <linux/tcp.h>
 #include <linux/udp.h>
 #include <linux/icmp.h>
@@ -227,6 +228,8 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet MAC address");
 #define FEC_PAUSE_FLAG_AUTONEG	0x1
 #define FEC_PAUSE_FLAG_ENABLE	0x2
 
+#define TSO_HEADER_SIZE		128
+
 static int mii_cnt;
 
 static inline
@@ -554,6 +557,174 @@ static int fec_enet_txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
 	return 0;
 }
 
+static int
+fec_enet_txq_put_data_tso(struct sk_buff *skb, struct net_device *ndev,
+			struct bufdesc *bdp, int index, char *data,
+			int size, bool last_tcp, bool is_last)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	const struct platform_device_id *id_entry =
+				platform_get_device_id(fep->pdev);
+	struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
+	unsigned short status;
+	unsigned int estatus = 0;
+
+	status = bdp->cbd_sc;
+	status &= ~BD_ENET_TX_STATS;
+
+	status |= (BD_ENET_TX_TC | BD_ENET_TX_READY);
+	bdp->cbd_datlen = size;
+
+	if (((unsigned long) data) & FEC_ALIGNMENT) {
+		memcpy(fep->tx_bounce[index], data, size);
+		data = fep->tx_bounce[index];
+	}
+	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(data, size);
+
+	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, data,
+					size, DMA_TO_DEVICE);
+	if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
+		dev_kfree_skb_any(skb);
+		if (net_ratelimit())
+			netdev_err(ndev, "Tx DMA memory map failed\n");
+		return NETDEV_TX_BUSY;
+	}
+
+	if (fep->bufdesc_ex) {
+		if (skb->ip_summed == CHECKSUM_PARTIAL)
+			estatus |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
+		ebdp->cbd_bdu = 0;
+		ebdp->cbd_esc = estatus;
+	}
+
+	/* Handle the last BD specially */
+	if (last_tcp)
+		status |= (BD_ENET_TX_LAST | BD_ENET_TX_TC);
+	if (is_last) {
+		status |= BD_ENET_TX_INTR;
+		if (fep->bufdesc_ex)
+			ebdp->cbd_esc |= BD_ENET_TX_INT;
+	}
+
+	bdp->cbd_sc = status;
+
+	return 0;
+}
+
+static void
+fec_enet_txq_put_hdr_tso(struct sk_buff *skb, struct net_device *ndev,
+			struct bufdesc *bdp, int index)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	const struct platform_device_id *id_entry =
+				platform_get_device_id(fep->pdev);
+	int hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
+	void *bufaddr;
+	unsigned short status;
+	unsigned int estatus = 0;
+
+	status = bdp->cbd_sc;
+	status &= ~BD_ENET_TX_STATS;
+	status |= (BD_ENET_TX_TC | BD_ENET_TX_READY);
+
+	bufaddr = fep->tso_hdrs + index * TSO_HEADER_SIZE;
+	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
+		memcpy(fep->tx_bounce[index], skb->data, hdr_len);
+		bufaddr = fep->tx_bounce[index];
+	}
+	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
+		swap_buffer(bufaddr, hdr_len);
+
+	bdp->cbd_bufaddr = fep->tso_hdrs_dma + index * TSO_HEADER_SIZE;
+	bdp->cbd_datlen = hdr_len;
+
+	if (fep->bufdesc_ex) {
+		if (skb->ip_summed == CHECKSUM_PARTIAL)
+			estatus |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
+		ebdp->cbd_bdu = 0;
+		ebdp->cbd_esc = estatus;
+	}
+
+	bdp->cbd_sc = status;
+}
+
+static int fec_enet_txq_submit_tso(struct sk_buff *skb, struct net_device *ndev)
+{
+	struct fec_enet_private *fep = netdev_priv(ndev);
+	int hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
+	int total_len, data_left;
+	struct bufdesc *bdp = fep->cur_tx;
+	struct tso_t tso;
+	unsigned int index = 0;
+	int ret;
+
+	if (tso_count_descs(skb) >= fec_enet_txdesc_entry_free(fep)) {
+		if (net_ratelimit())
+			netdev_err(ndev, "tx queue full!\n");
+		return NETDEV_TX_BUSY;
+	}
+
+	/* Protocol checksum off-load for TCP and UDP. */
+	if (fec_enet_clear_csum(skb, ndev)) {
+		dev_kfree_skb_any(skb);
+		return NETDEV_TX_OK;
+	}
+
+	/* Initialize the TSO handler, and prepare the first payload */
+	tso_start(skb, &tso);
+
+	total_len = skb->len - hdr_len;
+	while (total_len > 0) {
+		char *hdr;
+
+		index = fec_enet_get_bd_index(bdp, fep);
+		data_left = min_t(int, skb_shinfo(skb)->gso_size, total_len);
+		total_len -= data_left;
+
+		/* prepare packet headers: MAC + IP + TCP */
+		hdr = fep->tso_hdrs + index * TSO_HEADER_SIZE;
+		tso_build_hdr(skb, hdr, &tso, data_left, total_len == 0);
+		fec_enet_txq_put_hdr_tso(skb, ndev, bdp, index);
+
+		while (data_left > 0) {
+			int size;
+
+			size = min_t(int, tso.size, data_left);
+			bdp = fec_enet_get_nextdesc(bdp, fep);
+			index = fec_enet_get_bd_index(bdp, fep);
+			ret = fec_enet_txq_put_data_tso(skb, ndev, bdp, index, tso.data,
+							size, size == data_left,
+							total_len == 0);
+			if (ret)
+				goto err_release;
+
+			data_left -= size;
+			tso_build_data(skb, &tso, size);
+		}
+
+		bdp = fec_enet_get_nextdesc(bdp, fep);
+	}
+
+	/* Save skb pointer */
+	fep->tx_skbuff[index] = skb;
+
+	fec_enet_submit_work(bdp, fep);
+
+	skb_tx_timestamp(skb);
+	fep->cur_tx = bdp;
+
+	/* Trigger transmission start */
+	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
+
+	return 0;
+
+err_release:
+	/* TODO: Release all used data descriptors for TSO */
+	return ret;
+}
+
 static netdev_tx_t
 fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 {
@@ -577,7 +748,10 @@ fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		return NETDEV_TX_BUSY;
 	}
 
-	ret = fec_enet_txq_submit_skb(skb, ndev);
+	if (skb_is_gso(skb))
+		ret = fec_enet_txq_submit_tso(skb, ndev);
+	else
+		ret = fec_enet_txq_submit_skb(skb, ndev);
 	if (ret)
 		return ret;
 
@@ -2198,6 +2372,13 @@ static int fec_enet_init(struct net_device *ndev)
 	if (!cbd_base)
 		return -ENOMEM;
 
+	fep->tso_hdrs = dma_alloc_coherent(NULL, fep->tx_ring_size * TSO_HEADER_SIZE,
+						&fep->tso_hdrs_dma, GFP_KERNEL);
+	if (!fep->tso_hdrs) {
+		dma_free_coherent(NULL, bd_size, cbd_base, fep->bd_dma);
+		return -ENOMEM;
+	}
+
 	memset(cbd_base, 0, PAGE_SIZE);
 
 	fep->netdev = ndev;
@@ -2230,7 +2411,7 @@ static int fec_enet_init(struct net_device *ndev)
 	if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
 		/* enable hw accelerator */
 		ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
-				| NETIF_F_RXCSUM | NETIF_F_SG);
+				| NETIF_F_RXCSUM | NETIF_F_SG | NETIF_F_TSO);
 		fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
 	}
 
-- 
1.7.8

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 5/6] net: fec: Add Scatter/gather support
  2014-05-30  2:05 ` [PATCH v1 5/6] net: fec: Add Scatter/gather support Fugang Duan
@ 2014-05-30  4:37   ` Eric Dumazet
  2014-05-30  4:48     ` fugang.duan
  2014-05-30 14:26   ` Frank.Li
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30  4:37 UTC (permalink / raw)
  To: Fugang Duan
  Cc: b20596, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

On Fri, 2014-05-30 at 10:05 +0800, Fugang Duan wrote:
> Add Scatter/gather support for FEC.
> This feature allows to improve outbound throughput performance.
> Running iperf tests shows a 55.4% improvement, tested on imx6dl sabresd
> board.


> +			bufaddr = fep->tx_bounce[index];
> +		}
> +		if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> +			swap_buffer(bufaddr, frag_len);


bufaddr is a page fragment and can be shared (using sendfile() for
example), we cannot modify the content using swap_buffer().

In order to do the swap, you need to force the copy to tx_bounce buffer.



> +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> +						frag_len, DMA_TO_DEVICE);
> +		if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> +			dev_kfree_skb_any(skb);
> +			if (net_ratelimit())
> +				netdev_err(ndev, "Tx DMA memory map failed\n");
> +			goto dma_mapping_error;
> +		}
> +
> +		bdp->cbd_datlen = frag_len;
> +		bdp->cbd_sc = status;
> +	}
> +
> +	fep->cur_tx = bdp;
> +
> +	return 0;
> +
> +dma_mapping_error:
>  	bdp = fep->cur_tx;
> +	for (i = 0; i < frag; i++) {
> +		bdp = fec_enet_get_nextdesc(bdp, fep);
> +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
> +				bdp->cbd_datlen, DMA_TO_DEVICE);
> +	}
> +	return NETDEV_TX_OK;
> +}
>  
> -	status = bdp->cbd_sc;
> +static int fec_enet_txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
> +{
> +	struct fec_enet_private *fep = netdev_priv(ndev);
> +	const struct platform_device_id *id_entry =
> +				platform_get_device_id(fep->pdev);
> +	int nr_frags = skb_shinfo(skb)->nr_frags;
> +	struct bufdesc *bdp, *last_bdp;
> +	void *bufaddr;
> +	unsigned short status;
> +	unsigned short buflen;
> +	unsigned int estatus = 0;
> +	unsigned int index;
> +	int ret;
>  
>  	/* Protocol checksum off-load for TCP and UDP. */
>  	if (fec_enet_clear_csum(skb, ndev)) {
> @@ -349,17 +462,18 @@ static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
>  		return NETDEV_TX_OK;
>  	}
>  
> -	/* Clear all of the status flags */
> +	/* Fill in a Tx ring entry */
> +	bdp = fep->cur_tx;
> +	status = bdp->cbd_sc;
>  	status &= ~BD_ENET_TX_STATS;
>  
>  	/* Set buffer length and buffer pointer */
>  	bufaddr = skb->data;
> -	bdp->cbd_datlen = skb->len;
> +	buflen = skb_headlen(skb);
>  
>  	index = fec_enet_get_bd_index(bdp, fep);
> -
>  	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> -		memcpy(fep->tx_bounce[index], skb->data, skb->len);
> +		memcpy(fep->tx_bounce[index], skb->data, buflen);
>  		bufaddr = fep->tx_bounce[index];
>  	}
>  
> @@ -369,62 +483,66 @@ static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
>  	 * swap every frame going to and coming from the controller.
>  	 */
>  	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> -		swap_buffer(bufaddr, skb->len);
> -
> -	/* Save skb pointer */
> -	fep->tx_skbuff[index] = skb;
> +		swap_buffer(bufaddr, buflen);

Same problem here, a driver is certainly not allowed to mess with
skb->data.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 5/6] net: fec: Add Scatter/gather support
  2014-05-30  4:37   ` Eric Dumazet
@ 2014-05-30  4:48     ` fugang.duan
  0 siblings, 0 replies; 28+ messages in thread
From: fugang.duan @ 2014-05-30  4:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Frank.Li, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

From: Eric Dumazet <eric.dumazet@gmail.com> Data: Friday, May 30, 2014 12:38 PM
>To: Duan Fugang-B38611
>Cc: Li Frank-B20596; davem@davemloft.net; ezequiel.garcia@free-
>electrons.com; netdev@vger.kernel.org; shawn.guo@linaro.org;
>bhutchings@solarflare.com; stephen@networkplumber.org
>Subject: Re: [PATCH v1 5/6] net: fec: Add Scatter/gather support
>
>On Fri, 2014-05-30 at 10:05 +0800, Fugang Duan wrote:
>> Add Scatter/gather support for FEC.
>> This feature allows to improve outbound throughput performance.
>> Running iperf tests shows a 55.4% improvement, tested on imx6dl
>> sabresd board.
>
>
>> +			bufaddr = fep->tx_bounce[index];
>> +		}
>> +		if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
>> +			swap_buffer(bufaddr, frag_len);
>
>
>bufaddr is a page fragment and can be shared (using sendfile() for
>example), we cannot modify the content using swap_buffer().
>
>In order to do the swap, you need to force the copy to tx_bounce buffer.
>
Yes, you are right. I will change it in next version.
I tested on imx6dl platform, FEC support hw swap function, so I don't find issue.
Thanks.

>
>> +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
>> +						frag_len, DMA_TO_DEVICE);
>> +		if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
>> +			dev_kfree_skb_any(skb);
>> +			if (net_ratelimit())
>> +				netdev_err(ndev, "Tx DMA memory map failed\n");
>> +			goto dma_mapping_error;
>> +		}
>> +
>> +		bdp->cbd_datlen = frag_len;
>> +		bdp->cbd_sc = status;
>> +	}
>> +
>> +	fep->cur_tx = bdp;
>> +
>> +	return 0;
>> +
>> +dma_mapping_error:
>>  	bdp = fep->cur_tx;
>> +	for (i = 0; i < frag; i++) {
>> +		bdp = fec_enet_get_nextdesc(bdp, fep);
>> +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
>> +				bdp->cbd_datlen, DMA_TO_DEVICE);
>> +	}
>> +	return NETDEV_TX_OK;
>> +}
>>
>> -	status = bdp->cbd_sc;
>> +static int fec_enet_txq_submit_skb(struct sk_buff *skb, struct
>> +net_device *ndev) {
>> +	struct fec_enet_private *fep = netdev_priv(ndev);
>> +	const struct platform_device_id *id_entry =
>> +				platform_get_device_id(fep->pdev);
>> +	int nr_frags = skb_shinfo(skb)->nr_frags;
>> +	struct bufdesc *bdp, *last_bdp;
>> +	void *bufaddr;
>> +	unsigned short status;
>> +	unsigned short buflen;
>> +	unsigned int estatus = 0;
>> +	unsigned int index;
>> +	int ret;
>>
>>  	/* Protocol checksum off-load for TCP and UDP. */
>>  	if (fec_enet_clear_csum(skb, ndev)) { @@ -349,17 +462,18 @@ static
>> int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
>>  		return NETDEV_TX_OK;
>>  	}
>>
>> -	/* Clear all of the status flags */
>> +	/* Fill in a Tx ring entry */
>> +	bdp = fep->cur_tx;
>> +	status = bdp->cbd_sc;
>>  	status &= ~BD_ENET_TX_STATS;
>>
>>  	/* Set buffer length and buffer pointer */
>>  	bufaddr = skb->data;
>> -	bdp->cbd_datlen = skb->len;
>> +	buflen = skb_headlen(skb);
>>
>>  	index = fec_enet_get_bd_index(bdp, fep);
>> -
>>  	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
>> -		memcpy(fep->tx_bounce[index], skb->data, skb->len);
>> +		memcpy(fep->tx_bounce[index], skb->data, buflen);
>>  		bufaddr = fep->tx_bounce[index];
>>  	}
>>
>> @@ -369,62 +483,66 @@ static int txq_submit_skb(struct sk_buff *skb,
>struct net_device *ndev)
>>  	 * swap every frame going to and coming from the controller.
>>  	 */
>>  	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
>> -		swap_buffer(bufaddr, skb->len);
>> -
>> -	/* Save skb pointer */
>> -	fep->tx_skbuff[index] = skb;
>> +		swap_buffer(bufaddr, buflen);
>
>Same problem here, a driver is certainly not allowed to mess with
>skb->data.
>
Thanks, as above.

Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 6/6] net: fec: Add software TSO support
  2014-05-30  2:05 ` [PATCH v1 6/6] net: fec: Add software TSO support Fugang Duan
@ 2014-05-30  6:30   ` Eric Dumazet
  2014-05-30  7:16     ` fugang.duan
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30  6:30 UTC (permalink / raw)
  To: Fugang Duan
  Cc: b20596, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

On Fri, 2014-05-30 at 10:05 +0800, Fugang Duan wrote:

> +	if (((unsigned long) data) & FEC_ALIGNMENT) {
> +		memcpy(fep->tx_bounce[index], data, size);
> +		data = fep->tx_bounce[index];
> +	}

Now you have SG support, maybe you could avoid copying the whole part,
and only copy the beginning to reach the required alignment.

Not sure its a win, as it requires 2 descriptors instead of one,
and tso_count_descs() would have to be changed as well.

Do you have an idea of how often this bouncing happens for normal
workloads (ie not synthetic benchmarks) ?

Even for non TSO frames, we have an 32bit aligned IP header, so the
Ethernet header is not aligned to a 4 bytes boundary. I suspect this
driver had to bounce all TX frames ?

I am wondering if most part of the TSO gain you have comes from this
alignment problem you had before this patch and the SG one.

It looks like you could tweak tcp_sendmsg() to make sure a fragment
always start at a 16 bytes boundary or something...

It should not really matter with iperf because it naturally generates
aligned fragments (A new page starts with offset=0 and iperf uses 128KB
writes...)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index eb1dde37e678..be99af2d54e6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1220,6 +1220,7 @@ new_segment:
 					merge = false;
 				}
 
+				pfrag->offset = ALIGN(pfrag->offset, 16);
 				copy = min_t(int, copy, pfrag->size - pfrag->offset);
 
 				if (!sk_wmem_schedule(sk, copy))

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 6/6] net: fec: Add software TSO support
  2014-05-30  6:30   ` Eric Dumazet
@ 2014-05-30  7:16     ` fugang.duan
  2014-05-30 16:21       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: fugang.duan @ 2014-05-30  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Frank.Li, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

From: Eric Dumazet <eric.dumazet@gmail.com> Data: Friday, May 30, 2014 2:30 PM
>To: Duan Fugang-B38611
>Cc: Li Frank-B20596; davem@davemloft.net; ezequiel.garcia@free-
>electrons.com; netdev@vger.kernel.org; shawn.guo@linaro.org;
>bhutchings@solarflare.com; stephen@networkplumber.org
>Subject: Re: [PATCH v1 6/6] net: fec: Add software TSO support
>
>On Fri, 2014-05-30 at 10:05 +0800, Fugang Duan wrote:
>
>> +	if (((unsigned long) data) & FEC_ALIGNMENT) {
>> +		memcpy(fep->tx_bounce[index], data, size);
>> +		data = fep->tx_bounce[index];
>> +	}
>
>Now you have SG support, maybe you could avoid copying the whole part, and
>only copy the beginning to reach the required alignment.
>
>Not sure its a win, as it requires 2 descriptors instead of one, and
>tso_count_descs() would have to be changed as well.
>
>Do you have an idea of how often this bouncing happens for normal
>workloads (ie not synthetic benchmarks) ?
>
>Even for non TSO frames, we have an 32bit aligned IP header, so the
>Ethernet header is not aligned to a 4 bytes boundary. I suspect this
>driver had to bounce all TX frames ?
>
Yes, test found it bounce all TX frames.
Use 2 descriptors to transfer one part, which bring more complicate for driver. Of course, 
Performance must be better.

Digression information:
Imx6dl FEC HW have bandwidth issue limit to 400 ~ 700Mbps. Current performance with TSO is 506Mbps, cpu loading is about 40%.
Later chips with FEC IP support byte alignment, such as imx6sx. On imx6sx FEC, no SW TSO, tx bandwidth is 840~870Mbps, cpu loading is 100%,
After the software TSO, tx bandwidth is 840Mbps, cpu loading is 48%.

>I am wondering if most part of the TSO gain you have comes from this
>alignment problem you had before this patch and the SG one.
>
>It looks like you could tweak tcp_sendmsg() to make sure a fragment always
>start at a 16 bytes boundary or something...
>
>It should not really matter with iperf because it naturally generates
>aligned fragments (A new page starts with offset=0 and iperf uses 128KB
>writes...)
>
>diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index
>eb1dde37e678..be99af2d54e6 100644
>--- a/net/ipv4/tcp.c
>+++ b/net/ipv4/tcp.c
>@@ -1220,6 +1220,7 @@ new_segment:
> 					merge = false;
> 				}
>
>+				pfrag->offset = ALIGN(pfrag->offset, 16);
> 				copy = min_t(int, copy, pfrag->size - pfrag-
>>offset);
>
> 				if (!sk_wmem_schedule(sk, copy))
>
>
>
The solution with tweak for tcp_sendmsg is better, it don't bring any impact.
I don't know whether anybody agree the change ?

Thanks,
Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30  2:05 ` [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number Fugang Duan
@ 2014-05-30  9:10   ` David Laight
  2014-05-30  9:42     ` fugang.duan
  0 siblings, 1 reply; 28+ messages in thread
From: David Laight @ 2014-05-30  9:10 UTC (permalink / raw)
  To: 'Fugang Duan', b20596, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

From: Fugang Duan
> In order to support SG, software TSO, let's increase BD entry number.

Software TSO shouldn't significantly increase the number of
descriptors required.
I'd guess it makes a lot of packets have 2 fragments.

...
> -#define FEC_ENET_RX_PAGES	8
> +#define FEC_ENET_RX_PAGES	256

Supporting TSO shouldn't need more RX descriptors.
While 16 descriptors isn't that many, 512 is a lot.
If PAGE_SIZE is greater than 4k it is a huge amount of memory.

>  #define FEC_ENET_RX_FRSIZE	2048
>  #define FEC_ENET_RX_FRPPG	(PAGE_SIZE / FEC_ENET_RX_FRSIZE)
>  #define RX_RING_SIZE		(FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
>  #define FEC_ENET_TX_FRSIZE	2048
>  #define FEC_ENET_TX_FRPPG	(PAGE_SIZE / FEC_ENET_TX_FRSIZE)
> -#define TX_RING_SIZE		16	/* Must be power of two */
> -#define TX_RING_MOD_MASK	15	/*   for this to work */
> +#define TX_RING_SIZE		512	/* Must be power of two */
> +#define TX_RING_MOD_MASK	511	/*   for this to work */

Does the driver support BQL (or similar) in order to limit
the amount of queued tx traffic?
Otherwise you've significantly increased the latency for
connections other than one doing bulk tx.

	David

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30  9:10   ` David Laight
@ 2014-05-30  9:42     ` fugang.duan
  2014-05-30 13:13       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: fugang.duan @ 2014-05-30  9:42 UTC (permalink / raw)
  To: David Laight, Frank.Li, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

From: David Laight <David.Laight@ACULAB.COM> Data: Friday, May 30, 2014 5:11 PM
>To: Duan Fugang-B38611; Li Frank-B20596; davem@davemloft.net
>Cc: ezequiel.garcia@free-electrons.com; netdev@vger.kernel.org;
>shawn.guo@linaro.org; bhutchings@solarflare.com;
>stephen@networkplumber.org
>Subject: RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry
>number
>
>From: Fugang Duan
>> In order to support SG, software TSO, let's increase BD entry number.
>
>Software TSO shouldn't significantly increase the number of descriptors
>required.
>I'd guess it makes a lot of packets have 2 fragments.
>
>...
>> -#define FEC_ENET_RX_PAGES	8
>> +#define FEC_ENET_RX_PAGES	256
>
>Supporting TSO shouldn't need more RX descriptors.
>While 16 descriptors isn't that many, 512 is a lot.
>If PAGE_SIZE is greater than 4k it is a huge amount of memory.
>
Yes, 16 descriptors is little, increase BD entry size can improve the performance.
And I want to add the later patch to support interrupt coalescing, which need more BD descriptors.

>>  #define FEC_ENET_RX_FRSIZE	2048
>>  #define FEC_ENET_RX_FRPPG	(PAGE_SIZE / FEC_ENET_RX_FRSIZE)
>>  #define RX_RING_SIZE		(FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES)
>>  #define FEC_ENET_TX_FRSIZE	2048
>>  #define FEC_ENET_TX_FRPPG	(PAGE_SIZE / FEC_ENET_TX_FRSIZE)
>> -#define TX_RING_SIZE		16	/* Must be power of two */
>> -#define TX_RING_MOD_MASK	15	/*   for this to work */
>> +#define TX_RING_SIZE		512	/* Must be power of two */
>> +#define TX_RING_MOD_MASK	511	/*   for this to work */
>
>Does the driver support BQL (or similar) in order to limit the amount of
>queued tx traffic?
>Otherwise you've significantly increased the latency for connections other
>than one doing bulk tx.
>
>	David
>
The driver still don't support BQL.
I will add the feature to support FEC. Thanks for your advise.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30  9:42     ` fugang.duan
@ 2014-05-30 13:13       ` Eric Dumazet
  2014-05-30 14:01         ` ezequiel.garcia
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30 13:13 UTC (permalink / raw)
  To: fugang.duan
  Cc: David Laight, Frank.Li, davem, ezequiel.garcia, netdev,
	shawn.guo, bhutchings, stephen

On Fri, 2014-05-30 at 09:42 +0000, fugang.duan@freescale.com wrote:
> From: David Laight <David.Laight@ACULAB.COM> Data: Friday, May 30, 2014 5:11 PM
> >
> >Does the driver support BQL (or similar) in order to limit the amount of
> >queued tx traffic?
> >Otherwise you've significantly increased the latency for connections other
> >than one doing bulk tx.
> >
> >	David
> >
> The driver still don't support BQL.
> I will add the feature to support FEC. Thanks for your advise.

Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
descriptors.

So I do not think BQL is really needed, because a 512 slots TX ring wont
add a big latency : About 5 ms max.

BQL is generally nice for NIC supporting hardware TSO, where a 64KB TSO
packet consumes 3 or 4 descriptors.

Also note that TCP Small Queues should limit TX ring occupancy of a
single bulk flow anyway.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 13:13       ` Eric Dumazet
@ 2014-05-30 14:01         ` ezequiel.garcia
  2014-05-30 14:54           ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: ezequiel.garcia @ 2014-05-30 14:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: fugang.duan, David Laight, Frank.Li, davem, netdev, shawn.guo,
	bhutchings, stephen

Hello Eric,

On 30 May 06:13 AM, Eric Dumazet wrote:
> On Fri, 2014-05-30 at 09:42 +0000, fugang.duan@freescale.com wrote:
> > From: David Laight <David.Laight@ACULAB.COM> Data: Friday, May 30, 2014 5:11 PM
> > >
> > >Does the driver support BQL (or similar) in order to limit the amount of
> > >queued tx traffic?
> > >Otherwise you've significantly increased the latency for connections other
> > >than one doing bulk tx.
> > >
> > >	David
> > >
> > The driver still don't support BQL.
> > I will add the feature to support FEC. Thanks for your advise.
> 
> Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> descriptors.
> 

What's the rationale behing those numbers?

Following the Solarflare commit you pointed me a while ago, I've set
gso_max_segs to limit the number of descriptors in the mvneta and
mv643xx_eth drivers.

My understanding reading the sfc commit 7e6d06f0de3f7 is that the
maximum number of required descriptors would be around ~250:

/* Header and payload descriptor for each output segment, plus
 * one for every input fragment boundary within a segment
 */
unsigned int max_descs = EFX_TSO_MAX_SEGS * 2 + MAX_SKB_FRAGS;

Maybe I've misunderstood something?

PS: I was also about to add BQL support to the drivers...
-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 5/6] net: fec: Add Scatter/gather support
  2014-05-30  2:05 ` [PATCH v1 5/6] net: fec: Add Scatter/gather support Fugang Duan
  2014-05-30  4:37   ` Eric Dumazet
@ 2014-05-30 14:26   ` Frank.Li
  1 sibling, 0 replies; 28+ messages in thread
From: Frank.Li @ 2014-05-30 14:26 UTC (permalink / raw)
  To: fugang.duan, davem
  Cc: ezequiel.garcia, netdev, shawn.guo, bhutchings, fugang.duan, stephen



> -----Original Message-----
> From: Fugang Duan [mailto:b38611@freescale.com]
> Sent: Thursday, May 29, 2014 9:06 PM
> To: Li Frank-B20596; davem@davemloft.net
> Cc: ezequiel.garcia@free-electrons.com; netdev@vger.kernel.org;
> shawn.guo@linaro.org; bhutchings@solarflare.com; Duan Fugang-B38611;
> stephen@networkplumber.org
> Subject: [PATCH v1 5/6] net: fec: Add Scatter/gather support
> 
> Add Scatter/gather support for FEC.
> This feature allows to improve outbound throughput performance.
> Running iperf tests shows a 55.4% improvement, tested on imx6dl sabresd
> board.
> 
> $ ethtool -K eth0 sg off
> [  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port
> 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec
> 
> $ ethtool -K eth0 sg on
> $ iperf -c 10.192.242.167 -t 3 &
> [  3] local 10.192.242.108 port 52617 connected with 10.192.242.167 port
> 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 3.0 sec   154 MBytes   432 Mbits/sec
> 
> Signed-off-by: Fugang Duan <B38611@freescale.com>
> ---
>  drivers/net/ethernet/freescale/fec.h      |    2 +-
>  drivers/net/ethernet/freescale/fec_main.c |  230 ++++++++++++++++++++++--
> -----
>  2 files changed, 178 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec.h
> b/drivers/net/ethernet/freescale/fec.h
> index 74fbd49..57f3ecf 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -221,7 +221,7 @@ struct bufdesc_ex {
>  #define BD_ENET_TX_RCMASK       ((ushort)0x003c)
>  #define BD_ENET_TX_UN           ((ushort)0x0002)
>  #define BD_ENET_TX_CSL          ((ushort)0x0001)
> -#define BD_ENET_TX_STATS        ((ushort)0x03ff)        /* All status
> bits */
> +#define BD_ENET_TX_STATS        ((ushort)0x0fff)        /* All status
> bits */
> 
>  /*enhanced buffer descriptor control/status used by Ethernet transmit*/
>  #define BD_ENET_TX_INT          0x40000000
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 714100f..a96ed3a 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -301,6 +301,23 @@ int fec_enet_get_bd_index(struct bufdesc *bdp, struct
> fec_enet_private *fep)
>  	return index;
>  }
> 
> +static inline
> +int fec_enet_txdesc_entry_free(struct fec_enet_private *fep) {

Suggest change to fec_enet_get_free_txdesc_num
_free, look like free some memory or resource. 

> +	int entries;
> +
> +	if (fep->bufdesc_ex)
> +		entries = (struct bufdesc_ex *)fep->dirty_tx -
> +				(struct bufdesc_ex *)fep->cur_tx;
> +	else
> +		entries = fep->dirty_tx - fep->cur_tx;
> +
> +	if (fep->cur_tx >= fep->dirty_tx)
> +		entries += fep->tx_ring_size;
> +
> +	return entries;
> +}
> +
>  static void *swap_buffer(void *bufaddr, int len)  {
>  	int i;
> @@ -328,20 +345,116 @@ fec_enet_clear_csum(struct sk_buff *skb, struct
> net_device *ndev)
>  	return 0;
>  }
> 
> -static int txq_submit_skb(struct sk_buff *skb, struct net_device *ndev)
> +static void
> +fec_enet_submit_work(struct bufdesc *bdp, struct fec_enet_private *fep)
> +{
> +	const struct platform_device_id *id_entry =
> +				platform_get_device_id(fep->pdev);
> +	struct bufdesc *bdp_pre;
> +
> +	bdp_pre = fec_enet_get_prevdesc(bdp, fep);
> +	if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
> +	    !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
> +		fep->delay_work.trig_tx = true;
> +		schedule_delayed_work(&(fep->delay_work.delay_work),
> +					msecs_to_jiffies(1));
> +	}
> +}
> +
> +static int
> +fec_enet_txq_submit_frag_skb(struct sk_buff *skb, struct net_device
> +*ndev)
>  {
>  	struct fec_enet_private *fep = netdev_priv(ndev);
>  	const struct platform_device_id *id_entry =
>  				platform_get_device_id(fep->pdev);
> -	struct bufdesc *bdp, *bdp_pre;
> -	void *bufaddr;
> -	unsigned short	status;
> +	struct bufdesc *bdp = fep->cur_tx;
> +	struct bufdesc_ex *ebdp;
> +	int nr_frags = skb_shinfo(skb)->nr_frags;
> +	int frag, frag_len;
> +	unsigned short status;
> +	unsigned int estatus = 0;
> +	skb_frag_t *this_frag;
>  	unsigned int index;
> +	void *bufaddr;
> +	int i;
> 
> -	/* Fill in a Tx ring entry */
> +	for (frag = 0; frag < nr_frags; frag++) {
> +		this_frag = &skb_shinfo(skb)->frags[frag];
> +		bdp = fec_enet_get_nextdesc(bdp, fep);
> +		ebdp = (struct bufdesc_ex *)bdp;
> +
> +		status = bdp->cbd_sc;
> +		status &= ~BD_ENET_TX_STATS;
> +		status |= (BD_ENET_TX_TC | BD_ENET_TX_READY);
> +		frag_len = skb_shinfo(skb)->frags[frag].size;
> +
> +		/* Handle the last BD specially */
> +		if (frag == nr_frags - 1) {
> +			status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
> +			if (fep->bufdesc_ex) {
> +				estatus |= BD_ENET_TX_INT;
> +				if (unlikely(skb_shinfo(skb)->tx_flags &
> +					SKBTX_HW_TSTAMP && fep->hwts_tx_en))
> +					estatus |= BD_ENET_TX_TS;
> +			}
> +		}
> +
> +		if (fep->bufdesc_ex) {
> +			if (skb->ip_summed == CHECKSUM_PARTIAL)
> +				estatus |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
> +			ebdp->cbd_bdu = 0;
> +			ebdp->cbd_esc = estatus;
> +		}
> +
> +		bufaddr = page_address(this_frag->page.p) + this_frag-
> >page_offset;
> +
> +		index = fec_enet_get_bd_index(bdp, fep);
> +		if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> +			memcpy(fep->tx_bounce[index], bufaddr, frag_len);
> +			bufaddr = fep->tx_bounce[index];
> +		}
> +		if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> +			swap_buffer(bufaddr, frag_len);
> +		bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> +						frag_len, DMA_TO_DEVICE);
> +		if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> +			dev_kfree_skb_any(skb);
> +			if (net_ratelimit())
> +				netdev_err(ndev, "Tx DMA memory map failed\n");
> +			goto dma_mapping_error;
> +		}
> +
> +		bdp->cbd_datlen = frag_len;
> +		bdp->cbd_sc = status;
> +	}
> +
> +	fep->cur_tx = bdp;
> +
> +	return 0;
> +
> +dma_mapping_error:
>  	bdp = fep->cur_tx;
> +	for (i = 0; i < frag; i++) {
> +		bdp = fec_enet_get_nextdesc(bdp, fep);
> +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
> +				bdp->cbd_datlen, DMA_TO_DEVICE);
> +	}
> +	return NETDEV_TX_OK;
> +}
> 
> -	status = bdp->cbd_sc;
> +static int fec_enet_txq_submit_skb(struct sk_buff *skb, struct
> +net_device *ndev) {
> +	struct fec_enet_private *fep = netdev_priv(ndev);
> +	const struct platform_device_id *id_entry =
> +				platform_get_device_id(fep->pdev);
> +	int nr_frags = skb_shinfo(skb)->nr_frags;
> +	struct bufdesc *bdp, *last_bdp;
> +	void *bufaddr;
> +	unsigned short status;
> +	unsigned short buflen;
> +	unsigned int estatus = 0;
> +	unsigned int index;
> +	int ret;
> 
>  	/* Protocol checksum off-load for TCP and UDP. */
>  	if (fec_enet_clear_csum(skb, ndev)) {
> @@ -349,17 +462,18 @@ static int txq_submit_skb(struct sk_buff *skb,
> struct net_device *ndev)
>  		return NETDEV_TX_OK;
>  	}
> 
> -	/* Clear all of the status flags */
> +	/* Fill in a Tx ring entry */
> +	bdp = fep->cur_tx;
> +	status = bdp->cbd_sc;
>  	status &= ~BD_ENET_TX_STATS;
> 
>  	/* Set buffer length and buffer pointer */
>  	bufaddr = skb->data;
> -	bdp->cbd_datlen = skb->len;
> +	buflen = skb_headlen(skb);
> 
>  	index = fec_enet_get_bd_index(bdp, fep);
> -
>  	if (((unsigned long) bufaddr) & FEC_ALIGNMENT) {
> -		memcpy(fep->tx_bounce[index], skb->data, skb->len);
> +		memcpy(fep->tx_bounce[index], skb->data, buflen);
>  		bufaddr = fep->tx_bounce[index];
>  	}
> 
> @@ -369,62 +483,66 @@ static int txq_submit_skb(struct sk_buff *skb,
> struct net_device *ndev)
>  	 * swap every frame going to and coming from the controller.
>  	 */
>  	if (id_entry->driver_data & FEC_QUIRK_SWAP_FRAME)
> -		swap_buffer(bufaddr, skb->len);
> -
> -	/* Save skb pointer */
> -	fep->tx_skbuff[index] = skb;
> +		swap_buffer(bufaddr, buflen);
> 
>  	/* Push the data cache so the CPM does not get stale memory
>  	 * data.
>  	 */
>  	bdp->cbd_bufaddr = dma_map_single(&fep->pdev->dev, bufaddr,
> -			skb->len, DMA_TO_DEVICE);
> +					buflen, DMA_TO_DEVICE);
>  	if (dma_mapping_error(&fep->pdev->dev, bdp->cbd_bufaddr)) {
> -		bdp->cbd_bufaddr = 0;
> -		fep->tx_skbuff[index] = NULL;
>  		dev_kfree_skb_any(skb);
>  		if (net_ratelimit())
>  			netdev_err(ndev, "Tx DMA memory map failed\n");
>  		return NETDEV_TX_OK;
>  	}
> 
> +	if (nr_frags) {
> +		ret = fec_enet_txq_submit_frag_skb(skb, ndev);

The process of sending skb->data is similar with frag. 
Can you combine to one function?

Best regards
Frank Li

> +		if (ret)
> +			return ret;
> +	} else {
> +		status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
> +		if (fep->bufdesc_ex) {
> +			estatus = BD_ENET_TX_INT;
> +			if (unlikely(skb_shinfo(skb)->tx_flags &
> +				SKBTX_HW_TSTAMP && fep->hwts_tx_en))
> +				estatus |= BD_ENET_TX_TS;
> +		}
> +	}
> +
>  	if (fep->bufdesc_ex) {
> 
>  		struct bufdesc_ex *ebdp = (struct bufdesc_ex *)bdp;
> -		ebdp->cbd_bdu = 0;
> +
>  		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP &&
> -			fep->hwts_tx_en)) {
> -			ebdp->cbd_esc = (BD_ENET_TX_TS | BD_ENET_TX_INT);
> +			fep->hwts_tx_en))
>  			skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> -		} else {
> -			ebdp->cbd_esc = BD_ENET_TX_INT;
> 
> -			/* Enable protocol checksum flags
> -			 * We do not bother with the IP Checksum bits as they
> -			 * are done by the kernel
> -			 */
> -			if (skb->ip_summed == CHECKSUM_PARTIAL)
> -				ebdp->cbd_esc |= BD_ENET_TX_PINS |
> BD_ENET_TX_IINS;
> -		}
> +		if (skb->ip_summed == CHECKSUM_PARTIAL)
> +			estatus |= BD_ENET_TX_PINS | BD_ENET_TX_IINS;
> +
> +		ebdp->cbd_bdu = 0;
> +		ebdp->cbd_esc = estatus;
>  	}
> 
> +	last_bdp = fep->cur_tx;
> +	index = fec_enet_get_bd_index(last_bdp, fep);
> +	/* Save skb pointer */
> +	fep->tx_skbuff[index] = skb;
> +
> +	bdp->cbd_datlen = buflen;
> +
>  	/* Send it on its way.  Tell FEC it's ready, interrupt when done,
>  	 * it's the last BD of the frame, and to put the CRC on the end.
>  	 */
> -	status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR
> -			| BD_ENET_TX_LAST | BD_ENET_TX_TC);
> +	status |= (BD_ENET_TX_READY | BD_ENET_TX_TC);
>  	bdp->cbd_sc = status;
> 
> -	bdp_pre = fec_enet_get_prevdesc(bdp, fep);
> -	if ((id_entry->driver_data & FEC_QUIRK_ERR006358) &&
> -	    !(bdp_pre->cbd_sc & BD_ENET_TX_READY)) {
> -		fep->delay_work.trig_tx = true;
> -		schedule_delayed_work(&(fep->delay_work.delay_work),
> -					msecs_to_jiffies(1));
> -	}
> +	fec_enet_submit_work(bdp, fep);
> 
>  	/* If this was the last BD in the ring, start at the beginning again.
> */
> -	bdp = fec_enet_get_nextdesc(bdp, fep);
> +	bdp = fec_enet_get_nextdesc(last_bdp, fep);
> 
>  	skb_tx_timestamp(skb);
> 
> @@ -433,7 +551,7 @@ static int txq_submit_skb(struct sk_buff *skb, struct
> net_device *ndev)
>  	/* Trigger transmission start */
>  	writel(0, fep->hwp + FEC_X_DES_ACTIVE);
> 
> -	return NETDEV_TX_OK;
> +	return 0;
>  }
> 
>  static netdev_tx_t
> @@ -442,6 +560,7 @@ fec_enet_start_xmit(struct sk_buff *skb, struct
> net_device *ndev)
>  	struct fec_enet_private *fep = netdev_priv(ndev);
>  	struct bufdesc *bdp;
>  	unsigned short	status;
> +	int entries_free;
>  	int ret;
> 
>  	/* Fill in a Tx ring entry */
> @@ -453,15 +572,17 @@ fec_enet_start_xmit(struct sk_buff *skb, struct
> net_device *ndev)
>  		/* Ooops.  All transmit buffers are full.  Bail out.
>  		 * This should not happen, since ndev->tbusy should be set.
>  		 */
> -		netdev_err(ndev, "tx queue full!\n");
> +		if (net_ratelimit())
> +			netdev_err(ndev, "tx queue full!\n");
>  		return NETDEV_TX_BUSY;
>  	}
> 
> -	ret = txq_submit_skb(skb, ndev);
> -	if (ret == -EBUSY)
> -		return NETDEV_TX_BUSY;
> +	ret = fec_enet_txq_submit_skb(skb, ndev);
> +	if (ret)
> +		return ret;
> 
> -	if (fep->cur_tx == fep->dirty_tx)
> +	entries_free = fec_enet_txdesc_entry_free(fep);
> +	if (entries_free < MAX_SKB_FRAGS + 1)
>  		netif_stop_queue(ndev);
> 
>  	return NETDEV_TX_OK;
> @@ -782,6 +903,7 @@ fec_enet_tx(struct net_device *ndev)
>  	unsigned short status;
>  	struct	sk_buff	*skb;
>  	int	index = 0;
> +	int	entries;
> 
>  	fep = netdev_priv(ndev);
>  	bdp = fep->dirty_tx;
> @@ -798,9 +920,13 @@ fec_enet_tx(struct net_device *ndev)
>  		index = fec_enet_get_bd_index(bdp, fep);
> 
>  		skb = fep->tx_skbuff[index];
> -		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, skb->len,
> +		dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr, bdp-
> >cbd_datlen,
>  				DMA_TO_DEVICE);
>  		bdp->cbd_bufaddr = 0;
> +		if (!skb) {
> +			bdp = fec_enet_get_nextdesc(bdp, fep);
> +			continue;
> +		}
> 
>  		/* Check for errors. */
>  		if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC | @@ -819,7 +945,7
> @@ fec_enet_tx(struct net_device *ndev)
>  				ndev->stats.tx_carrier_errors++;
>  		} else {
>  			ndev->stats.tx_packets++;
> -			ndev->stats.tx_bytes += bdp->cbd_datlen;
> +			ndev->stats.tx_bytes += skb->len;
>  		}
> 
>  		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS) &&
> @@ -856,15 +982,13 @@ fec_enet_tx(struct net_device *ndev)
> 
>  		/* Since we have freed up a buffer, the ring is no longer full
>  		 */
> -		if (fep->dirty_tx != fep->cur_tx) {
> -			if (netif_queue_stopped(ndev))
> -				netif_wake_queue(ndev);
> -		}
> +		entries = fec_enet_txdesc_entry_free(fep);
> +		if (entries >= MAX_SKB_FRAGS + 1 && netif_queue_stopped(ndev))
> +			netif_wake_queue(ndev);
>  	}
>  	return;
>  }
> 
> -
>  /* During a receive, the cur_rx points to the current incoming buffer.
>   * When we update through the ring, if the next incoming buffer has
>   * not been given to the system, we just set the empty indicator, @@ -
> 2106,7 +2230,7 @@ static int fec_enet_init(struct net_device *ndev)
>  	if (id_entry->driver_data & FEC_QUIRK_HAS_CSUM) {
>  		/* enable hw accelerator */
>  		ndev->features |= (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM
> -				| NETIF_F_RXCSUM);
> +				| NETIF_F_RXCSUM | NETIF_F_SG);
>  		fep->csum_flags |= FLAG_RX_CSUM_ENABLED;
>  	}
> 
> --
> 1.7.8

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 14:01         ` ezequiel.garcia
@ 2014-05-30 14:54           ` Eric Dumazet
  2014-05-30 15:08             ` fugang.duan
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30 14:54 UTC (permalink / raw)
  To: ezequiel.garcia
  Cc: fugang.duan, David Laight, Frank.Li, davem, netdev, shawn.guo,
	bhutchings, stephen

On Fri, 2014-05-30 at 11:01 -0300, ezequiel.garcia@free-electrons.com
wrote:
> Hello Eric,
> 
> On 30 May 06:13 AM, Eric Dumazet wrote:
> > On Fri, 2014-05-30 at 09:42 +0000, fugang.duan@freescale.com wrote:
> > > From: David Laight <David.Laight@ACULAB.COM> Data: Friday, May 30, 2014 5:11 PM
> > > >
> > > >Does the driver support BQL (or similar) in order to limit the amount of
> > > >queued tx traffic?
> > > >Otherwise you've significantly increased the latency for connections other
> > > >than one doing bulk tx.
> > > >
> > > >	David
> > > >
> > > The driver still don't support BQL.
> > > I will add the feature to support FEC. Thanks for your advise.
> > 
> > Note that a full size TSO packet (44 or 45 MSS) requires about 88 or 90
> > descriptors.
> > 
> 
> What's the rationale behing those numbers?
> 

64KB TSO packet, with MSS=1460 -> 44 segments  (44*1460 = 64240)
with MSS=1448 (TCP timestamps) -> 45 segments  (45*1448 = 65160)

This software TSO emulation uses at least 2 descriptors per MSS

one descriptor to hold the headers (ethernet + ip + tcp)
one descriptor (or two) to hold the payload for this MSS

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 14:54           ` Eric Dumazet
@ 2014-05-30 15:08             ` fugang.duan
  2014-05-30 15:34               ` David Laight
  2014-05-30 15:40               ` Eric Dumazet
  0 siblings, 2 replies; 28+ messages in thread
From: fugang.duan @ 2014-05-30 15:08 UTC (permalink / raw)
  To: Eric Dumazet, ezequiel.garcia
  Cc: David Laight, Frank.Li, davem, netdev, shawn.guo, bhutchings, stephen

From: Eric Dumazet <eric.dumazet@gmail.com> Data: Friday, May 30, 2014 10:54 PM
>To: ezequiel.garcia@free-electrons.com
>Cc: Duan Fugang-B38611; David Laight; Li Frank-B20596; davem@davemloft.net;
>netdev@vger.kernel.org; shawn.guo@linaro.org; bhutchings@solarflare.com;
>stephen@networkplumber.org
>Subject: Re: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry
>number
>
>On Fri, 2014-05-30 at 11:01 -0300, ezequiel.garcia@free-electrons.com
>wrote:
>> Hello Eric,
>>
>> On 30 May 06:13 AM, Eric Dumazet wrote:
>> > On Fri, 2014-05-30 at 09:42 +0000, fugang.duan@freescale.com wrote:
>> > > From: David Laight <David.Laight@ACULAB.COM> Data: Friday, May 30,
>> > > 2014 5:11 PM
>> > > >
>> > > >Does the driver support BQL (or similar) in order to limit the
>> > > >amount of queued tx traffic?
>> > > >Otherwise you've significantly increased the latency for
>> > > >connections other than one doing bulk tx.
>> > > >
>> > > >	David
>> > > >
>> > > The driver still don't support BQL.
>> > > I will add the feature to support FEC. Thanks for your advise.
>> >
>> > Note that a full size TSO packet (44 or 45 MSS) requires about 88 or
>> > 90 descriptors.
>> >
>>
>> What's the rationale behing those numbers?
>>
>
>64KB TSO packet, with MSS=1460 -> 44 segments  (44*1460 = 64240) with
>MSS=1448 (TCP timestamps) -> 45 segments  (45*1448 = 65160)
>
>This software TSO emulation uses at least 2 descriptors per MSS
>
>one descriptor to hold the headers (ethernet + ip + tcp) one descriptor
>(or two) to hold the payload for this MSS
>
Thanks for Eric's detail explain.

If frag page data is not match the alignment for ethernet DMA controller, there need three descriptor for one MSS:
One descriptor for headers, one for the first non-align bytes copied from frag page, one for the rest of frag page.

So one frame may cost descriptor number is: 3 x 45 

And I will add interrupt coalescing support for tx and rx, which also cost some more descriptors.

So the descriptors slots set to 512 is not big, just is reasonable. Do you think ?


Thanks,
Andy 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 15:08             ` fugang.duan
@ 2014-05-30 15:34               ` David Laight
  2014-05-30 15:52                 ` fugang.duan
  2014-05-30 15:58                 ` Eric Dumazet
  2014-05-30 15:40               ` Eric Dumazet
  1 sibling, 2 replies; 28+ messages in thread
From: David Laight @ 2014-05-30 15:34 UTC (permalink / raw)
  To: 'fugang.duan@freescale.com', Eric Dumazet, ezequiel.garcia
  Cc: Frank.Li, davem, netdev, shawn.guo, bhutchings, stephen

From: fugang.duan@freescale.com
> >64KB TSO packet, with MSS=1460 -> 44 segments  (44*1460 = 64240) with
> >MSS=1448 (TCP timestamps) -> 45 segments  (45*1448 = 65160)
> >
> >This software TSO emulation uses at least 2 descriptors per MSS
> >
> >one descriptor to hold the headers (ethernet + ip + tcp) one descriptor
> >(or two) to hold the payload for this MSS
> >
> Thanks for Eric's detail explain.
> 
> If frag page data is not match the alignment for ethernet DMA controller,
> there need three descriptor for one MSS:
> One descriptor for headers, one for the first non-align bytes copied from
> frag page, one for the rest of frag page.
> 
> So one frame may cost descriptor number is: 3 x 45

No - that is 45 frames, typically needing 3 ring entries each.

> And I will add interrupt coalescing support for tx and rx, which also cost some more descriptors.
> 
> So the descriptors slots set to 512 is not big, just is reasonable. Do you think ?

Software TSO generates lots of separate ethernet frames, there is no
absolute requirement to be able to put all of them into the tx ring at once.

The required size for the tx ring is much more likely to be related
to any interrupt mitigation that delays the refilling of ring entries.
512 sounds like a lot of tx ring entries.

The receive ring doesn't need to allow for fragments - your driver
is in control of allocating the buffers.
I didn't understand why you've based the number of rx ring entries
on PAGE_SIZE - IIRC that might be 64k, or even larger.
I'd have expected 128 or 256 rx ring entries to be typical, but
how many are needed is a separate issue.
If the system can keep up with the maximum ethernet data rate
I'd expect that a smaller number would be fine.
If it can't keep up you'll lose packets anyway.
Aggressive power saving (with wake on LAN) might need more.

	David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 15:08             ` fugang.duan
  2014-05-30 15:34               ` David Laight
@ 2014-05-30 15:40               ` Eric Dumazet
  2014-05-30 15:46                 ` fugang.duan
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30 15:40 UTC (permalink / raw)
  To: fugang.duan
  Cc: ezequiel.garcia, David Laight, Frank.Li, davem, netdev,
	shawn.guo, bhutchings, stephen

On Fri, 2014-05-30 at 15:08 +0000, fugang.duan@freescale.com wrote:

> If frag page data is not match the alignment for ethernet DMA controller, there need three descriptor for one MSS:
> One descriptor for headers, one for the first non-align bytes copied from frag page, one for the rest of frag page.
> 

You could avoid the 2nd descriptor, by copying the unaligned part of the
payload into first descriptor (containing headers : about 66 bytes or so
you have room, if not, increase the 128 bytes room to 192 bytes)

> So one frame may cost descriptor number is: 3 x 45 

May... but in general it would be closer of 2 * 45


> 
> So the descriptors slots set to 512 is not big, just is reasonable. Do you think ?

I believe it is fine : that is about 5 64KB TSO packets.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 15:40               ` Eric Dumazet
@ 2014-05-30 15:46                 ` fugang.duan
  0 siblings, 0 replies; 28+ messages in thread
From: fugang.duan @ 2014-05-30 15:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: ezequiel.garcia, David Laight, Frank.Li, davem, netdev,
	shawn.guo, bhutchings, stephen

From: Eric Dumazet <eric.dumazet@gmail.com> Data: Friday, May 30, 2014 11:41 PM
>To: Duan Fugang-B38611
>Cc: ezequiel.garcia@free-electrons.com; David Laight; Li Frank-B20596;
>davem@davemloft.net; netdev@vger.kernel.org; shawn.guo@linaro.org;
>bhutchings@solarflare.com; stephen@networkplumber.org
>Subject: RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry
>number
>
>On Fri, 2014-05-30 at 15:08 +0000, fugang.duan@freescale.com wrote:
>
>> If frag page data is not match the alignment for ethernet DMA controller,
>there need three descriptor for one MSS:
>> One descriptor for headers, one for the first non-align bytes copied
>from frag page, one for the rest of frag page.
>>
>
>You could avoid the 2nd descriptor, by copying the unaligned part of the
>payload into first descriptor (containing headers : about 66 bytes or so
>you have room, if not, increase the 128 bytes room to 192 bytes)
>
>> So one frame may cost descriptor number is: 3 x 45
>
>May... but in general it would be closer of 2 * 45
>
Good idea!

>
>>
>> So the descriptors slots set to 512 is not big, just is reasonable. Do
>you think ?
>
>I believe it is fine : that is about 5 64KB TSO packets.
>

Thanks,
Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 15:34               ` David Laight
@ 2014-05-30 15:52                 ` fugang.duan
  2014-05-30 15:58                 ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: fugang.duan @ 2014-05-30 15:52 UTC (permalink / raw)
  To: David Laight, Eric Dumazet, ezequiel.garcia
  Cc: Frank.Li, davem, netdev, shawn.guo, bhutchings, stephen

Hi, David

From: David Laight <David.Laight@ACULAB.COM> Data: Friday, May 30, 2014 11:35 PM
>To: Duan Fugang-B38611; Eric Dumazet; ezequiel.garcia@free-electrons.com
>Cc: Li Frank-B20596; davem@davemloft.net; netdev@vger.kernel.org;
>shawn.guo@linaro.org; bhutchings@solarflare.com;
>stephen@networkplumber.org
>Subject: RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry
>number
>
>From: fugang.duan@freescale.com
>> >64KB TSO packet, with MSS=1460 -> 44 segments  (44*1460 = 64240) with
>> >MSS=1448 (TCP timestamps) -> 45 segments  (45*1448 = 65160)
>> >
>> >This software TSO emulation uses at least 2 descriptors per MSS
>> >
>> >one descriptor to hold the headers (ethernet + ip + tcp) one
>> >descriptor (or two) to hold the payload for this MSS
>> >
>> Thanks for Eric's detail explain.
>>
>> If frag page data is not match the alignment for ethernet DMA
>> controller, there need three descriptor for one MSS:
>> One descriptor for headers, one for the first non-align bytes copied
>> from frag page, one for the rest of frag page.
>>
>> So one frame may cost descriptor number is: 3 x 45
>
>No - that is 45 frames, typically needing 3 ring entries each.
>
>> And I will add interrupt coalescing support for tx and rx, which also
>cost some more descriptors.
>>
>> So the descriptors slots set to 512 is not big, just is reasonable. Do
>you think ?
>
>Software TSO generates lots of separate ethernet frames, there is no
>absolute requirement to be able to put all of them into the tx ring at
>once.
>
>The required size for the tx ring is much more likely to be related to any
>interrupt mitigation that delays the refilling of ring entries.
>512 sounds like a lot of tx ring entries.
>
>The receive ring doesn't need to allow for fragments - your driver is in
>control of allocating the buffers.
>I didn't understand why you've based the number of rx ring entries on
>PAGE_SIZE - IIRC that might be 64k, or even larger.
>I'd have expected 128 or 256 rx ring entries to be typical, but how many
>are needed is a separate issue.
>If the system can keep up with the maximum ethernet data rate I'd expect
>that a smaller number would be fine.
>If it can't keep up you'll lose packets anyway.
>Aggressive power saving (with wake on LAN) might need more.
>
>	David

Thanks for your information. I will do more test to weigh the smaller number for maximum data.
Summary you and Eric's thoughts, maybe rx set to 256, tx set to 512 are fine.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 15:34               ` David Laight
  2014-05-30 15:52                 ` fugang.duan
@ 2014-05-30 15:58                 ` Eric Dumazet
  2014-05-30 16:35                   ` ezequiel.garcia
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30 15:58 UTC (permalink / raw)
  To: David Laight
  Cc: 'fugang.duan@freescale.com',
	ezequiel.garcia, Frank.Li, davem, netdev, shawn.guo, bhutchings,
	stephen

On Fri, 2014-05-30 at 15:34 +0000, David Laight wrote:

> Software TSO generates lots of separate ethernet frames, there is no
> absolute requirement to be able to put all of them into the tx ring at once.

Yes there is absolute requirement. It is driver responsibility to block
the queue if the available slots in TX ring would not allow the
following packet to be sent.

Thats why a TSO emulation needs to set gso_max_segs to some sane value

(sfc sets it to 100)

In Fugang patch this part was a wrong way to deal with this :

+       if (tso_count_descs(skb) >= fec_enet_txdesc_entry_free(fep)) {
+               if (net_ratelimit())
+                       netdev_err(ndev, "tx queue full!\n");
+               return NETDEV_TX_BUSY;
+       }
+

This was copy/pasted from buggy
drivers/net/ethernet/marvell/mv643xx_eth.c

> 
> The required size for the tx ring is much more likely to be related
> to any interrupt mitigation that delays the refilling of ring entries.
> 512 sounds like a lot of tx ring entries.

512 slots -> 256 frames (considering SG : headers/payload use 2 desc)

-> 3 ms on a Gbit NIC. Its about the right size.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 6/6] net: fec: Add software TSO support
  2014-05-30  7:16     ` fugang.duan
@ 2014-05-30 16:21       ` Eric Dumazet
  2014-06-01  0:55         ` fugang.duan
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-05-30 16:21 UTC (permalink / raw)
  To: fugang.duan
  Cc: Frank.Li, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

On Fri, 2014-05-30 at 07:16 +0000, fugang.duan@freescale.com wrote:

> Yes, test found it bounce all TX frames.
> Use 2 descriptors to transfer one part, which bring more complicate for driver. Of course, 
> Performance must be better.
> 

How cpu handles misaligned 32bit accesses ?

> Digression information:
> Imx6dl FEC HW have bandwidth issue limit to 400 ~ 700Mbps. Current performance with TSO is 506Mbps, cpu loading is about 40%.
> Later chips with FEC IP support byte alignment, such as imx6sx. On imx6sx FEC, no SW TSO, tx bandwidth is 840~870Mbps, cpu loading is 100%,
> After the software TSO, tx bandwidth is 840Mbps, cpu loading is 48%.

Since you have some cpu cycles, have you tried to always use the bounce
thing, using one descriptor per MSS, instead of two ?

(headers + payload)

This might help to get better bandwidth, by lowering overhead on DMA and
on NIC.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number
  2014-05-30 15:58                 ` Eric Dumazet
@ 2014-05-30 16:35                   ` ezequiel.garcia
  0 siblings, 0 replies; 28+ messages in thread
From: ezequiel.garcia @ 2014-05-30 16:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, 'fugang.duan@freescale.com',
	Frank.Li, davem, netdev, shawn.guo, bhutchings, stephen

On 30 May 08:58 AM, Eric Dumazet wrote:
> On Fri, 2014-05-30 at 15:34 +0000, David Laight wrote:
> 
> > Software TSO generates lots of separate ethernet frames, there is no
> > absolute requirement to be able to put all of them into the tx ring at once.
> 
> Yes there is absolute requirement. It is driver responsibility to block
> the queue if the available slots in TX ring would not allow the
> following packet to be sent.
> 
> Thats why a TSO emulation needs to set gso_max_segs to some sane value
> 
> (sfc sets it to 100)
> 
> In Fugang patch this part was a wrong way to deal with this :
> 
> +       if (tso_count_descs(skb) >= fec_enet_txdesc_entry_free(fep)) {
> +               if (net_ratelimit())
> +                       netdev_err(ndev, "tx queue full!\n");
> +               return NETDEV_TX_BUSY;
> +       }
> +
> 
> This was copy/pasted from buggy
> drivers/net/ethernet/marvell/mv643xx_eth.c
> 

Indeed. I'm just about to submit the fixes for those, which may be useful to
fix it in this driver as well.
-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 6/6] net: fec: Add software TSO support
  2014-05-30 16:21       ` Eric Dumazet
@ 2014-06-01  0:55         ` fugang.duan
  2014-06-01  1:39           ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: fugang.duan @ 2014-06-01  0:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Frank.Li, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

From: Eric Dumazet <eric.dumazet@gmail.com> Data: Saturday, May 31, 2014 12:22 AM
>To: Duan Fugang-B38611
>Cc: Li Frank-B20596; davem@davemloft.net; ezequiel.garcia@free-
>electrons.com; netdev@vger.kernel.org; shawn.guo@linaro.org;
>bhutchings@solarflare.com; stephen@networkplumber.org
>Subject: RE: [PATCH v1 6/6] net: fec: Add software TSO support
>
>On Fri, 2014-05-30 at 07:16 +0000, fugang.duan@freescale.com wrote:
>
>> Yes, test found it bounce all TX frames.
>> Use 2 descriptors to transfer one part, which bring more complicate
>> for driver. Of course, Performance must be better.
>>
>
>How cpu handles misaligned 32bit accesses ?

You mean use extra descriptor for misaligned bytes ? Or attach the misaligned bytes to header descriptor ?
I don't do/test it.
Since cpu loading is light, I want to keep the current method. (one descriptor for header, one descriptor for mss)
>
>> Digression information:
>> Imx6dl FEC HW have bandwidth issue limit to 400 ~ 700Mbps. Current
>performance with TSO is 506Mbps, cpu loading is about 40%.
>> Later chips with FEC IP support byte alignment, such as imx6sx. On
>> imx6sx FEC, no SW TSO, tx bandwidth is 840~870Mbps, cpu loading is 100%,
>After the software TSO, tx bandwidth is 840Mbps, cpu loading is 48%.
>
>Since you have some cpu cycles, have you tried to always use the bounce
>thing, using one descriptor per MSS, instead of two ?
>
>(headers + payload)
>
>This might help to get better bandwidth, by lowering overhead on DMA and
>on NIC.
>
Imx6dl itself has hw bandwidth limitation. The performance can reach at 506Mbps is very amazing.

Imx6sx hw bandwidth no limitation. The above test result is that connecting to one FPGA board(iperf server), which rx speed may be limited.
So I connect to APPLE MAC book to test again, test result (applied the patches to our internal kernel 3.10.31):
High mem disable: tx bandwidth 942Mbps, cpu loading is 65%.
High mem enable: tx bandwidth 930Mbps, cpu loading is 100%. 
=> I don't know why kernel highmem config enable cause so much performance drop ???

For your above suggestion "using one descriptor per MSS, instead of two":
Yes, for imx6dl soc, we just do it like this.  For imx6sx soc FEC that support byte alignment,  so it also use one descriptor per MSS.

Thanks for your suggestion and response. Do you know why highmem cause much performance drop for SW TSO ?

Thanks,
Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 6/6] net: fec: Add software TSO support
  2014-06-01  0:55         ` fugang.duan
@ 2014-06-01  1:39           ` Eric Dumazet
  2014-06-01  2:25             ` fugang.duan
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2014-06-01  1:39 UTC (permalink / raw)
  To: fugang.duan
  Cc: Frank.Li, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

On Sun, 2014-06-01 at 00:55 +0000, fugang.duan@freescale.com wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com> Data: Saturday, May 31, 2014 12:22 AM
> >
> >How cpu handles misaligned 32bit accesses ?
> 
> You mean use extra descriptor for misaligned bytes ? Or attach the misaligned bytes to header descriptor ?

I meant, maybe you could not force alignment of the IP/TCP headers on a
4byte boundary, but using +2 bytes offset

This way, frames would be 4-bytes aligned from the Etnernet Header.

> So I connect to APPLE MAC book to test again, test result (applied the patches to our internal kernel 3.10.31):
> High mem disable: tx bandwidth 942Mbps, cpu loading is 65%.
> High mem enable: tx bandwidth 930Mbps, cpu loading is 100%. 
> => I don't know why kernel highmem config enable cause so much performance drop ???
> 
> For your above suggestion "using one descriptor per MSS, instead of two":
> Yes, for imx6dl soc, we just do it like this.  For imx6sx soc FEC that support byte alignment,  so it also use one descriptor per MSS.
> 
> Thanks for your suggestion and response. Do you know why highmem cause much performance drop for SW TSO ?

Check NETIF_F_HIGHDMA : your driver might be able to advertise its
support.

Check your copies, because you might then need kmap()

For an example, read efx_skb_copy_bits_to_pio() in
drivers/net/ethernet/sfc/tx.c

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v1 6/6] net: fec: Add software TSO support
  2014-06-01  1:39           ` Eric Dumazet
@ 2014-06-01  2:25             ` fugang.duan
  0 siblings, 0 replies; 28+ messages in thread
From: fugang.duan @ 2014-06-01  2:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Frank.Li, davem, ezequiel.garcia, netdev, shawn.guo, bhutchings, stephen

From: Eric Dumazet <eric.dumazet@gmail.com> Data: Sunday, June 01, 2014 9:40 AM
>To: Duan Fugang-B38611
>Cc: Li Frank-B20596; davem@davemloft.net; ezequiel.garcia@free-
>electrons.com; netdev@vger.kernel.org; shawn.guo@linaro.org;
>bhutchings@solarflare.com; stephen@networkplumber.org
>Subject: RE: [PATCH v1 6/6] net: fec: Add software TSO support
>
>On Sun, 2014-06-01 at 00:55 +0000, fugang.duan@freescale.com wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com> Data: Saturday, May 31,
>> 2014 12:22 AM
>> So I connect to APPLE MAC book to test again, test result (applied the
>patches to our internal kernel 3.10.31):
>> High mem disable: tx bandwidth 942Mbps, cpu loading is 65%.
>> High mem enable: tx bandwidth 930Mbps, cpu loading is 100%.
>> => I don't know why kernel highmem config enable cause so much
>performance drop ???
>>
>> For your above suggestion "using one descriptor per MSS, instead of two":
>> Yes, for imx6dl soc, we just do it like this.  For imx6sx soc FEC that
>support byte alignment,  so it also use one descriptor per MSS.
>>
>> Thanks for your suggestion and response. Do you know why highmem cause
>much performance drop for SW TSO ?
>
>Check NETIF_F_HIGHDMA : your driver might be able to advertise its support.
>
Indeed, I got the test result after add "NETIF_F_HIGHDMA" feature, otherwise, the performance is worse.

>Check your copies, because you might then need kmap()
>
>For an example, read efx_skb_copy_bits_to_pio() in
>drivers/net/ethernet/sfc/tx.c
>
I will check it.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-06-01  2:25 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-30  2:05 [PATCH v1 0/6] *** net: fec: Enable Software TSO to improve the tx performance *** Fugang Duan
2014-05-30  2:05 ` [PATCH v1 1/6] net: fec: Factorize the .xmit transmit function Fugang Duan
2014-05-30  2:05 ` [PATCH v1 2/6] net: fec: Enable IP header hardware checksum Fugang Duan
2014-05-30  2:05 ` [PATCH v1 3/6] net: fec: Factorize feature setting Fugang Duan
2014-05-30  2:05 ` [PATCH v1 4/6] net: fec: Increase buffer descriptor entry number Fugang Duan
2014-05-30  9:10   ` David Laight
2014-05-30  9:42     ` fugang.duan
2014-05-30 13:13       ` Eric Dumazet
2014-05-30 14:01         ` ezequiel.garcia
2014-05-30 14:54           ` Eric Dumazet
2014-05-30 15:08             ` fugang.duan
2014-05-30 15:34               ` David Laight
2014-05-30 15:52                 ` fugang.duan
2014-05-30 15:58                 ` Eric Dumazet
2014-05-30 16:35                   ` ezequiel.garcia
2014-05-30 15:40               ` Eric Dumazet
2014-05-30 15:46                 ` fugang.duan
2014-05-30  2:05 ` [PATCH v1 5/6] net: fec: Add Scatter/gather support Fugang Duan
2014-05-30  4:37   ` Eric Dumazet
2014-05-30  4:48     ` fugang.duan
2014-05-30 14:26   ` Frank.Li
2014-05-30  2:05 ` [PATCH v1 6/6] net: fec: Add software TSO support Fugang Duan
2014-05-30  6:30   ` Eric Dumazet
2014-05-30  7:16     ` fugang.duan
2014-05-30 16:21       ` Eric Dumazet
2014-06-01  0:55         ` fugang.duan
2014-06-01  1:39           ` Eric Dumazet
2014-06-01  2:25             ` fugang.duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).