All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mvneta: software TSO implementation
@ 2014-04-10 22:57 Ezequiel Garcia
  2014-04-10 22:58 ` [PATCH 1/3] net: mvneta: Factorize feature setting Ezequiel Garcia
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Ezequiel Garcia @ 2014-04-10 22:57 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Thomas Petazzoni, Gregory Clement,
	Simon Guinot, Willy Tarreau, Tawfik Bayouk, Lior Amsalem,
	Ezequiel Garcia

This patchset adds a software TSO implementation to our mvneta driver
which is already in use on a number platforms. When enabled, the TSO
shows both CPU usage reduction and throughput performance improvements.

The TSO feature can be completely disabled at runtime using ethtool:

  $ ethtool -K eth0 tso {on,off}

When TSO is disabled the only overhead remaining is the DMA buffers for the
TSO headers, allocated when the tx queues are initialized. Follow-up patches
can release/re-allocate these buffers upon TSO disabling/re-enabling.

In most of my tests I've used iperf where the improvement is most striking,
using a Plat'home Openblocks AX/3 board as the iperf client (tx).

Measuring the CPU usage with vmstat shows a substantial CPU usage drop when TSO
is on (~15% vs. ~25%). HTTP-based tests performed by Willy Tarreau have shown
nice performance improvements.

The first two patches are trivial cleanups, and the third patch does all the
hard work. The TSO implementation was originally prepared by Simon Guinot,
and I did some style massage and other cleaning.

Any comments about this is much appreciated.

Ezequiel Garcia (3):
  net: mvneta: Factorize feature setting
  net: mvneta: Clean mvneta_tx() sk_buff handling
  net: mvneta: Introduce a software TSO implementation

 drivers/net/ethernet/marvell/mvneta.c | 226 ++++++++++++++++++++++++++++++++--
 1 file changed, 215 insertions(+), 11 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] net: mvneta: Factorize feature setting
  2014-04-10 22:57 [PATCH 0/3] mvneta: software TSO implementation Ezequiel Garcia
@ 2014-04-10 22:58 ` Ezequiel Garcia
  2014-04-10 22:58 ` [PATCH 2/3] net: mvneta: Clean mvneta_tx() sk_buff handling Ezequiel Garcia
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Ezequiel Garcia @ 2014-04-10 22:58 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Thomas Petazzoni, Gregory Clement,
	Simon Guinot, Willy Tarreau, Tawfik Bayouk, Lior Amsalem,
	Ezequiel Garcia

In order to ease the addition of new features, let's factorize the
feature list.

Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index d04b1c3..b0db2c4 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2863,8 +2863,8 @@ static int mvneta_probe(struct platform_device *pdev)
 	netif_napi_add(dev, &pp->napi, mvneta_poll, pp->weight);
 
 	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM;
-	dev->hw_features |= NETIF_F_SG | NETIF_F_IP_CSUM;
-	dev->vlan_features |= NETIF_F_SG | NETIF_F_IP_CSUM;
+	dev->hw_features |= dev->features;
+	dev->vlan_features |= dev->features;
 	dev->priv_flags |= IFF_UNICAST_FLT;
 
 	err = register_netdev(dev);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] net: mvneta: Clean mvneta_tx() sk_buff handling
  2014-04-10 22:57 [PATCH 0/3] mvneta: software TSO implementation Ezequiel Garcia
  2014-04-10 22:58 ` [PATCH 1/3] net: mvneta: Factorize feature setting Ezequiel Garcia
@ 2014-04-10 22:58 ` Ezequiel Garcia
  2014-04-10 22:58 ` [PATCH 3/3] net: mvneta: Introduce a software TSO implementation Ezequiel Garcia
  2014-04-11  0:51 ` [PATCH 0/3] mvneta: " David Miller
  3 siblings, 0 replies; 17+ messages in thread
From: Ezequiel Garcia @ 2014-04-10 22:58 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Thomas Petazzoni, Gregory Clement,
	Simon Guinot, Willy Tarreau, Tawfik Bayouk, Lior Amsalem,
	Ezequiel Garcia

Rework mvneta_tx() so that the code that performs the final handling
before a sk_buff is transmitted is done only if the numbers of fragments
processed if positive.

This is preparation work to introduce a software TSO routine.

Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b0db2c4..e5bd3ca 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1584,7 +1584,6 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 	u16 txq_id = skb_get_queue_mapping(skb);
 	struct mvneta_tx_queue *txq = &pp->txqs[txq_id];
 	struct mvneta_tx_desc *tx_desc;
-	struct netdev_queue *nq;
 	int frags = 0;
 	u32 tx_cmd;
 
@@ -1592,7 +1591,6 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 		goto out;
 
 	frags = skb_shinfo(skb)->nr_frags + 1;
-	nq    = netdev_get_tx_queue(dev, txq_id);
 
 	/* Get a descriptor for the first part of the packet */
 	tx_desc = mvneta_txq_next_desc_get(txq);
@@ -1635,15 +1633,16 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 
-	txq->count += frags;
-	mvneta_txq_pend_desc_add(pp, txq, frags);
-
-	if (txq->size - txq->count < MAX_SKB_FRAGS + 1)
-		netif_tx_stop_queue(nq);
-
 out:
 	if (frags > 0) {
 		struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
+		struct netdev_queue *nq = netdev_get_tx_queue(dev, txq_id);
+
+		txq->count += frags;
+		mvneta_txq_pend_desc_add(pp, txq, frags);
+
+		if (txq->size - txq->count < MAX_SKB_FRAGS + 1)
+			netif_tx_stop_queue(nq);
 
 		u64_stats_update_begin(&stats->syncp);
 		stats->tx_packets++;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] net: mvneta: Introduce a software TSO implementation
  2014-04-10 22:57 [PATCH 0/3] mvneta: software TSO implementation Ezequiel Garcia
  2014-04-10 22:58 ` [PATCH 1/3] net: mvneta: Factorize feature setting Ezequiel Garcia
  2014-04-10 22:58 ` [PATCH 2/3] net: mvneta: Clean mvneta_tx() sk_buff handling Ezequiel Garcia
@ 2014-04-10 22:58 ` Ezequiel Garcia
  2014-04-10 23:16   ` Eric Dumazet
  2014-05-05 14:47   ` Ezequiel Garcia
  2014-04-11  0:51 ` [PATCH 0/3] mvneta: " David Miller
  3 siblings, 2 replies; 17+ messages in thread
From: Ezequiel Garcia @ 2014-04-10 22:58 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Thomas Petazzoni, Gregory Clement,
	Simon Guinot, Willy Tarreau, Tawfik Bayouk, Lior Amsalem,
	Ezequiel Garcia, Simon Guinot

This commit implements a software TSO which reduces the CPU
usage significantly while retaining the outbound throughput
at line rate.

Tested on a Plat'home Openblocks AX/3 board acting as iperf client (tx).
The CPU usage shows a substantial CPU usage drop, between 15%-25%.

Other tests performed by Willy Tarreau show performance improvements:
Willy reported that "[..] turning the TSO flag on immediately increases the
HTTP request rate from 1680 to 1820 per second (30 kB objects)".

Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Simon Guinot <sguinot@lacie.com>
Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 207 +++++++++++++++++++++++++++++++++-
 1 file changed, 206 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index e5bd3ca..cd6b998 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -244,6 +244,9 @@
 
 #define MVNETA_TX_MTU_MAX		0x3ffff
 
+/* TSO header size */
+#define TSO_HEADER_SIZE 128
+
 /* Max number of Rx descriptors */
 #define MVNETA_MAX_RXD 128
 
@@ -413,6 +416,12 @@ struct mvneta_tx_queue {
 
 	/* Index of the next TX DMA descriptor to process */
 	int next_desc_to_proc;
+
+	/* DMA buffers for TSO headers */
+	char *tso_hdrs;
+
+	/* DMA address of TSO headers */
+	dma_addr_t tso_hdrs_phys;
 };
 
 struct mvneta_rx_queue {
@@ -1519,6 +1528,181 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 	return rx_done;
 }
 
+static inline void
+mvneta_tso_build_hdr(struct net_device *dev, struct mvneta_tx_queue *txq,
+		     struct sk_buff *skb, int hdr_len, int size,
+		     u32 tcp_seq, u16 ip_id, bool is_last)
+{
+	struct mvneta_port *pp = netdev_priv(dev);
+	struct mvneta_tx_desc *tx_desc;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+	char *mac;
+	int mac_hdr_len = skb_network_offset(skb);
+
+	mac = txq->tso_hdrs + txq->txq_put_index * TSO_HEADER_SIZE;
+	memcpy(mac, skb->data, hdr_len);
+
+	iph = (struct iphdr *)(mac + mac_hdr_len);
+	iph->id = htons(ip_id);
+	iph->tot_len = htons(size + hdr_len - mac_hdr_len);
+
+	tcph = (struct tcphdr *)(mac + skb_transport_offset(skb));
+	tcph->seq = htonl(tcp_seq);
+
+	if (!is_last) {
+		/* Clear all special flags for not last packet */
+		tcph->psh = 0;
+		tcph->fin = 0;
+		tcph->rst = 0;
+	}
+
+	txq->tx_skb[txq->txq_put_index] = NULL;
+	tx_desc = mvneta_txq_next_desc_get(txq);
+	tx_desc->data_size = hdr_len;
+	tx_desc->command = mvneta_skb_tx_csum(pp, skb);
+	tx_desc->command |= MVNETA_TXD_F_DESC;
+	tx_desc->buf_phys_addr = txq->tso_hdrs_phys +
+				 txq->txq_put_index * TSO_HEADER_SIZE;
+	mvneta_txq_inc_put(txq);
+}
+
+static inline int
+mvneta_tso_build_data(struct net_device *dev, struct mvneta_tx_queue *txq,
+		      struct sk_buff *skb, char *frag_ptr, int frag_size,
+		      int data_left, bool is_last)
+{
+	int size;
+	struct mvneta_tx_desc *tx_desc;
+
+	size = (frag_size < data_left) ? frag_size : data_left;
+
+	tx_desc = mvneta_txq_next_desc_get(txq);
+	tx_desc->data_size = size;
+	tx_desc->buf_phys_addr = dma_map_single(dev->dev.parent, frag_ptr,
+						size, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(dev->dev.parent,
+		     tx_desc->buf_phys_addr))) {
+		mvneta_txq_desc_put(txq);
+		return 0;
+	}
+
+	tx_desc->command = 0;
+	txq->tx_skb[txq->txq_put_index] = NULL;
+
+	if (size == data_left) {
+		/* last descriptor in the TCP packet */
+		tx_desc->command = MVNETA_TXD_L_DESC;
+
+		/* last descriptor in SKB */
+		if (is_last)
+			txq->tx_skb[txq->txq_put_index] = skb;
+	}
+	mvneta_txq_inc_put(txq);
+	return size;
+}
+
+static int mvneta_tx_tso(struct sk_buff *skb, struct net_device *dev,
+			 struct mvneta_tx_queue *txq)
+{
+	int total_len, hdr_len, size, frag_size, data_left;
+	int desc_count;
+	u16 ip_id;
+	u32 tcp_seq;
+	skb_frag_t *frag;
+	int frag_idx = 0;
+	char *frag_ptr;
+	const struct tcphdr *th = tcp_hdr(skb);
+	struct mvneta_port *pp = netdev_priv(dev);
+	int i;
+
+	/* Calculate expected number of TX descriptors */
+	desc_count = skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
+	if ((txq->count + desc_count) >= txq->size)
+		return 0;
+
+	total_len = skb->len;
+	hdr_len = (skb_transport_offset(skb) + tcp_hdrlen(skb));
+
+	total_len -= hdr_len;
+	ip_id = ntohs(ip_hdr(skb)->id);
+	tcp_seq = ntohl(th->seq);
+
+	frag_size = skb_headlen(skb);
+	frag_ptr = skb->data;
+
+	if (frag_size < hdr_len)
+		return 0;
+
+	frag_size -= hdr_len;
+	frag_ptr += hdr_len;
+	if (frag_size == 0) {
+		frag = &skb_shinfo(skb)->frags[frag_idx];
+
+		/* Move to next segment */
+		frag_size = frag->size;
+		frag_ptr = page_address(frag->page.p) + frag->page_offset;
+		frag_idx++;
+	}
+	desc_count = 0;
+
+	while (total_len > 0) {
+		data_left = (skb_shinfo(skb)->gso_size < total_len) ?
+				skb_shinfo(skb)->gso_size : total_len;
+		desc_count++;
+		total_len -= data_left;
+
+		/* prepare packet headers: MAC + IP + TCP */
+		mvneta_tso_build_hdr(dev, txq, skb, hdr_len, data_left,
+				     tcp_seq, ip_id, total_len == 0);
+		ip_id++;
+
+		while (data_left > 0) {
+			desc_count++;
+
+			size = mvneta_tso_build_data(dev, txq, skb,
+						     frag_ptr, frag_size,
+						     data_left, total_len == 0);
+			if (size == 0)
+				goto err_release;
+
+			data_left -= size;
+			tcp_seq += size;
+
+			frag_size -= size;
+			frag_ptr += size;
+
+			if ((frag_size == 0) &&
+			    (frag_idx < skb_shinfo(skb)->nr_frags)) {
+				frag = &skb_shinfo(skb)->frags[frag_idx];
+
+				/* Move to next segment */
+				frag_size = frag->size;
+				frag_ptr = page_address(frag->page.p) +
+					   frag->page_offset;
+				frag_idx++;
+			}
+		}
+	}
+
+	return desc_count;
+
+err_release:
+	/* Release all used data descriptors; header descriptors must not
+	 * be DMA-unmapped.
+	 */
+	for (i = desc_count - 1; i >= 0; i--) {
+		struct mvneta_tx_desc *tx_desc = txq->descs + i;
+		if (!(tx_desc->command & MVNETA_TXD_F_DESC))
+			dma_unmap_single(pp->dev->dev.parent,
+					 tx_desc->buf_phys_addr,
+					 tx_desc->data_size,
+					 DMA_TO_DEVICE);
+		mvneta_txq_desc_put(txq);
+	}
+	return 0;
+}
+
 /* Handle tx fragmentation processing */
 static int mvneta_tx_frag_process(struct mvneta_port *pp, struct sk_buff *skb,
 				  struct mvneta_tx_queue *txq)
@@ -1590,6 +1774,11 @@ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev)
 	if (!netif_running(dev))
 		goto out;
 
+	if (skb_is_gso(skb)) {
+		frags = mvneta_tx_tso(skb, dev, txq);
+		goto out;
+	}
+
 	frags = skb_shinfo(skb)->nr_frags + 1;
 
 	/* Get a descriptor for the first part of the packet */
@@ -2108,6 +2297,18 @@ static int mvneta_txq_init(struct mvneta_port *pp,
 				  txq->descs, txq->descs_phys);
 		return -ENOMEM;
 	}
+
+	/* Allocate DMA buffers for TSO MAC/IP/TCP headers */
+	txq->tso_hdrs = dma_alloc_coherent(pp->dev->dev.parent,
+					   txq->size * TSO_HEADER_SIZE,
+					   &txq->tso_hdrs_phys, GFP_KERNEL);
+	if (txq->tso_hdrs == NULL) {
+		kfree(txq->tx_skb);
+		dma_free_coherent(pp->dev->dev.parent,
+				  txq->size * MVNETA_DESC_ALIGNED_SIZE,
+				  txq->descs, txq->descs_phys);
+		return -ENOMEM;
+	}
 	mvneta_tx_done_pkts_coal_set(pp, txq, txq->done_pkts_coal);
 
 	return 0;
@@ -2119,6 +2320,10 @@ static void mvneta_txq_deinit(struct mvneta_port *pp,
 {
 	kfree(txq->tx_skb);
 
+	if (txq->tso_hdrs)
+		dma_free_coherent(pp->dev->dev.parent,
+				  txq->size * TSO_HEADER_SIZE,
+				  txq->tso_hdrs, txq->tso_hdrs_phys);
 	if (txq->descs)
 		dma_free_coherent(pp->dev->dev.parent,
 				  txq->size * MVNETA_DESC_ALIGNED_SIZE,
@@ -2861,7 +3066,7 @@ static int mvneta_probe(struct platform_device *pdev)
 
 	netif_napi_add(dev, &pp->napi, mvneta_poll, pp->weight);
 
-	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM;
+	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
 	dev->hw_features |= dev->features;
 	dev->vlan_features |= dev->features;
 	dev->priv_flags |= IFF_UNICAST_FLT;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] net: mvneta: Introduce a software TSO implementation
  2014-04-10 22:58 ` [PATCH 3/3] net: mvneta: Introduce a software TSO implementation Ezequiel Garcia
@ 2014-04-10 23:16   ` Eric Dumazet
  2014-05-05 14:47   ` Ezequiel Garcia
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2014-04-10 23:16 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: netdev, David S. Miller, Thomas Petazzoni, Gregory Clement,
	Simon Guinot, Willy Tarreau, Tawfik Bayouk, Lior Amsalem,
	Simon Guinot

On Thu, 2014-04-10 at 19:58 -0300, Ezequiel Garcia wrote:
> +	/* Calculate expected number of TX descriptors */
> +	desc_count = skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
> +	if ((txq->count + desc_count) >= txq->size)
> +		return 0;
> +

Hmm.. have you tried to play with small MSS ?

I think you should set dev->gso_max_segs to a sensible value...

see commit 7e6d06f0de3f7 for details.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-10 22:57 [PATCH 0/3] mvneta: software TSO implementation Ezequiel Garcia
                   ` (2 preceding siblings ...)
  2014-04-10 22:58 ` [PATCH 3/3] net: mvneta: Introduce a software TSO implementation Ezequiel Garcia
@ 2014-04-11  0:51 ` David Miller
  2014-04-11  1:02   ` Eric Dumazet
  3 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2014-04-11  0:51 UTC (permalink / raw)
  To: ezequiel.garcia
  Cc: netdev, eric.dumazet, thomas.petazzoni, gregory.clement,
	simon.guinot, w, tawfik, alior


This is not really driver specific code.

Please create an appropriate abstraction, and put this code in a
common place so that any driver can do this.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11  0:51 ` [PATCH 0/3] mvneta: " David Miller
@ 2014-04-11  1:02   ` Eric Dumazet
  2014-04-11  5:48     ` Willy Tarreau
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2014-04-11  1:02 UTC (permalink / raw)
  To: David Miller
  Cc: ezequiel.garcia, netdev, thomas.petazzoni, gregory.clement,
	simon.guinot, w, tawfik, alior

On Thu, 2014-04-10 at 20:51 -0400, David Miller wrote:
> This is not really driver specific code.
> 
> Please create an appropriate abstraction, and put this code in a
> common place so that any driver can do this.
> 
> Thanks.

Note a prior implementation existed in

drivers/net/ethernet/tile/tilegx.c

Note sure how it can be properly factorized, but worth trying .

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11  1:02   ` Eric Dumazet
@ 2014-04-11  5:48     ` Willy Tarreau
  2014-04-11  6:20       ` David Miller
  2014-04-11 19:08       ` Ezequiel Garcia
  0 siblings, 2 replies; 17+ messages in thread
From: Willy Tarreau @ 2014-04-11  5:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ezequiel.garcia, netdev, thomas.petazzoni,
	gregory.clement, simon.guinot, tawfik, alior

On Thu, Apr 10, 2014 at 06:02:30PM -0700, Eric Dumazet wrote:
> On Thu, 2014-04-10 at 20:51 -0400, David Miller wrote:
> > This is not really driver specific code.
> > 
> > Please create an appropriate abstraction, and put this code in a
> > common place so that any driver can do this.
> > 
> > Thanks.
> 
> Note a prior implementation existed in
> 
> drivers/net/ethernet/tile/tilegx.c
> 
> Note sure how it can be properly factorized, but worth trying .

I also tried to find how to do this for mv643xx_eth a few years ago
based on other drivers (eg: tilegx) and failed to find anything
common. At this level, we're really playing with the NIC's descriptors,
and many of them have different capabilities (eg: HW checksums for all
frames, VLAN or none, etc...). I don't really see how this could be
factorized, and since many NICs do support TSO, I'm not convinced by
the effort.

Willy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11  5:48     ` Willy Tarreau
@ 2014-04-11  6:20       ` David Miller
  2014-04-11  6:30         ` Willy Tarreau
  2014-04-11 19:08       ` Ezequiel Garcia
  1 sibling, 1 reply; 17+ messages in thread
From: David Miller @ 2014-04-11  6:20 UTC (permalink / raw)
  To: w
  Cc: eric.dumazet, ezequiel.garcia, netdev, thomas.petazzoni,
	gregory.clement, simon.guinot, tawfik, alior

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 11 Apr 2014 07:48:47 +0200

> I also tried to find how to do this for mv643xx_eth a few years ago
> based on other drivers (eg: tilegx) and failed to find anything
> common. At this level, we're really playing with the NIC's descriptors,
> and many of them have different capabilities (eg: HW checksums for all
> frames, VLAN or none, etc...). I don't really see how this could be
> factorized, and since many NICs do support TSO, I'm not convinced by
> the effort.

I wanted to hook this into the NIU driver as well, which also lacks
TSO support.

The descriptor and checksum part is chip specific, but the segmentation
of the buffer segments, calculating the number of necessary descriptors,
etc. absolutely is not.

Please work on finding a suitable abstraction, it's possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11  6:20       ` David Miller
@ 2014-04-11  6:30         ` Willy Tarreau
  2014-04-11 16:58           ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Willy Tarreau @ 2014-04-11  6:30 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, ezequiel.garcia, netdev, thomas.petazzoni,
	gregory.clement, simon.guinot, tawfik, alior

On Fri, Apr 11, 2014 at 02:20:06AM -0400, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Fri, 11 Apr 2014 07:48:47 +0200
> 
> > I also tried to find how to do this for mv643xx_eth a few years ago
> > based on other drivers (eg: tilegx) and failed to find anything
> > common. At this level, we're really playing with the NIC's descriptors,
> > and many of them have different capabilities (eg: HW checksums for all
> > frames, VLAN or none, etc...). I don't really see how this could be
> > factorized, and since many NICs do support TSO, I'm not convinced by
> > the effort.
> 
> I wanted to hook this into the NIU driver as well, which also lacks
> TSO support.
> 
> The descriptor and checksum part is chip specific, but the segmentation
> of the buffer segments, calculating the number of necessary descriptors,
> etc. absolutely is not.

Agreed.

> Please work on finding a suitable abstraction, it's possible.

Do you think for example that having an xmit_tso() function which would
iterate over all segments and call something like :
   ->build_hdr()
   ->build_data()
   ->unroll_all()

would be sufficient ?

I fear that relying on function pointers in the inner loop can
significantly offset the CPU savings, but it could be attempted.

I think that the fact you want to do it for NIU is a good opportunity
to try to do it the right way in the most generic mode, but we'll
surely need your help on this!

Willy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11  6:30         ` Willy Tarreau
@ 2014-04-11 16:58           ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2014-04-11 16:58 UTC (permalink / raw)
  To: w
  Cc: eric.dumazet, ezequiel.garcia, netdev, thomas.petazzoni,
	gregory.clement, simon.guinot, tawfik, alior

From: Willy Tarreau <w@1wt.eu>
Date: Fri, 11 Apr 2014 08:30:21 +0200

> Do you think for example that having an xmit_tso() function which would
> iterate over all segments and call something like :
>    ->build_hdr()
>    ->build_data()
>    ->unroll_all()
> 
> would be sufficient ?

Yes, something involving callbacks is probably going to be necessary.

> I think that the fact you want to do it for NIU is a good opportunity
> to try to do it the right way in the most generic mode, but we'll
> surely need your help on this!

I can help for sure :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11  5:48     ` Willy Tarreau
  2014-04-11  6:20       ` David Miller
@ 2014-04-11 19:08       ` Ezequiel Garcia
  2014-04-11 19:43         ` Willy Tarreau
  1 sibling, 1 reply; 17+ messages in thread
From: Ezequiel Garcia @ 2014-04-11 19:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric Dumazet, David Miller, netdev, thomas.petazzoni,
	gregory.clement, simon.guinot, tawfik, alior

On Apr 11, Willy Tarreau wrote:
> On Thu, Apr 10, 2014 at 06:02:30PM -0700, Eric Dumazet wrote:
> > On Thu, 2014-04-10 at 20:51 -0400, David Miller wrote:
> > > This is not really driver specific code.
> > > 
> > > Please create an appropriate abstraction, and put this code in a
> > > common place so that any driver can do this.
> > > 
> > > Thanks.
> > 
> > Note a prior implementation existed in
> > 
> > drivers/net/ethernet/tile/tilegx.c
> > 
> > Note sure how it can be properly factorized, but worth trying .
> 
> I also tried to find how to do this for mv643xx_eth a few years ago
> 
[..]

Can you share *any* code about this attempt on mv643xx_eth? It would be
helpful to see other TSO cases, to have more examples in order to come
up with a proper abstraction.

-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] mvneta: software TSO implementation
  2014-04-11 19:08       ` Ezequiel Garcia
@ 2014-04-11 19:43         ` Willy Tarreau
  0 siblings, 0 replies; 17+ messages in thread
From: Willy Tarreau @ 2014-04-11 19:43 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Eric Dumazet, David Miller, netdev, thomas.petazzoni,
	gregory.clement, simon.guinot, tawfik, alior

On Fri, Apr 11, 2014 at 04:08:27PM -0300, Ezequiel Garcia wrote:
> On Apr 11, Willy Tarreau wrote:
> > On Thu, Apr 10, 2014 at 06:02:30PM -0700, Eric Dumazet wrote:
> > > On Thu, 2014-04-10 at 20:51 -0400, David Miller wrote:
> > > > This is not really driver specific code.
> > > > 
> > > > Please create an appropriate abstraction, and put this code in a
> > > > common place so that any driver can do this.
> > > > 
> > > > Thanks.
> > > 
> > > Note a prior implementation existed in
> > > 
> > > drivers/net/ethernet/tile/tilegx.c
> > > 
> > > Note sure how it can be properly factorized, but worth trying .
> > 
> > I also tried to find how to do this for mv643xx_eth a few years ago
> > 
> [..]
> 
> Can you share *any* code about this attempt on mv643xx_eth? It would be
> helpful to see other TSO cases, to have more examples in order to come
> up with a proper abstraction.

Unfortunately no. I remember that I picked the equivalent code from
mv_eth in the kirkwood's LSP and tried to adapt it to the mainline
driver, but unfortunately by then I didn't exactly understand well
what the driver was doing despite Eric explaining me a few things
about GSO and segmentation. Too many things were obscure to me and
I had to give up. The only vague memories I have are that this NIC
doesn't support Tx csum on VLANs nor IPv6. That's about all, I regret.

By the way, I could be wrong but I also have some memories about this
in the myri10ge driver. I recall something like it using software TSO
for VLANs or IPv6 or a combination of both, I don't exactly remember.
It's also possible that it depends on the firmware versions. Anyway the
driver is clean and well commented, it's easier to read than average.

Hoping this helps,
Willy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] net: mvneta: Introduce a software TSO implementation
  2014-04-10 22:58 ` [PATCH 3/3] net: mvneta: Introduce a software TSO implementation Ezequiel Garcia
  2014-04-10 23:16   ` Eric Dumazet
@ 2014-05-05 14:47   ` Ezequiel Garcia
  2014-05-07  6:04     ` Willy Tarreau
  2014-05-21  2:11     ` Ben Hutchings
  1 sibling, 2 replies; 17+ messages in thread
From: Ezequiel Garcia @ 2014-05-05 14:47 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, cmetcalf, Thomas Petazzoni,
	Gregory Clement, Simon Guinot, Willy Tarreau, Tawfik Bayouk,
	Lior Amsalem, Simon Guinot

Hi all,

On 10 Apr 07:58 PM, Ezequiel Garcia wrote:
[..]
> +
> +	/* Calculate expected number of TX descriptors */
> +	desc_count = skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
> +	if ((txq->count + desc_count) >= txq->size)
> +		return 0;
> +

Is this calculus correct? Does it give the accurate number of needed
descriptors or is it an approximation?

Tilera's tilegx driver does a much stricter descriptor count (see
tso_count_edescs). This functions loops through the skb_frag_t fragments
as it's done later in the data egress, hence strictly counting the
needed descriptors.

However, as it's a much heavier routine than the one shown above,
I'm wondering if we can get away without it.

Willy, Any ideas here?
-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] net: mvneta: Introduce a software TSO implementation
  2014-05-05 14:47   ` Ezequiel Garcia
@ 2014-05-07  6:04     ` Willy Tarreau
  2014-05-21  2:11     ` Ben Hutchings
  1 sibling, 0 replies; 17+ messages in thread
From: Willy Tarreau @ 2014-05-07  6:04 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: netdev, David S. Miller, Eric Dumazet, cmetcalf,
	Thomas Petazzoni, Gregory Clement, Simon Guinot, Tawfik Bayouk,
	Lior Amsalem, Simon Guinot

Hi Ezequiel,

On Mon, May 05, 2014 at 11:47:02AM -0300, Ezequiel Garcia wrote:
> Hi all,
> 
> On 10 Apr 07:58 PM, Ezequiel Garcia wrote:
> [..]
> > +
> > +	/* Calculate expected number of TX descriptors */
> > +	desc_count = skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
> > +	if ((txq->count + desc_count) >= txq->size)
> > +		return 0;
> > +
> 
> Is this calculus correct? Does it give the accurate number of needed
> descriptors or is it an approximation?
> 
> Tilera's tilegx driver does a much stricter descriptor count (see
> tso_count_edescs). This functions loops through the skb_frag_t fragments
> as it's done later in the data egress, hence strictly counting the
> needed descriptors.
> 
> However, as it's a much heavier routine than the one shown above,
> I'm wondering if we can get away without it.
> 
> Willy, Any ideas here?

I think we don't absolutely need to have the exact value, provided we
don't fall into the case where we cannot send a single skb with an
empty queue. Here we have 532 tx descriptors max, which means that
with an MSS below 246 bytes, we may fall into this situation.

Eric pointed out commit 7e6d06f0de3f7 which applies a limit on the
number of segments for the sfc driver, I think you should take a look
at it. I haven't played with gso_max_segs yet, but there are clear
explanations in the commit message and in the comments that can serve
as a nice starting point.

Hoping this helps :-/

Willy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] net: mvneta: Introduce a software TSO implementation
  2014-05-05 14:47   ` Ezequiel Garcia
  2014-05-07  6:04     ` Willy Tarreau
@ 2014-05-21  2:11     ` Ben Hutchings
  2014-05-22 17:47       ` Ezequiel Garcia
  1 sibling, 1 reply; 17+ messages in thread
From: Ben Hutchings @ 2014-05-21  2:11 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: netdev, David S. Miller, Eric Dumazet, cmetcalf,
	Thomas Petazzoni, Gregory Clement, Simon Guinot, Willy Tarreau,
	Tawfik Bayouk, Lior Amsalem, Simon Guinot

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]

On Mon, 2014-05-05 at 11:47 -0300, Ezequiel Garcia wrote:
> Hi all,
> 
> On 10 Apr 07:58 PM, Ezequiel Garcia wrote:
> [..]
> > +
> > +	/* Calculate expected number of TX descriptors */
> > +	desc_count = skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
> > +	if ((txq->count + desc_count) >= txq->size)
> > +		return 0;
> > +
> 
> Is this calculus correct? Does it give the accurate number of needed
> descriptors or is it an approximation?
> 
> Tilera's tilegx driver does a much stricter descriptor count (see
> tso_count_edescs). This functions loops through the skb_frag_t fragments
> as it's done later in the data egress, hence strictly counting the
> needed descriptors.
> 
> However, as it's a much heavier routine than the one shown above,
> I'm wondering if we can get away without it.
> 
> Willy, Any ideas here?

There are (at least) three potential bugs you need to avoid:

1. You must not underestimate the number of descriptors required here,
otherwise the ring may overflow.  Hopefully that's obvious.
2. You must ensure that the worst case number of descriptors does
actually fit in the ring, and will probably need to set
net_dev->gso_max_segs accordingly.  Eric pointed that out already.
3. If you stop the queue because an skb doesn't fit, you should not wake
it before there is enough space.

A simple way to do this is:

- Set a maximum number of segments to support (for sfc, I reckoned that
100 was enough)
- Calculate the maximum number of descriptors that could be needed for
that number of segments
- Stop the queue when the free space in the ring is less than that
maximum
- Wake the queue when the free space reaches a threshold somewhere
between that maximum and completely free

It may make more sense to work backward from the ring size to maximum
number of segments.

Ben.

-- 
Ben Hutchings
Life would be so much easier if we could look at the source code.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] net: mvneta: Introduce a software TSO implementation
  2014-05-21  2:11     ` Ben Hutchings
@ 2014-05-22 17:47       ` Ezequiel Garcia
  0 siblings, 0 replies; 17+ messages in thread
From: Ezequiel Garcia @ 2014-05-22 17:47 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: netdev, David S. Miller, Eric Dumazet, cmetcalf,
	Thomas Petazzoni, Gregory Clement, Simon Guinot, Willy Tarreau,
	Tawfik Bayouk, Lior Amsalem, Simon Guinot

On 21 May 03:11 AM, Ben Hutchings wrote:
> On Mon, 2014-05-05 at 11:47 -0300, Ezequiel Garcia wrote:
> > Hi all,
> > 
> > On 10 Apr 07:58 PM, Ezequiel Garcia wrote:
> > [..]
> > > +
> > > +	/* Calculate expected number of TX descriptors */
> > > +	desc_count = skb_shinfo(skb)->gso_segs * 2 + skb_shinfo(skb)->nr_frags;
> > > +	if ((txq->count + desc_count) >= txq->size)
> > > +		return 0;
> > > +
> > 
> > Is this calculus correct? Does it give the accurate number of needed
> > descriptors or is it an approximation?
> > 
> 
> There are (at least) three potential bugs you need to avoid:
> 
> 1. You must not underestimate the number of descriptors required here,
> otherwise the ring may overflow.  Hopefully that's obvious.

Yes, fully understood.

> 2. You must ensure that the worst case number of descriptors does
> actually fit in the ring, and will probably need to set
> net_dev->gso_max_segs accordingly.  Eric pointed that out already.

Yeah, I still need to take a deep look at the commit pointed out by Eric.

> 3. If you stop the queue because an skb doesn't fit, you should not wake
> it before there is enough space.
> 
> A simple way to do this is:
> 
> - Set a maximum number of segments to support (for sfc, I reckoned that
> 100 was enough)
> - Calculate the maximum number of descriptors that could be needed for
> that number of segments
> - Stop the queue when the free space in the ring is less than that
> maximum
> - Wake the queue when the free space reaches a threshold somewhere
> between that maximum and completely free
> 
> It may make more sense to work backward from the ring size to maximum
> number of segments.
> 

Hm, this seems a good suggestion. I'm wondering if this can be added to
the generic code. Although this is likely each driver's responsability.

BTW, have you seen the proposal for the generic TSO API?

http://permalink.gmane.org/gmane.linux.network/316852

Any comments about it would be much appreciated, as we'd really like
to see TSO implemented on our drivers.

Thanks,
-- 
Ezequiel García, Free Electrons
Embedded Linux, Kernel and Android Engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-05-22 17:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-10 22:57 [PATCH 0/3] mvneta: software TSO implementation Ezequiel Garcia
2014-04-10 22:58 ` [PATCH 1/3] net: mvneta: Factorize feature setting Ezequiel Garcia
2014-04-10 22:58 ` [PATCH 2/3] net: mvneta: Clean mvneta_tx() sk_buff handling Ezequiel Garcia
2014-04-10 22:58 ` [PATCH 3/3] net: mvneta: Introduce a software TSO implementation Ezequiel Garcia
2014-04-10 23:16   ` Eric Dumazet
2014-05-05 14:47   ` Ezequiel Garcia
2014-05-07  6:04     ` Willy Tarreau
2014-05-21  2:11     ` Ben Hutchings
2014-05-22 17:47       ` Ezequiel Garcia
2014-04-11  0:51 ` [PATCH 0/3] mvneta: " David Miller
2014-04-11  1:02   ` Eric Dumazet
2014-04-11  5:48     ` Willy Tarreau
2014-04-11  6:20       ` David Miller
2014-04-11  6:30         ` Willy Tarreau
2014-04-11 16:58           ` David Miller
2014-04-11 19:08       ` Ezequiel Garcia
2014-04-11 19:43         ` Willy Tarreau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.