All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-wired-lan] [next PATCH 0/4] i40e/i40evf: Improve throughput by reducing PCIe overhead
@ 2016-02-17 19:02 Alexander Duyck
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx Alexander Duyck
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Alexander Duyck @ 2016-02-17 19:02 UTC (permalink / raw)
  To: intel-wired-lan

This patch series is meant to address some minor performance issues and to
improve the throughput for the adapter.

On my system I saw a .5Gbps improvement based on these patches.

Specific items addressed include the use of modulo division in the Tx
fast-path, redundant checks against CHECKSUM_PARTIAL, and the use of only
8K of data per TXD when the maximum supported value is 16K - 1.

---

Alexander Duyck (4):
      i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx
      i40e/i40evf: Rewrite logic for 8 descriptor per packet check
      i40e/i40evf: Move Tx checksum closer to TSO
      i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K


 drivers/net/ethernet/intel/i40e/i40e_fcoe.c   |   20 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  199 +++++++++++--------------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   96 ++++++++++++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  190 +++++++++++-------------
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   94 ++++++++++++
 5 files changed, 374 insertions(+), 225 deletions(-)

--

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx
  2016-02-17 19:02 [Intel-wired-lan] [next PATCH 0/4] i40e/i40evf: Improve throughput by reducing PCIe overhead Alexander Duyck
@ 2016-02-17 19:02 ` Alexander Duyck
  2016-02-17 23:21   ` Bowers, AndrewX
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 2/4] i40e/i40evf: Rewrite logic for 8 descriptor per packet check Alexander Duyck
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Alexander Duyck @ 2016-02-17 19:02 UTC (permalink / raw)
  To: intel-wired-lan

In an upcoming patch I would like to have access to the descriptor count
used for the data portion of the frame.  For this reason I am splitting up
the descriptor count function from the function that stops the ring.

Also in order to try and reduce unnecessary duplication of code I am moving
the slow-path portions of the code out of being inline calls so that we can
just jump to them and process them instead of having to build them into
each function that calls them.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c   |   14 ++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   71 +++++--------------------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   44 +++++++++++++++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   64 +++++------------------
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   42 +++++++++++++++
 5 files changed, 123 insertions(+), 112 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
index 7c66ce416ec7..518d72ea1059 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
@@ -1359,16 +1359,26 @@ static netdev_tx_t i40e_fcoe_xmit_frame(struct sk_buff *skb,
 	struct i40e_ring *tx_ring = vsi->tx_rings[skb->queue_mapping];
 	struct i40e_tx_buffer *first;
 	u32 tx_flags = 0;
+	int fso, count;
 	u8 hdr_len = 0;
 	u8 sof = 0;
 	u8 eof = 0;
-	int fso;
 
 	if (i40e_fcoe_set_skb_header(skb))
 		goto out_drop;
 
-	if (!i40e_xmit_descriptor_count(skb, tx_ring))
+	count = i40e_xmit_descriptor_count(skb);
+
+	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
+	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
+	 *       + 4 desc gap to avoid the cache line where head is,
+	 *       + 1 desc for context descriptor,
+	 * otherwise try next time
+	 */
+	if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
+		tx_ring->tx_stats.tx_busy++;
 		return NETDEV_TX_BUSY;
+	}
 
 	/* prepare the xmit flags */
 	if (i40e_tx_prepare_vlan_flags(skb, tx_ring, &tx_flags))
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 1d3afa7dda18..f03657022b0f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2576,7 +2576,7 @@ static void i40e_create_tx_ctx(struct i40e_ring *tx_ring,
  *
  * Returns -EBUSY if a stop is needed, else 0
  **/
-static inline int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 {
 	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
 	/* Memory barrier before checking head and tail */
@@ -2593,24 +2593,6 @@ static inline int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 }
 
 /**
- * i40e_maybe_stop_tx - 1st level check for tx stop conditions
- * @tx_ring: the ring to be checked
- * @size:    the size buffer we want to assure is available
- *
- * Returns 0 if stop is not needed
- **/
-#ifdef I40E_FCOE
-inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
-#else
-static inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
-#endif
-{
-	if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
-		return 0;
-	return __i40e_maybe_stop_tx(tx_ring, size);
-}
-
-/**
  * i40e_chk_linearize - Check if there are more than 8 fragments per packet
  * @skb:      send buffer
  * @tx_flags: collected send information
@@ -2870,43 +2852,6 @@ dma_error:
 }
 
 /**
- * i40e_xmit_descriptor_count - calculate number of tx descriptors needed
- * @skb:     send buffer
- * @tx_ring: ring to send buffer on
- *
- * Returns number of data descriptors needed for this skb. Returns 0 to indicate
- * there is not enough descriptors available in this ring since we need at least
- * one descriptor.
- **/
-#ifdef I40E_FCOE
-inline int i40e_xmit_descriptor_count(struct sk_buff *skb,
-				      struct i40e_ring *tx_ring)
-#else
-static inline int i40e_xmit_descriptor_count(struct sk_buff *skb,
-					     struct i40e_ring *tx_ring)
-#endif
-{
-	unsigned int f;
-	int count = 0;
-
-	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
-	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
-	 *       + 4 desc gap to avoid the cache line where head is,
-	 *       + 1 desc for context descriptor,
-	 * otherwise try next time
-	 */
-	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
-		count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
-
-	count += TXD_USE_COUNT(skb_headlen(skb));
-	if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
-		tx_ring->tx_stats.tx_busy++;
-		return 0;
-	}
-	return count;
-}
-
-/**
  * i40e_xmit_frame_ring - Sends buffer on Tx ring
  * @skb:     send buffer
  * @tx_ring: ring to send buffer on
@@ -2924,14 +2869,24 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	__be16 protocol;
 	u32 td_cmd = 0;
 	u8 hdr_len = 0;
+	int tso, count;
 	int tsyn;
-	int tso;
 
 	/* prefetch the data, we'll need it later */
 	prefetch(skb->data);
 
-	if (0 == i40e_xmit_descriptor_count(skb, tx_ring))
+	count = i40e_xmit_descriptor_count(skb);
+
+	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
+	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
+	 *       + 4 desc gap to avoid the cache line where head is,
+	 *       + 1 desc for context descriptor,
+	 * otherwise try next time
+	 */
+	if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
+		tx_ring->tx_stats.tx_busy++;
 		return NETDEV_TX_BUSY;
+	}
 
 	/* prepare the xmit flags */
 	if (i40e_tx_prepare_vlan_flags(skb, tx_ring, &tx_flags))
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fde5f42524fb..48a2ab8a8ec7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -331,13 +331,12 @@ int i40e_napi_poll(struct napi_struct *napi, int budget);
 void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 		 struct i40e_tx_buffer *first, u32 tx_flags,
 		 const u8 hdr_len, u32 td_cmd, u32 td_offset);
-int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
-int i40e_xmit_descriptor_count(struct sk_buff *skb, struct i40e_ring *tx_ring);
 int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
 			       struct i40e_ring *tx_ring, u32 *flags);
 #endif
 void i40e_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector);
 u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
+int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -352,4 +351,45 @@ static inline u32 i40e_get_head(struct i40e_ring *tx_ring)
 
 	return le32_to_cpu(*(volatile __le32 *)head);
 }
+
+/**
+ * i40e_xmit_descriptor_count - calculate number of tx descriptors needed
+ * @skb:     send buffer
+ * @tx_ring: ring to send buffer on
+ *
+ * Returns number of data descriptors needed for this skb. Returns 0 to indicate
+ * there is not enough descriptors available in this ring since we need at least
+ * one descriptor.
+ **/
+static inline int i40e_xmit_descriptor_count(struct sk_buff *skb)
+{
+	const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
+	unsigned int nr_frags = skb_shinfo(skb)->nr_frags;
+	int count = 0, size = skb_headlen(skb);
+
+	for (;;) {
+		count += TXD_USE_COUNT(size);
+
+		if (!nr_frags--)
+			break;
+
+		size = skb_frag_size(frag++);
+	}
+
+	return count;
+}
+
+/**
+ * i40e_maybe_stop_tx - 1st level check for tx stop conditions
+ * @tx_ring: the ring to be checked
+ * @size:    the size buffer we want to assure is available
+ *
+ * Returns 0 if stop is not needed
+ **/
+static inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+{
+	if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
+		return 0;
+	return __i40e_maybe_stop_tx(tx_ring, size);
+}
 #endif /* _I40E_TXRX_H_ */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 16589c0b79a3..78d9ce4693c6 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1856,7 +1856,7 @@ linearize_chk_done:
  *
  * Returns -EBUSY if a stop is needed, else 0
  **/
-static inline int __i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+int __i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 {
 	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
 	/* Memory barrier before checking head and tail */
@@ -1873,20 +1873,6 @@ static inline int __i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 }
 
 /**
- * i40evf_maybe_stop_tx - 1st level check for tx stop conditions
- * @tx_ring: the ring to be checked
- * @size:    the size buffer we want to assure is available
- *
- * Returns 0 if stop is not needed
- **/
-static inline int i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
-{
-	if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
-		return 0;
-	return __i40evf_maybe_stop_tx(tx_ring, size);
-}
-
-/**
  * i40evf_tx_map - Build the Tx descriptor
  * @tx_ring:  ring to send buffer on
  * @skb:      send buffer
@@ -2001,7 +1987,7 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	netdev_tx_sent_queue(netdev_get_tx_queue(tx_ring->netdev,
 						 tx_ring->queue_index),
 						 first->bytecount);
-	i40evf_maybe_stop_tx(tx_ring, DESC_NEEDED);
+	i40e_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
 	/* Algorithm to optimize tail and RS bit setting:
 	 * if xmit_more is supported
@@ -2084,38 +2070,6 @@ dma_error:
 }
 
 /**
- * i40evf_xmit_descriptor_count - calculate number of tx descriptors needed
- * @skb:     send buffer
- * @tx_ring: ring to send buffer on
- *
- * Returns number of data descriptors needed for this skb. Returns 0 to indicate
- * there is not enough descriptors available in this ring since we need at least
- * one descriptor.
- **/
-static inline int i40evf_xmit_descriptor_count(struct sk_buff *skb,
-					       struct i40e_ring *tx_ring)
-{
-	unsigned int f;
-	int count = 0;
-
-	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
-	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
-	 *       + 4 desc gap to avoid the cache line where head is,
-	 *       + 1 desc for context descriptor,
-	 * otherwise try next time
-	 */
-	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
-		count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
-
-	count += TXD_USE_COUNT(skb_headlen(skb));
-	if (i40evf_maybe_stop_tx(tx_ring, count + 4 + 1)) {
-		tx_ring->tx_stats.tx_busy++;
-		return 0;
-	}
-	return count;
-}
-
-/**
  * i40e_xmit_frame_ring - Sends buffer on Tx ring
  * @skb:     send buffer
  * @tx_ring: ring to send buffer on
@@ -2133,13 +2087,23 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	__be16 protocol;
 	u32 td_cmd = 0;
 	u8 hdr_len = 0;
-	int tso;
+	int tso, count;
 
 	/* prefetch the data, we'll need it later */
 	prefetch(skb->data);
 
-	if (0 == i40evf_xmit_descriptor_count(skb, tx_ring))
+	count = i40e_xmit_descriptor_count(skb);
+
+	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
+	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
+	 *       + 4 desc gap to avoid the cache line where head is,
+	 *       + 1 desc for context descriptor,
+	 * otherwise try next time
+	 */
+	if (i40e_maybe_stop_tx(tx_ring, count + 4 + 1)) {
+		tx_ring->tx_stats.tx_busy++;
 		return NETDEV_TX_BUSY;
+	}
 
 	/* prepare the xmit flags */
 	if (i40evf_tx_prepare_vlan_flags(skb, tx_ring, &tx_flags))
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
index 6ea8701cf066..228cc76be6f6 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
@@ -326,6 +326,7 @@ void i40evf_free_rx_resources(struct i40e_ring *rx_ring);
 int i40evf_napi_poll(struct napi_struct *napi, int budget);
 void i40evf_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector);
 u32 i40evf_get_tx_pending(struct i40e_ring *ring, bool in_sw);
+int __i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -340,4 +341,45 @@ static inline u32 i40e_get_head(struct i40e_ring *tx_ring)
 
 	return le32_to_cpu(*(volatile __le32 *)head);
 }
+
+/**
+ * i40e_xmit_descriptor_count - calculate number of tx descriptors needed
+ * @skb:     send buffer
+ * @tx_ring: ring to send buffer on
+ *
+ * Returns number of data descriptors needed for this skb. Returns 0 to indicate
+ * there is not enough descriptors available in this ring since we need at least
+ * one descriptor.
+ **/
+static inline int i40e_xmit_descriptor_count(struct sk_buff *skb)
+{
+	const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[0];
+	unsigned int nr_frags = skb_shinfo(skb)->nr_frags;
+	int count = 0, size = skb_headlen(skb);
+
+	for (;;) {
+		count += TXD_USE_COUNT(size);
+
+		if (!nr_frags--)
+			break;
+
+		size = skb_frag_size(frag++);
+	}
+
+	return count;
+}
+
+/**
+ * i40e_maybe_stop_tx - 1st level check for tx stop conditions
+ * @tx_ring: the ring to be checked
+ * @size:    the size buffer we want to assure is available
+ *
+ * Returns 0 if stop is not needed
+ **/
+static inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
+{
+	if (likely(I40E_DESC_UNUSED(tx_ring) >= size))
+		return 0;
+	return __i40evf_maybe_stop_tx(tx_ring, size);
+}
 #endif /* _I40E_TXRX_H_ */


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 2/4] i40e/i40evf: Rewrite logic for 8 descriptor per packet check
  2016-02-17 19:02 [Intel-wired-lan] [next PATCH 0/4] i40e/i40evf: Improve throughput by reducing PCIe overhead Alexander Duyck
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx Alexander Duyck
@ 2016-02-17 19:02 ` Alexander Duyck
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 3/4] i40e/i40evf: Move Tx checksum closer to TSO Alexander Duyck
  2016-02-17 19:03 ` [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K Alexander Duyck
  3 siblings, 0 replies; 8+ messages in thread
From: Alexander Duyck @ 2016-02-17 19:02 UTC (permalink / raw)
  To: intel-wired-lan

This patch is meant to rewrite the logic for how we determine if we can
transmit the frame or if it needs to be linearized.

The previous code for this function was using a mix of division and modulus
division as a part of computing if we need to take the slow path.  Instead
I have replaced this by simply working with a sliding window which will
tell us if the frame would be capable of causing a single packet to span
several descriptors.

The logic for the scan is fairly simple.  If any given group of 6 fragments
is less than gso_size - 1 then it is possible for us to have one byte
coming out of the first fragment, 6 fragments, and one or more bytes coming
out of the last fragment.  This gives us a total of 8 fragments
which exceeds what we can allow so we send such frames to be linearized.

Arguably the use of modulus might be more exact as the approach I propose
may generate some false positives.  However the liklihood of us taking much
of a hit for those false positives is fairly low, and I would rather not
add more overhead in the case where we are receiving a frame composed of 4K
pages.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/ethernet/intel/i40e/i40e_fcoe.c   |    6 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  105 ++++++++++++++-----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   19 +++++
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |  105 ++++++++++++++-----------
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   19 +++++
 5 files changed, 162 insertions(+), 92 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
index 518d72ea1059..052df93f1da4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_fcoe.c
@@ -1368,6 +1368,12 @@ static netdev_tx_t i40e_fcoe_xmit_frame(struct sk_buff *skb,
 		goto out_drop;
 
 	count = i40e_xmit_descriptor_count(skb);
+	if (i40e_chk_linearize(skb, count)) {
+		if (__skb_linearize(skb))
+			goto out_drop;
+		count = TXD_USE_COUNT(skb->len);
+		tx_ring->tx_stats.tx_linearize++;
+	}
 
 	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
 	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f03657022b0f..5123646a895f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2593,59 +2593,71 @@ int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 }
 
 /**
- * i40e_chk_linearize - Check if there are more than 8 fragments per packet
+ * __i40e_chk_linearize - Check if there are more than 8 fragments per packet
  * @skb:      send buffer
- * @tx_flags: collected send information
  *
  * Note: Our HW can't scatter-gather more than 8 fragments to build
  * a packet on the wire and so we need to figure out the cases where we
  * need to linearize the skb.
  **/
-static bool i40e_chk_linearize(struct sk_buff *skb, u32 tx_flags)
+bool __i40e_chk_linearize(struct sk_buff *skb)
 {
-	struct skb_frag_struct *frag;
-	bool linearize = false;
-	unsigned int size = 0;
-	u16 num_frags;
-	u16 gso_segs;
+	const struct skb_frag_struct *frag, *stale;
+	int gso_size, nr_frags, sum;
 
-	num_frags = skb_shinfo(skb)->nr_frags;
-	gso_segs = skb_shinfo(skb)->gso_segs;
+	/* check to see if TSO is enabled, if so we may get a repreive */
+	gso_size = skb_shinfo(skb)->gso_size;
+	if (unlikely(!gso_size))
+		return true;
 
-	if (tx_flags & (I40E_TX_FLAGS_TSO | I40E_TX_FLAGS_FSO)) {
-		u16 j = 0;
+	/* no need to check if number of frags is less than 8 */
+	nr_frags = skb_shinfo(skb)->nr_frags;
+	if (nr_frags < I40E_MAX_BUFFER_TXD)
+		return false;
 
-		if (num_frags < (I40E_MAX_BUFFER_TXD))
-			goto linearize_chk_done;
-		/* try the simple math, if we have too many frags per segment */
-		if (DIV_ROUND_UP((num_frags + gso_segs), gso_segs) >
-		    I40E_MAX_BUFFER_TXD) {
-			linearize = true;
-			goto linearize_chk_done;
-		}
-		frag = &skb_shinfo(skb)->frags[0];
-		/* we might still have more fragments per segment */
-		do {
-			size += skb_frag_size(frag);
-			frag++; j++;
-			if ((size >= skb_shinfo(skb)->gso_size) &&
-			    (j < I40E_MAX_BUFFER_TXD)) {
-				size = (size % skb_shinfo(skb)->gso_size);
-				j = (size) ? 1 : 0;
-			}
-			if (j == I40E_MAX_BUFFER_TXD) {
-				linearize = true;
-				break;
-			}
-			num_frags--;
-		} while (num_frags);
-	} else {
-		if (num_frags >= I40E_MAX_BUFFER_TXD)
-			linearize = true;
+	/* We need to walk through the list and validate that each group
+	 * of 6 fragments totals at least gso_size.  However we don't need
+	 * to perform such validation on the first or last 6 since the first
+	 * 6 cannot inherit any data from a descriptor before them, and the
+	 * last 6 cannot inherit any data from a descriptor after them.
+	 */
+	nr_frags -= I40E_MAX_BUFFER_TXD - 1;
+	frag = &skb_shinfo(skb)->frags[0];
+
+	/* Initialize size to the negative value of gso_size minus 1.  We
+	 * use this as the worst case scenerio in which the frag ahead
+	 * of us only provides one byte which is why we are limited to 6
+	 * descriptors for a single transmit as the header and previous
+	 * fragment are already consuming 2 descriptors.
+	 */
+	sum = 1 - gso_size;
+
+	/* Add size of frags 1 through 5 to create our initial sum */
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+
+	/* Walk through fragments adding latest fragment, testing it, and
+	 * then removing stale fragments from the sum.
+	 */
+	stale = &skb_shinfo(skb)->frags[0];
+	for (;;) {
+		sum += skb_frag_size(++frag);
+
+		/* if sum is negative we failed to make sufficient progress */
+		if (sum < 0)
+			return true;
+
+		/* use pre-decrement to avoid processing last fragment */
+		if (!--nr_frags)
+			break;
+
+		sum -= skb_frag_size(++stale);
 	}
 
-linearize_chk_done:
-	return linearize;
+	return false;
 }
 
 /**
@@ -2876,6 +2888,12 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	prefetch(skb->data);
 
 	count = i40e_xmit_descriptor_count(skb);
+	if (i40e_chk_linearize(skb, count)) {
+		if (__skb_linearize(skb))
+			goto out_drop;
+		count = TXD_USE_COUNT(skb->len);
+		tx_ring->tx_stats.tx_linearize++;
+	}
 
 	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
 	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
@@ -2916,11 +2934,6 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	if (tsyn)
 		tx_flags |= I40E_TX_FLAGS_TSYN;
 
-	if (i40e_chk_linearize(skb, tx_flags)) {
-		if (skb_linearize(skb))
-			goto out_drop;
-		tx_ring->tx_stats.tx_linearize++;
-	}
 	skb_tx_timestamp(skb);
 
 	/* always enable CRC insertion offload */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 48a2ab8a8ec7..8a3a163cc475 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -337,6 +337,7 @@ int i40e_tx_prepare_vlan_flags(struct sk_buff *skb,
 void i40e_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector);
 u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
+bool __i40e_chk_linearize(struct sk_buff *skb);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -392,4 +393,22 @@ static inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 		return 0;
 	return __i40e_maybe_stop_tx(tx_ring, size);
 }
+
+/**
+ * i40e_chk_linearize - Check if there are more than 8 fragments per packet
+ * @skb:      send buffer
+ * @count:    number of buffers used
+ *
+ * Note: Our HW can't scatter-gather more than 8 fragments to build
+ * a packet on the wire and so we need to figure out the cases where we
+ * need to linearize the skb.
+ **/
+static inline bool i40e_chk_linearize(struct sk_buff *skb, int count)
+{
+	/* we can only support up to 8 data buffers for a single send */
+	if (likely(count <= I40E_MAX_BUFFER_TXD))
+		return false;
+
+	return __i40e_chk_linearize(skb);
+}
 #endif /* _I40E_TXRX_H_ */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 78d9ce4693c6..1dd1cc12304b 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1794,59 +1794,71 @@ static void i40e_create_tx_ctx(struct i40e_ring *tx_ring,
 }
 
 /**
- * i40e_chk_linearize - Check if there are more than 8 fragments per packet
+ * __i40evf_chk_linearize - Check if there are more than 8 fragments per packet
  * @skb:      send buffer
- * @tx_flags: collected send information
  *
  * Note: Our HW can't scatter-gather more than 8 fragments to build
  * a packet on the wire and so we need to figure out the cases where we
  * need to linearize the skb.
  **/
-static bool i40e_chk_linearize(struct sk_buff *skb, u32 tx_flags)
+bool __i40evf_chk_linearize(struct sk_buff *skb)
 {
-	struct skb_frag_struct *frag;
-	bool linearize = false;
-	unsigned int size = 0;
-	u16 num_frags;
-	u16 gso_segs;
+	const struct skb_frag_struct *frag, *stale;
+	int gso_size, nr_frags, sum;
 
-	num_frags = skb_shinfo(skb)->nr_frags;
-	gso_segs = skb_shinfo(skb)->gso_segs;
+	/* check to see if TSO is enabled, if so we may get a repreive */
+	gso_size = skb_shinfo(skb)->gso_size;
+	if (unlikely(!gso_size))
+		return true;
 
-	if (tx_flags & (I40E_TX_FLAGS_TSO | I40E_TX_FLAGS_FSO)) {
-		u16 j = 0;
+	/* no need to check if number of frags is less than 8 */
+	nr_frags = skb_shinfo(skb)->nr_frags;
+	if (nr_frags < I40E_MAX_BUFFER_TXD)
+		return false;
 
-		if (num_frags < (I40E_MAX_BUFFER_TXD))
-			goto linearize_chk_done;
-		/* try the simple math, if we have too many frags per segment */
-		if (DIV_ROUND_UP((num_frags + gso_segs), gso_segs) >
-		    I40E_MAX_BUFFER_TXD) {
-			linearize = true;
-			goto linearize_chk_done;
-		}
-		frag = &skb_shinfo(skb)->frags[0];
-		/* we might still have more fragments per segment */
-		do {
-			size += skb_frag_size(frag);
-			frag++; j++;
-			if ((size >= skb_shinfo(skb)->gso_size) &&
-			    (j < I40E_MAX_BUFFER_TXD)) {
-				size = (size % skb_shinfo(skb)->gso_size);
-				j = (size) ? 1 : 0;
-			}
-			if (j == I40E_MAX_BUFFER_TXD) {
-				linearize = true;
-				break;
-			}
-			num_frags--;
-		} while (num_frags);
-	} else {
-		if (num_frags >= I40E_MAX_BUFFER_TXD)
-			linearize = true;
+	/* We need to walk through the list and validate that each group
+	 * of 6 fragments totals at least gso_size.  However we don't need
+	 * to perform such validation on the first or last 6 since the first
+	 * 6 cannot inherit any data from a descriptor before them, and the
+	 * last 6 cannot inherit any data from a descriptor after them.
+	 */
+	nr_frags -= I40E_MAX_BUFFER_TXD - 1;
+	frag = &skb_shinfo(skb)->frags[0];
+
+	/* Initialize size to the negative value of gso_size minus 1.  We
+	 * use this as the worst case scenerio in which the frag ahead
+	 * of us only provides one byte which is why we are limited to 6
+	 * descriptors for a single transmit as the header and previous
+	 * fragment are already consuming 2 descriptors.
+	 */
+	sum = 1 - gso_size;
+
+	/* Add size of frags 1 through 5 to create our initial sum */
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+	sum += skb_frag_size(++frag);
+
+	/* Walk through fragments adding latest fragment, testing it, and
+	 * then removing stale fragments from the sum.
+	 */
+	stale = &skb_shinfo(skb)->frags[0];
+	for (;;) {
+		sum += skb_frag_size(++frag);
+
+		/* if sum is negative we failed to make sufficient progress */
+		if (sum < 0)
+			return true;
+
+		/* use pre-decrement to avoid processing last fragment */
+		if (!--nr_frags)
+			break;
+
+		sum -= skb_frag_size(++stale);
 	}
 
-linearize_chk_done:
-	return linearize;
+	return false;
 }
 
 /**
@@ -2093,6 +2105,12 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	prefetch(skb->data);
 
 	count = i40e_xmit_descriptor_count(skb);
+	if (i40e_chk_linearize(skb, count)) {
+		if (__skb_linearize(skb))
+			goto out_drop;
+		count = TXD_USE_COUNT(skb->len);
+		tx_ring->tx_stats.tx_linearize++;
+	}
 
 	/* need: 1 descriptor per page * PAGE_SIZE/I40E_MAX_DATA_PER_TXD,
 	 *       + 1 desc for skb_head_len/I40E_MAX_DATA_PER_TXD,
@@ -2128,11 +2146,6 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	else if (tso)
 		tx_flags |= I40E_TX_FLAGS_TSO;
 
-	if (i40e_chk_linearize(skb, tx_flags)) {
-		if (skb_linearize(skb))
-			goto out_drop;
-		tx_ring->tx_stats.tx_linearize++;
-	}
 	skb_tx_timestamp(skb);
 
 	/* always enable CRC insertion offload */
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
index 228cc76be6f6..8c5da4f89fd0 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
@@ -327,6 +327,7 @@ int i40evf_napi_poll(struct napi_struct *napi, int budget);
 void i40evf_force_wb(struct i40e_vsi *vsi, struct i40e_q_vector *q_vector);
 u32 i40evf_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 int __i40evf_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
+bool __i40evf_chk_linearize(struct sk_buff *skb);
 
 /**
  * i40e_get_head - Retrieve head from head writeback
@@ -382,4 +383,22 @@ static inline int i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size)
 		return 0;
 	return __i40evf_maybe_stop_tx(tx_ring, size);
 }
+
+/**
+ * i40e_chk_linearize - Check if there are more than 8 fragments per packet
+ * @skb:      send buffer
+ * @count:    number of buffers used
+ *
+ * Note: Our HW can't scatter-gather more than 8 fragments to build
+ * a packet on the wire and so we need to figure out the cases where we
+ * need to linearize the skb.
+ **/
+static inline bool i40e_chk_linearize(struct sk_buff *skb, int count)
+{
+	/* we can only support up to 8 data buffers for a single send */
+	if (likely(count <= I40E_MAX_BUFFER_TXD))
+		return false;
+
+	return __i40evf_chk_linearize(skb);
+}
 #endif /* _I40E_TXRX_H_ */


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 3/4] i40e/i40evf: Move Tx checksum closer to TSO
  2016-02-17 19:02 [Intel-wired-lan] [next PATCH 0/4] i40e/i40evf: Improve throughput by reducing PCIe overhead Alexander Duyck
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx Alexander Duyck
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 2/4] i40e/i40evf: Rewrite logic for 8 descriptor per packet check Alexander Duyck
@ 2016-02-17 19:02 ` Alexander Duyck
  2016-02-17 19:03 ` [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K Alexander Duyck
  3 siblings, 0 replies; 8+ messages in thread
From: Alexander Duyck @ 2016-02-17 19:02 UTC (permalink / raw)
  To: intel-wired-lan

On all of the other Intel drivers we place checksum close to TSO as they
have a significant amount in common and it can help to reduce the decision
tree for how to handle the frame as the first check in TSO is to see if
checksumming is offloaded, and if it is not we can skip _BOTH_ TSO and Tx
checksum offload based on a single check.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   12 ++++++------
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   10 +++++-----
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 5123646a895f..cb52f39d514a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2929,6 +2929,12 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	else if (tso)
 		tx_flags |= I40E_TX_FLAGS_TSO;
 
+	/* Always offload the checksum, since it's in the data descriptor */
+	tso = i40e_tx_enable_csum(skb, &tx_flags, &td_cmd, &td_offset,
+				  tx_ring, &cd_tunneling);
+	if (tso < 0)
+		goto out_drop;
+
 	tsyn = i40e_tsyn(tx_ring, skb, tx_flags, &cd_type_cmd_tso_mss);
 
 	if (tsyn)
@@ -2939,12 +2945,6 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	/* always enable CRC insertion offload */
 	td_cmd |= I40E_TX_DESC_CMD_ICRC;
 
-	/* Always offload the checksum, since it's in the data descriptor */
-	tso = i40e_tx_enable_csum(skb, &tx_flags, &td_cmd, &td_offset,
-				  tx_ring, &cd_tunneling);
-	if (tso < 0)
-		goto out_drop;
-
 	i40e_create_tx_ctx(tx_ring, cd_type_cmd_tso_mss,
 			   cd_tunneling, cd_l2tag2);
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 1dd1cc12304b..686a95fe48bd 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -2146,17 +2146,17 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	else if (tso)
 		tx_flags |= I40E_TX_FLAGS_TSO;
 
-	skb_tx_timestamp(skb);
-
-	/* always enable CRC insertion offload */
-	td_cmd |= I40E_TX_DESC_CMD_ICRC;
-
 	/* Always offload the checksum, since it's in the data descriptor */
 	tso = i40e_tx_enable_csum(skb, &tx_flags, &td_cmd, &td_offset,
 				  tx_ring, &cd_tunneling);
 	if (tso < 0)
 		goto out_drop;
 
+	skb_tx_timestamp(skb);
+
+	/* always enable CRC insertion offload */
+	td_cmd |= I40E_TX_DESC_CMD_ICRC;
+
 	i40e_create_tx_ctx(tx_ring, cd_type_cmd_tso_mss,
 			   cd_tunneling, cd_l2tag2);
 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K
  2016-02-17 19:02 [Intel-wired-lan] [next PATCH 0/4] i40e/i40evf: Improve throughput by reducing PCIe overhead Alexander Duyck
                   ` (2 preceding siblings ...)
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 3/4] i40e/i40evf: Move Tx checksum closer to TSO Alexander Duyck
@ 2016-02-17 19:03 ` Alexander Duyck
  2016-02-17 21:38   ` Jeff Kirsher
  3 siblings, 1 reply; 8+ messages in thread
From: Alexander Duyck @ 2016-02-17 19:03 UTC (permalink / raw)
  To: intel-wired-lan

From what I can tell the practial limitation on the size of the Tx data
buffer is the fact that the Tx descriptor is limited to 14 bits.  As such
we cannot use 16K as is typically used on the other Intel drivers.  However
artificially limiting ourselves to 8K can be expensive as this means that
we will consume up to 10 descriptors (1 context, 1 for header, and 9 for
an payload, non-8K aligned) in a single send.

I propose that we can reduce this by increasing the maximum data for a 4K
aligned block to 12K.  We can reduce the descriptors used for a 32K aligned
block by 1 by increasing the size like this.  In addition we still have the
4K - 1 of space that is still unused.  We can use this as a bit of extra
padding when dealing with data that is not aligned to 4K.

By aligning the descriptors after the first to 4K we can improve the
effiency of PCIe accesses as we can avoid using byte enables and can fetch
full TLP transactions after the first fetch of the buffer.  This helps to
improve PCIe efficiency.  Below is the results of testing before and after
with this patch:

Recv   Send   Send                         Utilization      Service Demand
Socket Socket Message  Elapsed             Send     Recv    Send    Recv
Size   Size   Size     Time    Throughput  local    remote  local   remote
bytes  bytes  bytes    secs.   10^6bits/s  % S      % U     us/KB   us/KB
Before:
87380  16384  16384    10.00     33682.24  20.27    -1.00   0.592   -1.00
After:
87380  16384  16384    10.00     34204.08  20.54    -1.00   0.590   -1.00

So the net result of this patch is that we have a small gain in throughput
due to a reduction in overhead for putting together the frame.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   13 ++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   35 +++++++++++++++++++++++--
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   13 ++++++---
 drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   35 +++++++++++++++++++++++--
 4 files changed, 82 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index cb52f39d514a..f870b8da4551 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2716,6 +2716,8 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	tx_bi = first;
 
 	for (frag = &skb_shinfo(skb)->frags[0];; frag++) {
+		unsigned int max_data = I40E_MAX_DATA_PER_TXD_ALIGNED;
+
 		if (dma_mapping_error(tx_ring->dev, dma))
 			goto dma_error;
 
@@ -2723,12 +2725,14 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 		dma_unmap_len_set(tx_bi, len, size);
 		dma_unmap_addr_set(tx_bi, dma, dma);
 
+		/* align size to end of page */
+		max_data += -dma & (I40E_MAX_READ_REQ_SIZE - 1);
 		tx_desc->buffer_addr = cpu_to_le64(dma);
 
 		while (unlikely(size > I40E_MAX_DATA_PER_TXD)) {
 			tx_desc->cmd_type_offset_bsz =
 				build_ctob(td_cmd, td_offset,
-					   I40E_MAX_DATA_PER_TXD, td_tag);
+					   max_data, td_tag);
 
 			tx_desc++;
 			i++;
@@ -2739,9 +2743,10 @@ static inline void i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 				i = 0;
 			}
 
-			dma += I40E_MAX_DATA_PER_TXD;
-			size -= I40E_MAX_DATA_PER_TXD;
+			dma += max_data;
+			size -= max_data;
 
+			max_data = I40E_MAX_DATA_PER_TXD_ALIGNED;
 			tx_desc->buffer_addr = cpu_to_le64(dma);
 		}
 
@@ -2891,7 +2896,7 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	if (i40e_chk_linearize(skb, count)) {
 		if (__skb_linearize(skb))
 			goto out_drop;
-		count = TXD_USE_COUNT(skb->len);
+		count = i40e_txd_use_count(skb->len);
 		tx_ring->tx_stats.tx_linearize++;
 	}
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 8a3a163cc475..8b049cd77064 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -146,10 +146,39 @@ enum i40e_dyn_idx_t {
 
 #define I40E_MAX_BUFFER_TXD	8
 #define I40E_MIN_TX_LEN		17
-#define I40E_MAX_DATA_PER_TXD	8192
+
+/* The size limit for a transmit buffer in a descriptor is (16K - 1).
+ * In order to align with the read requests we will align the value to
+ * the nearest 4K which represents our maximum read request size.
+ */
+#define I40E_MAX_READ_REQ_SIZE		4096
+#define I40E_MAX_DATA_PER_TXD		(16 * 1024 - 1)
+#define I40E_MAX_DATA_PER_TXD_ALIGNED \
+	(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
+
+/* This ugly bit of math is equivilent to DIV_ROUNDUP(size, X) where X is
+ * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the fact
+ * that 12K is not a power of 2 and division is expensive.  It is used to
+ * approximate the number of descriptors used per linear buffer.  Note
+ * that this will overestimate in some cases as it doesn't account for the
+ * fact that we will add up to 4K - 1 in aligning the 12K buffer, however
+ * the error should not impact things much as large buffers usually mean
+ * we will use fewer descriptors then there are frags in an skb.
+ */
+static inline unsigned int i40e_txd_use_count(unsigned int size)
+{
+	const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
+	const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
+	unsigned int adjust = ~(u32)0;
+
+	/* if we rounded up on the reciprprocal pull down the adjustment */
+	if ((max * reciprocal) > adjust)
+		adjust = ~(u32)(reciprocal - 1);
+
+	return (u32)((((u64)size * reciprocal) + adjust) >> 32);
+}
 
 /* Tx Descriptors needed, worst case */
-#define TXD_USE_COUNT(S) DIV_ROUND_UP((S), I40E_MAX_DATA_PER_TXD)
 #define DESC_NEEDED (MAX_SKB_FRAGS + 4)
 #define I40E_MIN_DESC_PENDING	4
 
@@ -369,7 +398,7 @@ static inline int i40e_xmit_descriptor_count(struct sk_buff *skb)
 	int count = 0, size = skb_headlen(skb);
 
 	for (;;) {
-		count += TXD_USE_COUNT(size);
+		count += i40e_txd_use_count(size);
 
 		if (!nr_frags--)
 			break;
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 686a95fe48bd..31466fe8dca1 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -1934,6 +1934,8 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 	tx_bi = first;
 
 	for (frag = &skb_shinfo(skb)->frags[0];; frag++) {
+		unsigned int max_data = I40E_MAX_DATA_PER_TXD_ALIGNED;
+
 		if (dma_mapping_error(tx_ring->dev, dma))
 			goto dma_error;
 
@@ -1941,12 +1943,14 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 		dma_unmap_len_set(tx_bi, len, size);
 		dma_unmap_addr_set(tx_bi, dma, dma);
 
+		/* align size to end of page */
+		max_data += -dma & (I40E_MAX_READ_REQ_SIZE - 1);
 		tx_desc->buffer_addr = cpu_to_le64(dma);
 
 		while (unlikely(size > I40E_MAX_DATA_PER_TXD)) {
 			tx_desc->cmd_type_offset_bsz =
 				build_ctob(td_cmd, td_offset,
-					   I40E_MAX_DATA_PER_TXD, td_tag);
+					   max_data, td_tag);
 
 			tx_desc++;
 			i++;
@@ -1957,9 +1961,10 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 				i = 0;
 			}
 
-			dma += I40E_MAX_DATA_PER_TXD;
-			size -= I40E_MAX_DATA_PER_TXD;
+			dma += max_data;
+			size -= max_data;
 
+			max_data = I40E_MAX_DATA_PER_TXD_ALIGNED;
 			tx_desc->buffer_addr = cpu_to_le64(dma);
 		}
 
@@ -2108,7 +2113,7 @@ static netdev_tx_t i40e_xmit_frame_ring(struct sk_buff *skb,
 	if (i40e_chk_linearize(skb, count)) {
 		if (__skb_linearize(skb))
 			goto out_drop;
-		count = TXD_USE_COUNT(skb->len);
+		count = i40e_txd_use_count(skb->len);
 		tx_ring->tx_stats.tx_linearize++;
 	}
 
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
index 8c5da4f89fd0..34096b1e4782 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.h
@@ -146,10 +146,39 @@ enum i40e_dyn_idx_t {
 
 #define I40E_MAX_BUFFER_TXD	8
 #define I40E_MIN_TX_LEN		17
-#define I40E_MAX_DATA_PER_TXD	8192
+
+/* The size limit for a transmit buffer in a descriptor is (16K - 1).
+ * In order to align with the read requests we will align the value to
+ * the nearest 4K which represents our maximum read request size.
+ */
+#define I40E_MAX_READ_REQ_SIZE		4096
+#define I40E_MAX_DATA_PER_TXD		(16 * 1024 - 1)
+#define I40E_MAX_DATA_PER_TXD_ALIGNED \
+	(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
+
+/* This ugly bit of math is equivilent to DIV_ROUNDUP(size, X) where X is
+ * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the fact
+ * that 12K is not a power of 2 and division is expensive.  It is used to
+ * approximate the number of descriptors used per linear buffer.  Note
+ * that this will overestimate in some cases as it doesn't account for the
+ * fact that we will add up to 4K - 1 in aligning the 12K buffer, however
+ * the error should not impact things much as large buffers usually mean
+ * we will use fewer descriptors then there are frags in an skb.
+ */
+static inline unsigned int i40e_txd_use_count(unsigned int size)
+{
+	const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
+	const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
+	unsigned int adjust = ~(u32)0;
+
+	/* if we rounded up on the reciprprocal pull down the adjustment */
+	if ((max * reciprocal) > adjust)
+		adjust = ~(u32)(reciprocal - 1);
+
+	return (u32)((((u64)size * reciprocal) + adjust) >> 32);
+}
 
 /* Tx Descriptors needed, worst case */
-#define TXD_USE_COUNT(S) DIV_ROUND_UP((S), I40E_MAX_DATA_PER_TXD)
 #define DESC_NEEDED (MAX_SKB_FRAGS + 4)
 #define I40E_MIN_DESC_PENDING	4
 
@@ -359,7 +388,7 @@ static inline int i40e_xmit_descriptor_count(struct sk_buff *skb)
 	int count = 0, size = skb_headlen(skb);
 
 	for (;;) {
-		count += TXD_USE_COUNT(size);
+		count += i40e_txd_use_count(size);
 
 		if (!nr_frags--)
 			break;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K
  2016-02-17 19:03 ` [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K Alexander Duyck
@ 2016-02-17 21:38   ` Jeff Kirsher
  2016-02-17 21:56     ` Alexander Duyck
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Kirsher @ 2016-02-17 21:38 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, 2016-02-17 at 11:03 -0800, Alexander Duyck wrote:
> >From what I can tell the practial limitation on the size of the Tx
> data
> buffer is the fact that the Tx descriptor is limited to 14 bits.? As
> such
> we cannot use 16K as is typically used on the other Intel drivers.?
> However
> artificially limiting ourselves to 8K can be expensive as this means
> that
> we will consume up to 10 descriptors (1 context, 1 for header, and 9
> for
> an payload, non-8K aligned) in a single send.
> 
> I propose that we can reduce this by increasing the maximum data for
> a 4K
> aligned block to 12K.? We can reduce the descriptors used for a 32K
> aligned
> block by 1 by increasing the size like this.? In addition we still
> have the
> 4K - 1 of space that is still unused.? We can use this as a bit of
> extra
> padding when dealing with data that is not aligned to 4K.
> 
> By aligning the descriptors after the first to 4K we can improve the
> effiency of PCIe accesses as we can avoid using byte enables and can
> fetch
> full TLP transactions after the first fetch of the buffer.? This
> helps to
> improve PCIe efficiency.? Below is the results of testing before and
> after
> with this patch:
> 
> Recv?? Send?? Send???????????????????????? Utilization????? Service
> Demand
> Socket Socket Message? Elapsed???????????? Send???? Recv??? Send???
> Recv
> Size?? Size?? Size???? Time??? Throughput? local??? remote? local??
> remote
> bytes? bytes? bytes??? secs.?? 10^6bits/s? % S????? % U???? us/KB??
> us/KB
> Before:
> 87380? 16384? 16384??? 10.00???? 33682.24? 20.27??? -1.00?? 0.592??
> -1.00
> After:
> 87380? 16384? 16384??? 10.00???? 34204.08? 20.54??? -1.00?? 0.590??
> -1.00
> 
> So the net result of this patch is that we have a small gain in
> throughput
> due to a reduction in overhead for putting together the frame.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
> ?drivers/net/ethernet/intel/i40e/i40e_txrx.c?? |?? 13 ++++++---
> ?drivers/net/ethernet/intel/i40e/i40e_txrx.h?? |?? 35
> +++++++++++++++++++++++--
> ?drivers/net/ethernet/intel/i40evf/i40e_txrx.c |?? 13 ++++++---
> ?drivers/net/ethernet/intel/i40evf/i40e_txrx.h |?? 35
> +++++++++++++++++++++++--
> ?4 files changed, 82 insertions(+), 14 deletions(-)

Getting a compile error after applying this patch to my tree, here is
what I am getting:

drivers/net/ethernet/intel/i40e/i40e_fcoe.c: In function
?i40e_fcoe_xmit_frame?:
drivers/net/ethernet/intel/i40e/i40e_fcoe.c:1374:11: error: implicit
declaration of function ?TXD_USE_COUNT? [-Werror=implicit-function-
declaration]
???count = TXD_USE_COUNT(skb->len);
???????????^
cc1: some warnings being treated as errors
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160217/63bd49de/attachment.asc>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K
  2016-02-17 21:38   ` Jeff Kirsher
@ 2016-02-17 21:56     ` Alexander Duyck
  0 siblings, 0 replies; 8+ messages in thread
From: Alexander Duyck @ 2016-02-17 21:56 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Feb 17, 2016 at 1:38 PM, Jeff Kirsher
<jeffrey.t.kirsher@intel.com> wrote:
> On Wed, 2016-02-17 at 11:03 -0800, Alexander Duyck wrote:
>> >From what I can tell the practial limitation on the size of the Tx
>> data
>> buffer is the fact that the Tx descriptor is limited to 14 bits.  As
>> such
>> we cannot use 16K as is typically used on the other Intel drivers.
>> However
>> artificially limiting ourselves to 8K can be expensive as this means
>> that
>> we will consume up to 10 descriptors (1 context, 1 for header, and 9
>> for
>> an payload, non-8K aligned) in a single send.
>>
>> I propose that we can reduce this by increasing the maximum data for
>> a 4K
>> aligned block to 12K.  We can reduce the descriptors used for a 32K
>> aligned
>> block by 1 by increasing the size like this.  In addition we still
>> have the
>> 4K - 1 of space that is still unused.  We can use this as a bit of
>> extra
>> padding when dealing with data that is not aligned to 4K.
>>
>> By aligning the descriptors after the first to 4K we can improve the
>> effiency of PCIe accesses as we can avoid using byte enables and can
>> fetch
>> full TLP transactions after the first fetch of the buffer.  This
>> helps to
>> improve PCIe efficiency.  Below is the results of testing before and
>> after
>> with this patch:
>>
>> Recv   Send   Send                         Utilization      Service
>> Demand
>> Socket Socket Message  Elapsed             Send     Recv    Send
>> Recv
>> Size   Size   Size     Time    Throughput  local    remote  local
>> remote
>> bytes  bytes  bytes    secs.   10^6bits/s  % S      % U     us/KB
>> us/KB
>> Before:
>> 87380  16384  16384    10.00     33682.24  20.27    -1.00   0.592
>> -1.00
>> After:
>> 87380  16384  16384    10.00     34204.08  20.54    -1.00   0.590
>> -1.00
>>
>> So the net result of this patch is that we have a small gain in
>> throughput
>> due to a reduction in overhead for putting together the frame.
>>
>> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
>> ---
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   13 ++++++---
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   35
>> +++++++++++++++++++++++--
>>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   13 ++++++---
>>  drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   35
>> +++++++++++++++++++++++--
>>  4 files changed, 82 insertions(+), 14 deletions(-)
>
> Getting a compile error after applying this patch to my tree, here is
> what I am getting:
>
> drivers/net/ethernet/intel/i40e/i40e_fcoe.c: In function
> ?i40e_fcoe_xmit_frame?:
> drivers/net/ethernet/intel/i40e/i40e_fcoe.c:1374:11: error: implicit
> declaration of function ?TXD_USE_COUNT? [-Werror=implicit-function-
> declaration]
>    count = TXD_USE_COUNT(skb->len);
>            ^
> cc1: some warnings being treated as errors

Sorry about that.  I don't think I was testing with FCoE enabled.
I'll fix it up and resubmit.

- Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx
  2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx Alexander Duyck
@ 2016-02-17 23:21   ` Bowers, AndrewX
  0 siblings, 0 replies; 8+ messages in thread
From: Bowers, AndrewX @ 2016-02-17 23:21 UTC (permalink / raw)
  To: intel-wired-lan

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces at lists.osuosl.org] On
> Behalf Of Alexander Duyck
> Sent: Wednesday, February 17, 2016 11:03 AM
> To: intel-wired-lan at lists.osuosl.org
> Subject: [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up
> xmit_descriptor_count from maybe_stop_tx
> 
> In an upcoming patch I would like to have access to the descriptor count used
> for the data portion of the frame.  For this reason I am splitting up the
> descriptor count function from the function that stops the ring.
> 
> Also in order to try and reduce unnecessary duplication of code I am moving
> the slow-path portions of the code out of being inline calls so that we can just
> jump to them and process them instead of having to build them into each
> function that calls them.
> 
> Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_fcoe.c   |   14 ++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   71 +++++--------------------
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   44 +++++++++++++++
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.c |   64 +++++------------------
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.h |   42 +++++++++++++++
>  5 files changed, 123 insertions(+), 112 deletions(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Patch code changes correctly applied

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-02-17 23:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-17 19:02 [Intel-wired-lan] [next PATCH 0/4] i40e/i40evf: Improve throughput by reducing PCIe overhead Alexander Duyck
2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 1/4] i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx Alexander Duyck
2016-02-17 23:21   ` Bowers, AndrewX
2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 2/4] i40e/i40evf: Rewrite logic for 8 descriptor per packet check Alexander Duyck
2016-02-17 19:02 ` [Intel-wired-lan] [next PATCH 3/4] i40e/i40evf: Move Tx checksum closer to TSO Alexander Duyck
2016-02-17 19:03 ` [Intel-wired-lan] [next PATCH 4/4] i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K Alexander Duyck
2016-02-17 21:38   ` Jeff Kirsher
2016-02-17 21:56     ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.