All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] i40e: improve AF_XDP performance
@ 2020-06-22 13:21 ` Magnus Karlsson
  0 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-22 13:21 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, intel-wired-lan, jeffrey.t.kirsher
  Cc: maciej.fijalkowski, maciejromanfijalkowski, netdev

This small series improves AF_XDP performance for the i40e NIC. The
first patch optimizes the Tx completion path for AF_XDP and the second
one removes a division in the data path for the normal SKB path, XDP
as well as AF_XDP. Overall, the throughput of the l2fwd application in
xpdsock improves with around 4% on my machine.

This patch has been applied against commit 29a720c1042f ("Merge branch
'Marvell-mvpp2-improvements'")

Thanks: Magnus

Magnus Karlsson (2):
  i40e: optimize AF_XDP Tx completion path
  i40e: eliminate division in napi_poll data path

 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 15 +++++++++----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
 3 files changed, 31 insertions(+), 19 deletions(-)

--
2.7.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Intel-wired-lan] [PATCH net-next 0/2] i40e: improve AF_XDP performance
@ 2020-06-22 13:21 ` Magnus Karlsson
  0 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-22 13:21 UTC (permalink / raw)
  To: intel-wired-lan

This small series improves AF_XDP performance for the i40e NIC. The
first patch optimizes the Tx completion path for AF_XDP and the second
one removes a division in the data path for the normal SKB path, XDP
as well as AF_XDP. Overall, the throughput of the l2fwd application in
xpdsock improves with around 4% on my machine.

This patch has been applied against commit 29a720c1042f ("Merge branch
'Marvell-mvpp2-improvements'")

Thanks: Magnus

Magnus Karlsson (2):
  i40e: optimize AF_XDP Tx completion path
  i40e: eliminate division in napi_poll data path

 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 15 +++++++++----
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
 3 files changed, 31 insertions(+), 19 deletions(-)

--
2.7.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path
  2020-06-22 13:21 ` [Intel-wired-lan] " Magnus Karlsson
@ 2020-06-22 13:21   ` Magnus Karlsson
  -1 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-22 13:21 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, intel-wired-lan, jeffrey.t.kirsher
  Cc: maciej.fijalkowski, maciejromanfijalkowski, netdev

Improve the performance of the AF_XDP zero-copy Tx completion
path. When there are no XDP buffers being sent using XDP_TX or
XDP_REDIRECT, we do not have go through the SW ring to clean up any
entries since the AF_XDP path does not use these. In these cases, just
fast forward the next-to-use counter and skip going through the SW
ring. The limit on the maximum number of entries to complete is also
removed since the algorithm is now O(1). To simplify the code path, the
maximum number of entries to complete for the XDP path is therefore
also increased from 256 to 512 (the default number of Tx HW
descriptors). This should be fine since the completion in the XDP path
is faster than in the SKB path that has 256 as the maximum number.

This patch provides around 4% throughput improvement for the l2fwd
application in xdpsock on my machine.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
 3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f9555c8..0ce9d1e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3538,6 +3538,7 @@ static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 	 */
 	smp_wmb();
 
+	xdp_ring->xdp_tx_active++;
 	i++;
 	if (i == xdp_ring->count)
 		i = 0;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 5c25597..c16fcd9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -371,6 +371,7 @@ struct i40e_ring {
 	/* used in interrupt processing */
 	u16 next_to_use;
 	u16 next_to_clean;
+	u16 xdp_tx_active;
 
 	u8 atr_sample_rate;
 	u8 atr_count;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 7276580..c8cd6a6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -378,6 +378,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
+	unsigned int sent_frames = 0, total_bytes = 0;
 	struct i40e_tx_desc *tx_desc = NULL;
 	struct i40e_tx_buffer *tx_bi;
 	bool work_done = true;
@@ -408,6 +409,9 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 				   | I40E_TX_DESC_CMD_EOP,
 				   0, desc.len, 0);
 
+		sent_frames++;
+		total_bytes += tx_bi->bytecount;
+
 		xdp_ring->next_to_use++;
 		if (xdp_ring->next_to_use == xdp_ring->count)
 			xdp_ring->next_to_use = 0;
@@ -420,6 +424,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 		i40e_xdp_ring_update_tail(xdp_ring);
 
 		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
+		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
 	}
 
 	return !!budget && work_done;
@@ -434,6 +439,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
 				     struct i40e_tx_buffer *tx_bi)
 {
 	xdp_return_frame(tx_bi->xdpf);
+	tx_ring->xdp_tx_active--;
 	dma_unmap_single(tx_ring->dev,
 			 dma_unmap_addr(tx_bi, dma),
 			 dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
@@ -450,24 +456,23 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
 bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
 			   struct i40e_ring *tx_ring, int napi_budget)
 {
-	unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
-	u32 i, completed_frames, frames_ready, xsk_frames = 0;
+	unsigned int ntc, budget = vsi->work_limit;
 	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 i, completed_frames, xsk_frames = 0;
 	u32 head_idx = i40e_get_head(tx_ring);
 	bool work_done = true, xmit_done;
 	struct i40e_tx_buffer *tx_bi;
 
 	if (head_idx < tx_ring->next_to_clean)
 		head_idx += tx_ring->count;
-	frames_ready = head_idx - tx_ring->next_to_clean;
+	completed_frames = head_idx - tx_ring->next_to_clean;
 
-	if (frames_ready == 0) {
+	if (completed_frames == 0)
 		goto out_xmit;
-	} else if (frames_ready > budget) {
-		completed_frames = budget;
-		work_done = false;
-	} else {
-		completed_frames = frames_ready;
+
+	if (likely(!tx_ring->xdp_tx_active)) {
+		xsk_frames = completed_frames;
+		goto skip;
 	}
 
 	ntc = tx_ring->next_to_clean;
@@ -475,18 +480,18 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
 	for (i = 0; i < completed_frames; i++) {
 		tx_bi = &tx_ring->tx_bi[ntc];
 
-		if (tx_bi->xdpf)
+		if (tx_bi->xdpf) {
 			i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
-		else
+			tx_bi->xdpf = NULL;
+		} else {
 			xsk_frames++;
-
-		tx_bi->xdpf = NULL;
-		total_bytes += tx_bi->bytecount;
+		}
 
 		if (++ntc >= tx_ring->count)
 			ntc = 0;
 	}
 
+skip:
 	tx_ring->next_to_clean += completed_frames;
 	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
 		tx_ring->next_to_clean -= tx_ring->count;
@@ -495,7 +500,6 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
 		xsk_umem_complete_tx(umem, xsk_frames);
 
 	i40e_arm_wb(tx_ring, vsi, budget);
-	i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
 
 out_xmit:
 	if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem))
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-wired-lan] [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path
@ 2020-06-22 13:21   ` Magnus Karlsson
  0 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-22 13:21 UTC (permalink / raw)
  To: intel-wired-lan

Improve the performance of the AF_XDP zero-copy Tx completion
path. When there are no XDP buffers being sent using XDP_TX or
XDP_REDIRECT, we do not have go through the SW ring to clean up any
entries since the AF_XDP path does not use these. In these cases, just
fast forward the next-to-use counter and skip going through the SW
ring. The limit on the maximum number of entries to complete is also
removed since the algorithm is now O(1). To simplify the code path, the
maximum number of entries to complete for the XDP path is therefore
also increased from 256 to 512 (the default number of Tx HW
descriptors). This should be fine since the completion in the XDP path
is faster than in the SKB path that has 256 as the maximum number.

This patch provides around 4% throughput improvement for the l2fwd
application in xdpsock on my machine.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
 3 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f9555c8..0ce9d1e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -3538,6 +3538,7 @@ static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 	 */
 	smp_wmb();
 
+	xdp_ring->xdp_tx_active++;
 	i++;
 	if (i == xdp_ring->count)
 		i = 0;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 5c25597..c16fcd9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -371,6 +371,7 @@ struct i40e_ring {
 	/* used in interrupt processing */
 	u16 next_to_use;
 	u16 next_to_clean;
+	u16 xdp_tx_active;
 
 	u8 atr_sample_rate;
 	u8 atr_count;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 7276580..c8cd6a6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -378,6 +378,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
+	unsigned int sent_frames = 0, total_bytes = 0;
 	struct i40e_tx_desc *tx_desc = NULL;
 	struct i40e_tx_buffer *tx_bi;
 	bool work_done = true;
@@ -408,6 +409,9 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 				   | I40E_TX_DESC_CMD_EOP,
 				   0, desc.len, 0);
 
+		sent_frames++;
+		total_bytes += tx_bi->bytecount;
+
 		xdp_ring->next_to_use++;
 		if (xdp_ring->next_to_use == xdp_ring->count)
 			xdp_ring->next_to_use = 0;
@@ -420,6 +424,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 		i40e_xdp_ring_update_tail(xdp_ring);
 
 		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
+		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
 	}
 
 	return !!budget && work_done;
@@ -434,6 +439,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
 				     struct i40e_tx_buffer *tx_bi)
 {
 	xdp_return_frame(tx_bi->xdpf);
+	tx_ring->xdp_tx_active--;
 	dma_unmap_single(tx_ring->dev,
 			 dma_unmap_addr(tx_bi, dma),
 			 dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
@@ -450,24 +456,23 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
 bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
 			   struct i40e_ring *tx_ring, int napi_budget)
 {
-	unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
-	u32 i, completed_frames, frames_ready, xsk_frames = 0;
+	unsigned int ntc, budget = vsi->work_limit;
 	struct xdp_umem *umem = tx_ring->xsk_umem;
+	u32 i, completed_frames, xsk_frames = 0;
 	u32 head_idx = i40e_get_head(tx_ring);
 	bool work_done = true, xmit_done;
 	struct i40e_tx_buffer *tx_bi;
 
 	if (head_idx < tx_ring->next_to_clean)
 		head_idx += tx_ring->count;
-	frames_ready = head_idx - tx_ring->next_to_clean;
+	completed_frames = head_idx - tx_ring->next_to_clean;
 
-	if (frames_ready == 0) {
+	if (completed_frames == 0)
 		goto out_xmit;
-	} else if (frames_ready > budget) {
-		completed_frames = budget;
-		work_done = false;
-	} else {
-		completed_frames = frames_ready;
+
+	if (likely(!tx_ring->xdp_tx_active)) {
+		xsk_frames = completed_frames;
+		goto skip;
 	}
 
 	ntc = tx_ring->next_to_clean;
@@ -475,18 +480,18 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
 	for (i = 0; i < completed_frames; i++) {
 		tx_bi = &tx_ring->tx_bi[ntc];
 
-		if (tx_bi->xdpf)
+		if (tx_bi->xdpf) {
 			i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
-		else
+			tx_bi->xdpf = NULL;
+		} else {
 			xsk_frames++;
-
-		tx_bi->xdpf = NULL;
-		total_bytes += tx_bi->bytecount;
+		}
 
 		if (++ntc >= tx_ring->count)
 			ntc = 0;
 	}
 
+skip:
 	tx_ring->next_to_clean += completed_frames;
 	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
 		tx_ring->next_to_clean -= tx_ring->count;
@@ -495,7 +500,6 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
 		xsk_umem_complete_tx(umem, xsk_frames);
 
 	i40e_arm_wb(tx_ring, vsi, budget);
-	i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
 
 out_xmit:
 	if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem))
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 2/2] i40e: eliminate division in napi_poll data path
  2020-06-22 13:21 ` [Intel-wired-lan] " Magnus Karlsson
@ 2020-06-22 13:21   ` Magnus Karlsson
  -1 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-22 13:21 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, intel-wired-lan, jeffrey.t.kirsher
  Cc: maciej.fijalkowski, maciejromanfijalkowski, netdev

Eliminate a division in the napi_poll data path. This division is
executed even though it is only needed in the rare case when there are
not enough interrupt lines so they have to be shared between queue
pairs. Instead, just test for this case and only execute the division
if needed. The code has been lifted from the ice driver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 0ce9d1e..da73a7d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2595,10 +2595,16 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	if (budget <= 0)
 		goto tx_only;
 
-	/* We attempt to distribute budget to each Rx queue fairly, but don't
-	 * allow the budget to go below 1 because that would exit polling early.
-	 */
-	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
+	/* normally we have 1 Rx ring per q_vector */
+	if (unlikely(q_vector->num_ringpairs > 1))
+		/* We attempt to distribute budget to each Rx queue fairly, but
+		 * don't allow the budget to go below 1 because that would exit
+		 * polling early.
+		 */
+		budget_per_ring = max_t(int, budget / q_vector->num_ringpairs, 1);
+	else
+		/* Max of 1 Rx ring in this q_vector so give it the budget */
+		budget_per_ring = budget;
 
 	i40e_for_each_ring(ring, q_vector->rx) {
 		int cleaned = ring->xsk_umem ?
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Intel-wired-lan] [PATCH net-next 2/2] i40e: eliminate division in napi_poll data path
@ 2020-06-22 13:21   ` Magnus Karlsson
  0 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-22 13:21 UTC (permalink / raw)
  To: intel-wired-lan

Eliminate a division in the napi_poll data path. This division is
executed even though it is only needed in the rare case when there are
not enough interrupt lines so they have to be shared between queue
pairs. Instead, just test for this case and only execute the division
if needed. The code has been lifted from the ice driver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 0ce9d1e..da73a7d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2595,10 +2595,16 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	if (budget <= 0)
 		goto tx_only;
 
-	/* We attempt to distribute budget to each Rx queue fairly, but don't
-	 * allow the budget to go below 1 because that would exit polling early.
-	 */
-	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
+	/* normally we have 1 Rx ring per q_vector */
+	if (unlikely(q_vector->num_ringpairs > 1))
+		/* We attempt to distribute budget to each Rx queue fairly, but
+		 * don't allow the budget to go below 1 because that would exit
+		 * polling early.
+		 */
+		budget_per_ring = max_t(int, budget / q_vector->num_ringpairs, 1);
+	else
+		/* Max of 1 Rx ring in this q_vector so give it the budget */
+		budget_per_ring = budget;
 
 	i40e_for_each_ring(ring, q_vector->rx) {
 		int cleaned = ring->xsk_umem ?
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path
  2020-06-22 13:21   ` [Intel-wired-lan] " Magnus Karlsson
@ 2020-06-23  1:00     ` Samudrala, Sridhar
  -1 siblings, 0 replies; 10+ messages in thread
From: Samudrala, Sridhar @ 2020-06-23  1:00 UTC (permalink / raw)
  To: Magnus Karlsson, bjorn.topel, intel-wired-lan, jeffrey.t.kirsher
  Cc: maciejromanfijalkowski, maciej.fijalkowski, netdev



On 6/22/2020 6:21 AM, Magnus Karlsson wrote:
> Improve the performance of the AF_XDP zero-copy Tx completion
> path. When there are no XDP buffers being sent using XDP_TX or
> XDP_REDIRECT, we do not have go through the SW ring to clean up any
> entries since the AF_XDP path does not use these. In these cases, just
> fast forward the next-to-use counter and skip going through the SW
> ring. The limit on the maximum number of entries to complete is also
> removed since the algorithm is now O(1). To simplify the code path, the
> maximum number of entries to complete for the XDP path is therefore
> also increased from 256 to 512 (the default number of Tx HW
> descriptors). This should be fine since the completion in the XDP path
> is faster than in the SKB path that has 256 as the maximum number.
> 
> This patch provides around 4% throughput improvement for the l2fwd
> application in xdpsock on my machine.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c |  1 +
>   drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
>   drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
>   3 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index f9555c8..0ce9d1e 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -3538,6 +3538,7 @@ static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
>   	 */
>   	smp_wmb();
>   
> +	xdp_ring->xdp_tx_active++;
>   	i++;
>   	if (i == xdp_ring->count)
>   		i = 0;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 5c25597..c16fcd9 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -371,6 +371,7 @@ struct i40e_ring {
>   	/* used in interrupt processing */
>   	u16 next_to_use;
>   	u16 next_to_clean;
> +	u16 xdp_tx_active;
>   
>   	u8 atr_sample_rate;
>   	u8 atr_count;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 7276580..c8cd6a6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -378,6 +378,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>    **/
>   static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>   {
> +	unsigned int sent_frames = 0, total_bytes = 0;
>   	struct i40e_tx_desc *tx_desc = NULL;
>   	struct i40e_tx_buffer *tx_bi;
>   	bool work_done = true;
> @@ -408,6 +409,9 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>   				   | I40E_TX_DESC_CMD_EOP,
>   				   0, desc.len, 0);
>   
> +		sent_frames++;
> +		total_bytes += tx_bi->bytecount;
> +
>   		xdp_ring->next_to_use++;
>   		if (xdp_ring->next_to_use == xdp_ring->count)
>   			xdp_ring->next_to_use = 0;
> @@ -420,6 +424,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>   		i40e_xdp_ring_update_tail(xdp_ring);
>   
>   		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> +		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
>   	}
>   
>   	return !!budget && work_done;
> @@ -434,6 +439,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
>   				     struct i40e_tx_buffer *tx_bi)
>   {
>   	xdp_return_frame(tx_bi->xdpf);
> +	tx_ring->xdp_tx_active--;
>   	dma_unmap_single(tx_ring->dev,
>   			 dma_unmap_addr(tx_bi, dma),
>   			 dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
> @@ -450,24 +456,23 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
>   bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
>   			   struct i40e_ring *tx_ring, int napi_budget)

napi_budget is not used. so the 3rd arg can be removed

>   {
> -	unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
> -	u32 i, completed_frames, frames_ready, xsk_frames = 0;
> +	unsigned int ntc, budget = vsi->work_limit;
>   	struct xdp_umem *umem = tx_ring->xsk_umem;
> +	u32 i, completed_frames, xsk_frames = 0;
>   	u32 head_idx = i40e_get_head(tx_ring);
>   	bool work_done = true, xmit_done;

work_done is no longer required

>   	struct i40e_tx_buffer *tx_bi;
>   
>   	if (head_idx < tx_ring->next_to_clean)
>   		head_idx += tx_ring->count;
> -	frames_ready = head_idx - tx_ring->next_to_clean;
> +	completed_frames = head_idx - tx_ring->next_to_clean;
>   
> -	if (frames_ready == 0) {
> +	if (completed_frames == 0)
>   		goto out_xmit;
> -	} else if (frames_ready > budget) {
> -		completed_frames = budget;
> -		work_done = false;
> -	} else {
> -		completed_frames = frames_ready;
> +
> +	if (likely(!tx_ring->xdp_tx_active)) {
> +		xsk_frames = completed_frames;
> +		goto skip;
>   	}
>   
>   	ntc = tx_ring->next_to_clean;
> @@ -475,18 +480,18 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
>   	for (i = 0; i < completed_frames; i++) {
>   		tx_bi = &tx_ring->tx_bi[ntc];
>   
> -		if (tx_bi->xdpf)
> +		if (tx_bi->xdpf) {
>   			i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
> -		else
> +			tx_bi->xdpf = NULL;
> +		} else {
>   			xsk_frames++;
> -
> -		tx_bi->xdpf = NULL;
> -		total_bytes += tx_bi->bytecount;
> +		}
>   
>   		if (++ntc >= tx_ring->count)
>   			ntc = 0;
>   	}
>   
> +skip:
>   	tx_ring->next_to_clean += completed_frames;
>   	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
>   		tx_ring->next_to_clean -= tx_ring->count;
> @@ -495,7 +500,6 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
>   		xsk_umem_complete_tx(umem, xsk_frames);
>   
>   	i40e_arm_wb(tx_ring, vsi, budget);

I guess budget here should be replaced with completed_frames

> -	i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
>   
>   out_xmit:
>   	if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem))


> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Intel-wired-lan] [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path
@ 2020-06-23  1:00     ` Samudrala, Sridhar
  0 siblings, 0 replies; 10+ messages in thread
From: Samudrala, Sridhar @ 2020-06-23  1:00 UTC (permalink / raw)
  To: intel-wired-lan



On 6/22/2020 6:21 AM, Magnus Karlsson wrote:
> Improve the performance of the AF_XDP zero-copy Tx completion
> path. When there are no XDP buffers being sent using XDP_TX or
> XDP_REDIRECT, we do not have go through the SW ring to clean up any
> entries since the AF_XDP path does not use these. In these cases, just
> fast forward the next-to-use counter and skip going through the SW
> ring. The limit on the maximum number of entries to complete is also
> removed since the algorithm is now O(1). To simplify the code path, the
> maximum number of entries to complete for the XDP path is therefore
> also increased from 256 to 512 (the default number of Tx HW
> descriptors). This should be fine since the completion in the XDP path
> is faster than in the SKB path that has 256 as the maximum number.
> 
> This patch provides around 4% throughput improvement for the l2fwd
> application in xdpsock on my machine.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c |  1 +
>   drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
>   drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
>   3 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index f9555c8..0ce9d1e 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -3538,6 +3538,7 @@ static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
>   	 */
>   	smp_wmb();
>   
> +	xdp_ring->xdp_tx_active++;
>   	i++;
>   	if (i == xdp_ring->count)
>   		i = 0;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 5c25597..c16fcd9 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -371,6 +371,7 @@ struct i40e_ring {
>   	/* used in interrupt processing */
>   	u16 next_to_use;
>   	u16 next_to_clean;
> +	u16 xdp_tx_active;
>   
>   	u8 atr_sample_rate;
>   	u8 atr_count;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 7276580..c8cd6a6 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -378,6 +378,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>    **/
>   static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>   {
> +	unsigned int sent_frames = 0, total_bytes = 0;
>   	struct i40e_tx_desc *tx_desc = NULL;
>   	struct i40e_tx_buffer *tx_bi;
>   	bool work_done = true;
> @@ -408,6 +409,9 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>   				   | I40E_TX_DESC_CMD_EOP,
>   				   0, desc.len, 0);
>   
> +		sent_frames++;
> +		total_bytes += tx_bi->bytecount;
> +
>   		xdp_ring->next_to_use++;
>   		if (xdp_ring->next_to_use == xdp_ring->count)
>   			xdp_ring->next_to_use = 0;
> @@ -420,6 +424,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>   		i40e_xdp_ring_update_tail(xdp_ring);
>   
>   		xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> +		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
>   	}
>   
>   	return !!budget && work_done;
> @@ -434,6 +439,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
>   				     struct i40e_tx_buffer *tx_bi)
>   {
>   	xdp_return_frame(tx_bi->xdpf);
> +	tx_ring->xdp_tx_active--;
>   	dma_unmap_single(tx_ring->dev,
>   			 dma_unmap_addr(tx_bi, dma),
>   			 dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
> @@ -450,24 +456,23 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
>   bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
>   			   struct i40e_ring *tx_ring, int napi_budget)

napi_budget is not used. so the 3rd arg can be removed

>   {
> -	unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
> -	u32 i, completed_frames, frames_ready, xsk_frames = 0;
> +	unsigned int ntc, budget = vsi->work_limit;
>   	struct xdp_umem *umem = tx_ring->xsk_umem;
> +	u32 i, completed_frames, xsk_frames = 0;
>   	u32 head_idx = i40e_get_head(tx_ring);
>   	bool work_done = true, xmit_done;

work_done is no longer required

>   	struct i40e_tx_buffer *tx_bi;
>   
>   	if (head_idx < tx_ring->next_to_clean)
>   		head_idx += tx_ring->count;
> -	frames_ready = head_idx - tx_ring->next_to_clean;
> +	completed_frames = head_idx - tx_ring->next_to_clean;
>   
> -	if (frames_ready == 0) {
> +	if (completed_frames == 0)
>   		goto out_xmit;
> -	} else if (frames_ready > budget) {
> -		completed_frames = budget;
> -		work_done = false;
> -	} else {
> -		completed_frames = frames_ready;
> +
> +	if (likely(!tx_ring->xdp_tx_active)) {
> +		xsk_frames = completed_frames;
> +		goto skip;
>   	}
>   
>   	ntc = tx_ring->next_to_clean;
> @@ -475,18 +480,18 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
>   	for (i = 0; i < completed_frames; i++) {
>   		tx_bi = &tx_ring->tx_bi[ntc];
>   
> -		if (tx_bi->xdpf)
> +		if (tx_bi->xdpf) {
>   			i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
> -		else
> +			tx_bi->xdpf = NULL;
> +		} else {
>   			xsk_frames++;
> -
> -		tx_bi->xdpf = NULL;
> -		total_bytes += tx_bi->bytecount;
> +		}
>   
>   		if (++ntc >= tx_ring->count)
>   			ntc = 0;
>   	}
>   
> +skip:
>   	tx_ring->next_to_clean += completed_frames;
>   	if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
>   		tx_ring->next_to_clean -= tx_ring->count;
> @@ -495,7 +500,6 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
>   		xsk_umem_complete_tx(umem, xsk_frames);
>   
>   	i40e_arm_wb(tx_ring, vsi, budget);

I guess budget here should be replaced with completed_frames

> -	i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
>   
>   out_xmit:
>   	if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem))


> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Intel-wired-lan] [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path
  2020-06-23  1:00     ` Samudrala, Sridhar
@ 2020-06-23  5:02       ` Magnus Karlsson
  -1 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-23  5:02 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Magnus Karlsson, Björn Töpel, intel-wired-lan,
	jeffrey.t.kirsher, Maciej Fijalkowski, Fijalkowski, Maciej,
	Network Development

On Tue, Jun 23, 2020 at 3:00 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
>
> On 6/22/2020 6:21 AM, Magnus Karlsson wrote:
> > Improve the performance of the AF_XDP zero-copy Tx completion
> > path. When there are no XDP buffers being sent using XDP_TX or
> > XDP_REDIRECT, we do not have go through the SW ring to clean up any
> > entries since the AF_XDP path does not use these. In these cases, just
> > fast forward the next-to-use counter and skip going through the SW
> > ring. The limit on the maximum number of entries to complete is also
> > removed since the algorithm is now O(1). To simplify the code path, the
> > maximum number of entries to complete for the XDP path is therefore
> > also increased from 256 to 512 (the default number of Tx HW
> > descriptors). This should be fine since the completion in the XDP path
> > is faster than in the SKB path that has 256 as the maximum number.
> >
> > This patch provides around 4% throughput improvement for the l2fwd
> > application in xdpsock on my machine.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.c |  1 +
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
> >   drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
> >   3 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > index f9555c8..0ce9d1e 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > @@ -3538,6 +3538,7 @@ static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
> >        */
> >       smp_wmb();
> >
> > +     xdp_ring->xdp_tx_active++;
> >       i++;
> >       if (i == xdp_ring->count)
> >               i = 0;
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index 5c25597..c16fcd9 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -371,6 +371,7 @@ struct i40e_ring {
> >       /* used in interrupt processing */
> >       u16 next_to_use;
> >       u16 next_to_clean;
> > +     u16 xdp_tx_active;
> >
> >       u8 atr_sample_rate;
> >       u8 atr_count;
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 7276580..c8cd6a6 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -378,6 +378,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >    **/
> >   static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >   {
> > +     unsigned int sent_frames = 0, total_bytes = 0;
> >       struct i40e_tx_desc *tx_desc = NULL;
> >       struct i40e_tx_buffer *tx_bi;
> >       bool work_done = true;
> > @@ -408,6 +409,9 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >                                  | I40E_TX_DESC_CMD_EOP,
> >                                  0, desc.len, 0);
> >
> > +             sent_frames++;
> > +             total_bytes += tx_bi->bytecount;
> > +
> >               xdp_ring->next_to_use++;
> >               if (xdp_ring->next_to_use == xdp_ring->count)
> >                       xdp_ring->next_to_use = 0;
> > @@ -420,6 +424,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >               i40e_xdp_ring_update_tail(xdp_ring);
> >
> >               xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> > +             i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
> >       }
> >
> >       return !!budget && work_done;
> > @@ -434,6 +439,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
> >                                    struct i40e_tx_buffer *tx_bi)
> >   {
> >       xdp_return_frame(tx_bi->xdpf);
> > +     tx_ring->xdp_tx_active--;
> >       dma_unmap_single(tx_ring->dev,
> >                        dma_unmap_addr(tx_bi, dma),
> >                        dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
> > @@ -450,24 +456,23 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
> >   bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
> >                          struct i40e_ring *tx_ring, int napi_budget)
>
> napi_budget is not used. so the 3rd arg can be removed

Thanks Sridhar. You are correct on all accounts in this mail. Will fix
and send a v2.

/Magnus

> >   {
> > -     unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
> > -     u32 i, completed_frames, frames_ready, xsk_frames = 0;
> > +     unsigned int ntc, budget = vsi->work_limit;
> >       struct xdp_umem *umem = tx_ring->xsk_umem;
> > +     u32 i, completed_frames, xsk_frames = 0;
> >       u32 head_idx = i40e_get_head(tx_ring);
> >       bool work_done = true, xmit_done;
>
> work_done is no longer required
>
> >       struct i40e_tx_buffer *tx_bi;
> >
> >       if (head_idx < tx_ring->next_to_clean)
> >               head_idx += tx_ring->count;
> > -     frames_ready = head_idx - tx_ring->next_to_clean;
> > +     completed_frames = head_idx - tx_ring->next_to_clean;
> >
> > -     if (frames_ready == 0) {
> > +     if (completed_frames == 0)
> >               goto out_xmit;
> > -     } else if (frames_ready > budget) {
> > -             completed_frames = budget;
> > -             work_done = false;
> > -     } else {
> > -             completed_frames = frames_ready;
> > +
> > +     if (likely(!tx_ring->xdp_tx_active)) {
> > +             xsk_frames = completed_frames;
> > +             goto skip;
> >       }
> >
> >       ntc = tx_ring->next_to_clean;
> > @@ -475,18 +480,18 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
> >       for (i = 0; i < completed_frames; i++) {
> >               tx_bi = &tx_ring->tx_bi[ntc];
> >
> > -             if (tx_bi->xdpf)
> > +             if (tx_bi->xdpf) {
> >                       i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
> > -             else
> > +                     tx_bi->xdpf = NULL;
> > +             } else {
> >                       xsk_frames++;
> > -
> > -             tx_bi->xdpf = NULL;
> > -             total_bytes += tx_bi->bytecount;
> > +             }
> >
> >               if (++ntc >= tx_ring->count)
> >                       ntc = 0;
> >       }
> >
> > +skip:
> >       tx_ring->next_to_clean += completed_frames;
> >       if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
> >               tx_ring->next_to_clean -= tx_ring->count;
> > @@ -495,7 +500,6 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
> >               xsk_umem_complete_tx(umem, xsk_frames);
> >
> >       i40e_arm_wb(tx_ring, vsi, budget);
>
> I guess budget here should be replaced with completed_frames
>
> > -     i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
> >
> >   out_xmit:
> >       if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem))
>
>
> >
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Intel-wired-lan] [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path
@ 2020-06-23  5:02       ` Magnus Karlsson
  0 siblings, 0 replies; 10+ messages in thread
From: Magnus Karlsson @ 2020-06-23  5:02 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Jun 23, 2020 at 3:00 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
>
> On 6/22/2020 6:21 AM, Magnus Karlsson wrote:
> > Improve the performance of the AF_XDP zero-copy Tx completion
> > path. When there are no XDP buffers being sent using XDP_TX or
> > XDP_REDIRECT, we do not have go through the SW ring to clean up any
> > entries since the AF_XDP path does not use these. In these cases, just
> > fast forward the next-to-use counter and skip going through the SW
> > ring. The limit on the maximum number of entries to complete is also
> > removed since the algorithm is now O(1). To simplify the code path, the
> > maximum number of entries to complete for the XDP path is therefore
> > also increased from 256 to 512 (the default number of Tx HW
> > descriptors). This should be fine since the completion in the XDP path
> > is faster than in the SKB path that has 256 as the maximum number.
> >
> > This patch provides around 4% throughput improvement for the l2fwd
> > application in xdpsock on my machine.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.c |  1 +
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 +
> >   drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 34 ++++++++++++++++-------------
> >   3 files changed, 21 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > index f9555c8..0ce9d1e 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > @@ -3538,6 +3538,7 @@ static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
> >        */
> >       smp_wmb();
> >
> > +     xdp_ring->xdp_tx_active++;
> >       i++;
> >       if (i == xdp_ring->count)
> >               i = 0;
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index 5c25597..c16fcd9 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -371,6 +371,7 @@ struct i40e_ring {
> >       /* used in interrupt processing */
> >       u16 next_to_use;
> >       u16 next_to_clean;
> > +     u16 xdp_tx_active;
> >
> >       u8 atr_sample_rate;
> >       u8 atr_count;
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index 7276580..c8cd6a6 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -378,6 +378,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >    **/
> >   static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >   {
> > +     unsigned int sent_frames = 0, total_bytes = 0;
> >       struct i40e_tx_desc *tx_desc = NULL;
> >       struct i40e_tx_buffer *tx_bi;
> >       bool work_done = true;
> > @@ -408,6 +409,9 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >                                  | I40E_TX_DESC_CMD_EOP,
> >                                  0, desc.len, 0);
> >
> > +             sent_frames++;
> > +             total_bytes += tx_bi->bytecount;
> > +
> >               xdp_ring->next_to_use++;
> >               if (xdp_ring->next_to_use == xdp_ring->count)
> >                       xdp_ring->next_to_use = 0;
> > @@ -420,6 +424,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >               i40e_xdp_ring_update_tail(xdp_ring);
> >
> >               xsk_umem_consume_tx_done(xdp_ring->xsk_umem);
> > +             i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
> >       }
> >
> >       return !!budget && work_done;
> > @@ -434,6 +439,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
> >                                    struct i40e_tx_buffer *tx_bi)
> >   {
> >       xdp_return_frame(tx_bi->xdpf);
> > +     tx_ring->xdp_tx_active--;
> >       dma_unmap_single(tx_ring->dev,
> >                        dma_unmap_addr(tx_bi, dma),
> >                        dma_unmap_len(tx_bi, len), DMA_TO_DEVICE);
> > @@ -450,24 +456,23 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
> >   bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
> >                          struct i40e_ring *tx_ring, int napi_budget)
>
> napi_budget is not used. so the 3rd arg can be removed

Thanks Sridhar. You are correct on all accounts in this mail. Will fix
and send a v2.

/Magnus

> >   {
> > -     unsigned int ntc, total_bytes = 0, budget = vsi->work_limit;
> > -     u32 i, completed_frames, frames_ready, xsk_frames = 0;
> > +     unsigned int ntc, budget = vsi->work_limit;
> >       struct xdp_umem *umem = tx_ring->xsk_umem;
> > +     u32 i, completed_frames, xsk_frames = 0;
> >       u32 head_idx = i40e_get_head(tx_ring);
> >       bool work_done = true, xmit_done;
>
> work_done is no longer required
>
> >       struct i40e_tx_buffer *tx_bi;
> >
> >       if (head_idx < tx_ring->next_to_clean)
> >               head_idx += tx_ring->count;
> > -     frames_ready = head_idx - tx_ring->next_to_clean;
> > +     completed_frames = head_idx - tx_ring->next_to_clean;
> >
> > -     if (frames_ready == 0) {
> > +     if (completed_frames == 0)
> >               goto out_xmit;
> > -     } else if (frames_ready > budget) {
> > -             completed_frames = budget;
> > -             work_done = false;
> > -     } else {
> > -             completed_frames = frames_ready;
> > +
> > +     if (likely(!tx_ring->xdp_tx_active)) {
> > +             xsk_frames = completed_frames;
> > +             goto skip;
> >       }
> >
> >       ntc = tx_ring->next_to_clean;
> > @@ -475,18 +480,18 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
> >       for (i = 0; i < completed_frames; i++) {
> >               tx_bi = &tx_ring->tx_bi[ntc];
> >
> > -             if (tx_bi->xdpf)
> > +             if (tx_bi->xdpf) {
> >                       i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
> > -             else
> > +                     tx_bi->xdpf = NULL;
> > +             } else {
> >                       xsk_frames++;
> > -
> > -             tx_bi->xdpf = NULL;
> > -             total_bytes += tx_bi->bytecount;
> > +             }
> >
> >               if (++ntc >= tx_ring->count)
> >                       ntc = 0;
> >       }
> >
> > +skip:
> >       tx_ring->next_to_clean += completed_frames;
> >       if (unlikely(tx_ring->next_to_clean >= tx_ring->count))
> >               tx_ring->next_to_clean -= tx_ring->count;
> > @@ -495,7 +500,6 @@ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi,
> >               xsk_umem_complete_tx(umem, xsk_frames);
> >
> >       i40e_arm_wb(tx_ring, vsi, budget);
>
> I guess budget here should be replaced with completed_frames
>
> > -     i40e_update_tx_stats(tx_ring, completed_frames, total_bytes);
> >
> >   out_xmit:
> >       if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem))
>
>
> >
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-06-23  5:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-22 13:21 [PATCH net-next 0/2] i40e: improve AF_XDP performance Magnus Karlsson
2020-06-22 13:21 ` [Intel-wired-lan] " Magnus Karlsson
2020-06-22 13:21 ` [PATCH net-next 1/2] i40e: optimize AF_XDP Tx completion path Magnus Karlsson
2020-06-22 13:21   ` [Intel-wired-lan] " Magnus Karlsson
2020-06-23  1:00   ` Samudrala, Sridhar
2020-06-23  1:00     ` Samudrala, Sridhar
2020-06-23  5:02     ` Magnus Karlsson
2020-06-23  5:02       ` Magnus Karlsson
2020-06-22 13:21 ` [PATCH net-next 2/2] i40e: eliminate division in napi_poll data path Magnus Karlsson
2020-06-22 13:21   ` [Intel-wired-lan] " Magnus Karlsson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.