bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 intel-next 0/4] XDP_TX improvements for ice
@ 2021-07-05 16:43 Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 1/4] ice: unify xdp_rings accesses Maciej Fijalkowski
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Maciej Fijalkowski @ 2021-07-05 16:43 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, bpf, davem, anthony.l.nguyen, kuba, bjorn,
	magnus.karlsson, joamaki, toke, Maciej Fijalkowski

Hi,

this is a second revision of a series around XDP_TX improvements for ice
driver. When compared to v1 (which can be found under [1]), two new
patches are introduced that are focused on improving the performance for
XDP_TX as Jussi reported that the numbers were pretty low on his side.
Furthermore the fallback path is now based on static branch, as
suggested by Toke on v1. This means that there's no further need for a
standalone net_device_ops that were serving the locked version of
ndo_xdp_xmit callback.

Idea from 2nd patch is borrowed from a joint work that was done against
OOT driver among with Sridhar Samudrala, Jesse Brandeburg and Piotr
Raczynski, where we working on fixing the scaling issues for Tx AF_XDP
ZC path.

Last but not least, with this series I observe the improvement of
performance by around 30%.

Thanks!
Maciej

[1] : https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/

Maciej Fijalkowski (4):
  ice: unify xdp_rings accesses
  ice: optimize XDP_TX descriptor processing
  ice: do not create xdp_frame on XDP_TX
  ice: introduce XDP_TX fallback path

 drivers/net/ethernet/intel/ice/ice.h          |   3 +
 drivers/net/ethernet/intel/ice/ice_lib.c      |   4 +-
 drivers/net/ethernet/intel/ice/ice_main.c     |  45 ++++++-
 drivers/net/ethernet/intel/ice/ice_txrx.c     | 123 +++++++++++++++---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |   6 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  29 +++--
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h |  16 +++
 7 files changed, 190 insertions(+), 36 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 intel-next 1/4] ice: unify xdp_rings accesses
  2021-07-05 16:43 [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Maciej Fijalkowski
@ 2021-07-05 16:43 ` Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 2/4] ice: optimize XDP_TX descriptor processing Maciej Fijalkowski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Maciej Fijalkowski @ 2021-07-05 16:43 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, bpf, davem, anthony.l.nguyen, kuba, bjorn,
	magnus.karlsson, joamaki, toke, Maciej Fijalkowski

There has been a long lasting issue of improper xdp_rings indexing for
XDP_TX and XDP_REDIRECT actions. Given that currently rx_ring->q_index
is mixed with smp_processor_id(), there could be a situation where Tx
descriptors are produced onto XDP Tx ring, but tail is never bumped -
for example pin a particular queue id to non-matching IRQ line.

Address this problem by ignoring the user ring count setting and always
initialize the xdp_rings array to be of num_possible_cpus() size. Then,
always use the smp_processor_id() as an index to xdp_rings array. This
provides serialization as at given time only a single softirq can run on
a particular CPU.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_lib.c      | 2 +-
 drivers/net/ethernet/intel/ice/ice_main.c     | 2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index dde9802c6c72..20d02e9d6635 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -3154,7 +3154,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi)
 
 		ice_vsi_map_rings_to_vectors(vsi);
 		if (ice_is_xdp_ena_vsi(vsi)) {
-			vsi->num_xdp_txq = vsi->alloc_rxq;
+			vsi->num_xdp_txq = num_possible_cpus();
 			ret = ice_prepare_xdp_rings(vsi, vsi->xdp_prog);
 			if (ret)
 				goto err_vectors;
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index ef8d1815af56..f26db305b797 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2626,7 +2626,7 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 	}
 
 	if (!ice_is_xdp_ena_vsi(vsi) && prog) {
-		vsi->num_xdp_txq = vsi->alloc_rxq;
+		vsi->num_xdp_txq = num_possible_cpus();
 		xdp_ring_err = ice_prepare_xdp_rings(vsi, prog);
 		if (xdp_ring_err)
 			NL_SET_ERR_MSG_MOD(extack, "Setting up XDP Tx resources failed");
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 171397dcf00a..4fd01a153d35 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -295,7 +295,7 @@ void ice_finalize_xdp_rx(struct ice_ring *rx_ring, unsigned int xdp_res)
 
 	if (xdp_res & ICE_XDP_TX) {
 		struct ice_ring *xdp_ring =
-			rx_ring->vsi->xdp_rings[rx_ring->q_index];
+			rx_ring->vsi->xdp_rings[smp_processor_id()];
 
 		ice_xdp_ring_update_tail(xdp_ring);
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 intel-next 2/4] ice: optimize XDP_TX descriptor processing
  2021-07-05 16:43 [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 1/4] ice: unify xdp_rings accesses Maciej Fijalkowski
@ 2021-07-05 16:43 ` Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 3/4] ice: do not create xdp_frame on XDP_TX Maciej Fijalkowski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Maciej Fijalkowski @ 2021-07-05 16:43 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, bpf, davem, anthony.l.nguyen, kuba, bjorn,
	magnus.karlsson, joamaki, toke, Maciej Fijalkowski

Improve the performance of XDP_TX by reducing the number of times we set
the RS bit. Instead of setting it for each packet, set it at the end of
the batch right before the tail bump. This results in reduced PCIe
traffic as HW will issue less writebacks.

For that purpose, introduce the next_rs_idx field onto ice_ring so that
the descriptor with RS bit set can be tracked.

This will allow for calculating the amount of descriptors that are ready
to be cleaned. DD bit can be checked only on the descriptor that
next_rs_idx points to. This logic can not be combined into existing
ice_clean_tx_irq routine so let's introduce a separate function
dedicated for cleaning XDP rings and drop the XDP specific bits from the
existing one.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c     | 79 ++++++++++++++++---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  5 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 13 ++-
 3 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 6ee8e0032d52..23b97c9579fb 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -194,6 +194,63 @@ void ice_free_tx_ring(struct ice_ring *tx_ring)
 	}
 }
 
+/**
+ * ice_clean_xdp_irq - Reclaim resources after transmit completes on XDP ring
+ * @xdp_ring: XDP ring to clean
+ *
+ * Returns true if there's any budget left (e.g. the clean is finished)
+ */
+static bool ice_clean_xdp_irq(struct ice_ring *xdp_ring)
+{
+	unsigned int total_bytes = 0, total_pkts = 0;
+	u16 next_rs_idx = xdp_ring->next_rs_idx;
+	u16 ntc = xdp_ring->next_to_clean;
+	struct ice_tx_desc *next_rs_desc;
+	struct ice_tx_buf *tx_buf;
+	u16 frames_ready = 0;
+	u16 frames_to_clean;
+	int i;
+
+	next_rs_desc = ICE_TX_DESC(xdp_ring, next_rs_idx);
+	if (next_rs_desc->cmd_type_offset_bsz &
+	    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)) {
+		if (next_rs_idx >= ntc)
+			frames_ready = next_rs_idx - ntc;
+		else
+			frames_ready = next_rs_idx + xdp_ring->count - ntc;
+	}
+
+	if (!frames_ready)
+		return true;
+
+	frames_to_clean = min_t(u16, frames_ready, ICE_DFLT_IRQ_WORK);
+
+	for (i = 0; i < frames_to_clean; i++) {
+		tx_buf = &xdp_ring->tx_buf[ntc];
+
+		total_bytes += tx_buf->bytecount;
+		/* normally tx_buf->gso_segs was taken but at this point
+		 * it's always 1 for us
+		 */
+		total_pkts++;
+
+		page_frag_free(tx_buf->raw_buf);
+		dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
+				 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
+		dma_unmap_len_set(tx_buf, len, 0);
+		tx_buf->raw_buf = NULL;
+
+		ntc++;
+		if (ntc >= xdp_ring->count)
+			ntc = 0;
+	}
+
+	xdp_ring->next_to_clean = ntc;
+	ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
+
+	return frames_ready <= frames_to_clean;
+}
+
 /**
  * ice_clean_tx_irq - Reclaim resources after transmit completes
  * @tx_ring: Tx ring to clean
@@ -238,11 +295,8 @@ static bool ice_clean_tx_irq(struct ice_ring *tx_ring, int napi_budget)
 		total_bytes += tx_buf->bytecount;
 		total_pkts += tx_buf->gso_segs;
 
-		if (ice_ring_is_xdp(tx_ring))
-			page_frag_free(tx_buf->raw_buf);
-		else
-			/* free the skb */
-			napi_consume_skb(tx_buf->skb, napi_budget);
+		/* free the skb */
+		napi_consume_skb(tx_buf->skb, napi_budget);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -297,10 +351,6 @@ static bool ice_clean_tx_irq(struct ice_ring *tx_ring, int napi_budget)
 	tx_ring->next_to_clean = i;
 
 	ice_update_tx_ring_stats(tx_ring, total_pkts, total_bytes);
-
-	if (ice_ring_is_xdp(tx_ring))
-		return !!budget;
-
 	netdev_tx_completed_queue(txring_txq(tx_ring), total_pkts,
 				  total_bytes);
 
@@ -1396,9 +1446,14 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	ice_for_each_ring(ring, q_vector->tx) {
-		bool wd = ring->xsk_pool ?
-			  ice_clean_tx_irq_zc(ring, budget) :
-			  ice_clean_tx_irq(ring, budget);
+		bool wd;
+
+		if (ring->xsk_pool)
+			wd = ice_clean_tx_irq_zc(ring, budget);
+		else if (ice_ring_is_xdp(ring))
+			wd = ice_clean_xdp_irq(ring);
+		else
+			wd = ice_clean_tx_irq(ring, budget);
 
 		if (!wd)
 			clean_complete = false;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 1e46e80f3d6f..b43d471ce05d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -283,7 +283,10 @@ struct ice_ring {
 	/* used in interrupt processing */
 	u16 next_to_use;
 	u16 next_to_clean;
-	u16 next_to_alloc;
+	union {
+		u16 next_to_alloc;
+		u16 next_rs_idx;
+	};
 
 	/* stats structs */
 	struct ice_q_stats	stats;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 4fd01a153d35..0b3d51c9869b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -244,19 +244,13 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_ring *xdp_ring)
 
 	tx_desc = ICE_TX_DESC(xdp_ring, i);
 	tx_desc->buf_addr = cpu_to_le64(dma);
-	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TXD_LAST_DESC_CMD, 0,
+	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP, 0,
 						      size, 0);
 
-	/* Make certain all of the status bits have been updated
-	 * before next_to_watch is written.
-	 */
-	smp_wmb();
-
+	xdp_ring->next_rs_idx = i;
 	i++;
 	if (i == xdp_ring->count)
 		i = 0;
-
-	tx_buf->next_to_watch = tx_desc;
 	xdp_ring->next_to_use = i;
 
 	return ICE_XDP_TX;
@@ -296,7 +290,10 @@ void ice_finalize_xdp_rx(struct ice_ring *rx_ring, unsigned int xdp_res)
 	if (xdp_res & ICE_XDP_TX) {
 		struct ice_ring *xdp_ring =
 			rx_ring->vsi->xdp_rings[smp_processor_id()];
+		struct ice_tx_desc *next_rs_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs_idx);
 
+		next_rs_desc->cmd_type_offset_bsz |=
+			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
 		ice_xdp_ring_update_tail(xdp_ring);
 	}
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 intel-next 3/4] ice: do not create xdp_frame on XDP_TX
  2021-07-05 16:43 [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 1/4] ice: unify xdp_rings accesses Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 2/4] ice: optimize XDP_TX descriptor processing Maciej Fijalkowski
@ 2021-07-05 16:43 ` Maciej Fijalkowski
  2021-07-05 16:43 ` [PATCH v2 intel-next 4/4] ice: introduce XDP_TX fallback path Maciej Fijalkowski
  2021-07-05 17:06 ` [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Toke Høiland-Jørgensen
  4 siblings, 0 replies; 6+ messages in thread
From: Maciej Fijalkowski @ 2021-07-05 16:43 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, bpf, davem, anthony.l.nguyen, kuba, bjorn,
	magnus.karlsson, joamaki, toke, Maciej Fijalkowski

xdp_frame is not needed for XDP_TX data path in ice driver case.
For this data path cleaning of sent descriptor will not happen anywhere
outside of the driver, which means that carrying the information about
the underlying memory model via xdp_frame will not be used. Therefore,
this conversion can be simply dropped, which would relieve CPU a bit.

Call directly ice_xmit_xdp_ring instead of ice_xmit_xdp_buff for XDP_TX
action returned from BPF program.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 23b97c9579fb..fef1f74562e5 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -586,7 +586,7 @@ ice_run_xdp(struct ice_ring *rx_ring, struct xdp_buff *xdp,
 		return ICE_XDP_PASS;
 	case XDP_TX:
 		xdp_ring = rx_ring->vsi->xdp_rings[smp_processor_id()];
-		result = ice_xmit_xdp_buff(xdp, xdp_ring);
+		result = ice_xmit_xdp_ring(xdp->data, xdp->data_end - xdp->data, xdp_ring);
 		if (result == ICE_XDP_CONSUMED)
 			goto out_failure;
 		return result;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 intel-next 4/4] ice: introduce XDP_TX fallback path
  2021-07-05 16:43 [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2021-07-05 16:43 ` [PATCH v2 intel-next 3/4] ice: do not create xdp_frame on XDP_TX Maciej Fijalkowski
@ 2021-07-05 16:43 ` Maciej Fijalkowski
  2021-07-05 17:06 ` [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Toke Høiland-Jørgensen
  4 siblings, 0 replies; 6+ messages in thread
From: Maciej Fijalkowski @ 2021-07-05 16:43 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, bpf, davem, anthony.l.nguyen, kuba, bjorn,
	magnus.karlsson, joamaki, toke, Maciej Fijalkowski

Under rare circumstances there might be a situation where a requirement
of having XDP Tx queue per CPU could not be fulfilled and some of the Tx
resources have to be shared between CPUs. This yields a need for placing
accesses to xdp_rings array inside a critical section protected by
spinlock. These accesses happen to be in the hot path, so let's
introduce the static branch that will be triggered from the control
plane when driver could not provide Tx queue dedicated for XDP on each
CPU.

Currently, the design that has been picked is to allow any number of XDP
Tx queues that is at least half of a count of CPUs that platform has.
For lower number driver will bail out with a response to user that there
were not enough Tx resources that would allow configuring XDP. The
sharing of rings is signalled via static branch enablement which in turn
indicates that lock for xdp_ring accesses needs to be taken in hot path.

Approach based on static branch has no impact on performance of a
non-fallback path. The static branch will act as a global driver
switch, meaning that if one PF got out of Tx resources, then other PFs
that ice driver is servicing will suffer. However, given the fact that
HW that ice driver is handling has 1024 Tx queues per each PF, this is
currently an unlikely scenario.

One thing to note in the fallback path is that it can result in a
situation where one NAPI instance is producing descriptors to the
xdp_ring that belongs to different NAPI, which would mean that an
explicit raise of a software interrupt needs to be issued so that remote
NAPI is scheduled and descriptors are cleaned on the xdp_ring.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h          |  3 ++
 drivers/net/ethernet/intel/ice/ice_lib.c      |  4 +-
 drivers/net/ethernet/intel/ice/ice_main.c     | 45 +++++++++++++++++--
 drivers/net/ethernet/intel/ice/ice_txrx.c     | 44 +++++++++++++++---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 22 ++++++---
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 16 +++++++
 7 files changed, 117 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index a450343fbb92..3ddf079a6bf3 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -158,6 +158,8 @@
 
 #define ice_pf_to_dev(pf) (&((pf)->pdev->dev))
 
+DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
+
 struct ice_txq_meta {
 	u32 q_teid;	/* Tx-scheduler element identifier */
 	u16 q_id;	/* Entry in VSI's txq_map bitmap */
@@ -647,6 +649,7 @@ int ice_up(struct ice_vsi *vsi);
 int ice_down(struct ice_vsi *vsi);
 int ice_vsi_cfg(struct ice_vsi *vsi);
 struct ice_vsi *ice_lb_vsi_setup(struct ice_pf *pf, struct ice_port_info *pi);
+int ice_vsi_determine_xdp_res(struct ice_vsi *vsi);
 int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog);
 int ice_destroy_xdp_rings(struct ice_vsi *vsi);
 int
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 20d02e9d6635..9aa8e711aaef 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -3154,7 +3154,9 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi)
 
 		ice_vsi_map_rings_to_vectors(vsi);
 		if (ice_is_xdp_ena_vsi(vsi)) {
-			vsi->num_xdp_txq = num_possible_cpus();
+			ret = ice_vsi_determine_xdp_res(vsi);
+			if (ret)
+				goto err_vectors;
 			ret = ice_prepare_xdp_rings(vsi, vsi->xdp_prog);
 			if (ret)
 				goto err_vectors;
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index f26db305b797..151868c7c485 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -42,6 +42,8 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
 #endif /* !CONFIG_DYNAMIC_DEBUG */
 
 static DEFINE_IDA(ice_aux_ida);
+DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
+EXPORT_SYMBOL(ice_xdp_locking_key);
 
 static struct workqueue_struct *ice_wq;
 static const struct net_device_ops ice_netdev_safe_mode_ops;
@@ -2382,6 +2384,7 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
 			goto free_xdp_rings;
 		ice_set_ring_xdp(xdp_ring);
 		xdp_ring->xsk_pool = ice_xsk_pool(xdp_ring);
+		spin_lock_init(&xdp_ring->tx_lock);
 	}
 
 	return 0;
@@ -2447,6 +2450,10 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog)
 	if (__ice_vsi_get_qs(&xdp_qs_cfg))
 		goto err_map_xdp;
 
+	if (static_key_enabled(&ice_xdp_locking_key))
+		netdev_warn(vsi->netdev,
+			    "Could not allocate one XDP Tx ring per CPU, XDP_TX/XDP_REDIRECT actions will be slower\n");
+
 	if (ice_xdp_alloc_setup_rings(vsi))
 		goto clear_xdp_rings;
 
@@ -2563,6 +2570,9 @@ int ice_destroy_xdp_rings(struct ice_vsi *vsi)
 	devm_kfree(ice_pf_to_dev(pf), vsi->xdp_rings);
 	vsi->xdp_rings = NULL;
 
+	if (static_key_enabled(&ice_xdp_locking_key))
+		static_branch_dec(&ice_xdp_locking_key);
+
 	if (ice_is_reset_in_progress(pf->state) || !vsi->q_vectors[0])
 		return 0;
 
@@ -2597,6 +2607,29 @@ static void ice_vsi_rx_napi_schedule(struct ice_vsi *vsi)
 	}
 }
 
+/**
+ * ice_vsi_determine_xdp_res - figure out how many Tx qs can XDP have
+ * @vsi: VSI to determine the count of XDP Tx qs
+ *
+ * returns 0 if Tx qs count is higher than at least half of CPU count,
+ * -ENOMEM otherwise
+ */
+int ice_vsi_determine_xdp_res(struct ice_vsi *vsi)
+{
+	u16 avail = ice_get_avail_txq_count(vsi->back);
+	u16 cpus = num_possible_cpus();
+
+	if (avail < cpus / 2)
+		return -ENOMEM;
+
+	vsi->num_xdp_txq = min_t(u16, avail, cpus);
+
+	if (vsi->num_xdp_txq < cpus)
+		static_branch_inc(&ice_xdp_locking_key);
+
+	return 0;
+}
+
 /**
  * ice_xdp_setup_prog - Add or remove XDP eBPF program
  * @vsi: VSI to setup XDP for
@@ -2626,10 +2659,14 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
 	}
 
 	if (!ice_is_xdp_ena_vsi(vsi) && prog) {
-		vsi->num_xdp_txq = num_possible_cpus();
-		xdp_ring_err = ice_prepare_xdp_rings(vsi, prog);
-		if (xdp_ring_err)
-			NL_SET_ERR_MSG_MOD(extack, "Setting up XDP Tx resources failed");
+		xdp_ring_err = ice_vsi_determine_xdp_res(vsi);
+		if (xdp_ring_err) {
+			NL_SET_ERR_MSG_MOD(extack, "Not enough Tx resources for XDP");
+		} else {
+			xdp_ring_err = ice_prepare_xdp_rings(vsi, prog);
+			if (xdp_ring_err)
+				NL_SET_ERR_MSG_MOD(extack, "Setting up XDP Tx resources failed");
+		}
 	} else if (ice_is_xdp_ena_vsi(vsi) && !prog) {
 		xdp_ring_err = ice_destroy_xdp_rings(vsi);
 		if (xdp_ring_err)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index fef1f74562e5..97d205fb19da 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -577,19 +577,35 @@ ice_run_xdp(struct ice_ring *rx_ring, struct xdp_buff *xdp,
 	    struct bpf_prog *xdp_prog)
 {
 	struct ice_ring *xdp_ring;
-	int err, result;
+	int err;
 	u32 act;
+	u16 idx;
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		return ICE_XDP_PASS;
 	case XDP_TX:
-		xdp_ring = rx_ring->vsi->xdp_rings[smp_processor_id()];
-		result = ice_xmit_xdp_ring(xdp->data, xdp->data_end - xdp->data, xdp_ring);
-		if (result == ICE_XDP_CONSUMED)
+		idx = smp_processor_id();
+		if (static_branch_unlikely(&ice_xdp_locking_key)) {
+			idx %= rx_ring->vsi->num_xdp_txq;
+			xdp_ring = rx_ring->vsi->xdp_rings[idx];
+			spin_lock(&xdp_ring->tx_lock);
+		} else {
+			xdp_ring = rx_ring->vsi->xdp_rings[idx];
+		}
+		err = ice_xmit_xdp_ring(xdp->data, xdp->data_end - xdp->data, xdp_ring);
+		if (static_branch_unlikely(&ice_xdp_locking_key)) {
+			spin_unlock(&xdp_ring->tx_lock);
+			if (err == ICE_XDP_CONSUMED) {
+				/* kick the remote NAPI that xdp_ring belongs to */
+				ice_trigger_sw_intr(&xdp_ring->vsi->back->hw, xdp_ring->q_vector);
+				goto out_failure;
+			}
+		}
+		if (err == ICE_XDP_CONSUMED)
 			goto out_failure;
-		return result;
+		return err;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
 		if (err)
@@ -638,7 +654,14 @@ ice_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
 		return -EINVAL;
 
-	xdp_ring = vsi->xdp_rings[queue_index];
+	if (static_branch_unlikely(&ice_xdp_locking_key)) {
+		queue_index %= vsi->num_xdp_txq;
+		xdp_ring = vsi->xdp_rings[queue_index];
+		spin_lock(&xdp_ring->tx_lock);
+	} else {
+		xdp_ring = vsi->xdp_rings[queue_index];
+	}
+
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
 		int err;
@@ -649,8 +672,15 @@ ice_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		nxmit++;
 	}
 
-	if (unlikely(flags & XDP_XMIT_FLUSH))
+	if (unlikely(flags & XDP_XMIT_FLUSH)) {
+		ice_xdp_ring_set_rs(xdp_ring);
 		ice_xdp_ring_update_tail(xdp_ring);
+	}
+
+	if (static_branch_unlikely(&ice_xdp_locking_key)) {
+		spin_unlock(&xdp_ring->tx_lock);
+		ice_trigger_sw_intr(&vsi->back->hw, xdp_ring->q_vector);
+	}
 
 	return nxmit;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index b43d471ce05d..63538cb3b632 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -303,6 +303,7 @@ struct ice_ring {
 	u16 rx_offset;
 	/* CL3 - 3rd cacheline starts here */
 	struct xdp_rxq_info xdp_rxq;
+	spinlock_t tx_lock;
 	struct sk_buff *skb;
 	/* CLX - the below items are only accessed infrequently and should be
 	 * in their own cache line if possible
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 0b3d51c9869b..164008e26869 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2019, Intel Corporation. */
 
 #include "ice_txrx_lib.h"
+#include "ice_base.h"
 
 /**
  * ice_release_rx_desc - Store the new tail and head values
@@ -288,12 +289,21 @@ void ice_finalize_xdp_rx(struct ice_ring *rx_ring, unsigned int xdp_res)
 		xdp_do_flush_map();
 
 	if (xdp_res & ICE_XDP_TX) {
-		struct ice_ring *xdp_ring =
-			rx_ring->vsi->xdp_rings[smp_processor_id()];
-		struct ice_tx_desc *next_rs_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs_idx);
-
-		next_rs_desc->cmd_type_offset_bsz |=
-			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+		u16 idx = smp_processor_id();
+		struct ice_ring *xdp_ring;
+
+		if (static_branch_unlikely(&ice_xdp_locking_key)) {
+			idx %= rx_ring->vsi->num_xdp_txq;
+			xdp_ring = rx_ring->vsi->xdp_rings[idx];
+			spin_lock(&xdp_ring->tx_lock);
+		} else {
+			xdp_ring = rx_ring->vsi->xdp_rings[idx];
+		}
+		ice_xdp_ring_set_rs(xdp_ring);
 		ice_xdp_ring_update_tail(xdp_ring);
+		if (static_branch_unlikely(&ice_xdp_locking_key)) {
+			spin_unlock(&xdp_ring->tx_lock);
+			ice_trigger_sw_intr(&rx_ring->vsi->back->hw, xdp_ring->q_vector);
+		}
 	}
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 05ac30752902..319b4bfcecfc 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -31,6 +31,22 @@ ice_build_ctob(u64 td_cmd, u64 td_offset, unsigned int size, u64 td_tag)
 			   (td_tag    << ICE_TXD_QW1_L2TAG1_S));
 }
 
+/**
+ * ice_xdp_ring_set_rs - Sets RS bit on a Tx descriptor
+ * @xdp_ring: XDP Tx ring
+ *
+ * This function sets RS bit on the last descriptor of a batch,
+ * before the update of HW tail register
+ */
+static inline void ice_xdp_ring_set_rs(struct ice_ring *xdp_ring)
+{
+	struct ice_tx_desc *next_rs_desc;
+
+	next_rs_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs_idx);
+	next_rs_desc->cmd_type_offset_bsz |=
+		cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+}
+
 /**
  * ice_xdp_ring_update_tail - Updates the XDP Tx ring tail register
  * @xdp_ring: XDP Tx ring
-- 
2.20.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 intel-next 0/4] XDP_TX improvements for ice
  2021-07-05 16:43 [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2021-07-05 16:43 ` [PATCH v2 intel-next 4/4] ice: introduce XDP_TX fallback path Maciej Fijalkowski
@ 2021-07-05 17:06 ` Toke Høiland-Jørgensen
  4 siblings, 0 replies; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-07-05 17:06 UTC (permalink / raw)
  To: Maciej Fijalkowski, intel-wired-lan
  Cc: netdev, bpf, davem, anthony.l.nguyen, kuba, bjorn,
	magnus.karlsson, joamaki, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Hi,
>
> this is a second revision of a series around XDP_TX improvements for ice
> driver. When compared to v1 (which can be found under [1]), two new
> patches are introduced that are focused on improving the performance for
> XDP_TX as Jussi reported that the numbers were pretty low on his side.
> Furthermore the fallback path is now based on static branch, as
> suggested by Toke on v1. This means that there's no further need for a
> standalone net_device_ops that were serving the locked version of
> ndo_xdp_xmit callback.
>
> Idea from 2nd patch is borrowed from a joint work that was done against
> OOT driver among with Sridhar Samudrala, Jesse Brandeburg and Piotr
> Raczynski, where we working on fixing the scaling issues for Tx AF_XDP
> ZC path.
>
> Last but not least, with this series I observe the improvement of
> performance by around 30%.

Wow, "but not least" indeed! :D
You can't just drop that at the end like that! I want details -
whaddyamean, "by around 30%"? In all cases? Only for TX? Gimme numbers! :)

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-07-05 17:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-05 16:43 [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Maciej Fijalkowski
2021-07-05 16:43 ` [PATCH v2 intel-next 1/4] ice: unify xdp_rings accesses Maciej Fijalkowski
2021-07-05 16:43 ` [PATCH v2 intel-next 2/4] ice: optimize XDP_TX descriptor processing Maciej Fijalkowski
2021-07-05 16:43 ` [PATCH v2 intel-next 3/4] ice: do not create xdp_frame on XDP_TX Maciej Fijalkowski
2021-07-05 16:43 ` [PATCH v2 intel-next 4/4] ice: introduce XDP_TX fallback path Maciej Fijalkowski
2021-07-05 17:06 ` [PATCH v2 intel-next 0/4] XDP_TX improvements for ice Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).