All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements
@ 2022-01-21 12:00 Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 1/7] ice: remove likely for napi_complete_done Maciej Fijalkowski
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

Hi,

Unfortunately, similar scalability issues that were addressed for XDP
processing in ice, exist for XDP in the zero-copy driver used by AF_XDP.
Let's resolve them in mostly the same way as we did in [0] and utilize
the Tx batching API from xsk buffer pool.

Move the array of Tx descriptors that is used with batching approach to
the xsk buffer pool. This means that future users of this API will not
have carry the array on their own side, they can simple refer to pool's
tx_desc array.

We also improve the Rx side where we extend ice_alloc_rx_buf_zc() to
handle the ring wrap and bump Rx tail more frequently. By doing so,
Rx side is adjusted to Tx and it was needed for l2fwd scenario.

Here are the improvements of performance numbers that this set brings
measured with xdpsock app in busy poll mode for 1 and 2 core modes.
Both Tx and Rx rings were sized to 1k length and busy poll budget was
256.

----------------------------------------------------------------
     |      txonly:      |      l2fwd      |      rxdrop
----------------------------------------------------------------
1C   |       149%        |       14%       |        3%
----------------------------------------------------------------
2C   |       134%        |       20%       |        5%
----------------------------------------------------------------

Next step will be to introduce batching onto Rx side.


v3:
* drop likely() that was wrapping napi_complete_done (patch 1)
* introduce configurable Tx threshold (patch 2)
* handle ring wrap on Rx side when allocating buffers (patch 3)
* respect NAPI budget when cleaning Tx descriptors in ZC (patch 6)
v2:
* introduce new patch that resets @next_dd and @next_rs fields
* use batching API for AF_XDP Tx on ice side

Thanks,
Maciej

[0]: https://lore.kernel.org/bpf/20211015162908.145341-8-anthony.l.nguyen@intel.com/

Maciej Fijalkowski (6):
  ice: remove likely for napi_complete_done
  ice: xsk: handle SW XDP ring wrap and bump tail more often
  ice: make Tx threshold dependent on ring length
  ice: xsk: avoid potential dead AF_XDP Tx processing
  ice: xsk: improve AF_XDP ZC Tx and use batching API
  ice: xsk: borrow xdp_tx_active logic from i40e

Magnus Karlsson (1):
  i40e: xsk: move tmp desc array from driver to pool

 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  11 -
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |   1 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |   4 +-
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |   3 +
 drivers/net/ethernet/intel/ice/ice_main.c     |   5 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   7 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  12 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  15 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c      | 366 ++++++++++++------
 drivers/net/ethernet/intel/ice/ice_xsk.h      |  27 +-
 include/net/xdp_sock_drv.h                    |   5 +-
 include/net/xsk_buff_pool.h                   |   1 +
 net/xdp/xsk.c                                 |  13 +-
 net/xdp/xsk_buff_pool.c                       |   7 +
 net/xdp/xsk_queue.h                           |  12 +-
 15 files changed, 327 insertions(+), 162 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 1/7] ice: remove likely for napi_complete_done
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often Maciej Fijalkowski
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

Remove the likely before napi_complete_done as this is the unlikely case
when busy-poll is used. Removing this has a positive performance impact
for busy-poll and no negative impact to the regular case.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 3e38695f1c9d..e661d0e45b9b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1513,7 +1513,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
 	/* Exit the polling mode, but don't re-enable interrupts if stack might
 	 * poll us due to busy-polling
 	 */
-	if (likely(napi_complete_done(napi, work_done))) {
+	if (napi_complete_done(napi, work_done)) {
 		ice_net_dim(q_vector);
 		ice_enable_interrupt(q_vector);
 	} else {
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 1/7] ice: remove likely for napi_complete_done Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  2022-01-21 12:29   ` Alexander Lobakin
  2022-01-21 12:00 ` [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length Maciej Fijalkowski
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

Currently, if ice_clean_rx_irq_zc() processed the whole ring and
next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
ring even if the XSK buffer pool would have enough free entries (either
from fill ring or the internal recycle mechanism) - it is because ring
wrap is not handled.

Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
Do not clamp the count of buffers that is passed to
xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
rx_ring->count,  but rather split it and have two calls to the mentioned
function - one for the part up until the wrap and one for the part after
the wrap.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +
 drivers/net/ethernet/intel/ice/ice_xsk.c  | 99 ++++++++++++++++++-----
 2 files changed, 81 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index b7b3bd4816f0..94a46e0e5ed0 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -111,6 +111,8 @@ static inline int ice_skb_pad(void)
 	(u16)((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \
 	      (R)->next_to_clean - (R)->next_to_use - 1)
 
+#define ICE_RING_QUARTER(R) ((R)->count / 4)
+
 #define ICE_TX_FLAGS_TSO	BIT(0)
 #define ICE_TX_FLAGS_HW_VLAN	BIT(1)
 #define ICE_TX_FLAGS_SW_VLAN	BIT(2)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 2388837d6d6c..0463fc594d08 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -359,33 +359,28 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
 }
 
 /**
- * ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
- * @rx_ring: Rx ring
+ * ice_fill_rx_descs - pick buffers from XSK buffer pool and use it
+ * @pool: XSK Buffer pool to pull the buffers from
+ * @xdp: SW ring of xdp_buff that will hold the buffers
+ * @rx_desc: Pointer to Rx descriptors that will be filled
  * @count: The number of buffers to allocate
  *
  * This function allocates a number of Rx buffers from the fill ring
  * or the internal recycle mechanism and places them on the Rx ring.
  *
- * Returns true if all allocations were successful, false if any fail.
+ * Note that ring wrap should be handled by caller of this function.
+ *
+ * Returns the amount of allocated Rx descriptors
  */
-bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
+static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
+			     union ice_32b_rx_flex_desc *rx_desc, u16 count)
 {
-	union ice_32b_rx_flex_desc *rx_desc;
-	u16 ntu = rx_ring->next_to_use;
-	struct xdp_buff **xdp;
-	u32 nb_buffs, i;
 	dma_addr_t dma;
+	u16 buffs;
+	int i;
 
-	rx_desc = ICE_RX_DESC(rx_ring, ntu);
-	xdp = ice_xdp_buf(rx_ring, ntu);
-
-	nb_buffs = min_t(u16, count, rx_ring->count - ntu);
-	nb_buffs = xsk_buff_alloc_batch(rx_ring->xsk_pool, xdp, nb_buffs);
-	if (!nb_buffs)
-		return false;
-
-	i = nb_buffs;
-	while (i--) {
+	buffs = xsk_buff_alloc_batch(pool, xdp, count);
+	for (i = 0; i < buffs; i++) {
 		dma = xsk_buff_xdp_get_dma(*xdp);
 		rx_desc->read.pkt_addr = cpu_to_le64(dma);
 		rx_desc->wb.status_error0 = 0;
@@ -394,13 +389,77 @@ bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
 		xdp++;
 	}
 
+	return buffs;
+}
+
+/**
+ * __ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
+ * @rx_ring: Rx ring
+ * @count: The number of buffers to allocate
+ *
+ * Place the @count of descriptors onto Rx ring. Handle the ring wrap
+ * for case where space from next_to_use up to the end of ring is less
+ * than @count. Finally do a tail bump.
+ *
+ * Returns true if all allocations were successful, false if any fail.
+ */
+static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
+{
+	union ice_32b_rx_flex_desc *rx_desc;
+	u32 nb_buffs_extra = 0, nb_buffs;
+	u16 ntu = rx_ring->next_to_use;
+	u16 total_count = count;
+	struct xdp_buff **xdp;
+
+	rx_desc = ICE_RX_DESC(rx_ring, ntu);
+	xdp = ice_xdp_buf(rx_ring, ntu);
+
+	if (ntu + count >= rx_ring->count) {
+		nb_buffs_extra = ice_fill_rx_descs(rx_ring->xsk_pool, xdp,
+						   rx_desc,
+						   rx_ring->count - ntu);
+		rx_desc = ICE_RX_DESC(rx_ring, 0);
+		xdp = ice_xdp_buf(rx_ring, 0);
+		ntu = 0;
+		count -= nb_buffs_extra;
+		ice_release_rx_desc(rx_ring, 0);
+	}
+
+	nb_buffs = ice_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
+
 	ntu += nb_buffs;
 	if (ntu == rx_ring->count)
 		ntu = 0;
 
-	ice_release_rx_desc(rx_ring, ntu);
+	if (rx_ring->next_to_use != ntu)
+		ice_release_rx_desc(rx_ring, ntu);
+
+	return total_count == (nb_buffs_extra + nb_buffs);
+}
+
+/**
+ * ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
+ * @rx_ring: Rx ring
+ * @count: The number of buffers to allocate
+ *
+ * Wrapper for internal allocation routine; figure out how many tail
+ * bumps should take place based on the given threshold
+ *
+ * Returns true if all calls to internal alloc routine succeeded
+ */
+bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
+{
+	u16 rx_thresh = ICE_RING_QUARTER(rx_ring);
+	u16 batched, leftover, i, tail_bumps;
+
+	batched = count & ~(rx_thresh - 1);
+	tail_bumps = batched / rx_thresh;
+	leftover = count & (rx_thresh - 1);
 
-	return count == nb_buffs;
+	for (i = 0; i < tail_bumps; i++)
+		if (!__ice_alloc_rx_bufs_zc(rx_ring, rx_thresh))
+			return false;
+	return __ice_alloc_rx_bufs_zc(rx_ring, leftover);
 }
 
 /**
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 1/7] ice: remove likely for napi_complete_done Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  2022-01-21 12:40   ` Rx: " Alexander Lobakin
  2022-01-21 12:00 ` [PATCH bpf-next v3 4/7] i40e: xsk: move tmp desc array from driver to pool Maciej Fijalkowski
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

XDP_TX workloads use a concept of Tx threshold that indicates the
interval of setting RS bit on descriptors which in turn tells the HW to
generate an interrupt to signal the completion of Tx on HW side. It is
currently based on a constant value of 32 which might not work out well
for various sizes of ring combined with for example batch size that can
be set via SO_BUSY_POLL_BUDGET.

Internal tests based on AF_XDP showed that most convenient setup of
mentioned threshold is when it is equal to quarter of a ring length.

Introduce @tx_thresh field in ice_tx_ring struct that will store the
value of threshold and use it in favor of ICE_TX_THRESH.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |  3 +++
 drivers/net/ethernet/intel/ice/ice_main.c     |  5 +++--
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  7 ++++++-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 14 ++++++++------
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index e2e3ef7fba7f..bfa5e5d167ab 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -2803,6 +2803,9 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
 		/* clone ring and setup updated count */
 		xdp_rings[i] = *vsi->xdp_rings[i];
 		xdp_rings[i].count = new_tx_cnt;
+		xdp_rings[i].tx_thresh = ice_get_tx_threshold(&xdp_rings[i]);
+		xdp_rings[i].next_dd = xdp_rings[i].tx_thresh - 1;
+		xdp_rings[i].next_rs = xdp_rings[i].tx_thresh - 1;
 		xdp_rings[i].desc = NULL;
 		xdp_rings[i].tx_buf = NULL;
 		err = ice_setup_tx_ring(&xdp_rings[i]);
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 30814435f779..0fd12a7f6d22 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2495,10 +2495,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
 		xdp_ring->reg_idx = vsi->txq_map[xdp_q_idx];
 		xdp_ring->vsi = vsi;
 		xdp_ring->netdev = NULL;
-		xdp_ring->next_dd = ICE_TX_THRESH - 1;
-		xdp_ring->next_rs = ICE_TX_THRESH - 1;
 		xdp_ring->dev = dev;
 		xdp_ring->count = vsi->num_tx_desc;
+		xdp_ring->tx_thresh = ice_get_tx_threshold(xdp_ring);
+		xdp_ring->next_dd = xdp_ring->tx_thresh - 1;
+		xdp_ring->next_rs = xdp_ring->tx_thresh - 1;
 		WRITE_ONCE(vsi->xdp_rings[i], xdp_ring);
 		if (ice_setup_tx_ring(xdp_ring))
 			goto free_xdp_rings;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 94a46e0e5ed0..09c8ad2f7403 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -13,7 +13,6 @@
 #define ICE_MAX_CHAINED_RX_BUFS	5
 #define ICE_MAX_BUF_TXD		8
 #define ICE_MIN_TX_LEN		17
-#define ICE_TX_THRESH		32
 
 /* The size limit for a transmit buffer in a descriptor is (16K - 1).
  * In order to align with the read requests we will align the value to
@@ -333,6 +332,7 @@ struct ice_tx_ring {
 	struct ice_channel *ch;
 	struct ice_ptp_tx *tx_tstamps;
 	spinlock_t tx_lock;
+	u16 tx_thresh;
 	u32 txq_teid;			/* Added Tx queue TEID */
 #define ICE_TX_FLAGS_RING_XDP		BIT(0)
 	u8 flags;
@@ -355,6 +355,11 @@ static inline void ice_clear_ring_build_skb_ena(struct ice_rx_ring *ring)
 	ring->flags &= ~ICE_RX_FLAGS_RING_BUILD_SKB;
 }
 
+static inline u16 ice_get_tx_threshold(struct ice_tx_ring *tx_ring)
+{
+	return ICE_RING_QUARTER(tx_ring);
+}
+
 static inline bool ice_ring_ch_enabled(struct ice_tx_ring *ring)
 {
 	return !!ring->ch;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 0e87b98e0966..5706b5405373 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -222,6 +222,7 @@ ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
 static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
 {
 	unsigned int total_bytes = 0, total_pkts = 0;
+	u16 tx_thresh = xdp_ring->tx_thresh;
 	u16 ntc = xdp_ring->next_to_clean;
 	struct ice_tx_desc *next_dd_desc;
 	u16 next_dd = xdp_ring->next_dd;
@@ -233,7 +234,7 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
 	    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
 		return;
 
-	for (i = 0; i < ICE_TX_THRESH; i++) {
+	for (i = 0; i < tx_thresh; i++) {
 		tx_buf = &xdp_ring->tx_buf[ntc];
 
 		total_bytes += tx_buf->bytecount;
@@ -254,9 +255,9 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
 	}
 
 	next_dd_desc->cmd_type_offset_bsz = 0;
-	xdp_ring->next_dd = xdp_ring->next_dd + ICE_TX_THRESH;
+	xdp_ring->next_dd = xdp_ring->next_dd + tx_thresh;
 	if (xdp_ring->next_dd > xdp_ring->count)
-		xdp_ring->next_dd = ICE_TX_THRESH - 1;
+		xdp_ring->next_dd = tx_thresh - 1;
 	xdp_ring->next_to_clean = ntc;
 	ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
 }
@@ -269,12 +270,13 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
  */
 int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
 {
+	u16 tx_thresh = xdp_ring->tx_thresh;
 	u16 i = xdp_ring->next_to_use;
 	struct ice_tx_desc *tx_desc;
 	struct ice_tx_buf *tx_buf;
 	dma_addr_t dma;
 
-	if (ICE_DESC_UNUSED(xdp_ring) < ICE_TX_THRESH)
+	if (ICE_DESC_UNUSED(xdp_ring) < tx_thresh)
 		ice_clean_xdp_irq(xdp_ring);
 
 	if (!unlikely(ICE_DESC_UNUSED(xdp_ring))) {
@@ -306,7 +308,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
 		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
 		tx_desc->cmd_type_offset_bsz |=
 			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
-		xdp_ring->next_rs = ICE_TX_THRESH - 1;
+		xdp_ring->next_rs = tx_thresh - 1;
 	}
 	xdp_ring->next_to_use = i;
 
@@ -314,7 +316,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
 		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
 		tx_desc->cmd_type_offset_bsz |=
 			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
-		xdp_ring->next_rs += ICE_TX_THRESH;
+		xdp_ring->next_rs += tx_thresh;
 	}
 
 	return ICE_XDP_TX;
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 4/7] i40e: xsk: move tmp desc array from driver to pool
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2022-01-21 12:00 ` [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 5/7] ice: xsk: avoid potential dead AF_XDP Tx processing Maciej Fijalkowski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel; +Cc: netdev, magnus.karlsson, alexandr.lobakin

From: Magnus Karlsson <magnus.karlsson@intel.com>

Move desc_array from the driver to the pool. The reason behind this is
that we can then reuse this array as a temporary storage for descriptors
in all zero-copy drivers that use the batched interface. This will make
it easier to add batching to more drivers.

i40e is the only driver that has a batched Tx zero-copy
implementation, so no need to touch any other driver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 11 -----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  1 -
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  |  4 ++--
 include/net/xdp_sock_drv.h                  |  5 ++---
 include/net/xsk_buff_pool.h                 |  1 +
 net/xdp/xsk.c                               | 13 ++++++-------
 net/xdp/xsk_buff_pool.c                     |  7 +++++++
 net/xdp/xsk_queue.h                         | 12 ++++++------
 8 files changed, 24 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 66cc79500c10..af9c88e71452 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -830,8 +830,6 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
 	i40e_clean_tx_ring(tx_ring);
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
-	kfree(tx_ring->xsk_descs);
-	tx_ring->xsk_descs = NULL;
 
 	if (tx_ring->desc) {
 		dma_free_coherent(tx_ring->dev, tx_ring->size,
@@ -1433,13 +1431,6 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		goto err;
 
-	if (ring_is_xdp(tx_ring)) {
-		tx_ring->xsk_descs = kcalloc(I40E_MAX_NUM_DESCRIPTORS, sizeof(*tx_ring->xsk_descs),
-					     GFP_KERNEL);
-		if (!tx_ring->xsk_descs)
-			goto err;
-	}
-
 	u64_stats_init(&tx_ring->syncp);
 
 	/* round up to nearest 4K */
@@ -1463,8 +1454,6 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	return 0;
 
 err:
-	kfree(tx_ring->xsk_descs);
-	tx_ring->xsk_descs = NULL;
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
 	return -ENOMEM;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index bfc2845c99d1..f6d91fa1562e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -390,7 +390,6 @@ struct i40e_ring {
 	u16 rx_offset;
 	struct xdp_rxq_info xdp_rxq;
 	struct xsk_buff_pool *xsk_pool;
-	struct xdp_desc *xsk_descs;      /* For storing descriptors in the AF_XDP ZC path */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 945b1bb9c6f4..9f349aaca9ff 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -467,11 +467,11 @@ static void i40e_set_rs_bit(struct i40e_ring *xdp_ring)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
-	struct xdp_desc *descs = xdp_ring->xsk_descs;
+	struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
 	u32 nb_pkts, nb_processed = 0;
 	unsigned int total_bytes = 0;
 
-	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, descs, budget);
+	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
 	if (!nb_pkts)
 		return true;
 
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 443d45951564..4aa031849668 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -13,7 +13,7 @@
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
-u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc, u32 max);
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
 struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
 					    u16 queue_id);
@@ -142,8 +142,7 @@ static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
 	return false;
 }
 
-static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc,
-						 u32 max)
+static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max)
 {
 	return 0;
 }
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ddeefc4a1040..5554ee75e7da 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -60,6 +60,7 @@ struct xsk_buff_pool {
 	 */
 	dma_addr_t *dma_pages;
 	struct xdp_buff_xsk *heads;
+	struct xdp_desc *tx_descs;
 	u64 chunk_mask;
 	u64 addrs_cnt;
 	u32 free_list_cnt;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 28ef3f4465ae..2abd64e4d589 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -343,9 +343,9 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
 }
 EXPORT_SYMBOL(xsk_tx_peek_desc);
 
-static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, struct xdp_desc *descs,
-					u32 max_entries)
+static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entries)
 {
+	struct xdp_desc *descs = pool->tx_descs;
 	u32 nb_pkts = 0;
 
 	while (nb_pkts < max_entries && xsk_tx_peek_desc(pool, &descs[nb_pkts]))
@@ -355,8 +355,7 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, struct xdp_d
 	return nb_pkts;
 }
 
-u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *descs,
-				   u32 max_entries)
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max_entries)
 {
 	struct xdp_sock *xs;
 	u32 nb_pkts;
@@ -365,7 +364,7 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *
 	if (!list_is_singular(&pool->xsk_tx_list)) {
 		/* Fallback to the non-batched version */
 		rcu_read_unlock();
-		return xsk_tx_peek_release_fallback(pool, descs, max_entries);
+		return xsk_tx_peek_release_fallback(pool, max_entries);
 	}
 
 	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
@@ -374,7 +373,7 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *
 		goto out;
 	}
 
-	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, descs, pool, max_entries);
+	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, pool, max_entries);
 	if (!nb_pkts) {
 		xs->tx->queue_empty_descs++;
 		goto out;
@@ -386,7 +385,7 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *
 	 * packets. This avoids having to implement any buffering in
 	 * the Tx path.
 	 */
-	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, descs, nb_pkts);
+	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
 	if (!nb_pkts)
 		goto out;
 
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index fd39bb660ebc..b34fca6ada86 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -37,6 +37,7 @@ void xp_destroy(struct xsk_buff_pool *pool)
 	if (!pool)
 		return;
 
+	kvfree(pool->tx_descs);
 	kvfree(pool->heads);
 	kvfree(pool);
 }
@@ -58,6 +59,12 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 	if (!pool->heads)
 		goto out;
 
+	if (xs->tx) {
+		pool->tx_descs = kcalloc(xs->tx->nentries, sizeof(*pool->tx_descs), GFP_KERNEL);
+		if (!pool->tx_descs)
+			goto out;
+	}
+
 	pool->chunk_mask = ~((u64)umem->chunk_size - 1);
 	pool->addrs_cnt = umem->size;
 	pool->heads_cnt = umem->chunks;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index e9aa2c236356..638138fbe475 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -205,11 +205,11 @@ static inline bool xskq_cons_read_desc(struct xsk_queue *q,
 	return false;
 }
 
-static inline u32 xskq_cons_read_desc_batch(struct xsk_queue *q,
-					    struct xdp_desc *descs,
-					    struct xsk_buff_pool *pool, u32 max)
+static inline u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
+					    u32 max)
 {
 	u32 cached_cons = q->cached_cons, nb_entries = 0;
+	struct xdp_desc *descs = pool->tx_descs;
 
 	while (cached_cons != q->cached_prod && nb_entries < max) {
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
@@ -282,12 +282,12 @@ static inline bool xskq_cons_peek_desc(struct xsk_queue *q,
 	return xskq_cons_read_desc(q, desc, pool);
 }
 
-static inline u32 xskq_cons_peek_desc_batch(struct xsk_queue *q, struct xdp_desc *descs,
-					    struct xsk_buff_pool *pool, u32 max)
+static inline u32 xskq_cons_peek_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
+					    u32 max)
 {
 	u32 entries = xskq_cons_nb_entries(q, max);
 
-	return xskq_cons_read_desc_batch(q, descs, pool, entries);
+	return xskq_cons_read_desc_batch(q, pool, entries);
 }
 
 /* To improve performance in the xskq_cons_release functions, only update local state here.
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 5/7] ice: xsk: avoid potential dead AF_XDP Tx processing
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2022-01-21 12:00 ` [PATCH bpf-next v3 4/7] i40e: xsk: move tmp desc array from driver to pool Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API Maciej Fijalkowski
  2022-01-21 12:00 ` [PATCH bpf-next v3 7/7] ice: xsk: borrow xdp_tx_active logic from i40e Maciej Fijalkowski
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

Commit 9610bd988df9 ("ice: optimize XDP_TX workloads") introduced
@next_dd and @next_rs to ice_tx_ring struct. Currently, their state is
not restored in ice_clean_tx_ring(), which was not causing any troubles
as the XDP rings are gone after we're done with XDP prog on interface.

For upcoming usage of mentioned fields in AF_XDP, this might expose us
to a potential dead Tx side. Scenario would look like following (based
on xdpsock):

- two xdpsock instances are spawned in Tx mode
- one of them is killed
- XDP prog is kept on interface due to the other xdpsock still running
  * this means that XDP rings stayed in place
- xdpsock is launched again on same queue id that was terminated on
- @next_dd and @next_rs setting is bogus, therefore transmit side is
  broken

To protect us from the above, restore the default @next_rs and @next_dd
values when cleaning the Tx ring.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index e661d0e45b9b..bfb9158b10a4 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -173,6 +173,9 @@ void ice_clean_tx_ring(struct ice_tx_ring *tx_ring)
 
 	tx_ring->next_to_use = 0;
 	tx_ring->next_to_clean = 0;
+	tx_ring->tx_thresh = ice_get_tx_threshold(tx_ring);
+	tx_ring->next_dd = tx_ring->tx_thresh - 1;
+	tx_ring->next_rs = tx_ring->tx_thresh - 1;
 
 	if (!tx_ring->netdev)
 		return;
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2022-01-21 12:00 ` [PATCH bpf-next v3 5/7] ice: xsk: avoid potential dead AF_XDP Tx processing Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  2022-01-21 12:54   ` Alexander Lobakin
  2022-01-21 12:00 ` [PATCH bpf-next v3 7/7] ice: xsk: borrow xdp_tx_active logic from i40e Maciej Fijalkowski
  6 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

Apply the logic that was done for regular XDP from commit 9610bd988df9
("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
of that, introduce batching to Tx that is inspired by i40e's
implementation with adjustments to the cleaning logic - take into the
account NAPI budget in ice_clean_xdp_irq_zc().

Separating the stats structs onto separate cache lines seemed to improve
the performance.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h |   2 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c  | 256 ++++++++++++++--------
 drivers/net/ethernet/intel/ice/ice_xsk.h  |  27 ++-
 4 files changed, 186 insertions(+), 101 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index bfb9158b10a4..7ab8c700c884 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1463,7 +1463,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
 		bool wd;
 
 		if (tx_ring->xsk_pool)
-			wd = ice_clean_tx_irq_zc(tx_ring, budget);
+			wd = ice_xmit_zc(tx_ring, ICE_DESC_UNUSED(tx_ring), budget);
 		else if (ice_ring_is_xdp(tx_ring))
 			wd = true;
 		else
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 09c8ad2f7403..191f9b8c50ee 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -322,9 +322,9 @@ struct ice_tx_ring {
 	u16 count;			/* Number of descriptors */
 	u16 q_index;			/* Queue number of ring */
 	/* stats structs */
+	struct ice_txq_stats tx_stats;
 	struct ice_q_stats	stats;
 	struct u64_stats_sync syncp;
-	struct ice_txq_stats tx_stats;
 
 	/* CL3 - 3rd cacheline starts here */
 	struct rcu_head rcu;		/* to avoid race on free */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 0463fc594d08..4b6e54f75af6 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -671,134 +671,208 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
 }
 
 /**
- * ice_xmit_zc - Completes AF_XDP entries, and cleans XDP entries
+ * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
  * @xdp_ring: XDP Tx ring
- * @budget: max number of frames to xmit
- *
- * Returns true if cleanup/transmission is done.
+ * @tx_buf: Tx buffer to clean
  */
-static bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, int budget)
+static void
+ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
 {
-	struct ice_tx_desc *tx_desc = NULL;
-	bool work_done = true;
-	struct xdp_desc desc;
-	dma_addr_t dma;
-
-	while (likely(budget-- > 0)) {
-		struct ice_tx_buf *tx_buf;
-
-		if (unlikely(!ICE_DESC_UNUSED(xdp_ring))) {
-			xdp_ring->tx_stats.tx_busy++;
-			work_done = false;
-			break;
-		}
+	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
+	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
+			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
+	dma_unmap_len_set(tx_buf, len, 0);
+}
 
-		tx_buf = &xdp_ring->tx_buf[xdp_ring->next_to_use];
+/**
+ * ice_clean_xdp_irq_zc - Reclaim resources after transmit completes on XDP ring
+ * @xdp_ring: XDP ring to clean
+ * @napi_budget: amount of descriptors that NAPI allows us to clean
+ *
+ * Returns count of cleaned descriptors
+ */
+static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
+{
+	u16 tx_thresh = xdp_ring->tx_thresh;
+	int budget = napi_budget / tx_thresh;
+	u16 ntc = xdp_ring->next_to_clean;
+	struct ice_tx_desc *next_dd_desc;
+	u16 next_dd = xdp_ring->next_dd;
+	u16 desc_cnt = xdp_ring->count;
+	struct ice_tx_buf *tx_buf;
+	u16 cleared_dds = 0;
+	u32 xsk_frames = 0;
+	u16 i;
 
-		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
+	do {
+		next_dd_desc = ICE_TX_DESC(xdp_ring, next_dd);
+		if (!(next_dd_desc->cmd_type_offset_bsz &
+		    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
 			break;
 
-		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
-		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
-						 desc.len);
+		cleared_dds++;
+		xsk_frames = 0;
 
-		tx_buf->bytecount = desc.len;
+		for (i = 0; i < tx_thresh; i++) {
+			tx_buf = &xdp_ring->tx_buf[ntc];
 
-		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
-		tx_desc->buf_addr = cpu_to_le64(dma);
-		tx_desc->cmd_type_offset_bsz =
-			ice_build_ctob(ICE_TXD_LAST_DESC_CMD, 0, desc.len, 0);
+			if (tx_buf->raw_buf) {
+				ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
+				tx_buf->raw_buf = NULL;
+			} else {
+				xsk_frames++;
+			}
 
-		xdp_ring->next_to_use++;
-		if (xdp_ring->next_to_use == xdp_ring->count)
-			xdp_ring->next_to_use = 0;
-	}
+			ntc++;
+			if (ntc >= xdp_ring->count)
+				ntc = 0;
+		}
+		if (xsk_frames)
+			xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
+		next_dd_desc->cmd_type_offset_bsz = 0;
+		next_dd = next_dd + tx_thresh;
+		if (next_dd >= desc_cnt)
+			next_dd = tx_thresh - 1;
+	} while (budget--);
 
-	if (tx_desc) {
-		ice_xdp_ring_update_tail(xdp_ring);
-		xsk_tx_release(xdp_ring->xsk_pool);
-	}
+	xdp_ring->next_to_clean = ntc;
+	xdp_ring->next_dd = next_dd;
 
-	return budget > 0 && work_done;
+	return cleared_dds * tx_thresh;
 }
 
 /**
- * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
- * @xdp_ring: XDP Tx ring
- * @tx_buf: Tx buffer to clean
+ * ice_xmit_pkt - produce a single HW Tx descriptor out of AF_XDP descriptor
+ * @xdp_ring: XDP ring to produce the HW Tx descriptor on
+ * @desc: AF_XDP descriptor to pull the DMA address and length from
+ * @total_bytes: bytes accumulator that will be used for stats update
  */
-static void
-ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
+static void ice_xmit_pkt(struct ice_tx_ring *xdp_ring, struct xdp_desc *desc,
+			 unsigned int *total_bytes)
 {
-	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
-	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
-			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
-	dma_unmap_len_set(tx_buf, len, 0);
+	struct ice_tx_desc *tx_desc;
+	dma_addr_t dma;
+
+	dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
+	xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
+
+	tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
+	tx_desc->buf_addr = cpu_to_le64(dma);
+	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP,
+						      0, desc->len, 0);
+
+	*total_bytes += desc->len;
 }
 
 /**
- * ice_clean_tx_irq_zc - Completes AF_XDP entries, and cleans XDP entries
- * @xdp_ring: XDP Tx ring
- * @budget: NAPI budget
- *
- * Returns true if cleanup/tranmission is done.
+ * ice_xmit_pkt_batch - produce a batch of HW Tx descriptors out of AF_XDP descriptors
+ * @xdp_ring: XDP ring to produce the HW Tx descriptors on
+ * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
+ * @total_bytes: bytes accumulator that will be used for stats update
  */
-bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget)
+static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
+			       unsigned int *total_bytes)
 {
-	int total_packets = 0, total_bytes = 0;
-	s16 ntc = xdp_ring->next_to_clean;
+	u16 tx_thresh = xdp_ring->tx_thresh;
+	u16 ntu = xdp_ring->next_to_use;
 	struct ice_tx_desc *tx_desc;
-	struct ice_tx_buf *tx_buf;
-	u32 xsk_frames = 0;
-	bool xmit_done;
+	dma_addr_t dma;
+	u32 i;
 
-	tx_desc = ICE_TX_DESC(xdp_ring, ntc);
-	tx_buf = &xdp_ring->tx_buf[ntc];
-	ntc -= xdp_ring->count;
+	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
+		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, descs[i].addr);
+		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, descs[i].len);
 
-	do {
-		if (!(tx_desc->cmd_type_offset_bsz &
-		      cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
-			break;
+		tx_desc = ICE_TX_DESC(xdp_ring, ntu++);
+		tx_desc->buf_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP,
+							      0, descs[i].len, 0);
 
-		total_bytes += tx_buf->bytecount;
-		total_packets++;
+		*total_bytes += descs[i].len;
+	}
 
-		if (tx_buf->raw_buf) {
-			ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
-			tx_buf->raw_buf = NULL;
-		} else {
-			xsk_frames++;
-		}
+	xdp_ring->next_to_use = ntu;
 
-		tx_desc->cmd_type_offset_bsz = 0;
-		tx_buf++;
-		tx_desc++;
-		ntc++;
+	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
+		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
+		tx_desc->cmd_type_offset_bsz |=
+			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+		xdp_ring->next_rs += tx_thresh;
+	}
+}
 
-		if (unlikely(!ntc)) {
-			ntc -= xdp_ring->count;
-			tx_buf = xdp_ring->tx_buf;
-			tx_desc = ICE_TX_DESC(xdp_ring, 0);
-		}
+/**
+ * ice_fill_tx_hw_ring - produce the number of Tx descriptors onto ring
+ * @xdp_ring: XDP ring to produce the HW Tx descriptors on
+ * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
+ * @nb_pkts: count of packets to be send
+ * @total_bytes: bytes accumulator that will be used for stats update
+ */
+static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
+				u32 nb_pkts, unsigned int *total_bytes)
+{
+	u16 tx_thresh = xdp_ring->tx_thresh;
+	struct ice_tx_desc *tx_desc;
+	u32 batched, leftover, i;
+
+	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
+	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
+	for (i = 0; i < batched; i += PKTS_PER_BATCH)
+		ice_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
+	for (i = batched; i < batched + leftover; i++)
+		ice_xmit_pkt(xdp_ring, &descs[i], total_bytes);
+
+	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
+		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
+		tx_desc->cmd_type_offset_bsz |=
+			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+		xdp_ring->next_rs += tx_thresh;
+	}
+}
 
-		prefetch(tx_desc);
+/**
+ * ice_xmit_zc - take entries from XSK Tx ring and place them onto HW Tx ring
+ * @xdp_ring: XDP ring to produce the HW Tx descriptors on
+ * @budget: number of free descriptors on HW Tx ring that can be used
+ * @napi_budget: amount of descriptors that NAPI allows us to clean
+ *
+ * Returns true if there is no more work that needs to be done, false otherwise
+ */
+bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, u32 budget, int napi_budget)
+{
+	struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
+	u16 tx_thresh = xdp_ring->tx_thresh;
+	u32 nb_pkts, nb_processed = 0;
+	unsigned int total_bytes = 0;
+	struct ice_tx_desc *tx_desc;
 
-	} while (likely(--budget));
+	if (budget < tx_thresh)
+		budget += ice_clean_xdp_irq_zc(xdp_ring, napi_budget);
+
+	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
+	if (!nb_pkts)
+		return true;
+
+	if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
+		nb_processed = xdp_ring->count - xdp_ring->next_to_use;
+		ice_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
+		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
+		tx_desc->cmd_type_offset_bsz |=
+			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
+		xdp_ring->next_rs = tx_thresh - 1;
+		xdp_ring->next_to_use = 0;
+	}
 
-	ntc += xdp_ring->count;
-	xdp_ring->next_to_clean = ntc;
+	ice_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
+			    &total_bytes);
 
-	if (xsk_frames)
-		xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
+	ice_xdp_ring_update_tail(xdp_ring);
+	ice_update_tx_ring_stats(xdp_ring, nb_pkts, total_bytes);
 
 	if (xsk_uses_need_wakeup(xdp_ring->xsk_pool))
 		xsk_set_tx_need_wakeup(xdp_ring->xsk_pool);
 
-	ice_update_tx_ring_stats(xdp_ring, total_packets, total_bytes);
-	xmit_done = ice_xmit_zc(xdp_ring, ICE_DFLT_IRQ_WORK);
-
-	return budget > 0 && xmit_done;
+	return nb_pkts < budget;
 }
 
 /**
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
index 4c7bd8e9dfc4..0cbb5793b5b8 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.h
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
@@ -6,19 +6,37 @@
 #include "ice_txrx.h"
 #include "ice.h"
 
+#define PKTS_PER_BATCH 8
+
+#ifdef __clang__
+#define loop_unrolled_for _Pragma("clang loop unroll_count(8)") for
+#elif __GNUC__ >= 4
+#define loop_unrolled_for _Pragma("GCC unroll 8") for
+#else
+#define loop_unrolled_for for
+#endif
+
 struct ice_vsi;
 
 #ifdef CONFIG_XDP_SOCKETS
 int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool,
 		       u16 qid);
 int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget);
-bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget);
 int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags);
 bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count);
 bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi);
 void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring);
 void ice_xsk_clean_xdp_ring(struct ice_tx_ring *xdp_ring);
+bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, u32 budget, int napi_budget);
 #else
+static inline bool
+ice_xmit_zc(struct ice_tx_ring __always_unused *xdp_ring,
+	    u32 __always_unused budget,
+	    int __always_unused napi_budget)
+{
+	return false;
+}
+
 static inline int
 ice_xsk_pool_setup(struct ice_vsi __always_unused *vsi,
 		   struct xsk_buff_pool __always_unused *pool,
@@ -34,13 +52,6 @@ ice_clean_rx_irq_zc(struct ice_rx_ring __always_unused *rx_ring,
 	return 0;
 }
 
-static inline bool
-ice_clean_tx_irq_zc(struct ice_tx_ring __always_unused *xdp_ring,
-		    int __always_unused budget)
-{
-	return false;
-}
-
 static inline bool
 ice_alloc_rx_bufs_zc(struct ice_rx_ring __always_unused *rx_ring,
 		     u16 __always_unused count)
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v3 7/7] ice: xsk: borrow xdp_tx_active logic from i40e
  2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
                   ` (5 preceding siblings ...)
  2022-01-21 12:00 ` [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API Maciej Fijalkowski
@ 2022-01-21 12:00 ` Maciej Fijalkowski
  6 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 12:00 UTC (permalink / raw)
  To: bpf, ast, daniel
  Cc: netdev, magnus.karlsson, alexandr.lobakin, Maciej Fijalkowski

One of the things that commit 5574ff7b7b3d ("i40e: optimize AF_XDP Tx
completion path") introduced was the @xdp_tx_active field. Its usage
from i40e can be adjusted to ice driver and give us positive performance
results.

If the descriptor that @next_dd to points has been sent by HW (its DD
bit is set), then we are sure that there are ICE_TX_THRESH count of
descriptors ready to be cleaned. If @xdp_tx_active is 0 which means that
related xdp_ring is not used for XDP_{TX, REDIRECT} workloads, then we
know how many XSK entries should placed to completion queue, IOW walking
through the ring can be skipped.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.h     |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  1 +
 drivers/net/ethernet/intel/ice/ice_xsk.c      | 15 ++++++++++++---
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 191f9b8c50ee..5c1d38ba5275 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -334,6 +334,7 @@ struct ice_tx_ring {
 	spinlock_t tx_lock;
 	u16 tx_thresh;
 	u32 txq_teid;			/* Added Tx queue TEID */
+	u16 xdp_tx_active;
 #define ICE_TX_FLAGS_RING_XDP		BIT(0)
 	u8 flags;
 	u8 dcb_tc;			/* Traffic class of ring */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 5706b5405373..93c61c7feed8 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -302,6 +302,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
 	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP, 0,
 						      size, 0);
 
+	xdp_ring->xdp_tx_active++;
 	i++;
 	if (i == xdp_ring->count) {
 		i = 0;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 4b6e54f75af6..adf246e05c41 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -679,6 +679,7 @@ static void
 ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
 {
 	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
+	xdp_ring->xdp_tx_active--;
 	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
 			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
 	dma_unmap_len_set(tx_buf, len, 0);
@@ -695,12 +696,11 @@ static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
 {
 	u16 tx_thresh = xdp_ring->tx_thresh;
 	int budget = napi_budget / tx_thresh;
-	u16 ntc = xdp_ring->next_to_clean;
 	struct ice_tx_desc *next_dd_desc;
 	u16 next_dd = xdp_ring->next_dd;
 	u16 desc_cnt = xdp_ring->count;
 	struct ice_tx_buf *tx_buf;
-	u16 cleared_dds = 0;
+	u16 ntc, cleared_dds = 0;
 	u32 xsk_frames = 0;
 	u16 i;
 
@@ -712,6 +712,12 @@ static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
 
 		cleared_dds++;
 		xsk_frames = 0;
+		if (likely(!xdp_ring->xdp_tx_active)) {
+			xsk_frames = tx_thresh;
+			goto skip;
+		}
+
+		ntc = xdp_ring->next_to_clean;
 
 		for (i = 0; i < tx_thresh; i++) {
 			tx_buf = &xdp_ring->tx_buf[ntc];
@@ -727,6 +733,10 @@ static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
 			if (ntc >= xdp_ring->count)
 				ntc = 0;
 		}
+skip:
+		xdp_ring->next_to_clean += tx_thresh;
+		if (xdp_ring->next_to_clean >= desc_cnt)
+			xdp_ring->next_to_clean -= desc_cnt;
 		if (xsk_frames)
 			xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
 		next_dd_desc->cmd_type_offset_bsz = 0;
@@ -735,7 +745,6 @@ static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
 			next_dd = tx_thresh - 1;
 	} while (budget--);
 
-	xdp_ring->next_to_clean = ntc;
 	xdp_ring->next_dd = next_dd;
 
 	return cleared_dds * tx_thresh;
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often
  2022-01-21 12:00 ` [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often Maciej Fijalkowski
@ 2022-01-21 12:29   ` Alexander Lobakin
  2022-01-21 14:31     ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Lobakin @ 2022-01-21 12:29 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, bpf, ast, daniel, netdev, magnus.karlsson

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 21 Jan 2022 13:00:06 +0100

> Currently, if ice_clean_rx_irq_zc() processed the whole ring and
> next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
> ring even if the XSK buffer pool would have enough free entries (either
> from fill ring or the internal recycle mechanism) - it is because ring
> wrap is not handled.
> 
> Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
> Do not clamp the count of buffers that is passed to
> xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
> rx_ring->count,  but rather split it and have two calls to the mentioned
> function - one for the part up until the wrap and one for the part after
> the wrap.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +
>  drivers/net/ethernet/intel/ice/ice_xsk.c  | 99 ++++++++++++++++++-----
>  2 files changed, 81 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index b7b3bd4816f0..94a46e0e5ed0 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -111,6 +111,8 @@ static inline int ice_skb_pad(void)
>  	(u16)((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \
>  	      (R)->next_to_clean - (R)->next_to_use - 1)
>  
> +#define ICE_RING_QUARTER(R) ((R)->count / 4)

I would use `>> 2` here just to show off :D

> +
>  #define ICE_TX_FLAGS_TSO	BIT(0)
>  #define ICE_TX_FLAGS_HW_VLAN	BIT(1)
>  #define ICE_TX_FLAGS_SW_VLAN	BIT(2)
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 2388837d6d6c..0463fc594d08 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -359,33 +359,28 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
>  }
>  
>  /**
> - * ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
> - * @rx_ring: Rx ring
> + * ice_fill_rx_descs - pick buffers from XSK buffer pool and use it
> + * @pool: XSK Buffer pool to pull the buffers from
> + * @xdp: SW ring of xdp_buff that will hold the buffers
> + * @rx_desc: Pointer to Rx descriptors that will be filled
>   * @count: The number of buffers to allocate
>   *
>   * This function allocates a number of Rx buffers from the fill ring
>   * or the internal recycle mechanism and places them on the Rx ring.
>   *
> - * Returns true if all allocations were successful, false if any fail.
> + * Note that ring wrap should be handled by caller of this function.
> + *
> + * Returns the amount of allocated Rx descriptors
>   */
> -bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
> +static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
> +			     union ice_32b_rx_flex_desc *rx_desc, u16 count)
>  {
> -	union ice_32b_rx_flex_desc *rx_desc;
> -	u16 ntu = rx_ring->next_to_use;
> -	struct xdp_buff **xdp;
> -	u32 nb_buffs, i;
>  	dma_addr_t dma;
> +	u16 buffs;
> +	int i;
>  
> -	rx_desc = ICE_RX_DESC(rx_ring, ntu);
> -	xdp = ice_xdp_buf(rx_ring, ntu);
> -
> -	nb_buffs = min_t(u16, count, rx_ring->count - ntu);
> -	nb_buffs = xsk_buff_alloc_batch(rx_ring->xsk_pool, xdp, nb_buffs);
> -	if (!nb_buffs)
> -		return false;
> -
> -	i = nb_buffs;
> -	while (i--) {
> +	buffs = xsk_buff_alloc_batch(pool, xdp, count);
> +	for (i = 0; i < buffs; i++) {
>  		dma = xsk_buff_xdp_get_dma(*xdp);
>  		rx_desc->read.pkt_addr = cpu_to_le64(dma);
>  		rx_desc->wb.status_error0 = 0;
> @@ -394,13 +389,77 @@ bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
>  		xdp++;
>  	}
>  
> +	return buffs;
> +}
> +
> +/**
> + * __ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
> + * @rx_ring: Rx ring
> + * @count: The number of buffers to allocate
> + *
> + * Place the @count of descriptors onto Rx ring. Handle the ring wrap
> + * for case where space from next_to_use up to the end of ring is less
> + * than @count. Finally do a tail bump.
> + *
> + * Returns true if all allocations were successful, false if any fail.
> + */
> +static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
> +{
> +	union ice_32b_rx_flex_desc *rx_desc;
> +	u32 nb_buffs_extra = 0, nb_buffs;
> +	u16 ntu = rx_ring->next_to_use;
> +	u16 total_count = count;
> +	struct xdp_buff **xdp;
> +
> +	rx_desc = ICE_RX_DESC(rx_ring, ntu);
> +	xdp = ice_xdp_buf(rx_ring, ntu);
> +
> +	if (ntu + count >= rx_ring->count) {
> +		nb_buffs_extra = ice_fill_rx_descs(rx_ring->xsk_pool, xdp,
> +						   rx_desc,
> +						   rx_ring->count - ntu);
> +		rx_desc = ICE_RX_DESC(rx_ring, 0);
> +		xdp = ice_xdp_buf(rx_ring, 0);
> +		ntu = 0;
> +		count -= nb_buffs_extra;
> +		ice_release_rx_desc(rx_ring, 0);
> +	}
> +
> +	nb_buffs = ice_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
> +
>  	ntu += nb_buffs;
>  	if (ntu == rx_ring->count)
>  		ntu = 0;
>  
> -	ice_release_rx_desc(rx_ring, ntu);
> +	if (rx_ring->next_to_use != ntu)
> +		ice_release_rx_desc(rx_ring, ntu);
> +
> +	return total_count == (nb_buffs_extra + nb_buffs);
> +}
> +
> +/**
> + * ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
> + * @rx_ring: Rx ring
> + * @count: The number of buffers to allocate
> + *
> + * Wrapper for internal allocation routine; figure out how many tail
> + * bumps should take place based on the given threshold
> + *
> + * Returns true if all calls to internal alloc routine succeeded
> + */
> +bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
> +{
> +	u16 rx_thresh = ICE_RING_QUARTER(rx_ring);
> +	u16 batched, leftover, i, tail_bumps;
> +
> +	batched = count & ~(rx_thresh - 1);

The ring size can be a non power-of-two unfortunately, it is rather
aligned to just 32: [0]. So it can be e.g. 96 and the mask will
break then.
You could use roundup_pow_of_two(ICE_RING_QUARTER(rx_ring)), but
might can be a little slower due to fls_long() (bitsearch) inside.

(I would generally prohibit non-pow-2 ring sizes at all from inside
 the Ethtool callbacks since it makes no sense to me :p)

Also, it's not recommended to open-code align-down since we got
the ALIGN_DOWN(value, pow_of_two_alignment) macro. The macro hell
inside expands to the same op you do in here.

> +	tail_bumps = batched / rx_thresh;
> +	leftover = count & (rx_thresh - 1);
>  
> -	return count == nb_buffs;
> +	for (i = 0; i < tail_bumps; i++)
> +		if (!__ice_alloc_rx_bufs_zc(rx_ring, rx_thresh))
> +			return false;
> +	return __ice_alloc_rx_bufs_zc(rx_ring, leftover);
>  }
>  
>  /**
> -- 
> 2.33.1

[0] https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/ice/ice_ethtool.c#L2729

Thanks,
Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Rx: [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length
  2022-01-21 12:00 ` [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length Maciej Fijalkowski
@ 2022-01-21 12:40   ` Alexander Lobakin
  2022-01-21 14:34     ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Lobakin @ 2022-01-21 12:40 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, bpf, ast, daniel, netdev, magnus.karlsson

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 21 Jan 2022 13:00:07 +0100

> XDP_TX workloads use a concept of Tx threshold that indicates the
> interval of setting RS bit on descriptors which in turn tells the HW to
> generate an interrupt to signal the completion of Tx on HW side. It is
> currently based on a constant value of 32 which might not work out well
> for various sizes of ring combined with for example batch size that can
> be set via SO_BUSY_POLL_BUDGET.
> 
> Internal tests based on AF_XDP showed that most convenient setup of
> mentioned threshold is when it is equal to quarter of a ring length.
> 
> Introduce @tx_thresh field in ice_tx_ring struct that will store the
> value of threshold and use it in favor of ICE_TX_THRESH.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  3 +++
>  drivers/net/ethernet/intel/ice/ice_main.c     |  5 +++--
>  drivers/net/ethernet/intel/ice/ice_txrx.h     |  7 ++++++-
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 14 ++++++++------
>  4 files changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> index e2e3ef7fba7f..bfa5e5d167ab 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> @@ -2803,6 +2803,9 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
>  		/* clone ring and setup updated count */
>  		xdp_rings[i] = *vsi->xdp_rings[i];
>  		xdp_rings[i].count = new_tx_cnt;
> +		xdp_rings[i].tx_thresh = ice_get_tx_threshold(&xdp_rings[i]);
> +		xdp_rings[i].next_dd = xdp_rings[i].tx_thresh - 1;
> +		xdp_rings[i].next_rs = xdp_rings[i].tx_thresh - 1;
>  		xdp_rings[i].desc = NULL;
>  		xdp_rings[i].tx_buf = NULL;
>  		err = ice_setup_tx_ring(&xdp_rings[i]);
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 30814435f779..0fd12a7f6d22 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -2495,10 +2495,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
>  		xdp_ring->reg_idx = vsi->txq_map[xdp_q_idx];
>  		xdp_ring->vsi = vsi;
>  		xdp_ring->netdev = NULL;
> -		xdp_ring->next_dd = ICE_TX_THRESH - 1;
> -		xdp_ring->next_rs = ICE_TX_THRESH - 1;
>  		xdp_ring->dev = dev;
>  		xdp_ring->count = vsi->num_tx_desc;
> +		xdp_ring->tx_thresh = ice_get_tx_threshold(xdp_ring);
> +		xdp_ring->next_dd = xdp_ring->tx_thresh - 1;
> +		xdp_ring->next_rs = xdp_ring->tx_thresh - 1;
>  		WRITE_ONCE(vsi->xdp_rings[i], xdp_ring);
>  		if (ice_setup_tx_ring(xdp_ring))
>  			goto free_xdp_rings;
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index 94a46e0e5ed0..09c8ad2f7403 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -13,7 +13,6 @@
>  #define ICE_MAX_CHAINED_RX_BUFS	5
>  #define ICE_MAX_BUF_TXD		8
>  #define ICE_MIN_TX_LEN		17
> -#define ICE_TX_THRESH		32
>  
>  /* The size limit for a transmit buffer in a descriptor is (16K - 1).
>   * In order to align with the read requests we will align the value to
> @@ -333,6 +332,7 @@ struct ice_tx_ring {
>  	struct ice_channel *ch;
>  	struct ice_ptp_tx *tx_tstamps;
>  	spinlock_t tx_lock;
> +	u16 tx_thresh;

Creating 2-byte hole here, but it's likely ok since I don't see a
place to pack it nicely. u8 at the end is not an option obviously.

(unless you want to store the order rather than the threshold
itself -- would require an additional macro, but at the same time
align nicely with the fact that we need the threshold to be of
power of two)

>  	u32 txq_teid;			/* Added Tx queue TEID */
>  #define ICE_TX_FLAGS_RING_XDP		BIT(0)
>  	u8 flags;
> @@ -355,6 +355,11 @@ static inline void ice_clear_ring_build_skb_ena(struct ice_rx_ring *ring)
>  	ring->flags &= ~ICE_RX_FLAGS_RING_BUILD_SKB;
>  }
>  
> +static inline u16 ice_get_tx_threshold(struct ice_tx_ring *tx_ring)
> +{
> +	return ICE_RING_QUARTER(tx_ring);
> +}
> +
>  static inline bool ice_ring_ch_enabled(struct ice_tx_ring *ring)
>  {
>  	return !!ring->ch;
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> index 0e87b98e0966..5706b5405373 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> @@ -222,6 +222,7 @@ ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
>  static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
>  {
>  	unsigned int total_bytes = 0, total_pkts = 0;
> +	u16 tx_thresh = xdp_ring->tx_thresh;
>  	u16 ntc = xdp_ring->next_to_clean;
>  	struct ice_tx_desc *next_dd_desc;
>  	u16 next_dd = xdp_ring->next_dd;
> @@ -233,7 +234,7 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
>  	    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
>  		return;
>  
> -	for (i = 0; i < ICE_TX_THRESH; i++) {
> +	for (i = 0; i < tx_thresh; i++) {
>  		tx_buf = &xdp_ring->tx_buf[ntc];
>  
>  		total_bytes += tx_buf->bytecount;
> @@ -254,9 +255,9 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
>  	}
>  
>  	next_dd_desc->cmd_type_offset_bsz = 0;
> -	xdp_ring->next_dd = xdp_ring->next_dd + ICE_TX_THRESH;
> +	xdp_ring->next_dd = xdp_ring->next_dd + tx_thresh;
>  	if (xdp_ring->next_dd > xdp_ring->count)
> -		xdp_ring->next_dd = ICE_TX_THRESH - 1;
> +		xdp_ring->next_dd = tx_thresh - 1;
>  	xdp_ring->next_to_clean = ntc;
>  	ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
>  }
> @@ -269,12 +270,13 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
>   */
>  int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
>  {
> +	u16 tx_thresh = xdp_ring->tx_thresh;
>  	u16 i = xdp_ring->next_to_use;
>  	struct ice_tx_desc *tx_desc;
>  	struct ice_tx_buf *tx_buf;
>  	dma_addr_t dma;
>  
> -	if (ICE_DESC_UNUSED(xdp_ring) < ICE_TX_THRESH)
> +	if (ICE_DESC_UNUSED(xdp_ring) < tx_thresh)
>  		ice_clean_xdp_irq(xdp_ring);
>  
>  	if (!unlikely(ICE_DESC_UNUSED(xdp_ring))) {
> @@ -306,7 +308,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
>  		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
>  		tx_desc->cmd_type_offset_bsz |=
>  			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> -		xdp_ring->next_rs = ICE_TX_THRESH - 1;
> +		xdp_ring->next_rs = tx_thresh - 1;
>  	}
>  	xdp_ring->next_to_use = i;
>  
> @@ -314,7 +316,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
>  		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
>  		tx_desc->cmd_type_offset_bsz |=
>  			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> -		xdp_ring->next_rs += ICE_TX_THRESH;
> +		xdp_ring->next_rs += tx_thresh;
>  	}
>  
>  	return ICE_XDP_TX;
> -- 
> 2.33.1

Thanks,
Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API
  2022-01-21 12:00 ` [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API Maciej Fijalkowski
@ 2022-01-21 12:54   ` Alexander Lobakin
  2022-01-21 13:17     ` Alexander Lobakin
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Lobakin @ 2022-01-21 12:54 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, bpf, ast, daniel, netdev, magnus.karlsson

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Fri, 21 Jan 2022 13:00:10 +0100

> Apply the logic that was done for regular XDP from commit 9610bd988df9
> ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
> of that, introduce batching to Tx that is inspired by i40e's
> implementation with adjustments to the cleaning logic - take into the
> account NAPI budget in ice_clean_xdp_irq_zc().
> 
> Separating the stats structs onto separate cache lines seemed to improve
> the performance.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.c |   2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx.h |   2 +-
>  drivers/net/ethernet/intel/ice/ice_xsk.c  | 256 ++++++++++++++--------
>  drivers/net/ethernet/intel/ice/ice_xsk.h  |  27 ++-
>  4 files changed, 186 insertions(+), 101 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index bfb9158b10a4..7ab8c700c884 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -1463,7 +1463,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
>  		bool wd;
>  
>  		if (tx_ring->xsk_pool)
> -			wd = ice_clean_tx_irq_zc(tx_ring, budget);
> +			wd = ice_xmit_zc(tx_ring, ICE_DESC_UNUSED(tx_ring), budget);
>  		else if (ice_ring_is_xdp(tx_ring))
>  			wd = true;
>  		else
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index 09c8ad2f7403..191f9b8c50ee 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -322,9 +322,9 @@ struct ice_tx_ring {
>  	u16 count;			/* Number of descriptors */
>  	u16 q_index;			/* Queue number of ring */
>  	/* stats structs */
> +	struct ice_txq_stats tx_stats;
>  	struct ice_q_stats	stats;
>  	struct u64_stats_sync syncp;
> -	struct ice_txq_stats tx_stats;
>  
>  	/* CL3 - 3rd cacheline starts here */
>  	struct rcu_head rcu;		/* to avoid race on free */
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 0463fc594d08..4b6e54f75af6 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -671,134 +671,208 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
>  }
>  
>  /**
> - * ice_xmit_zc - Completes AF_XDP entries, and cleans XDP entries
> + * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
>   * @xdp_ring: XDP Tx ring
> - * @budget: max number of frames to xmit
> - *
> - * Returns true if cleanup/transmission is done.
> + * @tx_buf: Tx buffer to clean
>   */
> -static bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, int budget)
> +static void
> +ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
>  {
> -	struct ice_tx_desc *tx_desc = NULL;
> -	bool work_done = true;
> -	struct xdp_desc desc;
> -	dma_addr_t dma;
> -
> -	while (likely(budget-- > 0)) {
> -		struct ice_tx_buf *tx_buf;
> -
> -		if (unlikely(!ICE_DESC_UNUSED(xdp_ring))) {
> -			xdp_ring->tx_stats.tx_busy++;
> -			work_done = false;
> -			break;
> -		}
> +	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
> +	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
> +			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
> +	dma_unmap_len_set(tx_buf, len, 0);
> +}
>  
> -		tx_buf = &xdp_ring->tx_buf[xdp_ring->next_to_use];
> +/**
> + * ice_clean_xdp_irq_zc - Reclaim resources after transmit completes on XDP ring
> + * @xdp_ring: XDP ring to clean
> + * @napi_budget: amount of descriptors that NAPI allows us to clean
> + *
> + * Returns count of cleaned descriptors
> + */
> +static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
> +{
> +	u16 tx_thresh = xdp_ring->tx_thresh;
> +	int budget = napi_budget / tx_thresh;
> +	u16 ntc = xdp_ring->next_to_clean;
> +	struct ice_tx_desc *next_dd_desc;

@next_dd_desc is used only inside the `do { } while`, can be moved
there the reduce the scope.

> +	u16 next_dd = xdp_ring->next_dd;
> +	u16 desc_cnt = xdp_ring->count;
> +	struct ice_tx_buf *tx_buf;
> +	u16 cleared_dds = 0;
> +	u32 xsk_frames = 0;
> +	u16 i;

Same with these 5, from @desc_cnt to @i.

>  
> -		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
> +	do {
> +		next_dd_desc = ICE_TX_DESC(xdp_ring, next_dd);
> +		if (!(next_dd_desc->cmd_type_offset_bsz &
> +		    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
>  			break;
>  
> -		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
> -		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
> -						 desc.len);
> +		cleared_dds++;
> +		xsk_frames = 0;
>  
> -		tx_buf->bytecount = desc.len;
> +		for (i = 0; i < tx_thresh; i++) {
> +			tx_buf = &xdp_ring->tx_buf[ntc];
>  
> -		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> -		tx_desc->buf_addr = cpu_to_le64(dma);
> -		tx_desc->cmd_type_offset_bsz =
> -			ice_build_ctob(ICE_TXD_LAST_DESC_CMD, 0, desc.len, 0);
> +			if (tx_buf->raw_buf) {
> +				ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
> +				tx_buf->raw_buf = NULL;
> +			} else {
> +				xsk_frames++;
> +			}
>  
> -		xdp_ring->next_to_use++;
> -		if (xdp_ring->next_to_use == xdp_ring->count)
> -			xdp_ring->next_to_use = 0;
> -	}
> +			ntc++;
> +			if (ntc >= xdp_ring->count)
> +				ntc = 0;
> +		}
> +		if (xsk_frames)
> +			xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
> +		next_dd_desc->cmd_type_offset_bsz = 0;
> +		next_dd = next_dd + tx_thresh;
> +		if (next_dd >= desc_cnt)
> +			next_dd = tx_thresh - 1;
> +	} while (budget--);
>  
> -	if (tx_desc) {
> -		ice_xdp_ring_update_tail(xdp_ring);
> -		xsk_tx_release(xdp_ring->xsk_pool);
> -	}
> +	xdp_ring->next_to_clean = ntc;
> +	xdp_ring->next_dd = next_dd;
>  
> -	return budget > 0 && work_done;
> +	return cleared_dds * tx_thresh;
>  }
>  
>  /**
> - * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
> - * @xdp_ring: XDP Tx ring
> - * @tx_buf: Tx buffer to clean
> + * ice_xmit_pkt - produce a single HW Tx descriptor out of AF_XDP descriptor
> + * @xdp_ring: XDP ring to produce the HW Tx descriptor on
> + * @desc: AF_XDP descriptor to pull the DMA address and length from
> + * @total_bytes: bytes accumulator that will be used for stats update
>   */
> -static void
> -ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
> +static void ice_xmit_pkt(struct ice_tx_ring *xdp_ring, struct xdp_desc *desc,
> +			 unsigned int *total_bytes)
>  {
> -	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
> -	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
> -			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
> -	dma_unmap_len_set(tx_buf, len, 0);
> +	struct ice_tx_desc *tx_desc;
> +	dma_addr_t dma;
> +
> +	dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
> +	xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
> +
> +	tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
> +	tx_desc->buf_addr = cpu_to_le64(dma);
> +	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP,
> +						      0, desc->len, 0);
> +
> +	*total_bytes += desc->len;
>  }
>  
>  /**
> - * ice_clean_tx_irq_zc - Completes AF_XDP entries, and cleans XDP entries
> - * @xdp_ring: XDP Tx ring
> - * @budget: NAPI budget
> - *
> - * Returns true if cleanup/tranmission is done.
> + * ice_xmit_pkt_batch - produce a batch of HW Tx descriptors out of AF_XDP descriptors
> + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> + * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
> + * @total_bytes: bytes accumulator that will be used for stats update
>   */
> -bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget)
> +static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
> +			       unsigned int *total_bytes)
>  {
> -	int total_packets = 0, total_bytes = 0;
> -	s16 ntc = xdp_ring->next_to_clean;
> +	u16 tx_thresh = xdp_ring->tx_thresh;
> +	u16 ntu = xdp_ring->next_to_use;
>  	struct ice_tx_desc *tx_desc;
> -	struct ice_tx_buf *tx_buf;
> -	u32 xsk_frames = 0;
> -	bool xmit_done;
> +	dma_addr_t dma;

Same with @dma here.

> +	u32 i;
>  
> -	tx_desc = ICE_TX_DESC(xdp_ring, ntc);
> -	tx_buf = &xdp_ring->tx_buf[ntc];
> -	ntc -= xdp_ring->count;
> +	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
> +		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, descs[i].addr);
> +		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, descs[i].len);
>  
> -	do {
> -		if (!(tx_desc->cmd_type_offset_bsz &
> -		      cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
> -			break;
> +		tx_desc = ICE_TX_DESC(xdp_ring, ntu++);
> +		tx_desc->buf_addr = cpu_to_le64(dma);
> +		tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP,
> +							      0, descs[i].len, 0);
>  
> -		total_bytes += tx_buf->bytecount;
> -		total_packets++;
> +		*total_bytes += descs[i].len;
> +	}
>  
> -		if (tx_buf->raw_buf) {
> -			ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
> -			tx_buf->raw_buf = NULL;
> -		} else {
> -			xsk_frames++;
> -		}
> +	xdp_ring->next_to_use = ntu;
>  
> -		tx_desc->cmd_type_offset_bsz = 0;
> -		tx_buf++;
> -		tx_desc++;
> -		ntc++;
> +	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
> +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> +		tx_desc->cmd_type_offset_bsz |=
> +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> +		xdp_ring->next_rs += tx_thresh;
> +	}
> +}
>  
> -		if (unlikely(!ntc)) {
> -			ntc -= xdp_ring->count;
> -			tx_buf = xdp_ring->tx_buf;
> -			tx_desc = ICE_TX_DESC(xdp_ring, 0);
> -		}
> +/**
> + * ice_fill_tx_hw_ring - produce the number of Tx descriptors onto ring
> + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> + * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
> + * @nb_pkts: count of packets to be send
> + * @total_bytes: bytes accumulator that will be used for stats update
> + */
> +static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
> +				u32 nb_pkts, unsigned int *total_bytes)
> +{
> +	u16 tx_thresh = xdp_ring->tx_thresh;
> +	struct ice_tx_desc *tx_desc;

And @tx_desc as well.

> +	u32 batched, leftover, i;
> +
> +	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
> +	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
> +	for (i = 0; i < batched; i += PKTS_PER_BATCH)
> +		ice_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
> +	for (i = batched; i < batched + leftover; i++)
> +		ice_xmit_pkt(xdp_ring, &descs[i], total_bytes);
> +
> +	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
> +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> +		tx_desc->cmd_type_offset_bsz |=
> +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> +		xdp_ring->next_rs += tx_thresh;
> +	}
> +}
>  
> -		prefetch(tx_desc);
> +/**
> + * ice_xmit_zc - take entries from XSK Tx ring and place them onto HW Tx ring
> + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> + * @budget: number of free descriptors on HW Tx ring that can be used
> + * @napi_budget: amount of descriptors that NAPI allows us to clean
> + *
> + * Returns true if there is no more work that needs to be done, false otherwise
> + */
> +bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, u32 budget, int napi_budget)
> +{
> +	struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
> +	u16 tx_thresh = xdp_ring->tx_thresh;
> +	u32 nb_pkts, nb_processed = 0;
> +	unsigned int total_bytes = 0;
> +	struct ice_tx_desc *tx_desc;

And this @tx_desc.

>  
> -	} while (likely(--budget));
> +	if (budget < tx_thresh)
> +		budget += ice_clean_xdp_irq_zc(xdp_ring, napi_budget);
> +
> +	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
> +	if (!nb_pkts)
> +		return true;
> +
> +	if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
> +		nb_processed = xdp_ring->count - xdp_ring->next_to_use;
> +		ice_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
> +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> +		tx_desc->cmd_type_offset_bsz |=
> +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> +		xdp_ring->next_rs = tx_thresh - 1;
> +		xdp_ring->next_to_use = 0;
> +	}
>  
> -	ntc += xdp_ring->count;
> -	xdp_ring->next_to_clean = ntc;
> +	ice_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
> +			    &total_bytes);
>  
> -	if (xsk_frames)
> -		xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
> +	ice_xdp_ring_update_tail(xdp_ring);
> +	ice_update_tx_ring_stats(xdp_ring, nb_pkts, total_bytes);
>  
>  	if (xsk_uses_need_wakeup(xdp_ring->xsk_pool))
>  		xsk_set_tx_need_wakeup(xdp_ring->xsk_pool);
>  
> -	ice_update_tx_ring_stats(xdp_ring, total_packets, total_bytes);
> -	xmit_done = ice_xmit_zc(xdp_ring, ICE_DFLT_IRQ_WORK);
> -
> -	return budget > 0 && xmit_done;
> +	return nb_pkts < budget;
>  }
>  
>  /**
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
> index 4c7bd8e9dfc4..0cbb5793b5b8 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.h
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
> @@ -6,19 +6,37 @@
>  #include "ice_txrx.h"
>  #include "ice.h"
>  
> +#define PKTS_PER_BATCH 8
> +
> +#ifdef __clang__
> +#define loop_unrolled_for _Pragma("clang loop unroll_count(8)") for
> +#elif __GNUC__ >= 4
> +#define loop_unrolled_for _Pragma("GCC unroll 8") for
> +#else
> +#define loop_unrolled_for for
> +#endif
> +
>  struct ice_vsi;
>  
>  #ifdef CONFIG_XDP_SOCKETS
>  int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool,
>  		       u16 qid);
>  int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget);
> -bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget);
>  int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags);
>  bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count);
>  bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi);
>  void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring);
>  void ice_xsk_clean_xdp_ring(struct ice_tx_ring *xdp_ring);
> +bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, u32 budget, int napi_budget);
>  #else
> +static inline bool
> +ice_xmit_zc(struct ice_tx_ring __always_unused *xdp_ring,
> +	    u32 __always_unused budget,
> +	    int __always_unused napi_budget)
> +{
> +	return false;
> +}
> +
>  static inline int
>  ice_xsk_pool_setup(struct ice_vsi __always_unused *vsi,
>  		   struct xsk_buff_pool __always_unused *pool,
> @@ -34,13 +52,6 @@ ice_clean_rx_irq_zc(struct ice_rx_ring __always_unused *rx_ring,
>  	return 0;
>  }
>  
> -static inline bool
> -ice_clean_tx_irq_zc(struct ice_tx_ring __always_unused *xdp_ring,
> -		    int __always_unused budget)
> -{
> -	return false;
> -}
> -
>  static inline bool
>  ice_alloc_rx_bufs_zc(struct ice_rx_ring __always_unused *rx_ring,
>  		     u16 __always_unused count)
> -- 
> 2.33.1

Thanks,
Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API
  2022-01-21 12:54   ` Alexander Lobakin
@ 2022-01-21 13:17     ` Alexander Lobakin
  2022-01-21 14:41       ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Lobakin @ 2022-01-21 13:17 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexander Lobakin, bpf, ast, daniel, netdev, magnus.karlsson

From: Alexander Lobakin <alexandr.lobakin@intel.com>
Date: Fri, 21 Jan 2022 13:54:47 +0100

> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Fri, 21 Jan 2022 13:00:10 +0100
> 
> > Apply the logic that was done for regular XDP from commit 9610bd988df9
> > ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
> > of that, introduce batching to Tx that is inspired by i40e's
> > implementation with adjustments to the cleaning logic - take into the
> > account NAPI budget in ice_clean_xdp_irq_zc().
> > 
> > Separating the stats structs onto separate cache lines seemed to improve
> > the performance.
> > 
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_txrx.c |   2 +-
> >  drivers/net/ethernet/intel/ice/ice_txrx.h |   2 +-
> >  drivers/net/ethernet/intel/ice/ice_xsk.c  | 256 ++++++++++++++--------
> >  drivers/net/ethernet/intel/ice/ice_xsk.h  |  27 ++-
> >  4 files changed, 186 insertions(+), 101 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > index bfb9158b10a4..7ab8c700c884 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > @@ -1463,7 +1463,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
> >  		bool wd;
> >  
> >  		if (tx_ring->xsk_pool)
> > -			wd = ice_clean_tx_irq_zc(tx_ring, budget);
> > +			wd = ice_xmit_zc(tx_ring, ICE_DESC_UNUSED(tx_ring), budget);
> >  		else if (ice_ring_is_xdp(tx_ring))
> >  			wd = true;
> >  		else
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > index 09c8ad2f7403..191f9b8c50ee 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > @@ -322,9 +322,9 @@ struct ice_tx_ring {
> >  	u16 count;			/* Number of descriptors */
> >  	u16 q_index;			/* Queue number of ring */
> >  	/* stats structs */
> > +	struct ice_txq_stats tx_stats;
> >  	struct ice_q_stats	stats;
> >  	struct u64_stats_sync syncp;
> > -	struct ice_txq_stats tx_stats;
> >  
> >  	/* CL3 - 3rd cacheline starts here */
> >  	struct rcu_head rcu;		/* to avoid race on free */
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > index 0463fc594d08..4b6e54f75af6 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > @@ -671,134 +671,208 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
> >  }
> >  
> >  /**
> > - * ice_xmit_zc - Completes AF_XDP entries, and cleans XDP entries
> > + * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
> >   * @xdp_ring: XDP Tx ring
> > - * @budget: max number of frames to xmit
> > - *
> > - * Returns true if cleanup/transmission is done.
> > + * @tx_buf: Tx buffer to clean
> >   */
> > -static bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, int budget)
> > +static void
> > +ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
> >  {
> > -	struct ice_tx_desc *tx_desc = NULL;
> > -	bool work_done = true;
> > -	struct xdp_desc desc;
> > -	dma_addr_t dma;
> > -
> > -	while (likely(budget-- > 0)) {
> > -		struct ice_tx_buf *tx_buf;
> > -
> > -		if (unlikely(!ICE_DESC_UNUSED(xdp_ring))) {
> > -			xdp_ring->tx_stats.tx_busy++;
> > -			work_done = false;
> > -			break;
> > -		}
> > +	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
> > +	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
> > +			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
> > +	dma_unmap_len_set(tx_buf, len, 0);
> > +}
> >  
> > -		tx_buf = &xdp_ring->tx_buf[xdp_ring->next_to_use];
> > +/**
> > + * ice_clean_xdp_irq_zc - Reclaim resources after transmit completes on XDP ring
> > + * @xdp_ring: XDP ring to clean
> > + * @napi_budget: amount of descriptors that NAPI allows us to clean
> > + *
> > + * Returns count of cleaned descriptors
> > + */
> > +static u16 ice_clean_xdp_irq_zc(struct ice_tx_ring *xdp_ring, int napi_budget)
> > +{
> > +	u16 tx_thresh = xdp_ring->tx_thresh;
> > +	int budget = napi_budget / tx_thresh;
> > +	u16 ntc = xdp_ring->next_to_clean;
> > +	struct ice_tx_desc *next_dd_desc;
> 
> @next_dd_desc is used only inside the `do { } while`, can be moved
> there the reduce the scope.
> 
> > +	u16 next_dd = xdp_ring->next_dd;
> > +	u16 desc_cnt = xdp_ring->count;
> > +	struct ice_tx_buf *tx_buf;
> > +	u16 cleared_dds = 0;
> > +	u32 xsk_frames = 0;
> > +	u16 i;
> 
> Same with these 5, from @desc_cnt to @i.
> 
> >  
> > -		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
> > +	do {
> > +		next_dd_desc = ICE_TX_DESC(xdp_ring, next_dd);
> > +		if (!(next_dd_desc->cmd_type_offset_bsz &
> > +		    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
> >  			break;
> >  
> > -		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
> > -		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
> > -						 desc.len);
> > +		cleared_dds++;
> > +		xsk_frames = 0;
> >  
> > -		tx_buf->bytecount = desc.len;
> > +		for (i = 0; i < tx_thresh; i++) {
> > +			tx_buf = &xdp_ring->tx_buf[ntc];
> >  
> > -		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> > -		tx_desc->buf_addr = cpu_to_le64(dma);
> > -		tx_desc->cmd_type_offset_bsz =
> > -			ice_build_ctob(ICE_TXD_LAST_DESC_CMD, 0, desc.len, 0);
> > +			if (tx_buf->raw_buf) {
> > +				ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
> > +				tx_buf->raw_buf = NULL;
> > +			} else {
> > +				xsk_frames++;
> > +			}
> >  
> > -		xdp_ring->next_to_use++;
> > -		if (xdp_ring->next_to_use == xdp_ring->count)
> > -			xdp_ring->next_to_use = 0;
> > -	}
> > +			ntc++;
> > +			if (ntc >= xdp_ring->count)
> > +				ntc = 0;
> > +		}
> > +		if (xsk_frames)
> > +			xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
> > +		next_dd_desc->cmd_type_offset_bsz = 0;
> > +		next_dd = next_dd + tx_thresh;
> > +		if (next_dd >= desc_cnt)
> > +			next_dd = tx_thresh - 1;
> > +	} while (budget--);
> >  
> > -	if (tx_desc) {
> > -		ice_xdp_ring_update_tail(xdp_ring);
> > -		xsk_tx_release(xdp_ring->xsk_pool);
> > -	}
> > +	xdp_ring->next_to_clean = ntc;
> > +	xdp_ring->next_dd = next_dd;
> >  
> > -	return budget > 0 && work_done;
> > +	return cleared_dds * tx_thresh;
> >  }
> >  
> >  /**
> > - * ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
> > - * @xdp_ring: XDP Tx ring
> > - * @tx_buf: Tx buffer to clean
> > + * ice_xmit_pkt - produce a single HW Tx descriptor out of AF_XDP descriptor
> > + * @xdp_ring: XDP ring to produce the HW Tx descriptor on
> > + * @desc: AF_XDP descriptor to pull the DMA address and length from
> > + * @total_bytes: bytes accumulator that will be used for stats update
> >   */
> > -static void
> > -ice_clean_xdp_tx_buf(struct ice_tx_ring *xdp_ring, struct ice_tx_buf *tx_buf)
> > +static void ice_xmit_pkt(struct ice_tx_ring *xdp_ring, struct xdp_desc *desc,
> > +			 unsigned int *total_bytes)
> >  {
> > -	xdp_return_frame((struct xdp_frame *)tx_buf->raw_buf);
> > -	dma_unmap_single(xdp_ring->dev, dma_unmap_addr(tx_buf, dma),
> > -			 dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
> > -	dma_unmap_len_set(tx_buf, len, 0);
> > +	struct ice_tx_desc *tx_desc;
> > +	dma_addr_t dma;
> > +
> > +	dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
> > +	xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
> > +
> > +	tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
> > +	tx_desc->buf_addr = cpu_to_le64(dma);
> > +	tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP,
> > +						      0, desc->len, 0);
> > +
> > +	*total_bytes += desc->len;
> >  }
> >  
> >  /**
> > - * ice_clean_tx_irq_zc - Completes AF_XDP entries, and cleans XDP entries
> > - * @xdp_ring: XDP Tx ring
> > - * @budget: NAPI budget
> > - *
> > - * Returns true if cleanup/tranmission is done.
> > + * ice_xmit_pkt_batch - produce a batch of HW Tx descriptors out of AF_XDP descriptors
> > + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> > + * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
> > + * @total_bytes: bytes accumulator that will be used for stats update
> >   */
> > -bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget)
> > +static void ice_xmit_pkt_batch(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
> > +			       unsigned int *total_bytes)
> >  {
> > -	int total_packets = 0, total_bytes = 0;
> > -	s16 ntc = xdp_ring->next_to_clean;
> > +	u16 tx_thresh = xdp_ring->tx_thresh;
> > +	u16 ntu = xdp_ring->next_to_use;
> >  	struct ice_tx_desc *tx_desc;
> > -	struct ice_tx_buf *tx_buf;
> > -	u32 xsk_frames = 0;
> > -	bool xmit_done;
> > +	dma_addr_t dma;
> 
> Same with @dma here.
> 
> > +	u32 i;
> >  
> > -	tx_desc = ICE_TX_DESC(xdp_ring, ntc);
> > -	tx_buf = &xdp_ring->tx_buf[ntc];
> > -	ntc -= xdp_ring->count;
> > +	loop_unrolled_for(i = 0; i < PKTS_PER_BATCH; i++) {
> > +		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, descs[i].addr);
> > +		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, descs[i].len);
> >  
> > -	do {
> > -		if (!(tx_desc->cmd_type_offset_bsz &
> > -		      cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
> > -			break;
> > +		tx_desc = ICE_TX_DESC(xdp_ring, ntu++);
> > +		tx_desc->buf_addr = cpu_to_le64(dma);
> > +		tx_desc->cmd_type_offset_bsz = ice_build_ctob(ICE_TX_DESC_CMD_EOP,
> > +							      0, descs[i].len, 0);
> >  
> > -		total_bytes += tx_buf->bytecount;
> > -		total_packets++;
> > +		*total_bytes += descs[i].len;
> > +	}
> >  
> > -		if (tx_buf->raw_buf) {
> > -			ice_clean_xdp_tx_buf(xdp_ring, tx_buf);
> > -			tx_buf->raw_buf = NULL;
> > -		} else {
> > -			xsk_frames++;
> > -		}
> > +	xdp_ring->next_to_use = ntu;
> >  
> > -		tx_desc->cmd_type_offset_bsz = 0;
> > -		tx_buf++;
> > -		tx_desc++;
> > -		ntc++;
> > +	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
> > +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> > +		tx_desc->cmd_type_offset_bsz |=
> > +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> > +		xdp_ring->next_rs += tx_thresh;
> > +	}
> > +}
> >  
> > -		if (unlikely(!ntc)) {
> > -			ntc -= xdp_ring->count;
> > -			tx_buf = xdp_ring->tx_buf;
> > -			tx_desc = ICE_TX_DESC(xdp_ring, 0);
> > -		}
> > +/**
> > + * ice_fill_tx_hw_ring - produce the number of Tx descriptors onto ring
> > + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> > + * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
> > + * @nb_pkts: count of packets to be send
> > + * @total_bytes: bytes accumulator that will be used for stats update
> > + */
> > +static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
> > +				u32 nb_pkts, unsigned int *total_bytes)
> > +{
> > +	u16 tx_thresh = xdp_ring->tx_thresh;
> > +	struct ice_tx_desc *tx_desc;
> 
> And @tx_desc as well.
> 
> > +	u32 batched, leftover, i;
> > +
> > +	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
> > +	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
> > +	for (i = 0; i < batched; i += PKTS_PER_BATCH)
> > +		ice_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
> > +	for (i = batched; i < batched + leftover; i++)

Breh, I overlooked that. @i will equal @batched after exiting the
first loop, so the assignment here is redundant (probably harmless
tho if the compilers are smart enough).

> > +		ice_xmit_pkt(xdp_ring, &descs[i], total_bytes);
> > +
> > +	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
> > +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> > +		tx_desc->cmd_type_offset_bsz |=
> > +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> > +		xdp_ring->next_rs += tx_thresh;
> > +	}
> > +}
> >  
> > -		prefetch(tx_desc);
> > +/**
> > + * ice_xmit_zc - take entries from XSK Tx ring and place them onto HW Tx ring
> > + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> > + * @budget: number of free descriptors on HW Tx ring that can be used
> > + * @napi_budget: amount of descriptors that NAPI allows us to clean
> > + *
> > + * Returns true if there is no more work that needs to be done, false otherwise
> > + */
> > +bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, u32 budget, int napi_budget)
> > +{
> > +	struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
> > +	u16 tx_thresh = xdp_ring->tx_thresh;
> > +	u32 nb_pkts, nb_processed = 0;
> > +	unsigned int total_bytes = 0;
> > +	struct ice_tx_desc *tx_desc;
> 
> And this @tx_desc.
> 
> >  
> > -	} while (likely(--budget));
> > +	if (budget < tx_thresh)
> > +		budget += ice_clean_xdp_irq_zc(xdp_ring, napi_budget);
> > +
> > +	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
> > +	if (!nb_pkts)
> > +		return true;
> > +
> > +	if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
> > +		nb_processed = xdp_ring->count - xdp_ring->next_to_use;
> > +		ice_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
> > +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> > +		tx_desc->cmd_type_offset_bsz |=
> > +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> > +		xdp_ring->next_rs = tx_thresh - 1;
> > +		xdp_ring->next_to_use = 0;
> > +	}
> >  
> > -	ntc += xdp_ring->count;
> > -	xdp_ring->next_to_clean = ntc;
> > +	ice_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
> > +			    &total_bytes);
> >  
> > -	if (xsk_frames)
> > -		xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames);
> > +	ice_xdp_ring_update_tail(xdp_ring);
> > +	ice_update_tx_ring_stats(xdp_ring, nb_pkts, total_bytes);
> >  
> >  	if (xsk_uses_need_wakeup(xdp_ring->xsk_pool))
> >  		xsk_set_tx_need_wakeup(xdp_ring->xsk_pool);
> >  
> > -	ice_update_tx_ring_stats(xdp_ring, total_packets, total_bytes);
> > -	xmit_done = ice_xmit_zc(xdp_ring, ICE_DFLT_IRQ_WORK);
> > -
> > -	return budget > 0 && xmit_done;
> > +	return nb_pkts < budget;
> >  }
> >  
> >  /**
> > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h
> > index 4c7bd8e9dfc4..0cbb5793b5b8 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_xsk.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.h
> > @@ -6,19 +6,37 @@
> >  #include "ice_txrx.h"
> >  #include "ice.h"
> >  
> > +#define PKTS_PER_BATCH 8
> > +
> > +#ifdef __clang__
> > +#define loop_unrolled_for _Pragma("clang loop unroll_count(8)") for
> > +#elif __GNUC__ >= 4
> > +#define loop_unrolled_for _Pragma("GCC unroll 8") for
> > +#else
> > +#define loop_unrolled_for for
> > +#endif
> > +
> >  struct ice_vsi;
> >  
> >  #ifdef CONFIG_XDP_SOCKETS
> >  int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool,
> >  		       u16 qid);
> >  int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget);
> > -bool ice_clean_tx_irq_zc(struct ice_tx_ring *xdp_ring, int budget);
> >  int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags);
> >  bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count);
> >  bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi);
> >  void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring);
> >  void ice_xsk_clean_xdp_ring(struct ice_tx_ring *xdp_ring);
> > +bool ice_xmit_zc(struct ice_tx_ring *xdp_ring, u32 budget, int napi_budget);
> >  #else
> > +static inline bool
> > +ice_xmit_zc(struct ice_tx_ring __always_unused *xdp_ring,
> > +	    u32 __always_unused budget,
> > +	    int __always_unused napi_budget)
> > +{
> > +	return false;
> > +}
> > +
> >  static inline int
> >  ice_xsk_pool_setup(struct ice_vsi __always_unused *vsi,
> >  		   struct xsk_buff_pool __always_unused *pool,
> > @@ -34,13 +52,6 @@ ice_clean_rx_irq_zc(struct ice_rx_ring __always_unused *rx_ring,
> >  	return 0;
> >  }
> >  
> > -static inline bool
> > -ice_clean_tx_irq_zc(struct ice_tx_ring __always_unused *xdp_ring,
> > -		    int __always_unused budget)
> > -{
> > -	return false;
> > -}
> > -
> >  static inline bool
> >  ice_alloc_rx_bufs_zc(struct ice_rx_ring __always_unused *rx_ring,
> >  		     u16 __always_unused count)
> > -- 
> > 2.33.1
> 
> Thanks,
> Al

Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often
  2022-01-21 12:29   ` Alexander Lobakin
@ 2022-01-21 14:31     ` Maciej Fijalkowski
  2022-01-24 16:44       ` Maciej Fijalkowski
  0 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 14:31 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: bpf, ast, daniel, netdev, magnus.karlsson, jesse.brandeburg

On Fri, Jan 21, 2022 at 01:29:20PM +0100, Alexander Lobakin wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Fri, 21 Jan 2022 13:00:06 +0100
> 
> > Currently, if ice_clean_rx_irq_zc() processed the whole ring and
> > next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
> > ring even if the XSK buffer pool would have enough free entries (either
> > from fill ring or the internal recycle mechanism) - it is because ring
> > wrap is not handled.
> > 
> > Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
> > Do not clamp the count of buffers that is passed to
> > xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
> > rx_ring->count,  but rather split it and have two calls to the mentioned
> > function - one for the part up until the wrap and one for the part after
> > the wrap.
> > 
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +
> >  drivers/net/ethernet/intel/ice/ice_xsk.c  | 99 ++++++++++++++++++-----
> >  2 files changed, 81 insertions(+), 20 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > index b7b3bd4816f0..94a46e0e5ed0 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > @@ -111,6 +111,8 @@ static inline int ice_skb_pad(void)
> >  	(u16)((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \
> >  	      (R)->next_to_clean - (R)->next_to_use - 1)
> >  
> > +#define ICE_RING_QUARTER(R) ((R)->count / 4)
> 
> I would use `>> 2` here just to show off :D

:)


(...)

> 
> > +
> > +/**
> > + * ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
> > + * @rx_ring: Rx ring
> > + * @count: The number of buffers to allocate
> > + *
> > + * Wrapper for internal allocation routine; figure out how many tail
> > + * bumps should take place based on the given threshold
> > + *
> > + * Returns true if all calls to internal alloc routine succeeded
> > + */
> > +bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
> > +{
> > +	u16 rx_thresh = ICE_RING_QUARTER(rx_ring);
> > +	u16 batched, leftover, i, tail_bumps;
> > +
> > +	batched = count & ~(rx_thresh - 1);
> 
> The ring size can be a non power-of-two unfortunately, it is rather
> aligned to just 32: [0]. So it can be e.g. 96 and the mask will
> break then.

Ugh nice catch!

> You could use roundup_pow_of_two(ICE_RING_QUARTER(rx_ring)), but
> might can be a little slower due to fls_long() (bitsearch) inside.
> 
> (I would generally prohibit non-pow-2 ring sizes at all from inside
>  the Ethtool callbacks since it makes no sense to me :p)

Although user would some of the freedom it makes a lot of sense to me.
Jesse, what's your view?

> 
> Also, it's not recommended to open-code align-down since we got
> the ALIGN_DOWN(value, pow_of_two_alignment) macro. The macro hell
> inside expands to the same op you do in here.

ack I'll try to use existing macros.

> 
> > +	tail_bumps = batched / rx_thresh;
> > +	leftover = count & (rx_thresh - 1);
> >  
> > -	return count == nb_buffs;
> > +	for (i = 0; i < tail_bumps; i++)
> > +		if (!__ice_alloc_rx_bufs_zc(rx_ring, rx_thresh))
> > +			return false;
> > +	return __ice_alloc_rx_bufs_zc(rx_ring, leftover);
> >  }
> >  
> >  /**
> > -- 
> > 2.33.1
> 
> [0] https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/ice/ice_ethtool.c#L2729
> 
> Thanks,
> Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Rx: [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length
  2022-01-21 12:40   ` Rx: " Alexander Lobakin
@ 2022-01-21 14:34     ` Maciej Fijalkowski
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 14:34 UTC (permalink / raw)
  To: Alexander Lobakin; +Cc: bpf, ast, daniel, netdev, magnus.karlsson

On Fri, Jan 21, 2022 at 01:40:29PM +0100, Alexander Lobakin wrote:
> From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Date: Fri, 21 Jan 2022 13:00:07 +0100
> 
> > XDP_TX workloads use a concept of Tx threshold that indicates the
> > interval of setting RS bit on descriptors which in turn tells the HW to
> > generate an interrupt to signal the completion of Tx on HW side. It is
> > currently based on a constant value of 32 which might not work out well
> > for various sizes of ring combined with for example batch size that can
> > be set via SO_BUSY_POLL_BUDGET.
> > 
> > Internal tests based on AF_XDP showed that most convenient setup of
> > mentioned threshold is when it is equal to quarter of a ring length.
> > 
> > Introduce @tx_thresh field in ice_tx_ring struct that will store the
> > value of threshold and use it in favor of ICE_TX_THRESH.
> > 
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_ethtool.c  |  3 +++
> >  drivers/net/ethernet/intel/ice/ice_main.c     |  5 +++--
> >  drivers/net/ethernet/intel/ice/ice_txrx.h     |  7 ++++++-
> >  drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 14 ++++++++------
> >  4 files changed, 20 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > index e2e3ef7fba7f..bfa5e5d167ab 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> > @@ -2803,6 +2803,9 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
> >  		/* clone ring and setup updated count */
> >  		xdp_rings[i] = *vsi->xdp_rings[i];
> >  		xdp_rings[i].count = new_tx_cnt;
> > +		xdp_rings[i].tx_thresh = ice_get_tx_threshold(&xdp_rings[i]);
> > +		xdp_rings[i].next_dd = xdp_rings[i].tx_thresh - 1;
> > +		xdp_rings[i].next_rs = xdp_rings[i].tx_thresh - 1;
> >  		xdp_rings[i].desc = NULL;
> >  		xdp_rings[i].tx_buf = NULL;
> >  		err = ice_setup_tx_ring(&xdp_rings[i]);
> > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > index 30814435f779..0fd12a7f6d22 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > @@ -2495,10 +2495,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> >  		xdp_ring->reg_idx = vsi->txq_map[xdp_q_idx];
> >  		xdp_ring->vsi = vsi;
> >  		xdp_ring->netdev = NULL;
> > -		xdp_ring->next_dd = ICE_TX_THRESH - 1;
> > -		xdp_ring->next_rs = ICE_TX_THRESH - 1;
> >  		xdp_ring->dev = dev;
> >  		xdp_ring->count = vsi->num_tx_desc;
> > +		xdp_ring->tx_thresh = ice_get_tx_threshold(xdp_ring);
> > +		xdp_ring->next_dd = xdp_ring->tx_thresh - 1;
> > +		xdp_ring->next_rs = xdp_ring->tx_thresh - 1;
> >  		WRITE_ONCE(vsi->xdp_rings[i], xdp_ring);
> >  		if (ice_setup_tx_ring(xdp_ring))
> >  			goto free_xdp_rings;
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > index 94a46e0e5ed0..09c8ad2f7403 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > @@ -13,7 +13,6 @@
> >  #define ICE_MAX_CHAINED_RX_BUFS	5
> >  #define ICE_MAX_BUF_TXD		8
> >  #define ICE_MIN_TX_LEN		17
> > -#define ICE_TX_THRESH		32
> >  
> >  /* The size limit for a transmit buffer in a descriptor is (16K - 1).
> >   * In order to align with the read requests we will align the value to
> > @@ -333,6 +332,7 @@ struct ice_tx_ring {
> >  	struct ice_channel *ch;
> >  	struct ice_ptp_tx *tx_tstamps;
> >  	spinlock_t tx_lock;
> > +	u16 tx_thresh;
> 
> Creating 2-byte hole here, but it's likely ok since I don't see a
> place to pack it nicely. u8 at the end is not an option obviously.
> 
> (unless you want to store the order rather than the threshold
> itself -- would require an additional macro, but at the same time
> align nicely with the fact that we need the threshold to be of
> power of two)

I started to wonder if we really need this to be embedded in the tx ring
struct. the calculation is not that expensive and would be always adjusted
to the ring length.

> 
> >  	u32 txq_teid;			/* Added Tx queue TEID */
> >  #define ICE_TX_FLAGS_RING_XDP		BIT(0)
> >  	u8 flags;
> > @@ -355,6 +355,11 @@ static inline void ice_clear_ring_build_skb_ena(struct ice_rx_ring *ring)
> >  	ring->flags &= ~ICE_RX_FLAGS_RING_BUILD_SKB;
> >  }
> >  
> > +static inline u16 ice_get_tx_threshold(struct ice_tx_ring *tx_ring)
> > +{
> > +	return ICE_RING_QUARTER(tx_ring);
> > +}
> > +
> >  static inline bool ice_ring_ch_enabled(struct ice_tx_ring *ring)
> >  {
> >  	return !!ring->ch;
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > index 0e87b98e0966..5706b5405373 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
> > @@ -222,6 +222,7 @@ ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
> >  static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
> >  {
> >  	unsigned int total_bytes = 0, total_pkts = 0;
> > +	u16 tx_thresh = xdp_ring->tx_thresh;
> >  	u16 ntc = xdp_ring->next_to_clean;
> >  	struct ice_tx_desc *next_dd_desc;
> >  	u16 next_dd = xdp_ring->next_dd;
> > @@ -233,7 +234,7 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
> >  	    cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)))
> >  		return;
> >  
> > -	for (i = 0; i < ICE_TX_THRESH; i++) {
> > +	for (i = 0; i < tx_thresh; i++) {
> >  		tx_buf = &xdp_ring->tx_buf[ntc];
> >  
> >  		total_bytes += tx_buf->bytecount;
> > @@ -254,9 +255,9 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
> >  	}
> >  
> >  	next_dd_desc->cmd_type_offset_bsz = 0;
> > -	xdp_ring->next_dd = xdp_ring->next_dd + ICE_TX_THRESH;
> > +	xdp_ring->next_dd = xdp_ring->next_dd + tx_thresh;
> >  	if (xdp_ring->next_dd > xdp_ring->count)
> > -		xdp_ring->next_dd = ICE_TX_THRESH - 1;
> > +		xdp_ring->next_dd = tx_thresh - 1;
> >  	xdp_ring->next_to_clean = ntc;
> >  	ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
> >  }
> > @@ -269,12 +270,13 @@ static void ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
> >   */
> >  int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
> >  {
> > +	u16 tx_thresh = xdp_ring->tx_thresh;
> >  	u16 i = xdp_ring->next_to_use;
> >  	struct ice_tx_desc *tx_desc;
> >  	struct ice_tx_buf *tx_buf;
> >  	dma_addr_t dma;
> >  
> > -	if (ICE_DESC_UNUSED(xdp_ring) < ICE_TX_THRESH)
> > +	if (ICE_DESC_UNUSED(xdp_ring) < tx_thresh)
> >  		ice_clean_xdp_irq(xdp_ring);
> >  
> >  	if (!unlikely(ICE_DESC_UNUSED(xdp_ring))) {
> > @@ -306,7 +308,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
> >  		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> >  		tx_desc->cmd_type_offset_bsz |=
> >  			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> > -		xdp_ring->next_rs = ICE_TX_THRESH - 1;
> > +		xdp_ring->next_rs = tx_thresh - 1;
> >  	}
> >  	xdp_ring->next_to_use = i;
> >  
> > @@ -314,7 +316,7 @@ int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring)
> >  		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> >  		tx_desc->cmd_type_offset_bsz |=
> >  			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> > -		xdp_ring->next_rs += ICE_TX_THRESH;
> > +		xdp_ring->next_rs += tx_thresh;
> >  	}
> >  
> >  	return ICE_XDP_TX;
> > -- 
> > 2.33.1
> 
> Thanks,
> Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API
  2022-01-21 13:17     ` Alexander Lobakin
@ 2022-01-21 14:41       ` Maciej Fijalkowski
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-21 14:41 UTC (permalink / raw)
  To: Alexander Lobakin; +Cc: bpf, ast, daniel, netdev, magnus.karlsson

On Fri, Jan 21, 2022 at 02:17:42PM +0100, Alexander Lobakin wrote:
> From: Alexander Lobakin <alexandr.lobakin@intel.com>
> Date: Fri, 21 Jan 2022 13:54:47 +0100
> 
> > From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > Date: Fri, 21 Jan 2022 13:00:10 +0100
> > 
> > > Apply the logic that was done for regular XDP from commit 9610bd988df9
> > > ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
> > > of that, introduce batching to Tx that is inspired by i40e's
> > > implementation with adjustments to the cleaning logic - take into the
> > > account NAPI budget in ice_clean_xdp_irq_zc().
> > > 
> > > Separating the stats structs onto separate cache lines seemed to improve
> > > the performance.
> > > 
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  drivers/net/ethernet/intel/ice/ice_txrx.c |   2 +-
> > >  drivers/net/ethernet/intel/ice/ice_txrx.h |   2 +-
> > >  drivers/net/ethernet/intel/ice/ice_xsk.c  | 256 ++++++++++++++--------
> > >  drivers/net/ethernet/intel/ice/ice_xsk.h  |  27 ++-
> > >  4 files changed, 186 insertions(+), 101 deletions(-)
> > > 
> > > +/**
> > > + * ice_fill_tx_hw_ring - produce the number of Tx descriptors onto ring
> > > + * @xdp_ring: XDP ring to produce the HW Tx descriptors on
> > > + * @descs: AF_XDP descriptors to pull the DMA addresses and lengths from
> > > + * @nb_pkts: count of packets to be send
> > > + * @total_bytes: bytes accumulator that will be used for stats update
> > > + */
> > > +static void ice_fill_tx_hw_ring(struct ice_tx_ring *xdp_ring, struct xdp_desc *descs,
> > > +				u32 nb_pkts, unsigned int *total_bytes)
> > > +{
> > > +	u16 tx_thresh = xdp_ring->tx_thresh;
> > > +	struct ice_tx_desc *tx_desc;
> > 
> > And @tx_desc as well.
> > 
> > > +	u32 batched, leftover, i;
> > > +
> > > +	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
> > > +	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
> > > +	for (i = 0; i < batched; i += PKTS_PER_BATCH)
> > > +		ice_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
> > > +	for (i = batched; i < batched + leftover; i++)
> 
> Breh, I overlooked that. @i will equal @batched after exiting the
> first loop, so the assignment here is redundant (probably harmless
> tho if the compilers are smart enough).

I can drop this and scope variables properly, thanks!

> 
> > > +		ice_xmit_pkt(xdp_ring, &descs[i], total_bytes);
> > > +
> > > +	if (xdp_ring->next_to_use > xdp_ring->next_rs) {
> > > +		tx_desc = ICE_TX_DESC(xdp_ring, xdp_ring->next_rs);
> > > +		tx_desc->cmd_type_offset_bsz |=
> > > +			cpu_to_le64(ICE_TX_DESC_CMD_RS << ICE_TXD_QW1_CMD_S);
> > > +		xdp_ring->next_rs += tx_thresh;
> > > +	}
> > > +}
> > >  
> > > -		prefetch(tx_desc);
> > > +/**

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often
  2022-01-21 14:31     ` Maciej Fijalkowski
@ 2022-01-24 16:44       ` Maciej Fijalkowski
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2022-01-24 16:44 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: bpf, ast, daniel, netdev, magnus.karlsson, jesse.brandeburg

On Fri, Jan 21, 2022 at 03:31:31PM +0100, Maciej Fijalkowski wrote:
> On Fri, Jan 21, 2022 at 01:29:20PM +0100, Alexander Lobakin wrote:
> > From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > Date: Fri, 21 Jan 2022 13:00:06 +0100
> > 
> > > Currently, if ice_clean_rx_irq_zc() processed the whole ring and
> > > next_to_use != 0, then ice_alloc_rx_buf_zc() would not refill the whole
> > > ring even if the XSK buffer pool would have enough free entries (either
> > > from fill ring or the internal recycle mechanism) - it is because ring
> > > wrap is not handled.
> > > 
> > > Improve the logic in ice_alloc_rx_buf_zc() to address the problem above.
> > > Do not clamp the count of buffers that is passed to
> > > xsk_buff_alloc_batch() in case when next_to_use + buffer count >=
> > > rx_ring->count,  but rather split it and have two calls to the mentioned
> > > function - one for the part up until the wrap and one for the part after
> > > the wrap.
> > > 
> > > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +
> > >  drivers/net/ethernet/intel/ice/ice_xsk.c  | 99 ++++++++++++++++++-----
> > >  2 files changed, 81 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > index b7b3bd4816f0..94a46e0e5ed0 100644
> > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> > > @@ -111,6 +111,8 @@ static inline int ice_skb_pad(void)
> > >  	(u16)((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \
> > >  	      (R)->next_to_clean - (R)->next_to_use - 1)
> > >  
> > > +#define ICE_RING_QUARTER(R) ((R)->count / 4)
> > 
> > I would use `>> 2` here just to show off :D
> 
> :)
> 
> 
> (...)
> 
> > 
> > > +
> > > +/**
> > > + * ice_alloc_rx_bufs_zc - allocate a number of Rx buffers
> > > + * @rx_ring: Rx ring
> > > + * @count: The number of buffers to allocate
> > > + *
> > > + * Wrapper for internal allocation routine; figure out how many tail
> > > + * bumps should take place based on the given threshold
> > > + *
> > > + * Returns true if all calls to internal alloc routine succeeded
> > > + */
> > > +bool ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
> > > +{
> > > +	u16 rx_thresh = ICE_RING_QUARTER(rx_ring);
> > > +	u16 batched, leftover, i, tail_bumps;
> > > +
> > > +	batched = count & ~(rx_thresh - 1);
> > 
> > The ring size can be a non power-of-two unfortunately, it is rather
> > aligned to just 32: [0]. So it can be e.g. 96 and the mask will
> > break then.
> 
> Ugh nice catch!
> 
> > You could use roundup_pow_of_two(ICE_RING_QUARTER(rx_ring)), but
> > might can be a little slower due to fls_long() (bitsearch) inside.
> > 
> > (I would generally prohibit non-pow-2 ring sizes at all from inside
> >  the Ethtool callbacks since it makes no sense to me :p)
> 
> Although user would some of the freedom it makes a lot of sense to me.
> Jesse, what's your view?

I decided to go with an approach that forbids xsk socket attachment when
ring length (either tx or rx) is not pow(2). Sending v4 with a separate
patch for that.

> 
> > 
> > Also, it's not recommended to open-code align-down since we got
> > the ALIGN_DOWN(value, pow_of_two_alignment) macro. The macro hell
> > inside expands to the same op you do in here.
> 
> ack I'll try to use existing macros.
> 
> > 
> > > +	tail_bumps = batched / rx_thresh;
> > > +	leftover = count & (rx_thresh - 1);
> > >  
> > > -	return count == nb_buffs;
> > > +	for (i = 0; i < tail_bumps; i++)
> > > +		if (!__ice_alloc_rx_bufs_zc(rx_ring, rx_thresh))
> > > +			return false;
> > > +	return __ice_alloc_rx_bufs_zc(rx_ring, leftover);
> > >  }
> > >  
> > >  /**
> > > -- 
> > > 2.33.1
> > 
> > [0] https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/ice/ice_ethtool.c#L2729
> > 
> > Thanks,
> > Al

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-01-24 16:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-21 12:00 [PATCH v3 bpf-next 0/7] xsk: Intel driver improvements Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 1/7] ice: remove likely for napi_complete_done Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 2/7] ice: xsk: handle SW XDP ring wrap and bump tail more often Maciej Fijalkowski
2022-01-21 12:29   ` Alexander Lobakin
2022-01-21 14:31     ` Maciej Fijalkowski
2022-01-24 16:44       ` Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 3/7] ice: make Tx threshold dependent on ring length Maciej Fijalkowski
2022-01-21 12:40   ` Rx: " Alexander Lobakin
2022-01-21 14:34     ` Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 4/7] i40e: xsk: move tmp desc array from driver to pool Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 5/7] ice: xsk: avoid potential dead AF_XDP Tx processing Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 6/7] ice: xsk: improve AF_XDP ZC Tx and use batching API Maciej Fijalkowski
2022-01-21 12:54   ` Alexander Lobakin
2022-01-21 13:17     ` Alexander Lobakin
2022-01-21 14:41       ` Maciej Fijalkowski
2022-01-21 12:00 ` [PATCH bpf-next v3 7/7] ice: xsk: borrow xdp_tx_active logic from i40e Maciej Fijalkowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.