bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements
@ 2020-11-04 14:08 Magnus Karlsson
  2020-11-04 14:08 ` [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy Magnus Karlsson
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:08 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: Magnus Karlsson, bpf, jeffrey.t.kirsher, anthony.l.nguyen,
	maciej.fijalkowski, maciejromanfijalkowski, intel-wired-lan

Subject: xsk: i40e: Tx performance improvements

This patch set improves the performance of mainly the Tx processing of
AF_XDP sockets. Though, patch 4 also improves the Rx path. All in all,
this patch set improves the throughput of the l2fwd xdpsock
application by around 14%. If we just take a look at Tx processing part,
it is improved by nearly 50%.

Hopefully the new batched Tx interfaces should be of value to other
drivers implementing AF_XDP zero-copy support. But patch #4 is generic
and will improve performance of all drivers when using AF_XDP sockets
(under the premises explained in that patch).

@Daniel. In patch 4, I apply all the padding required to hinder the
adjacency prefetcher to prefetch the wrong things. After this patch
set, I will submit another patch set that introduces
____cacheline_padding_in_smp in include/linux/cache.h according to your
suggestions. The last patch in that patch set will then convert the
explicit paddings that we have now to ____cacheline_padding_in_smp.

This patch has been applied against commit d0b3d2d7e50d ("Merge branch 'selftests/bpf: Migrate test_tcpbpf_user to be a part of test_progs'")

Structure of the patch set:

Patch 1: Introduce lazy Tx completions in the i40e driver.
Patch 2: For the xdpsock sample, increment Tx stats at sending instead
         of at completion.
Patch 3: Remove an unnecessary sw ring access from the Tx path in i40e.
Patch 4: Introduce padding between all pointers and fields in the ring.
Patch 5: Introduce batched Tx descriptor interfaces.
Patch 6: Use the new batched interfaces in the i40e driver to get higher
         throughput.

Thanks: Magnus

Magnus Karlsson (6):
  i40e: introduce lazy Tx completions for AF_XDP zero-copy
  samples/bpf: increment Tx stats at sending
  i40e: remove unnecessary sw_ring access from xsk Tx
  xsk: introduce padding between more ring pointers
  xsk: introduce batched Tx descriptor interfaces
  i40e: use batched xsk Tx interfaces to increase performance

 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c    |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  14 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c     | 140 +++++++++++++++++--------
 include/net/xdp_sock_drv.h                     |   7 ++
 net/xdp/xsk.c                                  |  43 ++++++++
 net/xdp/xsk_queue.h                            |  93 +++++++++++++---
 samples/bpf/xdpsock_user.c                     |   6 +-
 9 files changed, 249 insertions(+), 63 deletions(-)

--
2.7.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy
  2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
@ 2020-11-04 14:08 ` Magnus Karlsson
  2020-11-04 23:33   ` Jakub Kicinski
  2020-11-04 14:08 ` [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending Magnus Karlsson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:08 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduce lazy Tx completions when a queue is used for AF_XDP
zero-copy. In the current design, each time we get into the NAPI poll
loop we try to complete as many Tx packets as possible from the
NIC. This is performed by reading the head pointer register in the NIC
that tells us how many packets have been completed. Reading this
register is expensive as it is across PCIe, so let us try to limit the
number of times it is read by only completing Tx packets to user-space
when the number of available descriptors in the Tx HW ring is below
some threshold. This will decrease the number of reads issued to the
NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock
microbenchmark.

The threshold is set to the minimum possible size that the HW ring can
have. This so that we do not run into a scenario where the threshold
is higher than the configured number of descriptors in the HW ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 6acede0..f8815b3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -9,6 +9,8 @@
 #include "i40e_txrx_common.h"
 #include "i40e_xsk.h"
 
+#define I40E_TX_COMPLETION_THRESHOLD I40E_MIN_NUM_DESCRIPTORS
+
 int i40e_alloc_rx_bi_zc(struct i40e_ring *rx_ring)
 {
 	unsigned long sz = sizeof(*rx_ring->rx_bi_zc) * rx_ring->count;
@@ -460,12 +462,15 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring,
  **/
 bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi, struct i40e_ring *tx_ring)
 {
+	u32 i, completed_frames, xsk_frames = 0, head_idx;
 	struct xsk_buff_pool *bp = tx_ring->xsk_pool;
-	u32 i, completed_frames, xsk_frames = 0;
-	u32 head_idx = i40e_get_head(tx_ring);
 	struct i40e_tx_buffer *tx_bi;
 	unsigned int ntc;
 
+	if (I40E_DESC_UNUSED(tx_ring) >= I40E_TX_COMPLETION_THRESHOLD)
+		goto out_xmit;
+
+	head_idx = i40e_get_head(tx_ring);
 	if (head_idx < tx_ring->next_to_clean)
 		head_idx += tx_ring->count;
 	completed_frames = head_idx - tx_ring->next_to_clean;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending
  2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
  2020-11-04 14:08 ` [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy Magnus Karlsson
@ 2020-11-04 14:08 ` Magnus Karlsson
  2020-11-09 20:47   ` [Intel-wired-lan] " John Fastabend
  2020-11-04 14:08 ` [PATCH bpf-next 3/6] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:08 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Increment the statistics over how many Tx packets have been sent at
the time of sending instead of at the time of completion. This as a
completion event means that the buffer has been sent AND returned to
user space. The packet always gets sent shortly after sendto() is
called. The kernel might, for performance reasons, decide to not
return every single buffer to user space immediately after sending,
for example, only after a batch of packets have been
transmitted. Incrementing the number of packets sent at completion,
will in that case be confusing as if you send a single packet, the
counter might show zero for a while even though the packet has been
transmitted.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/xdpsock_user.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 1149e94..2567f0d 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -1146,7 +1146,6 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
 		xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
 		xsk->outstanding_tx -= rcvd;
-		xsk->ring_stats.tx_npkts += rcvd;
 	}
 }
 
@@ -1168,7 +1167,6 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk,
 	if (rcvd > 0) {
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
 		xsk->outstanding_tx -= rcvd;
-		xsk->ring_stats.tx_npkts += rcvd;
 	}
 }
 
@@ -1260,6 +1258,7 @@ static void tx_only(struct xsk_socket_info *xsk, u32 *frame_nb, int batch_size)
 	}
 
 	xsk_ring_prod__submit(&xsk->tx, batch_size);
+	xsk->ring_stats.tx_npkts += batch_size;
 	xsk->outstanding_tx += batch_size;
 	*frame_nb += batch_size;
 	*frame_nb %= NUM_FRAMES;
@@ -1348,6 +1347,7 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
 		}
 		return;
 	}
+	xsk->ring_stats.rx_npkts += rcvd;
 
 	ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
 	while (ret != rcvd) {
@@ -1379,7 +1379,7 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
 	xsk_ring_prod__submit(&xsk->tx, rcvd);
 	xsk_ring_cons__release(&xsk->rx, rcvd);
 
-	xsk->ring_stats.rx_npkts += rcvd;
+	xsk->ring_stats.tx_npkts += rcvd;
 	xsk->outstanding_tx += rcvd;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 3/6] i40e: remove unnecessary sw_ring access from xsk Tx
  2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
  2020-11-04 14:08 ` [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy Magnus Karlsson
  2020-11-04 14:08 ` [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending Magnus Karlsson
@ 2020-11-04 14:08 ` Magnus Karlsson
  2020-11-09 20:48   ` [Intel-wired-lan] " John Fastabend
  2020-11-04 14:09 ` [PATCH bpf-next 4/6] xsk: introduce padding between more ring pointers Magnus Karlsson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:08 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Remove the unnecessary access to the software ring for the AF_XDP
zero-copy driver. This was used to record the length of the packet so
that the driver Tx completion code could sum this up to produce the
total bytes sent. This is now performed during the transmission of the
packet, so no need to record this in the software ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index f8815b3..eabe1a3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -394,7 +394,6 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
 	unsigned int sent_frames = 0, total_bytes = 0;
 	struct i40e_tx_desc *tx_desc = NULL;
-	struct i40e_tx_buffer *tx_bi;
 	struct xdp_desc desc;
 	dma_addr_t dma;
 
@@ -406,9 +405,6 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
 						 desc.len);
 
-		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
-		tx_bi->bytecount = desc.len;
-
 		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
 		tx_desc->buffer_addr = cpu_to_le64(dma);
 		tx_desc->cmd_type_offset_bsz =
@@ -417,7 +413,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 				   0, desc.len, 0);
 
 		sent_frames++;
-		total_bytes += tx_bi->bytecount;
+		total_bytes += desc.len;
 
 		xdp_ring->next_to_use++;
 		if (xdp_ring->next_to_use == xdp_ring->count)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 4/6] xsk: introduce padding between more ring pointers
  2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
                   ` (2 preceding siblings ...)
  2020-11-04 14:08 ` [PATCH bpf-next 3/6] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
@ 2020-11-04 14:09 ` Magnus Karlsson
  2020-11-09 20:43   ` [Intel-wired-lan] " John Fastabend
  2020-11-04 14:09 ` [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
  2020-11-04 14:09 ` [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
  5 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:09 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduce one cache line worth of padding between the consumer pointer
and the flags field as well as between the flags field and the start
of the descriptors in all the lockless rings. This so that the x86 HW
adjacency prefetcher will not prefetch the adjacent pointer/field when
only one pointer/field is going to be used. This improves throughput
performance for the l2fwd sample app with 1% on my machine with HW
prefetching turned on in the BIOS.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk_queue.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index cdb9cf3..74fac80 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -18,9 +18,11 @@ struct xdp_ring {
 	/* Hinder the adjacent cache prefetcher to prefetch the consumer
 	 * pointer if the producer pointer is touched and vice versa.
 	 */
-	u32 pad ____cacheline_aligned_in_smp;
+	u32 pad1 ____cacheline_aligned_in_smp;
 	u32 consumer ____cacheline_aligned_in_smp;
+	u32 pad2 ____cacheline_aligned_in_smp;
 	u32 flags;
+	u32 pad3 ____cacheline_aligned_in_smp;
 };
 
 /* Used for the RX and TX queues for packets */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces
  2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
                   ` (3 preceding siblings ...)
  2020-11-04 14:09 ` [PATCH bpf-next 4/6] xsk: introduce padding between more ring pointers Magnus Karlsson
@ 2020-11-04 14:09 ` Magnus Karlsson
  2020-11-09 21:06   ` [Intel-wired-lan] " John Fastabend
  2020-11-04 14:09 ` [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
  5 siblings, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:09 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduce batched descriptor interfaces in the xsk core code for the
Tx path to be used in the driver to write a code path with higher
performance. This interface will be used by the i40e driver in the
next patch. Though other drivers would likely benefit from this new
interface too.

Note that batching is only implemented for the common case when
there is only one socket bound to the same device and queue id. When
this is not the case, we fall back to the old non-batched version of
the function.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock_drv.h |  7 ++++
 net/xdp/xsk.c              | 43 ++++++++++++++++++++++
 net/xdp/xsk_queue.h        | 89 +++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 126 insertions(+), 13 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 5b1ee8a..4e295541 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -13,6 +13,7 @@
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
 struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
 					    u16 queue_id);
@@ -128,6 +129,12 @@ static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
 	return false;
 }
 
+static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc,
+						 u32 max)
+{
+	return 0;
+}
+
 static inline void xsk_tx_release(struct xsk_buff_pool *pool)
 {
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b71a32e..dd75b5f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -332,6 +332,49 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
 }
 EXPORT_SYMBOL(xsk_tx_peek_desc);
 
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *descs,
+				   u32 max_entries)
+{
+	struct xdp_sock *xs;
+	u32 nb_pkts;
+
+	rcu_read_lock();
+	if (!list_is_singular(&pool->xsk_tx_list)) {
+		/* Fallback to the non-batched version */
+		rcu_read_unlock();
+		return xsk_tx_peek_desc(pool, &descs[0]) ? 1 : 0;
+	}
+
+	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
+
+	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, descs, pool, max_entries);
+	if (!nb_pkts) {
+		xs->tx->queue_empty_descs++;
+		goto out;
+	}
+
+	/* This is the backpressure mechanism for the Tx path. Try to
+	 * reserve space in the completion queue for all packets, but
+	 * if there are fewer slots available, just process that many
+	 * packets. This avoids having to implement any buffering in
+	 * the Tx path.
+	 */
+	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, descs, nb_pkts);
+	if (!nb_pkts)
+		goto out;
+
+	xskq_cons_release_n(xs->tx, nb_pkts);
+	__xskq_cons_release(xs->tx);
+	xs->sk.sk_write_space(&xs->sk);
+	rcu_read_unlock();
+	return nb_pkts;
+
+out:
+	rcu_read_unlock();
+	return 0;
+}
+EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
+
 static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
 {
 	struct net_device *dev = xs->dev;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 74fac80..a85c7e9 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -199,6 +199,33 @@ static inline bool xskq_cons_read_desc(struct xsk_queue *q,
 	return false;
 }
 
+static inline u32 xskq_cons_read_desc_batch(struct xsk_queue *q,
+					    struct xdp_desc *descs,
+					    struct xsk_buff_pool *pool, u32 max)
+{
+	u32 cached_cons = q->cached_cons, nb_entries = 0;
+
+	while (cached_cons != q->cached_prod && nb_entries < max) {
+		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+		u32 idx = cached_cons & q->ring_mask;
+
+		descs[nb_entries] = ring->desc[idx];
+		if (unlikely(!xskq_cons_is_valid_desc(q, &descs[nb_entries], pool))) {
+			if (nb_entries) {
+				/* Invalid entry detected. Return what we have. */
+				return nb_entries;
+			}
+			/* Use non-batch version to progress beyond invalid entry/entries */
+			return xskq_cons_read_desc(q, descs, pool) ? 1 : 0;
+		}
+
+		nb_entries++;
+		cached_cons++;
+	}
+
+	return nb_entries;
+}
+
 /* Functions for consumers */
 
 static inline void __xskq_cons_release(struct xsk_queue *q)
@@ -220,17 +247,22 @@ static inline void xskq_cons_get_entries(struct xsk_queue *q)
 	__xskq_cons_peek(q);
 }
 
-static inline bool xskq_cons_has_entries(struct xsk_queue *q, u32 cnt)
+static inline u32 xskq_cons_nb_entries(struct xsk_queue *q, u32 max)
 {
 	u32 entries = q->cached_prod - q->cached_cons;
 
-	if (entries >= cnt)
-		return true;
+	if (entries >= max)
+		return max;
 
 	__xskq_cons_peek(q);
 	entries = q->cached_prod - q->cached_cons;
 
-	return entries >= cnt;
+	return entries >= max ? max : entries;
+}
+
+static inline bool xskq_cons_has_entries(struct xsk_queue *q, u32 cnt)
+{
+	return xskq_cons_nb_entries(q, cnt) >= cnt ? true : false;
 }
 
 static inline bool xskq_cons_peek_addr_unchecked(struct xsk_queue *q, u64 *addr)
@@ -249,16 +281,28 @@ static inline bool xskq_cons_peek_desc(struct xsk_queue *q,
 	return xskq_cons_read_desc(q, desc, pool);
 }
 
+static inline u32 xskq_cons_peek_desc_batch(struct xsk_queue *q, struct xdp_desc *descs,
+					    struct xsk_buff_pool *pool, u32 max)
+{
+	u32 entries = xskq_cons_nb_entries(q, max);
+
+	return xskq_cons_read_desc_batch(q, descs, pool, entries);
+}
+
+/* To improve performance in the xskq_cons_release functions, only update local state here.
+ * Reflect this to global state when we get new entries from the ring in
+ * xskq_cons_get_entries() and whenever Rx or Tx processing are completed in the NAPI loop.
+ */
 static inline void xskq_cons_release(struct xsk_queue *q)
 {
-	/* To improve performance, only update local state here.
-	 * Reflect this to global state when we get new entries
-	 * from the ring in xskq_cons_get_entries() and whenever
-	 * Rx or Tx processing are completed in the NAPI loop.
-	 */
 	q->cached_cons++;
 }
 
+static inline void xskq_cons_release_n(struct xsk_queue *q, u32 cnt)
+{
+	q->cached_cons += cnt;
+}
+
 static inline bool xskq_cons_is_full(struct xsk_queue *q)
 {
 	/* No barriers needed since data is not accessed */
@@ -268,18 +312,23 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
 
 /* Functions for producers */
 
-static inline bool xskq_prod_is_full(struct xsk_queue *q)
+static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)
 {
 	u32 free_entries = q->nentries - (q->cached_prod - q->cached_cons);
 
-	if (free_entries)
-		return false;
+	if (free_entries >= max)
+		return max;
 
 	/* Refresh the local tail pointer */
 	q->cached_cons = READ_ONCE(q->ring->consumer);
 	free_entries = q->nentries - (q->cached_prod - q->cached_cons);
 
-	return !free_entries;
+	return free_entries >= max ? max : free_entries;
+}
+
+static inline bool xskq_prod_is_full(struct xsk_queue *q)
+{
+	return xskq_prod_nb_free(q, 1) ? false : true;
 }
 
 static inline int xskq_prod_reserve(struct xsk_queue *q)
@@ -304,6 +353,20 @@ static inline int xskq_prod_reserve_addr(struct xsk_queue *q, u64 addr)
 	return 0;
 }
 
+static inline u32 xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_desc *descs,
+					       u32 max)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+	u32 nb_entries, i;
+
+	nb_entries = xskq_prod_nb_free(q, max);
+
+	/* A, matches D */
+	for (i = 0; i < nb_entries; i++)
+		ring->desc[q->cached_prod++ & q->ring_mask] = descs[i].addr;
+	return nb_entries;
+}
+
 static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
 					 u64 addr, u32 len)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
                   ` (4 preceding siblings ...)
  2020-11-04 14:09 ` [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
@ 2020-11-04 14:09 ` Magnus Karlsson
  2020-11-04 23:01   ` Maciej Fijalkowski
  2020-11-09 21:10   ` [Intel-wired-lan] " John Fastabend
  5 siblings, 2 replies; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-04 14:09 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, jonathan.lemon
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Use the new batched xsk interfaces for the Tx path in the i40e driver
to improve performance. On my machine, this yields a throughput
increase of 4% for the l2fwd sample app in xdpsock. If we instead just
look at the Tx part, this patch set increases throughput with above
20% for Tx.

Note that I had to explicitly loop unroll the inner loop to get to
this performance level, by using a pragma. It is honored by both clang
and gcc and should be ignored by versions that do not support
it. Using the -funroll-loops compiler command line switch on the
source file resulted in a loop unrolling on a higher level that
lead to a performance decrease instead of an increase.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c    |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  14 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c     | 127 ++++++++++++++++++-------
 5 files changed, 110 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 26ba1f3..dc34867 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2025,7 +2025,7 @@ static int i40e_set_ringparam(struct net_device *netdev,
 			 */
 			tx_rings[i].desc = NULL;
 			tx_rings[i].rx_bi = NULL;
-			err = i40e_setup_tx_descriptors(&tx_rings[i]);
+			err = i40e_setup_tx_descriptors(&tx_rings[i], false);
 			if (err) {
 				while (i) {
 					i--;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 4f8a2154..c93774a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3030,13 +3030,13 @@ static int i40e_vsi_setup_tx_resources(struct i40e_vsi *vsi)
 	int i, err = 0;
 
 	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
-		err = i40e_setup_tx_descriptors(vsi->tx_rings[i]);
+		err = i40e_setup_tx_descriptors(vsi->tx_rings[i], false);
 
 	if (!i40e_enabled_xdp_vsi(vsi))
 		return err;
 
 	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
-		err = i40e_setup_tx_descriptors(vsi->xdp_rings[i]);
+		err = i40e_setup_tx_descriptors(vsi->xdp_rings[i], true);
 
 	return err;
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d43ce13..3e13e0e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -676,6 +676,8 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
 	i40e_clean_tx_ring(tx_ring);
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
+	kfree(tx_ring->xsk_descs);
+	tx_ring->xsk_descs = NULL;
 
 	if (tx_ring->desc) {
 		dma_free_coherent(tx_ring->dev, tx_ring->size,
@@ -1259,10 +1261,11 @@ void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
 /**
  * i40e_setup_tx_descriptors - Allocate the Tx descriptors
  * @tx_ring: the tx ring to set up
+ * @xdp_ring: true if this is an XDP Tx ring
  *
  * Return 0 on success, negative on error
  **/
-int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
+int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring, bool xdp_ring)
 {
 	struct device *dev = tx_ring->dev;
 	int bi_size;
@@ -1277,6 +1280,13 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		goto err;
 
+	if (xdp_ring) {
+		tx_ring->xsk_descs = kcalloc(I40E_MAX_NUM_DESCRIPTORS, sizeof(*tx_ring->xsk_descs),
+					     GFP_KERNEL);
+		if (!tx_ring->xsk_descs)
+			goto err;
+	}
+
 	u64_stats_init(&tx_ring->syncp);
 
 	/* round up to nearest 4K */
@@ -1300,6 +1310,8 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	return 0;
 
 err:
+	kfree(tx_ring->xsk_descs);
+	tx_ring->xsk_descs = NULL;
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
 	return -ENOMEM;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 2feed92..628d5d7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -389,6 +389,7 @@ struct i40e_ring {
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
 	struct xsk_buff_pool *xsk_pool;
+	struct xdp_desc *xsk_descs;      /* For storing descriptors in the AF_XDP ZC path */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -451,7 +452,7 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
 netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
-int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring);
+int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring, bool xdp_ring);
 int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring);
 void i40e_free_tx_resources(struct i40e_ring *tx_ring);
 void i40e_free_rx_resources(struct i40e_ring *rx_ring);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index eabe1a3..515d856 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -383,6 +383,78 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+static void i40e_xmit_pkt(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
+			  unsigned int *total_bytes)
+{
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+
+	dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
+	xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
+
+	tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
+	tx_desc->buffer_addr = cpu_to_le64(dma);
+	tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC | I40E_TX_DESC_CMD_EOP,
+						  0, desc->len, 0);
+
+	*total_bytes += desc->len;
+}
+
+/* This value should match the pragma below. Why 4? It is strictly
+ * empirical. It seems to be a good compromise between the advantage
+ * of having simultaneous outstanding reads to the DMA array that can
+ * hide each others latency and the disadvantage of having a larger
+ * code path.
+ */
+#define PKTS_PER_BATCH 4
+
+static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
+				unsigned int *total_bytes)
+{
+	u16 ntu = xdp_ring->next_to_use;
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+	u32 i;
+
+#pragma GCC unroll 4
+	for (i = 0; i < PKTS_PER_BATCH; i++) {
+		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
+		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
+
+		tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
+							  I40E_TX_DESC_CMD_EOP,
+							  0, desc[i].len, 0);
+
+		*total_bytes += desc[i].len;
+	}
+
+	xdp_ring->next_to_use = ntu;
+}
+
+static void i40e_fill_tx_hw_ring(struct i40e_ring *xdp_ring, struct xdp_desc *descs, u32 nb_pkts,
+				 unsigned int *total_bytes)
+{
+	u32 batched, leftover, i;
+
+	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
+	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
+	for (i = 0; i < batched; i += PKTS_PER_BATCH)
+		i40e_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
+	for (i = batched; i < batched + leftover; i++)
+		i40e_xmit_pkt(xdp_ring, &descs[i], total_bytes);
+}
+
+static void i40e_set_rs_bit(struct i40e_ring *xdp_ring)
+{
+	u16 ntu = xdp_ring->next_to_use ? xdp_ring->next_to_use - 1 : xdp_ring->count - 1;
+	struct i40e_tx_desc *tx_desc;
+
+	tx_desc = I40E_TX_DESC(xdp_ring, ntu);
+	tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS << I40E_TXD_QW1_CMD_SHIFT);
+}
+
 /**
  * i40e_xmit_zc - Performs zero-copy Tx AF_XDP
  * @xdp_ring: XDP Tx ring
@@ -392,45 +464,30 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
-	unsigned int sent_frames = 0, total_bytes = 0;
-	struct i40e_tx_desc *tx_desc = NULL;
-	struct xdp_desc desc;
-	dma_addr_t dma;
-
-	while (budget-- > 0) {
-		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
-			break;
-
-		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
-		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
-						 desc.len);
-
-		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
-		tx_desc->buffer_addr = cpu_to_le64(dma);
-		tx_desc->cmd_type_offset_bsz =
-			build_ctob(I40E_TX_DESC_CMD_ICRC
-				   | I40E_TX_DESC_CMD_EOP,
-				   0, desc.len, 0);
-
-		sent_frames++;
-		total_bytes += desc.len;
-
-		xdp_ring->next_to_use++;
-		if (xdp_ring->next_to_use == xdp_ring->count)
-			xdp_ring->next_to_use = 0;
+	struct xdp_desc *descs = xdp_ring->xsk_descs;
+	u32 nb_pkts, nb_processed = 0;
+	unsigned int total_bytes = 0;
+
+	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, descs, budget);
+	if (!nb_pkts)
+		return false;
+
+	if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
+		nb_processed = xdp_ring->count - xdp_ring->next_to_use;
+		i40e_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
+		xdp_ring->next_to_use = 0;
 	}
 
-	if (tx_desc) {
-		/* Request an interrupt for the last frame and bump tail ptr. */
-		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
-						 I40E_TXD_QW1_CMD_SHIFT);
-		i40e_xdp_ring_update_tail(xdp_ring);
+	i40e_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
+			     &total_bytes);
 
-		xsk_tx_release(xdp_ring->xsk_pool);
-		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
-	}
+	/* Request an interrupt for the last frame and bump tail ptr. */
+	i40e_set_rs_bit(xdp_ring);
+	i40e_xdp_ring_update_tail(xdp_ring);
+
+	i40e_update_tx_stats(xdp_ring, nb_pkts, total_bytes);
 
-	return !!budget;
+	return true;
 }
 
 /**
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-04 14:09 ` [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
@ 2020-11-04 23:01   ` Maciej Fijalkowski
  2020-11-05  7:19     ` Magnus Karlsson
  2020-11-09 21:10   ` [Intel-wired-lan] " John Fastabend
  1 sibling, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2020-11-04 23:01 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, bpf, jeffrey.t.kirsher, anthony.l.nguyen,
	maciejromanfijalkowski, intel-wired-lan

On Wed, Nov 04, 2020 at 03:09:02PM +0100, Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Use the new batched xsk interfaces for the Tx path in the i40e driver
> to improve performance. On my machine, this yields a throughput
> increase of 4% for the l2fwd sample app in xdpsock. If we instead just
> look at the Tx part, this patch set increases throughput with above
> 20% for Tx.
> 
> Note that I had to explicitly loop unroll the inner loop to get to
> this performance level, by using a pragma. It is honored by both clang
> and gcc and should be ignored by versions that do not support
> it. Using the -funroll-loops compiler command line switch on the
> source file resulted in a loop unrolling on a higher level that
> lead to a performance decrease instead of an increase.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 +-
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |   4 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  14 ++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c     | 127 ++++++++++++++++++-------
>  5 files changed, 110 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> index 26ba1f3..dc34867 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> @@ -2025,7 +2025,7 @@ static int i40e_set_ringparam(struct net_device *netdev,
>  			 */
>  			tx_rings[i].desc = NULL;
>  			tx_rings[i].rx_bi = NULL;
> -			err = i40e_setup_tx_descriptors(&tx_rings[i]);
> +			err = i40e_setup_tx_descriptors(&tx_rings[i], false);
>  			if (err) {
>  				while (i) {
>  					i--;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 4f8a2154..c93774a 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3030,13 +3030,13 @@ static int i40e_vsi_setup_tx_resources(struct i40e_vsi *vsi)
>  	int i, err = 0;
>  
>  	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
> -		err = i40e_setup_tx_descriptors(vsi->tx_rings[i]);
> +		err = i40e_setup_tx_descriptors(vsi->tx_rings[i], false);
>  
>  	if (!i40e_enabled_xdp_vsi(vsi))
>  		return err;
>  
>  	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
> -		err = i40e_setup_tx_descriptors(vsi->xdp_rings[i]);
> +		err = i40e_setup_tx_descriptors(vsi->xdp_rings[i], true);

One last suggestion: I think that you could just call ring_is_xdp(tx_ring)
within i40e_setup_tx_descriptors() and based on that allocate the
xsk_descs array, instead of parametrizing the function. At the time of
setting up tx descs the I40E_TXR_FLAGS_XDP flag is already set.

I think this would be a cleaner way of doing it.

>  
>  	return err;
>  }
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index d43ce13..3e13e0e 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -676,6 +676,8 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
>  	i40e_clean_tx_ring(tx_ring);
>  	kfree(tx_ring->tx_bi);
>  	tx_ring->tx_bi = NULL;
> +	kfree(tx_ring->xsk_descs);
> +	tx_ring->xsk_descs = NULL;
>  
>  	if (tx_ring->desc) {
>  		dma_free_coherent(tx_ring->dev, tx_ring->size,
> @@ -1259,10 +1261,11 @@ void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
>  /**
>   * i40e_setup_tx_descriptors - Allocate the Tx descriptors
>   * @tx_ring: the tx ring to set up
> + * @xdp_ring: true if this is an XDP Tx ring
>   *
>   * Return 0 on success, negative on error
>   **/
> -int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
> +int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring, bool xdp_ring)
>  {
>  	struct device *dev = tx_ring->dev;
>  	int bi_size;
> @@ -1277,6 +1280,13 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
>  	if (!tx_ring->tx_bi)
>  		goto err;
>  
> +	if (xdp_ring) {
> +		tx_ring->xsk_descs = kcalloc(I40E_MAX_NUM_DESCRIPTORS, sizeof(*tx_ring->xsk_descs),
> +					     GFP_KERNEL);
> +		if (!tx_ring->xsk_descs)
> +			goto err;
> +	}
> +
>  	u64_stats_init(&tx_ring->syncp);
>  
>  	/* round up to nearest 4K */
> @@ -1300,6 +1310,8 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
>  	return 0;
>  
>  err:
> +	kfree(tx_ring->xsk_descs);
> +	tx_ring->xsk_descs = NULL;
>  	kfree(tx_ring->tx_bi);
>  	tx_ring->tx_bi = NULL;
>  	return -ENOMEM;
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index 2feed92..628d5d7 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -389,6 +389,7 @@ struct i40e_ring {
>  	struct i40e_channel *ch;
>  	struct xdp_rxq_info xdp_rxq;
>  	struct xsk_buff_pool *xsk_pool;
> +	struct xdp_desc *xsk_descs;      /* For storing descriptors in the AF_XDP ZC path */
>  } ____cacheline_internodealigned_in_smp;
>  
>  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> @@ -451,7 +452,7 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
>  netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
>  void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
>  void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
> -int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring);
> +int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring, bool xdp_ring);
>  int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring);
>  void i40e_free_tx_resources(struct i40e_ring *tx_ring);
>  void i40e_free_rx_resources(struct i40e_ring *rx_ring);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index eabe1a3..515d856 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -383,6 +383,78 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>  	return failure ? budget : (int)total_rx_packets;
>  }
>  
> +static void i40e_xmit_pkt(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> +			  unsigned int *total_bytes)
> +{
> +	struct i40e_tx_desc *tx_desc;
> +	dma_addr_t dma;
> +
> +	dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
> +	xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
> +
> +	tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
> +	tx_desc->buffer_addr = cpu_to_le64(dma);
> +	tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC | I40E_TX_DESC_CMD_EOP,
> +						  0, desc->len, 0);
> +
> +	*total_bytes += desc->len;
> +}
> +
> +/* This value should match the pragma below. Why 4? It is strictly
> + * empirical. It seems to be a good compromise between the advantage
> + * of having simultaneous outstanding reads to the DMA array that can
> + * hide each others latency and the disadvantage of having a larger
> + * code path.
> + */
> +#define PKTS_PER_BATCH 4
> +
> +static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> +				unsigned int *total_bytes)
> +{
> +	u16 ntu = xdp_ring->next_to_use;
> +	struct i40e_tx_desc *tx_desc;
> +	dma_addr_t dma;
> +	u32 i;
> +
> +#pragma GCC unroll 4
> +	for (i = 0; i < PKTS_PER_BATCH; i++) {
> +		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
> +		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
> +
> +		tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
> +		tx_desc->buffer_addr = cpu_to_le64(dma);
> +		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
> +							  I40E_TX_DESC_CMD_EOP,
> +							  0, desc[i].len, 0);
> +
> +		*total_bytes += desc[i].len;
> +	}
> +
> +	xdp_ring->next_to_use = ntu;
> +}
> +
> +static void i40e_fill_tx_hw_ring(struct i40e_ring *xdp_ring, struct xdp_desc *descs, u32 nb_pkts,
> +				 unsigned int *total_bytes)
> +{
> +	u32 batched, leftover, i;
> +
> +	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
> +	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
> +	for (i = 0; i < batched; i += PKTS_PER_BATCH)
> +		i40e_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
> +	for (i = batched; i < batched + leftover; i++)
> +		i40e_xmit_pkt(xdp_ring, &descs[i], total_bytes);
> +}
> +
> +static void i40e_set_rs_bit(struct i40e_ring *xdp_ring)
> +{
> +	u16 ntu = xdp_ring->next_to_use ? xdp_ring->next_to_use - 1 : xdp_ring->count - 1;
> +	struct i40e_tx_desc *tx_desc;
> +
> +	tx_desc = I40E_TX_DESC(xdp_ring, ntu);
> +	tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS << I40E_TXD_QW1_CMD_SHIFT);
> +}
> +
>  /**
>   * i40e_xmit_zc - Performs zero-copy Tx AF_XDP
>   * @xdp_ring: XDP Tx ring
> @@ -392,45 +464,30 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
>   **/
>  static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
>  {
> -	unsigned int sent_frames = 0, total_bytes = 0;
> -	struct i40e_tx_desc *tx_desc = NULL;
> -	struct xdp_desc desc;
> -	dma_addr_t dma;
> -
> -	while (budget-- > 0) {
> -		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
> -			break;
> -
> -		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
> -		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
> -						 desc.len);
> -
> -		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> -		tx_desc->buffer_addr = cpu_to_le64(dma);
> -		tx_desc->cmd_type_offset_bsz =
> -			build_ctob(I40E_TX_DESC_CMD_ICRC
> -				   | I40E_TX_DESC_CMD_EOP,
> -				   0, desc.len, 0);
> -
> -		sent_frames++;
> -		total_bytes += desc.len;
> -
> -		xdp_ring->next_to_use++;
> -		if (xdp_ring->next_to_use == xdp_ring->count)
> -			xdp_ring->next_to_use = 0;
> +	struct xdp_desc *descs = xdp_ring->xsk_descs;
> +	u32 nb_pkts, nb_processed = 0;
> +	unsigned int total_bytes = 0;
> +
> +	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, descs, budget);
> +	if (!nb_pkts)
> +		return false;
> +
> +	if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
> +		nb_processed = xdp_ring->count - xdp_ring->next_to_use;
> +		i40e_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
> +		xdp_ring->next_to_use = 0;
>  	}
>  
> -	if (tx_desc) {
> -		/* Request an interrupt for the last frame and bump tail ptr. */
> -		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
> -						 I40E_TXD_QW1_CMD_SHIFT);
> -		i40e_xdp_ring_update_tail(xdp_ring);
> +	i40e_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
> +			     &total_bytes);
>  
> -		xsk_tx_release(xdp_ring->xsk_pool);
> -		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
> -	}
> +	/* Request an interrupt for the last frame and bump tail ptr. */
> +	i40e_set_rs_bit(xdp_ring);
> +	i40e_xdp_ring_update_tail(xdp_ring);
> +
> +	i40e_update_tx_stats(xdp_ring, nb_pkts, total_bytes);
>  
> -	return !!budget;
> +	return true;
>  }
>  
>  /**
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy
  2020-11-04 14:08 ` [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy Magnus Karlsson
@ 2020-11-04 23:33   ` Jakub Kicinski
  2020-11-04 23:35     ` Jakub Kicinski
  2020-11-05 14:17     ` Magnus Karlsson
  0 siblings, 2 replies; 21+ messages in thread
From: Jakub Kicinski @ 2020-11-04 23:33 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, bpf, jeffrey.t.kirsher, anthony.l.nguyen,
	maciej.fijalkowski, maciejromanfijalkowski, intel-wired-lan

On Wed,  4 Nov 2020 15:08:57 +0100 Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Introduce lazy Tx completions when a queue is used for AF_XDP
> zero-copy. In the current design, each time we get into the NAPI poll
> loop we try to complete as many Tx packets as possible from the
> NIC. This is performed by reading the head pointer register in the NIC
> that tells us how many packets have been completed. Reading this
> register is expensive as it is across PCIe, so let us try to limit the
> number of times it is read by only completing Tx packets to user-space
> when the number of available descriptors in the Tx HW ring is below
> some threshold. This will decrease the number of reads issued to the
> NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock
> microbenchmark.
> 
> The threshold is set to the minimum possible size that the HW ring can
> have. This so that we do not run into a scenario where the threshold
> is higher than the configured number of descriptors in the HW ring.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

I feel like this needs a big fat warning somewhere.

It's perfectly fine to never complete TCP packets, but AF_XDP could be
used to implement protocols in user space. What if someone wants to
implement something like TSQ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy
  2020-11-04 23:33   ` Jakub Kicinski
@ 2020-11-04 23:35     ` Jakub Kicinski
  2020-11-05 14:17     ` Magnus Karlsson
  1 sibling, 0 replies; 21+ messages in thread
From: Jakub Kicinski @ 2020-11-04 23:35 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, bpf, jeffrey.t.kirsher, anthony.l.nguyen,
	maciej.fijalkowski, maciejromanfijalkowski, intel-wired-lan

On Wed, 4 Nov 2020 15:33:20 -0800 Jakub Kicinski wrote:
> I feel like this needs a big fat warning somewhere.
> 
> It's perfectly fine to never complete TCP packets,

s/TCP/normal XDP/, sorry

> but AF_XDP could be used to implement protocols in user space. What
> if someone wants to implement something like TSQ?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-04 23:01   ` Maciej Fijalkowski
@ 2020-11-05  7:19     ` Magnus Karlsson
  0 siblings, 0 replies; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-05  7:19 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon, bpf,
	jeffrey.t.kirsher, anthony.l.nguyen, Maciej Fijalkowski,
	intel-wired-lan

On Thu, Nov 5, 2020 at 12:12 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Nov 04, 2020 at 03:09:02PM +0100, Magnus Karlsson wrote:
> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >
> > Use the new batched xsk interfaces for the Tx path in the i40e driver
> > to improve performance. On my machine, this yields a throughput
> > increase of 4% for the l2fwd sample app in xdpsock. If we instead just
> > look at the Tx part, this patch set increases throughput with above
> > 20% for Tx.
> >
> > Note that I had to explicitly loop unroll the inner loop to get to
> > this performance level, by using a pragma. It is honored by both clang
> > and gcc and should be ignored by versions that do not support
> > it. Using the -funroll-loops compiler command line switch on the
> > source file resulted in a loop unrolling on a higher level that
> > lead to a performance decrease instead of an increase.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 +-
> >  drivers/net/ethernet/intel/i40e/i40e_main.c    |   4 +-
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  14 ++-
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   3 +-
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c     | 127 ++++++++++++++++++-------
> >  5 files changed, 110 insertions(+), 40 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> > index 26ba1f3..dc34867 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
> > @@ -2025,7 +2025,7 @@ static int i40e_set_ringparam(struct net_device *netdev,
> >                        */
> >                       tx_rings[i].desc = NULL;
> >                       tx_rings[i].rx_bi = NULL;
> > -                     err = i40e_setup_tx_descriptors(&tx_rings[i]);
> > +                     err = i40e_setup_tx_descriptors(&tx_rings[i], false);
> >                       if (err) {
> >                               while (i) {
> >                                       i--;
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > index 4f8a2154..c93774a 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> > @@ -3030,13 +3030,13 @@ static int i40e_vsi_setup_tx_resources(struct i40e_vsi *vsi)
> >       int i, err = 0;
> >
> >       for (i = 0; i < vsi->num_queue_pairs && !err; i++)
> > -             err = i40e_setup_tx_descriptors(vsi->tx_rings[i]);
> > +             err = i40e_setup_tx_descriptors(vsi->tx_rings[i], false);
> >
> >       if (!i40e_enabled_xdp_vsi(vsi))
> >               return err;
> >
> >       for (i = 0; i < vsi->num_queue_pairs && !err; i++)
> > -             err = i40e_setup_tx_descriptors(vsi->xdp_rings[i]);
> > +             err = i40e_setup_tx_descriptors(vsi->xdp_rings[i], true);
>
> One last suggestion: I think that you could just call ring_is_xdp(tx_ring)
> within i40e_setup_tx_descriptors() and based on that allocate the
> xsk_descs array, instead of parametrizing the function. At the time of
> setting up tx descs the I40E_TXR_FLAGS_XDP flag is already set.
>
> I think this would be a cleaner way of doing it.

Yes, good suggestion. Will do that.

> >
> >       return err;
> >  }
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > index d43ce13..3e13e0e 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > @@ -676,6 +676,8 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
> >       i40e_clean_tx_ring(tx_ring);
> >       kfree(tx_ring->tx_bi);
> >       tx_ring->tx_bi = NULL;
> > +     kfree(tx_ring->xsk_descs);
> > +     tx_ring->xsk_descs = NULL;
> >
> >       if (tx_ring->desc) {
> >               dma_free_coherent(tx_ring->dev, tx_ring->size,
> > @@ -1259,10 +1261,11 @@ void i40e_clean_programming_status(struct i40e_ring *rx_ring, u64 qword0_raw,
> >  /**
> >   * i40e_setup_tx_descriptors - Allocate the Tx descriptors
> >   * @tx_ring: the tx ring to set up
> > + * @xdp_ring: true if this is an XDP Tx ring
> >   *
> >   * Return 0 on success, negative on error
> >   **/
> > -int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
> > +int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring, bool xdp_ring)
> >  {
> >       struct device *dev = tx_ring->dev;
> >       int bi_size;
> > @@ -1277,6 +1280,13 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
> >       if (!tx_ring->tx_bi)
> >               goto err;
> >
> > +     if (xdp_ring) {
> > +             tx_ring->xsk_descs = kcalloc(I40E_MAX_NUM_DESCRIPTORS, sizeof(*tx_ring->xsk_descs),
> > +                                          GFP_KERNEL);
> > +             if (!tx_ring->xsk_descs)
> > +                     goto err;
> > +     }
> > +
> >       u64_stats_init(&tx_ring->syncp);
> >
> >       /* round up to nearest 4K */
> > @@ -1300,6 +1310,8 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
> >       return 0;
> >
> >  err:
> > +     kfree(tx_ring->xsk_descs);
> > +     tx_ring->xsk_descs = NULL;
> >       kfree(tx_ring->tx_bi);
> >       tx_ring->tx_bi = NULL;
> >       return -ENOMEM;
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index 2feed92..628d5d7 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -389,6 +389,7 @@ struct i40e_ring {
> >       struct i40e_channel *ch;
> >       struct xdp_rxq_info xdp_rxq;
> >       struct xsk_buff_pool *xsk_pool;
> > +     struct xdp_desc *xsk_descs;      /* For storing descriptors in the AF_XDP ZC path */
> >  } ____cacheline_internodealigned_in_smp;
> >
> >  static inline bool ring_uses_build_skb(struct i40e_ring *ring)
> > @@ -451,7 +452,7 @@ bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
> >  netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
> >  void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
> >  void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
> > -int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring);
> > +int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring, bool xdp_ring);
> >  int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring);
> >  void i40e_free_tx_resources(struct i40e_ring *tx_ring);
> >  void i40e_free_rx_resources(struct i40e_ring *rx_ring);
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > index eabe1a3..515d856 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > @@ -383,6 +383,78 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >       return failure ? budget : (int)total_rx_packets;
> >  }
> >
> > +static void i40e_xmit_pkt(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> > +                       unsigned int *total_bytes)
> > +{
> > +     struct i40e_tx_desc *tx_desc;
> > +     dma_addr_t dma;
> > +
> > +     dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
> > +     xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
> > +
> > +     tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
> > +     tx_desc->buffer_addr = cpu_to_le64(dma);
> > +     tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC | I40E_TX_DESC_CMD_EOP,
> > +                                               0, desc->len, 0);
> > +
> > +     *total_bytes += desc->len;
> > +}
> > +
> > +/* This value should match the pragma below. Why 4? It is strictly
> > + * empirical. It seems to be a good compromise between the advantage
> > + * of having simultaneous outstanding reads to the DMA array that can
> > + * hide each others latency and the disadvantage of having a larger
> > + * code path.
> > + */
> > +#define PKTS_PER_BATCH 4
> > +
> > +static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> > +                             unsigned int *total_bytes)
> > +{
> > +     u16 ntu = xdp_ring->next_to_use;
> > +     struct i40e_tx_desc *tx_desc;
> > +     dma_addr_t dma;
> > +     u32 i;
> > +
> > +#pragma GCC unroll 4
> > +     for (i = 0; i < PKTS_PER_BATCH; i++) {
> > +             dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
> > +             xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
> > +
> > +             tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
> > +             tx_desc->buffer_addr = cpu_to_le64(dma);
> > +             tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
> > +                                                       I40E_TX_DESC_CMD_EOP,
> > +                                                       0, desc[i].len, 0);
> > +
> > +             *total_bytes += desc[i].len;
> > +     }
> > +
> > +     xdp_ring->next_to_use = ntu;
> > +}
> > +
> > +static void i40e_fill_tx_hw_ring(struct i40e_ring *xdp_ring, struct xdp_desc *descs, u32 nb_pkts,
> > +                              unsigned int *total_bytes)
> > +{
> > +     u32 batched, leftover, i;
> > +
> > +     batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
> > +     leftover = nb_pkts & (PKTS_PER_BATCH - 1);
> > +     for (i = 0; i < batched; i += PKTS_PER_BATCH)
> > +             i40e_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
> > +     for (i = batched; i < batched + leftover; i++)
> > +             i40e_xmit_pkt(xdp_ring, &descs[i], total_bytes);
> > +}
> > +
> > +static void i40e_set_rs_bit(struct i40e_ring *xdp_ring)
> > +{
> > +     u16 ntu = xdp_ring->next_to_use ? xdp_ring->next_to_use - 1 : xdp_ring->count - 1;
> > +     struct i40e_tx_desc *tx_desc;
> > +
> > +     tx_desc = I40E_TX_DESC(xdp_ring, ntu);
> > +     tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS << I40E_TXD_QW1_CMD_SHIFT);
> > +}
> > +
> >  /**
> >   * i40e_xmit_zc - Performs zero-copy Tx AF_XDP
> >   * @xdp_ring: XDP Tx ring
> > @@ -392,45 +464,30 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
> >   **/
> >  static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
> >  {
> > -     unsigned int sent_frames = 0, total_bytes = 0;
> > -     struct i40e_tx_desc *tx_desc = NULL;
> > -     struct xdp_desc desc;
> > -     dma_addr_t dma;
> > -
> > -     while (budget-- > 0) {
> > -             if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
> > -                     break;
> > -
> > -             dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
> > -             xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
> > -                                              desc.len);
> > -
> > -             tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> > -             tx_desc->buffer_addr = cpu_to_le64(dma);
> > -             tx_desc->cmd_type_offset_bsz =
> > -                     build_ctob(I40E_TX_DESC_CMD_ICRC
> > -                                | I40E_TX_DESC_CMD_EOP,
> > -                                0, desc.len, 0);
> > -
> > -             sent_frames++;
> > -             total_bytes += desc.len;
> > -
> > -             xdp_ring->next_to_use++;
> > -             if (xdp_ring->next_to_use == xdp_ring->count)
> > -                     xdp_ring->next_to_use = 0;
> > +     struct xdp_desc *descs = xdp_ring->xsk_descs;
> > +     u32 nb_pkts, nb_processed = 0;
> > +     unsigned int total_bytes = 0;
> > +
> > +     nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, descs, budget);
> > +     if (!nb_pkts)
> > +             return false;
> > +
> > +     if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
> > +             nb_processed = xdp_ring->count - xdp_ring->next_to_use;
> > +             i40e_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
> > +             xdp_ring->next_to_use = 0;
> >       }
> >
> > -     if (tx_desc) {
> > -             /* Request an interrupt for the last frame and bump tail ptr. */
> > -             tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
> > -                                              I40E_TXD_QW1_CMD_SHIFT);
> > -             i40e_xdp_ring_update_tail(xdp_ring);
> > +     i40e_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
> > +                          &total_bytes);
> >
> > -             xsk_tx_release(xdp_ring->xsk_pool);
> > -             i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
> > -     }
> > +     /* Request an interrupt for the last frame and bump tail ptr. */
> > +     i40e_set_rs_bit(xdp_ring);
> > +     i40e_xdp_ring_update_tail(xdp_ring);
> > +
> > +     i40e_update_tx_stats(xdp_ring, nb_pkts, total_bytes);
> >
> > -     return !!budget;
> > +     return true;
> >  }
> >
> >  /**
> > --
> > 2.7.4
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy
  2020-11-04 23:33   ` Jakub Kicinski
  2020-11-04 23:35     ` Jakub Kicinski
@ 2020-11-05 14:17     ` Magnus Karlsson
  2020-11-05 15:45       ` Jakub Kicinski
  1 sibling, 1 reply; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-05 14:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon, bpf,
	jeffrey.t.kirsher, anthony.l.nguyen, Fijalkowski, Maciej,
	Maciej Fijalkowski, intel-wired-lan

On Thu, Nov 5, 2020 at 12:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  4 Nov 2020 15:08:57 +0100 Magnus Karlsson wrote:
> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >
> > Introduce lazy Tx completions when a queue is used for AF_XDP
> > zero-copy. In the current design, each time we get into the NAPI poll
> > loop we try to complete as many Tx packets as possible from the
> > NIC. This is performed by reading the head pointer register in the NIC
> > that tells us how many packets have been completed. Reading this
> > register is expensive as it is across PCIe, so let us try to limit the
> > number of times it is read by only completing Tx packets to user-space
> > when the number of available descriptors in the Tx HW ring is below
> > some threshold. This will decrease the number of reads issued to the
> > NIC and improves performance with 1.5% - 2% for the l2fwd xdpsock
> > microbenchmark.
> >
> > The threshold is set to the minimum possible size that the HW ring can
> > have. This so that we do not run into a scenario where the threshold
> > is higher than the configured number of descriptors in the HW ring.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>
> I feel like this needs a big fat warning somewhere.
>
> It's perfectly fine to never complete TCP packets, but AF_XDP could be
> used to implement protocols in user space. What if someone wants to
> implement something like TSQ?

I might misunderstand you, but with TSQ here (for something that
bypasses qdisk and any buffering and just goes straight to the driver)
you mean the ability to have just a few buffers outstanding and
continuously reuse these? If so, that is likely best achieved by
setting a low Tx queue size on the NIC. Note that even without this
patch, completions could be delayed. Though this patch makes that the
normal case. In any way, I think this calls for some improved
documentation.

I also discovered a corner case that will lead to a deadlock if the
completion ring size is half the size of the Tx NIC ring size. This
needs to be fixed, so I will spin a v2.

Thanks: Magnus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy
  2020-11-05 14:17     ` Magnus Karlsson
@ 2020-11-05 15:45       ` Jakub Kicinski
  2020-11-06 19:09         ` Magnus Karlsson
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2020-11-05 15:45 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon, bpf,
	jeffrey.t.kirsher, anthony.l.nguyen, Fijalkowski, Maciej,
	Maciej Fijalkowski, intel-wired-lan

On Thu, 5 Nov 2020 15:17:50 +0100 Magnus Karlsson wrote:
> > I feel like this needs a big fat warning somewhere.
> >
> > It's perfectly fine to never complete TCP packets, but AF_XDP could be
> > used to implement protocols in user space. What if someone wants to
> > implement something like TSQ?  
> 
> I might misunderstand you, but with TSQ here (for something that
> bypasses qdisk and any buffering and just goes straight to the driver)
> you mean the ability to have just a few buffers outstanding and
> continuously reuse these? If so, that is likely best achieved by
> setting a low Tx queue size on the NIC. Note that even without this
> patch, completions could be delayed. Though this patch makes that the
> normal case. In any way, I think this calls for some improved
> documentation.

TSQ tries to limit the amount of data the TCP stack queues into TC/sched
and drivers. Say 1MB ~ 16 GSO frames. It will not queue more data until
some of the transfer is reported as completed. 

IIUC you're allowing up to 64 descriptors to linger without reporting
back that the transfer is done. That means that user space implementing
a scheme similar to TSQ may see its transfers stalled.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy
  2020-11-05 15:45       ` Jakub Kicinski
@ 2020-11-06 19:09         ` Magnus Karlsson
  0 siblings, 0 replies; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-06 19:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon, bpf,
	jeffrey.t.kirsher, anthony.l.nguyen, Fijalkowski, Maciej,
	Maciej Fijalkowski, intel-wired-lan

On Thu, Nov 5, 2020 at 4:45 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 5 Nov 2020 15:17:50 +0100 Magnus Karlsson wrote:
> > > I feel like this needs a big fat warning somewhere.
> > >
> > > It's perfectly fine to never complete TCP packets, but AF_XDP could be
> > > used to implement protocols in user space. What if someone wants to
> > > implement something like TSQ?
> >
> > I might misunderstand you, but with TSQ here (for something that
> > bypasses qdisk and any buffering and just goes straight to the driver)
> > you mean the ability to have just a few buffers outstanding and
> > continuously reuse these? If so, that is likely best achieved by
> > setting a low Tx queue size on the NIC. Note that even without this
> > patch, completions could be delayed. Though this patch makes that the
> > normal case. In any way, I think this calls for some improved
> > documentation.
>
> TSQ tries to limit the amount of data the TCP stack queues into TC/sched
> and drivers. Say 1MB ~ 16 GSO frames. It will not queue more data until
> some of the transfer is reported as completed.

Thanks. Got it. There is one more use case I can think of for quick
completions of Tx buffers and that is if you have metadata associated
with the completion, for example a Tx time stamp. Not that this
capability exists today, but hopefully it will get added at some
point.

Anyway after some more thinking, I would like to remove this patch
from the patch set and put it on the shelf for a while. The reason
behind this is that if we can get a good busy poll solution for AF_XDP
sockets, then we do not need this patch. With busy-poll the choice of
when to complete Tx buffers would be left to the application in a nice
way. If the application would like to quickly get buffers completed
(at the cost of some performance) it would call sendto() (or friends)
soon after it put the packet on the Tx ring. If max throughput is
desired with no regard to when a buffer is returned, then sendto()
would be called only after a large batch of packets have been put on
the Tx ring. No need for any threshold or new knob, in other words,
much nicer. So let us wait for Björn's busy poll patches and see where
it leads. Please protest if you do not agree. Otherwise I will submit
a v2 without this patch and with Maciej's proposed simplification.

> IIUC you're allowing up to 64 descriptors to linger without reporting
> back that the transfer is done. That means that user space implementing
> a scheme similar to TSQ may see its transfers stalled.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Intel-wired-lan] [PATCH bpf-next 4/6] xsk: introduce padding between more ring pointers
  2020-11-04 14:09 ` [PATCH bpf-next 4/6] xsk: introduce padding between more ring pointers Magnus Karlsson
@ 2020-11-09 20:43   ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2020-11-09 20:43 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon
  Cc: maciejromanfijalkowski, intel-wired-lan, bpf

Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Introduce one cache line worth of padding between the consumer pointer
> and the flags field as well as between the flags field and the start
> of the descriptors in all the lockless rings. This so that the x86 HW
> adjacency prefetcher will not prefetch the adjacent pointer/field when
> only one pointer/field is going to be used. This improves throughput
> performance for the l2fwd sample app with 1% on my machine with HW
> prefetching turned on in the BIOS.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

>  net/xdp/xsk_queue.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index cdb9cf3..74fac80 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -18,9 +18,11 @@ struct xdp_ring {
>  	/* Hinder the adjacent cache prefetcher to prefetch the consumer
>  	 * pointer if the producer pointer is touched and vice versa.
>  	 */
> -	u32 pad ____cacheline_aligned_in_smp;
> +	u32 pad1 ____cacheline_aligned_in_smp;
>  	u32 consumer ____cacheline_aligned_in_smp;
> +	u32 pad2 ____cacheline_aligned_in_smp;
>  	u32 flags;
> +	u32 pad3 ____cacheline_aligned_in_smp;
>  };
>  
>  /* Used for the RX and TX queues for packets */
> -- 
> 2.7.4
> 
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan



^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Intel-wired-lan] [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending
  2020-11-04 14:08 ` [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending Magnus Karlsson
@ 2020-11-09 20:47   ` John Fastabend
  2020-11-10  7:12     ` Magnus Karlsson
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2020-11-09 20:47 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon
  Cc: maciejromanfijalkowski, intel-wired-lan, bpf

Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Increment the statistics over how many Tx packets have been sent at
> the time of sending instead of at the time of completion. This as a
> completion event means that the buffer has been sent AND returned to
> user space. The packet always gets sent shortly after sendto() is
> called. The kernel might, for performance reasons, decide to not
> return every single buffer to user space immediately after sending,
> for example, only after a batch of packets have been
> transmitted. Incrementing the number of packets sent at completion,
> will in that case be confusing as if you send a single packet, the
> counter might show zero for a while even though the packet has been
> transmitted.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---

LGTM. Just one question then if we wanted to know the old value, packet
completion counter it looks like (tx_npkts - outstanding_tx) would give
that value?

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Intel-wired-lan] [PATCH bpf-next 3/6] i40e: remove unnecessary sw_ring access from xsk Tx
  2020-11-04 14:08 ` [PATCH bpf-next 3/6] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
@ 2020-11-09 20:48   ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2020-11-09 20:48 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon
  Cc: maciejromanfijalkowski, intel-wired-lan, bpf

Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Remove the unnecessary access to the software ring for the AF_XDP
> zero-copy driver. This was used to record the length of the packet so
> that the driver Tx completion code could sum this up to produce the
> total bytes sent. This is now performed during the transmission of the
> packet, so no need to record this in the software ring.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Intel-wired-lan] [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces
  2020-11-04 14:09 ` [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
@ 2020-11-09 21:06   ` John Fastabend
  2020-11-10  8:28     ` Magnus Karlsson
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2020-11-09 21:06 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon
  Cc: maciejromanfijalkowski, intel-wired-lan, bpf

Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Introduce batched descriptor interfaces in the xsk core code for the
> Tx path to be used in the driver to write a code path with higher
> performance. This interface will be used by the i40e driver in the
> next patch. Though other drivers would likely benefit from this new
> interface too.
> 
> Note that batching is only implemented for the common case when
> there is only one socket bound to the same device and queue id. When
> this is not the case, we fall back to the old non-batched version of
> the function.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  include/net/xdp_sock_drv.h |  7 ++++
>  net/xdp/xsk.c              | 43 ++++++++++++++++++++++
>  net/xdp/xsk_queue.h        | 89 +++++++++++++++++++++++++++++++++++++++-------
>  3 files changed, 126 insertions(+), 13 deletions(-)
> 
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 5b1ee8a..4e295541 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -13,6 +13,7 @@
>  
>  void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
>  bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
> +u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc, u32 max);
>  void xsk_tx_release(struct xsk_buff_pool *pool);
>  struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
>  					    u16 queue_id);
> @@ -128,6 +129,12 @@ static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
>  	return false;
>  }
>  
> +static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc,
> +						 u32 max)
> +{
> +	return 0;
> +}
> +
>  static inline void xsk_tx_release(struct xsk_buff_pool *pool)
>  {
>  }
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index b71a32e..dd75b5f 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -332,6 +332,49 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
>  }
>  EXPORT_SYMBOL(xsk_tx_peek_desc);
>  
> +u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *descs,
> +				   u32 max_entries)
> +{
> +	struct xdp_sock *xs;
> +	u32 nb_pkts;
> +
> +	rcu_read_lock();
> +	if (!list_is_singular(&pool->xsk_tx_list)) {
> +		/* Fallback to the non-batched version */
> +		rcu_read_unlock();
> +		return xsk_tx_peek_desc(pool, &descs[0]) ? 1 : 0;
> +	}
> +
> +	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);

I'm not seeing how we avoid the null check here? Can you add a comment on why this
is safe? I see the bind/unbind routines is it possible to unbind while this is
running or do we have some locking here.

> +
> +	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, descs, pool, max_entries);
> +	if (!nb_pkts) {
> +		xs->tx->queue_empty_descs++;
> +		goto out;
> +	}
> +
> +	/* This is the backpressure mechanism for the Tx path. Try to
> +	 * reserve space in the completion queue for all packets, but
> +	 * if there are fewer slots available, just process that many
> +	 * packets. This avoids having to implement any buffering in
> +	 * the Tx path.
> +	 */
> +	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, descs, nb_pkts);
> +	if (!nb_pkts)
> +		goto out;
> +
> +	xskq_cons_release_n(xs->tx, nb_pkts);
> +	__xskq_cons_release(xs->tx);
> +	xs->sk.sk_write_space(&xs->sk);

Can you move the out label here? Looks like nb_pkts = 0 in all cases
where goto out is used.

> +	rcu_read_unlock();
> +	return nb_pkts;
> +
> +out:
> +	rcu_read_unlock();
> +	return 0;
> +}
> +EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
> +
>  static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
>  {
>  	struct net_device *dev = xs->dev;

[...]

Other than above question LGTM.

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [Intel-wired-lan] [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-04 14:09 ` [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
  2020-11-04 23:01   ` Maciej Fijalkowski
@ 2020-11-09 21:10   ` John Fastabend
  1 sibling, 0 replies; 21+ messages in thread
From: John Fastabend @ 2020-11-09 21:10 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon
  Cc: maciejromanfijalkowski, intel-wired-lan, bpf

Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Use the new batched xsk interfaces for the Tx path in the i40e driver
> to improve performance. On my machine, this yields a throughput
> increase of 4% for the l2fwd sample app in xdpsock. If we instead just
> look at the Tx part, this patch set increases throughput with above
> 20% for Tx.
> 
> Note that I had to explicitly loop unroll the inner loop to get to
> this performance level, by using a pragma. It is honored by both clang
> and gcc and should be ignored by versions that do not support
> it. Using the -funroll-loops compiler command line switch on the
> source file resulted in a loop unrolling on a higher level that
> lead to a performance decrease instead of an increase.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 +-
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |   4 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  14 ++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c     | 127 ++++++++++++++++++-------
>  5 files changed, 110 insertions(+), 40 deletions(-)
> 

LGTM, although I mostly just reviewed the API usage. Maciej's seems like
a nice cleanup.

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-wired-lan] [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending
  2020-11-09 20:47   ` [Intel-wired-lan] " John Fastabend
@ 2020-11-10  7:12     ` Magnus Karlsson
  0 siblings, 0 replies; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-10  7:12 UTC (permalink / raw)
  To: John Fastabend
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon,
	Maciej Fijalkowski, intel-wired-lan, bpf

On Mon, Nov 9, 2020 at 9:47 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Magnus Karlsson wrote:
> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >
> > Increment the statistics over how many Tx packets have been sent at
> > the time of sending instead of at the time of completion. This as a
> > completion event means that the buffer has been sent AND returned to
> > user space. The packet always gets sent shortly after sendto() is
> > called. The kernel might, for performance reasons, decide to not
> > return every single buffer to user space immediately after sending,
> > for example, only after a batch of packets have been
> > transmitted. Incrementing the number of packets sent at completion,
> > will in that case be confusing as if you send a single packet, the
> > counter might show zero for a while even though the packet has been
> > transmitted.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
>
> LGTM. Just one question then if we wanted to know the old value, packet
> completion counter it looks like (tx_npkts - outstanding_tx) would give
> that value?

That is correct.

> Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-wired-lan] [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces
  2020-11-09 21:06   ` [Intel-wired-lan] " John Fastabend
@ 2020-11-10  8:28     ` Magnus Karlsson
  0 siblings, 0 replies; 21+ messages in thread
From: Magnus Karlsson @ 2020-11-10  8:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon,
	Maciej Fijalkowski, intel-wired-lan, bpf

On Mon, Nov 9, 2020 at 10:06 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Magnus Karlsson wrote:
> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >
> > Introduce batched descriptor interfaces in the xsk core code for the
> > Tx path to be used in the driver to write a code path with higher
> > performance. This interface will be used by the i40e driver in the
> > next patch. Though other drivers would likely benefit from this new
> > interface too.
> >
> > Note that batching is only implemented for the common case when
> > there is only one socket bound to the same device and queue id. When
> > this is not the case, we fall back to the old non-batched version of
> > the function.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >  include/net/xdp_sock_drv.h |  7 ++++
> >  net/xdp/xsk.c              | 43 ++++++++++++++++++++++
> >  net/xdp/xsk_queue.h        | 89 +++++++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 126 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> > index 5b1ee8a..4e295541 100644
> > --- a/include/net/xdp_sock_drv.h
> > +++ b/include/net/xdp_sock_drv.h
> > @@ -13,6 +13,7 @@
> >
> >  void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
> >  bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
> > +u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc, u32 max);
> >  void xsk_tx_release(struct xsk_buff_pool *pool);
> >  struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
> >                                           u16 queue_id);
> > @@ -128,6 +129,12 @@ static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
> >       return false;
> >  }
> >
> > +static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc,
> > +                                              u32 max)
> > +{
> > +     return 0;
> > +}
> > +
> >  static inline void xsk_tx_release(struct xsk_buff_pool *pool)
> >  {
> >  }
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index b71a32e..dd75b5f 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -332,6 +332,49 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
> >  }
> >  EXPORT_SYMBOL(xsk_tx_peek_desc);
> >
> > +u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *descs,
> > +                                u32 max_entries)
> > +{
> > +     struct xdp_sock *xs;
> > +     u32 nb_pkts;
> > +
> > +     rcu_read_lock();
> > +     if (!list_is_singular(&pool->xsk_tx_list)) {
> > +             /* Fallback to the non-batched version */
> > +             rcu_read_unlock();
> > +             return xsk_tx_peek_desc(pool, &descs[0]) ? 1 : 0;
> > +     }
> > +
> > +     xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
>
> I'm not seeing how we avoid the null check here? Can you add a comment on why this
> is safe? I see the bind/unbind routines is it possible to unbind while this is
> running or do we have some locking here.

You are correct. The entry can disappear between list_is_singluar and
list_first_or_null_rcu. There are 3 possibilities at this point:

0 entries: as you point out, we need to test for this and exit since
the socket does not exist anymore.
1 entry: everything is working as expected.
>1 entry: we only process the first socket in the list. This is fine since this can only happen when we add a second socket to the list and the next time we enter this function list_is_singular() will not be true anymore, so we will use the fallback version that will process packets from all sockets. So the only thing that will happen in this rare case is that the start of processing for the second socket is delayed ever so slightly.

In summary, I will add a test for !xs and exit in that case.

> > +
> > +     nb_pkts = xskq_cons_peek_desc_batch(xs->tx, descs, pool, max_entries);
> > +     if (!nb_pkts) {
> > +             xs->tx->queue_empty_descs++;
> > +             goto out;
> > +     }
> > +
> > +     /* This is the backpressure mechanism for the Tx path. Try to
> > +      * reserve space in the completion queue for all packets, but
> > +      * if there are fewer slots available, just process that many
> > +      * packets. This avoids having to implement any buffering in
> > +      * the Tx path.
> > +      */
> > +     nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, descs, nb_pkts);
> > +     if (!nb_pkts)
> > +             goto out;
> > +
> > +     xskq_cons_release_n(xs->tx, nb_pkts);
> > +     __xskq_cons_release(xs->tx);
> > +     xs->sk.sk_write_space(&xs->sk);
>
> Can you move the out label here? Looks like nb_pkts = 0 in all cases
> where goto out is used.

Nice simplification. Will fix.

Thanks: Magnus

> > +     rcu_read_unlock();
> > +     return nb_pkts;
> > +
> > +out:
> > +     rcu_read_unlock();
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
> > +
> >  static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
> >  {
> >       struct net_device *dev = xs->dev;
>
> [...]
>
> Other than above question LGTM.
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-11-10  8:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-04 14:08 [PATCH bpf-next 0/6] xsk: i40e: Tx performance improvements Magnus Karlsson
2020-11-04 14:08 ` [PATCH bpf-next 1/6] i40e: introduce lazy Tx completions for AF_XDP zero-copy Magnus Karlsson
2020-11-04 23:33   ` Jakub Kicinski
2020-11-04 23:35     ` Jakub Kicinski
2020-11-05 14:17     ` Magnus Karlsson
2020-11-05 15:45       ` Jakub Kicinski
2020-11-06 19:09         ` Magnus Karlsson
2020-11-04 14:08 ` [PATCH bpf-next 2/6] samples/bpf: increment Tx stats at sending Magnus Karlsson
2020-11-09 20:47   ` [Intel-wired-lan] " John Fastabend
2020-11-10  7:12     ` Magnus Karlsson
2020-11-04 14:08 ` [PATCH bpf-next 3/6] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
2020-11-09 20:48   ` [Intel-wired-lan] " John Fastabend
2020-11-04 14:09 ` [PATCH bpf-next 4/6] xsk: introduce padding between more ring pointers Magnus Karlsson
2020-11-09 20:43   ` [Intel-wired-lan] " John Fastabend
2020-11-04 14:09 ` [PATCH bpf-next 5/6] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
2020-11-09 21:06   ` [Intel-wired-lan] " John Fastabend
2020-11-10  8:28     ` Magnus Karlsson
2020-11-04 14:09 ` [PATCH bpf-next 6/6] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
2020-11-04 23:01   ` Maciej Fijalkowski
2020-11-05  7:19     ` Magnus Karlsson
2020-11-09 21:10   ` [Intel-wired-lan] " John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).