bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements
@ 2020-11-10 11:01 Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 1/5] samples/bpf: increment Tx stats at sending Magnus Karlsson
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-10 11:01 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, kuba, john.fastabend
  Cc: Magnus Karlsson, bpf, jeffrey.t.kirsher, anthony.l.nguyen,
	maciej.fijalkowski, maciejromanfijalkowski, intel-wired-lan

This patch set improves the performance of mainly the Tx processing of
AF_XDP sockets. Though, patch 3 also improves the Rx path. All in all,
this patch set improves the throughput of the l2fwd xdpsock
application by around 11%. If we just take a look at Tx processing part,
it is improved by 35% to 40%.

Hopefully the new batched Tx interfaces should be of value to other
drivers implementing AF_XDP zero-copy support. But patch #3 is generic
and will improve performance of all drivers when using AF_XDP sockets
(under the premises explained in that patch).

@Daniel. In patch 3, I apply all the padding required to hinder the
adjacency prefetcher to prefetch the wrong things. After this patch
set, I will submit another patch set that introduces
____cacheline_padding_in_smp in include/linux/cache.h according to your
suggestions. The last patch in that patch set will then convert the
explicit paddings that we have now to ____cacheline_padding_in_smp.

v1 -> v2:
* Removed added parameter in i40e_setup_tx_descriptors and adopted a
  simpler solution [Maciej]
* Added test for !xs in xsk_tx_peek_release_desc_batch() [John]
* Simplified return path in xsk_tx_peek_release_desc_batch() [John]
* Dropped patch #1 in v1 that introduced lazy completions. Hopefully
  this is not needed when we get busy poll. [Jakub]
* Iterate over local variable in xskq_prod_reserve_addr_batch() for
  improved performance
* Fixed the fallback path in xsk_tx_peek_release_desc_batch() so that
  it also produces a batch of descriptors, albeit by using the slower
  (but more general) older code. This improves the performance of the
  case when multiple sockets are sharing the same device and queue id.

This patch has been applied against commit f52b8fd33257 ("bpf: selftest: Use static globals in tcp_hdr_options and btf_skc_cls_ingress")

Structure of the patch set:

Patch 1: For the xdpsock sample, increment Tx stats at sending instead
         of at completion.
Patch 2: Remove an unnecessary sw ring access from the Tx path in i40e.
Patch 3: Introduce padding between all pointers and fields in the ring.
Patch 4: Introduce batched Tx descriptor interfaces.
Patch 5: Use the new batched interfaces in the i40e driver to get higher
         throughput.

Thanks: Magnus

Magnus Karlsson (5):
  samples/bpf: increment Tx stats at sending
  i40e: remove unnecessary sw_ring access from xsk Tx
  xsk: introduce padding between more ring pointers
  xsk: introduce batched Tx descriptor interfaces
  i40e: use batched xsk Tx interfaces to increase performance

 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  11 +++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 131 +++++++++++++++++++---------
 include/net/xdp_sock_drv.h                  |   7 ++
 net/xdp/xsk.c                               |  57 ++++++++++++
 net/xdp/xsk_queue.h                         |  96 +++++++++++++++++---
 samples/bpf/xdpsock_user.c                  |   6 +-
 7 files changed, 253 insertions(+), 56 deletions(-)

--
2.7.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v2 1/5] samples/bpf: increment Tx stats at sending
  2020-11-10 11:01 [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements Magnus Karlsson
@ 2020-11-10 11:01 ` Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 2/5] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-10 11:01 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, kuba, john.fastabend
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Increment the statistics over how many Tx packets have been sent at
the time of sending instead of at the time of completion. This as a
completion event means that the buffer has been sent AND returned to
user space. The packet always gets sent shortly after sendto() is
called. The kernel might, for performance reasons, decide to not
return every single buffer to user space immediately after sending,
for example, only after a batch of packets have been
transmitted. Incrementing the number of packets sent at completion,
will in that case be confusing as if you send a single packet, the
counter might show zero for a while even though the packet has been
transmitted.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 samples/bpf/xdpsock_user.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 1149e94..2567f0d 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -1146,7 +1146,6 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
 		xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
 		xsk->outstanding_tx -= rcvd;
-		xsk->ring_stats.tx_npkts += rcvd;
 	}
 }
 
@@ -1168,7 +1167,6 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk,
 	if (rcvd > 0) {
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
 		xsk->outstanding_tx -= rcvd;
-		xsk->ring_stats.tx_npkts += rcvd;
 	}
 }
 
@@ -1260,6 +1258,7 @@ static void tx_only(struct xsk_socket_info *xsk, u32 *frame_nb, int batch_size)
 	}
 
 	xsk_ring_prod__submit(&xsk->tx, batch_size);
+	xsk->ring_stats.tx_npkts += batch_size;
 	xsk->outstanding_tx += batch_size;
 	*frame_nb += batch_size;
 	*frame_nb %= NUM_FRAMES;
@@ -1348,6 +1347,7 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
 		}
 		return;
 	}
+	xsk->ring_stats.rx_npkts += rcvd;
 
 	ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
 	while (ret != rcvd) {
@@ -1379,7 +1379,7 @@ static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
 	xsk_ring_prod__submit(&xsk->tx, rcvd);
 	xsk_ring_cons__release(&xsk->rx, rcvd);
 
-	xsk->ring_stats.rx_npkts += rcvd;
+	xsk->ring_stats.tx_npkts += rcvd;
 	xsk->outstanding_tx += rcvd;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v2 2/5] i40e: remove unnecessary sw_ring access from xsk Tx
  2020-11-10 11:01 [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 1/5] samples/bpf: increment Tx stats at sending Magnus Karlsson
@ 2020-11-10 11:01 ` Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 3/5] xsk: introduce padding between more ring pointers Magnus Karlsson
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-10 11:01 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, kuba, john.fastabend
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Remove the unnecessary access to the software ring for the AF_XDP
zero-copy driver. This was used to record the length of the packet so
that the driver Tx completion code could sum this up to produce the
total bytes sent. This is now performed during the transmission of the
packet, so no need to record this in the software ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 6acede0..61aa1fc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -392,7 +392,6 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
 	unsigned int sent_frames = 0, total_bytes = 0;
 	struct i40e_tx_desc *tx_desc = NULL;
-	struct i40e_tx_buffer *tx_bi;
 	struct xdp_desc desc;
 	dma_addr_t dma;
 
@@ -404,9 +403,6 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
 						 desc.len);
 
-		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
-		tx_bi->bytecount = desc.len;
-
 		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
 		tx_desc->buffer_addr = cpu_to_le64(dma);
 		tx_desc->cmd_type_offset_bsz =
@@ -415,7 +411,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 				   0, desc.len, 0);
 
 		sent_frames++;
-		total_bytes += tx_bi->bytecount;
+		total_bytes += desc.len;
 
 		xdp_ring->next_to_use++;
 		if (xdp_ring->next_to_use == xdp_ring->count)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v2 3/5] xsk: introduce padding between more ring pointers
  2020-11-10 11:01 [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 1/5] samples/bpf: increment Tx stats at sending Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 2/5] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
@ 2020-11-10 11:01 ` Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 4/5] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
  2020-11-10 11:01 ` [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
  4 siblings, 0 replies; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-10 11:01 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, kuba, john.fastabend
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduce one cache line worth of padding between the consumer pointer
and the flags field as well as between the flags field and the start
of the descriptors in all the lockless rings. This so that the x86 HW
adjacency prefetcher will not prefetch the adjacent pointer/field when
only one pointer/field is going to be used. This improves throughput
performance for the l2fwd sample app with 1% on my machine with HW
prefetching turned on in the BIOS.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 net/xdp/xsk_queue.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index cdb9cf3..74fac80 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -18,9 +18,11 @@ struct xdp_ring {
 	/* Hinder the adjacent cache prefetcher to prefetch the consumer
 	 * pointer if the producer pointer is touched and vice versa.
 	 */
-	u32 pad ____cacheline_aligned_in_smp;
+	u32 pad1 ____cacheline_aligned_in_smp;
 	u32 consumer ____cacheline_aligned_in_smp;
+	u32 pad2 ____cacheline_aligned_in_smp;
 	u32 flags;
+	u32 pad3 ____cacheline_aligned_in_smp;
 };
 
 /* Used for the RX and TX queues for packets */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v2 4/5] xsk: introduce batched Tx descriptor interfaces
  2020-11-10 11:01 [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements Magnus Karlsson
                   ` (2 preceding siblings ...)
  2020-11-10 11:01 ` [PATCH bpf-next v2 3/5] xsk: introduce padding between more ring pointers Magnus Karlsson
@ 2020-11-10 11:01 ` Magnus Karlsson
  2020-11-11  8:40   ` John Fastabend
  2020-11-10 11:01 ` [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
  4 siblings, 1 reply; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-10 11:01 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, kuba, john.fastabend
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Introduce batched descriptor interfaces in the xsk core code for the
Tx path to be used in the driver to write a code path with higher
performance. This interface will be used by the i40e driver in the
next patch. Though other drivers would likely benefit from this new
interface too.

Note that batching is only implemented for the common case when
there is only one socket bound to the same device and queue id. When
this is not the case, we fall back to the old non-batched version of
the function.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock_drv.h |  7 ++++
 net/xdp/xsk.c              | 57 ++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h        | 92 +++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 143 insertions(+), 13 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 5b1ee8a..4e295541 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -13,6 +13,7 @@
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
 struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
 					    u16 queue_id);
@@ -128,6 +129,12 @@ static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool,
 	return false;
 }
 
+static inline u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *desc,
+						 u32 max)
+{
+	return 0;
+}
+
 static inline void xsk_tx_release(struct xsk_buff_pool *pool)
 {
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b71a32e..37f70cd 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -332,6 +332,63 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc)
 }
 EXPORT_SYMBOL(xsk_tx_peek_desc);
 
+static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, struct xdp_desc *descs,
+					u32 max_entries)
+{
+	u32 nb_pkts = 0;
+
+	while (nb_pkts < max_entries && xsk_tx_peek_desc(pool, &descs[nb_pkts]))
+		nb_pkts++;
+
+	xsk_tx_release(pool);
+	return nb_pkts;
+}
+
+u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, struct xdp_desc *descs,
+				   u32 max_entries)
+{
+	struct xdp_sock *xs;
+	u32 nb_pkts;
+
+	rcu_read_lock();
+	if (!list_is_singular(&pool->xsk_tx_list)) {
+		/* Fallback to the non-batched version */
+		rcu_read_unlock();
+		return xsk_tx_peek_release_fallback(pool, descs, max_entries);
+	}
+
+	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
+	if (!xs) {
+		nb_pkts = 0;
+		goto out;
+	}
+
+	nb_pkts = xskq_cons_peek_desc_batch(xs->tx, descs, pool, max_entries);
+	if (!nb_pkts) {
+		xs->tx->queue_empty_descs++;
+		goto out;
+	}
+
+	/* This is the backpressure mechanism for the Tx path. Try to
+	 * reserve space in the completion queue for all packets, but
+	 * if there are fewer slots available, just process that many
+	 * packets. This avoids having to implement any buffering in
+	 * the Tx path.
+	 */
+	nb_pkts = xskq_prod_reserve_addr_batch(pool->cq, descs, nb_pkts);
+	if (!nb_pkts)
+		goto out;
+
+	xskq_cons_release_n(xs->tx, nb_pkts);
+	__xskq_cons_release(xs->tx);
+	xs->sk.sk_write_space(&xs->sk);
+
+out:
+	rcu_read_unlock();
+	return nb_pkts;
+}
+EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
+
 static int xsk_wakeup(struct xdp_sock *xs, u8 flags)
 {
 	struct net_device *dev = xs->dev;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 74fac80..7350f87 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -199,6 +199,33 @@ static inline bool xskq_cons_read_desc(struct xsk_queue *q,
 	return false;
 }
 
+static inline u32 xskq_cons_read_desc_batch(struct xsk_queue *q,
+					    struct xdp_desc *descs,
+					    struct xsk_buff_pool *pool, u32 max)
+{
+	u32 cached_cons = q->cached_cons, nb_entries = 0;
+
+	while (cached_cons != q->cached_prod && nb_entries < max) {
+		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+		u32 idx = cached_cons & q->ring_mask;
+
+		descs[nb_entries] = ring->desc[idx];
+		if (unlikely(!xskq_cons_is_valid_desc(q, &descs[nb_entries], pool))) {
+			if (nb_entries) {
+				/* Invalid entry detected. Return what we have. */
+				return nb_entries;
+			}
+			/* Use non-batch version to progress beyond invalid entry/entries */
+			return xskq_cons_read_desc(q, descs, pool) ? 1 : 0;
+		}
+
+		nb_entries++;
+		cached_cons++;
+	}
+
+	return nb_entries;
+}
+
 /* Functions for consumers */
 
 static inline void __xskq_cons_release(struct xsk_queue *q)
@@ -220,17 +247,22 @@ static inline void xskq_cons_get_entries(struct xsk_queue *q)
 	__xskq_cons_peek(q);
 }
 
-static inline bool xskq_cons_has_entries(struct xsk_queue *q, u32 cnt)
+static inline u32 xskq_cons_nb_entries(struct xsk_queue *q, u32 max)
 {
 	u32 entries = q->cached_prod - q->cached_cons;
 
-	if (entries >= cnt)
-		return true;
+	if (entries >= max)
+		return max;
 
 	__xskq_cons_peek(q);
 	entries = q->cached_prod - q->cached_cons;
 
-	return entries >= cnt;
+	return entries >= max ? max : entries;
+}
+
+static inline bool xskq_cons_has_entries(struct xsk_queue *q, u32 cnt)
+{
+	return xskq_cons_nb_entries(q, cnt) >= cnt ? true : false;
 }
 
 static inline bool xskq_cons_peek_addr_unchecked(struct xsk_queue *q, u64 *addr)
@@ -249,16 +281,28 @@ static inline bool xskq_cons_peek_desc(struct xsk_queue *q,
 	return xskq_cons_read_desc(q, desc, pool);
 }
 
+static inline u32 xskq_cons_peek_desc_batch(struct xsk_queue *q, struct xdp_desc *descs,
+					    struct xsk_buff_pool *pool, u32 max)
+{
+	u32 entries = xskq_cons_nb_entries(q, max);
+
+	return xskq_cons_read_desc_batch(q, descs, pool, entries);
+}
+
+/* To improve performance in the xskq_cons_release functions, only update local state here.
+ * Reflect this to global state when we get new entries from the ring in
+ * xskq_cons_get_entries() and whenever Rx or Tx processing are completed in the NAPI loop.
+ */
 static inline void xskq_cons_release(struct xsk_queue *q)
 {
-	/* To improve performance, only update local state here.
-	 * Reflect this to global state when we get new entries
-	 * from the ring in xskq_cons_get_entries() and whenever
-	 * Rx or Tx processing are completed in the NAPI loop.
-	 */
 	q->cached_cons++;
 }
 
+static inline void xskq_cons_release_n(struct xsk_queue *q, u32 cnt)
+{
+	q->cached_cons += cnt;
+}
+
 static inline bool xskq_cons_is_full(struct xsk_queue *q)
 {
 	/* No barriers needed since data is not accessed */
@@ -268,18 +312,23 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
 
 /* Functions for producers */
 
-static inline bool xskq_prod_is_full(struct xsk_queue *q)
+static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)
 {
 	u32 free_entries = q->nentries - (q->cached_prod - q->cached_cons);
 
-	if (free_entries)
-		return false;
+	if (free_entries >= max)
+		return max;
 
 	/* Refresh the local tail pointer */
 	q->cached_cons = READ_ONCE(q->ring->consumer);
 	free_entries = q->nentries - (q->cached_prod - q->cached_cons);
 
-	return !free_entries;
+	return free_entries >= max ? max : free_entries;
+}
+
+static inline bool xskq_prod_is_full(struct xsk_queue *q)
+{
+	return xskq_prod_nb_free(q, 1) ? false : true;
 }
 
 static inline int xskq_prod_reserve(struct xsk_queue *q)
@@ -304,6 +353,23 @@ static inline int xskq_prod_reserve_addr(struct xsk_queue *q, u64 addr)
 	return 0;
 }
 
+static inline u32 xskq_prod_reserve_addr_batch(struct xsk_queue *q, struct xdp_desc *descs,
+					       u32 max)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+	u32 nb_entries, i, cached_prod;
+
+	nb_entries = xskq_prod_nb_free(q, max);
+
+	/* A, matches D */
+	cached_prod = q->cached_prod;
+	for (i = 0; i < nb_entries; i++)
+		ring->desc[cached_prod++ & q->ring_mask] = descs[i].addr;
+	q->cached_prod = cached_prod;
+
+	return nb_entries;
+}
+
 static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
 					 u64 addr, u32 len)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-10 11:01 [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements Magnus Karlsson
                   ` (3 preceding siblings ...)
  2020-11-10 11:01 ` [PATCH bpf-next v2 4/5] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
@ 2020-11-10 11:01 ` Magnus Karlsson
  2020-11-11  1:37   ` kernel test robot
  4 siblings, 1 reply; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-10 11:01 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev,
	jonathan.lemon, kuba, john.fastabend
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

From: Magnus Karlsson <magnus.karlsson@intel.com>

Use the new batched xsk interfaces for the Tx path in the i40e driver
to improve performance. On my machine, this yields a throughput
increase of 4% for the l2fwd sample app in xdpsock. If we instead just
look at the Tx part, this patch set increases throughput with above
20% for Tx.

Note that I had to explicitly loop unroll the inner loop to get to
this performance level, by using a pragma. It is honored by both clang
and gcc and should be ignored by versions that do not support
it. Using the -funroll-loops compiler command line switch on the
source file resulted in a loop unrolling on a higher level that
lead to a performance decrease instead of an increase.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  11 +++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 127 ++++++++++++++++++++--------
 3 files changed, 104 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d43ce13..c21548c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -676,6 +676,8 @@ void i40e_free_tx_resources(struct i40e_ring *tx_ring)
 	i40e_clean_tx_ring(tx_ring);
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
+	kfree(tx_ring->xsk_descs);
+	tx_ring->xsk_descs = NULL;
 
 	if (tx_ring->desc) {
 		dma_free_coherent(tx_ring->dev, tx_ring->size,
@@ -1277,6 +1279,13 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	if (!tx_ring->tx_bi)
 		goto err;
 
+	if (ring_is_xdp(tx_ring)) {
+		tx_ring->xsk_descs = kcalloc(I40E_MAX_NUM_DESCRIPTORS, sizeof(*tx_ring->xsk_descs),
+					     GFP_KERNEL);
+		if (!tx_ring->xsk_descs)
+			goto err;
+	}
+
 	u64_stats_init(&tx_ring->syncp);
 
 	/* round up to nearest 4K */
@@ -1300,6 +1309,8 @@ int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring)
 	return 0;
 
 err:
+	kfree(tx_ring->xsk_descs);
+	tx_ring->xsk_descs = NULL;
 	kfree(tx_ring->tx_bi);
 	tx_ring->tx_bi = NULL;
 	return -ENOMEM;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 2feed92..5f531b1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -389,6 +389,7 @@ struct i40e_ring {
 	struct i40e_channel *ch;
 	struct xdp_rxq_info xdp_rxq;
 	struct xsk_buff_pool *xsk_pool;
+	struct xdp_desc *xsk_descs;      /* For storing descriptors in the AF_XDP ZC path */
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 61aa1fc..a271a02 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -381,6 +381,78 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	return failure ? budget : (int)total_rx_packets;
 }
 
+static void i40e_xmit_pkt(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
+			  unsigned int *total_bytes)
+{
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+
+	dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc->addr);
+	xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc->len);
+
+	tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use++);
+	tx_desc->buffer_addr = cpu_to_le64(dma);
+	tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC | I40E_TX_DESC_CMD_EOP,
+						  0, desc->len, 0);
+
+	*total_bytes += desc->len;
+}
+
+/* This value should match the pragma below. Why 4? It is strictly
+ * empirical. It seems to be a good compromise between the advantage
+ * of having simultaneous outstanding reads to the DMA array that can
+ * hide each others latency and the disadvantage of having a larger
+ * code path.
+ */
+#define PKTS_PER_BATCH 4
+
+static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
+				unsigned int *total_bytes)
+{
+	u16 ntu = xdp_ring->next_to_use;
+	struct i40e_tx_desc *tx_desc;
+	dma_addr_t dma;
+	u32 i;
+
+#pragma GCC unroll 4
+	for (i = 0; i < PKTS_PER_BATCH; i++) {
+		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
+		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
+
+		tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
+		tx_desc->buffer_addr = cpu_to_le64(dma);
+		tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
+							  I40E_TX_DESC_CMD_EOP,
+							  0, desc[i].len, 0);
+
+		*total_bytes += desc[i].len;
+	}
+
+	xdp_ring->next_to_use = ntu;
+}
+
+static void i40e_fill_tx_hw_ring(struct i40e_ring *xdp_ring, struct xdp_desc *descs, u32 nb_pkts,
+				 unsigned int *total_bytes)
+{
+	u32 batched, leftover, i;
+
+	batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
+	leftover = nb_pkts & (PKTS_PER_BATCH - 1);
+	for (i = 0; i < batched; i += PKTS_PER_BATCH)
+		i40e_xmit_pkt_batch(xdp_ring, &descs[i], total_bytes);
+	for (i = batched; i < batched + leftover; i++)
+		i40e_xmit_pkt(xdp_ring, &descs[i], total_bytes);
+}
+
+static void i40e_set_rs_bit(struct i40e_ring *xdp_ring)
+{
+	u16 ntu = xdp_ring->next_to_use ? xdp_ring->next_to_use - 1 : xdp_ring->count - 1;
+	struct i40e_tx_desc *tx_desc;
+
+	tx_desc = I40E_TX_DESC(xdp_ring, ntu);
+	tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS << I40E_TXD_QW1_CMD_SHIFT);
+}
+
 /**
  * i40e_xmit_zc - Performs zero-copy Tx AF_XDP
  * @xdp_ring: XDP Tx ring
@@ -390,45 +462,30 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
-	unsigned int sent_frames = 0, total_bytes = 0;
-	struct i40e_tx_desc *tx_desc = NULL;
-	struct xdp_desc desc;
-	dma_addr_t dma;
-
-	while (budget-- > 0) {
-		if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc))
-			break;
-
-		dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr);
-		xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma,
-						 desc.len);
-
-		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
-		tx_desc->buffer_addr = cpu_to_le64(dma);
-		tx_desc->cmd_type_offset_bsz =
-			build_ctob(I40E_TX_DESC_CMD_ICRC
-				   | I40E_TX_DESC_CMD_EOP,
-				   0, desc.len, 0);
-
-		sent_frames++;
-		total_bytes += desc.len;
-
-		xdp_ring->next_to_use++;
-		if (xdp_ring->next_to_use == xdp_ring->count)
-			xdp_ring->next_to_use = 0;
+	struct xdp_desc *descs = xdp_ring->xsk_descs;
+	u32 nb_pkts, nb_processed = 0;
+	unsigned int total_bytes = 0;
+
+	nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, descs, budget);
+	if (!nb_pkts)
+		return false;
+
+	if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
+		nb_processed = xdp_ring->count - xdp_ring->next_to_use;
+		i40e_fill_tx_hw_ring(xdp_ring, descs, nb_processed, &total_bytes);
+		xdp_ring->next_to_use = 0;
 	}
 
-	if (tx_desc) {
-		/* Request an interrupt for the last frame and bump tail ptr. */
-		tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
-						 I40E_TXD_QW1_CMD_SHIFT);
-		i40e_xdp_ring_update_tail(xdp_ring);
+	i40e_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed,
+			     &total_bytes);
 
-		xsk_tx_release(xdp_ring->xsk_pool);
-		i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes);
-	}
+	/* Request an interrupt for the last frame and bump tail ptr. */
+	i40e_set_rs_bit(xdp_ring);
+	i40e_xdp_ring_update_tail(xdp_ring);
+
+	i40e_update_tx_stats(xdp_ring, nb_pkts, total_bytes);
 
-	return !!budget;
+	return true;
 }
 
 /**
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-10 11:01 ` [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
@ 2020-11-11  1:37   ` kernel test robot
  2020-11-11 11:57     ` Magnus Karlsson
  0 siblings, 1 reply; 12+ messages in thread
From: kernel test robot @ 2020-11-11  1:37 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon, kuba, john.fastabend
  Cc: kbuild-all, clang-built-linux, bpf, jeffrey.t.kirsher

[-- Attachment #1: Type: text/plain, Size: 2610 bytes --]

Hi Magnus,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: powerpc64-randconfig-r025-20201110 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc64 cross compiling tool for clang build
        # apt-get install binutils-powerpc64-linux-gnu
        # https://github.com/0day-ci/linux/commit/b016bbeac6692a93e61b28efa430d64645032b5e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
        git checkout b016bbeac6692a93e61b28efa430d64645032b5e
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/net/ethernet/intel/i40e/i40e_xsk.c:417:13: warning: unknown pragma ignored [-Wunknown-pragmas]
   #pragma GCC unroll 4
               ^
   1 warning generated.

vim +417 drivers/net/ethernet/intel/i40e/i40e_xsk.c

   408	
   409	static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
   410					unsigned int *total_bytes)
   411	{
   412		u16 ntu = xdp_ring->next_to_use;
   413		struct i40e_tx_desc *tx_desc;
   414		dma_addr_t dma;
   415		u32 i;
   416	
 > 417	#pragma GCC unroll 4
   418		for (i = 0; i < PKTS_PER_BATCH; i++) {
   419			dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
   420			xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
   421	
   422			tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
   423			tx_desc->buffer_addr = cpu_to_le64(dma);
   424			tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
   425								  I40E_TX_DESC_CMD_EOP,
   426								  0, desc[i].len, 0);
   427	
   428			*total_bytes += desc[i].len;
   429		}
   430	
   431		xdp_ring->next_to_use = ntu;
   432	}
   433	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29926 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH bpf-next v2 4/5] xsk: introduce batched Tx descriptor interfaces
  2020-11-10 11:01 ` [PATCH bpf-next v2 4/5] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
@ 2020-11-11  8:40   ` John Fastabend
  0 siblings, 0 replies; 12+ messages in thread
From: John Fastabend @ 2020-11-11  8:40 UTC (permalink / raw)
  To: Magnus Karlsson, magnus.karlsson, bjorn.topel, ast, daniel,
	netdev, jonathan.lemon, kuba, john.fastabend
  Cc: bpf, jeffrey.t.kirsher, anthony.l.nguyen, maciej.fijalkowski,
	maciejromanfijalkowski, intel-wired-lan

Magnus Karlsson wrote:
> From: Magnus Karlsson <magnus.karlsson@intel.com>
> 
> Introduce batched descriptor interfaces in the xsk core code for the
> Tx path to be used in the driver to write a code path with higher
> performance. This interface will be used by the i40e driver in the
> next patch. Though other drivers would likely benefit from this new
> interface too.
> 
> Note that batching is only implemented for the common case when
> there is only one socket bound to the same device and queue id. When
> this is not the case, we fall back to the old non-batched version of
> the function.
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-11  1:37   ` kernel test robot
@ 2020-11-11 11:57     ` Magnus Karlsson
  2020-11-11 19:16       ` Nick Desaulniers
  0 siblings, 1 reply; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-11 11:57 UTC (permalink / raw)
  To: kernel test robot
  Cc: Karlsson, Magnus, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jonathan Lemon,
	Jakub Kicinski, John Fastabend, kbuild-all, clang-built-linux,
	bpf, jeffrey.t.kirsher

On Wed, Nov 11, 2020 at 2:38 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Magnus,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on bpf-next/master]
>
> url:    https://github.com/0day-ci/linux/commits/Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> config: powerpc64-randconfig-r025-20201110 (attached as .config)
> compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # install powerpc64 cross compiling tool for clang build
>         # apt-get install binutils-powerpc64-linux-gnu
>         # https://github.com/0day-ci/linux/commit/b016bbeac6692a93e61b28efa430d64645032b5e
>         git remote add linux-review https://github.com/0day-ci/linux
>         git fetch --no-tags linux-review Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
>         git checkout b016bbeac6692a93e61b28efa430d64645032b5e
>         # save the attached .config to linux build tree
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc64
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
> >> drivers/net/ethernet/intel/i40e/i40e_xsk.c:417:13: warning: unknown pragma ignored [-Wunknown-pragmas]
>    #pragma GCC unroll 4
>                ^
>    1 warning generated.

And I was hoping that unknown pragmas would be ignored, but that will
obviously not be the case with -Wunknown-pragmas added. The unrolling
of this inner loop where the code spends most of its time gives me
nearly 1 Mpps extra in performance which is substantial, so I would
like to get this unrolled in some way, but without the warning. Need
some advice please. Here are some options that comes in mind:

#1: Suppress unknown pragma warnings in this file only by adding
CFLAGS_i40e_xsk.o += -Wno-unknown-pragmas (or whatever that option
might be) in the Makefile

#2: Force the compiler to loop-unroll the loop with for example a
switch statement with four cases that all fall through. This will make
the code less readable.

#3: Manually loop-unroll the loop. This will make the code even less
readable than #2.

I prefer #1 as I like to keep the code readable, but you might have
other better suggestions on how to tackle this.

Thanks: Magnus

> vim +417 drivers/net/ethernet/intel/i40e/i40e_xsk.c
>
>    408
>    409  static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
>    410                                  unsigned int *total_bytes)
>    411  {
>    412          u16 ntu = xdp_ring->next_to_use;
>    413          struct i40e_tx_desc *tx_desc;
>    414          dma_addr_t dma;
>    415          u32 i;
>    416
>  > 417  #pragma GCC unroll 4
>    418          for (i = 0; i < PKTS_PER_BATCH; i++) {
>    419                  dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
>    420                  xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
>    421
>    422                  tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
>    423                  tx_desc->buffer_addr = cpu_to_le64(dma);
>    424                  tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
>    425                                                            I40E_TX_DESC_CMD_EOP,
>    426                                                            0, desc[i].len, 0);
>    427
>    428                  *total_bytes += desc[i].len;
>    429          }
>    430
>    431          xdp_ring->next_to_use = ntu;
>    432  }
>    433
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-11 11:57     ` Magnus Karlsson
@ 2020-11-11 19:16       ` Nick Desaulniers
  2020-11-12  7:45         ` Magnus Karlsson
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Desaulniers @ 2020-11-11 19:16 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: kernel test robot, Karlsson, Magnus, Björn Töpel,
	Alexei Starovoitov, Daniel Borkmann, Network Development,
	Jonathan Lemon, Jakub Kicinski, John Fastabend, kbuild-all,
	clang-built-linux, bpf, Jeff Kirsher

On Wed, Nov 11, 2020 at 3:57 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Wed, Nov 11, 2020 at 2:38 AM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Magnus,
> >
> > I love your patch! Perhaps something to improve:
> >
> > [auto build test WARNING on bpf-next/master]
> >
> > url:    https://github.com/0day-ci/linux/commits/Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> > config: powerpc64-randconfig-r025-20201110 (attached as .config)
> > compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)

^ Note: clang

> > reproduce (this is a W=1 build):
> >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >         chmod +x ~/bin/make.cross
> >         # install powerpc64 cross compiling tool for clang build
> >         # apt-get install binutils-powerpc64-linux-gnu
> >         # https://github.com/0day-ci/linux/commit/b016bbeac6692a93e61b28efa430d64645032b5e
> >         git remote add linux-review https://github.com/0day-ci/linux
> >         git fetch --no-tags linux-review Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> >         git checkout b016bbeac6692a93e61b28efa430d64645032b5e
> >         # save the attached .config to linux build tree
> >         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc64
> >
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kernel test robot <lkp@intel.com>
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> drivers/net/ethernet/intel/i40e/i40e_xsk.c:417:13: warning: unknown pragma ignored [-Wunknown-pragmas]
> >    #pragma GCC unroll 4
> >                ^
> >    1 warning generated.
>
> And I was hoping that unknown pragmas would be ignored, but that will
> obviously not be the case with -Wunknown-pragmas added. The unrolling
> of this inner loop where the code spends most of its time gives me
> nearly 1 Mpps extra in performance which is substantial, so I would
> like to get this unrolled in some way, but without the warning. Need
> some advice please. Here are some options that comes in mind:
>
> #1: Suppress unknown pragma warnings in this file only by adding
> CFLAGS_i40e_xsk.o += -Wno-unknown-pragmas (or whatever that option
> might be) in the Makefile
>
> #2: Force the compiler to loop-unroll the loop with for example a
> switch statement with four cases that all fall through. This will make
> the code less readable.
>
> #3: Manually loop-unroll the loop. This will make the code even less
> readable than #2.

#4 support both compilers.  Note Clang's syntax is slightly different
here; it doesn't accept GCC specific pragmas, and uses a slightly
different form:
https://clang.llvm.org/docs/LanguageExtensions.html#loop-unrolling .
If you wrap that in a macro based on `#ifdef __clang__`, that should
do the trick.

>
> I prefer #1 as I like to keep the code readable, but you might have
> other better suggestions on how to tackle this.
>
> Thanks: Magnus
>
> > vim +417 drivers/net/ethernet/intel/i40e/i40e_xsk.c
> >
> >    408
> >    409  static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> >    410                                  unsigned int *total_bytes)
> >    411  {
> >    412          u16 ntu = xdp_ring->next_to_use;
> >    413          struct i40e_tx_desc *tx_desc;
> >    414          dma_addr_t dma;
> >    415          u32 i;
> >    416
> >  > 417  #pragma GCC unroll 4
> >    418          for (i = 0; i < PKTS_PER_BATCH; i++) {
> >    419                  dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
> >    420                  xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
> >    421
> >    422                  tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
> >    423                  tx_desc->buffer_addr = cpu_to_le64(dma);
> >    424                  tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
> >    425                                                            I40E_TX_DESC_CMD_EOP,
> >    426                                                            0, desc[i].len, 0);
> >    427
> >    428                  *total_bytes += desc[i].len;
> >    429          }
> >    430
> >    431          xdp_ring->next_to_use = ntu;
> >    432  }
> >    433
> >
> > ---
> > 0-DAY CI Kernel Test Service, Intel Corporation
> > https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
>
> --
> You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/CAJ8uoz2aDjLPtcTgZ_pO-%3DS9TgXm3c57rN8TTPXdqT7HOOKrhA%40mail.gmail.com.



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-11 19:16       ` Nick Desaulniers
@ 2020-11-12  7:45         ` Magnus Karlsson
  2020-11-12 19:39           ` Nick Desaulniers
  0 siblings, 1 reply; 12+ messages in thread
From: Magnus Karlsson @ 2020-11-12  7:45 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: kernel test robot, Karlsson, Magnus, Björn Töpel,
	Alexei Starovoitov, Daniel Borkmann, Network Development,
	Jonathan Lemon, Jakub Kicinski, John Fastabend, kbuild-all,
	clang-built-linux, bpf, Jeff Kirsher

On Wed, Nov 11, 2020 at 8:16 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Wed, Nov 11, 2020 at 3:57 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Wed, Nov 11, 2020 at 2:38 AM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Magnus,
> > >
> > > I love your patch! Perhaps something to improve:
> > >
> > > [auto build test WARNING on bpf-next/master]
> > >
> > > url:    https://github.com/0day-ci/linux/commits/Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> > > config: powerpc64-randconfig-r025-20201110 (attached as .config)
> > > compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)
>
> ^ Note: clang
>
> > > reproduce (this is a W=1 build):
> > >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> > >         chmod +x ~/bin/make.cross
> > >         # install powerpc64 cross compiling tool for clang build
> > >         # apt-get install binutils-powerpc64-linux-gnu
> > >         # https://github.com/0day-ci/linux/commit/b016bbeac6692a93e61b28efa430d64645032b5e
> > >         git remote add linux-review https://github.com/0day-ci/linux
> > >         git fetch --no-tags linux-review Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> > >         git checkout b016bbeac6692a93e61b28efa430d64645032b5e
> > >         # save the attached .config to linux build tree
> > >         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc64
> > >
> > > If you fix the issue, kindly add following tag as appropriate
> > > Reported-by: kernel test robot <lkp@intel.com>
> > >
> > > All warnings (new ones prefixed by >>):
> > >
> > > >> drivers/net/ethernet/intel/i40e/i40e_xsk.c:417:13: warning: unknown pragma ignored [-Wunknown-pragmas]
> > >    #pragma GCC unroll 4
> > >                ^
> > >    1 warning generated.
> >
> > And I was hoping that unknown pragmas would be ignored, but that will
> > obviously not be the case with -Wunknown-pragmas added. The unrolling
> > of this inner loop where the code spends most of its time gives me
> > nearly 1 Mpps extra in performance which is substantial, so I would
> > like to get this unrolled in some way, but without the warning. Need
> > some advice please. Here are some options that comes in mind:
> >
> > #1: Suppress unknown pragma warnings in this file only by adding
> > CFLAGS_i40e_xsk.o += -Wno-unknown-pragmas (or whatever that option
> > might be) in the Makefile
> >
> > #2: Force the compiler to loop-unroll the loop with for example a
> > switch statement with four cases that all fall through. This will make
> > the code less readable.
> >
> > #3: Manually loop-unroll the loop. This will make the code even less
> > readable than #2.
>
> #4 support both compilers.  Note Clang's syntax is slightly different
> here; it doesn't accept GCC specific pragmas, and uses a slightly
> different form:
> https://clang.llvm.org/docs/LanguageExtensions.html#loop-unrolling .
> If you wrap that in a macro based on `#ifdef __clang__`, that should
> do the trick.

Yes, that did the trick. Tried it out with the compiler explorer at
https://godbolt.org/ and it compiles nicely even for clang-powerpc64.
Will spin a v3.

Thank you: Magnus

> >
> > I prefer #1 as I like to keep the code readable, but you might have
> > other better suggestions on how to tackle this.
> >
> > Thanks: Magnus
> >
> > > vim +417 drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > >
> > >    408
> > >    409  static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> > >    410                                  unsigned int *total_bytes)
> > >    411  {
> > >    412          u16 ntu = xdp_ring->next_to_use;
> > >    413          struct i40e_tx_desc *tx_desc;
> > >    414          dma_addr_t dma;
> > >    415          u32 i;
> > >    416
> > >  > 417  #pragma GCC unroll 4
> > >    418          for (i = 0; i < PKTS_PER_BATCH; i++) {
> > >    419                  dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
> > >    420                  xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
> > >    421
> > >    422                  tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
> > >    423                  tx_desc->buffer_addr = cpu_to_le64(dma);
> > >    424                  tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
> > >    425                                                            I40E_TX_DESC_CMD_EOP,
> > >    426                                                            0, desc[i].len, 0);
> > >    427
> > >    428                  *total_bytes += desc[i].len;
> > >    429          }
> > >    430
> > >    431          xdp_ring->next_to_use = ntu;
> > >    432  }
> > >    433
> > >
> > > ---
> > > 0-DAY CI Kernel Test Service, Intel Corporation
> > > https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/CAJ8uoz2aDjLPtcTgZ_pO-%3DS9TgXm3c57rN8TTPXdqT7HOOKrhA%40mail.gmail.com.
>
>
>
> --
> Thanks,
> ~Nick Desaulniers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance
  2020-11-12  7:45         ` Magnus Karlsson
@ 2020-11-12 19:39           ` Nick Desaulniers
  0 siblings, 0 replies; 12+ messages in thread
From: Nick Desaulniers @ 2020-11-12 19:39 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: kernel test robot, Karlsson, Magnus, Björn Töpel,
	Alexei Starovoitov, Daniel Borkmann, Network Development,
	Jonathan Lemon, Jakub Kicinski, John Fastabend, kbuild-all,
	clang-built-linux, bpf, Jeff Kirsher

On Wed, Nov 11, 2020 at 11:45 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Wed, Nov 11, 2020 at 8:16 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Wed, Nov 11, 2020 at 3:57 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> > >
> > > On Wed, Nov 11, 2020 at 2:38 AM kernel test robot <lkp@intel.com> wrote:
> > > >
> > > > Hi Magnus,
> > > >
> > > > I love your patch! Perhaps something to improve:
> > > >
> > > > [auto build test WARNING on bpf-next/master]
> > > >
> > > > url:    https://github.com/0day-ci/linux/commits/Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> > > > config: powerpc64-randconfig-r025-20201110 (attached as .config)
> > > > compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 4d81c8adb6ed9840257f6cb6b93f60856d422a15)
> >
> > ^ Note: clang
> >
> > > > reproduce (this is a W=1 build):
> > > >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> > > >         chmod +x ~/bin/make.cross
> > > >         # install powerpc64 cross compiling tool for clang build
> > > >         # apt-get install binutils-powerpc64-linux-gnu
> > > >         # https://github.com/0day-ci/linux/commit/b016bbeac6692a93e61b28efa430d64645032b5e
> > > >         git remote add linux-review https://github.com/0day-ci/linux
> > > >         git fetch --no-tags linux-review Magnus-Karlsson/xsk-i40e-Tx-performance-improvements/20201110-190310
> > > >         git checkout b016bbeac6692a93e61b28efa430d64645032b5e
> > > >         # save the attached .config to linux build tree
> > > >         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc64
> > > >
> > > > If you fix the issue, kindly add following tag as appropriate
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > >
> > > > All warnings (new ones prefixed by >>):
> > > >
> > > > >> drivers/net/ethernet/intel/i40e/i40e_xsk.c:417:13: warning: unknown pragma ignored [-Wunknown-pragmas]
> > > >    #pragma GCC unroll 4
> > > >                ^
> > > >    1 warning generated.
> > >
> > > And I was hoping that unknown pragmas would be ignored, but that will
> > > obviously not be the case with -Wunknown-pragmas added. The unrolling
> > > of this inner loop where the code spends most of its time gives me
> > > nearly 1 Mpps extra in performance which is substantial, so I would
> > > like to get this unrolled in some way, but without the warning. Need
> > > some advice please. Here are some options that comes in mind:
> > >
> > > #1: Suppress unknown pragma warnings in this file only by adding
> > > CFLAGS_i40e_xsk.o += -Wno-unknown-pragmas (or whatever that option
> > > might be) in the Makefile
> > >
> > > #2: Force the compiler to loop-unroll the loop with for example a
> > > switch statement with four cases that all fall through. This will make
> > > the code less readable.
> > >
> > > #3: Manually loop-unroll the loop. This will make the code even less
> > > readable than #2.
> >
> > #4 support both compilers.  Note Clang's syntax is slightly different
> > here; it doesn't accept GCC specific pragmas, and uses a slightly
> > different form:
> > https://clang.llvm.org/docs/LanguageExtensions.html#loop-unrolling .
> > If you wrap that in a macro based on `#ifdef __clang__`, that should
> > do the trick.
>
> Yes, that did the trick. Tried it out with the compiler explorer at
> https://godbolt.org/ and it compiles nicely even for clang-powerpc64.
> Will spin a v3.
>
> Thank you: Magnus

Great job Magnus, I appreciate it!

>
> > >
> > > I prefer #1 as I like to keep the code readable, but you might have
> > > other better suggestions on how to tackle this.
> > >
> > > Thanks: Magnus
> > >
> > > > vim +417 drivers/net/ethernet/intel/i40e/i40e_xsk.c
> > > >
> > > >    408
> > > >    409  static void i40e_xmit_pkt_batch(struct i40e_ring *xdp_ring, struct xdp_desc *desc,
> > > >    410                                  unsigned int *total_bytes)
> > > >    411  {
> > > >    412          u16 ntu = xdp_ring->next_to_use;
> > > >    413          struct i40e_tx_desc *tx_desc;
> > > >    414          dma_addr_t dma;
> > > >    415          u32 i;
> > > >    416
> > > >  > 417  #pragma GCC unroll 4
> > > >    418          for (i = 0; i < PKTS_PER_BATCH; i++) {
> > > >    419                  dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc[i].addr);
> > > >    420                  xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc[i].len);
> > > >    421
> > > >    422                  tx_desc = I40E_TX_DESC(xdp_ring, ntu++);
> > > >    423                  tx_desc->buffer_addr = cpu_to_le64(dma);
> > > >    424                  tx_desc->cmd_type_offset_bsz = build_ctob(I40E_TX_DESC_CMD_ICRC |
> > > >    425                                                            I40E_TX_DESC_CMD_EOP,
> > > >    426                                                            0, desc[i].len, 0);
> > > >    427
> > > >    428                  *total_bytes += desc[i].len;
> > > >    429          }
> > > >    430
> > > >    431          xdp_ring->next_to_use = ntu;
> > > >    432  }
> > > >    433
> > > >
> > > > ---
> > > > 0-DAY CI Kernel Test Service, Intel Corporation
> > > > https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "Clang Built Linux" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com.
> > > To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/CAJ8uoz2aDjLPtcTgZ_pO-%3DS9TgXm3c57rN8TTPXdqT7HOOKrhA%40mail.gmail.com.
> >
> >
> >
> > --
> > Thanks,
> > ~Nick Desaulniers



-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-11-12 19:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-10 11:01 [PATCH bpf-next v2 0/5] xsk: i40e: Tx performance improvements Magnus Karlsson
2020-11-10 11:01 ` [PATCH bpf-next v2 1/5] samples/bpf: increment Tx stats at sending Magnus Karlsson
2020-11-10 11:01 ` [PATCH bpf-next v2 2/5] i40e: remove unnecessary sw_ring access from xsk Tx Magnus Karlsson
2020-11-10 11:01 ` [PATCH bpf-next v2 3/5] xsk: introduce padding between more ring pointers Magnus Karlsson
2020-11-10 11:01 ` [PATCH bpf-next v2 4/5] xsk: introduce batched Tx descriptor interfaces Magnus Karlsson
2020-11-11  8:40   ` John Fastabend
2020-11-10 11:01 ` [PATCH bpf-next v2 5/5] i40e: use batched xsk Tx interfaces to increase performance Magnus Karlsson
2020-11-11  1:37   ` kernel test robot
2020-11-11 11:57     ` Magnus Karlsson
2020-11-11 19:16       ` Nick Desaulniers
2020-11-12  7:45         ` Magnus Karlsson
2020-11-12 19:39           ` Nick Desaulniers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).