netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs
@ 2020-09-03 21:00 Saeed Mahameed
  2020-09-03 21:00 ` [net-next 01/10] net/mlx5e: Refactor inline header size calculation in the TX path Saeed Mahameed
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski; +Cc: netdev, Saeed Mahameed

Hi Dave & Jakub

This series adds support for Multi packet tx descriptors for SKBs.
For more information please see tag log below.

One note that is worth mentioning here, is that Maxim had to do some
manual function inlining in the tx c file to avoid performance drop due
to refactoring and functions extraction for re-use, I hope this is not a
big deal, as the other way around this is to avoid code reuse which
makes the mlx5 tx path __uglier__.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit 08aaa0819d5cce78a10c2fcea17057d07698691f:

  Merge branch 'l2tp-miscellaneous-cleanups' (2020-09-03 12:19:04 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2020-09-03

for you to fetch changes up to b960a70f9cf90524f2b4e67d202e7240c3ef6928:

  net/mlx5e: Enhanced TX MPWQE for SKBs (2020-09-03 13:56:10 -0700)

----------------------------------------------------------------
mlx5-updates-2020-09-03

Multi packet TX descriptor support for SKBs.

This series introduces some refactoring of the regular TX data path in
mlx5 and adds the Enhanced TX MPWQE feature support. MPWQE stands for
multi-packet work queue element, and it can serve multiple packets,
reducing the PCI bandwidth spent on control traffic. It should improve
performance in scenarios where PCI is the bottleneck, and xmit_more is
signaled by the kernel. The refactoring done in this series also
improves the packet rate on its own.

MPWQE is already implemented in the XDP tx path, this series adds the
support of MPWQE for regular kernel SKB tx path.

MPWQE is supported from ConnectX-5 and onward, for legacy devices we need
to keep backward compatibility for regular (Single packet) WQE descriptor.

MPWQE is not compatible with certain offloads and features, such as TLS
offload, TSO, nonlinear SKBs. If such incompatible features are in use,
the driver gracefully falls back to non-MPWQE per SKB.

Prior to the final patch "net/mlx5e: Enhanced TX MPWQE for SKBs" that adds
the actual support, Maxim did some refactoring to the tx data path to
split it into stages and smaller helper functions that can be utilized and
reused for both legacy and new MPWQE feature.

Due to this refactoring and the increase of helper functions,
Maxim had to manually tune inlining of these functions in the tx.c file to
get the maximum performance and the expected results of MPWQE.

Performance effect:

All of the changes below are tested with packet rate udp and pktgen
tests, no performance impact seen on TCP single stream test and
XDP_TX single stream test.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
NIC: Mellanox ConnectX-6 Dx

1)  Refactoring #1: Refactor xmit functions & manual inlining
This change has no performance impact in TCP single stream test and
XDP_TX single stream test.

UDP pktgen (burst 32), single stream:
  Packet rate: 17.55 Mpps -> 19.23 Mpps
  Instructions per packet: 420 -> 360
  Cycles per packet: 165 -> 142

2) Refactoring #2: Support multiple SKBs in a TX WQE
First building block needed to support multple SKBs in downstream MPWQE

UDP pktgen (burst 32), single stream:
  Packet rate: 19.23 Mpps -> 19.12 Mpps
  Instructions per packet: 360 -> 354
  Cycles per packet: 142 -> 140

3) MPWQE Feature for SKBs (final patch)

UDP pktgen, 64-byte packets, single stream, MPWQE off:
  Packet rate: 19.12 Mpps -> 20.02 Mpps
  Instructions per packet: 354 -> 347
  Cycles per packet: 140 -> 129

UDP pktgen, 64-byte packets, single stream, MPWQE on:
  Packet rate: 19.12 Mpps -> 20.67 Mpps
  Instructions per packet: 354 -> 335
  Cycles per packet: 140 -> 124

Enabling MPWQE can reduce PCI bandwidth:
  PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
    Inbound PCI utilization with MPWQE off: 81.3%
    Inbound PCI utilization with MPWQE on: 59.3%
  PCI Gen3, pktgen at fixed rate of 56064005 pps on 24 CPU cores:
    Inbound PCI utilization with MPWQE off: 65.8%
    Inbound PCI utilization with MPWQE on: 49.2%

Enabling MPWQE can also reduce CPU load, increasing the packet rate in
case of CPU bottleneck:
  PCI Gen2, pktgen at full rate on 24 CPU cores:
    Packet rate with MPWQE off: 37.4 Mpps
    Packet rate with MPWQE on: 49.1 Mpps
  PCI Gen3, pktgen at full rate on 24 CPU cores:
    Packet rate with MPWQE off: 56.2 Mpps
    Packet rate with MPWQE on: 67.0 Mpps

Burst size in all pktgen tests is 32.

To avoid performance degradation when MPWQE is off, manual optimizations
of function inlining were performed. It's especially important to have
mlx5e_sq_xmit_mpwqe noinline, otherwise gcc inlines it automatically and
bloats mlx5e_xmit, slowing it down, which reduces the maximum gain seen by
MPWQE, and in order to avoid this, we had two options
1. drop the refactoring and duplicate the TX data path to have 2 huge
functions.
2. refactoring and code reuse with manual inlining, as we did in this
series.

-Saeed.

----------------------------------------------------------------
Maxim Mikityanskiy (10):
      net/mlx5e: Refactor inline header size calculation in the TX path
      net/mlx5e: Refactor xmit functions
      net/mlx5e: Small improvements for XDP TX MPWQE logic
      net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT
      net/mlx5e: Move the TLS resync check out of the function
      net/mlx5e: Support multiple SKBs in a TX WQE
      net/mlx5e: Generalize TX MPWQE checks for full session
      net/mlx5e: Rename xmit-related structs to generalize them
      net/mlx5e: Move TX code into functions to be used by MPWQE
      net/mlx5e: Enhanced TX MPWQE for SKBs

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  30 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h  | 102 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |  33 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h   |  60 +-
 .../net/ethernet/mellanox/mlx5/core/en/xsk/tx.c    |   2 +-
 .../mellanox/mlx5/core/en_accel/en_accel.h         |  32 +-
 .../ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c |   3 -
 .../mellanox/mlx5/core/en_accel/ktls_txrx.h        |  20 +-
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c         |   8 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  18 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    | 653 +++++++++++++++------
 14 files changed, 659 insertions(+), 327 deletions(-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [net-next 01/10] net/mlx5e: Refactor inline header size calculation in the TX path
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-03 21:00 ` [net-next 02/10] net/mlx5e: Refactor xmit functions Saeed Mahameed
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

As preparation for the next patch, don't increase ihs to calculate
ds_cnt and then decrease it, but rather calculate the intermediate value
temporarily. This code has the same amount of arithmetic operations, but
now allows to split out ds_cnt calculation, which will be performed in
the next patch.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index da596de3abba..e15aa53ff83e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -307,9 +307,9 @@ void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	ds_cnt += skb_shinfo(skb)->nr_frags;
 
 	if (ihs) {
-		ihs += !!skb_vlan_tag_present(skb) * VLAN_HLEN;
+		u16 inl = ihs + !!skb_vlan_tag_present(skb) * VLAN_HLEN - INL_HDR_START_SZ;
 
-		ds_cnt_inl = DIV_ROUND_UP(ihs - INL_HDR_START_SZ, MLX5_SEND_WQE_DS);
+		ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
 		ds_cnt += ds_cnt_inl;
 	}
 
@@ -348,12 +348,12 @@ void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	eseg->mss = mss;
 
 	if (ihs) {
-		eseg->inline_hdr.sz = cpu_to_be16(ihs);
 		if (skb_vlan_tag_present(skb)) {
-			ihs -= VLAN_HLEN;
+			eseg->inline_hdr.sz = cpu_to_be16(ihs + VLAN_HLEN);
 			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, ihs);
 			stats->added_vlan_packets++;
 		} else {
+			eseg->inline_hdr.sz = cpu_to_be16(ihs);
 			memcpy(eseg->inline_hdr.start, skb->data, ihs);
 		}
 		dseg += ds_cnt_inl;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 02/10] net/mlx5e: Refactor xmit functions
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
  2020-09-03 21:00 ` [net-next 01/10] net/mlx5e: Refactor inline header size calculation in the TX path Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-04 15:27   ` Willem de Bruijn
  2020-09-03 21:00 ` [net-next 03/10] net/mlx5e: Small improvements for XDP TX MPWQE logic Saeed Mahameed
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

A huge function mlx5e_sq_xmit was split into several to achieve multiple
goals:

1. Reuse the code in IPoIB.

2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now
it's possible to reserve space in the WQ before running eseg-based
offloads, so:

2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge
anymore.

2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy
mlx5e_fill_sq_frag_edge for better code maintainability and reuse.

3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after
mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the
code flow will split into two paths: MPWQE and non-MPWQE.

Two high-level functions are provided to send packets:

* mlx5e_xmit is called by the networking stack, runs offloads and sends
the packet. In one of the following patches, MPWQE support will be added
to this flow.

* mlx5e_sq_xmit_simple is called by the TLS offload, runs only the
checksum offload and sends the packet.

This change has no performance impact in TCP single stream test and
XDP_TX single stream test.

UDP pktgen (burst 32), single stream:
  Packet rate: 17.55 Mpps -> 19.23 Mpps
  Instructions per packet: 420 -> 360
  Cycles per packet: 165 -> 142

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
NIC: Mellanox ConnectX-6 Dx

To get this performance gain, manual optimizations of function inlining
were performed. It's important to have mlx5e_sq_xmit_wqe inline,
otherwise the packet rate will be 1 Mpps less in UDP pktgen test.
__always_inline is required, because gcc uninlines it when it's called
from two places (mlx5e_xmit and mlx5e_sq_xmit_simple).

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  63 +--
 .../mellanox/mlx5/core/en_accel/en_accel.h    |   5 +
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c    |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 391 ++++++++++--------
 4 files changed, 243 insertions(+), 222 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 9334c9c3e208..d4ee22789ab0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -41,8 +41,6 @@ void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
 		       struct net_device *sb_dev);
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
-void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
-		   struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq);
 
@@ -188,23 +186,6 @@ static inline u16 mlx5e_icosq_get_next_pi(struct mlx5e_icosq *sq, u16 size)
 	return pi;
 }
 
-static inline void
-mlx5e_fill_sq_frag_edge(struct mlx5e_txqsq *sq, struct mlx5_wq_cyc *wq,
-			u16 pi, u16 nnops)
-{
-	struct mlx5e_tx_wqe_info *edge_wi, *wi = &sq->db.wqe_info[pi];
-
-	edge_wi = wi + nnops;
-
-	/* fill sq frag edge with nops to avoid wqe wrapping two pages */
-	for (; wi < edge_wi; wi++) {
-		memset(wi, 0, sizeof(*wi));
-		wi->num_wqebbs = 1;
-		mlx5e_post_nop(wq, sq->sqn, &sq->pc);
-	}
-	sq->stats->nop += nnops;
-}
-
 static inline void
 mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
 		struct mlx5_wqe_ctrl_seg *ctrl)
@@ -223,29 +204,6 @@ mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
 	mlx5_write64((__be32 *)ctrl, uar_map);
 }
 
-static inline bool mlx5e_transport_inline_tx_wqe(struct mlx5_wqe_ctrl_seg *cseg)
-{
-	return cseg && !!cseg->tis_tir_num;
-}
-
-static inline u8
-mlx5e_tx_wqe_inline_mode(struct mlx5e_txqsq *sq, struct mlx5_wqe_ctrl_seg *cseg,
-			 struct sk_buff *skb)
-{
-	u8 mode;
-
-	if (mlx5e_transport_inline_tx_wqe(cseg))
-		return MLX5_INLINE_MODE_TCP_UDP;
-
-	mode = sq->min_inline_mode;
-
-	if (skb_vlan_tag_present(skb) &&
-	    test_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state))
-		mode = max_t(u8, MLX5_INLINE_MODE_L2, mode);
-
-	return mode;
-}
-
 static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
 {
 	struct mlx5_core_cq *mcq;
@@ -286,6 +244,27 @@ mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
 	}
 }
 
+static inline void mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq,
+					       struct sk_buff *skb,
+					       struct mlx5_wqe_eth_seg *eseg)
+{
+	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
+		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
+		if (skb->encapsulation) {
+			eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
+					  MLX5_ETH_WQE_L4_INNER_CSUM;
+			sq->stats->csum_partial_inner++;
+		} else {
+			eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
+			sq->stats->csum_partial++;
+		}
+	} else {
+		sq->stats->csum_none++;
+	}
+}
+
+void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more);
+
 static inline void mlx5e_rqwq_reset(struct mlx5e_rq *rq)
 {
 	if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index 110476bdeffb..23d4ef5ab9c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -145,6 +145,11 @@ static inline bool mlx5e_accel_tx_finish(struct mlx5e_priv *priv,
 	}
 #endif
 
+#if IS_ENABLED(CONFIG_GENEVE)
+	if (skb->encapsulation)
+		mlx5e_tx_tunnel_accel(skb, &wqe->eth);
+#endif
+
 	return true;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index b0c31d49ff8d..c36560b3e93d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -189,12 +189,10 @@ static bool mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
 				 struct mlx5e_tls *tls)
 {
 	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
-	struct mlx5e_tx_wqe *wqe;
 	struct sync_info info;
 	struct sk_buff *nskb;
 	int linear_len = 0;
 	int headln;
-	u16 pi;
 	int i;
 
 	sq->stats->tls_ooo++;
@@ -246,9 +244,7 @@ static bool mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
 	sq->stats->tls_resync_bytes += nskb->len;
 	mlx5e_tls_complete_sync_skb(skb, nskb, tcp_seq, headln,
 				    cpu_to_be64(info.rcd_sn));
-	pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
-	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
-	mlx5e_sq_xmit(sq, nskb, wqe, pi, true);
+	mlx5e_sq_xmit_simple(sq, nskb, true);
 
 	return true;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index e15aa53ff83e..f967bc0573c0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -144,23 +144,6 @@ static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
 	memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz);
 }
 
-static inline void
-mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg)
-{
-	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
-		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
-		if (skb->encapsulation) {
-			eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
-					  MLX5_ETH_WQE_L4_INNER_CSUM;
-			sq->stats->csum_partial_inner++;
-		} else {
-			eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
-			sq->stats->csum_partial++;
-		}
-	} else
-		sq->stats->csum_none++;
-}
-
 static inline u16
 mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
 {
@@ -232,22 +215,121 @@ mlx5e_txwqe_build_dsegs(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	return -ENOMEM;
 }
 
+struct mlx5e_tx_attr {
+	u32 num_bytes;
+	u16 headlen;
+	u16 ihs;
+	__be16 mss;
+	u8 opcode;
+};
+
+struct mlx5e_tx_wqe_attr {
+	u16 ds_cnt;
+	u16 ds_cnt_inl;
+	u8 num_wqebbs;
+};
+
+static inline u8
+mlx5e_tx_wqe_inline_mode(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+			 struct mlx5e_accel_tx_state *accel)
+{
+	u8 mode;
+
+#ifdef CONFIG_MLX5_EN_TLS
+	if (accel && accel->tls.tls_tisn)
+		return MLX5_INLINE_MODE_TCP_UDP;
+#endif
+
+	mode = sq->min_inline_mode;
+
+	if (skb_vlan_tag_present(skb) &&
+	    test_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state))
+		mode = max_t(u8, MLX5_INLINE_MODE_L2, mode);
+
+	return mode;
+}
+
+static inline void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+					 struct mlx5e_accel_tx_state *accel,
+					 struct mlx5e_tx_attr *attr)
+{
+	struct mlx5e_sq_stats *stats = sq->stats;
+
+	if (skb_is_gso(skb)) {
+		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
+
+		*attr = (struct mlx5e_tx_attr) {
+			.opcode    = MLX5_OPCODE_LSO,
+			.mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
+			.ihs       = ihs,
+			.num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
+			.headlen   = skb_headlen(skb) - ihs,
+		};
+
+		stats->packets += skb_shinfo(skb)->gso_segs;
+	} else {
+		u8 mode = mlx5e_tx_wqe_inline_mode(sq, skb, accel);
+		u16 ihs = mlx5e_calc_min_inline(mode, skb);
+
+		*attr = (struct mlx5e_tx_attr) {
+			.opcode    = MLX5_OPCODE_SEND,
+			.mss       = cpu_to_be16(0),
+			.ihs       = ihs,
+			.num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN),
+			.headlen   = skb_headlen(skb) - ihs,
+		};
+
+		stats->packets++;
+	}
+
+	stats->bytes += attr->num_bytes;
+}
+
+static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
+					  const struct mlx5e_tx_attr *attr,
+					  struct mlx5e_tx_wqe_attr *wqe_attr)
+{
+	u16 ds_cnt = sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS;
+	u16 ds_cnt_inl = 0;
+
+	ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
+
+	if (attr->ihs) {
+		u16 inl = attr->ihs - INL_HDR_START_SZ;
+
+		if (skb_vlan_tag_present(skb))
+			inl += VLAN_HLEN;
+
+		ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
+		ds_cnt += ds_cnt_inl;
+	}
+
+	*wqe_attr = (struct mlx5e_tx_wqe_attr) {
+		.ds_cnt     = ds_cnt,
+		.ds_cnt_inl = ds_cnt_inl,
+		.num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS),
+	};
+}
+
 static inline void
 mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
-		     u8 opcode, u16 ds_cnt, u8 num_wqebbs, u32 num_bytes, u8 num_dma,
+		     const struct mlx5e_tx_attr *attr,
+		     const struct mlx5e_tx_wqe_attr *wqe_attr, u8 num_dma,
 		     struct mlx5e_tx_wqe_info *wi, struct mlx5_wqe_ctrl_seg *cseg,
 		     bool xmit_more)
 {
 	struct mlx5_wq_cyc *wq = &sq->wq;
 	bool send_doorbell;
 
-	wi->num_bytes = num_bytes;
-	wi->num_dma = num_dma;
-	wi->num_wqebbs = num_wqebbs;
-	wi->skb = skb;
+	*wi = (struct mlx5e_tx_wqe_info) {
+		.skb = skb,
+		.num_bytes = attr->num_bytes,
+		.num_dma = num_dma,
+		.num_wqebbs = wqe_attr->num_wqebbs,
+	};
 
-	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | opcode);
-	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
+	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
+	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
 
 	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
 		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
@@ -258,105 +340,44 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		sq->stats->stopped++;
 	}
 
-	send_doorbell = __netdev_tx_sent_queue(sq->txq, num_bytes,
-					       xmit_more);
+	send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
 	if (send_doorbell)
 		mlx5e_notify_hw(wq, sq->pc, sq->uar_map, cseg);
 }
 
-void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
-		   struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more)
+static __always_inline void
+mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+		  const struct mlx5e_tx_attr *attr, const struct mlx5e_tx_wqe_attr *wqe_attr,
+		  struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more)
 {
-	struct mlx5_wq_cyc *wq = &sq->wq;
 	struct mlx5_wqe_ctrl_seg *cseg;
 	struct mlx5_wqe_eth_seg  *eseg;
 	struct mlx5_wqe_data_seg *dseg;
 	struct mlx5e_tx_wqe_info *wi;
 
 	struct mlx5e_sq_stats *stats = sq->stats;
-	u16 headlen, ihs, contig_wqebbs_room;
-	u16 ds_cnt, ds_cnt_inl = 0;
-	u8 num_wqebbs, opcode;
-	u32 num_bytes;
 	int num_dma;
-	__be16 mss;
 
-	/* Calc ihs and ds cnt, no writes to wqe yet */
-	ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
-	if (skb_is_gso(skb)) {
-		opcode    = MLX5_OPCODE_LSO;
-		mss       = cpu_to_be16(skb_shinfo(skb)->gso_size);
-		ihs       = mlx5e_tx_get_gso_ihs(sq, skb);
-		num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs;
-		stats->packets += skb_shinfo(skb)->gso_segs;
-	} else {
-		u8 mode = mlx5e_tx_wqe_inline_mode(sq, &wqe->ctrl, skb);
-
-		opcode    = MLX5_OPCODE_SEND;
-		mss       = 0;
-		ihs       = mlx5e_calc_min_inline(mode, skb);
-		num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
-		stats->packets++;
-	}
-
-	stats->bytes     += num_bytes;
 	stats->xmit_more += xmit_more;
 
-	headlen = skb->len - ihs - skb->data_len;
-	ds_cnt += !!headlen;
-	ds_cnt += skb_shinfo(skb)->nr_frags;
-
-	if (ihs) {
-		u16 inl = ihs + !!skb_vlan_tag_present(skb) * VLAN_HLEN - INL_HDR_START_SZ;
-
-		ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
-		ds_cnt += ds_cnt_inl;
-	}
-
-	num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
-	contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
-	if (unlikely(contig_wqebbs_room < num_wqebbs)) {
-#ifdef CONFIG_MLX5_EN_IPSEC
-		struct mlx5_wqe_eth_seg cur_eth = wqe->eth;
-#endif
-#ifdef CONFIG_MLX5_EN_TLS
-		struct mlx5_wqe_ctrl_seg cur_ctrl = wqe->ctrl;
-#endif
-		mlx5e_fill_sq_frag_edge(sq, wq, pi, contig_wqebbs_room);
-		pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
-		wqe = MLX5E_TX_FETCH_WQE(sq, pi);
-#ifdef CONFIG_MLX5_EN_IPSEC
-		wqe->eth = cur_eth;
-#endif
-#ifdef CONFIG_MLX5_EN_TLS
-		wqe->ctrl = cur_ctrl;
-#endif
-	}
-
 	/* fill wqe */
 	wi   = &sq->db.wqe_info[pi];
 	cseg = &wqe->ctrl;
 	eseg = &wqe->eth;
 	dseg =  wqe->data;
 
-#if IS_ENABLED(CONFIG_GENEVE)
-	if (skb->encapsulation)
-		mlx5e_tx_tunnel_accel(skb, eseg);
-#endif
-	mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
-
-	eseg->mss = mss;
+	eseg->mss = attr->mss;
 
-	if (ihs) {
+	if (attr->ihs) {
 		if (skb_vlan_tag_present(skb)) {
-			eseg->inline_hdr.sz = cpu_to_be16(ihs + VLAN_HLEN);
-			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, ihs);
+			eseg->inline_hdr.sz = cpu_to_be16(attr->ihs + VLAN_HLEN);
+			mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
 			stats->added_vlan_packets++;
 		} else {
-			eseg->inline_hdr.sz = cpu_to_be16(ihs);
-			memcpy(eseg->inline_hdr.start, skb->data, ihs);
+			eseg->inline_hdr.sz = cpu_to_be16(attr->ihs);
+			memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
 		}
-		dseg += ds_cnt_inl;
+		dseg += wqe_attr->ds_cnt_inl;
 	} else if (skb_vlan_tag_present(skb)) {
 		eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
 		if (skb->vlan_proto == cpu_to_be16(ETH_P_8021AD))
@@ -365,12 +386,12 @@ void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		stats->added_vlan_packets++;
 	}
 
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + ihs, headlen, dseg);
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
+					  attr->headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
 
-	mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt, num_wqebbs, num_bytes,
-			     num_dma, wi, cseg, xmit_more);
+	mlx5e_txwqe_complete(sq, skb, attr, wqe_attr, num_dma, wi, cseg, xmit_more);
 
 	return;
 
@@ -383,6 +404,8 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	struct mlx5e_accel_tx_state accel = {};
+	struct mlx5e_tx_wqe_attr wqe_attr;
+	struct mlx5e_tx_attr attr;
 	struct mlx5e_tx_wqe *wqe;
 	struct mlx5e_txqsq *sq;
 	u16 pi;
@@ -393,19 +416,64 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(!mlx5e_accel_tx_begin(dev, sq, skb, &accel)))
 		goto out;
 
-	pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
+	mlx5e_sq_xmit_prepare(sq, skb, &accel, &attr);
+	mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
+	pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
 	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
 
 	/* May update the WQE, but may not post other WQEs. */
 	if (unlikely(!mlx5e_accel_tx_finish(priv, sq, skb, wqe, &accel)))
 		goto out;
 
-	mlx5e_sq_xmit(sq, skb, wqe, pi, netdev_xmit_more());
+	mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
+	mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, netdev_xmit_more());
 
 out:
 	return NETDEV_TX_OK;
 }
 
+void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more)
+{
+	struct mlx5e_tx_wqe_attr wqe_attr;
+	struct mlx5e_tx_attr attr;
+	struct mlx5e_tx_wqe *wqe;
+	u16 pi;
+
+	mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
+	mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
+	pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
+	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
+	mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
+	mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, xmit_more);
+}
+
+static inline void mlx5e_tx_wi_dma_unmap(struct mlx5e_txqsq *sq,
+					 struct mlx5e_tx_wqe_info *wi,
+					 u32 *dma_fifo_cc)
+{
+	int i;
+
+	for (i = 0; i < wi->num_dma; i++) {
+		struct mlx5e_sq_dma *dma = mlx5e_dma_get(sq, (*dma_fifo_cc)++);
+
+		mlx5e_tx_dma_unmap(sq->pdev, dma);
+	}
+}
+
+static inline void mlx5e_consume_skb(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+				     struct mlx5_cqe64 *cqe, int napi_budget)
+{
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
+		struct skb_shared_hwtstamps hwts = {};
+		u64 ts = get_cqe_ts(cqe);
+
+		hwts.hwtstamp = mlx5_timecounter_cyc2time(sq->clock, ts);
+		skb_tstamp_tx(skb, &hwts);
+	}
+
+	napi_consume_skb(skb, napi_budget);
+}
+
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 {
 	struct mlx5e_sq_stats *stats;
@@ -452,7 +520,6 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 
 		do {
 			struct sk_buff *skb;
-			int j;
 
 			last_wqe = (sqcc == wqe_counter);
 
@@ -460,33 +527,18 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			wi = &sq->db.wqe_info[ci];
 			skb = wi->skb;
 
+			sqcc += wi->num_wqebbs;
+
 			if (unlikely(!skb)) {
 				mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
-				sqcc += wi->num_wqebbs;
 				continue;
 			}
 
-			if (unlikely(skb_shinfo(skb)->tx_flags &
-				     SKBTX_HW_TSTAMP)) {
-				struct skb_shared_hwtstamps hwts = {};
-
-				hwts.hwtstamp =
-					mlx5_timecounter_cyc2time(sq->clock,
-								  get_cqe_ts(cqe));
-				skb_tstamp_tx(skb, &hwts);
-			}
-
-			for (j = 0; j < wi->num_dma; j++) {
-				struct mlx5e_sq_dma *dma =
-					mlx5e_dma_get(sq, dma_fifo_cc++);
-
-				mlx5e_tx_dma_unmap(sq->pdev, dma);
-			}
+			mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
+			mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
 
 			npkts++;
 			nbytes += wi->num_bytes;
-			sqcc += wi->num_wqebbs;
-			napi_consume_skb(skb, napi_budget);
 		} while (!last_wqe);
 
 		if (unlikely(get_cqe_opcode(cqe) == MLX5_CQE_REQ_ERR)) {
@@ -531,7 +583,6 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
 	u32 dma_fifo_cc, nbytes = 0;
 	u16 ci, sqcc, npkts = 0;
 	struct sk_buff *skb;
-	int i;
 
 	sqcc = sq->cc;
 	dma_fifo_cc = sq->dma_fifo_cc;
@@ -541,23 +592,18 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
 		wi = &sq->db.wqe_info[ci];
 		skb = wi->skb;
 
+		sqcc += wi->num_wqebbs;
+
 		if (!skb) {
 			mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
-			sqcc += wi->num_wqebbs;
 			continue;
 		}
 
-		for (i = 0; i < wi->num_dma; i++) {
-			struct mlx5e_sq_dma *dma =
-				mlx5e_dma_get(sq, dma_fifo_cc++);
-
-			mlx5e_tx_dma_unmap(sq->pdev, dma);
-		}
-
+		mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
 		dev_kfree_skb_any(skb);
+
 		npkts++;
 		nbytes += wi->num_bytes;
-		sqcc += wi->num_wqebbs;
 	}
 
 	sq->dma_fifo_cc = dma_fifo_cc;
@@ -576,9 +622,34 @@ mlx5i_txwqe_build_datagram(struct mlx5_av *av, u32 dqpn, u32 dqkey,
 	dseg->av.key.qkey.qkey = cpu_to_be32(dqkey);
 }
 
+static void mlx5i_sq_calc_wqe_attr(struct sk_buff *skb,
+				   const struct mlx5e_tx_attr *attr,
+				   struct mlx5e_tx_wqe_attr *wqe_attr)
+{
+	u16 ds_cnt = sizeof(struct mlx5i_tx_wqe) / MLX5_SEND_WQE_DS;
+	u16 ds_cnt_inl = 0;
+
+	ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
+
+	if (attr->ihs) {
+		u16 inl = attr->ihs - INL_HDR_START_SZ;
+
+		ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
+		ds_cnt += ds_cnt_inl;
+	}
+
+	*wqe_attr = (struct mlx5e_tx_wqe_attr) {
+		.ds_cnt     = ds_cnt,
+		.ds_cnt_inl = ds_cnt_inl,
+		.num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS),
+	};
+}
+
 void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		   struct mlx5_av *av, u32 dqpn, u32 dqkey, bool xmit_more)
 {
+	struct mlx5e_tx_wqe_attr wqe_attr;
+	struct mlx5e_tx_attr attr;
 	struct mlx5i_tx_wqe *wqe;
 
 	struct mlx5_wqe_datagram_seg *datagram;
@@ -588,47 +659,17 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	struct mlx5e_tx_wqe_info *wi;
 
 	struct mlx5e_sq_stats *stats = sq->stats;
-	u16 ds_cnt, ds_cnt_inl = 0;
-	u8 num_wqebbs, opcode;
-	u16 headlen, ihs, pi;
-	u32 num_bytes;
 	int num_dma;
-	__be16 mss;
+	u16 pi;
 
-	/* Calc ihs and ds cnt, no writes to wqe yet */
-	ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
-	if (skb_is_gso(skb)) {
-		opcode    = MLX5_OPCODE_LSO;
-		mss       = cpu_to_be16(skb_shinfo(skb)->gso_size);
-		ihs       = mlx5e_tx_get_gso_ihs(sq, skb);
-		num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs;
-		stats->packets += skb_shinfo(skb)->gso_segs;
-	} else {
-		u8 mode = mlx5e_tx_wqe_inline_mode(sq, NULL, skb);
+	mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
+	mlx5i_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
 
-		opcode    = MLX5_OPCODE_SEND;
-		mss       = 0;
-		ihs       = mlx5e_calc_min_inline(mode, skb);
-		num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
-		stats->packets++;
-	}
+	pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
+	wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
 
-	stats->bytes     += num_bytes;
 	stats->xmit_more += xmit_more;
 
-	headlen = skb->len - ihs - skb->data_len;
-	ds_cnt += !!headlen;
-	ds_cnt += skb_shinfo(skb)->nr_frags;
-
-	if (ihs) {
-		ds_cnt_inl = DIV_ROUND_UP(ihs - INL_HDR_START_SZ, MLX5_SEND_WQE_DS);
-		ds_cnt += ds_cnt_inl;
-	}
-
-	num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
-	pi = mlx5e_txqsq_get_next_pi(sq, num_wqebbs);
-	wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
-
 	/* fill wqe */
 	wi       = &sq->db.wqe_info[pi];
 	cseg     = &wqe->ctrl;
@@ -640,20 +681,20 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 
 	mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
 
-	eseg->mss = mss;
+	eseg->mss = attr.mss;
 
-	if (ihs) {
-		memcpy(eseg->inline_hdr.start, skb->data, ihs);
-		eseg->inline_hdr.sz = cpu_to_be16(ihs);
-		dseg += ds_cnt_inl;
+	if (attr.ihs) {
+		memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
+		eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
+		dseg += wqe_attr.ds_cnt_inl;
 	}
 
-	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + ihs, headlen, dseg);
+	num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
+					  attr.headlen, dseg);
 	if (unlikely(num_dma < 0))
 		goto err_drop;
 
-	mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt, num_wqebbs, num_bytes,
-			     num_dma, wi, cseg, xmit_more);
+	mlx5e_txwqe_complete(sq, skb, &attr, &wqe_attr, num_dma, wi, cseg, xmit_more);
 
 	return;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 03/10] net/mlx5e: Small improvements for XDP TX MPWQE logic
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
  2020-09-03 21:00 ` [net-next 01/10] net/mlx5e: Refactor inline header size calculation in the TX path Saeed Mahameed
  2020-09-03 21:00 ` [net-next 02/10] net/mlx5e: Refactor xmit functions Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-03 21:00 ` [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT Saeed Mahameed
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

Use MLX5E_XDP_MPW_MAX_WQEBBS to reserve space for a MPWQE, because it's
actually the maximal size a MPWQE can take.

Reorganize the logic that checks when to close the MPWQE session:

1. Put all checks into a single function.

2. When inline is on, make only one comparison - if it's false, the less
strict one will also be false. The compiler probably optimized it out
anyway, but it's clearer to also reflect it in the code.

The MLX5E_XDP_INLINE_WQE_* defines are also changed to make the
calculations more correct from the logical point of view. Though
MLX5E_XDP_INLINE_WQE_MAX_DS_CNT used to be 16 and didn't change its
value, the calculation used to be DIV_ROUND_UP(max inline packet size,
MLX5_SEND_WQE_DS), and the numerator should have included sizeof(struct
mlx5_wqe_inline_seg).

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c |  5 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h | 16 +++++++++-------
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 145592788de5..7fccd2ea7dc9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -198,7 +198,7 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
 	u16 pi;
 
-	pi = mlx5e_xdpsq_get_next_pi(sq, MLX5_SEND_WQE_MAX_WQEBBS);
+	pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_XDP_MPW_MAX_WQEBBS);
 	session->wqe = MLX5E_TX_FETCH_WQE(sq, pi);
 
 	net_prefetchw(session->wqe->data);
@@ -284,8 +284,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_xmit_data *x
 
 	mlx5e_xdp_mpwqe_add_dseg(sq, xdptxd, stats);
 
-	if (unlikely(mlx5e_xdp_no_room_for_inline_pkt(session) ||
-		     session->ds_count == MLX5E_XDP_MPW_MAX_NUM_DS))
+	if (unlikely(mlx5e_xdp_mpqwe_is_full(session)))
 		mlx5e_xdp_mpwqe_complete(sq);
 
 	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index e806c13d491f..615bf04f4a54 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -42,9 +42,10 @@
 	(sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
 #define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */)
 
-#define MLX5E_XDP_INLINE_WQE_SZ_THRSD (256 - sizeof(struct mlx5_wqe_inline_seg))
-#define MLX5E_XDP_INLINE_WQE_MAX_DS_CNT \
-	DIV_ROUND_UP(MLX5E_XDP_INLINE_WQE_SZ_THRSD, MLX5_SEND_WQE_DS)
+#define MLX5E_XDP_INLINE_WQE_MAX_DS_CNT 16
+#define MLX5E_XDP_INLINE_WQE_SZ_THRSD \
+	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
+	 sizeof(struct mlx5_wqe_inline_seg))
 
 /* The mult of MLX5_SEND_WQE_MAX_WQEBBS * MLX5_SEND_WQEBB_NUM_DS
  * (16 * 4 == 64) does not fit in the 6-bit DS field of Ctrl Segment.
@@ -141,11 +142,12 @@ static inline void mlx5e_xdp_update_inline_state(struct mlx5e_xdpsq *sq)
 		session->inline_on = 1;
 }
 
-static inline bool
-mlx5e_xdp_no_room_for_inline_pkt(struct mlx5e_xdp_mpwqe *session)
+static inline bool mlx5e_xdp_mpqwe_is_full(struct mlx5e_xdp_mpwqe *session)
 {
-	return session->inline_on &&
-	       session->ds_count + MLX5E_XDP_INLINE_WQE_MAX_DS_CNT > MLX5E_XDP_MPW_MAX_NUM_DS;
+	if (session->inline_on)
+		return session->ds_count + MLX5E_XDP_INLINE_WQE_MAX_DS_CNT >
+		       MLX5E_XDP_MPW_MAX_NUM_DS;
+	return session->ds_count == MLX5E_XDP_MPW_MAX_NUM_DS;
 }
 
 struct mlx5e_xdp_wqe_info {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (2 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 03/10] net/mlx5e: Small improvements for XDP TX MPWQE logic Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-04 15:05   ` Willem de Bruijn
  2020-09-03 21:00 ` [net-next 05/10] net/mlx5e: Move the TLS resync check out of the function Saeed Mahameed
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

A constant for the number of DS in an empty WQE (i.e. a WQE without data
segments) is needed in multiple places (normal TX data path, MPWQE in
XDP), but currently we have a constant for XDP and an inline formula in
normal TX. This patch introduces a common constant.

Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
assignment, because the code nearby is touched.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 ++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 13 +++++++-----
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 21 +++++++------------
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
 4 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index d4ee22789ab0..155b89998891 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -7,6 +7,8 @@
 #include "en.h"
 #include <linux/indirect_call_wrapper.h>
 
+#define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
+
 #define INL_HDR_START_SZ (sizeof(((struct mlx5_wqe_eth_seg *)NULL)->inline_hdr.start))
 
 enum mlx5e_icosq_wqe_type {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 7fccd2ea7dc9..81cd9a04bcb0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -196,16 +196,19 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 {
 	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
+	struct mlx5e_tx_wqe *wqe;
 	u16 pi;
 
 	pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_XDP_MPW_MAX_WQEBBS);
-	session->wqe = MLX5E_TX_FETCH_WQE(sq, pi);
-
+	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
 	net_prefetchw(session->wqe->data);
-	session->ds_count  = MLX5E_XDP_TX_EMPTY_DS_COUNT;
-	session->pkt_count = 0;
 
-	mlx5e_xdp_update_inline_state(sq);
+	*session = (struct mlx5e_xdp_mpwqe) {
+		.wqe = wqe,
+		.ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
+		.pkt_count = 0,
+		.inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
+	};
 
 	stats->mpwqe++;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 615bf04f4a54..96d6b1553bab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -38,9 +38,7 @@
 #include "en/txrx.h"
 
 #define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN)
-#define MLX5E_XDP_TX_EMPTY_DS_COUNT \
-	(sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
-#define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */)
+#define MLX5E_XDP_TX_DS_COUNT (MLX5E_TX_WQE_EMPTY_DS_COUNT + 1 /* SG DS */)
 
 #define MLX5E_XDP_INLINE_WQE_MAX_DS_CNT 16
 #define MLX5E_XDP_INLINE_WQE_SZ_THRSD \
@@ -123,23 +121,20 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
 /* Enable inline WQEs to shift some load from a congested HCA (HW) to
  * a less congested cpu (SW).
  */
-static inline void mlx5e_xdp_update_inline_state(struct mlx5e_xdpsq *sq)
+static inline bool mlx5e_xdp_get_inline_state(struct mlx5e_xdpsq *sq, bool cur)
 {
 	u16 outstanding = sq->xdpi_fifo_pc - sq->xdpi_fifo_cc;
-	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
 
 #define MLX5E_XDP_INLINE_WATERMARK_LOW	10
 #define MLX5E_XDP_INLINE_WATERMARK_HIGH 128
 
-	if (session->inline_on) {
-		if (outstanding <= MLX5E_XDP_INLINE_WATERMARK_LOW)
-			session->inline_on = 0;
-		return;
-	}
+	if (cur && outstanding <= MLX5E_XDP_INLINE_WATERMARK_LOW)
+		return false;
+
+	if (!cur && outstanding >= MLX5E_XDP_INLINE_WATERMARK_HIGH)
+		return true;
 
-	/* inline is false */
-	if (outstanding >= MLX5E_XDP_INLINE_WATERMARK_HIGH)
-		session->inline_on = 1;
+	return cur;
 }
 
 static inline bool mlx5e_xdp_mpqwe_is_full(struct mlx5e_xdp_mpwqe *session)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f967bc0573c0..46bdbbbfaf65 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -289,7 +289,7 @@ static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
 					  const struct mlx5e_tx_attr *attr,
 					  struct mlx5e_tx_wqe_attr *wqe_attr)
 {
-	u16 ds_cnt = sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS;
+	u16 ds_cnt = MLX5E_TX_WQE_EMPTY_DS_COUNT;
 	u16 ds_cnt_inl = 0;
 
 	ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 05/10] net/mlx5e: Move the TLS resync check out of the function
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (3 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-03 21:00 ` [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE Saeed Mahameed
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

Before this patch, mlx5e_ktls_tx_handle_resync_dump_comp checked for
resync_dump_frag_page. It happened for all WQEs without an SKB,
including padding WQEs, and required a function call. Normally, padding
WQEs happen more often than TLS resyncs. Take this check out of the
function and put it to an inline function to save a call on all padding
WQEs.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c |  3 ---
 .../mellanox/mlx5/core/en_accel/ktls_txrx.h        | 14 +++++++++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  4 ++--
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index f4861545b236..b140e13fdcc8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -345,9 +345,6 @@ void mlx5e_ktls_tx_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
 	struct mlx5e_sq_stats *stats;
 	struct mlx5e_sq_dma *dma;
 
-	if (!wi->resync_dump_frag_page)
-		return;
-
 	dma = mlx5e_dma_get(sq, (*dma_fifo_cc)++);
 	stats = sq->stats;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h
index ff4c740af10b..fcfb156cf09d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h
@@ -29,11 +29,19 @@ void mlx5e_ktls_handle_get_psv_completion(struct mlx5e_icosq_wqe_info *wi,
 void mlx5e_ktls_tx_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
 					   struct mlx5e_tx_wqe_info *wi,
 					   u32 *dma_fifo_cc);
+static inline void
+mlx5e_ktls_tx_try_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
+					  struct mlx5e_tx_wqe_info *wi,
+					  u32 *dma_fifo_cc)
+{
+	if (unlikely(wi->resync_dump_frag_page))
+		mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, dma_fifo_cc);
+}
 #else
 static inline void
-mlx5e_ktls_tx_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
-				      struct mlx5e_tx_wqe_info *wi,
-				      u32 *dma_fifo_cc)
+mlx5e_ktls_tx_try_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
+					  struct mlx5e_tx_wqe_info *wi,
+					  u32 *dma_fifo_cc)
 {
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 46bdbbbfaf65..869b3313dabf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -530,7 +530,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			sqcc += wi->num_wqebbs;
 
 			if (unlikely(!skb)) {
-				mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
+				mlx5e_ktls_tx_try_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
 				continue;
 			}
 
@@ -595,7 +595,7 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
 		sqcc += wi->num_wqebbs;
 
 		if (!skb) {
-			mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
+			mlx5e_ktls_tx_try_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
 			continue;
 		}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (4 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 05/10] net/mlx5e: Move the TLS resync check out of the function Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-03 22:46   ` Jakub Kicinski
  2020-09-03 21:00 ` [net-next 07/10] net/mlx5e: Generalize TX MPWQE checks for full session Saeed Mahameed
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Tariq Toukan, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

TX MPWQE support for SKBs is coming in one of the following patches, and
a single MPWQE can send multiple SKBs. This commit prepares the TX path
code to handle such cases:

1. An additional FIFO for SKBs is added, just like the FIFO for DMA
chunks.

2. struct mlx5e_tx_wqe_info will contain num_fifo_pkts. If a given WQE
contains only one packet, num_fifo_pkts will be zero, and the SKB will
be stored in mlx5e_tx_wqe_info, as usual. If num_fifo_pkts > 0, the SKB
pointer will be NULL, and the SKBs will be stored in the FIFO.

This change has no performance impact in TCP single stream test and
XDP_TX single stream test.

UDP pktgen (burst 32), single stream:
  Packet rate: 19.23 Mpps -> 19.12 Mpps
  Instructions per packet: 360 -> 354
  Cycles per packet: 142 -> 140

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
NIC: Mellanox ConnectX-6 Dx

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 ++
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 18 +++++
 .../mellanox/mlx5/core/en_accel/ktls_txrx.h   | 10 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  7 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 71 ++++++++++++++-----
 5 files changed, 89 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4f33658da25a..6ab60074fca9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -317,11 +317,13 @@ struct mlx5e_txqsq {
 
 	/* dirtied @completion */
 	u16                        cc;
+	u16                        skb_fifo_cc;
 	u32                        dma_fifo_cc;
 	struct dim                 dim; /* Adaptive Moderation */
 
 	/* dirtied @xmit */
 	u16                        pc ____cacheline_aligned_in_smp;
+	u16                        skb_fifo_pc;
 	u32                        dma_fifo_pc;
 
 	struct mlx5e_cq            cq;
@@ -329,9 +331,11 @@ struct mlx5e_txqsq {
 	/* read only */
 	struct mlx5_wq_cyc         wq;
 	u32                        dma_fifo_mask;
+	u16                        skb_fifo_mask;
 	struct mlx5e_sq_stats     *stats;
 	struct {
 		struct mlx5e_sq_dma       *dma_fifo;
+		struct sk_buff           **skb_fifo;
 		struct mlx5e_tx_wqe_info  *wqe_info;
 	} db;
 	void __iomem              *uar_map;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 155b89998891..7baac2971758 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -105,6 +105,7 @@ struct mlx5e_tx_wqe_info {
 	u32 num_bytes;
 	u8 num_wqebbs;
 	u8 num_dma;
+	u8 num_fifo_pkts;
 #ifdef CONFIG_MLX5_EN_TLS
 	struct page *resync_dump_frag_page;
 #endif
@@ -231,6 +232,23 @@ mlx5e_dma_push(struct mlx5e_txqsq *sq, dma_addr_t addr, u32 size,
 	dma->type = map_type;
 }
 
+static inline struct sk_buff **mlx5e_skb_fifo_get(struct mlx5e_txqsq *sq, u16 i)
+{
+	return &sq->db.skb_fifo[i & sq->skb_fifo_mask];
+}
+
+static inline void mlx5e_skb_fifo_push(struct mlx5e_txqsq *sq, struct sk_buff *skb)
+{
+	struct sk_buff **skb_item = mlx5e_skb_fifo_get(sq, sq->skb_fifo_pc++);
+
+	*skb_item = skb;
+}
+
+static inline struct sk_buff *mlx5e_skb_fifo_pop(struct mlx5e_txqsq *sq)
+{
+	return *mlx5e_skb_fifo_get(sq, sq->skb_fifo_cc++);
+}
+
 static inline void
 mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h
index fcfb156cf09d..7521c9be735b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_txrx.h
@@ -29,20 +29,24 @@ void mlx5e_ktls_handle_get_psv_completion(struct mlx5e_icosq_wqe_info *wi,
 void mlx5e_ktls_tx_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
 					   struct mlx5e_tx_wqe_info *wi,
 					   u32 *dma_fifo_cc);
-static inline void
+static inline bool
 mlx5e_ktls_tx_try_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
 					  struct mlx5e_tx_wqe_info *wi,
 					  u32 *dma_fifo_cc)
 {
-	if (unlikely(wi->resync_dump_frag_page))
+	if (unlikely(wi->resync_dump_frag_page)) {
 		mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, dma_fifo_cc);
+		return true;
+	}
+	return false;
 }
 #else
-static inline void
+static inline bool
 mlx5e_ktls_tx_try_handle_resync_dump_comp(struct mlx5e_txqsq *sq,
 					  struct mlx5e_tx_wqe_info *wi,
 					  u32 *dma_fifo_cc)
 {
+	return false;
 }
 
 #endif /* CONFIG_MLX5_EN_TLS */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 26834625556d..b413aa168e4e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1040,6 +1040,7 @@ static void mlx5e_free_icosq(struct mlx5e_icosq *sq)
 static void mlx5e_free_txqsq_db(struct mlx5e_txqsq *sq)
 {
 	kvfree(sq->db.wqe_info);
+	kvfree(sq->db.skb_fifo);
 	kvfree(sq->db.dma_fifo);
 }
 
@@ -1051,15 +1052,19 @@ static int mlx5e_alloc_txqsq_db(struct mlx5e_txqsq *sq, int numa)
 	sq->db.dma_fifo = kvzalloc_node(array_size(df_sz,
 						   sizeof(*sq->db.dma_fifo)),
 					GFP_KERNEL, numa);
+	sq->db.skb_fifo = kvzalloc_node(array_size(df_sz,
+						   sizeof(*sq->db.skb_fifo)),
+					GFP_KERNEL, numa);
 	sq->db.wqe_info = kvzalloc_node(array_size(wq_sz,
 						   sizeof(*sq->db.wqe_info)),
 					GFP_KERNEL, numa);
-	if (!sq->db.dma_fifo || !sq->db.wqe_info) {
+	if (!sq->db.dma_fifo || !sq->db.skb_fifo || !sq->db.wqe_info) {
 		mlx5e_free_txqsq_db(sq);
 		return -ENOMEM;
 	}
 
 	sq->dma_fifo_mask = df_sz - 1;
+	sq->skb_fifo_mask = df_sz - 1;
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 869b3313dabf..9ced350150b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -326,6 +326,7 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		.num_bytes = attr->num_bytes,
 		.num_dma = num_dma,
 		.num_wqebbs = wqe_attr->num_wqebbs,
+		.num_fifo_pkts = 0,
 	};
 
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
@@ -474,6 +475,20 @@ static inline void mlx5e_consume_skb(struct mlx5e_txqsq *sq, struct sk_buff *skb
 	napi_consume_skb(skb, napi_budget);
 }
 
+static inline void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq,
+						 struct mlx5e_tx_wqe_info *wi,
+						 struct mlx5_cqe64 *cqe,
+						 int napi_budget)
+{
+	int i;
+
+	for (i = 0; i < wi->num_fifo_pkts; i++) {
+		struct sk_buff *skb = mlx5e_skb_fifo_pop(sq);
+
+		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
+	}
+}
+
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 {
 	struct mlx5e_sq_stats *stats;
@@ -519,26 +534,33 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 		wqe_counter = be16_to_cpu(cqe->wqe_counter);
 
 		do {
-			struct sk_buff *skb;
-
 			last_wqe = (sqcc == wqe_counter);
 
 			ci = mlx5_wq_cyc_ctr2ix(&sq->wq, sqcc);
 			wi = &sq->db.wqe_info[ci];
-			skb = wi->skb;
 
 			sqcc += wi->num_wqebbs;
 
-			if (unlikely(!skb)) {
-				mlx5e_ktls_tx_try_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
+			if (likely(wi->skb)) {
+				mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
+				mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
+
+				npkts++;
+				nbytes += wi->num_bytes;
 				continue;
 			}
 
-			mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
-			mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
+			if (unlikely(mlx5e_ktls_tx_try_handle_resync_dump_comp(sq, wi,
+									       &dma_fifo_cc)))
+				continue;
 
-			npkts++;
-			nbytes += wi->num_bytes;
+			if (wi->num_fifo_pkts) {
+				mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
+				mlx5e_tx_wi_consume_fifo_skbs(sq, wi, cqe, napi_budget);
+
+				npkts += wi->num_fifo_pkts;
+				nbytes += wi->num_bytes;
+			}
 		} while (!last_wqe);
 
 		if (unlikely(get_cqe_opcode(cqe) == MLX5_CQE_REQ_ERR)) {
@@ -577,12 +599,19 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 	return (i == MLX5E_TX_CQ_POLL_BUDGET);
 }
 
+static void mlx5e_tx_wi_kfree_fifo_skbs(struct mlx5e_txqsq *sq, struct mlx5e_tx_wqe_info *wi)
+{
+	int i;
+
+	for (i = 0; i < wi->num_fifo_pkts; i++)
+		dev_kfree_skb_any(mlx5e_skb_fifo_pop(sq));
+}
+
 void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
 {
 	struct mlx5e_tx_wqe_info *wi;
 	u32 dma_fifo_cc, nbytes = 0;
 	u16 ci, sqcc, npkts = 0;
-	struct sk_buff *skb;
 
 	sqcc = sq->cc;
 	dma_fifo_cc = sq->dma_fifo_cc;
@@ -590,20 +619,28 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
 	while (sqcc != sq->pc) {
 		ci = mlx5_wq_cyc_ctr2ix(&sq->wq, sqcc);
 		wi = &sq->db.wqe_info[ci];
-		skb = wi->skb;
 
 		sqcc += wi->num_wqebbs;
 
-		if (!skb) {
-			mlx5e_ktls_tx_try_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
+		if (likely(wi->skb)) {
+			mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
+			dev_kfree_skb_any(wi->skb);
+
+			npkts++;
+			nbytes += wi->num_bytes;
 			continue;
 		}
 
-		mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
-		dev_kfree_skb_any(skb);
+		if (unlikely(mlx5e_ktls_tx_try_handle_resync_dump_comp(sq, wi, &dma_fifo_cc)))
+			continue;
 
-		npkts++;
-		nbytes += wi->num_bytes;
+		if (wi->num_fifo_pkts) {
+			mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
+			mlx5e_tx_wi_kfree_fifo_skbs(sq, wi);
+
+			npkts += wi->num_fifo_pkts;
+			nbytes += wi->num_bytes;
+		}
 	}
 
 	sq->dma_fifo_cc = dma_fifo_cc;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 07/10] net/mlx5e: Generalize TX MPWQE checks for full session
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (5 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-03 21:00 ` [net-next 08/10] net/mlx5e: Rename xmit-related structs to generalize them Saeed Mahameed
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

As preparation for the upcoming TX MPWQE for SKBs, create a function
(mlx5e_tx_mpwqe_is_full) to check whether an MPWQE session is full. This
function will be shared by MPWQE code for XDP and for SKBs. Defines are
renamed and moved to make them not XDP-specific.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h  | 18 ++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c   |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h   | 18 ++----------------
 3 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 7baac2971758..09cf4236439e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -9,6 +9,19 @@
 
 #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
 
+/* The mult of MLX5_SEND_WQE_MAX_WQEBBS * MLX5_SEND_WQEBB_NUM_DS
+ * (16 * 4 == 64) does not fit in the 6-bit DS field of Ctrl Segment.
+ * We use a bound lower that MLX5_SEND_WQE_MAX_WQEBBS to let a
+ * full-session WQE be cache-aligned.
+ */
+#if L1_CACHE_BYTES < 128
+#define MLX5E_TX_MPW_MAX_WQEBBS (MLX5_SEND_WQE_MAX_WQEBBS - 1)
+#else
+#define MLX5E_TX_MPW_MAX_WQEBBS (MLX5_SEND_WQE_MAX_WQEBBS - 2)
+#endif
+
+#define MLX5E_TX_MPW_MAX_NUM_DS (MLX5E_TX_MPW_MAX_WQEBBS * MLX5_SEND_WQEBB_NUM_DS)
+
 #define INL_HDR_START_SZ (sizeof(((struct mlx5_wqe_eth_seg *)NULL)->inline_hdr.start))
 
 enum mlx5e_icosq_wqe_type {
@@ -285,6 +298,11 @@ static inline void mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq,
 
 void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more);
 
+static inline bool mlx5e_tx_mpwqe_is_full(struct mlx5e_xdp_mpwqe *session)
+{
+	return session->ds_count == MLX5E_TX_MPW_MAX_NUM_DS;
+}
+
 static inline void mlx5e_rqwq_reset(struct mlx5e_rq *rq)
 {
 	if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 81cd9a04bcb0..0edd4ebeb90c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -199,7 +199,7 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 	struct mlx5e_tx_wqe *wqe;
 	u16 pi;
 
-	pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_XDP_MPW_MAX_WQEBBS);
+	pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_TX_MPW_MAX_WQEBBS);
 	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
 	net_prefetchw(session->wqe->data);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 96d6b1553bab..0dc38acab5a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -45,20 +45,6 @@
 	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
 	 sizeof(struct mlx5_wqe_inline_seg))
 
-/* The mult of MLX5_SEND_WQE_MAX_WQEBBS * MLX5_SEND_WQEBB_NUM_DS
- * (16 * 4 == 64) does not fit in the 6-bit DS field of Ctrl Segment.
- * We use a bound lower that MLX5_SEND_WQE_MAX_WQEBBS to let a
- * full-session WQE be cache-aligned.
- */
-#if L1_CACHE_BYTES < 128
-#define MLX5E_XDP_MPW_MAX_WQEBBS (MLX5_SEND_WQE_MAX_WQEBBS - 1)
-#else
-#define MLX5E_XDP_MPW_MAX_WQEBBS (MLX5_SEND_WQE_MAX_WQEBBS - 2)
-#endif
-
-#define MLX5E_XDP_MPW_MAX_NUM_DS \
-	(MLX5E_XDP_MPW_MAX_WQEBBS * MLX5_SEND_WQEBB_NUM_DS)
-
 struct mlx5e_xsk_param;
 int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
@@ -141,8 +127,8 @@ static inline bool mlx5e_xdp_mpqwe_is_full(struct mlx5e_xdp_mpwqe *session)
 {
 	if (session->inline_on)
 		return session->ds_count + MLX5E_XDP_INLINE_WQE_MAX_DS_CNT >
-		       MLX5E_XDP_MPW_MAX_NUM_DS;
-	return session->ds_count == MLX5E_XDP_MPW_MAX_NUM_DS;
+		       MLX5E_TX_MPW_MAX_NUM_DS;
+	return mlx5e_tx_mpwqe_is_full(session);
 }
 
 struct mlx5e_xdp_wqe_info {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 08/10] net/mlx5e: Rename xmit-related structs to generalize them
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (6 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 07/10] net/mlx5e: Generalize TX MPWQE checks for full session Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-03 21:00 ` [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE Saeed Mahameed
  2020-09-03 21:00 ` [net-next 10/10] net/mlx5e: Enhanced TX MPWQE for SKBs Saeed Mahameed
  9 siblings, 0 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Tariq Toukan, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

As preparation for the upcoming TX MPWQE support for SKBs, rename struct
mlx5e_xdp_mpwqe to mlx5e_tx_mpwqe and move it above struct mlx5e_txqsq.
This structure will be reused in the regular SQ and in the regular TX
data path. Also rename mlx5e_xdp_xmit_data to mlx5e_xmit_data - it will
be used in the upcoming TX MPWQE flow.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 22 +++++++++----------
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 16 +++++++-------
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 10 ++++-----
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  2 +-
 5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6ab60074fca9..3511836f0f4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -312,6 +312,14 @@ enum {
 	MLX5E_SQ_STATE_PENDING_XSK_TX,
 };
 
+struct mlx5e_tx_mpwqe {
+	/* Current MPWQE session */
+	struct mlx5e_tx_wqe *wqe;
+	u8 ds_count;
+	u8 pkt_count;
+	u8 inline_on;
+};
+
 struct mlx5e_txqsq {
 	/* data path */
 
@@ -402,7 +410,7 @@ struct mlx5e_xdp_info {
 	};
 };
 
-struct mlx5e_xdp_xmit_data {
+struct mlx5e_xmit_data {
 	dma_addr_t  dma_addr;
 	void       *data;
 	u32         len;
@@ -415,18 +423,10 @@ struct mlx5e_xdp_info_fifo {
 	u32 mask;
 };
 
-struct mlx5e_xdp_mpwqe {
-	/* Current MPWQE session */
-	struct mlx5e_tx_wqe *wqe;
-	u8                   ds_count;
-	u8                   pkt_count;
-	u8                   inline_on;
-};
-
 struct mlx5e_xdpsq;
 typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
 typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
-					struct mlx5e_xdp_xmit_data *,
+					struct mlx5e_xmit_data *,
 					struct mlx5e_xdp_info *,
 					int);
 
@@ -441,7 +441,7 @@ struct mlx5e_xdpsq {
 	u32                        xdpi_fifo_pc ____cacheline_aligned_in_smp;
 	u16                        pc;
 	struct mlx5_wqe_ctrl_seg   *doorbell_cseg;
-	struct mlx5e_xdp_mpwqe     mpwqe;
+	struct mlx5e_tx_mpwqe      mpwqe;
 
 	struct mlx5e_cq            cq;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 09cf4236439e..1ac4607fba08 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -298,7 +298,7 @@ static inline void mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq,
 
 void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more);
 
-static inline bool mlx5e_tx_mpwqe_is_full(struct mlx5e_xdp_mpwqe *session)
+static inline bool mlx5e_tx_mpwqe_is_full(struct mlx5e_tx_mpwqe *session)
 {
 	return session->ds_count == MLX5E_TX_MPW_MAX_NUM_DS;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 0edd4ebeb90c..adacc4f9a3bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -59,7 +59,7 @@ static inline bool
 mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 		    struct mlx5e_dma_info *di, struct xdp_buff *xdp)
 {
-	struct mlx5e_xdp_xmit_data xdptxd;
+	struct mlx5e_xmit_data xdptxd;
 	struct mlx5e_xdp_info xdpi;
 	struct xdp_frame *xdpf;
 	dma_addr_t dma_addr;
@@ -194,7 +194,7 @@ static u16 mlx5e_xdpsq_get_next_pi(struct mlx5e_xdpsq *sq, u16 size)
 
 static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 {
-	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
 	struct mlx5e_tx_wqe *wqe;
 	u16 pi;
@@ -203,7 +203,7 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
 	net_prefetchw(session->wqe->data);
 
-	*session = (struct mlx5e_xdp_mpwqe) {
+	*session = (struct mlx5e_tx_mpwqe) {
 		.wqe = wqe,
 		.ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
 		.pkt_count = 0,
@@ -216,7 +216,7 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq)
 {
 	struct mlx5_wq_cyc       *wq    = &sq->wq;
-	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
 	struct mlx5_wqe_ctrl_seg *cseg = &session->wqe->ctrl;
 	u16 ds_count = session->ds_count;
 	u16 pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
@@ -261,10 +261,10 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq
 }
 
 INDIRECT_CALLABLE_SCOPE bool
-mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_xmit_data *xdptxd,
+mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 			   struct mlx5e_xdp_info *xdpi, int check_result)
 {
-	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
 
 	if (unlikely(xdptxd->len > sq->hw_mtu)) {
@@ -308,7 +308,7 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq)
 }
 
 INDIRECT_CALLABLE_SCOPE bool
-mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_xmit_data *xdptxd,
+mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 		     struct mlx5e_xdp_info *xdpi, int check_result)
 {
 	struct mlx5_wq_cyc       *wq   = &sq->wq;
@@ -505,7 +505,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
-		struct mlx5e_xdp_xmit_data xdptxd;
+		struct mlx5e_xmit_data xdptxd;
 		struct mlx5e_xdp_info xdpi;
 		bool ret;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 0dc38acab5a8..4bd8af478a4a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -58,11 +58,11 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		   u32 flags);
 
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
-							  struct mlx5e_xdp_xmit_data *xdptxd,
+							  struct mlx5e_xmit_data *xdptxd,
 							  struct mlx5e_xdp_info *xdpi,
 							  int check_result));
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
-						    struct mlx5e_xdp_xmit_data *xdptxd,
+						    struct mlx5e_xmit_data *xdptxd,
 						    struct mlx5e_xdp_info *xdpi,
 						    int check_result));
 INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq));
@@ -123,7 +123,7 @@ static inline bool mlx5e_xdp_get_inline_state(struct mlx5e_xdpsq *sq, bool cur)
 	return cur;
 }
 
-static inline bool mlx5e_xdp_mpqwe_is_full(struct mlx5e_xdp_mpwqe *session)
+static inline bool mlx5e_xdp_mpqwe_is_full(struct mlx5e_tx_mpwqe *session)
 {
 	if (session->inline_on)
 		return session->ds_count + MLX5E_XDP_INLINE_WQE_MAX_DS_CNT >
@@ -138,10 +138,10 @@ struct mlx5e_xdp_wqe_info {
 
 static inline void
 mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq,
-			 struct mlx5e_xdp_xmit_data *xdptxd,
+			 struct mlx5e_xmit_data *xdptxd,
 			 struct mlx5e_xdpsq_stats *stats)
 {
-	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
 	struct mlx5_wqe_data_seg *dseg =
 		(struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
 	u32 dma_len = xdptxd->len;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index aa91cbdfe969..fb671a457129 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -67,8 +67,8 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
 bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 {
 	struct xsk_buff_pool *pool = sq->xsk_pool;
+	struct mlx5e_xmit_data xdptxd;
 	struct mlx5e_xdp_info xdpi;
-	struct mlx5e_xdp_xmit_data xdptxd;
 	bool work_done = true;
 	bool flush = false;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (7 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 08/10] net/mlx5e: Rename xmit-related structs to generalize them Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  2020-09-04 15:06   ` Willem de Bruijn
  2020-09-03 21:00 ` [net-next 10/10] net/mlx5e: Enhanced TX MPWQE for SKBs Saeed Mahameed
  9 siblings, 1 reply; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

mlx5e_txwqe_complete performs some actions that can be taken to separate
functions:

1. Update the flags needed for hardware timestamping.

2. Stop the TX queue if it's full.

Take these actions into separate functions to be reused by the MPWQE
code in the following commit and to maintain clear responsibilities of
functions.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 23 ++++++++++++++-----
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 9ced350150b3..3b68c8333875 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -311,6 +311,20 @@ static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
 	};
 }
 
+static inline void mlx5e_tx_skb_update_hwts_flags(struct sk_buff *skb)
+{
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
+		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+}
+
+static inline void mlx5e_tx_check_stop(struct mlx5e_txqsq *sq)
+{
+	if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, sq->stop_room))) {
+		netif_tx_stop_queue(sq->txq);
+		sq->stats->stopped++;
+	}
+}
+
 static inline void
 mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 		     const struct mlx5e_tx_attr *attr,
@@ -332,14 +346,11 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
 
-	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
-		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+	mlx5e_tx_skb_update_hwts_flags(skb);
 
 	sq->pc += wi->num_wqebbs;
-	if (unlikely(!mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, sq->stop_room))) {
-		netif_tx_stop_queue(sq->txq);
-		sq->stats->stopped++;
-	}
+
+	mlx5e_tx_check_stop(sq);
 
 	send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
 	if (send_doorbell)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [net-next 10/10] net/mlx5e: Enhanced TX MPWQE for SKBs
  2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
                   ` (8 preceding siblings ...)
  2020-09-03 21:00 ` [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE Saeed Mahameed
@ 2020-09-03 21:00 ` Saeed Mahameed
  9 siblings, 0 replies; 23+ messages in thread
From: Saeed Mahameed @ 2020-09-03 21:00 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski
  Cc: netdev, Maxim Mikityanskiy, Saeed Mahameed

From: Maxim Mikityanskiy <maximmi@mellanox.com>

This commit adds support for Enhanced TX MPWQE feature in the regular
(SKB) data path. A MPWQE (multi-packet work queue element) can serve
multiple packets, reducing the PCI bandwidth on control traffic.

Two new stats (tx*_mpwqe_blks and tx*_mpwqe_pkts) are added. The feature
is on by default and controlled by the skb_tx_mpwqe private flag.

In a MPWQE, eseg is shared among all packets, so eseg-based offloads
(IPSEC, GENEVE, checksum) run on a separate eseg that is compared to the
eseg of the current MPWQE session to decide if the new packet can be
added to the same session.

MPWQE is not compatible with certain offloads and features, such as TLS
offload, TSO, nonlinear SKBs. If such incompatible features are in use,
the driver gracefully falls back to non-MPWQE.

This change has no performance impact in TCP single stream test and
XDP_TX single stream test.

UDP pktgen, 64-byte packets, single stream, MPWQE off:
  Packet rate: 19.12 Mpps -> 20.02 Mpps
  Instructions per packet: 354 -> 347
  Cycles per packet: 140 -> 129

UDP pktgen, 64-byte packets, single stream, MPWQE on:
  Packet rate: 19.12 Mpps -> 20.67 Mpps
  Instructions per packet: 354 -> 335
  Cycles per packet: 140 -> 124

Enabling MPWQE can reduce PCI bandwidth:
  PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
    Inbound PCI utilization with MPWQE off: 81.3%
    Inbound PCI utilization with MPWQE on: 59.3%
  PCI Gen3, pktgen at fixed rate of 56064005 pps on 24 CPU cores:
    Inbound PCI utilization with MPWQE off: 65.8%
    Inbound PCI utilization with MPWQE on: 49.2%

Enabling MPWQE can also reduce CPU load, increasing the packet rate in
case of CPU bottleneck:
  PCI Gen2, pktgen at full rate on 24 CPU cores:
    Packet rate with MPWQE off: 37.4 Mpps
    Packet rate with MPWQE on: 49.1 Mpps
  PCI Gen3, pktgen at full rate on 24 CPU cores:
    Packet rate with MPWQE off: 56.2 Mpps
    Packet rate with MPWQE on: 67.0 Mpps

Burst size in all pktgen tests is 32.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
NIC: Mellanox ConnectX-6 Dx

To avoid performance degradation when MPWQE is off, manual optimizations
of function inlining were performed. It's especially important to have
mlx5e_sq_xmit_mpwqe noinline, otherwise gcc inlines it automatically and
bloats mlx5e_xmit, slowing it down.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   4 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |   1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |   1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |  29 +--
 .../mellanox/mlx5/core/en_accel/tls_rxtx.c    |   2 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  15 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  11 ++
 .../ethernet/mellanox/mlx5/core/en_stats.c    |   6 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   4 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 184 +++++++++++++++++-
 11 files changed, 240 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3511836f0f4a..2abb0857ede0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -221,6 +221,7 @@ enum mlx5e_priv_flag {
 	MLX5E_PFLAG_RX_STRIDING_RQ,
 	MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
 	MLX5E_PFLAG_XDP_TX_MPWQE,
+	MLX5E_PFLAG_SKB_TX_MPWQE,
 	MLX5E_NUM_PFLAGS, /* Keep last */
 };
 
@@ -304,6 +305,7 @@ struct mlx5e_sq_dma {
 
 enum {
 	MLX5E_SQ_STATE_ENABLED,
+	MLX5E_SQ_STATE_MPWQE,
 	MLX5E_SQ_STATE_RECOVERING,
 	MLX5E_SQ_STATE_IPSEC,
 	MLX5E_SQ_STATE_AM,
@@ -315,6 +317,7 @@ enum {
 struct mlx5e_tx_mpwqe {
 	/* Current MPWQE session */
 	struct mlx5e_tx_wqe *wqe;
+	u32 bytes_count;
 	u8 ds_count;
 	u8 pkt_count;
 	u8 inline_on;
@@ -333,6 +336,7 @@ struct mlx5e_txqsq {
 	u16                        pc ____cacheline_aligned_in_smp;
 	u16                        skb_fifo_pc;
 	u32                        dma_fifo_pc;
+	struct mlx5e_tx_mpwqe      mpwqe;
 
 	struct mlx5e_cq            cq;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 1ac4607fba08..749881987094 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -297,6 +297,7 @@ static inline void mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq,
 }
 
 void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more);
+void mlx5e_tx_mpwqe_ensure_complete(struct mlx5e_txqsq *sq);
 
 static inline bool mlx5e_tx_mpwqe_is_full(struct mlx5e_tx_mpwqe *session)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index adacc4f9a3bf..307eb64889c0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -205,6 +205,7 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
 
 	*session = (struct mlx5e_tx_mpwqe) {
 		.wqe = wqe,
+		.bytes_count = 0,
 		.ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
 		.pkt_count = 0,
 		.inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 4bd8af478a4a..d487e5e37162 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -147,6 +147,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq,
 	u32 dma_len = xdptxd->len;
 
 	session->pkt_count++;
+	session->bytes_count += dma_len;
 
 	if (session->inline_on && dma_len <= MLX5E_XDP_INLINE_WQE_SZ_THRSD) {
 		struct mlx5_wqe_inline_seg *inline_dseg =
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
index 23d4ef5ab9c5..2ea1cdc1ca54 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
@@ -128,31 +128,38 @@ static inline bool mlx5e_accel_tx_begin(struct net_device *dev,
 	return true;
 }
 
-static inline bool mlx5e_accel_tx_finish(struct mlx5e_priv *priv,
-					 struct mlx5e_txqsq *sq,
-					 struct sk_buff *skb,
-					 struct mlx5e_tx_wqe *wqe,
-					 struct mlx5e_accel_tx_state *state)
-{
-#ifdef CONFIG_MLX5_EN_TLS
-	mlx5e_tls_handle_tx_wqe(sq, &wqe->ctrl, &state->tls);
-#endif
+/* Part of the eseg touched by TX offloads */
+#define MLX5E_ACCEL_ESEG_LEN offsetof(struct mlx5_wqe_eth_seg, mss)
 
+static inline bool mlx5e_accel_tx_eseg(struct mlx5e_priv *priv,
+				       struct mlx5e_txqsq *sq,
+				       struct sk_buff *skb,
+				       struct mlx5_wqe_eth_seg *eseg)
+{
 #ifdef CONFIG_MLX5_EN_IPSEC
 	if (test_bit(MLX5E_SQ_STATE_IPSEC, &sq->state)) {
-		if (unlikely(!mlx5e_ipsec_handle_tx_skb(priv, &wqe->eth, skb)))
+		if (unlikely(!mlx5e_ipsec_handle_tx_skb(priv, eseg, skb)))
 			return false;
 	}
 #endif
 
 #if IS_ENABLED(CONFIG_GENEVE)
 	if (skb->encapsulation)
-		mlx5e_tx_tunnel_accel(skb, &wqe->eth);
+		mlx5e_tx_tunnel_accel(skb, eseg);
 #endif
 
 	return true;
 }
 
+static inline void mlx5e_accel_tx_finish(struct mlx5e_txqsq *sq,
+					 struct mlx5e_tx_wqe *wqe,
+					 struct mlx5e_accel_tx_state *state)
+{
+#ifdef CONFIG_MLX5_EN_TLS
+	mlx5e_tls_handle_tx_wqe(sq, &wqe->ctrl, &state->tls);
+#endif
+}
+
 static inline int mlx5e_accel_init_rx(struct mlx5e_priv *priv)
 {
 	return mlx5e_ktls_init_rx(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
index c36560b3e93d..6982b193ee8a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
@@ -270,6 +270,8 @@ bool mlx5e_tls_handle_tx_skb(struct net_device *netdev, struct mlx5e_txqsq *sq,
 	if (!datalen)
 		return true;
 
+	mlx5e_tx_mpwqe_ensure_complete(sq);
+
 	tls_ctx = tls_get_ctx(skb->sk);
 	if (WARN_ON_ONCE(tls_ctx->netdev != netdev))
 		goto err_out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 5cb1e4839eb7..2c34bb57048c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1901,7 +1901,7 @@ static int set_pflag_rx_no_csum_complete(struct net_device *netdev, bool enable)
 	return 0;
 }
 
-static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
+static int set_pflag_tx_mpwqe_common(struct net_device *netdev, u32 flag, bool enable)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5_core_dev *mdev = priv->mdev;
@@ -1913,7 +1913,7 @@ static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
 
 	new_channels.params = priv->channels.params;
 
-	MLX5E_SET_PFLAG(&new_channels.params, MLX5E_PFLAG_XDP_TX_MPWQE, enable);
+	MLX5E_SET_PFLAG(&new_channels.params, flag, enable);
 
 	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
 		priv->channels.params = new_channels.params;
@@ -1924,6 +1924,16 @@ static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
 	return err;
 }
 
+static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
+{
+	return set_pflag_tx_mpwqe_common(netdev, MLX5E_PFLAG_XDP_TX_MPWQE, enable);
+}
+
+static int set_pflag_skb_tx_mpwqe(struct net_device *netdev, bool enable)
+{
+	return set_pflag_tx_mpwqe_common(netdev, MLX5E_PFLAG_SKB_TX_MPWQE, enable);
+}
+
 static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
 	{ "rx_cqe_moder",        set_pflag_rx_cqe_based_moder },
 	{ "tx_cqe_moder",        set_pflag_tx_cqe_based_moder },
@@ -1931,6 +1941,7 @@ static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
 	{ "rx_striding_rq",      set_pflag_rx_striding_rq },
 	{ "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
 	{ "xdp_tx_mpwqe",        set_pflag_xdp_tx_mpwqe },
+	{ "skb_tx_mpwqe",        set_pflag_skb_tx_mpwqe },
 };
 
 static int mlx5e_handle_pflag(struct net_device *netdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b413aa168e4e..f8ad4a724a63 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1075,6 +1075,12 @@ static int mlx5e_calc_sq_stop_room(struct mlx5e_txqsq *sq, u8 log_sq_size)
 
 	sq->stop_room  = mlx5e_tls_get_stop_room(sq);
 	sq->stop_room += mlx5e_stop_room_for_wqe(MLX5_SEND_WQE_MAX_WQEBBS);
+	if (test_bit(MLX5E_SQ_STATE_MPWQE, &sq->state))
+		/* A MPWQE can take up to the maximum-sized WQE + all the normal
+		 * stop room can be taken if a new packet breaks the active
+		 * MPWQE session and allocates its WQEs right away.
+		 */
+		sq->stop_room += mlx5e_stop_room_for_wqe(MLX5_SEND_WQE_MAX_WQEBBS);
 
 	if (WARN_ON(sq->stop_room >= sq_size)) {
 		netdev_err(sq->channel->netdev, "Stop room %hu is bigger than the SQ size %d\n",
@@ -1116,6 +1122,8 @@ static int mlx5e_alloc_txqsq(struct mlx5e_channel *c,
 		set_bit(MLX5E_SQ_STATE_IPSEC, &sq->state);
 	if (mlx5_accel_is_tls_device(c->priv->mdev))
 		set_bit(MLX5E_SQ_STATE_TLS, &sq->state);
+	if (param->is_mpw)
+		set_bit(MLX5E_SQ_STATE_MPWQE, &sq->state);
 	err = mlx5e_calc_sq_stop_room(sq, params->log_sq_size);
 	if (err)
 		return err;
@@ -2168,6 +2176,7 @@ static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
 	mlx5e_build_sq_param_common(priv, param);
 	MLX5_SET(wq, wq, log_wq_sz, params->log_sq_size);
 	MLX5_SET(sqc, sqc, allow_swp, allow_swp);
+	param->is_mpw = MLX5E_GET_PFLAG(params, MLX5E_PFLAG_SKB_TX_MPWQE);
 	mlx5e_build_tx_cq_param(priv, params, &param->cqp);
 }
 
@@ -4721,6 +4730,8 @@ void mlx5e_build_nic_params(struct mlx5e_priv *priv,
 	params->log_sq_size = is_kdump_kernel() ?
 		MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE :
 		MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
+	MLX5E_SET_PFLAG(params, MLX5E_PFLAG_SKB_TX_MPWQE,
+			MLX5_CAP_ETH(mdev, enhanced_multi_pkt_send_wqe));
 
 	/* XDP SQ */
 	MLX5E_SET_PFLAG(params, MLX5E_PFLAG_XDP_TX_MPWQE,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index e3b2f59408e6..20d7815ffbf4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -98,6 +98,8 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tso_inner_bytes) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_added_vlan_packets) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_nop) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_mpwqe_blks) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_mpwqe_pkts) },
 
 #ifdef CONFIG_MLX5_EN_TLS
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_tls_encrypted_packets) },
@@ -353,6 +355,8 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 			s->tx_tso_inner_bytes	+= sq_stats->tso_inner_bytes;
 			s->tx_added_vlan_packets += sq_stats->added_vlan_packets;
 			s->tx_nop               += sq_stats->nop;
+			s->tx_mpwqe_blks        += sq_stats->mpwqe_blks;
+			s->tx_mpwqe_pkts        += sq_stats->mpwqe_pkts;
 			s->tx_queue_stopped	+= sq_stats->stopped;
 			s->tx_queue_wake	+= sq_stats->wake;
 			s->tx_queue_dropped	+= sq_stats->dropped;
@@ -1527,6 +1531,8 @@ static const struct counter_desc sq_stats_desc[] = {
 	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, csum_partial_inner) },
 	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, added_vlan_packets) },
 	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, nop) },
+	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, mpwqe_blks) },
+	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, mpwqe_pkts) },
 #ifdef CONFIG_MLX5_EN_TLS
 	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, tls_encrypted_packets) },
 	{ MLX5E_DECLARE_TX_STAT(struct mlx5e_sq_stats, tls_encrypted_bytes) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 2e1cca1923b9..fd198965ba82 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -117,6 +117,8 @@ struct mlx5e_sw_stats {
 	u64 tx_tso_inner_bytes;
 	u64 tx_added_vlan_packets;
 	u64 tx_nop;
+	u64 tx_mpwqe_blks;
+	u64 tx_mpwqe_pkts;
 	u64 rx_lro_packets;
 	u64 rx_lro_bytes;
 	u64 rx_ecn_mark;
@@ -345,6 +347,8 @@ struct mlx5e_sq_stats {
 	u64 csum_partial_inner;
 	u64 added_vlan_packets;
 	u64 nop;
+	u64 mpwqe_blks;
+	u64 mpwqe_pkts;
 #ifdef CONFIG_MLX5_EN_TLS
 	u64 tls_encrypted_packets;
 	u64 tls_encrypted_bytes;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 3b68c8333875..d8f1acca37f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -412,6 +412,166 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 	dev_kfree_skb_any(skb);
 }
 
+static inline bool mlx5e_tx_skb_supports_mpwqe(struct sk_buff *skb, struct mlx5e_tx_attr *attr)
+{
+	return !skb_is_nonlinear(skb) && !skb_vlan_tag_present(skb) && !attr->ihs;
+}
+
+static inline bool mlx5e_tx_mpwqe_same_eseg(struct mlx5e_txqsq *sq, struct mlx5_wqe_eth_seg *eseg)
+{
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
+
+	/* Assumes the session is already running and has at least one packet. */
+	return !memcmp(&session->wqe->eth, eseg, MLX5E_ACCEL_ESEG_LEN);
+}
+
+static void mlx5e_tx_mpwqe_session_start(struct mlx5e_txqsq *sq,
+					 struct mlx5_wqe_eth_seg *eseg)
+{
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
+	struct mlx5e_tx_wqe *wqe;
+	u16 pi;
+
+	pi = mlx5e_txqsq_get_next_pi(sq, MLX5E_TX_MPW_MAX_WQEBBS);
+	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
+	prefetchw(wqe->data);
+
+	*session = (struct mlx5e_tx_mpwqe) {
+		.wqe = wqe,
+		.bytes_count = 0,
+		.ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
+		.pkt_count = 0,
+		.inline_on = 0,
+	};
+
+	memcpy(&session->wqe->eth, eseg, MLX5E_ACCEL_ESEG_LEN);
+
+	sq->stats->mpwqe_blks++;
+}
+
+static inline bool mlx5e_tx_mpwqe_session_is_active(struct mlx5e_txqsq *sq)
+{
+	return sq->mpwqe.wqe;
+}
+
+static inline void mlx5e_tx_mpwqe_add_dseg(struct mlx5e_txqsq *sq, struct mlx5e_xmit_data *txd)
+{
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
+	struct mlx5_wqe_data_seg *dseg;
+
+	dseg = (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
+
+	session->pkt_count++;
+	session->bytes_count += txd->len;
+
+	dseg->addr = cpu_to_be64(txd->dma_addr);
+	dseg->byte_count = cpu_to_be32(txd->len);
+	dseg->lkey = sq->mkey_be;
+	session->ds_count++;
+
+	sq->stats->mpwqe_pkts++;
+}
+
+static struct mlx5_wqe_ctrl_seg *mlx5e_tx_mpwqe_session_complete(struct mlx5e_txqsq *sq)
+{
+	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
+	u8 ds_count = session->ds_count;
+	struct mlx5_wqe_ctrl_seg *cseg;
+	struct mlx5e_tx_wqe_info *wi;
+	u16 pi;
+
+	cseg = &session->wqe->ctrl;
+	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_ENHANCED_MPSW);
+	cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | ds_count);
+
+	pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
+	wi = &sq->db.wqe_info[pi];
+	*wi = (struct mlx5e_tx_wqe_info) {
+		.skb = NULL,
+		.num_bytes = session->bytes_count,
+		.num_wqebbs = DIV_ROUND_UP(ds_count, MLX5_SEND_WQEBB_NUM_DS),
+		.num_dma = session->pkt_count,
+		.num_fifo_pkts = session->pkt_count,
+	};
+
+	sq->pc += wi->num_wqebbs;
+
+	session->wqe = NULL;
+
+	mlx5e_tx_check_stop(sq);
+
+	return cseg;
+}
+
+static noinline void
+mlx5e_sq_xmit_mpwqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
+		    struct mlx5_wqe_eth_seg *eseg, bool xmit_more)
+{
+	struct mlx5_wqe_ctrl_seg *cseg;
+	struct mlx5e_xmit_data txd;
+
+	if (!mlx5e_tx_mpwqe_session_is_active(sq)) {
+		mlx5e_tx_mpwqe_session_start(sq, eseg);
+	} else if (!mlx5e_tx_mpwqe_same_eseg(sq, eseg)) {
+		mlx5e_tx_mpwqe_session_complete(sq);
+		mlx5e_tx_mpwqe_session_start(sq, eseg);
+	}
+
+	sq->stats->xmit_more += xmit_more;
+
+	txd.data = skb->data;
+	txd.len = skb->len;
+
+	txd.dma_addr = dma_map_single(sq->pdev, txd.data, txd.len, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(sq->pdev, txd.dma_addr)))
+		goto err_unmap;
+	mlx5e_dma_push(sq, txd.dma_addr, txd.len, MLX5E_DMA_MAP_SINGLE);
+
+	mlx5e_skb_fifo_push(sq, skb);
+
+	mlx5e_tx_mpwqe_add_dseg(sq, &txd);
+
+	mlx5e_tx_skb_update_hwts_flags(skb);
+
+	if (unlikely(mlx5e_tx_mpwqe_is_full(&sq->mpwqe))) {
+		/* Might stop the queue and affect the retval of __netdev_tx_sent_queue. */
+		cseg = mlx5e_tx_mpwqe_session_complete(sq);
+
+		if (__netdev_tx_sent_queue(sq->txq, txd.len, xmit_more))
+			mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, cseg);
+	} else if (__netdev_tx_sent_queue(sq->txq, txd.len, xmit_more)) {
+		/* Might stop the queue, but we were asked to ring the doorbell anyway. */
+		cseg = mlx5e_tx_mpwqe_session_complete(sq);
+
+		mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, cseg);
+	}
+
+	return;
+
+err_unmap:
+	mlx5e_dma_unmap_wqe_err(sq, 1);
+	sq->stats->dropped++;
+	dev_kfree_skb_any(skb);
+}
+
+void mlx5e_tx_mpwqe_ensure_complete(struct mlx5e_txqsq *sq)
+{
+	/* Unlikely in non-MPWQE workloads; not important in MPWQE workloads. */
+	if (unlikely(mlx5e_tx_mpwqe_session_is_active(sq)))
+		mlx5e_tx_mpwqe_session_complete(sq);
+}
+
+static inline bool mlx5e_txwqe_build_eseg(struct mlx5e_priv *priv, struct mlx5e_txqsq *sq,
+					  struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg)
+{
+	if (unlikely(!mlx5e_accel_tx_eseg(priv, sq, skb, eseg)))
+		return false;
+
+	mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
+
+	return true;
+}
+
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
@@ -426,21 +586,35 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* May send SKBs and WQEs. */
 	if (unlikely(!mlx5e_accel_tx_begin(dev, sq, skb, &accel)))
-		goto out;
+		return NETDEV_TX_OK;
 
 	mlx5e_sq_xmit_prepare(sq, skb, &accel, &attr);
+
+	if (test_bit(MLX5E_SQ_STATE_MPWQE, &sq->state)) {
+		if (mlx5e_tx_skb_supports_mpwqe(skb, &attr)) {
+			struct mlx5_wqe_eth_seg eseg = {};
+
+			if (unlikely(!mlx5e_txwqe_build_eseg(priv, sq, skb, &eseg)))
+				return NETDEV_TX_OK;
+
+			mlx5e_sq_xmit_mpwqe(sq, skb, &eseg, netdev_xmit_more());
+			return NETDEV_TX_OK;
+		}
+
+		mlx5e_tx_mpwqe_ensure_complete(sq);
+	}
+
 	mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
 	pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
 	wqe = MLX5E_TX_FETCH_WQE(sq, pi);
 
 	/* May update the WQE, but may not post other WQEs. */
-	if (unlikely(!mlx5e_accel_tx_finish(priv, sq, skb, wqe, &accel)))
-		goto out;
+	mlx5e_accel_tx_finish(sq, wqe, &accel);
+	if (unlikely(!mlx5e_txwqe_build_eseg(priv, sq, skb, &wqe->eth)))
+		return NETDEV_TX_OK;
 
-	mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
 	mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, netdev_xmit_more());
 
-out:
 	return NETDEV_TX_OK;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE
  2020-09-03 21:00 ` [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE Saeed Mahameed
@ 2020-09-03 22:46   ` Jakub Kicinski
  2020-09-08  8:59     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2020-09-03 22:46 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David S. Miller, netdev, Maxim Mikityanskiy, Tariq Toukan

On Thu, 3 Sep 2020 14:00:18 -0700 Saeed Mahameed wrote:
> +static inline void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq,
> +						 struct mlx5e_tx_wqe_info *wi,
> +						 struct mlx5_cqe64 *cqe,
> +						 int napi_budget)
> +{
> +	int i;
> +
> +	for (i = 0; i < wi->num_fifo_pkts; i++) {
> +		struct sk_buff *skb = mlx5e_skb_fifo_pop(sq);
> +
> +		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
> +	}
> +}

The compiler was not inlining this one either?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT
  2020-09-03 21:00 ` [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT Saeed Mahameed
@ 2020-09-04 15:05   ` Willem de Bruijn
  2020-09-08  8:59     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2020-09-04 15:05 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Network Development, Maxim Mikityanskiy

On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>
> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>
> A constant for the number of DS in an empty WQE (i.e. a WQE without data
> segments) is needed in multiple places (normal TX data path, MPWQE in
> XDP), but currently we have a constant for XDP and an inline formula in
> normal TX. This patch introduces a common constant.
>
> Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
> assignment, because the code nearby is touched.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 ++
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 13 +++++++-----
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 21 +++++++------------
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
>  4 files changed, 19 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> index d4ee22789ab0..155b89998891 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> @@ -7,6 +7,8 @@
>  #include "en.h"
>  #include <linux/indirect_call_wrapper.h>
>
> +#define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> +

Out of curiosity, what is the logic for dividing this struct by 16?

struct mlx5e_tx_wqe {
        struct mlx5_wqe_ctrl_seg ctrl;
        struct mlx5_wqe_eth_seg  eth;
        struct mlx5_wqe_data_seg data[0];
};

>  #define INL_HDR_START_SZ (sizeof(((struct mlx5_wqe_eth_seg *)NULL)->inline_hdr.start))
>
>  enum mlx5e_icosq_wqe_type {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index 7fccd2ea7dc9..81cd9a04bcb0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -196,16 +196,19 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
>  {
>         struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
>         struct mlx5e_xdpsq_stats *stats = sq->stats;
> +       struct mlx5e_tx_wqe *wqe;
>         u16 pi;
>
>         pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_XDP_MPW_MAX_WQEBBS);
> -       session->wqe = MLX5E_TX_FETCH_WQE(sq, pi);
> -
> +       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>         net_prefetchw(session->wqe->data);

Is this prefetch still valid? And is the temporary variable wqe still
needed at all?


> -       session->ds_count  = MLX5E_XDP_TX_EMPTY_DS_COUNT;
> -       session->pkt_count = 0;
>
> -       mlx5e_xdp_update_inline_state(sq);
> +       *session = (struct mlx5e_xdp_mpwqe) {
> +               .wqe = wqe,
> +               .ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
> +               .pkt_count = 0,
> +               .inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
> +       };
>
>         stats->mpwqe++;
>  }

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE
  2020-09-03 21:00 ` [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE Saeed Mahameed
@ 2020-09-04 15:06   ` Willem de Bruijn
  2020-09-08  8:59     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2020-09-04 15:06 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Network Development, Maxim Mikityanskiy

On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>
> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>
> mlx5e_txwqe_complete performs some actions that can be taken to separate
> functions:
>
> 1. Update the flags needed for hardware timestamping.
>
> 2. Stop the TX queue if it's full.
>
> Take these actions into separate functions to be reused by the MPWQE
> code in the following commit and to maintain clear responsibilities of
> functions.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 23 ++++++++++++++-----
>  1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index 9ced350150b3..3b68c8333875 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -311,6 +311,20 @@ static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
>         };
>  }
>
> +static inline void mlx5e_tx_skb_update_hwts_flags(struct sk_buff *skb)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> +               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +}

Subjective, but this helper adds a level of indirection and introduces
code churn without simplying anything, imho.

> +static inline void mlx5e_tx_check_stop(struct mlx5e_txqsq *sq)
> +{
> +       if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, sq->stop_room))) {
> +               netif_tx_stop_queue(sq->txq);
> +               sq->stats->stopped++;
> +       }
> +}
> +
>  static inline void
>  mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                      const struct mlx5e_tx_attr *attr,
> @@ -332,14 +346,11 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>         cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
>         cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
>
> -       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> -               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +       mlx5e_tx_skb_update_hwts_flags(skb);
>
>         sq->pc += wi->num_wqebbs;
> -       if (unlikely(!mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, sq->stop_room))) {
> -               netif_tx_stop_queue(sq->txq);
> -               sq->stats->stopped++;
> -       }
> +
> +       mlx5e_tx_check_stop(sq);
>
>         send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
>         if (send_doorbell)
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 02/10] net/mlx5e: Refactor xmit functions
  2020-09-03 21:00 ` [net-next 02/10] net/mlx5e: Refactor xmit functions Saeed Mahameed
@ 2020-09-04 15:27   ` Willem de Bruijn
  2020-09-08  8:58     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 23+ messages in thread
From: Willem de Bruijn @ 2020-09-04 15:27 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Network Development, Maxim Mikityanskiy

On Thu, Sep 3, 2020 at 11:00 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>
> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>
> A huge function mlx5e_sq_xmit was split into several to achieve multiple
> goals:
>
> 1. Reuse the code in IPoIB.
>
> 2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now
> it's possible to reserve space in the WQ before running eseg-based
> offloads, so:
>
> 2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge
> anymore.
>
> 2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy
> mlx5e_fill_sq_frag_edge for better code maintainability and reuse.
>
> 3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after
> mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the
> code flow will split into two paths: MPWQE and non-MPWQE.
>
> Two high-level functions are provided to send packets:
>
> * mlx5e_xmit is called by the networking stack, runs offloads and sends
> the packet. In one of the following patches, MPWQE support will be added
> to this flow.
>
> * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the
> checksum offload and sends the packet.
>
> This change has no performance impact in TCP single stream test and
> XDP_TX single stream test.
>
> UDP pktgen (burst 32), single stream:
>   Packet rate: 17.55 Mpps -> 19.23 Mpps
>   Instructions per packet: 420 -> 360
>   Cycles per packet: 165 -> 142
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
> NIC: Mellanox ConnectX-6 Dx
>
> To get this performance gain, manual optimizations of function inlining
> were performed. It's important to have mlx5e_sq_xmit_wqe inline,
> otherwise the packet rate will be 1 Mpps less in UDP pktgen test.
> __always_inline is required, because gcc uninlines it when it's called
> from two places (mlx5e_xmit and mlx5e_sq_xmit_simple).
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  63 +--
>  .../mellanox/mlx5/core/en_accel/en_accel.h    |   5 +
>  .../mellanox/mlx5/core/en_accel/tls_rxtx.c    |   6 +-
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 391 ++++++++++--------
>  4 files changed, 243 insertions(+), 222 deletions(-)

This combines a lot of changes. Including supposed noops, but with
subtle changes, like converting to struct initializers.

Probably deserves to be broken up a bit more.

For instance, a pure noop patch that moves
mlx5e_txwqe_build_eseg_csum, a separate patch for
mlx5e_tx_wqe_inline_mode (the change to which is not trivial in
itself), introduction of mlx5e_sq_xmit_prepare, ..

Is, especially after this refactoring, mlx5e_xmit still considerably
more complex than mlx5e_sq_xmit_simple? It does not look like that
separate function is really necessary. At least after this patch.


>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> index 9334c9c3e208..d4ee22789ab0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> @@ -41,8 +41,6 @@ void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
>  u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
>                        struct net_device *sb_dev);
>  netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
> -void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> -                  struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more);
>  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
>  void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq);
>
> @@ -188,23 +186,6 @@ static inline u16 mlx5e_icosq_get_next_pi(struct mlx5e_icosq *sq, u16 size)
>         return pi;
>  }
>
> -static inline void
> -mlx5e_fill_sq_frag_edge(struct mlx5e_txqsq *sq, struct mlx5_wq_cyc *wq,
> -                       u16 pi, u16 nnops)
> -{
> -       struct mlx5e_tx_wqe_info *edge_wi, *wi = &sq->db.wqe_info[pi];
> -
> -       edge_wi = wi + nnops;
> -
> -       /* fill sq frag edge with nops to avoid wqe wrapping two pages */
> -       for (; wi < edge_wi; wi++) {
> -               memset(wi, 0, sizeof(*wi));
> -               wi->num_wqebbs = 1;
> -               mlx5e_post_nop(wq, sq->sqn, &sq->pc);
> -       }
> -       sq->stats->nop += nnops;
> -}
> -
>  static inline void
>  mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
>                 struct mlx5_wqe_ctrl_seg *ctrl)
> @@ -223,29 +204,6 @@ mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
>         mlx5_write64((__be32 *)ctrl, uar_map);
>  }
>
> -static inline bool mlx5e_transport_inline_tx_wqe(struct mlx5_wqe_ctrl_seg *cseg)
> -{
> -       return cseg && !!cseg->tis_tir_num;
> -}
> -
> -static inline u8
> -mlx5e_tx_wqe_inline_mode(struct mlx5e_txqsq *sq, struct mlx5_wqe_ctrl_seg *cseg,
> -                        struct sk_buff *skb)
> -{
> -       u8 mode;
> -
> -       if (mlx5e_transport_inline_tx_wqe(cseg))
> -               return MLX5_INLINE_MODE_TCP_UDP;
> -
> -       mode = sq->min_inline_mode;
> -
> -       if (skb_vlan_tag_present(skb) &&
> -           test_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state))
> -               mode = max_t(u8, MLX5_INLINE_MODE_L2, mode);
> -
> -       return mode;
> -}
> -
>  static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
>  {
>         struct mlx5_core_cq *mcq;
> @@ -286,6 +244,27 @@ mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
>         }
>  }
>
> +static inline void mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq,
> +                                              struct sk_buff *skb,
> +                                              struct mlx5_wqe_eth_seg *eseg)
> +{
> +       if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> +               eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> +               if (skb->encapsulation) {
> +                       eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> +                                         MLX5_ETH_WQE_L4_INNER_CSUM;
> +                       sq->stats->csum_partial_inner++;
> +               } else {
> +                       eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> +                       sq->stats->csum_partial++;
> +               }
> +       } else {
> +               sq->stats->csum_none++;
> +       }
> +}
> +
> +void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more);
> +
>  static inline void mlx5e_rqwq_reset(struct mlx5e_rq *rq)
>  {
>         if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
> index 110476bdeffb..23d4ef5ab9c5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
> @@ -145,6 +145,11 @@ static inline bool mlx5e_accel_tx_finish(struct mlx5e_priv *priv,
>         }
>  #endif
>
> +#if IS_ENABLED(CONFIG_GENEVE)
> +       if (skb->encapsulation)
> +               mlx5e_tx_tunnel_accel(skb, &wqe->eth);
> +#endif
> +
>         return true;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
> index b0c31d49ff8d..c36560b3e93d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
> @@ -189,12 +189,10 @@ static bool mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
>                                  struct mlx5e_tls *tls)
>  {
>         u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> -       struct mlx5e_tx_wqe *wqe;
>         struct sync_info info;
>         struct sk_buff *nskb;
>         int linear_len = 0;
>         int headln;
> -       u16 pi;
>         int i;
>
>         sq->stats->tls_ooo++;
> @@ -246,9 +244,7 @@ static bool mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
>         sq->stats->tls_resync_bytes += nskb->len;
>         mlx5e_tls_complete_sync_skb(skb, nskb, tcp_seq, headln,
>                                     cpu_to_be64(info.rcd_sn));
> -       pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
> -       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
> -       mlx5e_sq_xmit(sq, nskb, wqe, pi, true);
> +       mlx5e_sq_xmit_simple(sq, nskb, true);
>
>         return true;
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index e15aa53ff83e..f967bc0573c0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -144,23 +144,6 @@ static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
>         memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz);
>  }
>
> -static inline void
> -mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg)
> -{
> -       if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> -               eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> -               if (skb->encapsulation) {
> -                       eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> -                                         MLX5_ETH_WQE_L4_INNER_CSUM;
> -                       sq->stats->csum_partial_inner++;
> -               } else {
> -                       eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> -                       sq->stats->csum_partial++;
> -               }
> -       } else
> -               sq->stats->csum_none++;
> -}
> -
>  static inline u16
>  mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
>  {
> @@ -232,22 +215,121 @@ mlx5e_txwqe_build_dsegs(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>         return -ENOMEM;
>  }
>
> +struct mlx5e_tx_attr {
> +       u32 num_bytes;
> +       u16 headlen;
> +       u16 ihs;
> +       __be16 mss;
> +       u8 opcode;
> +};
> +
> +struct mlx5e_tx_wqe_attr {
> +       u16 ds_cnt;
> +       u16 ds_cnt_inl;
> +       u8 num_wqebbs;
> +};
> +
> +static inline u8
> +mlx5e_tx_wqe_inline_mode(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> +                        struct mlx5e_accel_tx_state *accel)
> +{
> +       u8 mode;
> +
> +#ifdef CONFIG_MLX5_EN_TLS
> +       if (accel && accel->tls.tls_tisn)
> +               return MLX5_INLINE_MODE_TCP_UDP;
> +#endif
> +
> +       mode = sq->min_inline_mode;
> +
> +       if (skb_vlan_tag_present(skb) &&
> +           test_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state))
> +               mode = max_t(u8, MLX5_INLINE_MODE_L2, mode);
> +
> +       return mode;
> +}
> +
> +static inline void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> +                                        struct mlx5e_accel_tx_state *accel,
> +                                        struct mlx5e_tx_attr *attr)
> +{
> +       struct mlx5e_sq_stats *stats = sq->stats;
> +
> +       if (skb_is_gso(skb)) {
> +               u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
> +
> +               *attr = (struct mlx5e_tx_attr) {
> +                       .opcode    = MLX5_OPCODE_LSO,
> +                       .mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
> +                       .ihs       = ihs,
> +                       .num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
> +                       .headlen   = skb_headlen(skb) - ihs,
> +               };
> +
> +               stats->packets += skb_shinfo(skb)->gso_segs;
> +       } else {
> +               u8 mode = mlx5e_tx_wqe_inline_mode(sq, skb, accel);
> +               u16 ihs = mlx5e_calc_min_inline(mode, skb);
> +
> +               *attr = (struct mlx5e_tx_attr) {
> +                       .opcode    = MLX5_OPCODE_SEND,
> +                       .mss       = cpu_to_be16(0),
> +                       .ihs       = ihs,
> +                       .num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN),
> +                       .headlen   = skb_headlen(skb) - ihs,
> +               };
> +
> +               stats->packets++;
> +       }
> +
> +       stats->bytes += attr->num_bytes;
> +}
> +
> +static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
> +                                         const struct mlx5e_tx_attr *attr,
> +                                         struct mlx5e_tx_wqe_attr *wqe_attr)
> +{
> +       u16 ds_cnt = sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS;
> +       u16 ds_cnt_inl = 0;
> +
> +       ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
> +
> +       if (attr->ihs) {
> +               u16 inl = attr->ihs - INL_HDR_START_SZ;
> +
> +               if (skb_vlan_tag_present(skb))
> +                       inl += VLAN_HLEN;
> +
> +               ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
> +               ds_cnt += ds_cnt_inl;
> +       }
> +
> +       *wqe_attr = (struct mlx5e_tx_wqe_attr) {
> +               .ds_cnt     = ds_cnt,
> +               .ds_cnt_inl = ds_cnt_inl,
> +               .num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS),
> +       };
> +}
> +
>  static inline void
>  mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> -                    u8 opcode, u16 ds_cnt, u8 num_wqebbs, u32 num_bytes, u8 num_dma,
> +                    const struct mlx5e_tx_attr *attr,
> +                    const struct mlx5e_tx_wqe_attr *wqe_attr, u8 num_dma,
>                      struct mlx5e_tx_wqe_info *wi, struct mlx5_wqe_ctrl_seg *cseg,
>                      bool xmit_more)
>  {
>         struct mlx5_wq_cyc *wq = &sq->wq;
>         bool send_doorbell;
>
> -       wi->num_bytes = num_bytes;
> -       wi->num_dma = num_dma;
> -       wi->num_wqebbs = num_wqebbs;
> -       wi->skb = skb;
> +       *wi = (struct mlx5e_tx_wqe_info) {
> +               .skb = skb,
> +               .num_bytes = attr->num_bytes,
> +               .num_dma = num_dma,
> +               .num_wqebbs = wqe_attr->num_wqebbs,
> +       };
>
> -       cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | opcode);
> -       cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
> +       cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
> +       cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
>
>         if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
>                 skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> @@ -258,105 +340,44 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                 sq->stats->stopped++;
>         }
>
> -       send_doorbell = __netdev_tx_sent_queue(sq->txq, num_bytes,
> -                                              xmit_more);
> +       send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
>         if (send_doorbell)
>                 mlx5e_notify_hw(wq, sq->pc, sq->uar_map, cseg);
>  }
>
> -void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> -                  struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more)
> +static __always_inline void
> +mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> +                 const struct mlx5e_tx_attr *attr, const struct mlx5e_tx_wqe_attr *wqe_attr,
> +                 struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more)
>  {
> -       struct mlx5_wq_cyc *wq = &sq->wq;
>         struct mlx5_wqe_ctrl_seg *cseg;
>         struct mlx5_wqe_eth_seg  *eseg;
>         struct mlx5_wqe_data_seg *dseg;
>         struct mlx5e_tx_wqe_info *wi;
>
>         struct mlx5e_sq_stats *stats = sq->stats;
> -       u16 headlen, ihs, contig_wqebbs_room;
> -       u16 ds_cnt, ds_cnt_inl = 0;
> -       u8 num_wqebbs, opcode;
> -       u32 num_bytes;
>         int num_dma;
> -       __be16 mss;
>
> -       /* Calc ihs and ds cnt, no writes to wqe yet */
> -       ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
> -       if (skb_is_gso(skb)) {
> -               opcode    = MLX5_OPCODE_LSO;
> -               mss       = cpu_to_be16(skb_shinfo(skb)->gso_size);
> -               ihs       = mlx5e_tx_get_gso_ihs(sq, skb);
> -               num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs;
> -               stats->packets += skb_shinfo(skb)->gso_segs;
> -       } else {
> -               u8 mode = mlx5e_tx_wqe_inline_mode(sq, &wqe->ctrl, skb);
> -
> -               opcode    = MLX5_OPCODE_SEND;
> -               mss       = 0;
> -               ihs       = mlx5e_calc_min_inline(mode, skb);
> -               num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
> -               stats->packets++;
> -       }
> -
> -       stats->bytes     += num_bytes;
>         stats->xmit_more += xmit_more;
>
> -       headlen = skb->len - ihs - skb->data_len;
> -       ds_cnt += !!headlen;
> -       ds_cnt += skb_shinfo(skb)->nr_frags;
> -
> -       if (ihs) {
> -               u16 inl = ihs + !!skb_vlan_tag_present(skb) * VLAN_HLEN - INL_HDR_START_SZ;
> -
> -               ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
> -               ds_cnt += ds_cnt_inl;
> -       }
> -
> -       num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
> -       contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
> -       if (unlikely(contig_wqebbs_room < num_wqebbs)) {
> -#ifdef CONFIG_MLX5_EN_IPSEC
> -               struct mlx5_wqe_eth_seg cur_eth = wqe->eth;
> -#endif
> -#ifdef CONFIG_MLX5_EN_TLS
> -               struct mlx5_wqe_ctrl_seg cur_ctrl = wqe->ctrl;
> -#endif
> -               mlx5e_fill_sq_frag_edge(sq, wq, pi, contig_wqebbs_room);
> -               pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
> -               wqe = MLX5E_TX_FETCH_WQE(sq, pi);
> -#ifdef CONFIG_MLX5_EN_IPSEC
> -               wqe->eth = cur_eth;
> -#endif
> -#ifdef CONFIG_MLX5_EN_TLS
> -               wqe->ctrl = cur_ctrl;
> -#endif
> -       }
> -
>         /* fill wqe */
>         wi   = &sq->db.wqe_info[pi];
>         cseg = &wqe->ctrl;
>         eseg = &wqe->eth;
>         dseg =  wqe->data;
>
> -#if IS_ENABLED(CONFIG_GENEVE)
> -       if (skb->encapsulation)
> -               mlx5e_tx_tunnel_accel(skb, eseg);
> -#endif
> -       mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
> -
> -       eseg->mss = mss;
> +       eseg->mss = attr->mss;
>
> -       if (ihs) {
> +       if (attr->ihs) {
>                 if (skb_vlan_tag_present(skb)) {
> -                       eseg->inline_hdr.sz = cpu_to_be16(ihs + VLAN_HLEN);
> -                       mlx5e_insert_vlan(eseg->inline_hdr.start, skb, ihs);
> +                       eseg->inline_hdr.sz = cpu_to_be16(attr->ihs + VLAN_HLEN);
> +                       mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
>                         stats->added_vlan_packets++;
>                 } else {
> -                       eseg->inline_hdr.sz = cpu_to_be16(ihs);
> -                       memcpy(eseg->inline_hdr.start, skb->data, ihs);
> +                       eseg->inline_hdr.sz = cpu_to_be16(attr->ihs);
> +                       memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
>                 }
> -               dseg += ds_cnt_inl;
> +               dseg += wqe_attr->ds_cnt_inl;
>         } else if (skb_vlan_tag_present(skb)) {
>                 eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
>                 if (skb->vlan_proto == cpu_to_be16(ETH_P_8021AD))
> @@ -365,12 +386,12 @@ void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                 stats->added_vlan_packets++;
>         }
>
> -       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + ihs, headlen, dseg);
> +       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
> +                                         attr->headlen, dseg);
>         if (unlikely(num_dma < 0))
>                 goto err_drop;
>
> -       mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt, num_wqebbs, num_bytes,
> -                            num_dma, wi, cseg, xmit_more);
> +       mlx5e_txwqe_complete(sq, skb, attr, wqe_attr, num_dma, wi, cseg, xmit_more);
>
>         return;
>
> @@ -383,6 +404,8 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>         struct mlx5e_priv *priv = netdev_priv(dev);
>         struct mlx5e_accel_tx_state accel = {};
> +       struct mlx5e_tx_wqe_attr wqe_attr;
> +       struct mlx5e_tx_attr attr;
>         struct mlx5e_tx_wqe *wqe;
>         struct mlx5e_txqsq *sq;
>         u16 pi;
> @@ -393,19 +416,64 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
>         if (unlikely(!mlx5e_accel_tx_begin(dev, sq, skb, &accel)))
>                 goto out;
>
> -       pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
> +       mlx5e_sq_xmit_prepare(sq, skb, &accel, &attr);
> +       mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
> +       pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
>         wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>
>         /* May update the WQE, but may not post other WQEs. */
>         if (unlikely(!mlx5e_accel_tx_finish(priv, sq, skb, wqe, &accel)))
>                 goto out;
>
> -       mlx5e_sq_xmit(sq, skb, wqe, pi, netdev_xmit_more());
> +       mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
> +       mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, netdev_xmit_more());
>
>  out:
>         return NETDEV_TX_OK;
>  }
>
> +void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more)
> +{
> +       struct mlx5e_tx_wqe_attr wqe_attr;
> +       struct mlx5e_tx_attr attr;
> +       struct mlx5e_tx_wqe *wqe;
> +       u16 pi;
> +
> +       mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
> +       mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
> +       pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
> +       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
> +       mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
> +       mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, xmit_more);
> +}
> +
> +static inline void mlx5e_tx_wi_dma_unmap(struct mlx5e_txqsq *sq,
> +                                        struct mlx5e_tx_wqe_info *wi,
> +                                        u32 *dma_fifo_cc)
> +{
> +       int i;
> +
> +       for (i = 0; i < wi->num_dma; i++) {
> +               struct mlx5e_sq_dma *dma = mlx5e_dma_get(sq, (*dma_fifo_cc)++);
> +
> +               mlx5e_tx_dma_unmap(sq->pdev, dma);
> +       }
> +}
> +
> +static inline void mlx5e_consume_skb(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> +                                    struct mlx5_cqe64 *cqe, int napi_budget)
> +{
> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
> +               struct skb_shared_hwtstamps hwts = {};
> +               u64 ts = get_cqe_ts(cqe);
> +
> +               hwts.hwtstamp = mlx5_timecounter_cyc2time(sq->clock, ts);
> +               skb_tstamp_tx(skb, &hwts);
> +       }
> +
> +       napi_consume_skb(skb, napi_budget);
> +}
> +
>  bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>  {
>         struct mlx5e_sq_stats *stats;
> @@ -452,7 +520,6 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>
>                 do {
>                         struct sk_buff *skb;
> -                       int j;
>
>                         last_wqe = (sqcc == wqe_counter);
>
> @@ -460,33 +527,18 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>                         wi = &sq->db.wqe_info[ci];
>                         skb = wi->skb;
>
> +                       sqcc += wi->num_wqebbs;
> +
>                         if (unlikely(!skb)) {
>                                 mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
> -                               sqcc += wi->num_wqebbs;
>                                 continue;
>                         }
>
> -                       if (unlikely(skb_shinfo(skb)->tx_flags &
> -                                    SKBTX_HW_TSTAMP)) {
> -                               struct skb_shared_hwtstamps hwts = {};
> -
> -                               hwts.hwtstamp =
> -                                       mlx5_timecounter_cyc2time(sq->clock,
> -                                                                 get_cqe_ts(cqe));
> -                               skb_tstamp_tx(skb, &hwts);
> -                       }
> -
> -                       for (j = 0; j < wi->num_dma; j++) {
> -                               struct mlx5e_sq_dma *dma =
> -                                       mlx5e_dma_get(sq, dma_fifo_cc++);
> -
> -                               mlx5e_tx_dma_unmap(sq->pdev, dma);
> -                       }
> +                       mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
> +                       mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
>
>                         npkts++;
>                         nbytes += wi->num_bytes;
> -                       sqcc += wi->num_wqebbs;
> -                       napi_consume_skb(skb, napi_budget);
>                 } while (!last_wqe);
>
>                 if (unlikely(get_cqe_opcode(cqe) == MLX5_CQE_REQ_ERR)) {
> @@ -531,7 +583,6 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
>         u32 dma_fifo_cc, nbytes = 0;
>         u16 ci, sqcc, npkts = 0;
>         struct sk_buff *skb;
> -       int i;
>
>         sqcc = sq->cc;
>         dma_fifo_cc = sq->dma_fifo_cc;
> @@ -541,23 +592,18 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
>                 wi = &sq->db.wqe_info[ci];
>                 skb = wi->skb;
>
> +               sqcc += wi->num_wqebbs;
> +
>                 if (!skb) {
>                         mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
> -                       sqcc += wi->num_wqebbs;
>                         continue;
>                 }
>
> -               for (i = 0; i < wi->num_dma; i++) {
> -                       struct mlx5e_sq_dma *dma =
> -                               mlx5e_dma_get(sq, dma_fifo_cc++);
> -
> -                       mlx5e_tx_dma_unmap(sq->pdev, dma);
> -               }
> -
> +               mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
>                 dev_kfree_skb_any(skb);
> +
>                 npkts++;
>                 nbytes += wi->num_bytes;
> -               sqcc += wi->num_wqebbs;
>         }
>
>         sq->dma_fifo_cc = dma_fifo_cc;
> @@ -576,9 +622,34 @@ mlx5i_txwqe_build_datagram(struct mlx5_av *av, u32 dqpn, u32 dqkey,
>         dseg->av.key.qkey.qkey = cpu_to_be32(dqkey);
>  }
>
> +static void mlx5i_sq_calc_wqe_attr(struct sk_buff *skb,
> +                                  const struct mlx5e_tx_attr *attr,
> +                                  struct mlx5e_tx_wqe_attr *wqe_attr)
> +{
> +       u16 ds_cnt = sizeof(struct mlx5i_tx_wqe) / MLX5_SEND_WQE_DS;
> +       u16 ds_cnt_inl = 0;
> +
> +       ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
> +
> +       if (attr->ihs) {
> +               u16 inl = attr->ihs - INL_HDR_START_SZ;
> +
> +               ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
> +               ds_cnt += ds_cnt_inl;
> +       }
> +
> +       *wqe_attr = (struct mlx5e_tx_wqe_attr) {
> +               .ds_cnt     = ds_cnt,
> +               .ds_cnt_inl = ds_cnt_inl,
> +               .num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS),
> +       };
> +}
> +
>  void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                    struct mlx5_av *av, u32 dqpn, u32 dqkey, bool xmit_more)
>  {
> +       struct mlx5e_tx_wqe_attr wqe_attr;
> +       struct mlx5e_tx_attr attr;
>         struct mlx5i_tx_wqe *wqe;
>
>         struct mlx5_wqe_datagram_seg *datagram;
> @@ -588,47 +659,17 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>         struct mlx5e_tx_wqe_info *wi;
>
>         struct mlx5e_sq_stats *stats = sq->stats;
> -       u16 ds_cnt, ds_cnt_inl = 0;
> -       u8 num_wqebbs, opcode;
> -       u16 headlen, ihs, pi;
> -       u32 num_bytes;
>         int num_dma;
> -       __be16 mss;
> +       u16 pi;
>
> -       /* Calc ihs and ds cnt, no writes to wqe yet */
> -       ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
> -       if (skb_is_gso(skb)) {
> -               opcode    = MLX5_OPCODE_LSO;
> -               mss       = cpu_to_be16(skb_shinfo(skb)->gso_size);
> -               ihs       = mlx5e_tx_get_gso_ihs(sq, skb);
> -               num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs;
> -               stats->packets += skb_shinfo(skb)->gso_segs;
> -       } else {
> -               u8 mode = mlx5e_tx_wqe_inline_mode(sq, NULL, skb);
> +       mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
> +       mlx5i_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
>
> -               opcode    = MLX5_OPCODE_SEND;
> -               mss       = 0;
> -               ihs       = mlx5e_calc_min_inline(mode, skb);
> -               num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
> -               stats->packets++;
> -       }
> +       pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
> +       wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
>
> -       stats->bytes     += num_bytes;
>         stats->xmit_more += xmit_more;
>
> -       headlen = skb->len - ihs - skb->data_len;
> -       ds_cnt += !!headlen;
> -       ds_cnt += skb_shinfo(skb)->nr_frags;
> -
> -       if (ihs) {
> -               ds_cnt_inl = DIV_ROUND_UP(ihs - INL_HDR_START_SZ, MLX5_SEND_WQE_DS);
> -               ds_cnt += ds_cnt_inl;
> -       }
> -
> -       num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
> -       pi = mlx5e_txqsq_get_next_pi(sq, num_wqebbs);
> -       wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
> -
>         /* fill wqe */
>         wi       = &sq->db.wqe_info[pi];
>         cseg     = &wqe->ctrl;
> @@ -640,20 +681,20 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>
>         mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
>
> -       eseg->mss = mss;
> +       eseg->mss = attr.mss;
>
> -       if (ihs) {
> -               memcpy(eseg->inline_hdr.start, skb->data, ihs);
> -               eseg->inline_hdr.sz = cpu_to_be16(ihs);
> -               dseg += ds_cnt_inl;
> +       if (attr.ihs) {
> +               memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
> +               eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
> +               dseg += wqe_attr.ds_cnt_inl;
>         }
>
> -       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + ihs, headlen, dseg);
> +       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
> +                                         attr.headlen, dseg);
>         if (unlikely(num_dma < 0))
>                 goto err_drop;
>
> -       mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt, num_wqebbs, num_bytes,
> -                            num_dma, wi, cseg, xmit_more);
> +       mlx5e_txwqe_complete(sq, skb, &attr, &wqe_attr, num_dma, wi, cseg, xmit_more);
>
>         return;
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 02/10] net/mlx5e: Refactor xmit functions
  2020-09-04 15:27   ` Willem de Bruijn
@ 2020-09-08  8:58     ` Maxim Mikityanskiy
  2020-09-08  9:08       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Maxim Mikityanskiy @ 2020-09-08  8:58 UTC (permalink / raw)
  To: Willem de Bruijn, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Network Development, Maxim Mikityanskiy

On 2020-09-04 18:27, Willem de Bruijn wrote:
> On Thu, Sep 3, 2020 at 11:00 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>>
>> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>>
>> A huge function mlx5e_sq_xmit was split into several to achieve multiple
>> goals:
>>
>> 1. Reuse the code in IPoIB.
>>
>> 2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now
>> it's possible to reserve space in the WQ before running eseg-based
>> offloads, so:
>>
>> 2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge
>> anymore.
>>
>> 2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy
>> mlx5e_fill_sq_frag_edge for better code maintainability and reuse.
>>
>> 3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after
>> mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the
>> code flow will split into two paths: MPWQE and non-MPWQE.
>>
>> Two high-level functions are provided to send packets:
>>
>> * mlx5e_xmit is called by the networking stack, runs offloads and sends
>> the packet. In one of the following patches, MPWQE support will be added
>> to this flow.
>>
>> * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the
>> checksum offload and sends the packet.
>>
>> This change has no performance impact in TCP single stream test and
>> XDP_TX single stream test.
>>
>> UDP pktgen (burst 32), single stream:
>>    Packet rate: 17.55 Mpps -> 19.23 Mpps
>>    Instructions per packet: 420 -> 360
>>    Cycles per packet: 165 -> 142
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
>> NIC: Mellanox ConnectX-6 Dx
>>
>> To get this performance gain, manual optimizations of function inlining
>> were performed. It's important to have mlx5e_sq_xmit_wqe inline,
>> otherwise the packet rate will be 1 Mpps less in UDP pktgen test.
>> __always_inline is required, because gcc uninlines it when it's called
>> from two places (mlx5e_xmit and mlx5e_sq_xmit_simple).
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
>>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  63 +--
>>   .../mellanox/mlx5/core/en_accel/en_accel.h    |   5 +
>>   .../mellanox/mlx5/core/en_accel/tls_rxtx.c    |   6 +-
>>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 391 ++++++++++--------
>>   4 files changed, 243 insertions(+), 222 deletions(-)
> 
> This combines a lot of changes. Including supposed noops, but with
> subtle changes, like converting to struct initializers.

Struct initializers are mostly used in the new code. I can split out the 
only converted occurrence.

> Probably deserves to be broken up a bit more.
> 
> For instance, a pure noop patch that moves
> mlx5e_txwqe_build_eseg_csum,

OK. Not sure I really need to move it though.

> a separate patch for
> mlx5e_tx_wqe_inline_mode (the change to which is not trivial in
> itself),

The change to this function is trivial:

-       if (mlx5e_transport_inline_tx_wqe(cseg))
+#ifdef CONFIG_MLX5_EN_TLS
+       if (accel && accel->tls.tls_tisn)
                 return MLX5_INLINE_MODE_TCP_UDP;
+#endif

I can't do this change in a separate patch, as `accel` is introduced by 
this patch. I can do the movement of this function in a separate patch, 
though.

> introduction of mlx5e_sq_xmit_prepare, ..

While it's possible to introduce mlx5e_sq_xmit_prepare and 
mlx5e_sq_calc_wqe_attr in separate patches, I don't think it makes sense 
to do so. First of all, it's one logical change, second, such separation 
will produce a lot of changes like `ihs` -> `attr->ihs`, which will be 
overwritten in the second patch.

> Is, especially after this refactoring, mlx5e_xmit still considerably
> more complex than mlx5e_sq_xmit_simple? It does not look like that
> separate function is really necessary.

The purpose of the simple version is to be called in cases where the 
driver needs to produce an SKB by itself (e.g., in the kTLS offload), 
and we don't want to call acceleration offloads (including kTLS) and 
MPWQE from such contexts. It's not about saving a few CPU cycles, it's 
about making it safer by skipping all code that shouldn't run in such 
contexts anyway.

> At least after this patch.

Well, if you look at the final version, the difference is more 
significant :)

> 
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> index 9334c9c3e208..d4ee22789ab0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> @@ -41,8 +41,6 @@ void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
>>   u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
>>                         struct net_device *sb_dev);
>>   netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
>> -void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> -                  struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more);
>>   bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
>>   void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq);
>>
>> @@ -188,23 +186,6 @@ static inline u16 mlx5e_icosq_get_next_pi(struct mlx5e_icosq *sq, u16 size)
>>          return pi;
>>   }
>>
>> -static inline void
>> -mlx5e_fill_sq_frag_edge(struct mlx5e_txqsq *sq, struct mlx5_wq_cyc *wq,
>> -                       u16 pi, u16 nnops)
>> -{
>> -       struct mlx5e_tx_wqe_info *edge_wi, *wi = &sq->db.wqe_info[pi];
>> -
>> -       edge_wi = wi + nnops;
>> -
>> -       /* fill sq frag edge with nops to avoid wqe wrapping two pages */
>> -       for (; wi < edge_wi; wi++) {
>> -               memset(wi, 0, sizeof(*wi));
>> -               wi->num_wqebbs = 1;
>> -               mlx5e_post_nop(wq, sq->sqn, &sq->pc);
>> -       }
>> -       sq->stats->nop += nnops;
>> -}
>> -
>>   static inline void
>>   mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
>>                  struct mlx5_wqe_ctrl_seg *ctrl)
>> @@ -223,29 +204,6 @@ mlx5e_notify_hw(struct mlx5_wq_cyc *wq, u16 pc, void __iomem *uar_map,
>>          mlx5_write64((__be32 *)ctrl, uar_map);
>>   }
>>
>> -static inline bool mlx5e_transport_inline_tx_wqe(struct mlx5_wqe_ctrl_seg *cseg)
>> -{
>> -       return cseg && !!cseg->tis_tir_num;
>> -}
>> -
>> -static inline u8
>> -mlx5e_tx_wqe_inline_mode(struct mlx5e_txqsq *sq, struct mlx5_wqe_ctrl_seg *cseg,
>> -                        struct sk_buff *skb)
>> -{
>> -       u8 mode;
>> -
>> -       if (mlx5e_transport_inline_tx_wqe(cseg))
>> -               return MLX5_INLINE_MODE_TCP_UDP;
>> -
>> -       mode = sq->min_inline_mode;
>> -
>> -       if (skb_vlan_tag_present(skb) &&
>> -           test_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state))
>> -               mode = max_t(u8, MLX5_INLINE_MODE_L2, mode);
>> -
>> -       return mode;
>> -}
>> -
>>   static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
>>   {
>>          struct mlx5_core_cq *mcq;
>> @@ -286,6 +244,27 @@ mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
>>          }
>>   }
>>
>> +static inline void mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq,
>> +                                              struct sk_buff *skb,
>> +                                              struct mlx5_wqe_eth_seg *eseg)
>> +{
>> +       if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>> +               eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>> +               if (skb->encapsulation) {
>> +                       eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>> +                                         MLX5_ETH_WQE_L4_INNER_CSUM;
>> +                       sq->stats->csum_partial_inner++;
>> +               } else {
>> +                       eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>> +                       sq->stats->csum_partial++;
>> +               }
>> +       } else {
>> +               sq->stats->csum_none++;
>> +       }
>> +}
>> +
>> +void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more);
>> +
>>   static inline void mlx5e_rqwq_reset(struct mlx5e_rq *rq)
>>   {
>>          if (rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) {
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
>> index 110476bdeffb..23d4ef5ab9c5 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/en_accel.h
>> @@ -145,6 +145,11 @@ static inline bool mlx5e_accel_tx_finish(struct mlx5e_priv *priv,
>>          }
>>   #endif
>>
>> +#if IS_ENABLED(CONFIG_GENEVE)
>> +       if (skb->encapsulation)
>> +               mlx5e_tx_tunnel_accel(skb, &wqe->eth);
>> +#endif
>> +
>>          return true;
>>   }
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
>> index b0c31d49ff8d..c36560b3e93d 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/tls_rxtx.c
>> @@ -189,12 +189,10 @@ static bool mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
>>                                   struct mlx5e_tls *tls)
>>   {
>>          u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
>> -       struct mlx5e_tx_wqe *wqe;
>>          struct sync_info info;
>>          struct sk_buff *nskb;
>>          int linear_len = 0;
>>          int headln;
>> -       u16 pi;
>>          int i;
>>
>>          sq->stats->tls_ooo++;
>> @@ -246,9 +244,7 @@ static bool mlx5e_tls_handle_ooo(struct mlx5e_tls_offload_context_tx *context,
>>          sq->stats->tls_resync_bytes += nskb->len;
>>          mlx5e_tls_complete_sync_skb(skb, nskb, tcp_seq, headln,
>>                                      cpu_to_be64(info.rcd_sn));
>> -       pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
>> -       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>> -       mlx5e_sq_xmit(sq, nskb, wqe, pi, true);
>> +       mlx5e_sq_xmit_simple(sq, nskb, true);
>>
>>          return true;
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> index e15aa53ff83e..f967bc0573c0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> @@ -144,23 +144,6 @@ static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs)
>>          memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz);
>>   }
>>
>> -static inline void
>> -mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg)
>> -{
>> -       if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>> -               eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>> -               if (skb->encapsulation) {
>> -                       eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>> -                                         MLX5_ETH_WQE_L4_INNER_CSUM;
>> -                       sq->stats->csum_partial_inner++;
>> -               } else {
>> -                       eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>> -                       sq->stats->csum_partial++;
>> -               }
>> -       } else
>> -               sq->stats->csum_none++;
>> -}
>> -
>>   static inline u16
>>   mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb)
>>   {
>> @@ -232,22 +215,121 @@ mlx5e_txwqe_build_dsegs(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>          return -ENOMEM;
>>   }
>>
>> +struct mlx5e_tx_attr {
>> +       u32 num_bytes;
>> +       u16 headlen;
>> +       u16 ihs;
>> +       __be16 mss;
>> +       u8 opcode;
>> +};
>> +
>> +struct mlx5e_tx_wqe_attr {
>> +       u16 ds_cnt;
>> +       u16 ds_cnt_inl;
>> +       u8 num_wqebbs;
>> +};
>> +
>> +static inline u8
>> +mlx5e_tx_wqe_inline_mode(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> +                        struct mlx5e_accel_tx_state *accel)
>> +{
>> +       u8 mode;
>> +
>> +#ifdef CONFIG_MLX5_EN_TLS
>> +       if (accel && accel->tls.tls_tisn)
>> +               return MLX5_INLINE_MODE_TCP_UDP;
>> +#endif
>> +
>> +       mode = sq->min_inline_mode;
>> +
>> +       if (skb_vlan_tag_present(skb) &&
>> +           test_bit(MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE, &sq->state))
>> +               mode = max_t(u8, MLX5_INLINE_MODE_L2, mode);
>> +
>> +       return mode;
>> +}
>> +
>> +static inline void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> +                                        struct mlx5e_accel_tx_state *accel,
>> +                                        struct mlx5e_tx_attr *attr)
>> +{
>> +       struct mlx5e_sq_stats *stats = sq->stats;
>> +
>> +       if (skb_is_gso(skb)) {
>> +               u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb);
>> +
>> +               *attr = (struct mlx5e_tx_attr) {
>> +                       .opcode    = MLX5_OPCODE_LSO,
>> +                       .mss       = cpu_to_be16(skb_shinfo(skb)->gso_size),
>> +                       .ihs       = ihs,
>> +                       .num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs,
>> +                       .headlen   = skb_headlen(skb) - ihs,
>> +               };
>> +
>> +               stats->packets += skb_shinfo(skb)->gso_segs;
>> +       } else {
>> +               u8 mode = mlx5e_tx_wqe_inline_mode(sq, skb, accel);
>> +               u16 ihs = mlx5e_calc_min_inline(mode, skb);
>> +
>> +               *attr = (struct mlx5e_tx_attr) {
>> +                       .opcode    = MLX5_OPCODE_SEND,
>> +                       .mss       = cpu_to_be16(0),
>> +                       .ihs       = ihs,
>> +                       .num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN),
>> +                       .headlen   = skb_headlen(skb) - ihs,
>> +               };
>> +
>> +               stats->packets++;
>> +       }
>> +
>> +       stats->bytes += attr->num_bytes;
>> +}
>> +
>> +static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
>> +                                         const struct mlx5e_tx_attr *attr,
>> +                                         struct mlx5e_tx_wqe_attr *wqe_attr)
>> +{
>> +       u16 ds_cnt = sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS;
>> +       u16 ds_cnt_inl = 0;
>> +
>> +       ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
>> +
>> +       if (attr->ihs) {
>> +               u16 inl = attr->ihs - INL_HDR_START_SZ;
>> +
>> +               if (skb_vlan_tag_present(skb))
>> +                       inl += VLAN_HLEN;
>> +
>> +               ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
>> +               ds_cnt += ds_cnt_inl;
>> +       }
>> +
>> +       *wqe_attr = (struct mlx5e_tx_wqe_attr) {
>> +               .ds_cnt     = ds_cnt,
>> +               .ds_cnt_inl = ds_cnt_inl,
>> +               .num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS),
>> +       };
>> +}
>> +
>>   static inline void
>>   mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> -                    u8 opcode, u16 ds_cnt, u8 num_wqebbs, u32 num_bytes, u8 num_dma,
>> +                    const struct mlx5e_tx_attr *attr,
>> +                    const struct mlx5e_tx_wqe_attr *wqe_attr, u8 num_dma,
>>                       struct mlx5e_tx_wqe_info *wi, struct mlx5_wqe_ctrl_seg *cseg,
>>                       bool xmit_more)
>>   {
>>          struct mlx5_wq_cyc *wq = &sq->wq;
>>          bool send_doorbell;
>>
>> -       wi->num_bytes = num_bytes;
>> -       wi->num_dma = num_dma;
>> -       wi->num_wqebbs = num_wqebbs;
>> -       wi->skb = skb;
>> +       *wi = (struct mlx5e_tx_wqe_info) {
>> +               .skb = skb,
>> +               .num_bytes = attr->num_bytes,
>> +               .num_dma = num_dma,
>> +               .num_wqebbs = wqe_attr->num_wqebbs,
>> +       };
>>
>> -       cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | opcode);
>> -       cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
>> +       cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
>> +       cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
>>
>>          if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
>>                  skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
>> @@ -258,105 +340,44 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>                  sq->stats->stopped++;
>>          }
>>
>> -       send_doorbell = __netdev_tx_sent_queue(sq->txq, num_bytes,
>> -                                              xmit_more);
>> +       send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
>>          if (send_doorbell)
>>                  mlx5e_notify_hw(wq, sq->pc, sq->uar_map, cseg);
>>   }
>>
>> -void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> -                  struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more)
>> +static __always_inline void
>> +mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> +                 const struct mlx5e_tx_attr *attr, const struct mlx5e_tx_wqe_attr *wqe_attr,
>> +                 struct mlx5e_tx_wqe *wqe, u16 pi, bool xmit_more)
>>   {
>> -       struct mlx5_wq_cyc *wq = &sq->wq;
>>          struct mlx5_wqe_ctrl_seg *cseg;
>>          struct mlx5_wqe_eth_seg  *eseg;
>>          struct mlx5_wqe_data_seg *dseg;
>>          struct mlx5e_tx_wqe_info *wi;
>>
>>          struct mlx5e_sq_stats *stats = sq->stats;
>> -       u16 headlen, ihs, contig_wqebbs_room;
>> -       u16 ds_cnt, ds_cnt_inl = 0;
>> -       u8 num_wqebbs, opcode;
>> -       u32 num_bytes;
>>          int num_dma;
>> -       __be16 mss;
>>
>> -       /* Calc ihs and ds cnt, no writes to wqe yet */
>> -       ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
>> -       if (skb_is_gso(skb)) {
>> -               opcode    = MLX5_OPCODE_LSO;
>> -               mss       = cpu_to_be16(skb_shinfo(skb)->gso_size);
>> -               ihs       = mlx5e_tx_get_gso_ihs(sq, skb);
>> -               num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs;
>> -               stats->packets += skb_shinfo(skb)->gso_segs;
>> -       } else {
>> -               u8 mode = mlx5e_tx_wqe_inline_mode(sq, &wqe->ctrl, skb);
>> -
>> -               opcode    = MLX5_OPCODE_SEND;
>> -               mss       = 0;
>> -               ihs       = mlx5e_calc_min_inline(mode, skb);
>> -               num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
>> -               stats->packets++;
>> -       }
>> -
>> -       stats->bytes     += num_bytes;
>>          stats->xmit_more += xmit_more;
>>
>> -       headlen = skb->len - ihs - skb->data_len;
>> -       ds_cnt += !!headlen;
>> -       ds_cnt += skb_shinfo(skb)->nr_frags;
>> -
>> -       if (ihs) {
>> -               u16 inl = ihs + !!skb_vlan_tag_present(skb) * VLAN_HLEN - INL_HDR_START_SZ;
>> -
>> -               ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
>> -               ds_cnt += ds_cnt_inl;
>> -       }
>> -
>> -       num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
>> -       contig_wqebbs_room = mlx5_wq_cyc_get_contig_wqebbs(wq, pi);
>> -       if (unlikely(contig_wqebbs_room < num_wqebbs)) {
>> -#ifdef CONFIG_MLX5_EN_IPSEC
>> -               struct mlx5_wqe_eth_seg cur_eth = wqe->eth;
>> -#endif
>> -#ifdef CONFIG_MLX5_EN_TLS
>> -               struct mlx5_wqe_ctrl_seg cur_ctrl = wqe->ctrl;
>> -#endif
>> -               mlx5e_fill_sq_frag_edge(sq, wq, pi, contig_wqebbs_room);
>> -               pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
>> -               wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>> -#ifdef CONFIG_MLX5_EN_IPSEC
>> -               wqe->eth = cur_eth;
>> -#endif
>> -#ifdef CONFIG_MLX5_EN_TLS
>> -               wqe->ctrl = cur_ctrl;
>> -#endif
>> -       }
>> -
>>          /* fill wqe */
>>          wi   = &sq->db.wqe_info[pi];
>>          cseg = &wqe->ctrl;
>>          eseg = &wqe->eth;
>>          dseg =  wqe->data;
>>
>> -#if IS_ENABLED(CONFIG_GENEVE)
>> -       if (skb->encapsulation)
>> -               mlx5e_tx_tunnel_accel(skb, eseg);
>> -#endif
>> -       mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
>> -
>> -       eseg->mss = mss;
>> +       eseg->mss = attr->mss;
>>
>> -       if (ihs) {
>> +       if (attr->ihs) {
>>                  if (skb_vlan_tag_present(skb)) {
>> -                       eseg->inline_hdr.sz = cpu_to_be16(ihs + VLAN_HLEN);
>> -                       mlx5e_insert_vlan(eseg->inline_hdr.start, skb, ihs);
>> +                       eseg->inline_hdr.sz = cpu_to_be16(attr->ihs + VLAN_HLEN);
>> +                       mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs);
>>                          stats->added_vlan_packets++;
>>                  } else {
>> -                       eseg->inline_hdr.sz = cpu_to_be16(ihs);
>> -                       memcpy(eseg->inline_hdr.start, skb->data, ihs);
>> +                       eseg->inline_hdr.sz = cpu_to_be16(attr->ihs);
>> +                       memcpy(eseg->inline_hdr.start, skb->data, attr->ihs);
>>                  }
>> -               dseg += ds_cnt_inl;
>> +               dseg += wqe_attr->ds_cnt_inl;
>>          } else if (skb_vlan_tag_present(skb)) {
>>                  eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN);
>>                  if (skb->vlan_proto == cpu_to_be16(ETH_P_8021AD))
>> @@ -365,12 +386,12 @@ void mlx5e_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>                  stats->added_vlan_packets++;
>>          }
>>
>> -       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + ihs, headlen, dseg);
>> +       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs,
>> +                                         attr->headlen, dseg);
>>          if (unlikely(num_dma < 0))
>>                  goto err_drop;
>>
>> -       mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt, num_wqebbs, num_bytes,
>> -                            num_dma, wi, cseg, xmit_more);
>> +       mlx5e_txwqe_complete(sq, skb, attr, wqe_attr, num_dma, wi, cseg, xmit_more);
>>
>>          return;
>>
>> @@ -383,6 +404,8 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>          struct mlx5e_priv *priv = netdev_priv(dev);
>>          struct mlx5e_accel_tx_state accel = {};
>> +       struct mlx5e_tx_wqe_attr wqe_attr;
>> +       struct mlx5e_tx_attr attr;
>>          struct mlx5e_tx_wqe *wqe;
>>          struct mlx5e_txqsq *sq;
>>          u16 pi;
>> @@ -393,19 +416,64 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
>>          if (unlikely(!mlx5e_accel_tx_begin(dev, sq, skb, &accel)))
>>                  goto out;
>>
>> -       pi = mlx5_wq_cyc_ctr2ix(&sq->wq, sq->pc);
>> +       mlx5e_sq_xmit_prepare(sq, skb, &accel, &attr);
>> +       mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
>> +       pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
>>          wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>>
>>          /* May update the WQE, but may not post other WQEs. */
>>          if (unlikely(!mlx5e_accel_tx_finish(priv, sq, skb, wqe, &accel)))
>>                  goto out;
>>
>> -       mlx5e_sq_xmit(sq, skb, wqe, pi, netdev_xmit_more());
>> +       mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
>> +       mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, netdev_xmit_more());
>>
>>   out:
>>          return NETDEV_TX_OK;
>>   }
>>
>> +void mlx5e_sq_xmit_simple(struct mlx5e_txqsq *sq, struct sk_buff *skb, bool xmit_more)
>> +{
>> +       struct mlx5e_tx_wqe_attr wqe_attr;
>> +       struct mlx5e_tx_attr attr;
>> +       struct mlx5e_tx_wqe *wqe;
>> +       u16 pi;
>> +
>> +       mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
>> +       mlx5e_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
>> +       pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
>> +       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>> +       mlx5e_txwqe_build_eseg_csum(sq, skb, &wqe->eth);
>> +       mlx5e_sq_xmit_wqe(sq, skb, &attr, &wqe_attr, wqe, pi, xmit_more);
>> +}
>> +
>> +static inline void mlx5e_tx_wi_dma_unmap(struct mlx5e_txqsq *sq,
>> +                                        struct mlx5e_tx_wqe_info *wi,
>> +                                        u32 *dma_fifo_cc)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < wi->num_dma; i++) {
>> +               struct mlx5e_sq_dma *dma = mlx5e_dma_get(sq, (*dma_fifo_cc)++);
>> +
>> +               mlx5e_tx_dma_unmap(sq->pdev, dma);
>> +       }
>> +}
>> +
>> +static inline void mlx5e_consume_skb(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>> +                                    struct mlx5_cqe64 *cqe, int napi_budget)
>> +{
>> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
>> +               struct skb_shared_hwtstamps hwts = {};
>> +               u64 ts = get_cqe_ts(cqe);
>> +
>> +               hwts.hwtstamp = mlx5_timecounter_cyc2time(sq->clock, ts);
>> +               skb_tstamp_tx(skb, &hwts);
>> +       }
>> +
>> +       napi_consume_skb(skb, napi_budget);
>> +}
>> +
>>   bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>>   {
>>          struct mlx5e_sq_stats *stats;
>> @@ -452,7 +520,6 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>>
>>                  do {
>>                          struct sk_buff *skb;
>> -                       int j;
>>
>>                          last_wqe = (sqcc == wqe_counter);
>>
>> @@ -460,33 +527,18 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>>                          wi = &sq->db.wqe_info[ci];
>>                          skb = wi->skb;
>>
>> +                       sqcc += wi->num_wqebbs;
>> +
>>                          if (unlikely(!skb)) {
>>                                  mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
>> -                               sqcc += wi->num_wqebbs;
>>                                  continue;
>>                          }
>>
>> -                       if (unlikely(skb_shinfo(skb)->tx_flags &
>> -                                    SKBTX_HW_TSTAMP)) {
>> -                               struct skb_shared_hwtstamps hwts = {};
>> -
>> -                               hwts.hwtstamp =
>> -                                       mlx5_timecounter_cyc2time(sq->clock,
>> -                                                                 get_cqe_ts(cqe));
>> -                               skb_tstamp_tx(skb, &hwts);
>> -                       }
>> -
>> -                       for (j = 0; j < wi->num_dma; j++) {
>> -                               struct mlx5e_sq_dma *dma =
>> -                                       mlx5e_dma_get(sq, dma_fifo_cc++);
>> -
>> -                               mlx5e_tx_dma_unmap(sq->pdev, dma);
>> -                       }
>> +                       mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
>> +                       mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
>>
>>                          npkts++;
>>                          nbytes += wi->num_bytes;
>> -                       sqcc += wi->num_wqebbs;
>> -                       napi_consume_skb(skb, napi_budget);
>>                  } while (!last_wqe);
>>
>>                  if (unlikely(get_cqe_opcode(cqe) == MLX5_CQE_REQ_ERR)) {
>> @@ -531,7 +583,6 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
>>          u32 dma_fifo_cc, nbytes = 0;
>>          u16 ci, sqcc, npkts = 0;
>>          struct sk_buff *skb;
>> -       int i;
>>
>>          sqcc = sq->cc;
>>          dma_fifo_cc = sq->dma_fifo_cc;
>> @@ -541,23 +592,18 @@ void mlx5e_free_txqsq_descs(struct mlx5e_txqsq *sq)
>>                  wi = &sq->db.wqe_info[ci];
>>                  skb = wi->skb;
>>
>> +               sqcc += wi->num_wqebbs;
>> +
>>                  if (!skb) {
>>                          mlx5e_ktls_tx_handle_resync_dump_comp(sq, wi, &dma_fifo_cc);
>> -                       sqcc += wi->num_wqebbs;
>>                          continue;
>>                  }
>>
>> -               for (i = 0; i < wi->num_dma; i++) {
>> -                       struct mlx5e_sq_dma *dma =
>> -                               mlx5e_dma_get(sq, dma_fifo_cc++);
>> -
>> -                       mlx5e_tx_dma_unmap(sq->pdev, dma);
>> -               }
>> -
>> +               mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
>>                  dev_kfree_skb_any(skb);
>> +
>>                  npkts++;
>>                  nbytes += wi->num_bytes;
>> -               sqcc += wi->num_wqebbs;
>>          }
>>
>>          sq->dma_fifo_cc = dma_fifo_cc;
>> @@ -576,9 +622,34 @@ mlx5i_txwqe_build_datagram(struct mlx5_av *av, u32 dqpn, u32 dqkey,
>>          dseg->av.key.qkey.qkey = cpu_to_be32(dqkey);
>>   }
>>
>> +static void mlx5i_sq_calc_wqe_attr(struct sk_buff *skb,
>> +                                  const struct mlx5e_tx_attr *attr,
>> +                                  struct mlx5e_tx_wqe_attr *wqe_attr)
>> +{
>> +       u16 ds_cnt = sizeof(struct mlx5i_tx_wqe) / MLX5_SEND_WQE_DS;
>> +       u16 ds_cnt_inl = 0;
>> +
>> +       ds_cnt += !!attr->headlen + skb_shinfo(skb)->nr_frags;
>> +
>> +       if (attr->ihs) {
>> +               u16 inl = attr->ihs - INL_HDR_START_SZ;
>> +
>> +               ds_cnt_inl = DIV_ROUND_UP(inl, MLX5_SEND_WQE_DS);
>> +               ds_cnt += ds_cnt_inl;
>> +       }
>> +
>> +       *wqe_attr = (struct mlx5e_tx_wqe_attr) {
>> +               .ds_cnt     = ds_cnt,
>> +               .ds_cnt_inl = ds_cnt_inl,
>> +               .num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS),
>> +       };
>> +}
>> +
>>   void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>                     struct mlx5_av *av, u32 dqpn, u32 dqkey, bool xmit_more)
>>   {
>> +       struct mlx5e_tx_wqe_attr wqe_attr;
>> +       struct mlx5e_tx_attr attr;
>>          struct mlx5i_tx_wqe *wqe;
>>
>>          struct mlx5_wqe_datagram_seg *datagram;
>> @@ -588,47 +659,17 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>          struct mlx5e_tx_wqe_info *wi;
>>
>>          struct mlx5e_sq_stats *stats = sq->stats;
>> -       u16 ds_cnt, ds_cnt_inl = 0;
>> -       u8 num_wqebbs, opcode;
>> -       u16 headlen, ihs, pi;
>> -       u32 num_bytes;
>>          int num_dma;
>> -       __be16 mss;
>> +       u16 pi;
>>
>> -       /* Calc ihs and ds cnt, no writes to wqe yet */
>> -       ds_cnt = sizeof(*wqe) / MLX5_SEND_WQE_DS;
>> -       if (skb_is_gso(skb)) {
>> -               opcode    = MLX5_OPCODE_LSO;
>> -               mss       = cpu_to_be16(skb_shinfo(skb)->gso_size);
>> -               ihs       = mlx5e_tx_get_gso_ihs(sq, skb);
>> -               num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs;
>> -               stats->packets += skb_shinfo(skb)->gso_segs;
>> -       } else {
>> -               u8 mode = mlx5e_tx_wqe_inline_mode(sq, NULL, skb);
>> +       mlx5e_sq_xmit_prepare(sq, skb, NULL, &attr);
>> +       mlx5i_sq_calc_wqe_attr(skb, &attr, &wqe_attr);
>>
>> -               opcode    = MLX5_OPCODE_SEND;
>> -               mss       = 0;
>> -               ihs       = mlx5e_calc_min_inline(mode, skb);
>> -               num_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
>> -               stats->packets++;
>> -       }
>> +       pi = mlx5e_txqsq_get_next_pi(sq, wqe_attr.num_wqebbs);
>> +       wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
>>
>> -       stats->bytes     += num_bytes;
>>          stats->xmit_more += xmit_more;
>>
>> -       headlen = skb->len - ihs - skb->data_len;
>> -       ds_cnt += !!headlen;
>> -       ds_cnt += skb_shinfo(skb)->nr_frags;
>> -
>> -       if (ihs) {
>> -               ds_cnt_inl = DIV_ROUND_UP(ihs - INL_HDR_START_SZ, MLX5_SEND_WQE_DS);
>> -               ds_cnt += ds_cnt_inl;
>> -       }
>> -
>> -       num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
>> -       pi = mlx5e_txqsq_get_next_pi(sq, num_wqebbs);
>> -       wqe = MLX5I_SQ_FETCH_WQE(sq, pi);
>> -
>>          /* fill wqe */
>>          wi       = &sq->db.wqe_info[pi];
>>          cseg     = &wqe->ctrl;
>> @@ -640,20 +681,20 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>
>>          mlx5e_txwqe_build_eseg_csum(sq, skb, eseg);
>>
>> -       eseg->mss = mss;
>> +       eseg->mss = attr.mss;
>>
>> -       if (ihs) {
>> -               memcpy(eseg->inline_hdr.start, skb->data, ihs);
>> -               eseg->inline_hdr.sz = cpu_to_be16(ihs);
>> -               dseg += ds_cnt_inl;
>> +       if (attr.ihs) {
>> +               memcpy(eseg->inline_hdr.start, skb->data, attr.ihs);
>> +               eseg->inline_hdr.sz = cpu_to_be16(attr.ihs);
>> +               dseg += wqe_attr.ds_cnt_inl;
>>          }
>>
>> -       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + ihs, headlen, dseg);
>> +       num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs,
>> +                                         attr.headlen, dseg);
>>          if (unlikely(num_dma < 0))
>>                  goto err_drop;
>>
>> -       mlx5e_txwqe_complete(sq, skb, opcode, ds_cnt, num_wqebbs, num_bytes,
>> -                            num_dma, wi, cseg, xmit_more);
>> +       mlx5e_txwqe_complete(sq, skb, &attr, &wqe_attr, num_dma, wi, cseg, xmit_more);
>>
>>          return;
>>
>> --
>> 2.26.2
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT
  2020-09-04 15:05   ` Willem de Bruijn
@ 2020-09-08  8:59     ` Maxim Mikityanskiy
  2020-09-08  9:06       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Maxim Mikityanskiy @ 2020-09-08  8:59 UTC (permalink / raw)
  To: Willem de Bruijn, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Network Development, Maxim Mikityanskiy

On 2020-09-04 18:05, Willem de Bruijn wrote:
> On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>>
>> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>>
>> A constant for the number of DS in an empty WQE (i.e. a WQE without data
>> segments) is needed in multiple places (normal TX data path, MPWQE in
>> XDP), but currently we have a constant for XDP and an inline formula in
>> normal TX. This patch introduces a common constant.
>>
>> Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
>> assignment, because the code nearby is touched.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
>>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 ++
>>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 13 +++++++-----
>>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 21 +++++++------------
>>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
>>   4 files changed, 19 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> index d4ee22789ab0..155b89998891 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>> @@ -7,6 +7,8 @@
>>   #include "en.h"
>>   #include <linux/indirect_call_wrapper.h>
>>
>> +#define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
>> +
> 
> Out of curiosity, what is the logic for dividing this struct by 16?

The hardware needs the size of a WQE in DS units (16 bytes). An empty 
WQE takes 2 DS (for the ctrl and eth segments), and this macro is this 
initial size of an empty WQE (2 DS). As data segments are added to the 
WQE, it grows, and its size in DS also grows.

> struct mlx5e_tx_wqe {
>          struct mlx5_wqe_ctrl_seg ctrl;
>          struct mlx5_wqe_eth_seg  eth;
>          struct mlx5_wqe_data_seg data[0];
> };
> 
>>   #define INL_HDR_START_SZ (sizeof(((struct mlx5_wqe_eth_seg *)NULL)->inline_hdr.start))
>>
>>   enum mlx5e_icosq_wqe_type {
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> index 7fccd2ea7dc9..81cd9a04bcb0 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> @@ -196,16 +196,19 @@ static void mlx5e_xdp_mpwqe_session_start(struct mlx5e_xdpsq *sq)
>>   {
>>          struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
>>          struct mlx5e_xdpsq_stats *stats = sq->stats;
>> +       struct mlx5e_tx_wqe *wqe;
>>          u16 pi;
>>
>>          pi = mlx5e_xdpsq_get_next_pi(sq, MLX5E_XDP_MPW_MAX_WQEBBS);
>> -       session->wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>> -
>> +       wqe = MLX5E_TX_FETCH_WQE(sq, pi);
>>          net_prefetchw(session->wqe->data);
> 
> Is this prefetch still valid?

It should be:

net_prefetchw(wqe->data);

Probably a bad rebase.

> And is the temporary variable wqe still
> needed at all?

We want to prefetch as early as possible (before filling *session).

> 
>> -       session->ds_count  = MLX5E_XDP_TX_EMPTY_DS_COUNT;
>> -       session->pkt_count = 0;
>>
>> -       mlx5e_xdp_update_inline_state(sq);
>> +       *session = (struct mlx5e_xdp_mpwqe) {
>> +               .wqe = wqe,
>> +               .ds_count = MLX5E_TX_WQE_EMPTY_DS_COUNT,
>> +               .pkt_count = 0,
>> +               .inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
>> +       };
>>
>>          stats->mpwqe++;
>>   }


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE
  2020-09-04 15:06   ` Willem de Bruijn
@ 2020-09-08  8:59     ` Maxim Mikityanskiy
  2020-09-08  9:04       ` Willem de Bruijn
  0 siblings, 1 reply; 23+ messages in thread
From: Maxim Mikityanskiy @ 2020-09-08  8:59 UTC (permalink / raw)
  To: Willem de Bruijn, Saeed Mahameed
  Cc: David S. Miller, Jakub Kicinski, Network Development, Maxim Mikityanskiy

On 2020-09-04 18:06, Willem de Bruijn wrote:
> On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>>
>> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>>
>> mlx5e_txwqe_complete performs some actions that can be taken to separate
>> functions:
>>
>> 1. Update the flags needed for hardware timestamping.
>>
>> 2. Stop the TX queue if it's full.
>>
>> Take these actions into separate functions to be reused by the MPWQE
>> code in the following commit and to maintain clear responsibilities of
>> functions.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
>>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 23 ++++++++++++++-----
>>   1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> index 9ced350150b3..3b68c8333875 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> @@ -311,6 +311,20 @@ static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
>>          };
>>   }
>>
>> +static inline void mlx5e_tx_skb_update_hwts_flags(struct sk_buff *skb)
>> +{
>> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
>> +               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
>> +}
> 
> Subjective, but this helper adds a level of indirection and introduces
> code churn without simplying anything, imho.

It's added for the sake of being reused in non-MPWQE and MPWQE flows.

>> +static inline void mlx5e_tx_check_stop(struct mlx5e_txqsq *sq)
>> +{
>> +       if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, sq->stop_room))) {
>> +               netif_tx_stop_queue(sq->txq);
>> +               sq->stats->stopped++;
>> +       }
>> +}
>> +
>>   static inline void
>>   mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>                       const struct mlx5e_tx_attr *attr,
>> @@ -332,14 +346,11 @@ mlx5e_txwqe_complete(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>          cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | attr->opcode);
>>          cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | wqe_attr->ds_cnt);
>>
>> -       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
>> -               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
>> +       mlx5e_tx_skb_update_hwts_flags(skb);
>>
>>          sq->pc += wi->num_wqebbs;
>> -       if (unlikely(!mlx5e_wqc_has_room_for(wq, sq->cc, sq->pc, sq->stop_room))) {
>> -               netif_tx_stop_queue(sq->txq);
>> -               sq->stats->stopped++;
>> -       }
>> +
>> +       mlx5e_tx_check_stop(sq);
>>
>>          send_doorbell = __netdev_tx_sent_queue(sq->txq, attr->num_bytes, xmit_more);
>>          if (send_doorbell)
>> --
>> 2.26.2
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE
  2020-09-03 22:46   ` Jakub Kicinski
@ 2020-09-08  8:59     ` Maxim Mikityanskiy
  2020-09-08 18:07       ` Jakub Kicinski
  0 siblings, 1 reply; 23+ messages in thread
From: Maxim Mikityanskiy @ 2020-09-08  8:59 UTC (permalink / raw)
  To: Jakub Kicinski, Saeed Mahameed
  Cc: David S. Miller, netdev, Maxim Mikityanskiy, Tariq Toukan

On 2020-09-04 01:46, Jakub Kicinski wrote:
> On Thu, 3 Sep 2020 14:00:18 -0700 Saeed Mahameed wrote:
>> +static inline void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq,
>> +						 struct mlx5e_tx_wqe_info *wi,
>> +						 struct mlx5_cqe64 *cqe,
>> +						 int napi_budget)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < wi->num_fifo_pkts; i++) {
>> +		struct sk_buff *skb = mlx5e_skb_fifo_pop(sq);
>> +
>> +		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
>> +	}
>> +}
> 
> The compiler was not inlining this one either?

Regarding this one, gcc inlines it automatically, but I went on the safe 
side and inlined it explicitly - it's small and called for every WQE, so 
we never want it to be non-inline.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE
  2020-09-08  8:59     ` Maxim Mikityanskiy
@ 2020-09-08  9:04       ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2020-09-08  9:04 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Network Development, Maxim Mikityanskiy

On Tue, Sep 8, 2020 at 11:00 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2020-09-04 18:06, Willem de Bruijn wrote:
> > On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
> >>
> >> From: Maxim Mikityanskiy <maximmi@mellanox.com>
> >>
> >> mlx5e_txwqe_complete performs some actions that can be taken to separate
> >> functions:
> >>
> >> 1. Update the flags needed for hardware timestamping.
> >>
> >> 2. Stop the TX queue if it's full.
> >>
> >> Take these actions into separate functions to be reused by the MPWQE
> >> code in the following commit and to maintain clear responsibilities of
> >> functions.
> >>
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >> ---
> >>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 23 ++++++++++++++-----
> >>   1 file changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> >> index 9ced350150b3..3b68c8333875 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> >> @@ -311,6 +311,20 @@ static inline void mlx5e_sq_calc_wqe_attr(struct sk_buff *skb,
> >>          };
> >>   }
> >>
> >> +static inline void mlx5e_tx_skb_update_hwts_flags(struct sk_buff *skb)
> >> +{
> >> +       if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
> >> +               skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> >> +}
> >
> > Subjective, but this helper adds a level of indirection and introduces
> > code churn without simplying anything, imho.
>
> It's added for the sake of being reused in non-MPWQE and MPWQE flows.

I understand. I'm just saying that a helper for two lines whose
function is clear just adds a layer of obfuscation. As said, that is
subjective, so just keep as is as you disagree.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT
  2020-09-08  8:59     ` Maxim Mikityanskiy
@ 2020-09-08  9:06       ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2020-09-08  9:06 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Network Development, Maxim Mikityanskiy

On Tue, Sep 8, 2020 at 10:59 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2020-09-04 18:05, Willem de Bruijn wrote:
> > On Thu, Sep 3, 2020 at 11:01 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
> >>
> >> From: Maxim Mikityanskiy <maximmi@mellanox.com>
> >>
> >> A constant for the number of DS in an empty WQE (i.e. a WQE without data
> >> segments) is needed in multiple places (normal TX data path, MPWQE in
> >> XDP), but currently we have a constant for XDP and an inline formula in
> >> normal TX. This patch introduces a common constant.
> >>
> >> Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
> >> assignment, because the code nearby is touched.
> >>
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >> ---
> >>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  2 ++
> >>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 13 +++++++-----
> >>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 21 +++++++------------
> >>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
> >>   4 files changed, 19 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> >> index d4ee22789ab0..155b89998891 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> >> @@ -7,6 +7,8 @@
> >>   #include "en.h"
> >>   #include <linux/indirect_call_wrapper.h>
> >>
> >> +#define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
> >> +
> >
> > Out of curiosity, what is the logic for dividing this struct by 16?
>
> The hardware needs the size of a WQE in DS units (16 bytes). An empty
> WQE takes 2 DS (for the ctrl and eth segments), and this macro is this
> initial size of an empty WQE (2 DS). As data segments are added to the
> WQE, it grows, and its size in DS also grows.
>
> > struct mlx5e_tx_wqe {
> >          struct mlx5_wqe_ctrl_seg ctrl;
> >          struct mlx5_wqe_eth_seg  eth;
> >          struct mlx5_wqe_data_seg data[0];
> > };

Thanks. It was not obvious to me that the first two are the size as
data_segs. But that actually is pretty logical. Ack.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 02/10] net/mlx5e: Refactor xmit functions
  2020-09-08  8:58     ` Maxim Mikityanskiy
@ 2020-09-08  9:08       ` Willem de Bruijn
  0 siblings, 0 replies; 23+ messages in thread
From: Willem de Bruijn @ 2020-09-08  9:08 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Saeed Mahameed, David S. Miller, Jakub Kicinski,
	Network Development, Maxim Mikityanskiy

On Tue, Sep 8, 2020 at 10:59 AM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2020-09-04 18:27, Willem de Bruijn wrote:
> > On Thu, Sep 3, 2020 at 11:00 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
> >>
> >> From: Maxim Mikityanskiy <maximmi@mellanox.com>
> >>
> >> A huge function mlx5e_sq_xmit was split into several to achieve multiple
> >> goals:
> >>
> >> 1. Reuse the code in IPoIB.
> >>
> >> 2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now
> >> it's possible to reserve space in the WQ before running eseg-based
> >> offloads, so:
> >>
> >> 2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge
> >> anymore.
> >>
> >> 2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy
> >> mlx5e_fill_sq_frag_edge for better code maintainability and reuse.
> >>
> >> 3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after
> >> mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the
> >> code flow will split into two paths: MPWQE and non-MPWQE.
> >>
> >> Two high-level functions are provided to send packets:
> >>
> >> * mlx5e_xmit is called by the networking stack, runs offloads and sends
> >> the packet. In one of the following patches, MPWQE support will be added
> >> to this flow.
> >>
> >> * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the
> >> checksum offload and sends the packet.
> >>
> >> This change has no performance impact in TCP single stream test and
> >> XDP_TX single stream test.
> >>
> >> UDP pktgen (burst 32), single stream:
> >>    Packet rate: 17.55 Mpps -> 19.23 Mpps
> >>    Instructions per packet: 420 -> 360
> >>    Cycles per packet: 165 -> 142
> >>
> >> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
> >> NIC: Mellanox ConnectX-6 Dx
> >>
> >> To get this performance gain, manual optimizations of function inlining
> >> were performed. It's important to have mlx5e_sq_xmit_wqe inline,
> >> otherwise the packet rate will be 1 Mpps less in UDP pktgen test.
> >> __always_inline is required, because gcc uninlines it when it's called
> >> from two places (mlx5e_xmit and mlx5e_sq_xmit_simple).
> >>
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> >> ---
> >>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  63 +--
> >>   .../mellanox/mlx5/core/en_accel/en_accel.h    |   5 +
> >>   .../mellanox/mlx5/core/en_accel/tls_rxtx.c    |   6 +-
> >>   .../net/ethernet/mellanox/mlx5/core/en_tx.c   | 391 ++++++++++--------
> >>   4 files changed, 243 insertions(+), 222 deletions(-)
> >
> > This combines a lot of changes. Including supposed noops, but with
> > subtle changes, like converting to struct initializers.
>
> Struct initializers are mostly used in the new code. I can split out the
> only converted occurrence.
>
> > Probably deserves to be broken up a bit more.
> >
> > For instance, a pure noop patch that moves
> > mlx5e_txwqe_build_eseg_csum,
>
> OK. Not sure I really need to move it though.

Even better.

In general, I don't really care how this patch is simplified, but as
is it is long and combines code moves, refactors that are supposedly a
noop and new functionality. I imagine that there must be some strategy
to break it up into sensible manageable chunks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE
  2020-09-08  8:59     ` Maxim Mikityanskiy
@ 2020-09-08 18:07       ` Jakub Kicinski
  0 siblings, 0 replies; 23+ messages in thread
From: Jakub Kicinski @ 2020-09-08 18:07 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Saeed Mahameed, netdev, Maxim Mikityanskiy, Tariq Toukan

On Tue, 8 Sep 2020 11:59:54 +0300 Maxim Mikityanskiy wrote:
> On 2020-09-04 01:46, Jakub Kicinski wrote:
> > On Thu, 3 Sep 2020 14:00:18 -0700 Saeed Mahameed wrote:  
> >> +static inline void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq,
> >> +						 struct mlx5e_tx_wqe_info *wi,
> >> +						 struct mlx5_cqe64 *cqe,
> >> +						 int napi_budget)
> >> +{
> >> +	int i;
> >> +
> >> +	for (i = 0; i < wi->num_fifo_pkts; i++) {
> >> +		struct sk_buff *skb = mlx5e_skb_fifo_pop(sq);
> >> +
> >> +		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
> >> +	}
> >> +}  
> > 
> > The compiler was not inlining this one either?  
> 
> Regarding this one, gcc inlines it automatically, but I went on the safe 
> side and inlined it explicitly - it's small and called for every WQE, so 
> we never want it to be non-inline.

Everyone always wants to be on the safe side :/ That's not an argument
we accept in this context.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-09-08 18:08 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 21:00 [pull request][net-next 00/10] mlx5 Multi packet tx descriptors for SKBs Saeed Mahameed
2020-09-03 21:00 ` [net-next 01/10] net/mlx5e: Refactor inline header size calculation in the TX path Saeed Mahameed
2020-09-03 21:00 ` [net-next 02/10] net/mlx5e: Refactor xmit functions Saeed Mahameed
2020-09-04 15:27   ` Willem de Bruijn
2020-09-08  8:58     ` Maxim Mikityanskiy
2020-09-08  9:08       ` Willem de Bruijn
2020-09-03 21:00 ` [net-next 03/10] net/mlx5e: Small improvements for XDP TX MPWQE logic Saeed Mahameed
2020-09-03 21:00 ` [net-next 04/10] net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT Saeed Mahameed
2020-09-04 15:05   ` Willem de Bruijn
2020-09-08  8:59     ` Maxim Mikityanskiy
2020-09-08  9:06       ` Willem de Bruijn
2020-09-03 21:00 ` [net-next 05/10] net/mlx5e: Move the TLS resync check out of the function Saeed Mahameed
2020-09-03 21:00 ` [net-next 06/10] net/mlx5e: Support multiple SKBs in a TX WQE Saeed Mahameed
2020-09-03 22:46   ` Jakub Kicinski
2020-09-08  8:59     ` Maxim Mikityanskiy
2020-09-08 18:07       ` Jakub Kicinski
2020-09-03 21:00 ` [net-next 07/10] net/mlx5e: Generalize TX MPWQE checks for full session Saeed Mahameed
2020-09-03 21:00 ` [net-next 08/10] net/mlx5e: Rename xmit-related structs to generalize them Saeed Mahameed
2020-09-03 21:00 ` [net-next 09/10] net/mlx5e: Move TX code into functions to be used by MPWQE Saeed Mahameed
2020-09-04 15:06   ` Willem de Bruijn
2020-09-08  8:59     ` Maxim Mikityanskiy
2020-09-08  9:04       ` Willem de Bruijn
2020-09-03 21:00 ` [net-next 10/10] net/mlx5e: Enhanced TX MPWQE for SKBs Saeed Mahameed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).