All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/10] mlx4 XDP performance improvements
@ 2017-06-15 11:35 Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 01/10] net/mlx4_en: Remove unused argument in TX datapath function Tariq Toukan
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team, Tariq Toukan

Hi Dave,

This patchset contains data-path improvements, mainly for XDP_DROP
and XDP_TX cases.

Main patches:
* Patch 2 by Saeed allows enabling optimized A0 RX steering (in HW) when
  setting a single RX ring.
  With this configuration, HW packet-rate dramatically improves,
  reaching 28.1 Mpps in XDP_DROP case for both IPv4 (37% gain)
  and IPv6 (53% gain).
* Patch 6 enhances the XDP xmit function. Among other changes, now we
  ring one doorbell per NAPI. Patch gives 17% gain in XDP_TX case.
* Patch 7 obsoletes the NAPI of XDP_TX completion queue and integrates its
  poll into the respective RX NAPI. Patch gives 15% gain in XDP_TX case.

Series generated against net-next commit:
f7aec129a356 rxrpc: Cache the congestion window setting

Thanks,
Tariq.


Saeed Mahameed (1):
  net/mlx4_en: Optimized single ring steering

Tariq Toukan (9):
  net/mlx4_en: Remove unused argument in TX datapath function
  net/mlx4_en: Improve receive data-path
  net/mlx4_en: Improve transmit CQ polling
  net/mlx4_en: Improve stack xmit function
  net/mlx4_en: Improve XDP xmit function
  net/mlx4_en: Poll XDP TX completion queue in RX NAPI
  net/mlx4_en: Increase default TX ring size
  net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
  net/mlx4_en: Refactor mlx4_en_free_tx_desc

 drivers/net/ethernet/mellanox/mlx4/en_cq.c     |  25 +-
 drivers/net/ethernet/mellanox/mlx4/en_main.c   |   6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  20 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 145 ++++++++----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     | 305 ++++++++++++-------------
 drivers/net/ethernet/mellanox/mlx4/main.c      |   4 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  23 +-
 7 files changed, 293 insertions(+), 235 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 01/10] net/mlx4_en: Remove unused argument in TX datapath function
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 02/10] net/mlx4_en: Optimized single ring steering Tariq Toukan
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Remove owner argument, as it is obsolete and unused.
This also saves the overhead of calculating its value in data-path.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 10 ++++------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  6 +++---
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 6ffd1849a604..37386abea54c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -265,7 +265,7 @@ static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
 
 u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			 struct mlx4_en_tx_ring *ring,
-			 int index, u8 owner, u64 timestamp,
+			 int index, u64 timestamp,
 			 int napi_mode)
 {
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
@@ -344,7 +344,7 @@ u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 
 u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 			    struct mlx4_en_tx_ring *ring,
-			    int index, u8 owner, u64 timestamp,
+			    int index, u64 timestamp,
 			    int napi_mode)
 {
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
@@ -381,8 +381,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 	while (ring->cons != ring->prod) {
 		ring->last_nr_txbb = ring->free_tx_desc(priv, ring,
 						ring->cons & ring->size_mask,
-						!!(ring->cons & ring->size), 0,
-						0 /* Non-NAPI caller */);
+						0, 0 /* Non-NAPI caller */);
 		ring->cons += ring->last_nr_txbb;
 		cnt++;
 	}
@@ -464,8 +463,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 			/* free next descriptor */
 			last_nr_txbb = ring->free_tx_desc(
 					priv, ring, ring_index,
-					!!((ring_cons + txbbs_skipped) &
-					ring->size), timestamp, napi_budget);
+					timestamp, napi_budget);
 
 			mlx4_en_stamp_wqe(priv, ring, stamp_index,
 					  !!((ring_cons + txbbs_stamp) &
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 8c4f63946b14..08b2906a98af 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -276,7 +276,7 @@ struct mlx4_en_tx_ring {
 	struct netdev_queue	*tx_queue;
 	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
 						struct mlx4_en_tx_ring *ring,
-						int index, u8 owner,
+						int index,
 						u64 timestamp, int napi_mode);
 	struct mlx4_en_rx_ring	*recycle_ring;
 
@@ -723,11 +723,11 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
 int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget);
 u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			 struct mlx4_en_tx_ring *ring,
-			 int index, u8 owner, u64 timestamp,
+			 int index, u64 timestamp,
 			 int napi_mode);
 u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 			    struct mlx4_en_tx_ring *ring,
-			    int index, u8 owner, u64 timestamp,
+			    int index, u64 timestamp,
 			    int napi_mode);
 void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride,
 		int is_tx, int rss, int qpn, int cqn, int user_prio,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 02/10] net/mlx4_en: Optimized single ring steering
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 01/10] net/mlx4_en: Remove unused argument in TX datapath function Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 03/10] net/mlx4_en: Improve receive data-path Tariq Toukan
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

From: Saeed Mahameed <saeedm@mellanox.com>

Avoid touching RX QP RSS context when loading with only
one RX ring, to allow optimized A0 RX steering.

Enable by:
- loading mlx4_core with module param: log_num_mgm_entry_size = -6.
- then: ethtool -L <interface> rx 1

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

XDP_DROP packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
-------------------------------------

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c   |  6 ++--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 12 ++++---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 48 ++++++++++++++++++++------
 drivers/net/ethernet/mellanox/mlx4/main.c      |  4 +--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +-
 5 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index d94f981eafc4..56cdf38d150e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -125,9 +125,9 @@ void mlx4_en_update_loopback_state(struct net_device *dev,
 		priv->flags |= MLX4_EN_FLAG_ENABLE_HW_LOOPBACK;
 
 	mutex_lock(&priv->mdev->state_lock);
-	if (priv->mdev->dev->caps.flags2 &
-	    MLX4_DEV_CAP_FLAG2_UPDATE_QP_SRC_CHECK_LB &&
-	    priv->rss_map.indir_qp.qpn) {
+	if ((priv->mdev->dev->caps.flags2 &
+	     MLX4_DEV_CAP_FLAG2_UPDATE_QP_SRC_CHECK_LB) &&
+	    priv->rss_map.indir_qp && priv->rss_map.indir_qp->qpn) {
 		int i;
 		int err = 0;
 		int loopback = !!(features & NETIF_F_LOOPBACK);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c1de75fc399a..51ce111b719e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -596,6 +596,8 @@ static int mlx4_en_get_qp(struct mlx4_en_priv *priv)
 		return err;
 	}
 
+	en_info(priv, "Steering Mode %d\n", dev->caps.steering_mode);
+
 	if (dev->caps.steering_mode == MLX4_STEERING_MODE_A0) {
 		int base_qpn = mlx4_get_base_qpn(dev, priv->port);
 		*qpn = base_qpn + index;
@@ -1010,7 +1012,7 @@ static void mlx4_en_do_multicast(struct mlx4_en_priv *priv,
 				memcpy(&mc_list[10], mclist->addr, ETH_ALEN);
 				mc_list[5] = priv->port;
 				err = mlx4_multicast_detach(mdev->dev,
-							    &priv->rss_map.indir_qp,
+							    priv->rss_map.indir_qp,
 							    mc_list,
 							    MLX4_PROT_ETH,
 							    mclist->reg_id);
@@ -1032,7 +1034,7 @@ static void mlx4_en_do_multicast(struct mlx4_en_priv *priv,
 				/* needed for B0 steering support */
 				mc_list[5] = priv->port;
 				err = mlx4_multicast_attach(mdev->dev,
-							    &priv->rss_map.indir_qp,
+							    priv->rss_map.indir_qp,
 							    mc_list,
 							    priv->port, 0,
 							    MLX4_PROT_ETH,
@@ -1742,7 +1744,7 @@ int mlx4_en_start_port(struct net_device *dev)
 	/* Attach rx QP to bradcast address */
 	eth_broadcast_addr(&mc_list[10]);
 	mc_list[5] = priv->port; /* needed for B0 steering support */
-	if (mlx4_multicast_attach(mdev->dev, &priv->rss_map.indir_qp, mc_list,
+	if (mlx4_multicast_attach(mdev->dev, priv->rss_map.indir_qp, mc_list,
 				  priv->port, 0, MLX4_PROT_ETH,
 				  &priv->broadcast_id))
 		mlx4_warn(mdev, "Failed Attaching Broadcast\n");
@@ -1866,12 +1868,12 @@ void mlx4_en_stop_port(struct net_device *dev, int detach)
 	/* Detach All multicasts */
 	eth_broadcast_addr(&mc_list[10]);
 	mc_list[5] = priv->port; /* needed for B0 steering support */
-	mlx4_multicast_detach(mdev->dev, &priv->rss_map.indir_qp, mc_list,
+	mlx4_multicast_detach(mdev->dev, priv->rss_map.indir_qp, mc_list,
 			      MLX4_PROT_ETH, priv->broadcast_id);
 	list_for_each_entry(mclist, &priv->curr_list, list) {
 		memcpy(&mc_list[10], mclist->addr, ETH_ALEN);
 		mc_list[5] = priv->port;
-		mlx4_multicast_detach(mdev->dev, &priv->rss_map.indir_qp,
+		mlx4_multicast_detach(mdev->dev, priv->rss_map.indir_qp,
 				      mc_list, MLX4_PROT_ETH, mclist->reg_id);
 		if (mclist->tunnel_reg_id)
 			mlx4_flow_detach(mdev->dev, mclist->tunnel_reg_id);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 77abd1813047..c4edae854f1b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1099,11 +1099,14 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 	int i, qpn;
 	int err = 0;
 	int good_qps = 0;
+	u8 flags;
 
 	en_dbg(DRV, priv, "Configuring rss steering\n");
+
+	flags = priv->rx_ring_num == 1 ? MLX4_RESERVE_A0_QP : 0;
 	err = mlx4_qp_reserve_range(mdev->dev, priv->rx_ring_num,
 				    priv->rx_ring_num,
-				    &rss_map->base_qpn, 0);
+				    &rss_map->base_qpn, flags);
 	if (err) {
 		en_err(priv, "Failed reserving %d qps\n", priv->rx_ring_num);
 		return err;
@@ -1120,13 +1123,28 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 		++good_qps;
 	}
 
+	if (priv->rx_ring_num == 1) {
+		rss_map->indir_qp = &rss_map->qps[0];
+		priv->base_qpn = rss_map->indir_qp->qpn;
+		en_info(priv, "Optimized Non-RSS steering\n");
+		return 0;
+	}
+
+	rss_map->indir_qp = kzalloc(sizeof(*rss_map->indir_qp), GFP_KERNEL);
+	if (!rss_map->indir_qp) {
+		err = -ENOMEM;
+		goto rss_err;
+	}
+
 	/* Configure RSS indirection qp */
-	err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp, GFP_KERNEL);
+	err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, rss_map->indir_qp,
+			    GFP_KERNEL);
 	if (err) {
 		en_err(priv, "Failed to allocate RSS indirection QP\n");
 		goto rss_err;
 	}
-	rss_map->indir_qp.event = mlx4_en_sqp_event;
+
+	rss_map->indir_qp->event = mlx4_en_sqp_event;
 	mlx4_en_fill_qp_context(priv, 0, 0, 0, 1, priv->base_qpn,
 				priv->rx_ring[0]->cqn, -1, &context);
 
@@ -1164,8 +1182,9 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 		err = -EINVAL;
 		goto indir_err;
 	}
+
 	err = mlx4_qp_to_ready(mdev->dev, &priv->res.mtt, &context,
-			       &rss_map->indir_qp, &rss_map->indir_state);
+			       rss_map->indir_qp, &rss_map->indir_state);
 	if (err)
 		goto indir_err;
 
@@ -1173,9 +1192,11 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 
 indir_err:
 	mlx4_qp_modify(mdev->dev, NULL, rss_map->indir_state,
-		       MLX4_QP_STATE_RST, NULL, 0, 0, &rss_map->indir_qp);
-	mlx4_qp_remove(mdev->dev, &rss_map->indir_qp);
-	mlx4_qp_free(mdev->dev, &rss_map->indir_qp);
+		       MLX4_QP_STATE_RST, NULL, 0, 0, rss_map->indir_qp);
+	mlx4_qp_remove(mdev->dev, rss_map->indir_qp);
+	mlx4_qp_free(mdev->dev, rss_map->indir_qp);
+	kfree(rss_map->indir_qp);
+	rss_map->indir_qp = NULL;
 rss_err:
 	for (i = 0; i < good_qps; i++) {
 		mlx4_qp_modify(mdev->dev, NULL, rss_map->state[i],
@@ -1193,10 +1214,15 @@ void mlx4_en_release_rss_steer(struct mlx4_en_priv *priv)
 	struct mlx4_en_rss_map *rss_map = &priv->rss_map;
 	int i;
 
-	mlx4_qp_modify(mdev->dev, NULL, rss_map->indir_state,
-		       MLX4_QP_STATE_RST, NULL, 0, 0, &rss_map->indir_qp);
-	mlx4_qp_remove(mdev->dev, &rss_map->indir_qp);
-	mlx4_qp_free(mdev->dev, &rss_map->indir_qp);
+	if (priv->rx_ring_num > 1) {
+		mlx4_qp_modify(mdev->dev, NULL, rss_map->indir_state,
+			       MLX4_QP_STATE_RST, NULL, 0, 0,
+			       rss_map->indir_qp);
+		mlx4_qp_remove(mdev->dev, rss_map->indir_qp);
+		mlx4_qp_free(mdev->dev, rss_map->indir_qp);
+		kfree(rss_map->indir_qp);
+		rss_map->indir_qp = NULL;
+	}
 
 	for (i = 0; i < priv->rx_ring_num; i++) {
 		mlx4_qp_modify(mdev->dev, NULL, rss_map->state[i],
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index ccae3c6593c4..457e070bca46 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2356,8 +2356,8 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
 					MLX4_A0_STEERING_TABLE_SIZE;
 			}
 
-			mlx4_dbg(dev, "DMFS high rate steer mode is: %s\n",
-				 dmfs_high_rate_steering_mode_str(
+			mlx4_info(dev, "DMFS high rate steer mode is: %s\n",
+				  dmfs_high_rate_steering_mode_str(
 					dev->caps.dmfs_high_steer_mode));
 		}
 	} else {
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 08b2906a98af..41f4f8f9f300 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -431,7 +431,7 @@ struct mlx4_en_rss_map {
 	int base_qpn;
 	struct mlx4_qp qps[MAX_RX_RINGS];
 	enum mlx4_qp_state state[MAX_RX_RINGS];
-	struct mlx4_qp indir_qp;
+	struct mlx4_qp *indir_qp;
 	enum mlx4_qp_state indir_state;
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 03/10] net/mlx4_en: Improve receive data-path
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 01/10] net/mlx4_en: Remove unused argument in TX datapath function Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 02/10] net/mlx4_en: Optimized single ring steering Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 04/10] net/mlx4_en: Improve transmit CQ polling Tariq Toukan
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Several small performance improvements in RX datapath,
including:
- Compiler branch predictor hints.
- Replace a multiplication with a shift operation.
- Minimize variables scope.
- Write-prefetch for packet header.
- Avoid trinary-operator ("?") when value can be preset in a matching
  branch.
- Save a branch by updating RX ring doorbell within
  mlx4_en_refill_rx_buffers(), which now returns void.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON
(enable by ethtool -L <interface> rx 1).

XDP_DROP packet rate:
Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).

Drop packets in TC:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 4.14 Mpps | 4.18 Mpps |   1% |
-------------------------------------

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 10.1 Mpps | 10.3 Mpps |   2% |
IPv6 | 10.1 Mpps | 10.3 Mpps |   2% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 73 ++++++++++++++++--------------
 1 file changed, 39 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c4edae854f1b..507c48ef2674 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -134,10 +134,11 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 				   struct mlx4_en_rx_ring *ring, int index,
 				   gfp_t gfp)
 {
-	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
+	struct mlx4_en_rx_desc *rx_desc = ring->buf +
+		(index << ring->log_stride);
 	struct mlx4_en_rx_alloc *frags = ring->rx_info +
 					(index << priv->log_rx_info);
-	if (ring->page_cache.index > 0) {
+	if (likely(ring->page_cache.index > 0)) {
 		/* XDP uses a single page per frame */
 		if (!frags->page) {
 			ring->page_cache.index--;
@@ -178,6 +179,7 @@ static void mlx4_en_free_rx_desc(const struct mlx4_en_priv *priv,
 	}
 }
 
+/* Function not in fast-path */
 static int mlx4_en_fill_rx_buffers(struct mlx4_en_priv *priv)
 {
 	struct mlx4_en_rx_ring *ring;
@@ -539,14 +541,14 @@ static void validate_loopback(struct mlx4_en_priv *priv, void *va)
 	priv->loopback_ok = 1;
 }
 
-static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
+static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
 				      struct mlx4_en_rx_ring *ring)
 {
 	u32 missing = ring->actual_size - (ring->prod - ring->cons);
 
 	/* Try to batch allocations, but not too much. */
 	if (missing < 8)
-		return false;
+		return;
 	do {
 		if (mlx4_en_prepare_rx_desc(priv, ring,
 					    ring->prod & ring->size_mask,
@@ -554,9 +556,9 @@ static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
 					    __GFP_MEMALLOC))
 			break;
 		ring->prod++;
-	} while (--missing);
+	} while (likely(--missing));
 
-	return true;
+	mlx4_en_update_rx_prod_db(ring);
 }
 
 /* When hardware doesn't strip the vlan, we need to calculate the checksum
@@ -637,21 +639,14 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	struct mlx4_en_dev *mdev = priv->mdev;
-	struct mlx4_cqe *cqe;
-	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
-	struct mlx4_en_rx_alloc *frags;
+	int factor = priv->cqe_factor;
+	struct mlx4_en_rx_ring *ring;
 	struct bpf_prog *xdp_prog;
+	int cq_ring = cq->ring;
 	int doorbell_pending;
-	struct sk_buff *skb;
-	int index;
-	int nr;
-	unsigned int length;
+	struct mlx4_cqe *cqe;
 	int polled = 0;
-	int ip_summed;
-	int factor = priv->cqe_factor;
-	u64 timestamp;
-	bool l2_tunnel;
+	int index;
 
 	if (unlikely(!priv->port_up))
 		return 0;
@@ -659,6 +654,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (unlikely(budget <= 0))
 		return polled;
 
+	ring = priv->rx_ring[cq_ring];
+
 	/* Protect accesses to: ring->xdp_prog, priv->mac_hash list */
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(ring->xdp_prog);
@@ -673,10 +670,17 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	/* Process all completed CQEs */
 	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
 		    cq->mcq.cons_index & cq->size)) {
+		struct mlx4_en_rx_alloc *frags;
+		enum pkt_hash_types hash_type;
+		struct sk_buff *skb;
+		unsigned int length;
+		int ip_summed;
 		void *va;
+		int nr;
 
 		frags = ring->rx_info + (index << priv->log_rx_info);
 		va = page_address(frags[0].page) + frags[0].page_offset;
+		prefetchw(va);
 		/*
 		 * make sure we read the CQE after we read the ownership bit
 		 */
@@ -768,7 +772,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 				break;
 			case XDP_TX:
 				if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
-							length, cq->ring,
+							length, cq_ring,
 							&doorbell_pending))) {
 					frags[0].page = NULL;
 					goto next;
@@ -790,24 +794,27 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		ring->packets++;
 
 		skb = napi_get_frags(&cq->napi);
-		if (!skb)
+		if (unlikely(!skb))
 			goto next;
 
 		if (unlikely(ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL)) {
-			timestamp = mlx4_en_get_cqe_ts(cqe);
-			mlx4_en_fill_hwtstamps(mdev, skb_hwtstamps(skb),
+			u64 timestamp = mlx4_en_get_cqe_ts(cqe);
+
+			mlx4_en_fill_hwtstamps(priv->mdev, skb_hwtstamps(skb),
 					       timestamp);
 		}
-		skb_record_rx_queue(skb, cq->ring);
+		skb_record_rx_queue(skb, cq_ring);
 
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
 				if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
 				    cqe->checksum == cpu_to_be16(0xffff)) {
-					ip_summed = CHECKSUM_UNNECESSARY;
-					l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
+					bool l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 						(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
+
+					ip_summed = CHECKSUM_UNNECESSARY;
+					hash_type = PKT_HASH_TYPE_L4;
 					if (l2_tunnel)
 						skb->csum_level = 1;
 					ring->csum_ok++;
@@ -822,6 +829,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						goto csum_none;
 					} else {
 						ip_summed = CHECKSUM_COMPLETE;
+						hash_type = PKT_HASH_TYPE_L3;
 						ring->csum_complete++;
 					}
 				} else {
@@ -831,16 +839,14 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		} else {
 csum_none:
 			ip_summed = CHECKSUM_NONE;
+			hash_type = PKT_HASH_TYPE_L3;
 			ring->csum_none++;
 		}
 		skb->ip_summed = ip_summed;
 		if (dev->features & NETIF_F_RXHASH)
 			skb_set_hash(skb,
 				     be32_to_cpu(cqe->immed_rss_invalid),
-				     (ip_summed == CHECKSUM_UNNECESSARY) ?
-					PKT_HASH_TYPE_L4 :
-					PKT_HASH_TYPE_L3);
-
+				     hash_type);
 
 		if ((cqe->vlan_my_qpn &
 		     cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK)) &&
@@ -867,13 +873,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
-		if (++polled == budget)
+		if (unlikely(++polled == budget))
 			break;
 	}
 
 	rcu_read_unlock();
 
-	if (polled) {
+	if (likely(polled)) {
 		if (doorbell_pending)
 			mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
 
@@ -883,8 +889,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	}
 	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
 
-	if (mlx4_en_refill_rx_buffers(priv, ring))
-		mlx4_en_update_rx_prod_db(ring);
+	mlx4_en_refill_rx_buffers(priv, ring);
 
 	return polled;
 }
@@ -936,7 +941,7 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 			done--;
 	}
 	/* Done for now */
-	if (napi_complete_done(napi, done))
+	if (likely(napi_complete_done(napi, done)))
 		mlx4_en_arm_cq(priv, cq);
 	return done;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 04/10] net/mlx4_en: Improve transmit CQ polling
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (2 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 03/10] net/mlx4_en: Improve receive data-path Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 05/10] net/mlx4_en: Improve stack xmit function Tariq Toukan
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Several small performance improvements in TX CQ polling,
including:
- Compiler branch predictor hints.
- Minimize variables scope.
- More proper check of cq type.
- Use boolean instead of int for a binary indication.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Packet-rate tests for both regular stack and XDP use cases:
No noticeable gain, no degradation.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 37386abea54c..2d5e0da1de2f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -402,8 +402,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	struct mlx4_cq *mcq = &cq->mcq;
 	struct mlx4_en_tx_ring *ring = priv->tx_ring[cq->type][cq->ring];
 	struct mlx4_cqe *cqe;
-	u16 index;
-	u16 new_index, ring_index, stamp_index;
+	u16 index, ring_index, stamp_index;
 	u32 txbbs_skipped = 0;
 	u32 txbbs_stamp = 0;
 	u32 cons_index = mcq->cons_index;
@@ -418,7 +417,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	u32 last_nr_txbb;
 	u32 ring_cons;
 
-	if (!priv->port_up)
+	if (unlikely(!priv->port_up))
 		return true;
 
 	netdev_txq_bql_complete_prefetchw(ring->tx_queue);
@@ -433,6 +432,8 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	/* Process all completed CQEs */
 	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
 			cons_index & size) && (done < budget)) {
+		u16 new_index;
+
 		/*
 		 * make sure we read the CQE after we read the
 		 * ownership bit
@@ -479,7 +480,6 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 		cqe = mlx4_en_get_cqe(buf, index, priv->cqe_size) + factor;
 	}
 
-
 	/*
 	 * To prevent CQ overflow we first update CQ consumer and only then
 	 * the ring consumer.
@@ -492,7 +492,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
 	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
 
-	if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
+	if (cq->type == TX_XDP)
 		return done < budget;
 
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
@@ -504,6 +504,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 		netif_tx_wake_queue(ring->tx_queue);
 		ring->wake_queue++;
 	}
+
 	return done < budget;
 }
 
@@ -524,7 +525,7 @@ int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget)
 	struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
 	struct net_device *dev = cq->dev;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int clean_complete;
+	bool clean_complete;
 
 	clean_complete = mlx4_en_process_tx_cq(dev, cq, budget);
 	if (!clean_complete)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 05/10] net/mlx4_en: Improve stack xmit function
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (3 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 04/10] net/mlx4_en: Improve transmit CQ polling Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 06/10] net/mlx4_en: Improve XDP " Tariq Toukan
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Several small code and performance improvements in stack TX datapath,
including:
- Compiler branch predictor hints.
- Minimize variables scope.
- Move tx_info non-inline flow handling to a separate function.
- Calculate data_offset in compile time rather than in runtime
  (for !lso_header_size branch).
- Avoid trinary-operator ("?") when value can be preset in a matching
  branch.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 151 +++++++++++++++++------------
 1 file changed, 87 insertions(+), 64 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 2d5e0da1de2f..58f4b322587b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -774,37 +774,101 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
 	}
 }
 
+static bool mlx4_en_build_dma_wqe(struct mlx4_en_priv *priv,
+				  struct skb_shared_info *shinfo,
+				  struct mlx4_wqe_data_seg *data,
+				  struct sk_buff *skb,
+				  int lso_header_size,
+				  __be32 mr_key,
+				  struct mlx4_en_tx_info *tx_info)
+{
+	struct device *ddev = priv->ddev;
+	dma_addr_t dma = 0;
+	u32 byte_count = 0;
+	int i_frag;
+
+	/* Map fragments if any */
+	for (i_frag = shinfo->nr_frags - 1; i_frag >= 0; i_frag--) {
+		const struct skb_frag_struct *frag;
+
+		frag = &shinfo->frags[i_frag];
+		byte_count = skb_frag_size(frag);
+		dma = skb_frag_dma_map(ddev, frag,
+				       0, byte_count,
+				       DMA_TO_DEVICE);
+		if (dma_mapping_error(ddev, dma))
+			goto tx_drop_unmap;
+
+		data->addr = cpu_to_be64(dma);
+		data->lkey = mr_key;
+		dma_wmb();
+		data->byte_count = cpu_to_be32(byte_count);
+		--data;
+	}
+
+	/* Map linear part if needed */
+	if (tx_info->linear) {
+		byte_count = skb_headlen(skb) - lso_header_size;
+
+		dma = dma_map_single(ddev, skb->data +
+				     lso_header_size, byte_count,
+				     PCI_DMA_TODEVICE);
+		if (dma_mapping_error(ddev, dma))
+			goto tx_drop_unmap;
+
+		data->addr = cpu_to_be64(dma);
+		data->lkey = mr_key;
+		dma_wmb();
+		data->byte_count = cpu_to_be32(byte_count);
+	}
+	/* tx completion can avoid cache line miss for common cases */
+	tx_info->map0_dma = dma;
+	tx_info->map0_byte_count = byte_count;
+
+	return true;
+
+tx_drop_unmap:
+	en_err(priv, "DMA mapping error\n");
+
+	while (++i_frag < shinfo->nr_frags) {
+		++data;
+		dma_unmap_page(ddev, (dma_addr_t)be64_to_cpu(data->addr),
+			       be32_to_cpu(data->byte_count),
+			       PCI_DMA_TODEVICE);
+	}
+
+	return false;
+}
+
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
-	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
 	struct mlx4_wqe_data_seg *data;
 	struct mlx4_en_tx_info *tx_info;
-	int tx_ind = 0;
+	int tx_ind;
 	int nr_txbb;
 	int desc_size;
 	int real_size;
 	u32 index, bf_index;
 	__be32 op_own;
-	u16 vlan_proto = 0;
-	int i_frag;
 	int lso_header_size;
 	void *fragptr = NULL;
 	bool bounce = false;
 	bool send_doorbell;
 	bool stop_queue;
 	bool inline_ok;
+	u8 data_offset;
 	u32 ring_cons;
 	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
 	ring = priv->tx_ring[TX][tx_ind];
 
-	if (!priv->port_up)
+	if (unlikely(!priv->port_up))
 		goto tx_drop;
 
 	/* fetch ring->cons far ahead before needing it to avoid stall */
@@ -826,6 +890,8 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	bf_ok = ring->bf_enabled;
 	if (skb_vlan_tag_present(skb)) {
+		u16 vlan_proto;
+
 		qpn_vlan.vlan_tag = cpu_to_be16(skb_vlan_tag_get(skb));
 		vlan_proto = be16_to_cpu(skb->vlan_proto);
 		if (vlan_proto == ETH_P_8021AD)
@@ -862,64 +928,31 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->skb = skb;
 	tx_info->nr_txbb = nr_txbb;
 
-	data = &tx_desc->data;
-	if (lso_header_size)
-		data = ((void *)&tx_desc->lso + ALIGN(lso_header_size + 4,
-						      DS_SIZE));
+	if (!lso_header_size) {
+		data = &tx_desc->data;
+		data_offset = offsetof(struct mlx4_en_tx_desc, data);
+	} else {
+		int lso_align = ALIGN(lso_header_size + 4, DS_SIZE);
+
+		data = (void *)&tx_desc->lso + lso_align;
+		data_offset = offsetof(struct mlx4_en_tx_desc, lso) + lso_align;
+	}
 
 	/* valid only for none inline segments */
-	tx_info->data_offset = (void *)data - (void *)tx_desc;
+	tx_info->data_offset = data_offset;
 
 	tx_info->inl = inline_ok;
 
-	tx_info->linear = (lso_header_size < skb_headlen(skb) &&
-			   !inline_ok) ? 1 : 0;
+	tx_info->linear = lso_header_size < skb_headlen(skb) && !inline_ok;
 
 	tx_info->nr_maps = shinfo->nr_frags + tx_info->linear;
 	data += tx_info->nr_maps - 1;
 
-	if (!tx_info->inl) {
-		dma_addr_t dma = 0;
-		u32 byte_count = 0;
-
-		/* Map fragments if any */
-		for (i_frag = shinfo->nr_frags - 1; i_frag >= 0; i_frag--) {
-			const struct skb_frag_struct *frag;
-
-			frag = &shinfo->frags[i_frag];
-			byte_count = skb_frag_size(frag);
-			dma = skb_frag_dma_map(ddev, frag,
-					       0, byte_count,
-					       DMA_TO_DEVICE);
-			if (dma_mapping_error(ddev, dma))
-				goto tx_drop_unmap;
-
-			data->addr = cpu_to_be64(dma);
-			data->lkey = ring->mr_key;
-			dma_wmb();
-			data->byte_count = cpu_to_be32(byte_count);
-			--data;
-		}
-
-		/* Map linear part if needed */
-		if (tx_info->linear) {
-			byte_count = skb_headlen(skb) - lso_header_size;
-
-			dma = dma_map_single(ddev, skb->data +
-					     lso_header_size, byte_count,
-					     PCI_DMA_TODEVICE);
-			if (dma_mapping_error(ddev, dma))
-				goto tx_drop_unmap;
-
-			data->addr = cpu_to_be64(dma);
-			data->lkey = ring->mr_key;
-			dma_wmb();
-			data->byte_count = cpu_to_be32(byte_count);
-		}
-		/* tx completion can avoid cache line miss for common cases */
-		tx_info->map0_dma = dma;
-		tx_info->map0_byte_count = byte_count;
-	}
+	if (!tx_info->inl)
+		if (!mlx4_en_build_dma_wqe(priv, shinfo, data, skb,
+					   lso_header_size, ring->mr_key,
+					   tx_info))
+			goto tx_drop_count;
 
 	/*
 	 * For timestamping add flag to skb_shinfo and
@@ -1055,16 +1088,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 	return NETDEV_TX_OK;
 
-tx_drop_unmap:
-	en_err(priv, "DMA mapping error\n");
-
-	while (++i_frag < shinfo->nr_frags) {
-		++data;
-		dma_unmap_page(ddev, (dma_addr_t) be64_to_cpu(data->addr),
-			       be32_to_cpu(data->byte_count),
-			       PCI_DMA_TODEVICE);
-	}
-
 tx_drop_count:
 	ring->tx_dropped++;
 tx_drop:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 06/10] net/mlx4_en: Improve XDP xmit function
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (4 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 05/10] net/mlx4_en: Improve stack xmit function Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 07/10] net/mlx4_en: Poll XDP TX completion queue in RX NAPI Tariq Toukan
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Several performance improvements in XDP TX datapath,
including:
- Ring a single doorbell for XDP TX ring per NAPI budget,
  instead of doing it per a lower threshold (was 8).
  This includes removing the flow of immediate doorbell ringing
  in case of a full TX ring.
- Compiler branch predictor hints.
- Calculate values in compile time rather than in runtime.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 59 +++++++++-------------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  3 +-
 3 files changed, 21 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 507c48ef2674..747e4d7d7693 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -643,7 +643,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring;
 	struct bpf_prog *xdp_prog;
 	int cq_ring = cq->ring;
-	int doorbell_pending;
+	bool doorbell_pending;
 	struct mlx4_cqe *cqe;
 	int polled = 0;
 	int index;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 58f4b322587b..01bb43879221 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -1095,51 +1095,40 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+#define MLX4_EN_XDP_TX_NRTXBB  1
+#define MLX4_EN_XDP_TX_REAL_SZ (((CTRL_SIZE + MLX4_EN_XDP_TX_NRTXBB * DS_SIZE) \
+				 / 16) & 0x3f)
+
 netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 			       struct mlx4_en_rx_alloc *frame,
 			       struct net_device *dev, unsigned int length,
-			       int tx_ind, int *doorbell_pending)
+			       int tx_ind, bool *doorbell_pending)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
-	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
-	struct mlx4_wqe_data_seg *data;
 	struct mlx4_en_tx_info *tx_info;
-	int index, bf_index;
-	bool send_doorbell;
-	int nr_txbb = 1;
-	bool stop_queue;
+	struct mlx4_wqe_data_seg *data;
+	struct mlx4_en_tx_ring *ring;
 	dma_addr_t dma;
-	int real_size;
 	__be32 op_own;
-	u32 ring_cons;
-	bool bf_ok;
+	int index;
 
-	BUILD_BUG_ON_MSG(ALIGN(CTRL_SIZE + DS_SIZE, TXBB_SIZE) != TXBB_SIZE,
-			 "mlx4_en_xmit_frame requires minimum size tx desc");
+	if (unlikely(!priv->port_up))
+		goto tx_drop;
 
 	ring = priv->tx_ring[TX_XDP][tx_ind];
 
-	if (!priv->port_up)
-		goto tx_drop;
-
-	if (mlx4_en_is_tx_ring_full(ring))
+	if (unlikely(mlx4_en_is_tx_ring_full(ring)))
 		goto tx_drop_count;
 
-	/* fetch ring->cons far ahead before needing it to avoid stall */
-	ring_cons = READ_ONCE(ring->cons);
-
 	index = ring->prod & ring->size_mask;
 	tx_info = &ring->tx_info[index];
 
-	bf_ok = ring->bf_enabled;
-
 	/* Track current inflight packets for performance analysis */
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
-			 (u32)(ring->prod - ring_cons - 1));
+			 (u32)(ring->prod - READ_ONCE(ring->cons) - 1));
 
-	bf_index = ring->prod;
 	tx_desc = ring->buf + index * TXBB_SIZE;
 	data = &tx_desc->data;
 
@@ -1149,9 +1138,9 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	frame->page = NULL;
 	tx_info->map0_dma = dma;
 	tx_info->map0_byte_count = PAGE_SIZE;
-	tx_info->nr_txbb = nr_txbb;
+	tx_info->nr_txbb = MLX4_EN_XDP_TX_NRTXBB;
 	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
-	tx_info->data_offset = (void *)data - (void *)tx_desc;
+	tx_info->data_offset = offsetof(struct mlx4_en_tx_desc, data);
 	tx_info->ts_requested = 0;
 	tx_info->nr_maps = 1;
 	tx_info->linear = 1;
@@ -1175,23 +1164,13 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	rx_ring->xdp_tx++;
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
 
-	ring->prod += nr_txbb;
+	ring->prod += MLX4_EN_XDP_TX_NRTXBB;
 
-	stop_queue = mlx4_en_is_tx_ring_full(ring);
-	send_doorbell = stop_queue ||
-				*doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
-	bf_ok &= send_doorbell;
+	qpn_vlan.fence_size = MLX4_EN_XDP_TX_REAL_SZ;
 
-	real_size = ((CTRL_SIZE + nr_txbb * DS_SIZE) / 16) & 0x3f;
-
-	if (bf_ok)
-		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
-	else
-		qpn_vlan.fence_size = real_size;
-
-	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
-			      op_own, bf_ok, send_doorbell);
-	*doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, 0,
+			      op_own, false, false);
+	*doorbell_pending = true;
 
 	return NETDEV_TX_OK;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 41f4f8f9f300..c52edb717add 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -121,7 +121,6 @@
 					 MLX4_EN_NUM_UP)
 
 #define MLX4_EN_DEFAULT_TX_WORK		256
-#define MLX4_EN_DOORBELL_BUDGET		8
 
 /* Target number of packets to coalesce with interrupt moderation */
 #define MLX4_EN_RX_COAL_TARGET	44
@@ -689,7 +688,7 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 			       struct mlx4_en_rx_alloc *frame,
 			       struct net_device *dev, unsigned int length,
-			       int tx_ind, int *doorbell_pending);
+			       int tx_ind, bool *doorbell_pending);
 void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
 bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
 			struct mlx4_en_rx_alloc *frame);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 07/10] net/mlx4_en: Poll XDP TX completion queue in RX NAPI
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (5 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 06/10] net/mlx4_en: Improve XDP " Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 08/10] net/mlx4_en: Increase default TX ring size Tariq Toukan
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Instead of having their own NAPIs, XDP TX completion queues get
polled within the corresponding RX NAPI.
This prevents any possible race on TX ring prod/cons indices,
between the context that issues the transmits (RX NAPI) and the
context that handles the completions (was previously done in
a separate NAPI).

This also improves performance, as it decreases the number
of NAPIs running on a CPU, saving the overhead of syncing
and switching between the contexts.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Single queue no-RSS optimization ON.

XDP_TX packet rate:
-------------------------------------
     | Before    | After     | Gain |
IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
-------------------------------------

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c     | 25 ++++++++++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  8 +++++---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 22 +++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |  5 +++--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  7 ++++++-
 5 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 09dd3776db76..85fe17e4dcfb 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -146,16 +146,25 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 	if (err)
 		goto free_eq;
 
-	cq->mcq.comp  = cq->type != RX ? mlx4_en_tx_irq : mlx4_en_rx_irq;
 	cq->mcq.event = mlx4_en_cq_event;
 
-	if (cq->type != RX)
+	switch (cq->type) {
+	case TX:
+		cq->mcq.comp = mlx4_en_tx_irq;
 		netif_tx_napi_add(cq->dev, &cq->napi, mlx4_en_poll_tx_cq,
 				  NAPI_POLL_WEIGHT);
-	else
+		napi_enable(&cq->napi);
+		break;
+	case RX:
+		cq->mcq.comp = mlx4_en_rx_irq;
 		netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq, 64);
-
-	napi_enable(&cq->napi);
+		napi_enable(&cq->napi);
+		break;
+	case TX_XDP:
+		/* nothing regarding napi, it's shared with rx ring */
+		cq->xdp_busy = false;
+		break;
+	}
 
 	return 0;
 
@@ -184,8 +193,10 @@ void mlx4_en_destroy_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq **pcq)
 
 void mlx4_en_deactivate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq)
 {
-	napi_disable(&cq->napi);
-	netif_napi_del(&cq->napi);
+	if (cq->type != TX_XDP) {
+		napi_disable(&cq->napi);
+		netif_napi_del(&cq->napi);
+	}
 
 	mlx4_cq_free(priv->mdev->dev, &cq->mcq);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 51ce111b719e..99c02bb4f302 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1679,13 +1679,15 @@ int mlx4_en_start_port(struct net_device *dev)
 			if (t != TX_XDP) {
 				tx_ring->tx_queue = netdev_get_tx_queue(dev, i);
 				tx_ring->recycle_ring = NULL;
+
+				/* Arm CQ for TX completions */
+				mlx4_en_arm_cq(priv, cq);
+
 			} else {
 				mlx4_en_init_recycle_ring(priv, i);
+				/* XDP TX CQ should never be armed */
 			}
 
-			/* Arm CQ for TX completions */
-			mlx4_en_arm_cq(priv, cq);
-
 			/* Set initial ownership of all Tx TXBBs to SW (1) */
 			for (j = 0; j < tx_ring->buf_size; j += STAMP_STRIDE)
 				*((u32 *)(tx_ring->buf + j)) = 0xffffffff;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 747e4d7d7693..e5fb89505a13 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -880,8 +880,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	rcu_read_unlock();
 
 	if (likely(polled)) {
-		if (doorbell_pending)
-			mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
+		if (doorbell_pending) {
+			priv->tx_cq[TX_XDP][cq_ring]->xdp_busy = true;
+			mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq_ring]);
+		}
 
 		mlx4_cq_set_ci(&cq->mcq);
 		wmb(); /* ensure HW sees CQ consumer before we post new buffers */
@@ -912,16 +914,30 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 	struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
 	struct net_device *dev = cq->dev;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_cq *xdp_tx_cq = NULL;
+	bool clean_complete = true;
 	int done;
 
+	if (priv->tx_ring_num[TX_XDP]) {
+		xdp_tx_cq = priv->tx_cq[TX_XDP][cq->ring];
+		if (xdp_tx_cq->xdp_busy) {
+			clean_complete = mlx4_en_process_tx_cq(dev, xdp_tx_cq,
+							       budget);
+			xdp_tx_cq->xdp_busy = !clean_complete;
+		}
+	}
+
 	done = mlx4_en_process_rx_cq(dev, cq, budget);
 
 	/* If we used up all the quota - we're probably not done yet... */
-	if (done == budget) {
+	if (done == budget || !clean_complete) {
 		const struct cpumask *aff;
 		struct irq_data *idata;
 		int cpu_curr;
 
+		/* in case we got here because of !clean_complete */
+		done = budget;
+
 		INC_PERF_COUNTER(priv->pstats.napi_quota);
 
 		cpu_curr = smp_processor_id();
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 01bb43879221..500442c60342 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -395,8 +395,8 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 	return cnt;
 }
 
-static bool mlx4_en_process_tx_cq(struct net_device *dev,
-				  struct mlx4_en_cq *cq, int napi_budget)
+bool mlx4_en_process_tx_cq(struct net_device *dev,
+			   struct mlx4_en_cq *cq, int napi_budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_cq *mcq = &cq->mcq;
@@ -1176,6 +1176,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 
 tx_drop_count:
 	rx_ring->xdp_tx_full++;
+	*doorbell_pending = true;
 tx_drop:
 	return NETDEV_TX_BUSY;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index c52edb717add..bde3bc869b70 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -358,7 +358,10 @@ struct mlx4_en_cq {
 	struct mlx4_hwq_resources wqres;
 	int                     ring;
 	struct net_device      *dev;
-	struct napi_struct	napi;
+	union {
+		struct napi_struct napi;
+		bool               xdp_busy;
+	};
 	int size;
 	int buf_size;
 	int vector;
@@ -720,6 +723,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
 			  int budget);
 int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget);
 int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget);
+bool mlx4_en_process_tx_cq(struct net_device *dev,
+			   struct mlx4_en_cq *cq, int napi_budget);
 u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			 struct mlx4_en_tx_ring *ring,
 			 int index, u64 timestamp,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 08/10] net/mlx4_en: Increase default TX ring size
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (6 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 07/10] net/mlx4_en: Poll XDP TX completion queue in RX NAPI Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-15 11:35 ` [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations Tariq Toukan
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Increase the default TX ring size (from 512 to 1024) to match
the RX ring size.
This gives the XDP TX ring a better chance to keep up with the
rate of its RX ring in case of a high load of XDP_TX actions.

Tested:
Ethtool counter rx_xdp_tx_full used to increase, after applying this
patch it stopped.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index bde3bc869b70..7f61b5829709 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -115,8 +115,8 @@
 #define MLX4_EN_MIN_TX_RING_P_UP	1
 #define MLX4_EN_MAX_TX_RING_P_UP	32
 #define MLX4_EN_NUM_UP			8
-#define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
+#define MLX4_EN_DEF_TX_RING_SIZE	MLX4_EN_DEF_RX_RING_SIZE
 #define MAX_TX_RINGS			(MLX4_EN_MAX_TX_RING_P_UP * \
 					 MLX4_EN_NUM_UP)
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (7 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 08/10] net/mlx4_en: Increase default TX ring size Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-20  8:45   ` David Laight
  2017-06-15 11:35 ` [PATCH net-next 10/10] net/mlx4_en: Refactor mlx4_en_free_tx_desc Tariq Toukan
  2017-06-16  2:53 ` [PATCH net-next 00/10] mlx4 XDP performance improvements David Miller
  10 siblings, 1 reply; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
operation instead of a multiplication with TXBB_SIZE.
Operations are equivalent as TXBB_SIZE is a power of two.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 26 ++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  3 ++-
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 500442c60342..efc4411b1e44 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -234,23 +234,24 @@ static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
 			      u8 owner)
 {
 	__be32 stamp = cpu_to_be32(STAMP_VAL | (!!owner << STAMP_SHIFT));
-	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
+	struct mlx4_en_tx_desc *tx_desc = ring->buf + (index << LOG_TXBB_SIZE);
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
 	void *end = ring->buf + ring->buf_size;
 	__be32 *ptr = (__be32 *)tx_desc;
 	int i;
 
 	/* Optimize the common case when there are no wraparounds */
-	if (likely((void *)tx_desc + tx_info->nr_txbb * TXBB_SIZE <= end)) {
+	if (likely((void *)tx_desc +
+		   (tx_info->nr_txbb << LOG_TXBB_SIZE) <= end)) {
 		/* Stamp the freed descriptor */
-		for (i = 0; i < tx_info->nr_txbb * TXBB_SIZE;
+		for (i = 0; i < tx_info->nr_txbb << LOG_TXBB_SIZE;
 		     i += STAMP_STRIDE) {
 			*ptr = stamp;
 			ptr += STAMP_DWORDS;
 		}
 	} else {
 		/* Stamp the freed descriptor */
-		for (i = 0; i < tx_info->nr_txbb * TXBB_SIZE;
+		for (i = 0; i < tx_info->nr_txbb << LOG_TXBB_SIZE;
 		     i += STAMP_STRIDE) {
 			*ptr = stamp;
 			ptr += STAMP_DWORDS;
@@ -269,7 +270,7 @@ u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 			 int napi_mode)
 {
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
-	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
+	struct mlx4_en_tx_desc *tx_desc = ring->buf + (index << LOG_TXBB_SIZE);
 	struct mlx4_wqe_data_seg *data = (void *) tx_desc + tx_info->data_offset;
 	void *end = ring->buf + ring->buf_size;
 	struct sk_buff *skb = tx_info->skb;
@@ -289,7 +290,8 @@ u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 	}
 
 	/* Optimize the common case when there are no wraparounds */
-	if (likely((void *) tx_desc + tx_info->nr_txbb * TXBB_SIZE <= end)) {
+	if (likely((void *)tx_desc +
+		   (tx_info->nr_txbb << LOG_TXBB_SIZE) <= end)) {
 		if (!tx_info->inl) {
 			if (tx_info->linear)
 				dma_unmap_single(priv->ddev,
@@ -542,7 +544,7 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 						      u32 index,
 						      unsigned int desc_size)
 {
-	u32 copy = (ring->size - index) * TXBB_SIZE;
+	u32 copy = (ring->size - index) << LOG_TXBB_SIZE;
 	int i;
 
 	for (i = desc_size - copy - 4; i >= 0; i -= 4) {
@@ -557,12 +559,12 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 		if ((i & (TXBB_SIZE - 1)) == 0)
 			wmb();
 
-		*((u32 *) (ring->buf + index * TXBB_SIZE + i)) =
+		*((u32 *)(ring->buf + (index << LOG_TXBB_SIZE) + i)) =
 			*((u32 *) (ring->bounce_buf + i));
 	}
 
 	/* Return real descriptor location */
-	return ring->buf + index * TXBB_SIZE;
+	return ring->buf + (index << LOG_TXBB_SIZE);
 }
 
 /* Decide if skb can be inlined in tx descriptor to avoid dma mapping
@@ -881,7 +883,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Align descriptor to TXBB size */
 	desc_size = ALIGN(real_size, TXBB_SIZE);
-	nr_txbb = desc_size / TXBB_SIZE;
+	nr_txbb = desc_size >> LOG_TXBB_SIZE;
 	if (unlikely(nr_txbb > MAX_DESC_TXBBS)) {
 		if (netif_msg_tx_err(priv))
 			en_warn(priv, "Oversized header or SG list\n");
@@ -916,7 +918,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* See if we have enough space for whole descriptor TXBB for setting
 	 * SW ownership on next descriptor; if not, use a bounce buffer. */
 	if (likely(index + nr_txbb <= ring->size))
-		tx_desc = ring->buf + index * TXBB_SIZE;
+		tx_desc = ring->buf + (index << LOG_TXBB_SIZE);
 	else {
 		tx_desc = (struct mlx4_en_tx_desc *) ring->bounce_buf;
 		bounce = true;
@@ -1129,7 +1131,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
 			 (u32)(ring->prod - READ_ONCE(ring->cons) - 1));
 
-	tx_desc = ring->buf + index * TXBB_SIZE;
+	tx_desc = ring->buf + (index << LOG_TXBB_SIZE);
 	data = &tx_desc->data;
 
 	dma = frame->dma;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 7f61b5829709..963b77d51b48 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -72,7 +72,8 @@
 #define DEF_RX_RINGS		16
 #define MAX_RX_RINGS		128
 #define MIN_RX_RINGS		4
-#define TXBB_SIZE		64
+#define LOG_TXBB_SIZE		6
+#define TXBB_SIZE		BIT(LOG_TXBB_SIZE)
 #define HEADROOM		(2048 / TXBB_SIZE + 1)
 #define STAMP_STRIDE		64
 #define STAMP_DWORDS		(STAMP_STRIDE / 4)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 10/10] net/mlx4_en: Refactor mlx4_en_free_tx_desc
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (8 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations Tariq Toukan
@ 2017-06-15 11:35 ` Tariq Toukan
  2017-06-16  2:53 ` [PATCH net-next 00/10] mlx4 XDP performance improvements David Miller
  10 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-15 11:35 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team,
	Tariq Toukan, Eric Dumazet

Some code re-ordering, functionally equivalent.

- The !tx_info->inl check is evaluated anyway in both flows
  (common case/end case). Run it first, this might finish
  the flows earlier.
- dma_unmap calls are identical in both flows, get it out
  of the if block into the common area.

Performance tests:
Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Gain is too small to be measurable, no degradation sensed.
Results are similar for IPv4 and IPv6.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: kernel-team@fb.com
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 45 +++++++++++-------------------
 1 file changed, 16 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index efc4411b1e44..7d69d939ee2d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -289,20 +289,20 @@ u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 		skb_tstamp_tx(skb, &hwts);
 	}
 
-	/* Optimize the common case when there are no wraparounds */
-	if (likely((void *)tx_desc +
-		   (tx_info->nr_txbb << LOG_TXBB_SIZE) <= end)) {
-		if (!tx_info->inl) {
-			if (tx_info->linear)
-				dma_unmap_single(priv->ddev,
-						tx_info->map0_dma,
-						tx_info->map0_byte_count,
-						PCI_DMA_TODEVICE);
-			else
-				dma_unmap_page(priv->ddev,
-					       tx_info->map0_dma,
-					       tx_info->map0_byte_count,
-					       PCI_DMA_TODEVICE);
+	if (!tx_info->inl) {
+		if (tx_info->linear)
+			dma_unmap_single(priv->ddev,
+					 tx_info->map0_dma,
+					 tx_info->map0_byte_count,
+					 PCI_DMA_TODEVICE);
+		else
+			dma_unmap_page(priv->ddev,
+				       tx_info->map0_dma,
+				       tx_info->map0_byte_count,
+				       PCI_DMA_TODEVICE);
+		/* Optimize the common case when there are no wraparounds */
+		if (likely((void *)tx_desc +
+			   (tx_info->nr_txbb << LOG_TXBB_SIZE) <= end)) {
 			for (i = 1; i < nr_maps; i++) {
 				data++;
 				dma_unmap_page(priv->ddev,
@@ -310,23 +310,10 @@ u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 					be32_to_cpu(data->byte_count),
 					PCI_DMA_TODEVICE);
 			}
-		}
-	} else {
-		if (!tx_info->inl) {
-			if ((void *) data >= end) {
+		} else {
+			if ((void *)data >= end)
 				data = ring->buf + ((void *)data - end);
-			}
 
-			if (tx_info->linear)
-				dma_unmap_single(priv->ddev,
-						tx_info->map0_dma,
-						tx_info->map0_byte_count,
-						PCI_DMA_TODEVICE);
-			else
-				dma_unmap_page(priv->ddev,
-					       tx_info->map0_dma,
-					       tx_info->map0_byte_count,
-					       PCI_DMA_TODEVICE);
 			for (i = 1; i < nr_maps; i++) {
 				data++;
 				/* Check for wraparound before unmapping */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 00/10] mlx4 XDP performance improvements
  2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
                   ` (9 preceding siblings ...)
  2017-06-15 11:35 ` [PATCH net-next 10/10] net/mlx4_en: Refactor mlx4_en_free_tx_desc Tariq Toukan
@ 2017-06-16  2:53 ` David Miller
  10 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2017-06-16  2:53 UTC (permalink / raw)
  To: tariqt; +Cc: netdev, eranbe, saeedm, kernel-team

From: Tariq Toukan <tariqt@mellanox.com>
Date: Thu, 15 Jun 2017 14:35:30 +0300

> This patchset contains data-path improvements, mainly for XDP_DROP
> and XDP_TX cases.
> 
> Main patches:
> * Patch 2 by Saeed allows enabling optimized A0 RX steering (in HW) when
>   setting a single RX ring.
>   With this configuration, HW packet-rate dramatically improves,
>   reaching 28.1 Mpps in XDP_DROP case for both IPv4 (37% gain)
>   and IPv6 (53% gain).
> * Patch 6 enhances the XDP xmit function. Among other changes, now we
>   ring one doorbell per NAPI. Patch gives 17% gain in XDP_TX case.
> * Patch 7 obsoletes the NAPI of XDP_TX completion queue and integrates its
>   poll into the respective RX NAPI. Patch gives 15% gain in XDP_TX case.
> 
> Series generated against net-next commit:
> f7aec129a356 rxrpc: Cache the congestion window setting

Series applied, thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
  2017-06-15 11:35 ` [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations Tariq Toukan
@ 2017-06-20  8:45   ` David Laight
  2017-06-20 10:53     ` Tariq Toukan
  0 siblings, 1 reply; 14+ messages in thread
From: David Laight @ 2017-06-20  8:45 UTC (permalink / raw)
  To: 'Tariq Toukan', David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team, Eric Dumazet

From: Tariq Toukan
> Sent: 15 June 2017 12:36
> Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
> operation instead of a multiplication with TXBB_SIZE.
> Operations are equivalent as TXBB_SIZE is a power of two.
> 
> Performance tests:
> Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Gain is too small to be measurable, no degradation sensed.
> Results are similar for IPv4 and IPv6.

I can't imagine there is any difference at all.
The compiler will use a shift for a 'multiply by a constant power of 2'.

...
If you want to save a few cycles I think the loop:
> -		for (i = 0; i < tx_info->nr_txbb * TXBB_SIZE;
> +		for (i = 0; i < tx_info->nr_txbb << LOG_TXBB_SIZE;
requires the compiler generate code to read nr_txbb every
iteration.
Caching the values might help (unless it causes a different
register spill).

	David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations
  2017-06-20  8:45   ` David Laight
@ 2017-06-20 10:53     ` Tariq Toukan
  0 siblings, 0 replies; 14+ messages in thread
From: Tariq Toukan @ 2017-06-20 10:53 UTC (permalink / raw)
  To: David Laight, 'Tariq Toukan', David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, kernel-team, Eric Dumazet



On 20/06/2017 11:45 AM, David Laight wrote:
> From: Tariq Toukan
>> Sent: 15 June 2017 12:36
>> Define LOG_TXBB_SIZE, log of TXBB_SIZE, and use it with a shift
>> operation instead of a multiplication with TXBB_SIZE.
>> Operations are equivalent as TXBB_SIZE is a power of two.
>>
>> Performance tests:
>> Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Gain is too small to be measurable, no degradation sensed.
>> Results are similar for IPv4 and IPv6.
> 
> I can't imagine there is any difference at all.
> The compiler will use a shift for a 'multiply by a constant power of 2'.
Yeah i guess my compiler does that, because it's a constant known at 
compile-time.

> 
> ...
> If you want to save a few cycles I think the loop:
>> -		for (i = 0; i < tx_info->nr_txbb * TXBB_SIZE;
>> +		for (i = 0; i < tx_info->nr_txbb << LOG_TXBB_SIZE;
> requires the compiler generate code to read nr_txbb every
> iteration.
> Caching the values might help (unless it causes a different
> register spill).

That sounds good!
I'll prepare and send this after testing.
I'll also look for similar cases in driver.

> 
> 	David
> 

Thank you David!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-06-20 10:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-15 11:35 [PATCH net-next 00/10] mlx4 XDP performance improvements Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 01/10] net/mlx4_en: Remove unused argument in TX datapath function Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 02/10] net/mlx4_en: Optimized single ring steering Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 03/10] net/mlx4_en: Improve receive data-path Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 04/10] net/mlx4_en: Improve transmit CQ polling Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 05/10] net/mlx4_en: Improve stack xmit function Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 06/10] net/mlx4_en: Improve XDP " Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 07/10] net/mlx4_en: Poll XDP TX completion queue in RX NAPI Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 08/10] net/mlx4_en: Increase default TX ring size Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 09/10] net/mlx4_en: Replace TXBB_SIZE multiplications with shift operations Tariq Toukan
2017-06-20  8:45   ` David Laight
2017-06-20 10:53     ` Tariq Toukan
2017-06-15 11:35 ` [PATCH net-next 10/10] net/mlx4_en: Refactor mlx4_en_free_tx_desc Tariq Toukan
2017-06-16  2:53 ` [PATCH net-next 00/10] mlx4 XDP performance improvements David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.