All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/11] mlx5 RX refactoring and XDP support
@ 2016-09-07 12:42 Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
                   ` (11 more replies)
  0 siblings, 12 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

Hi All,

This patch set introduces some important data path RX refactoring
addressing mlx5e memory allocation/management improvements and XDP support.

Submitting as RFC since we would like to get an early feedback, while we
continue reviewing testing and complete the performance analysis in house.

In details:
>From Tariq, three patches that address the page allocation and memory fragmentation
issue of mlx5e striding RQ, where we used to allocate order 5 pages then split them 
into order 0 pages.  Basically we now allocate only order 0 pages and we default
to what we used to call the fallback mechanism (ConnectX4 UMR) to virtually map 
them into device as one big chunk.  The last two patches in his series, Tariq introduces
a mapped pages internal cache API for mlx5e driver to recover from the performance
degradation we hit, since now we allocate 32 order 0 pages when required in striding RQ
RX path.  Those two patches of mapped pages cache API are needed later in my series
of XDP support.

XDP support:
To have proper XDP support a "page per packet" is a must have prerequisite and
neither of our two RX modes can trivially do XDP.  Striding RQ is not an option,
since the whole idea from it is to share memory and have minimum HW descriptors
for as much packets as we can.

The other mode is a regular RX ring mode where we have a HW descriptor per packet,
but the main issue is that ring SKBs are allocated in advance and skb->data is mapped
directly to device (drop decision will not be as fast as we want, since we will need to
free the skb).  For that we've made some RX refactoring also in regular RQ mode area 
where we now allocate a page per packet and use build_skb.  To overcome the page allocator 
overhead, we use the page cache API.  For those who have ConnectX4-LX where striding RQ is
the default, if xdp is requested we will move to regular ring RQ mode, and move back to
striding mode when xdp goes back to off.

Some issues are needed to be addressed now, having a page per packet is not perfect as
it seems, driver memory consumption just got up, and as a future work, we need to
share pages for multiple packets when XDP is off, especially for systems with large
PAGE_SIZE. 

For XDP TX forwarding support, we add it in the last two patches.
Nothing is really special there :).

You will find much more details and initial performance numbers in the patches commit messages.

Thanks,
Saeed.

Rana Shahout (1):
  net/mlx5e: XDP fast RX drop bpf programs support

Saeed Mahameed (7):
  net/mlx5e: Build RX SKB on demand
  net/mlx5e: Union RQ RX info per RQ type
  net/mlx5e: Slightly reduce hardware LRO size
  net/mlx5e: Dynamic RQ type infrastructure
  net/mlx5e: Have a clear separation between different SQ types
  net/mlx5e: XDP TX forwarding support
  net/mlx5e: XDP TX xmit more

Tariq Toukan (3):
  net/mlx5e: Single flow order-0 pages for Striding RQ
  net/mlx5e: Introduce API for RX mapped pages
  net/mlx5e: Implement RX mapped page cache for page recycle

 drivers/net/ethernet/mellanox/mlx5/core/en.h       | 144 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 581 +++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 618 ++++++++++-----------
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |  32 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  61 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  67 ++-
 6 files changed, 998 insertions(+), 505 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
       [not found]   ` <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-09-07 12:42 ` [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages Saeed Mahameed
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.

Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.

Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.

This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a huge gain.

We ran pktgen single-stream benchmarks, with iptables-raw-drop:

Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction

Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction

Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction

Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  54 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 ++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 292 ++++-----------------
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
 5 files changed, 184 insertions(+), 304 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index bf722aa..075cdfc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -62,12 +62,12 @@
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE                0xd
 
 #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW            0x1
-#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x4
+#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
 
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
-#define MLX5_MPWRQ_LOG_WQE_SZ			17
+#define MLX5_MPWRQ_LOG_WQE_SZ			18
 #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
 				    MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
 #define MLX5_MPWRQ_PAGES_PER_WQE		BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
@@ -293,8 +293,8 @@ struct mlx5e_rq {
 	u32                    wqe_sz;
 	struct sk_buff       **skb;
 	struct mlx5e_mpw_info *wqe_info;
+	void                  *mtt_no_align;
 	__be32                 mkey_be;
-	__be32                 umr_mkey_be;
 
 	struct device         *pdev;
 	struct net_device     *netdev;
@@ -323,32 +323,15 @@ struct mlx5e_rq {
 
 struct mlx5e_umr_dma_info {
 	__be64                *mtt;
-	__be64                *mtt_no_align;
 	dma_addr_t             mtt_addr;
-	struct mlx5e_dma_info *dma_info;
+	struct mlx5e_dma_info  dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
+	struct mlx5e_umr_wqe   wqe;
 };
 
 struct mlx5e_mpw_info {
-	union {
-		struct mlx5e_dma_info     dma_info;
-		struct mlx5e_umr_dma_info umr;
-	};
+	struct mlx5e_umr_dma_info umr;
 	u16 consumed_strides;
 	u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
-
-	void (*dma_pre_sync)(struct device *pdev,
-			     struct mlx5e_mpw_info *wi,
-			     u32 wqe_offset, u32 len);
-	void (*add_skb_frag)(struct mlx5e_rq *rq,
-			     struct sk_buff *skb,
-			     struct mlx5e_mpw_info *wi,
-			     u32 page_idx, u32 frag_offset, u32 len);
-	void (*copy_skb_header)(struct device *pdev,
-				struct sk_buff *skb,
-				struct mlx5e_mpw_info *wi,
-				u32 page_idx, u32 offset,
-				u32 headlen);
-	void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 };
 
 struct mlx5e_tx_wqe_info {
@@ -706,24 +689,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
 int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
 void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
-void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
-void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
-				    struct mlx5_cqe64 *cqe,
-				    u16 byte_cnt,
-				    struct mlx5e_mpw_info *wi,
-				    struct sk_buff *skb);
-void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
-					struct mlx5_cqe64 *cqe,
-					u16 byte_cnt,
-					struct mlx5e_mpw_info *wi,
-					struct sk_buff *skb);
-void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
-				struct mlx5e_mpw_info *wi);
-void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
-				    struct mlx5e_mpw_info *wi);
+void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq);
+void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_rx_am(struct mlx5e_rq *rq);
@@ -810,6 +780,12 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
 	mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, NULL, cq->wq.cc);
 }
 
+static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
+{
+	return rq->mpwqe_mtt_offset +
+		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
+}
+
 static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
 {
 	return min_t(int, mdev->priv.eq_table.num_comp_vectors,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2459c7f..0db4d3b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -138,7 +138,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
-		s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
 		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
 		s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
 		s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
@@ -298,6 +297,107 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
 #define MLX5E_HW2SW_MTU(hwmtu) (hwmtu - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
 #define MLX5E_SW2HW_MTU(swmtu) (swmtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
 
+static inline int mlx5e_get_wqe_mtt_sz(void)
+{
+	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
+	 * To avoid copying garbage after the mtt array, we allocate
+	 * a little more.
+	 */
+	return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
+		     MLX5_UMR_MTT_ALIGNMENT);
+}
+
+static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
+				       struct mlx5e_umr_wqe *wqe, u16 ix)
+{
+	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
+	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
+	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
+	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
+
+	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				      ds_cnt);
+	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
+	cseg->imm       = rq->mkey_be;
+
+	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
+	ucseg->klm_octowords =
+		cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
+	ucseg->bsf_octowords =
+		cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
+	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
+
+	dseg->lkey = sq->mkey_be;
+	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
+}
+
+static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
+				     struct mlx5e_channel *c)
+{
+	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
+	int mtt_sz = mlx5e_get_wqe_mtt_sz();
+	int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
+	int i;
+
+	rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
+				    GFP_KERNEL, cpu_to_node(c->cpu));
+	if (!rq->wqe_info)
+		goto err_out;
+
+	/* We allocate more than mtt_sz as we will align the pointer */
+	rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
+					cpu_to_node(c->cpu));
+	if (unlikely(!rq->mtt_no_align))
+		goto err_free_wqe_info;
+
+	for (i = 0; i < wq_sz; i++) {
+		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+
+		wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
+					MLX5_UMR_ALIGN);
+		wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
+						  PCI_DMA_TODEVICE);
+		if (unlikely(dma_mapping_error(c->pdev, wi->umr.mtt_addr)))
+			goto err_unmap_mtts;
+
+		mlx5e_build_umr_wqe(rq, &c->icosq, &wi->umr.wqe, i);
+	}
+
+	return 0;
+
+err_unmap_mtts:
+	while (--i >= 0) {
+		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+
+		dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
+				 PCI_DMA_TODEVICE);
+	}
+	kfree(rq->mtt_no_align);
+err_free_wqe_info:
+	kfree(rq->wqe_info);
+
+err_out:
+	return -ENOMEM;
+}
+
+static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
+{
+	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
+	int mtt_sz = mlx5e_get_wqe_mtt_sz();
+	int i;
+
+	for (i = 0; i < wq_sz; i++) {
+		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+
+		dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
+				 PCI_DMA_TODEVICE);
+	}
+	kfree(rq->mtt_no_align);
+	kfree(rq->wqe_info);
+}
+
 static int mlx5e_create_rq(struct mlx5e_channel *c,
 			   struct mlx5e_rq_param *param,
 			   struct mlx5e_rq *rq)
@@ -322,14 +422,16 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 
 	wq_sz = mlx5_wq_ll_get_size(&rq->wq);
 
+	rq->wq_type = priv->params.rq_wq_type;
+	rq->pdev    = c->pdev;
+	rq->netdev  = c->netdev;
+	rq->tstamp  = &priv->tstamp;
+	rq->channel = c;
+	rq->ix      = c->ix;
+	rq->priv    = c->priv;
+
 	switch (priv->params.rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-		rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
-					    GFP_KERNEL, cpu_to_node(c->cpu));
-		if (!rq->wqe_info) {
-			err = -ENOMEM;
-			goto err_rq_wq_destroy;
-		}
 		rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
 		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
@@ -341,6 +443,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
 		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
 		byte_count = rq->wqe_sz;
+		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
+		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
+		if (err)
+			goto err_rq_wq_destroy;
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
@@ -359,27 +465,19 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
 		byte_count = rq->wqe_sz;
 		byte_count |= MLX5_HW_START_PADDING;
+		rq->mkey_be = c->mkey_be;
 	}
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 
 		wqe->data.byte_count = cpu_to_be32(byte_count);
+		wqe->data.lkey = rq->mkey_be;
 	}
 
 	INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
 	rq->am.mode = priv->params.rx_cq_period_mode;
 
-	rq->wq_type = priv->params.rq_wq_type;
-	rq->pdev    = c->pdev;
-	rq->netdev  = c->netdev;
-	rq->tstamp  = &priv->tstamp;
-	rq->channel = c;
-	rq->ix      = c->ix;
-	rq->priv    = c->priv;
-	rq->mkey_be = c->mkey_be;
-	rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
-
 	return 0;
 
 err_rq_wq_destroy:
@@ -392,7 +490,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 {
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-		kfree(rq->wqe_info);
+		mlx5e_rq_free_mpwqe_info(rq);
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
 		kfree(rq->skb);
@@ -530,7 +628,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
 
 	/* UMR WQE (if in progress) is always at wq->head */
 	if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
-		mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
+		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
 
 	while (!mlx5_wq_ll_is_empty(wq)) {
 		wqe_ix_be = *wq->tail_next;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b6f8ebb..8ad4d32 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -200,7 +200,6 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 
 	*((dma_addr_t *)skb->cb) = dma_addr;
 	wqe->data.addr = cpu_to_be64(dma_addr);
-	wqe->data.lkey = rq->mkey_be;
 
 	rq->skb[ix] = skb;
 
@@ -231,44 +230,11 @@ static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
 	return rq->mpwqe_num_strides >> MLX5_MPWRQ_WQE_PAGE_ORDER;
 }
 
-static inline void
-mlx5e_dma_pre_sync_linear_mpwqe(struct device *pdev,
-				struct mlx5e_mpw_info *wi,
-				u32 wqe_offset, u32 len)
-{
-	dma_sync_single_for_cpu(pdev, wi->dma_info.addr + wqe_offset,
-				len, DMA_FROM_DEVICE);
-}
-
-static inline void
-mlx5e_dma_pre_sync_fragmented_mpwqe(struct device *pdev,
-				    struct mlx5e_mpw_info *wi,
-				    u32 wqe_offset, u32 len)
-{
-	/* No dma pre sync for fragmented MPWQE */
-}
-
-static inline void
-mlx5e_add_skb_frag_linear_mpwqe(struct mlx5e_rq *rq,
-				struct sk_buff *skb,
-				struct mlx5e_mpw_info *wi,
-				u32 page_idx, u32 frag_offset,
-				u32 len)
-{
-	unsigned int truesize =	ALIGN(len, rq->mpwqe_stride_sz);
-
-	wi->skbs_frags[page_idx]++;
-	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
-			&wi->dma_info.page[page_idx], frag_offset,
-			len, truesize);
-}
-
-static inline void
-mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
-				    struct sk_buff *skb,
-				    struct mlx5e_mpw_info *wi,
-				    u32 page_idx, u32 frag_offset,
-				    u32 len)
+static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
+					    struct sk_buff *skb,
+					    struct mlx5e_mpw_info *wi,
+					    u32 page_idx, u32 frag_offset,
+					    u32 len)
 {
 	unsigned int truesize =	ALIGN(len, rq->mpwqe_stride_sz);
 
@@ -282,24 +248,11 @@ mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
 }
 
 static inline void
-mlx5e_copy_skb_header_linear_mpwqe(struct device *pdev,
-				   struct sk_buff *skb,
-				   struct mlx5e_mpw_info *wi,
-				   u32 page_idx, u32 offset,
-				   u32 headlen)
-{
-	struct page *page = &wi->dma_info.page[page_idx];
-
-	skb_copy_to_linear_data(skb, page_address(page) + offset,
-				ALIGN(headlen, sizeof(long)));
-}
-
-static inline void
-mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
-				       struct sk_buff *skb,
-				       struct mlx5e_mpw_info *wi,
-				       u32 page_idx, u32 offset,
-				       u32 headlen)
+mlx5e_copy_skb_header_mpwqe(struct device *pdev,
+			    struct sk_buff *skb,
+			    struct mlx5e_mpw_info *wi,
+			    u32 page_idx, u32 offset,
+			    u32 headlen)
 {
 	u16 headlen_pg = min_t(u32, headlen, PAGE_SIZE - offset);
 	struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[page_idx];
@@ -324,46 +277,9 @@ mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
 	}
 }
 
-static u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
-{
-	return rq->mpwqe_mtt_offset +
-		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
-}
-
-static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
-				struct mlx5e_sq *sq,
-				struct mlx5e_umr_wqe *wqe,
-				u16 ix)
+static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
-	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
-	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
 	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
-	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
-	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
-
-	memset(wqe, 0, sizeof(*wqe));
-	cseg->opmod_idx_opcode =
-		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
-			    MLX5_OPCODE_UMR);
-	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
-				      ds_cnt);
-	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
-	cseg->imm       = rq->umr_mkey_be;
-
-	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
-	ucseg->klm_octowords =
-		cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
-	ucseg->bsf_octowords =
-		cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
-	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
-
-	dseg->lkey = sq->mkey_be;
-	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
-}
-
-static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
-{
 	struct mlx5e_sq *sq = &rq->channel->icosq;
 	struct mlx5_wq_cyc *wq = &sq->wq;
 	struct mlx5e_umr_wqe *wqe;
@@ -378,30 +294,22 @@ static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 	}
 
 	wqe = mlx5_wq_cyc_get_wqe(wq, pi);
-	mlx5e_build_umr_wqe(rq, sq, wqe, ix);
+	memcpy(wqe, &wi->umr.wqe, sizeof(*wqe));
+	wqe->ctrl.opmod_idx_opcode =
+		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+			    MLX5_OPCODE_UMR);
+
 	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
 	sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
 	sq->pc += num_wqebbs;
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
 
-static inline int mlx5e_get_wqe_mtt_sz(void)
-{
-	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
-	 * To avoid copying garbage after the mtt array, we allocate
-	 * a little more.
-	 */
-	return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
-		     MLX5_UMR_MTT_ALIGNMENT);
-}
-
-static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
-				    struct mlx5e_mpw_info *wi,
-				    int i)
+static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
+					   struct mlx5e_mpw_info *wi,
+					   int i)
 {
-	struct page *page;
-
-	page = dev_alloc_page();
+	struct page *page = dev_alloc_page();
 	if (unlikely(!page))
 		return -ENOMEM;
 
@@ -417,47 +325,25 @@ static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
 	return 0;
 }
 
-static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
-					   struct mlx5e_rx_wqe *wqe,
-					   u16 ix)
+static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
+				    struct mlx5e_rx_wqe *wqe,
+				    u16 ix)
 {
 	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
-	int mtt_sz = mlx5e_get_wqe_mtt_sz();
 	u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
+	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
+	int err;
 	int i;
 
-	wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
-				   MLX5_MPWRQ_PAGES_PER_WQE,
-				   GFP_ATOMIC);
-	if (unlikely(!wi->umr.dma_info))
-		goto err_out;
-
-	/* We allocate more than mtt_sz as we will align the pointer */
-	wi->umr.mtt_no_align = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
-				       GFP_ATOMIC);
-	if (unlikely(!wi->umr.mtt_no_align))
-		goto err_free_umr;
-
-	wi->umr.mtt = PTR_ALIGN(wi->umr.mtt_no_align, MLX5_UMR_ALIGN);
-	wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
-					  PCI_DMA_TODEVICE);
-	if (unlikely(dma_mapping_error(rq->pdev, wi->umr.mtt_addr)))
-		goto err_free_mtt;
-
 	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
-		if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i)))
+		err = mlx5e_alloc_and_map_page(rq, wi, i);
+		if (unlikely(err))
 			goto err_unmap;
-		page_ref_add(wi->umr.dma_info[i].page,
-			     mlx5e_mpwqe_strides_per_page(rq));
+		page_ref_add(wi->umr.dma_info[i].page, pg_strides);
 		wi->skbs_frags[i] = 0;
 	}
 
 	wi->consumed_strides = 0;
-	wi->dma_pre_sync = mlx5e_dma_pre_sync_fragmented_mpwqe;
-	wi->add_skb_frag = mlx5e_add_skb_frag_fragmented_mpwqe;
-	wi->copy_skb_header = mlx5e_copy_skb_header_fragmented_mpwqe;
-	wi->free_wqe     = mlx5e_free_rx_fragmented_mpwqe;
-	wqe->data.lkey = rq->umr_mkey_be;
 	wqe->data.addr = cpu_to_be64(dma_offset);
 
 	return 0;
@@ -466,41 +352,28 @@ err_unmap:
 	while (--i >= 0) {
 		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
 			       PCI_DMA_FROMDEVICE);
-		page_ref_sub(wi->umr.dma_info[i].page,
-			     mlx5e_mpwqe_strides_per_page(rq));
+		page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
 		put_page(wi->umr.dma_info[i].page);
 	}
-	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
-
-err_free_mtt:
-	kfree(wi->umr.mtt_no_align);
-
-err_free_umr:
-	kfree(wi->umr.dma_info);
 
-err_out:
-	return -ENOMEM;
+	return err;
 }
 
-void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
-				    struct mlx5e_mpw_info *wi)
+void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
 {
-	int mtt_sz = mlx5e_get_wqe_mtt_sz();
+	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
 	int i;
 
 	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
 		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
 			       PCI_DMA_FROMDEVICE);
 		page_ref_sub(wi->umr.dma_info[i].page,
-			mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
+			     pg_strides - wi->skbs_frags[i]);
 		put_page(wi->umr.dma_info[i].page);
 	}
-	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
-	kfree(wi->umr.mtt_no_align);
-	kfree(wi->umr.dma_info);
 }
 
-void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
+void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
 {
 	struct mlx5_wq_ll *wq = &rq->wq;
 	struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
@@ -508,12 +381,11 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
 	clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
 
 	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
-		mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
+		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
 		return;
 	}
 
 	mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
-	rq->stats.mpwqe_frag++;
 
 	/* ensure wqes are visible to device before updating doorbell record */
 	dma_wmb();
@@ -521,84 +393,23 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
 	mlx5_wq_ll_update_db_record(wq);
 }
 
-static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
-				       struct mlx5e_rx_wqe *wqe,
-				       u16 ix)
-{
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
-	gfp_t gfp_mask;
-	int i;
-
-	gfp_mask = GFP_ATOMIC | __GFP_COLD | __GFP_MEMALLOC;
-	wi->dma_info.page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
-					     MLX5_MPWRQ_WQE_PAGE_ORDER);
-	if (unlikely(!wi->dma_info.page))
-		return -ENOMEM;
-
-	wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
-					 rq->wqe_sz, PCI_DMA_FROMDEVICE);
-	if (unlikely(dma_mapping_error(rq->pdev, wi->dma_info.addr))) {
-		put_page(wi->dma_info.page);
-		return -ENOMEM;
-	}
-
-	/* We split the high-order page into order-0 ones and manage their
-	 * reference counter to minimize the memory held by small skb fragments
-	 */
-	split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER);
-	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
-		page_ref_add(&wi->dma_info.page[i],
-			     mlx5e_mpwqe_strides_per_page(rq));
-		wi->skbs_frags[i] = 0;
-	}
-
-	wi->consumed_strides = 0;
-	wi->dma_pre_sync = mlx5e_dma_pre_sync_linear_mpwqe;
-	wi->add_skb_frag = mlx5e_add_skb_frag_linear_mpwqe;
-	wi->copy_skb_header = mlx5e_copy_skb_header_linear_mpwqe;
-	wi->free_wqe     = mlx5e_free_rx_linear_mpwqe;
-	wqe->data.lkey = rq->mkey_be;
-	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
-
-	return 0;
-}
-
-void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
-				struct mlx5e_mpw_info *wi)
-{
-	int i;
-
-	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
-		       PCI_DMA_FROMDEVICE);
-	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
-		page_ref_sub(&wi->dma_info.page[i],
-			mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
-		put_page(&wi->dma_info.page[i]);
-	}
-}
-
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
 {
 	int err;
 
-	err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
-	if (unlikely(err)) {
-		err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
-		if (unlikely(err))
-			return err;
-		set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
-		mlx5e_post_umr_wqe(rq, ix);
-		return -EBUSY;
-	}
-
-	return 0;
+	err = mlx5e_alloc_rx_umr_mpwqe(rq, wqe, ix);
+	if (unlikely(err))
+		return err;
+	set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
+	mlx5e_post_umr_wqe(rq, ix);
+	return -EBUSY;
 }
 
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 {
 	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
 
-	wi->free_wqe(rq, wi);
+	mlx5e_free_rx_mpwqe(rq, wi);
 }
 
 #define RQ_CANNOT_POST(rq) \
@@ -617,9 +428,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 		int err;
 
 		err = rq->alloc_wqe(rq, wqe, wq->head);
+		if (err == -EBUSY)
+			return true;
 		if (unlikely(err)) {
-			if (err != -EBUSY)
-				rq->stats.buff_alloc_err++;
+			rq->stats.buff_alloc_err++;
 			break;
 		}
 
@@ -823,7 +635,6 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
 					   u32 cqe_bcnt,
 					   struct sk_buff *skb)
 {
-	u32 consumed_bytes = ALIGN(cqe_bcnt, rq->mpwqe_stride_sz);
 	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
 	u32 wqe_offset     = stride_ix * rq->mpwqe_stride_sz;
 	u32 head_offset    = wqe_offset & (PAGE_SIZE - 1);
@@ -837,21 +648,20 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
 		page_idx++;
 		frag_offset -= PAGE_SIZE;
 	}
-	wi->dma_pre_sync(rq->pdev, wi, wqe_offset, consumed_bytes);
 
 	while (byte_cnt) {
 		u32 pg_consumed_bytes =
 			min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
 
-		wi->add_skb_frag(rq, skb, wi, page_idx, frag_offset,
-				 pg_consumed_bytes);
+		mlx5e_add_skb_frag_mpwqe(rq, skb, wi, page_idx, frag_offset,
+					 pg_consumed_bytes);
 		byte_cnt -= pg_consumed_bytes;
 		frag_offset = 0;
 		page_idx++;
 	}
 	/* copy header */
-	wi->copy_skb_header(rq->pdev, skb, wi, head_page_idx, head_offset,
-			    headlen);
+	mlx5e_copy_skb_header_mpwqe(rq->pdev, skb, wi, head_page_idx,
+				    head_offset, headlen);
 	/* skb linear part was allocated with headlen and aligned to long */
 	skb->tail += headlen;
 	skb->len  += headlen;
@@ -896,7 +706,7 @@ mpwrq_cqe_out:
 	if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
 		return;
 
-	wi->free_wqe(rq, wi);
+	mlx5e_free_rx_mpwqe(rq, wi);
 	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 499487c..1f56543 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -73,7 +73,6 @@ struct mlx5e_sw_stats {
 	u64 tx_xmit_more;
 	u64 rx_wqe_err;
 	u64 rx_mpwqe_filler;
-	u64 rx_mpwqe_frag;
 	u64 rx_buff_alloc_err;
 	u64 rx_cqe_compress_blks;
 	u64 rx_cqe_compress_pkts;
@@ -105,7 +104,6 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
-	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
@@ -274,7 +272,6 @@ struct mlx5e_rq_stats {
 	u64 lro_bytes;
 	u64 wqe_err;
 	u64 mpwqe_filler;
-	u64 mpwqe_frag;
 	u64 buff_alloc_err;
 	u64 cqe_compress_blks;
 	u64 cqe_compress_pkts;
@@ -290,7 +287,6 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler) },
-	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_frag) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 9bf33bb..08d8b0c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -87,7 +87,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 		case MLX5_OPCODE_NOP:
 			break;
 		case MLX5_OPCODE_UMR:
-			mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
+			mlx5e_post_rx_mpwqe(&sq->channel->rq);
 			break;
 		default:
 			WARN_ONCE(true,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle Saeed Mahameed
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Manage the allocation and deallocation of mapped RX pages only
through dedicated API functions.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 46 +++++++++++++++----------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8ad4d32..c1cb510 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -305,26 +305,32 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
 
-static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
-					   struct mlx5e_mpw_info *wi,
-					   int i)
+static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
+					  struct mlx5e_dma_info *dma_info)
 {
 	struct page *page = dev_alloc_page();
+
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	wi->umr.dma_info[i].page = page;
-	wi->umr.dma_info[i].addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
-						PCI_DMA_FROMDEVICE);
-	if (unlikely(dma_mapping_error(rq->pdev, wi->umr.dma_info[i].addr))) {
+	dma_info->page = page;
+	dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
+				      DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
 		put_page(page);
 		return -ENOMEM;
 	}
-	wi->umr.mtt[i] = cpu_to_be64(wi->umr.dma_info[i].addr | MLX5_EN_WR);
 
 	return 0;
 }
 
+static inline void mlx5e_page_release(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
+{
+	dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
+	put_page(dma_info->page);
+}
+
 static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 				    struct mlx5e_rx_wqe *wqe,
 				    u16 ix)
@@ -336,10 +342,13 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 	int i;
 
 	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
-		err = mlx5e_alloc_and_map_page(rq, wi, i);
+		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
+
+		err = mlx5e_page_alloc_mapped(rq, dma_info);
 		if (unlikely(err))
 			goto err_unmap;
-		page_ref_add(wi->umr.dma_info[i].page, pg_strides);
+		wi->umr.mtt[i] = cpu_to_be64(dma_info->addr | MLX5_EN_WR);
+		page_ref_add(dma_info->page, pg_strides);
 		wi->skbs_frags[i] = 0;
 	}
 
@@ -350,10 +359,10 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 
 err_unmap:
 	while (--i >= 0) {
-		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
-			       PCI_DMA_FROMDEVICE);
-		page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
-		put_page(wi->umr.dma_info[i].page);
+		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
+
+		page_ref_sub(dma_info->page, pg_strides);
+		mlx5e_page_release(rq, dma_info);
 	}
 
 	return err;
@@ -365,11 +374,10 @@ void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
 	int i;
 
 	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
-		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
-			       PCI_DMA_FROMDEVICE);
-		page_ref_sub(wi->umr.dma_info[i].page,
-			     pg_strides - wi->skbs_frags[i]);
-		put_page(wi->umr.dma_info[i].page);
+		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
+
+		page_ref_sub(dma_info->page, pg_strides - wi->skbs_frags[i]);
+		mlx5e_page_release(rq, dma_info);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
       [not found]   ` <1473252152-11379-4-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.

We ran pktgen single-stream benchmarks, with iptables-raw-drop:

Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain

Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain

Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       | 16 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 57 ++++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++++++
 4 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 075cdfc..afbdf70 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
 	u8					tired;
 };
 
+/* a single cache unit is capable to serve one napi call (for non-striding rq)
+ * or a MPWQE (for striding rq).
+ */
+#define MLX5E_CACHE_UNIT	(MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
+				 MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
+#define MLX5E_CACHE_SIZE	(2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
+struct mlx5e_page_cache {
+	u32 head;
+	u32 tail;
+	struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
+};
+
 struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
@@ -301,6 +313,8 @@ struct mlx5e_rq {
 	struct mlx5e_tstamp   *tstamp;
 	struct mlx5e_rq_stats  stats;
 	struct mlx5e_cq        cq;
+	struct mlx5e_page_cache page_cache;
+
 	mlx5e_fp_handle_rx_cqe handle_rx_cqe;
 	mlx5e_fp_alloc_wqe     alloc_wqe;
 	mlx5e_fp_dealloc_wqe   dealloc_wqe;
@@ -685,6 +699,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
 void mlx5e_free_tx_descs(struct mlx5e_sq *sq);
 
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+			bool recycle);
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0db4d3b..c84702c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -141,6 +141,10 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
 		s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
 		s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
+		s->rx_cache_reuse += rq_stats->cache_reuse;
+		s->rx_cache_full  += rq_stats->cache_full;
+		s->rx_cache_empty += rq_stats->cache_empty;
+		s->rx_cache_busy  += rq_stats->cache_busy;
 
 		for (j = 0; j < priv->params.num_tc; j++) {
 			sq_stats = &priv->channel[i]->sq[j].stats;
@@ -478,6 +482,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
 	rq->am.mode = priv->params.rx_cq_period_mode;
 
+	rq->page_cache.head = 0;
+	rq->page_cache.tail = 0;
+
 	return 0;
 
 err_rq_wq_destroy:
@@ -488,6 +495,8 @@ err_rq_wq_destroy:
 
 static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 {
+	int i;
+
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		mlx5e_rq_free_mpwqe_info(rq);
@@ -496,6 +505,12 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 		kfree(rq->skb);
 	}
 
+	for (i = rq->page_cache.head; i != rq->page_cache.tail;
+	     i = (i + 1) & (MLX5E_CACHE_SIZE - 1)) {
+		struct mlx5e_dma_info *dma_info = &rq->page_cache.page_cache[i];
+
+		mlx5e_page_release(rq, dma_info, false);
+	}
 	mlx5_wq_destroy(&rq->wq_ctrl);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index c1cb510..8e02af3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
 
+static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
+{
+	struct mlx5e_page_cache *cache = &rq->page_cache;
+	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
+
+	if (tail_next == cache->head) {
+		rq->stats.cache_full++;
+		return false;
+	}
+
+	cache->page_cache[cache->tail] = *dma_info;
+	cache->tail = tail_next;
+	return true;
+}
+
+static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
+{
+	struct mlx5e_page_cache *cache = &rq->page_cache;
+
+	if (unlikely(cache->head == cache->tail)) {
+		rq->stats.cache_empty++;
+		return false;
+	}
+
+	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+		rq->stats.cache_busy++;
+		return false;
+	}
+
+	*dma_info = cache->page_cache[cache->head];
+	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
+	rq->stats.cache_reuse++;
+
+	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
+				   DMA_FROM_DEVICE);
+	return true;
+}
+
 static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 					  struct mlx5e_dma_info *dma_info)
 {
-	struct page *page = dev_alloc_page();
+	struct page *page;
+
+	if (mlx5e_rx_cache_get(rq, dma_info))
+		return 0;
 
+	page = dev_alloc_page();
 	if (unlikely(!page))
 		return -ENOMEM;
 
@@ -324,9 +368,12 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	return 0;
 }
 
-static inline void mlx5e_page_release(struct mlx5e_rq *rq,
-				      struct mlx5e_dma_info *dma_info)
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+			bool recycle)
 {
+	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+		return;
+
 	dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
 	put_page(dma_info->page);
 }
@@ -362,7 +409,7 @@ err_unmap:
 		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
 
 		page_ref_sub(dma_info->page, pg_strides);
-		mlx5e_page_release(rq, dma_info);
+		mlx5e_page_release(rq, dma_info, true);
 	}
 
 	return err;
@@ -377,7 +424,7 @@ void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
 		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
 
 		page_ref_sub(dma_info->page, pg_strides - wi->skbs_frags[i]);
-		mlx5e_page_release(rq, dma_info);
+		mlx5e_page_release(rq, dma_info, true);
 	}
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 1f56543..6af8d79 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -76,6 +76,10 @@ struct mlx5e_sw_stats {
 	u64 rx_buff_alloc_err;
 	u64 rx_cqe_compress_blks;
 	u64 rx_cqe_compress_pkts;
+	u64 rx_cache_reuse;
+	u64 rx_cache_full;
+	u64 rx_cache_empty;
+	u64 rx_cache_busy;
 
 	/* Special handling counters */
 	u64 link_down_events_phy;
@@ -107,6 +111,10 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_reuse) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_full) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_empty) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_busy) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, link_down_events_phy) },
 };
 
@@ -275,6 +283,10 @@ struct mlx5e_rq_stats {
 	u64 buff_alloc_err;
 	u64 cqe_compress_blks;
 	u64 cqe_compress_pkts;
+	u64 cache_reuse;
+	u64 cache_full;
+	u64 cache_empty;
+	u64 cache_busy;
 };
 
 static const struct counter_desc rq_stats_desc[] = {
@@ -290,6 +302,10 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_reuse) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_full) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_empty) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_busy) },
 };
 
 struct mlx5e_sq_stats {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (2 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 17:34   ` Alexei Starovoitov
       [not found]   ` <1473252152-11379-5-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-09-07 12:42 ` [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type Saeed Mahameed
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

For non-striding RQ configuration before this patch we had a ring
with pre-allocated SKBs and mapped the SKB->data buffers for
device.

For robustness and better RX data buffers management, we allocate a
page per packet and build_skb around it.

This patch (which is a prerequisite for XDP) will actually reduce
performance for normal stack usage, because we are now hitting a bottleneck
in the page allocator. A later patch of page reuse mechanism will be
needed to restore or even improve performance in comparison to the old
RX scheme.

Packet rate performance testing was done with pktgen 64B packets on xmit
side and TC drop action on RX side.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
 1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
 2.Build SKB with RX page cache (This patch)

Streams    Baseline    Build SKB+page-cache    Improvement
-----------------------------------------------------------
1          4.33Mpps      5.51Mpps                27%
2          7.35Mpps      11.5Mpps                52%
4          14.0Mpps      16.3Mpps                16%
8          22.2Mpps      29.6Mpps                20%
16         24.8Mpps      34.0Mpps                17%

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  10 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  31 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 215 +++++++++++-----------
 3 files changed, 133 insertions(+), 123 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index afbdf70..a346112 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -65,6 +65,8 @@
 #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
 
+#define MLX5_RX_HEADROOM NET_SKB_PAD
+
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
 #define MLX5_MPWRQ_LOG_WQE_SZ			18
@@ -302,10 +304,14 @@ struct mlx5e_page_cache {
 struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
-	u32                    wqe_sz;
-	struct sk_buff       **skb;
+
+	struct mlx5e_dma_info *dma_info;
 	struct mlx5e_mpw_info *wqe_info;
 	void                  *mtt_no_align;
+	struct {
+		u8             page_order;
+		u32            wqe_sz;    /* wqe data buffer size */
+	} buff;
 	__be32                 mkey_be;
 
 	struct device         *pdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c84702c..c9f1dea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	void *rqc = param->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
 	u32 byte_count;
+	u32 frag_sz;
+	int npages;
 	int wq_sz;
 	int err;
 	int i;
@@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 
 		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
 		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
-		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
-		byte_count = rq->wqe_sz;
+
+		rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
+		byte_count = rq->buff.wqe_sz;
 		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
 		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
 		if (err)
 			goto err_rq_wq_destroy;
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
-		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
-				       cpu_to_node(c->cpu));
-		if (!rq->skb) {
+		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info), GFP_KERNEL,
+					    cpu_to_node(c->cpu));
+		if (!rq->dma_info) {
 			err = -ENOMEM;
 			goto err_rq_wq_destroy;
 		}
+
 		rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
 		rq->alloc_wqe = mlx5e_alloc_rx_wqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
 
-		rq->wqe_sz = (priv->params.lro_en) ?
+		rq->buff.wqe_sz = (priv->params.lro_en) ?
 				priv->params.lro_wqe_sz :
 				MLX5E_SW2HW_MTU(priv->netdev->mtu);
-		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
-		byte_count = rq->wqe_sz;
+		byte_count = rq->buff.wqe_sz;
+
+		/* calc the required page order */
+		frag_sz = MLX5_RX_HEADROOM +
+			  byte_count /* packet data */ +
+			  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		frag_sz = SKB_DATA_ALIGN(frag_sz);
+
+		npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
+		rq->buff.page_order = order_base_2(npages);
+
 		byte_count |= MLX5_HW_START_PADDING;
 		rq->mkey_be = c->mkey_be;
 	}
@@ -502,7 +515,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 		mlx5e_rq_free_mpwqe_info(rq);
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
-		kfree(rq->skb);
+		kfree(rq->dma_info);
 	}
 
 	for (i = rq->page_cache.head; i != rq->page_cache.tail;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8e02af3..2f5bc6f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -179,50 +179,99 @@ unlock:
 	mutex_unlock(&priv->state_lock);
 }
 
-int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
+
+static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
 {
-	struct sk_buff *skb;
-	dma_addr_t dma_addr;
+	struct mlx5e_page_cache *cache = &rq->page_cache;
+	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
 
-	skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
-	if (unlikely(!skb))
-		return -ENOMEM;
+	if (tail_next == cache->head) {
+		rq->stats.cache_full++;
+		return false;
+	}
+
+	cache->page_cache[cache->tail] = *dma_info;
+	cache->tail = tail_next;
+	return true;
+}
+
+static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
+{
+	struct mlx5e_page_cache *cache = &rq->page_cache;
+
+	if (unlikely(cache->head == cache->tail)) {
+		rq->stats.cache_empty++;
+		return false;
+	}
+
+	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+		rq->stats.cache_busy++;
+		return false;
+	}
+
+	*dma_info = cache->page_cache[cache->head];
+	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
+	rq->stats.cache_reuse++;
+
+	dma_sync_single_for_device(rq->pdev, dma_info->addr,
+				   RQ_PAGE_SIZE(rq),
+				   DMA_FROM_DEVICE);
+	return true;
+}
 
-	dma_addr = dma_map_single(rq->pdev,
-				  /* hw start padding */
-				  skb->data,
-				  /* hw end padding */
-				  rq->wqe_sz,
-				  DMA_FROM_DEVICE);
+static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
+					  struct mlx5e_dma_info *dma_info)
+{
+	struct page *page;
 
-	if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
-		goto err_free_skb;
+	if (mlx5e_rx_cache_get(rq, dma_info))
+		return 0;
 
-	*((dma_addr_t *)skb->cb) = dma_addr;
-	wqe->data.addr = cpu_to_be64(dma_addr);
+	page = dev_alloc_pages(rq->buff.page_order);
+	if (unlikely(!page))
+		return -ENOMEM;
 
-	rq->skb[ix] = skb;
+	dma_info->page = page;
+	dma_info->addr = dma_map_page(rq->pdev, page, 0,
+				      RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
+		put_page(page);
+		return -ENOMEM;
+	}
 
 	return 0;
+}
 
-err_free_skb:
-	dev_kfree_skb(skb);
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+			bool recycle)
+{
+	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+		return;
+
+	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+		       DMA_FROM_DEVICE);
+	put_page(dma_info->page);
+}
+
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+{
+	struct mlx5e_dma_info *di = &rq->dma_info[ix];
 
-	return -ENOMEM;
+	if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
+		return -ENOMEM;
+
+	wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
+	return 0;
 }
 
 void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct sk_buff *skb = rq->skb[ix];
+	struct mlx5e_dma_info *di = &rq->dma_info[ix];
 
-	if (skb) {
-		rq->skb[ix] = NULL;
-		dma_unmap_single(rq->pdev,
-				 *((dma_addr_t *)skb->cb),
-				 rq->wqe_sz,
-				 DMA_FROM_DEVICE);
-		dev_kfree_skb(skb);
-	}
+	mlx5e_page_release(rq, di, true);
 }
 
 static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
@@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
 
-static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
-				      struct mlx5e_dma_info *dma_info)
-{
-	struct mlx5e_page_cache *cache = &rq->page_cache;
-	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
-
-	if (tail_next == cache->head) {
-		rq->stats.cache_full++;
-		return false;
-	}
-
-	cache->page_cache[cache->tail] = *dma_info;
-	cache->tail = tail_next;
-	return true;
-}
-
-static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
-				      struct mlx5e_dma_info *dma_info)
-{
-	struct mlx5e_page_cache *cache = &rq->page_cache;
-
-	if (unlikely(cache->head == cache->tail)) {
-		rq->stats.cache_empty++;
-		return false;
-	}
-
-	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
-		rq->stats.cache_busy++;
-		return false;
-	}
-
-	*dma_info = cache->page_cache[cache->head];
-	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
-	rq->stats.cache_reuse++;
-
-	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
-				   DMA_FROM_DEVICE);
-	return true;
-}
-
-static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
-					  struct mlx5e_dma_info *dma_info)
-{
-	struct page *page;
-
-	if (mlx5e_rx_cache_get(rq, dma_info))
-		return 0;
-
-	page = dev_alloc_page();
-	if (unlikely(!page))
-		return -ENOMEM;
-
-	dma_info->page = page;
-	dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
-				      DMA_FROM_DEVICE);
-	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
-		put_page(page);
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
-			bool recycle)
-{
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
-
-	dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
-	put_page(dma_info->page);
-}
-
 static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 				    struct mlx5e_rx_wqe *wqe,
 				    u16 ix)
@@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
 	mlx5_wq_ll_update_db_record(wq);
 }
 
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 {
 	int err;
 
@@ -650,31 +626,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
+	struct mlx5e_dma_info *di;
 	struct mlx5e_rx_wqe *wqe;
-	struct sk_buff *skb;
 	__be16 wqe_counter_be;
+	struct sk_buff *skb;
 	u16 wqe_counter;
 	u32 cqe_bcnt;
+	void *va;
 
 	wqe_counter_be = cqe->wqe_counter;
 	wqe_counter    = be16_to_cpu(wqe_counter_be);
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
-	skb            = rq->skb[wqe_counter];
-	prefetch(skb->data);
-	rq->skb[wqe_counter] = NULL;
+	di             = &rq->dma_info[wqe_counter];
+	va             = page_address(di->page);
 
-	dma_unmap_single(rq->pdev,
-			 *((dma_addr_t *)skb->cb),
-			 rq->wqe_sz,
-			 DMA_FROM_DEVICE);
+	dma_sync_single_range_for_cpu(rq->pdev,
+				      di->addr,
+				      MLX5_RX_HEADROOM,
+				      rq->buff.wqe_sz,
+				      DMA_FROM_DEVICE);
+	prefetch(va + MLX5_RX_HEADROOM);
 
 	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
 		rq->stats.wqe_err++;
-		dev_kfree_skb(skb);
+		mlx5e_page_release(rq, di, true);
 		goto wq_ll_pop;
 	}
 
+	skb = build_skb(va, RQ_PAGE_SIZE(rq));
+	if (unlikely(!skb)) {
+		rq->stats.buff_alloc_err++;
+		mlx5e_page_release(rq, di, true);
+		goto wq_ll_pop;
+	}
+
+	/* queue up for recycling ..*/
+	page_ref_inc(di->page);
+	mlx5e_page_release(rq, di, true);
+
 	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
+	skb_reserve(skb, MLX5_RX_HEADROOM);
 	skb_put(skb, cqe_bcnt);
 
 	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (3 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size Saeed Mahameed
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

We have two types of RX RQs, and they use two separate sets of
info arrays and structures in RX data path function.  Today those
structures are mutually exclusive per RQ type, hence one kind is
allocated on RQ creation according to the RQ type.

For better cache locality and to minimalize the
sizeof(struct mlx5e_rq), in this patch we define them as a union.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      | 14 ++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 32 +++++++++++------------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 10 +++----
 3 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index a346112..7dfb34e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -305,9 +305,14 @@ struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
 
-	struct mlx5e_dma_info *dma_info;
-	struct mlx5e_mpw_info *wqe_info;
-	void                  *mtt_no_align;
+	union {
+		struct mlx5e_dma_info *dma_info;
+		struct {
+			struct mlx5e_mpw_info *info;
+			void                  *mtt_no_align;
+			u32                    mtt_offset;
+		} mpwqe;
+	};
 	struct {
 		u8             page_order;
 		u32            wqe_sz;    /* wqe data buffer size */
@@ -327,7 +332,6 @@ struct mlx5e_rq {
 
 	unsigned long          state;
 	int                    ix;
-	u32                    mpwqe_mtt_offset;
 
 	struct mlx5e_rx_am     am; /* Adaptive Moderation */
 
@@ -804,7 +808,7 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
 
 static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
 {
-	return rq->mpwqe_mtt_offset +
+	return rq->mpwqe.mtt_offset +
 		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c9f1dea..9f0f5f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -317,7 +317,7 @@ static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
 	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
 	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
 	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
 	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
 
@@ -345,21 +345,21 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
 	int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
 	int i;
 
-	rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
-				    GFP_KERNEL, cpu_to_node(c->cpu));
-	if (!rq->wqe_info)
+	rq->mpwqe.info = kzalloc_node(wq_sz * sizeof(*rq->mpwqe.info),
+				      GFP_KERNEL, cpu_to_node(c->cpu));
+	if (!rq->mpwqe.info)
 		goto err_out;
 
 	/* We allocate more than mtt_sz as we will align the pointer */
-	rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
+	rq->mpwqe.mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
 					cpu_to_node(c->cpu));
-	if (unlikely(!rq->mtt_no_align))
+	if (unlikely(!rq->mpwqe.mtt_no_align))
 		goto err_free_wqe_info;
 
 	for (i = 0; i < wq_sz; i++) {
-		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+		struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
 
-		wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
+		wi->umr.mtt = PTR_ALIGN(rq->mpwqe.mtt_no_align + i * mtt_alloc,
 					MLX5_UMR_ALIGN);
 		wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
 						  PCI_DMA_TODEVICE);
@@ -373,14 +373,14 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
 
 err_unmap_mtts:
 	while (--i >= 0) {
-		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+		struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
 
 		dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
 				 PCI_DMA_TODEVICE);
 	}
-	kfree(rq->mtt_no_align);
+	kfree(rq->mpwqe.mtt_no_align);
 err_free_wqe_info:
-	kfree(rq->wqe_info);
+	kfree(rq->mpwqe.info);
 
 err_out:
 	return -ENOMEM;
@@ -393,13 +393,13 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
 	int i;
 
 	for (i = 0; i < wq_sz; i++) {
-		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+		struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
 
 		dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
 				 PCI_DMA_TODEVICE);
 	}
-	kfree(rq->mtt_no_align);
-	kfree(rq->wqe_info);
+	kfree(rq->mpwqe.mtt_no_align);
+	kfree(rq->mpwqe.info);
 }
 
 static int mlx5e_create_rq(struct mlx5e_channel *c,
@@ -442,7 +442,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
-		rq->mpwqe_mtt_offset = c->ix *
+		rq->mpwqe.mtt_offset = c->ix *
 			MLX5E_REQUIRED_MTTS(1, BIT(priv->params.log_rq_size));
 
 		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
@@ -656,7 +656,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
 
 	/* UMR WQE (if in progress) is always at wq->head */
 	if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
-		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
+		mlx5e_free_rx_mpwqe(rq, &rq->mpwqe.info[wq->head]);
 
 	while (!mlx5_wq_ll_is_empty(wq)) {
 		wqe_ix_be = *wq->tail_next;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2f5bc6f..95f9b1e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -328,7 +328,7 @@ mlx5e_copy_skb_header_mpwqe(struct device *pdev,
 
 static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 	struct mlx5e_sq *sq = &rq->channel->icosq;
 	struct mlx5_wq_cyc *wq = &sq->wq;
 	struct mlx5e_umr_wqe *wqe;
@@ -358,7 +358,7 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 				    struct mlx5e_rx_wqe *wqe,
 				    u16 ix)
 {
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 	u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
 	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
 	int err;
@@ -412,7 +412,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
 	clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
 
 	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
-		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
+		mlx5e_free_rx_mpwqe(rq, &rq->mpwqe.info[wq->head]);
 		return;
 	}
 
@@ -438,7 +438,7 @@ int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 
 	mlx5e_free_rx_mpwqe(rq, wi);
 }
@@ -717,7 +717,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
 	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[wqe_id];
 	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
 	struct sk_buff *skb;
 	u16 cqe_bcnt;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (4 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure Saeed Mahameed
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

Before this patch LRO size was 64K, now with build_skb requires
extra room, headroom + sizeof(skb_shared_info) added to the data
buffer will make  wqe size or page_frag_size slightly larger than
64K which will demand order 5 page instead of order 4 in 4K page systems.

We take those extra bytes from hardware LRO data size in order to not
increase the required page order for when hardware LRO is enabled.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9f0f5f6..17f84f9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3185,8 +3185,11 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 	mlx5e_build_default_indir_rqt(mdev, priv->params.indirection_rqt,
 				      MLX5E_INDIR_RQT_SIZE, profile->max_nch(mdev));
 
-	priv->params.lro_wqe_sz            =
-		MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ;
+	priv->params.lro_wqe_sz =
+		MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ -
+		/* Extra room needed for build_skb */
+		MLX5_RX_HEADROOM -
+		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	/* Initialize pflags */
 	MLX5E_SET_PRIV_FLAG(priv, MLX5E_PFLAG_RX_CQE_BASED_MODER,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (5 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

Add two helper functions to allow dynamic changes of RQ type.

mlx5e_set_rq_priv_params and mlx5e_set_rq_type_params will be
used on netdev creation to determine the default RQ type.

This will be needed later for downstream patches of XDP support.
When enabling XDP we will dynamically move from striding RQ to
linked list RQ type.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 92 ++++++++++++-----------
 1 file changed, 50 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 17f84f9..a6a2e60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -69,6 +69,47 @@ struct mlx5e_channel_param {
 	struct mlx5e_cq_param      icosq_cq;
 };
 
+static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
+{
+	return MLX5_CAP_GEN(mdev, striding_rq) &&
+		MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
+		MLX5_CAP_ETH(mdev, reg_umr_sq);
+}
+
+static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
+{
+	priv->params.rq_wq_type = rq_type;
+	switch (priv->params.rq_wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
+		priv->params.mpwqe_log_stride_sz = priv->params.rx_cqe_compress ?
+			MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS :
+			MLX5_MPWRQ_LOG_STRIDE_SIZE;
+		priv->params.mpwqe_log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ -
+			priv->params.mpwqe_log_stride_sz;
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
+	}
+	priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
+					       BIT(priv->params.log_rq_size));
+
+	mlx5_core_info(priv->mdev,
+		       "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
+		       priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
+		       BIT(priv->params.log_rq_size),
+		       BIT(priv->params.mpwqe_log_stride_sz),
+		       priv->params.rx_cqe_compress_admin);
+}
+
+static void mlx5e_set_rq_priv_params(struct mlx5e_priv *priv)
+{
+	u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) ?
+		    MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
+		    MLX5_WQ_TYPE_LINKED_LIST;
+	mlx5e_set_rq_type_params(priv, rq_type);
+}
+
 static void mlx5e_update_carrier(struct mlx5e_priv *priv)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
@@ -3036,13 +3077,6 @@ void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
 		indirection_rqt[i] = i % num_channels;
 }
 
-static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
-{
-	return MLX5_CAP_GEN(mdev, striding_rq) &&
-		MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
-		MLX5_CAP_ETH(mdev, reg_umr_sq);
-}
-
 static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
 {
 	enum pcie_link_width width;
@@ -3122,11 +3156,13 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 					 MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
 					 MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
 
-	priv->params.log_sq_size           =
-		MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
-	priv->params.rq_wq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) ?
-		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
-		MLX5_WQ_TYPE_LINKED_LIST;
+	priv->mdev                         = mdev;
+	priv->netdev                       = netdev;
+	priv->params.num_channels          = profile->max_nch(mdev);
+	priv->profile                      = profile;
+	priv->ppriv                        = ppriv;
+
+	priv->params.log_sq_size = MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
 
 	/* set CQE compression */
 	priv->params.rx_cqe_compress_admin = false;
@@ -3139,33 +3175,11 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 		priv->params.rx_cqe_compress_admin =
 			cqe_compress_heuristic(link_speed, pci_bw);
 	}
-
 	priv->params.rx_cqe_compress = priv->params.rx_cqe_compress_admin;
 
-	switch (priv->params.rq_wq_type) {
-	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
-		priv->params.mpwqe_log_stride_sz =
-			priv->params.rx_cqe_compress ?
-			MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS :
-			MLX5_MPWRQ_LOG_STRIDE_SIZE;
-		priv->params.mpwqe_log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ -
-			priv->params.mpwqe_log_stride_sz;
+	mlx5e_set_rq_priv_params(priv);
+	if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
 		priv->params.lro_en = true;
-		break;
-	default: /* MLX5_WQ_TYPE_LINKED_LIST */
-		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
-	}
-
-	mlx5_core_info(mdev,
-		       "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
-		       priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
-		       BIT(priv->params.log_rq_size),
-		       BIT(priv->params.mpwqe_log_stride_sz),
-		       priv->params.rx_cqe_compress_admin);
-
-	priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
-					    BIT(priv->params.log_rq_size));
 
 	priv->params.rx_am_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
 	mlx5e_set_rx_cq_mode_params(&priv->params, cq_period_mode);
@@ -3195,12 +3209,6 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 	MLX5E_SET_PRIV_FLAG(priv, MLX5E_PFLAG_RX_CQE_BASED_MODER,
 			    priv->params.rx_cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
 
-	priv->mdev                         = mdev;
-	priv->netdev                       = netdev;
-	priv->params.num_channels          = profile->max_nch(mdev);
-	priv->profile                      = profile;
-	priv->ppriv                        = ppriv;
-
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 	mlx5e_ets_init(priv);
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (6 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 13:32   ` Or Gerlitz
                     ` (2 more replies)
  2016-09-07 12:42 ` [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types Saeed Mahameed
                   ` (3 subsequent siblings)
  11 siblings, 3 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Rana Shahout,
	Saeed Mahameed

From: Rana Shahout <ranas@mellanox.com>

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.

When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".

On XDP set, we fail if HW LRO is set and request from user to turn it
off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.

Full channels reset (close/open) is required only when setting XDP
on/off.

When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
	- rq.state = disabled
	- napi_synnchronize
	- xchg(rq->xdp_prg)
	- rq.state = enabled
	- napi_schedule // Just in case we've missed an IRQ

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
	1. Baseline, Before this patch with TC drop action
	2. This patch with TC drop action
	3. This patch with XDP RX fast drop

Streams    Baseline(TC drop)    TC drop    XDP fast Drop
--------------------------------------------------------------
1           5.51Mpps            5.14Mpps     13.5Mpps
2           11.5Mpps            10.0Mpps     25.1Mpps
4           16.3Mpps            17.2Mpps     35.4Mpps
8           29.6Mpps            28.2Mpps     45.8Mpps*
16          34.0Mpps            30.1Mpps     45.8Mpps*

It seems that there is around ~5% degradation between Baseline
and this patch with single stream when comparing packet rate with TC drop,
it might be related to XDP code overhead or new cache misses added by
XDP code.

*My xmitter was limited to 45Mpps, so for 8/16 streams the xmitter is
the bottlenick and it seems that XDP drop can handle more.

Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 100 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  26 +++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 +
 4 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7dfb34e..729bae8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -334,6 +334,7 @@ struct mlx5e_rq {
 	int                    ix;
 
 	struct mlx5e_rx_am     am; /* Adaptive Moderation */
+	struct bpf_prog       *xdp_prog;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
@@ -627,6 +628,7 @@ struct mlx5e_priv {
 	/* priv data path fields - start */
 	struct mlx5e_sq            **txq_to_sq_map;
 	int channeltc_to_txq_map[MLX5E_MAX_NUM_CHANNELS][MLX5E_MAX_NUM_TC];
+	struct bpf_prog *xdp_prog;
 	/* priv data path fields - end */
 
 	unsigned long              state;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a6a2e60..dab8486 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -34,6 +34,7 @@
 #include <net/pkt_cls.h>
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
+#include <linux/bpf.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -104,7 +105,8 @@ static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
 
 static void mlx5e_set_rq_priv_params(struct mlx5e_priv *priv)
 {
-	u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) ?
+	u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) &&
+		    !priv->xdp_prog ?
 		    MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
 		    MLX5_WQ_TYPE_LINKED_LIST;
 	mlx5e_set_rq_type_params(priv, rq_type);
@@ -177,6 +179,7 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 		s->rx_csum_none	+= rq_stats->csum_none;
 		s->rx_csum_complete += rq_stats->csum_complete;
 		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
+		s->rx_xdp_drop += rq_stats->xdp_drop;
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
 		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
@@ -476,6 +479,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->channel = c;
 	rq->ix      = c->ix;
 	rq->priv    = c->priv;
+	rq->xdp_prog = priv->xdp_prog;
 
 	switch (priv->params.rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
@@ -539,6 +543,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->page_cache.head = 0;
 	rq->page_cache.tail = 0;
 
+	if (rq->xdp_prog)
+		bpf_prog_add(rq->xdp_prog, 1);
+
 	return 0;
 
 err_rq_wq_destroy:
@@ -551,6 +558,9 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 {
 	int i;
 
+	if (rq->xdp_prog)
+		bpf_prog_put(rq->xdp_prog);
+
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		mlx5e_rq_free_mpwqe_info(rq);
@@ -2953,6 +2963,92 @@ static void mlx5e_tx_timeout(struct net_device *dev)
 		schedule_work(&priv->tx_timeout_work);
 }
 
+static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct bpf_prog *old_prog;
+	int err = 0;
+	bool reset, was_opened;
+	int i;
+
+	mutex_lock(&priv->state_lock);
+
+	if ((netdev->features & NETIF_F_LRO) && prog) {
+		netdev_warn(netdev, "can't set XDP while LRO is on, disable LRO first\n");
+		err = -EINVAL;
+		goto unlock;
+	}
+
+	was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
+	/* no need for full reset when exchanging programs */
+	reset = (!priv->xdp_prog || !prog);
+
+	if (was_opened && reset)
+		mlx5e_close_locked(netdev);
+
+	/* exchange programs */
+	old_prog = xchg(&priv->xdp_prog, prog);
+	if (prog)
+		bpf_prog_add(prog, 1);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	if (reset) /* change RQ type according to priv->xdp_prog */
+		mlx5e_set_rq_priv_params(priv);
+
+	if (was_opened && reset)
+		mlx5e_open_locked(netdev);
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state) || reset)
+		goto unlock;
+
+	/* exchanging programs w/o reset, we update ref counts on behalf
+	 * of the channels RQs here.
+	 */
+	bpf_prog_add(prog, priv->params.num_channels);
+	for (i = 0; i < priv->params.num_channels; i++) {
+		struct mlx5e_channel *c = priv->channel[i];
+
+		set_bit(MLX5E_RQ_STATE_FLUSH, &c->rq.state);
+		napi_synchronize(&c->napi);
+		/* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
+
+		old_prog = xchg(&c->rq.xdp_prog, prog);
+
+		clear_bit(MLX5E_RQ_STATE_FLUSH, &c->rq.state);
+		/* napi_schedule in case we have missed anything */
+		set_bit(MLX5E_CHANNEL_NAPI_SCHED, &c->flags);
+		napi_schedule(&c->napi);
+
+		if (old_prog)
+			bpf_prog_put(old_prog);
+	}
+
+unlock:
+	mutex_unlock(&priv->state_lock);
+	return err;
+}
+
+static bool mlx5e_xdp_attached(struct net_device *dev)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+
+	return !!priv->xdp_prog;
+}
+
+static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return mlx5e_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = mlx5e_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
 	.ndo_open                = mlx5e_open,
 	.ndo_stop                = mlx5e_close,
@@ -2972,6 +3068,7 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = {
 	.ndo_rx_flow_steer	 = mlx5e_rx_flow_steer,
 #endif
 	.ndo_tx_timeout          = mlx5e_tx_timeout,
+	.ndo_xdp		 = mlx5e_xdp,
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -3003,6 +3100,7 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = {
 	.ndo_set_vf_link_state   = mlx5e_set_vf_link_state,
 	.ndo_get_vf_stats        = mlx5e_get_vf_stats,
 	.ndo_tx_timeout          = mlx5e_tx_timeout,
+	.ndo_xdp		 = mlx5e_xdp,
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 95f9b1e..cde34c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -624,8 +624,20 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 	napi_gro_receive(rq->cq.napi, skb);
 }
 
+static inline enum xdp_action mlx5e_xdp_handle(struct mlx5e_rq *rq,
+					       const struct bpf_prog *prog,
+					       void *data, u32 len)
+{
+	struct xdp_buff xdp;
+
+	xdp.data = data;
+	xdp.data_end = xdp.data + len;
+	return bpf_prog_run_xdp(prog, &xdp);
+}
+
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
+	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
 	struct mlx5e_dma_info *di;
 	struct mlx5e_rx_wqe *wqe;
 	__be16 wqe_counter_be;
@@ -646,6 +658,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 				      rq->buff.wqe_sz,
 				      DMA_FROM_DEVICE);
 	prefetch(va + MLX5_RX_HEADROOM);
+	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 
 	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
 		rq->stats.wqe_err++;
@@ -653,6 +666,18 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
+	if (xdp_prog) {
+		enum xdp_action act =
+			mlx5e_xdp_handle(rq, xdp_prog, va + MLX5_RX_HEADROOM,
+					 cqe_bcnt);
+
+		if (act != XDP_PASS) {
+			rq->stats.xdp_drop++;
+			mlx5e_page_release(rq, di, true);
+			goto wq_ll_pop;
+		}
+	}
+
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
 	if (unlikely(!skb)) {
 		rq->stats.buff_alloc_err++;
@@ -664,7 +689,6 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	page_ref_inc(di->page);
 	mlx5e_page_release(rq, di, true);
 
-	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 	skb_reserve(skb, MLX5_RX_HEADROOM);
 	skb_put(skb, cqe_bcnt);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 6af8d79..084d6c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -65,6 +65,7 @@ struct mlx5e_sw_stats {
 	u64 rx_csum_none;
 	u64 rx_csum_complete;
 	u64 rx_csum_unnecessary_inner;
+	u64 rx_xdp_drop;
 	u64 tx_csum_partial;
 	u64 tx_csum_partial_inner;
 	u64 tx_queue_stopped;
@@ -100,6 +101,7 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_none) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_complete) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary_inner) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_drop) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial_inner) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
@@ -278,6 +280,7 @@ struct mlx5e_rq_stats {
 	u64 csum_none;
 	u64 lro_packets;
 	u64 lro_bytes;
+	u64 xdp_drop;
 	u64 wqe_err;
 	u64 mpwqe_filler;
 	u64 buff_alloc_err;
@@ -295,6 +298,7 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_complete) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_none) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_drop) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (7 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support Saeed Mahameed
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

Make a clear separate between Regular SQ (TXQ) and ICO SQ creation,
destruction and union their mutual information structures.

Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ.
And have a different SQ edge for ICO SQ than TXQ SQ, to be more
accurate.

In preparation for XDP TX support.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  23 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 121 ++++++++++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |  28 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |   2 +-
 5 files changed, 118 insertions(+), 64 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 729bae8..b2da9bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -101,6 +101,9 @@
 #define MLX5E_UPDATE_STATS_INTERVAL    200 /* msecs */
 #define MLX5E_SQ_BF_BUDGET             16
 
+#define MLX5E_ICOSQ_MAX_WQEBBS \
+	(DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
+
 #define MLX5E_NUM_MAIN_GROUPS 9
 
 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
@@ -386,6 +389,11 @@ struct mlx5e_ico_wqe_info {
 	u8  num_wqebbs;
 };
 
+enum mlx5e_sq_type {
+	MLX5E_SQ_TXQ,
+	MLX5E_SQ_ICO
+};
+
 struct mlx5e_sq {
 	/* data path */
 
@@ -403,10 +411,15 @@ struct mlx5e_sq {
 
 	struct mlx5e_cq            cq;
 
-	/* pointers to per packet info: write@xmit, read@completion */
-	struct sk_buff           **skb;
-	struct mlx5e_sq_dma       *dma_fifo;
-	struct mlx5e_tx_wqe_info  *wqe_info;
+	/* pointers to per tx element info: write@xmit, read@completion */
+	union {
+		struct {
+			struct sk_buff           **skb;
+			struct mlx5e_sq_dma       *dma_fifo;
+			struct mlx5e_tx_wqe_info  *wqe_info;
+		} txq;
+		struct mlx5e_ico_wqe_info *ico_wqe;
+	} db;
 
 	/* read only */
 	struct mlx5_wq_cyc         wq;
@@ -428,8 +441,8 @@ struct mlx5e_sq {
 	struct mlx5_uar            uar;
 	struct mlx5e_channel      *channel;
 	int                        tc;
-	struct mlx5e_ico_wqe_info *ico_wqe_info;
 	u32                        rate_limit;
+	u8                         type;
 } ____cacheline_aligned_in_smp;
 
 static inline bool mlx5e_sq_has_room_for(struct mlx5e_sq *sq, u16 n)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index dab8486..8baeb9e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -51,7 +51,7 @@ struct mlx5e_sq_param {
 	struct mlx5_wq_param       wq;
 	u16                        max_inline;
 	u8                         min_inline_mode;
-	bool                       icosq;
+	enum mlx5e_sq_type         type;
 };
 
 struct mlx5e_cq_param {
@@ -742,8 +742,8 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
 	if (param->am_enabled)
 		set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
 
-	sq->ico_wqe_info[pi].opcode     = MLX5_OPCODE_NOP;
-	sq->ico_wqe_info[pi].num_wqebbs = 1;
+	sq->db.ico_wqe[pi].opcode     = MLX5_OPCODE_NOP;
+	sq->db.ico_wqe[pi].num_wqebbs = 1;
 	mlx5e_send_nop(sq, true); /* trigger mlx5e_post_rx_wqes() */
 
 	return 0;
@@ -767,26 +767,43 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
 	mlx5e_destroy_rq(rq);
 }
 
-static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
+static void mlx5e_free_sq_ico_db(struct mlx5e_sq *sq)
 {
-	kfree(sq->wqe_info);
-	kfree(sq->dma_fifo);
-	kfree(sq->skb);
+	kfree(sq->db.ico_wqe);
 }
 
-static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
+static int mlx5e_alloc_sq_ico_db(struct mlx5e_sq *sq, int numa)
+{
+	u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+	sq->db.ico_wqe = kzalloc_node(sizeof(*sq->db.ico_wqe) * wq_sz,
+				      GFP_KERNEL, numa);
+	if (!sq->db.ico_wqe)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void mlx5e_free_sq_txq_db(struct mlx5e_sq *sq)
+{
+	kfree(sq->db.txq.wqe_info);
+	kfree(sq->db.txq.dma_fifo);
+	kfree(sq->db.txq.skb);
+}
+
+static int mlx5e_alloc_sq_txq_db(struct mlx5e_sq *sq, int numa)
 {
 	int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
 	int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
 
-	sq->skb = kzalloc_node(wq_sz * sizeof(*sq->skb), GFP_KERNEL, numa);
-	sq->dma_fifo = kzalloc_node(df_sz * sizeof(*sq->dma_fifo), GFP_KERNEL,
-				    numa);
-	sq->wqe_info = kzalloc_node(wq_sz * sizeof(*sq->wqe_info), GFP_KERNEL,
-				    numa);
-
-	if (!sq->skb || !sq->dma_fifo || !sq->wqe_info) {
-		mlx5e_free_sq_db(sq);
+	sq->db.txq.skb = kzalloc_node(wq_sz * sizeof(*sq->db.txq.skb),
+				      GFP_KERNEL, numa);
+	sq->db.txq.dma_fifo = kzalloc_node(df_sz * sizeof(*sq->db.txq.dma_fifo),
+					   GFP_KERNEL, numa);
+	sq->db.txq.wqe_info = kzalloc_node(wq_sz * sizeof(*sq->db.txq.wqe_info),
+					   GFP_KERNEL, numa);
+	if (!sq->db.txq.skb || !sq->db.txq.dma_fifo || !sq->db.txq.wqe_info) {
+		mlx5e_free_sq_txq_db(sq);
 		return -ENOMEM;
 	}
 
@@ -795,6 +812,30 @@ static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
 	return 0;
 }
 
+static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
+{
+	switch (sq->type) {
+	case MLX5E_SQ_TXQ:
+		mlx5e_free_sq_txq_db(sq);
+		break;
+	case MLX5E_SQ_ICO:
+		mlx5e_free_sq_ico_db(sq);
+		break;
+	}
+}
+
+static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
+{
+	switch (sq->type) {
+	case MLX5E_SQ_TXQ:
+		return mlx5e_alloc_sq_txq_db(sq, numa);
+	case MLX5E_SQ_ICO:
+		return mlx5e_alloc_sq_ico_db(sq, numa);
+	}
+
+	return 0;
+}
+
 static int mlx5e_create_sq(struct mlx5e_channel *c,
 			   int tc,
 			   struct mlx5e_sq_param *param,
@@ -805,8 +846,16 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 
 	void *sqc = param->sqc;
 	void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
+	u16 sq_max_wqebbs;
 	int err;
 
+	sq->type      = param->type;
+	sq->pdev      = c->pdev;
+	sq->tstamp    = &priv->tstamp;
+	sq->mkey_be   = c->mkey_be;
+	sq->channel   = c;
+	sq->tc        = tc;
+
 	err = mlx5_alloc_map_uar(mdev, &sq->uar, !!MLX5_CAP_GEN(mdev, bf));
 	if (err)
 		return err;
@@ -835,18 +884,8 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 	if (err)
 		goto err_sq_wq_destroy;
 
-	if (param->icosq) {
-		u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
-
-		sq->ico_wqe_info = kzalloc_node(sizeof(*sq->ico_wqe_info) *
-						wq_sz,
-						GFP_KERNEL,
-						cpu_to_node(c->cpu));
-		if (!sq->ico_wqe_info) {
-			err = -ENOMEM;
-			goto err_free_sq_db;
-		}
-	} else {
+	sq_max_wqebbs = MLX5_SEND_WQE_MAX_WQEBBS;
+	if (sq->type == MLX5E_SQ_TXQ) {
 		int txq_ix;
 
 		txq_ix = c->ix + tc * priv->params.num_channels;
@@ -854,19 +893,14 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 		priv->txq_to_sq_map[txq_ix] = sq;
 	}
 
-	sq->pdev      = c->pdev;
-	sq->tstamp    = &priv->tstamp;
-	sq->mkey_be   = c->mkey_be;
-	sq->channel   = c;
-	sq->tc        = tc;
-	sq->edge      = (sq->wq.sz_m1 + 1) - MLX5_SEND_WQE_MAX_WQEBBS;
+	if (sq->type == MLX5E_SQ_ICO)
+		sq_max_wqebbs = MLX5E_ICOSQ_MAX_WQEBBS;
+
+	sq->edge      = (sq->wq.sz_m1 + 1) - sq_max_wqebbs;
 	sq->bf_budget = MLX5E_SQ_BF_BUDGET;
 
 	return 0;
 
-err_free_sq_db:
-	mlx5e_free_sq_db(sq);
-
 err_sq_wq_destroy:
 	mlx5_wq_destroy(&sq->wq_ctrl);
 
@@ -881,7 +915,6 @@ static void mlx5e_destroy_sq(struct mlx5e_sq *sq)
 	struct mlx5e_channel *c = sq->channel;
 	struct mlx5e_priv *priv = c->priv;
 
-	kfree(sq->ico_wqe_info);
 	mlx5e_free_sq_db(sq);
 	mlx5_wq_destroy(&sq->wq_ctrl);
 	mlx5_unmap_free_uar(priv->mdev, &sq->uar);
@@ -910,11 +943,12 @@ static int mlx5e_enable_sq(struct mlx5e_sq *sq, struct mlx5e_sq_param *param)
 
 	memcpy(sqc, param->sqc, sizeof(param->sqc));
 
-	MLX5_SET(sqc,  sqc, tis_num_0, param->icosq ? 0 : priv->tisn[sq->tc]);
+	MLX5_SET(sqc,  sqc, tis_num_0, param->type == MLX5E_SQ_ICO ?
+				       0 : priv->tisn[sq->tc]);
 	MLX5_SET(sqc,  sqc, cqn,		sq->cq.mcq.cqn);
 	MLX5_SET(sqc,  sqc, min_wqe_inline_mode, sq->min_inline_mode);
 	MLX5_SET(sqc,  sqc, state,		MLX5_SQC_STATE_RST);
-	MLX5_SET(sqc,  sqc, tis_lst_sz,		param->icosq ? 0 : 1);
+	MLX5_SET(sqc,  sqc, tis_lst_sz, param->type == MLX5E_SQ_ICO ? 0 : 1);
 	MLX5_SET(sqc,  sqc, flush_in_error_en,	1);
 
 	MLX5_SET(wq,   wq, wq_type,       MLX5_WQ_TYPE_CYCLIC);
@@ -1029,8 +1063,10 @@ static void mlx5e_close_sq(struct mlx5e_sq *sq)
 		netif_tx_disable_queue(sq->txq);
 
 		/* last doorbell out, godspeed .. */
-		if (mlx5e_sq_has_room_for(sq, 1))
+		if (mlx5e_sq_has_room_for(sq, 1)) {
+			sq->db.txq.skb[(sq->pc & sq->wq.sz_m1)] = NULL;
 			mlx5e_send_nop(sq, true);
+		}
 	}
 
 	mlx5e_disable_sq(sq);
@@ -1507,6 +1543,7 @@ static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
 
 	param->max_inline = priv->params.tx_max_inline;
 	param->min_inline_mode = priv->params.tx_min_inline_mode;
+	param->type = MLX5E_SQ_TXQ;
 }
 
 static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv,
@@ -1580,7 +1617,7 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 	MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
 	MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq));
 
-	param->icosq = true;
+	param->type = MLX5E_SQ_ICO;
 }
 
 static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index cde34c8..eb489e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -337,8 +337,8 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 
 	/* fill sq edge with nops to avoid wqe wrap around */
 	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
-		sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
-		sq->ico_wqe_info[pi].num_wqebbs = 1;
+		sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_NOP;
+		sq->db.ico_wqe[pi].num_wqebbs = 1;
 		mlx5e_send_nop(sq, true);
 	}
 
@@ -348,8 +348,8 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
 			    MLX5_OPCODE_UMR);
 
-	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
-	sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
+	sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR;
+	sq->db.ico_wqe[pi].num_wqebbs = num_wqebbs;
 	sq->pc += num_wqebbs;
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 988eca9..a728303 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -52,7 +52,6 @@ void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw)
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_NOP);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | 0x01);
 
-	sq->skb[pi] = NULL;
 	sq->pc++;
 	sq->stats.nop++;
 
@@ -82,15 +81,15 @@ static inline void mlx5e_dma_push(struct mlx5e_sq *sq,
 				  u32 size,
 				  enum mlx5e_dma_map_type map_type)
 {
-	sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr = addr;
-	sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size = size;
-	sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].type = map_type;
+	sq->db.txq.dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr = addr;
+	sq->db.txq.dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size = size;
+	sq->db.txq.dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].type = map_type;
 	sq->dma_fifo_pc++;
 }
 
 static inline struct mlx5e_sq_dma *mlx5e_dma_get(struct mlx5e_sq *sq, u32 i)
 {
-	return &sq->dma_fifo[i & sq->dma_fifo_mask];
+	return &sq->db.txq.dma_fifo[i & sq->dma_fifo_mask];
 }
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_sq *sq, u8 num_dma)
@@ -221,7 +220,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 
 	u16 pi = sq->pc & wq->sz_m1;
 	struct mlx5e_tx_wqe      *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
-	struct mlx5e_tx_wqe_info *wi   = &sq->wqe_info[pi];
+	struct mlx5e_tx_wqe_info *wi   = &sq->db.txq.wqe_info[pi];
 
 	struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
 	struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
@@ -341,7 +340,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | opcode);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
 
-	sq->skb[pi] = skb;
+	sq->db.txq.skb[pi] = skb;
 
 	wi->num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
 	sq->pc += wi->num_wqebbs;
@@ -367,8 +366,10 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 	}
 
 	/* fill sq edge with nops to avoid wqe wrap around */
-	while ((sq->pc & wq->sz_m1) > sq->edge)
+	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+		sq->db.txq.skb[pi] = NULL;
 		mlx5e_send_nop(sq, false);
+	}
 
 	if (bf)
 		sq->bf_budget--;
@@ -442,8 +443,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			last_wqe = (sqcc == wqe_counter);
 
 			ci = sqcc & sq->wq.sz_m1;
-			skb = sq->skb[ci];
-			wi = &sq->wqe_info[ci];
+			skb = sq->db.txq.skb[ci];
+			wi = &sq->db.txq.wqe_info[ci];
 
 			if (unlikely(!skb)) { /* nop */
 				sqcc++;
@@ -499,10 +500,13 @@ void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
 	u16 ci;
 	int i;
 
+	if (sq->type != MLX5E_SQ_TXQ)
+		return;
+
 	while (sq->cc != sq->pc) {
 		ci = sq->cc & sq->wq.sz_m1;
-		skb = sq->skb[ci];
-		wi = &sq->wqe_info[ci];
+		skb = sq->db.txq.skb[ci];
+		wi = &sq->db.txq.wqe_info[ci];
 
 		if (!skb) { /* nop */
 			sq->cc++;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 08d8b0c..47cd561 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -72,7 +72,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 
 	do {
 		u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
-		struct mlx5e_ico_wqe_info *icowi = &sq->ico_wqe_info[ci];
+		struct mlx5e_ico_wqe_info *icowi = &sq->db.ico_wqe[ci];
 
 		mlx5_cqwq_pop(&cq->wq);
 		sqcc += icowi->num_wqebbs;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (8 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
       [not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

Adding support for XDP_TX forwarding from xdp program.
Using XDP, now user can loop packets out of the same port.

We create a dedicated TX SQ for each channel that will serve
XDP programs that return XDP_TX action to loop packets back to
the wire directly from the channel RQ RX path.

For that RX pages will now need to be mapped bi-directionally,
and on XDP_TX action we will sync the page back to device then
queue it into SQ for transmission.  The XDP xmit frame function will
report back to the RX path if the page was consumed (transmitted), if so,
RX path will forget about that page as if it were released to the stack.
Later on, on XDP TX completion, the page will be released back to the
page cache.

For simplicity this patch will hit a doorbell on every XDP TX packet.

Next patch will introduce a xmit more like mechanism that will
queue up more than one packet into SQ w/o notifying the hardware,
once RX napi loop is done we will hit doorbell once for all XDP TX
packets form the previous loop.  This should drastically improve
XDP TX performance.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  24 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  93 +++++++++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 115 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   8 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  39 ++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  65 +++++++++++-
 6 files changed, 308 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b2da9bf..df2c9e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -104,6 +104,14 @@
 #define MLX5E_ICOSQ_MAX_WQEBBS \
 	(DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
 
+#define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN)
+#define MLX5E_XDP_IHS_DS_COUNT \
+	DIV_ROUND_UP(MLX5E_XDP_MIN_INLINE - 2, MLX5_SEND_WQE_DS)
+#define MLX5E_XDP_TX_DS_COUNT \
+	(MLX5E_XDP_IHS_DS_COUNT + (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS) + 1 /* SG DS */)
+#define MLX5E_XDP_TX_WQEBBS \
+	DIV_ROUND_UP(MLX5E_XDP_TX_DS_COUNT, MLX5_SEND_WQEBB_NUM_DS)
+
 #define MLX5E_NUM_MAIN_GROUPS 9
 
 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
@@ -319,6 +327,7 @@ struct mlx5e_rq {
 	struct {
 		u8             page_order;
 		u32            wqe_sz;    /* wqe data buffer size */
+		u8             map_dir;   /* dma map direction */
 	} buff;
 	__be32                 mkey_be;
 
@@ -384,14 +393,15 @@ enum {
 	MLX5E_SQ_STATE_BF_ENABLE,
 };
 
-struct mlx5e_ico_wqe_info {
+struct mlx5e_sq_wqe_info {
 	u8  opcode;
 	u8  num_wqebbs;
 };
 
 enum mlx5e_sq_type {
 	MLX5E_SQ_TXQ,
-	MLX5E_SQ_ICO
+	MLX5E_SQ_ICO,
+	MLX5E_SQ_XDP
 };
 
 struct mlx5e_sq {
@@ -418,7 +428,11 @@ struct mlx5e_sq {
 			struct mlx5e_sq_dma       *dma_fifo;
 			struct mlx5e_tx_wqe_info  *wqe_info;
 		} txq;
-		struct mlx5e_ico_wqe_info *ico_wqe;
+		struct mlx5e_sq_wqe_info *ico_wqe;
+		struct {
+			struct mlx5e_sq_wqe_info  *wqe_info;
+			struct mlx5e_dma_info     *di;
+		} xdp;
 	} db;
 
 	/* read only */
@@ -458,8 +472,10 @@ enum channel_flags {
 struct mlx5e_channel {
 	/* data path */
 	struct mlx5e_rq            rq;
+	struct mlx5e_sq            xdp_sq;
 	struct mlx5e_sq            sq[MLX5E_MAX_NUM_TC];
 	struct mlx5e_sq            icosq;   /* internal control operations */
+	bool                       xdp;
 	struct napi_struct         napi;
 	struct device             *pdev;
 	struct net_device         *netdev;
@@ -722,7 +738,7 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq);
+void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
 
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8baeb9e..1d9c01f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -64,6 +64,7 @@ struct mlx5e_cq_param {
 struct mlx5e_channel_param {
 	struct mlx5e_rq_param      rq;
 	struct mlx5e_sq_param      sq;
+	struct mlx5e_sq_param      xdp_sq;
 	struct mlx5e_sq_param      icosq;
 	struct mlx5e_cq_param      rx_cq;
 	struct mlx5e_cq_param      tx_cq;
@@ -180,6 +181,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 		s->rx_csum_complete += rq_stats->csum_complete;
 		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
 		s->rx_xdp_drop += rq_stats->xdp_drop;
+		s->rx_xdp_tx += rq_stats->xdp_tx;
+		s->rx_xdp_tx_full += rq_stats->xdp_tx_full;
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
 		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
@@ -481,6 +484,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->priv    = c->priv;
 	rq->xdp_prog = priv->xdp_prog;
 
+	rq->buff.map_dir = DMA_FROM_DEVICE;
+	if (rq->xdp_prog)
+		rq->buff.map_dir = DMA_BIDIRECTIONAL;
+
 	switch (priv->params.rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
@@ -767,6 +774,28 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
 	mlx5e_destroy_rq(rq);
 }
 
+static void mlx5e_free_sq_xdp_db(struct mlx5e_sq *sq)
+{
+	kfree(sq->db.xdp.di);
+	kfree(sq->db.xdp.wqe_info);
+}
+
+static int mlx5e_alloc_sq_xdp_db(struct mlx5e_sq *sq, int numa)
+{
+	int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+	sq->db.xdp.di = kzalloc_node(sizeof(*sq->db.xdp.di) * wq_sz,
+				     GFP_KERNEL, numa);
+	sq->db.xdp.wqe_info = kzalloc_node(sizeof(*sq->db.xdp.wqe_info) * wq_sz,
+					   GFP_KERNEL, numa);
+	if (!sq->db.xdp.di || !sq->db.xdp.wqe_info) {
+		mlx5e_free_sq_xdp_db(sq);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static void mlx5e_free_sq_ico_db(struct mlx5e_sq *sq)
 {
 	kfree(sq->db.ico_wqe);
@@ -821,6 +850,9 @@ static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
 	case MLX5E_SQ_ICO:
 		mlx5e_free_sq_ico_db(sq);
 		break;
+	case MLX5E_SQ_XDP:
+		mlx5e_free_sq_xdp_db(sq);
+		break;
 	}
 }
 
@@ -831,11 +863,24 @@ static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
 		return mlx5e_alloc_sq_txq_db(sq, numa);
 	case MLX5E_SQ_ICO:
 		return mlx5e_alloc_sq_ico_db(sq, numa);
+	case MLX5E_SQ_XDP:
+		return mlx5e_alloc_sq_xdp_db(sq, numa);
 	}
 
 	return 0;
 }
 
+static int mlx5e_sq_get_max_wqebbs(u8 sq_type)
+{
+	switch (sq_type) {
+	case MLX5E_SQ_ICO:
+		return MLX5E_ICOSQ_MAX_WQEBBS;
+	case MLX5E_SQ_XDP:
+		return MLX5E_XDP_TX_WQEBBS;
+	}
+	return MLX5_SEND_WQE_MAX_WQEBBS;
+}
+
 static int mlx5e_create_sq(struct mlx5e_channel *c,
 			   int tc,
 			   struct mlx5e_sq_param *param,
@@ -846,7 +891,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 
 	void *sqc = param->sqc;
 	void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
-	u16 sq_max_wqebbs;
 	int err;
 
 	sq->type      = param->type;
@@ -884,7 +928,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 	if (err)
 		goto err_sq_wq_destroy;
 
-	sq_max_wqebbs = MLX5_SEND_WQE_MAX_WQEBBS;
 	if (sq->type == MLX5E_SQ_TXQ) {
 		int txq_ix;
 
@@ -893,10 +936,7 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 		priv->txq_to_sq_map[txq_ix] = sq;
 	}
 
-	if (sq->type == MLX5E_SQ_ICO)
-		sq_max_wqebbs = MLX5E_ICOSQ_MAX_WQEBBS;
-
-	sq->edge      = (sq->wq.sz_m1 + 1) - sq_max_wqebbs;
+	sq->edge = (sq->wq.sz_m1 + 1) - mlx5e_sq_get_max_wqebbs(sq->type);
 	sq->bf_budget = MLX5E_SQ_BF_BUDGET;
 
 	return 0;
@@ -1070,7 +1110,7 @@ static void mlx5e_close_sq(struct mlx5e_sq *sq)
 	}
 
 	mlx5e_disable_sq(sq);
-	mlx5e_free_tx_descs(sq);
+	mlx5e_free_sq_descs(sq);
 	mlx5e_destroy_sq(sq);
 }
 
@@ -1431,14 +1471,31 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 		}
 	}
 
+	if (priv->xdp_prog) {
+		/* XDP SQ CQ params are same as normal TXQ sq CQ params */
+		err = mlx5e_open_cq(c, &cparam->tx_cq, &c->xdp_sq.cq,
+				    priv->params.tx_cq_moderation);
+		if (err)
+			goto err_close_sqs;
+
+		err = mlx5e_open_sq(c, 0, &cparam->xdp_sq, &c->xdp_sq);
+		if (err) {
+			mlx5e_close_cq(&c->xdp_sq.cq);
+			goto err_close_sqs;
+		}
+	}
+
+	c->xdp = !!priv->xdp_prog;
 	err = mlx5e_open_rq(c, &cparam->rq, &c->rq);
 	if (err)
-		goto err_close_sqs;
+		goto err_close_xdp_sq;
 
 	netif_set_xps_queue(netdev, get_cpu_mask(c->cpu), ix);
 	*cp = c;
 
 	return 0;
+err_close_xdp_sq:
+	mlx5e_close_sq(&c->xdp_sq);
 
 err_close_sqs:
 	mlx5e_close_sqs(c);
@@ -1467,9 +1524,13 @@ err_napi_del:
 static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
 	mlx5e_close_rq(&c->rq);
+	if (c->xdp)
+		mlx5e_close_sq(&c->xdp_sq);
 	mlx5e_close_sqs(c);
 	mlx5e_close_sq(&c->icosq);
 	napi_disable(&c->napi);
+	if (c->xdp)
+		mlx5e_close_cq(&c->xdp_sq.cq);
 	mlx5e_close_cq(&c->rq.cq);
 	mlx5e_close_tx_cqs(c);
 	mlx5e_close_cq(&c->icosq.cq);
@@ -1620,12 +1681,28 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 	param->type = MLX5E_SQ_ICO;
 }
 
+static void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv,
+				    struct mlx5e_sq_param *param)
+{
+	void *sqc = param->sqc;
+	void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+	mlx5e_build_sq_param_common(priv, param);
+	MLX5_SET(wq, wq, log_wq_sz,     priv->params.log_sq_size);
+
+	param->max_inline = priv->params.tx_max_inline;
+	/* FOR XDP SQs will support only L2 inline mode */
+	param->min_inline_mode = MLX5_INLINE_MODE_NONE;
+	param->type = MLX5E_SQ_XDP;
+}
+
 static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
 {
 	u8 icosq_log_wq_sz = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
 
 	mlx5e_build_rq_param(priv, &cparam->rq);
 	mlx5e_build_sq_param(priv, &cparam->sq);
+	mlx5e_build_xdpsq_param(priv, &cparam->xdp_sq);
 	mlx5e_build_icosq_param(priv, &cparam->icosq, icosq_log_wq_sz);
 	mlx5e_build_rx_cq_param(priv, &cparam->rx_cq);
 	mlx5e_build_tx_cq_param(priv, &cparam->tx_cq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index eb489e9..912a0e2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,7 +236,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 
 	dma_info->page = page;
 	dma_info->addr = dma_map_page(rq->pdev, page, 0,
-				      RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
+				      RQ_PAGE_SIZE(rq), rq->buff.map_dir);
 	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
 		put_page(page);
 		return -ENOMEM;
@@ -252,7 +252,7 @@ void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 		return;
 
 	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-		       DMA_FROM_DEVICE);
+		       rq->buff.map_dir);
 	put_page(dma_info->page);
 }
 
@@ -624,15 +624,100 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 	napi_gro_receive(rq->cq.napi, skb);
 }
 
-static inline enum xdp_action mlx5e_xdp_handle(struct mlx5e_rq *rq,
-					       const struct bpf_prog *prog,
-					       void *data, u32 len)
+static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
+					struct mlx5e_dma_info *di,
+					unsigned int data_offset,
+					int len)
 {
+	struct mlx5_wq_cyc       *wq   = &sq->wq;
+	u16                      pi    = sq->pc & wq->sz_m1;
+	struct mlx5e_tx_wqe      *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
+	struct mlx5e_sq_wqe_info *wi   = &sq->db.xdp.wqe_info[pi];
+
+	struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+	struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
+	struct mlx5_wqe_data_seg *dseg;
+
+	dma_addr_t dma_addr  = di->addr + data_offset + MLX5E_XDP_MIN_INLINE;
+	unsigned int dma_len = len - MLX5E_XDP_MIN_INLINE;
+	void *data           = page_address(di->page) + data_offset;
+
+	if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_XDP_TX_WQEBBS))) {
+		sq->channel->rq.stats.xdp_tx_full++;
+		return false;
+	}
+
+	dma_sync_single_for_device(sq->pdev, dma_addr, dma_len, PCI_DMA_TODEVICE);
+
+	memset(wqe, 0, sizeof(*wqe));
+
+	/* copy the inline part */
+	memcpy(eseg->inline_hdr_start, data, MLX5E_XDP_MIN_INLINE);
+	eseg->inline_hdr_sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE);
+
+	dseg = (struct mlx5_wqe_data_seg *)cseg + (MLX5E_XDP_TX_DS_COUNT - 1);
+
+	/* write the dma part */
+	dseg->addr       = cpu_to_be64(dma_addr);
+	dseg->byte_count = cpu_to_be32(dma_len);
+	dseg->lkey       = sq->mkey_be;
+
+	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_SEND);
+	cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | MLX5E_XDP_TX_DS_COUNT);
+
+	sq->db.xdp.di[pi] = *di;
+	wi->opcode     = MLX5_OPCODE_SEND;
+	wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
+	sq->pc += MLX5E_XDP_TX_WQEBBS;
+
+	/* TODO: xmit more */
+	wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
+
+	/* fill sq edge with nops to avoid wqe wrap around */
+	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+		sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
+		mlx5e_send_nop(sq, false);
+	}
+	return true;
+}
+
+/* returns true if packet was consumed by xdp */
+static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
+				    const struct bpf_prog *prog,
+				    struct mlx5e_dma_info *di,
+				    void *data, u16 len)
+{
+	bool consumed = false;
 	struct xdp_buff xdp;
+	u32 act;
+
+	if (!prog)
+		return false;
 
 	xdp.data = data;
 	xdp.data_end = xdp.data + len;
-	return bpf_prog_run_xdp(prog, &xdp);
+	act = bpf_prog_run_xdp(prog, &xdp);
+	switch (act) {
+	case XDP_PASS:
+		return false;
+	case XDP_TX:
+		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
+						MLX5_RX_HEADROOM,
+						len);
+		rq->stats.xdp_tx += consumed;
+		return consumed;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		return false;
+	case XDP_ABORTED:
+	case XDP_DROP:
+		rq->stats.xdp_drop++;
+		mlx5e_page_release(rq, di, true);
+		return true;
+	}
+
+	return false;
 }
 
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
@@ -643,21 +728,22 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	__be16 wqe_counter_be;
 	struct sk_buff *skb;
 	u16 wqe_counter;
+	void *va, *data;
 	u32 cqe_bcnt;
-	void *va;
 
 	wqe_counter_be = cqe->wqe_counter;
 	wqe_counter    = be16_to_cpu(wqe_counter_be);
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
 	di             = &rq->dma_info[wqe_counter];
 	va             = page_address(di->page);
+	data           = va + MLX5_RX_HEADROOM;
 
 	dma_sync_single_range_for_cpu(rq->pdev,
 				      di->addr,
 				      MLX5_RX_HEADROOM,
 				      rq->buff.wqe_sz,
 				      DMA_FROM_DEVICE);
-	prefetch(va + MLX5_RX_HEADROOM);
+	prefetch(data);
 	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 
 	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
@@ -666,17 +752,8 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
-	if (xdp_prog) {
-		enum xdp_action act =
-			mlx5e_xdp_handle(rq, xdp_prog, va + MLX5_RX_HEADROOM,
-					 cqe_bcnt);
-
-		if (act != XDP_PASS) {
-			rq->stats.xdp_drop++;
-			mlx5e_page_release(rq, di, true);
-			goto wq_ll_pop;
-		}
-	}
+	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+		goto wq_ll_pop; /* page/packet was consumed by XDP */
 
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
 	if (unlikely(!skb)) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 084d6c8..57452fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -66,6 +66,8 @@ struct mlx5e_sw_stats {
 	u64 rx_csum_complete;
 	u64 rx_csum_unnecessary_inner;
 	u64 rx_xdp_drop;
+	u64 rx_xdp_tx;
+	u64 rx_xdp_tx_full;
 	u64 tx_csum_partial;
 	u64 tx_csum_partial_inner;
 	u64 tx_queue_stopped;
@@ -102,6 +104,8 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_complete) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary_inner) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_drop) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_tx) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_tx_full) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial_inner) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
@@ -281,6 +285,8 @@ struct mlx5e_rq_stats {
 	u64 lro_packets;
 	u64 lro_bytes;
 	u64 xdp_drop;
+	u64 xdp_tx;
+	u64 xdp_tx_full;
 	u64 wqe_err;
 	u64 mpwqe_filler;
 	u64 buff_alloc_err;
@@ -299,6 +305,8 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_none) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_drop) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx_full) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index a728303..7191035 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -493,16 +493,13 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 	return (i == MLX5E_TX_CQ_POLL_BUDGET);
 }
 
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
+static void mlx5e_free_txq_sq_descs(struct mlx5e_sq *sq)
 {
 	struct mlx5e_tx_wqe_info *wi;
 	struct sk_buff *skb;
 	u16 ci;
 	int i;
 
-	if (sq->type != MLX5E_SQ_TXQ)
-		return;
-
 	while (sq->cc != sq->pc) {
 		ci = sq->cc & sq->wq.sz_m1;
 		skb = sq->db.txq.skb[ci];
@@ -524,3 +521,37 @@ void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
 		sq->cc += wi->num_wqebbs;
 	}
 }
+
+static void mlx5e_free_xdp_sq_descs(struct mlx5e_sq *sq)
+{
+	struct mlx5e_sq_wqe_info *wi;
+	struct mlx5e_dma_info *di;
+	u16 ci;
+
+	while (sq->cc != sq->pc) {
+		ci = sq->cc & sq->wq.sz_m1;
+		di = &sq->db.xdp.di[ci];
+		wi = &sq->db.xdp.wqe_info[ci];
+
+		if (wi->opcode == MLX5_OPCODE_NOP) {
+			sq->cc++;
+			continue;
+		}
+
+		sq->cc += wi->num_wqebbs;
+
+		mlx5e_page_release(&sq->channel->rq, di, false);
+	}
+}
+
+void mlx5e_free_sq_descs(struct mlx5e_sq *sq)
+{
+	switch (sq->type) {
+	case MLX5E_SQ_TXQ:
+		mlx5e_free_txq_sq_descs(sq);
+		break;
+	case MLX5E_SQ_XDP:
+		mlx5e_free_xdp_sq_descs(sq);
+		break;
+	}
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 47cd561..397285d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -72,7 +72,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 
 	do {
 		u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
-		struct mlx5e_ico_wqe_info *icowi = &sq->db.ico_wqe[ci];
+		struct mlx5e_sq_wqe_info *icowi = &sq->db.ico_wqe[ci];
 
 		mlx5_cqwq_pop(&cq->wq);
 		sqcc += icowi->num_wqebbs;
@@ -105,6 +105,66 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 	sq->cc = sqcc;
 }
 
+static inline bool mlx5e_poll_xdp_tx_cq(struct mlx5e_cq *cq)
+{
+	struct mlx5e_sq *sq;
+	u16 sqcc;
+	int i;
+
+	sq = container_of(cq, struct mlx5e_sq, cq);
+
+	if (unlikely(test_bit(MLX5E_SQ_STATE_FLUSH, &sq->state)))
+		return false;
+
+	/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
+	 * otherwise a cq overrun may occur
+	 */
+	sqcc = sq->cc;
+
+	for (i = 0; i < MLX5E_TX_CQ_POLL_BUDGET; i++) {
+		struct mlx5_cqe64 *cqe;
+		u16 wqe_counter;
+		bool last_wqe;
+
+		cqe = mlx5e_get_cqe(cq);
+		if (!cqe)
+			break;
+
+		mlx5_cqwq_pop(&cq->wq);
+
+		wqe_counter = be16_to_cpu(cqe->wqe_counter);
+
+		do {
+			struct mlx5e_sq_wqe_info *wi;
+			struct mlx5e_dma_info *di;
+			u16 ci;
+
+			last_wqe = (sqcc == wqe_counter);
+
+			ci = sqcc & sq->wq.sz_m1;
+			di = &sq->db.xdp.di[ci];
+			wi = &sq->db.xdp.wqe_info[ci];
+
+			if (unlikely(wi->opcode == MLX5_OPCODE_NOP)) {
+				sqcc++;
+				continue;
+			}
+
+			sqcc += wi->num_wqebbs;
+			/* Recycle RX page */
+			mlx5e_page_release(&cq->channel->rq, di, true);
+		} while (!last_wqe);
+	}
+
+	mlx5_cqwq_update_db_record(&cq->wq);
+
+	/* ensure cq space is freed before enabling more cqes */
+	wmb();
+
+	sq->cc = sqcc;
+	return (i == MLX5E_TX_CQ_POLL_BUDGET);
+}
+
 int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 {
 	struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel,
@@ -121,6 +181,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	work_done = mlx5e_poll_rx_cq(&c->rq.cq, budget);
 	busy |= work_done == budget;
 
+	if (c->xdp)
+		busy |= mlx5e_poll_xdp_tx_cq(&c->xdp_sq.cq);
+
 	mlx5e_poll_ico_cq(&c->icosq.cq);
 
 	busy |= mlx5e_post_rx_wqes(&c->rq);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
                   ` (9 preceding siblings ...)
  2016-09-07 12:42 ` [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
  2016-09-07 13:44   ` John Fastabend
                     ` (2 more replies)
       [not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  11 siblings, 3 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
  To: iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams     XDP TX       XDP TX (xmit more)
---------------------------------------------------
1           4.90Mpps      7.50Mpps
2           9.50Mpps      14.8Mpps
4           16.5Mpps      25.1Mpps
8           21.5Mpps      27.5Mpps*
16          24.1Mpps      27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index df2c9e0..6846208 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -265,7 +265,8 @@ struct mlx5e_cq {
 
 struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
-				       struct mlx5_cqe64 *cqe);
+				       struct mlx5_cqe64 *cqe,
+				       bool *xdp_doorbell);
 typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
 				  u16 ix);
 
@@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
 
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle);
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			 bool *xdp_doorbell);
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			       bool *xdp_doorbell);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
 int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
 int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 912a0e2..ed93251 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
 static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
 					     struct mlx5e_cq *cq,
 					     int update_owner_only,
-					     int budget_rem)
+					     int budget_rem,
+					     bool *xdp_doorbell)
 {
 	u32 cqcc = cq->wq.cc + update_owner_only;
 	u32 cqe_count;
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
 			mlx5e_read_mini_arr_slot(cq, cqcc);
 
 		mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
-		rq->handle_rx_cqe(rq, &cq->title);
+		rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
 	}
 	mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
 	cq->wq.cc = cqcc;
@@ -143,15 +144,16 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
 
 static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
 					      struct mlx5e_cq *cq,
-					      int budget_rem)
+					      int budget_rem,
+					      bool *xdp_doorbell)
 {
 	mlx5e_read_title_slot(rq, cq, cq->wq.cc);
 	mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
 	mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
-	rq->handle_rx_cqe(rq, &cq->title);
+	rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
 	cq->mini_arr_idx++;
 
-	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
+	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem, xdp_doorbell) - 1;
 }
 
 void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
@@ -670,23 +672,36 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
 	wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
 	sq->pc += MLX5E_XDP_TX_WQEBBS;
 
-	/* TODO: xmit more */
+	/* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
+	return true;
+}
+
+static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
+{
+	struct mlx5_wq_cyc *wq = &sq->wq;
+	struct mlx5e_tx_wqe *wqe;
+	u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
+
+	wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
+
 	wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 
+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
 	/* fill sq edge with nops to avoid wqe wrap around */
 	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
 		sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
 		mlx5e_send_nop(sq, false);
 	}
-	return true;
+#endif
 }
 
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    const struct bpf_prog *prog,
 				    struct mlx5e_dma_info *di,
-				    void *data, u16 len)
+				    void *data, u16 len,
+				    bool *xdp_doorbell)
 {
 	bool consumed = false;
 	struct xdp_buff xdp;
@@ -705,7 +720,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
 						MLX5_RX_HEADROOM,
 						len);
+		if (unlikely(!consumed) && (*xdp_doorbell)) {
+			/* SQ is full, ring doorbell */
+			mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+			*xdp_doorbell = false;
+		}
 		rq->stats.xdp_tx += consumed;
+		*xdp_doorbell |= consumed;
 		return consumed;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -720,7 +741,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 	return false;
 }
 
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			 bool *xdp_doorbell)
 {
 	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
 	struct mlx5e_dma_info *di;
@@ -752,7 +774,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
-	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
 		goto wq_ll_pop; /* page/packet was consumed by XDP */
 
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
@@ -814,7 +836,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
 	skb->len  += headlen;
 }
 
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			       bool *xdp_doorbell)
 {
 	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
 	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
@@ -860,13 +883,15 @@ mpwrq_cqe_out:
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
+	bool xdp_doorbell = false;
 	int work_done = 0;
 
 	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
 		return 0;
 
 	if (cq->decmprs_left)
-		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
+		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
+							&xdp_doorbell);
 
 	for (; work_done < budget; work_done++) {
 		struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
@@ -877,15 +902,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 		if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
 			work_done +=
 				mlx5e_decompress_cqes_start(rq, cq,
-							    budget - work_done);
+							    budget - work_done,
+							    &xdp_doorbell);
 			continue;
 		}
 
 		mlx5_cqwq_pop(&cq->wq);
 
-		rq->handle_rx_cqe(rq, cqe);
+		rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
 	}
 
+	if (xdp_doorbell)
+		mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+
 	mlx5_cqwq_update_db_record(&cq->wq);
 
 	/* ensure cq space is freed before enabling more cqes */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
  2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
@ 2016-09-07 13:32   ` Or Gerlitz
       [not found]     ` <CAJ3xEMhh=fu+mrCGAjv1PDdGn9GPLJv9MssMzwzvppoqZUY01A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]   ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2016-09-08 10:58   ` Jamal Hadi Salim
  2 siblings, 1 reply; 72+ messages in thread
From: Or Gerlitz @ 2016-09-07 13:32 UTC (permalink / raw)
  To: Saeed Mahameed, Tariq Toukan, Rana Shahout
  Cc: iovisor-dev, Linux Netdev List, Brenden Blanco,
	Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
	Jesper Dangaard Brouer, Daniel Borkmann, Eric Dumazet,
	Jamal Hadi Salim

On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:

> Packet rate performance testing was done with pktgen 64B packets and on
> TX side and, TC drop action on RX side compared to XDP fast drop.
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
>         1. Baseline, Before this patch with TC drop action
>         2. This patch with TC drop action
>         3. This patch with XDP RX fast drop
>
> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> --------------------------------------------------------------
> 1           5.51Mpps            5.14Mpps     13.5Mpps
> 2           11.5Mpps            10.0Mpps     25.1Mpps
> 4           16.3Mpps            17.2Mpps     35.4Mpps
> 8           29.6Mpps            28.2Mpps     45.8Mpps*
> 16          34.0Mpps            30.1Mpps     45.8Mpps*

Rana, Guys, congrat!!

When you say X streams, does each stream mapped by RSS to different RX ring?
or we're on the same RX ring for all rows of the above table?

In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
I don't think we went beyond one RX ring.

Here, I guess you want to 1st get an initial max for N pktgen TX
threads all sending
the same stream so you land on single RX ring, and then move to M * N pktgen TX
threads to max that further.

I don't see how the current Linux stack would be able to happily drive 34M PPS
(== allocate SKB, etc, you know...) on a single CPU, Jesper?

Or.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
@ 2016-09-07 13:44   ` John Fastabend
       [not found]     ` <57D019B2.7070007-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-09-07 14:41   ` Eric Dumazet
       [not found]   ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 72+ messages in thread
From: John Fastabend @ 2016-09-07 13:44 UTC (permalink / raw)
  To: Saeed Mahameed, iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim

On 16-09-07 05:42 AM, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
> 
> XDP forward packet rate:
> 
> Comparing XDP with and w/o xmit more (bulk transmit):
> 
> Streams     XDP TX       XDP TX (xmit more)
> ---------------------------------------------------
> 1           4.90Mpps      7.50Mpps
> 2           9.50Mpps      14.8Mpps
> 4           16.5Mpps      25.1Mpps
> 8           21.5Mpps      27.5Mpps*
> 16          24.1Mpps      27.5Mpps*
> 

Hi Saeed,

How many cores are you using with these numbers? Just a single
core? Or are streams being RSS'd across cores somehow.

> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]     ` <57D019B2.7070007-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-09-07 14:40       ` Saeed Mahameed via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-07 14:40 UTC (permalink / raw)
  To: John Fastabend
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List

On Wed, Sep 7, 2016 at 4:44 PM, John Fastabend via iovisor-dev
<iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
> On 16-09-07 05:42 AM, Saeed Mahameed wrote:
>> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>>
>> Here we introduce a xmit more like mechanism that will queue up more
>> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>>
>> Once RX napi budget is consumed and we exit napi RX loop, we will
>> flush (doorbell) all XDP looped packets in case there are such.
>>
>> XDP forward packet rate:
>>
>> Comparing XDP with and w/o xmit more (bulk transmit):
>>
>> Streams     XDP TX       XDP TX (xmit more)
>> ---------------------------------------------------
>> 1           4.90Mpps      7.50Mpps
>> 2           9.50Mpps      14.8Mpps
>> 4           16.5Mpps      25.1Mpps
>> 8           21.5Mpps      27.5Mpps*
>> 16          24.1Mpps      27.5Mpps*
>>
>
> Hi Saeed,
>
> How many cores are you using with these numbers? Just a single
> core? Or are streams being RSS'd across cores somehow.
>

Hi John,

Right I should have been more clear here, numbers of streams refers to
the active RSS cores.
We just manipulate the number of rings with ethtool -L to test this.

>> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
>> we will be working on the analysis and will publish the conclusions
>> later.
>>
>
> Thanks,
> John
> _______________________________________________
> iovisor-dev mailing list
> iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
  2016-09-07 13:44   ` John Fastabend
@ 2016-09-07 14:41   ` Eric Dumazet
       [not found]     ` <1473259302.10725.31.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
       [not found]   ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 72+ messages in thread
From: Eric Dumazet @ 2016-09-07 14:41 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco,
	Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
	Jesper Dangaard Brouer, Daniel Borkmann, Eric Dumazet,
	Jamal Hadi Salim

On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.

Why is this idea depends on XDP ?

It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]     ` <CAJ3xEMhh=fu+mrCGAjv1PDdGn9GPLJv9MssMzwzvppoqZUY01A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-07 14:48       ` Saeed Mahameed via iovisor-dev
       [not found]         ` <CALzJLG8_F28kQOPqTTLJRMsf9BOQvm3K2hAraCzabnXV4yKUgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-07 14:48 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Tom Herbert, Rana Shahout

On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>> Packet rate performance testing was done with pktgen 64B packets and on
>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Comparison is done between:
>>         1. Baseline, Before this patch with TC drop action
>>         2. This patch with TC drop action
>>         3. This patch with XDP RX fast drop
>>
>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>> --------------------------------------------------------------
>> 1           5.51Mpps            5.14Mpps     13.5Mpps
>> 2           11.5Mpps            10.0Mpps     25.1Mpps
>> 4           16.3Mpps            17.2Mpps     35.4Mpps
>> 8           29.6Mpps            28.2Mpps     45.8Mpps*
>> 16          34.0Mpps            30.1Mpps     45.8Mpps*
>
> Rana, Guys, congrat!!
>
> When you say X streams, does each stream mapped by RSS to different RX ring?
> or we're on the same RX ring for all rows of the above table?

Yes, I will make this more clear in the actual submission,
Here we are talking about different RSS core rings.

>
> In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
> I don't think we went beyond one RX ring.

Here we did, the first row is what you are describing the other rows
are the same test
with increasing the number of the RSS receiving cores, The xmit side is sending
as many streams as possible to be as much uniformly spread as possible
across the
different RSS cores on the receiver.

>
> Here, I guess you want to 1st get an initial max for N pktgen TX
> threads all sending
> the same stream so you land on single RX ring, and then move to M * N pktgen TX
> threads to max that further.
>
> I don't see how the current Linux stack would be able to happily drive 34M PPS
> (== allocate SKB, etc, you know...) on a single CPU, Jesper?
>
> Or.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]     ` <1473259302.10725.31.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2016-09-07 15:08       ` Saeed Mahameed via iovisor-dev
  2016-09-07 15:32         ` Eric Dumazet
  0 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-07 15:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Tom Herbert

On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>>
>> Here we introduce a xmit more like mechanism that will queue up more
>> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>>
>> Once RX napi budget is consumed and we exit napi RX loop, we will
>> flush (doorbell) all XDP looped packets in case there are such.
>
> Why is this idea depends on XDP ?
>
> It looks like we could apply it to any driver having one IRQ servicing
> one RX and one TX, without XDP being involved.
>

Yes but it is more complicated than XDP case, where the RX ring posts
the TX descriptors and once done
the RX ring hits the doorbell once for all the TX descriptors it
posted, and it is the only possible place to hit a doorbell
for XDP TX ring.

For regular TX and RX ring sharing the same IRQ, there is no such
simple connection between them, and hitting a doorbell
from RX ring napi would race with xmit ndo function of the TX ring.

How do you synchronize in such case ?
isn't the existing xmit more mechanism sufficient enough ? maybe we
can have a fence from napi RX function
that will hold the xmit queue until done and then flush the TX queue
with the setting the right xmit more flags, without the need
of explicitly intervening with TX flow (hitting the doorbell).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-07 15:08       ` Saeed Mahameed via iovisor-dev
@ 2016-09-07 15:32         ` Eric Dumazet
  2016-09-07 16:57           ` Saeed Mahameed
  0 siblings, 1 reply; 72+ messages in thread
From: Eric Dumazet @ 2016-09-07 15:32 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, iovisor-dev, Linux Netdev List, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Tom Herbert,
	Martin KaFai Lau, Jesper Dangaard Brouer, Daniel Borkmann,
	Eric Dumazet, Jamal Hadi Salim

On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> >>
> >> Here we introduce a xmit more like mechanism that will queue up more
> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> >>
> >> Once RX napi budget is consumed and we exit napi RX loop, we will
> >> flush (doorbell) all XDP looped packets in case there are such.
> >
> > Why is this idea depends on XDP ?
> >
> > It looks like we could apply it to any driver having one IRQ servicing
> > one RX and one TX, without XDP being involved.
> >
> 
> Yes but it is more complicated than XDP case, where the RX ring posts
> the TX descriptors and once done
> the RX ring hits the doorbell once for all the TX descriptors it
> posted, and it is the only possible place to hit a doorbell
> for XDP TX ring.
> 
> For regular TX and RX ring sharing the same IRQ, there is no such
> simple connection between them, and hitting a doorbell
> from RX ring napi would race with xmit ndo function of the TX ring.
> 
> How do you synchronize in such case ?
> isn't the existing xmit more mechanism sufficient enough ?

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.

A simple cmpxchg could be used to synchronize the thing, if we really
cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
smaller than doorbell one)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]         ` <CALzJLG8_F28kQOPqTTLJRMsf9BOQvm3K2hAraCzabnXV4yKUgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-07 16:54           ` Tom Herbert via iovisor-dev
       [not found]             ` <CALx6S35b_MZXiGR-b1SB+VNifPHDfQNDZdz-6vk0t3bKNwen+w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-09-07 16:54 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Rana Shahout, Or Gerlitz

On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
<saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>> Packet rate performance testing was done with pktgen 64B packets and on
>>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>>
>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>>
>>> Comparison is done between:
>>>         1. Baseline, Before this patch with TC drop action
>>>         2. This patch with TC drop action
>>>         3. This patch with XDP RX fast drop
>>>
>>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>>> --------------------------------------------------------------
>>> 1           5.51Mpps            5.14Mpps     13.5Mpps
>>> 2           11.5Mpps            10.0Mpps     25.1Mpps
>>> 4           16.3Mpps            17.2Mpps     35.4Mpps
>>> 8           29.6Mpps            28.2Mpps     45.8Mpps*
>>> 16          34.0Mpps            30.1Mpps     45.8Mpps*
>>
>> Rana, Guys, congrat!!
>>
>> When you say X streams, does each stream mapped by RSS to different RX ring?
>> or we're on the same RX ring for all rows of the above table?
>
> Yes, I will make this more clear in the actual submission,
> Here we are talking about different RSS core rings.
>
>>
>> In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
>> I don't think we went beyond one RX ring.
>
> Here we did, the first row is what you are describing the other rows
> are the same test
> with increasing the number of the RSS receiving cores, The xmit side is sending
> as many streams as possible to be as much uniformly spread as possible
> across the
> different RSS cores on the receiver.
>
Hi Saeed,

Please report CPU utilization also. The expectation is that
performance should scale linearly with increasing number of CPUs (i.e.
pps/CPU_utilization should be constant).

Tom

>>
>> Here, I guess you want to 1st get an initial max for N pktgen TX
>> threads all sending
>> the same stream so you land on single RX ring, and then move to M * N pktgen TX
>> threads to max that further.
>>
>> I don't see how the current Linux stack would be able to happily drive 34M PPS
>> (== allocate SKB, etc, you know...) on a single CPU, Jesper?
>>
>> Or.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-07 15:32         ` Eric Dumazet
@ 2016-09-07 16:57           ` Saeed Mahameed
       [not found]             ` <CALzJLG9iVpS2qH5Ryc_DtEjrQMhcKD+qrLrGn=vet=_9N8eXPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 16:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, iovisor-dev, Linux Netdev List, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Tom Herbert,
	Martin KaFai Lau, Jesper Dangaard Brouer, Daniel Borkmann,
	Eric Dumazet, Jamal Hadi Salim

On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >>
>> >> Here we introduce a xmit more like mechanism that will queue up more
>> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >>
>> >> Once RX napi budget is consumed and we exit napi RX loop, we will
>> >> flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > Why is this idea depends on XDP ?
>> >
>> > It looks like we could apply it to any driver having one IRQ servicing
>> > one RX and one TX, without XDP being involved.
>> >
>>
>> Yes but it is more complicated than XDP case, where the RX ring posts
>> the TX descriptors and once done
>> the RX ring hits the doorbell once for all the TX descriptors it
>> posted, and it is the only possible place to hit a doorbell
>> for XDP TX ring.
>>
>> For regular TX and RX ring sharing the same IRQ, there is no such
>> simple connection between them, and hitting a doorbell
>> from RX ring napi would race with xmit ndo function of the TX ring.
>>
>> How do you synchronize in such case ?
>> isn't the existing xmit more mechanism sufficient enough ?
>
> Only if a qdisc is present and pressure is high enough.
>
> But in a forwarding setup, we likely receive at a lower rate than the
> NIC can transmit.
>

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.

> A simple cmpxchg could be used to synchronize the thing, if we really
> cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
> smaller than doorbell one)
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]             ` <CALx6S35b_MZXiGR-b1SB+VNifPHDfQNDZdz-6vk0t3bKNwen+w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-07 17:07               ` Saeed Mahameed via iovisor-dev
       [not found]                 ` <CALzJLG9bu3-=Ybq+Lk1fvAe5AohVHAaPpa9RQqd1QVe-7XPyhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-07 17:07 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Rana Shahout, Or Gerlitz

On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
> On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
> <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>>> Packet rate performance testing was done with pktgen 64B packets and on
>>>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>>>
>>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>>>
>>>> Comparison is done between:
>>>>         1. Baseline, Before this patch with TC drop action
>>>>         2. This patch with TC drop action
>>>>         3. This patch with XDP RX fast drop
>>>>
>>>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>>>> --------------------------------------------------------------
>>>> 1           5.51Mpps            5.14Mpps     13.5Mpps
>>>> 2           11.5Mpps            10.0Mpps     25.1Mpps
>>>> 4           16.3Mpps            17.2Mpps     35.4Mpps
>>>> 8           29.6Mpps            28.2Mpps     45.8Mpps*
>>>> 16          34.0Mpps            30.1Mpps     45.8Mpps*
>>>
>>> Rana, Guys, congrat!!
>>>
>>> When you say X streams, does each stream mapped by RSS to different RX ring?
>>> or we're on the same RX ring for all rows of the above table?
>>
>> Yes, I will make this more clear in the actual submission,
>> Here we are talking about different RSS core rings.
>>
>>>
>>> In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
>>> I don't think we went beyond one RX ring.
>>
>> Here we did, the first row is what you are describing the other rows
>> are the same test
>> with increasing the number of the RSS receiving cores, The xmit side is sending
>> as many streams as possible to be as much uniformly spread as possible
>> across the
>> different RSS cores on the receiver.
>>
> Hi Saeed,
>
> Please report CPU utilization also. The expectation is that
> performance should scale linearly with increasing number of CPUs (i.e.
> pps/CPU_utilization should be constant).
>

Hi Tom

That was my expectation too.

We didn't do the full analysis yet, It could be that RSS was not
spreading the workload on all the cores evenly.
Those numbers are from my humble machine with a quick and dirty
testing, the idea of this submission
is to let the folks look at the code while we continue testing and
analyzing those patches.

Anyway we will share more accurate results when we have them, with CPU
utilization statistics as well.

Thanks,
Saeed.

> Tom
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
       [not found]   ` <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-07 17:31     ` Alexei Starovoitov via iovisor-dev
       [not found]       ` <20160907173131.GA64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-09-07 19:18       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 72+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-09-07 17:31 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	Eric Dumazet, Tom Herbert

On Wed, Sep 07, 2016 at 03:42:22PM +0300, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.
> 
> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
> 
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
> 
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction

imo it's not a realistic use case, but would be good to mention that
patch 3 brings performance back for this use case anyway.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
  2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
@ 2016-09-07 17:34   ` Alexei Starovoitov
       [not found]     ` <20160907173449.GB64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
       [not found]   ` <1473252152-11379-5-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2016-09-07 17:34 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco, Tom Herbert,
	Martin KaFai Lau, Jesper Dangaard Brouer, Daniel Borkmann,
	Eric Dumazet, Jamal Hadi Salim

On Wed, Sep 07, 2016 at 03:42:25PM +0300, Saeed Mahameed wrote:
> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
> 
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
> 
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.
> 
> Packet rate performance testing was done with pktgen 64B packets on xmit
> side and TC drop action on RX side.
> 
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Comparison is done between:
>  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>  2.Build SKB with RX page cache (This patch)
> 
> Streams    Baseline    Build SKB+page-cache    Improvement
> -----------------------------------------------------------
> 1          4.33Mpps      5.51Mpps                27%
> 2          7.35Mpps      11.5Mpps                52%
> 4          14.0Mpps      16.3Mpps                16%
> 8          22.2Mpps      29.6Mpps                20%
> 16         24.8Mpps      34.0Mpps                17%

Impressive gains for build_skb. I think it should help ip forwarding too
and likely tcp_rr. tcp_stream shouldn't see any difference.
If you can benchmark that along with pktgen+tc_drop it would
help to better understand the impact of the changes.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]             ` <CALzJLG9iVpS2qH5Ryc_DtEjrQMhcKD+qrLrGn=vet=_9N8eXPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-07 18:19               ` Eric Dumazet via iovisor-dev
       [not found]                 ` <1473272346.10725.73.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  2016-09-07 18:22               ` Jesper Dangaard Brouer via iovisor-dev
  1 sibling, 1 reply; 72+ messages in thread
From: Eric Dumazet via iovisor-dev @ 2016-09-07 18:19 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Tom Herbert

On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:

> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device
> TX ring is idle most of the time, i think his idea can come in handy here.
> I am not fully involved in the details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

I do not think qdisc is relevant here.

Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
because only this layer can know if more packets are to come.

What I am saying is that regardless of skb->xmit_more being set or not,
(for example if no qdisc is even used)
a NAPI driver can arm a bit asking the doorbell being sent at the end of
NAPI.

I am not saying this must be done, only that the idea could be extended
to non XDP world, if we care enough.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]             ` <CALzJLG9iVpS2qH5Ryc_DtEjrQMhcKD+qrLrGn=vet=_9N8eXPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-09-07 18:19               ` Eric Dumazet via iovisor-dev
@ 2016-09-07 18:22               ` Jesper Dangaard Brouer via iovisor-dev
       [not found]                 ` <20160907202234.55e18ef3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-07 18:22 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Eric Dumazet, Linux Netdev List, iovisor-dev, Jamal Hadi Salim,
	Saeed Mahameed, Eric Dumazet, Tom Herbert


On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:  
> >> > On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
[...]
> >
> > Only if a qdisc is present and pressure is high enough.
> >
> > But in a forwarding setup, we likely receive at a lower rate than the
> > NIC can transmit.

Yes, I can confirm this happens in my experiments.

> >  
> 
> Jesper has a similar Idea to make the qdisc think it is under
> pressure, when the device TX ring is idle most of the time, i think
> his idea can come in handy here. I am not fully involved in the
> details, maybe he can elaborate more.
> 
> But if it works, it will be transparent to napi, and xmit more will
> happen by design.

Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible.  This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
       [not found]   ` <1473252152-11379-4-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-07 18:45     ` Jesper Dangaard Brouer via iovisor-dev
       [not found]       ` <20160907204501.08cc4ede-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-07 18:45 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	Eric Dumazet, Tom Herbert


On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Instead of reallocating and mapping pages for RX data-path,
> recycle already used pages in a per ring cache.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - order0 no cache
> * 4,786,899 - order0 with cache
> 1% gain
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - order0 no cache
> * 4,127,852 - order0 with cache
> 3.7% gain
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - order0 no cache
> * 3,931,708 - order0 with cache
> 5.4% gain
> 
> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       | 16 ++++++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++++++
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 57 ++++++++++++++++++++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++++++
>  4 files changed, 99 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 075cdfc..afbdf70 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
>  	u8					tired;
>  };
>  
> +/* a single cache unit is capable to serve one napi call (for non-striding rq)
> + * or a MPWQE (for striding rq).
> + */
> +#define MLX5E_CACHE_UNIT	(MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
> +				 MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
> +#define MLX5E_CACHE_SIZE	(2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
> +struct mlx5e_page_cache {
> +	u32 head;
> +	u32 tail;
> +	struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
> +};
> +
[...]
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index c1cb510..8e02af3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  }
>  
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> +				      struct mlx5e_dma_info *dma_info)
> +{
> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> +	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> +
> +	if (tail_next == cache->head) {
> +		rq->stats.cache_full++;
> +		return false;
> +	}
> +
> +	cache->page_cache[cache->tail] = *dma_info;
> +	cache->tail = tail_next;
> +	return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> +				      struct mlx5e_dma_info *dma_info)
> +{
> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> +
> +	if (unlikely(cache->head == cache->tail)) {
> +		rq->stats.cache_empty++;
> +		return false;
> +	}
> +
> +	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> +		rq->stats.cache_busy++;
> +		return false;
> +	}

Hmmm... doesn't this cause "blocking" of the page_cache recycle
facility until the page at the head of the queue gets (page) refcnt
decremented?  Real use-case could fairly easily block/cause this... 

> +
> +	*dma_info = cache->page_cache[cache->head];
> +	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> +	rq->stats.cache_reuse++;
> +
> +	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> +				   DMA_FROM_DEVICE);
> +	return true;
> +}
> +
>  static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
>  					  struct mlx5e_dma_info *dma_info)
>  {
> -	struct page *page = dev_alloc_page();
> +	struct page *page;
> +
> +	if (mlx5e_rx_cache_get(rq, dma_info))
> +		return 0;
>  
> +	page = dev_alloc_page();
>  	if (unlikely(!page))
>  		return -ENOMEM;

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
  2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
@ 2016-09-07 19:18       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-07 19:18 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	linux-mm, Eric Dumazet, Tom Herbert


On Wed,  7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.

Thanks you for doing this! MM-list people thanks you!

For others to understand what this means:  This driver was doing
split_page() on high-order pages (for Striding RQ).  This was really bad
because it will cause fragmenting the page-allocator, and depleting the
high-order pages available quickly.

(I've left rest of patch intact below, if some MM people should be
interested in looking at the changes).

There is even a funny comment in split_page() relevant to this:

/* [...]
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.
 */


> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
> 
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
> 
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
> 

Well, the reduction does not really matter than much, because your
baseline benchmarks are from a freshly booted system, where you have
not fragmented and depleted the high-order pages yet... ;-)


> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       |  54 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 ++++++++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 292 ++++-----------------
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
>  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
>  5 files changed, 184 insertions(+), 304 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index bf722aa..075cdfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -62,12 +62,12 @@
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE                0xd
>  
>  #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW            0x1
> -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x4
> +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
>  
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
> -#define MLX5_MPWRQ_LOG_WQE_SZ			17
> +#define MLX5_MPWRQ_LOG_WQE_SZ			18
>  #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
>  				    MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
>  #define MLX5_MPWRQ_PAGES_PER_WQE		BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
> @@ -293,8 +293,8 @@ struct mlx5e_rq {
>  	u32                    wqe_sz;
>  	struct sk_buff       **skb;
>  	struct mlx5e_mpw_info *wqe_info;
> +	void                  *mtt_no_align;
>  	__be32                 mkey_be;
> -	__be32                 umr_mkey_be;
>  
>  	struct device         *pdev;
>  	struct net_device     *netdev;
> @@ -323,32 +323,15 @@ struct mlx5e_rq {
>  
>  struct mlx5e_umr_dma_info {
>  	__be64                *mtt;
> -	__be64                *mtt_no_align;
>  	dma_addr_t             mtt_addr;
> -	struct mlx5e_dma_info *dma_info;
> +	struct mlx5e_dma_info  dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
> +	struct mlx5e_umr_wqe   wqe;
>  };
>  
>  struct mlx5e_mpw_info {
> -	union {
> -		struct mlx5e_dma_info     dma_info;
> -		struct mlx5e_umr_dma_info umr;
> -	};
> +	struct mlx5e_umr_dma_info umr;
>  	u16 consumed_strides;
>  	u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
> -
> -	void (*dma_pre_sync)(struct device *pdev,
> -			     struct mlx5e_mpw_info *wi,
> -			     u32 wqe_offset, u32 len);
> -	void (*add_skb_frag)(struct mlx5e_rq *rq,
> -			     struct sk_buff *skb,
> -			     struct mlx5e_mpw_info *wi,
> -			     u32 page_idx, u32 frag_offset, u32 len);
> -	void (*copy_skb_header)(struct device *pdev,
> -				struct sk_buff *skb,
> -				struct mlx5e_mpw_info *wi,
> -				u32 page_idx, u32 offset,
> -				u32 headlen);
> -	void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
>  };
>  
>  struct mlx5e_tx_wqe_info {
> @@ -706,24 +689,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>  void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>  bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
>  int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
>  void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
>  void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
> -void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				    struct mlx5_cqe64 *cqe,
> -				    u16 byte_cnt,
> -				    struct mlx5e_mpw_info *wi,
> -				    struct sk_buff *skb);
> -void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -					struct mlx5_cqe64 *cqe,
> -					u16 byte_cnt,
> -					struct mlx5e_mpw_info *wi,
> -					struct sk_buff *skb);
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				struct mlx5e_mpw_info *wi);
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -				    struct mlx5e_mpw_info *wi);
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq);
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
>  struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
>  
>  void mlx5e_rx_am(struct mlx5e_rq *rq);
> @@ -810,6 +780,12 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
>  	mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, NULL, cq->wq.cc);
>  }
>  
> +static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> +{
> +	return rq->mpwqe_mtt_offset +
> +		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> +}
> +
>  static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
>  {
>  	return min_t(int, mdev->priv.eq_table.num_comp_vectors,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 2459c7f..0db4d3b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -138,7 +138,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
>  		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
>  		s->rx_wqe_err   += rq_stats->wqe_err;
>  		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
> -		s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
>  		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
>  		s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
>  		s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
> @@ -298,6 +297,107 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
>  #define MLX5E_HW2SW_MTU(hwmtu) (hwmtu - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
>  #define MLX5E_SW2HW_MTU(swmtu) (swmtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
>  
> +static inline int mlx5e_get_wqe_mtt_sz(void)
> +{
> +	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> +	 * To avoid copying garbage after the mtt array, we allocate
> +	 * a little more.
> +	 */
> +	return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> +		     MLX5_UMR_MTT_ALIGNMENT);
> +}
> +
> +static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
> +				       struct mlx5e_umr_wqe *wqe, u16 ix)
> +{
> +	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
> +	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> +	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> +	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> +
> +	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> +				      ds_cnt);
> +	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
> +	cseg->imm       = rq->mkey_be;
> +
> +	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> +	ucseg->klm_octowords =
> +		cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> +	ucseg->bsf_octowords =
> +		cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> +	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> +
> +	dseg->lkey = sq->mkey_be;
> +	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> +}
> +
> +static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
> +				     struct mlx5e_channel *c)
> +{
> +	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> +	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
> +	int i;
> +
> +	rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> +				    GFP_KERNEL, cpu_to_node(c->cpu));
> +	if (!rq->wqe_info)
> +		goto err_out;
> +
> +	/* We allocate more than mtt_sz as we will align the pointer */
> +	rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
> +					cpu_to_node(c->cpu));
> +	if (unlikely(!rq->mtt_no_align))
> +		goto err_free_wqe_info;
> +
> +	for (i = 0; i < wq_sz; i++) {
> +		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> +		wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
> +					MLX5_UMR_ALIGN);
> +		wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
> +						  PCI_DMA_TODEVICE);
> +		if (unlikely(dma_mapping_error(c->pdev, wi->umr.mtt_addr)))
> +			goto err_unmap_mtts;
> +
> +		mlx5e_build_umr_wqe(rq, &c->icosq, &wi->umr.wqe, i);
> +	}
> +
> +	return 0;
> +
> +err_unmap_mtts:
> +	while (--i >= 0) {
> +		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> +		dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
> +				 PCI_DMA_TODEVICE);
> +	}
> +	kfree(rq->mtt_no_align);
> +err_free_wqe_info:
> +	kfree(rq->wqe_info);
> +
> +err_out:
> +	return -ENOMEM;
> +}
> +
> +static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
> +{
> +	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> +	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	int i;
> +
> +	for (i = 0; i < wq_sz; i++) {
> +		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> +		dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
> +				 PCI_DMA_TODEVICE);
> +	}
> +	kfree(rq->mtt_no_align);
> +	kfree(rq->wqe_info);
> +}
> +
>  static int mlx5e_create_rq(struct mlx5e_channel *c,
>  			   struct mlx5e_rq_param *param,
>  			   struct mlx5e_rq *rq)
> @@ -322,14 +422,16 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  
>  	wq_sz = mlx5_wq_ll_get_size(&rq->wq);
>  
> +	rq->wq_type = priv->params.rq_wq_type;
> +	rq->pdev    = c->pdev;
> +	rq->netdev  = c->netdev;
> +	rq->tstamp  = &priv->tstamp;
> +	rq->channel = c;
> +	rq->ix      = c->ix;
> +	rq->priv    = c->priv;
> +
>  	switch (priv->params.rq_wq_type) {
>  	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> -		rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> -					    GFP_KERNEL, cpu_to_node(c->cpu));
> -		if (!rq->wqe_info) {
> -			err = -ENOMEM;
> -			goto err_rq_wq_destroy;
> -		}
>  		rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
>  		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
>  		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
> @@ -341,6 +443,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
>  		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
>  		byte_count = rq->wqe_sz;
> +		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> +		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
> +		if (err)
> +			goto err_rq_wq_destroy;
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
>  		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> @@ -359,27 +465,19 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
>  		byte_count = rq->wqe_sz;
>  		byte_count |= MLX5_HW_START_PADDING;
> +		rq->mkey_be = c->mkey_be;
>  	}
>  
>  	for (i = 0; i < wq_sz; i++) {
>  		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
>  
>  		wqe->data.byte_count = cpu_to_be32(byte_count);
> +		wqe->data.lkey = rq->mkey_be;
>  	}
>  
>  	INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
>  	rq->am.mode = priv->params.rx_cq_period_mode;
>  
> -	rq->wq_type = priv->params.rq_wq_type;
> -	rq->pdev    = c->pdev;
> -	rq->netdev  = c->netdev;
> -	rq->tstamp  = &priv->tstamp;
> -	rq->channel = c;
> -	rq->ix      = c->ix;
> -	rq->priv    = c->priv;
> -	rq->mkey_be = c->mkey_be;
> -	rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> -
>  	return 0;
>  
>  err_rq_wq_destroy:
> @@ -392,7 +490,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
>  {
>  	switch (rq->wq_type) {
>  	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> -		kfree(rq->wqe_info);
> +		mlx5e_rq_free_mpwqe_info(rq);
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
>  		kfree(rq->skb);
> @@ -530,7 +628,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
>  
>  	/* UMR WQE (if in progress) is always at wq->head */
>  	if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
> -		mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> +		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
>  
>  	while (!mlx5_wq_ll_is_empty(wq)) {
>  		wqe_ix_be = *wq->tail_next;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b6f8ebb..8ad4d32 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -200,7 +200,6 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
>  
>  	*((dma_addr_t *)skb->cb) = dma_addr;
>  	wqe->data.addr = cpu_to_be64(dma_addr);
> -	wqe->data.lkey = rq->mkey_be;
>  
>  	rq->skb[ix] = skb;
>  
> @@ -231,44 +230,11 @@ static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
>  	return rq->mpwqe_num_strides >> MLX5_MPWRQ_WQE_PAGE_ORDER;
>  }
>  
> -static inline void
> -mlx5e_dma_pre_sync_linear_mpwqe(struct device *pdev,
> -				struct mlx5e_mpw_info *wi,
> -				u32 wqe_offset, u32 len)
> -{
> -	dma_sync_single_for_cpu(pdev, wi->dma_info.addr + wqe_offset,
> -				len, DMA_FROM_DEVICE);
> -}
> -
> -static inline void
> -mlx5e_dma_pre_sync_fragmented_mpwqe(struct device *pdev,
> -				    struct mlx5e_mpw_info *wi,
> -				    u32 wqe_offset, u32 len)
> -{
> -	/* No dma pre sync for fragmented MPWQE */
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_linear_mpwqe(struct mlx5e_rq *rq,
> -				struct sk_buff *skb,
> -				struct mlx5e_mpw_info *wi,
> -				u32 page_idx, u32 frag_offset,
> -				u32 len)
> -{
> -	unsigned int truesize =	ALIGN(len, rq->mpwqe_stride_sz);
> -
> -	wi->skbs_frags[page_idx]++;
> -	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> -			&wi->dma_info.page[page_idx], frag_offset,
> -			len, truesize);
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
> -				    struct sk_buff *skb,
> -				    struct mlx5e_mpw_info *wi,
> -				    u32 page_idx, u32 frag_offset,
> -				    u32 len)
> +static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
> +					    struct sk_buff *skb,
> +					    struct mlx5e_mpw_info *wi,
> +					    u32 page_idx, u32 frag_offset,
> +					    u32 len)
>  {
>  	unsigned int truesize =	ALIGN(len, rq->mpwqe_stride_sz);
>  
> @@ -282,24 +248,11 @@ mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
>  }
>  
>  static inline void
> -mlx5e_copy_skb_header_linear_mpwqe(struct device *pdev,
> -				   struct sk_buff *skb,
> -				   struct mlx5e_mpw_info *wi,
> -				   u32 page_idx, u32 offset,
> -				   u32 headlen)
> -{
> -	struct page *page = &wi->dma_info.page[page_idx];
> -
> -	skb_copy_to_linear_data(skb, page_address(page) + offset,
> -				ALIGN(headlen, sizeof(long)));
> -}
> -
> -static inline void
> -mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
> -				       struct sk_buff *skb,
> -				       struct mlx5e_mpw_info *wi,
> -				       u32 page_idx, u32 offset,
> -				       u32 headlen)
> +mlx5e_copy_skb_header_mpwqe(struct device *pdev,
> +			    struct sk_buff *skb,
> +			    struct mlx5e_mpw_info *wi,
> +			    u32 page_idx, u32 offset,
> +			    u32 headlen)
>  {
>  	u16 headlen_pg = min_t(u32, headlen, PAGE_SIZE - offset);
>  	struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[page_idx];
> @@ -324,46 +277,9 @@ mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
>  	}
>  }
>  
> -static u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> -{
> -	return rq->mpwqe_mtt_offset +
> -		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> -}
> -
> -static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
> -				struct mlx5e_sq *sq,
> -				struct mlx5e_umr_wqe *wqe,
> -				u16 ix)
> +static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  {
> -	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
> -	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> -	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
>  	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> -	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> -	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> -
> -	memset(wqe, 0, sizeof(*wqe));
> -	cseg->opmod_idx_opcode =
> -		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> -			    MLX5_OPCODE_UMR);
> -	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> -				      ds_cnt);
> -	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
> -	cseg->imm       = rq->umr_mkey_be;
> -
> -	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> -	ucseg->klm_octowords =
> -		cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> -	ucseg->bsf_octowords =
> -		cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> -	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> -
> -	dseg->lkey = sq->mkey_be;
> -	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> -}
> -
> -static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> -{
>  	struct mlx5e_sq *sq = &rq->channel->icosq;
>  	struct mlx5_wq_cyc *wq = &sq->wq;
>  	struct mlx5e_umr_wqe *wqe;
> @@ -378,30 +294,22 @@ static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  	}
>  
>  	wqe = mlx5_wq_cyc_get_wqe(wq, pi);
> -	mlx5e_build_umr_wqe(rq, sq, wqe, ix);
> +	memcpy(wqe, &wi->umr.wqe, sizeof(*wqe));
> +	wqe->ctrl.opmod_idx_opcode =
> +		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> +			    MLX5_OPCODE_UMR);
> +
>  	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
>  	sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
>  	sq->pc += num_wqebbs;
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  }
>  
> -static inline int mlx5e_get_wqe_mtt_sz(void)
> -{
> -	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> -	 * To avoid copying garbage after the mtt array, we allocate
> -	 * a little more.
> -	 */
> -	return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> -		     MLX5_UMR_MTT_ALIGNMENT);
> -}
> -
> -static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> -				    struct mlx5e_mpw_info *wi,
> -				    int i)
> +static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> +					   struct mlx5e_mpw_info *wi,
> +					   int i)
>  {
> -	struct page *page;
> -
> -	page = dev_alloc_page();
> +	struct page *page = dev_alloc_page();
>  	if (unlikely(!page))
>  		return -ENOMEM;
>  
> @@ -417,47 +325,25 @@ static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
>  	return 0;
>  }
>  
> -static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -					   struct mlx5e_rx_wqe *wqe,
> -					   u16 ix)
> +static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
> +				    struct mlx5e_rx_wqe *wqe,
> +				    u16 ix)
>  {
>  	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> -	int mtt_sz = mlx5e_get_wqe_mtt_sz();
>  	u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
> +	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
> +	int err;
>  	int i;
>  
> -	wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
> -				   MLX5_MPWRQ_PAGES_PER_WQE,
> -				   GFP_ATOMIC);
> -	if (unlikely(!wi->umr.dma_info))
> -		goto err_out;
> -
> -	/* We allocate more than mtt_sz as we will align the pointer */
> -	wi->umr.mtt_no_align = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
> -				       GFP_ATOMIC);
> -	if (unlikely(!wi->umr.mtt_no_align))
> -		goto err_free_umr;
> -
> -	wi->umr.mtt = PTR_ALIGN(wi->umr.mtt_no_align, MLX5_UMR_ALIGN);
> -	wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
> -					  PCI_DMA_TODEVICE);
> -	if (unlikely(dma_mapping_error(rq->pdev, wi->umr.mtt_addr)))
> -		goto err_free_mtt;
> -
>  	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> -		if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i)))
> +		err = mlx5e_alloc_and_map_page(rq, wi, i);
> +		if (unlikely(err))
>  			goto err_unmap;
> -		page_ref_add(wi->umr.dma_info[i].page,
> -			     mlx5e_mpwqe_strides_per_page(rq));
> +		page_ref_add(wi->umr.dma_info[i].page, pg_strides);
>  		wi->skbs_frags[i] = 0;
>  	}
>  
>  	wi->consumed_strides = 0;
> -	wi->dma_pre_sync = mlx5e_dma_pre_sync_fragmented_mpwqe;
> -	wi->add_skb_frag = mlx5e_add_skb_frag_fragmented_mpwqe;
> -	wi->copy_skb_header = mlx5e_copy_skb_header_fragmented_mpwqe;
> -	wi->free_wqe     = mlx5e_free_rx_fragmented_mpwqe;
> -	wqe->data.lkey = rq->umr_mkey_be;
>  	wqe->data.addr = cpu_to_be64(dma_offset);
>  
>  	return 0;
> @@ -466,41 +352,28 @@ err_unmap:
>  	while (--i >= 0) {
>  		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
>  			       PCI_DMA_FROMDEVICE);
> -		page_ref_sub(wi->umr.dma_info[i].page,
> -			     mlx5e_mpwqe_strides_per_page(rq));
> +		page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
>  		put_page(wi->umr.dma_info[i].page);
>  	}
> -	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> -
> -err_free_mtt:
> -	kfree(wi->umr.mtt_no_align);
> -
> -err_free_umr:
> -	kfree(wi->umr.dma_info);
>  
> -err_out:
> -	return -ENOMEM;
> +	return err;
>  }
>  
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -				    struct mlx5e_mpw_info *wi)
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
>  {
> -	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
>  	int i;
>  
>  	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
>  		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
>  			       PCI_DMA_FROMDEVICE);
>  		page_ref_sub(wi->umr.dma_info[i].page,
> -			mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> +			     pg_strides - wi->skbs_frags[i]);
>  		put_page(wi->umr.dma_info[i].page);
>  	}
> -	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> -	kfree(wi->umr.mtt_no_align);
> -	kfree(wi->umr.dma_info);
>  }
>  
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
>  {
>  	struct mlx5_wq_ll *wq = &rq->wq;
>  	struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
> @@ -508,12 +381,11 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
>  	clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
>  
>  	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
> -		mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> +		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
>  		return;
>  	}
>  
>  	mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
> -	rq->stats.mpwqe_frag++;
>  
>  	/* ensure wqes are visible to device before updating doorbell record */
>  	dma_wmb();
> @@ -521,84 +393,23 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
>  	mlx5_wq_ll_update_db_record(wq);
>  }
>  
> -static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				       struct mlx5e_rx_wqe *wqe,
> -				       u16 ix)
> -{
> -	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> -	gfp_t gfp_mask;
> -	int i;
> -
> -	gfp_mask = GFP_ATOMIC | __GFP_COLD | __GFP_MEMALLOC;
> -	wi->dma_info.page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> -					     MLX5_MPWRQ_WQE_PAGE_ORDER);
> -	if (unlikely(!wi->dma_info.page))
> -		return -ENOMEM;
> -
> -	wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> -					 rq->wqe_sz, PCI_DMA_FROMDEVICE);
> -	if (unlikely(dma_mapping_error(rq->pdev, wi->dma_info.addr))) {
> -		put_page(wi->dma_info.page);
> -		return -ENOMEM;
> -	}
> -
> -	/* We split the high-order page into order-0 ones and manage their
> -	 * reference counter to minimize the memory held by small skb fragments
> -	 */
> -	split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER);
> -	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> -		page_ref_add(&wi->dma_info.page[i],
> -			     mlx5e_mpwqe_strides_per_page(rq));
> -		wi->skbs_frags[i] = 0;
> -	}
> -
> -	wi->consumed_strides = 0;
> -	wi->dma_pre_sync = mlx5e_dma_pre_sync_linear_mpwqe;
> -	wi->add_skb_frag = mlx5e_add_skb_frag_linear_mpwqe;
> -	wi->copy_skb_header = mlx5e_copy_skb_header_linear_mpwqe;
> -	wi->free_wqe     = mlx5e_free_rx_linear_mpwqe;
> -	wqe->data.lkey = rq->mkey_be;
> -	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> -
> -	return 0;
> -}
> -
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				struct mlx5e_mpw_info *wi)
> -{
> -	int i;
> -
> -	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> -		       PCI_DMA_FROMDEVICE);
> -	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> -		page_ref_sub(&wi->dma_info.page[i],
> -			mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> -		put_page(&wi->dma_info.page[i]);
> -	}
> -}
> -
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
>  {
>  	int err;
>  
> -	err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
> -	if (unlikely(err)) {
> -		err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
> -		if (unlikely(err))
> -			return err;
> -		set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> -		mlx5e_post_umr_wqe(rq, ix);
> -		return -EBUSY;
> -	}
> -
> -	return 0;
> +	err = mlx5e_alloc_rx_umr_mpwqe(rq, wqe, ix);
> +	if (unlikely(err))
> +		return err;
> +	set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> +	mlx5e_post_umr_wqe(rq, ix);
> +	return -EBUSY;
>  }
>  
>  void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>  {
>  	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
>  
> -	wi->free_wqe(rq, wi);
> +	mlx5e_free_rx_mpwqe(rq, wi);
>  }
>  
>  #define RQ_CANNOT_POST(rq) \
> @@ -617,9 +428,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
>  		int err;
>  
>  		err = rq->alloc_wqe(rq, wqe, wq->head);
> +		if (err == -EBUSY)
> +			return true;
>  		if (unlikely(err)) {
> -			if (err != -EBUSY)
> -				rq->stats.buff_alloc_err++;
> +			rq->stats.buff_alloc_err++;
>  			break;
>  		}
>  
> @@ -823,7 +635,6 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
>  					   u32 cqe_bcnt,
>  					   struct sk_buff *skb)
>  {
> -	u32 consumed_bytes = ALIGN(cqe_bcnt, rq->mpwqe_stride_sz);
>  	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
>  	u32 wqe_offset     = stride_ix * rq->mpwqe_stride_sz;
>  	u32 head_offset    = wqe_offset & (PAGE_SIZE - 1);
> @@ -837,21 +648,20 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
>  		page_idx++;
>  		frag_offset -= PAGE_SIZE;
>  	}
> -	wi->dma_pre_sync(rq->pdev, wi, wqe_offset, consumed_bytes);
>  
>  	while (byte_cnt) {
>  		u32 pg_consumed_bytes =
>  			min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
>  
> -		wi->add_skb_frag(rq, skb, wi, page_idx, frag_offset,
> -				 pg_consumed_bytes);
> +		mlx5e_add_skb_frag_mpwqe(rq, skb, wi, page_idx, frag_offset,
> +					 pg_consumed_bytes);
>  		byte_cnt -= pg_consumed_bytes;
>  		frag_offset = 0;
>  		page_idx++;
>  	}
>  	/* copy header */
> -	wi->copy_skb_header(rq->pdev, skb, wi, head_page_idx, head_offset,
> -			    headlen);
> +	mlx5e_copy_skb_header_mpwqe(rq->pdev, skb, wi, head_page_idx,
> +				    head_offset, headlen);
>  	/* skb linear part was allocated with headlen and aligned to long */
>  	skb->tail += headlen;
>  	skb->len  += headlen;
> @@ -896,7 +706,7 @@ mpwrq_cqe_out:
>  	if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
>  		return;
>  
> -	wi->free_wqe(rq, wi);
> +	mlx5e_free_rx_mpwqe(rq, wi);
>  	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> index 499487c..1f56543 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> @@ -73,7 +73,6 @@ struct mlx5e_sw_stats {
>  	u64 tx_xmit_more;
>  	u64 rx_wqe_err;
>  	u64 rx_mpwqe_filler;
> -	u64 rx_mpwqe_frag;
>  	u64 rx_buff_alloc_err;
>  	u64 rx_cqe_compress_blks;
>  	u64 rx_cqe_compress_pkts;
> @@ -105,7 +104,6 @@ static const struct counter_desc sw_stats_desc[] = {
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
> -	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
> @@ -274,7 +272,6 @@ struct mlx5e_rq_stats {
>  	u64 lro_bytes;
>  	u64 wqe_err;
>  	u64 mpwqe_filler;
> -	u64 mpwqe_frag;
>  	u64 buff_alloc_err;
>  	u64 cqe_compress_blks;
>  	u64 cqe_compress_pkts;
> @@ -290,7 +287,6 @@ static const struct counter_desc rq_stats_desc[] = {
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler) },
> -	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_frag) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> index 9bf33bb..08d8b0c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> @@ -87,7 +87,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
>  		case MLX5_OPCODE_NOP:
>  			break;
>  		case MLX5_OPCODE_UMR:
> -			mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
> +			mlx5e_post_rx_mpwqe(&sq->channel->rq);
>  			break;
>  		default:
>  			WARN_ONCE(true,



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
@ 2016-09-07 19:18       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-07 19:18 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco,
	Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, brouer,
	linux-mm


On Wed,  7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:

> From: Tariq Toukan <tariqt@mellanox.com>
> 
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.

Thanks you for doing this! MM-list people thanks you!

For others to understand what this means:  This driver was doing
split_page() on high-order pages (for Striding RQ).  This was really bad
because it will cause fragmenting the page-allocator, and depleting the
high-order pages available quickly.

(I've left rest of patch intact below, if some MM people should be
interested in looking at the changes).

There is even a funny comment in split_page() relevant to this:

/* [...]
 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.
 */


> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
> 
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
> 
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
> 
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> 
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
> 
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
> 
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
> 

Well, the reduction does not really matter than much, because your
baseline benchmarks are from a freshly booted system, where you have
not fragmented and depleted the high-order pages yet... ;-)


> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       |  54 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 ++++++++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 292 ++++-----------------
>  drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
>  drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
>  5 files changed, 184 insertions(+), 304 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index bf722aa..075cdfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -62,12 +62,12 @@
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE                0xd
>  
>  #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW            0x1
> -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x4
> +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
>  
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
> -#define MLX5_MPWRQ_LOG_WQE_SZ			17
> +#define MLX5_MPWRQ_LOG_WQE_SZ			18
>  #define MLX5_MPWRQ_WQE_PAGE_ORDER  (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
>  				    MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
>  #define MLX5_MPWRQ_PAGES_PER_WQE		BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
> @@ -293,8 +293,8 @@ struct mlx5e_rq {
>  	u32                    wqe_sz;
>  	struct sk_buff       **skb;
>  	struct mlx5e_mpw_info *wqe_info;
> +	void                  *mtt_no_align;
>  	__be32                 mkey_be;
> -	__be32                 umr_mkey_be;
>  
>  	struct device         *pdev;
>  	struct net_device     *netdev;
> @@ -323,32 +323,15 @@ struct mlx5e_rq {
>  
>  struct mlx5e_umr_dma_info {
>  	__be64                *mtt;
> -	__be64                *mtt_no_align;
>  	dma_addr_t             mtt_addr;
> -	struct mlx5e_dma_info *dma_info;
> +	struct mlx5e_dma_info  dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
> +	struct mlx5e_umr_wqe   wqe;
>  };
>  
>  struct mlx5e_mpw_info {
> -	union {
> -		struct mlx5e_dma_info     dma_info;
> -		struct mlx5e_umr_dma_info umr;
> -	};
> +	struct mlx5e_umr_dma_info umr;
>  	u16 consumed_strides;
>  	u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
> -
> -	void (*dma_pre_sync)(struct device *pdev,
> -			     struct mlx5e_mpw_info *wi,
> -			     u32 wqe_offset, u32 len);
> -	void (*add_skb_frag)(struct mlx5e_rq *rq,
> -			     struct sk_buff *skb,
> -			     struct mlx5e_mpw_info *wi,
> -			     u32 page_idx, u32 frag_offset, u32 len);
> -	void (*copy_skb_header)(struct device *pdev,
> -				struct sk_buff *skb,
> -				struct mlx5e_mpw_info *wi,
> -				u32 page_idx, u32 offset,
> -				u32 headlen);
> -	void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
>  };
>  
>  struct mlx5e_tx_wqe_info {
> @@ -706,24 +689,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>  void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>  bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
>  int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
>  void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
>  void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
> -void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				    struct mlx5_cqe64 *cqe,
> -				    u16 byte_cnt,
> -				    struct mlx5e_mpw_info *wi,
> -				    struct sk_buff *skb);
> -void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -					struct mlx5_cqe64 *cqe,
> -					u16 byte_cnt,
> -					struct mlx5e_mpw_info *wi,
> -					struct sk_buff *skb);
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				struct mlx5e_mpw_info *wi);
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -				    struct mlx5e_mpw_info *wi);
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq);
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
>  struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
>  
>  void mlx5e_rx_am(struct mlx5e_rq *rq);
> @@ -810,6 +780,12 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
>  	mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, NULL, cq->wq.cc);
>  }
>  
> +static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> +{
> +	return rq->mpwqe_mtt_offset +
> +		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> +}
> +
>  static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
>  {
>  	return min_t(int, mdev->priv.eq_table.num_comp_vectors,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 2459c7f..0db4d3b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -138,7 +138,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
>  		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
>  		s->rx_wqe_err   += rq_stats->wqe_err;
>  		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
> -		s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
>  		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
>  		s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
>  		s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
> @@ -298,6 +297,107 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
>  #define MLX5E_HW2SW_MTU(hwmtu) (hwmtu - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
>  #define MLX5E_SW2HW_MTU(swmtu) (swmtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
>  
> +static inline int mlx5e_get_wqe_mtt_sz(void)
> +{
> +	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> +	 * To avoid copying garbage after the mtt array, we allocate
> +	 * a little more.
> +	 */
> +	return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> +		     MLX5_UMR_MTT_ALIGNMENT);
> +}
> +
> +static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
> +				       struct mlx5e_umr_wqe *wqe, u16 ix)
> +{
> +	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
> +	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> +	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> +	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> +
> +	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> +				      ds_cnt);
> +	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
> +	cseg->imm       = rq->mkey_be;
> +
> +	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> +	ucseg->klm_octowords =
> +		cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> +	ucseg->bsf_octowords =
> +		cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> +	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> +
> +	dseg->lkey = sq->mkey_be;
> +	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> +}
> +
> +static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
> +				     struct mlx5e_channel *c)
> +{
> +	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> +	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
> +	int i;
> +
> +	rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> +				    GFP_KERNEL, cpu_to_node(c->cpu));
> +	if (!rq->wqe_info)
> +		goto err_out;
> +
> +	/* We allocate more than mtt_sz as we will align the pointer */
> +	rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
> +					cpu_to_node(c->cpu));
> +	if (unlikely(!rq->mtt_no_align))
> +		goto err_free_wqe_info;
> +
> +	for (i = 0; i < wq_sz; i++) {
> +		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> +		wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
> +					MLX5_UMR_ALIGN);
> +		wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
> +						  PCI_DMA_TODEVICE);
> +		if (unlikely(dma_mapping_error(c->pdev, wi->umr.mtt_addr)))
> +			goto err_unmap_mtts;
> +
> +		mlx5e_build_umr_wqe(rq, &c->icosq, &wi->umr.wqe, i);
> +	}
> +
> +	return 0;
> +
> +err_unmap_mtts:
> +	while (--i >= 0) {
> +		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> +		dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
> +				 PCI_DMA_TODEVICE);
> +	}
> +	kfree(rq->mtt_no_align);
> +err_free_wqe_info:
> +	kfree(rq->wqe_info);
> +
> +err_out:
> +	return -ENOMEM;
> +}
> +
> +static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
> +{
> +	int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> +	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	int i;
> +
> +	for (i = 0; i < wq_sz; i++) {
> +		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> +		dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
> +				 PCI_DMA_TODEVICE);
> +	}
> +	kfree(rq->mtt_no_align);
> +	kfree(rq->wqe_info);
> +}
> +
>  static int mlx5e_create_rq(struct mlx5e_channel *c,
>  			   struct mlx5e_rq_param *param,
>  			   struct mlx5e_rq *rq)
> @@ -322,14 +422,16 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  
>  	wq_sz = mlx5_wq_ll_get_size(&rq->wq);
>  
> +	rq->wq_type = priv->params.rq_wq_type;
> +	rq->pdev    = c->pdev;
> +	rq->netdev  = c->netdev;
> +	rq->tstamp  = &priv->tstamp;
> +	rq->channel = c;
> +	rq->ix      = c->ix;
> +	rq->priv    = c->priv;
> +
>  	switch (priv->params.rq_wq_type) {
>  	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> -		rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> -					    GFP_KERNEL, cpu_to_node(c->cpu));
> -		if (!rq->wqe_info) {
> -			err = -ENOMEM;
> -			goto err_rq_wq_destroy;
> -		}
>  		rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
>  		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
>  		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
> @@ -341,6 +443,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
>  		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
>  		byte_count = rq->wqe_sz;
> +		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> +		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
> +		if (err)
> +			goto err_rq_wq_destroy;
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
>  		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> @@ -359,27 +465,19 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
>  		byte_count = rq->wqe_sz;
>  		byte_count |= MLX5_HW_START_PADDING;
> +		rq->mkey_be = c->mkey_be;
>  	}
>  
>  	for (i = 0; i < wq_sz; i++) {
>  		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
>  
>  		wqe->data.byte_count = cpu_to_be32(byte_count);
> +		wqe->data.lkey = rq->mkey_be;
>  	}
>  
>  	INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
>  	rq->am.mode = priv->params.rx_cq_period_mode;
>  
> -	rq->wq_type = priv->params.rq_wq_type;
> -	rq->pdev    = c->pdev;
> -	rq->netdev  = c->netdev;
> -	rq->tstamp  = &priv->tstamp;
> -	rq->channel = c;
> -	rq->ix      = c->ix;
> -	rq->priv    = c->priv;
> -	rq->mkey_be = c->mkey_be;
> -	rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> -
>  	return 0;
>  
>  err_rq_wq_destroy:
> @@ -392,7 +490,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
>  {
>  	switch (rq->wq_type) {
>  	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> -		kfree(rq->wqe_info);
> +		mlx5e_rq_free_mpwqe_info(rq);
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
>  		kfree(rq->skb);
> @@ -530,7 +628,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
>  
>  	/* UMR WQE (if in progress) is always at wq->head */
>  	if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
> -		mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> +		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
>  
>  	while (!mlx5_wq_ll_is_empty(wq)) {
>  		wqe_ix_be = *wq->tail_next;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b6f8ebb..8ad4d32 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -200,7 +200,6 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
>  
>  	*((dma_addr_t *)skb->cb) = dma_addr;
>  	wqe->data.addr = cpu_to_be64(dma_addr);
> -	wqe->data.lkey = rq->mkey_be;
>  
>  	rq->skb[ix] = skb;
>  
> @@ -231,44 +230,11 @@ static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
>  	return rq->mpwqe_num_strides >> MLX5_MPWRQ_WQE_PAGE_ORDER;
>  }
>  
> -static inline void
> -mlx5e_dma_pre_sync_linear_mpwqe(struct device *pdev,
> -				struct mlx5e_mpw_info *wi,
> -				u32 wqe_offset, u32 len)
> -{
> -	dma_sync_single_for_cpu(pdev, wi->dma_info.addr + wqe_offset,
> -				len, DMA_FROM_DEVICE);
> -}
> -
> -static inline void
> -mlx5e_dma_pre_sync_fragmented_mpwqe(struct device *pdev,
> -				    struct mlx5e_mpw_info *wi,
> -				    u32 wqe_offset, u32 len)
> -{
> -	/* No dma pre sync for fragmented MPWQE */
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_linear_mpwqe(struct mlx5e_rq *rq,
> -				struct sk_buff *skb,
> -				struct mlx5e_mpw_info *wi,
> -				u32 page_idx, u32 frag_offset,
> -				u32 len)
> -{
> -	unsigned int truesize =	ALIGN(len, rq->mpwqe_stride_sz);
> -
> -	wi->skbs_frags[page_idx]++;
> -	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> -			&wi->dma_info.page[page_idx], frag_offset,
> -			len, truesize);
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
> -				    struct sk_buff *skb,
> -				    struct mlx5e_mpw_info *wi,
> -				    u32 page_idx, u32 frag_offset,
> -				    u32 len)
> +static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
> +					    struct sk_buff *skb,
> +					    struct mlx5e_mpw_info *wi,
> +					    u32 page_idx, u32 frag_offset,
> +					    u32 len)
>  {
>  	unsigned int truesize =	ALIGN(len, rq->mpwqe_stride_sz);
>  
> @@ -282,24 +248,11 @@ mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
>  }
>  
>  static inline void
> -mlx5e_copy_skb_header_linear_mpwqe(struct device *pdev,
> -				   struct sk_buff *skb,
> -				   struct mlx5e_mpw_info *wi,
> -				   u32 page_idx, u32 offset,
> -				   u32 headlen)
> -{
> -	struct page *page = &wi->dma_info.page[page_idx];
> -
> -	skb_copy_to_linear_data(skb, page_address(page) + offset,
> -				ALIGN(headlen, sizeof(long)));
> -}
> -
> -static inline void
> -mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
> -				       struct sk_buff *skb,
> -				       struct mlx5e_mpw_info *wi,
> -				       u32 page_idx, u32 offset,
> -				       u32 headlen)
> +mlx5e_copy_skb_header_mpwqe(struct device *pdev,
> +			    struct sk_buff *skb,
> +			    struct mlx5e_mpw_info *wi,
> +			    u32 page_idx, u32 offset,
> +			    u32 headlen)
>  {
>  	u16 headlen_pg = min_t(u32, headlen, PAGE_SIZE - offset);
>  	struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[page_idx];
> @@ -324,46 +277,9 @@ mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
>  	}
>  }
>  
> -static u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> -{
> -	return rq->mpwqe_mtt_offset +
> -		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> -}
> -
> -static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
> -				struct mlx5e_sq *sq,
> -				struct mlx5e_umr_wqe *wqe,
> -				u16 ix)
> +static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  {
> -	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
> -	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> -	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
>  	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> -	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> -	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> -
> -	memset(wqe, 0, sizeof(*wqe));
> -	cseg->opmod_idx_opcode =
> -		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> -			    MLX5_OPCODE_UMR);
> -	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> -				      ds_cnt);
> -	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
> -	cseg->imm       = rq->umr_mkey_be;
> -
> -	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> -	ucseg->klm_octowords =
> -		cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> -	ucseg->bsf_octowords =
> -		cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> -	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> -
> -	dseg->lkey = sq->mkey_be;
> -	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> -}
> -
> -static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> -{
>  	struct mlx5e_sq *sq = &rq->channel->icosq;
>  	struct mlx5_wq_cyc *wq = &sq->wq;
>  	struct mlx5e_umr_wqe *wqe;
> @@ -378,30 +294,22 @@ static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  	}
>  
>  	wqe = mlx5_wq_cyc_get_wqe(wq, pi);
> -	mlx5e_build_umr_wqe(rq, sq, wqe, ix);
> +	memcpy(wqe, &wi->umr.wqe, sizeof(*wqe));
> +	wqe->ctrl.opmod_idx_opcode =
> +		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> +			    MLX5_OPCODE_UMR);
> +
>  	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
>  	sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
>  	sq->pc += num_wqebbs;
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  }
>  
> -static inline int mlx5e_get_wqe_mtt_sz(void)
> -{
> -	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> -	 * To avoid copying garbage after the mtt array, we allocate
> -	 * a little more.
> -	 */
> -	return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> -		     MLX5_UMR_MTT_ALIGNMENT);
> -}
> -
> -static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> -				    struct mlx5e_mpw_info *wi,
> -				    int i)
> +static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> +					   struct mlx5e_mpw_info *wi,
> +					   int i)
>  {
> -	struct page *page;
> -
> -	page = dev_alloc_page();
> +	struct page *page = dev_alloc_page();
>  	if (unlikely(!page))
>  		return -ENOMEM;
>  
> @@ -417,47 +325,25 @@ static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
>  	return 0;
>  }
>  
> -static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -					   struct mlx5e_rx_wqe *wqe,
> -					   u16 ix)
> +static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
> +				    struct mlx5e_rx_wqe *wqe,
> +				    u16 ix)
>  {
>  	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> -	int mtt_sz = mlx5e_get_wqe_mtt_sz();
>  	u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
> +	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
> +	int err;
>  	int i;
>  
> -	wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
> -				   MLX5_MPWRQ_PAGES_PER_WQE,
> -				   GFP_ATOMIC);
> -	if (unlikely(!wi->umr.dma_info))
> -		goto err_out;
> -
> -	/* We allocate more than mtt_sz as we will align the pointer */
> -	wi->umr.mtt_no_align = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
> -				       GFP_ATOMIC);
> -	if (unlikely(!wi->umr.mtt_no_align))
> -		goto err_free_umr;
> -
> -	wi->umr.mtt = PTR_ALIGN(wi->umr.mtt_no_align, MLX5_UMR_ALIGN);
> -	wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
> -					  PCI_DMA_TODEVICE);
> -	if (unlikely(dma_mapping_error(rq->pdev, wi->umr.mtt_addr)))
> -		goto err_free_mtt;
> -
>  	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> -		if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i)))
> +		err = mlx5e_alloc_and_map_page(rq, wi, i);
> +		if (unlikely(err))
>  			goto err_unmap;
> -		page_ref_add(wi->umr.dma_info[i].page,
> -			     mlx5e_mpwqe_strides_per_page(rq));
> +		page_ref_add(wi->umr.dma_info[i].page, pg_strides);
>  		wi->skbs_frags[i] = 0;
>  	}
>  
>  	wi->consumed_strides = 0;
> -	wi->dma_pre_sync = mlx5e_dma_pre_sync_fragmented_mpwqe;
> -	wi->add_skb_frag = mlx5e_add_skb_frag_fragmented_mpwqe;
> -	wi->copy_skb_header = mlx5e_copy_skb_header_fragmented_mpwqe;
> -	wi->free_wqe     = mlx5e_free_rx_fragmented_mpwqe;
> -	wqe->data.lkey = rq->umr_mkey_be;
>  	wqe->data.addr = cpu_to_be64(dma_offset);
>  
>  	return 0;
> @@ -466,41 +352,28 @@ err_unmap:
>  	while (--i >= 0) {
>  		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
>  			       PCI_DMA_FROMDEVICE);
> -		page_ref_sub(wi->umr.dma_info[i].page,
> -			     mlx5e_mpwqe_strides_per_page(rq));
> +		page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
>  		put_page(wi->umr.dma_info[i].page);
>  	}
> -	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> -
> -err_free_mtt:
> -	kfree(wi->umr.mtt_no_align);
> -
> -err_free_umr:
> -	kfree(wi->umr.dma_info);
>  
> -err_out:
> -	return -ENOMEM;
> +	return err;
>  }
>  
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> -				    struct mlx5e_mpw_info *wi)
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
>  {
> -	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
>  	int i;
>  
>  	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
>  		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
>  			       PCI_DMA_FROMDEVICE);
>  		page_ref_sub(wi->umr.dma_info[i].page,
> -			mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> +			     pg_strides - wi->skbs_frags[i]);
>  		put_page(wi->umr.dma_info[i].page);
>  	}
> -	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> -	kfree(wi->umr.mtt_no_align);
> -	kfree(wi->umr.dma_info);
>  }
>  
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
>  {
>  	struct mlx5_wq_ll *wq = &rq->wq;
>  	struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
> @@ -508,12 +381,11 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
>  	clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
>  
>  	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
> -		mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> +		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
>  		return;
>  	}
>  
>  	mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
> -	rq->stats.mpwqe_frag++;
>  
>  	/* ensure wqes are visible to device before updating doorbell record */
>  	dma_wmb();
> @@ -521,84 +393,23 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
>  	mlx5_wq_ll_update_db_record(wq);
>  }
>  
> -static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				       struct mlx5e_rx_wqe *wqe,
> -				       u16 ix)
> -{
> -	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> -	gfp_t gfp_mask;
> -	int i;
> -
> -	gfp_mask = GFP_ATOMIC | __GFP_COLD | __GFP_MEMALLOC;
> -	wi->dma_info.page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> -					     MLX5_MPWRQ_WQE_PAGE_ORDER);
> -	if (unlikely(!wi->dma_info.page))
> -		return -ENOMEM;
> -
> -	wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> -					 rq->wqe_sz, PCI_DMA_FROMDEVICE);
> -	if (unlikely(dma_mapping_error(rq->pdev, wi->dma_info.addr))) {
> -		put_page(wi->dma_info.page);
> -		return -ENOMEM;
> -	}
> -
> -	/* We split the high-order page into order-0 ones and manage their
> -	 * reference counter to minimize the memory held by small skb fragments
> -	 */
> -	split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER);
> -	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> -		page_ref_add(&wi->dma_info.page[i],
> -			     mlx5e_mpwqe_strides_per_page(rq));
> -		wi->skbs_frags[i] = 0;
> -	}
> -
> -	wi->consumed_strides = 0;
> -	wi->dma_pre_sync = mlx5e_dma_pre_sync_linear_mpwqe;
> -	wi->add_skb_frag = mlx5e_add_skb_frag_linear_mpwqe;
> -	wi->copy_skb_header = mlx5e_copy_skb_header_linear_mpwqe;
> -	wi->free_wqe     = mlx5e_free_rx_linear_mpwqe;
> -	wqe->data.lkey = rq->mkey_be;
> -	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> -
> -	return 0;
> -}
> -
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> -				struct mlx5e_mpw_info *wi)
> -{
> -	int i;
> -
> -	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> -		       PCI_DMA_FROMDEVICE);
> -	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> -		page_ref_sub(&wi->dma_info.page[i],
> -			mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> -		put_page(&wi->dma_info.page[i]);
> -	}
> -}
> -
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
>  {
>  	int err;
>  
> -	err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
> -	if (unlikely(err)) {
> -		err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
> -		if (unlikely(err))
> -			return err;
> -		set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> -		mlx5e_post_umr_wqe(rq, ix);
> -		return -EBUSY;
> -	}
> -
> -	return 0;
> +	err = mlx5e_alloc_rx_umr_mpwqe(rq, wqe, ix);
> +	if (unlikely(err))
> +		return err;
> +	set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> +	mlx5e_post_umr_wqe(rq, ix);
> +	return -EBUSY;
>  }
>  
>  void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>  {
>  	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
>  
> -	wi->free_wqe(rq, wi);
> +	mlx5e_free_rx_mpwqe(rq, wi);
>  }
>  
>  #define RQ_CANNOT_POST(rq) \
> @@ -617,9 +428,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
>  		int err;
>  
>  		err = rq->alloc_wqe(rq, wqe, wq->head);
> +		if (err == -EBUSY)
> +			return true;
>  		if (unlikely(err)) {
> -			if (err != -EBUSY)
> -				rq->stats.buff_alloc_err++;
> +			rq->stats.buff_alloc_err++;
>  			break;
>  		}
>  
> @@ -823,7 +635,6 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
>  					   u32 cqe_bcnt,
>  					   struct sk_buff *skb)
>  {
> -	u32 consumed_bytes = ALIGN(cqe_bcnt, rq->mpwqe_stride_sz);
>  	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
>  	u32 wqe_offset     = stride_ix * rq->mpwqe_stride_sz;
>  	u32 head_offset    = wqe_offset & (PAGE_SIZE - 1);
> @@ -837,21 +648,20 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
>  		page_idx++;
>  		frag_offset -= PAGE_SIZE;
>  	}
> -	wi->dma_pre_sync(rq->pdev, wi, wqe_offset, consumed_bytes);
>  
>  	while (byte_cnt) {
>  		u32 pg_consumed_bytes =
>  			min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
>  
> -		wi->add_skb_frag(rq, skb, wi, page_idx, frag_offset,
> -				 pg_consumed_bytes);
> +		mlx5e_add_skb_frag_mpwqe(rq, skb, wi, page_idx, frag_offset,
> +					 pg_consumed_bytes);
>  		byte_cnt -= pg_consumed_bytes;
>  		frag_offset = 0;
>  		page_idx++;
>  	}
>  	/* copy header */
> -	wi->copy_skb_header(rq->pdev, skb, wi, head_page_idx, head_offset,
> -			    headlen);
> +	mlx5e_copy_skb_header_mpwqe(rq->pdev, skb, wi, head_page_idx,
> +				    head_offset, headlen);
>  	/* skb linear part was allocated with headlen and aligned to long */
>  	skb->tail += headlen;
>  	skb->len  += headlen;
> @@ -896,7 +706,7 @@ mpwrq_cqe_out:
>  	if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
>  		return;
>  
> -	wi->free_wqe(rq, wi);
> +	mlx5e_free_rx_mpwqe(rq, wi);
>  	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> index 499487c..1f56543 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> @@ -73,7 +73,6 @@ struct mlx5e_sw_stats {
>  	u64 tx_xmit_more;
>  	u64 rx_wqe_err;
>  	u64 rx_mpwqe_filler;
> -	u64 rx_mpwqe_frag;
>  	u64 rx_buff_alloc_err;
>  	u64 rx_cqe_compress_blks;
>  	u64 rx_cqe_compress_pkts;
> @@ -105,7 +104,6 @@ static const struct counter_desc sw_stats_desc[] = {
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
> -	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
>  	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
> @@ -274,7 +272,6 @@ struct mlx5e_rq_stats {
>  	u64 lro_bytes;
>  	u64 wqe_err;
>  	u64 mpwqe_filler;
> -	u64 mpwqe_frag;
>  	u64 buff_alloc_err;
>  	u64 cqe_compress_blks;
>  	u64 cqe_compress_pkts;
> @@ -290,7 +287,6 @@ static const struct counter_desc rq_stats_desc[] = {
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler) },
> -	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_frag) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
>  	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> index 9bf33bb..08d8b0c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> @@ -87,7 +87,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
>  		case MLX5_OPCODE_NOP:
>  			break;
>  		case MLX5_OPCODE_UMR:
> -			mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
> +			mlx5e_post_rx_mpwqe(&sq->channel->rq);
>  			break;
>  		default:
>  			WARN_ONCE(true,



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
  2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
@ 2016-09-07 19:32       ` Jesper Dangaard Brouer
       [not found]   ` <1473252152-11379-5-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-07 19:32 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	linux-mm, Eric Dumazet, Tom Herbert


On Wed,  7 Sep 2016 15:42:25 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
> 
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
> 
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.

Yes, it is true that there is a performance reduction (for normal
stack, not XDP) caused by hitting a bottleneck in the page allocator.

I actually have a PoC implementation of my page_pool, that show we
regain the performance and then some.  Based on an earlier version of
this patch, where I hook it into the mlx5 driver (50Gbit/s version).


You desc might be a bit outdated, as this patch and the patch before
does contain you own driver local page-cache recycle facility.  And you
also show that you regain quite a lot of the lost performance.

You driver local page_cache does have its limitations (see comments on
other patch), as it depend on timely refcnt decrease, by the users of
the page.  If they hold onto pages (like TCP) then your page-cache will
not be efficient.

 
> Packet rate performance testing was done with pktgen 64B packets on
> xmit side and TC drop action on RX side.

I assume this is TC _ingress_ dropping, like [1]

[1] https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh

> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Comparison is done between:
>  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>  2.Build SKB with RX page cache (This patch)
> 
> Streams    Baseline    Build SKB+page-cache    Improvement
> -----------------------------------------------------------
> 1          4.33Mpps      5.51Mpps                27%
> 2          7.35Mpps      11.5Mpps                52%
> 4          14.0Mpps      16.3Mpps                16%
> 8          22.2Mpps      29.6Mpps                20%
> 16         24.8Mpps      34.0Mpps                17%

The improvements gained from using your page-cache is impressively high.

Thanks for working on this,
 --Jesper
 
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h      |  10 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  31 +++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 215 +++++++++++-----------
>  3 files changed, 133 insertions(+), 123 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index afbdf70..a346112 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -65,6 +65,8 @@
>  #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
>  
> +#define MLX5_RX_HEADROOM NET_SKB_PAD
> +
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_WQE_SZ			18
> @@ -302,10 +304,14 @@ struct mlx5e_page_cache {
>  struct mlx5e_rq {
>  	/* data path */
>  	struct mlx5_wq_ll      wq;
> -	u32                    wqe_sz;
> -	struct sk_buff       **skb;
> +
> +	struct mlx5e_dma_info *dma_info;
>  	struct mlx5e_mpw_info *wqe_info;
>  	void                  *mtt_no_align;
> +	struct {
> +		u8             page_order;
> +		u32            wqe_sz;    /* wqe data buffer size */
> +	} buff;
>  	__be32                 mkey_be;
>  
>  	struct device         *pdev;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index c84702c..c9f1dea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  	void *rqc = param->rqc;
>  	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
>  	u32 byte_count;
> +	u32 frag_sz;
> +	int npages;
>  	int wq_sz;
>  	int err;
>  	int i;
> @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  
>  		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
>  		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> -		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> -		byte_count = rq->wqe_sz;
> +
> +		rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> +		byte_count = rq->buff.wqe_sz;
>  		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
>  		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
>  		if (err)
>  			goto err_rq_wq_destroy;
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
> -		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> -				       cpu_to_node(c->cpu));
> -		if (!rq->skb) {
> +		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info), GFP_KERNEL,
> +					    cpu_to_node(c->cpu));
> +		if (!rq->dma_info) {
>  			err = -ENOMEM;
>  			goto err_rq_wq_destroy;
>  		}
> +
>  		rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
>  		rq->alloc_wqe = mlx5e_alloc_rx_wqe;
>  		rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
>  
> -		rq->wqe_sz = (priv->params.lro_en) ?
> +		rq->buff.wqe_sz = (priv->params.lro_en) ?
>  				priv->params.lro_wqe_sz :
>  				MLX5E_SW2HW_MTU(priv->netdev->mtu);
> -		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
> -		byte_count = rq->wqe_sz;
> +		byte_count = rq->buff.wqe_sz;
> +
> +		/* calc the required page order */
> +		frag_sz = MLX5_RX_HEADROOM +
> +			  byte_count /* packet data */ +
> +			  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +		frag_sz = SKB_DATA_ALIGN(frag_sz);
> +
> +		npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
> +		rq->buff.page_order = order_base_2(npages);
> +
>  		byte_count |= MLX5_HW_START_PADDING;
>  		rq->mkey_be = c->mkey_be;
>  	}
> @@ -502,7 +515,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
>  		mlx5e_rq_free_mpwqe_info(rq);
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
> -		kfree(rq->skb);
> +		kfree(rq->dma_info);
>  	}
>  
>  	for (i = rq->page_cache.head; i != rq->page_cache.tail;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8e02af3..2f5bc6f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -179,50 +179,99 @@ unlock:
>  	mutex_unlock(&priv->state_lock);
>  }
>  
> -int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
> +
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> +				      struct mlx5e_dma_info *dma_info)
>  {
> -	struct sk_buff *skb;
> -	dma_addr_t dma_addr;
> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> +	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
>  
> -	skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
> -	if (unlikely(!skb))
> -		return -ENOMEM;
> +	if (tail_next == cache->head) {
> +		rq->stats.cache_full++;
> +		return false;
> +	}
> +
> +	cache->page_cache[cache->tail] = *dma_info;
> +	cache->tail = tail_next;
> +	return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> +				      struct mlx5e_dma_info *dma_info)
> +{
> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> +
> +	if (unlikely(cache->head == cache->tail)) {
> +		rq->stats.cache_empty++;
> +		return false;
> +	}
> +
> +	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> +		rq->stats.cache_busy++;
> +		return false;
> +	}
> +
> +	*dma_info = cache->page_cache[cache->head];
> +	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> +	rq->stats.cache_reuse++;
> +
> +	dma_sync_single_for_device(rq->pdev, dma_info->addr,
> +				   RQ_PAGE_SIZE(rq),
> +				   DMA_FROM_DEVICE);
> +	return true;
> +}
>  
> -	dma_addr = dma_map_single(rq->pdev,
> -				  /* hw start padding */
> -				  skb->data,
> -				  /* hw end padding */
> -				  rq->wqe_sz,
> -				  DMA_FROM_DEVICE);
> +static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> +					  struct mlx5e_dma_info *dma_info)
> +{
> +	struct page *page;
>  
> -	if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
> -		goto err_free_skb;
> +	if (mlx5e_rx_cache_get(rq, dma_info))
> +		return 0;
>  
> -	*((dma_addr_t *)skb->cb) = dma_addr;
> -	wqe->data.addr = cpu_to_be64(dma_addr);
> +	page = dev_alloc_pages(rq->buff.page_order);
> +	if (unlikely(!page))
> +		return -ENOMEM;
>  
> -	rq->skb[ix] = skb;
> +	dma_info->page = page;
> +	dma_info->addr = dma_map_page(rq->pdev, page, 0,
> +				      RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
> +	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> +		put_page(page);
> +		return -ENOMEM;
> +	}
>  
>  	return 0;
> +}
>  
> -err_free_skb:
> -	dev_kfree_skb(skb);
> +void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> +			bool recycle)
> +{
> +	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> +		return;
> +
> +	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
> +		       DMA_FROM_DEVICE);
> +	put_page(dma_info->page);
> +}
> +
> +int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> +	struct mlx5e_dma_info *di = &rq->dma_info[ix];
>  
> -	return -ENOMEM;
> +	if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
> +		return -ENOMEM;
> +
> +	wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
> +	return 0;
>  }
>  
>  void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
>  {
> -	struct sk_buff *skb = rq->skb[ix];
> +	struct mlx5e_dma_info *di = &rq->dma_info[ix];
>  
> -	if (skb) {
> -		rq->skb[ix] = NULL;
> -		dma_unmap_single(rq->pdev,
> -				 *((dma_addr_t *)skb->cb),
> -				 rq->wqe_sz,
> -				 DMA_FROM_DEVICE);
> -		dev_kfree_skb(skb);
> -	}
> +	mlx5e_page_release(rq, di, true);
>  }
>  
>  static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
> @@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  }
>  
> -static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> -				      struct mlx5e_dma_info *dma_info)
> -{
> -	struct mlx5e_page_cache *cache = &rq->page_cache;
> -	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> -
> -	if (tail_next == cache->head) {
> -		rq->stats.cache_full++;
> -		return false;
> -	}
> -
> -	cache->page_cache[cache->tail] = *dma_info;
> -	cache->tail = tail_next;
> -	return true;
> -}
> -
> -static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> -				      struct mlx5e_dma_info *dma_info)
> -{
> -	struct mlx5e_page_cache *cache = &rq->page_cache;
> -
> -	if (unlikely(cache->head == cache->tail)) {
> -		rq->stats.cache_empty++;
> -		return false;
> -	}
> -
> -	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> -		rq->stats.cache_busy++;
> -		return false;
> -	}
> -
> -	*dma_info = cache->page_cache[cache->head];
> -	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> -	rq->stats.cache_reuse++;
> -
> -	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> -				   DMA_FROM_DEVICE);
> -	return true;
> -}
> -
> -static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> -					  struct mlx5e_dma_info *dma_info)
> -{
> -	struct page *page;
> -
> -	if (mlx5e_rx_cache_get(rq, dma_info))
> -		return 0;
> -
> -	page = dev_alloc_page();
> -	if (unlikely(!page))
> -		return -ENOMEM;
> -
> -	dma_info->page = page;
> -	dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
> -				      DMA_FROM_DEVICE);
> -	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> -		put_page(page);
> -		return -ENOMEM;
> -	}
> -
> -	return 0;
> -}
> -
> -void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> -			bool recycle)
> -{
> -	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> -		return;
> -
> -	dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
> -	put_page(dma_info->page);
> -}
> -
>  static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
>  				    struct mlx5e_rx_wqe *wqe,
>  				    u16 ix)
> @@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
>  	mlx5_wq_ll_update_db_record(wq);
>  }
>  
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
>  {
>  	int err;
>  
> @@ -650,31 +626,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
>  
>  void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>  {
> +	struct mlx5e_dma_info *di;
>  	struct mlx5e_rx_wqe *wqe;
> -	struct sk_buff *skb;
>  	__be16 wqe_counter_be;
> +	struct sk_buff *skb;
>  	u16 wqe_counter;
>  	u32 cqe_bcnt;
> +	void *va;
>  
>  	wqe_counter_be = cqe->wqe_counter;
>  	wqe_counter    = be16_to_cpu(wqe_counter_be);
>  	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
> -	skb            = rq->skb[wqe_counter];
> -	prefetch(skb->data);
> -	rq->skb[wqe_counter] = NULL;
> +	di             = &rq->dma_info[wqe_counter];
> +	va             = page_address(di->page);
>  
> -	dma_unmap_single(rq->pdev,
> -			 *((dma_addr_t *)skb->cb),
> -			 rq->wqe_sz,
> -			 DMA_FROM_DEVICE);
> +	dma_sync_single_range_for_cpu(rq->pdev,
> +				      di->addr,
> +				      MLX5_RX_HEADROOM,
> +				      rq->buff.wqe_sz,
> +				      DMA_FROM_DEVICE);
> +	prefetch(va + MLX5_RX_HEADROOM);
>  
>  	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
>  		rq->stats.wqe_err++;
> -		dev_kfree_skb(skb);
> +		mlx5e_page_release(rq, di, true);
>  		goto wq_ll_pop;
>  	}
>  
> +	skb = build_skb(va, RQ_PAGE_SIZE(rq));
> +	if (unlikely(!skb)) {
> +		rq->stats.buff_alloc_err++;
> +		mlx5e_page_release(rq, di, true);
> +		goto wq_ll_pop;
> +	}
> +
> +	/* queue up for recycling ..*/
> +	page_ref_inc(di->page);
> +	mlx5e_page_release(rq, di, true);
> +
>  	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
> +	skb_reserve(skb, MLX5_RX_HEADROOM);
>  	skb_put(skb, cqe_bcnt);
>  
>  	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
@ 2016-09-07 19:32       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-07 19:32 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco,
	Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
	Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, brouer,
	linux-mm


On Wed,  7 Sep 2016 15:42:25 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:

> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
> 
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
> 
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.

Yes, it is true that there is a performance reduction (for normal
stack, not XDP) caused by hitting a bottleneck in the page allocator.

I actually have a PoC implementation of my page_pool, that show we
regain the performance and then some.  Based on an earlier version of
this patch, where I hook it into the mlx5 driver (50Gbit/s version).


You desc might be a bit outdated, as this patch and the patch before
does contain you own driver local page-cache recycle facility.  And you
also show that you regain quite a lot of the lost performance.

You driver local page_cache does have its limitations (see comments on
other patch), as it depend on timely refcnt decrease, by the users of
the page.  If they hold onto pages (like TCP) then your page-cache will
not be efficient.

 
> Packet rate performance testing was done with pktgen 64B packets on
> xmit side and TC drop action on RX side.

I assume this is TC _ingress_ dropping, like [1]

[1] https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh

> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> 
> Comparison is done between:
>  1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>  2.Build SKB with RX page cache (This patch)
> 
> Streams    Baseline    Build SKB+page-cache    Improvement
> -----------------------------------------------------------
> 1          4.33Mpps      5.51Mpps                27%
> 2          7.35Mpps      11.5Mpps                52%
> 4          14.0Mpps      16.3Mpps                16%
> 8          22.2Mpps      29.6Mpps                20%
> 16         24.8Mpps      34.0Mpps                17%

The improvements gained from using your page-cache is impressively high.

Thanks for working on this,
 --Jesper
 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h      |  10 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  31 +++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 215 +++++++++++-----------
>  3 files changed, 133 insertions(+), 123 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index afbdf70..a346112 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -65,6 +65,8 @@
>  #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
>  #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
>  
> +#define MLX5_RX_HEADROOM NET_SKB_PAD
> +
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
>  #define MLX5_MPWRQ_LOG_WQE_SZ			18
> @@ -302,10 +304,14 @@ struct mlx5e_page_cache {
>  struct mlx5e_rq {
>  	/* data path */
>  	struct mlx5_wq_ll      wq;
> -	u32                    wqe_sz;
> -	struct sk_buff       **skb;
> +
> +	struct mlx5e_dma_info *dma_info;
>  	struct mlx5e_mpw_info *wqe_info;
>  	void                  *mtt_no_align;
> +	struct {
> +		u8             page_order;
> +		u32            wqe_sz;    /* wqe data buffer size */
> +	} buff;
>  	__be32                 mkey_be;
>  
>  	struct device         *pdev;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index c84702c..c9f1dea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  	void *rqc = param->rqc;
>  	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
>  	u32 byte_count;
> +	u32 frag_sz;
> +	int npages;
>  	int wq_sz;
>  	int err;
>  	int i;
> @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>  
>  		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
>  		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> -		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> -		byte_count = rq->wqe_sz;
> +
> +		rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> +		byte_count = rq->buff.wqe_sz;
>  		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
>  		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
>  		if (err)
>  			goto err_rq_wq_destroy;
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
> -		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> -				       cpu_to_node(c->cpu));
> -		if (!rq->skb) {
> +		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info), GFP_KERNEL,
> +					    cpu_to_node(c->cpu));
> +		if (!rq->dma_info) {
>  			err = -ENOMEM;
>  			goto err_rq_wq_destroy;
>  		}
> +
>  		rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
>  		rq->alloc_wqe = mlx5e_alloc_rx_wqe;
>  		rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
>  
> -		rq->wqe_sz = (priv->params.lro_en) ?
> +		rq->buff.wqe_sz = (priv->params.lro_en) ?
>  				priv->params.lro_wqe_sz :
>  				MLX5E_SW2HW_MTU(priv->netdev->mtu);
> -		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
> -		byte_count = rq->wqe_sz;
> +		byte_count = rq->buff.wqe_sz;
> +
> +		/* calc the required page order */
> +		frag_sz = MLX5_RX_HEADROOM +
> +			  byte_count /* packet data */ +
> +			  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +		frag_sz = SKB_DATA_ALIGN(frag_sz);
> +
> +		npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
> +		rq->buff.page_order = order_base_2(npages);
> +
>  		byte_count |= MLX5_HW_START_PADDING;
>  		rq->mkey_be = c->mkey_be;
>  	}
> @@ -502,7 +515,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
>  		mlx5e_rq_free_mpwqe_info(rq);
>  		break;
>  	default: /* MLX5_WQ_TYPE_LINKED_LIST */
> -		kfree(rq->skb);
> +		kfree(rq->dma_info);
>  	}
>  
>  	for (i = rq->page_cache.head; i != rq->page_cache.tail;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8e02af3..2f5bc6f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -179,50 +179,99 @@ unlock:
>  	mutex_unlock(&priv->state_lock);
>  }
>  
> -int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
> +
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> +				      struct mlx5e_dma_info *dma_info)
>  {
> -	struct sk_buff *skb;
> -	dma_addr_t dma_addr;
> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> +	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
>  
> -	skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
> -	if (unlikely(!skb))
> -		return -ENOMEM;
> +	if (tail_next == cache->head) {
> +		rq->stats.cache_full++;
> +		return false;
> +	}
> +
> +	cache->page_cache[cache->tail] = *dma_info;
> +	cache->tail = tail_next;
> +	return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> +				      struct mlx5e_dma_info *dma_info)
> +{
> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> +
> +	if (unlikely(cache->head == cache->tail)) {
> +		rq->stats.cache_empty++;
> +		return false;
> +	}
> +
> +	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> +		rq->stats.cache_busy++;
> +		return false;
> +	}
> +
> +	*dma_info = cache->page_cache[cache->head];
> +	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> +	rq->stats.cache_reuse++;
> +
> +	dma_sync_single_for_device(rq->pdev, dma_info->addr,
> +				   RQ_PAGE_SIZE(rq),
> +				   DMA_FROM_DEVICE);
> +	return true;
> +}
>  
> -	dma_addr = dma_map_single(rq->pdev,
> -				  /* hw start padding */
> -				  skb->data,
> -				  /* hw end padding */
> -				  rq->wqe_sz,
> -				  DMA_FROM_DEVICE);
> +static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> +					  struct mlx5e_dma_info *dma_info)
> +{
> +	struct page *page;
>  
> -	if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
> -		goto err_free_skb;
> +	if (mlx5e_rx_cache_get(rq, dma_info))
> +		return 0;
>  
> -	*((dma_addr_t *)skb->cb) = dma_addr;
> -	wqe->data.addr = cpu_to_be64(dma_addr);
> +	page = dev_alloc_pages(rq->buff.page_order);
> +	if (unlikely(!page))
> +		return -ENOMEM;
>  
> -	rq->skb[ix] = skb;
> +	dma_info->page = page;
> +	dma_info->addr = dma_map_page(rq->pdev, page, 0,
> +				      RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
> +	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> +		put_page(page);
> +		return -ENOMEM;
> +	}
>  
>  	return 0;
> +}
>  
> -err_free_skb:
> -	dev_kfree_skb(skb);
> +void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> +			bool recycle)
> +{
> +	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> +		return;
> +
> +	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
> +		       DMA_FROM_DEVICE);
> +	put_page(dma_info->page);
> +}
> +
> +int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> +	struct mlx5e_dma_info *di = &rq->dma_info[ix];
>  
> -	return -ENOMEM;
> +	if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
> +		return -ENOMEM;
> +
> +	wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
> +	return 0;
>  }
>  
>  void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
>  {
> -	struct sk_buff *skb = rq->skb[ix];
> +	struct mlx5e_dma_info *di = &rq->dma_info[ix];
>  
> -	if (skb) {
> -		rq->skb[ix] = NULL;
> -		dma_unmap_single(rq->pdev,
> -				 *((dma_addr_t *)skb->cb),
> -				 rq->wqe_sz,
> -				 DMA_FROM_DEVICE);
> -		dev_kfree_skb(skb);
> -	}
> +	mlx5e_page_release(rq, di, true);
>  }
>  
>  static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
> @@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  }
>  
> -static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> -				      struct mlx5e_dma_info *dma_info)
> -{
> -	struct mlx5e_page_cache *cache = &rq->page_cache;
> -	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> -
> -	if (tail_next == cache->head) {
> -		rq->stats.cache_full++;
> -		return false;
> -	}
> -
> -	cache->page_cache[cache->tail] = *dma_info;
> -	cache->tail = tail_next;
> -	return true;
> -}
> -
> -static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> -				      struct mlx5e_dma_info *dma_info)
> -{
> -	struct mlx5e_page_cache *cache = &rq->page_cache;
> -
> -	if (unlikely(cache->head == cache->tail)) {
> -		rq->stats.cache_empty++;
> -		return false;
> -	}
> -
> -	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> -		rq->stats.cache_busy++;
> -		return false;
> -	}
> -
> -	*dma_info = cache->page_cache[cache->head];
> -	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> -	rq->stats.cache_reuse++;
> -
> -	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> -				   DMA_FROM_DEVICE);
> -	return true;
> -}
> -
> -static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> -					  struct mlx5e_dma_info *dma_info)
> -{
> -	struct page *page;
> -
> -	if (mlx5e_rx_cache_get(rq, dma_info))
> -		return 0;
> -
> -	page = dev_alloc_page();
> -	if (unlikely(!page))
> -		return -ENOMEM;
> -
> -	dma_info->page = page;
> -	dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
> -				      DMA_FROM_DEVICE);
> -	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> -		put_page(page);
> -		return -ENOMEM;
> -	}
> -
> -	return 0;
> -}
> -
> -void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> -			bool recycle)
> -{
> -	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> -		return;
> -
> -	dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
> -	put_page(dma_info->page);
> -}
> -
>  static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
>  				    struct mlx5e_rx_wqe *wqe,
>  				    u16 ix)
> @@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
>  	mlx5_wq_ll_update_db_record(wq);
>  }
>  
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
>  {
>  	int err;
>  
> @@ -650,31 +626,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
>  
>  void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>  {
> +	struct mlx5e_dma_info *di;
>  	struct mlx5e_rx_wqe *wqe;
> -	struct sk_buff *skb;
>  	__be16 wqe_counter_be;
> +	struct sk_buff *skb;
>  	u16 wqe_counter;
>  	u32 cqe_bcnt;
> +	void *va;
>  
>  	wqe_counter_be = cqe->wqe_counter;
>  	wqe_counter    = be16_to_cpu(wqe_counter_be);
>  	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
> -	skb            = rq->skb[wqe_counter];
> -	prefetch(skb->data);
> -	rq->skb[wqe_counter] = NULL;
> +	di             = &rq->dma_info[wqe_counter];
> +	va             = page_address(di->page);
>  
> -	dma_unmap_single(rq->pdev,
> -			 *((dma_addr_t *)skb->cb),
> -			 rq->wqe_sz,
> -			 DMA_FROM_DEVICE);
> +	dma_sync_single_range_for_cpu(rq->pdev,
> +				      di->addr,
> +				      MLX5_RX_HEADROOM,
> +				      rq->buff.wqe_sz,
> +				      DMA_FROM_DEVICE);
> +	prefetch(va + MLX5_RX_HEADROOM);
>  
>  	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
>  		rq->stats.wqe_err++;
> -		dev_kfree_skb(skb);
> +		mlx5e_page_release(rq, di, true);
>  		goto wq_ll_pop;
>  	}
>  
> +	skb = build_skb(va, RQ_PAGE_SIZE(rq));
> +	if (unlikely(!skb)) {
> +		rq->stats.buff_alloc_err++;
> +		mlx5e_page_release(rq, di, true);
> +		goto wq_ll_pop;
> +	}
> +
> +	/* queue up for recycling ..*/
> +	page_ref_inc(di->page);
> +	mlx5e_page_release(rq, di, true);
> +
>  	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
> +	skb_reserve(skb, MLX5_RX_HEADROOM);
>  	skb_put(skb, cqe_bcnt);
>  
>  	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                 ` <1473272346.10725.73.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2016-09-07 20:09                   ` Saeed Mahameed via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-07 20:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Tom Herbert

On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:
>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device
>> TX ring is idle most of the time, i think his idea can come in handy here.
>> I am not fully involved in the details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
>
> I do not think qdisc is relevant here.
>
> Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
> because only this layer can know if more packets are to come.
>
>
> What I am saying is that regardless of skb->xmit_more being set or not,
> (for example if no qdisc is even used)
> a NAPI driver can arm a bit asking the doorbell being sent at the end of
> NAPI.
>
> I am not saying this must be done, only that the idea could be extended
> to non XDP world, if we care enough.
>

Yes, and i am just trying to suggest Ideas that do not require
communication between RX (NAPI) and TX.

The problem here is the synchronization (TX doorbell from RX) which is
not as simple as atomic operation for some drivers.

How about RX bulking ? it also can help here, since for the forwarding
case, the forwarding path will be able to
process bulk of RX SKBs and can bulk xmit the portion of SKBs that
will be forwarded.

As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if
you are joining of course).

Thanks
Saeed.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]   ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-07 20:55     ` Or Gerlitz via iovisor-dev
       [not found]       ` <CAJ3xEMgsGHqQ7x8wky6Sfs34Ry67PnZEhYmnK=g8XnnXbgWagg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Or Gerlitz via iovisor-dev @ 2016-09-07 20:55 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Eric Dumazet,
	Tom Herbert, Rana Shahout

On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>
> When XDP is on we make sure to change channels RQs type to
> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> ensure "page per packet".
>
> On XDP set, we fail if HW LRO is set and request from user to turn it
> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> annoying, but we prefer not to enforce LRO off from XDP set function.
>
> Full channels reset (close/open) is required only when setting XDP
> on/off.
>
> When XDP set is called just to exchange programs, we will update
> each RQ xdp program on the fly and for synchronization with current
> data path RX activity of that RQ, we temporally disable that RQ and
> ensure RX path is not running, quickly update and re-enable that RQ,
> for that we do:
>         - rq.state = disabled
>         - napi_synnchronize
>         - xchg(rq->xdp_prg)
>         - rq.state = enabled
>         - napi_schedule // Just in case we've missed an IRQ
>
> Packet rate performance testing was done with pktgen 64B packets and on
> TX side and, TC drop action on RX side compared to XDP fast drop.
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
>         1. Baseline, Before this patch with TC drop action
>         2. This patch with TC drop action
>         3. This patch with XDP RX fast drop
>
> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> --------------------------------------------------------------
> 1           5.51Mpps            5.14Mpps     13.5Mpps

This (13.5 M PPS) is less than 50% of the result we presented @ the
XDP summit which was obtained by Rana. Please see if/how much does
this grows if you use more sender threads, but all of them to xmit the
same stream/flows, so we're on one ring. That (XDP with single RX ring
getting packets from N remote TX rings) would be your canonical
base-line for any further numbers.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]       ` <CAJ3xEMgsGHqQ7x8wky6Sfs34Ry67PnZEhYmnK=g8XnnXbgWagg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-07 21:53         ` Saeed Mahameed via iovisor-dev
       [not found]           ` <CALzJLG9C0PgJWFi9hc7LrhZJejOHmWOjn0Lu-jiPekoyTGq1Ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-09-08  7:38         ` Jesper Dangaard Brouer via iovisor-dev
  1 sibling, 1 reply; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-07 21:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List, Rana Shahout

On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-dev
<iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>>
>> When XDP is on we make sure to change channels RQs type to
>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>> ensure "page per packet".
>>
>> On XDP set, we fail if HW LRO is set and request from user to turn it
>> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>> annoying, but we prefer not to enforce LRO off from XDP set function.
>>
>> Full channels reset (close/open) is required only when setting XDP
>> on/off.
>>
>> When XDP set is called just to exchange programs, we will update
>> each RQ xdp program on the fly and for synchronization with current
>> data path RX activity of that RQ, we temporally disable that RQ and
>> ensure RX path is not running, quickly update and re-enable that RQ,
>> for that we do:
>>         - rq.state = disabled
>>         - napi_synnchronize
>>         - xchg(rq->xdp_prg)
>>         - rq.state = enabled
>>         - napi_schedule // Just in case we've missed an IRQ
>>
>> Packet rate performance testing was done with pktgen 64B packets and on
>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Comparison is done between:
>>         1. Baseline, Before this patch with TC drop action
>>         2. This patch with TC drop action
>>         3. This patch with XDP RX fast drop
>>
>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>> --------------------------------------------------------------
>> 1           5.51Mpps            5.14Mpps     13.5Mpps
>
> This (13.5 M PPS) is less than 50% of the result we presented @ the
> XDP summit which was obtained by Rana. Please see if/how much does
> this grows if you use more sender threads, but all of them to xmit the
> same stream/flows, so we're on one ring. That (XDP with single RX ring
> getting packets from N remote TX rings) would be your canonical
> base-line for any further numbers.
>

I used N TX senders sending 48Mpps to a single RX core.
The single RX core could handle only 13.5Mpps.

The implementation here is different from the one we presented at the
summit, before, it was with striding RQ, now it is regular linked list
RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings
handles only 1K).

In striding RQ we register only 16 HW descriptors for every 32K
packets. I.e for
every 32K packets we access the HW only 16 times.  on the other hand,
regular RQ will access the HW (register descriptors) once per packet,
i.e we write to HW 1K time for 1K packets. i think this explains the
difference.

the catch here is that we can't use striding RQ for XDP, bummer!

As i said, we will have the full and final performance results on V1.
This is just a RFC with barely quick and dirty testing.


> _______________________________________________
> iovisor-dev mailing list
> iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                 ` <20160907202234.55e18ef3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-08  2:58                   ` John Fastabend via iovisor-dev
       [not found]                     ` <57D0D3EA.1090004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: John Fastabend via iovisor-dev @ 2016-09-08  2:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed
  Cc: Eric Dumazet, Linux Netdev List, iovisor-dev, Jamal Hadi Salim,
	Saeed Mahameed, Eric Dumazet, Tom Herbert

On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
> 
> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:  
>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> [...]
>>>
>>> Only if a qdisc is present and pressure is high enough.
>>>
>>> But in a forwarding setup, we likely receive at a lower rate than the
>>> NIC can transmit.
> 
> Yes, I can confirm this happens in my experiments.
> 
>>>  
>>
>> Jesper has a similar Idea to make the qdisc think it is under
>> pressure, when the device TX ring is idle most of the time, i think
>> his idea can come in handy here. I am not fully involved in the
>> details, maybe he can elaborate more.
>>
>> But if it works, it will be transparent to napi, and xmit more will
>> happen by design.
> 
> Yes. I have some ideas around getting more bulking going from the qdisc
> layer, by having the drivers provide some feedback to the qdisc layer
> indicating xmit_more should be possible.  This will be a topic at the
> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> challenge people to come up with a good solution ;-)
> 

One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.

.John

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                     ` <57D0D3EA.1090004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-09-08  3:21                       ` Tom Herbert via iovisor-dev
  2016-09-08  5:11                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 72+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-09-08  3:21 UTC (permalink / raw)
  To: John Fastabend
  Cc: Eric Dumazet, Linux Netdev List, iovisor-dev, Jamal Hadi Salim,
	Saeed Mahameed, Eric Dumazet

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>>
>> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> [...]
>>>>
>>>> Only if a qdisc is present and pressure is high enough.
>>>>
>>>> But in a forwarding setup, we likely receive at a lower rate than the
>>>> NIC can transmit.
>>
>> Yes, I can confirm this happens in my experiments.
>>
>>>>
>>>
>>> Jesper has a similar Idea to make the qdisc think it is under
>>> pressure, when the device TX ring is idle most of the time, i think
>>> his idea can come in handy here. I am not fully involved in the
>>> details, maybe he can elaborate more.
>>>
>>> But if it works, it will be transparent to napi, and xmit more will
>>> happen by design.
>>
>> Yes. I have some ideas around getting more bulking going from the qdisc
>> layer, by having the drivers provide some feedback to the qdisc layer
>> indicating xmit_more should be possible.  This will be a topic at the
>> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> challenge people to come up with a good solution ;-)
>>
>
> One thing I've noticed but haven't yet actually analyzed much is if
> I shrink the nic descriptor ring size to only be slightly larger than
> the qdisc layer bulking size I get more bulking and better perf numbers.
> At least on microbenchmarks. The reason being the nic pushes back more
> on the qdisc. So maybe a case for making the ring size in the NIC some
> factor of the expected number of queues feeding the descriptor ring.
>

BQL is not helping with that?

Tom

> .John

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-08  3:21                       ` Tom Herbert via iovisor-dev
@ 2016-09-08  5:11                         ` Jesper Dangaard Brouer
       [not found]                           ` <20160908071119.776cce56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-08  5:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, Saeed Mahameed, Eric Dumazet, Saeed Mahameed,
	iovisor-dev, Linux Netdev List, Tariq Toukan, Brenden Blanco,
	Alexei Starovoitov, Martin KaFai Lau, Daniel Borkmann,
	Eric Dumazet, Jamal Hadi Salim, brouer


On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@gmail.com> wrote:
> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:  
> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> [...]  
> >>>>
> >>>> Only if a qdisc is present and pressure is high enough.
> >>>>
> >>>> But in a forwarding setup, we likely receive at a lower rate than the
> >>>> NIC can transmit.  
> >>
> >> Yes, I can confirm this happens in my experiments.
> >>  
> >>>>  
> >>>
> >>> Jesper has a similar Idea to make the qdisc think it is under
> >>> pressure, when the device TX ring is idle most of the time, i think
> >>> his idea can come in handy here. I am not fully involved in the
> >>> details, maybe he can elaborate more.
> >>>
> >>> But if it works, it will be transparent to napi, and xmit more will
> >>> happen by design.  
> >>
> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> indicating xmit_more should be possible.  This will be a topic at the
> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> challenge people to come up with a good solution ;-)
> >>  
> >
> > One thing I've noticed but haven't yet actually analyzed much is if
> > I shrink the nic descriptor ring size to only be slightly larger than
> > the qdisc layer bulking size I get more bulking and better perf numbers.
> > At least on microbenchmarks. The reason being the nic pushes back more
> > on the qdisc. So maybe a case for making the ring size in the NIC some
> > factor of the expected number of queues feeding the descriptor ring.
> >  

I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).

 
> BQL is not helping with that?

Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight".  Which Tom already have
a patch for :-)  Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]           ` <CALzJLG9C0PgJWFi9hc7LrhZJejOHmWOjn0Lu-jiPekoyTGq1Ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-08  7:10             ` Or Gerlitz via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Or Gerlitz via iovisor-dev @ 2016-09-08  7:10 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List, Rana Shahout

On Thu, Sep 8, 2016 at 12:53 AM, Saeed Mahameed
<saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-dev
> <iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>> From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>
>>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>>>
>>> When XDP is on we make sure to change channels RQs type to
>>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>>> ensure "page per packet".
>>>
>>> On XDP set, we fail if HW LRO is set and request from user to turn it
>>> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>>> annoying, but we prefer not to enforce LRO off from XDP set function.
>>>
>>> Full channels reset (close/open) is required only when setting XDP
>>> on/off.
>>>
>>> When XDP set is called just to exchange programs, we will update
>>> each RQ xdp program on the fly and for synchronization with current
>>> data path RX activity of that RQ, we temporally disable that RQ and
>>> ensure RX path is not running, quickly update and re-enable that RQ,
>>> for that we do:
>>>         - rq.state = disabled
>>>         - napi_synnchronize
>>>         - xchg(rq->xdp_prg)
>>>         - rq.state = enabled
>>>         - napi_schedule // Just in case we've missed an IRQ
>>>
>>> Packet rate performance testing was done with pktgen 64B packets and on
>>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>>
>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>>
>>> Comparison is done between:
>>>         1. Baseline, Before this patch with TC drop action
>>>         2. This patch with TC drop action
>>>         3. This patch with XDP RX fast drop
>>>
>>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>>> --------------------------------------------------------------
>>> 1           5.51Mpps            5.14Mpps     13.5Mpps
>>
>> This (13.5 M PPS) is less than 50% of the result we presented @ the
>> XDP summit which was obtained by Rana. Please see if/how much does
>> this grows if you use more sender threads, but all of them to xmit the
>> same stream/flows, so we're on one ring. That (XDP with single RX ring
>> getting packets from N remote TX rings) would be your canonical
>> base-line for any further numbers.
>>
>
> I used N TX senders sending 48Mpps to a single RX core.
> The single RX core could handle only 13.5Mpps.
>
> The implementation here is different from the one we presented at the
> summit, before, it was with striding RQ, now it is regular linked list
> RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings
> handles only 1K)

> In striding RQ we register only 16 HW descriptors for every 32K
> packets. I.e for
> every 32K packets we access the HW only 16 times.  on the other hand,
> regular RQ will access the HW (register descriptors) once per packet,
> i.e we write to HW 1K time for 1K packets. i think this explains the
> difference.

> the catch here is that we can't use striding RQ for XDP, bummer!

yep, sounds like a bum bum bum (we went from >30M PPS to 13.5M PPS).

We used striding RQ for XDP with the prev impl. and I don't see a real
deep reason not to do so also when striding RQ doesn't use compound
pages any more.  I guess there are more details I need to catch up with
here, but the bottom result is not good and we need to re-think.

> As i said, we will have the full and final performance results on V1.
> This is just a RFC with barely quick and dirty testing

Yep, understood. But in parallel, you need to reconsider how to get along
without that bumming down of numbers.

Or.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]                 ` <CALzJLG9bu3-=Ybq+Lk1fvAe5AohVHAaPpa9RQqd1QVe-7XPyhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-08  7:19                   ` Jesper Dangaard Brouer via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-08  7:19 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List, Rana Shahout, Or Gerlitz

On Wed, 7 Sep 2016 20:07:01 +0300
Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
> > On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
> > <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:  
> >> On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:  
> >>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >>>  
> >>>> Packet rate performance testing was done with pktgen 64B packets and on
> >>>> TX side and, TC drop action on RX side compared to XDP fast drop.
> >>>>
> >>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >>>>
> >>>> Comparison is done between:
> >>>>         1. Baseline, Before this patch with TC drop action
> >>>>         2. This patch with TC drop action
> >>>>         3. This patch with XDP RX fast drop
> >>>>
> >>>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> >>>> --------------------------------------------------------------
> >>>> 1           5.51Mpps            5.14Mpps     13.5Mpps
> >>>> 2           11.5Mpps            10.0Mpps     25.1Mpps
> >>>> 4           16.3Mpps            17.2Mpps     35.4Mpps
> >>>> 8           29.6Mpps            28.2Mpps     45.8Mpps*
> >>>> 16          34.0Mpps            30.1Mpps     45.8Mpps*  
> >>>
> >>> Rana, Guys, congrat!!
> >>>
> >>> When you say X streams, does each stream mapped by RSS to different RX ring?
> >>> or we're on the same RX ring for all rows of the above table?  
> >>
> >> Yes, I will make this more clear in the actual submission,
> >> Here we are talking about different RSS core rings.
> >>  
> >>>
> >>> In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
> >>> I don't think we went beyond one RX ring.  
> >>
> >> Here we did, the first row is what you are describing the other rows
> >> are the same test
> >> with increasing the number of the RSS receiving cores, The xmit side is sending
> >> as many streams as possible to be as much uniformly spread as possible
> >> across the
> >> different RSS cores on the receiver.
> >>  
> > Hi Saeed,
> >
> > Please report CPU utilization also. The expectation is that
> > performance should scale linearly with increasing number of CPUs (i.e.
> > pps/CPU_utilization should be constant).
> >  
> 
> That was my expectation too.

Be careful with such expectations at these extreme speeds, because we
are starting to hit PCI-express limitations and CPU cache-coherency
limitations (if any atomic/RMW operations still exists per packet).

Consider that in the small packet size 64 bytes case, the drivers PCI bandwidth
need/overhead is actually quite large, as every descriptor also 64
bytes transferred.

 
> Anyway we will share more accurate results when we have them, with CPU
> utilization statistics as well.

It is interesting to monitor the CPU utilization, because (if C-states
are enabled) you will likely see the CPU freq be reduced or even enter
CPU idle states, in-case your software (XDP) gets faster than the HW
(PCI or NIC).  I've seen that happen with mlx4/CX3-pro.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]       ` <CAJ3xEMgsGHqQ7x8wky6Sfs34Ry67PnZEhYmnK=g8XnnXbgWagg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-09-07 21:53         ` Saeed Mahameed via iovisor-dev
@ 2016-09-08  7:38         ` Jesper Dangaard Brouer via iovisor-dev
  2016-09-08  9:31           ` Or Gerlitz
  1 sibling, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-08  7:38 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Tom Herbert, Rana Shahout

On Wed, 7 Sep 2016 23:55:42 +0300
Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> > From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >
> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
> >
> > When XDP is on we make sure to change channels RQs type to
> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> > ensure "page per packet".
> >
> > On XDP set, we fail if HW LRO is set and request from user to turn it
> > off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> > annoying, but we prefer not to enforce LRO off from XDP set function.
> >
> > Full channels reset (close/open) is required only when setting XDP
> > on/off.
> >
> > When XDP set is called just to exchange programs, we will update
> > each RQ xdp program on the fly and for synchronization with current
> > data path RX activity of that RQ, we temporally disable that RQ and
> > ensure RX path is not running, quickly update and re-enable that RQ,
> > for that we do:
> >         - rq.state = disabled
> >         - napi_synnchronize
> >         - xchg(rq->xdp_prg)
> >         - rq.state = enabled
> >         - napi_schedule // Just in case we've missed an IRQ
> >
> > Packet rate performance testing was done with pktgen 64B packets and on
> > TX side and, TC drop action on RX side compared to XDP fast drop.
> >
> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >
> > Comparison is done between:
> >         1. Baseline, Before this patch with TC drop action
> >         2. This patch with TC drop action
> >         3. This patch with XDP RX fast drop
> >
> > Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> > --------------------------------------------------------------
> > 1           5.51Mpps            5.14Mpps     13.5Mpps  
> 
> This (13.5 M PPS) is less than 50% of the result we presented @ the
> XDP summit which was obtained by Rana. Please see if/how much does
> this grows if you use more sender threads, but all of them to xmit the
> same stream/flows, so we're on one ring. That (XDP with single RX ring
> getting packets from N remote TX rings) would be your canonical
> base-line for any further numbers.

Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
that you should be able to reach 23Mpps on a single CPU.  This is
a XDP-drop-simulation with order-0 pages being recycled through my
page_pool code, plus avoiding the cache-misses (notice you are using a
CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).

The 23Mpps number looks like some HW limitation, as the increase was
is not proportional to page-allocator overhead I removed (and CPU freq
starts to decrease).  I also did scaling tests to more CPUs, which
showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
level I see 60Mpps (50G max is 74Mpps).

Notice this is a significant improvement over the mlx4/CX3-pro HW, as
it only scales up to 20Mpps, but can also do 20Mpps XDP-drop on a
single core.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]   ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-08  8:11     ` Jesper Dangaard Brouer via iovisor-dev
       [not found]       ` <20160908101147.1b351432-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-08  8:11 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	Eric Dumazet, Tom Herbert


I'm sorry but I have a problem with this patch!

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

 Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

 Lets do bundling/bulking from the start!

The reason behind the xmit_more API is that we could not change the
API of all the drivers.  And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces.  I also want bulk transmit from day-1 here.  It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.

But the mSwitch[1] article actually already solved this destination
sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports".  And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

--Jesper
(top post)


On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> 
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
> 
> XDP forward packet rate:
> 
> Comparing XDP with and w/o xmit more (bulk transmit):
> 
> Streams     XDP TX       XDP TX (xmit more)
> ---------------------------------------------------
> 1           4.90Mpps      7.50Mpps
> 2           9.50Mpps      14.8Mpps
> 4           16.5Mpps      25.1Mpps
> 8           21.5Mpps      27.5Mpps*
> 16          24.1Mpps      27.5Mpps*
> 
> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
> 
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
>  2 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index df2c9e0..6846208 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -265,7 +265,8 @@ struct mlx5e_cq {
>  
>  struct mlx5e_rq;
>  typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
> -				       struct mlx5_cqe64 *cqe);
> +				       struct mlx5_cqe64 *cqe,
> +				       bool *xdp_doorbell);
>  typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
>  				  u16 ix);
>  
> @@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
>  
>  void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
>  			bool recycle);
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			 bool *xdp_doorbell);
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			       bool *xdp_doorbell);
>  bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
>  int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
>  int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 912a0e2..ed93251 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
>  static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  					     struct mlx5e_cq *cq,
>  					     int update_owner_only,
> -					     int budget_rem)
> +					     int budget_rem,
> +					     bool *xdp_doorbell)
>  {
>  	u32 cqcc = cq->wq.cc + update_owner_only;
>  	u32 cqe_count;
> @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  			mlx5e_read_mini_arr_slot(cq, cqcc);
>  
>  		mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
> -		rq->handle_rx_cqe(rq, &cq->title);
> +		rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
>  	}
>  	mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
>  	cq->wq.cc = cqcc;
> @@ -143,15 +144,16 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>  
>  static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
>  					      struct mlx5e_cq *cq,
> -					      int budget_rem)
> +					      int budget_rem,
> +					      bool *xdp_doorbell)
>  {
>  	mlx5e_read_title_slot(rq, cq, cq->wq.cc);
>  	mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
>  	mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
> -	rq->handle_rx_cqe(rq, &cq->title);
> +	rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
>  	cq->mini_arr_idx++;
>  
> -	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
> +	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem, xdp_doorbell) - 1;
>  }
>  
>  void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
> @@ -670,23 +672,36 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
>  	wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
>  	sq->pc += MLX5E_XDP_TX_WQEBBS;
>  
> -	/* TODO: xmit more */
> +	/* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
> +	return true;
> +}
> +
> +static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
> +{
> +	struct mlx5_wq_cyc *wq = &sq->wq;
> +	struct mlx5e_tx_wqe *wqe;
> +	u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
> +
> +	wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
> +
>  	wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
>  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>  
> +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
>  	/* fill sq edge with nops to avoid wqe wrap around */
>  	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
>  		sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
>  		mlx5e_send_nop(sq, false);
>  	}
> -	return true;
> +#endif
>  }
>  
>  /* returns true if packet was consumed by xdp */
>  static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  				    const struct bpf_prog *prog,
>  				    struct mlx5e_dma_info *di,
> -				    void *data, u16 len)
> +				    void *data, u16 len,
> +				    bool *xdp_doorbell)
>  {
>  	bool consumed = false;
>  	struct xdp_buff xdp;
> @@ -705,7 +720,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
>  						MLX5_RX_HEADROOM,
>  						len);
> +		if (unlikely(!consumed) && (*xdp_doorbell)) {
> +			/* SQ is full, ring doorbell */
> +			mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
> +			*xdp_doorbell = false;
> +		}
>  		rq->stats.xdp_tx += consumed;
> +		*xdp_doorbell |= consumed;
>  		return consumed;
>  	default:
>  		bpf_warn_invalid_xdp_action(act);
> @@ -720,7 +741,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  	return false;
>  }
>  
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			 bool *xdp_doorbell)
>  {
>  	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
>  	struct mlx5e_dma_info *di;
> @@ -752,7 +774,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>  		goto wq_ll_pop;
>  	}
>  
> -	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
> +	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
>  		goto wq_ll_pop; /* page/packet was consumed by XDP */
>  
>  	skb = build_skb(va, RQ_PAGE_SIZE(rq));
> @@ -814,7 +836,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
>  	skb->len  += headlen;
>  }
>  
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			       bool *xdp_doorbell)
>  {
>  	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
>  	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
> @@ -860,13 +883,15 @@ mpwrq_cqe_out:
>  int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  {
>  	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
> +	bool xdp_doorbell = false;
>  	int work_done = 0;
>  
>  	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
>  		return 0;
>  
>  	if (cq->decmprs_left)
> -		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
> +		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
> +							&xdp_doorbell);
>  
>  	for (; work_done < budget; work_done++) {
>  		struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
> @@ -877,15 +902,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  		if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
>  			work_done +=
>  				mlx5e_decompress_cqes_start(rq, cq,
> -							    budget - work_done);
> +							    budget - work_done,
> +							    &xdp_doorbell);
>  			continue;
>  		}
>  
>  		mlx5_cqwq_pop(&cq->wq);
>  
> -		rq->handle_rx_cqe(rq, cqe);
> +		rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
>  	}
>  
> +	if (xdp_doorbell)
> +		mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
> +
>  	mlx5_cqwq_update_db_record(&cq->wq);
>  
>  	/* ensure cq space is freed before enabling more cqes */



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
  2016-09-08  7:38         ` Jesper Dangaard Brouer via iovisor-dev
@ 2016-09-08  9:31           ` Or Gerlitz
       [not found]             ` <CAJ3xEMiDBZ2-FdE7wniW0Y_S6k8NKfKEdy3w+1vs83oPuMAG5Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Or Gerlitz @ 2016-09-08  9:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, iovisor-dev, Linux Netdev List, Tariq Toukan,
	Brenden Blanco, Alexei Starovoitov, Tom Herbert,
	Martin KaFai Lau, Daniel Borkmann, Eric Dumazet,
	Jamal Hadi Salim, Rana Shahout

On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 7 Sep 2016 23:55:42 +0300
> Or Gerlitz <gerlitz.or@gmail.com> wrote:
>
>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
>> > From: Rana Shahout <ranas@mellanox.com>
>> >
>> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>> >
>> > When XDP is on we make sure to change channels RQs type to
>> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>> > ensure "page per packet".
>> >
>> > On XDP set, we fail if HW LRO is set and request from user to turn it
>> > off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>> > annoying, but we prefer not to enforce LRO off from XDP set function.
>> >
>> > Full channels reset (close/open) is required only when setting XDP
>> > on/off.
>> >
>> > When XDP set is called just to exchange programs, we will update
>> > each RQ xdp program on the fly and for synchronization with current
>> > data path RX activity of that RQ, we temporally disable that RQ and
>> > ensure RX path is not running, quickly update and re-enable that RQ,
>> > for that we do:
>> >         - rq.state = disabled
>> >         - napi_synnchronize
>> >         - xchg(rq->xdp_prg)
>> >         - rq.state = enabled
>> >         - napi_schedule // Just in case we've missed an IRQ
>> >
>> > Packet rate performance testing was done with pktgen 64B packets and on
>> > TX side and, TC drop action on RX side compared to XDP fast drop.
>> >
>> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>> >
>> > Comparison is done between:
>> >         1. Baseline, Before this patch with TC drop action
>> >         2. This patch with TC drop action
>> >         3. This patch with XDP RX fast drop
>> >
>> > Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>> > --------------------------------------------------------------
>> > 1           5.51Mpps            5.14Mpps     13.5Mpps
>>
>> This (13.5 M PPS) is less than 50% of the result we presented @ the
>> XDP summit which was obtained by Rana. Please see if/how much does
>> this grows if you use more sender threads, but all of them to xmit the
>> same stream/flows, so we're on one ring. That (XDP with single RX ring
>> getting packets from N remote TX rings) would be your canonical
>> base-line for any further numbers.
>
> Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
> that you should be able to reach 23Mpps on a single CPU.  This is
> a XDP-drop-simulation with order-0 pages being recycled through my
> page_pool code, plus avoiding the cache-misses (notice you are using a
> CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).

so this takes up from 13M to 23M, good.

Could you explain why the move from order-3 to order-0 is hurting the
performance so much (drop from 32M to 23M), any way we can overcome that?

> The 23Mpps number looks like some HW limitation, as the increase was

not HW, I think. As I said, Rana got 32M with striding RQ when she was
using order-3
(or did we use order-5?)

> is not proportional to page-allocator overhead I removed (and CPU freq
> starts to decrease).  I also did scaling tests to more CPUs, which
> showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
> level I see 60Mpps (50G max is 74Mpps).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]             ` <CAJ3xEMiDBZ2-FdE7wniW0Y_S6k8NKfKEdy3w+1vs83oPuMAG5Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-08  9:52               ` Jesper Dangaard Brouer via iovisor-dev
  2016-09-14  9:24               ` Tariq Toukan via iovisor-dev
  1 sibling, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-08  9:52 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Tom Herbert, Rana Shahout

On Thu, 8 Sep 2016 12:31:47 +0300
Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
> <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Wed, 7 Sep 2016 23:55:42 +0300
> > Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >  
> >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:  
> >> > From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> >
> >> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
> >> >
> >> > When XDP is on we make sure to change channels RQs type to
> >> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> >> > ensure "page per packet".
> >> >
> >> > On XDP set, we fail if HW LRO is set and request from user to turn it
> >> > off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> >> > annoying, but we prefer not to enforce LRO off from XDP set function.
> >> >
> >> > Full channels reset (close/open) is required only when setting XDP
> >> > on/off.
> >> >
> >> > When XDP set is called just to exchange programs, we will update
> >> > each RQ xdp program on the fly and for synchronization with current
> >> > data path RX activity of that RQ, we temporally disable that RQ and
> >> > ensure RX path is not running, quickly update and re-enable that RQ,
> >> > for that we do:
> >> >         - rq.state = disabled
> >> >         - napi_synnchronize
> >> >         - xchg(rq->xdp_prg)
> >> >         - rq.state = enabled
> >> >         - napi_schedule // Just in case we've missed an IRQ
> >> >
> >> > Packet rate performance testing was done with pktgen 64B packets and on
> >> > TX side and, TC drop action on RX side compared to XDP fast drop.
> >> >
> >> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >> >
> >> > Comparison is done between:
> >> >         1. Baseline, Before this patch with TC drop action
> >> >         2. This patch with TC drop action
> >> >         3. This patch with XDP RX fast drop
> >> >
> >> > Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> >> > --------------------------------------------------------------
> >> > 1           5.51Mpps            5.14Mpps     13.5Mpps  
> >>
> >> This (13.5 M PPS) is less than 50% of the result we presented @ the
> >> XDP summit which was obtained by Rana. Please see if/how much does
> >> this grows if you use more sender threads, but all of them to xmit the
> >> same stream/flows, so we're on one ring. That (XDP with single RX ring
> >> getting packets from N remote TX rings) would be your canonical
> >> base-line for any further numbers.  
> >
> > Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
> > that you should be able to reach 23Mpps on a single CPU.  This is
> > a XDP-drop-simulation with order-0 pages being recycled through my
> > page_pool code, plus avoiding the cache-misses (notice you are using a
> > CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).  
> 
> so this takes up from 13M to 23M, good.

Notice the 23Mpps was crude hack test to determine the maximum
achievable performance.  This is our performance target, once we get
_close_ to that then we are happy, and stop optimizing.

> Could you explain why the move from order-3 to order-0 is hurting the
> performance so much (drop from 32M to 23M), any way we can overcome that?

It is all going to be in the details.

When reaching these numbers be careful, thinking wow 23M to 32M sounds
like a huge deal... but the performance difference in nanosec is
actually not that large, it is only around 12ns more we have to save.

(1/(23*10^6)-1/(32*10^6))*10^9 = 12.22

> > The 23Mpps number looks like some HW limitation, as the increase was  
> 
> not HW, I think. As I said, Rana got 32M with striding RQ when she was
> using order-3 (or did we use order-5?)

It was order-5.

We likely need some HW tuning parameter (like with mlx4) if you want to
go past the 23Mpps mark.

 
> > is not proportional to page-allocator overhead I removed (and CPU freq
> > starts to decrease).  I also did scaling tests to more CPUs, which
> > showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
> > level I see 60Mpps (50G max is 74Mpps).  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
  2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
  2016-09-07 13:32   ` Or Gerlitz
       [not found]   ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-08 10:58   ` Jamal Hadi Salim
  2 siblings, 0 replies; 72+ messages in thread
From: Jamal Hadi Salim @ 2016-09-08 10:58 UTC (permalink / raw)
  To: Saeed Mahameed, iovisor-dev
  Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
	Daniel Borkmann, Eric Dumazet, Rana Shahout


On 16-09-07 08:42 AM, Saeed Mahameed wrote:

> Comparison is done between:
> 	1. Baseline, Before this patch with TC drop action
> 	2. This patch with TC drop action
> 	3. This patch with XDP RX fast drop
>
> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> --------------------------------------------------------------
> 1           5.51Mpps            5.14Mpps     13.5Mpps
> 2           11.5Mpps            10.0Mpps     25.1Mpps
> 4           16.3Mpps            17.2Mpps     35.4Mpps
> 8           29.6Mpps            28.2Mpps     45.8Mpps*
> 16          34.0Mpps            30.1Mpps     45.8Mpps*
>
> It seems that there is around ~5% degradation between Baseline
> and this patch with single stream when comparing packet rate with TC drop,
> it might be related to XDP code overhead or new cache misses added by
> XDP code.


I would suspect this degradation would affect every other packet that
has no interest in XDP.
if you were trying to test forwarding, adding a tc action to
accept and count packets will be sufficient. Since you are not:

Try to baseline sending the wrong destination MAC  address (i.e one
not understood by host). The kernel will eventually drop it
somewhere pre-IP processing time (and you can see difference with
XDP compiled in).

Slightly tangent question: Would it be fair to assume that this
hardware can drop at wire rate if you instead used an offloaded
tc rule?

cheers,
jamal

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                           ` <20160908071119.776cce56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-08 16:26                             ` Tom Herbert via iovisor-dev
  2016-09-08 17:19                               ` Jesper Dangaard Brouer via iovisor-dev
  0 siblings, 1 reply; 72+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-09-08 16:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Linux Netdev List, iovisor-dev, John Fastabend,
	Jamal Hadi Salim, Saeed Mahameed, Eric Dumazet

On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
<brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
>
>> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >>
>> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> [...]
>> >>>>
>> >>>> Only if a qdisc is present and pressure is high enough.
>> >>>>
>> >>>> But in a forwarding setup, we likely receive at a lower rate than the
>> >>>> NIC can transmit.
>> >>
>> >> Yes, I can confirm this happens in my experiments.
>> >>
>> >>>>
>> >>>
>> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >>> pressure, when the device TX ring is idle most of the time, i think
>> >>> his idea can come in handy here. I am not fully involved in the
>> >>> details, maybe he can elaborate more.
>> >>>
>> >>> But if it works, it will be transparent to napi, and xmit more will
>> >>> happen by design.
>> >>
>> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> challenge people to come up with a good solution ;-)
>> >>
>> >
>> > One thing I've noticed but haven't yet actually analyzed much is if
>> > I shrink the nic descriptor ring size to only be slightly larger than
>> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> > At least on microbenchmarks. The reason being the nic pushes back more
>> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> > factor of the expected number of queues feeding the descriptor ring.
>> >
>
> I've also played with shrink the NIC descriptor ring size, it works,
> but it is an ugly hack to get NIC pushes backs, and I foresee it will
> hurt normal use-cases. (There are other reasons for shrinking the ring
> size like cache usage, but that is unrelated to this).
>
>
>> BQL is not helping with that?
>
> Exactly. But the BQL _byte_ limit is not what is needed, what we need
> to know is the _packets_ currently "in-flight".  Which Tom already have
> a patch for :-)  Once we have that the algorithm is simple.
>
> Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> packets in-flight, the qdisc start it's bulk dequeue building phase,
> before calling the driver. The allowed max qdisc bulk size should
> likely be related to pkts-in-flight.
>
Sorry, I'm still missing it. The point of BQL is that we minimize the
amount of data (and hence number of packets) that needs to be queued
in the device in order to prevent the link from going idle while there
are outstanding packets to be sent. The algorithm is based on counting
bytes not packets because bytes are roughly an equal cost unit of
work. So if we've queued 100K of bytes on the queue we know how long
that takes around 80 usecs @10G, but if we count packets then we
really don't know much about that. 100 packets enqueued could
represent 6400 bytes or 6400K worth of data so time to transmit is
anywhere from 5usecs to 5msecs....

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-08 16:26                             ` Tom Herbert via iovisor-dev
@ 2016-09-08 17:19                               ` Jesper Dangaard Brouer via iovisor-dev
       [not found]                                 ` <20160908191914.197ce7ec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-08 17:19 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Linux Netdev List, iovisor-dev, John Fastabend,
	Jamal Hadi Salim, Achiad Shochat, Saeed Mahameed, Eric Dumazet

On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:

> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
> <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
> >  
> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:  
> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:  
> >> >>
> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:  
> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:  
> >> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:  
> >> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:  
> >> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:  
> >> >> [...]  
> >> >>>>
> >> >>>> Only if a qdisc is present and pressure is high enough.
> >> >>>>
> >> >>>> But in a forwarding setup, we likely receive at a lower rate than the
> >> >>>> NIC can transmit.  
> >> >>
> >> >> Yes, I can confirm this happens in my experiments.
> >> >>  
> >> >>>>  
> >> >>>
> >> >>> Jesper has a similar Idea to make the qdisc think it is under
> >> >>> pressure, when the device TX ring is idle most of the time, i think
> >> >>> his idea can come in handy here. I am not fully involved in the
> >> >>> details, maybe he can elaborate more.
> >> >>>
> >> >>> But if it works, it will be transparent to napi, and xmit more will
> >> >>> happen by design.  
> >> >>
> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
> >> >> layer, by having the drivers provide some feedback to the qdisc layer
> >> >> indicating xmit_more should be possible.  This will be a topic at the
> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
> >> >> challenge people to come up with a good solution ;-)
> >> >>  
> >> >
> >> > One thing I've noticed but haven't yet actually analyzed much is if
> >> > I shrink the nic descriptor ring size to only be slightly larger than
> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
> >> > At least on microbenchmarks. The reason being the nic pushes back more
> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
> >> > factor of the expected number of queues feeding the descriptor ring.
> >> >  
> >
> > I've also played with shrink the NIC descriptor ring size, it works,
> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
> > hurt normal use-cases. (There are other reasons for shrinking the ring
> > size like cache usage, but that is unrelated to this).
> >
> >  
> >> BQL is not helping with that?  
> >
> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
> > to know is the _packets_ currently "in-flight".  Which Tom already have
> > a patch for :-)  Once we have that the algorithm is simple.
> >
> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
> > packets in-flight, the qdisc start it's bulk dequeue building phase,
> > before calling the driver. The allowed max qdisc bulk size should
> > likely be related to pkts-in-flight.
> >  
> Sorry, I'm still missing it. The point of BQL is that we minimize the
> amount of data (and hence number of packets) that needs to be queued
> in the device in order to prevent the link from going idle while there
> are outstanding packets to be sent. The algorithm is based on counting
> bytes not packets because bytes are roughly an equal cost unit of
> work. So if we've queued 100K of bytes on the queue we know how long
> that takes around 80 usecs @10G, but if we count packets then we
> really don't know much about that. 100 packets enqueued could
> represent 6400 bytes or 6400K worth of data so time to transmit is
> anywhere from 5usecs to 5msecs....
> 
> Shouldn't qdisc bulk size be based on the BQL limit? What is the
> simple algorithm to apply to in-flight packets?

Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.

The idea/algo is trying to predict the future.  If we see a given/high
packet rate, which equals a high transaction cost, then lets try not
calling the driver, and instead backlog the packet in the qdisc,
speculatively hoping the current rate continues.  This will in effect
allow bulking and amortize the 70ns transaction cost over N packets.

Instead of tracking a rate of packets or doorbells per sec, I will let
BQLs packet-in-flight tell me when the driver sees a rate high enough
that the drivers (DMA-TX completion) consider several packets are
in-flight.
When that happens, I will bet on, I can stop sending packets to the
device, and instead queue them in the qdisc layer.  If I'm unlucky and
the flow stops, then I'm hoping that the last packet stuck in the qdisc,
will be picked by the next napi-schedule, before the device driver runs
"dry".

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                                 ` <20160908191914.197ce7ec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-08 18:16                                   ` Tom Herbert via iovisor-dev
  2016-09-08 18:48                                     ` Rick Jones
  0 siblings, 1 reply; 72+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-09-08 18:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Linux Netdev List, iovisor-dev, John Fastabend,
	Jamal Hadi Salim, Achiad Shochat, Saeed Mahameed, Eric Dumazet

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 8 Sep 2016 09:26:03 -0700
> Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
>
>> On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
>> <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >
>> > On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
>> >
>> >> On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >> > On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
>> >> >>
>> >> >> On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> >> >>> On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >> >>>> On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
>> >> >>>>> On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >> >>>>>> On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
>> >> >> [...]
>> >> >>>>
>> >> >>>> Only if a qdisc is present and pressure is high enough.
>> >> >>>>
>> >> >>>> But in a forwarding setup, we likely receive at a lower rate than the
>> >> >>>> NIC can transmit.
>> >> >>
>> >> >> Yes, I can confirm this happens in my experiments.
>> >> >>
>> >> >>>>
>> >> >>>
>> >> >>> Jesper has a similar Idea to make the qdisc think it is under
>> >> >>> pressure, when the device TX ring is idle most of the time, i think
>> >> >>> his idea can come in handy here. I am not fully involved in the
>> >> >>> details, maybe he can elaborate more.
>> >> >>>
>> >> >>> But if it works, it will be transparent to napi, and xmit more will
>> >> >>> happen by design.
>> >> >>
>> >> >> Yes. I have some ideas around getting more bulking going from the qdisc
>> >> >> layer, by having the drivers provide some feedback to the qdisc layer
>> >> >> indicating xmit_more should be possible.  This will be a topic at the
>> >> >> Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
>> >> >> challenge people to come up with a good solution ;-)
>> >> >>
>> >> >
>> >> > One thing I've noticed but haven't yet actually analyzed much is if
>> >> > I shrink the nic descriptor ring size to only be slightly larger than
>> >> > the qdisc layer bulking size I get more bulking and better perf numbers.
>> >> > At least on microbenchmarks. The reason being the nic pushes back more
>> >> > on the qdisc. So maybe a case for making the ring size in the NIC some
>> >> > factor of the expected number of queues feeding the descriptor ring.
>> >> >
>> >
>> > I've also played with shrink the NIC descriptor ring size, it works,
>> > but it is an ugly hack to get NIC pushes backs, and I foresee it will
>> > hurt normal use-cases. (There are other reasons for shrinking the ring
>> > size like cache usage, but that is unrelated to this).
>> >
>> >
>> >> BQL is not helping with that?
>> >
>> > Exactly. But the BQL _byte_ limit is not what is needed, what we need
>> > to know is the _packets_ currently "in-flight".  Which Tom already have
>> > a patch for :-)  Once we have that the algorithm is simple.
>> >
>> > Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
>> > packets in-flight, the qdisc start it's bulk dequeue building phase,
>> > before calling the driver. The allowed max qdisc bulk size should
>> > likely be related to pkts-in-flight.
>> >
>> Sorry, I'm still missing it. The point of BQL is that we minimize the
>> amount of data (and hence number of packets) that needs to be queued
>> in the device in order to prevent the link from going idle while there
>> are outstanding packets to be sent. The algorithm is based on counting
>> bytes not packets because bytes are roughly an equal cost unit of
>> work. So if we've queued 100K of bytes on the queue we know how long
>> that takes around 80 usecs @10G, but if we count packets then we
>> really don't know much about that. 100 packets enqueued could
>> represent 6400 bytes or 6400K worth of data so time to transmit is
>> anywhere from 5usecs to 5msecs....
>>
>> Shouldn't qdisc bulk size be based on the BQL limit? What is the
>> simple algorithm to apply to in-flight packets?
>
> Maybe the algorithm is not so simple, and we likely also have to take
> BQL bytes into account.
>
> The reason for wanting packets-in-flight is because we are attacking a
> transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
> data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
> 70.74). The 10G wirespeed small packets budget is 67.2ns, this with
> fixed overhead per packet of 70ns we can never reach 10G wirespeed.
>
But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.

> The idea/algo is trying to predict the future.  If we see a given/high
> packet rate, which equals a high transaction cost, then lets try not
> calling the driver, and instead backlog the packet in the qdisc,
> speculatively hoping the current rate continues.  This will in effect
> allow bulking and amortize the 70ns transaction cost over N packets.
>
> Instead of tracking a rate of packets or doorbells per sec, I will let
> BQLs packet-in-flight tell me when the driver sees a rate high enough
> that the drivers (DMA-TX completion) consider several packets are
> in-flight.
> When that happens, I will bet on, I can stop sending packets to the
> device, and instead queue them in the qdisc layer.  If I'm unlucky and
> the flow stops, then I'm hoping that the last packet stuck in the qdisc,
> will be picked by the next napi-schedule, before the device driver runs
> "dry".
>
This is exactly what BQL already does (except the queue limit is on
bytes). Once the byte limit is reached the queue is stopped. At TX
completion time some number of bytes are freed up so that a bulk of
packets can be sent to the queue limit.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-08 18:16                                   ` Tom Herbert via iovisor-dev
@ 2016-09-08 18:48                                     ` Rick Jones
  2016-09-08 18:52                                       ` Eric Dumazet
  0 siblings, 1 reply; 72+ messages in thread
From: Rick Jones @ 2016-09-08 18:48 UTC (permalink / raw)
  To: Tom Herbert, Jesper Dangaard Brouer
  Cc: John Fastabend, Saeed Mahameed, Eric Dumazet, Saeed Mahameed,
	iovisor-dev, Linux Netdev List, Tariq Toukan, Brenden Blanco,
	Alexei Starovoitov, Martin KaFai Lau, Daniel Borkmann,
	Eric Dumazet, Jamal Hadi Salim, Achiad Shochat

On 09/08/2016 11:16 AM, Tom Herbert wrote:
> On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> On Thu, 8 Sep 2016 09:26:03 -0700
>> Tom Herbert <tom@herbertland.com> wrote:
>>> Shouldn't qdisc bulk size be based on the BQL limit? What is the
>>> simple algorithm to apply to in-flight packets?
>>
>> Maybe the algorithm is not so simple, and we likely also have to take
>> BQL bytes into account.
>>
>> The reason for wanting packets-in-flight is because we are attacking a
>> transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
>> data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
>> 70.74). The 10G wirespeed small packets budget is 67.2ns, this with
>> fixed overhead per packet of 70ns we can never reach 10G wirespeed.
>>
> But you should be able to do this with BQL and it is more accurate.
> BQL tells how many bytes need to be sent and that can be used to
> create a bulk of packets to send with one doorbell.

With small packets and the "default" ring size for this NIC/driver 
combination, is the BQL large enough that the ring fills before one hits 
the BQL?

rick jones

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-08 18:48                                     ` Rick Jones
@ 2016-09-08 18:52                                       ` Eric Dumazet
  0 siblings, 0 replies; 72+ messages in thread
From: Eric Dumazet @ 2016-09-08 18:52 UTC (permalink / raw)
  To: Rick Jones
  Cc: Tom Herbert, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Saeed Mahameed, iovisor-dev, Linux Netdev List,
	Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
	Martin KaFai Lau, Daniel Borkmann, Eric Dumazet,
	Jamal Hadi Salim, Achiad Shochat

On Thu, 2016-09-08 at 11:48 -0700, Rick Jones wrote:

> With small packets and the "default" ring size for this NIC/driver 
> combination, is the BQL large enough that the ring fills before one hits 
> the BQL?

It depends on how TX completion (NAPI handler) is implemented in the
driver.

Say how many packets can be dequeued by each invocation.

Drivers have a lot of variations there.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]       ` <20160908101147.1b351432-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-09  3:22         ` Alexei Starovoitov via iovisor-dev
       [not found]           ` <20160909032202.GA62966-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-09-09 15:03           ` [iovisor-dev] " Saeed Mahameed
  0 siblings, 2 replies; 72+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-09-09  3:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> 
> I'm sorry but I have a problem with this patch!

is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.

> Looking at this patch, I want to bring up a fundamental architectural
> concern with the development direction of XDP transmit.
> 
> 
> What you are trying to implement, with delaying the doorbell, is
> basically TX bulking for TX_XDP.
> 
>  Why not implement a TX bulking interface directly instead?!?
> 
> Yes, the tailptr/doorbell is the most costly operation, but why not
> also take advantage of the benefits of bulking for other parts of the
> code? (benefit is smaller, by every cycles counts in this area)
> 
> This hole XDP exercise is about avoiding having a transaction cost per
> packet, that reads "bulking" or "bundling" of packets, where possible.
> 
>  Lets do bundling/bulking from the start!

mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.

> The reason behind the xmit_more API is that we could not change the
> API of all the drivers.  And we found that calling an explicit NDO
> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> would hit the common single packet use-case.
> 
> It should be really easy to build a bundle of packets that need XDP_TX
> action, especially given you only have a single destination "port".
> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.

> In the future, XDP need to support XDP_FWD forwarding of packets/pages
> out other interfaces.  I also want bulk transmit from day-1 here.  It
> is slightly more tricky to sort packets for multiple outgoing
> interfaces efficiently in the pool loop.

I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.

> But the mSwitch[1] article actually already solved this destination
> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> understanding the next steps, for a smarter data structure, when
> starting to have more TX "ports".  And perhaps align your single
> XDP_TX destination data structure to this future development.
> 
> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

> --Jesper
> (top post)

since when it's ok to top post?

> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> > 
> > Here we introduce a xmit more like mechanism that will queue up more
> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> > 
> > Once RX napi budget is consumed and we exit napi RX loop, we will
> > flush (doorbell) all XDP looped packets in case there are such.
> > 
> > XDP forward packet rate:
> > 
> > Comparing XDP with and w/o xmit more (bulk transmit):
> > 
> > Streams     XDP TX       XDP TX (xmit more)
> > ---------------------------------------------------
> > 1           4.90Mpps      7.50Mpps
> > 2           9.50Mpps      14.8Mpps
> > 4           16.5Mpps      25.1Mpps
> > 8           21.5Mpps      27.5Mpps*
> > 16          24.1Mpps      27.5Mpps*
> > 
> > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> > we will be working on the analysis and will publish the conclusions
> > later.
> > 
> > Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
> >  2 files changed, 49 insertions(+), 17 deletions(-)
...
> > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
> >  			mlx5e_read_mini_arr_slot(cq, cqcc);
> >  
> >  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> >  
> > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */

Saeed,
please make sure to remove such debug bits.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]           ` <20160909032202.GA62966-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-09  5:36             ` Jesper Dangaard Brouer via iovisor-dev
       [not found]               ` <20160909073652.351d76d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-09  5:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA

On Thu, 8 Sep 2016 20:22:04 -0700
Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > I'm sorry but I have a problem with this patch!  
> 
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.

What I'm proposing can also be used by the normal stack.
 
> > Looking at this patch, I want to bring up a fundamental architectural
> > concern with the development direction of XDP transmit.
> > 
> > 
> > What you are trying to implement, with delaying the doorbell, is
> > basically TX bulking for TX_XDP.
> > 
> >  Why not implement a TX bulking interface directly instead?!?
> > 
> > Yes, the tailptr/doorbell is the most costly operation, but why not
> > also take advantage of the benefits of bulking for other parts of the
> > code? (benefit is smaller, by every cycles counts in this area)
> > 
> > This hole XDP exercise is about avoiding having a transaction cost per
> > packet, that reads "bulking" or "bundling" of packets, where possible.
> > 
> >  Lets do bundling/bulking from the start!  
> 
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.

This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.  

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX. 
Every time you call the stack code, then you flush your icache.  When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.

 
> > The reason behind the xmit_more API is that we could not change the
> > API of all the drivers.  And we found that calling an explicit NDO
> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> > would hit the common single packet use-case.
> > 
> > It should be really easy to build a bundle of packets that need XDP_TX
> > action, especially given you only have a single destination "port".
> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  
> 
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
> 
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.

I just explain it will be more efficient due to better use of icache.

 
> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
> > out other interfaces.  I also want bulk transmit from day-1 here.  It
> > is slightly more tricky to sort packets for multiple outgoing
> > interfaces efficiently in the pool loop.  
> 
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.

So, have you solved the problem exclusive access to a TX ring of a
remote/different net_device when sending?

In you design you assume there exist many TX ring available for other
devices to access.  In my design I also want to support devices that
doesn't have this HW capability, and e.g. only have one TX queue.


> > But the mSwitch[1] article actually already solved this destination
> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
> > understanding the next steps, for a smarter data structure, when
> > starting to have more TX "ports".  And perhaps align your single
> > XDP_TX destination data structure to this future development.
> > 
> > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf  
> 
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
> 
> > --Jesper
> > (top post)  
> 
> since when it's ok to top post?

What a kneejerk reaction.  When writing something general we often
reply to the top of the email, and then often delete the rest (which
makes it hard for later comers to follow).  I was bcc'ing some people,
which needed the context, so it was a service note to you, that I
didn't write anything below.

 
> > On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >   
> > > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> > > 
> > > Here we introduce a xmit more like mechanism that will queue up more
> > > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> > > 
> > > Once RX napi budget is consumed and we exit napi RX loop, we will
> > > flush (doorbell) all XDP looped packets in case there are such.
> > > 
> > > XDP forward packet rate:
> > > 
> > > Comparing XDP with and w/o xmit more (bulk transmit):
> > > 
> > > Streams     XDP TX       XDP TX (xmit more)
> > > ---------------------------------------------------
> > > 1           4.90Mpps      7.50Mpps
> > > 2           9.50Mpps      14.8Mpps
> > > 4           16.5Mpps      25.1Mpps
> > > 8           21.5Mpps      27.5Mpps*
> > > 16          24.1Mpps      27.5Mpps*
> > > 
> > > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> > > we will be working on the analysis and will publish the conclusions
> > > later.
> > > 
> > > Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
> > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
> > >  2 files changed, 49 insertions(+), 17 deletions(-)  
> ...
> > > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
> > >  			mlx5e_read_mini_arr_slot(cq, cqcc);
> > >  
> > >  	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> > >  
> > > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */  
> 
> Saeed,
> please make sure to remove such debug bits.
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]               ` <20160909073652.351d76d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-09  6:30                 ` Alexei Starovoitov via iovisor-dev
       [not found]                   ` <20160909063048.GA67375-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-09-09 19:02                 ` Tom Herbert via iovisor-dev
  1 sibling, 1 reply; 72+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-09-09  6:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > >  Lets do bundling/bulking from the start!  
> > 
> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.
> > See nothing wrong about it. RX side processes the packets and
> > when it's done it tells TX to xmit whatever it collected.
> 
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.  
> 
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
> 
> Imagine you have packets intermixed towards the stack and XDP_TX. 
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.

correct. And why is that a problem?
As we discussed numerous times before XDP is deliberately not trying
to work with 10% of the traffic. If most of the traffic is going into
the stack there is no reason to use XDP. We have tc and netfilter
to deal with it. The cases where most of the traffic needs
skb should not use XDP. If we try to add such uses cases to XDP we
will only hurt XDP performance, increase complexity and gain nothing back.

Let's say a user wants to send 50% into the stack->tcp->socket->user and
another 50% via XDP_TX. The performance is going to be dominated by the stack.
So everything that XDP does to receive and/or transmit is irrelevant.
If we try to optimize XDP for that, we gain nothing in performance.
The user could have used netfilter just as well in such scenario.
The performance would have been the same.

XDP only makes sense when it's servicing most of the traffic,
like L4 load balancer, ILA router or DoS prevention use cases.
Sorry for the broken record. XDP is not a solution for every
networking use case. It only makes sense for packet in and out.
When packet goes to the host it has to go through skb and
optimizing that path is a task that is orthogonal to the XDP patches.

To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-09  3:22         ` Alexei Starovoitov via iovisor-dev
       [not found]           ` <20160909032202.GA62966-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-09 15:03           ` Saeed Mahameed
       [not found]             ` <CALzJLG_r0pDJgxqqak5=NatT8tF7UP2NkGS1wjeWcS5C=Zvv2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-09 15:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Tom Herbert, iovisor-dev,
	Jamal Hadi Salim, Saeed Mahameed, Eric Dumazet,
	Linux Netdev List

On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
<iovisor-dev@lists.iovisor.org> wrote:
> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>>
>> I'm sorry but I have a problem with this patch!
>
> is it because the variable is called 'xdp_doorbell'?
> Frankly I see nothing scary in this patch.
> It extends existing code by adding a flag to ring doorbell or not.
> The end of rx napi is used as an obvious heuristic to flush the pipe.
> Looks pretty generic to me.
> The same code can be used for non-xdp as well once we figure out
> good algorithm for xmit_more in the stack.
>
>> Looking at this patch, I want to bring up a fundamental architectural
>> concern with the development direction of XDP transmit.
>>
>>
>> What you are trying to implement, with delaying the doorbell, is
>> basically TX bulking for TX_XDP.
>>
>>  Why not implement a TX bulking interface directly instead?!?
>>
>> Yes, the tailptr/doorbell is the most costly operation, but why not
>> also take advantage of the benefits of bulking for other parts of the
>> code? (benefit is smaller, by every cycles counts in this area)
>>
>> This hole XDP exercise is about avoiding having a transaction cost per
>> packet, that reads "bulking" or "bundling" of packets, where possible.
>>
>>  Lets do bundling/bulking from the start!

Jesper, what we did here is also bulking, instead of bulkin in a
temporary list in the driver
we list the packets in the HW and once done we transmit all at once via the
xdp_doorbell indication.

I agree with you that we can take advantage and improve the icahce by
bulkin first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling
in the diver and we need to do the comparison between both approaches
before we decide which is better.

this must be marked as future work and not have this from the start.

>
> mlx4 already does bulking and this proposed mlx5 set of patches
> does bulking as well.
> See nothing wrong about it. RX side processes the packets and
> when it's done it tells TX to xmit whatever it collected.
>
>> The reason behind the xmit_more API is that we could not change the
>> API of all the drivers.  And we found that calling an explicit NDO
>> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> would hit the common single packet use-case.
>>
>> It should be really easy to build a bundle of packets that need XDP_TX
>> action, especially given you only have a single destination "port".
>> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
> not sure what are you proposing here?
> Sounds like you want to extend it to multi port in the future?
> Sure. The proposed code is easily extendable.
>
> Or you want to see something like a link list of packets
> or an array of packets that RX side is preparing and then
> send the whole array/list to TX port?
> I don't think that would be efficient, since it would mean
> unnecessary copy of pointers.
>
>> In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> out other interfaces.  I also want bulk transmit from day-1 here.  It
>> is slightly more tricky to sort packets for multiple outgoing
>> interfaces efficiently in the pool loop.
>
> I don't think so. Multi port is natural extension to this set of patches.
> With multi port the end of RX will tell multiple ports (that were
> used to tx) to ring the bell. Pretty trivial and doesn't involve any
> extra arrays or link lists.
>
>> But the mSwitch[1] article actually already solved this destination
>> sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> understanding the next steps, for a smarter data structure, when
>> starting to have more TX "ports".  And perhaps align your single
>> XDP_TX destination data structure to this future development.
>>
>> [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
>
> I don't see how this particular paper applies to the existing kernel code.
> It's great to take ideas from research papers, but real code is different.
>
>> --Jesper
>> (top post)
>
> since when it's ok to top post?
>
>> On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>> > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> >
>> > Here we introduce a xmit more like mechanism that will queue up more
>> > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> >
>> > Once RX napi budget is consumed and we exit napi RX loop, we will
>> > flush (doorbell) all XDP looped packets in case there are such.
>> >
>> > XDP forward packet rate:
>> >
>> > Comparing XDP with and w/o xmit more (bulk transmit):
>> >
>> > Streams     XDP TX       XDP TX (xmit more)
>> > ---------------------------------------------------
>> > 1           4.90Mpps      7.50Mpps
>> > 2           9.50Mpps      14.8Mpps
>> > 4           16.5Mpps      25.1Mpps
>> > 8           21.5Mpps      27.5Mpps*
>> > 16          24.1Mpps      27.5Mpps*
>> >
>> > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
>> > we will be working on the analysis and will publish the conclusions
>> > later.
>> >
>> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> > ---
>> >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
>> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
>> >  2 files changed, 49 insertions(+), 17 deletions(-)
> ...
>> > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>> >                     mlx5e_read_mini_arr_slot(cq, cqcc);
>> >
>> >     mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>> >
>> > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
>
> Saeed,
> please make sure to remove such debug bits.
>

Sure, will fix this.

Thanks,
Saeed.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 00/11] mlx5 RX refactoring and XDP support
       [not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-09 15:10   ` Saeed Mahameed via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-09 15:10 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Eric Dumazet,
	Tom Herbert

On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Hi All,
>
> This patch set introduces some important data path RX refactoring
> addressing mlx5e memory allocation/management improvements and XDP support.
>
> Submitting as RFC since we would like to get an early feedback, while we
> continue reviewing testing and complete the performance analysis in house.
>

Hi,

I am going to be out of office for the whole next week with a random
mail access.
I will do my best to be as active as possible, but in the meanwhile,
Tariq and Or will handle any questions
regarding this series or mlx5 in general while I am away.

Thanks,
Saeed.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]               ` <20160909073652.351d76d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-09-09  6:30                 ` Alexei Starovoitov via iovisor-dev
@ 2016-09-09 19:02                 ` Tom Herbert via iovisor-dev
  1 sibling, 0 replies; 72+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-09-09 19:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Kernel Network Developers, iovisor-dev, Jamal Hadi Salim,
	Saeed Mahameed, Eric Dumazet

On Thu, Sep 8, 2016 at 10:36 PM, Jesper Dangaard Brouer
<brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 8 Sep 2016 20:22:04 -0700
> Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >
>> > I'm sorry but I have a problem with this patch!
>>
>> is it because the variable is called 'xdp_doorbell'?
>> Frankly I see nothing scary in this patch.
>> It extends existing code by adding a flag to ring doorbell or not.
>> The end of rx napi is used as an obvious heuristic to flush the pipe.
>> Looks pretty generic to me.
>> The same code can be used for non-xdp as well once we figure out
>> good algorithm for xmit_more in the stack.
>
> What I'm proposing can also be used by the normal stack.
>
>> > Looking at this patch, I want to bring up a fundamental architectural
>> > concern with the development direction of XDP transmit.
>> >
>> >
>> > What you are trying to implement, with delaying the doorbell, is
>> > basically TX bulking for TX_XDP.
>> >
>> >  Why not implement a TX bulking interface directly instead?!?
>> >
>> > Yes, the tailptr/doorbell is the most costly operation, but why not
>> > also take advantage of the benefits of bulking for other parts of the
>> > code? (benefit is smaller, by every cycles counts in this area)
>> >
>> > This hole XDP exercise is about avoiding having a transaction cost per
>> > packet, that reads "bulking" or "bundling" of packets, where possible.
>> >
>> >  Lets do bundling/bulking from the start!
>>
>> mlx4 already does bulking and this proposed mlx5 set of patches
>> does bulking as well.
>> See nothing wrong about it. RX side processes the packets and
>> when it's done it tells TX to xmit whatever it collected.
>
> This is doing "hidden" bulking and not really taking advantage of using
> the icache more effeciently.
>
> Let me explain the problem I see, little more clear then, so you
> hopefully see where I'm going.
>
> Imagine you have packets intermixed towards the stack and XDP_TX.
> Every time you call the stack code, then you flush your icache.  When
> returning to the driver code, you will have to reload all the icache
> associated with the XDP_TX, this is a costly operation.
>
>
>> > The reason behind the xmit_more API is that we could not change the
>> > API of all the drivers.  And we found that calling an explicit NDO
>> > flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> > would hit the common single packet use-case.
>> >
>> > It should be really easy to build a bundle of packets that need XDP_TX
>> > action, especially given you only have a single destination "port".
>> > And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>>
>> not sure what are you proposing here?
>> Sounds like you want to extend it to multi port in the future?
>> Sure. The proposed code is easily extendable.
>>
>> Or you want to see something like a link list of packets
>> or an array of packets that RX side is preparing and then
>> send the whole array/list to TX port?
>> I don't think that would be efficient, since it would mean
>> unnecessary copy of pointers.
>
> I just explain it will be more efficient due to better use of icache.
>
>
>> > In the future, XDP need to support XDP_FWD forwarding of packets/pages
>> > out other interfaces.  I also want bulk transmit from day-1 here.  It
>> > is slightly more tricky to sort packets for multiple outgoing
>> > interfaces efficiently in the pool loop.
>>
>> I don't think so. Multi port is natural extension to this set of patches.
>> With multi port the end of RX will tell multiple ports (that were
>> used to tx) to ring the bell. Pretty trivial and doesn't involve any
>> extra arrays or link lists.
>
> So, have you solved the problem exclusive access to a TX ring of a
> remote/different net_device when sending?
>
> In you design you assume there exist many TX ring available for other
> devices to access.  In my design I also want to support devices that
> doesn't have this HW capability, and e.g. only have one TX queue.
>
Right, but segregating TX queues used by the stack from the those used
by XDP is pretty fundamental to the design. If we start mixing them,
then we need to pull in several features (such as BQL which seems like
what you're proposing) into the XDP path. If this starts to slow
things down or we need to reinvent a bunch of existing features to not
use skbuffs that seems to run contrary to "the simple as possible"
model for XDP-- may as well use the regular stack at that point
maybe...

Tom

>
>> > But the mSwitch[1] article actually already solved this destination
>> > sorting.  Please read[1] section 3.3 "Switch Fabric Algorithm" for
>> > understanding the next steps, for a smarter data structure, when
>> > starting to have more TX "ports".  And perhaps align your single
>> > XDP_TX destination data structure to this future development.
>> >
>> > [1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
>>
>> I don't see how this particular paper applies to the existing kernel code.
>> It's great to take ideas from research papers, but real code is different.
>>
>> > --Jesper
>> > (top post)
>>
>> since when it's ok to top post?
>
> What a kneejerk reaction.  When writing something general we often
> reply to the top of the email, and then often delete the rest (which
> makes it hard for later comers to follow).  I was bcc'ing some people,
> which needed the context, so it was a service note to you, that I
> didn't write anything below.
>
>
>> > On Wed,  7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> >
>> > > Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>> > >
>> > > Here we introduce a xmit more like mechanism that will queue up more
>> > > than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>> > >
>> > > Once RX napi budget is consumed and we exit napi RX loop, we will
>> > > flush (doorbell) all XDP looped packets in case there are such.
>> > >
>> > > XDP forward packet rate:
>> > >
>> > > Comparing XDP with and w/o xmit more (bulk transmit):
>> > >
>> > > Streams     XDP TX       XDP TX (xmit more)
>> > > ---------------------------------------------------
>> > > 1           4.90Mpps      7.50Mpps
>> > > 2           9.50Mpps      14.8Mpps
>> > > 4           16.5Mpps      25.1Mpps
>> > > 8           21.5Mpps      27.5Mpps*
>> > > 16          24.1Mpps      27.5Mpps*
>> > >
>> > > *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
>> > > we will be working on the analysis and will publish the conclusions
>> > > later.
>> > >
>> > > Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> > > ---
>> > >  drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
>> > >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
>> > >  2 files changed, 49 insertions(+), 17 deletions(-)
>> ...
>> > > @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>> > >                   mlx5e_read_mini_arr_slot(cq, cqcc);
>> > >
>> > >   mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>> > >
>> > > +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
>>
>> Saeed,
>> please make sure to remove such debug bits.
>>
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                   ` <20160909063048.GA67375-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-12  8:56                     ` Jesper Dangaard Brouer via iovisor-dev
       [not found]                       ` <20160912105655.0cb5607e-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-09-12 11:30                     ` Jesper Dangaard Brouer via iovisor-dev
  1 sibling, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-12  8:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Edward Cree

On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > >  Lets do bundling/bulking from the start!    
> > > 
> > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > does bulking as well.
> > > See nothing wrong about it. RX side processes the packets and
> > > when it's done it tells TX to xmit whatever it collected.  
> > 
> > This is doing "hidden" bulking and not really taking advantage of using
> > the icache more effeciently.  
> > 
> > Let me explain the problem I see, little more clear then, so you
> > hopefully see where I'm going.
> > 
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
> correct. And why is that a problem?

It is good that you can see and acknowledge the I-cache problem.

XDP is all about performance.  What I hear is, that you are arguing
against a model that will yield better performance, that does not make
sense to me.  Let me explain this again, in another way.

This is a mental model switch.  Stop seeing the lowest driver RX as
something that works on a per packet basis.  Maybe is it is easier to
understand if we instead see this as vector processing?  This is about
having a vector of packets, where we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle (directly measurable with perf, while I-cache
is not directly measurable).


Lets assume everything fits into the I-cache (XDP+driver code). The
CPU-frontend still have to decode the instructions from the I-cache
into micro-ops.  The next level of optimizations is to reuse the
decoded I-cache by running it on all elements in the packet-vector.

The Intel "64 and IA-32 Architectures Optimization Reference Manual"
(section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make
sure each hot code block is less than about 500 instructions.  Thus,
the different "stages" working on the packet-vector, need to be rather
small and compact.

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf



Notice: The same mental model switch applies to delivery packets to
the regular netstack.  I've brought this up before[3].  Instead of
flushing the drivers I-cache for every packet, by calling the stack,
let instead bundle up N-packets in the driver before calling the
stack.  I showed 10% speedup by a naive implementation of this
approach.  Edward Cree also showed[4] a 10% performance boost, and
went further into the stack, showing a 25% increase.

A goal is also, to make optimizing netstack code-size independent of
the driver code-size, by separating the netstacks I-cache usage from
the drivers.

[3] http://lists.openwall.net/netdev/2016/01/15/51
[4] http://lists.openwall.net/netdev/2016/04/19/89
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]             ` <CALzJLG_r0pDJgxqqak5=NatT8tF7UP2NkGS1wjeWcS5C=Zvv2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-09-12 10:15               ` Jesper Dangaard Brouer via iovisor-dev
       [not found]                 ` <20160912121530.4b4f0ad7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-09-13 15:20                 ` [iovisor-dev] " Edward Cree
  0 siblings, 2 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-12 10:15 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List

On Fri, 9 Sep 2016 18:03:09 +0300
Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
> <iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:  
> >>
> >> I'm sorry but I have a problem with this patch!  
> >> Looking at this patch, I want to bring up a fundamental architectural
> >> concern with the development direction of XDP transmit.
> >>
> >>
> >> What you are trying to implement, with delaying the doorbell, is
> >> basically TX bulking for TX_XDP.
> >>
> >>  Why not implement a TX bulking interface directly instead?!?
> >>
> >> Yes, the tailptr/doorbell is the most costly operation, but why not
> >> also take advantage of the benefits of bulking for other parts of the
> >> code? (benefit is smaller, by every cycles counts in this area)
> >>
> >> This hole XDP exercise is about avoiding having a transaction cost per
> >> packet, that reads "bulking" or "bundling" of packets, where possible.
> >>
> >>  Lets do bundling/bulking from the start!  
> 
> Jesper, what we did here is also bulking, instead of bulking in a
> temporary list in the driver we list the packets in the HW and once
> done we transmit all at once via the xdp_doorbell indication.
> 
> I agree with you that we can take advantage and improve the icache by
> bulking first in software and then queue all at once in the hw then
> ring one doorbell.
> 
> but I also agree with Alexei that this will introduce an extra
> pointer/list handling in the driver and we need to do the comparison
> between both approaches before we decide which is better.

I welcome implementing both approaches and benchmarking them against
each-other, I'll gladly dedicate time for this!

I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

The key point is stop seeing the lowest driver RX as something that
works on a per packet basis.  It might be easier to view this as a kind
of vector processing.  This is about having a vector of packets, where
we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle.  The next level of optimization (for >= Sandy
Bridge CPUs) is to make these vector processing stages small enough to fit
into the CPUs decoded-I-cache section.


It might also be important to mention, that for netstack delivery, I
don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
packets.  Why, because the NIC-HW runs independently and have the
opportunity to deliver more frames in the RX ring queue, while the
stack "slow" processed packets.  You can view this as "bulking" from
the RX ring queue, with a "look-back" before exiting the NAPI poll loop.


> this must be marked as future work and not have this from the start.

We both know that statement is BS, and the other approach will never be
implemented once this patch is accepted upstream.


> > mlx4 already does bulking and this proposed mlx5 set of patches
> > does bulking as well.

I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
way IMHO.  And now mlx5 is building on the same principle. That is why
I'm yelling STOP.


> >> The reason behind the xmit_more API is that we could not change the
> >> API of all the drivers.  And we found that calling an explicit NDO
> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
> >> would hit the common single packet use-case.
> >>
> >> It should be really easy to build a bundle of packets that need XDP_TX
> >> action, especially given you only have a single destination "port".
> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.  


[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                   ` <20160909063048.GA67375-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-09-12  8:56                     ` Jesper Dangaard Brouer via iovisor-dev
@ 2016-09-12 11:30                     ` Jesper Dangaard Brouer via iovisor-dev
  2016-09-12 19:56                       ` Alexei Starovoitov
  1 sibling, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-12 11:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA

On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
> > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > Every time you call the stack code, then you flush your icache.  When
> > returning to the driver code, you will have to reload all the icache
> > associated with the XDP_TX, this is a costly operation.  
> 
[...]
> To make further progress in this discussion can we talk about
> the use case you have in mind instead? Then solution will
> be much clear, I hope.

The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets.  I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive).  Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...



Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack.  This is due to hitting a bottleneck in the page
allocator.  I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
 It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                       ` <20160912105655.0cb5607e-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-12 17:53                         ` Alexei Starovoitov via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-09-12 17:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA, Edward Cree

On Mon, Sep 12, 2016 at 10:56:55AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> > > > >  Lets do bundling/bulking from the start!    
> > > > 
> > > > mlx4 already does bulking and this proposed mlx5 set of patches
> > > > does bulking as well.
> > > > See nothing wrong about it. RX side processes the packets and
> > > > when it's done it tells TX to xmit whatever it collected.  
> > > 
> > > This is doing "hidden" bulking and not really taking advantage of using
> > > the icache more effeciently.  
> > > 
> > > Let me explain the problem I see, little more clear then, so you
> > > hopefully see where I'm going.
> > > 
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> > correct. And why is that a problem?
> 
> It is good that you can see and acknowledge the I-cache problem.
> 
> XDP is all about performance.  What I hear is, that you are arguing
> against a model that will yield better performance, that does not make
> sense to me.  Let me explain this again, in another way.

I'm arguing against your proposal that I think will be more complex and
lower performance than what Saeed and the team already implemented.
Therefore I don't think it's fair to block the patch and ask them to
reimplement it just to test an idea that may or may not improve performance.

Getting maximum performance is tricky. Good is better than perfect.
It's important to argue about user space visible bits upfront, but
on the kernel performance side we should build/test incrementally.
This particular patch 11/11 is simple, easy to review and provides
good performance. What's not to like?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-12 11:30                     ` Jesper Dangaard Brouer via iovisor-dev
@ 2016-09-12 19:56                       ` Alexei Starovoitov
       [not found]                         ` <20160912195626.GA18146-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2016-09-12 19:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, iovisor-dev, netdev, Tariq Toukan,
	Brenden Blanco, Tom Herbert, Martin KaFai Lau, Daniel Borkmann,
	Eric Dumazet, Jamal Hadi Salim

On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 8 Sep 2016 23:30:50 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > Every time you call the stack code, then you flush your icache.  When
> > > returning to the driver code, you will have to reload all the icache
> > > associated with the XDP_TX, this is a costly operation.  
> > 
> [...]
> > To make further progress in this discussion can we talk about
> > the use case you have in mind instead? Then solution will
> > be much clear, I hope.
> 
> The DDoS use-case _is_ affected by this "hidden" bulking design.
> 
> Lets say, I want to implement a DDoS facility. Instead of just
> dropping the malicious packets, I want to see the bad packets.  I
> implement this by rewriting the destination-MAC to be my monitor
> machine and then XDP_TX the packet.

not following the use case. you want to implement a DDoS generator?
Or just forward all bad packets from affected host to another host
in the same rack? so two servers will be spammed with traffic and
even more load on the tor? I really don't see how this is useful
for anything but stress testing.

> In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> of the traffic is delivered to the stack. (See note 1)

hmm. DoS prevention use case is when 99% of the traffic is dropped.

> Once the DDoS attack starts, then the traffic pattern changes, and XDP
> should (hopefully only) catch the malicious traffic (monitor machine can
> help diagnose false positive).  Now, due to interleaving the DDoS
> traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> more icache misses...
> 
> 
> 
> Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> program with 100% delivery to the stack, actually slows down the
> normal stack.  This is due to hitting a bottleneck in the page
> allocator.  I'm working removing that bottleneck with page_pool, and
> that solution is orthogonal to this problem.

sure. no one arguing against improving page allocator.

>  It is actually an excellent argument, for why you would want to run a
> DDoS XDP filter only on a restricted number of RX queues.

no. it's the opposite. If the host is under DoS there is no way
the host can tell in advance which rx queue will be seeing bad packets.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                         ` <20160912195626.GA18146-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-12 20:48                           ` Jesper Dangaard Brouer via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-12 20:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA

On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Sep 2016 23:30:50 -0700
> > Alexei Starovoitov <alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >   
> > > On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:  
> > [...]  
> > > > Imagine you have packets intermixed towards the stack and XDP_TX. 
> > > > Every time you call the stack code, then you flush your icache.  When
> > > > returning to the driver code, you will have to reload all the icache
> > > > associated with the XDP_TX, this is a costly operation.    
> > >   
> > [...]  
> > > To make further progress in this discussion can we talk about
> > > the use case you have in mind instead? Then solution will
> > > be much clear, I hope.  
> > 
> > The DDoS use-case _is_ affected by this "hidden" bulking design.
> > 
> > Lets say, I want to implement a DDoS facility. Instead of just
> > dropping the malicious packets, I want to see the bad packets.  I
> > implement this by rewriting the destination-MAC to be my monitor
> > machine and then XDP_TX the packet.  
> 
> not following the use case. you want to implement a DDoS generator?
> Or just forward all bad packets from affected host to another host
> in the same rack? so two servers will be spammed with traffic and
> even more load on the tor? I really don't see how this is useful
> for anything but stress testing.

As I wrote below, the purpose of the monitor machine is to diagnose
false positives.  If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

> > In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
> > of the traffic is delivered to the stack. (See note 1)  
> 
> hmm. DoS prevention use case is when 99% of the traffic is dropped.

As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

> > Once the DDoS attack starts, then the traffic pattern changes, and XDP
> > should (hopefully only) catch the malicious traffic (monitor machine can
> > help diagnose false positive).  Now, due to interleaving the DDoS
> > traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
> > more icache misses...
> > 
> > 
> > 
> > Note(1): Notice I have already demonstrated that loading a XDP/eBPF
> > program with 100% delivery to the stack, actually slows down the
> > normal stack.  This is due to hitting a bottleneck in the page
> > allocator.  I'm working removing that bottleneck with page_pool, and
> > that solution is orthogonal to this problem.  
> 
> sure. no one arguing against improving page allocator.
> 
> >  It is actually an excellent argument, for why you would want to run a
> > DDoS XDP filter only on a restricted number of RX queues.  
> 
> no. it's the opposite. If the host is under DoS there is no way
> the host can tell in advance which rx queue will be seeing bad
> packets.

Sorry, this note was not related to the DoS use-case.  You
misunderstood it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                 ` <20160912121530.4b4f0ad7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-12 21:45                   ` Tom Herbert via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-09-12 21:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet

On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer
<brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, 9 Sep 2016 18:03:09 +0300
> Saeed Mahameed <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
>> On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
>> <iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy@public.gmane.org> wrote:
>> > On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:
>> >>
>> >> I'm sorry but I have a problem with this patch!
>> >> Looking at this patch, I want to bring up a fundamental architectural
>> >> concern with the development direction of XDP transmit.
>> >>
>> >>
>> >> What you are trying to implement, with delaying the doorbell, is
>> >> basically TX bulking for TX_XDP.
>> >>
>> >>  Why not implement a TX bulking interface directly instead?!?
>> >>
>> >> Yes, the tailptr/doorbell is the most costly operation, but why not
>> >> also take advantage of the benefits of bulking for other parts of the
>> >> code? (benefit is smaller, by every cycles counts in this area)
>> >>
>> >> This hole XDP exercise is about avoiding having a transaction cost per
>> >> packet, that reads "bulking" or "bundling" of packets, where possible.
>> >>
>> >>  Lets do bundling/bulking from the start!
>>
>> Jesper, what we did here is also bulking, instead of bulking in a
>> temporary list in the driver we list the packets in the HW and once
>> done we transmit all at once via the xdp_doorbell indication.
>>
>> I agree with you that we can take advantage and improve the icache by
>> bulking first in software and then queue all at once in the hw then
>> ring one doorbell.
>>
>> but I also agree with Alexei that this will introduce an extra
>> pointer/list handling in the driver and we need to do the comparison
>> between both approaches before we decide which is better.
>
> I welcome implementing both approaches and benchmarking them against
> each-other, I'll gladly dedicate time for this!
>
Yes, please implement this so we can have something clear to evaluate
and compare. There is far to much spewing of "expert opinions"
happening here :-(

> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> The key point is stop seeing the lowest driver RX as something that
> works on a per packet basis.  It might be easier to view this as a kind
> of vector processing.  This is about having a vector of packets, where
> we apply some action/operation.
>
> This is about using the CPU more efficiently, getting it to do more
> instructions per cycle.  The next level of optimization (for >= Sandy
> Bridge CPUs) is to make these vector processing stages small enough to fit
> into the CPUs decoded-I-cache section.
>
>
> It might also be important to mention, that for netstack delivery, I
> don't imagine bulking 64 packets.  Instead, I imagine doing 8-16
> packets.  Why, because the NIC-HW runs independently and have the
> opportunity to deliver more frames in the RX ring queue, while the
> stack "slow" processed packets.  You can view this as "bulking" from
> the RX ring queue, with a "look-back" before exiting the NAPI poll loop.
>
>
>> this must be marked as future work and not have this from the start.
>
> We both know that statement is BS, and the other approach will never be
> implemented once this patch is accepted upstream.
>
>
>> > mlx4 already does bulking and this proposed mlx5 set of patches
>> > does bulking as well.
>
> I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
> way IMHO.  And now mlx5 is building on the same principle. That is why
> I'm yelling STOP.
>
>
>> >> The reason behind the xmit_more API is that we could not change the
>> >> API of all the drivers.  And we found that calling an explicit NDO
>> >> flush came at a cost (only approx 7 ns IIRC), but it still a cost that
>> >> would hit the common single packet use-case.
>> >>
>> >> It should be really easy to build a bundle of packets that need XDP_TX
>> >> action, especially given you only have a single destination "port".
>> >> And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
>
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
       [not found]       ` <20160907204501.08cc4ede-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-13 10:16         ` Tariq Toukan via iovisor-dev
       [not found]           ` <549ee0e2-b76b-ec62-4287-e63c4320e7c6-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Tariq Toukan via iovisor-dev @ 2016-09-13 10:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	Eric Dumazet, Tom Herbert


On 07/09/2016 9:45 PM, Jesper Dangaard Brouer wrote:
> On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> Instead of reallocating and mapping pages for RX data-path,
>> recycle already used pages in a per ring cache.
>>
>> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>>
>> Single stride, 64 bytes:
>> * 4,739,057 - baseline
>> * 4,749,550 - order0 no cache
>> * 4,786,899 - order0 with cache
>> 1% gain
>>
>> Larger packets, no page cross, 1024 bytes:
>> * 3,982,361 - baseline
>> * 3,845,682 - order0 no cache
>> * 4,127,852 - order0 with cache
>> 3.7% gain
>>
>> Larger packets, every 3rd packet crosses a page, 1500 bytes:
>> * 3,731,189 - baseline
>> * 3,579,414 - order0 no cache
>> * 3,931,708 - order0 with cache
>> 5.4% gain
>>
>> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/en.h       | 16 ++++++
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++++++
>>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 57 ++++++++++++++++++++--
>>   drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++++++
>>   4 files changed, 99 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> index 075cdfc..afbdf70 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
>>   	u8					tired;
>>   };
>>   
>> +/* a single cache unit is capable to serve one napi call (for non-striding rq)
>> + * or a MPWQE (for striding rq).
>> + */
>> +#define MLX5E_CACHE_UNIT	(MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
>> +				 MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
>> +#define MLX5E_CACHE_SIZE	(2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
>> +struct mlx5e_page_cache {
>> +	u32 head;
>> +	u32 tail;
>> +	struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
>> +};
>> +
> [...]
>>   
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> index c1cb510..8e02af3 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
>>   	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>>   }
>>   
>> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
>> +				      struct mlx5e_dma_info *dma_info)
>> +{
>> +	struct mlx5e_page_cache *cache = &rq->page_cache;
>> +	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
>> +
>> +	if (tail_next == cache->head) {
>> +		rq->stats.cache_full++;
>> +		return false;
>> +	}
>> +
>> +	cache->page_cache[cache->tail] = *dma_info;
>> +	cache->tail = tail_next;
>> +	return true;
>> +}
>> +
>> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
>> +				      struct mlx5e_dma_info *dma_info)
>> +{
>> +	struct mlx5e_page_cache *cache = &rq->page_cache;
>> +
>> +	if (unlikely(cache->head == cache->tail)) {
>> +		rq->stats.cache_empty++;
>> +		return false;
>> +	}
>> +
>> +	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
>> +		rq->stats.cache_busy++;
>> +		return false;
>> +	}
> Hmmm... doesn't this cause "blocking" of the page_cache recycle
> facility until the page at the head of the queue gets (page) refcnt
> decremented?  Real use-case could fairly easily block/cause this...
Hi Jesper,

That's right. We are aware of this issue.
We considered ways of solving this, but decided to keep current 
implementation for now.
One way of solving this is to look deeper in the cache.
Cons:
- this will consume time, and the chance of finding an available page is 
not that high: if the page in head of queue is busy then there's a good 
chance that all the others are too (because of FIFO).
in other words, you already checked all pages and anyway you're going to 
allocate a new one (higher penalty for same decision).
- this will make holes in the array causing complex accounting when 
looking for an available page (this can easily be fixed by swapping 
between the page in head and the available one).

Another way is sharing pages between different RQs.
- For now we're not doing this for simplicity and to keep 
synchronization away.

What do you think?

Anyway, we're looking forward to use your page-pool API which solves 
these issues.

Regards,
Tariq
>
>> +
>> +	*dma_info = cache->page_cache[cache->head];
>> +	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
>> +	rq->stats.cache_reuse++;
>> +
>> +	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
>> +				   DMA_FROM_DEVICE);
>> +	return true;
>> +}
>> +
>>   static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
>>   					  struct mlx5e_dma_info *dma_info)
>>   {
>> -	struct page *page = dev_alloc_page();
>> +	struct page *page;
>> +
>> +	if (mlx5e_rx_cache_get(rq, dma_info))
>> +		return 0;
>>   
>> +	page = dev_alloc_page();
>>   	if (unlikely(!page))
>>   		return -ENOMEM;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [iovisor-dev] README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
  2016-09-12 10:15               ` Jesper Dangaard Brouer via iovisor-dev
       [not found]                 ` <20160912121530.4b4f0ad7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-09-13 15:20                 ` Edward Cree
       [not found]                   ` <d8a477c6-5394-ab33-443f-59d75a58f430-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>
  1 sibling, 1 reply; 72+ messages in thread
From: Edward Cree @ 2016-09-13 15:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed
  Cc: Alexei Starovoitov, Tom Herbert, iovisor-dev, Jamal Hadi Salim,
	Saeed Mahameed, Eric Dumazet, Linux Netdev List, Rana Shahout,
	Tariq Toukan

On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
> I'm reacting so loudly, because this is a mental model switch, that
> need to be applied to the full drivers RX path. Also for normal stack
> delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> there is between 10%-25% perf gain here.
>
> [1] http://lists.openwall.net/netdev/2016/04/19/89
> [2] http://lists.openwall.net/netdev/2016/01/15/51
BTW, I'd also still rather like to see that happen, I never really
understood the objections people had to those patches when I posted them.  I
still believe that dealing in skb-lists instead of skbs, and thus
'automatically' bulking similar packets, is better than trying to categorise
packets into flows early on based on some set of keys.  The problem with the
latter approach is that there are now two definitions of "similar":
1) the set of fields used to index the flow
2) what will actually cause the stack's behaviour to differ if not using the
cached values.
Quite apart from the possibility of bugs if one changes but not the other,
this forces (1) to be conservative, only considering things "similar" if the
entire stack will.  Whereas with bundling, the stack can keep packets
together until they reach a layer at which they are no longer "similar"
enough.  Thus, for instance, packets with the same IP 3-tuple but different
port numbers can be grouped together for IP layer processing, then split
apart for L4.

-Ed

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                   ` <d8a477c6-5394-ab33-443f-59d75a58f430-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>
@ 2016-09-13 15:58                     ` Eric Dumazet via iovisor-dev
       [not found]                       ` <1473782310.18970.138.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  0 siblings, 1 reply; 72+ messages in thread
From: Eric Dumazet via iovisor-dev @ 2016-09-13 15:58 UTC (permalink / raw)
  To: Edward Cree
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List

On Tue, 2016-09-13 at 16:20 +0100, Edward Cree wrote:
> On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
> > I'm reacting so loudly, because this is a mental model switch, that
> > need to be applied to the full drivers RX path. Also for normal stack
> > delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
> > there is between 10%-25% perf gain here.
> >
> > [1] http://lists.openwall.net/netdev/2016/04/19/89
> > [2] http://lists.openwall.net/netdev/2016/01/15/51
> BTW, I'd also still rather like to see that happen, I never really
> understood the objections people had to those patches when I posted them.  I
> still believe that dealing in skb-lists instead of skbs, and thus
> 'automatically' bulking similar packets, is better than trying to categorise
> packets into flows early on based on some set of keys.  The problem with the
> latter approach is that there are now two definitions of "similar":
> 1) the set of fields used to index the flow
> 2) what will actually cause the stack's behaviour to differ if not using the
> cached values.
> Quite apart from the possibility of bugs if one changes but not the other,
> this forces (1) to be conservative, only considering things "similar" if the
> entire stack will.  Whereas with bundling, the stack can keep packets
> together until they reach a layer at which they are no longer "similar"
> enough.  Thus, for instance, packets with the same IP 3-tuple but different
> port numbers can be grouped together for IP layer processing, then split
> apart for L4.

To be fair you never showed us the numbers for DDOS traffic, and you did
not show us how typical TCP + netfilter modules kind of traffic would be
handled.

Show us real numbers, not synthetic ones, say when receiving traffic on
100,000 or more TCP sockets.

We also care about icache pressure, and GRO/TSO already provides
bundling where it is applicable, without adding insane complexity in the
stacks.

Just look at how complex the software fallbacks for GSO/checksumming
are, how many bugs we had to fix... And this is only at the edge of our
stack.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
       [not found]           ` <549ee0e2-b76b-ec62-4287-e63c4320e7c6-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-13 16:28             ` Jesper Dangaard Brouer via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-13 16:28 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, netdev-u79uwXL29TY76Z2rM5mHXA

On Tue, 13 Sep 2016 13:16:29 +0300
Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> On 07/09/2016 9:45 PM, Jesper Dangaard Brouer wrote:
> > On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >  
> >> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>
> >> Instead of reallocating and mapping pages for RX data-path,
> >> recycle already used pages in a per ring cache.
> >>
> >> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> >>
> >> Single stride, 64 bytes:
> >> * 4,739,057 - baseline
> >> * 4,749,550 - order0 no cache
> >> * 4,786,899 - order0 with cache
> >> 1% gain
> >>
> >> Larger packets, no page cross, 1024 bytes:
> >> * 3,982,361 - baseline
> >> * 3,845,682 - order0 no cache
> >> * 4,127,852 - order0 with cache
> >> 3.7% gain
> >>
> >> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> >> * 3,731,189 - baseline
> >> * 3,579,414 - order0 no cache
> >> * 3,931,708 - order0 with cache
> >> 5.4% gain
> >>
> >> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> ---
> >>   drivers/net/ethernet/mellanox/mlx5/core/en.h       | 16 ++++++
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++++++
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 57 ++++++++++++++++++++--
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++++++
> >>   4 files changed, 99 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> index 075cdfc..afbdf70 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
> >>   	u8					tired;
> >>   };
> >>   
> >> +/* a single cache unit is capable to serve one napi call (for non-striding rq)
> >> + * or a MPWQE (for striding rq).
> >> + */
> >> +#define MLX5E_CACHE_UNIT	(MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
> >> +				 MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
> >> +#define MLX5E_CACHE_SIZE	(2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
> >> +struct mlx5e_page_cache {
> >> +	u32 head;
> >> +	u32 tail;
> >> +	struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
> >> +};
> >> +  
> > [...]  
> >>   
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> index c1cb510..8e02af3 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> >>   	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> >>   }
> >>   
> >> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> >> +				      struct mlx5e_dma_info *dma_info)
> >> +{
> >> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> >> +	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> >> +
> >> +	if (tail_next == cache->head) {
> >> +		rq->stats.cache_full++;
> >> +		return false;
> >> +	}
> >> +
> >> +	cache->page_cache[cache->tail] = *dma_info;
> >> +	cache->tail = tail_next;
> >> +	return true;
> >> +}
> >> +
> >> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> >> +				      struct mlx5e_dma_info *dma_info)
> >> +{
> >> +	struct mlx5e_page_cache *cache = &rq->page_cache;
> >> +
> >> +	if (unlikely(cache->head == cache->tail)) {
> >> +		rq->stats.cache_empty++;
> >> +		return false;
> >> +	}
> >> +
> >> +	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> >> +		rq->stats.cache_busy++;
> >> +		return false;
> >> +	}  
> > Hmmm... doesn't this cause "blocking" of the page_cache recycle
> > facility until the page at the head of the queue gets (page) refcnt
> > decremented?  Real use-case could fairly easily block/cause this...  
> Hi Jesper,
> 
> That's right. We are aware of this issue.
> We considered ways of solving this, but decided to keep current 
> implementation for now.
> One way of solving this is to look deeper in the cache.
> Cons:
> - this will consume time, and the chance of finding an available page is 
> not that high: if the page in head of queue is busy then there's a good 
> chance that all the others are too (because of FIFO).
> in other words, you already checked all pages and anyway you're going to 
> allocate a new one (higher penalty for same decision).
> - this will make holes in the array causing complex accounting when 
> looking for an available page (this can easily be fixed by swapping 
> between the page in head and the available one).
> 
> Another way is sharing pages between different RQs.
> - For now we're not doing this for simplicity and to keep 
> synchronization away.
> 
> What do you think?
> 
> Anyway, we're looking forward to use your page-pool API which solves 
> these issues.

Yes, as you mention yourself, the page-pool API solves this problem.
Thus, I'm not sure it is worth investing more time in optimizing this
driver local page cache mechanism.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
       [not found]                       ` <1473782310.18970.138.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2016-09-13 16:47                         ` Jesper Dangaard Brouer via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-13 16:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List, Edward Cree

On Tue, 13 Sep 2016 08:58:30 -0700
Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> We also care about icache pressure, and GRO/TSO already provides
> bundling where it is applicable, without adding insane complexity in
> the stacks.

Sorry, I cannot resist. The GRO code is really bad regarding icache
pressure/usage, due to how everything is function pointers calling
function pointers, even if the general case is calling the function
defined just next to it in the same C-file (which usually cause
inlining).  I can easily get 10% more performance for UDP use-cases by
simply disabling the GRO code, and I measure a significant drop in
icache-misses.

Edward's solution should lower icache pressure.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
       [not found]             ` <CAJ3xEMiDBZ2-FdE7wniW0Y_S6k8NKfKEdy3w+1vs83oPuMAG5Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-09-08  9:52               ` Jesper Dangaard Brouer via iovisor-dev
@ 2016-09-14  9:24               ` Tariq Toukan via iovisor-dev
  1 sibling, 0 replies; 72+ messages in thread
From: Tariq Toukan via iovisor-dev @ 2016-09-14  9:24 UTC (permalink / raw)
  To: Or Gerlitz, Jesper Dangaard Brouer
  Cc: Tom Herbert, iovisor-dev, Jamal Hadi Salim, Saeed Mahameed,
	Eric Dumazet, Linux Netdev List, Rana Shahout



On 08/09/2016 12:31 PM, Or Gerlitz wrote:
> On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
> <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Wed, 7 Sep 2016 23:55:42 +0300
>> Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>>> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>>> From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>>
>>>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>>>>
>>>> When XDP is on we make sure to change channels RQs type to
>>>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
>>>> ensure "page per packet".
>>>>
>>>> On XDP set, we fail if HW LRO is set and request from user to turn it
>>>> off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
>>>> annoying, but we prefer not to enforce LRO off from XDP set function.
>>>>
>>>> Full channels reset (close/open) is required only when setting XDP
>>>> on/off.
>>>>
>>>> When XDP set is called just to exchange programs, we will update
>>>> each RQ xdp program on the fly and for synchronization with current
>>>> data path RX activity of that RQ, we temporally disable that RQ and
>>>> ensure RX path is not running, quickly update and re-enable that RQ,
>>>> for that we do:
>>>>          - rq.state = disabled
>>>>          - napi_synnchronize
>>>>          - xchg(rq->xdp_prg)
>>>>          - rq.state = enabled
>>>>          - napi_schedule // Just in case we've missed an IRQ
>>>>
>>>> Packet rate performance testing was done with pktgen 64B packets and on
>>>> TX side and, TC drop action on RX side compared to XDP fast drop.
>>>>
>>>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>>>
>>>> Comparison is done between:
>>>>          1. Baseline, Before this patch with TC drop action
>>>>          2. This patch with TC drop action
>>>>          3. This patch with XDP RX fast drop
>>>>
>>>> Streams    Baseline(TC drop)    TC drop    XDP fast Drop
>>>> --------------------------------------------------------------
>>>> 1           5.51Mpps            5.14Mpps     13.5Mpps
>>> This (13.5 M PPS) is less than 50% of the result we presented @ the
>>> XDP summit which was obtained by Rana. Please see if/how much does
>>> this grows if you use more sender threads, but all of them to xmit the
>>> same stream/flows, so we're on one ring. That (XDP with single RX ring
>>> getting packets from N remote TX rings) would be your canonical
>>> base-line for any further numbers.
>> Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
>> that you should be able to reach 23Mpps on a single CPU.  This is
>> a XDP-drop-simulation with order-0 pages being recycled through my
>> page_pool code, plus avoiding the cache-misses (notice you are using a
>> CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).
> so this takes up from 13M to 23M, good.
>
> Could you explain why the move from order-3 to order-0 is hurting the
> performance so much (drop from 32M to 23M), any way we can overcome that?
The issue is not moving from high-order to order-0.
It's moving from Striding RQ to non-Striding RQ without using a 
page-reuse mechanism (not cache).
In current memory-scheme, each 64B packet consumes a 4K page, including 
allocate/release (from cache in this case, but still...).
I believe that once we implement page-reuse for non Striding RQ we'll 
hit 32M PPS again.
>> The 23Mpps number looks like some HW limitation, as the increase was
> not HW, I think. As I said, Rana got 32M with striding RQ when she was
> using order-3
> (or did we use order-5?)
order-5.
>> is not proportional to page-allocator overhead I removed (and CPU freq
>> starts to decrease).  I also did scaling tests to more CPUs, which
>> showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
>> level I see 60Mpps (50G max is 74Mpps).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
  2016-09-07 19:18       ` Jesper Dangaard Brouer
@ 2016-09-15 14:28           ` Tariq Toukan
  -1 siblings, 0 replies; 72+ messages in thread
From: Tariq Toukan via iovisor-dev @ 2016-09-15 14:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
	linux-mm, Eric Dumazet, Tom Herbert

Hi Jesper,


On 07/09/2016 10:18 PM, Jesper Dangaard Brouer wrote:
> On Wed,  7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> To improve the memory consumption scheme, we omit the flow that
>> demands and splits high-order pages in Striding RQ, and stay
>> with a single Striding RQ flow that uses order-0 pages.
> Thanks you for doing this! MM-list people thanks you!
Thanks. I've just submitted it to net-next.
> For others to understand what this means:  This driver was doing
> split_page() on high-order pages (for Striding RQ).  This was really bad
> because it will cause fragmenting the page-allocator, and depleting the
> high-order pages available quickly.
>
> (I've left rest of patch intact below, if some MM people should be
> interested in looking at the changes).
>
> There is even a funny comment in split_page() relevant to this:
>
> /* [...]
>   * Note: this is probably too low level an operation for use in drivers.
>   * Please consult with lkml before using this in your driver.
>   */
>
>
>> Moving to fragmented memory allows the use of larger MPWQEs,
>> which reduces the number of UMR posts and filler CQEs.
>>
>> Moving to a single flow allows several optimizations that improve
>> performance, especially in production servers where we would
>> anyway fallback to order-0 allocations:
>> - inline functions that were called via function pointers.
>> - improve the UMR post process.
>>
>> This patch alone is expected to give a slight performance reduction.
>> However, the new memory scheme gives the possibility to use a page-cache
>> of a fair size, that doesn't inflate the memory footprint, which will
>> dramatically fix the reduction and even give a huge gain.
>>
>> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>>
>> Single stride, 64 bytes:
>> * 4,739,057 - baseline
>> * 4,749,550 - this patch
>> no reduction
>>
>> Larger packets, no page cross, 1024 bytes:
>> * 3,982,361 - baseline
>> * 3,845,682 - this patch
>> 3.5% reduction
>>
>> Larger packets, every 3rd packet crosses a page, 1500 bytes:
>> * 3,731,189 - baseline
>> * 3,579,414 - this patch
>> 4% reduction
>>
> Well, the reduction does not really matter than much, because your
> baseline benchmarks are from a freshly booted system, where you have
> not fragmented and depleted the high-order pages yet... ;-)
Indeed. On fragmented systems we'll get a gain, even w/o the page-cache 
mechanism, as no time is wasted looking for high-order-pages.
>
>
>> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
>> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
>> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/en.h       |  54 ++--
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 ++++++++--
>>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 292 ++++-----------------
>>   drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
>>   drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
>>   5 files changed, 184 insertions(+), 304 deletions(-)
>>
Regards,
Tariq

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
@ 2016-09-15 14:28           ` Tariq Toukan
  0 siblings, 0 replies; 72+ messages in thread
From: Tariq Toukan @ 2016-09-15 14:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Saeed Mahameed
  Cc: iovisor-dev, netdev, Brenden Blanco, Alexei Starovoitov,
	Tom Herbert, Martin KaFai Lau, Daniel Borkmann, Eric Dumazet,
	Jamal Hadi Salim, linux-mm

Hi Jesper,


On 07/09/2016 10:18 PM, Jesper Dangaard Brouer wrote:
> On Wed,  7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>> From: Tariq Toukan <tariqt@mellanox.com>
>>
>> To improve the memory consumption scheme, we omit the flow that
>> demands and splits high-order pages in Striding RQ, and stay
>> with a single Striding RQ flow that uses order-0 pages.
> Thanks you for doing this! MM-list people thanks you!
Thanks. I've just submitted it to net-next.
> For others to understand what this means:  This driver was doing
> split_page() on high-order pages (for Striding RQ).  This was really bad
> because it will cause fragmenting the page-allocator, and depleting the
> high-order pages available quickly.
>
> (I've left rest of patch intact below, if some MM people should be
> interested in looking at the changes).
>
> There is even a funny comment in split_page() relevant to this:
>
> /* [...]
>   * Note: this is probably too low level an operation for use in drivers.
>   * Please consult with lkml before using this in your driver.
>   */
>
>
>> Moving to fragmented memory allows the use of larger MPWQEs,
>> which reduces the number of UMR posts and filler CQEs.
>>
>> Moving to a single flow allows several optimizations that improve
>> performance, especially in production servers where we would
>> anyway fallback to order-0 allocations:
>> - inline functions that were called via function pointers.
>> - improve the UMR post process.
>>
>> This patch alone is expected to give a slight performance reduction.
>> However, the new memory scheme gives the possibility to use a page-cache
>> of a fair size, that doesn't inflate the memory footprint, which will
>> dramatically fix the reduction and even give a huge gain.
>>
>> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>>
>> Single stride, 64 bytes:
>> * 4,739,057 - baseline
>> * 4,749,550 - this patch
>> no reduction
>>
>> Larger packets, no page cross, 1024 bytes:
>> * 3,982,361 - baseline
>> * 3,845,682 - this patch
>> 3.5% reduction
>>
>> Larger packets, every 3rd packet crosses a page, 1500 bytes:
>> * 3,731,189 - baseline
>> * 3,579,414 - this patch
>> 4% reduction
>>
> Well, the reduction does not really matter than much, because your
> baseline benchmarks are from a freshly booted system, where you have
> not fragmented and depleted the high-order pages yet... ;-)
Indeed. On fragmented systems we'll get a gain, even w/o the page-cache 
mechanism, as no time is wasted looking for high-order-pages.
>
>
>> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
>> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/en.h       |  54 ++--
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 136 ++++++++--
>>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 292 ++++-----------------
>>   drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 -
>>   drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   2 +-
>>   5 files changed, 184 insertions(+), 304 deletions(-)
>>
Regards,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
       [not found]       ` <20160907173131.GA64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-15 14:34         ` Tariq Toukan via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Tariq Toukan via iovisor-dev @ 2016-09-15 14:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Saeed Mahameed
  Cc: Tom Herbert, netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev,
	Jamal Hadi Salim, Eric Dumazet

Hi Alexei,

On 07/09/2016 8:31 PM, Alexei Starovoitov wrote:
> On Wed, Sep 07, 2016 at 03:42:22PM +0300, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> To improve the memory consumption scheme, we omit the flow that
>> demands and splits high-order pages in Striding RQ, and stay
>> with a single Striding RQ flow that uses order-0 pages.
>>
>> Moving to fragmented memory allows the use of larger MPWQEs,
>> which reduces the number of UMR posts and filler CQEs.
>>
>> Moving to a single flow allows several optimizations that improve
>> performance, especially in production servers where we would
>> anyway fallback to order-0 allocations:
>> - inline functions that were called via function pointers.
>> - improve the UMR post process.
>>
>> This patch alone is expected to give a slight performance reduction.
>> However, the new memory scheme gives the possibility to use a page-cache
>> of a fair size, that doesn't inflate the memory footprint, which will
>> dramatically fix the reduction and even give a huge gain.
>>
>> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>>
>> Single stride, 64 bytes:
>> * 4,739,057 - baseline
>> * 4,749,550 - this patch
>> no reduction
>>
>> Larger packets, no page cross, 1024 bytes:
>> * 3,982,361 - baseline
>> * 3,845,682 - this patch
>> 3.5% reduction
>>
>> Larger packets, every 3rd packet crosses a page, 1500 bytes:
>> * 3,731,189 - baseline
>> * 3,579,414 - this patch
>> 4% reduction
> imo it's not a realistic use case, but would be good to mention that
> patch 3 brings performance back for this use case anyway.
Exactly, that's what I meant in the previous paragraph (".. will 
dramatically fix the reduction and even give a huge gain.")
Regards,
Tariq

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
       [not found]     ` <20160907173449.GB64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-09-18 15:46       ` Tariq Toukan via iovisor-dev
  0 siblings, 0 replies; 72+ messages in thread
From: Tariq Toukan via iovisor-dev @ 2016-09-18 15:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Saeed Mahameed
  Cc: Tom Herbert, netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev,
	Jamal Hadi Salim, Eric Dumazet

Hi Alexei,

On 07/09/2016 8:34 PM, Alexei Starovoitov wrote:
> On Wed, Sep 07, 2016 at 03:42:25PM +0300, Saeed Mahameed wrote:
>> For non-striding RQ configuration before this patch we had a ring
>> with pre-allocated SKBs and mapped the SKB->data buffers for
>> device.
>>
>> For robustness and better RX data buffers management, we allocate a
>> page per packet and build_skb around it.
>>
>> This patch (which is a prerequisite for XDP) will actually reduce
>> performance for normal stack usage, because we are now hitting a bottleneck
>> in the page allocator. A later patch of page reuse mechanism will be
>> needed to restore or even improve performance in comparison to the old
>> RX scheme.
>>
>> Packet rate performance testing was done with pktgen 64B packets on xmit
>> side and TC drop action on RX side.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>>
>> Comparison is done between:
>>   1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
>>   2.Build SKB with RX page cache (This patch)
>>
>> Streams    Baseline    Build SKB+page-cache    Improvement
>> -----------------------------------------------------------
>> 1          4.33Mpps      5.51Mpps                27%
>> 2          7.35Mpps      11.5Mpps                52%
>> 4          14.0Mpps      16.3Mpps                16%
>> 8          22.2Mpps      29.6Mpps                20%
>> 16         24.8Mpps      34.0Mpps                17%
> Impressive gains for build_skb. I think it should help ip forwarding too
> and likely tcp_rr. tcp_stream shouldn't see any difference.
> If you can benchmark that along with pktgen+tc_drop it would
> help to better understand the impact of the changes.
Why do you expect an improvement in tcp_rr?
I don't see such in my tests.
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2016-09-18 15:46 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
     [not found]   ` <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-07 17:31     ` Alexei Starovoitov via iovisor-dev
     [not found]       ` <20160907173131.GA64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-15 14:34         ` Tariq Toukan via iovisor-dev
2016-09-07 19:18     ` Jesper Dangaard Brouer via iovisor-dev
2016-09-07 19:18       ` Jesper Dangaard Brouer
     [not found]       ` <20160907211840.36c37ea0-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-15 14:28         ` Tariq Toukan via iovisor-dev
2016-09-15 14:28           ` Tariq Toukan
2016-09-07 12:42 ` [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle Saeed Mahameed
     [not found]   ` <1473252152-11379-4-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-07 18:45     ` Jesper Dangaard Brouer via iovisor-dev
     [not found]       ` <20160907204501.08cc4ede-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-13 10:16         ` Tariq Toukan via iovisor-dev
     [not found]           ` <549ee0e2-b76b-ec62-4287-e63c4320e7c6-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-13 16:28             ` Jesper Dangaard Brouer via iovisor-dev
2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
2016-09-07 17:34   ` Alexei Starovoitov
     [not found]     ` <20160907173449.GB64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-18 15:46       ` Tariq Toukan via iovisor-dev
     [not found]   ` <1473252152-11379-5-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-07 19:32     ` Jesper Dangaard Brouer via iovisor-dev
2016-09-07 19:32       ` Jesper Dangaard Brouer
2016-09-07 12:42 ` [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
2016-09-07 13:32   ` Or Gerlitz
     [not found]     ` <CAJ3xEMhh=fu+mrCGAjv1PDdGn9GPLJv9MssMzwzvppoqZUY01A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-07 14:48       ` Saeed Mahameed via iovisor-dev
     [not found]         ` <CALzJLG8_F28kQOPqTTLJRMsf9BOQvm3K2hAraCzabnXV4yKUgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-07 16:54           ` Tom Herbert via iovisor-dev
     [not found]             ` <CALx6S35b_MZXiGR-b1SB+VNifPHDfQNDZdz-6vk0t3bKNwen+w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-07 17:07               ` Saeed Mahameed via iovisor-dev
     [not found]                 ` <CALzJLG9bu3-=Ybq+Lk1fvAe5AohVHAaPpa9RQqd1QVe-7XPyhw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-08  7:19                   ` Jesper Dangaard Brouer via iovisor-dev
     [not found]   ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-07 20:55     ` Or Gerlitz via iovisor-dev
     [not found]       ` <CAJ3xEMgsGHqQ7x8wky6Sfs34Ry67PnZEhYmnK=g8XnnXbgWagg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-07 21:53         ` Saeed Mahameed via iovisor-dev
     [not found]           ` <CALzJLG9C0PgJWFi9hc7LrhZJejOHmWOjn0Lu-jiPekoyTGq1Ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-08  7:10             ` Or Gerlitz via iovisor-dev
2016-09-08  7:38         ` Jesper Dangaard Brouer via iovisor-dev
2016-09-08  9:31           ` Or Gerlitz
     [not found]             ` <CAJ3xEMiDBZ2-FdE7wniW0Y_S6k8NKfKEdy3w+1vs83oPuMAG5Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-08  9:52               ` Jesper Dangaard Brouer via iovisor-dev
2016-09-14  9:24               ` Tariq Toukan via iovisor-dev
2016-09-08 10:58   ` Jamal Hadi Salim
2016-09-07 12:42 ` [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
2016-09-07 13:44   ` John Fastabend
     [not found]     ` <57D019B2.7070007-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-09-07 14:40       ` Saeed Mahameed via iovisor-dev
2016-09-07 14:41   ` Eric Dumazet
     [not found]     ` <1473259302.10725.31.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2016-09-07 15:08       ` Saeed Mahameed via iovisor-dev
2016-09-07 15:32         ` Eric Dumazet
2016-09-07 16:57           ` Saeed Mahameed
     [not found]             ` <CALzJLG9iVpS2qH5Ryc_DtEjrQMhcKD+qrLrGn=vet=_9N8eXPw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-07 18:19               ` Eric Dumazet via iovisor-dev
     [not found]                 ` <1473272346.10725.73.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2016-09-07 20:09                   ` Saeed Mahameed via iovisor-dev
2016-09-07 18:22               ` Jesper Dangaard Brouer via iovisor-dev
     [not found]                 ` <20160907202234.55e18ef3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-08  2:58                   ` John Fastabend via iovisor-dev
     [not found]                     ` <57D0D3EA.1090004-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-09-08  3:21                       ` Tom Herbert via iovisor-dev
2016-09-08  5:11                         ` Jesper Dangaard Brouer
     [not found]                           ` <20160908071119.776cce56-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-08 16:26                             ` Tom Herbert via iovisor-dev
2016-09-08 17:19                               ` Jesper Dangaard Brouer via iovisor-dev
     [not found]                                 ` <20160908191914.197ce7ec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-08 18:16                                   ` Tom Herbert via iovisor-dev
2016-09-08 18:48                                     ` Rick Jones
2016-09-08 18:52                                       ` Eric Dumazet
     [not found]   ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-08  8:11     ` README: " Jesper Dangaard Brouer via iovisor-dev
     [not found]       ` <20160908101147.1b351432-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-09  3:22         ` Alexei Starovoitov via iovisor-dev
     [not found]           ` <20160909032202.GA62966-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-09  5:36             ` Jesper Dangaard Brouer via iovisor-dev
     [not found]               ` <20160909073652.351d76d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-09  6:30                 ` Alexei Starovoitov via iovisor-dev
     [not found]                   ` <20160909063048.GA67375-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-12  8:56                     ` Jesper Dangaard Brouer via iovisor-dev
     [not found]                       ` <20160912105655.0cb5607e-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-12 17:53                         ` Alexei Starovoitov via iovisor-dev
2016-09-12 11:30                     ` Jesper Dangaard Brouer via iovisor-dev
2016-09-12 19:56                       ` Alexei Starovoitov
     [not found]                         ` <20160912195626.GA18146-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-12 20:48                           ` Jesper Dangaard Brouer via iovisor-dev
2016-09-09 19:02                 ` Tom Herbert via iovisor-dev
2016-09-09 15:03           ` [iovisor-dev] " Saeed Mahameed
     [not found]             ` <CALzJLG_r0pDJgxqqak5=NatT8tF7UP2NkGS1wjeWcS5C=Zvv2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-09-12 10:15               ` Jesper Dangaard Brouer via iovisor-dev
     [not found]                 ` <20160912121530.4b4f0ad7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-09-12 21:45                   ` Tom Herbert via iovisor-dev
2016-09-13 15:20                 ` [iovisor-dev] " Edward Cree
     [not found]                   ` <d8a477c6-5394-ab33-443f-59d75a58f430-s/n/eUQHGBpZroRs9YW3xA@public.gmane.org>
2016-09-13 15:58                     ` Eric Dumazet via iovisor-dev
     [not found]                       ` <1473782310.18970.138.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2016-09-13 16:47                         ` Jesper Dangaard Brouer via iovisor-dev
     [not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-09 15:10   ` [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed via iovisor-dev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.