All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations
@ 2016-03-11 13:39 Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey Saeed Mahameed
                   ` (12 more replies)
  0 siblings, 13 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

Hello Dave,

This series includes Some RX modifications and optimizations for
the mlx5 Ethernet driver. 

>From Rana, we have one patch that adds the support for Connectx-4
queue counters.

>From Tariq, several patches that are centralized around improving
RX path message rate, CPU and Memory utilization, in each patch
commit message you will find the performance improvements numbers
related to that specific patch.

In the 3rd patch we used a queue counter to report "out of buffer" 
dropped packet count, "Dropped packets due to lack of software resources"

4th patch modifies the driver's to RSS default value to be spread along the
close NUMA node cores only for better out of the box experience.

In the 5th and 6th patches we utilized the use of RX multi-packet WQE
(Striding RQ) for better memory utilization especially in case of hardware
LRO is enabled and for better message rate for small packets.

In the 7th and 8th patches we added a fallback mechanism to use fragmented
memory when allocating large WQE strides fails, using UMR
(User Memory Registration) and ICO (Internal Control Operations) SQs.

In 9th patch To reduce the interrupt count we change the RX moderation
period to be based on the last generated CQE rather than the last generated
interrupt.

In the 10th to 13th patches we did some small modification which show some small
extra improvements.

Note: The patch from Matan "net/mlx5: Refactor mlx5_core_mr to mkey"
included in this series is already submitted and applied into Doug Ledford's 
rdma tree a606b0f6691d ("net/mlx5: Refactor mlx5_core_mr to mkey").

This series is generated against net-next commit e8ab563f4b2e ("Merge branch 'flower-offload'")

Thanks,
Saeed

Matan Barak (1):
  net/mlx5: Refactor mlx5_core_mr to mkey

Rana Shahout (1):
  net/mlx5e: Allocate set of queue counters per netdev

Tariq Toukan (11):
  net/mlx5: Introduce device queue counters
  net/mlx5e: Use only close NUMA node for default RSS
  net/mlx5e: Use function pointers for RX data path handling
  net/mlx5e: Support RX multi-packet WQE (Striding RQ)
  net/mlx5e: Added ICO SQs
  net/mlx5e: Add fragmented memory support for RX multi packet WQE
  net/mlx5e: Change RX moderation period to be based on CQE
  net/mlx5e: Use napi_alloc_skb for RX SKB allocations
  net/mlx5e: Prefetch next RX CQE
  net/mlx5e: Remove redundant barrier
  net/mlx5e: Add ethtool counter for RX SKB allocation failures

 drivers/infiniband/hw/mlx5/cq.c                    |   16 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h               |    6 +-
 drivers/infiniband/hw/mlx5/mr.c                    |   50 +-
 drivers/infiniband/hw/mlx5/odp.c                   |   10 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  196 +++++++-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   28 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  386 +++++++++++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  498 ++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |    6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |   59 +++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |    6 +-
 drivers/net/ethernet/mellanox/mlx5/core/mr.c       |   54 ++-
 drivers/net/ethernet/mellanox/mlx5/core/qp.c       |   68 +++
 include/linux/mlx5/device.h                        |   39 ++-
 include/linux/mlx5/driver.h                        |   24 +-
 include/linux/mlx5/mlx5_ifc.h                      |   32 +-
 include/linux/mlx5/qp.h                            |   10 +-
 17 files changed, 1263 insertions(+), 225 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 02/13] net/mlx5: Introduce device queue counters Saeed Mahameed
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Matan Barak, Saeed Mahameed

From: Matan Barak <matanb@mellanox.com>

Mlx5's mkey mechanism is also used for memory windows.
The current code base uses MR (memory region) naming, which is
inaccurate. Changing MR to mkey in order to represent its different
usages more accurately.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/infiniband/hw/mlx5/cq.c                   |   16 +++---
 drivers/infiniband/hw/mlx5/mlx5_ib.h              |    6 +-
 drivers/infiniband/hw/mlx5/mr.c                   |   50 ++++++++++----------
 drivers/infiniband/hw/mlx5/odp.c                  |   10 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   12 ++--
 drivers/net/ethernet/mellanox/mlx5/core/main.c    |    6 +-
 drivers/net/ethernet/mellanox/mlx5/core/mr.c      |   54 +++++++++++----------
 include/linux/mlx5/driver.h                       |   24 +++++----
 include/linux/mlx5/qp.h                           |    4 +-
 10 files changed, 94 insertions(+), 90 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index fd1de31..28b69b4 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -431,7 +431,7 @@ static int mlx5_poll_one(struct mlx5_ib_cq *cq,
 	struct mlx5_core_qp *mqp;
 	struct mlx5_ib_wq *wq;
 	struct mlx5_sig_err_cqe *sig_err_cqe;
-	struct mlx5_core_mr *mmr;
+	struct mlx5_core_mkey *mmkey;
 	struct mlx5_ib_mr *mr;
 	uint8_t opcode;
 	uint32_t qpn;
@@ -536,17 +536,17 @@ repoll:
 	case MLX5_CQE_SIG_ERR:
 		sig_err_cqe = (struct mlx5_sig_err_cqe *)cqe64;
 
-		read_lock(&dev->mdev->priv.mr_table.lock);
-		mmr = __mlx5_mr_lookup(dev->mdev,
-				       mlx5_base_mkey(be32_to_cpu(sig_err_cqe->mkey)));
-		if (unlikely(!mmr)) {
-			read_unlock(&dev->mdev->priv.mr_table.lock);
+		read_lock(&dev->mdev->priv.mkey_table.lock);
+		mmkey = __mlx5_mr_lookup(dev->mdev,
+					 mlx5_base_mkey(be32_to_cpu(sig_err_cqe->mkey)));
+		if (unlikely(!mmkey)) {
+			read_unlock(&dev->mdev->priv.mkey_table.lock);
 			mlx5_ib_warn(dev, "CQE@CQ %06x for unknown MR %6x\n",
 				     cq->mcq.cqn, be32_to_cpu(sig_err_cqe->mkey));
 			return -EINVAL;
 		}
 
-		mr = to_mibmr(mmr);
+		mr = to_mibmr(mmkey);
 		get_sig_err_item(sig_err_cqe, &mr->sig->err_item);
 		mr->sig->sig_err_exists = true;
 		mr->sig->sigerr_count++;
@@ -558,7 +558,7 @@ repoll:
 			     mr->sig->err_item.expected,
 			     mr->sig->err_item.actual);
 
-		read_unlock(&dev->mdev->priv.mr_table.lock);
+		read_unlock(&dev->mdev->priv.mkey_table.lock);
 		goto repoll;
 	}
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d2b9737..b1cad93 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -413,7 +413,7 @@ struct mlx5_ib_mr {
 	int			ndescs;
 	int			max_descs;
 	int			desc_size;
-	struct mlx5_core_mr	mmr;
+	struct mlx5_core_mkey	mmkey;
 	struct ib_umem	       *umem;
 	struct mlx5_shared_mr_info	*smr_info;
 	struct list_head	list;
@@ -558,9 +558,9 @@ static inline struct mlx5_ib_qp *to_mibqp(struct mlx5_core_qp *mqp)
 	return container_of(mqp, struct mlx5_ib_qp_base, mqp)->container_mibqp;
 }
 
-static inline struct mlx5_ib_mr *to_mibmr(struct mlx5_core_mr *mmr)
+static inline struct mlx5_ib_mr *to_mibmr(struct mlx5_core_mkey *mmkey)
 {
-	return container_of(mmr, struct mlx5_ib_mr, mmr);
+	return container_of(mmkey, struct mlx5_ib_mr, mmkey);
 }
 
 static inline struct mlx5_ib_pd *to_mpd(struct ib_pd *ibpd)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 6000f7a..fbcccc8 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -57,7 +57,7 @@ static int clean_mr(struct mlx5_ib_mr *mr);
 
 static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 {
-	int err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr);
+	int err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmkey);
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	/* Wait until all page fault handlers using the mr complete. */
@@ -86,7 +86,7 @@ static void reg_mr_callback(int status, void *context)
 	struct mlx5_cache_ent *ent = &cache->ent[c];
 	u8 key;
 	unsigned long flags;
-	struct mlx5_mr_table *table = &dev->mdev->priv.mr_table;
+	struct mlx5_mkey_table *table = &dev->mdev->priv.mkey_table;
 	int err;
 
 	spin_lock_irqsave(&ent->lock, flags);
@@ -113,7 +113,7 @@ static void reg_mr_callback(int status, void *context)
 	spin_lock_irqsave(&dev->mdev->priv.mkey_lock, flags);
 	key = dev->mdev->priv.mkey_key++;
 	spin_unlock_irqrestore(&dev->mdev->priv.mkey_lock, flags);
-	mr->mmr.key = mlx5_idx_to_mkey(be32_to_cpu(mr->out.mkey) & 0xffffff) | key;
+	mr->mmkey.key = mlx5_idx_to_mkey(be32_to_cpu(mr->out.mkey) & 0xffffff) | key;
 
 	cache->last_add = jiffies;
 
@@ -124,10 +124,10 @@ static void reg_mr_callback(int status, void *context)
 	spin_unlock_irqrestore(&ent->lock, flags);
 
 	write_lock_irqsave(&table->lock, flags);
-	err = radix_tree_insert(&table->tree, mlx5_base_mkey(mr->mmr.key),
-				&mr->mmr);
+	err = radix_tree_insert(&table->tree, mlx5_base_mkey(mr->mmkey.key),
+				&mr->mmkey);
 	if (err)
-		pr_err("Error inserting to mr tree. 0x%x\n", -err);
+		pr_err("Error inserting to mkey tree. 0x%x\n", -err);
 	write_unlock_irqrestore(&table->lock, flags);
 }
 
@@ -168,7 +168,7 @@ static int add_keys(struct mlx5_ib_dev *dev, int c, int num)
 		spin_lock_irq(&ent->lock);
 		ent->pending++;
 		spin_unlock_irq(&ent->lock);
-		err = mlx5_core_create_mkey(dev->mdev, &mr->mmr, in,
+		err = mlx5_core_create_mkey(dev->mdev, &mr->mmkey, in,
 					    sizeof(*in), reg_mr_callback,
 					    mr, &mr->out);
 		if (err) {
@@ -657,14 +657,14 @@ struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc)
 	seg->qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
 	seg->start_addr = 0;
 
-	err = mlx5_core_create_mkey(mdev, &mr->mmr, in, sizeof(*in), NULL, NULL,
+	err = mlx5_core_create_mkey(mdev, &mr->mmkey, in, sizeof(*in), NULL, NULL,
 				    NULL);
 	if (err)
 		goto err_in;
 
 	kfree(in);
-	mr->ibmr.lkey = mr->mmr.key;
-	mr->ibmr.rkey = mr->mmr.key;
+	mr->ibmr.lkey = mr->mmkey.key;
+	mr->ibmr.rkey = mr->mmkey.key;
 	mr->umem = NULL;
 
 	return &mr->ibmr;
@@ -813,7 +813,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
 
 	memset(&umrwr, 0, sizeof(umrwr));
 	umrwr.wr.wr_id = (u64)(unsigned long)&umr_context;
-	prep_umr_reg_wqe(pd, &umrwr.wr, &sg, dma, npages, mr->mmr.key,
+	prep_umr_reg_wqe(pd, &umrwr.wr, &sg, dma, npages, mr->mmkey.key,
 			 page_shift, virt_addr, len, access_flags);
 
 	mlx5_ib_init_umr_context(&umr_context);
@@ -830,9 +830,9 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem,
 		}
 	}
 
-	mr->mmr.iova = virt_addr;
-	mr->mmr.size = len;
-	mr->mmr.pd = to_mpd(pd)->pdn;
+	mr->mmkey.iova = virt_addr;
+	mr->mmkey.size = len;
+	mr->mmkey.pd = to_mpd(pd)->pdn;
 
 	mr->live = 1;
 
@@ -944,7 +944,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
 		wr.wr.opcode = MLX5_IB_WR_UMR;
 		wr.npages = sg.length / sizeof(u64);
 		wr.page_shift = PAGE_SHIFT;
-		wr.mkey = mr->mmr.key;
+		wr.mkey = mr->mmkey.key;
 		wr.target.offset = start_page_index;
 
 		mlx5_ib_init_umr_context(&umr_context);
@@ -1013,7 +1013,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
 	in->seg.qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
 	in->xlat_oct_act_size = cpu_to_be32(get_octo_len(virt_addr, length,
 							 1 << page_shift));
-	err = mlx5_core_create_mkey(dev->mdev, &mr->mmr, in, inlen, NULL,
+	err = mlx5_core_create_mkey(dev->mdev, &mr->mmkey, in, inlen, NULL,
 				    NULL, NULL);
 	if (err) {
 		mlx5_ib_warn(dev, "create mkey failed\n");
@@ -1024,7 +1024,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
 	mr->live = 1;
 	kvfree(in);
 
-	mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmr.key);
+	mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmkey.key);
 
 	return mr;
 
@@ -1091,13 +1091,13 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		goto error;
 	}
 
-	mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmr.key);
+	mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
 
 	mr->umem = umem;
 	mr->npages = npages;
 	atomic_add(npages, &dev->mdev->priv.reg_pages);
-	mr->ibmr.lkey = mr->mmr.key;
-	mr->ibmr.rkey = mr->mmr.key;
+	mr->ibmr.lkey = mr->mmkey.key;
+	mr->ibmr.rkey = mr->mmkey.key;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 	if (umem->odp_data) {
@@ -1141,7 +1141,7 @@ static int unreg_umr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 
 	memset(&umrwr.wr, 0, sizeof(umrwr));
 	umrwr.wr.wr_id = (u64)(unsigned long)&umr_context;
-	prep_umr_unreg_wqe(dev, &umrwr.wr, mr->mmr.key);
+	prep_umr_unreg_wqe(dev, &umrwr.wr, mr->mmkey.key);
 
 	mlx5_ib_init_umr_context(&umr_context);
 	down(&umrc->sem);
@@ -1236,7 +1236,7 @@ static int clean_mr(struct mlx5_ib_mr *mr)
 		err = destroy_mkey(dev, mr);
 		if (err) {
 			mlx5_ib_warn(dev, "failed to destroy mkey 0x%x (%d)\n",
-				     mr->mmr.key, err);
+				     mr->mmkey.key, err);
 			return err;
 		}
 	} else {
@@ -1362,13 +1362,13 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
 	}
 
 	in->seg.flags = MLX5_PERM_UMR_EN | access_mode;
-	err = mlx5_core_create_mkey(dev->mdev, &mr->mmr, in, sizeof(*in),
+	err = mlx5_core_create_mkey(dev->mdev, &mr->mmkey, in, sizeof(*in),
 				    NULL, NULL, NULL);
 	if (err)
 		goto err_destroy_psv;
 
-	mr->ibmr.lkey = mr->mmr.key;
-	mr->ibmr.rkey = mr->mmr.key;
+	mr->ibmr.lkey = mr->mmkey.key;
+	mr->ibmr.rkey = mr->mmkey.key;
 	mr->umem = NULL;
 	kfree(in);
 
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index b8d7636..34e79e7 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -142,13 +142,13 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
 						   u32 key)
 {
 	u32 base_key = mlx5_base_mkey(key);
-	struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
-	struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
+	struct mlx5_core_mkey *mmkey = __mlx5_mr_lookup(dev->mdev, base_key);
+	struct mlx5_ib_mr *mr = container_of(mmkey, struct mlx5_ib_mr, mmkey);
 
-	if (!mmr || mmr->key != key || !mr->live)
+	if (!mmkey || mmkey->key != key || !mr->live)
 		return NULL;
 
-	return container_of(mmr, struct mlx5_ib_mr, mmr);
+	return container_of(mmkey, struct mlx5_ib_mr, mmkey);
 }
 
 static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
@@ -232,7 +232,7 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
 	io_virt += pfault->mpfault.bytes_committed;
 	bcnt -= pfault->mpfault.bytes_committed;
 
-	start_idx = (io_virt - (mr->mmr.iova & PAGE_MASK)) >> PAGE_SHIFT;
+	start_idx = (io_virt - (mr->mmkey.iova & PAGE_MASK)) >> PAGE_SHIFT;
 
 	if (mr->umem->writable)
 		access_mask |= ODP_WRITE_ALLOWED_BIT;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0f76d32..1f5cc1b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -553,7 +553,7 @@ struct mlx5e_priv {
 	struct mlx5_uar            cq_uar;
 	u32                        pdn;
 	u32                        tdn;
-	struct mlx5_core_mr        mr;
+	struct mlx5_core_mkey      mkey;
 	struct mlx5e_rq            drop_rq;
 
 	struct mlx5e_channel     **channel;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ac58078..e0adb60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -979,7 +979,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->cpu      = cpu;
 	c->pdev     = &priv->mdev->pdev->dev;
 	c->netdev   = priv->netdev;
-	c->mkey_be  = cpu_to_be32(priv->mr.key);
+	c->mkey_be  = cpu_to_be32(priv->mkey.key);
 	c->num_tc   = priv->params.num_tc;
 
 	mlx5e_build_channeltc_to_txq_map(priv, ix);
@@ -2418,7 +2418,7 @@ static void mlx5e_build_netdev(struct net_device *netdev)
 }
 
 static int mlx5e_create_mkey(struct mlx5e_priv *priv, u32 pdn,
-			     struct mlx5_core_mr *mr)
+			     struct mlx5_core_mkey *mkey)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
 	struct mlx5_create_mkey_mbox_in *in;
@@ -2434,7 +2434,7 @@ static int mlx5e_create_mkey(struct mlx5e_priv *priv, u32 pdn,
 	in->seg.flags_pd = cpu_to_be32(pdn | MLX5_MKEY_LEN64);
 	in->seg.qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
 
-	err = mlx5_core_create_mkey(mdev, mr, in, sizeof(*in), NULL, NULL,
+	err = mlx5_core_create_mkey(mdev, mkey, in, sizeof(*in), NULL, NULL,
 				    NULL);
 
 	kvfree(in);
@@ -2485,7 +2485,7 @@ static void *mlx5e_create_netdev(struct mlx5_core_dev *mdev)
 		goto err_dealloc_pd;
 	}
 
-	err = mlx5e_create_mkey(priv, priv->pdn, &priv->mr);
+	err = mlx5e_create_mkey(priv, priv->pdn, &priv->mkey);
 	if (err) {
 		mlx5_core_err(mdev, "create mkey failed, %d\n", err);
 		goto err_dealloc_transport_domain;
@@ -2575,7 +2575,7 @@ err_destroy_tises:
 	mlx5e_destroy_tises(priv);
 
 err_destroy_mkey:
-	mlx5_core_destroy_mkey(mdev, &priv->mr);
+	mlx5_core_destroy_mkey(mdev, &priv->mkey);
 
 err_dealloc_transport_domain:
 	mlx5_core_dealloc_transport_domain(mdev, priv->tdn);
@@ -2611,7 +2611,7 @@ static void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, void *vpriv)
 	mlx5e_destroy_rqt(priv, MLX5E_INDIRECTION_RQT);
 	mlx5e_close_drop_rq(priv);
 	mlx5e_destroy_tises(priv);
-	mlx5_core_destroy_mkey(priv->mdev, &priv->mr);
+	mlx5_core_destroy_mkey(priv->mdev, &priv->mkey);
 	mlx5_core_dealloc_transport_domain(priv->mdev, priv->tdn);
 	mlx5_core_dealloc_pd(priv->mdev, priv->pdn);
 	mlx5_unmap_free_uar(priv->mdev, &priv->cq_uar);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b7133d..72a94e7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1096,7 +1096,7 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv)
 	mlx5_init_cq_table(dev);
 	mlx5_init_qp_table(dev);
 	mlx5_init_srq_table(dev);
-	mlx5_init_mr_table(dev);
+	mlx5_init_mkey_table(dev);
 
 	err = mlx5_init_fs(dev);
 	if (err) {
@@ -1143,7 +1143,7 @@ err_sriov:
 err_reg_dev:
 	mlx5_cleanup_fs(dev);
 err_fs:
-	mlx5_cleanup_mr_table(dev);
+	mlx5_cleanup_mkey_table(dev);
 	mlx5_cleanup_srq_table(dev);
 	mlx5_cleanup_qp_table(dev);
 	mlx5_cleanup_cq_table(dev);
@@ -1212,7 +1212,7 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv)
 #endif
 
 	mlx5_cleanup_fs(dev);
-	mlx5_cleanup_mr_table(dev);
+	mlx5_cleanup_mkey_table(dev);
 	mlx5_cleanup_srq_table(dev);
 	mlx5_cleanup_qp_table(dev);
 	mlx5_cleanup_cq_table(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mr.c b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
index 6fa22b5..77a7293 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
@@ -36,25 +36,26 @@
 #include <linux/mlx5/cmd.h>
 #include "mlx5_core.h"
 
-void mlx5_init_mr_table(struct mlx5_core_dev *dev)
+void mlx5_init_mkey_table(struct mlx5_core_dev *dev)
 {
-	struct mlx5_mr_table *table = &dev->priv.mr_table;
+	struct mlx5_mkey_table *table = &dev->priv.mkey_table;
 
 	memset(table, 0, sizeof(*table));
 	rwlock_init(&table->lock);
 	INIT_RADIX_TREE(&table->tree, GFP_ATOMIC);
 }
 
-void mlx5_cleanup_mr_table(struct mlx5_core_dev *dev)
+void mlx5_cleanup_mkey_table(struct mlx5_core_dev *dev)
 {
 }
 
-int mlx5_core_create_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
+int mlx5_core_create_mkey(struct mlx5_core_dev *dev,
+			  struct mlx5_core_mkey *mkey,
 			  struct mlx5_create_mkey_mbox_in *in, int inlen,
 			  mlx5_cmd_cbk_t callback, void *context,
 			  struct mlx5_create_mkey_mbox_out *out)
 {
-	struct mlx5_mr_table *table = &dev->priv.mr_table;
+	struct mlx5_mkey_table *table = &dev->priv.mkey_table;
 	struct mlx5_create_mkey_mbox_out lout;
 	int err;
 	u8 key;
@@ -83,34 +84,35 @@ int mlx5_core_create_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
 		return mlx5_cmd_status_to_err(&lout.hdr);
 	}
 
-	mr->iova = be64_to_cpu(in->seg.start_addr);
-	mr->size = be64_to_cpu(in->seg.len);
-	mr->key = mlx5_idx_to_mkey(be32_to_cpu(lout.mkey) & 0xffffff) | key;
-	mr->pd = be32_to_cpu(in->seg.flags_pd) & 0xffffff;
+	mkey->iova = be64_to_cpu(in->seg.start_addr);
+	mkey->size = be64_to_cpu(in->seg.len);
+	mkey->key = mlx5_idx_to_mkey(be32_to_cpu(lout.mkey) & 0xffffff) | key;
+	mkey->pd = be32_to_cpu(in->seg.flags_pd) & 0xffffff;
 
 	mlx5_core_dbg(dev, "out 0x%x, key 0x%x, mkey 0x%x\n",
-		      be32_to_cpu(lout.mkey), key, mr->key);
+		      be32_to_cpu(lout.mkey), key, mkey->key);
 
-	/* connect to MR tree */
+	/* connect to mkey tree */
 	write_lock_irq(&table->lock);
-	err = radix_tree_insert(&table->tree, mlx5_base_mkey(mr->key), mr);
+	err = radix_tree_insert(&table->tree, mlx5_base_mkey(mkey->key), mkey);
 	write_unlock_irq(&table->lock);
 	if (err) {
-		mlx5_core_warn(dev, "failed radix tree insert of mr 0x%x, %d\n",
-			       mlx5_base_mkey(mr->key), err);
-		mlx5_core_destroy_mkey(dev, mr);
+		mlx5_core_warn(dev, "failed radix tree insert of mkey 0x%x, %d\n",
+			       mlx5_base_mkey(mkey->key), err);
+		mlx5_core_destroy_mkey(dev, mkey);
 	}
 
 	return err;
 }
 EXPORT_SYMBOL(mlx5_core_create_mkey);
 
-int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr)
+int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev,
+			   struct mlx5_core_mkey *mkey)
 {
-	struct mlx5_mr_table *table = &dev->priv.mr_table;
+	struct mlx5_mkey_table *table = &dev->priv.mkey_table;
 	struct mlx5_destroy_mkey_mbox_in in;
 	struct mlx5_destroy_mkey_mbox_out out;
-	struct mlx5_core_mr *deleted_mr;
+	struct mlx5_core_mkey *deleted_mkey;
 	unsigned long flags;
 	int err;
 
@@ -118,16 +120,16 @@ int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr)
 	memset(&out, 0, sizeof(out));
 
 	write_lock_irqsave(&table->lock, flags);
-	deleted_mr = radix_tree_delete(&table->tree, mlx5_base_mkey(mr->key));
+	deleted_mkey = radix_tree_delete(&table->tree, mlx5_base_mkey(mkey->key));
 	write_unlock_irqrestore(&table->lock, flags);
-	if (!deleted_mr) {
-		mlx5_core_warn(dev, "failed radix tree delete of mr 0x%x\n",
-			       mlx5_base_mkey(mr->key));
+	if (!deleted_mkey) {
+		mlx5_core_warn(dev, "failed radix tree delete of mkey 0x%x\n",
+			       mlx5_base_mkey(mkey->key));
 		return -ENOENT;
 	}
 
 	in.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_DESTROY_MKEY);
-	in.mkey = cpu_to_be32(mlx5_mkey_to_idx(mr->key));
+	in.mkey = cpu_to_be32(mlx5_mkey_to_idx(mkey->key));
 	err = mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
 	if (err)
 		return err;
@@ -139,7 +141,7 @@ int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr)
 }
 EXPORT_SYMBOL(mlx5_core_destroy_mkey);
 
-int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
+int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mkey *mkey,
 			 struct mlx5_query_mkey_mbox_out *out, int outlen)
 {
 	struct mlx5_query_mkey_mbox_in in;
@@ -149,7 +151,7 @@ int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
 	memset(out, 0, outlen);
 
 	in.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_QUERY_MKEY);
-	in.mkey = cpu_to_be32(mlx5_mkey_to_idx(mr->key));
+	in.mkey = cpu_to_be32(mlx5_mkey_to_idx(mkey->key));
 	err = mlx5_cmd_exec(dev, &in, sizeof(in), out, outlen);
 	if (err)
 		return err;
@@ -161,7 +163,7 @@ int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
 }
 EXPORT_SYMBOL(mlx5_core_query_mkey);
 
-int mlx5_core_dump_fill_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
+int mlx5_core_dump_fill_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mkey *_mkey,
 			     u32 *mkey)
 {
 	struct mlx5_query_special_ctxs_mbox_in in;
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index bb1a880..7e39338 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -340,7 +340,7 @@ struct mlx5_core_sig_ctx {
 	u32			sigerr_count;
 };
 
-struct mlx5_core_mr {
+struct mlx5_core_mkey {
 	u64			iova;
 	u64			size;
 	u32			key;
@@ -428,7 +428,7 @@ struct mlx5_srq_table {
 	struct radix_tree_root	tree;
 };
 
-struct mlx5_mr_table {
+struct mlx5_mkey_table {
 	/* protect radix tree
 	 */
 	rwlock_t		lock;
@@ -484,9 +484,9 @@ struct mlx5_priv {
 	struct mlx5_cq_table	cq_table;
 	/* end: cq staff */
 
-	/* start: mr staff */
-	struct mlx5_mr_table	mr_table;
-	/* end: mr staff */
+	/* start: mkey staff */
+	struct mlx5_mkey_table	mkey_table;
+	/* end: mkey staff */
 
 	/* start: alloc staff */
 	/* protect buffer alocation according to numa node */
@@ -740,16 +740,18 @@ int mlx5_core_query_srq(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq,
 			struct mlx5_query_srq_mbox_out *out);
 int mlx5_core_arm_srq(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq,
 		      u16 lwm, int is_srq);
-void mlx5_init_mr_table(struct mlx5_core_dev *dev);
-void mlx5_cleanup_mr_table(struct mlx5_core_dev *dev);
-int mlx5_core_create_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
+void mlx5_init_mkey_table(struct mlx5_core_dev *dev);
+void mlx5_cleanup_mkey_table(struct mlx5_core_dev *dev);
+int mlx5_core_create_mkey(struct mlx5_core_dev *dev,
+			  struct mlx5_core_mkey *mkey,
 			  struct mlx5_create_mkey_mbox_in *in, int inlen,
 			  mlx5_cmd_cbk_t callback, void *context,
 			  struct mlx5_create_mkey_mbox_out *out);
-int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr);
-int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
+int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev,
+			   struct mlx5_core_mkey *mkey);
+int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mkey *mkey,
 			 struct mlx5_query_mkey_mbox_out *out, int outlen);
-int mlx5_core_dump_fill_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr,
+int mlx5_core_dump_fill_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mkey *_mkey,
 			     u32 *mkey);
 int mlx5_core_alloc_pd(struct mlx5_core_dev *dev, u32 *pdn);
 int mlx5_core_dealloc_pd(struct mlx5_core_dev *dev, u32 pdn);
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 5b8c89f..f46324e 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -621,9 +621,9 @@ static inline struct mlx5_core_qp *__mlx5_qp_lookup(struct mlx5_core_dev *dev, u
 	return radix_tree_lookup(&dev->priv.qp_table.tree, qpn);
 }
 
-static inline struct mlx5_core_mr *__mlx5_mr_lookup(struct mlx5_core_dev *dev, u32 key)
+static inline struct mlx5_core_mkey *__mlx5_mr_lookup(struct mlx5_core_dev *dev, u32 key)
 {
-	return radix_tree_lookup(&dev->priv.mr_table.tree, key);
+	return radix_tree_lookup(&dev->priv.mkey_table.tree, key);
 }
 
 struct mlx5_page_fault_resume_mbox_in {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 02/13] net/mlx5: Introduce device queue counters
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 03/13] net/mlx5e: Allocate set of queue counters per netdev Saeed Mahameed
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Rana Shahout, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

A queue counter can collect several statistics for one or more
hardware queues (QPs, RQs, etc ..) that the counter is attached to.

For Ethernet it will provide an "out of buffer" counter which collects
the number of all packets that are dropped due to lack of software
buffers.

Here we add device commands to alloc/query/dealloc queue counters.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c |   68 ++++++++++++++++++++++++++
 include/linux/mlx5/qp.h                      |    6 ++
 2 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index def2893..b720a27 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -538,3 +538,71 @@ void mlx5_core_destroy_sq_tracked(struct mlx5_core_dev *dev,
 	mlx5_core_destroy_sq(dev, sq->qpn);
 }
 EXPORT_SYMBOL(mlx5_core_destroy_sq_tracked);
+
+int mlx5_core_alloc_q_counter(struct mlx5_core_dev *dev, u16 *counter_id)
+{
+	u32 in[MLX5_ST_SZ_DW(alloc_q_counter_in)];
+	u32 out[MLX5_ST_SZ_DW(alloc_q_counter_out)];
+	int err;
+
+	memset(in, 0, sizeof(in));
+	memset(out, 0, sizeof(out));
+
+	MLX5_SET(alloc_q_counter_in, in, opcode, MLX5_CMD_OP_ALLOC_Q_COUNTER);
+	err = mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, sizeof(out));
+	if (!err)
+		*counter_id = MLX5_GET(alloc_q_counter_out, out,
+				       counter_set_id);
+	return err;
+}
+EXPORT_SYMBOL_GPL(mlx5_core_alloc_q_counter);
+
+int mlx5_core_dealloc_q_counter(struct mlx5_core_dev *dev, u16 counter_id)
+{
+	u32 in[MLX5_ST_SZ_DW(dealloc_q_counter_in)];
+	u32 out[MLX5_ST_SZ_DW(dealloc_q_counter_out)];
+
+	memset(in, 0, sizeof(in));
+	memset(out, 0, sizeof(out));
+
+	MLX5_SET(dealloc_q_counter_in, in, opcode,
+		 MLX5_CMD_OP_DEALLOC_Q_COUNTER);
+	MLX5_SET(dealloc_q_counter_in, in, counter_set_id, counter_id);
+	return mlx5_cmd_exec_check_status(dev, in, sizeof(in), out,
+					  sizeof(out));
+}
+EXPORT_SYMBOL_GPL(mlx5_core_dealloc_q_counter);
+
+int mlx5_core_query_q_counter(struct mlx5_core_dev *dev, u16 counter_id,
+			      int reset, void *out, int out_size)
+{
+	u32 in[MLX5_ST_SZ_DW(query_q_counter_in)];
+
+	memset(in, 0, sizeof(in));
+
+	MLX5_SET(query_q_counter_in, in, opcode, MLX5_CMD_OP_QUERY_Q_COUNTER);
+	MLX5_SET(query_q_counter_in, in, clear, reset);
+	MLX5_SET(query_q_counter_in, in, counter_set_id, counter_id);
+	return mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, out_size);
+}
+EXPORT_SYMBOL_GPL(mlx5_core_query_q_counter);
+
+int mlx5_core_query_out_of_buffer(struct mlx5_core_dev *dev, u16 counter_id,
+				  u32 *out_of_buffer)
+{
+	int outlen = MLX5_ST_SZ_BYTES(query_q_counter_out);
+	void *out;
+	int err;
+
+	out = mlx5_vzalloc(outlen);
+	if (!out)
+		return -ENOMEM;
+
+	err = mlx5_core_query_q_counter(dev, counter_id, 0, out, outlen);
+	if (!err)
+		*out_of_buffer = MLX5_GET(query_q_counter_out, out,
+					  out_of_buffer);
+
+	kfree(out);
+	return err;
+}
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index f46324e..296424b 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -667,6 +667,12 @@ int mlx5_core_create_sq_tracked(struct mlx5_core_dev *dev, u32 *in, int inlen,
 				struct mlx5_core_qp *sq);
 void mlx5_core_destroy_sq_tracked(struct mlx5_core_dev *dev,
 				  struct mlx5_core_qp *sq);
+int mlx5_core_alloc_q_counter(struct mlx5_core_dev *dev, u16 *counter_id);
+int mlx5_core_dealloc_q_counter(struct mlx5_core_dev *dev, u16 counter_id);
+int mlx5_core_query_q_counter(struct mlx5_core_dev *dev, u16 counter_id,
+			      int reset, void *out, int out_size);
+int mlx5_core_query_out_of_buffer(struct mlx5_core_dev *dev, u16 counter_id,
+				  u32 *out_of_buffer);
 
 static inline const char *mlx5_qp_type_str(int type)
 {
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 03/13] net/mlx5e: Allocate set of queue counters per netdev
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 02/13] net/mlx5: Introduce device queue counters Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS Saeed Mahameed
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Rana Shahout, Saeed Mahameed

From: Rana Shahout <ranas@mellanox.com>

Connect all netdev RQs to this set of queue counters.
Also, add an "rx_out_of_buffer" counter to ethtool,
which indicates RX packet drops due to lack of receive
buffers.

Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   11 +++++
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   11 +++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   42 +++++++++++++++++++-
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1f5cc1b..5ef91a9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -236,6 +236,15 @@ struct mlx5e_pport_stats {
 	__be64 RFC_2819_counters[NUM_RFC_2819_COUNTERS];
 };
 
+static const char qcounter_stats_strings[][ETH_GSTRING_LEN] = {
+	"rx_out_of_buffer",
+};
+
+struct mlx5e_qcounter_stats {
+	u32 rx_out_of_buffer;
+#define NUM_Q_COUNTERS 1
+};
+
 static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
 	"packets",
 	"bytes",
@@ -293,6 +302,7 @@ struct mlx5e_sq_stats {
 struct mlx5e_stats {
 	struct mlx5e_vport_stats   vport;
 	struct mlx5e_pport_stats   pport;
+	struct mlx5e_qcounter_stats qcnt;
 };
 
 struct mlx5e_params {
@@ -575,6 +585,7 @@ struct mlx5e_priv {
 	struct net_device         *netdev;
 	struct mlx5e_stats         stats;
 	struct mlx5e_tstamp        tstamp;
+	u16 q_counter;
 };
 
 #define MLX5E_NET_IP_ALIGN 2
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 68834b7..39c1902 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -165,6 +165,8 @@ static const struct {
 	},
 };
 
+#define MLX5E_NUM_Q_CNTRS(priv) (NUM_Q_COUNTERS * (!!priv->q_counter))
+
 static int mlx5e_get_sset_count(struct net_device *dev, int sset)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
@@ -172,6 +174,7 @@ static int mlx5e_get_sset_count(struct net_device *dev, int sset)
 	switch (sset) {
 	case ETH_SS_STATS:
 		return NUM_VPORT_COUNTERS + NUM_PPORT_COUNTERS +
+		       MLX5E_NUM_Q_CNTRS(priv) +
 		       priv->params.num_channels * NUM_RQ_STATS +
 		       priv->params.num_channels * priv->params.num_tc *
 						   NUM_SQ_STATS;
@@ -200,6 +203,11 @@ static void mlx5e_get_strings(struct net_device *dev,
 			strcpy(data + (idx++) * ETH_GSTRING_LEN,
 			       vport_strings[i]);
 
+		/* Q counters */
+		for (i = 0; i < MLX5E_NUM_Q_CNTRS(priv); i++)
+			strcpy(data + (idx++) * ETH_GSTRING_LEN,
+			       qcounter_stats_strings[i]);
+
 		/* PPORT counters */
 		for (i = 0; i < NUM_PPORT_COUNTERS; i++)
 			strcpy(data + (idx++) * ETH_GSTRING_LEN,
@@ -240,6 +248,9 @@ static void mlx5e_get_ethtool_stats(struct net_device *dev,
 	for (i = 0; i < NUM_VPORT_COUNTERS; i++)
 		data[idx++] = ((u64 *)&priv->stats.vport)[i];
 
+	for (i = 0; i < MLX5E_NUM_Q_CNTRS(priv); i++)
+		data[idx++] = ((u32 *)&priv->stats.qcnt)[i];
+
 	for (i = 0; i < NUM_PPORT_COUNTERS; i++)
 		data[idx++] = be64_to_cpu(((__be64 *)&priv->stats.pport)[i]);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e0adb60..7fbe1ba 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -129,6 +129,17 @@ free_out:
 	kvfree(out);
 }
 
+static void mlx5e_update_q_counter(struct mlx5e_priv *priv)
+{
+	struct mlx5e_qcounter_stats *qcnt = &priv->stats.qcnt;
+
+	if (!priv->q_counter)
+		return;
+
+	mlx5_core_query_out_of_buffer(priv->mdev, priv->q_counter,
+				      &qcnt->rx_out_of_buffer);
+}
+
 void mlx5e_update_stats(struct mlx5e_priv *priv)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
@@ -250,6 +261,8 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 			       s->rx_csum_sw;
 
 	mlx5e_update_pport_counters(priv);
+	mlx5e_update_q_counter(priv);
+
 free_out:
 	kvfree(out);
 }
@@ -1055,6 +1068,7 @@ static void mlx5e_build_rq_param(struct mlx5e_priv *priv,
 	MLX5_SET(wq, wq, log_wq_stride,    ilog2(sizeof(struct mlx5e_rx_wqe)));
 	MLX5_SET(wq, wq, log_wq_sz,        priv->params.log_rq_size);
 	MLX5_SET(wq, wq, pd,               priv->pdn);
+	MLX5_SET(rqc, rqc, counter_set_id, priv->q_counter);
 
 	param->wq.buf_numa_node = dev_to_node(&priv->mdev->pdev->dev);
 	param->wq.linear = 1;
@@ -2442,6 +2456,26 @@ static int mlx5e_create_mkey(struct mlx5e_priv *priv, u32 pdn,
 	return err;
 }
 
+static void mlx5e_create_q_counter(struct mlx5e_priv *priv)
+{
+	struct mlx5_core_dev *mdev = priv->mdev;
+	int err;
+
+	err = mlx5_core_alloc_q_counter(mdev, &priv->q_counter);
+	if (err) {
+		mlx5_core_warn(mdev, "alloc queue counter failed, %d\n", err);
+		priv->q_counter = 0;
+	}
+}
+
+static void mlx5e_destroy_q_counter(struct mlx5e_priv *priv)
+{
+	if (!priv->q_counter)
+		return;
+
+	mlx5_core_dealloc_q_counter(priv->mdev, priv->q_counter);
+}
+
 static void *mlx5e_create_netdev(struct mlx5_core_dev *mdev)
 {
 	struct net_device *netdev;
@@ -2527,13 +2561,15 @@ static void *mlx5e_create_netdev(struct mlx5_core_dev *mdev)
 		goto err_destroy_tirs;
 	}
 
+	mlx5e_create_q_counter(priv);
+
 	mlx5e_init_eth_addr(priv);
 
 	mlx5e_vxlan_init(priv);
 
 	err = mlx5e_tc_init(priv);
 	if (err)
-		goto err_destroy_flow_tables;
+		goto err_dealloc_q_counters;
 
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 	mlx5e_dcbnl_ieee_setets_core(priv, &priv->params.ets);
@@ -2556,7 +2592,8 @@ static void *mlx5e_create_netdev(struct mlx5_core_dev *mdev)
 err_tc_cleanup:
 	mlx5e_tc_cleanup(priv);
 
-err_destroy_flow_tables:
+err_dealloc_q_counters:
+	mlx5e_destroy_q_counter(priv);
 	mlx5e_destroy_flow_tables(priv);
 
 err_destroy_tirs:
@@ -2605,6 +2642,7 @@ static void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, void *vpriv)
 	unregister_netdev(netdev);
 	mlx5e_tc_cleanup(priv);
 	mlx5e_vxlan_cleanup(priv);
+	mlx5e_destroy_q_counter(priv);
 	mlx5e_destroy_flow_tables(priv);
 	mlx5e_destroy_tirs(priv);
 	mlx5e_destroy_rqt(priv, MLX5E_SINGLE_RQ_RQT);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (2 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 03/13] net/mlx5e: Allocate set of queue counters per netdev Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 14:08   ` Sergei Shtylyov
  2016-03-11 13:39 ` [PATCH net-next 05/13] net/mlx5e: Use function pointers for RX data path handling Saeed Mahameed
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Distribute default RSS table uniformely over the rings of the
close NUMA node, instead of all available channels.
This way we enforce the preference of close rings over far ones.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |    3 ++-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |    2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   15 +++++++++++++--
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5ef91a9..f4bf470 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -671,7 +671,8 @@ void mlx5e_build_tir_ctx_hash(void *tirc, struct mlx5e_priv *priv);
 
 int mlx5e_open_locked(struct net_device *netdev);
 int mlx5e_close_locked(struct net_device *netdev);
-void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, int len,
+void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
+				   u32 *indirection_rqt, int len,
 				   int num_channels);
 
 static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 39c1902..6f40ba4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -397,7 +397,7 @@ static int mlx5e_set_channels(struct net_device *dev,
 		mlx5e_close_locked(dev);
 
 	priv->params.num_channels = count;
-	mlx5e_build_default_indir_rqt(priv->params.indirection_rqt,
+	mlx5e_build_default_indir_rqt(priv->mdev, priv->params.indirection_rqt,
 				      MLX5E_INDIR_RQT_SIZE, count);
 
 	if (was_opened)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 7fbe1ba..9b58ef6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2297,11 +2297,22 @@ static void mlx5e_ets_init(struct mlx5e_priv *priv)
 }
 #endif
 
-void mlx5e_build_default_indir_rqt(u32 *indirection_rqt, int len,
+void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
+				   u32 *indirection_rqt, int len,
 				   int num_channels)
 {
+	int node = mdev->priv.numa_node;
+	int node_num_of_cores;
 	int i;
 
+	if (node == -1)
+		node = first_online_node;
+
+	node_num_of_cores = cpumask_weight(cpumask_of_node(node));
+
+	if (node_num_of_cores)
+		num_channels = min_t(int, num_channels, node_num_of_cores);
+
 	for (i = 0; i < len; i++)
 		indirection_rqt[i] = i % num_channels;
 }
@@ -2333,7 +2344,7 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
 	netdev_rss_key_fill(priv->params.toeplitz_hash_key,
 			    sizeof(priv->params.toeplitz_hash_key));
 
-	mlx5e_build_default_indir_rqt(priv->params.indirection_rqt,
+	mlx5e_build_default_indir_rqt(mdev, priv->params.indirection_rqt,
 				      MLX5E_INDIR_RQT_SIZE, num_channels);
 
 	priv->params.lro_wqe_sz            =
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 05/13] net/mlx5e: Use function pointers for RX data path handling
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (3 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ) Saeed Mahameed
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Achiad Shochat, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

In preparation for Striding RQ feature, which will need its own
RX handlers.
This patch does not change any functionality.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |   33 ++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   74 +++++++++++----------
 3 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index f4bf470..c70ec84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -72,6 +72,17 @@
 #define MLX5E_SQ_BF_BUDGET             16
 
 #define MLX5E_NUM_MAIN_GROUPS 9
+#define MLX5E_NET_IP_ALIGN 2
+
+struct mlx5e_tx_wqe {
+	struct mlx5_wqe_ctrl_seg ctrl;
+	struct mlx5_wqe_eth_seg  eth;
+};
+
+struct mlx5e_rx_wqe {
+	struct mlx5_wqe_srq_next_seg  next;
+	struct mlx5_wqe_data_seg      data;
+};
 
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
@@ -357,6 +368,12 @@ struct mlx5e_cq {
 	struct mlx5_wq_ctrl        wq_ctrl;
 } ____cacheline_aligned_in_smp;
 
+struct mlx5e_rq;
+typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
+				       struct mlx5_cqe64 *cqe);
+typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
+				  u16 ix);
+
 struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
@@ -368,6 +385,8 @@ struct mlx5e_rq {
 	struct mlx5e_tstamp   *tstamp;
 	struct mlx5e_rq_stats  stats;
 	struct mlx5e_cq        cq;
+	mlx5e_fp_handle_rx_cqe handle_rx_cqe;
+	mlx5e_fp_alloc_wqe     alloc_wqe;
 
 	unsigned long          state;
 	int                    ix;
@@ -588,18 +607,6 @@ struct mlx5e_priv {
 	u16 q_counter;
 };
 
-#define MLX5E_NET_IP_ALIGN 2
-
-struct mlx5e_tx_wqe {
-	struct mlx5_wqe_ctrl_seg ctrl;
-	struct mlx5_wqe_eth_seg  eth;
-};
-
-struct mlx5e_rx_wqe {
-	struct mlx5_wqe_srq_next_seg  next;
-	struct mlx5_wqe_data_seg      data;
-};
-
 enum mlx5e_link_mode {
 	MLX5E_1000BASE_CX_SGMII	 = 0,
 	MLX5E_1000BASE_KX	 = 1,
@@ -642,7 +649,9 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9b58ef6..23ba12c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -357,6 +357,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 			cpu_to_be32(byte_count | MLX5_HW_START_PADDING);
 	}
 
+	rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
+	rq->alloc_wqe = mlx5e_alloc_rx_wqe;
 	rq->pdev    = c->pdev;
 	rq->netdev  = c->netdev;
 	rq->tstamp  = &priv->tstamp;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 58d4e2f..d7ccced 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -42,8 +42,7 @@ static inline bool mlx5e_rx_hw_stamp(struct mlx5e_tstamp *tstamp)
 	return tstamp->hwtstamp_config.rx_filter == HWTSTAMP_FILTER_ALL;
 }
 
-static inline int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq,
-				     struct mlx5e_rx_wqe *wqe, u16 ix)
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 {
 	struct sk_buff *skb;
 	dma_addr_t dma_addr;
@@ -87,7 +86,7 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 	while (!mlx5_wq_ll_is_full(wq)) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
 
-		if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, wq->head)))
+		if (unlikely(rq->alloc_wqe(rq, wqe, wq->head)))
 			break;
 
 		mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
@@ -229,50 +228,55 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
 }
 
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+{
+	struct mlx5e_rx_wqe *wqe;
+	struct sk_buff *skb;
+	__be16 wqe_counter_be;
+	u16 wqe_counter;
+
+	wqe_counter_be = cqe->wqe_counter;
+	wqe_counter    = be16_to_cpu(wqe_counter_be);
+	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
+	skb            = rq->skb[wqe_counter];
+	prefetch(skb->data);
+	rq->skb[wqe_counter] = NULL;
+
+	dma_unmap_single(rq->pdev,
+			 *((dma_addr_t *)skb->cb),
+			 rq->wqe_sz,
+			 DMA_FROM_DEVICE);
+
+	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
+		rq->stats.wqe_err++;
+		dev_kfree_skb(skb);
+		goto wq_ll_pop;
+	}
+
+	mlx5e_build_rx_skb(cqe, rq, skb);
+	rq->stats.packets++;
+	rq->stats.bytes += be32_to_cpu(cqe->byte_cnt);
+	napi_gro_receive(rq->cq.napi, skb);
+
+wq_ll_pop:
+	mlx5_wq_ll_pop(&rq->wq, wqe_counter_be,
+		       &wqe->next.next_wqe_index);
+}
+
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
 	int work_done;
 
 	for (work_done = 0; work_done < budget; work_done++) {
-		struct mlx5e_rx_wqe *wqe;
-		struct mlx5_cqe64 *cqe;
-		struct sk_buff *skb;
-		__be16 wqe_counter_be;
-		u16 wqe_counter;
+		struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
 
-		cqe = mlx5e_get_cqe(cq);
 		if (!cqe)
 			break;
 
 		mlx5_cqwq_pop(&cq->wq);
 
-		wqe_counter_be = cqe->wqe_counter;
-		wqe_counter    = be16_to_cpu(wqe_counter_be);
-		wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
-		skb            = rq->skb[wqe_counter];
-		prefetch(skb->data);
-		rq->skb[wqe_counter] = NULL;
-
-		dma_unmap_single(rq->pdev,
-				 *((dma_addr_t *)skb->cb),
-				 rq->wqe_sz,
-				 DMA_FROM_DEVICE);
-
-		if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
-			rq->stats.wqe_err++;
-			dev_kfree_skb(skb);
-			goto wq_ll_pop;
-		}
-
-		mlx5e_build_rx_skb(cqe, rq, skb);
-		rq->stats.packets++;
-		rq->stats.bytes += be32_to_cpu(cqe->byte_cnt);
-		napi_gro_receive(cq->napi, skb);
-
-wq_ll_pop:
-		mlx5_wq_ll_pop(&rq->wq, wqe_counter_be,
-			       &wqe->next.next_wqe_index);
+		rq->handle_rx_cqe(rq, cqe);
 	}
 
 	mlx5_cqwq_update_db_record(&cq->wq);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (4 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 05/13] net/mlx5e: Use function pointers for RX data path handling Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-14 21:33   ` Jesper Dangaard Brouer
  2016-03-11 13:39 ` [PATCH net-next 07/13] net/mlx5e: Added ICO SQs Saeed Mahameed
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Achiad Shochat, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Introduce the feature of multi-packet WQE (RX Work Queue Element)
referred to as (MPWQE or Striding RQ), in which WQEs are larger
and serve multiple packets each.

Every WQE consists of many strides of the same size, every received
packet is aligned to a beginning of a stride and is written to
consecutive strides within a WQE.

In the regular approach, each regular WQE is big enough to be capable
of serving one received packet of any size up to MTU or 64K in case of
device LRO is enabeled, making it very wasteful when dealing with
small packets or device LRO is enabeled.

For its flexibility, MPWQE allows a better memory utilization (implying
improvements in CPU utilization and packet rate) as packets consume
strides according to their size, preserving the rest of the WQE to be
available for other packets.

MPWQE default configuration:
	NUM WQEs = 16
	Strides Per WQE = 1024
	Stride Size = 128

Performance tested on ConnectX4-Lx 50G.

* Netperf single TCP stream:
- message size = 1024,  bw raised from ~12300 mbps to 14900 mbps (+20%)
- message size = 65536, bw raised from ~21800 mbps to 33500 mbps (+50%)
- with other message sized we saw some gain or no degradation.

* Netperf multi TCP stream:
- No degradation, line rate reached.

* Pktgen: packet loss in bursts of N small messages (64byte), single
stream
- | num packets | packets loss before	| packets loss after
  |	2K	|       ~ 1K		|	0
  |	16K	|       ~13K 		|	0
  |	32K	|	~29K		|      14K

As expected as the driver can recive as many small packets (<=128) as
the number of total strides in the ring (default = 1024 * 16) vs. 1024
(default ring size regardless of packets size) before this feautre.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   71 +++++++++++-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   15 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  109 +++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  126 ++++++++++++++++++--
 include/linux/mlx5/device.h                        |   39 ++++++-
 include/linux/mlx5/mlx5_ifc.h                      |   13 ++-
 6 files changed, 327 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c70ec84..cd8805d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -57,12 +57,27 @@
 #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE                0xa
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE                0xd
 
+#define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW            0x1
+#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x4
+#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
+
+#define MLX5_MPWRQ_LOG_NUM_STRIDES		10 /* >= 9, HW restriction */
+#define MLX5_MPWRQ_LOG_STRIDE_SIZE		7  /* >= 6, HW restriction */
+#define MLX5_MPWRQ_NUM_STRIDES			BIT(MLX5_MPWRQ_LOG_NUM_STRIDES)
+#define MLX5_MPWRQ_STRIDE_SIZE			BIT(MLX5_MPWRQ_LOG_STRIDE_SIZE)
+#define MLX5_MPWRQ_LOG_WQE_SZ			(MLX5_MPWRQ_LOG_NUM_STRIDES +\
+						 MLX5_MPWRQ_LOG_STRIDE_SIZE)
+#define MLX5_MPWRQ_WQE_PAGE_ORDER  (max_t(int, 0, \
+					  MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT))
+#define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD	(128)
+
 #define MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ                 (64 * 1024)
 #define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC      0x10
 #define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS      0x20
 #define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC      0x10
 #define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS      0x20
 #define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES                0x80
+#define MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW            0x2
 
 #define MLX5E_LOG_INDIR_RQT_SIZE       0x7
 #define MLX5E_INDIR_RQT_SIZE           BIT(MLX5E_LOG_INDIR_RQT_SIZE)
@@ -74,6 +89,38 @@
 #define MLX5E_NUM_MAIN_GROUPS 9
 #define MLX5E_NET_IP_ALIGN 2
 
+static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
+{
+	switch (wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		return min_t(u16, MLX5E_PARAMS_DEFAULT_MIN_RX_WQES_MPW,
+			     wq_size / 2);
+	default:
+		return min_t(u16, MLX5E_PARAMS_DEFAULT_MIN_RX_WQES,
+			     wq_size / 2);
+	}
+}
+
+static inline int mlx5_min_log_rq_size(int wq_type)
+{
+	switch (wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		return MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW;
+	default:
+		return MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE;
+	}
+}
+
+static inline int mlx5_max_log_rq_size(int wq_type)
+{
+	switch (wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		return MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW;
+	default:
+		return MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE;
+	}
+}
+
 struct mlx5e_tx_wqe {
 	struct mlx5_wqe_ctrl_seg ctrl;
 	struct mlx5_wqe_eth_seg  eth;
@@ -128,6 +175,7 @@ static const char vport_strings[][ETH_GSTRING_LEN] = {
 	"tx_queue_wake",
 	"tx_queue_dropped",
 	"rx_wqe_err",
+	"rx_mpwqe_filler",
 };
 
 struct mlx5e_vport_stats {
@@ -169,8 +217,9 @@ struct mlx5e_vport_stats {
 	u64 tx_queue_wake;
 	u64 tx_queue_dropped;
 	u64 rx_wqe_err;
+	u64 rx_mpwqe_filler;
 
-#define NUM_VPORT_COUNTERS     35
+#define NUM_VPORT_COUNTERS     36
 };
 
 static const char pport_strings[][ETH_GSTRING_LEN] = {
@@ -263,7 +312,8 @@ static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
 	"csum_sw",
 	"lro_packets",
 	"lro_bytes",
-	"wqe_err"
+	"wqe_err",
+	"mpwqe_filler",
 };
 
 struct mlx5e_rq_stats {
@@ -274,6 +324,7 @@ struct mlx5e_rq_stats {
 	u64 lro_packets;
 	u64 lro_bytes;
 	u64 wqe_err;
+	u64 mpwqe_filler;
 #define NUM_RQ_STATS 7
 };
 
@@ -318,6 +369,7 @@ struct mlx5e_stats {
 
 struct mlx5e_params {
 	u8  log_sq_size;
+	u8  rq_wq_type;
 	u8  log_rq_size;
 	u16 num_channels;
 	u8  num_tc;
@@ -374,11 +426,22 @@ typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
 typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
 				  u16 ix);
 
+struct mlx5e_dma_info {
+	struct page	*page;
+	dma_addr_t	addr;
+};
+
+struct mlx5e_mpw_info {
+	struct mlx5e_dma_info dma_info;
+	u16		     consumed_strides;
+};
+
 struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
 	u32                    wqe_sz;
 	struct sk_buff       **skb;
+	struct mlx5e_mpw_info *wqe_info;
 
 	struct device         *pdev;
 	struct net_device     *netdev;
@@ -393,6 +456,7 @@ struct mlx5e_rq {
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
+	u8                     wq_type;
 	u32                    rqn;
 	struct mlx5e_channel  *channel;
 	struct mlx5e_priv     *priv;
@@ -649,9 +713,12 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
+
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
 int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 6f40ba4..4077856 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -273,8 +273,9 @@ static void mlx5e_get_ringparam(struct net_device *dev,
 				struct ethtool_ringparam *param)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
+	int rq_wq_type = priv->params.rq_wq_type;
 
-	param->rx_max_pending = 1 << MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE;
+	param->rx_max_pending = 1 << mlx5_max_log_rq_size(rq_wq_type);
 	param->tx_max_pending = 1 << MLX5E_PARAMS_MAXIMUM_LOG_SQ_SIZE;
 	param->rx_pending     = 1 << priv->params.log_rq_size;
 	param->tx_pending     = 1 << priv->params.log_sq_size;
@@ -285,6 +286,7 @@ static int mlx5e_set_ringparam(struct net_device *dev,
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	bool was_opened;
+	int rq_wq_type = priv->params.rq_wq_type;
 	u16 min_rx_wqes;
 	u8 log_rq_size;
 	u8 log_sq_size;
@@ -300,16 +302,16 @@ static int mlx5e_set_ringparam(struct net_device *dev,
 			    __func__);
 		return -EINVAL;
 	}
-	if (param->rx_pending < (1 << MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE)) {
+	if (param->rx_pending < (1 << mlx5_min_log_rq_size(rq_wq_type))) {
 		netdev_info(dev, "%s: rx_pending (%d) < min (%d)\n",
 			    __func__, param->rx_pending,
-			    1 << MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE);
+			    1 << mlx5_min_log_rq_size(rq_wq_type));
 		return -EINVAL;
 	}
-	if (param->rx_pending > (1 << MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE)) {
+	if (param->rx_pending > (1 << mlx5_max_log_rq_size(rq_wq_type))) {
 		netdev_info(dev, "%s: rx_pending (%d) > max (%d)\n",
 			    __func__, param->rx_pending,
-			    1 << MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE);
+			    1 << mlx5_max_log_rq_size(rq_wq_type));
 		return -EINVAL;
 	}
 	if (param->tx_pending < (1 << MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE)) {
@@ -327,8 +329,7 @@ static int mlx5e_set_ringparam(struct net_device *dev,
 
 	log_rq_size = order_base_2(param->rx_pending);
 	log_sq_size = order_base_2(param->tx_pending);
-	min_rx_wqes = min_t(u16, param->rx_pending - 1,
-			    MLX5E_PARAMS_DEFAULT_MIN_RX_WQES);
+	min_rx_wqes = mlx5_min_rx_wqes(rq_wq_type, param->rx_pending);
 
 	if (log_rq_size == priv->params.log_rq_size &&
 	    log_sq_size == priv->params.log_sq_size &&
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 23ba12c..871f3af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -175,6 +175,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 	s->rx_csum_none		= 0;
 	s->rx_csum_sw		= 0;
 	s->rx_wqe_err		= 0;
+	s->rx_mpwqe_filler	= 0;
 	for (i = 0; i < priv->params.num_channels; i++) {
 		rq_stats = &priv->channel[i]->rq.stats;
 
@@ -185,6 +186,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 		s->rx_csum_none	+= rq_stats->csum_none;
 		s->rx_csum_sw	+= rq_stats->csum_sw;
 		s->rx_wqe_err   += rq_stats->wqe_err;
+		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
 
 		for (j = 0; j < priv->params.num_tc; j++) {
 			sq_stats = &priv->channel[i]->sq[j].stats;
@@ -323,6 +325,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	struct mlx5_core_dev *mdev = priv->mdev;
 	void *rqc = param->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
+	u32 byte_count;
 	int wq_sz;
 	int err;
 	int i;
@@ -337,28 +340,47 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->wq.db = &rq->wq.db[MLX5_RCV_DBR];
 
 	wq_sz = mlx5_wq_ll_get_size(&rq->wq);
-	rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
-			       cpu_to_node(c->cpu));
-	if (!rq->skb) {
-		err = -ENOMEM;
-		goto err_rq_wq_destroy;
-	}
 
-	rq->wqe_sz = (priv->params.lro_en) ? priv->params.lro_wqe_sz :
-					     MLX5E_SW2HW_MTU(priv->netdev->mtu);
-	rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz + MLX5E_NET_IP_ALIGN);
+	switch (priv->params.rq_wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
+					    GFP_KERNEL, cpu_to_node(c->cpu));
+		if (!rq->wqe_info) {
+			err = -ENOMEM;
+			goto err_rq_wq_destroy;
+		}
+		rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
+		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
+
+		rq->wqe_sz = MLX5_MPWRQ_NUM_STRIDES * MLX5_MPWRQ_STRIDE_SIZE;
+		byte_count = rq->wqe_sz;
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
+				       cpu_to_node(c->cpu));
+		if (!rq->skb) {
+			err = -ENOMEM;
+			goto err_rq_wq_destroy;
+		}
+		rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
+		rq->alloc_wqe = mlx5e_alloc_rx_wqe;
+
+		rq->wqe_sz = (priv->params.lro_en) ?
+				priv->params.lro_wqe_sz :
+				MLX5E_SW2HW_MTU(priv->netdev->mtu);
+		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz + MLX5E_NET_IP_ALIGN);
+		byte_count = rq->wqe_sz - MLX5E_NET_IP_ALIGN;
+		byte_count |= MLX5_HW_START_PADDING;
+	}
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
-		u32 byte_count = rq->wqe_sz - MLX5E_NET_IP_ALIGN;
 
 		wqe->data.lkey       = c->mkey_be;
-		wqe->data.byte_count =
-			cpu_to_be32(byte_count | MLX5_HW_START_PADDING);
+		wqe->data.byte_count = cpu_to_be32(byte_count);
 	}
 
-	rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
-	rq->alloc_wqe = mlx5e_alloc_rx_wqe;
+	rq->wq_type = priv->params.rq_wq_type;
 	rq->pdev    = c->pdev;
 	rq->netdev  = c->netdev;
 	rq->tstamp  = &priv->tstamp;
@@ -376,7 +398,14 @@ err_rq_wq_destroy:
 
 static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 {
-	kfree(rq->skb);
+	switch (rq->wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		kfree(rq->wqe_info);
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		kfree(rq->skb);
+	}
+
 	mlx5_wq_destroy(&rq->wq_ctrl);
 }
 
@@ -1065,7 +1094,18 @@ static void mlx5e_build_rq_param(struct mlx5e_priv *priv,
 	void *rqc = param->rqc;
 	void *wq = MLX5_ADDR_OF(rqc, rqc, wq);
 
-	MLX5_SET(wq, wq, wq_type,          MLX5_WQ_TYPE_LINKED_LIST);
+	switch (priv->params.rq_wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		MLX5_SET(wq, wq, log_wqe_num_of_strides,
+			 MLX5_MPWRQ_LOG_NUM_STRIDES - 9);
+		MLX5_SET(wq, wq, log_wqe_stride_size,
+			 MLX5_MPWRQ_LOG_STRIDE_SIZE - 6);
+		MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ);
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_LINKED_LIST);
+	}
+
 	MLX5_SET(wq, wq, end_padding_mode, MLX5_WQ_END_PAD_MODE_ALIGN);
 	MLX5_SET(wq, wq, log_wq_stride,    ilog2(sizeof(struct mlx5e_rx_wqe)));
 	MLX5_SET(wq, wq, log_wq_sz,        priv->params.log_rq_size);
@@ -1111,8 +1151,18 @@ static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
 				    struct mlx5e_cq_param *param)
 {
 	void *cqc = param->cqc;
+	u8 log_cq_size;
 
-	MLX5_SET(cqc, cqc, log_cq_size,  priv->params.log_rq_size);
+	switch (priv->params.rq_wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		log_cq_size = priv->params.log_rq_size +
+			MLX5_MPWRQ_LOG_NUM_STRIDES;
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		log_cq_size = priv->params.log_rq_size;
+	}
+
+	MLX5_SET(cqc, cqc, log_cq_size, log_cq_size);
 
 	mlx5e_build_common_cq_param(priv, param);
 }
@@ -1983,7 +2033,8 @@ static int mlx5e_set_features(struct net_device *netdev,
 	if (changes & NETIF_F_LRO) {
 		bool was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
 
-		if (was_opened)
+		if (was_opened && (priv->params.rq_wq_type ==
+				   MLX5_WQ_TYPE_LINKED_LIST))
 			mlx5e_close_locked(priv->netdev);
 
 		priv->params.lro_en = !!(features & NETIF_F_LRO);
@@ -1992,7 +2043,8 @@ static int mlx5e_set_features(struct net_device *netdev,
 			mlx5_core_warn(priv->mdev, "lro modify failed, %d\n",
 				       err);
 
-		if (was_opened)
+		if (was_opened && (priv->params.rq_wq_type ==
+				   MLX5_WQ_TYPE_LINKED_LIST))
 			err = mlx5e_open_locked(priv->netdev);
 	}
 
@@ -2327,8 +2379,21 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
 
 	priv->params.log_sq_size           =
 		MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
-	priv->params.log_rq_size           =
-		MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
+	priv->params.rq_wq_type = MLX5_CAP_GEN(mdev, striding_rq) ?
+		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
+		MLX5_WQ_TYPE_LINKED_LIST;
+
+	switch (priv->params.rq_wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
+		priv->params.lro_en = true;
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
+	}
+
+	priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
+					    BIT(priv->params.log_rq_size));
 	priv->params.rx_cq_moderation_usec =
 		MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC;
 	priv->params.rx_cq_moderation_pkts =
@@ -2338,8 +2403,6 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
 	priv->params.tx_cq_moderation_pkts =
 		MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS;
 	priv->params.tx_max_inline         = mlx5e_get_max_inline_cap(mdev);
-	priv->params.min_rx_wqes           =
-		MLX5E_PARAMS_DEFAULT_MIN_RX_WQES;
 	priv->params.num_tc                = 1;
 	priv->params.rss_hfunc             = ETH_RSS_HASH_XOR;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d7ccced..18105c1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -76,6 +76,33 @@ err_free_skb:
 	return -ENOMEM;
 }
 
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+{
+	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	int ret = 0;
+
+	wi->dma_info.page = alloc_pages(GFP_ATOMIC | __GFP_COMP | __GFP_COLD,
+					MLX5_MPWRQ_WQE_PAGE_ORDER);
+	if (unlikely(!wi->dma_info.page))
+		return -ENOMEM;
+
+	wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
+					 rq->wqe_sz, PCI_DMA_FROMDEVICE);
+	if (dma_mapping_error(rq->pdev, wi->dma_info.addr)) {
+		ret = -ENOMEM;
+		goto err_put_page;
+	}
+
+	wi->consumed_strides = 0;
+	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
+
+	return 0;
+
+err_put_page:
+	put_page(wi->dma_info.page);
+	return ret;
+}
+
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 {
 	struct mlx5_wq_ll *wq = &rq->wq;
@@ -100,7 +127,8 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 	return !mlx5_wq_ll_is_full(wq);
 }
 
-static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe)
+static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
+				 u32 cqe_bcnt)
 {
 	struct ethhdr	*eth	= (struct ethhdr *)(skb->data);
 	struct iphdr	*ipv4	= (struct iphdr *)(skb->data + ETH_HLEN);
@@ -111,7 +139,7 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe)
 	int tcp_ack = ((CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA  == l4_hdr_type) ||
 		       (CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA == l4_hdr_type));
 
-	u16 tot_len = be32_to_cpu(cqe->byte_cnt) - ETH_HLEN;
+	u16 tot_len = cqe_bcnt - ETH_HLEN;
 
 	if (eth->h_proto == htons(ETH_P_IP)) {
 		tcp = (struct tcphdr *)(skb->data + ETH_HLEN +
@@ -191,19 +219,17 @@ csum_none:
 }
 
 static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
+				      u32 cqe_bcnt,
 				      struct mlx5e_rq *rq,
 				      struct sk_buff *skb)
 {
 	struct net_device *netdev = rq->netdev;
-	u32 cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 	struct mlx5e_tstamp *tstamp = rq->tstamp;
 	int lro_num_seg;
 
-	skb_put(skb, cqe_bcnt);
-
 	lro_num_seg = be32_to_cpu(cqe->srqn) >> 24;
 	if (lro_num_seg > 1) {
-		mlx5e_lro_update_hdr(skb, cqe);
+		mlx5e_lro_update_hdr(skb, cqe, cqe_bcnt);
 		skb_shinfo(skb)->gso_size = DIV_ROUND_UP(cqe_bcnt, lro_num_seg);
 		rq->stats.lro_packets++;
 		rq->stats.lro_bytes += cqe_bcnt;
@@ -228,12 +254,24 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 	skb->mark = be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK;
 }
 
+static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
+					 struct mlx5_cqe64 *cqe,
+					 u32 cqe_bcnt,
+					 struct sk_buff *skb)
+{
+	mlx5e_build_rx_skb(cqe, cqe_bcnt, rq, skb);
+	rq->stats.packets++;
+	rq->stats.bytes += cqe_bcnt;
+	napi_gro_receive(rq->cq.napi, skb);
+}
+
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_rx_wqe *wqe;
 	struct sk_buff *skb;
 	__be16 wqe_counter_be;
 	u16 wqe_counter;
+	u32 cqe_bcnt;
 
 	wqe_counter_be = cqe->wqe_counter;
 	wqe_counter    = be16_to_cpu(wqe_counter_be);
@@ -253,16 +291,84 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
-	mlx5e_build_rx_skb(cqe, rq, skb);
-	rq->stats.packets++;
-	rq->stats.bytes += be32_to_cpu(cqe->byte_cnt);
-	napi_gro_receive(rq->cq.napi, skb);
+	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
+	skb_put(skb, cqe_bcnt);
+
+	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
 
 wq_ll_pop:
 	mlx5_wq_ll_pop(&rq->wq, wqe_counter_be,
 		       &wqe->next.next_wqe_index);
 }
 
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+{
+	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
+	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
+	u32 consumed_bytes = cstrides  * MLX5_MPWRQ_STRIDE_SIZE;
+	u32 stride_offset  = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
+	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
+	struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
+	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
+	struct sk_buff *skb;
+	u16 byte_cnt;
+	u16 cqe_bcnt;
+	u16 headlen;
+
+	wi->consumed_strides += cstrides;
+
+	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
+		rq->stats.wqe_err++;
+		goto mpwrq_cqe_out;
+	}
+
+	if (mpwrq_is_filler_cqe(cqe)) {
+		rq->stats.mpwqe_filler++;
+		goto mpwrq_cqe_out;
+	}
+
+	skb = netdev_alloc_skb(rq->netdev, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
+	if (unlikely(!skb))
+		goto mpwrq_cqe_out;
+
+	dma_sync_single_for_cpu(rq->pdev, wi->dma_info.addr + stride_offset,
+				consumed_bytes, DMA_FROM_DEVICE);
+
+	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
+	headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, cqe_bcnt);
+	skb_copy_to_linear_data(skb,
+				page_address(wi->dma_info.page) + stride_offset,
+				headlen);
+	skb_put(skb, headlen);
+
+	byte_cnt = cqe_bcnt - headlen;
+	if (byte_cnt) {
+		skb_frag_t *f0 = &skb_shinfo(skb)->frags[0];
+
+		skb_shinfo(skb)->nr_frags = 1;
+
+		skb->data_len  = byte_cnt;
+		skb->len      += byte_cnt;
+		skb->truesize  = SKB_TRUESIZE(skb->len);
+
+		get_page(wi->dma_info.page);
+		skb_frag_set_page(skb, 0, wi->dma_info.page);
+		skb_frag_size_set(f0, skb->data_len);
+		f0->page_offset = stride_offset + headlen;
+	}
+
+	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
+
+mpwrq_cqe_out:
+	if (likely(wi->consumed_strides < MLX5_MPWRQ_NUM_STRIDES))
+		return;
+
+	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
+		       PCI_DMA_FROMDEVICE);
+	put_page(wi->dma_info.page);
+	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
+}
+
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 68a56bc..9b60f45 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -621,7 +621,8 @@ struct mlx5_err_cqe {
 };
 
 struct mlx5_cqe64 {
-	u8		rsvd0[4];
+	u8              rsvd0[2];
+	__be16          wqe_id;
 	u8		lro_tcppsh_abort_dupack;
 	u8		lro_min_ttl;
 	__be16		lro_tcp_win;
@@ -673,6 +674,42 @@ static inline u64 get_cqe_ts(struct mlx5_cqe64 *cqe)
 	return (u64)lo | ((u64)hi << 32);
 }
 
+struct mpwrq_cqe_bc {
+	__be16	filler_consumed_strides;
+	__be16	byte_cnt;
+};
+
+static inline u16 mpwrq_get_cqe_byte_cnt(struct mlx5_cqe64 *cqe)
+{
+	struct mpwrq_cqe_bc *bc = (struct mpwrq_cqe_bc *)&cqe->byte_cnt;
+
+	return be16_to_cpu(bc->byte_cnt);
+}
+
+static inline u16 mpwrq_get_cqe_bc_consumed_strides(struct mpwrq_cqe_bc *bc)
+{
+	return 0x7fff & be16_to_cpu(bc->filler_consumed_strides);
+}
+
+static inline u16 mpwrq_get_cqe_consumed_strides(struct mlx5_cqe64 *cqe)
+{
+	struct mpwrq_cqe_bc *bc = (struct mpwrq_cqe_bc *)&cqe->byte_cnt;
+
+	return mpwrq_get_cqe_bc_consumed_strides(bc);
+}
+
+static inline bool mpwrq_is_filler_cqe(struct mlx5_cqe64 *cqe)
+{
+	struct mpwrq_cqe_bc *bc = (struct mpwrq_cqe_bc *)&cqe->byte_cnt;
+
+	return 0x8000 & be16_to_cpu(bc->filler_consumed_strides);
+}
+
+static inline u16 mpwrq_get_cqe_stride_index(struct mlx5_cqe64 *cqe)
+{
+	return be16_to_cpu(cqe->wqe_counter);
+}
+
 enum {
 	CQE_L4_HDR_TYPE_NONE			= 0x0,
 	CQE_L4_HDR_TYPE_TCP_NO_ACK		= 0x1,
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 9d91ce3..6060ca3 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -620,7 +620,7 @@ struct mlx5_ifc_odp_cap_bits {
 enum {
 	MLX5_WQ_TYPE_LINKED_LIST  = 0x0,
 	MLX5_WQ_TYPE_CYCLIC       = 0x1,
-	MLX5_WQ_TYPE_STRQ         = 0x2,
+	MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ = 0x2,
 };
 
 enum {
@@ -750,7 +750,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         cqe_version[0x4];
 
 	u8         compact_address_vector[0x1];
-	u8         reserved_at_200[0xe];
+	u8         striding_rq[0x1];
+	u8         reserved_at_201[0xd];
 	u8         drain_sigerr[0x1];
 	u8         cmdif_checksum[0x2];
 	u8         sigerr_cqe[0x1];
@@ -963,7 +964,13 @@ struct mlx5_ifc_wq_bits {
 	u8         reserved_at_118[0x3];
 	u8         log_wq_sz[0x5];
 
-	u8         reserved_at_120[0x4e0];
+	u8         reserved_at_120[0x15];
+	u8         log_wqe_num_of_strides[0x3];
+	u8         two_byte_shift_en[0x1];
+	u8         reserved_at_139[0x4];
+	u8         log_wqe_stride_size[0x3];
+
+	u8         reserved_at_140[0x4c0];
 
 	struct mlx5_ifc_cmd_pas_bits pas[0];
 };
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 07/13] net/mlx5e: Added ICO SQs
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (5 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ) Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE Saeed Mahameed
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Added ICO (Internal Control Operations) SQ per channel to be used
for driver internal operations such as memory registration for
fragmented memory and nop requests upon ifconfig up.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    7 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  135 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |    2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |   55 +++++++++
 4 files changed, 174 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index cd8805d..8e011af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -484,6 +484,11 @@ enum {
 	MLX5E_SQ_STATE_BF_ENABLE,
 };
 
+struct mlx5e_ico_wqe_info {
+	u8  opcode;
+	u8  num_wqebbs;
+};
+
 struct mlx5e_sq {
 	/* data path */
 
@@ -525,6 +530,7 @@ struct mlx5e_sq {
 	struct mlx5_uar            uar;
 	struct mlx5e_channel      *channel;
 	int                        tc;
+	struct mlx5e_ico_wqe_info *ico_wqe_info;
 } ____cacheline_aligned_in_smp;
 
 static inline bool mlx5e_sq_has_room_for(struct mlx5e_sq *sq, u16 n)
@@ -541,6 +547,7 @@ struct mlx5e_channel {
 	/* data path */
 	struct mlx5e_rq            rq;
 	struct mlx5e_sq            sq[MLX5E_MAX_NUM_TC];
+	struct mlx5e_sq            icosq;   /* internal control operations */
 	struct napi_struct         napi;
 	struct device             *pdev;
 	struct net_device         *netdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 871f3af..0507e46 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -48,6 +48,7 @@ struct mlx5e_sq_param {
 	u32                        sqc[MLX5_ST_SZ_DW(sqc)];
 	struct mlx5_wq_param       wq;
 	u16                        max_inline;
+	bool                       icosq;
 };
 
 struct mlx5e_cq_param {
@@ -59,8 +60,10 @@ struct mlx5e_cq_param {
 struct mlx5e_channel_param {
 	struct mlx5e_rq_param      rq;
 	struct mlx5e_sq_param      sq;
+	struct mlx5e_sq_param      icosq;
 	struct mlx5e_cq_param      rx_cq;
 	struct mlx5e_cq_param      tx_cq;
+	struct mlx5e_cq_param      icosq_cq;
 };
 
 static void mlx5e_update_carrier(struct mlx5e_priv *priv)
@@ -502,6 +505,8 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
 			 struct mlx5e_rq_param *param,
 			 struct mlx5e_rq *rq)
 {
+	struct mlx5e_sq *sq = &c->icosq;
+	u16 pi = sq->pc & sq->wq.sz_m1;
 	int err;
 
 	err = mlx5e_create_rq(c, param, rq);
@@ -517,7 +522,10 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
 		goto err_disable_rq;
 
 	set_bit(MLX5E_RQ_STATE_POST_WQES_ENABLE, &rq->state);
-	mlx5e_send_nop(&c->sq[0], true); /* trigger mlx5e_post_rx_wqes() */
+
+	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
+	sq->ico_wqe_info[pi].num_wqebbs = 1;
+	mlx5e_send_nop(sq, true); /* trigger mlx5e_post_rx_wqes() */
 
 	return 0;
 
@@ -583,7 +591,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 
 	void *sqc = param->sqc;
 	void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
-	int txq_ix;
 	int err;
 
 	err = mlx5_alloc_map_uar(mdev, &sq->uar, true);
@@ -611,8 +618,24 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 	if (err)
 		goto err_sq_wq_destroy;
 
-	txq_ix = c->ix + tc * priv->params.num_channels;
-	sq->txq = netdev_get_tx_queue(priv->netdev, txq_ix);
+	if (param->icosq) {
+		u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+		sq->ico_wqe_info = kzalloc_node(sizeof(*sq->ico_wqe_info) *
+						wq_sz,
+						GFP_KERNEL,
+						cpu_to_node(c->cpu));
+		if (!sq->ico_wqe_info) {
+			err = -ENOMEM;
+			goto err_free_sq_db;
+		}
+	} else {
+		int txq_ix;
+
+		txq_ix = c->ix + tc * priv->params.num_channels;
+		sq->txq = netdev_get_tx_queue(priv->netdev, txq_ix);
+		priv->txq_to_sq_map[txq_ix] = sq;
+	}
 
 	sq->pdev      = c->pdev;
 	sq->tstamp    = &priv->tstamp;
@@ -621,10 +644,12 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 	sq->tc        = tc;
 	sq->edge      = (sq->wq.sz_m1 + 1) - MLX5_SEND_WQE_MAX_WQEBBS;
 	sq->bf_budget = MLX5E_SQ_BF_BUDGET;
-	priv->txq_to_sq_map[txq_ix] = sq;
 
 	return 0;
 
+err_free_sq_db:
+	mlx5e_free_sq_db(sq);
+
 err_sq_wq_destroy:
 	mlx5_wq_destroy(&sq->wq_ctrl);
 
@@ -639,6 +664,7 @@ static void mlx5e_destroy_sq(struct mlx5e_sq *sq)
 	struct mlx5e_channel *c = sq->channel;
 	struct mlx5e_priv *priv = c->priv;
 
+	kfree(sq->ico_wqe_info);
 	mlx5e_free_sq_db(sq);
 	mlx5_wq_destroy(&sq->wq_ctrl);
 	mlx5_unmap_free_uar(priv->mdev, &sq->uar);
@@ -667,10 +693,10 @@ static int mlx5e_enable_sq(struct mlx5e_sq *sq, struct mlx5e_sq_param *param)
 
 	memcpy(sqc, param->sqc, sizeof(param->sqc));
 
-	MLX5_SET(sqc,  sqc, tis_num_0,		priv->tisn[sq->tc]);
-	MLX5_SET(sqc,  sqc, cqn,		c->sq[sq->tc].cq.mcq.cqn);
+	MLX5_SET(sqc,  sqc, tis_num_0, param->icosq ? 0 : priv->tisn[sq->tc]);
+	MLX5_SET(sqc,  sqc, cqn,		sq->cq.mcq.cqn);
 	MLX5_SET(sqc,  sqc, state,		MLX5_SQC_STATE_RST);
-	MLX5_SET(sqc,  sqc, tis_lst_sz,		1);
+	MLX5_SET(sqc,  sqc, tis_lst_sz,		param->icosq ? 0 : 1);
 	MLX5_SET(sqc,  sqc, flush_in_error_en,	1);
 
 	MLX5_SET(wq,   wq, wq_type,       MLX5_WQ_TYPE_CYCLIC);
@@ -745,9 +771,11 @@ static int mlx5e_open_sq(struct mlx5e_channel *c,
 	if (err)
 		goto err_disable_sq;
 
-	set_bit(MLX5E_SQ_STATE_WAKE_TXQ_ENABLE, &sq->state);
-	netdev_tx_reset_queue(sq->txq);
-	netif_tx_start_queue(sq->txq);
+	if (sq->txq) {
+		set_bit(MLX5E_SQ_STATE_WAKE_TXQ_ENABLE, &sq->state);
+		netdev_tx_reset_queue(sq->txq);
+		netif_tx_start_queue(sq->txq);
+	}
 
 	return 0;
 
@@ -768,15 +796,19 @@ static inline void netif_tx_disable_queue(struct netdev_queue *txq)
 
 static void mlx5e_close_sq(struct mlx5e_sq *sq)
 {
-	clear_bit(MLX5E_SQ_STATE_WAKE_TXQ_ENABLE, &sq->state);
-	napi_synchronize(&sq->channel->napi); /* prevent netif_tx_wake_queue */
-	netif_tx_disable_queue(sq->txq);
+	if (sq->txq) {
+		clear_bit(MLX5E_SQ_STATE_WAKE_TXQ_ENABLE, &sq->state);
+		/* prevent netif_tx_wake_queue */
+		napi_synchronize(&sq->channel->napi);
+		netif_tx_disable_queue(sq->txq);
 
-	/* ensure hw is notified of all pending wqes */
-	if (mlx5e_sq_has_room_for(sq, 1))
-		mlx5e_send_nop(sq, true);
+		/* ensure hw is notified of all pending wqes */
+		if (mlx5e_sq_has_room_for(sq, 1))
+			mlx5e_send_nop(sq, true);
+
+		mlx5e_modify_sq(sq, MLX5_SQC_STATE_RDY, MLX5_SQC_STATE_ERR);
+	}
 
-	mlx5e_modify_sq(sq, MLX5_SQC_STATE_RDY, MLX5_SQC_STATE_ERR);
 	while (sq->cc != sq->pc) /* wait till sq is empty */
 		msleep(20);
 
@@ -1030,10 +1062,14 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 
 	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
 
-	err = mlx5e_open_tx_cqs(c, cparam);
+	err = mlx5e_open_cq(c, &cparam->icosq_cq, &c->icosq.cq, 0, 0);
 	if (err)
 		goto err_napi_del;
 
+	err = mlx5e_open_tx_cqs(c, cparam);
+	if (err)
+		goto err_close_icosq_cq;
+
 	err = mlx5e_open_cq(c, &cparam->rx_cq, &c->rq.cq,
 			    priv->params.rx_cq_moderation_usec,
 			    priv->params.rx_cq_moderation_pkts);
@@ -1042,10 +1078,14 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 
 	napi_enable(&c->napi);
 
-	err = mlx5e_open_sqs(c, cparam);
+	err = mlx5e_open_sq(c, 0, &cparam->icosq, &c->icosq);
 	if (err)
 		goto err_disable_napi;
 
+	err = mlx5e_open_sqs(c, cparam);
+	if (err)
+		goto err_close_icosq;
+
 	err = mlx5e_open_rq(c, &cparam->rq, &c->rq);
 	if (err)
 		goto err_close_sqs;
@@ -1058,6 +1098,9 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 err_close_sqs:
 	mlx5e_close_sqs(c);
 
+err_close_icosq:
+	mlx5e_close_sq(&c->icosq);
+
 err_disable_napi:
 	napi_disable(&c->napi);
 	mlx5e_close_cq(&c->rq.cq);
@@ -1065,6 +1108,9 @@ err_disable_napi:
 err_close_tx_cqs:
 	mlx5e_close_tx_cqs(c);
 
+err_close_icosq_cq:
+	mlx5e_close_cq(&c->icosq.cq);
+
 err_napi_del:
 	netif_napi_del(&c->napi);
 	napi_hash_del(&c->napi);
@@ -1077,9 +1123,11 @@ static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
 	mlx5e_close_rq(&c->rq);
 	mlx5e_close_sqs(c);
+	mlx5e_close_sq(&c->icosq);
 	napi_disable(&c->napi);
 	mlx5e_close_cq(&c->rq.cq);
 	mlx5e_close_tx_cqs(c);
+	mlx5e_close_cq(&c->icosq.cq);
 	netif_napi_del(&c->napi);
 
 	napi_hash_del(&c->napi);
@@ -1125,17 +1173,27 @@ static void mlx5e_build_drop_rq_param(struct mlx5e_rq_param *param)
 	MLX5_SET(wq, wq, log_wq_stride,    ilog2(sizeof(struct mlx5e_rx_wqe)));
 }
 
-static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
-				 struct mlx5e_sq_param *param)
+static void mlx5e_build_sq_param_common(struct mlx5e_priv *priv,
+					struct mlx5e_sq_param *param)
 {
 	void *sqc = param->sqc;
 	void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
 
-	MLX5_SET(wq, wq, log_wq_sz,     priv->params.log_sq_size);
 	MLX5_SET(wq, wq, log_wq_stride, ilog2(MLX5_SEND_WQE_BB));
 	MLX5_SET(wq, wq, pd,            priv->pdn);
 
 	param->wq.buf_numa_node = dev_to_node(&priv->mdev->pdev->dev);
+}
+
+static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
+				 struct mlx5e_sq_param *param)
+{
+	void *sqc = param->sqc;
+	void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+	mlx5e_build_sq_param_common(priv, param);
+	MLX5_SET(wq, wq, log_wq_sz,     priv->params.log_sq_size);
+
 	param->max_inline = priv->params.tx_max_inline;
 }
 
@@ -1172,20 +1230,49 @@ static void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv,
 {
 	void *cqc = param->cqc;
 
-	MLX5_SET(cqc, cqc, log_cq_size,  priv->params.log_sq_size);
+	MLX5_SET(cqc, cqc, log_cq_size, priv->params.log_sq_size);
 
 	mlx5e_build_common_cq_param(priv, param);
 }
 
+static void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv,
+				     struct mlx5e_cq_param *param,
+				     u8 log_wq_size)
+{
+	void *cqc = param->cqc;
+
+	MLX5_SET(cqc, cqc, log_cq_size, log_wq_size);
+
+	mlx5e_build_common_cq_param(priv, param);
+}
+
+static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
+				    struct mlx5e_sq_param *param,
+				    u8 log_wq_size)
+{
+	void *sqc = param->sqc;
+	void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+	mlx5e_build_sq_param_common(priv, param);
+
+	MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
+
+	param->icosq = true;
+}
+
 static void mlx5e_build_channel_param(struct mlx5e_priv *priv,
 				      struct mlx5e_channel_param *cparam)
 {
+	u8 icosq_log_wq_sz = 0;
+
 	memset(cparam, 0, sizeof(*cparam));
 
 	mlx5e_build_rq_param(priv, &cparam->rq);
 	mlx5e_build_sq_param(priv, &cparam->sq);
+	mlx5e_build_icosq_param(priv, &cparam->icosq, icosq_log_wq_sz);
 	mlx5e_build_rx_cq_param(priv, &cparam->rx_cq);
 	mlx5e_build_tx_cq_param(priv, &cparam->tx_cq);
+	mlx5e_build_ico_cq_param(priv, &cparam->icosq_cq, icosq_log_wq_sz);
 }
 
 static int mlx5e_open_channels(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 94a14f8..7c94a9b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -54,6 +54,7 @@ void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw)
 
 	sq->skb[pi] = NULL;
 	sq->pc++;
+	sq->stats.nop++;
 
 	if (notify_hw) {
 		cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
@@ -387,7 +388,6 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq)
 			wi = &sq->wqe_info[ci];
 
 			if (unlikely(!skb)) { /* nop */
-				sq->stats.nop++;
 				sqcc++;
 				continue;
 			}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 66d51a7..500dcd4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -49,6 +49,57 @@ struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq)
 	return cqe;
 }
 
+static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
+{
+	struct mlx5_wq_cyc *wq;
+	struct mlx5_cqe64 *cqe;
+	struct mlx5e_sq *sq;
+	u16 sqcc;
+
+	cqe = mlx5e_get_cqe(cq);
+	if (likely(!cqe))
+		return;
+
+	sq = container_of(cq, struct mlx5e_sq, cq);
+	wq = &sq->wq;
+
+	/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
+	 * otherwise a cq overrun may occur
+	 */
+	sqcc = sq->cc;
+
+	do {
+		u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
+		struct mlx5e_ico_wqe_info *icowi = &sq->ico_wqe_info[ci];
+
+		mlx5_cqwq_pop(&cq->wq);
+		sqcc += icowi->num_wqebbs;
+
+		if (unlikely((cqe->op_own >> 4) != MLX5_CQE_REQ)) {
+			WARN_ONCE(true, "mlx5e: Bad OP in ICOSQ CQE: 0x%x\n",
+				  cqe->op_own);
+			break;
+		}
+
+		switch (icowi->opcode) {
+		case MLX5_OPCODE_NOP:
+			break;
+		default:
+			WARN_ONCE(true,
+				  "mlx5e: Bad OPCODE in ICOSQ WQE info: 0x%x\n",
+				  icowi->opcode);
+		}
+
+	} while ((cqe = mlx5e_get_cqe(cq)));
+
+	mlx5_cqwq_update_db_record(&cq->wq);
+
+	/* ensure cq space is freed before enabling more cqes */
+	wmb();
+
+	sq->cc = sqcc;
+}
+
 int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 {
 	struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel,
@@ -64,6 +115,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 
 	work_done = mlx5e_poll_rx_cq(&c->rq.cq, budget);
 	busy |= work_done == budget;
+
+	mlx5e_poll_ico_cq(&c->icosq.cq);
+
 	busy |= mlx5e_post_rx_wqes(&c->rq);
 
 	if (busy)
@@ -80,6 +134,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	for (i = 0; i < c->num_tc; i++)
 		mlx5e_cq_arm(&c->sq[i].cq);
 	mlx5e_cq_arm(&c->rq.cq);
+	mlx5e_cq_arm(&c->icosq.cq);
 
 	return work_done;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (6 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 07/13] net/mlx5e: Added ICO SQs Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 14:32   ` Eric Dumazet
  2016-03-11 13:39 ` [PATCH net-next 09/13] net/mlx5e: Change RX moderation period to be based on CQE Saeed Mahameed
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

If the allocation of a linear (physically continuous) MPWQE fails,
we allocate a fragmented MPWQE.

This is implemented via device's UMR (User Memory Registration)
which allows to register multiple memory fragments into ConnectX
hardware as a continuous buffer.
UMR registration is an asynchronous operation and is done via
ICO SQs.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |   75 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   63 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |  356 +++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |    4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |    3 +
 include/linux/mlx5/mlx5_ifc.h                     |   10 +-
 6 files changed, 460 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 8e011af..930d52a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -69,6 +69,10 @@
 						 MLX5_MPWRQ_LOG_STRIDE_SIZE)
 #define MLX5_MPWRQ_WQE_PAGE_ORDER  (max_t(int, 0, \
 					  MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT))
+#define MLX5_MPWRQ_WQE_NUM_PAGES		BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
+#define MLX5_CHANNEL_MAX_NUM_PAGES (MLX5_MPWRQ_WQE_NUM_PAGES * \
+				    BIT(MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW))
+#define MLX5_UMR_ALIGN				(2048)
 #define MLX5_MPWRQ_SMALL_PACKET_THRESHOLD	(128)
 
 #define MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ                 (64 * 1024)
@@ -131,6 +135,13 @@ struct mlx5e_rx_wqe {
 	struct mlx5_wqe_data_seg      data;
 };
 
+struct mlx5e_umr_wqe {
+	struct mlx5_wqe_ctrl_seg       ctrl;
+	struct mlx5_wqe_umr_ctrl_seg   uctrl;
+	struct mlx5_mkey_seg           mkc;
+	struct mlx5_wqe_data_seg       data;
+};
+
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
 #define MLX5E_MIN_BW_ALLOC 1   /* Min percentage of BW allocation */
@@ -176,6 +187,7 @@ static const char vport_strings[][ETH_GSTRING_LEN] = {
 	"tx_queue_dropped",
 	"rx_wqe_err",
 	"rx_mpwqe_filler",
+	"rx_mpwqe_frag",
 };
 
 struct mlx5e_vport_stats {
@@ -218,8 +230,9 @@ struct mlx5e_vport_stats {
 	u64 tx_queue_dropped;
 	u64 rx_wqe_err;
 	u64 rx_mpwqe_filler;
+	u64 rx_mpwqe_frag;
 
-#define NUM_VPORT_COUNTERS     36
+#define NUM_VPORT_COUNTERS     37
 };
 
 static const char pport_strings[][ETH_GSTRING_LEN] = {
@@ -314,6 +327,7 @@ static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
 	"lro_bytes",
 	"wqe_err",
 	"mpwqe_filler",
+	"mpwqe_frag",
 };
 
 struct mlx5e_rq_stats {
@@ -325,7 +339,8 @@ struct mlx5e_rq_stats {
 	u64 lro_bytes;
 	u64 wqe_err;
 	u64 mpwqe_filler;
-#define NUM_RQ_STATS 7
+	u64 mpwqe_frag;
+#define NUM_RQ_STATS 8
 };
 
 static const char sq_stats_strings[][ETH_GSTRING_LEN] = {
@@ -404,6 +419,7 @@ struct mlx5e_tstamp {
 
 enum {
 	MLX5E_RQ_STATE_POST_WQES_ENABLE,
+	MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS,
 };
 
 struct mlx5e_cq {
@@ -431,17 +447,14 @@ struct mlx5e_dma_info {
 	dma_addr_t	addr;
 };
 
-struct mlx5e_mpw_info {
-	struct mlx5e_dma_info dma_info;
-	u16		     consumed_strides;
-};
-
 struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
 	u32                    wqe_sz;
 	struct sk_buff       **skb;
 	struct mlx5e_mpw_info *wqe_info;
+	__be32                 mkey_be;
+	__be32                 umr_mkey_be;
 
 	struct device         *pdev;
 	struct net_device     *netdev;
@@ -462,6 +475,27 @@ struct mlx5e_rq {
 	struct mlx5e_priv     *priv;
 } ____cacheline_aligned_in_smp;
 
+struct mlx5e_umr_dma_info {
+	__be64                *mtt;
+	dma_addr_t             mtt_addr;
+	struct mlx5e_dma_info *dma_info;
+};
+
+struct mlx5e_mpw_info {
+	union {
+		struct mlx5e_dma_info     dma_info;
+		struct mlx5e_umr_dma_info umr;
+	};
+	u16 consumed_strides;
+
+	void (*complete_wqe)(struct mlx5e_rq *rq,
+			     struct mlx5_cqe64 *cqe,
+			     u16 byte_cnt,
+			     struct mlx5e_mpw_info *wi,
+			     struct sk_buff *skb);
+	void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
+};
+
 struct mlx5e_tx_wqe_info {
 	u32 num_bytes;
 	u8  num_wqebbs;
@@ -654,6 +688,7 @@ struct mlx5e_priv {
 	u32                        pdn;
 	u32                        tdn;
 	struct mlx5_core_mkey      mkey;
+	struct mlx5_core_mkey      umr_mkey;
 	struct mlx5e_rq            drop_rq;
 
 	struct mlx5e_channel     **channel;
@@ -726,6 +761,21 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
 int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
 int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
+void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
+void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
+				    struct mlx5_cqe64 *cqe,
+				    u16 byte_cnt,
+				    struct mlx5e_mpw_info *wi,
+				    struct sk_buff *skb);
+void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
+					struct mlx5_cqe64 *cqe,
+					u16 byte_cnt,
+					struct mlx5e_mpw_info *wi,
+					struct sk_buff *skb);
+void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
+				struct mlx5e_mpw_info *wi);
+void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
+				    struct mlx5e_mpw_info *wi);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
@@ -759,7 +809,7 @@ void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
 				   int num_channels);
 
 static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
-				      struct mlx5e_tx_wqe *wqe, int bf_sz)
+				      struct mlx5_wqe_ctrl_seg *ctrl, int bf_sz)
 {
 	u16 ofst = MLX5_BF_OFFSET + sq->bf_offset;
 
@@ -773,9 +823,9 @@ static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
 	 */
 	wmb();
 	if (bf_sz)
-		__iowrite64_copy(sq->uar_map + ofst, &wqe->ctrl, bf_sz);
+		__iowrite64_copy(sq->uar_map + ofst, ctrl, bf_sz);
 	else
-		mlx5_write64((__be32 *)&wqe->ctrl, sq->uar_map + ofst, NULL);
+		mlx5_write64((__be32 *)ctrl, sq->uar_map + ofst, NULL);
 	/* flush the write-combining mapped buffer */
 	wmb();
 
@@ -796,6 +846,11 @@ static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
 		     MLX5E_MAX_NUM_CHANNELS);
 }
 
+static inline int mlx5e_get_mtt_octw(int npages)
+{
+	return ALIGN(npages, 8) / 2;
+}
+
 extern const struct ethtool_ops mlx5e_ethtool_ops;
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 extern const struct dcbnl_rtnl_ops mlx5e_dcbnl_ops;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0507e46..aa1bd54 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -179,6 +179,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 	s->rx_csum_sw		= 0;
 	s->rx_wqe_err		= 0;
 	s->rx_mpwqe_filler	= 0;
+	s->rx_mpwqe_frag	= 0;
 	for (i = 0; i < priv->params.num_channels; i++) {
 		rq_stats = &priv->channel[i]->rq.stats;
 
@@ -190,6 +191,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 		s->rx_csum_sw	+= rq_stats->csum_sw;
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
+		s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
 
 		for (j = 0; j < priv->params.num_tc; j++) {
 			sq_stats = &priv->channel[i]->sq[j].stats;
@@ -379,7 +381,6 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 
-		wqe->data.lkey       = c->mkey_be;
 		wqe->data.byte_count = cpu_to_be32(byte_count);
 	}
 
@@ -390,6 +391,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->channel = c;
 	rq->ix      = c->ix;
 	rq->priv    = c->priv;
+	rq->mkey_be = c->mkey_be;
+	rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
 
 	return 0;
 
@@ -1256,6 +1259,7 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 	mlx5e_build_sq_param_common(priv, param);
 
 	MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
+	MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq));
 
 	param->icosq = true;
 }
@@ -1263,7 +1267,7 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 static void mlx5e_build_channel_param(struct mlx5e_priv *priv,
 				      struct mlx5e_channel_param *cparam)
 {
-	u8 icosq_log_wq_sz = 0;
+	u8 icosq_log_wq_sz = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
 
 	memset(cparam, 0, sizeof(*cparam));
 
@@ -2458,6 +2462,13 @@ void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
 		indirection_rqt[i] = i % num_channels;
 }
 
+static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
+{
+	return MLX5_CAP_GEN(mdev, striding_rq) &&
+		MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
+		MLX5_CAP_ETH(mdev, reg_umr_sq);
+}
+
 static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
 				    struct net_device *netdev,
 				    int num_channels)
@@ -2466,7 +2477,7 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
 
 	priv->params.log_sq_size           =
 		MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
-	priv->params.rq_wq_type = MLX5_CAP_GEN(mdev, striding_rq) ?
+	priv->params.rq_wq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) ?
 		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
 		MLX5_WQ_TYPE_LINKED_LIST;
 
@@ -2639,6 +2650,40 @@ static void mlx5e_destroy_q_counter(struct mlx5e_priv *priv)
 	mlx5_core_dealloc_q_counter(priv->mdev, priv->q_counter);
 }
 
+static int mlx5e_create_umr_mkey(struct mlx5e_priv *priv)
+{
+	struct mlx5_core_dev *mdev = priv->mdev;
+	struct mlx5_create_mkey_mbox_in *in;
+	struct mlx5_mkey_seg *mkc;
+	int inlen = sizeof(*in);
+	int npages = MLX5E_MAX_NUM_CHANNELS * MLX5_CHANNEL_MAX_NUM_PAGES;
+	int err;
+
+	in = mlx5_vzalloc(inlen);
+	if (!in)
+		return -ENOMEM;
+
+	mkc = &in->seg;
+	mkc->status = MLX5_MKEY_STATUS_FREE;
+	mkc->flags = MLX5_PERM_UMR_EN |
+		     MLX5_PERM_LOCAL_READ |
+		     MLX5_PERM_LOCAL_WRITE |
+		     MLX5_ACCESS_MODE_MTT;
+
+	mkc->qpn_mkey7_0 = cpu_to_be32(0xffffff << 8);
+	mkc->flags_pd = cpu_to_be32(priv->pdn);
+	mkc->len = cpu_to_be64(npages << PAGE_SHIFT);
+	mkc->xlt_oct_size = cpu_to_be32(mlx5e_get_mtt_octw(npages));
+	mkc->log2_page_size = PAGE_SHIFT;
+
+	err = mlx5_core_create_mkey(mdev, &priv->umr_mkey, in, inlen, NULL,
+				    NULL, NULL);
+
+	kvfree(in);
+
+	return err;
+}
+
 static void *mlx5e_create_netdev(struct mlx5_core_dev *mdev)
 {
 	struct net_device *netdev;
@@ -2688,10 +2733,16 @@ static void *mlx5e_create_netdev(struct mlx5_core_dev *mdev)
 		goto err_dealloc_transport_domain;
 	}
 
+	err = mlx5e_create_umr_mkey(priv);
+	if (err) {
+		mlx5_core_err(mdev, "create umr mkey failed, %d\n", err);
+		goto err_destroy_mkey;
+	}
+
 	err = mlx5e_create_tises(priv);
 	if (err) {
 		mlx5_core_warn(mdev, "create tises failed, %d\n", err);
-		goto err_destroy_mkey;
+		goto err_destroy_umr_mkey;
 	}
 
 	err = mlx5e_open_drop_rq(priv);
@@ -2774,6 +2825,9 @@ err_close_drop_rq:
 err_destroy_tises:
 	mlx5e_destroy_tises(priv);
 
+err_destroy_umr_mkey:
+	mlx5_core_destroy_mkey(mdev, &priv->umr_mkey);
+
 err_destroy_mkey:
 	mlx5_core_destroy_mkey(mdev, &priv->mkey);
 
@@ -2812,6 +2866,7 @@ static void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, void *vpriv)
 	mlx5e_destroy_rqt(priv, MLX5E_INDIRECTION_RQT);
 	mlx5e_close_drop_rq(priv);
 	mlx5e_destroy_tises(priv);
+	mlx5_core_destroy_mkey(priv->mdev, &priv->umr_mkey);
 	mlx5_core_destroy_mkey(priv->mdev, &priv->mkey);
 	mlx5_core_dealloc_transport_domain(priv->mdev, priv->tdn);
 	mlx5_core_dealloc_pd(priv->mdev, priv->pdn);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 18105c1..dd3a6e1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -65,6 +65,7 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 
 	*((dma_addr_t *)skb->cb) = dma_addr;
 	wqe->data.addr = cpu_to_be64(dma_addr + MLX5E_NET_IP_ALIGN);
+	wqe->data.lkey = rq->mkey_be;
 
 	rq->skb[ix] = skb;
 
@@ -76,7 +77,185 @@ err_free_skb:
 	return -ENOMEM;
 }
 
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
+				struct mlx5e_sq *sq,
+				struct mlx5e_umr_wqe *wqe,
+				u16 ix)
+{
+	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
+	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
+	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
+	u16 umr_wqe_mtt_offset = rq->ix * MLX5_CHANNEL_MAX_NUM_PAGES +
+					ix * MLX5_MPWRQ_WQE_NUM_PAGES;
+
+	memset(wqe, 0, sizeof(*wqe));
+	cseg->opmod_idx_opcode =
+		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+			    MLX5_OPCODE_UMR);
+	cseg->qpn_ds    = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+				      ds_cnt);
+	cseg->fm_ce_se  = MLX5_WQE_CTRL_CQ_UPDATE;
+	cseg->imm       = rq->umr_mkey_be;
+
+	ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
+	ucseg->klm_octowords =
+		cpu_to_be16(mlx5e_get_mtt_octw(MLX5_MPWRQ_WQE_NUM_PAGES));
+	ucseg->bsf_octowords =
+		cpu_to_be16(mlx5e_get_mtt_octw(umr_wqe_mtt_offset));
+	ucseg->mkey_mask     = cpu_to_be64(MLX5_MKEY_MASK_FREE);
+
+	dseg->lkey = sq->mkey_be;
+	dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
+}
+
+static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
+{
+	struct mlx5e_sq *sq = &rq->channel->icosq;
+	struct mlx5_wq_cyc *wq = &sq->wq;
+	struct mlx5e_umr_wqe *wqe;
+	u8 num_wqebbs = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_BB);
+	u16 pi;
+
+	/* fill sq edge with nops to avoid wqe wrap around */
+	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+		sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
+		sq->ico_wqe_info[pi].num_wqebbs = 1;
+		mlx5e_send_nop(sq, true);
+	}
+
+	wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+	mlx5e_build_umr_wqe(rq, sq, wqe, ix);
+	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
+	sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
+	sq->pc += num_wqebbs;
+	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
+}
+
+static inline int mlx5e_get_wqe_mtt_sz(void)
+{
+	/* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes. */
+	return ALIGN(MLX5_MPWRQ_WQE_NUM_PAGES * sizeof(__be64),
+		     MLX5_UMR_MTT_ALIGNMENT);
+}
+
+static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
+				    struct mlx5e_mpw_info *wi,
+				    int i)
+{
+	struct page *page;
+
+	page = alloc_page(GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
+	if (!page)
+		return -ENOMEM;
+
+	wi->umr.dma_info[i].page = page;
+	wi->umr.dma_info[i].addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
+						PCI_DMA_FROMDEVICE);
+	if (dma_mapping_error(rq->pdev, wi->umr.dma_info[i].addr)) {
+		put_page(page);
+		return -ENOMEM;
+	}
+	wi->umr.mtt[i] = cpu_to_be64(wi->umr.dma_info[i].addr | MLX5_EN_WR);
+
+	return 0;
+}
+
+static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
+					   struct mlx5e_rx_wqe *wqe,
+					   u16 ix)
+{
+	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	int mtt_sz = mlx5e_get_wqe_mtt_sz();
+	u32 dma_offset = rq->ix * MLX5_CHANNEL_MAX_NUM_PAGES * PAGE_SIZE +
+		ix * rq->wqe_sz;
+	int i;
+
+	wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
+				   MLX5_MPWRQ_WQE_NUM_PAGES,
+				   GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
+	if (!wi->umr.dma_info)
+		goto err_out;
+
+	 /* To avoid copying garbage after the mtt array, we allocate
+	  * a little more.
+	  */
+	wi->umr.mtt = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
+			  GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
+	if (!wi->umr.mtt)
+		goto err_free_umr;
+
+	wi->umr.mtt = PTR_ALIGN(wi->umr.mtt, MLX5_UMR_ALIGN);
+	wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
+				      PCI_DMA_TODEVICE);
+	if (dma_mapping_error(rq->pdev, wi->umr.mtt_addr))
+		goto err_free_mtt;
+
+	for (i = 0; i < MLX5_MPWRQ_WQE_NUM_PAGES; i++)
+		if (mlx5e_alloc_and_map_page(rq, wi, i))
+			goto err_unmap;
+
+	wi->consumed_strides = 0;
+	wi->complete_wqe = mlx5e_complete_rx_fragmented_mpwqe;
+	wi->free_wqe     = mlx5e_free_rx_fragmented_mpwqe;
+	wqe->data.lkey = rq->umr_mkey_be;
+	wqe->data.addr = cpu_to_be64(dma_offset);
+
+	return 0;
+
+err_unmap:
+	while (--i >= 0) {
+		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
+			       PCI_DMA_FROMDEVICE);
+		put_page(wi->umr.dma_info[i].page);
+	}
+	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
+
+err_free_mtt:
+	kfree(wi->umr.mtt);
+
+err_free_umr:
+	kfree(wi->umr.dma_info);
+
+err_out:
+	return -ENOMEM;
+}
+
+void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
+				    struct mlx5e_mpw_info *wi)
+{
+	int mtt_sz = mlx5e_get_wqe_mtt_sz();
+	int i;
+
+	for (i = 0; i < MLX5_MPWRQ_WQE_NUM_PAGES; i++) {
+		dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
+			       PCI_DMA_FROMDEVICE);
+		put_page(wi->umr.dma_info[i].page);
+	}
+	dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
+	kfree(wi->umr.mtt);
+	kfree(wi->umr.dma_info);
+}
+
+void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
+{
+	struct mlx5_wq_ll *wq = &rq->wq;
+	struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
+
+	clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
+	mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
+	rq->stats.mpwqe_frag++;
+
+	/* ensure wqes are visible to device before updating doorbell record */
+	dma_wmb();
+
+	mlx5_wq_ll_update_db_record(wq);
+}
+
+static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
+				       struct mlx5e_rx_wqe *wqe,
+				       u16 ix)
 {
 	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
 	int ret = 0;
@@ -94,6 +273,9 @@ int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 	}
 
 	wi->consumed_strides = 0;
+	wi->complete_wqe = mlx5e_complete_rx_linear_mpwqe;
+	wi->free_wqe     = mlx5e_free_rx_linear_mpwqe;
+	wqe->data.lkey = rq->mkey_be;
 	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
 
 	return 0;
@@ -103,11 +285,40 @@ err_put_page:
 	return ret;
 }
 
+void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
+				struct mlx5e_mpw_info *wi)
+{
+	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
+		       PCI_DMA_FROMDEVICE);
+	put_page(wi->dma_info.page);
+}
+
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+{
+	int err;
+
+	err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
+	if (unlikely(err)) {
+		err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
+		if (err)
+			return err;
+		set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
+		mlx5e_post_umr_wqe(rq, ix);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+#define RQ_CANNOT_POST(rq) \
+		(!test_bit(MLX5E_RQ_STATE_POST_WQES_ENABLE, &rq->state) || \
+		 test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
+
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 {
 	struct mlx5_wq_ll *wq = &rq->wq;
 
-	if (unlikely(!test_bit(MLX5E_RQ_STATE_POST_WQES_ENABLE, &rq->state)))
+	if (unlikely(RQ_CANNOT_POST(rq)))
 		return false;
 
 	while (!mlx5_wq_ll_is_full(wq)) {
@@ -301,19 +512,125 @@ wq_ll_pop:
 		       &wqe->next.next_wqe_index);
 }
 
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
+			       int page_offset)
+{
+	int f = skb_shinfo(skb)->nr_frags++;
+	skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
+
+	skb->len += len;
+	skb->data_len += len;
+	get_page(page);
+	skb_frag_set_page(skb, f, page);
+	skb_frag_size_set(fr, len);
+	fr->page_offset = page_offset;
+	skb->truesize  = SKB_TRUESIZE(skb->len);
+}
+
+#define MLX5_MPWRQ_MULTI_STRIDE_PACKET_THRESHOLD	\
+			(MLX5_MPWRQ_SMALL_PACKET_THRESHOLD > \
+			 BIT(MLX5_MPWRQ_LOG_STRIDE_SIZE))
+
+void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
+					struct mlx5_cqe64 *cqe,
+					u16 byte_cnt,
+					struct mlx5e_mpw_info *wi,
+					struct sk_buff *skb)
+{
+	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
+	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
+	u32 consumed_bytes = cstrides  * MLX5_MPWRQ_STRIDE_SIZE;
+	u32 wqe_offset     = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
+	u32 page_offset    = wqe_offset & (PAGE_SIZE - 1);
+	u32 page_idx       = wqe_offset >> PAGE_SHIFT;
+	u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - page_offset,
+				      consumed_bytes);
+	struct mlx5e_dma_info *dma_info;
+	u16 headlen;
+
+	dma_info = &wi->umr.dma_info[page_idx];
+	dma_sync_single_for_cpu(rq->pdev, dma_info->addr + page_offset,
+				pg_consumed_bytes, DMA_FROM_DEVICE);
+
+	headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, byte_cnt);
+#if (MLX5_MPWRQ_MULTI_STRIDE_PACKET_THRESHOLD)
+	if (headlen >= pg_consumed_bytes) {
+		u16 headlen_rem = headlen - pg_consumed_bytes;
+
+		skb_copy_to_linear_data(skb, page_address(dma_info->page) +
+					page_offset, pg_consumed_bytes);
+		dma_info = &wi->umr.dma_info[++page_idx];
+		dma_sync_single_for_cpu(rq->pdev, dma_info->addr, headlen_rem,
+					DMA_FROM_DEVICE);
+		skb_copy_to_linear_data_offset(skb, pg_consumed_bytes,
+					       page_address(dma_info->page),
+					       headlen_rem);
+		page_offset = headlen_rem;
+	} else
+#endif
+	{
+		skb_copy_to_linear_data(skb, page_address(dma_info->page) +
+					page_offset, headlen);
+		page_offset += headlen;
+	}
+
+	skb_put(skb, headlen);
+
+	byte_cnt -= headlen;
+
+	while (byte_cnt) {
+		dma_info = &wi->umr.dma_info[page_idx++];
+		pg_consumed_bytes = min_t(u32, PAGE_SIZE - page_offset,
+					  byte_cnt);
+		dma_sync_single_for_cpu(rq->pdev,
+					dma_info->addr + page_offset,
+					pg_consumed_bytes,
+					DMA_FROM_DEVICE);
+		mlx5e_add_skb_frag(skb, pg_consumed_bytes, dma_info->page,
+				   page_offset);
+		byte_cnt -= pg_consumed_bytes;
+		page_offset = 0;
+	}
+}
+
+void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
+				    struct mlx5_cqe64 *cqe,
+				    u16 byte_cnt,
+				    struct mlx5e_mpw_info *wi,
+				    struct sk_buff *skb)
 {
 	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
 	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
 	u32 consumed_bytes = cstrides  * MLX5_MPWRQ_STRIDE_SIZE;
 	u32 stride_offset  = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
+	struct mlx5e_dma_info *dma_info;
+	u32 data_offset;
+	u16 headlen;
+
+	dma_info = &wi->dma_info;
+	dma_sync_single_for_cpu(rq->pdev, dma_info->addr + stride_offset,
+				consumed_bytes, DMA_FROM_DEVICE);
+
+	data_offset = stride_offset;
+	headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, byte_cnt);
+	skb_copy_to_linear_data(skb, page_address(dma_info->page) + data_offset,
+				headlen);
+	skb_put(skb, headlen);
+
+	byte_cnt -= headlen;
+	if (byte_cnt)
+		mlx5e_add_skb_frag(skb, byte_cnt, dma_info->page,
+				   data_offset + headlen);
+}
+
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+{
+	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
 	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
 	struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
 	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
 	struct sk_buff *skb;
-	u16 byte_cnt;
 	u16 cqe_bcnt;
-	u16 headlen;
 
 	wi->consumed_strides += cstrides;
 
@@ -331,31 +648,8 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	if (unlikely(!skb))
 		goto mpwrq_cqe_out;
 
-	dma_sync_single_for_cpu(rq->pdev, wi->dma_info.addr + stride_offset,
-				consumed_bytes, DMA_FROM_DEVICE);
-
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
-	headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, cqe_bcnt);
-	skb_copy_to_linear_data(skb,
-				page_address(wi->dma_info.page) + stride_offset,
-				headlen);
-	skb_put(skb, headlen);
-
-	byte_cnt = cqe_bcnt - headlen;
-	if (byte_cnt) {
-		skb_frag_t *f0 = &skb_shinfo(skb)->frags[0];
-
-		skb_shinfo(skb)->nr_frags = 1;
-
-		skb->data_len  = byte_cnt;
-		skb->len      += byte_cnt;
-		skb->truesize  = SKB_TRUESIZE(skb->len);
-
-		get_page(wi->dma_info.page);
-		skb_frag_set_page(skb, 0, wi->dma_info.page);
-		skb_frag_size_set(f0, skb->data_len);
-		f0->page_offset = stride_offset + headlen;
-	}
+	wi->complete_wqe(rq, cqe, cqe_bcnt, wi, skb);
 
 	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
 
@@ -363,9 +657,7 @@ mpwrq_cqe_out:
 	if (likely(wi->consumed_strides < MLX5_MPWRQ_NUM_STRIDES))
 		return;
 
-	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
-		       PCI_DMA_FROMDEVICE);
-	put_page(wi->dma_info.page);
+	wi->free_wqe(rq, wi);
 	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 7c94a9b..d030974 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -58,7 +58,7 @@ void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw)
 
 	if (notify_hw) {
 		cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
-		mlx5e_tx_notify_hw(sq, wqe, 0);
+		mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 	}
 }
 
@@ -310,7 +310,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 			bf_sz = wi->num_wqebbs << 3;
 
 		cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
-		mlx5e_tx_notify_hw(sq, wqe, bf_sz);
+		mlx5e_tx_notify_hw(sq, &wqe->ctrl, bf_sz);
 	}
 
 	/* fill sq edge with nops to avoid wqe wrap around */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 500dcd4..882e42e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -84,6 +84,9 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 		switch (icowi->opcode) {
 		case MLX5_OPCODE_NOP:
 			break;
+		case MLX5_OPCODE_UMR:
+			mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
+			break;
 		default:
 			WARN_ONCE(true,
 				  "mlx5e: Bad OPCODE in ICOSQ WQE info: 0x%x\n",
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 6060ca3..d39dd31 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -512,7 +512,8 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits {
 	u8         max_lso_cap[0x5];
 	u8         reserved_at_10[0x4];
 	u8         rss_ind_tbl_cap[0x4];
-	u8         reserved_at_18[0x3];
+	u8         reg_umr_sq[0x1];
+	u8         reserved_at_19[0x2];
 	u8         tunnel_lso_const_out_ip_id[0x1];
 	u8         reserved_at_1c[0x2];
 	u8         tunnel_statless_gre[0x1];
@@ -782,7 +783,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         cd[0x1];
 	u8         reserved_at_22c[0x1];
 	u8         apm[0x1];
-	u8         reserved_at_22e[0x7];
+	u8         reserved_at_22e[0x1];
+	u8         umr_ptr_rlky[0x1];
+	u8         reserved_at_230[0x5];
 	u8         qkv[0x1];
 	u8         pkv[0x1];
 	u8         reserved_at_237[0x4];
@@ -2138,7 +2141,8 @@ struct mlx5_ifc_sqc_bits {
 	u8         flush_in_error_en[0x1];
 	u8         reserved_at_4[0x4];
 	u8         state[0x4];
-	u8         reserved_at_c[0x14];
+	u8         reg_umr[0x1];
+	u8         reserved_at_d[0x13];
 
 	u8         reserved_at_20[0x8];
 	u8         user_index[0x18];
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 09/13] net/mlx5e: Change RX moderation period to be based on CQE
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (7 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 10/13] net/mlx5e: Use napi_alloc_skb for RX SKB allocations Saeed Mahameed
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Achiad Shochat, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

In this mode the moderation timer will restart upon completion
generation rather than upon interrupt generation.
The outcome is that for bursty traffic the period timer will never
expire and thus only the moderation frames counter will dictate
interrupt generation, thus the interrupt rate will be relative
to the incoming packets size.
If the burst seizes for "moderation period" time then an interrupt
will be issued immediately.

Performance tested on ConnectX4-Lx 50G.

Less packet loss in netperf and pktgen tests, with no bw degradation.

For example:
pktgen single flow, 16 sender threads, 8K (64B) packets bursts each:
we see improvment from 30% packet loss to 17%.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Achiad Shochat <achiad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   14 ++++++++++++++
 include/linux/mlx5/mlx5_ifc.h                     |    9 +++++++--
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 930d52a..77bf54c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -77,6 +77,7 @@
 
 #define MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ                 (64 * 1024)
 #define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC      0x10
+#define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE 0x3
 #define MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS      0x20
 #define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC      0x10
 #define MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS      0x20
@@ -388,6 +389,7 @@ struct mlx5e_params {
 	u8  log_rq_size;
 	u16 num_channels;
 	u8  num_tc;
+	u8  rx_cq_period_mode;
 	u16 rx_cq_moderation_usec;
 	u16 rx_cq_moderation_pkts;
 	u16 tx_cq_moderation_usec;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index aa1bd54..784962c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -55,6 +55,7 @@ struct mlx5e_cq_param {
 	u32                        cqc[MLX5_ST_SZ_DW(cqc)];
 	struct mlx5_wq_param       wq;
 	u16                        eq_ix;
+	u8                         cq_period_mode;
 };
 
 struct mlx5e_channel_param {
@@ -903,6 +904,7 @@ static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param)
 
 	mlx5_vector2eqn(mdev, param->eq_ix, &eqn, &irqn_not_used);
 
+	MLX5_SET(cqc,   cqc, cq_period_mode, param->cq_period_mode);
 	MLX5_SET(cqc,   cqc, c_eqn,         eqn);
 	MLX5_SET(cqc,   cqc, uar_page,      mcq->uar->index);
 	MLX5_SET(cqc,   cqc, log_page_size, cq->wq_ctrl.buf.page_shift -
@@ -1226,6 +1228,8 @@ static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv,
 	MLX5_SET(cqc, cqc, log_cq_size, log_cq_size);
 
 	mlx5e_build_common_cq_param(priv, param);
+
+	param->cq_period_mode = priv->params.rx_cq_period_mode;
 }
 
 static void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv,
@@ -1236,6 +1240,8 @@ static void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv,
 	MLX5_SET(cqc, cqc, log_cq_size, priv->params.log_sq_size);
 
 	mlx5e_build_common_cq_param(priv, param);
+
+	param->cq_period_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
 }
 
 static void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv,
@@ -1247,6 +1253,8 @@ static void mlx5e_build_ico_cq_param(struct mlx5e_priv *priv,
 	MLX5_SET(cqc, cqc, log_cq_size, log_wq_size);
 
 	mlx5e_build_common_cq_param(priv, param);
+
+	param->cq_period_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
 }
 
 static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
@@ -2492,7 +2500,13 @@ static void mlx5e_build_netdev_priv(struct mlx5_core_dev *mdev,
 
 	priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
 					    BIT(priv->params.log_rq_size));
+	priv->params.rx_cq_period_mode =
+		MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ?
+		MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
+		MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
 	priv->params.rx_cq_moderation_usec =
+		MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ?
+		MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE :
 		MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC;
 	priv->params.rx_cq_moderation_pkts =
 		MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS;
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index d39dd31..37cc13a 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -779,7 +779,7 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8         block_lb_mc[0x1];
 	u8         reserved_at_228[0x1];
 	u8         scqe_break_moderation[0x1];
-	u8         reserved_at_22a[0x1];
+	u8         cq_period_start_from_cqe[0x1];
 	u8         cd[0x1];
 	u8         reserved_at_22c[0x1];
 	u8         apm[0x1];
@@ -2547,6 +2547,11 @@ enum {
 	MLX5_CQC_ST_FIRED                                 = 0xa,
 };
 
+enum {
+	MLX5_CQ_PERIOD_MODE_START_FROM_EQE = 0x0,
+	MLX5_CQ_PERIOD_MODE_START_FROM_CQE = 0x1,
+};
+
 struct mlx5_ifc_cqc_bits {
 	u8         status[0x4];
 	u8         reserved_at_4[0x4];
@@ -2555,7 +2560,7 @@ struct mlx5_ifc_cqc_bits {
 	u8         reserved_at_c[0x1];
 	u8         scqe_break_moderation_en[0x1];
 	u8         oi[0x1];
-	u8         reserved_at_f[0x2];
+	u8         cq_period_mode[0x2];
 	u8         cqe_zip_en[0x1];
 	u8         mini_cqe_res_format[0x2];
 	u8         st[0x4];
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 10/13] net/mlx5e: Use napi_alloc_skb for RX SKB allocations
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (8 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 09/13] net/mlx5e: Change RX moderation period to be based on CQE Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 11/13] net/mlx5e: Prefetch next RX CQE Saeed Mahameed
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Instead of netdev_alloc_skb, we use the napi_alloc_skb function
which is designated to allocate skbuff's for RX in a
channel-specific NAPI instance.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index dd3a6e1..aa7f90c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -47,7 +47,7 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 	struct sk_buff *skb;
 	dma_addr_t dma_addr;
 
-	skb = netdev_alloc_skb(rq->netdev, rq->wqe_sz);
+	skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
@@ -644,7 +644,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto mpwrq_cqe_out;
 	}
 
-	skb = netdev_alloc_skb(rq->netdev, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
+	skb = napi_alloc_skb(rq->cq.napi, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
 	if (unlikely(!skb))
 		goto mpwrq_cqe_out;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 11/13] net/mlx5e: Prefetch next RX CQE
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (9 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 10/13] net/mlx5e: Use napi_alloc_skb for RX SKB allocations Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 12/13] net/mlx5e: Remove redundant barrier Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 13/13] net/mlx5e: Add ethtool counter for RX SKB allocation failures Saeed Mahameed
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Performance optimization that prefetches the next RX CQE while
handling the current one.

Performance tested on ConnectX4-Lx 50G.
* Netperf single TCP stream:
- bw raise of 3-10% for various representative messages sizes.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index aa7f90c..b53e9bd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -664,15 +664,19 @@ mpwrq_cqe_out:
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
+	struct mlx5_cqe64 *next_cqe = mlx5e_get_cqe(cq);
+	struct mlx5_cqe64 *cqe;
 	int work_done;
 
 	for (work_done = 0; work_done < budget; work_done++) {
-		struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
+		cqe = next_cqe;
 
 		if (!cqe)
 			break;
 
 		mlx5_cqwq_pop(&cq->wq);
+		next_cqe = mlx5e_get_cqe(cq);
+		prefetch(next_cqe);
 
 		rq->handle_rx_cqe(rq, cqe);
 	}
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 12/13] net/mlx5e: Remove redundant barrier
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (10 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 11/13] net/mlx5e: Prefetch next RX CQE Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  2016-03-11 13:39 ` [PATCH net-next 13/13] net/mlx5e: Add ethtool counter for RX SKB allocation failures Saeed Mahameed
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

The bit-op operation one line before is an explicit barrier
by itself.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 882e42e..10f8ca5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -147,7 +147,6 @@ void mlx5e_completion_event(struct mlx5_core_cq *mcq)
 	struct mlx5e_cq *cq = container_of(mcq, struct mlx5e_cq, mcq);
 
 	set_bit(MLX5E_CHANNEL_NAPI_SCHED, &cq->channel->flags);
-	barrier();
 	napi_schedule(cq->napi);
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next 13/13] net/mlx5e: Add ethtool counter for RX SKB allocation failures
  2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
                   ` (11 preceding siblings ...)
  2016-03-11 13:39 ` [PATCH net-next 12/13] net/mlx5e: Remove redundant barrier Saeed Mahameed
@ 2016-03-11 13:39 ` Saeed Mahameed
  12 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 13:39 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer, Saeed Mahameed

From: Tariq Toukan <tariqt@mellanox.com>

Counts the number of RX SKB allocation failures and shows it
in ethtool statistics.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    8 ++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |    8 ++++++--
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 77bf54c..bc391c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -189,6 +189,7 @@ static const char vport_strings[][ETH_GSTRING_LEN] = {
 	"rx_wqe_err",
 	"rx_mpwqe_filler",
 	"rx_mpwqe_frag",
+	"rx_buff_alloc_err",
 };
 
 struct mlx5e_vport_stats {
@@ -232,8 +233,9 @@ struct mlx5e_vport_stats {
 	u64 rx_wqe_err;
 	u64 rx_mpwqe_filler;
 	u64 rx_mpwqe_frag;
+	u64 rx_buff_alloc_err;
 
-#define NUM_VPORT_COUNTERS     37
+#define NUM_VPORT_COUNTERS     38
 };
 
 static const char pport_strings[][ETH_GSTRING_LEN] = {
@@ -329,6 +331,7 @@ static const char rq_stats_strings[][ETH_GSTRING_LEN] = {
 	"wqe_err",
 	"mpwqe_filler",
 	"mpwqe_frag",
+	"buff_alloc_err",
 };
 
 struct mlx5e_rq_stats {
@@ -341,7 +344,8 @@ struct mlx5e_rq_stats {
 	u64 wqe_err;
 	u64 mpwqe_filler;
 	u64 mpwqe_frag;
-#define NUM_RQ_STATS 8
+	u64 buff_alloc_err;
+#define NUM_RQ_STATS 9
 };
 
 static const char sq_stats_strings[][ETH_GSTRING_LEN] = {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 784962c..56d7888 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -181,6 +181,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 	s->rx_wqe_err		= 0;
 	s->rx_mpwqe_filler	= 0;
 	s->rx_mpwqe_frag	= 0;
+	s->rx_buff_alloc_err	= 0;
 	for (i = 0; i < priv->params.num_channels; i++) {
 		rq_stats = &priv->channel[i]->rq.stats;
 
@@ -193,6 +194,7 @@ void mlx5e_update_stats(struct mlx5e_priv *priv)
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
 		s->rx_mpwqe_frag   += rq_stats->mpwqe_frag;
+		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
 
 		for (j = 0; j < priv->params.num_tc; j++) {
 			sq_stats = &priv->channel[i]->sq[j].stats;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b53e9bd..89b8ace 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -48,8 +48,10 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 	dma_addr_t dma_addr;
 
 	skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
-	if (unlikely(!skb))
+	if (unlikely(!skb)) {
+		rq->stats.buff_alloc_err++;
 		return -ENOMEM;
+	}
 
 	dma_addr = dma_map_single(rq->pdev,
 				  /* hw start padding */
@@ -645,8 +647,10 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	}
 
 	skb = napi_alloc_skb(rq->cq.napi, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
-	if (unlikely(!skb))
+	if (unlikely(!skb)) {
+		rq->stats.buff_alloc_err++;
 		goto mpwrq_cqe_out;
+	}
 
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
 	wi->complete_wqe(rq, cqe, cqe_bcnt, wi, skb);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS
  2016-03-11 13:39 ` [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS Saeed Mahameed
@ 2016-03-11 14:08   ` Sergei Shtylyov
  2016-03-11 19:29     ` Saeed Mahameed
  0 siblings, 1 reply; 26+ messages in thread
From: Sergei Shtylyov @ 2016-03-11 14:08 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer

Hello.

On 3/11/2016 4:39 PM, Saeed Mahameed wrote:

> From: Tariq Toukan <tariqt@mellanox.com>
>
> Distribute default RSS table uniformely over the rings of the

    Uniformly.

> close NUMA node, instead of all available channels.
> This way we enforce the preference of close rings over far ones.
>
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
[...]

MBR, Sergei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-11 13:39 ` [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE Saeed Mahameed
@ 2016-03-11 14:32   ` Eric Dumazet
  2016-03-11 19:25     ` Saeed Mahameed
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2016-03-11 14:32 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon,
	Tariq Toukan, Jesper Dangaard Brouer

On ven., 2016-03-11 at 15:39 +0200, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt@mellanox.com>
> 
> If the allocation of a linear (physically continuous) MPWQE fails,
> we allocate a fragmented MPWQE.
> 
> This is implemented via device's UMR (User Memory Registration)
> which allows to register multiple memory fragments into ConnectX
> hardware as a continuous buffer.
> UMR registration is an asynchronous operation and is done via
> ICO SQs.
> 
...

> +static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> +				    struct mlx5e_mpw_info *wi,
> +				    int i)
> +{
> +	struct page *page;
> +
> +	page = alloc_page(GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	wi->umr.dma_info[i].page = page;
> +	wi->umr.dma_info[i].addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
> +						PCI_DMA_FROMDEVICE);
> +	if (dma_mapping_error(rq->pdev, wi->umr.dma_info[i].addr)) {
> +		put_page(page);
> +		return -ENOMEM;
> +	}
> +	wi->umr.mtt[i] = cpu_to_be64(wi->umr.dma_info[i].addr | MLX5_EN_WR);
> +
> +	return 0;
> +}
> +
> +static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> +					   struct mlx5e_rx_wqe *wqe,
> +					   u16 ix)
> +{
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	u32 dma_offset = rq->ix * MLX5_CHANNEL_MAX_NUM_PAGES * PAGE_SIZE +
> +		ix * rq->wqe_sz;
> +	int i;
> +
> +	wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
> +				   MLX5_MPWRQ_WQE_NUM_PAGES,
> +				   GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
> +	if (!wi->umr.dma_info)
> +		goto err_out;
> +
> +	 /* To avoid copying garbage after the mtt array, we allocate
> +	  * a little more.
> +	  */
> +	wi->umr.mtt = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
> +			  GFP_ATOMIC | __GFP_COMP | __GFP_COLD);

__GFP_COLD right before a memset(0) (kzalloc) makes little sense.


> +	if (!wi->umr.mtt)
> +		goto err_free_umr;
> +
> +	wi->umr.mtt = PTR_ALIGN(wi->umr.mtt, MLX5_UMR_ALIGN);
> +	wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
> +				      PCI_DMA_TODEVICE);
> +	if (dma_mapping_error(rq->pdev, wi->umr.mtt_addr))
> +		goto err_free_mtt;
> +
...

>  
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
> +			       int page_offset)
> +{
> +	int f = skb_shinfo(skb)->nr_frags++;
> +	skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
> +
> +	skb->len += len;
> +	skb->data_len += len;
> +	get_page(page);
> +	skb_frag_set_page(skb, f, page);
> +	skb_frag_size_set(fr, len);
> +	fr->page_offset = page_offset;
> +	skb->truesize  = SKB_TRUESIZE(skb->len);
> +}

Really I am speechless.

It is hard to believe how much effort some drivers authors spend trying
to fool linux stack and risk OOM a host under stress.

SKB_TRUESIZE() is absolutely not something a driver is allowed to use.

Here you want instead :

skb->truesize += PAGE_SIZE;

Assuming you allocate and use an order-0 page per fragment. Fact that
you receive say 100 bytes datagram is irrelevant to truesize.

truesize is the real memory usage of one skb. Not the minimal size of an
optimally allocated skb for a given payload.


Better RX speed should not be done at the risk of system stability.

Now if for some reason you need to increase max TCP RWIN, that would be
a TCP stack change, not some obscure lie in a driver trying to be faster
than competitors.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-11 14:32   ` Eric Dumazet
@ 2016-03-11 19:25     ` Saeed Mahameed
  2016-03-11 19:58       ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 19:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Or Gerlitz,
	Eran Ben Elisha, Tal Alon, Tariq Toukan, Jesper Dangaard Brouer

>> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>> +static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
>> +                            int page_offset)
>> +{
>> +     int f = skb_shinfo(skb)->nr_frags++;
>> +     skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
>> +
>> +     skb->len += len;
>> +     skb->data_len += len;
>> +     get_page(page);
>> +     skb_frag_set_page(skb, f, page);
>> +     skb_frag_size_set(fr, len);
>> +     fr->page_offset = page_offset;
>> +     skb->truesize  = SKB_TRUESIZE(skb->len);
>> +}
>
> Really I am speechless.
>
> It is hard to believe how much effort some drivers authors spend trying
> to fool linux stack and risk OOM a host under stress.

Eric, you got it all wrong my friend, no one is trying to fool nobody here.
I will explain it to you below.

>
> SKB_TRUESIZE() is absolutely not something a driver is allowed to use.
>
> Here you want instead :
>
> skb->truesize += PAGE_SIZE;
>
> Assuming you allocate and use an order-0 page per fragment. Fact that
> you receive say 100 bytes datagram is irrelevant to truesize.

Your assumption is wrong, we allocate as many pages as a WQE needs,
and a WQE can describe/handle
up to 1024 packets which share the same page/pages, so the skb should
really have a true size of the strides
of that page it used and not the WHOLE page as you think.

you should already learn this from the previous patch.

each WQE (Receive Work Queue Element) contains 1024 strides each of
the size 128B,
i.e, a packet of the size 128B or less will consume only one stride of
that WQE page, next packets on that WQE
will use the following strides in that same page.

So in opposite of what you think this new scheme is better than our
old one in terms of memory utilization.
before, we wasted MTU size per SKB/Packet regardless of the real
packet size, now each SKB will consume only
as much as 128B strides it will need, no more no less.

BTW there will be only 16 WQEs per ring :), so this new approach
doesn't drastically consume more memory than the previous one.
But it sure can handle more small packets bursts.

>
> truesize is the real memory usage of one skb. Not the minimal size of an
> optimally allocated skb for a given payload.

I totally agree with this, we should have reported  skb->truesize +=
(consumed strides)*(stride size).
but again this is not as critical as you think, in the worst case
skb->truesize will be off by 127B at most.

I will discuss this with Tariq and fix it.

>
>
> Better RX speed should not be done at the risk of system stability.

The whole idea of this patch is not improving RX speed ! No ! not at
all ! it just improves the driver resiliency
when the system is under stress on the expense of performance!

So I really think we should get a "thumbs up" from you.

>
> Now if for some reason you need to increase max TCP RWIN, that would be
> a TCP stack change, not some obscure lie in a driver trying to be faster
> than competitors.

No we are not trying to max TCP RWIN in here, Sorry you think of it
this way, I hope my explanation above changes your mind.

Thanks,
Saeed

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS
  2016-03-11 14:08   ` Sergei Shtylyov
@ 2016-03-11 19:29     ` Saeed Mahameed
  0 siblings, 0 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-11 19:29 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Or Gerlitz,
	Eran Ben Elisha, Tal Alon, Tariq Toukan, Jesper Dangaard Brouer

On Fri, Mar 11, 2016 at 4:08 PM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> Hello.
>
> On 3/11/2016 4:39 PM, Saeed Mahameed wrote:
>
>> From: Tariq Toukan <tariqt@mellanox.com>
>>
>> Distribute default RSS table uniformely over the rings of the
>
>
>    Uniformly.

Indeed :), will fix this

Thank you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-11 19:25     ` Saeed Mahameed
@ 2016-03-11 19:58       ` Eric Dumazet
  2016-03-13 10:29         ` achiad shochat
  2016-03-14 18:16         ` Saeed Mahameed
  0 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2016-03-11 19:58 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Or Gerlitz,
	Eran Ben Elisha, Tal Alon, Tariq Toukan, Jesper Dangaard Brouer

On ven., 2016-03-11 at 21:25 +0200, Saeed Mahameed wrote:
> >> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> >> +static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
> >> +                            int page_offset)
> >> +{
> >> +     int f = skb_shinfo(skb)->nr_frags++;
> >> +     skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
> >> +
> >> +     skb->len += len;
> >> +     skb->data_len += len;
> >> +     get_page(page);
> >> +     skb_frag_set_page(skb, f, page);
> >> +     skb_frag_size_set(fr, len);
> >> +     fr->page_offset = page_offset;
> >> +     skb->truesize  = SKB_TRUESIZE(skb->len);
> >> +}
> >
> > Really I am speechless.
> >
> > It is hard to believe how much effort some drivers authors spend trying
> > to fool linux stack and risk OOM a host under stress.
> 
> Eric, you got it all wrong my friend, no one is trying to fool nobody here.
> I will explain it to you below.
> 
> >
> > SKB_TRUESIZE() is absolutely not something a driver is allowed to use.
> >
> > Here you want instead :
> >
> > skb->truesize += PAGE_SIZE;
> >
> > Assuming you allocate and use an order-0 page per fragment. Fact that
> > you receive say 100 bytes datagram is irrelevant to truesize.
> 
> Your assumption is wrong, we allocate as many pages as a WQE needs,
> and a WQE can describe/handle
> up to 1024 packets which share the same page/pages, so the skb should
> really have a true size of the strides
> of that page it used and not the WHOLE page as you think.
> 
> you should already learn this from the previous patch.
> 
> each WQE (Receive Work Queue Element) contains 1024 strides each of
> the size 128B,
> i.e, a packet of the size 128B or less will consume only one stride of
> that WQE page, next packets on that WQE
> will use the following strides in that same page.
> 
> So in opposite of what you think this new scheme is better than our
> old one in terms of memory utilization.
> before, we wasted MTU size per SKB/Packet regardless of the real
> packet size, now each SKB will consume only
> as much as 128B strides it will need, no more no less.
> 
> BTW there will be only 16 WQEs per ring :), so this new approach
> doesn't drastically consume more memory than the previous one.
> But it sure can handle more small packets bursts.
> 
> >
> > truesize is the real memory usage of one skb. Not the minimal size of an
> > optimally allocated skb for a given payload.
> 
> I totally agree with this, we should have reported  skb->truesize +=
> (consumed strides)*(stride size).
> but again this is not as critical as you think, in the worst case
> skb->truesize will be off by 127B at most.

Ouch. really you are completely wrong.

If one skb has a fragment of a page, and sits in a queue for a long
time, it really uses a full page, because the remaining part of the page
is not reusable. Only kmalloc(128) can deal with that idea of allowing
other parts of the page being 'freed and reusable'

It is trivial for an attacker to make sure the host will consume one
page + sk_buff + skb->head = 4096 + 256 + 512, by specially sending out
of order packets on TCP flows.

It is very tempting to have special memory allocators you know, but you
have to understand attackers are smart. Smarter than us.

If now you are telling me you plan to allocate 131072 bytes pages (1024
strides of 128 bytes), then a smart attacker can actually bump skb
truesize to 128KB

Really your RX allocation schem is easily DOSable.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-11 19:58       ` Eric Dumazet
@ 2016-03-13 10:29         ` achiad shochat
  2016-03-14 18:16         ` Saeed Mahameed
  1 sibling, 0 replies; 26+ messages in thread
From: achiad shochat @ 2016-03-13 10:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Or Gerlitz, Eran Ben Elisha, Tal Alon,
	Tariq Toukan, Jesper Dangaard Brouer

On 11 March 2016 at 21:58, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On ven., 2016-03-11 at 21:25 +0200, Saeed Mahameed wrote:
>> >> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>> >> +static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
>> >> +                            int page_offset)
>> >> +{
>> >> +     int f = skb_shinfo(skb)->nr_frags++;
>> >> +     skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
>> >> +
>> >> +     skb->len += len;
>> >> +     skb->data_len += len;
>> >> +     get_page(page);
>> >> +     skb_frag_set_page(skb, f, page);
>> >> +     skb_frag_size_set(fr, len);
>> >> +     fr->page_offset = page_offset;
>> >> +     skb->truesize  = SKB_TRUESIZE(skb->len);
>> >> +}
>> >
>> > Really I am speechless.
>> >
>> > It is hard to believe how much effort some drivers authors spend trying
>> > to fool linux stack and risk OOM a host under stress.
>>
>> Eric, you got it all wrong my friend, no one is trying to fool nobody here.
>> I will explain it to you below.
>>
>> >
>> > SKB_TRUESIZE() is absolutely not something a driver is allowed to use.
>> >
>> > Here you want instead :
>> >
>> > skb->truesize += PAGE_SIZE;
>> >
>> > Assuming you allocate and use an order-0 page per fragment. Fact that
>> > you receive say 100 bytes datagram is irrelevant to truesize.
>>
>> Your assumption is wrong, we allocate as many pages as a WQE needs,
>> and a WQE can describe/handle
>> up to 1024 packets which share the same page/pages, so the skb should
>> really have a true size of the strides
>> of that page it used and not the WHOLE page as you think.
>>
>> you should already learn this from the previous patch.
>>
>> each WQE (Receive Work Queue Element) contains 1024 strides each of
>> the size 128B,
>> i.e, a packet of the size 128B or less will consume only one stride of
>> that WQE page, next packets on that WQE
>> will use the following strides in that same page.
>>
>> So in opposite of what you think this new scheme is better than our
>> old one in terms of memory utilization.
>> before, we wasted MTU size per SKB/Packet regardless of the real
>> packet size, now each SKB will consume only
>> as much as 128B strides it will need, no more no less.
>>
>> BTW there will be only 16 WQEs per ring :), so this new approach
>> doesn't drastically consume more memory than the previous one.
>> But it sure can handle more small packets bursts.
>>
>> >
>> > truesize is the real memory usage of one skb. Not the minimal size of an
>> > optimally allocated skb for a given payload.
>>
>> I totally agree with this, we should have reported  skb->truesize +=
>> (consumed strides)*(stride size).
>> but again this is not as critical as you think, in the worst case
>> skb->truesize will be off by 127B at most.
>
> Ouch. really you are completely wrong.
>
> If one skb has a fragment of a page, and sits in a queue for a long
> time, it really uses a full page, because the remaining part of the page
> is not reusable. Only kmalloc(128) can deal with that idea of allowing
> other parts of the page being 'freed and reusable'
>
> It is trivial for an attacker to make sure the host will consume one
> page + sk_buff + skb->head = 4096 + 256 + 512, by specially sending out
> of order packets on TCP flows.
>
> It is very tempting to have special memory allocators you know, but you
> have to understand attackers are smart. Smarter than us.
>
> If now you are telling me you plan to allocate 131072 bytes pages (1024
> strides of 128 bytes), then a smart attacker can actually bump skb
> truesize to 128KB
>
> Really your RX allocation schem is easily DOSable.
>
>

It seems you did not understand the new scheme at all.
With the new scheme each incoming packet uses only the fraction of the
page that it needs.
This is optimal memory utilization and locality.

So if one skb has a fragment of a page, and sits in a queue for a long
time, it really does _not_ use a full page, because the remaining part
of the page
_is_ usable for other incoming packets.

BTW, this new scheme was actually introduced in the previous patch
"[PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE
(Striding RQ)" rather than in this one.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-11 19:58       ` Eric Dumazet
  2016-03-13 10:29         ` achiad shochat
@ 2016-03-14 18:16         ` Saeed Mahameed
  2016-03-14 19:16           ` achiad shochat
  2016-03-14 20:23           ` Eric Dumazet
  1 sibling, 2 replies; 26+ messages in thread
From: Saeed Mahameed @ 2016-03-14 18:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Or Gerlitz,
	Eran Ben Elisha, Tal Alon, Tariq Toukan, Jesper Dangaard Brouer

On Fri, Mar 11, 2016 at 9:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

>> I totally agree with this, we should have reported  skb->truesize +=
>> (consumed strides)*(stride size).
>> but again this is not as critical as you think, in the worst case
>> skb->truesize will be off by 127B at most.
>
> Ouch. really you are completely wrong.

It it is just a matter of perspective, quoting:
http://vger.kernel.org/~davem/skb_sk.html
"This is the total of how large a data buffer we allocated for the
packet, plus the size of 'struct sk_buff' itself."

as explained more than once, a page used in ConnectX4 MPWQE approach
can be used for more than one packet, according to the above
documentation and many other examples in the kernel, each packet will
report as much data buffer as it used from that page, and we allocated
for that packet: #strides * stridesize from that page, (common sense).

it is really uncalled-for to report for each SKB, skb->truesize +=
PAGE_SIZE for the same shared reuseable page, as we did in here and as
other drivers already do.

It is just ridiculous to report PAGE_SIZE for SKB that used only 128B
and the others parts of that page are being either reused by HW or
reported back to the stack and we already did the truesize accounting
on their parts.

It seems to me that reporting PAGE_SIZE* (#SKBs pointing to that page)
for all of those SKBs is just a big lie and it is just an abuse to the
skb->truesize to protect against special/rare cases like OOO issue
that I can suggest a handful of solutions (out of this thread scope)
for them without the need of lying in device drivers of the actual
truesize.
Think about it, if SKBs share the same page then SUM(SKBs->truesize) =
PAGE_SIZE.

and suppose you are right, why just not  remove the truesize param
from skb_add_rx_frag, and just explicitly do skb->true_szie +=
PAGE_SIZE, hardcoded inside that function? or rename the truesize
param to pageorder ?

>
> If one skb has a fragment of a page, and sits in a queue for a long
> time, it really uses a full page, because the remaining part of the page
> is not reusable. Only kmalloc(128) can deal with that idea of allowing
> other parts of the page being 'freed and reusable'
This concern was also true before this series for other drivers in the
kernel, who use pages for fragmented SKBs and non of them report
PAGE_SIZE as SKB->truesize, as their pages are reuseable.

>
> It is trivial for an attacker to make sure the host will consume one
> page + sk_buff + skb->head = 4096 + 256 + 512, by specially sending out
> of order packets on TCP flows.
we can do special accounting for ooo like issues in the stack (maybe
count page references and sum up page sizes as you suggest), device
drivers shouldn't have special handling/accounting to protect against
such cases.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-14 18:16         ` Saeed Mahameed
@ 2016-03-14 19:16           ` achiad shochat
  2016-03-14 20:26             ` Eric Dumazet
  2016-03-14 20:29             ` Eric Dumazet
  2016-03-14 20:23           ` Eric Dumazet
  1 sibling, 2 replies; 26+ messages in thread
From: achiad shochat @ 2016-03-14 19:16 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Eric Dumazet, Saeed Mahameed, David S. Miller, Linux Netdev List,
	Or Gerlitz, Eran Ben Elisha, Tal Alon, Tariq Toukan,
	Jesper Dangaard Brouer

On 14 March 2016 at 20:16, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> On Fri, Mar 11, 2016 at 9:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>>> I totally agree with this, we should have reported  skb->truesize +=
>>> (consumed strides)*(stride size).
>>> but again this is not as critical as you think, in the worst case
>>> skb->truesize will be off by 127B at most.
>>
>> Ouch. really you are completely wrong.
>
> It it is just a matter of perspective, quoting:
> http://vger.kernel.org/~davem/skb_sk.html
> "This is the total of how large a data buffer we allocated for the
> packet, plus the size of 'struct sk_buff' itself."
>
> as explained more than once, a page used in ConnectX4 MPWQE approach
> can be used for more than one packet, according to the above
> documentation and many other examples in the kernel, each packet will
> report as much data buffer as it used from that page, and we allocated
> for that packet: #strides * stridesize from that page, (common sense).
>
> it is really uncalled-for to report for each SKB, skb->truesize +=
> PAGE_SIZE for the same shared reuseable page, as we did in here and as
> other drivers already do.
>
> It is just ridiculous to report PAGE_SIZE for SKB that used only 128B
> and the others parts of that page are being either reused by HW or
> reported back to the stack and we already did the truesize accounting
> on their parts.
>
> It seems to me that reporting PAGE_SIZE* (#SKBs pointing to that page)
> for all of those SKBs is just a big lie and it is just an abuse to the
> skb->truesize to protect against special/rare cases like OOO issue
> that I can suggest a handful of solutions (out of this thread scope)
> for them without the need of lying in device drivers of the actual
> truesize.
> Think about it, if SKBs share the same page then SUM(SKBs->truesize) =
> PAGE_SIZE.
>
> and suppose you are right, why just not  remove the truesize param
> from skb_add_rx_frag, and just explicitly do skb->true_szie +=
> PAGE_SIZE, hardcoded inside that function? or rename the truesize
> param to pageorder ?
>
>>
>> If one skb has a fragment of a page, and sits in a queue for a long
>> time, it really uses a full page, because the remaining part of the page
>> is not reusable. Only kmalloc(128) can deal with that idea of allowing
>> other parts of the page being 'freed and reusable'
> This concern was also true before this series for other drivers in the
> kernel, who use pages for fragmented SKBs and non of them report
> PAGE_SIZE as SKB->truesize, as their pages are reuseable.
>
>>
>> It is trivial for an attacker to make sure the host will consume one
>> page + sk_buff + skb->head = 4096 + 256 + 512, by specially sending out
>> of order packets on TCP flows.
> we can do special accounting for ooo like issues in the stack (maybe
> count page references and sum up page sizes as you suggest), device
> drivers shouldn't have special handling/accounting to protect against
> such cases.

I really do not see why the new scheme is more DOSable than the common
scheme of pre-allocating SKB using napi_alloc_skb().
In both cases each RX packet "eats" a chunk from a (likely(compound))
page and holds the page as long as it sits in a queue.
Only in the new scheme the size of the eaten chunk equals to the
actual incoming packet size rather than pre-defined according to the
maximum packet size (MTU), which yields optimal memory usage and
locality.
It does not make sense that an incoming packet of 64 bytes will eat
1500 bytes of memory.
In the new scheme the RX buffer "credits" is bytes only, not in
packets, which makes more sense - given a link speed, the size of the
buffer in bytes determines how long the SW response time for an
incoming burst of traffic can be before the buffer will overrun.

Eric, am I missing something here or the new scheme was not clear to
you previously?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-14 18:16         ` Saeed Mahameed
  2016-03-14 19:16           ` achiad shochat
@ 2016-03-14 20:23           ` Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2016-03-14 20:23 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List, Or Gerlitz,
	Eran Ben Elisha, Tal Alon, Tariq Toukan, Jesper Dangaard Brouer

On Mon, 2016-03-14 at 20:16 +0200, Saeed Mahameed wrote:

> we can do special accounting for ooo like issues in the stack (maybe
> count page references and sum up page sizes as you suggest), device
> drivers shouldn't have special handling/accounting to protect against
> such cases.

The existing skb->truesize is doing this already.

The fact that some drivers use PAGE_SIZE/2 instead of PAGE_SIZE is an
heuristic that is mostly okay, and we accept the risk :

Even if a smart attack is happening, host will consume 200 XB instead of
100 XB.

But pretending to use 128 bytes is simply a dangerous weapon over your
head, since you end up consuming 1600 XB.

With tcp_mem[2] being 18% of physical memory, you end up consuming all
physical memory and crash.

I can tell you that these kind of attacks are very real. I´ve seen them
in action.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-14 19:16           ` achiad shochat
@ 2016-03-14 20:26             ` Eric Dumazet
  2016-03-14 20:29             ` Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2016-03-14 20:26 UTC (permalink / raw)
  To: achiad shochat
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Or Gerlitz, Eran Ben Elisha, Tal Alon,
	Tariq Toukan, Jesper Dangaard Brouer

On Mon, 2016-03-14 at 21:16 +0200, achiad shochat wrote:

> Eric, am I missing something here or the new scheme was not clear to
> you previously?

I simply do not want to see drivers using 

1) SKB_TRUESIZE()

or 

2)

   skb->truesize =  some_expression


Drivers should not assume they know better than core networking stack.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE
  2016-03-14 19:16           ` achiad shochat
  2016-03-14 20:26             ` Eric Dumazet
@ 2016-03-14 20:29             ` Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2016-03-14 20:29 UTC (permalink / raw)
  To: achiad shochat
  Cc: Saeed Mahameed, Saeed Mahameed, David S. Miller,
	Linux Netdev List, Or Gerlitz, Eran Ben Elisha, Tal Alon,
	Tariq Toukan, Jesper Dangaard Brouer

On Mon, 2016-03-14 at 21:16 +0200, achiad shochat wrote:

> I really do not see why the new scheme is more DOSable than the common
> scheme of pre-allocating SKB using napi_alloc_skb().

Because sizeof(skb_shared_info) is big enough that if you allocate 128
bytes, you end up using 512 bytes or even more.

In practice it is good enough.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ)
  2016-03-11 13:39 ` [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ) Saeed Mahameed
@ 2016-03-14 21:33   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2016-03-14 21:33 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev, Or Gerlitz, Eran Ben Elisha, Tal Alon,
	Tariq Toukan, Achiad Shochat, brouer


On Fri, 11 Mar 2016 15:39:47 +0200 Saeed Mahameed <saeedm@mellanox.com> wrote:

> From: Tariq Toukan <tariqt@mellanox.com>
> 
> Introduce the feature of multi-packet WQE (RX Work Queue Element)
> referred to as (MPWQE or Striding RQ), in which WQEs are larger
> and serve multiple packets each.
> 
> Every WQE consists of many strides of the same size, every received
> packet is aligned to a beginning of a stride and is written to
> consecutive strides within a WQE.

I really like this HW support! :-)

I noticed the "Multi-Packet WQE" send format, but I could not find the
receive part in the programmers ref doc, until I started looking after
"stride".


> In the regular approach, each regular WQE is big enough to be capable
> of serving one received packet of any size up to MTU or 64K in case of
> device LRO is enabeled, making it very wasteful when dealing with
> small packets or device LRO is enabeled.
> 
> For its flexibility, MPWQE allows a better memory utilization (implying
> improvements in CPU utilization and packet rate) as packets consume
> strides according to their size, preserving the rest of the WQE to be
> available for other packets.

It does allow significant better memory utilization (even if Eric
cannot see it, I can).

One issue with this approach is that we no-longer can use the
packet-data as the skb->data pointer.  (AFAIK because we cannot use
dma_unmap any longer, and instead we need to use dma_sync).

Thus, for every single packet you are now allocating a new memory area
for skb->data.


> MPWQE default configuration:
> 	NUM WQEs = 16
> 	Strides Per WQE = 1024
> 	Stride Size = 128

> Performance tested on ConnectX4-Lx 50G.
> 
> * Netperf single TCP stream:
> - message size = 1024,  bw raised from ~12300 mbps to 14900 mbps (+20%)
> - message size = 65536, bw raised from ~21800 mbps to 33500 mbps (+50%)
> - with other message sized we saw some gain or no degradation.
> 
> * Netperf multi TCP stream:
> - No degradation, line rate reached.
> 
> * Pktgen: packet loss in bursts of N small messages (64byte), single
> stream
> - | num packets | packets loss before	| packets loss after
>   |	2K	|       ~ 1K		|	0
>   |	16K	|       ~13K 		|	0
>   |	32K	|	~29K		|      14K
> 
> As expected as the driver can recive as many small packets (<=128) as
> the number of total strides in the ring (default = 1024 * 16) vs. 1024
> (default ring size regardless of packets size) before this feautre.
> 
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Achiad Shochat <achiad@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h       |   71 +++++++++++-
>  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   15 ++-
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  109 +++++++++++++----
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  126 ++++++++++++++++++--
>  include/linux/mlx5/device.h                        |   39 ++++++-
>  include/linux/mlx5/mlx5_ifc.h                      |   13 ++-
>  6 files changed, 327 insertions(+), 46 deletions(-)
> 
[...]
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -76,6 +76,33 @@ err_free_skb:
>  	return -ENOMEM;
>  }
>  
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +	int ret = 0;
> +
> +	wi->dma_info.page = alloc_pages(GFP_ATOMIC | __GFP_COMP | __GFP_COLD,
> +					MLX5_MPWRQ_WQE_PAGE_ORDER);

Order 5 page = 131072 bytes, but we only alloc 16 of them.

> +	if (unlikely(!wi->dma_info.page))
> +		return -ENOMEM;
> +
> +	wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> +					 rq->wqe_sz, PCI_DMA_FROMDEVICE);

Mapping the entire page is going to make PowerPC owners happy.

> +	if (dma_mapping_error(rq->pdev, wi->dma_info.addr)) {
> +		ret = -ENOMEM;
> +		goto err_put_page;
> +	}
> +
> +	wi->consumed_strides = 0;
> +	wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> +
> +	return 0;
> +
> +err_put_page:
> +	put_page(wi->dma_info.page);
> +	return ret;
> +}
> +
[...]
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +{
> +	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
> +	u16 stride_ix      = mpwrq_get_cqe_stride_index(cqe);
> +	u32 consumed_bytes = cstrides  * MLX5_MPWRQ_STRIDE_SIZE;
> +	u32 stride_offset  = stride_ix * MLX5_MPWRQ_STRIDE_SIZE;
> +	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
> +	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
> +	struct sk_buff *skb;
> +	u16 byte_cnt;
> +	u16 cqe_bcnt;
> +	u16 headlen;
> +
> +	wi->consumed_strides += cstrides;

Ok, moving N strides, for next round.

> +
> +	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
> +		rq->stats.wqe_err++;
> +		goto mpwrq_cqe_out;
> +	}
> +
> +	if (mpwrq_is_filler_cqe(cqe)) {
> +		rq->stats.mpwqe_filler++;
> +		goto mpwrq_cqe_out;
> +	}
> +
> +	skb = netdev_alloc_skb(rq->netdev, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD);
> +	if (unlikely(!skb))
> +		goto mpwrq_cqe_out;
> +
> +	dma_sync_single_for_cpu(rq->pdev, wi->dma_info.addr + stride_offset,
> +				consumed_bytes, DMA_FROM_DEVICE);
> +
> +	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
> +	headlen = min_t(u16, MLX5_MPWRQ_SMALL_PACKET_THRESHOLD, cqe_bcnt);
> +	skb_copy_to_linear_data(skb,
> +				page_address(wi->dma_info.page) + stride_offset,
> +				headlen);
> +	skb_put(skb, headlen);
> +
> +	byte_cnt = cqe_bcnt - headlen;
> +	if (byte_cnt) {
> +		skb_frag_t *f0 = &skb_shinfo(skb)->frags[0];
> +
> +		skb_shinfo(skb)->nr_frags = 1;
> +
> +		skb->data_len  = byte_cnt;
> +		skb->len      += byte_cnt;
> +		skb->truesize  = SKB_TRUESIZE(skb->len);
> +
> +		get_page(wi->dma_info.page);
> +		skb_frag_set_page(skb, 0, wi->dma_info.page);
> +		skb_frag_size_set(f0, skb->data_len);
> +		f0->page_offset = stride_offset + headlen;
> +	}
> +
> +	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
> +
> +mpwrq_cqe_out:
> +	if (likely(wi->consumed_strides < MLX5_MPWRQ_NUM_STRIDES))
> +		return;

Due to return statement, we keep working on the same big page, only
dma_sync'ing what we need.

> +
> +	dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> +		       PCI_DMA_FROMDEVICE);

Page is first fully dma_unmap'ed after all stride-entries have been
processed/consumed.

> +	put_page(wi->dma_info.page);
> +	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
> +}
> +
>  int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>  {
>  	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-03-14 21:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-11 13:39 [PATCH net-next 00/13] Mellanox 100G mlx5 driver receive path optimizations Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 01/13] net/mlx5: Refactor mlx5_core_mr to mkey Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 02/13] net/mlx5: Introduce device queue counters Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 03/13] net/mlx5e: Allocate set of queue counters per netdev Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 04/13] net/mlx5e: Use only close NUMA node for default RSS Saeed Mahameed
2016-03-11 14:08   ` Sergei Shtylyov
2016-03-11 19:29     ` Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 05/13] net/mlx5e: Use function pointers for RX data path handling Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE (Striding RQ) Saeed Mahameed
2016-03-14 21:33   ` Jesper Dangaard Brouer
2016-03-11 13:39 ` [PATCH net-next 07/13] net/mlx5e: Added ICO SQs Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support for RX multi packet WQE Saeed Mahameed
2016-03-11 14:32   ` Eric Dumazet
2016-03-11 19:25     ` Saeed Mahameed
2016-03-11 19:58       ` Eric Dumazet
2016-03-13 10:29         ` achiad shochat
2016-03-14 18:16         ` Saeed Mahameed
2016-03-14 19:16           ` achiad shochat
2016-03-14 20:26             ` Eric Dumazet
2016-03-14 20:29             ` Eric Dumazet
2016-03-14 20:23           ` Eric Dumazet
2016-03-11 13:39 ` [PATCH net-next 09/13] net/mlx5e: Change RX moderation period to be based on CQE Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 10/13] net/mlx5e: Use napi_alloc_skb for RX SKB allocations Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 11/13] net/mlx5e: Prefetch next RX CQE Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 12/13] net/mlx5e: Remove redundant barrier Saeed Mahameed
2016-03-11 13:39 ` [PATCH net-next 13/13] net/mlx5e: Add ethtool counter for RX SKB allocation failures Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.