All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/8] mlx5e XDP support
@ 2016-09-19 13:58 Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 1/8] net/mlx5e: Build RX SKB on demand Tariq Toukan
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

This series adds XDP support in mlx5e driver.
This includes the use cases: XDP_DROP, XDP_PASS, and XDP_TX.

Single stream performance tests show 16.5 Mpps for XDP_DROP,
and 12.4 Mpps for XDP_TX, with nice scalability for multiple streams/rings.

This rate of XDP_DROP is lower than the 32 Mpps we got in previous
implementation, when Striding RQ was used.

We moved to non-Striding RQ, as some XDP_TX requirements (like headroom,
packet-per-page) cannot be satisfied with the current Striding RQ HW,
and we decided to fully support both DROP/TX.

Few directions are considered in order to enable the faster rate for XDP_DROP,
e.g a possibility for users to enable Striding RQ so they choose optimized
XDP_DROP on the price of partial XDP_TX functionality, or some HW changes.

Series generated against net-next commit:
22da73492541 'net: r6040: add in missing white space in error message text'

Thanks,
Tariq

Rana Shahout (1):
  net/mlx5e: XDP fast RX drop bpf programs support

Saeed Mahameed (7):
  net/mlx5e: Build RX SKB on demand
  net/mlx5e: Union RQ RX info per RQ type
  net/mlx5e: Slightly reduce hardware LRO size
  net/mlx5e: Dynamic RQ type infrastructure
  net/mlx5e: Have a clear separation between different SQ types
  net/mlx5e: XDP TX forwarding support
  net/mlx5e: XDP TX xmit more

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  77 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 462 ++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 381 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |  12 +
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  63 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  65 ++-
 6 files changed, 784 insertions(+), 276 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH net-next 1/8] net/mlx5e: Build RX SKB on demand
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 2/8] net/mlx5e: Union RQ RX info per RQ type Tariq Toukan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

For non-striding RQ configuration before this patch we had a ring
with pre-allocated SKBs and mapped the SKB->data buffers for
device.

For robustness and better RX data buffers management, we allocate a
page per packet and build_skb around it.

This patch (which is a prerequisite for XDP) will actually reduce
performance for normal stack usage, because we are now hitting a bottleneck
in the page allocator. We use the page-cache to restore or even improve
performance in comparison to the old RX scheme.

Packet rate performance testing was done with pktgen 64B packets on xmit
side and TC ingress dropping action on RX side.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
 1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
 2.Build SKB with RX page cache (This patch)

RX Cores  Baseline    Build SKB+page-cache    Improvement
-----------------------------------------------------------
1          4.16Mpps       5.33Mpps                28%
2          7.16Mpps      10.24Mpps                43%
4         13.61Mpps      20.51Mpps                51%
8         25.32Mpps      32.00Mpps                26%

All respective cores were 100% utilized.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  10 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  31 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 215 +++++++++++-----------
 3 files changed, 133 insertions(+), 123 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7dd4763e726e..4d06c1b9348b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -65,6 +65,8 @@
 #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW            0x3
 #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW            0x6
 
+#define MLX5_RX_HEADROOM NET_SKB_PAD
+
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE		6  /* >= 6, HW restriction */
 #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS	8  /* >= 6, HW restriction */
 #define MLX5_MPWRQ_LOG_WQE_SZ			18
@@ -302,10 +304,14 @@ struct mlx5e_page_cache {
 struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
-	u32                    wqe_sz;
-	struct sk_buff       **skb;
+
+	struct mlx5e_dma_info *dma_info;
 	struct mlx5e_mpw_info *wqe_info;
 	void                  *mtt_no_align;
+	struct {
+		u8             page_order;
+		u32            wqe_sz;    /* wqe data buffer size */
+	} buff;
 	__be32                 mkey_be;
 
 	struct device         *pdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8595b507e200..d09588b54c38 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -408,6 +408,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	void *rqc = param->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
 	u32 byte_count;
+	u32 frag_sz;
+	int npages;
 	int wq_sz;
 	int err;
 	int i;
@@ -442,29 +444,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 
 		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
 		rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
-		rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
-		byte_count = rq->wqe_sz;
+
+		rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
+		byte_count = rq->buff.wqe_sz;
 		rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
 		err = mlx5e_rq_alloc_mpwqe_info(rq, c);
 		if (err)
 			goto err_rq_wq_destroy;
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
-		rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
-				       cpu_to_node(c->cpu));
-		if (!rq->skb) {
+		rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info),
+					    GFP_KERNEL, cpu_to_node(c->cpu));
+		if (!rq->dma_info) {
 			err = -ENOMEM;
 			goto err_rq_wq_destroy;
 		}
+
 		rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
 		rq->alloc_wqe = mlx5e_alloc_rx_wqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
 
-		rq->wqe_sz = (priv->params.lro_en) ?
+		rq->buff.wqe_sz = (priv->params.lro_en) ?
 				priv->params.lro_wqe_sz :
 				MLX5E_SW2HW_MTU(priv->netdev->mtu);
-		rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
-		byte_count = rq->wqe_sz;
+		byte_count = rq->buff.wqe_sz;
+
+		/* calc the required page order */
+		frag_sz = MLX5_RX_HEADROOM +
+			  byte_count /* packet data */ +
+			  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		frag_sz = SKB_DATA_ALIGN(frag_sz);
+
+		npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
+		rq->buff.page_order = order_base_2(npages);
+
 		byte_count |= MLX5_HW_START_PADDING;
 		rq->mkey_be = c->mkey_be;
 	}
@@ -499,7 +512,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 		mlx5e_rq_free_mpwqe_info(rq);
 		break;
 	default: /* MLX5_WQ_TYPE_LINKED_LIST */
-		kfree(rq->skb);
+		kfree(rq->dma_info);
 	}
 
 	for (i = rq->page_cache.head; i != rq->page_cache.tail;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index dc8677933f76..d017829b77eb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -179,50 +179,99 @@ unlock:
 	mutex_unlock(&priv->state_lock);
 }
 
-int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
+
+static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
 {
-	struct sk_buff *skb;
-	dma_addr_t dma_addr;
+	struct mlx5e_page_cache *cache = &rq->page_cache;
+	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
 
-	skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
-	if (unlikely(!skb))
-		return -ENOMEM;
+	if (tail_next == cache->head) {
+		rq->stats.cache_full++;
+		return false;
+	}
+
+	cache->page_cache[cache->tail] = *dma_info;
+	cache->tail = tail_next;
+	return true;
+}
+
+static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
+				      struct mlx5e_dma_info *dma_info)
+{
+	struct mlx5e_page_cache *cache = &rq->page_cache;
+
+	if (unlikely(cache->head == cache->tail)) {
+		rq->stats.cache_empty++;
+		return false;
+	}
+
+	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+		rq->stats.cache_busy++;
+		return false;
+	}
+
+	*dma_info = cache->page_cache[cache->head];
+	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
+	rq->stats.cache_reuse++;
+
+	dma_sync_single_for_device(rq->pdev, dma_info->addr,
+				   RQ_PAGE_SIZE(rq),
+				   DMA_FROM_DEVICE);
+	return true;
+}
 
-	dma_addr = dma_map_single(rq->pdev,
-				  /* hw start padding */
-				  skb->data,
-				  /* hw end padding */
-				  rq->wqe_sz,
-				  DMA_FROM_DEVICE);
+static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
+					  struct mlx5e_dma_info *dma_info)
+{
+	struct page *page;
 
-	if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
-		goto err_free_skb;
+	if (mlx5e_rx_cache_get(rq, dma_info))
+		return 0;
 
-	*((dma_addr_t *)skb->cb) = dma_addr;
-	wqe->data.addr = cpu_to_be64(dma_addr);
+	page = dev_alloc_pages(rq->buff.page_order);
+	if (unlikely(!page))
+		return -ENOMEM;
 
-	rq->skb[ix] = skb;
+	dma_info->page = page;
+	dma_info->addr = dma_map_page(rq->pdev, page, 0,
+				      RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
+		put_page(page);
+		return -ENOMEM;
+	}
 
 	return 0;
+}
 
-err_free_skb:
-	dev_kfree_skb(skb);
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+			bool recycle)
+{
+	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+		return;
+
+	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+		       DMA_FROM_DEVICE);
+	put_page(dma_info->page);
+}
+
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+{
+	struct mlx5e_dma_info *di = &rq->dma_info[ix];
 
-	return -ENOMEM;
+	if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
+		return -ENOMEM;
+
+	wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
+	return 0;
 }
 
 void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct sk_buff *skb = rq->skb[ix];
+	struct mlx5e_dma_info *di = &rq->dma_info[ix];
 
-	if (skb) {
-		rq->skb[ix] = NULL;
-		dma_unmap_single(rq->pdev,
-				 *((dma_addr_t *)skb->cb),
-				 rq->wqe_sz,
-				 DMA_FROM_DEVICE);
-		dev_kfree_skb(skb);
-	}
+	mlx5e_page_release(rq, di, true);
 }
 
 static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
@@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
 
-static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
-				      struct mlx5e_dma_info *dma_info)
-{
-	struct mlx5e_page_cache *cache = &rq->page_cache;
-	u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
-
-	if (tail_next == cache->head) {
-		rq->stats.cache_full++;
-		return false;
-	}
-
-	cache->page_cache[cache->tail] = *dma_info;
-	cache->tail = tail_next;
-	return true;
-}
-
-static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
-				      struct mlx5e_dma_info *dma_info)
-{
-	struct mlx5e_page_cache *cache = &rq->page_cache;
-
-	if (unlikely(cache->head == cache->tail)) {
-		rq->stats.cache_empty++;
-		return false;
-	}
-
-	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
-		rq->stats.cache_busy++;
-		return false;
-	}
-
-	*dma_info = cache->page_cache[cache->head];
-	cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
-	rq->stats.cache_reuse++;
-
-	dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
-				   DMA_FROM_DEVICE);
-	return true;
-}
-
-static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
-					  struct mlx5e_dma_info *dma_info)
-{
-	struct page *page;
-
-	if (mlx5e_rx_cache_get(rq, dma_info))
-		return 0;
-
-	page = dev_alloc_page();
-	if (unlikely(!page))
-		return -ENOMEM;
-
-	dma_info->page = page;
-	dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
-				      DMA_FROM_DEVICE);
-	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
-		put_page(page);
-		return -ENOMEM;
-	}
-
-	return 0;
-}
-
-void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
-			bool recycle)
-{
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
-
-	dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
-	put_page(dma_info->page);
-}
-
 static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 				    struct mlx5e_rx_wqe *wqe,
 				    u16 ix)
@@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
 	mlx5_wq_ll_update_db_record(wq);
 }
 
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix)
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 {
 	int err;
 
@@ -658,31 +634,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
+	struct mlx5e_dma_info *di;
 	struct mlx5e_rx_wqe *wqe;
-	struct sk_buff *skb;
 	__be16 wqe_counter_be;
+	struct sk_buff *skb;
 	u16 wqe_counter;
 	u32 cqe_bcnt;
+	void *va;
 
 	wqe_counter_be = cqe->wqe_counter;
 	wqe_counter    = be16_to_cpu(wqe_counter_be);
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
-	skb            = rq->skb[wqe_counter];
-	prefetch(skb->data);
-	rq->skb[wqe_counter] = NULL;
+	di             = &rq->dma_info[wqe_counter];
+	va             = page_address(di->page);
 
-	dma_unmap_single(rq->pdev,
-			 *((dma_addr_t *)skb->cb),
-			 rq->wqe_sz,
-			 DMA_FROM_DEVICE);
+	dma_sync_single_range_for_cpu(rq->pdev,
+				      di->addr,
+				      MLX5_RX_HEADROOM,
+				      rq->buff.wqe_sz,
+				      DMA_FROM_DEVICE);
+	prefetch(va + MLX5_RX_HEADROOM);
 
 	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
 		rq->stats.wqe_err++;
-		dev_kfree_skb(skb);
+		mlx5e_page_release(rq, di, true);
 		goto wq_ll_pop;
 	}
 
+	skb = build_skb(va, RQ_PAGE_SIZE(rq));
+	if (unlikely(!skb)) {
+		rq->stats.buff_alloc_err++;
+		mlx5e_page_release(rq, di, true);
+		goto wq_ll_pop;
+	}
+
+	/* queue up for recycling ..*/
+	page_ref_inc(di->page);
+	mlx5e_page_release(rq, di, true);
+
 	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
+	skb_reserve(skb, MLX5_RX_HEADROOM);
 	skb_put(skb, cqe_bcnt);
 
 	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 2/8] net/mlx5e: Union RQ RX info per RQ type
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 1/8] net/mlx5e: Build RX SKB on demand Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 3/8] net/mlx5e: Slightly reduce hardware LRO size Tariq Toukan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

We have two types of RX RQs, and they use two separate sets of
info arrays and structures in RX data path function.  Today those
structures are mutually exclusive per RQ type, hence one kind is
allocated on RQ creation according to the RQ type.

For better cache locality and to minimize the
sizeof(struct mlx5e_rq), in this patch we define them as a union.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      | 14 ++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 32 +++++++++++------------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 10 +++----
 3 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4d06c1b9348b..e3331237ce0e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -305,9 +305,14 @@ struct mlx5e_rq {
 	/* data path */
 	struct mlx5_wq_ll      wq;
 
-	struct mlx5e_dma_info *dma_info;
-	struct mlx5e_mpw_info *wqe_info;
-	void                  *mtt_no_align;
+	union {
+		struct mlx5e_dma_info *dma_info;
+		struct {
+			struct mlx5e_mpw_info *info;
+			void                  *mtt_no_align;
+			u32                    mtt_offset;
+		} mpwqe;
+	};
 	struct {
 		u8             page_order;
 		u32            wqe_sz;    /* wqe data buffer size */
@@ -327,7 +332,6 @@ struct mlx5e_rq {
 
 	unsigned long          state;
 	int                    ix;
-	u32                    mpwqe_mtt_offset;
 
 	struct mlx5e_rx_am     am; /* Adaptive Moderation */
 
@@ -770,7 +774,7 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
 
 static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
 {
-	return rq->mpwqe_mtt_offset +
+	return rq->mpwqe.mtt_offset +
 		wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index d09588b54c38..f2efa532f453 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -314,7 +314,7 @@ static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
 	struct mlx5_wqe_ctrl_seg      *cseg = &wqe->ctrl;
 	struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
 	struct mlx5_wqe_data_seg      *dseg = &wqe->data;
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 	u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
 	u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
 
@@ -342,21 +342,21 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
 	int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
 	int i;
 
-	rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
-				    GFP_KERNEL, cpu_to_node(c->cpu));
-	if (!rq->wqe_info)
+	rq->mpwqe.info = kzalloc_node(wq_sz * sizeof(*rq->mpwqe.info),
+				      GFP_KERNEL, cpu_to_node(c->cpu));
+	if (!rq->mpwqe.info)
 		goto err_out;
 
 	/* We allocate more than mtt_sz as we will align the pointer */
-	rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
+	rq->mpwqe.mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
 					cpu_to_node(c->cpu));
-	if (unlikely(!rq->mtt_no_align))
+	if (unlikely(!rq->mpwqe.mtt_no_align))
 		goto err_free_wqe_info;
 
 	for (i = 0; i < wq_sz; i++) {
-		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+		struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
 
-		wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
+		wi->umr.mtt = PTR_ALIGN(rq->mpwqe.mtt_no_align + i * mtt_alloc,
 					MLX5_UMR_ALIGN);
 		wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
 						  PCI_DMA_TODEVICE);
@@ -370,14 +370,14 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
 
 err_unmap_mtts:
 	while (--i >= 0) {
-		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+		struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
 
 		dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
 				 PCI_DMA_TODEVICE);
 	}
-	kfree(rq->mtt_no_align);
+	kfree(rq->mpwqe.mtt_no_align);
 err_free_wqe_info:
-	kfree(rq->wqe_info);
+	kfree(rq->mpwqe.info);
 
 err_out:
 	return -ENOMEM;
@@ -390,13 +390,13 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
 	int i;
 
 	for (i = 0; i < wq_sz; i++) {
-		struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+		struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
 
 		dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
 				 PCI_DMA_TODEVICE);
 	}
-	kfree(rq->mtt_no_align);
-	kfree(rq->wqe_info);
+	kfree(rq->mpwqe.mtt_no_align);
+	kfree(rq->mpwqe.info);
 }
 
 static int mlx5e_create_rq(struct mlx5e_channel *c,
@@ -439,7 +439,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 		rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
-		rq->mpwqe_mtt_offset = c->ix *
+		rq->mpwqe.mtt_offset = c->ix *
 			MLX5E_REQUIRED_MTTS(1, BIT(priv->params.log_rq_size));
 
 		rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
@@ -654,7 +654,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
 
 	/* UMR WQE (if in progress) is always at wq->head */
 	if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
-		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
+		mlx5e_free_rx_mpwqe(rq, &rq->mpwqe.info[wq->head]);
 
 	while (!mlx5_wq_ll_is_empty(wq)) {
 		wqe_ix_be = *wq->tail_next;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d017829b77eb..a403a797ebef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -328,7 +328,7 @@ mlx5e_copy_skb_header_mpwqe(struct device *pdev,
 
 static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 	struct mlx5e_sq *sq = &rq->channel->icosq;
 	struct mlx5_wq_cyc *wq = &sq->wq;
 	struct mlx5e_umr_wqe *wqe;
@@ -358,7 +358,7 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 				    struct mlx5e_rx_wqe *wqe,
 				    u16 ix)
 {
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 	u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
 	int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
 	int err;
@@ -412,7 +412,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
 	clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
 
 	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
-		mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
+		mlx5e_free_rx_mpwqe(rq, &rq->mpwqe.info[wq->head]);
 		return;
 	}
 
@@ -438,7 +438,7 @@ int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
 
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 {
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
 
 	mlx5e_free_rx_mpwqe(rq, wi);
 }
@@ -725,7 +725,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
 	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
-	struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
+	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[wqe_id];
 	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
 	struct sk_buff *skb;
 	u16 cqe_bcnt;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 3/8] net/mlx5e: Slightly reduce hardware LRO size
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 1/8] net/mlx5e: Build RX SKB on demand Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 2/8] net/mlx5e: Union RQ RX info per RQ type Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 4/8] net/mlx5e: Dynamic RQ type infrastructure Tariq Toukan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

Before this patch LRO size was 64K, now with build_skb requires
extra room, headroom + sizeof(skb_shared_info) added to the data
buffer will make  wqe size or page_frag_size slightly larger than
64K which will demand order 5 page instead of order 4 in 4K page systems.

We take those extra bytes from hardware LRO data size in order to not
increase the required page order for when hardware LRO is enabled.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index f2efa532f453..8734240865e1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3187,8 +3187,11 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 	mlx5e_build_default_indir_rqt(mdev, priv->params.indirection_rqt,
 				      MLX5E_INDIR_RQT_SIZE, profile->max_nch(mdev));
 
-	priv->params.lro_wqe_sz            =
-		MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ;
+	priv->params.lro_wqe_sz =
+		MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ -
+		/* Extra room needed for build_skb */
+		MLX5_RX_HEADROOM -
+		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	/* Initialize pflags */
 	MLX5E_SET_PRIV_FLAG(priv, MLX5E_PFLAG_RX_CQE_BASED_MODER,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 4/8] net/mlx5e: Dynamic RQ type infrastructure
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
                   ` (2 preceding siblings ...)
  2016-09-19 13:58 ` [PATCH net-next 3/8] net/mlx5e: Slightly reduce hardware LRO size Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 5/8] net/mlx5e: XDP fast RX drop bpf programs support Tariq Toukan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

Add two helper functions to allow dynamic changes of RQ type.

mlx5e_set_rq_priv_params and mlx5e_set_rq_type_params will be
used on netdev creation to determine the default RQ type.

This will be needed later for downstream patches of XDP support.
When enabling XDP we will dynamically move from striding RQ to
linked list RQ type.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 92 ++++++++++++-----------
 1 file changed, 50 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8734240865e1..ff520ad27f5e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -69,6 +69,47 @@ struct mlx5e_channel_param {
 	struct mlx5e_cq_param      icosq_cq;
 };
 
+static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
+{
+	return MLX5_CAP_GEN(mdev, striding_rq) &&
+		MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
+		MLX5_CAP_ETH(mdev, reg_umr_sq);
+}
+
+static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
+{
+	priv->params.rq_wq_type = rq_type;
+	switch (priv->params.rq_wq_type) {
+	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
+		priv->params.mpwqe_log_stride_sz = priv->params.rx_cqe_compress ?
+			MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS :
+			MLX5_MPWRQ_LOG_STRIDE_SIZE;
+		priv->params.mpwqe_log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ -
+			priv->params.mpwqe_log_stride_sz;
+		break;
+	default: /* MLX5_WQ_TYPE_LINKED_LIST */
+		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
+	}
+	priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
+					       BIT(priv->params.log_rq_size));
+
+	mlx5_core_info(priv->mdev,
+		       "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
+		       priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
+		       BIT(priv->params.log_rq_size),
+		       BIT(priv->params.mpwqe_log_stride_sz),
+		       priv->params.rx_cqe_compress_admin);
+}
+
+static void mlx5e_set_rq_priv_params(struct mlx5e_priv *priv)
+{
+	u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) ?
+		    MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
+		    MLX5_WQ_TYPE_LINKED_LIST;
+	mlx5e_set_rq_type_params(priv, rq_type);
+}
+
 static void mlx5e_update_carrier(struct mlx5e_priv *priv)
 {
 	struct mlx5_core_dev *mdev = priv->mdev;
@@ -3038,13 +3079,6 @@ void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
 		indirection_rqt[i] = i % num_channels;
 }
 
-static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
-{
-	return MLX5_CAP_GEN(mdev, striding_rq) &&
-		MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
-		MLX5_CAP_ETH(mdev, reg_umr_sq);
-}
-
 static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
 {
 	enum pcie_link_width width;
@@ -3124,11 +3158,13 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 					 MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
 					 MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
 
-	priv->params.log_sq_size           =
-		MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
-	priv->params.rq_wq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) ?
-		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
-		MLX5_WQ_TYPE_LINKED_LIST;
+	priv->mdev                         = mdev;
+	priv->netdev                       = netdev;
+	priv->params.num_channels          = profile->max_nch(mdev);
+	priv->profile                      = profile;
+	priv->ppriv                        = ppriv;
+
+	priv->params.log_sq_size = MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
 
 	/* set CQE compression */
 	priv->params.rx_cqe_compress_admin = false;
@@ -3141,33 +3177,11 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 		priv->params.rx_cqe_compress_admin =
 			cqe_compress_heuristic(link_speed, pci_bw);
 	}
-
 	priv->params.rx_cqe_compress = priv->params.rx_cqe_compress_admin;
 
-	switch (priv->params.rq_wq_type) {
-	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
-		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
-		priv->params.mpwqe_log_stride_sz =
-			priv->params.rx_cqe_compress ?
-			MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS :
-			MLX5_MPWRQ_LOG_STRIDE_SIZE;
-		priv->params.mpwqe_log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ -
-			priv->params.mpwqe_log_stride_sz;
+	mlx5e_set_rq_priv_params(priv);
+	if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
 		priv->params.lro_en = true;
-		break;
-	default: /* MLX5_WQ_TYPE_LINKED_LIST */
-		priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
-	}
-
-	mlx5_core_info(mdev,
-		       "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
-		       priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
-		       BIT(priv->params.log_rq_size),
-		       BIT(priv->params.mpwqe_log_stride_sz),
-		       priv->params.rx_cqe_compress_admin);
-
-	priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
-					    BIT(priv->params.log_rq_size));
 
 	priv->params.rx_am_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
 	mlx5e_set_rx_cq_mode_params(&priv->params, cq_period_mode);
@@ -3197,12 +3211,6 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
 	MLX5E_SET_PRIV_FLAG(priv, MLX5E_PFLAG_RX_CQE_BASED_MODER,
 			    priv->params.rx_cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
 
-	priv->mdev                         = mdev;
-	priv->netdev                       = netdev;
-	priv->params.num_channels          = profile->max_nch(mdev);
-	priv->profile                      = profile;
-	priv->ppriv                        = ppriv;
-
 #ifdef CONFIG_MLX5_CORE_EN_DCB
 	mlx5e_ets_init(priv);
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 5/8] net/mlx5e: XDP fast RX drop bpf programs support
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
                   ` (3 preceding siblings ...)
  2016-09-19 13:58 ` [PATCH net-next 4/8] net/mlx5e: Dynamic RQ type infrastructure Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 6/8] net/mlx5e: Have a clear separation between different SQ types Tariq Toukan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Rana Shahout <ranas@mellanox.com>

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.

When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".

On XDP set, we fail if HW LRO is set and request from user to turn it
off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.

Full channels reset (close/open) is required only when setting XDP
on/off.

When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
	- rq.state = disabled
	- napi_synnchronize
	- xchg(rq->xdp_prg)
	- rq.state = enabled
	- napi_schedule // Just in case we've missed an IRQ

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
	1. Baseline, Before this patch with TC drop action
	2. This patch with TC drop action
	3. This patch with XDP RX fast drop

RX Cores  Baseline(TC drop)    TC drop    XDP fast Drop
--------------------------------------------------------------
1            5.3Mpps           5.3Mpps     16.5Mpps
2           10.2Mpps          10.2Mpps     31.3Mpps
4           20.5Mpps          19.9Mpps     36.3Mpps*

*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.

Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 100 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    |  26 +++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   4 +
 4 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index e3331237ce0e..5e8e669f69c0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -334,6 +334,7 @@ struct mlx5e_rq {
 	int                    ix;
 
 	struct mlx5e_rx_am     am; /* Adaptive Moderation */
+	struct bpf_prog       *xdp_prog;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
@@ -627,6 +628,7 @@ struct mlx5e_priv {
 	/* priv data path fields - start */
 	struct mlx5e_sq            **txq_to_sq_map;
 	int channeltc_to_txq_map[MLX5E_MAX_NUM_CHANNELS][MLX5E_MAX_NUM_TC];
+	struct bpf_prog *xdp_prog;
 	/* priv data path fields - end */
 
 	unsigned long              state;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ff520ad27f5e..6e95a16226f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -34,6 +34,7 @@
 #include <net/pkt_cls.h>
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
+#include <linux/bpf.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -104,7 +105,8 @@ static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
 
 static void mlx5e_set_rq_priv_params(struct mlx5e_priv *priv)
 {
-	u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) ?
+	u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) &&
+		    !priv->xdp_prog ?
 		    MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
 		    MLX5_WQ_TYPE_LINKED_LIST;
 	mlx5e_set_rq_type_params(priv, rq_type);
@@ -177,6 +179,7 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 		s->rx_csum_none	+= rq_stats->csum_none;
 		s->rx_csum_complete += rq_stats->csum_complete;
 		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
+		s->rx_xdp_drop += rq_stats->xdp_drop;
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
 		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
@@ -473,6 +476,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->channel = c;
 	rq->ix      = c->ix;
 	rq->priv    = c->priv;
+	rq->xdp_prog = priv->xdp_prog;
 
 	switch (priv->params.rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
@@ -536,6 +540,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->page_cache.head = 0;
 	rq->page_cache.tail = 0;
 
+	if (rq->xdp_prog)
+		bpf_prog_add(rq->xdp_prog, 1);
+
 	return 0;
 
 err_rq_wq_destroy:
@@ -548,6 +555,9 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
 {
 	int i;
 
+	if (rq->xdp_prog)
+		bpf_prog_put(rq->xdp_prog);
+
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		mlx5e_rq_free_mpwqe_info(rq);
@@ -2955,6 +2965,92 @@ static void mlx5e_tx_timeout(struct net_device *dev)
 		schedule_work(&priv->tx_timeout_work);
 }
 
+static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+	struct mlx5e_priv *priv = netdev_priv(netdev);
+	struct bpf_prog *old_prog;
+	int err = 0;
+	bool reset, was_opened;
+	int i;
+
+	mutex_lock(&priv->state_lock);
+
+	if ((netdev->features & NETIF_F_LRO) && prog) {
+		netdev_warn(netdev, "can't set XDP while LRO is on, disable LRO first\n");
+		err = -EINVAL;
+		goto unlock;
+	}
+
+	was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
+	/* no need for full reset when exchanging programs */
+	reset = (!priv->xdp_prog || !prog);
+
+	if (was_opened && reset)
+		mlx5e_close_locked(netdev);
+
+	/* exchange programs */
+	old_prog = xchg(&priv->xdp_prog, prog);
+	if (prog)
+		bpf_prog_add(prog, 1);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	if (reset) /* change RQ type according to priv->xdp_prog */
+		mlx5e_set_rq_priv_params(priv);
+
+	if (was_opened && reset)
+		mlx5e_open_locked(netdev);
+
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state) || reset)
+		goto unlock;
+
+	/* exchanging programs w/o reset, we update ref counts on behalf
+	 * of the channels RQs here.
+	 */
+	bpf_prog_add(prog, priv->params.num_channels);
+	for (i = 0; i < priv->params.num_channels; i++) {
+		struct mlx5e_channel *c = priv->channel[i];
+
+		set_bit(MLX5E_RQ_STATE_FLUSH, &c->rq.state);
+		napi_synchronize(&c->napi);
+		/* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
+
+		old_prog = xchg(&c->rq.xdp_prog, prog);
+
+		clear_bit(MLX5E_RQ_STATE_FLUSH, &c->rq.state);
+		/* napi_schedule in case we have missed anything */
+		set_bit(MLX5E_CHANNEL_NAPI_SCHED, &c->flags);
+		napi_schedule(&c->napi);
+
+		if (old_prog)
+			bpf_prog_put(old_prog);
+	}
+
+unlock:
+	mutex_unlock(&priv->state_lock);
+	return err;
+}
+
+static bool mlx5e_xdp_attached(struct net_device *dev)
+{
+	struct mlx5e_priv *priv = netdev_priv(dev);
+
+	return !!priv->xdp_prog;
+}
+
+static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return mlx5e_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = mlx5e_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
 	.ndo_open                = mlx5e_open,
 	.ndo_stop                = mlx5e_close,
@@ -2974,6 +3070,7 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = {
 	.ndo_rx_flow_steer	 = mlx5e_rx_flow_steer,
 #endif
 	.ndo_tx_timeout          = mlx5e_tx_timeout,
+	.ndo_xdp		 = mlx5e_xdp,
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -3005,6 +3102,7 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = {
 	.ndo_set_vf_link_state   = mlx5e_set_vf_link_state,
 	.ndo_get_vf_stats        = mlx5e_get_vf_stats,
 	.ndo_tx_timeout          = mlx5e_tx_timeout,
+	.ndo_xdp		 = mlx5e_xdp,
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index a403a797ebef..96f6317a95ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -632,8 +632,20 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 	napi_gro_receive(rq->cq.napi, skb);
 }
 
+static inline enum xdp_action mlx5e_xdp_handle(struct mlx5e_rq *rq,
+					       const struct bpf_prog *prog,
+					       void *data, u32 len)
+{
+	struct xdp_buff xdp;
+
+	xdp.data = data;
+	xdp.data_end = xdp.data + len;
+	return bpf_prog_run_xdp(prog, &xdp);
+}
+
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
+	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
 	struct mlx5e_dma_info *di;
 	struct mlx5e_rx_wqe *wqe;
 	__be16 wqe_counter_be;
@@ -654,6 +666,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 				      rq->buff.wqe_sz,
 				      DMA_FROM_DEVICE);
 	prefetch(va + MLX5_RX_HEADROOM);
+	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 
 	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
 		rq->stats.wqe_err++;
@@ -661,6 +674,18 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
+	if (xdp_prog) {
+		enum xdp_action act =
+			mlx5e_xdp_handle(rq, xdp_prog, va + MLX5_RX_HEADROOM,
+					 cqe_bcnt);
+
+		if (act != XDP_PASS) {
+			rq->stats.xdp_drop++;
+			mlx5e_page_release(rq, di, true);
+			goto wq_ll_pop;
+		}
+	}
+
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
 	if (unlikely(!skb)) {
 		rq->stats.buff_alloc_err++;
@@ -672,7 +697,6 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	page_ref_inc(di->page);
 	mlx5e_page_release(rq, di, true);
 
-	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 	skb_reserve(skb, MLX5_RX_HEADROOM);
 	skb_put(skb, cqe_bcnt);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 6af8d79e8c2a..084d6c893a6e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -65,6 +65,7 @@ struct mlx5e_sw_stats {
 	u64 rx_csum_none;
 	u64 rx_csum_complete;
 	u64 rx_csum_unnecessary_inner;
+	u64 rx_xdp_drop;
 	u64 tx_csum_partial;
 	u64 tx_csum_partial_inner;
 	u64 tx_queue_stopped;
@@ -100,6 +101,7 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_none) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_complete) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary_inner) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_drop) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial_inner) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
@@ -278,6 +280,7 @@ struct mlx5e_rq_stats {
 	u64 csum_none;
 	u64 lro_packets;
 	u64 lro_bytes;
+	u64 xdp_drop;
 	u64 wqe_err;
 	u64 mpwqe_filler;
 	u64 buff_alloc_err;
@@ -295,6 +298,7 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_complete) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_none) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_drop) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 6/8] net/mlx5e: Have a clear separation between different SQ types
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
                   ` (4 preceding siblings ...)
  2016-09-19 13:58 ` [PATCH net-next 5/8] net/mlx5e: XDP fast RX drop bpf programs support Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support Tariq Toukan
  2016-09-19 13:58 ` [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more Tariq Toukan
  7 siblings, 0 replies; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

Make a clear separate between Regular SQ (TXQ) and ICO SQ creation,
destruction and union their mutual information structures.

Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ.
And have a different SQ edge for ICO SQ than TXQ SQ, to be more
accurate.

In preparation for XDP TX support.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |  23 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 121 ++++++++++++++--------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |  30 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |   2 +-
 5 files changed, 120 insertions(+), 64 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5e8e669f69c0..5917f5e609ae 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -101,6 +101,9 @@
 #define MLX5E_UPDATE_STATS_INTERVAL    200 /* msecs */
 #define MLX5E_SQ_BF_BUDGET             16
 
+#define MLX5E_ICOSQ_MAX_WQEBBS \
+	(DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
+
 #define MLX5E_NUM_MAIN_GROUPS 9
 
 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
@@ -386,6 +389,11 @@ struct mlx5e_ico_wqe_info {
 	u8  num_wqebbs;
 };
 
+enum mlx5e_sq_type {
+	MLX5E_SQ_TXQ,
+	MLX5E_SQ_ICO
+};
+
 struct mlx5e_sq {
 	/* data path */
 
@@ -403,10 +411,15 @@ struct mlx5e_sq {
 
 	struct mlx5e_cq            cq;
 
-	/* pointers to per packet info: write@xmit, read@completion */
-	struct sk_buff           **skb;
-	struct mlx5e_sq_dma       *dma_fifo;
-	struct mlx5e_tx_wqe_info  *wqe_info;
+	/* pointers to per tx element info: write@xmit, read@completion */
+	union {
+		struct {
+			struct sk_buff           **skb;
+			struct mlx5e_sq_dma       *dma_fifo;
+			struct mlx5e_tx_wqe_info  *wqe_info;
+		} txq;
+		struct mlx5e_ico_wqe_info *ico_wqe;
+	} db;
 
 	/* read only */
 	struct mlx5_wq_cyc         wq;
@@ -428,8 +441,8 @@ struct mlx5e_sq {
 	struct mlx5_uar            uar;
 	struct mlx5e_channel      *channel;
 	int                        tc;
-	struct mlx5e_ico_wqe_info *ico_wqe_info;
 	u32                        rate_limit;
+	u8                         type;
 } ____cacheline_aligned_in_smp;
 
 static inline bool mlx5e_sq_has_room_for(struct mlx5e_sq *sq, u16 n)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 6e95a16226f8..632de09e69cf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -51,7 +51,7 @@ struct mlx5e_sq_param {
 	struct mlx5_wq_param       wq;
 	u16                        max_inline;
 	u8                         min_inline_mode;
-	bool                       icosq;
+	enum mlx5e_sq_type         type;
 };
 
 struct mlx5e_cq_param {
@@ -740,8 +740,8 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
 	if (param->am_enabled)
 		set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
 
-	sq->ico_wqe_info[pi].opcode     = MLX5_OPCODE_NOP;
-	sq->ico_wqe_info[pi].num_wqebbs = 1;
+	sq->db.ico_wqe[pi].opcode     = MLX5_OPCODE_NOP;
+	sq->db.ico_wqe[pi].num_wqebbs = 1;
 	mlx5e_send_nop(sq, true); /* trigger mlx5e_post_rx_wqes() */
 
 	return 0;
@@ -765,26 +765,43 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
 	mlx5e_destroy_rq(rq);
 }
 
-static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
+static void mlx5e_free_sq_ico_db(struct mlx5e_sq *sq)
 {
-	kfree(sq->wqe_info);
-	kfree(sq->dma_fifo);
-	kfree(sq->skb);
+	kfree(sq->db.ico_wqe);
 }
 
-static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
+static int mlx5e_alloc_sq_ico_db(struct mlx5e_sq *sq, int numa)
+{
+	u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+	sq->db.ico_wqe = kzalloc_node(sizeof(*sq->db.ico_wqe) * wq_sz,
+				      GFP_KERNEL, numa);
+	if (!sq->db.ico_wqe)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void mlx5e_free_sq_txq_db(struct mlx5e_sq *sq)
+{
+	kfree(sq->db.txq.wqe_info);
+	kfree(sq->db.txq.dma_fifo);
+	kfree(sq->db.txq.skb);
+}
+
+static int mlx5e_alloc_sq_txq_db(struct mlx5e_sq *sq, int numa)
 {
 	int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
 	int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
 
-	sq->skb = kzalloc_node(wq_sz * sizeof(*sq->skb), GFP_KERNEL, numa);
-	sq->dma_fifo = kzalloc_node(df_sz * sizeof(*sq->dma_fifo), GFP_KERNEL,
-				    numa);
-	sq->wqe_info = kzalloc_node(wq_sz * sizeof(*sq->wqe_info), GFP_KERNEL,
-				    numa);
-
-	if (!sq->skb || !sq->dma_fifo || !sq->wqe_info) {
-		mlx5e_free_sq_db(sq);
+	sq->db.txq.skb = kzalloc_node(wq_sz * sizeof(*sq->db.txq.skb),
+				      GFP_KERNEL, numa);
+	sq->db.txq.dma_fifo = kzalloc_node(df_sz * sizeof(*sq->db.txq.dma_fifo),
+					   GFP_KERNEL, numa);
+	sq->db.txq.wqe_info = kzalloc_node(wq_sz * sizeof(*sq->db.txq.wqe_info),
+					   GFP_KERNEL, numa);
+	if (!sq->db.txq.skb || !sq->db.txq.dma_fifo || !sq->db.txq.wqe_info) {
+		mlx5e_free_sq_txq_db(sq);
 		return -ENOMEM;
 	}
 
@@ -793,6 +810,30 @@ static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
 	return 0;
 }
 
+static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
+{
+	switch (sq->type) {
+	case MLX5E_SQ_TXQ:
+		mlx5e_free_sq_txq_db(sq);
+		break;
+	case MLX5E_SQ_ICO:
+		mlx5e_free_sq_ico_db(sq);
+		break;
+	}
+}
+
+static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
+{
+	switch (sq->type) {
+	case MLX5E_SQ_TXQ:
+		return mlx5e_alloc_sq_txq_db(sq, numa);
+	case MLX5E_SQ_ICO:
+		return mlx5e_alloc_sq_ico_db(sq, numa);
+	}
+
+	return 0;
+}
+
 static int mlx5e_create_sq(struct mlx5e_channel *c,
 			   int tc,
 			   struct mlx5e_sq_param *param,
@@ -803,8 +844,16 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 
 	void *sqc = param->sqc;
 	void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
+	u16 sq_max_wqebbs;
 	int err;
 
+	sq->type      = param->type;
+	sq->pdev      = c->pdev;
+	sq->tstamp    = &priv->tstamp;
+	sq->mkey_be   = c->mkey_be;
+	sq->channel   = c;
+	sq->tc        = tc;
+
 	err = mlx5_alloc_map_uar(mdev, &sq->uar, !!MLX5_CAP_GEN(mdev, bf));
 	if (err)
 		return err;
@@ -833,18 +882,8 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 	if (err)
 		goto err_sq_wq_destroy;
 
-	if (param->icosq) {
-		u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
-
-		sq->ico_wqe_info = kzalloc_node(sizeof(*sq->ico_wqe_info) *
-						wq_sz,
-						GFP_KERNEL,
-						cpu_to_node(c->cpu));
-		if (!sq->ico_wqe_info) {
-			err = -ENOMEM;
-			goto err_free_sq_db;
-		}
-	} else {
+	sq_max_wqebbs = MLX5_SEND_WQE_MAX_WQEBBS;
+	if (sq->type == MLX5E_SQ_TXQ) {
 		int txq_ix;
 
 		txq_ix = c->ix + tc * priv->params.num_channels;
@@ -852,19 +891,14 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 		priv->txq_to_sq_map[txq_ix] = sq;
 	}
 
-	sq->pdev      = c->pdev;
-	sq->tstamp    = &priv->tstamp;
-	sq->mkey_be   = c->mkey_be;
-	sq->channel   = c;
-	sq->tc        = tc;
-	sq->edge      = (sq->wq.sz_m1 + 1) - MLX5_SEND_WQE_MAX_WQEBBS;
+	if (sq->type == MLX5E_SQ_ICO)
+		sq_max_wqebbs = MLX5E_ICOSQ_MAX_WQEBBS;
+
+	sq->edge      = (sq->wq.sz_m1 + 1) - sq_max_wqebbs;
 	sq->bf_budget = MLX5E_SQ_BF_BUDGET;
 
 	return 0;
 
-err_free_sq_db:
-	mlx5e_free_sq_db(sq);
-
 err_sq_wq_destroy:
 	mlx5_wq_destroy(&sq->wq_ctrl);
 
@@ -879,7 +913,6 @@ static void mlx5e_destroy_sq(struct mlx5e_sq *sq)
 	struct mlx5e_channel *c = sq->channel;
 	struct mlx5e_priv *priv = c->priv;
 
-	kfree(sq->ico_wqe_info);
 	mlx5e_free_sq_db(sq);
 	mlx5_wq_destroy(&sq->wq_ctrl);
 	mlx5_unmap_free_uar(priv->mdev, &sq->uar);
@@ -908,11 +941,12 @@ static int mlx5e_enable_sq(struct mlx5e_sq *sq, struct mlx5e_sq_param *param)
 
 	memcpy(sqc, param->sqc, sizeof(param->sqc));
 
-	MLX5_SET(sqc,  sqc, tis_num_0, param->icosq ? 0 : priv->tisn[sq->tc]);
+	MLX5_SET(sqc,  sqc, tis_num_0, param->type == MLX5E_SQ_ICO ?
+				       0 : priv->tisn[sq->tc]);
 	MLX5_SET(sqc,  sqc, cqn,		sq->cq.mcq.cqn);
 	MLX5_SET(sqc,  sqc, min_wqe_inline_mode, sq->min_inline_mode);
 	MLX5_SET(sqc,  sqc, state,		MLX5_SQC_STATE_RST);
-	MLX5_SET(sqc,  sqc, tis_lst_sz,		param->icosq ? 0 : 1);
+	MLX5_SET(sqc,  sqc, tis_lst_sz, param->type == MLX5E_SQ_ICO ? 0 : 1);
 	MLX5_SET(sqc,  sqc, flush_in_error_en,	1);
 
 	MLX5_SET(wq,   wq, wq_type,       MLX5_WQ_TYPE_CYCLIC);
@@ -1027,8 +1061,10 @@ static void mlx5e_close_sq(struct mlx5e_sq *sq)
 		netif_tx_disable_queue(sq->txq);
 
 		/* last doorbell out, godspeed .. */
-		if (mlx5e_sq_has_room_for(sq, 1))
+		if (mlx5e_sq_has_room_for(sq, 1)) {
+			sq->db.txq.skb[(sq->pc & sq->wq.sz_m1)] = NULL;
 			mlx5e_send_nop(sq, true);
+		}
 	}
 
 	mlx5e_disable_sq(sq);
@@ -1505,6 +1541,7 @@ static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
 
 	param->max_inline = priv->params.tx_max_inline;
 	param->min_inline_mode = priv->params.tx_min_inline_mode;
+	param->type = MLX5E_SQ_TXQ;
 }
 
 static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv,
@@ -1578,7 +1615,7 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 	MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
 	MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq));
 
-	param->icosq = true;
+	param->type = MLX5E_SQ_ICO;
 }
 
 static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 96f6317a95ea..941e53169838 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -337,8 +337,8 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 
 	/* fill sq edge with nops to avoid wqe wrap around */
 	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
-		sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
-		sq->ico_wqe_info[pi].num_wqebbs = 1;
+		sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_NOP;
+		sq->db.ico_wqe[pi].num_wqebbs = 1;
 		mlx5e_send_nop(sq, true);
 	}
 
@@ -348,8 +348,8 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
 		cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
 			    MLX5_OPCODE_UMR);
 
-	sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
-	sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
+	sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR;
+	sq->db.ico_wqe[pi].num_wqebbs = num_wqebbs;
 	sq->pc += num_wqebbs;
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index eb0e72537f10..f02f24cfcb7a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -52,7 +52,6 @@ void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw)
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_NOP);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | 0x01);
 
-	sq->skb[pi] = NULL;
 	sq->pc++;
 	sq->stats.nop++;
 
@@ -82,15 +81,17 @@ static inline void mlx5e_dma_push(struct mlx5e_sq *sq,
 				  u32 size,
 				  enum mlx5e_dma_map_type map_type)
 {
-	sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr = addr;
-	sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size = size;
-	sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].type = map_type;
+	u32 i = sq->dma_fifo_pc & sq->dma_fifo_mask;
+
+	sq->db.txq.dma_fifo[i].addr = addr;
+	sq->db.txq.dma_fifo[i].size = size;
+	sq->db.txq.dma_fifo[i].type = map_type;
 	sq->dma_fifo_pc++;
 }
 
 static inline struct mlx5e_sq_dma *mlx5e_dma_get(struct mlx5e_sq *sq, u32 i)
 {
-	return &sq->dma_fifo[i & sq->dma_fifo_mask];
+	return &sq->db.txq.dma_fifo[i & sq->dma_fifo_mask];
 }
 
 static void mlx5e_dma_unmap_wqe_err(struct mlx5e_sq *sq, u8 num_dma)
@@ -221,7 +222,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 
 	u16 pi = sq->pc & wq->sz_m1;
 	struct mlx5e_tx_wqe      *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
-	struct mlx5e_tx_wqe_info *wi   = &sq->wqe_info[pi];
+	struct mlx5e_tx_wqe_info *wi   = &sq->db.txq.wqe_info[pi];
 
 	struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
 	struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
@@ -341,7 +342,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | opcode);
 	cseg->qpn_ds           = cpu_to_be32((sq->sqn << 8) | ds_cnt);
 
-	sq->skb[pi] = skb;
+	sq->db.txq.skb[pi] = skb;
 
 	wi->num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
 	sq->pc += wi->num_wqebbs;
@@ -368,8 +369,10 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
 	}
 
 	/* fill sq edge with nops to avoid wqe wrap around */
-	while ((sq->pc & wq->sz_m1) > sq->edge)
+	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+		sq->db.txq.skb[pi] = NULL;
 		mlx5e_send_nop(sq, false);
+	}
 
 	if (bf)
 		sq->bf_budget--;
@@ -442,8 +445,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			last_wqe = (sqcc == wqe_counter);
 
 			ci = sqcc & sq->wq.sz_m1;
-			skb = sq->skb[ci];
-			wi = &sq->wqe_info[ci];
+			skb = sq->db.txq.skb[ci];
+			wi = &sq->db.txq.wqe_info[ci];
 
 			if (unlikely(!skb)) { /* nop */
 				sqcc++;
@@ -499,10 +502,13 @@ void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
 	u16 ci;
 	int i;
 
+	if (sq->type != MLX5E_SQ_TXQ)
+		return;
+
 	while (sq->cc != sq->pc) {
 		ci = sq->cc & sq->wq.sz_m1;
-		skb = sq->skb[ci];
-		wi = &sq->wqe_info[ci];
+		skb = sq->db.txq.skb[ci];
+		wi = &sq->db.txq.wqe_info[ci];
 
 		if (!skb) { /* nop */
 			sq->cc++;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 08d8b0c91f07..47cd5619ae02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -72,7 +72,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 
 	do {
 		u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
-		struct mlx5e_ico_wqe_info *icowi = &sq->ico_wqe_info[ci];
+		struct mlx5e_ico_wqe_info *icowi = &sq->db.ico_wqe[ci];
 
 		mlx5_cqwq_pop(&cq->wq);
 		sqcc += icowi->num_wqebbs;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
                   ` (5 preceding siblings ...)
  2016-09-19 13:58 ` [PATCH net-next 6/8] net/mlx5e: Have a clear separation between different SQ types Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-20  8:29   ` Jesper Dangaard Brouer
  2016-09-19 13:58 ` [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more Tariq Toukan
  7 siblings, 1 reply; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

Adding support for XDP_TX forwarding from xdp program.
Using XDP, now user can loop packets out of the same port.

We create a dedicated TX SQ for each channel that will serve
XDP programs that return XDP_TX action to loop packets back to
the wire directly from the channel RQ RX path.

For that RX pages will now need to be mapped bi-directionally,
and on XDP_TX action we will sync the page back to device then
queue it into SQ for transmission.  The XDP xmit frame function will
report back to the RX path if the page was consumed (transmitted), if so,
RX path will forget about that page as if it were released to the stack.
Later on, on XDP TX completion, the page will be released back to the
page cache.

For simplicity this patch will hit a doorbell on every XDP TX packet.

Next patch will introduce a xmit more like mechanism that will
queue up more than one packet into SQ w/o notifying the hardware,
once RX napi loop is done we will hit doorbell once for all XDP TX
packets form the previous loop.  This should drastically improve
XDP TX performance.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  25 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  93 +++++++++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 116 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |   8 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c    |  39 ++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c  |  65 +++++++++++-
 6 files changed, 310 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5917f5e609ae..82eededfc92a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -104,6 +104,15 @@
 #define MLX5E_ICOSQ_MAX_WQEBBS \
 	(DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
 
+#define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN)
+#define MLX5E_XDP_IHS_DS_COUNT \
+	DIV_ROUND_UP(MLX5E_XDP_MIN_INLINE - 2, MLX5_SEND_WQE_DS)
+#define MLX5E_XDP_TX_DS_COUNT \
+	(MLX5E_XDP_IHS_DS_COUNT + \
+	 (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS) + 1 /* SG DS */)
+#define MLX5E_XDP_TX_WQEBBS \
+	DIV_ROUND_UP(MLX5E_XDP_TX_DS_COUNT, MLX5_SEND_WQEBB_NUM_DS)
+
 #define MLX5E_NUM_MAIN_GROUPS 9
 
 static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
@@ -319,6 +328,7 @@ struct mlx5e_rq {
 	struct {
 		u8             page_order;
 		u32            wqe_sz;    /* wqe data buffer size */
+		u8             map_dir;   /* dma map direction */
 	} buff;
 	__be32                 mkey_be;
 
@@ -384,14 +394,15 @@ enum {
 	MLX5E_SQ_STATE_BF_ENABLE,
 };
 
-struct mlx5e_ico_wqe_info {
+struct mlx5e_sq_wqe_info {
 	u8  opcode;
 	u8  num_wqebbs;
 };
 
 enum mlx5e_sq_type {
 	MLX5E_SQ_TXQ,
-	MLX5E_SQ_ICO
+	MLX5E_SQ_ICO,
+	MLX5E_SQ_XDP
 };
 
 struct mlx5e_sq {
@@ -418,7 +429,11 @@ struct mlx5e_sq {
 			struct mlx5e_sq_dma       *dma_fifo;
 			struct mlx5e_tx_wqe_info  *wqe_info;
 		} txq;
-		struct mlx5e_ico_wqe_info *ico_wqe;
+		struct mlx5e_sq_wqe_info *ico_wqe;
+		struct {
+			struct mlx5e_sq_wqe_info  *wqe_info;
+			struct mlx5e_dma_info     *di;
+		} xdp;
 	} db;
 
 	/* read only */
@@ -458,8 +473,10 @@ enum channel_flags {
 struct mlx5e_channel {
 	/* data path */
 	struct mlx5e_rq            rq;
+	struct mlx5e_sq            xdp_sq;
 	struct mlx5e_sq            sq[MLX5E_MAX_NUM_TC];
 	struct mlx5e_sq            icosq;   /* internal control operations */
+	bool                       xdp;
 	struct napi_struct         napi;
 	struct device             *pdev;
 	struct net_device         *netdev;
@@ -688,7 +705,7 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq);
+void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
 
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 632de09e69cf..a9fc9d4c6af4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -64,6 +64,7 @@ struct mlx5e_cq_param {
 struct mlx5e_channel_param {
 	struct mlx5e_rq_param      rq;
 	struct mlx5e_sq_param      sq;
+	struct mlx5e_sq_param      xdp_sq;
 	struct mlx5e_sq_param      icosq;
 	struct mlx5e_cq_param      rx_cq;
 	struct mlx5e_cq_param      tx_cq;
@@ -180,6 +181,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
 		s->rx_csum_complete += rq_stats->csum_complete;
 		s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
 		s->rx_xdp_drop += rq_stats->xdp_drop;
+		s->rx_xdp_tx += rq_stats->xdp_tx;
+		s->rx_xdp_tx_full += rq_stats->xdp_tx_full;
 		s->rx_wqe_err   += rq_stats->wqe_err;
 		s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
 		s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
@@ -478,6 +481,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->priv    = c->priv;
 	rq->xdp_prog = priv->xdp_prog;
 
+	rq->buff.map_dir = DMA_FROM_DEVICE;
+	if (rq->xdp_prog)
+		rq->buff.map_dir = DMA_BIDIRECTIONAL;
+
 	switch (priv->params.rq_wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 		rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
@@ -765,6 +772,28 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
 	mlx5e_destroy_rq(rq);
 }
 
+static void mlx5e_free_sq_xdp_db(struct mlx5e_sq *sq)
+{
+	kfree(sq->db.xdp.di);
+	kfree(sq->db.xdp.wqe_info);
+}
+
+static int mlx5e_alloc_sq_xdp_db(struct mlx5e_sq *sq, int numa)
+{
+	int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+	sq->db.xdp.di = kzalloc_node(sizeof(*sq->db.xdp.di) * wq_sz,
+				     GFP_KERNEL, numa);
+	sq->db.xdp.wqe_info = kzalloc_node(sizeof(*sq->db.xdp.wqe_info) * wq_sz,
+					   GFP_KERNEL, numa);
+	if (!sq->db.xdp.di || !sq->db.xdp.wqe_info) {
+		mlx5e_free_sq_xdp_db(sq);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static void mlx5e_free_sq_ico_db(struct mlx5e_sq *sq)
 {
 	kfree(sq->db.ico_wqe);
@@ -819,6 +848,9 @@ static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
 	case MLX5E_SQ_ICO:
 		mlx5e_free_sq_ico_db(sq);
 		break;
+	case MLX5E_SQ_XDP:
+		mlx5e_free_sq_xdp_db(sq);
+		break;
 	}
 }
 
@@ -829,11 +861,24 @@ static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
 		return mlx5e_alloc_sq_txq_db(sq, numa);
 	case MLX5E_SQ_ICO:
 		return mlx5e_alloc_sq_ico_db(sq, numa);
+	case MLX5E_SQ_XDP:
+		return mlx5e_alloc_sq_xdp_db(sq, numa);
 	}
 
 	return 0;
 }
 
+static int mlx5e_sq_get_max_wqebbs(u8 sq_type)
+{
+	switch (sq_type) {
+	case MLX5E_SQ_ICO:
+		return MLX5E_ICOSQ_MAX_WQEBBS;
+	case MLX5E_SQ_XDP:
+		return MLX5E_XDP_TX_WQEBBS;
+	}
+	return MLX5_SEND_WQE_MAX_WQEBBS;
+}
+
 static int mlx5e_create_sq(struct mlx5e_channel *c,
 			   int tc,
 			   struct mlx5e_sq_param *param,
@@ -844,7 +889,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 
 	void *sqc = param->sqc;
 	void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
-	u16 sq_max_wqebbs;
 	int err;
 
 	sq->type      = param->type;
@@ -882,7 +926,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 	if (err)
 		goto err_sq_wq_destroy;
 
-	sq_max_wqebbs = MLX5_SEND_WQE_MAX_WQEBBS;
 	if (sq->type == MLX5E_SQ_TXQ) {
 		int txq_ix;
 
@@ -891,10 +934,7 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
 		priv->txq_to_sq_map[txq_ix] = sq;
 	}
 
-	if (sq->type == MLX5E_SQ_ICO)
-		sq_max_wqebbs = MLX5E_ICOSQ_MAX_WQEBBS;
-
-	sq->edge      = (sq->wq.sz_m1 + 1) - sq_max_wqebbs;
+	sq->edge = (sq->wq.sz_m1 + 1) - mlx5e_sq_get_max_wqebbs(sq->type);
 	sq->bf_budget = MLX5E_SQ_BF_BUDGET;
 
 	return 0;
@@ -1068,7 +1108,7 @@ static void mlx5e_close_sq(struct mlx5e_sq *sq)
 	}
 
 	mlx5e_disable_sq(sq);
-	mlx5e_free_tx_descs(sq);
+	mlx5e_free_sq_descs(sq);
 	mlx5e_destroy_sq(sq);
 }
 
@@ -1429,14 +1469,31 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 		}
 	}
 
+	if (priv->xdp_prog) {
+		/* XDP SQ CQ params are same as normal TXQ sq CQ params */
+		err = mlx5e_open_cq(c, &cparam->tx_cq, &c->xdp_sq.cq,
+				    priv->params.tx_cq_moderation);
+		if (err)
+			goto err_close_sqs;
+
+		err = mlx5e_open_sq(c, 0, &cparam->xdp_sq, &c->xdp_sq);
+		if (err) {
+			mlx5e_close_cq(&c->xdp_sq.cq);
+			goto err_close_sqs;
+		}
+	}
+
+	c->xdp = !!priv->xdp_prog;
 	err = mlx5e_open_rq(c, &cparam->rq, &c->rq);
 	if (err)
-		goto err_close_sqs;
+		goto err_close_xdp_sq;
 
 	netif_set_xps_queue(netdev, get_cpu_mask(c->cpu), ix);
 	*cp = c;
 
 	return 0;
+err_close_xdp_sq:
+	mlx5e_close_sq(&c->xdp_sq);
 
 err_close_sqs:
 	mlx5e_close_sqs(c);
@@ -1465,9 +1522,13 @@ err_napi_del:
 static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
 	mlx5e_close_rq(&c->rq);
+	if (c->xdp)
+		mlx5e_close_sq(&c->xdp_sq);
 	mlx5e_close_sqs(c);
 	mlx5e_close_sq(&c->icosq);
 	napi_disable(&c->napi);
+	if (c->xdp)
+		mlx5e_close_cq(&c->xdp_sq.cq);
 	mlx5e_close_cq(&c->rq.cq);
 	mlx5e_close_tx_cqs(c);
 	mlx5e_close_cq(&c->icosq.cq);
@@ -1618,12 +1679,28 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
 	param->type = MLX5E_SQ_ICO;
 }
 
+static void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv,
+				    struct mlx5e_sq_param *param)
+{
+	void *sqc = param->sqc;
+	void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+	mlx5e_build_sq_param_common(priv, param);
+	MLX5_SET(wq, wq, log_wq_sz,     priv->params.log_sq_size);
+
+	param->max_inline = priv->params.tx_max_inline;
+	/* FOR XDP SQs will support only L2 inline mode */
+	param->min_inline_mode = MLX5_INLINE_MODE_NONE;
+	param->type = MLX5E_SQ_XDP;
+}
+
 static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
 {
 	u8 icosq_log_wq_sz = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
 
 	mlx5e_build_rq_param(priv, &cparam->rq);
 	mlx5e_build_sq_param(priv, &cparam->sq);
+	mlx5e_build_xdpsq_param(priv, &cparam->xdp_sq);
 	mlx5e_build_icosq_param(priv, &cparam->icosq, icosq_log_wq_sz);
 	mlx5e_build_rx_cq_param(priv, &cparam->rx_cq);
 	mlx5e_build_tx_cq_param(priv, &cparam->tx_cq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 941e53169838..fd8011bf25f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,7 +236,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 
 	dma_info->page = page;
 	dma_info->addr = dma_map_page(rq->pdev, page, 0,
-				      RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
+				      RQ_PAGE_SIZE(rq), rq->buff.map_dir);
 	if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
 		put_page(page);
 		return -ENOMEM;
@@ -252,7 +252,7 @@ void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 		return;
 
 	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-		       DMA_FROM_DEVICE);
+		       rq->buff.map_dir);
 	put_page(dma_info->page);
 }
 
@@ -632,15 +632,101 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 	napi_gro_receive(rq->cq.napi, skb);
 }
 
-static inline enum xdp_action mlx5e_xdp_handle(struct mlx5e_rq *rq,
-					       const struct bpf_prog *prog,
-					       void *data, u32 len)
+static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
+					struct mlx5e_dma_info *di,
+					unsigned int data_offset,
+					int len)
 {
+	struct mlx5_wq_cyc       *wq   = &sq->wq;
+	u16                      pi    = sq->pc & wq->sz_m1;
+	struct mlx5e_tx_wqe      *wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
+	struct mlx5e_sq_wqe_info *wi   = &sq->db.xdp.wqe_info[pi];
+
+	struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+	struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
+	struct mlx5_wqe_data_seg *dseg;
+
+	dma_addr_t dma_addr  = di->addr + data_offset + MLX5E_XDP_MIN_INLINE;
+	unsigned int dma_len = len - MLX5E_XDP_MIN_INLINE;
+	void *data           = page_address(di->page) + data_offset;
+
+	if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_XDP_TX_WQEBBS))) {
+		sq->channel->rq.stats.xdp_tx_full++;
+		return false;
+	}
+
+	dma_sync_single_for_device(sq->pdev, dma_addr, dma_len,
+				   PCI_DMA_TODEVICE);
+
+	memset(wqe, 0, sizeof(*wqe));
+
+	/* copy the inline part */
+	memcpy(eseg->inline_hdr_start, data, MLX5E_XDP_MIN_INLINE);
+	eseg->inline_hdr_sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE);
+
+	dseg = (struct mlx5_wqe_data_seg *)cseg + (MLX5E_XDP_TX_DS_COUNT - 1);
+
+	/* write the dma part */
+	dseg->addr       = cpu_to_be64(dma_addr);
+	dseg->byte_count = cpu_to_be32(dma_len);
+	dseg->lkey       = sq->mkey_be;
+
+	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_SEND);
+	cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | MLX5E_XDP_TX_DS_COUNT);
+
+	sq->db.xdp.di[pi] = *di;
+	wi->opcode     = MLX5_OPCODE_SEND;
+	wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
+	sq->pc += MLX5E_XDP_TX_WQEBBS;
+
+	/* TODO: xmit more */
+	wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
+
+	/* fill sq edge with nops to avoid wqe wrap around */
+	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+		sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
+		mlx5e_send_nop(sq, false);
+	}
+	return true;
+}
+
+/* returns true if packet was consumed by xdp */
+static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
+				    const struct bpf_prog *prog,
+				    struct mlx5e_dma_info *di,
+				    void *data, u16 len)
+{
+	bool consumed = false;
 	struct xdp_buff xdp;
+	u32 act;
+
+	if (!prog)
+		return false;
 
 	xdp.data = data;
 	xdp.data_end = xdp.data + len;
-	return bpf_prog_run_xdp(prog, &xdp);
+	act = bpf_prog_run_xdp(prog, &xdp);
+	switch (act) {
+	case XDP_PASS:
+		return false;
+	case XDP_TX:
+		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
+						MLX5_RX_HEADROOM,
+						len);
+		rq->stats.xdp_tx += consumed;
+		return consumed;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		return false;
+	case XDP_ABORTED:
+	case XDP_DROP:
+		rq->stats.xdp_drop++;
+		mlx5e_page_release(rq, di, true);
+		return true;
+	}
+
+	return false;
 }
 
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
@@ -651,21 +737,22 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	__be16 wqe_counter_be;
 	struct sk_buff *skb;
 	u16 wqe_counter;
+	void *va, *data;
 	u32 cqe_bcnt;
-	void *va;
 
 	wqe_counter_be = cqe->wqe_counter;
 	wqe_counter    = be16_to_cpu(wqe_counter_be);
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
 	di             = &rq->dma_info[wqe_counter];
 	va             = page_address(di->page);
+	data           = va + MLX5_RX_HEADROOM;
 
 	dma_sync_single_range_for_cpu(rq->pdev,
 				      di->addr,
 				      MLX5_RX_HEADROOM,
 				      rq->buff.wqe_sz,
 				      DMA_FROM_DEVICE);
-	prefetch(va + MLX5_RX_HEADROOM);
+	prefetch(data);
 	cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
 
 	if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
@@ -674,17 +761,8 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
-	if (xdp_prog) {
-		enum xdp_action act =
-			mlx5e_xdp_handle(rq, xdp_prog, va + MLX5_RX_HEADROOM,
-					 cqe_bcnt);
-
-		if (act != XDP_PASS) {
-			rq->stats.xdp_drop++;
-			mlx5e_page_release(rq, di, true);
-			goto wq_ll_pop;
-		}
-	}
+	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+		goto wq_ll_pop; /* page/packet was consumed by XDP */
 
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
 	if (unlikely(!skb)) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 084d6c893a6e..57452fdc5154 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -66,6 +66,8 @@ struct mlx5e_sw_stats {
 	u64 rx_csum_complete;
 	u64 rx_csum_unnecessary_inner;
 	u64 rx_xdp_drop;
+	u64 rx_xdp_tx;
+	u64 rx_xdp_tx_full;
 	u64 tx_csum_partial;
 	u64 tx_csum_partial_inner;
 	u64 tx_queue_stopped;
@@ -102,6 +104,8 @@ static const struct counter_desc sw_stats_desc[] = {
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_complete) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary_inner) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_drop) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_tx) },
+	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_tx_full) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial_inner) },
 	{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
@@ -281,6 +285,8 @@ struct mlx5e_rq_stats {
 	u64 lro_packets;
 	u64 lro_bytes;
 	u64 xdp_drop;
+	u64 xdp_tx;
+	u64 xdp_tx_full;
 	u64 wqe_err;
 	u64 mpwqe_filler;
 	u64 buff_alloc_err;
@@ -299,6 +305,8 @@ static const struct counter_desc rq_stats_desc[] = {
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_none) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_drop) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx) },
+	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx_full) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
 	{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f02f24cfcb7a..70a717382357 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -495,16 +495,13 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 	return (i == MLX5E_TX_CQ_POLL_BUDGET);
 }
 
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
+static void mlx5e_free_txq_sq_descs(struct mlx5e_sq *sq)
 {
 	struct mlx5e_tx_wqe_info *wi;
 	struct sk_buff *skb;
 	u16 ci;
 	int i;
 
-	if (sq->type != MLX5E_SQ_TXQ)
-		return;
-
 	while (sq->cc != sq->pc) {
 		ci = sq->cc & sq->wq.sz_m1;
 		skb = sq->db.txq.skb[ci];
@@ -526,3 +523,37 @@ void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
 		sq->cc += wi->num_wqebbs;
 	}
 }
+
+static void mlx5e_free_xdp_sq_descs(struct mlx5e_sq *sq)
+{
+	struct mlx5e_sq_wqe_info *wi;
+	struct mlx5e_dma_info *di;
+	u16 ci;
+
+	while (sq->cc != sq->pc) {
+		ci = sq->cc & sq->wq.sz_m1;
+		di = &sq->db.xdp.di[ci];
+		wi = &sq->db.xdp.wqe_info[ci];
+
+		if (wi->opcode == MLX5_OPCODE_NOP) {
+			sq->cc++;
+			continue;
+		}
+
+		sq->cc += wi->num_wqebbs;
+
+		mlx5e_page_release(&sq->channel->rq, di, false);
+	}
+}
+
+void mlx5e_free_sq_descs(struct mlx5e_sq *sq)
+{
+	switch (sq->type) {
+	case MLX5E_SQ_TXQ:
+		mlx5e_free_txq_sq_descs(sq);
+		break;
+	case MLX5E_SQ_XDP:
+		mlx5e_free_xdp_sq_descs(sq);
+		break;
+	}
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 47cd5619ae02..397285dcdc8c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -72,7 +72,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 
 	do {
 		u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
-		struct mlx5e_ico_wqe_info *icowi = &sq->db.ico_wqe[ci];
+		struct mlx5e_sq_wqe_info *icowi = &sq->db.ico_wqe[ci];
 
 		mlx5_cqwq_pop(&cq->wq);
 		sqcc += icowi->num_wqebbs;
@@ -105,6 +105,66 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 	sq->cc = sqcc;
 }
 
+static inline bool mlx5e_poll_xdp_tx_cq(struct mlx5e_cq *cq)
+{
+	struct mlx5e_sq *sq;
+	u16 sqcc;
+	int i;
+
+	sq = container_of(cq, struct mlx5e_sq, cq);
+
+	if (unlikely(test_bit(MLX5E_SQ_STATE_FLUSH, &sq->state)))
+		return false;
+
+	/* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
+	 * otherwise a cq overrun may occur
+	 */
+	sqcc = sq->cc;
+
+	for (i = 0; i < MLX5E_TX_CQ_POLL_BUDGET; i++) {
+		struct mlx5_cqe64 *cqe;
+		u16 wqe_counter;
+		bool last_wqe;
+
+		cqe = mlx5e_get_cqe(cq);
+		if (!cqe)
+			break;
+
+		mlx5_cqwq_pop(&cq->wq);
+
+		wqe_counter = be16_to_cpu(cqe->wqe_counter);
+
+		do {
+			struct mlx5e_sq_wqe_info *wi;
+			struct mlx5e_dma_info *di;
+			u16 ci;
+
+			last_wqe = (sqcc == wqe_counter);
+
+			ci = sqcc & sq->wq.sz_m1;
+			di = &sq->db.xdp.di[ci];
+			wi = &sq->db.xdp.wqe_info[ci];
+
+			if (unlikely(wi->opcode == MLX5_OPCODE_NOP)) {
+				sqcc++;
+				continue;
+			}
+
+			sqcc += wi->num_wqebbs;
+			/* Recycle RX page */
+			mlx5e_page_release(&cq->channel->rq, di, true);
+		} while (!last_wqe);
+	}
+
+	mlx5_cqwq_update_db_record(&cq->wq);
+
+	/* ensure cq space is freed before enabling more cqes */
+	wmb();
+
+	sq->cc = sqcc;
+	return (i == MLX5E_TX_CQ_POLL_BUDGET);
+}
+
 int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 {
 	struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel,
@@ -121,6 +181,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	work_done = mlx5e_poll_rx_cq(&c->rq.cq, budget);
 	busy |= work_done == budget;
 
+	if (c->xdp)
+		busy |= mlx5e_poll_xdp_tx_cq(&c->xdp_sq.cq);
+
 	mlx5e_poll_ico_cq(&c->icosq.cq);
 
 	busy |= mlx5e_post_rx_wqes(&c->rq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more
  2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
                   ` (6 preceding siblings ...)
  2016-09-19 13:58 ` [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support Tariq Toukan
@ 2016-09-19 13:58 ` Tariq Toukan
  2016-09-20  7:46   ` Jesper Dangaard Brouer
  7 siblings, 1 reply; 32+ messages in thread
From: Tariq Toukan @ 2016-09-19 13:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout, Tariq Toukan

From: Saeed Mahameed <saeedm@mellanox.com>

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

RX Cores    XDP TX       XDP TX (xmit more)
---------------------------------------------------
1           6.5Mpps      12.4Mpps
2          13.2Mpps      24.2Mpps
4          25.2Mpps      36.3Mpps*
8          36.3Mpps*     36.3Mpps*

*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 62 +++++++++++++++++--------
 2 files changed, 48 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 82eededfc92a..b490b5b14529 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -266,7 +266,8 @@ struct mlx5e_cq {
 
 struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
-				       struct mlx5_cqe64 *cqe);
+				       struct mlx5_cqe64 *cqe,
+				       bool *xdp_doorbell);
 typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
 				  u16 ix);
 
@@ -709,8 +710,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
 
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle);
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			 bool *xdp_doorbell);
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			       bool *xdp_doorbell);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
 int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
 int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,	u16 ix);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index fd8011bf25f8..4e7290823212 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
 static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
 					     struct mlx5e_cq *cq,
 					     int update_owner_only,
-					     int budget_rem)
+					     int budget_rem,
+					     bool *xdp_doorbell)
 {
 	u32 cqcc = cq->wq.cc + update_owner_only;
 	u32 cqe_count;
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
 			mlx5e_read_mini_arr_slot(cq, cqcc);
 
 		mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
-		rq->handle_rx_cqe(rq, &cq->title);
+		rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
 	}
 	mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
 	cq->wq.cc = cqcc;
@@ -143,15 +144,17 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
 
 static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
 					      struct mlx5e_cq *cq,
-					      int budget_rem)
+					      int budget_rem,
+					      bool *xdp_doorbell)
 {
 	mlx5e_read_title_slot(rq, cq, cq->wq.cc);
 	mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
 	mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
-	rq->handle_rx_cqe(rq, &cq->title);
+	rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
 	cq->mini_arr_idx++;
 
-	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
+	return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem,
+					  xdp_doorbell) - 1;
 }
 
 void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
@@ -679,23 +682,28 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
 	wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
 	sq->pc += MLX5E_XDP_TX_WQEBBS;
 
-	/* TODO: xmit more */
+	/* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
+	return true;
+}
+
+static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
+{
+	struct mlx5_wq_cyc *wq = &sq->wq;
+	struct mlx5e_tx_wqe *wqe;
+	u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
+
+	wqe  = mlx5_wq_cyc_get_wqe(wq, pi);
+
 	wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
 	mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
-
-	/* fill sq edge with nops to avoid wqe wrap around */
-	while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
-		sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
-		mlx5e_send_nop(sq, false);
-	}
-	return true;
 }
 
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    const struct bpf_prog *prog,
 				    struct mlx5e_dma_info *di,
-				    void *data, u16 len)
+				    void *data, u16 len,
+				    bool *xdp_doorbell)
 {
 	bool consumed = false;
 	struct xdp_buff xdp;
@@ -714,7 +722,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
 						MLX5_RX_HEADROOM,
 						len);
+		if (unlikely(!consumed) && (*xdp_doorbell)) {
+			/* SQ is full, ring doorbell */
+			mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+			*xdp_doorbell = false;
+		}
 		rq->stats.xdp_tx += consumed;
+		*xdp_doorbell |= consumed;
 		return consumed;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -729,7 +743,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 	return false;
 }
 
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			 bool *xdp_doorbell)
 {
 	struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
 	struct mlx5e_dma_info *di;
@@ -761,7 +776,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 		goto wq_ll_pop;
 	}
 
-	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+	if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
 		goto wq_ll_pop; /* page/packet was consumed by XDP */
 
 	skb = build_skb(va, RQ_PAGE_SIZE(rq));
@@ -823,7 +838,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
 	skb->len  += headlen;
 }
 
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			       bool *xdp_doorbell)
 {
 	u16 cstrides       = mpwrq_get_cqe_consumed_strides(cqe);
 	u16 wqe_id         = be16_to_cpu(cqe->wqe_id);
@@ -869,13 +885,15 @@ mpwrq_cqe_out:
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
+	bool xdp_doorbell = false;
 	int work_done = 0;
 
 	if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
 		return 0;
 
 	if (cq->decmprs_left)
-		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
+		work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
+							&xdp_doorbell);
 
 	for (; work_done < budget; work_done++) {
 		struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
@@ -886,15 +904,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 		if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
 			work_done +=
 				mlx5e_decompress_cqes_start(rq, cq,
-							    budget - work_done);
+							    budget - work_done,
+							    &xdp_doorbell);
 			continue;
 		}
 
 		mlx5_cqwq_pop(&cq->wq);
 
-		rq->handle_rx_cqe(rq, cqe);
+		rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
 	}
 
+	if (xdp_doorbell)
+		mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+
 	mlx5_cqwq_update_db_record(&cq->wq);
 
 	/* ensure cq space is freed before enabling more cqes */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more
  2016-09-19 13:58 ` [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more Tariq Toukan
@ 2016-09-20  7:46   ` Jesper Dangaard Brouer
  2016-09-20  8:19     ` Tariq Toukan
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20  7:46 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: brouer, David S. Miller, netdev, Eran Ben Elisha, Saeed Mahameed,
	Rana Shahout

On Mon, 19 Sep 2016 16:58:59 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> From: Saeed Mahameed <saeedm@mellanox.com>
> 
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> 
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.

I've already raised strong concerns with this approach on the RFC
patchset.  Of not really taking advantage of RX bulking.
Please do not ignore this!

If you can promise, that we/you will also try to other approach I'm
suggesting, then I'm fine with this patch.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more
  2016-09-20  7:46   ` Jesper Dangaard Brouer
@ 2016-09-20  8:19     ` Tariq Toukan
  2016-09-20  9:26       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 32+ messages in thread
From: Tariq Toukan @ 2016-09-20  8:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tariq Toukan
  Cc: David S. Miller, netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout

Hi Jesper,

On 20/09/2016 10:46 AM, Jesper Dangaard Brouer wrote:
> On Mon, 19 Sep 2016 16:58:59 +0300
> Tariq Toukan <tariqt@mellanox.com> wrote:
>
>> From: Saeed Mahameed <saeedm@mellanox.com>
>>
>> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>>
>> Here we introduce a xmit more like mechanism that will queue up more
>> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>>
>> Once RX napi budget is consumed and we exit napi RX loop, we will
>> flush (doorbell) all XDP looped packets in case there are such.
> I've already raised strong concerns with this approach on the RFC
> patchset.  Of not really taking advantage of RX bulking.
> Please do not ignore this!
Sure. Your approach can fit with our plans to split the RX completion 
poll loop into several stages.
I will try it when we get there.
>
> If you can promise, that we/you will also try to other approach I'm
> suggesting, then I'm fine with this patch.
>
Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-19 13:58 ` [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support Tariq Toukan
@ 2016-09-20  8:29   ` Jesper Dangaard Brouer
  2016-09-20 11:33     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20  8:29 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S. Miller, netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout


On Mon, 19 Sep 2016 16:58:58 +0300 Tariq Toukan <tariqt@mellanox.com> wrote:

> +	act = bpf_prog_run_xdp(prog, &xdp);
> +	switch (act) {
> +	case XDP_PASS:
> +		return false;
> +	case XDP_TX:
> +		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
> +						MLX5_RX_HEADROOM,
> +						len);
> +		rq->stats.xdp_tx += consumed;
> +		return consumed;
> +	default:
> +		bpf_warn_invalid_xdp_action(act);
> +		return false;

This looks wrong.  According to the specification[1] and comment in the
code /include/uapi/linux/bpf.h: "Unknown return codes will result in
packet drop"

> +	case XDP_ABORTED:

It is not clearly defined, but I believe XDP_ABORTED should also result
in a warning (calling bpf_warn_invalid_xdp_action(act)).


> +	case XDP_DROP:
> +		rq->stats.xdp_drop++;
> +		mlx5e_page_release(rq, di, true);
> +		return true;
> +	}


I've started documenting XDP[2], as this patch clearly shows there is a
need to have a specification, given already the second driver
supporting XDP gets these details wrong.

[1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/xdp_actions.html#xdp-aborted

[2] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more
  2016-09-20  8:19     ` Tariq Toukan
@ 2016-09-20  9:26       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20  9:26 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: brouer, Tariq Toukan, David S. Miller, netdev, Eran Ben Elisha,
	Saeed Mahameed, Rana Shahout

On Tue, 20 Sep 2016 11:19:46 +0300
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> Hi Jesper,
> 
> On 20/09/2016 10:46 AM, Jesper Dangaard Brouer wrote:
> > On Mon, 19 Sep 2016 16:58:59 +0300
> > Tariq Toukan <tariqt@mellanox.com> wrote:
> >  
> >> From: Saeed Mahameed <saeedm@mellanox.com>
> >>
> >> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
> >>
> >> Here we introduce a xmit more like mechanism that will queue up more
> >> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
> >>
> >> Once RX napi budget is consumed and we exit napi RX loop, we will
> >> flush (doorbell) all XDP looped packets in case there are such.  
> > I've already raised strong concerns with this approach on the RFC
> > patchset.  Of not really taking advantage of RX bulking.
> > Please do not ignore this!
>
> Sure. Your approach can fit with our plans to split the RX completion 
> poll loop into several stages.
> I will try it when we get there.
> 
> > If you can promise, that we/you will also try to other approach I'm
> > suggesting, then I'm fine with this patch.

I'll take the above as a promise that there are plans to work in the
direction I'm requesting. Thanks!

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20  8:29   ` Jesper Dangaard Brouer
@ 2016-09-20 11:33     ` Jesper Dangaard Brouer
  2016-09-20 12:53       ` Tariq Toukan
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 11:33 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: brouer, David S. Miller, netdev, Eran Ben Elisha, Saeed Mahameed,
	Rana Shahout


Resending, email didn't reach the list (used wrong sender email)

On Tue, 20 Sep 2016 10:29:43 +0200 Jesper Dangaard Brouer <netdev@brouer.com> wrote:

> On Mon, 19 Sep 2016 16:58:58 +0300 Tariq Toukan <tariqt@mellanox.com> wrote:
> 
> > +	act = bpf_prog_run_xdp(prog, &xdp);
> > +	switch (act) {
> > +	case XDP_PASS:
> > +		return false;
> > +	case XDP_TX:
> > +		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
> > +						MLX5_RX_HEADROOM,
> > +						len);
> > +		rq->stats.xdp_tx += consumed;
> > +		return consumed;
> > +	default:
> > +		bpf_warn_invalid_xdp_action(act);
> > +		return false;  
> 
> This looks wrong.  According to the specification[1] and comment in the
> code /include/uapi/linux/bpf.h: "Unknown return codes will result in
> packet drop"
> 
> > +	case XDP_ABORTED:  
> 
> It is not clearly defined, but I believe XDP_ABORTED should also result
> in a warning (calling bpf_warn_invalid_xdp_action(act)).
> 
> 
> > +	case XDP_DROP:
> > +		rq->stats.xdp_drop++;
> > +		mlx5e_page_release(rq, di, true);
> > +		return true;
> > +	}  
> 
> 
> I've started documenting XDP[2], as this patch clearly shows there is a
> need to have a specification, given already the second driver
> supporting XDP gets these details wrong.
> 
> [1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/xdp_actions.html#xdp-aborted
> 
> [2] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
> 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 11:33     ` Jesper Dangaard Brouer
@ 2016-09-20 12:53       ` Tariq Toukan
  2016-09-20 15:40         ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Tariq Toukan @ 2016-09-20 12:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tariq Toukan
  Cc: David S. Miller, netdev, Eran Ben Elisha, Saeed Mahameed, Rana Shahout


On 20/09/2016 2:33 PM, Jesper Dangaard Brouer wrote:
> Resending, email didn't reach the list (used wrong sender email)
>
> On Tue, 20 Sep 2016 10:29:43 +0200 Jesper Dangaard Brouer <netdev@brouer.com> wrote:
>
>> On Mon, 19 Sep 2016 16:58:58 +0300 Tariq Toukan <tariqt@mellanox.com> wrote:
>>
>>> +	act = bpf_prog_run_xdp(prog, &xdp);
>>> +	switch (act) {
>>> +	case XDP_PASS:
>>> +		return false;
>>> +	case XDP_TX:
>>> +		consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
>>> +						MLX5_RX_HEADROOM,
>>> +						len);
>>> +		rq->stats.xdp_tx += consumed;
>>> +		return consumed;
>>> +	default:
>>> +		bpf_warn_invalid_xdp_action(act);
>>> +		return false;
>> This looks wrong.  According to the specification[1] and comment in the
>> code /include/uapi/linux/bpf.h: "Unknown return codes will result in
>> packet drop"
I'll fix this.
>>> +	case XDP_ABORTED:
>> It is not clearly defined, but I believe XDP_ABORTED should also result
>> in a warning (calling bpf_warn_invalid_xdp_action(act)).
I'll add this.
>>
>>
>>> +	case XDP_DROP:
>>> +		rq->stats.xdp_drop++;
>>> +		mlx5e_page_release(rq, di, true);
>>> +		return true;
>>> +	}
>>
>> I've started documenting XDP[2], as this patch clearly shows there is a
>> need to have a specification, given already the second driver
>> supporting XDP gets these details wrong.
Indeed. Thanks for doing this.
>> [1] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/xdp_actions.html#xdp-aborted
>>
>> [2] https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html
>>
Regards,
Tariq

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 12:53       ` Tariq Toukan
@ 2016-09-20 15:40         ` Alexei Starovoitov
  2016-09-20 15:51           ` Tom Herbert
                             ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2016-09-20 15:40 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Jesper Dangaard Brouer, Tariq Toukan, David S. Miller, netdev,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:
> >>>+	case XDP_ABORTED:
> >>It is not clearly defined, but I believe XDP_ABORTED should also result
> >>in a warning (calling bpf_warn_invalid_xdp_action(act)).
> I'll add this.

Certainly NOT.
XDP_ABORTED is an exception case when program does divide by zero.
It should NOT do bpf_warn. It must drop the packet.
We discussed it several months ago.
See mlx4/en_rx.c for canonical implementation.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 15:40         ` Alexei Starovoitov
@ 2016-09-20 15:51           ` Tom Herbert
  2016-09-20 15:58             ` Eric Dumazet
  2016-09-20 16:00             ` Alexei Starovoitov
  2016-09-20 15:57           ` Jesper Dangaard Brouer
  2016-09-21  7:57           ` Tariq Toukan
  2 siblings, 2 replies; 32+ messages in thread
From: Tom Herbert @ 2016-09-20 15:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, Sep 20, 2016 at 8:40 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:
>> >>>+  case XDP_ABORTED:
>> >>It is not clearly defined, but I believe XDP_ABORTED should also result
>> >>in a warning (calling bpf_warn_invalid_xdp_action(act)).
>> I'll add this.
>
> Certainly NOT.
> XDP_ABORTED is an exception case when program does divide by zero.
> It should NOT do bpf_warn. It must drop the packet.
> We discussed it several months ago.
> See mlx4/en_rx.c for canonical implementation.
>
It should at least bump a counter so that the user knows that aborts
are happening.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 15:40         ` Alexei Starovoitov
  2016-09-20 15:51           ` Tom Herbert
@ 2016-09-20 15:57           ` Jesper Dangaard Brouer
  2016-09-21  7:57           ` Tariq Toukan
  2 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 15:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tariq Toukan, Tariq Toukan, David S. Miller, netdev,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer

On Tue, 20 Sep 2016 08:40:37 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:
> > >>>+	case XDP_ABORTED:  
> > >>It is not clearly defined, but I believe XDP_ABORTED should also result
> > >>in a warning (calling bpf_warn_invalid_xdp_action(act)).  
> > I'll add this.  
> 
> Certainly NOT.
> XDP_ABORTED is an exception case when program does divide by zero.
> It should NOT do bpf_warn. It must drop the packet.
> We discussed it several months ago.
> See mlx4/en_rx.c for canonical implementation.

It was certainly not documented, and my memory fails me.

Please explain why a eBPF program error (div by zero) must be a silent drop?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 15:51           ` Tom Herbert
@ 2016-09-20 15:58             ` Eric Dumazet
  2016-09-20 16:06               ` Jesper Dangaard Brouer
  2016-09-20 16:00             ` Alexei Starovoitov
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-09-20 15:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Tariq Toukan, Jesper Dangaard Brouer,
	Tariq Toukan, David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, 2016-09-20 at 08:51 -0700, Tom Herbert wrote:
> On Tue, Sep 20, 2016 at 8:40 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:
> >> >>>+  case XDP_ABORTED:
> >> >>It is not clearly defined, but I believe XDP_ABORTED should also result
> >> >>in a warning (calling bpf_warn_invalid_xdp_action(act)).
> >> I'll add this.
> >
> > Certainly NOT.
> > XDP_ABORTED is an exception case when program does divide by zero.
> > It should NOT do bpf_warn. It must drop the packet.
> > We discussed it several months ago.
> > See mlx4/en_rx.c for canonical implementation.
> >
> It should at least bump a counter so that the user knows that aborts
> are happening.

Same for XDP_TX if/when packet is dropped because output ring is full.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 15:51           ` Tom Herbert
  2016-09-20 15:58             ` Eric Dumazet
@ 2016-09-20 16:00             ` Alexei Starovoitov
  1 sibling, 0 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2016-09-20 16:00 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, Sep 20, 2016 at 08:51:05AM -0700, Tom Herbert wrote:
> On Tue, Sep 20, 2016 at 8:40 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:
> >> >>>+  case XDP_ABORTED:
> >> >>It is not clearly defined, but I believe XDP_ABORTED should also result
> >> >>in a warning (calling bpf_warn_invalid_xdp_action(act)).
> >> I'll add this.
> >
> > Certainly NOT.
> > XDP_ABORTED is an exception case when program does divide by zero.
> > It should NOT do bpf_warn. It must drop the packet.
> > We discussed it several months ago.
> > See mlx4/en_rx.c for canonical implementation.
> >
> It should at least bump a counter so that the user knows that aborts
> are happening.

yes the driver can add another counter, but it's not mandated by xdp side.
Meaning it's up to the driver to have counters and their way of
reporting thems (so far all drivers do it via ethtool).
We don't define any counters via XDP api, since they cannot be defined
in a common way across different nics and hw, and we already have ethtool
which is good enough. Especially in this case 'aborted' counter is
only for debugging. The program should not have 'divide by zero' when
it's written correctly.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 15:58             ` Eric Dumazet
@ 2016-09-20 16:06               ` Jesper Dangaard Brouer
  2016-09-20 16:13                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 16:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Alexei Starovoitov, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer

On Tue, 20 Sep 2016 08:58:30 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2016-09-20 at 08:51 -0700, Tom Herbert wrote:
> > On Tue, Sep 20, 2016 at 8:40 AM, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:  
> > > On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:  
> > >> >>>+  case XDP_ABORTED:  
> > >> >>It is not clearly defined, but I believe XDP_ABORTED should also result
> > >> >>in a warning (calling bpf_warn_invalid_xdp_action(act)).  
> > >> I'll add this.  
> > >
> > > Certainly NOT.
> > > XDP_ABORTED is an exception case when program does divide by zero.
> > > It should NOT do bpf_warn. It must drop the packet.
> > > We discussed it several months ago.
> > > See mlx4/en_rx.c for canonical implementation.
> > >  
> > It should at least bump a counter so that the user knows that aborts
> > are happening.  

I agree.

> Same for XDP_TX if/when packet is dropped because output ring is full.

For the XDP_TX case a counter is already incremented[1] but it is a
local ring counter (ring->tx_dropped++).

Do you think we should maintain separate counters for XDP? (to have a
more consistent interface across drivers...)


[1] https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx4/en_tx.c#L1181

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 16:06               ` Jesper Dangaard Brouer
@ 2016-09-20 16:13                 ` Eric Dumazet
  2016-09-20 16:27                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-09-20 16:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Alexei Starovoitov, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, 2016-09-20 at 18:06 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 20 Sep 2016 08:58:30 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Same for XDP_TX if/when packet is dropped because output ring is full.
> 
> For the XDP_TX case a counter is already incremented[1] but it is a
> local ring counter (ring->tx_dropped++).
> 
> Do you think we should maintain separate counters for XDP? (to have a
> more consistent interface across drivers...)

No, as long as the admin can learn drops are occurring.

"perf ... -e skb:kfree_skb ..." or drop monitor wont work,
unfortunately.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 16:13                 ` Eric Dumazet
@ 2016-09-20 16:27                   ` Jesper Dangaard Brouer
  2016-09-20 16:45                     ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 16:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Alexei Starovoitov, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer

On Tue, 20 Sep 2016 09:13:56 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2016-09-20 at 18:06 +0200, Jesper Dangaard Brouer wrote:
> > On Tue, 20 Sep 2016 08:58:30 -0700
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> 
> > > Same for XDP_TX if/when packet is dropped because output ring is full.  
> > 
> > For the XDP_TX case a counter is already incremented[1] but it is a
> > local ring counter (ring->tx_dropped++).
> > 
> > Do you think we should maintain separate counters for XDP? (to have a
> > more consistent interface across drivers...)  
> 
> No, as long as the admin can learn drops are occurring.

Okay, so recording these drops is important for an admin, agreed.  Now
that we have the chance to define the API, wouldn't is be nice if the
admin across drive drivers knew what counter to look for???

> "perf ... -e skb:kfree_skb ..." or drop monitor wont work,
> unfortunately.

That is actually a good idea... why not add a trace point for these
rare drop cases, which is a hassle to debug for an admin?
Let's actually take advantage of all the nice infrastructure the kernel
provides(?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 16:27                   ` Jesper Dangaard Brouer
@ 2016-09-20 16:45                     ` Alexei Starovoitov
  2016-09-20 16:58                       ` Thomas Graf
                                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Alexei Starovoitov @ 2016-09-20 16:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Tom Herbert, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, Sep 20, 2016 at 06:27:17PM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 20 Sep 2016 09:13:56 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2016-09-20 at 18:06 +0200, Jesper Dangaard Brouer wrote:
> > > On Tue, 20 Sep 2016 08:58:30 -0700
> > > Eric Dumazet <eric.dumazet@gmail.com> wrote:  
> > 
> > > > Same for XDP_TX if/when packet is dropped because output ring is full.  
> > > 
> > > For the XDP_TX case a counter is already incremented[1] but it is a
> > > local ring counter (ring->tx_dropped++).
> > > 
> > > Do you think we should maintain separate counters for XDP? (to have a
> > > more consistent interface across drivers...)  
> > 
> > No, as long as the admin can learn drops are occurring.
> 
> Okay, so recording these drops is important for an admin, agreed.  Now
> that we have the chance to define the API, wouldn't is be nice if the
> admin across drive drivers knew what counter to look for???
> 
> > "perf ... -e skb:kfree_skb ..." or drop monitor wont work,
> > unfortunately.
> 
> That is actually a good idea... why not add a trace point for these
> rare drop cases, which is a hassle to debug for an admin?
> Let's actually take advantage of all the nice infrastructure the kernel
> provides(?)

that's a great idea. Instead of adding aborted, ring_full, blahblah counters
everywhere that cost performance, let's add tracepoints that are nops
when not in use.
And if all drivers call the same tracepoints it will be even better.
Monitoring trace_xdp_aborted tracepoint would be generic way to debug
'divide by zero'.

To your other question:
> Please explain why a eBPF program error (div by zero) must be a silent drop?

because 'div by zero' is an abnormal situation that shouldn't be exploited.
Meaning if xdp program is doing DoS prevention and it has a bug that
attacker can now exploit by sending a crafted packet that causes
'div by zero' and kernel will warn then attack got successful.
Therefore it has to be silent drop.
tracpoint in such case is great, since the user can do debugging with it
and even monitoring 24/7 and if suddenly the control plan sees a lot
of such trace_xdp_abotred events, it can disable that tracepoint to avoid
spam and adjust the program or act on attack some other way.
Hardcoded warnings and counters are not generic enough for all
the use cases people want to throw at XDP.
The tracepoints idea is awesome, in a sense that it's optional.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 16:45                     ` Alexei Starovoitov
@ 2016-09-20 16:58                       ` Thomas Graf
  2016-09-20 17:39                       ` Eric Dumazet
  2016-09-20 21:04                       ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 32+ messages in thread
From: Thomas Graf @ 2016-09-20 16:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Tom Herbert, Tariq Toukan,
	Tariq Toukan, David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On 09/20/16 at 09:45am, Alexei Starovoitov wrote:
> that's a great idea. Instead of adding aborted, ring_full, blahblah counters
> everywhere that cost performance, let's add tracepoints that are nops
> when not in use.
> And if all drivers call the same tracepoints it will be even better.
> Monitoring trace_xdp_aborted tracepoint would be generic way to debug
> 'divide by zero'.

Not opposed but I still need an indication to start tracing ;-) A
single counter would be enough though.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 16:45                     ` Alexei Starovoitov
  2016-09-20 16:58                       ` Thomas Graf
@ 2016-09-20 17:39                       ` Eric Dumazet
  2016-09-20 18:59                         ` Jesper Dangaard Brouer
  2016-09-20 21:04                       ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2016-09-20 17:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Tom Herbert, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, 2016-09-20 at 09:45 -0700, Alexei Starovoitov wrote:

> because 'div by zero' is an abnormal situation that shouldn't be exploited.
> Meaning if xdp program is doing DoS prevention and it has a bug that
> attacker can now exploit by sending a crafted packet that causes
> 'div by zero' and kernel will warn then attack got successful.
> Therefore it has to be silent drop.

A silent drop means a genuine error in a BPF program might be never
caught, since a tracepoint might never be enabled.

> tracpoint in such case is great, since the user can do debugging with it
> and even monitoring 24/7 and if suddenly the control plan sees a lot
> of such trace_xdp_abotred events, it can disable that tracepoint to avoid
> spam and adjust the program or act on attack some other way.
> Hardcoded warnings and counters are not generic enough for all
> the use cases people want to throw at XDP.
> The tracepoints idea is awesome, in a sense that it's optional.


Note that tracepoints are optional in a kernel.

Many existing supervision infrastructures collect device snmp counters,
and run as unprivileged programs. 

tracepoints might not fit the need here, compared to a mere
tx_ring->tx_drops++

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 17:39                       ` Eric Dumazet
@ 2016-09-20 18:59                         ` Jesper Dangaard Brouer
  2016-09-20 19:21                           ` Tom Herbert
  2016-09-20 20:47                           ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 18:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Tom Herbert, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer

On Tue, 20 Sep 2016 10:39:20 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2016-09-20 at 09:45 -0700, Alexei Starovoitov wrote:
> 
> > because 'div by zero' is an abnormal situation that shouldn't be exploited.
> > Meaning if xdp program is doing DoS prevention and it has a bug that
> > attacker can now exploit by sending a crafted packet that causes
> > 'div by zero' and kernel will warn then attack got successful.
> > Therefore it has to be silent drop.  
> 
> A silent drop means a genuine error in a BPF program might be never
> caught, since a tracepoint might never be enabled.

I do see your point. But we can document our way out of it.
 
> > tracpoint in such case is great, since the user can do debugging with it
> > and even monitoring 24/7 and if suddenly the control plan sees a lot
> > of such trace_xdp_abotred events, it can disable that tracepoint to avoid
> > spam and adjust the program or act on attack some other way.
> > Hardcoded warnings and counters are not generic enough for all
> > the use cases people want to throw at XDP.
> > The tracepoints idea is awesome, in a sense that it's optional.  
> 
> 
> Note that tracepoints are optional in a kernel.

Well, that is a good thing, as it can be compiled out (as that provides
an option for zero cost).


> Many existing supervision infrastructures collect device snmp
> counters, and run as unprivileged programs. 

A supervision infrastructures is a valid use-case. It again indicate
that such XDP stats need to structured, not just a random driver
specific ethtool counter, to make it easy for such collection daemons.


> tracepoints might not fit the need here, compared to a mere
> tx_ring->tx_drops++

I do see your point.  I really liked the tracepoint idea, but now I'm
uncertain again...


I do have a use-case where I want to use the NIC HW-RX-ingress-overflow
and TX-overflow drop indicators, but I don't want to tie it into this
discussion.  The abort and error indicators a not relevant for that
use-case.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 18:59                         ` Jesper Dangaard Brouer
@ 2016-09-20 19:21                           ` Tom Herbert
  2016-09-20 20:47                           ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 32+ messages in thread
From: Tom Herbert @ 2016-09-20 19:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Alexei Starovoitov, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout

On Tue, Sep 20, 2016 at 11:59 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 20 Sep 2016 10:39:20 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Tue, 2016-09-20 at 09:45 -0700, Alexei Starovoitov wrote:
>>
>> > because 'div by zero' is an abnormal situation that shouldn't be exploited.
>> > Meaning if xdp program is doing DoS prevention and it has a bug that
>> > attacker can now exploit by sending a crafted packet that causes
>> > 'div by zero' and kernel will warn then attack got successful.
>> > Therefore it has to be silent drop.
>>
>> A silent drop means a genuine error in a BPF program might be never
>> caught, since a tracepoint might never be enabled.
>
> I do see your point. But we can document our way out of it.
>
>> > tracpoint in such case is great, since the user can do debugging with it
>> > and even monitoring 24/7 and if suddenly the control plan sees a lot
>> > of such trace_xdp_abotred events, it can disable that tracepoint to avoid
>> > spam and adjust the program or act on attack some other way.
>> > Hardcoded warnings and counters are not generic enough for all
>> > the use cases people want to throw at XDP.
>> > The tracepoints idea is awesome, in a sense that it's optional.
>>
>>
>> Note that tracepoints are optional in a kernel.
>
> Well, that is a good thing, as it can be compiled out (as that provides
> an option for zero cost).
>
>
>> Many existing supervision infrastructures collect device snmp
>> counters, and run as unprivileged programs.
>
> A supervision infrastructures is a valid use-case. It again indicate
> that such XDP stats need to structured, not just a random driver
> specific ethtool counter, to make it easy for such collection daemons.
>
I am currently adding a structure to define an XDP hook (plan to post
patches shortly). Counters can be added to that in a uniform fashion
that doesn't need code in every driver.

Tom

>
>> tracepoints might not fit the need here, compared to a mere
>> tx_ring->tx_drops++
>
> I do see your point.  I really liked the tracepoint idea, but now I'm
> uncertain again...
>
>
> I do have a use-case where I want to use the NIC HW-RX-ingress-overflow
> and TX-overflow drop indicators, but I don't want to tie it into this
> discussion.  The abort and error indicators a not relevant for that
> use-case.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 18:59                         ` Jesper Dangaard Brouer
  2016-09-20 19:21                           ` Tom Herbert
@ 2016-09-20 20:47                           ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 20:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, Tom Herbert, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer


On Tue, 20 Sep 2016 20:59:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> On Tue, 20 Sep 2016 10:39:20 -0700  Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
[...]
> 
> > Many existing supervision infrastructures collect device snmp
> > counters, and run as unprivileged programs.   
> 
> A supervision infrastructures is a valid use-case. It again indicate
> that such XDP stats need to structured, not just a random driver
> specific ethtool counter, to make it easy for such collection daemons.
> 
> 
> > tracepoints might not fit the need here, compared to a mere
> > tx_ring->tx_drops++  
> 
> I do see your point.  I really liked the tracepoint idea, but now I'm
> uncertain again...

I've document the need for Troubleshooting and Monitoring, so we don't
forget about it. See:

Commit:
 https://github.com/netoptimizer/prototype-kernel/commit/3925249089ae4

Online doc:
 https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/userspace_api.html#troubleshooting-and-monitoring


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 16:45                     ` Alexei Starovoitov
  2016-09-20 16:58                       ` Thomas Graf
  2016-09-20 17:39                       ` Eric Dumazet
@ 2016-09-20 21:04                       ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-20 21:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Tariq Toukan, Tariq Toukan,
	David S. Miller, Linux Kernel Network Developers,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer

On Tue, 20 Sep 2016 09:45:28 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> To your other question:
> > Please explain why a eBPF program error (div by zero) must be a silent drop?  
> 
> because 'div by zero' is an abnormal situation that shouldn't be exploited.
> Meaning if xdp program is doing DoS prevention and it has a bug that
> attacker can now exploit by sending a crafted packet that causes
> 'div by zero' and kernel will warn then attack got successful.
> Therefore it has to be silent drop.

Understood and documented:
 https://github.com/netoptimizer/prototype-kernel/commit/a4e60e2d7a894

Our current solution is not very optimal, it only result in onetime
WARN_ONCE() see bpf_warn_invalid_xdp_action().  But is should not be
affected by the DoS attack scenario you described.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-20 15:40         ` Alexei Starovoitov
  2016-09-20 15:51           ` Tom Herbert
  2016-09-20 15:57           ` Jesper Dangaard Brouer
@ 2016-09-21  7:57           ` Tariq Toukan
  2016-09-21  8:16             ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 32+ messages in thread
From: Tariq Toukan @ 2016-09-21  7:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Tariq Toukan, David S. Miller, netdev,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout


On 20/09/2016 6:40 PM, Alexei Starovoitov wrote:
> On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:
>>>>> +	case XDP_ABORTED:
>>>> It is not clearly defined, but I believe XDP_ABORTED should also result
>>>> in a warning (calling bpf_warn_invalid_xdp_action(act)).
>> I'll add this.
> Certainly NOT.
> XDP_ABORTED is an exception case when program does divide by zero.
> It should NOT do bpf_warn. It must drop the packet.
> We discussed it several months ago.
> See mlx4/en_rx.c for canonical implementation.
>
This is also the example given here:
https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/xdp_actions.html#code-example

I prefer to align with the documentation (and with current mlx4 driver 
code), which means keeping the XDP_ABORTED w/o a warning.
Anyway, I don't think this should block the coming V2. If you decide to 
change documentation/specification, we will simply adjust our drivers 
accordingly.

Thanks,
Tariq.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support
  2016-09-21  7:57           ` Tariq Toukan
@ 2016-09-21  8:16             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-21  8:16 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Alexei Starovoitov, Tariq Toukan, David S. Miller, netdev,
	Eran Ben Elisha, Saeed Mahameed, Rana Shahout, brouer

On Wed, 21 Sep 2016 10:57:34 +0300
Tariq Toukan <ttoukan.linux@gmail.com> wrote:

> On 20/09/2016 6:40 PM, Alexei Starovoitov wrote:
> > On Tue, Sep 20, 2016 at 03:53:10PM +0300, Tariq Toukan wrote:  
> >>>>> +	case XDP_ABORTED:  
> >>>> It is not clearly defined, but I believe XDP_ABORTED should also result
> >>>> in a warning (calling bpf_warn_invalid_xdp_action(act)).  
> >> I'll add this.  
> > Certainly NOT.
> > XDP_ABORTED is an exception case when program does divide by zero.
> > It should NOT do bpf_warn. It must drop the packet.
> > We discussed it several months ago.
> > See mlx4/en_rx.c for canonical implementation.
> >  
> This is also the example given here:
> https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/xdp_actions.html#code-example
> 

Nice, the documentation is already useful :-)))

I actually adjusted the code example after this feedback. Which I also
wrote in this email thread about updating the documentation.  I already
pointed to this commit, but here is a more specific link:

https://github.com/netoptimizer/prototype-kernel/commit/3925249089ae4#diff-c9b8ab4ded3eed2e92f8d898111d8e9bL74


> I prefer to align with the documentation (and with current mlx4 driver 
> code), which means keeping the XDP_ABORTED w/o a warning.
> Anyway, I don't think this should block the coming V2. If you decide to 
> change documentation/specification, we will simply adjust our drivers 
> accordingly.

This is not blocking your V2, please resend so we can all start working
on a common code base for mlx5 (I'm currently running mlx5 with my own
one-page-per-packet patch... and my page_pool on-top)

As discussed in this thread, the outcome will likely be a new interface
for Troubleshooting and Monitoring, as documented and summarized here
so we don't forget:

https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/implementation/userspace_api.html#troubleshooting-and-monitoring

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-09-21  8:16 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-19 13:58 [PATCH net-next 0/8] mlx5e XDP support Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 1/8] net/mlx5e: Build RX SKB on demand Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 2/8] net/mlx5e: Union RQ RX info per RQ type Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 3/8] net/mlx5e: Slightly reduce hardware LRO size Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 4/8] net/mlx5e: Dynamic RQ type infrastructure Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 5/8] net/mlx5e: XDP fast RX drop bpf programs support Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 6/8] net/mlx5e: Have a clear separation between different SQ types Tariq Toukan
2016-09-19 13:58 ` [PATCH net-next 7/8] net/mlx5e: XDP TX forwarding support Tariq Toukan
2016-09-20  8:29   ` Jesper Dangaard Brouer
2016-09-20 11:33     ` Jesper Dangaard Brouer
2016-09-20 12:53       ` Tariq Toukan
2016-09-20 15:40         ` Alexei Starovoitov
2016-09-20 15:51           ` Tom Herbert
2016-09-20 15:58             ` Eric Dumazet
2016-09-20 16:06               ` Jesper Dangaard Brouer
2016-09-20 16:13                 ` Eric Dumazet
2016-09-20 16:27                   ` Jesper Dangaard Brouer
2016-09-20 16:45                     ` Alexei Starovoitov
2016-09-20 16:58                       ` Thomas Graf
2016-09-20 17:39                       ` Eric Dumazet
2016-09-20 18:59                         ` Jesper Dangaard Brouer
2016-09-20 19:21                           ` Tom Herbert
2016-09-20 20:47                           ` Jesper Dangaard Brouer
2016-09-20 21:04                       ` Jesper Dangaard Brouer
2016-09-20 16:00             ` Alexei Starovoitov
2016-09-20 15:57           ` Jesper Dangaard Brouer
2016-09-21  7:57           ` Tariq Toukan
2016-09-21  8:16             ` Jesper Dangaard Brouer
2016-09-19 13:58 ` [PATCH net-next 8/8] net/mlx5e: XDP TX xmit more Tariq Toukan
2016-09-20  7:46   ` Jesper Dangaard Brouer
2016-09-20  8:19     ` Tariq Toukan
2016-09-20  9:26       ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.