All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] mlx4: Better use of order-0 pages in RX path
@ 2017-03-13  0:58 Eric Dumazet
  2017-03-13 12:01 ` Tariq Toukan
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13  0:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet, Alexander Duyck

When adding order-0 pages allocations and page recycling in receive path,
I added issues on PowerPC, or more generally on arches with large pages.

A GRO packet, aggregating 45 segments, ended up using 45 page frags
on 45 different pages. Before my changes we were very likely packing
up to 42 Ethernet frames per 64KB page.

1) At skb freeing time, all put_page() on the skb frags now touch 45
   different 'struct page' and this adds more cache line misses.
   Too bad that standard Ethernet MTU is so small :/

2) Using one order-0 page per ring slot consumes ~42 times more memory
   on PowerPC.

3) Allocating order-0 pages is very likely to use pages from very
   different locations, increasing TLB pressure on hosts with more
   than 256 GB of memory after days of uptime.

This patch uses a refined strategy, addressing these points.

We still use order-0 pages, but the page recyling technique is modified
so that we have better chances to lower number of pages containing the
frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)

Page allocations are split in two halves :
- One currently visible by the NIC for DMA operations.
- The other contains pages that already added to old skbs, put in
  a quarantine.

When we receive a frame, we look at the oldest entry in the pool and
check if the page count is back to one, meaning old skbs/frags were
consumed and the page can be recycled.

Page allocations are attempted using high order ones, trying
to lower TLB pressure.

On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
But on PowerPC, this patch considerably reduces the allocated memory.

Performance gain on PowerPC is about 50% for a single TCP flow.

On x86, I could not measure the difference, my test machine being
limited by the sender (33 Gbit per TCP flow).
22 less cache line misses per 64 KB GRO packet is probably in the order
of 2 % or so.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 462 +++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 +++-
 3 files changed, 310 insertions(+), 221 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..de455c8a2dec389cfeca6b6d474a6184d6acf618 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -31,7 +31,6 @@
  *
  */
 
-#include <net/busy_poll.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <linux/mlx4/cq.h>
@@ -50,59 +49,41 @@
 
 #include "mlx4_en.h"
 
-static int mlx4_alloc_page(struct mlx4_en_priv *priv,
-			   struct mlx4_en_rx_alloc *frag,
-			   gfp_t gfp)
+static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
+				    struct mlx4_en_rx_ring *ring,
+				    dma_addr_t *dma,
+				    unsigned int node, gfp_t gfp)
 {
 	struct page *page;
-	dma_addr_t dma;
-
-	page = alloc_page(gfp);
-	if (unlikely(!page))
-		return -ENOMEM;
-	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
-	if (unlikely(dma_mapping_error(priv->ddev, dma))) {
-		__free_page(page);
-		return -ENOMEM;
-	}
-	frag->page = page;
-	frag->dma = dma;
-	frag->page_offset = priv->rx_headroom;
-	return 0;
-}
 
-static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
-			       struct mlx4_en_rx_ring *ring,
-			       struct mlx4_en_rx_desc *rx_desc,
-			       struct mlx4_en_rx_alloc *frags,
-			       gfp_t gfp)
-{
-	int i;
+	if (unlikely(!ring->pre_allocated_count)) {
+		unsigned int order = READ_ONCE(ring->rx_alloc_order);
 
-	for (i = 0; i < priv->num_frags; i++, frags++) {
-		if (!frags->page) {
-			if (mlx4_alloc_page(priv, frags, gfp))
-				return -ENOMEM;
-			ring->rx_alloc_pages++;
+		page = __alloc_pages_node(node, gfp | __GFP_NOMEMALLOC |
+						__GFP_NOWARN | __GFP_NORETRY,
+					  order);
+		if (page) {
+			split_page(page, order);
+			ring->pre_allocated_count = 1U << order;
+		} else {
+			if (order > 1)
+				ring->rx_alloc_order--;
+			page = __alloc_pages_node(node, gfp, 0);
+			if (unlikely(!page))
+				return NULL;
+			ring->pre_allocated_count = 1U;
 		}
-		rx_desc->data[i].addr = cpu_to_be64(frags->dma +
-						    frags->page_offset);
+		ring->pre_allocated = page;
+		ring->rx_alloc_pages += ring->pre_allocated_count;
 	}
-	return 0;
-}
-
-static void mlx4_en_free_frag(const struct mlx4_en_priv *priv,
-			      struct mlx4_en_rx_alloc *frag)
-{
-	if (frag->page) {
-		dma_unmap_page(priv->ddev, frag->dma,
-			       PAGE_SIZE, priv->dma_dir);
-		__free_page(frag->page);
+	page = ring->pre_allocated++;
+	ring->pre_allocated_count--;
+	*dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
+	if (unlikely(dma_mapping_error(priv->ddev, *dma))) {
+		__free_page(page);
+		return NULL;
 	}
-	/* We need to clear all fields, otherwise a change of priv->log_rx_info
-	 * could lead to see garbage later in frag->page.
-	 */
-	memset(frag, 0, sizeof(*frag));
+	return page;
 }
 
 static void mlx4_en_init_rx_desc(const struct mlx4_en_priv *priv,
@@ -130,32 +111,18 @@ static void mlx4_en_init_rx_desc(const struct mlx4_en_priv *priv,
 	}
 }
 
-static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
-				   struct mlx4_en_rx_ring *ring, int index,
-				   gfp_t gfp)
+static void mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
+				    struct mlx4_en_rx_ring *ring,
+				    unsigned int index)
 {
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
-	struct mlx4_en_rx_alloc *frags = ring->rx_info +
-					(index << priv->log_rx_info);
-	if (ring->page_cache.index > 0) {
-		/* XDP uses a single page per frame */
-		if (!frags->page) {
-			ring->page_cache.index--;
-			frags->page = ring->page_cache.buf[ring->page_cache.index].page;
-			frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
-		}
-		frags->page_offset = XDP_PACKET_HEADROOM;
-		rx_desc->data[0].addr = cpu_to_be64(frags->dma +
-						    XDP_PACKET_HEADROOM);
-		return 0;
-	}
-
-	return mlx4_en_alloc_frags(priv, ring, rx_desc, frags, gfp);
-}
+	const struct mlx4_en_rx_alloc *frags = ring->rx_info +
+						(index << priv->log_rx_info);
+	int i;
 
-static bool mlx4_en_is_ring_empty(const struct mlx4_en_rx_ring *ring)
-{
-	return ring->prod == ring->cons;
+	for (i = 0; i < priv->num_frags; i++, frags++)
+		rx_desc->data[i].addr = cpu_to_be64(frags->dma +
+						    frags->page_offset);
 }
 
 static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
@@ -163,79 +130,135 @@ static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
 	*ring->wqres.db.db = cpu_to_be32(ring->prod & 0xffff);
 }
 
-/* slow path */
-static void mlx4_en_free_rx_desc(const struct mlx4_en_priv *priv,
-				 struct mlx4_en_rx_ring *ring,
-				 int index)
+static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv,
+				struct mlx4_en_rx_ring *ring)
 {
-	struct mlx4_en_rx_alloc *frags;
-	int nr;
+	int index;
+
+	if (ring->pool.array) {
+		const struct mlx4_en_page *en_page = ring->pool.array;
 
-	frags = ring->rx_info + (index << priv->log_rx_info);
-	for (nr = 0; nr < priv->num_frags; nr++) {
-		en_dbg(DRV, priv, "Freeing fragment:%d\n", nr);
-		mlx4_en_free_frag(priv, frags + nr);
+		for (index = 0; index < ring->pool.pool_size; index++) {
+			dma_unmap_page(priv->ddev, en_page->dma,
+				       PAGE_SIZE, priv->dma_dir);
+			__free_page(en_page->page);
+			en_page++;
+		}
+		kfree(ring->pool.array);
+		ring->pool.array = NULL;
+		while (ring->pre_allocated_count) {
+			__free_page(ring->pre_allocated++);
+			ring->pre_allocated_count--;
+		}
 	}
+	ring->cons = 0;
+	ring->prod = 0;
 }
 
 static int mlx4_en_fill_rx_buffers(struct mlx4_en_priv *priv)
 {
+	int ring_ind, i, new_size = priv->prof->rx_ring_size;
+	unsigned int stride_bytes = 0;
 	struct mlx4_en_rx_ring *ring;
-	int ring_ind;
-	int buf_ind;
-	int new_size;
+	unsigned int pages_per_ring;
+	unsigned int page_ind;
+	unsigned int total;
+	unsigned int order;
+
+	for (i = 0; i < priv->num_frags; i++)
+		stride_bytes += priv->frag_info[i].frag_stride;
+
+	/* Page recycling works best if we have enough pages in the pool.
+	 * Apply a factor of two on the minimal allocations required to
+	 * populate RX rings.
+	 */
+retry:
+	total = 0;
+	pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
+	pages_per_ring = roundup_pow_of_two(pages_per_ring);
+
+	order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
+
+	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
+		ring = priv->rx_ring[ring_ind];
+		mlx4_en_free_rx_buf(priv, ring);
 
-	for (buf_ind = 0; buf_ind < priv->prof->rx_ring_size; buf_ind++) {
+		/* Note: use kvalloc() when available, no hurry */
+		ring->pool.array = kmalloc_node(sizeof(*ring->pool.array) *
+						pages_per_ring,
+						GFP_KERNEL, ring->node);
+		if (!ring->pool.array)
+			return -ENOMEM;
+		ring->pool.pool_idx = 0;
+		ring->pool.pool_size = 0;
+		ring->rx_alloc_order = ring->rx_pref_alloc_order = order;
+	}
+
+	for (page_ind = 0; page_ind < pages_per_ring; page_ind++) {
 		for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
-			ring = priv->rx_ring[ring_ind];
+			struct page *page;
+			dma_addr_t dma;
 
-			if (mlx4_en_prepare_rx_desc(priv, ring,
-						    ring->actual_size,
-						    GFP_KERNEL | __GFP_COLD)) {
-				if (ring->actual_size < MLX4_EN_MIN_RX_SIZE) {
-					en_err(priv, "Failed to allocate enough rx buffers\n");
-					return -ENOMEM;
-				} else {
-					new_size = rounddown_pow_of_two(ring->actual_size);
-					en_warn(priv, "Only %d buffers allocated reducing ring size to %d\n",
-						ring->actual_size, new_size);
-					goto reduce_rings;
-				}
-			}
-			ring->actual_size++;
-			ring->prod++;
+			ring = priv->rx_ring[ring_ind];
+			page = mlx4_alloc_page(priv, ring, &dma,
+					       ring->node, GFP_KERNEL);
+			if (!page)
+				goto fail;
+			ring->pool.array[page_ind].page = page;
+			ring->pool.array[page_ind].dma = dma;
+			ring->pool.pool_size = page_ind + 1;
+			total++;
+			cond_resched();
 		}
 	}
-	return 0;
 
-reduce_rings:
+	order = min_t(u32, ilog2(pages_per_ring >> 1), MAX_ORDER - 1);
+
 	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
 		ring = priv->rx_ring[ring_ind];
-		while (ring->actual_size > new_size) {
-			ring->actual_size--;
-			ring->prod--;
-			mlx4_en_free_rx_desc(priv, ring, ring->actual_size);
-		}
-	}
+		ring->rx_alloc_order = ring->rx_pref_alloc_order = order;
 
-	return 0;
-}
+		memcpy(ring->frag_info, priv->frag_info,
+		       sizeof(priv->frag_info));
 
-static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv,
-				struct mlx4_en_rx_ring *ring)
-{
-	int index;
+		while (ring->actual_size < new_size) {
+			struct mlx4_en_frag_info *frag_info = ring->frag_info;
+			struct mlx4_en_rx_alloc *frags = ring->rx_info +
+					(ring->actual_size << priv->log_rx_info);
+
+			for (i = 0; i < priv->num_frags; i++, frag_info++, frags++) {
+				if (frag_info->frag_stride + frag_info->page_offset > PAGE_SIZE) {
+					struct mlx4_en_page *en_page;
+
+					en_page = &ring->pool.array[ring->pool.pool_idx];
+					frag_info->page_offset = priv->rx_headroom;
+					frag_info->dma = en_page->dma;
+					frag_info->page = en_page->page;
+					++ring->pool.pool_idx;
 
-	en_dbg(DRV, priv, "Freeing Rx buf - cons:%d prod:%d\n",
-	       ring->cons, ring->prod);
+					WARN_ON_ONCE(ring->pool.pool_idx >=
+						     ring->pool.pool_size);
+				}
+				frags->page = frag_info->page;
+				frags->dma = frag_info->dma;
+				frags->page_offset = frag_info->page_offset;
 
-	/* Unmap and free Rx buffers */
-	for (index = 0; index < ring->size; index++) {
-		en_dbg(DRV, priv, "Processing descriptor:%d\n", index);
-		mlx4_en_free_rx_desc(priv, ring, index);
+				frag_info->page_offset += frag_info->frag_stride;
+			}
+			mlx4_en_prepare_rx_desc(priv, ring, ring->actual_size);
+			ring->actual_size++;
+			ring->prod++;
+		}
 	}
-	ring->cons = 0;
-	ring->prod = 0;
+	return 0;
+fail:
+	new_size >>= 1;
+	if (new_size < MLX4_EN_MIN_RX_SIZE) {
+		en_err(priv, "Failed to allocate enough rx pages, got %u of them\n",
+		       total);
+		return -ENOMEM;
+	}
+	goto retry;
 }
 
 void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
@@ -277,6 +300,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 		}
 	}
 
+	ring->node = node;
 	ring->prod = 0;
 	ring->cons = 0;
 	ring->size = size;
@@ -386,23 +410,21 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 	return err;
 }
 
-/* We recover from out of memory by scheduling our napi poll
- * function (mlx4_en_process_cq), which tries to allocate
- * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
+/* Under memory pressure, each ring->rx_alloc_order might be lowered
+ * to very small values. Periodically reset it to initial value for
+ * optimal allocations, in case stress is over.
  */
 void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 {
-	int ring;
+	struct mlx4_en_rx_ring *ring;
+	int ring_ind;
 
 	if (!priv->port_up)
 		return;
 
-	for (ring = 0; ring < priv->rx_ring_num; ring++) {
-		if (mlx4_en_is_ring_empty(priv->rx_ring[ring])) {
-			local_bh_disable();
-			napi_reschedule(&priv->rx_cq[ring]->napi);
-			local_bh_enable();
-		}
+	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
+		ring = priv->rx_ring[ring_ind];
+		WRITE_ONCE(ring->rx_alloc_order, ring->rx_pref_alloc_order);
 	}
 }
 
@@ -413,15 +435,15 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
  * this cache when it is sized to be a multiple of the napi budget.
  */
 bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
-			struct mlx4_en_rx_alloc *frame)
+			struct page *page, dma_addr_t dma)
 {
 	struct mlx4_en_page_cache *cache = &ring->page_cache;
 
 	if (cache->index >= MLX4_EN_CACHE_SIZE)
 		return false;
 
-	cache->buf[cache->index].page = frame->page;
-	cache->buf[cache->index].dma = frame->dma;
+	cache->buf[cache->index].page = page;
+	cache->buf[cache->index].dma = dma;
 	cache->index++;
 	return true;
 }
@@ -454,7 +476,7 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 	for (i = 0; i < ring->page_cache.index; i++) {
 		dma_unmap_page(priv->ddev, ring->page_cache.buf[i].dma,
 			       PAGE_SIZE, priv->dma_dir);
-		put_page(ring->page_cache.buf[i].page);
+		__free_page(ring->page_cache.buf[i].page);
 	}
 	ring->page_cache.index = 0;
 	mlx4_en_free_rx_buf(priv, ring);
@@ -462,67 +484,95 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 		ring->buf -= TXBB_SIZE;
 }
 
+static bool mlx4_page_is_reusable(struct page *page)
+{
+	return likely(page_count(page) == 1) &&
+	       likely(!page_is_pfmemalloc(page)) &&
+	       likely(page_to_nid(page) == numa_mem_id());
+}
 
-static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
-				    struct mlx4_en_rx_alloc *frags,
-				    struct sk_buff *skb,
-				    int length)
+static bool mlx4_replenish(struct mlx4_en_priv *priv,
+			   struct mlx4_en_rx_ring *ring,
+			   struct mlx4_en_frag_info *frag_info)
 {
-	const struct mlx4_en_frag_info *frag_info = priv->frag_info;
-	unsigned int truesize = 0;
-	int nr, frag_size;
+	struct mlx4_en_page *en_page = &ring->pool.array[ring->pool.pool_idx];
 	struct page *page;
 	dma_addr_t dma;
-	bool release;
 
-	/* Collect used fragments while replacing them in the HW descriptors */
-	for (nr = 0;; frags++) {
-		frag_size = min_t(int, length, frag_info->frag_size);
+	if (!mlx4_page_is_reusable(en_page->page)) {
+		page = mlx4_alloc_page(priv, ring, &dma, numa_mem_id(),
+				       GFP_ATOMIC | __GFP_MEMALLOC);
+		if (unlikely(!page)) {
+			/* Only drop incoming packet if previous page
+			 * can not be reused at all. NUMA placement is a hint,
+			 * pfmemalloc skbs will eventually be dropped if
+			 * necessary.
+			 */
+			if (page_count(en_page->page) != 1)
+				return false;
+		} else {
+			dma_unmap_page(priv->ddev, en_page->dma, PAGE_SIZE,
+				       priv->dma_dir);
+			__free_page(en_page->page);
+			en_page->page = page;
+			en_page->dma = dma;
+		}
+	}
+	frag_info->page_offset = priv->rx_headroom;
+	frag_info->page = en_page->page;
+	frag_info->dma = en_page->dma;
 
-		page = frags->page;
-		if (unlikely(!page))
-			goto fail;
+	if (unlikely(++ring->pool.pool_idx == ring->pool.pool_size))
+		ring->pool.pool_idx = 0;
 
-		dma = frags->dma;
-		dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
-					      frag_size, priv->dma_dir);
+	return true;
+}
 
-		__skb_fill_page_desc(skb, nr, page, frags->page_offset,
-				     frag_size);
+static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
+				    struct mlx4_en_rx_ring *ring,
+				    struct mlx4_en_rx_alloc *frags,
+				    struct sk_buff *skb,
+				    int length)
+{
+	struct mlx4_en_frag_info *frag_info = ring->frag_info;
+	int nr, frag_size;
 
-		truesize += frag_info->frag_stride;
-		if (frag_info->frag_stride == PAGE_SIZE / 2) {
-			frags->page_offset ^= PAGE_SIZE / 2;
-			release = page_count(page) != 1 ||
-				  page_is_pfmemalloc(page) ||
-				  page_to_nid(page) != numa_mem_id();
-		} else {
-			u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);
+	/* Make sure we can replenish RX ring with new page frags,
+	 * otherwise we drop this packet. Very sad but true.
+	 */
+	for (nr = 0; nr < priv->num_frags; nr++, frag_info++) {
+		if (frag_info->frag_stride + frag_info->page_offset <= PAGE_SIZE)
+			continue;
+		if (!mlx4_replenish(priv, ring, frag_info))
+			return -1;
+	}
+	frag_info = ring->frag_info;
 
-			frags->page_offset += sz_align;
-			release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
-		}
-		if (release) {
-			dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
-			frags->page = NULL;
-		} else {
-			page_ref_inc(page);
-		}
+	for (nr = 0;; frag_info++, frags++) {
+		frag_size = min_t(int, length, frag_info->frag_size);
 
-		nr++;
-		length -= frag_size;
-		if (!length)
+		if (frag_size) {
+			dma_sync_single_range_for_cpu(priv->ddev, frags->dma,
+						      frags->page_offset,
+						      frag_size,
+						      priv->dma_dir);
+
+			skb_fill_page_desc(skb, nr, frags->page,
+					   frags->page_offset,
+					   frag_size);
+			page_ref_inc(frags->page);
+			skb->truesize += frag_info->frag_stride;
+			length -= frag_size;
+		}
+		/* prepare what is needed for the next frame */
+		frags->page = frag_info->page;
+		frags->dma = frag_info->dma;
+		frags->page_offset = frag_info->page_offset;
+		frag_info->page_offset += frag_info->frag_stride;
+		if (++nr == priv->num_frags)
 			break;
-		frag_info++;
 	}
-	skb->truesize += truesize;
-	return nr;
 
-fail:
-	while (nr > 0) {
-		nr--;
-		__skb_frag_unref(skb_shinfo(skb)->frags + nr);
-	}
 	return 0;
 }
 
@@ -539,20 +589,16 @@ static void validate_loopback(struct mlx4_en_priv *priv, void *va)
 	priv->loopback_ok = 1;
 }
 
-static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
-				      struct mlx4_en_rx_ring *ring)
+static bool mlx4_en_remap_rx_buffers(struct mlx4_en_priv *priv,
+				     struct mlx4_en_rx_ring *ring)
 {
 	u32 missing = ring->actual_size - (ring->prod - ring->cons);
 
-	/* Try to batch allocations, but not too much. */
 	if (missing < 8)
 		return false;
 	do {
-		if (mlx4_en_prepare_rx_desc(priv, ring,
-					    ring->prod & ring->size_mask,
-					    GFP_ATOMIC | __GFP_COLD |
-					    __GFP_MEMALLOC))
-			break;
+		mlx4_en_prepare_rx_desc(priv, ring,
+					ring->prod & ring->size_mask);
 		ring->prod++;
 	} while (--missing);
 
@@ -740,7 +786,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 */
 		if (xdp_prog) {
 			struct xdp_buff xdp;
-			dma_addr_t dma;
+			struct page *npage;
+			dma_addr_t ndma, dma;
 			void *orig_data;
 			u32 act;
 
@@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			case XDP_PASS:
 				break;
 			case XDP_TX:
+				/* Make sure we have one page ready to replace this one */
+				npage = NULL;
+				if (!ring->page_cache.index) {
+					npage = mlx4_alloc_page(priv, ring,
+								&ndma, numa_mem_id(),
+								GFP_ATOMIC | __GFP_MEMALLOC);
+					if (!npage) {
+						ring->xdp_drop++;
+						goto next;
+					}
+				}
 				if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
 							length, cq->ring,
 							&doorbell_pending))) {
-					frags[0].page = NULL;
+					if (ring->page_cache.index) {
+						u32 idx = --ring->page_cache.index;
+
+						frags[0].page = ring->page_cache.buf[idx].page;
+						frags[0].dma = ring->page_cache.buf[idx].dma;
+					} else {
+						frags[0].page = npage;
+						frags[0].dma = ndma;
+					}
+					frags[0].page_offset = XDP_PACKET_HEADROOM;
 					goto next;
 				}
 				trace_xdp_exception(dev, xdp_prog, act);
@@ -853,9 +920,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD),
 					       be16_to_cpu(cqe->sl_vid));
 
-		nr = mlx4_en_complete_rx_desc(priv, frags, skb, length);
-		if (likely(nr)) {
-			skb_shinfo(skb)->nr_frags = nr;
+		nr = mlx4_en_complete_rx_desc(priv, ring, frags, skb, length);
+		if (likely(nr >= 0)) {
 			skb->len = length;
 			skb->data_len = length;
 			napi_gro_frags(&cq->napi);
@@ -883,7 +949,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	}
 	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
 
-	if (mlx4_en_refill_rx_buffers(priv, ring))
+	if (mlx4_en_remap_rx_buffers(priv, ring))
 		mlx4_en_update_rx_prod_db(ring);
 
 	return polled;
@@ -956,6 +1022,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		 * expense of more costly truesize accounting
 		 */
 		priv->frag_info[0].frag_stride = PAGE_SIZE;
+		priv->frag_info[0].page_offset = PAGE_SIZE;
 		priv->dma_dir = PCI_DMA_BIDIRECTIONAL;
 		priv->rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
@@ -974,6 +1041,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 				frag_size = min(frag_size, frag_size_max);
 
 			priv->frag_info[i].frag_size = frag_size;
+			priv->frag_info[i].page_offset = PAGE_SIZE;
 			frag_stride = ALIGN(frag_size, SMP_CACHE_BYTES);
 			/* We can only pack 2 1536-bytes frames in on 4K page
 			 * Therefore, each frame would consume more bytes (truesize)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index e0c5ffb3e3a6607456e1f191b0b8c8becfc71219..76da4c33ae431dbf351699ee9010ea0452a41084 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -351,15 +351,12 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 			    int napi_mode)
 {
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
-	struct mlx4_en_rx_alloc frame = {
-		.page = tx_info->page,
-		.dma = tx_info->map0_dma,
-	};
-
-	if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
-		dma_unmap_page(priv->ddev, tx_info->map0_dma,
-			       PAGE_SIZE, priv->dma_dir);
-		put_page(tx_info->page);
+	struct page *page = tx_info->page;
+	dma_addr_t dma = tx_info->map0_dma;
+
+	if (!mlx4_en_rx_recycle(ring->recycle_ring, page, dma)) {
+		dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
+		__free_page(page);
 	}
 
 	return tx_info->nr_txbb;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 39f401aa30474e61c0b0029463b23a829ec35fa3..23e66b1eb6ddd4665339b238618dead537d2155e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -99,17 +99,13 @@
 
 #define MLX4_EN_WATCHDOG_TIMEOUT	(15 * HZ)
 
-/* Use the maximum between 16384 and a single page */
-#define MLX4_EN_ALLOC_SIZE	PAGE_ALIGN(16384)
-
 #define MLX4_EN_MAX_RX_FRAGS	4
 
 /* Maximum ring sizes */
 #define MLX4_EN_MAX_TX_SIZE	8192
 #define MLX4_EN_MAX_RX_SIZE	8192
 
-/* Minimum ring size for our page-allocation scheme to work */
-#define MLX4_EN_MIN_RX_SIZE	(MLX4_EN_ALLOC_SIZE / SMP_CACHE_BYTES)
+#define MLX4_EN_MIN_RX_SIZE	256
 #define MLX4_EN_MIN_TX_SIZE	(4096 / TXBB_SIZE)
 
 #define MLX4_EN_SMALL_PKT_SIZE		64
@@ -325,6 +321,30 @@ struct mlx4_en_rx_desc {
 	struct mlx4_wqe_data_seg data[0];
 };
 
+/* Each rx_ring has a pool of pages, with associated dma mapping.
+ * We try to recycle the pages, by keeping a reference on them.
+ */
+struct mlx4_en_page {
+	struct page	*page;
+	dma_addr_t	dma; /* might be kept in page_private() ? */
+};
+
+/* A page pool contains a fixed number of pages, and a current index.
+ */
+struct mlx4_en_page_pool {
+	unsigned int		pool_size;
+	unsigned int		pool_idx;
+	struct mlx4_en_page	*array;
+};
+
+struct mlx4_en_frag_info {
+	u16		frag_size;
+	u32		frag_stride;
+	struct page	*page;
+	dma_addr_t	dma;
+	u32		page_offset;
+};
+
 struct mlx4_en_rx_ring {
 	struct mlx4_hwq_resources wqres;
 	u32 size ;	/* number of Rx descs*/
@@ -337,22 +357,31 @@ struct mlx4_en_rx_ring {
 	u32 cons;
 	u32 buf_size;
 	u8  fcs_del;
+	u8  hwtstamp_rx_filter;
+	u16 node;
 	void *buf;
 	void *rx_info;
-	struct bpf_prog __rcu *xdp_prog;
-	struct mlx4_en_page_cache page_cache;
+	struct bpf_prog __rcu		*xdp_prog;
+	struct mlx4_en_page_pool	pool;
+	unsigned long			rx_alloc_pages;
+
+	struct page			*pre_allocated;
+	unsigned int			pre_allocated_count;
+	unsigned int			rx_alloc_order;
+	struct mlx4_en_frag_info	frag_info[MLX4_EN_MAX_RX_FRAGS];
+	struct mlx4_en_page_cache	page_cache;
+
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long csum_ok;
 	unsigned long csum_none;
 	unsigned long csum_complete;
-	unsigned long rx_alloc_pages;
 	unsigned long xdp_drop;
 	unsigned long xdp_tx;
 	unsigned long xdp_tx_full;
 	unsigned long dropped;
-	int hwtstamp_rx_filter;
 	cpumask_var_t affinity_mask;
+	unsigned int			rx_pref_alloc_order;
 };
 
 struct mlx4_en_cq {
@@ -462,11 +491,6 @@ struct mlx4_en_mc_list {
 	u64			tunnel_reg_id;
 };
 
-struct mlx4_en_frag_info {
-	u16 frag_size;
-	u32 frag_stride;
-};
-
 #ifdef CONFIG_MLX4_EN_DCB
 /* Minimal TC BW - setting to 0 will block traffic */
 #define MLX4_EN_BW_MIN 1
@@ -693,7 +717,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 			       int tx_ind, int *doorbell_pending);
 void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
 bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
-			struct mlx4_en_rx_alloc *frame);
+			struct page *page, dma_addr_t dma);
 
 int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_tx_ring **pring,
-- 
2.12.0.246.ga2ecc84866-goog

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13  0:58 [PATCH net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
@ 2017-03-13 12:01 ` Tariq Toukan
  2017-03-13 12:54   ` Eric Dumazet
  2017-03-13 12:15 ` kbuild test robot
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Tariq Toukan @ 2017-03-13 12:01 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Alexei Starovoitov, Eric Dumazet, Alexander Duyck,
	Jesper Dangaard Brouer

Hi Eric, thanks for your patch.

On 13/03/2017 2:58 AM, Eric Dumazet wrote:
> When adding order-0 pages allocations and page recycling in receive path,
> I added issues on PowerPC, or more generally on arches with large pages.
>
> A GRO packet, aggregating 45 segments, ended up using 45 page frags
> on 45 different pages. Before my changes we were very likely packing
> up to 42 Ethernet frames per 64KB page.
>
> 1) At skb freeing time, all put_page() on the skb frags now touch 45
>    different 'struct page' and this adds more cache line misses.
>    Too bad that standard Ethernet MTU is so small :/
>
> 2) Using one order-0 page per ring slot consumes ~42 times more memory
>    on PowerPC.
>
> 3) Allocating order-0 pages is very likely to use pages from very
>    different locations, increasing TLB pressure on hosts with more
>    than 256 GB of memory after days of uptime.
>
> This patch uses a refined strategy, addressing these points.
>
> We still use order-0 pages, but the page recyling technique is modified
> so that we have better chances to lower number of pages containing the
> frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)
>
> Page allocations are split in two halves :
> - One currently visible by the NIC for DMA operations.
> - The other contains pages that already added to old skbs, put in
>   a quarantine.
>
> When we receive a frame, we look at the oldest entry in the pool and
> check if the page count is back to one, meaning old skbs/frags were
> consumed and the page can be recycled.
>
> Page allocations are attempted using high order ones, trying
> to lower TLB pressure.

I think MM-list people won't be happy with this.
We were doing a similar thing with order-5 pages in mlx5 Striding RQ:
Allocate and split high-order pages, to have:
- Physically contiguous memory,
- Less page allocations,
- Yet, keep a fine grained refcounts/truesize.
In case no high-order page available, fallback to using order-0 pages.

However, we changed this behavior, as it was fragmenting the memory, and 
depleting the high-order pages available quickly.

>
> On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
> But on PowerPC, this patch considerably reduces the allocated memory.
>
> Performance gain on PowerPC is about 50% for a single TCP flow.

Nice!

>
> On x86, I could not measure the difference, my test machine being
> limited by the sender (33 Gbit per TCP flow).
> 22 less cache line misses per 64 KB GRO packet is probably in the order
> of 2 % or so.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 462 +++++++++++++++------------
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 +++-
>  3 files changed, 310 insertions(+), 221 deletions(-)
>

Thanks,
Tariq

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13  0:58 [PATCH net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
  2017-03-13 12:01 ` Tariq Toukan
@ 2017-03-13 12:15 ` kbuild test robot
  2017-03-13 12:50 ` kbuild test robot
  2017-03-13 17:34 ` Alexei Starovoitov
  3 siblings, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2017-03-13 12:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kbuild-all, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Eric Dumazet, Eric Dumazet, Alexander Duyck

[-- Attachment #1: Type: text/plain, Size: 3961 bytes --]

Hi Eric,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/mlx4-Better-use-of-order-0-pages-in-RX-path/20170313-191100
config: x86_64-acpi-redef (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:622:12: warning: 'struct iphdr' declared inside parameter list will not be visible outside of this definition or declaration
        struct iphdr *iph)
               ^~~~~
   In file included from include/linux/swab.h:4:0,
                    from include/uapi/linux/byteorder/little_endian.h:12,
                    from include/linux/byteorder/little_endian.h:4,
                    from arch/x86/include/uapi/asm/byteorder.h:4,
                    from include/asm-generic/bitops/le.h:5,
                    from arch/x86/include/asm/bitops.h:517,
                    from include/linux/bitops.h:36,
                    from include/linux/kernel.h:10,
                    from include/linux/list.h:8,
                    from include/linux/timer.h:4,
                    from include/linux/workqueue.h:8,
                    from include/linux/bpf.h:12,
                    from drivers/net/ethernet/mellanox/mlx4/en_rx.c:34:
   drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 'get_fixed_ipv4_csum':
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:36: error: dereferencing pointer to incomplete type 'struct iphdr'
     length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
                                       ^
   include/uapi/linux/swab.h:100:54: note: in definition of macro '__swab16'
    #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
                                                         ^
   include/linux/byteorder/generic.h:96:21: note: in expansion of macro '__be16_to_cpu'
    #define be16_to_cpu __be16_to_cpu
                        ^~~~~~~~~~~~~
   drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:21: note: in expansion of macro 'be16_to_cpu'
     length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
                        ^~~~~~~~~~~

vim +627 drivers/net/ethernet/mellanox/mlx4/en_rx.c

f8c6455b Shani Michaeli 2014-11-09  616  
f8c6455b Shani Michaeli 2014-11-09  617  /* Although the stack expects checksum which doesn't include the pseudo
f8c6455b Shani Michaeli 2014-11-09  618   * header, the HW adds it. To address that, we are subtracting the pseudo
f8c6455b Shani Michaeli 2014-11-09  619   * header checksum from the checksum value provided by the HW.
f8c6455b Shani Michaeli 2014-11-09  620   */
f8c6455b Shani Michaeli 2014-11-09  621  static void get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
f8c6455b Shani Michaeli 2014-11-09 @622  				struct iphdr *iph)
f8c6455b Shani Michaeli 2014-11-09  623  {
f8c6455b Shani Michaeli 2014-11-09  624  	__u16 length_for_csum = 0;
f8c6455b Shani Michaeli 2014-11-09  625  	__wsum csum_pseudo_header = 0;
f8c6455b Shani Michaeli 2014-11-09  626  
f8c6455b Shani Michaeli 2014-11-09 @627  	length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
f8c6455b Shani Michaeli 2014-11-09  628  	csum_pseudo_header = csum_tcpudp_nofold(iph->saddr, iph->daddr,
f8c6455b Shani Michaeli 2014-11-09  629  						length_for_csum, iph->protocol, 0);
f8c6455b Shani Michaeli 2014-11-09  630  	skb->csum = csum_sub(hw_checksum, csum_pseudo_header);

:::::: The code at line 627 was first introduced by commit
:::::: f8c6455bb04b944edb69e9b074e28efee2c56bdd net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

:::::: TO: Shani Michaeli <shanim@mellanox.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29508 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13  0:58 [PATCH net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
  2017-03-13 12:01 ` Tariq Toukan
  2017-03-13 12:15 ` kbuild test robot
@ 2017-03-13 12:50 ` kbuild test robot
  2017-03-13 13:07   ` Eric Dumazet
  2017-03-13 17:34 ` Alexei Starovoitov
  3 siblings, 1 reply; 23+ messages in thread
From: kbuild test robot @ 2017-03-13 12:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kbuild-all, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Eric Dumazet, Eric Dumazet, Alexander Duyck

[-- Attachment #1: Type: text/plain, Size: 4688 bytes --]

Hi Eric,

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/mlx4-Better-use-of-order-0-pages-in-RX-path/20170313-191100
config: x86_64-randconfig-s5-03131942 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/mellanox/mlx4/en_rx.c:622:12: warning: 'struct iphdr' declared inside parameter list will not be visible outside of this definition or declaration
        struct iphdr *iph)
               ^~~~~
   In file included from include/linux/swab.h:4:0,
                    from include/uapi/linux/byteorder/little_endian.h:12,
                    from include/linux/byteorder/little_endian.h:4,
                    from arch/x86/include/uapi/asm/byteorder.h:4,
                    from include/asm-generic/bitops/le.h:5,
                    from arch/x86/include/asm/bitops.h:517,
                    from include/linux/bitops.h:36,
                    from include/linux/kernel.h:10,
                    from include/linux/list.h:8,
                    from include/linux/timer.h:4,
                    from include/linux/workqueue.h:8,
                    from include/linux/bpf.h:12,
                    from drivers/net/ethernet/mellanox/mlx4/en_rx.c:34:
   drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 'get_fixed_ipv4_csum':
   drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:36: error: dereferencing pointer to incomplete type 'struct iphdr'
     length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
                                       ^
   include/uapi/linux/swab.h:100:54: note: in definition of macro '__swab16'
    #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
                                                         ^
>> include/linux/byteorder/generic.h:96:21: note: in expansion of macro '__be16_to_cpu'
    #define be16_to_cpu __be16_to_cpu
                        ^~~~~~~~~~~~~
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:21: note: in expansion of macro 'be16_to_cpu'
     length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
                        ^~~~~~~~~~~

vim +/be16_to_cpu +627 drivers/net/ethernet/mellanox/mlx4/en_rx.c

f8c6455b Shani Michaeli 2014-11-09  611  static inline __wsum get_fixed_vlan_csum(__wsum hw_checksum,
f8c6455b Shani Michaeli 2014-11-09  612  					 struct vlan_hdr *vlanh)
f8c6455b Shani Michaeli 2014-11-09  613  {
f8c6455b Shani Michaeli 2014-11-09  614  	return csum_add(hw_checksum, *(__wsum *)vlanh);
f8c6455b Shani Michaeli 2014-11-09  615  }
f8c6455b Shani Michaeli 2014-11-09  616  
f8c6455b Shani Michaeli 2014-11-09  617  /* Although the stack expects checksum which doesn't include the pseudo
f8c6455b Shani Michaeli 2014-11-09  618   * header, the HW adds it. To address that, we are subtracting the pseudo
f8c6455b Shani Michaeli 2014-11-09  619   * header checksum from the checksum value provided by the HW.
f8c6455b Shani Michaeli 2014-11-09  620   */
f8c6455b Shani Michaeli 2014-11-09  621  static void get_fixed_ipv4_csum(__wsum hw_checksum, struct sk_buff *skb,
f8c6455b Shani Michaeli 2014-11-09  622  				struct iphdr *iph)
f8c6455b Shani Michaeli 2014-11-09  623  {
f8c6455b Shani Michaeli 2014-11-09  624  	__u16 length_for_csum = 0;
f8c6455b Shani Michaeli 2014-11-09  625  	__wsum csum_pseudo_header = 0;
f8c6455b Shani Michaeli 2014-11-09  626  
f8c6455b Shani Michaeli 2014-11-09 @627  	length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
f8c6455b Shani Michaeli 2014-11-09  628  	csum_pseudo_header = csum_tcpudp_nofold(iph->saddr, iph->daddr,
f8c6455b Shani Michaeli 2014-11-09  629  						length_for_csum, iph->protocol, 0);
f8c6455b Shani Michaeli 2014-11-09  630  	skb->csum = csum_sub(hw_checksum, csum_pseudo_header);
f8c6455b Shani Michaeli 2014-11-09  631  }
f8c6455b Shani Michaeli 2014-11-09  632  
f8c6455b Shani Michaeli 2014-11-09  633  #if IS_ENABLED(CONFIG_IPV6)
f8c6455b Shani Michaeli 2014-11-09  634  /* In IPv6 packets, besides subtracting the pseudo header checksum,
f8c6455b Shani Michaeli 2014-11-09  635   * we also compute/add the IP header checksum which

:::::: The code at line 627 was first introduced by commit
:::::: f8c6455bb04b944edb69e9b074e28efee2c56bdd net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

:::::: TO: Shani Michaeli <shanim@mellanox.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34812 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 12:01 ` Tariq Toukan
@ 2017-03-13 12:54   ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 12:54 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck, Jesper Dangaard Brouer

On Mon, 2017-03-13 at 14:01 +0200, Tariq Toukan wrote:

> I think MM-list people won't be happy with this.
> We were doing a similar thing with order-5 pages in mlx5 Striding RQ:
> Allocate and split high-order pages, to have:
> - Physically contiguous memory,
> - Less page allocations,
> - Yet, keep a fine grained refcounts/truesize.
> In case no high-order page available, fallback to using order-0 pages.
> 
> However, we changed this behavior, as it was fragmenting the memory, and 
> depleting the high-order pages available quickly.


Sure, I was not happy with this schem either.

I was the first to complain and suggest split_page() one year ago.

mlx5 was using __GFP_MEMALLOC for its MLX5_MPWRQ_WQE_PAGE_ORDER
allocations, and failure had no fallback.

mlx5e_alloc_rx_mpwqe() was simply giving up immediately.

Very different behavior there, since :

1) we normally recycle 99% [1] of the pages, and rx_alloc_order quickly
decreases under memory pressure.


2) My high order allocations use __GFP_NOMEMALLOC to cancel the
__GFP_MEMALLOC

3) Also note that I chose to periodically reset rx_alloc_order from
mlx4_en_recover_from_oom() to the initial value.
   We could later change this to a slow recovery if really needed, but
   my tests were fine with this.


[1] This driver might need to change the default RX ring sizes.
    1024 slots is a bit short for 40Gbit NIC these days.

    (We tune this to 4096)

Thanks !

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 12:50 ` kbuild test robot
@ 2017-03-13 13:07   ` Eric Dumazet
  2017-03-13 13:16     ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 13:07 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Eric Dumazet, kbuild-all, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Mon, 2017-03-13 at 20:50 +0800, kbuild test robot wrote:
> Hi Eric,
> 
> [auto build test WARNING on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/mlx4-Better-use-of-order-0-pages-in-RX-path/20170313-191100
> config: x86_64-randconfig-s5-03131942 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All warnings (new ones prefixed by >>):
> 
>    drivers/net/ethernet/mellanox/mlx4/en_rx.c:622:12: warning: 'struct iphdr' declared inside parameter list will not be visible outside of this definition or declaration
>         struct iphdr *iph)
>                ^~~~~
>    In file included from include/linux/swab.h:4:0,
>                     from include/uapi/linux/byteorder/little_endian.h:12,
>                     from include/linux/byteorder/little_endian.h:4,
>                     from arch/x86/include/uapi/asm/byteorder.h:4,
>                     from include/asm-generic/bitops/le.h:5,
>                     from arch/x86/include/asm/bitops.h:517,
>                     from include/linux/bitops.h:36,
>                     from include/linux/kernel.h:10,
>                     from include/linux/list.h:8,
>                     from include/linux/timer.h:4,
>                     from include/linux/workqueue.h:8,
>                     from include/linux/bpf.h:12,
>                     from drivers/net/ethernet/mellanox/mlx4/en_rx.c:34:
>    drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function 'get_fixed_ipv4_csum':
>    drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:36: error: dereferencing pointer to incomplete type 'struct iphdr'
>      length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
>                                        ^
>    include/uapi/linux/swab.h:100:54: note: in definition of macro '__swab16'
>     #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
>                                                          ^
> >> include/linux/byteorder/generic.h:96:21: note: in expansion of macro '__be16_to_cpu'
>     #define be16_to_cpu __be16_to_cpu
>                         ^~~~~~~~~~~~~
> >> drivers/net/ethernet/mellanox/mlx4/en_rx.c:627:21: note: in expansion of macro 'be16_to_cpu'
>      length_for_csum = (be16_to_cpu(iph->tot_len) - (iph->ihl << 2));
>                         ^~~~~~~~~~~
> 
> vim +/be16_to_cpu +627 drivers/net/ethernet/mellanox/mlx4/en_rx.c

I removed the include <net/budy_poll.h>

It seems we need <net/ip.h>.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 13:07   ` Eric Dumazet
@ 2017-03-13 13:16     ` Eric Dumazet
  0 siblings, 0 replies; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 13:16 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Eric Dumazet, kbuild-all, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Mon, 2017-03-13 at 06:07 -0700, Eric Dumazet wrote:

> I removed the include <net/budy_poll.h>
> 
> It seems we need <net/ip.h>.

Yes, I will add this to v2 :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index de455c8a2dec389cfeca6b6d474a6184d6acf618..a71554649c25383bb765fa8220bc9cd490247aee 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -43,6 +43,7 @@
 #include <linux/vmalloc.h>
 #include <linux/irq.h>
 
+#include <net/ip.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ip6_checksum.h>
 #endif

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13  0:58 [PATCH net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
                   ` (2 preceding siblings ...)
  2017-03-13 12:50 ` kbuild test robot
@ 2017-03-13 17:34 ` Alexei Starovoitov
  2017-03-13 17:50   ` Eric Dumazet
  3 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-13 17:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  			case XDP_PASS:
>  				break;
>  			case XDP_TX:
> +				/* Make sure we have one page ready to replace this one */
> +				npage = NULL;
> +				if (!ring->page_cache.index) {
> +					npage = mlx4_alloc_page(priv, ring,
> +								&ndma, numa_mem_id(),
> +								GFP_ATOMIC | __GFP_MEMALLOC);

did you test this with xdp2 test ?
under what conditions it allocates ?
It looks dangerous from security point of view to do allocations here.
Can it be exploited by an attacker?
we use xdp for ddos and lb and this is fast path.
If 1 out of 100s XDP_TX packets hit this allocation we will have serious
perf regression.
In general I dont think it's a good idea to penalize x86 in favor of powerpc.
Can you #ifdef this new code somehow? so we won't have these concerns on x86?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 17:34 ` Alexei Starovoitov
@ 2017-03-13 17:50   ` Eric Dumazet
  2017-03-13 18:31     ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 17:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
>> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>>                       case XDP_PASS:
>>                               break;
>>                       case XDP_TX:
>> +                             /* Make sure we have one page ready to replace this one */
>> +                             npage = NULL;
>> +                             if (!ring->page_cache.index) {
>> +                                     npage = mlx4_alloc_page(priv, ring,
>> +                                                             &ndma, numa_mem_id(),
>> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
>
> did you test this with xdp2 test ?
> under what conditions it allocates ?
> It looks dangerous from security point of view to do allocations here.
> Can it be exploited by an attacker?
> we use xdp for ddos and lb and this is fast path.
> If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> perf regression.
> In general I dont think it's a good idea to penalize x86 in favor of powerpc.
> Can you #ifdef this new code somehow? so we won't have these concerns on x86?

Normal paths would never hit this point really. I wanted to be extra
safe, because who knows, some guys could be tempted to set
ethtool -G ethX  rx 512 tx 8192

Before this patch, if you were able to push enough frames in TX ring,
you would also eventually be forced to allocate memory, or drop frames...

This patch does not penalize x86, quite the contrary.
It brings a (small) improvement on x86, and a huge improvement on powerpc.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 17:50   ` Eric Dumazet
@ 2017-03-13 18:31     ` Alexei Starovoitov
  2017-03-13 18:58       ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-13 18:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >>                       case XDP_PASS:
> >>                               break;
> >>                       case XDP_TX:
> >> +                             /* Make sure we have one page ready to replace this one */
> >> +                             npage = NULL;
> >> +                             if (!ring->page_cache.index) {
> >> +                                     npage = mlx4_alloc_page(priv, ring,
> >> +                                                             &ndma, numa_mem_id(),
> >> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
> >
> > did you test this with xdp2 test ?
> > under what conditions it allocates ?
> > It looks dangerous from security point of view to do allocations here.
> > Can it be exploited by an attacker?
> > we use xdp for ddos and lb and this is fast path.
> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> > perf regression.
> > In general I dont think it's a good idea to penalize x86 in favor of powerpc.
> > Can you #ifdef this new code somehow? so we won't have these concerns on x86?
> 
> Normal paths would never hit this point really. I wanted to be extra
> safe, because who knows, some guys could be tempted to set
> ethtool -G ethX  rx 512 tx 8192
> 
> Before this patch, if you were able to push enough frames in TX ring,
> you would also eventually be forced to allocate memory, or drop frames...

hmm. not following.
Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
So this rx page belongs to driver, not shared with anyone and it only needs to
be put onto tx ring, so I don't understand why driver needs to allocating
anything here. To refill the rx ring? but why here?
rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
will be unused.
why are you saying it will cause this if (!ring->page_cache.index) to trigger?

> This patch does not penalize x86, quite the contrary.
> It brings a (small) improvement on x86, and a huge improvement on powerpc.

for normal tcp stack. sure. I'm worried about xdp fast path that needs to be tested.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 18:31     ` Alexei Starovoitov
@ 2017-03-13 18:58       ` Eric Dumazet
  2017-03-13 20:23         ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 18:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
>> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
>> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>> >>                       case XDP_PASS:
>> >>                               break;
>> >>                       case XDP_TX:
>> >> +                             /* Make sure we have one page ready to replace this one */
>> >> +                             npage = NULL;
>> >> +                             if (!ring->page_cache.index) {
>> >> +                                     npage = mlx4_alloc_page(priv, ring,
>> >> +                                                             &ndma, numa_mem_id(),
>> >> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
>> >
>> > did you test this with xdp2 test ?
>> > under what conditions it allocates ?
>> > It looks dangerous from security point of view to do allocations here.
>> > Can it be exploited by an attacker?
>> > we use xdp for ddos and lb and this is fast path.
>> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
>> > perf regression.
>> > In general I dont think it's a good idea to penalize x86 in favor of powerpc.
>> > Can you #ifdef this new code somehow? so we won't have these concerns on x86?
>>
>> Normal paths would never hit this point really. I wanted to be extra
>> safe, because who knows, some guys could be tempted to set
>> ethtool -G ethX  rx 512 tx 8192
>>
>> Before this patch, if you were able to push enough frames in TX ring,
>> you would also eventually be forced to allocate memory, or drop frames...
>
> hmm. not following.
> Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
> So this rx page belongs to driver, not shared with anyone and it only needs to
> be put onto tx ring, so I don't understand why driver needs to allocating
> anything here. To refill the rx ring? but why here?

Because RX ring can be shared, by packets goind to the upper stacks (say TCP)

So there is no guarantee that the pages in the quarantine pool have
their page count to 1.

The normal TX_XDP path will recycle pages in ring->cache_page .

This is exactly where I pick up a replacement.

Pages in ring->cache_page have the guarantee to have no other users
than ourself (mlx4 driver)

You might have not noticed that current mlx4 driver has a lazy refill
of RX ring buffers, that eventually
removes all the pages from RX ring, and we have to recover with this
lazy mlx4_en_recover_from_oom() thing
that will attempt to restart the allocations.

After my patch, we have the guarantee that the RX ring buffer is
always fully populated.

When we receive a frame (XDP or not), we drop it if we can not
replenish the RX slot,
in case the oldest page in quarantine is not a recycling candidate and
we can not allocate a new page.



> rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
> will be unused.

Yes, but as an author of a kernel patch, I do not want to be
responsible for a crash
_if_ someone tries something dumb like that.

> why are you saying it will cause this if (!ring->page_cache.index) to trigger?
>
>> This patch does not penalize x86, quite the contrary.
>> It brings a (small) improvement on x86, and a huge improvement on powerpc.
>
> for normal tcp stack. sure. I'm worried about xdp fast path that needs to be tested.

Note that TX_XDP in mlx4 is completely broken as I mentioned earlier.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 18:58       ` Eric Dumazet
@ 2017-03-13 20:23         ` Alexei Starovoitov
  2017-03-13 21:09           ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-13 20:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >> >>                       case XDP_PASS:
> >> >>                               break;
> >> >>                       case XDP_TX:
> >> >> +                             /* Make sure we have one page ready to replace this one */
> >> >> +                             npage = NULL;
> >> >> +                             if (!ring->page_cache.index) {
> >> >> +                                     npage = mlx4_alloc_page(priv, ring,
> >> >> +                                                             &ndma, numa_mem_id(),
> >> >> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
> >> >
> >> > did you test this with xdp2 test ?
> >> > under what conditions it allocates ?
> >> > It looks dangerous from security point of view to do allocations here.
> >> > Can it be exploited by an attacker?
> >> > we use xdp for ddos and lb and this is fast path.
> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> >> > perf regression.
> >> > In general I dont think it's a good idea to penalize x86 in favor of powerpc.
> >> > Can you #ifdef this new code somehow? so we won't have these concerns on x86?
> >>
> >> Normal paths would never hit this point really. I wanted to be extra
> >> safe, because who knows, some guys could be tempted to set
> >> ethtool -G ethX  rx 512 tx 8192
> >>
> >> Before this patch, if you were able to push enough frames in TX ring,
> >> you would also eventually be forced to allocate memory, or drop frames...
> >
> > hmm. not following.
> > Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
> > So this rx page belongs to driver, not shared with anyone and it only needs to
> > be put onto tx ring, so I don't understand why driver needs to allocating
> > anything here. To refill the rx ring? but why here?
> 
> Because RX ring can be shared, by packets goind to the upper stacks (say TCP)
> 
> So there is no guarantee that the pages in the quarantine pool have
> their page count to 1.
> 
> The normal TX_XDP path will recycle pages in ring->cache_page .
> 
> This is exactly where I pick up a replacement.
> 
> Pages in ring->cache_page have the guarantee to have no other users
> than ourself (mlx4 driver)
> 
> You might have not noticed that current mlx4 driver has a lazy refill
> of RX ring buffers, that eventually
> removes all the pages from RX ring, and we have to recover with this
> lazy mlx4_en_recover_from_oom() thing
> that will attempt to restart the allocations.
> 
> After my patch, we have the guarantee that the RX ring buffer is
> always fully populated.
> 
> When we receive a frame (XDP or not), we drop it if we can not
> replenish the RX slot,
> in case the oldest page in quarantine is not a recycling candidate and
> we can not allocate a new page.

Got it. Could you please add above explanation to the commit log,
since it's a huge difference vs other drivers.
I don't think any other driver in the tree follows this strategy.
I think that's the right call, but it shouldn't be hidden in details.
If that's the right approach (probably is) we should do it
in the other drivers too.
Though I don't see you dropping "to the stack" packet in this patch.
The idea is to drop the packet (whether xdp or not) if rx
buffer cannot be replinshed _regardless_ whether driver
implements recycling, right?

> > rx 512 tx 8192 is meaningless from xdp pov, since most of the tx entries
> > will be unused.
> 
> Yes, but as an author of a kernel patch, I do not want to be
> responsible for a crash
> _if_ someone tries something dumb like that.
> 
> > why are you saying it will cause this if (!ring->page_cache.index) to trigger?
> >
> >> This patch does not penalize x86, quite the contrary.
> >> It brings a (small) improvement on x86, and a huge improvement on powerpc.
> >
> > for normal tcp stack. sure. I'm worried about xdp fast path that needs to be tested.
> 
> Note that TX_XDP in mlx4 is completely broken as I mentioned earlier.

Theory vs practice. We don't see this issue running xdp.
My understanding, since rx and tx napi are scheduled for the same cpu
there is a guarantee that during normal invocation they won't race
and the only problem is due to mlx4_en_recover_from_oom() that can
run rx napi on some random cpu.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 20:23         ` Alexei Starovoitov
@ 2017-03-13 21:09           ` Eric Dumazet
  2017-03-13 23:21             ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 21:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 1:23 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
>> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
>> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
>> >> <alexei.starovoitov@gmail.com> wrote:
>> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
>> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>> >> >>                       case XDP_PASS:
>> >> >>                               break;
>> >> >>                       case XDP_TX:
>> >> >> +                             /* Make sure we have one page ready to replace this one */
>> >> >> +                             npage = NULL;
>> >> >> +                             if (!ring->page_cache.index) {
>> >> >> +                                     npage = mlx4_alloc_page(priv, ring,
>> >> >> +                                                             &ndma, numa_mem_id(),
>> >> >> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
>> >> >
>> >> > did you test this with xdp2 test ?
>> >> > under what conditions it allocates ?
>> >> > It looks dangerous from security point of view to do allocations here.
>> >> > Can it be exploited by an attacker?
>> >> > we use xdp for ddos and lb and this is fast path.
>> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
>> >> > perf regression.
>> >> > In general I dont think it's a good idea to penalize x86 in favor of powerpc.
>> >> > Can you #ifdef this new code somehow? so we won't have these concerns on x86?
>> >>
>> >> Normal paths would never hit this point really. I wanted to be extra
>> >> safe, because who knows, some guys could be tempted to set
>> >> ethtool -G ethX  rx 512 tx 8192
>> >>
>> >> Before this patch, if you were able to push enough frames in TX ring,
>> >> you would also eventually be forced to allocate memory, or drop frames...
>> >
>> > hmm. not following.
>> > Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
>> > So this rx page belongs to driver, not shared with anyone and it only needs to
>> > be put onto tx ring, so I don't understand why driver needs to allocating
>> > anything here. To refill the rx ring? but why here?
>>
>> Because RX ring can be shared, by packets goind to the upper stacks (say TCP)
>>
>> So there is no guarantee that the pages in the quarantine pool have
>> their page count to 1.
>>
>> The normal TX_XDP path will recycle pages in ring->cache_page .
>>
>> This is exactly where I pick up a replacement.
>>
>> Pages in ring->cache_page have the guarantee to have no other users
>> than ourself (mlx4 driver)
>>
>> You might have not noticed that current mlx4 driver has a lazy refill
>> of RX ring buffers, that eventually
>> removes all the pages from RX ring, and we have to recover with this
>> lazy mlx4_en_recover_from_oom() thing
>> that will attempt to restart the allocations.
>>
>> After my patch, we have the guarantee that the RX ring buffer is
>> always fully populated.
>>
>> When we receive a frame (XDP or not), we drop it if we can not
>> replenish the RX slot,
>> in case the oldest page in quarantine is not a recycling candidate and
>> we can not allocate a new page.
>
> Got it. Could you please add above explanation to the commit log,
> since it's a huge difference vs other drivers.
> I don't think any other driver in the tree follows this strategy.

sk_buff allocation can fail before considering adding a frag on the
skb anyway...

Look, all this is completely irrelevant for XDP_TX, since there is no
allocation at all,
once ~128 pages have been put into page_cache.

Do you want to prefill this cache when XDP is loaded ?

> I think that's the right call, but it shouldn't be hidden in details.

ring->page_cache contains the pages used by XDP_TX are recycled
through mlx4_en_rx_recycle()

There is a nice comment there. I did not change this part.

These pages _used_ to be directly taken from mlx4_en_prepare_rx_desc(),
because there was no page recycling for the normal path.

All my patch does is a generic implementation
for page recycling, leaving the XDP_TX pages in their own pool,
because we do not even have to check their page count, adding a cache line miss.

So I do have 2 different pools, simply to let XDP_TX path being ultra
fast, as before.

Does not look details to me.


> If that's the right approach (probably is) we should do it
> in the other drivers too.
> Though I don't see you dropping "to the stack" packet in this patch.
> The idea is to drop the packet (whether xdp or not) if rx
> buffer cannot be replinshed _regardless_ whether driver
> implements recycling, right?



>
> Theory vs practice. We don't see this issue running xdp.

Sure, so lets ignore syzkaller guys and stop doing code reviews, if
all this is working for you.

> My understanding, since rx and tx napi are scheduled for the same cpu
> there is a guarantee that during normal invocation they won't race
> and the only problem is due to mlx4_en_recover_from_oom() that can
> run rx napi on some random cpu.

Exactly. So under pressure, host will panic :/

Same if you let one application using busy polling, or if IRQ is moved
to another cpu (/proc/irq/*/smp_affinity)

Looks fragile, but whatever, maybe you control everything on your hosts.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 21:09           ` Eric Dumazet
@ 2017-03-13 23:21             ` Alexei Starovoitov
  2017-03-13 23:28               ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-13 23:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 02:09:23PM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 1:23 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote:
> >> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov
> >> >> <alexei.starovoitov@gmail.com> wrote:
> >> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote:
> >> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >> >> >>                       case XDP_PASS:
> >> >> >>                               break;
> >> >> >>                       case XDP_TX:
> >> >> >> +                             /* Make sure we have one page ready to replace this one */
> >> >> >> +                             npage = NULL;
> >> >> >> +                             if (!ring->page_cache.index) {
> >> >> >> +                                     npage = mlx4_alloc_page(priv, ring,
> >> >> >> +                                                             &ndma, numa_mem_id(),
> >> >> >> +                                                             GFP_ATOMIC | __GFP_MEMALLOC);
> >> >> >
> >> >> > did you test this with xdp2 test ?
> >> >> > under what conditions it allocates ?
> >> >> > It looks dangerous from security point of view to do allocations here.
> >> >> > Can it be exploited by an attacker?
> >> >> > we use xdp for ddos and lb and this is fast path.
> >> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious
> >> >> > perf regression.
> >> >> > In general I dont think it's a good idea to penalize x86 in favor of powerpc.
> >> >> > Can you #ifdef this new code somehow? so we won't have these concerns on x86?
> >> >>
> >> >> Normal paths would never hit this point really. I wanted to be extra
> >> >> safe, because who knows, some guys could be tempted to set
> >> >> ethtool -G ethX  rx 512 tx 8192
> >> >>
> >> >> Before this patch, if you were able to push enough frames in TX ring,
> >> >> you would also eventually be forced to allocate memory, or drop frames...
> >> >
> >> > hmm. not following.
> >> > Into xdp tx queues packets don't come from stack. It can only be via xdp_tx.
> >> > So this rx page belongs to driver, not shared with anyone and it only needs to
> >> > be put onto tx ring, so I don't understand why driver needs to allocating
> >> > anything here. To refill the rx ring? but why here?
> >>
> >> Because RX ring can be shared, by packets goind to the upper stacks (say TCP)
> >>
> >> So there is no guarantee that the pages in the quarantine pool have
> >> their page count to 1.
> >>
> >> The normal TX_XDP path will recycle pages in ring->cache_page .
> >>
> >> This is exactly where I pick up a replacement.
> >>
> >> Pages in ring->cache_page have the guarantee to have no other users
> >> than ourself (mlx4 driver)
> >>
> >> You might have not noticed that current mlx4 driver has a lazy refill
> >> of RX ring buffers, that eventually
> >> removes all the pages from RX ring, and we have to recover with this
> >> lazy mlx4_en_recover_from_oom() thing
> >> that will attempt to restart the allocations.
> >>
> >> After my patch, we have the guarantee that the RX ring buffer is
> >> always fully populated.
> >>
> >> When we receive a frame (XDP or not), we drop it if we can not
> >> replenish the RX slot,
> >> in case the oldest page in quarantine is not a recycling candidate and
> >> we can not allocate a new page.
> >
> > Got it. Could you please add above explanation to the commit log,
> > since it's a huge difference vs other drivers.
> > I don't think any other driver in the tree follows this strategy.
> 
> sk_buff allocation can fail before considering adding a frag on the
> skb anyway...
> 
> Look, all this is completely irrelevant for XDP_TX, since there is no
> allocation at all,
> once ~128 pages have been put into page_cache.

is it once in the beginning only? If so then why that
'if (!ring->page_cache.index)' check is done for every packet?
If every 128 packets then it will cause performance drop.
Since I don't completely understand how your new recycling
logic works, I'm asking you to test this patch with samples/bpf/xdp2
to make sure perf is still good, since it doesn't sound that
you tested with xdp and I don't understand the patch enough to see
the impact and it makes me worried.

> Do you want to prefill this cache when XDP is loaded ?
> 
> > I think that's the right call, but it shouldn't be hidden in details.
> 
> ring->page_cache contains the pages used by XDP_TX are recycled
> through mlx4_en_rx_recycle()
> 
> There is a nice comment there. I did not change this part.
> 
> These pages _used_ to be directly taken from mlx4_en_prepare_rx_desc(),
> because there was no page recycling for the normal path.
> 
> All my patch does is a generic implementation
> for page recycling, leaving the XDP_TX pages in their own pool,
> because we do not even have to check their page count, adding a cache line miss.
> 
> So I do have 2 different pools, simply to let XDP_TX path being ultra
> fast, as before.

I agree, but I want to see test results.

> Does not look details to me.

that's exactly why i'm asking to add all that info to commit log.

> > My understanding, since rx and tx napi are scheduled for the same cpu
> > there is a guarantee that during normal invocation they won't race
> > and the only problem is due to mlx4_en_recover_from_oom() that can
> > run rx napi on some random cpu.
> 
> Exactly. So under pressure, host will panic :/

the host under pressure will race and may be it will panic. Sure.
I will take such panic any day vs adding spin_locks to the fast path.
If Saeed can fix it without adding spin_locks great, but slowing
down fast path because of one in a million race is no good.
You're using the same arguments when dealing with bugs in tcp stack,
so I can only hold you to the same standard here :)

Frankly if you were not proposing to slowdown xdp with spin_locks
in the other thread, I probably wouldn't be worried about this patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 23:21             ` Alexei Starovoitov
@ 2017-03-13 23:28               ` Eric Dumazet
  2017-03-13 23:40                 ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> is it once in the beginning only? If so then why that
> 'if (!ring->page_cache.index)' check is done for every packet?



You did not really read the patch, otherwise you would not ask these questions.

Test it, and if you find a regression, shout loudly.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 23:28               ` Eric Dumazet
@ 2017-03-13 23:40                 ` Alexei Starovoitov
  2017-03-13 23:44                   ` Eric Dumazet
  2017-03-14  1:02                   ` Eric Dumazet
  0 siblings, 2 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-13 23:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 04:28:04PM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > is it once in the beginning only? If so then why that
> > 'if (!ring->page_cache.index)' check is done for every packet?
> 
> 
> 
> You did not really read the patch, otherwise you would not ask these questions.

please explain. I see
+  if (!ring->page_cache.index) {
+          npage = mlx4_alloc_page(priv, ring,
which is done for every packet that goes via XDP_TX.

> Test it, and if you find a regression, shout loudly.

that's not how it works. It's a job of submitter to prove
that additional code doesn't cause regressions especially
when there are legitimate concerns.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 23:40                 ` Alexei Starovoitov
@ 2017-03-13 23:44                   ` Eric Dumazet
  2017-03-13 23:51                     ` Alexei Starovoitov
  2017-03-14  1:02                   ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-13 23:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 4:40 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 04:28:04PM -0700, Eric Dumazet wrote:
>> On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> >
>> > is it once in the beginning only? If so then why that
>> > 'if (!ring->page_cache.index)' check is done for every packet?
>>
>>
>>
>> You did not really read the patch, otherwise you would not ask these questions.
>
> please explain. I see
> +  if (!ring->page_cache.index) {
> +          npage = mlx4_alloc_page(priv, ring,
> which is done for every packet that goes via XDP_TX.
>

Well, we do for all packets, even on hosts not running XDP:

if (xdp_prog) { ...

...

Then :

if (doorbell_pending))
     mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);

And nobody complained of few additional instructions.

Should I had, very loudly ?


>> Test it, and if you find a regression, shout loudly.
>
> that's not how it works. It's a job of submitter to prove
> that additional code doesn't cause regressions especially
> when there are legitimate concerns.

I have no easy way to test XDP. I  have never used it and am not
planning to use it any time soon.

Does it mean I no longer can participate to linux dev ?

Nice to hear Alexei.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 23:44                   ` Eric Dumazet
@ 2017-03-13 23:51                     ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-13 23:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 13, 2017 at 04:44:19PM -0700, Eric Dumazet wrote:
> On Mon, Mar 13, 2017 at 4:40 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Mar 13, 2017 at 04:28:04PM -0700, Eric Dumazet wrote:
> >> On Mon, Mar 13, 2017 at 4:21 PM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> >
> >> > is it once in the beginning only? If so then why that
> >> > 'if (!ring->page_cache.index)' check is done for every packet?
> >>
> >>
> >>
> >> You did not really read the patch, otherwise you would not ask these questions.
> >
> > please explain. I see
> > +  if (!ring->page_cache.index) {
> > +          npage = mlx4_alloc_page(priv, ring,
> > which is done for every packet that goes via XDP_TX.
> >
> 
> Well, we do for all packets, even on hosts not running XDP:
> 
> if (xdp_prog) { ...
> 
> ...
> 
> Then :
> 
> if (doorbell_pending))
>      mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> 
> And nobody complained of few additional instructions.

it's not the additional 'if' I'm concerned about it's the allocation
after 'if', since you didn't explain clearly when it's executed.

> >> Test it, and if you find a regression, shout loudly.
> >
> > that's not how it works. It's a job of submitter to prove
> > that additional code doesn't cause regressions especially
> > when there are legitimate concerns.
> 
> I have no easy way to test XDP. I  have never used it and am not
> planning to use it any time soon.
> 
> Does it mean I no longer can participate to linux dev ?

when you're touching the most performance sensitive piece of XDP in
the driver then yes, you have to show that XDP doesn't regress.
Especially since it's trivial to test.
Two mlx4 serves, pktgen and xdp2 is enough.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-13 23:40                 ` Alexei Starovoitov
  2017-03-13 23:44                   ` Eric Dumazet
@ 2017-03-14  1:02                   ` Eric Dumazet
  2017-03-14  4:57                     ` Alexei Starovoitov
  1 sibling, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-14  1:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Mon, 2017-03-13 at 16:40 -0700, Alexei Starovoitov wrote:

> that's not how it works. It's a job of submitter to prove
> that additional code doesn't cause regressions especially
> when there are legitimate concerns.

This test was moved out of the mlx4_en_prepare_rx_desc() section into
the XDP_TX code path.


        if (ring->page_cache.index > 0) {
                /* XDP uses a single page per frame */
                if (!frags->page) {
                        ring->page_cache.index--;
                        frags->page = ring->page_cache.buf[ring->page_cache.index].page;
                        frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
                }
                frags->page_offset = XDP_PACKET_HEADROOM;
                rx_desc->data[0].addr = cpu_to_be64(frags->dma +
                                                    XDP_PACKET_HEADROOM);
                return 0;
        }

Can you check again your claim, because I see no additional cost
for XDP_TX.

In fact I removed from the other paths (which are equally important I
believe) a test that had no use, so everybody should be happy.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-14  1:02                   ` Eric Dumazet
@ 2017-03-14  4:57                     ` Alexei Starovoitov
  2017-03-14 13:34                       ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2017-03-14  4:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Mon, Mar 13, 2017 at 06:02:11PM -0700, Eric Dumazet wrote:
> On Mon, 2017-03-13 at 16:40 -0700, Alexei Starovoitov wrote:
> 
> > that's not how it works. It's a job of submitter to prove
> > that additional code doesn't cause regressions especially
> > when there are legitimate concerns.
> 
> This test was moved out of the mlx4_en_prepare_rx_desc() section into
> the XDP_TX code path.
> 
> 
>         if (ring->page_cache.index > 0) {
>                 /* XDP uses a single page per frame */
>                 if (!frags->page) {
>                         ring->page_cache.index--;
>                         frags->page = ring->page_cache.buf[ring->page_cache.index].page;
>                         frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
>                 }
>                 frags->page_offset = XDP_PACKET_HEADROOM;
>                 rx_desc->data[0].addr = cpu_to_be64(frags->dma +
>                                                     XDP_PACKET_HEADROOM);
>                 return 0;
>         }
> 
> Can you check again your claim, because I see no additional cost
> for XDP_TX.

Let's look what it was:
- xdp_tx xmits the page regardless whether driver can replenish
- at the end of the napi mlx4_en_refill_rx_buffers() will replenish
rx in bulk either from page_cache or by allocating one page at a time

after the changes:
- xdp_tx will check page_cache if it's empty it will try to do
order 10 (ten!!) alloc, will fail, will try to alloc single page,
will xmit the packet, and will place just allocated page into rx ring.
on the next packet in the same napi loop, it will try to allocate
order 9 (since the cache is still empty), will fail, will try single
page, succeed... next packet will try order 8 and so on.
And that spiky order 10 allocations will be happening 4 times a second
due to new logic in mlx4_en_recover_from_oom().
We may get lucky and order 2 alloc will succeed, but before that
we will waste tons of cycles.
If an attacker somehow makes existing page recycling logic not effective,
the xdp performance will be limited by order0 page allocator.
Not great, but still acceptable.
After this patch it will just tank due to this crazy scheme.
Yet you're not talking about this new behavior in the commit log.
You didn't test XDP at all and still claiming that everything is fine ?!
NACK

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-14  4:57                     ` Alexei Starovoitov
@ 2017-03-14 13:34                       ` Eric Dumazet
  2017-03-14 14:21                         ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-14 13:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Mon, Mar 13, 2017 at 9:57 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 06:02:11PM -0700, Eric Dumazet wrote:
>> On Mon, 2017-03-13 at 16:40 -0700, Alexei Starovoitov wrote:
>>
>> > that's not how it works. It's a job of submitter to prove
>> > that additional code doesn't cause regressions especially
>> > when there are legitimate concerns.
>>
>> This test was moved out of the mlx4_en_prepare_rx_desc() section into
>> the XDP_TX code path.
>>
>>
>>         if (ring->page_cache.index > 0) {
>>                 /* XDP uses a single page per frame */
>>                 if (!frags->page) {
>>                         ring->page_cache.index--;
>>                         frags->page = ring->page_cache.buf[ring->page_cache.index].page;
>>                         frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
>>                 }
>>                 frags->page_offset = XDP_PACKET_HEADROOM;
>>                 rx_desc->data[0].addr = cpu_to_be64(frags->dma +
>>                                                     XDP_PACKET_HEADROOM);
>>                 return 0;
>>         }
>>
>> Can you check again your claim, because I see no additional cost
>> for XDP_TX.
>
> Let's look what it was:
> - xdp_tx xmits the page regardless whether driver can replenish
> - at the end of the napi mlx4_en_refill_rx_buffers() will replenish
> rx in bulk either from page_cache or by allocating one page at a time
>
> after the changes:
> - xdp_tx will check page_cache if it's empty it will try to do
> order 10 (ten!!) alloc, will fail, will try to alloc single page,
> will xmit the packet, and will place just allocated page into rx ring.
> on the next packet in the same napi loop, it will try to allocate
> order 9 (since the cache is still empty), will fail, will try single
> page, succeed... next packet will try order 8 and so on.
> And that spiky order 10 allocations will be happening 4 times a second
> due to new logic in mlx4_en_recover_from_oom().
> We may get lucky and order 2 alloc will succeed, but before that
> we will waste tons of cycles.
> If an attacker somehow makes existing page recycling logic not effective,
> the xdp performance will be limited by order0 page allocator.
> Not great, but still acceptable.
> After this patch it will just tank due to this crazy scheme.
> Yet you're not talking about this new behavior in the commit log.
> You didn't test XDP at all and still claiming that everything is fine ?!
> NACK


Ouch. You NACK a patch based on fears from your side, just because I
said I would not spend time on XDP at this very moment.

We hardly allocate a page in our workloads, and a failed attempt to
get an order-10 page
with GFP_ATOMIC has exactly the same cost than a failed attempt of
order-0 or order-1 or order-X,
the buddy page allocator gives that for free.

So I will leave this to Mellanox for XDP tests and upstreaming this,
and will stop arguing with you, this is going nowhere.

I suggested some changes but you blocked everything just because I
publicly said that I would not use XDP,
which added a serious mess to this driver.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-14 13:34                       ` Eric Dumazet
@ 2017-03-14 14:21                         ` Eric Dumazet
  2017-03-15 15:33                           ` Tariq Toukan
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2017-03-14 14:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Tue, 2017-03-14 at 06:34 -0700, Eric Dumazet wrote:

> So I will leave this to Mellanox for XDP tests and upstreaming this,
> and will stop arguing with you, this is going nowhere.

Tariq, I will send a v2, including these changes (plus the missing
include of yesterday)

One is to make sure high order allocations remove __GFP_DIRECT_RECLAIM

The other is changing mlx4_en_recover_from_oom() to increase by 
one rx_alloc_order instead of plain reset to rx_pref_alloc_order

Please test XDP and tell me if you find any issues ?
Thanks !

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a71554649c25383bb765fa8220bc9cd490247aee..cc41f2f145541b469b52e7014659d5fdbb7dac68 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -60,8 +60,10 @@ static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
        if (unlikely(!ring->pre_allocated_count)) {
                unsigned int order = READ_ONCE(ring->rx_alloc_order);
 
-               page = __alloc_pages_node(node, gfp | __GFP_NOMEMALLOC |
-                                               __GFP_NOWARN | __GFP_NORETRY,
+               page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
+                                               __GFP_NOMEMALLOC |
+                                               __GFP_NOWARN |
+                                               __GFP_NORETRY,
                                          order);
                if (page) {
                        split_page(page, order);
@@ -412,12 +414,13 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 }
 
 /* Under memory pressure, each ring->rx_alloc_order might be lowered
- * to very small values. Periodically reset it to initial value for
+ * to very small values. Periodically increase t to initial value for
  * optimal allocations, in case stress is over.
  */
 void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 {
        struct mlx4_en_rx_ring *ring;
+       unsigned int order;
        int ring_ind;
 
        if (!priv->port_up)
@@ -425,7 +428,9 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 
        for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
                ring = priv->rx_ring[ring_ind];
-               WRITE_ONCE(ring->rx_alloc_order, ring->rx_pref_alloc_order);
+               order = min_t(unsigned int, ring->rx_alloc_order + 1,
+                             ring->rx_pref_alloc_order);
+               WRITE_ONCE(ring->rx_alloc_order, order);
        }
 }
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-14 14:21                         ` Eric Dumazet
@ 2017-03-15 15:33                           ` Tariq Toukan
  0 siblings, 0 replies; 23+ messages in thread
From: Tariq Toukan @ 2017-03-15 15:33 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet
  Cc: Alexei Starovoitov, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck



On 14/03/2017 4:21 PM, Eric Dumazet wrote:
> On Tue, 2017-03-14 at 06:34 -0700, Eric Dumazet wrote:
>
>> So I will leave this to Mellanox for XDP tests and upstreaming this,
>> and will stop arguing with you, this is going nowhere.
>
> Tariq, I will send a v2, including these changes (plus the missing
> include of yesterday)
>
> One is to make sure high order allocations remove __GFP_DIRECT_RECLAIM
>
> The other is changing mlx4_en_recover_from_oom() to increase by
> one rx_alloc_order instead of plain reset to rx_pref_alloc_order
>
> Please test XDP and tell me if you find any issues ?
> Thanks !

I can do the XDP check tomorrow, and complete the patch review.
I will let you updated.

(I am replying this on V2 as well)

Regards,
Tariq

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-03-15 15:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-13  0:58 [PATCH net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
2017-03-13 12:01 ` Tariq Toukan
2017-03-13 12:54   ` Eric Dumazet
2017-03-13 12:15 ` kbuild test robot
2017-03-13 12:50 ` kbuild test robot
2017-03-13 13:07   ` Eric Dumazet
2017-03-13 13:16     ` Eric Dumazet
2017-03-13 17:34 ` Alexei Starovoitov
2017-03-13 17:50   ` Eric Dumazet
2017-03-13 18:31     ` Alexei Starovoitov
2017-03-13 18:58       ` Eric Dumazet
2017-03-13 20:23         ` Alexei Starovoitov
2017-03-13 21:09           ` Eric Dumazet
2017-03-13 23:21             ` Alexei Starovoitov
2017-03-13 23:28               ` Eric Dumazet
2017-03-13 23:40                 ` Alexei Starovoitov
2017-03-13 23:44                   ` Eric Dumazet
2017-03-13 23:51                     ` Alexei Starovoitov
2017-03-14  1:02                   ` Eric Dumazet
2017-03-14  4:57                     ` Alexei Starovoitov
2017-03-14 13:34                       ` Eric Dumazet
2017-03-14 14:21                         ` Eric Dumazet
2017-03-15 15:33                           ` Tariq Toukan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.