All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
@ 2017-03-14 15:11 Eric Dumazet
  2017-03-15  4:06 ` Alexei Starovoitov
  2017-03-15 15:36 ` Tariq Toukan
  0 siblings, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2017-03-14 15:11 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet, Alexander Duyck

When adding order-0 pages allocations and page recycling in receive path,
I added issues on PowerPC, or more generally on arches with large pages.

A GRO packet, aggregating 45 segments, ended up using 45 page frags
on 45 different pages. Before my changes we were very likely packing
up to 42 Ethernet frames per 64KB page.

1) At skb freeing time, all put_page() on the skb frags now touch 45
   different 'struct page' and this adds more cache line misses.
   Too bad that standard Ethernet MTU is so small :/

2) Using one order-0 page per ring slot consumes ~42 times more memory
   on PowerPC.

3) Allocating order-0 pages is very likely to use pages from very
   different locations, increasing TLB pressure on hosts with more
   than 256 GB of memory after days of uptime.

This patch uses a refined strategy, addressing these points.

We still use order-0 pages, but the page recyling technique is modified
so that we have better chances to lower number of pages containing the
frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)

Page allocations are split in two halves :
- One currently visible by the NIC for DMA operations.
- The other contains pages that already added to old skbs, put in
  a quarantine.

When we receive a frame, we look at the oldest entry in the pool and
check if the page count is back to one, meaning old skbs/frags were
consumed and the page can be recycled.

Page allocations are attempted using high order ones, trying
to lower TLB pressure. We remember in ring->rx_alloc_order the last attempted
order and quickly decrement it in case of failures.
Then mlx4_en_recover_from_oom() called every 250 msec will attempt
to gradually restore rx_alloc_order to its optimal value.

On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
But on PowerPC, this patch considerably reduces the allocated memory.

Performance gain on PowerPC is about 50% for a single TCP flow.

On x86, I could not measure the difference, my test machine being
limited by the sender (33 Gbit per TCP flow).
22 less cache line misses per 64 KB GRO packet is probably in the order
of 2 % or so.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 470 ++++++++++++++++-----------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 ++-
 3 files changed, 317 insertions(+), 222 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..cc41f2f145541b469b52e7014659d5fdbb7dac68 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -31,7 +31,6 @@
  *
  */
 
-#include <net/busy_poll.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <linux/mlx4/cq.h>
@@ -44,65 +43,50 @@
 #include <linux/vmalloc.h>
 #include <linux/irq.h>
 
+#include <net/ip.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ip6_checksum.h>
 #endif
 
 #include "mlx4_en.h"
 
-static int mlx4_alloc_page(struct mlx4_en_priv *priv,
-			   struct mlx4_en_rx_alloc *frag,
-			   gfp_t gfp)
+static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
+				    struct mlx4_en_rx_ring *ring,
+				    dma_addr_t *dma,
+				    unsigned int node, gfp_t gfp)
 {
 	struct page *page;
-	dma_addr_t dma;
-
-	page = alloc_page(gfp);
-	if (unlikely(!page))
-		return -ENOMEM;
-	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
-	if (unlikely(dma_mapping_error(priv->ddev, dma))) {
-		__free_page(page);
-		return -ENOMEM;
-	}
-	frag->page = page;
-	frag->dma = dma;
-	frag->page_offset = priv->rx_headroom;
-	return 0;
-}
 
-static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
-			       struct mlx4_en_rx_ring *ring,
-			       struct mlx4_en_rx_desc *rx_desc,
-			       struct mlx4_en_rx_alloc *frags,
-			       gfp_t gfp)
-{
-	int i;
-
-	for (i = 0; i < priv->num_frags; i++, frags++) {
-		if (!frags->page) {
-			if (mlx4_alloc_page(priv, frags, gfp))
-				return -ENOMEM;
-			ring->rx_alloc_pages++;
+	if (unlikely(!ring->pre_allocated_count)) {
+		unsigned int order = READ_ONCE(ring->rx_alloc_order);
+
+		page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
+						__GFP_NOMEMALLOC |
+						__GFP_NOWARN |
+						__GFP_NORETRY,
+					  order);
+		if (page) {
+			split_page(page, order);
+			ring->pre_allocated_count = 1U << order;
+		} else {
+			if (order > 1)
+				ring->rx_alloc_order--;
+			page = __alloc_pages_node(node, gfp, 0);
+			if (unlikely(!page))
+				return NULL;
+			ring->pre_allocated_count = 1U;
 		}
-		rx_desc->data[i].addr = cpu_to_be64(frags->dma +
-						    frags->page_offset);
+		ring->pre_allocated = page;
+		ring->rx_alloc_pages += ring->pre_allocated_count;
 	}
-	return 0;
-}
-
-static void mlx4_en_free_frag(const struct mlx4_en_priv *priv,
-			      struct mlx4_en_rx_alloc *frag)
-{
-	if (frag->page) {
-		dma_unmap_page(priv->ddev, frag->dma,
-			       PAGE_SIZE, priv->dma_dir);
-		__free_page(frag->page);
+	page = ring->pre_allocated++;
+	ring->pre_allocated_count--;
+	*dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
+	if (unlikely(dma_mapping_error(priv->ddev, *dma))) {
+		__free_page(page);
+		return NULL;
 	}
-	/* We need to clear all fields, otherwise a change of priv->log_rx_info
-	 * could lead to see garbage later in frag->page.
-	 */
-	memset(frag, 0, sizeof(*frag));
+	return page;
 }
 
 static void mlx4_en_init_rx_desc(const struct mlx4_en_priv *priv,
@@ -130,32 +114,18 @@ static void mlx4_en_init_rx_desc(const struct mlx4_en_priv *priv,
 	}
 }
 
-static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
-				   struct mlx4_en_rx_ring *ring, int index,
-				   gfp_t gfp)
+static void mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
+				    struct mlx4_en_rx_ring *ring,
+				    unsigned int index)
 {
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
-	struct mlx4_en_rx_alloc *frags = ring->rx_info +
-					(index << priv->log_rx_info);
-	if (ring->page_cache.index > 0) {
-		/* XDP uses a single page per frame */
-		if (!frags->page) {
-			ring->page_cache.index--;
-			frags->page = ring->page_cache.buf[ring->page_cache.index].page;
-			frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
-		}
-		frags->page_offset = XDP_PACKET_HEADROOM;
-		rx_desc->data[0].addr = cpu_to_be64(frags->dma +
-						    XDP_PACKET_HEADROOM);
-		return 0;
-	}
-
-	return mlx4_en_alloc_frags(priv, ring, rx_desc, frags, gfp);
-}
+	const struct mlx4_en_rx_alloc *frags = ring->rx_info +
+						(index << priv->log_rx_info);
+	int i;
 
-static bool mlx4_en_is_ring_empty(const struct mlx4_en_rx_ring *ring)
-{
-	return ring->prod == ring->cons;
+	for (i = 0; i < priv->num_frags; i++, frags++)
+		rx_desc->data[i].addr = cpu_to_be64(frags->dma +
+						    frags->page_offset);
 }
 
 static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
@@ -163,79 +133,135 @@ static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
 	*ring->wqres.db.db = cpu_to_be32(ring->prod & 0xffff);
 }
 
-/* slow path */
-static void mlx4_en_free_rx_desc(const struct mlx4_en_priv *priv,
-				 struct mlx4_en_rx_ring *ring,
-				 int index)
+static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv,
+				struct mlx4_en_rx_ring *ring)
 {
-	struct mlx4_en_rx_alloc *frags;
-	int nr;
+	int index;
 
-	frags = ring->rx_info + (index << priv->log_rx_info);
-	for (nr = 0; nr < priv->num_frags; nr++) {
-		en_dbg(DRV, priv, "Freeing fragment:%d\n", nr);
-		mlx4_en_free_frag(priv, frags + nr);
+	if (ring->pool.array) {
+		const struct mlx4_en_page *en_page = ring->pool.array;
+
+		for (index = 0; index < ring->pool.pool_size; index++) {
+			dma_unmap_page(priv->ddev, en_page->dma,
+				       PAGE_SIZE, priv->dma_dir);
+			__free_page(en_page->page);
+			en_page++;
+		}
+		kfree(ring->pool.array);
+		ring->pool.array = NULL;
+		while (ring->pre_allocated_count) {
+			__free_page(ring->pre_allocated++);
+			ring->pre_allocated_count--;
+		}
 	}
+	ring->cons = 0;
+	ring->prod = 0;
 }
 
 static int mlx4_en_fill_rx_buffers(struct mlx4_en_priv *priv)
 {
+	int ring_ind, i, new_size = priv->prof->rx_ring_size;
+	unsigned int stride_bytes = 0;
 	struct mlx4_en_rx_ring *ring;
-	int ring_ind;
-	int buf_ind;
-	int new_size;
+	unsigned int pages_per_ring;
+	unsigned int page_ind;
+	unsigned int total;
+	unsigned int order;
+
+	for (i = 0; i < priv->num_frags; i++)
+		stride_bytes += priv->frag_info[i].frag_stride;
+
+	/* Page recycling works best if we have enough pages in the pool.
+	 * Apply a factor of two on the minimal allocations required to
+	 * populate RX rings.
+	 */
+retry:
+	total = 0;
+	pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
+	pages_per_ring = roundup_pow_of_two(pages_per_ring);
+
+	order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
+
+	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
+		ring = priv->rx_ring[ring_ind];
+		mlx4_en_free_rx_buf(priv, ring);
+
+		/* Note: use kvalloc() when available, no hurry */
+		ring->pool.array = kmalloc_node(sizeof(*ring->pool.array) *
+						pages_per_ring,
+						GFP_KERNEL, ring->node);
+		if (!ring->pool.array)
+			return -ENOMEM;
+		ring->pool.pool_idx = 0;
+		ring->pool.pool_size = 0;
+		ring->rx_alloc_order = ring->rx_pref_alloc_order = order;
+	}
 
-	for (buf_ind = 0; buf_ind < priv->prof->rx_ring_size; buf_ind++) {
+	for (page_ind = 0; page_ind < pages_per_ring; page_ind++) {
 		for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
-			ring = priv->rx_ring[ring_ind];
+			struct page *page;
+			dma_addr_t dma;
 
-			if (mlx4_en_prepare_rx_desc(priv, ring,
-						    ring->actual_size,
-						    GFP_KERNEL | __GFP_COLD)) {
-				if (ring->actual_size < MLX4_EN_MIN_RX_SIZE) {
-					en_err(priv, "Failed to allocate enough rx buffers\n");
-					return -ENOMEM;
-				} else {
-					new_size = rounddown_pow_of_two(ring->actual_size);
-					en_warn(priv, "Only %d buffers allocated reducing ring size to %d\n",
-						ring->actual_size, new_size);
-					goto reduce_rings;
-				}
-			}
-			ring->actual_size++;
-			ring->prod++;
+			ring = priv->rx_ring[ring_ind];
+			page = mlx4_alloc_page(priv, ring, &dma,
+					       ring->node, GFP_KERNEL);
+			if (!page)
+				goto fail;
+			ring->pool.array[page_ind].page = page;
+			ring->pool.array[page_ind].dma = dma;
+			ring->pool.pool_size = page_ind + 1;
+			total++;
+			cond_resched();
 		}
 	}
-	return 0;
 
-reduce_rings:
+	order = min_t(u32, ilog2(pages_per_ring >> 1), MAX_ORDER - 1);
+
 	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
 		ring = priv->rx_ring[ring_ind];
-		while (ring->actual_size > new_size) {
-			ring->actual_size--;
-			ring->prod--;
-			mlx4_en_free_rx_desc(priv, ring, ring->actual_size);
-		}
-	}
+		ring->rx_alloc_order = ring->rx_pref_alloc_order = order;
 
-	return 0;
-}
+		memcpy(ring->frag_info, priv->frag_info,
+		       sizeof(priv->frag_info));
 
-static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv,
-				struct mlx4_en_rx_ring *ring)
-{
-	int index;
+		while (ring->actual_size < new_size) {
+			struct mlx4_en_frag_info *frag_info = ring->frag_info;
+			struct mlx4_en_rx_alloc *frags = ring->rx_info +
+					(ring->actual_size << priv->log_rx_info);
+
+			for (i = 0; i < priv->num_frags; i++, frag_info++, frags++) {
+				if (frag_info->frag_stride + frag_info->page_offset > PAGE_SIZE) {
+					struct mlx4_en_page *en_page;
+
+					en_page = &ring->pool.array[ring->pool.pool_idx];
+					frag_info->page_offset = priv->rx_headroom;
+					frag_info->dma = en_page->dma;
+					frag_info->page = en_page->page;
+					++ring->pool.pool_idx;
 
-	en_dbg(DRV, priv, "Freeing Rx buf - cons:%d prod:%d\n",
-	       ring->cons, ring->prod);
+					WARN_ON_ONCE(ring->pool.pool_idx >=
+						     ring->pool.pool_size);
+				}
+				frags->page = frag_info->page;
+				frags->dma = frag_info->dma;
+				frags->page_offset = frag_info->page_offset;
 
-	/* Unmap and free Rx buffers */
-	for (index = 0; index < ring->size; index++) {
-		en_dbg(DRV, priv, "Processing descriptor:%d\n", index);
-		mlx4_en_free_rx_desc(priv, ring, index);
+				frag_info->page_offset += frag_info->frag_stride;
+			}
+			mlx4_en_prepare_rx_desc(priv, ring, ring->actual_size);
+			ring->actual_size++;
+			ring->prod++;
+		}
 	}
-	ring->cons = 0;
-	ring->prod = 0;
+	return 0;
+fail:
+	new_size >>= 1;
+	if (new_size < MLX4_EN_MIN_RX_SIZE) {
+		en_err(priv, "Failed to allocate enough rx pages, got %u of them\n",
+		       total);
+		return -ENOMEM;
+	}
+	goto retry;
 }
 
 void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
@@ -277,6 +303,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 		}
 	}
 
+	ring->node = node;
 	ring->prod = 0;
 	ring->cons = 0;
 	ring->size = size;
@@ -386,23 +413,24 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 	return err;
 }
 
-/* We recover from out of memory by scheduling our napi poll
- * function (mlx4_en_process_cq), which tries to allocate
- * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
+/* Under memory pressure, each ring->rx_alloc_order might be lowered
+ * to very small values. Periodically increase t to initial value for
+ * optimal allocations, in case stress is over.
  */
 void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 {
-	int ring;
+	struct mlx4_en_rx_ring *ring;
+	unsigned int order;
+	int ring_ind;
 
 	if (!priv->port_up)
 		return;
 
-	for (ring = 0; ring < priv->rx_ring_num; ring++) {
-		if (mlx4_en_is_ring_empty(priv->rx_ring[ring])) {
-			local_bh_disable();
-			napi_reschedule(&priv->rx_cq[ring]->napi);
-			local_bh_enable();
-		}
+	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
+		ring = priv->rx_ring[ring_ind];
+		order = min_t(unsigned int, ring->rx_alloc_order + 1,
+			      ring->rx_pref_alloc_order);
+		WRITE_ONCE(ring->rx_alloc_order, order);
 	}
 }
 
@@ -413,15 +441,15 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
  * this cache when it is sized to be a multiple of the napi budget.
  */
 bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
-			struct mlx4_en_rx_alloc *frame)
+			struct page *page, dma_addr_t dma)
 {
 	struct mlx4_en_page_cache *cache = &ring->page_cache;
 
 	if (cache->index >= MLX4_EN_CACHE_SIZE)
 		return false;
 
-	cache->buf[cache->index].page = frame->page;
-	cache->buf[cache->index].dma = frame->dma;
+	cache->buf[cache->index].page = page;
+	cache->buf[cache->index].dma = dma;
 	cache->index++;
 	return true;
 }
@@ -454,7 +482,7 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 	for (i = 0; i < ring->page_cache.index; i++) {
 		dma_unmap_page(priv->ddev, ring->page_cache.buf[i].dma,
 			       PAGE_SIZE, priv->dma_dir);
-		put_page(ring->page_cache.buf[i].page);
+		__free_page(ring->page_cache.buf[i].page);
 	}
 	ring->page_cache.index = 0;
 	mlx4_en_free_rx_buf(priv, ring);
@@ -462,67 +490,95 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 		ring->buf -= TXBB_SIZE;
 }
 
+static bool mlx4_page_is_reusable(struct page *page)
+{
+	return likely(page_count(page) == 1) &&
+	       likely(!page_is_pfmemalloc(page)) &&
+	       likely(page_to_nid(page) == numa_mem_id());
+}
 
-static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
-				    struct mlx4_en_rx_alloc *frags,
-				    struct sk_buff *skb,
-				    int length)
+static bool mlx4_replenish(struct mlx4_en_priv *priv,
+			   struct mlx4_en_rx_ring *ring,
+			   struct mlx4_en_frag_info *frag_info)
 {
-	const struct mlx4_en_frag_info *frag_info = priv->frag_info;
-	unsigned int truesize = 0;
-	int nr, frag_size;
+	struct mlx4_en_page *en_page = &ring->pool.array[ring->pool.pool_idx];
 	struct page *page;
 	dma_addr_t dma;
-	bool release;
 
-	/* Collect used fragments while replacing them in the HW descriptors */
-	for (nr = 0;; frags++) {
-		frag_size = min_t(int, length, frag_info->frag_size);
+	if (!mlx4_page_is_reusable(en_page->page)) {
+		page = mlx4_alloc_page(priv, ring, &dma, numa_mem_id(),
+				       GFP_ATOMIC | __GFP_MEMALLOC);
+		if (unlikely(!page)) {
+			/* Only drop incoming packet if previous page
+			 * can not be reused at all. NUMA placement is a hint,
+			 * pfmemalloc skbs will eventually be dropped if
+			 * necessary.
+			 */
+			if (page_count(en_page->page) != 1)
+				return false;
+		} else {
+			dma_unmap_page(priv->ddev, en_page->dma, PAGE_SIZE,
+				       priv->dma_dir);
+			__free_page(en_page->page);
+			en_page->page = page;
+			en_page->dma = dma;
+		}
+	}
+	frag_info->page_offset = priv->rx_headroom;
+	frag_info->page = en_page->page;
+	frag_info->dma = en_page->dma;
 
-		page = frags->page;
-		if (unlikely(!page))
-			goto fail;
+	if (unlikely(++ring->pool.pool_idx == ring->pool.pool_size))
+		ring->pool.pool_idx = 0;
 
-		dma = frags->dma;
-		dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
-					      frag_size, priv->dma_dir);
+	return true;
+}
 
-		__skb_fill_page_desc(skb, nr, page, frags->page_offset,
-				     frag_size);
+static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
+				    struct mlx4_en_rx_ring *ring,
+				    struct mlx4_en_rx_alloc *frags,
+				    struct sk_buff *skb,
+				    int length)
+{
+	struct mlx4_en_frag_info *frag_info = ring->frag_info;
+	int nr, frag_size;
 
-		truesize += frag_info->frag_stride;
-		if (frag_info->frag_stride == PAGE_SIZE / 2) {
-			frags->page_offset ^= PAGE_SIZE / 2;
-			release = page_count(page) != 1 ||
-				  page_is_pfmemalloc(page) ||
-				  page_to_nid(page) != numa_mem_id();
-		} else {
-			u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);
+	/* Make sure we can replenish RX ring with new page frags,
+	 * otherwise we drop this packet. Very sad but true.
+	 */
+	for (nr = 0; nr < priv->num_frags; nr++, frag_info++) {
+		if (frag_info->frag_stride + frag_info->page_offset <= PAGE_SIZE)
+			continue;
+		if (!mlx4_replenish(priv, ring, frag_info))
+			return -1;
+	}
+	frag_info = ring->frag_info;
 
-			frags->page_offset += sz_align;
-			release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
-		}
-		if (release) {
-			dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
-			frags->page = NULL;
-		} else {
-			page_ref_inc(page);
-		}
+	for (nr = 0;; frag_info++, frags++) {
+		frag_size = min_t(int, length, frag_info->frag_size);
 
-		nr++;
-		length -= frag_size;
-		if (!length)
+		if (frag_size) {
+			dma_sync_single_range_for_cpu(priv->ddev, frags->dma,
+						      frags->page_offset,
+						      frag_size,
+						      priv->dma_dir);
+
+			skb_fill_page_desc(skb, nr, frags->page,
+					   frags->page_offset,
+					   frag_size);
+			page_ref_inc(frags->page);
+			skb->truesize += frag_info->frag_stride;
+			length -= frag_size;
+		}
+		/* prepare what is needed for the next frame */
+		frags->page = frag_info->page;
+		frags->dma = frag_info->dma;
+		frags->page_offset = frag_info->page_offset;
+		frag_info->page_offset += frag_info->frag_stride;
+		if (++nr == priv->num_frags)
 			break;
-		frag_info++;
 	}
-	skb->truesize += truesize;
-	return nr;
 
-fail:
-	while (nr > 0) {
-		nr--;
-		__skb_frag_unref(skb_shinfo(skb)->frags + nr);
-	}
 	return 0;
 }
 
@@ -539,20 +595,16 @@ static void validate_loopback(struct mlx4_en_priv *priv, void *va)
 	priv->loopback_ok = 1;
 }
 
-static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
-				      struct mlx4_en_rx_ring *ring)
+static bool mlx4_en_remap_rx_buffers(struct mlx4_en_priv *priv,
+				     struct mlx4_en_rx_ring *ring)
 {
 	u32 missing = ring->actual_size - (ring->prod - ring->cons);
 
-	/* Try to batch allocations, but not too much. */
 	if (missing < 8)
 		return false;
 	do {
-		if (mlx4_en_prepare_rx_desc(priv, ring,
-					    ring->prod & ring->size_mask,
-					    GFP_ATOMIC | __GFP_COLD |
-					    __GFP_MEMALLOC))
-			break;
+		mlx4_en_prepare_rx_desc(priv, ring,
+					ring->prod & ring->size_mask);
 		ring->prod++;
 	} while (--missing);
 
@@ -740,7 +792,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 */
 		if (xdp_prog) {
 			struct xdp_buff xdp;
-			dma_addr_t dma;
+			struct page *npage;
+			dma_addr_t ndma, dma;
 			void *orig_data;
 			u32 act;
 
@@ -767,10 +820,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			case XDP_PASS:
 				break;
 			case XDP_TX:
+				/* Make sure we have one page ready to replace this one */
+				npage = NULL;
+				if (!ring->page_cache.index) {
+					npage = mlx4_alloc_page(priv, ring,
+								&ndma, numa_mem_id(),
+								GFP_ATOMIC | __GFP_MEMALLOC);
+					if (!npage) {
+						ring->xdp_drop++;
+						goto next;
+					}
+				}
 				if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
 							length, cq->ring,
 							&doorbell_pending))) {
-					frags[0].page = NULL;
+					if (ring->page_cache.index) {
+						u32 idx = --ring->page_cache.index;
+
+						frags[0].page = ring->page_cache.buf[idx].page;
+						frags[0].dma = ring->page_cache.buf[idx].dma;
+					} else {
+						frags[0].page = npage;
+						frags[0].dma = ndma;
+					}
+					frags[0].page_offset = XDP_PACKET_HEADROOM;
 					goto next;
 				}
 				trace_xdp_exception(dev, xdp_prog, act);
@@ -853,9 +926,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD),
 					       be16_to_cpu(cqe->sl_vid));
 
-		nr = mlx4_en_complete_rx_desc(priv, frags, skb, length);
-		if (likely(nr)) {
-			skb_shinfo(skb)->nr_frags = nr;
+		nr = mlx4_en_complete_rx_desc(priv, ring, frags, skb, length);
+		if (likely(nr >= 0)) {
 			skb->len = length;
 			skb->data_len = length;
 			napi_gro_frags(&cq->napi);
@@ -883,7 +955,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	}
 	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
 
-	if (mlx4_en_refill_rx_buffers(priv, ring))
+	if (mlx4_en_remap_rx_buffers(priv, ring))
 		mlx4_en_update_rx_prod_db(ring);
 
 	return polled;
@@ -956,6 +1028,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		 * expense of more costly truesize accounting
 		 */
 		priv->frag_info[0].frag_stride = PAGE_SIZE;
+		priv->frag_info[0].page_offset = PAGE_SIZE;
 		priv->dma_dir = PCI_DMA_BIDIRECTIONAL;
 		priv->rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
@@ -974,6 +1047,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 				frag_size = min(frag_size, frag_size_max);
 
 			priv->frag_info[i].frag_size = frag_size;
+			priv->frag_info[i].page_offset = PAGE_SIZE;
 			frag_stride = ALIGN(frag_size, SMP_CACHE_BYTES);
 			/* We can only pack 2 1536-bytes frames in on 4K page
 			 * Therefore, each frame would consume more bytes (truesize)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index e0c5ffb3e3a6607456e1f191b0b8c8becfc71219..76da4c33ae431dbf351699ee9010ea0452a41084 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -351,15 +351,12 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 			    int napi_mode)
 {
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
-	struct mlx4_en_rx_alloc frame = {
-		.page = tx_info->page,
-		.dma = tx_info->map0_dma,
-	};
-
-	if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
-		dma_unmap_page(priv->ddev, tx_info->map0_dma,
-			       PAGE_SIZE, priv->dma_dir);
-		put_page(tx_info->page);
+	struct page *page = tx_info->page;
+	dma_addr_t dma = tx_info->map0_dma;
+
+	if (!mlx4_en_rx_recycle(ring->recycle_ring, page, dma)) {
+		dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
+		__free_page(page);
 	}
 
 	return tx_info->nr_txbb;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 39f401aa30474e61c0b0029463b23a829ec35fa3..23e66b1eb6ddd4665339b238618dead537d2155e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -99,17 +99,13 @@
 
 #define MLX4_EN_WATCHDOG_TIMEOUT	(15 * HZ)
 
-/* Use the maximum between 16384 and a single page */
-#define MLX4_EN_ALLOC_SIZE	PAGE_ALIGN(16384)
-
 #define MLX4_EN_MAX_RX_FRAGS	4
 
 /* Maximum ring sizes */
 #define MLX4_EN_MAX_TX_SIZE	8192
 #define MLX4_EN_MAX_RX_SIZE	8192
 
-/* Minimum ring size for our page-allocation scheme to work */
-#define MLX4_EN_MIN_RX_SIZE	(MLX4_EN_ALLOC_SIZE / SMP_CACHE_BYTES)
+#define MLX4_EN_MIN_RX_SIZE	256
 #define MLX4_EN_MIN_TX_SIZE	(4096 / TXBB_SIZE)
 
 #define MLX4_EN_SMALL_PKT_SIZE		64
@@ -325,6 +321,30 @@ struct mlx4_en_rx_desc {
 	struct mlx4_wqe_data_seg data[0];
 };
 
+/* Each rx_ring has a pool of pages, with associated dma mapping.
+ * We try to recycle the pages, by keeping a reference on them.
+ */
+struct mlx4_en_page {
+	struct page	*page;
+	dma_addr_t	dma; /* might be kept in page_private() ? */
+};
+
+/* A page pool contains a fixed number of pages, and a current index.
+ */
+struct mlx4_en_page_pool {
+	unsigned int		pool_size;
+	unsigned int		pool_idx;
+	struct mlx4_en_page	*array;
+};
+
+struct mlx4_en_frag_info {
+	u16		frag_size;
+	u32		frag_stride;
+	struct page	*page;
+	dma_addr_t	dma;
+	u32		page_offset;
+};
+
 struct mlx4_en_rx_ring {
 	struct mlx4_hwq_resources wqres;
 	u32 size ;	/* number of Rx descs*/
@@ -337,22 +357,31 @@ struct mlx4_en_rx_ring {
 	u32 cons;
 	u32 buf_size;
 	u8  fcs_del;
+	u8  hwtstamp_rx_filter;
+	u16 node;
 	void *buf;
 	void *rx_info;
-	struct bpf_prog __rcu *xdp_prog;
-	struct mlx4_en_page_cache page_cache;
+	struct bpf_prog __rcu		*xdp_prog;
+	struct mlx4_en_page_pool	pool;
+	unsigned long			rx_alloc_pages;
+
+	struct page			*pre_allocated;
+	unsigned int			pre_allocated_count;
+	unsigned int			rx_alloc_order;
+	struct mlx4_en_frag_info	frag_info[MLX4_EN_MAX_RX_FRAGS];
+	struct mlx4_en_page_cache	page_cache;
+
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long csum_ok;
 	unsigned long csum_none;
 	unsigned long csum_complete;
-	unsigned long rx_alloc_pages;
 	unsigned long xdp_drop;
 	unsigned long xdp_tx;
 	unsigned long xdp_tx_full;
 	unsigned long dropped;
-	int hwtstamp_rx_filter;
 	cpumask_var_t affinity_mask;
+	unsigned int			rx_pref_alloc_order;
 };
 
 struct mlx4_en_cq {
@@ -462,11 +491,6 @@ struct mlx4_en_mc_list {
 	u64			tunnel_reg_id;
 };
 
-struct mlx4_en_frag_info {
-	u16 frag_size;
-	u32 frag_stride;
-};
-
 #ifdef CONFIG_MLX4_EN_DCB
 /* Minimal TC BW - setting to 0 will block traffic */
 #define MLX4_EN_BW_MIN 1
@@ -693,7 +717,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 			       int tx_ind, int *doorbell_pending);
 void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
 bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
-			struct mlx4_en_rx_alloc *frame);
+			struct page *page, dma_addr_t dma);
 
 int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_tx_ring **pring,
-- 
2.12.0.246.ga2ecc84866-goog

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-14 15:11 [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
@ 2017-03-15  4:06 ` Alexei Starovoitov
  2017-03-15 13:21   ` Eric Dumazet
  2017-03-15 15:36 ` Tariq Toukan
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2017-03-15  4:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> +				    struct mlx4_en_rx_ring *ring,
> +				    dma_addr_t *dma,
> +				    unsigned int node, gfp_t gfp)
>  {
> +	if (unlikely(!ring->pre_allocated_count)) {
> +		unsigned int order = READ_ONCE(ring->rx_alloc_order);
> +
> +		page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
> +						__GFP_NOMEMALLOC |
> +						__GFP_NOWARN |
> +						__GFP_NORETRY,
> +					  order);
> +		if (page) {
> +			split_page(page, order);
> +			ring->pre_allocated_count = 1U << order;
> +		} else {
> +			if (order > 1)
> +				ring->rx_alloc_order--;
> +			page = __alloc_pages_node(node, gfp, 0);
> +			if (unlikely(!page))
> +				return NULL;
> +			ring->pre_allocated_count = 1U;
>  		}
> +		ring->pre_allocated = page;
> +		ring->rx_alloc_pages += ring->pre_allocated_count;
>  	}
> +	page = ring->pre_allocated++;
> +	ring->pre_allocated_count--;

do you think this style of allocation can be moved into net common?
If it's a good thing then other drivers should be using it too, right?

> +	ring->cons = 0;
> +	ring->prod = 0;
> +
> +	/* Page recycling works best if we have enough pages in the pool.
> +	 * Apply a factor of two on the minimal allocations required to
> +	 * populate RX rings.
> +	 */

i'm not sure how above comments matches the code below.
If my math is correct a ring of 1k elements will ask for 1024
contiguous pages.

> +retry:
> +	total = 0;
> +	pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> +	pages_per_ring = roundup_pow_of_two(pages_per_ring);
> +
> +	order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);

if you're sure it doesn't hurt the rest of the system,
why use MAX_ORDER - 1? why not MAX_ORDER?

>  
> -/* We recover from out of memory by scheduling our napi poll
> - * function (mlx4_en_process_cq), which tries to allocate
> - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> + * to very small values. Periodically increase t to initial value for
> + * optimal allocations, in case stress is over.
>   */
> +	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> +		ring = priv->rx_ring[ring_ind];
> +		order = min_t(unsigned int, ring->rx_alloc_order + 1,
> +			      ring->rx_pref_alloc_order);
> +		WRITE_ONCE(ring->rx_alloc_order, order);

when recycling is effective in a matter of few seconds it will
increase ther order back to 10 and the first time the driver needs
to allocate, it will start that tedious failure loop all over again.
How about removing this periodic mlx4_en_recover_from_oom() completely
and switch to increase the order inside mlx4_alloc_page().
Like N successful __alloc_pages_node() with order X will bump it
into order X+1. If it fails next time it will do only one failed attempt.

> +static bool mlx4_replenish(struct mlx4_en_priv *priv,
> +			   struct mlx4_en_rx_ring *ring,
> +			   struct mlx4_en_frag_info *frag_info)
>  {
> +	struct mlx4_en_page *en_page = &ring->pool.array[ring->pool.pool_idx];
> +	if (!mlx4_page_is_reusable(en_page->page)) {
> +		page = mlx4_alloc_page(priv, ring, &dma, numa_mem_id(),
> +				       GFP_ATOMIC | __GFP_MEMALLOC);

I don't understand why page_is_reusable is doing !page_is_pfmemalloc(page))
check, but here you're asking for MEMALLOC pages too, so
under memory pressure the hw will dma the packet into this page,
but then the stack will still drop it, so under pressure
we'll keep alloc/free the pages from reserve. Isn't it better
to let the hw drop (since we cannot alloc and rx ring is empty) ?
What am I missing?

> @@ -767,10 +820,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  			case XDP_PASS:
>  				break;
>  			case XDP_TX:
> +				/* Make sure we have one page ready to replace this one */
> +				npage = NULL;
> +				if (!ring->page_cache.index) {
> +					npage = mlx4_alloc_page(priv, ring,
> +								&ndma, numa_mem_id(),
> +								GFP_ATOMIC | __GFP_MEMALLOC);
> +					if (!npage) {
> +						ring->xdp_drop++;
> +						goto next;
> +					}
> +				}

I think the approach to drop packets after xdp prog executed is incorrect.
In such case xdp program already parsed the packet, performed all necessary
changes and now ready to xmit it out. Dropping packet so late, just
because page_cache is empty, doesn't make any sense. Therefore please change
this part back to how it was before: allocate buffer at the end of the loop.
I think you're trying too hard to avoid _oom() callback doing allocs. Say
all rx packets went via xdp_tx and we don't have any pages to put into rx ring
at the end of this rx loop. That's still not a problem. The hw will drop
few packets and when tx completion fires on the same cpu for xdp rings we
can without locks populate corresponding rx ring.
So XDP will be able to operate without allocations even when there is
no free memory in the rest of the system.

Also I was thinking to limit xdp rings to order 0 allocs only, but doing
large order can indeed decrease tlb misses. That has to be tested more
thoroughly and if large order will turn out to hurt we can add small patch
on top of this one to do order0 for xdp rings.

Overall I think there are several good ideas in this patch,
but drop after xdp_tx is the showstopper and has to be addressed
before this patch can be merged.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15  4:06 ` Alexei Starovoitov
@ 2017-03-15 13:21   ` Eric Dumazet
  2017-03-15 23:06     ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2017-03-15 13:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Tue, 2017-03-14 at 21:06 -0700, Alexei Starovoitov wrote:
> On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> > +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> > +				    struct mlx4_en_rx_ring *ring,
> > +				    dma_addr_t *dma,
> > +				    unsigned int node, gfp_t gfp)
> >  {
> > +	if (unlikely(!ring->pre_allocated_count)) {
> > +		unsigned int order = READ_ONCE(ring->rx_alloc_order);
> > +
> > +		page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
> > +						__GFP_NOMEMALLOC |
> > +						__GFP_NOWARN |
> > +						__GFP_NORETRY,
> > +					  order);
> > +		if (page) {
> > +			split_page(page, order);
> > +			ring->pre_allocated_count = 1U << order;
> > +		} else {
> > +			if (order > 1)
> > +				ring->rx_alloc_order--;
> > +			page = __alloc_pages_node(node, gfp, 0);
> > +			if (unlikely(!page))
> > +				return NULL;
> > +			ring->pre_allocated_count = 1U;
> >  		}
> > +		ring->pre_allocated = page;
> > +		ring->rx_alloc_pages += ring->pre_allocated_count;
> >  	}
> > +	page = ring->pre_allocated++;
> > +	ring->pre_allocated_count--;
> 
> do you think this style of allocation can be moved into net common?
> If it's a good thing then other drivers should be using it too, right?

Yes, we might do this once this proves to work well.

> 
> > +	ring->cons = 0;
> > +	ring->prod = 0;
> > +
> > +	/* Page recycling works best if we have enough pages in the pool.
> > +	 * Apply a factor of two on the minimal allocations required to
> > +	 * populate RX rings.
> > +	 */
> 
> i'm not sure how above comments matches the code below.
> If my math is correct a ring of 1k elements will ask for 1024
> contiguous pages.

On x86 it might be the case, unless you use MTU=900 ?

On PowerPC, PAGE_SIZE=65536

65536/1536 = 42  

So for 1024 elements, we need 1024/42 = ~24 pages.

Thus allocating 48 pages is the goal.
Rounded to next power of two (although nothing in my code really needs
this additional constraint, a page pool does not have to have a power of
two entries)

Later, we might chose a different factor, maybe an elastic one.

> 
> > +retry:
> > +	total = 0;
> > +	pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> > +	pages_per_ring = roundup_pow_of_two(pages_per_ring);
> > +
> > +	order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
> 
> if you're sure it doesn't hurt the rest of the system,
> why use MAX_ORDER - 1? why not MAX_ORDER?

alloc_page(GFP_...,   MAX_ORDER)  never succeeds ;)

Because of the __GRP_NOWARN you would not see the error I guess, but we
would get one pesky order-0 page in the ring buffer ;)

> 
> >  
> > -/* We recover from out of memory by scheduling our napi poll
> > - * function (mlx4_en_process_cq), which tries to allocate
> > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > + * to very small values. Periodically increase t to initial value for
> > + * optimal allocations, in case stress is over.
> >   */
> > +	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > +		ring = priv->rx_ring[ring_ind];
> > +		order = min_t(unsigned int, ring->rx_alloc_order + 1,
> > +			      ring->rx_pref_alloc_order);
> > +		WRITE_ONCE(ring->rx_alloc_order, order);
> 
> when recycling is effective in a matter of few seconds it will
> increase ther order back to 10 and the first time the driver needs
> to allocate, it will start that tedious failure loop all over again.
> How about removing this periodic mlx4_en_recover_from_oom() completely
> and switch to increase the order inside mlx4_alloc_page().
> Like N successful __alloc_pages_node() with order X will bump it
> into order X+1. If it fails next time it will do only one failed attempt.

I wanted to do the increase out of line. (not in the data path)

We probably could increase only if ring->rx_alloc_pages got a
significant increase since the last mlx4_en_recover_from_oom() call.

(That would require a new ring->prior_rx_alloc_pages out of hot cache
lines)

Or maybe attempt the allocation from process context (from
mlx4_en_recover_from_oom())

I have not really thought very hard about this, since data path, if
really needing new pages, will very fast probe rx_alloc_order back to
available pages in the zone.

> 
> > +static bool mlx4_replenish(struct mlx4_en_priv *priv,
> > +			   struct mlx4_en_rx_ring *ring,
> > +			   struct mlx4_en_frag_info *frag_info)
> >  {
> > +	struct mlx4_en_page *en_page = &ring->pool.array[ring->pool.pool_idx];
> > +	if (!mlx4_page_is_reusable(en_page->page)) {
> > +		page = mlx4_alloc_page(priv, ring, &dma, numa_mem_id(),
> > +				       GFP_ATOMIC | __GFP_MEMALLOC);
> 
> I don't understand why page_is_reusable is doing !page_is_pfmemalloc(page))
> check, but here you're asking for MEMALLOC pages too, so
> under memory pressure the hw will dma the packet into this page,
> but then the stack will still drop it, so under pressure
> we'll keep alloc/free the pages from reserve. Isn't it better
> to let the hw drop (since we cannot alloc and rx ring is empty) ?
> What am I missing?

We do not want to drop packets that might help SWAP over NFS from
freeing memory and help us recovering from pressure. This is the whole
point of __GFP_MEMALLOC protocol : hw wont know that a particular TCP
packet must not be dropped.

> 
> > @@ -767,10 +820,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >  			case XDP_PASS:
> >  				break;
> >  			case XDP_TX:
> > +				/* Make sure we have one page ready to replace this one */
> > +				npage = NULL;
> > +				if (!ring->page_cache.index) {
> > +					npage = mlx4_alloc_page(priv, ring,
> > +								&ndma, numa_mem_id(),
> > +								GFP_ATOMIC | __GFP_MEMALLOC);
> > +					if (!npage) {
> > +						ring->xdp_drop++;
> > +						goto next;
> > +					}
> > +				}
> 
> I think the approach to drop packets after xdp prog executed is incorrect.
> In such case xdp program already parsed the packet, performed all necessary
> changes and now ready to xmit it out. Dropping packet so late, just
> because page_cache is empty, doesn't make any sense. Therefore please change
> this part back to how it was before: allocate buffer at the end of the loop.
> I think you're trying too hard to avoid _oom() callback doing allocs. Say
> all rx packets went via xdp_tx and we don't have any pages to put into rx ring
> at the end of this rx loop. That's still not a problem. The hw will drop
> few packets and when tx completion fires on the same cpu for xdp rings we
> can without locks populate corresponding rx ring.
> So XDP will be able to operate without allocations even when there is
> no free memory in the rest of the system.
> 
> Also I was thinking to limit xdp rings to order 0 allocs only, but doing
> large order can indeed decrease tlb misses. That has to be tested more
> thoroughly and if large order will turn out to hurt we can add small patch
> on top of this one to do order0 for xdp rings.


> 
> Overall I think there are several good ideas in this patch,
> but drop after xdp_tx is the showstopper and has to be addressed
> before this patch can be merged.
> 

The xmit itself can drop the packet if TX ring is full.

I can do the change, but you are focusing on this allocation that will
never happen but for the first packets going through XDP_TX

I am fine adding overhead for the non XDP_TX, since you really want it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-14 15:11 [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
  2017-03-15  4:06 ` Alexei Starovoitov
@ 2017-03-15 15:36 ` Tariq Toukan
  2017-03-15 16:27   ` Eric Dumazet
  2017-03-20 12:59   ` Tariq Toukan
  1 sibling, 2 replies; 16+ messages in thread
From: Tariq Toukan @ 2017-03-15 15:36 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Alexei Starovoitov, Eric Dumazet, Alexander Duyck



On 14/03/2017 5:11 PM, Eric Dumazet wrote:
> When adding order-0 pages allocations and page recycling in receive path,
> I added issues on PowerPC, or more generally on arches with large pages.
>
> A GRO packet, aggregating 45 segments, ended up using 45 page frags
> on 45 different pages. Before my changes we were very likely packing
> up to 42 Ethernet frames per 64KB page.
>
> 1) At skb freeing time, all put_page() on the skb frags now touch 45
>    different 'struct page' and this adds more cache line misses.
>    Too bad that standard Ethernet MTU is so small :/
>
> 2) Using one order-0 page per ring slot consumes ~42 times more memory
>    on PowerPC.
>
> 3) Allocating order-0 pages is very likely to use pages from very
>    different locations, increasing TLB pressure on hosts with more
>    than 256 GB of memory after days of uptime.
>
> This patch uses a refined strategy, addressing these points.
>
> We still use order-0 pages, but the page recyling technique is modified
> so that we have better chances to lower number of pages containing the
> frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)
>
> Page allocations are split in two halves :
> - One currently visible by the NIC for DMA operations.
> - The other contains pages that already added to old skbs, put in
>   a quarantine.
>
> When we receive a frame, we look at the oldest entry in the pool and
> check if the page count is back to one, meaning old skbs/frags were
> consumed and the page can be recycled.
>
> Page allocations are attempted using high order ones, trying
> to lower TLB pressure. We remember in ring->rx_alloc_order the last attempted
> order and quickly decrement it in case of failures.
> Then mlx4_en_recover_from_oom() called every 250 msec will attempt
> to gradually restore rx_alloc_order to its optimal value.
>
> On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
> But on PowerPC, this patch considerably reduces the allocated memory.
>
> Performance gain on PowerPC is about 50% for a single TCP flow.
>
> On x86, I could not measure the difference, my test machine being
> limited by the sender (33 Gbit per TCP flow).
> 22 less cache line misses per 64 KB GRO packet is probably in the order
> of 2 % or so.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 470 ++++++++++++++++-----------
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 ++-
>  3 files changed, 317 insertions(+), 222 deletions(-)
>

Hi Eric,

Thanks for your patch.

I will do the XDP tests and complete the review, by tomorrow.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15 15:36 ` Tariq Toukan
@ 2017-03-15 16:27   ` Eric Dumazet
  2017-03-20 12:59   ` Tariq Toukan
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2017-03-15 16:27 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, 2017-03-15 at 17:36 +0200, Tariq Toukan wrote:

> 
> Hi Eric,
> 
> Thanks for your patch.
> 
> I will do the XDP tests and complete the review, by tomorrow.
> 

Thanks a lot Tariq !

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15 13:21   ` Eric Dumazet
@ 2017-03-15 23:06     ` Alexei Starovoitov
  2017-03-15 23:34       ` Eric Dumazet
  2017-03-16  1:07       ` Eric Dumazet
  0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2017-03-15 23:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, Mar 15, 2017 at 06:21:29AM -0700, Eric Dumazet wrote:
> On Tue, 2017-03-14 at 21:06 -0700, Alexei Starovoitov wrote:
> > On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> > > +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> > > +				    struct mlx4_en_rx_ring *ring,
> > > +				    dma_addr_t *dma,
> > > +				    unsigned int node, gfp_t gfp)
> > >  {
> > > +	if (unlikely(!ring->pre_allocated_count)) {
> > > +		unsigned int order = READ_ONCE(ring->rx_alloc_order);
> > > +
> > > +		page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
> > > +						__GFP_NOMEMALLOC |
> > > +						__GFP_NOWARN |
> > > +						__GFP_NORETRY,
> > > +					  order);
> > > +		if (page) {
> > > +			split_page(page, order);
> > > +			ring->pre_allocated_count = 1U << order;
> > > +		} else {
> > > +			if (order > 1)
> > > +				ring->rx_alloc_order--;
> > > +			page = __alloc_pages_node(node, gfp, 0);
> > > +			if (unlikely(!page))
> > > +				return NULL;
> > > +			ring->pre_allocated_count = 1U;
> > >  		}
> > > +		ring->pre_allocated = page;
> > > +		ring->rx_alloc_pages += ring->pre_allocated_count;
> > >  	}
> > > +	page = ring->pre_allocated++;
> > > +	ring->pre_allocated_count--;
> > 
> > do you think this style of allocation can be moved into net common?
> > If it's a good thing then other drivers should be using it too, right?
> 
> Yes, we might do this once this proves to work well.

In theory it looks quite promising :)

> 
> > 
> > > +	ring->cons = 0;
> > > +	ring->prod = 0;
> > > +
> > > +	/* Page recycling works best if we have enough pages in the pool.
> > > +	 * Apply a factor of two on the minimal allocations required to
> > > +	 * populate RX rings.
> > > +	 */
> > 
> > i'm not sure how above comments matches the code below.
> > If my math is correct a ring of 1k elements will ask for 1024
> > contiguous pages.
> 
> On x86 it might be the case, unless you use MTU=900 ?
> 
> On PowerPC, PAGE_SIZE=65536
> 
> 65536/1536 = 42  
> 
> So for 1024 elements, we need 1024/42 = ~24 pages.
> 
> Thus allocating 48 pages is the goal.
> Rounded to next power of two (although nothing in my code really needs
> this additional constraint, a page pool does not have to have a power of
> two entries)

on powerpc asking for roundup(48) = 64 contiguous pages sounds reasonable.
on x86 it's roundup(1024 * 1500 * 2 / 4096) = 1024 contiguous pages.
I thought it has zero chance of succeeding unless it's fresh boot ?
Should it be something like ?
order = min_t(u32, ilog2(pages_per_ring), 5); 

> Later, we might chose a different factor, maybe an elastic one.
> 
> > 
> > > +retry:
> > > +	total = 0;
> > > +	pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> > > +	pages_per_ring = roundup_pow_of_two(pages_per_ring);
> > > +
> > > +	order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
> > 
> > if you're sure it doesn't hurt the rest of the system,
> > why use MAX_ORDER - 1? why not MAX_ORDER?
> 
> alloc_page(GFP_...,   MAX_ORDER)  never succeeds ;)

but MAX_ORDER - 1 also almost never succeeds ?

> Because of the __GRP_NOWARN you would not see the error I guess, but we
> would get one pesky order-0 page in the ring buffer ;)
> 
> > 
> > >  
> > > -/* We recover from out of memory by scheduling our napi poll
> > > - * function (mlx4_en_process_cq), which tries to allocate
> > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > > + * to very small values. Periodically increase t to initial value for
> > > + * optimal allocations, in case stress is over.
> > >   */
> > > +	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > > +		ring = priv->rx_ring[ring_ind];
> > > +		order = min_t(unsigned int, ring->rx_alloc_order + 1,
> > > +			      ring->rx_pref_alloc_order);
> > > +		WRITE_ONCE(ring->rx_alloc_order, order);
> > 
> > when recycling is effective in a matter of few seconds it will
> > increase ther order back to 10 and the first time the driver needs
> > to allocate, it will start that tedious failure loop all over again.
> > How about removing this periodic mlx4_en_recover_from_oom() completely
> > and switch to increase the order inside mlx4_alloc_page().
> > Like N successful __alloc_pages_node() with order X will bump it
> > into order X+1. If it fails next time it will do only one failed attempt.
> 
> I wanted to do the increase out of line. (not in the data path)
> 
> We probably could increase only if ring->rx_alloc_pages got a
> significant increase since the last mlx4_en_recover_from_oom() call.
> 
> (That would require a new ring->prior_rx_alloc_pages out of hot cache
> lines)

right. rx_alloc_pages can also be reduce to 16-bit and this
new one prior_rx_alloc_pages to 16-bit too, no?

> Or maybe attempt the allocation from process context (from
> mlx4_en_recover_from_oom())

yeah. that won't be great. since this patch is removing that issue,
the best is not to introduce it again.

> I have not really thought very hard about this, since data path, if
> really needing new pages, will very fast probe rx_alloc_order back to
> available pages in the zone.

I'm nit picking on it mainly because, if proven successful,
this strategy will be adopted by other drivers and likely moved
to net/common, so imo it's important to talk about these details.

> > 
> > > +static bool mlx4_replenish(struct mlx4_en_priv *priv,
> > > +			   struct mlx4_en_rx_ring *ring,
> > > +			   struct mlx4_en_frag_info *frag_info)
> > >  {
> > > +	struct mlx4_en_page *en_page = &ring->pool.array[ring->pool.pool_idx];
> > > +	if (!mlx4_page_is_reusable(en_page->page)) {
> > > +		page = mlx4_alloc_page(priv, ring, &dma, numa_mem_id(),
> > > +				       GFP_ATOMIC | __GFP_MEMALLOC);
> > 
> > I don't understand why page_is_reusable is doing !page_is_pfmemalloc(page))
> > check, but here you're asking for MEMALLOC pages too, so
> > under memory pressure the hw will dma the packet into this page,
> > but then the stack will still drop it, so under pressure
> > we'll keep alloc/free the pages from reserve. Isn't it better
> > to let the hw drop (since we cannot alloc and rx ring is empty) ?
> > What am I missing?
> 
> We do not want to drop packets that might help SWAP over NFS from
> freeing memory and help us recovering from pressure. This is the whole
> point of __GFP_MEMALLOC protocol : hw wont know that a particular TCP
> packet must not be dropped.

interesting. fancy. so things like nbd will also be using them too, right?

> > 
> > > @@ -767,10 +820,30 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> > >  			case XDP_PASS:
> > >  				break;
> > >  			case XDP_TX:
> > > +				/* Make sure we have one page ready to replace this one */
> > > +				npage = NULL;
> > > +				if (!ring->page_cache.index) {
> > > +					npage = mlx4_alloc_page(priv, ring,
> > > +								&ndma, numa_mem_id(),
> > > +								GFP_ATOMIC | __GFP_MEMALLOC);
> > > +					if (!npage) {
> > > +						ring->xdp_drop++;
> > > +						goto next;
> > > +					}
> > > +				}
> > 
> > I think the approach to drop packets after xdp prog executed is incorrect.
> > In such case xdp program already parsed the packet, performed all necessary
> > changes and now ready to xmit it out. Dropping packet so late, just
> > because page_cache is empty, doesn't make any sense. Therefore please change
> > this part back to how it was before: allocate buffer at the end of the loop.
> > I think you're trying too hard to avoid _oom() callback doing allocs. Say
> > all rx packets went via xdp_tx and we don't have any pages to put into rx ring
> > at the end of this rx loop. That's still not a problem. The hw will drop
> > few packets and when tx completion fires on the same cpu for xdp rings we
> > can without locks populate corresponding rx ring.
> > So XDP will be able to operate without allocations even when there is
> > no free memory in the rest of the system.
> > 
> > Also I was thinking to limit xdp rings to order 0 allocs only, but doing
> > large order can indeed decrease tlb misses. That has to be tested more
> > thoroughly and if large order will turn out to hurt we can add small patch
> > on top of this one to do order0 for xdp rings.
> 
> 
> > 
> > Overall I think there are several good ideas in this patch,
> > but drop after xdp_tx is the showstopper and has to be addressed
> > before this patch can be merged.
> > 
> 
> The xmit itself can drop the packet if TX ring is full.

yes. and we have 'xdp_tx_full' counter for it that we monitor.
When tx ring and mtu are sized properly, we don't expect to see it
incrementing at all. This is something in our control. 'Our' means
humans that setup the environment.
'cache empty' condition is something ephemeral. Packets will be dropped
and we won't have any tools to address it. These packets are real
people requests. Any drop needs to be categorized.
Like there is 'rx_fifo_errors' counter that mlx4 reports when
hw is dropping packets before they reach the driver. We see it
incrementing depending on the traffic pattern though overall Mpps
through the nic is not too high and this is something we
actively investigating too.

> I can do the change, but you are focusing on this allocation that will
> never happen but for the first packets going through XDP_TX

if 'page cache empty' condition happens once every 100 million packets
and we drop one due to failed alloc, it's a problem that I'd like to avoid.
Today xdp is mostly about ddos and lb. I'd like to get to the point
where all drop decisions are done by the program only.
xdp program should decide which traffic is bad and should pass
good traffic through. If there are drops due to surrounding
infrastructure (like driver logic or nic hw fifo full or cpu too slow)
we'll be fixing them.
And yes it means there will be more changes to the drivers due to xdp.

> I am fine adding overhead for the non XDP_TX, since you really want it.

Thank you. I don't understand the 'overhead' part.
I'll test the next version of this patch.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15 23:06     ` Alexei Starovoitov
@ 2017-03-15 23:34       ` Eric Dumazet
  2017-03-16  0:44         ` Alexei Starovoitov
  2017-03-16  1:07       ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2017-03-15 23:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> On Wed, Mar 15, 2017 at 06:21:29AM -0700, Eric Dumazet wrote:
> > On Tue, 2017-03-14 at 21:06 -0700, Alexei Starovoitov wrote:
> > > On Tue, Mar 14, 2017 at 08:11:43AM -0700, Eric Dumazet wrote:
> > > > +static struct page *mlx4_alloc_page(struct mlx4_en_priv *priv,
> > > > +				    struct mlx4_en_rx_ring *ring,
> > > > +				    dma_addr_t *dma,
> > > > +				    unsigned int node, gfp_t gfp)
> > > >  {
> > > > +	if (unlikely(!ring->pre_allocated_count)) {
> > > > +		unsigned int order = READ_ONCE(ring->rx_alloc_order);
> > > > +
> > > > +		page = __alloc_pages_node(node, (gfp & ~__GFP_DIRECT_RECLAIM) |
> > > > +						__GFP_NOMEMALLOC |
> > > > +						__GFP_NOWARN |
> > > > +						__GFP_NORETRY,
> > > > +					  order);
> > > > +		if (page) {
> > > > +			split_page(page, order);
> > > > +			ring->pre_allocated_count = 1U << order;
> > > > +		} else {
> > > > +			if (order > 1)
> > > > +				ring->rx_alloc_order--;
> > > > +			page = __alloc_pages_node(node, gfp, 0);
> > > > +			if (unlikely(!page))
> > > > +				return NULL;
> > > > +			ring->pre_allocated_count = 1U;
> > > >  		}
> > > > +		ring->pre_allocated = page;
> > > > +		ring->rx_alloc_pages += ring->pre_allocated_count;
> > > >  	}
> > > > +	page = ring->pre_allocated++;
> > > > +	ring->pre_allocated_count--;
> > > 
> > > do you think this style of allocation can be moved into net common?
> > > If it's a good thing then other drivers should be using it too, right?
> > 
> > Yes, we might do this once this proves to work well.
> 
> In theory it looks quite promising :)
> 
> > 
> > > 
> > > > +	ring->cons = 0;
> > > > +	ring->prod = 0;
> > > > +
> > > > +	/* Page recycling works best if we have enough pages in the pool.
> > > > +	 * Apply a factor of two on the minimal allocations required to
> > > > +	 * populate RX rings.
> > > > +	 */
> > > 
> > > i'm not sure how above comments matches the code below.
> > > If my math is correct a ring of 1k elements will ask for 1024
> > > contiguous pages.
> > 
> > On x86 it might be the case, unless you use MTU=900 ?
> > 
> > On PowerPC, PAGE_SIZE=65536
> > 
> > 65536/1536 = 42  
> > 
> > So for 1024 elements, we need 1024/42 = ~24 pages.
> > 
> > Thus allocating 48 pages is the goal.
> > Rounded to next power of two (although nothing in my code really needs
> > this additional constraint, a page pool does not have to have a power of
> > two entries)
> 
> on powerpc asking for roundup(48) = 64 contiguous pages sounds reasonable.
> on x86 it's roundup(1024 * 1500 * 2 / 4096) = 1024 contiguous pages.



> I thought it has zero chance of succeeding unless it's fresh boot ?

Don't you load your NIC at boot time ?

Anyway, not a big deal, order-0 pages will then be allocated,
but each order-0 page also decrease order to something that might be
available, like order-5, after 5 failed allocations.

> Should it be something like ?
> order = min_t(u32, ilog2(pages_per_ring), 5); 
> 
> > Later, we might chose a different factor, maybe an elastic one.
> > 
> > > 
> > > > +retry:
> > > > +	total = 0;
> > > > +	pages_per_ring = new_size * stride_bytes * 2 / PAGE_SIZE;
> > > > +	pages_per_ring = roundup_pow_of_two(pages_per_ring);
> > > > +
> > > > +	order = min_t(u32, ilog2(pages_per_ring), MAX_ORDER - 1);
> > > 
> > > if you're sure it doesn't hurt the rest of the system,
> > > why use MAX_ORDER - 1? why not MAX_ORDER?
> > 
> > alloc_page(GFP_...,   MAX_ORDER)  never succeeds ;)
> 
> but MAX_ORDER - 1 also almost never succeeds ?

It does on 100% of our hosts at boot time.

And if not, we will automatically converge to whatever order-X pages are
readily in buddy allocator.

> 
> > Because of the __GRP_NOWARN you would not see the error I guess, but we
> > would get one pesky order-0 page in the ring buffer ;)
> > 
> > > 
> > > >  
> > > > -/* We recover from out of memory by scheduling our napi poll
> > > > - * function (mlx4_en_process_cq), which tries to allocate
> > > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > > > + * to very small values. Periodically increase t to initial value for
> > > > + * optimal allocations, in case stress is over.
> > > >   */
> > > > +	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > > > +		ring = priv->rx_ring[ring_ind];
> > > > +		order = min_t(unsigned int, ring->rx_alloc_order + 1,
> > > > +			      ring->rx_pref_alloc_order);
> > > > +		WRITE_ONCE(ring->rx_alloc_order, order);
> > > 
> > > when recycling is effective in a matter of few seconds it will
> > > increase ther order back to 10 and the first time the driver needs
> > > to allocate, it will start that tedious failure loop all over again.
> > > How about removing this periodic mlx4_en_recover_from_oom() completely
> > > and switch to increase the order inside mlx4_alloc_page().
> > > Like N successful __alloc_pages_node() with order X will bump it
> > > into order X+1. If it fails next time it will do only one failed attempt.
> > 
> > I wanted to do the increase out of line. (not in the data path)
> > 
> > We probably could increase only if ring->rx_alloc_pages got a
> > significant increase since the last mlx4_en_recover_from_oom() call.
> > 
> > (That would require a new ring->prior_rx_alloc_pages out of hot cache
> > lines)
> 
> right. rx_alloc_pages can also be reduce to 16-bit and this
> new one prior_rx_alloc_pages to 16-bit too, no?

Think about arches not having atomic 8-bit or 16-bit reads or writes.

READ_ONCE()/WRITE_ONCE() will not be usable.


> 
> > Or maybe attempt the allocation from process context (from
> > mlx4_en_recover_from_oom())
> 
> yeah. that won't be great. since this patch is removing that issue,
> the best is not to introduce it again.
> 
> > I have not really thought very hard about this, since data path, if
> > really needing new pages, will very fast probe rx_alloc_order back to
> > available pages in the zone.
> 
> I'm nit picking on it mainly because, if proven successful,
> this strategy will be adopted by other drivers and likely moved
> to net/common, so imo it's important to talk about these details.

Sure.

> interesting. fancy. so things like nbd will also be using them too, right?

Yes, I believe so.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15 23:34       ` Eric Dumazet
@ 2017-03-16  0:44         ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2017-03-16  0:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, Mar 15, 2017 at 04:34:51PM -0700, Eric Dumazet wrote:
> > > > > -/* We recover from out of memory by scheduling our napi poll
> > > > > - * function (mlx4_en_process_cq), which tries to allocate
> > > > > - * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
> > > > > +/* Under memory pressure, each ring->rx_alloc_order might be lowered
> > > > > + * to very small values. Periodically increase t to initial value for
> > > > > + * optimal allocations, in case stress is over.
> > > > >   */
> > > > > +	for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) {
> > > > > +		ring = priv->rx_ring[ring_ind];
> > > > > +		order = min_t(unsigned int, ring->rx_alloc_order + 1,
> > > > > +			      ring->rx_pref_alloc_order);
> > > > > +		WRITE_ONCE(ring->rx_alloc_order, order);
> > > > 
> > > > when recycling is effective in a matter of few seconds it will
> > > > increase ther order back to 10 and the first time the driver needs
> > > > to allocate, it will start that tedious failure loop all over again.
> > > > How about removing this periodic mlx4_en_recover_from_oom() completely
> > > > and switch to increase the order inside mlx4_alloc_page().
> > > > Like N successful __alloc_pages_node() with order X will bump it
> > > > into order X+1. If it fails next time it will do only one failed attempt.
> > > 
> > > I wanted to do the increase out of line. (not in the data path)
> > > 
> > > We probably could increase only if ring->rx_alloc_pages got a
> > > significant increase since the last mlx4_en_recover_from_oom() call.
> > > 
> > > (That would require a new ring->prior_rx_alloc_pages out of hot cache
> > > lines)
> > 
> > right. rx_alloc_pages can also be reduce to 16-bit and this
> > new one prior_rx_alloc_pages to 16-bit too, no?
> 
> Think about arches not having atomic 8-bit or 16-bit reads or writes.
> 
> READ_ONCE()/WRITE_ONCE() will not be usable.

I mean if you really want to squeeze space these two:
+       unsigned int                    pre_allocated_count;
+       unsigned int                    rx_alloc_order;

can become 16-bit and have room for 'rx_alloc_pages_without_fail'
that will count to small N and then bump 'rx_alloc_order' by 1.
and since _oom() will be gone. There is no need for read/write__once.

anyway, looking forward to your next version.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15 23:06     ` Alexei Starovoitov
  2017-03-15 23:34       ` Eric Dumazet
@ 2017-03-16  1:07       ` Eric Dumazet
  2017-03-16  1:10         ` Eric Dumazet
  2017-03-16  1:56         ` Alexei Starovoitov
  1 sibling, 2 replies; 16+ messages in thread
From: Eric Dumazet @ 2017-03-16  1:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:

> yes. and we have 'xdp_tx_full' counter for it that we monitor.
> When tx ring and mtu are sized properly, we don't expect to see it
> incrementing at all. This is something in our control. 'Our' means
> humans that setup the environment.
> 'cache empty' condition is something ephemeral. Packets will be dropped
> and we won't have any tools to address it. These packets are real
> people requests. Any drop needs to be categorized.
> Like there is 'rx_fifo_errors' counter that mlx4 reports when
> hw is dropping packets before they reach the driver. We see it
> incrementing depending on the traffic pattern though overall Mpps
> through the nic is not too high and this is something we
> actively investigating too.


This all looks nice, except that current mlx4 driver does not have a
counter of failed allocations.

You are asking me to keep some inexistent functionality.

Look in David net tree :

mlx4_en_refill_rx_buffers()
...
   if (mlx4_en_prepare_rx_desc(...))
      break;


So in case of memory pressure, mlx4 stops working and not a single
counter is incremented/reported.

So I guess your supervision was not really tested.

Just to show you what you are asking, here is a diff against latest
version.

You want to make sure a fresh page is there before calling XDP program.

Is it really what you want ?

 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +++++++++++++----------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
                if (xdp_prog) {
                        struct xdp_buff xdp;
                        struct page *npage;
-                       dma_addr_t ndma, dma;
+                       dma_addr_t dma;
                        void *orig_data;
                        u32 act;
 
+                       /* Make sure we have one page ready to replace this one, per Alexei request */
+                       if (unlikely(!ring->page_cache.index)) {
+                               npage = mlx4_alloc_page(priv, ring,
+                                                       &ring->page_cache.buf[0].dma,
+                                                       numa_mem_id(),
+                                                       GFP_ATOMIC | __GFP_MEMALLOC);
+                               if (!npage) {
+                                       /* replace this by a new ring->rx_alloc_failed++
+                                        */
+                                       ring->xdp_drop++;
+                                       goto next;
+                               }
+                               ring->page_cache.buf[0].page = npage;
+                       }
                        dma = frags[0].dma + frags[0].page_offset;
                        dma_sync_single_for_cpu(priv->ddev, dma,
                                                priv->frag_info[0].frag_size,
@@ -820,29 +834,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
                        case XDP_PASS:
                                break;
                        case XDP_TX:
-                               /* Make sure we have one page ready to replace this one */
-                               npage = NULL;
-                               if (!ring->page_cache.index) {
-                                       npage = mlx4_alloc_page(priv, ring,
-                                                               &ndma, numa_mem_id(),
-                                                               GFP_ATOMIC | __GFP_MEMALLOC);
-                                       if (!npage) {
-                                               ring->xdp_drop++;
-                                               goto next;
-                                       }
-                               }
                                if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
                                                        length, cq->ring,
                                                        &doorbell_pending))) {
-                                       if (ring->page_cache.index) {
-                                               u32 idx = --ring->page_cache.index;
+                                       u32 idx = --ring->page_cache.index;
 
-                                               frags[0].page = ring->page_cache.buf[idx].page;
-                                               frags[0].dma = ring->page_cache.buf[idx].dma;
-                                       } else {
-                                               frags[0].page = npage;
-                                               frags[0].dma = ndma;
-                                       }
+                                       frags[0].page = ring->page_cache.buf[idx].page;
+                                       frags[0].dma = ring->page_cache.buf[idx].dma;
                                        frags[0].page_offset = XDP_PACKET_HEADROOM;
                                        goto next;
                                }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-16  1:07       ` Eric Dumazet
@ 2017-03-16  1:10         ` Eric Dumazet
  2017-03-16  1:56         ` Alexei Starovoitov
  1 sibling, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2017-03-16  1:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, 2017-03-15 at 18:07 -0700, Eric Dumazet wrote:
> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> 
> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
> > When tx ring and mtu are sized properly, we don't expect to see it
> > incrementing at all. This is something in our control. 'Our' means
> > humans that setup the environment.
> > 'cache empty' condition is something ephemeral. Packets will be dropped
> > and we won't have any tools to address it. These packets are real
> > people requests. Any drop needs to be categorized.
> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
> > hw is dropping packets before they reach the driver. We see it
> > incrementing depending on the traffic pattern though overall Mpps
> > through the nic is not too high and this is something we
> > actively investigating too.
> 
> 
> This all looks nice, except that current mlx4 driver does not have a
> counter of failed allocations.
> 
> You are asking me to keep some inexistent functionality.
> 
> Look in David net tree :
> 
> mlx4_en_refill_rx_buffers()
> ...
>    if (mlx4_en_prepare_rx_desc(...))
>       break;
> 
> 
> So in case of memory pressure, mlx4 stops working and not a single
> counter is incremented/reported.
> 
> So I guess your supervision was not really tested.
> 
> Just to show you what you are asking, here is a diff against latest
> version.
> 
> You want to make sure a fresh page is there before calling XDP program.
> 
> Is it really what you want ?
> 
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +++++++++++++----------------
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                 if (xdp_prog) {
>                         struct xdp_buff xdp;
>                         struct page *npage;
> -                       dma_addr_t ndma, dma;
> +                       dma_addr_t dma;
>                         void *orig_data;
>                         u32 act;
>  
> +                       /* Make sure we have one page ready to replace this one, per Alexei request */
> +                       if (unlikely(!ring->page_cache.index)) {
> +                               npage = mlx4_alloc_page(priv, ring,
> +                                                       &ring->page_cache.buf[0].dma,
> +                                                       numa_mem_id(),
> +                                                       GFP_ATOMIC | __GFP_MEMALLOC);
> +                               if (!npage) {
> +                                       /* replace this by a new ring->rx_alloc_failed++
> +                                        */
> +                                       ring->xdp_drop++;
> +                                       goto next;
> +                               }
> +                               ring->page_cache.buf[0].page = npage;

Plus a missing :
				ring->page_cache.index = 1;

> +                       }
>                         dma = frags[0].dma + frags[0].page_offset;
>                         dma_sync_single_for_cpu(priv->ddev, dma,
>                                                 priv->frag_info[0].frag_size,
> @@ -820,29 +834,13 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                         case XDP_PASS:
>                                 break;
>                         case XDP_TX:
> -                               /* Make sure we have one page ready to replace this one */
> -                               npage = NULL;
> -                               if (!ring->page_cache.index) {
> -                                       npage = mlx4_alloc_page(priv, ring,
> -                                                               &ndma, numa_mem_id(),
> -                                                               GFP_ATOMIC | __GFP_MEMALLOC);
> -                                       if (!npage) {
> -                                               ring->xdp_drop++;
> -                                               goto next;
> -                                       }
> -                               }
>                                 if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
>                                                         length, cq->ring,
>                                                         &doorbell_pending))) {
> -                                       if (ring->page_cache.index) {
> -                                               u32 idx = --ring->page_cache.index;
> +                                       u32 idx = --ring->page_cache.index;
>  
> -                                               frags[0].page = ring->page_cache.buf[idx].page;
> -                                               frags[0].dma = ring->page_cache.buf[idx].dma;
> -                                       } else {
> -                                               frags[0].page = npage;
> -                                               frags[0].dma = ndma;
> -                                       }
> +                                       frags[0].page = ring->page_cache.buf[idx].page;
> +                                       frags[0].dma = ring->page_cache.buf[idx].dma;
>                                         frags[0].page_offset = XDP_PACKET_HEADROOM;
>                                         goto next;
>                                 }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-16  1:07       ` Eric Dumazet
  2017-03-16  1:10         ` Eric Dumazet
@ 2017-03-16  1:56         ` Alexei Starovoitov
  2017-03-16  2:48           ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2017-03-16  1:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> 
> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
> > When tx ring and mtu are sized properly, we don't expect to see it
> > incrementing at all. This is something in our control. 'Our' means
> > humans that setup the environment.
> > 'cache empty' condition is something ephemeral. Packets will be dropped
> > and we won't have any tools to address it. These packets are real
> > people requests. Any drop needs to be categorized.
> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
> > hw is dropping packets before they reach the driver. We see it
> > incrementing depending on the traffic pattern though overall Mpps
> > through the nic is not too high and this is something we
> > actively investigating too.
> 
> 
> This all looks nice, except that current mlx4 driver does not have a
> counter of failed allocations.
> 
> You are asking me to keep some inexistent functionality.
> 
> Look in David net tree :
> 
> mlx4_en_refill_rx_buffers()
> ...
>    if (mlx4_en_prepare_rx_desc(...))
>       break;
> 
> 
> So in case of memory pressure, mlx4 stops working and not a single
> counter is incremented/reported.

Not quite. That is exactly the case i'm asking to keep.
In case of memory pressure (like in the above case rx fifo
won't be replenished) and in case of hw receiving
faster than the driver can drain the rx ring,
the hw will increment 'rx_fifo_errors' counter.
And that's what we monitor already and what I described in previous email.

> Is it really what you want ?

almost. see below.

>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +++++++++++++----------------
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                 if (xdp_prog) {
>                         struct xdp_buff xdp;
>                         struct page *npage;
> -                       dma_addr_t ndma, dma;
> +                       dma_addr_t dma;
>                         void *orig_data;
>                         u32 act;
>  
> +                       /* Make sure we have one page ready to replace this one, per Alexei request */

do you have to add snarky comments?

> +                       if (unlikely(!ring->page_cache.index)) {
> +                               npage = mlx4_alloc_page(priv, ring,
> +                                                       &ring->page_cache.buf[0].dma,
> +                                                       numa_mem_id(),
> +                                                       GFP_ATOMIC | __GFP_MEMALLOC);
> +                               if (!npage) {
> +                                       /* replace this by a new ring->rx_alloc_failed++
> +                                        */
> +                                       ring->xdp_drop++;

counting it as 'xdp_drop' is incorrect.
'xdp_drop' should be incremented only when program actually doing it,
otherwise that's confusing to the user.
If you add new counter 'rx_alloc_failed' (as you implying above)
than it's similar to the existing state.
Before: for both hw receives too much and oom with rx fifo empty - we
will see 'rx_fifo_errors' counter.
After: most rx_fifo_erros would mean hw receive issues and oom will
be covered by this new counter.

Another thought... if we do this 'replenish rx ring immediately'
why do it for xdp rings only? Let's do it for all rings?
the above 'if (unlikely(!ring->page_cache.index)) ..alloc_page'
can move before 'if (xdp_prog)' and simplify the rest?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-16  1:56         ` Alexei Starovoitov
@ 2017-03-16  2:48           ` Eric Dumazet
  2017-03-16  5:39             ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2017-03-16  2:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, Mar 15, 2017 at 6:56 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
>> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
>>
>> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
>> > When tx ring and mtu are sized properly, we don't expect to see it
>> > incrementing at all. This is something in our control. 'Our' means
>> > humans that setup the environment.
>> > 'cache empty' condition is something ephemeral. Packets will be dropped
>> > and we won't have any tools to address it. These packets are real
>> > people requests. Any drop needs to be categorized.
>> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
>> > hw is dropping packets before they reach the driver. We see it
>> > incrementing depending on the traffic pattern though overall Mpps
>> > through the nic is not too high and this is something we
>> > actively investigating too.
>>
>>
>> This all looks nice, except that current mlx4 driver does not have a
>> counter of failed allocations.
>>
>> You are asking me to keep some inexistent functionality.
>>
>> Look in David net tree :
>>
>> mlx4_en_refill_rx_buffers()
>> ...
>>    if (mlx4_en_prepare_rx_desc(...))
>>       break;
>>
>>
>> So in case of memory pressure, mlx4 stops working and not a single
>> counter is incremented/reported.
>
> Not quite. That is exactly the case i'm asking to keep.
> In case of memory pressure (like in the above case rx fifo
> won't be replenished) and in case of hw receiving
> faster than the driver can drain the rx ring,
> the hw will increment 'rx_fifo_errors' counter.

In current mlx4 driver, if napi_get_frags() fails, no counter is incremented.

So you are describing quite a different behavior, where _cpu_ can not
keep up and rx_fifo_errors is incremented.

But in case of _memory_ pressure (and normal traffic), rx_fifo_errors
wont be incremented.

And even if xdp_prog 'decides' to return XDP_PASS, the fine packet
will be dropped anyway.


> And that's what we monitor already and what I described in previous email.
>
>> Is it really what you want ?
>
> almost. see below.
>
>>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +++++++++++++----------------
>>  1 file changed, 18 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> index cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>>                 if (xdp_prog) {
>>                         struct xdp_buff xdp;
>>                         struct page *npage;
>> -                       dma_addr_t ndma, dma;
>> +                       dma_addr_t dma;
>>                         void *orig_data;
>>                         u32 act;
>>
>> +                       /* Make sure we have one page ready to replace this one, per Alexei request */
>
> do you have to add snarky comments?

Is request a bad or offensive word ?

What would be the best way to say that you asked to move this code
here, while in my opinion it was better where it was  ?

>
>> +                       if (unlikely(!ring->page_cache.index)) {
>> +                               npage = mlx4_alloc_page(priv, ring,
>> +                                                       &ring->page_cache.buf[0].dma,
>> +                                                       numa_mem_id(),
>> +                                                       GFP_ATOMIC | __GFP_MEMALLOC);
>> +                               if (!npage) {
>> +                                       /* replace this by a new ring->rx_alloc_failed++
>> +                                        */
>> +                                       ring->xdp_drop++;
>
> counting it as 'xdp_drop' is incorrect.

I added a comment to make that very clear .

If you do not read the comment, what can I say ?

So the comment is :

 replace this by a new ring->rx_alloc_failed++

This of course will require other changes in other files (folding
stats at ethtool -S)
that are irrelevant for the discussion we have right now.

I wont provide full patch without knowing exactly what you are requesting.


> 'xdp_drop' should be incremented only when program actually doing it,
> otherwise that's confusing to the user.
> If you add new counter 'rx_alloc_failed' (as you implying above)
> than it's similar to the existing state.
> Before: for both hw receives too much and oom with rx fifo empty - we
> will see 'rx_fifo_errors' counter.
> After: most rx_fifo_erros would mean hw receive issues and oom will
> be covered by this new counter.
>
> Another thought... if we do this 'replenish rx ring immediately'
> why do it for xdp rings only? Let's do it for all rings?
> the above 'if (unlikely(!ring->page_cache.index)) ..alloc_page'
> can move before 'if (xdp_prog)' and simplify the rest?
>

Because non XDP paths attempt to use the page pool first.

_if_ the oldest page in page pool can not be recycled, then we
allocate a fresh page,
from a special pool (order-X preallocations) that does not fit the
page_cache order-0 model

Non XDP paths do not need to populate page_cache with one order-0
page, that would add extra useless code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-16  2:48           ` Eric Dumazet
@ 2017-03-16  5:39             ` Alexei Starovoitov
  2017-03-16 12:00               ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2017-03-16  5:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, Mar 15, 2017 at 07:48:04PM -0700, Eric Dumazet wrote:
> On Wed, Mar 15, 2017 at 6:56 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Wed, Mar 15, 2017 at 06:07:16PM -0700, Eric Dumazet wrote:
> >> On Wed, 2017-03-15 at 16:06 -0700, Alexei Starovoitov wrote:
> >>
> >> > yes. and we have 'xdp_tx_full' counter for it that we monitor.
> >> > When tx ring and mtu are sized properly, we don't expect to see it
> >> > incrementing at all. This is something in our control. 'Our' means
> >> > humans that setup the environment.
> >> > 'cache empty' condition is something ephemeral. Packets will be dropped
> >> > and we won't have any tools to address it. These packets are real
> >> > people requests. Any drop needs to be categorized.
> >> > Like there is 'rx_fifo_errors' counter that mlx4 reports when
> >> > hw is dropping packets before they reach the driver. We see it
> >> > incrementing depending on the traffic pattern though overall Mpps
> >> > through the nic is not too high and this is something we
> >> > actively investigating too.
> >>
> >>
> >> This all looks nice, except that current mlx4 driver does not have a
> >> counter of failed allocations.
> >>
> >> You are asking me to keep some inexistent functionality.
> >>
> >> Look in David net tree :
> >>
> >> mlx4_en_refill_rx_buffers()
> >> ...
> >>    if (mlx4_en_prepare_rx_desc(...))
> >>       break;
> >>
> >>
> >> So in case of memory pressure, mlx4 stops working and not a single
> >> counter is incremented/reported.
> >
> > Not quite. That is exactly the case i'm asking to keep.
> > In case of memory pressure (like in the above case rx fifo
> > won't be replenished) and in case of hw receiving
> > faster than the driver can drain the rx ring,
> > the hw will increment 'rx_fifo_errors' counter.
> 
> In current mlx4 driver, if napi_get_frags() fails, no counter is incremented.
> 
> So you are describing quite a different behavior, where _cpu_ can not
> keep up and rx_fifo_errors is incremented.
> 
> But in case of _memory_ pressure (and normal traffic), rx_fifo_errors
> wont be incremented.

when there is no room in the rx fifo the hw will increment the counter.
That's the same as oom causing alloc fails and rx ring not being replenished.
When there is nothing free in rx ring to dma the packet to, the hw will
increment that counter.

> And even if xdp_prog 'decides' to return XDP_PASS, the fine packet
> will be dropped anyway.

yes. And no counter will be incremented when packet is dropped further
in the stack. So?
The discussion specifically about changing xdp rings behavior for xdp_tx case.

> > And that's what we monitor already and what I described in previous email.
> >
> >> Is it really what you want ?
> >
> > almost. see below.
> >
> >>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   38 +++++++++++++----------------
> >>  1 file changed, 18 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> >> index cc41f2f145541b469b52e7014659d5fdbb7dac68..e5ef8999087b52705faf083c94cde439aab9e2b7 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> >> @@ -793,10 +793,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >>                 if (xdp_prog) {
> >>                         struct xdp_buff xdp;
> >>                         struct page *npage;
> >> -                       dma_addr_t ndma, dma;
> >> +                       dma_addr_t dma;
> >>                         void *orig_data;
> >>                         u32 act;
> >>
> >> +                       /* Make sure we have one page ready to replace this one, per Alexei request */
> >
> > do you have to add snarky comments?
> 
> Is request a bad or offensive word ?
> 
> What would be the best way to say that you asked to move this code
> here, while in my opinion it was better where it was  ?

so in other words you're saying that you disagree with
all of my previous explanations of why the drop after xdp_tx is wrong?
And in other words you think it's fine to execute the program,
do map lookups, rewrites and at the last step drop the result
because the driver couldn't allocate to make the rx ring populated, right?
I can try to see why it would be reasonable if you provide the arguments.

> >
> >> +                       if (unlikely(!ring->page_cache.index)) {
> >> +                               npage = mlx4_alloc_page(priv, ring,
> >> +                                                       &ring->page_cache.buf[0].dma,
> >> +                                                       numa_mem_id(),
> >> +                                                       GFP_ATOMIC | __GFP_MEMALLOC);
> >> +                               if (!npage) {
> >> +                                       /* replace this by a new ring->rx_alloc_failed++
> >> +                                        */
> >> +                                       ring->xdp_drop++;
> >
> > counting it as 'xdp_drop' is incorrect.
> 
> I added a comment to make that very clear .
> 
> If you do not read the comment, what can I say ?

enough insults, please?
you're making it difficult to keep the discussion technical.

> So the comment is :
> 
>  replace this by a new ring->rx_alloc_failed++
> 
> This of course will require other changes in other files (folding
> stats at ethtool -S)
> that are irrelevant for the discussion we have right now.

doing ring->xdp_drop++ here is not correct. Hence it's a blocker
for the rest of the patch. Just because 99% of it is good,
it doesn't mean that we can regress this 1%.
The solution to add new counter is trivial, so it's not
clear to me what you're objecting to.

> I wont provide full patch without knowing exactly what you are requesting.
> 
> 
> > 'xdp_drop' should be incremented only when program actually doing it,
> > otherwise that's confusing to the user.
> > If you add new counter 'rx_alloc_failed' (as you implying above)
> > than it's similar to the existing state.
> > Before: for both hw receives too much and oom with rx fifo empty - we
> > will see 'rx_fifo_errors' counter.
> > After: most rx_fifo_erros would mean hw receive issues and oom will
> > be covered by this new counter.
> >
> > Another thought... if we do this 'replenish rx ring immediately'
> > why do it for xdp rings only? Let's do it for all rings?
> > the above 'if (unlikely(!ring->page_cache.index)) ..alloc_page'
> > can move before 'if (xdp_prog)' and simplify the rest?
> >
> 
> Because non XDP paths attempt to use the page pool first.
> 
> _if_ the oldest page in page pool can not be recycled, then we
> allocate a fresh page,
> from a special pool (order-X preallocations) that does not fit the
> page_cache order-0 model

I see. yeah. mlx4_en_complete_rx_desc->mlx4_replenish logic
cannot be hoisted before napi_get_frags().
I was hoping to avoid doing napi_get_frags and populating
skb when we drop the packet due to !mlx4_replenish.
Nevermind.

Concrete steps from this state of patch:
1. either do 'if (unlikely(!ring->page_cache.index)) ..alloc_page'
before the program runs and increment new 'rx_alloc_failed' counter.
Without new counter it's no good.

2. or keep xdp rings as it is today: process all packets, some
packets are going into xdp_tx, some xdp_pass.
The xdp_pass will do your mlx4_en_complete_rx_desc->mlx4_replenish
with potential drop. At the end of the loop call into
mlx4_en_remap_rx_buffers() which will do
if (!frags[0].page)
  mlx4_alloc_page()
mlx4_en_prepare_rx_desc()
alloc will be happening only for pages that went into xdp_tx.
If it's under memory pressure and cannot refill the hw will
see that rx fifo has no free descriptors and will increment
rx_fifo_errors and we have exactly the same situation as today.
And no new counter neccessary.
I think the concern here that rx fifo can become completely
empty and nothing will ever replenish it, hence I'm suggesting
to populate it during processing of tx completion in xdp rings,
since it runs on the same cpu and doesn't race.
This way xdp will still function even under severe memory
pressure and when memory is available again the rx fifo ring
will be filled in mlx4_en_remap_rx_buffers() without
any extra _oom callbacks.
Also doing allocs in mlx4_en_remap_rx_buffers() after the loop
gives opportunity to do batch allocations, so long term
I'd like to try 2 anyway, but 1 is fine too.

In 2 the new counter is not necessary, because rx_fifo_errors
will cover it (would be nice to add in the future for visibility).
For 1 the new counter is needed immediately, otherwise the
drops will be lost that are accounted today.

I think it's important to agree on the problem first.
While you think your current patch is perfect and not seeing
my arguments, it will be impossible to discuss the solution
without agreeing on the problem.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-16  5:39             ` Alexei Starovoitov
@ 2017-03-16 12:00               ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2017-03-16 12:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Saeed Mahameed, Willem de Bruijn, Alexei Starovoitov,
	Alexander Duyck

On Wed, 2017-03-15 at 22:39 -0700, Alexei Starovoitov wrote:

> when there is no room in the rx fifo the hw will increment the counter.
> That's the same as oom causing alloc fails and rx ring not being replenished.
> When there is nothing free in rx ring to dma the packet to, the hw will
> increment that counter.

After 4096 frames were lost buy linux mlx4 driver, then hw will finally
catch up and start increasing its own hw counter.


While you pretended that you received millions of frames and wanted that
not a single one could be dropped without a report.

This is simply not true.

I do not believe I will continue the discussion.

We can not really agree it seems, so why annoying all netdev readers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-15 15:36 ` Tariq Toukan
  2017-03-15 16:27   ` Eric Dumazet
@ 2017-03-20 12:59   ` Tariq Toukan
  2017-03-20 13:04     ` Eric Dumazet
  1 sibling, 1 reply; 16+ messages in thread
From: Tariq Toukan @ 2017-03-20 12:59 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Alexei Starovoitov, Eric Dumazet, Alexander Duyck



On 15/03/2017 5:36 PM, Tariq Toukan wrote:
>
>
> On 14/03/2017 5:11 PM, Eric Dumazet wrote:
>> When adding order-0 pages allocations and page recycling in receive path,
>> I added issues on PowerPC, or more generally on arches with large pages.
>>
>> A GRO packet, aggregating 45 segments, ended up using 45 page frags
>> on 45 different pages. Before my changes we were very likely packing
>> up to 42 Ethernet frames per 64KB page.
>>
>> 1) At skb freeing time, all put_page() on the skb frags now touch 45
>>    different 'struct page' and this adds more cache line misses.
>>    Too bad that standard Ethernet MTU is so small :/
>>
>> 2) Using one order-0 page per ring slot consumes ~42 times more memory
>>    on PowerPC.
>>
>> 3) Allocating order-0 pages is very likely to use pages from very
>>    different locations, increasing TLB pressure on hosts with more
>>    than 256 GB of memory after days of uptime.
>>
>> This patch uses a refined strategy, addressing these points.
>>
>> We still use order-0 pages, but the page recyling technique is modified
>> so that we have better chances to lower number of pages containing the
>> frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)
>>
>> Page allocations are split in two halves :
>> - One currently visible by the NIC for DMA operations.
>> - The other contains pages that already added to old skbs, put in
>>   a quarantine.
>>
>> When we receive a frame, we look at the oldest entry in the pool and
>> check if the page count is back to one, meaning old skbs/frags were
>> consumed and the page can be recycled.
>>
>> Page allocations are attempted using high order ones, trying
>> to lower TLB pressure. We remember in ring->rx_alloc_order the last
>> attempted
>> order and quickly decrement it in case of failures.
>> Then mlx4_en_recover_from_oom() called every 250 msec will attempt
>> to gradually restore rx_alloc_order to its optimal value.
>>
>> On x86, memory allocations stay the same. (One page per RX slot for
>> MTU=1500)
>> But on PowerPC, this patch considerably reduces the allocated memory.
>>
>> Performance gain on PowerPC is about 50% for a single TCP flow.
>>
>> On x86, I could not measure the difference, my test machine being
>> limited by the sender (33 Gbit per TCP flow).
>> 22 less cache line misses per 64 KB GRO packet is probably in the order
>> of 2 % or so.
>>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Cc: Tariq Toukan <tariqt@mellanox.com>
>> Cc: Saeed Mahameed <saeedm@mellanox.com>
>> Cc: Alexander Duyck <alexander.duyck@gmail.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 470
>> ++++++++++++++++-----------
>>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
>>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 ++-
>>  3 files changed, 317 insertions(+), 222 deletions(-)
>>
>
> Hi Eric,
>
> Thanks for your patch.
>
> I will do the XDP tests and complete the review, by tomorrow.

Hi Eric,

While testing XDP scenarios, I noticed a small degradation.
However, more importantly, I hit a kernel panic, see trace below.

I'll need time to debug this.
I will update about progress in debug and XDP testing.

If you want, I can do the re-submission myself once both issues are solved.

Thanks,
Tariq

Trace:
[  379.069292] BUG: Bad page state in process xdp2  pfn:fd8c04
[  379.075840] page:ffffea003f630100 count:-1 mapcount:0 mapping: 
   (null) index:0x0
[  379.085413] flags: 0x2fffff80000000()
[  379.089816] raw: 002fffff80000000 0000000000000000 0000000000000000 
ffffffffffffffff
[  379.098994] raw: dead000000000100 dead000000000200 0000000000000000 
0000000000000000
[  379.108154] page dumped because: nonzero _refcount
[  379.113793] Modules linked in: mlx4_en(OE) mlx4_ib ib_core 
mlx4_core(OE) netconsole nfsv3 nfs fscache dm_mirror dm_region_hash 
dm_log dm_mod sb_edac edac_core x86_pkg_temp_thermal coretemp 
i2c_diolan_u2c kvm iTCO_wdt iTCO_vendor_support lpc_ich ipmi_si 
irqbypass dcdbas ipmi_devintf crc32_pclmul mfd_core ghash_clmulni_intel 
pcspkr ipmi_msghandler sg wmi acpi_power_meter shpchp nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc binfmt_misc ip_tables sr_mod cdrom sd_mod 
mlx5_core i2c_algo_bit drm_kms_helper tg3 syscopyarea sysfillrect 
sysimgblt fb_sys_fops ttm drm ahci libahci ptp megaraid_sas libata 
crc32c_intel i2c_core pps_core [last unloaded: mlx4_en]
[  379.179886] CPU: 38 PID: 6243 Comm: xdp2 Tainted: G           OE 
4.11.0-rc2+ #25
[  379.188846] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
1.5.4 10/002/2015
[  379.197814] Call Trace:
[  379.200838]  dump_stack+0x63/0x8c
[  379.204833]  bad_page+0xfe/0x11a
[  379.208728]  free_pages_check_bad+0x76/0x78
[  379.213688]  free_pcppages_bulk+0x4d5/0x510
[  379.218647]  free_hot_cold_page+0x258/0x280
[  379.228911]  __free_pages+0x25/0x30
[  379.233099]  mlx4_en_free_rx_buf.isra.23+0x79/0x110 [mlx4_en]
[  379.239811]  mlx4_en_deactivate_rx_ring+0xb2/0xd0 [mlx4_en]
[  379.246332]  mlx4_en_stop_port+0x4fc/0x7d0 [mlx4_en]
[  379.252166]  mlx4_xdp+0x373/0x3b0 [mlx4_en]
[  379.257126]  dev_change_xdp_fd+0x102/0x140
[  379.261993]  ? nla_parse+0xa3/0x100
[  379.266176]  do_setlink+0xc9c/0xcc0
[  379.270363]  ? nla_parse+0xa3/0x100
[  379.274547]  rtnl_setlink+0xbc/0x100
[  379.278828]  ? __enqueue_entity+0x60/0x70
[  379.283595]  rtnetlink_rcv_msg+0x95/0x220
[  379.288365]  ? __kmalloc_node_track_caller+0x214/0x280
[  379.294397]  ? __alloc_skb+0x7e/0x260
[  379.298774]  ? rtnl_newlink+0x830/0x830
[  379.303349]  netlink_rcv_skb+0xa7/0xc0
[  379.307825]  rtnetlink_rcv+0x28/0x30
[  379.312102]  netlink_unicast+0x15f/0x230
[  379.316771]  netlink_sendmsg+0x319/0x390
[  379.321441]  sock_sendmsg+0x38/0x50
[  379.325624]  SYSC_sendto+0xef/0x170
[  379.329808]  ? SYSC_bind+0xb0/0xe0
[  379.333895]  ? alloc_file+0x1b/0xc0
[  379.338077]  ? __fd_install+0x22/0xb0
[  379.342456]  ? sock_alloc_file+0x91/0x120
[  379.347314]  ? fd_install+0x25/0x30
[  379.351518]  SyS_sendto+0xe/0x10
[  379.355432]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  379.360901] RIP: 0033:0x7f824e6d0cad
[  379.365201] RSP: 002b:00007ffc75259a08 EFLAGS: 00000246 ORIG_RAX: 
000000000000002c
[  379.374198] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 
00007f824e6d0cad
[  379.382481] RDX: 000000000000002c RSI: 00007ffc75259a20 RDI: 
0000000000000003
[  379.390746] RBP: 00000000ffffffff R08: 0000000000000000 R09: 
0000000000000000
[  379.399010] R10: 0000000000000000 R11: 0000000000000246 R12: 
0000000000000019
[  379.407273] R13: 0000000000000030 R14: 00007ffc7525b260 R15: 
00007ffc75273270
[  379.415539] Disabling lock debugging due to kernel taint


>
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path
  2017-03-20 12:59   ` Tariq Toukan
@ 2017-03-20 13:04     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2017-03-20 13:04 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Alexei Starovoitov, Eric Dumazet,
	Alexander Duyck

On Mon, Mar 20, 2017 at 5:59 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:

>
> Hi Eric,
>
> While testing XDP scenarios, I noticed a small degradation.
> However, more importantly, I hit a kernel panic, see trace below.
>
> I'll need time to debug this.
> I will update about progress in debug and XDP testing.
>
> If you want, I can do the re-submission myself once both issues are solved.
>
> Thanks,
> Tariq
>
> Trace:
> [  379.069292] BUG: Bad page state in process xdp2  pfn:fd8c04
> [  379.075840] page:ffffea003f630100 count:-1 mapcount:0 mapping:   (null)
> index:0x0
> [  379.085413] flags: 0x2fffff80000000()
> [  379.089816] raw: 002fffff80000000 0000000000000000 0000000000000000
> ffffffffffffffff
> [  379.098994] raw: dead000000000100 dead000000000200 0000000000000000
> 0000000000000000
> [  379.108154] page dumped because: nonzero _refcount
> [  379.113793] Modules linked in: mlx4_en(OE) mlx4_ib ib_core mlx4_core(OE)
> netconsole nfsv3 nfs fscache dm_mirror dm_region_hash dm_log dm_mod sb_edac
> edac_core x86_pkg_temp_thermal coretemp i2c_diolan_u2c kvm iTCO_wdt
> iTCO_vendor_support lpc_ich ipmi_si irqbypass dcdbas ipmi_devintf
> crc32_pclmul mfd_core ghash_clmulni_intel pcspkr ipmi_msghandler sg wmi
> acpi_power_meter shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> binfmt_misc ip_tables sr_mod cdrom sd_mod mlx5_core i2c_algo_bit
> drm_kms_helper tg3 syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm
> ahci libahci ptp megaraid_sas libata crc32c_intel i2c_core pps_core [last
> unloaded: mlx4_en]
> [  379.179886] CPU: 38 PID: 6243 Comm: xdp2 Tainted: G           OE
> 4.11.0-rc2+ #25
> [  379.188846] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4
> 10/002/2015
> [  379.197814] Call Trace:
> [  379.200838]  dump_stack+0x63/0x8c
> [  379.204833]  bad_page+0xfe/0x11a
> [  379.208728]  free_pages_check_bad+0x76/0x78
> [  379.213688]  free_pcppages_bulk+0x4d5/0x510
> [  379.218647]  free_hot_cold_page+0x258/0x280
> [  379.228911]  __free_pages+0x25/0x30
> [  379.233099]  mlx4_en_free_rx_buf.isra.23+0x79/0x110 [mlx4_en]
> [  379.239811]  mlx4_en_deactivate_rx_ring+0xb2/0xd0 [mlx4_en]
> [  379.246332]  mlx4_en_stop_port+0x4fc/0x7d0 [mlx4_en]
> [  379.252166]  mlx4_xdp+0x373/0x3b0 [mlx4_en]
> [  379.257126]  dev_change_xdp_fd+0x102/0x140
> [  379.261993]  ? nla_parse+0xa3/0x100
> [  379.266176]  do_setlink+0xc9c/0xcc0
> [  379.270363]  ? nla_parse+0xa3/0x100
> [  379.274547]  rtnl_setlink+0xbc/0x100
> [  379.278828]  ? __enqueue_entity+0x60/0x70
> [  379.283595]  rtnetlink_rcv_msg+0x95/0x220
> [  379.288365]  ? __kmalloc_node_track_caller+0x214/0x280
> [  379.294397]  ? __alloc_skb+0x7e/0x260
> [  379.298774]  ? rtnl_newlink+0x830/0x830
> [  379.303349]  netlink_rcv_skb+0xa7/0xc0
> [  379.307825]  rtnetlink_rcv+0x28/0x30
> [  379.312102]  netlink_unicast+0x15f/0x230
> [  379.316771]  netlink_sendmsg+0x319/0x390
> [  379.321441]  sock_sendmsg+0x38/0x50
> [  379.325624]  SYSC_sendto+0xef/0x170
> [  379.329808]  ? SYSC_bind+0xb0/0xe0
> [  379.333895]  ? alloc_file+0x1b/0xc0
> [  379.338077]  ? __fd_install+0x22/0xb0
> [  379.342456]  ? sock_alloc_file+0x91/0x120
> [  379.347314]  ? fd_install+0x25/0x30
> [  379.351518]  SyS_sendto+0xe/0x10
> [  379.355432]  entry_SYSCALL_64_fastpath+0x1a/0xa9
> [  379.360901] RIP: 0033:0x7f824e6d0cad
> [  379.365201] RSP: 002b:00007ffc75259a08 EFLAGS: 00000246 ORIG_RAX:
> 000000000000002c
> [  379.374198] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX:
> 00007f824e6d0cad
> [  379.382481] RDX: 000000000000002c RSI: 00007ffc75259a20 RDI:
> 0000000000000003
> [  379.390746] RBP: 00000000ffffffff R08: 0000000000000000 R09:
> 0000000000000000
> [  379.399010] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000000000000019
> [  379.407273] R13: 0000000000000030 R14: 00007ffc7525b260 R15:
> 00007ffc75273270
> [  379.415539] Disabling lock debugging due to kernel taint
>
>

Hi Tariq

Thanks for testing.

I wont have time to work on this before one month or so, so I will
appreciate your findings ;)

Thanks !

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-03-20 13:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-14 15:11 [PATCH v2 net-next] mlx4: Better use of order-0 pages in RX path Eric Dumazet
2017-03-15  4:06 ` Alexei Starovoitov
2017-03-15 13:21   ` Eric Dumazet
2017-03-15 23:06     ` Alexei Starovoitov
2017-03-15 23:34       ` Eric Dumazet
2017-03-16  0:44         ` Alexei Starovoitov
2017-03-16  1:07       ` Eric Dumazet
2017-03-16  1:10         ` Eric Dumazet
2017-03-16  1:56         ` Alexei Starovoitov
2017-03-16  2:48           ` Eric Dumazet
2017-03-16  5:39             ` Alexei Starovoitov
2017-03-16 12:00               ` Eric Dumazet
2017-03-15 15:36 ` Tariq Toukan
2017-03-15 16:27   ` Eric Dumazet
2017-03-20 12:59   ` Tariq Toukan
2017-03-20 13:04     ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.