[PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling
@ 2017-02-13 19:58 Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 01/14] mlx4: use __skb_fill_page_desc() Eric Dumazet
                   ` (14 more replies)
  0 siblings, 15 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

As mentioned half a year ago, we better switch mlx4 driver to order-0
allocations and page recycling.

This reduces vulnerability surface thanks to better skb->truesize
tracking and provides better performance in most cases.

v2 provides an ethtool -S new counter (rx_alloc_pages) and
code factorization, plus Tariq fix.

v3 includes various fixes based on Tariq tests and feedback
from Saeed and Tariq.

Worth noting this patch series deletes ~250 lines of code ;)

Eric Dumazet (14):
  mlx4: use __skb_fill_page_desc()
  mlx4: dma_dir is a mlx4_en_priv attribute
  mlx4: remove order field from mlx4_en_frag_info
  mlx4: get rid of frag_prefix_size
  mlx4: rx_headroom is a per port attribute
  mlx4: reduce rx ring page_cache size
  mlx4: removal of frag_sizes[]
  mlx4: use order-0 pages for RX
  mlx4: add page recycling in receive path
  mlx4: add rx_alloc_pages counter in ethtool -S
  mlx4: do not access rx_desc from mlx4_en_process_rx_cq()
  mlx4: factorize page_address() calls
  mlx4: make validate_loopback() more generic
  mlx4: remove duplicate code in mlx4_en_process_rx_cq()

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_port.c     |   2 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c       | 609 +++++++----------------
 drivers/net/ethernet/mellanox/mlx4/en_selftest.c |   6 -
 drivers/net/ethernet/mellanox/mlx4/en_tx.c       |   4 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h     |  29 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h  |   2 +-
 7 files changed, 202 insertions(+), 452 deletions(-)

-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 01/14] mlx4: use __skb_fill_page_desc()
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 02/14] mlx4: dma_dir is a mlx4_en_priv attribute Eric Dumazet
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

Or we might miss the fact that a page was allocated from memory reserves.

Fixes: dceeab0e5258 ("mlx4: support __GFP_MEMALLOC for rx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index d85e6446f9d99e38c75b97d7fba29bd057e0a16f..867292880c07a15124a0cf099d1fcda09926548e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -604,10 +604,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 		dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size,
 					DMA_FROM_DEVICE);
 
-		/* Save page reference in skb */
-		__skb_frag_set_page(&skb_frags_rx[nr], frags[nr].page);
-		skb_frag_size_set(&skb_frags_rx[nr], frag_info->frag_size);
-		skb_frags_rx[nr].page_offset = frags[nr].page_offset;
+		__skb_fill_page_desc(skb, nr, frags[nr].page,
+				     frags[nr].page_offset,
+				     frag_info->frag_size);
+
 		skb->truesize += frag_info->frag_stride;
 		frags[nr].page = NULL;
 	}
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 02/14] mlx4: dma_dir is a mlx4_en_priv attribute
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 01/14] mlx4: use __skb_fill_page_desc() Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 03/14] mlx4: remove order field from mlx4_en_frag_info Eric Dumazet
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

No need to duplicate it for all queues and frags.

num_frags & log_rx_info become u8 to save space.
u8 accesses are a bit faster than u16 anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 16 ++++++++--------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  6 +++---
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 867292880c07a15124a0cf099d1fcda09926548e..6183128b2d3d0519b46d14152b15c95ebbf62db7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -72,7 +72,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 			return -ENOMEM;
 	}
 	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-			   frag_info->dma_dir);
+			   priv->dma_dir);
 	if (unlikely(dma_mapping_error(priv->ddev, dma))) {
 		put_page(page);
 		return -ENOMEM;
@@ -128,7 +128,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 		if (page_alloc[i].page != ring_alloc[i].page) {
 			dma_unmap_page(priv->ddev, page_alloc[i].dma,
 				page_alloc[i].page_size,
-				priv->frag_info[i].dma_dir);
+				priv->dma_dir);
 			page = page_alloc[i].page;
 			/* Revert changes done by mlx4_alloc_pages */
 			page_ref_sub(page, page_alloc[i].page_size /
@@ -149,7 +149,7 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
 
 	if (next_frag_end > frags[i].page_size)
 		dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
-			       frag_info->dma_dir);
+			       priv->dma_dir);
 
 	if (frags[i].page)
 		put_page(frags[i].page);
@@ -181,7 +181,7 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
 		page_alloc = &ring->page_alloc[i];
 		dma_unmap_page(priv->ddev, page_alloc->dma,
 			       page_alloc->page_size,
-			       priv->frag_info[i].dma_dir);
+			       priv->dma_dir);
 		page = page_alloc->page;
 		/* Revert changes done by mlx4_alloc_pages */
 		page_ref_sub(page, page_alloc->page_size /
@@ -206,7 +206,7 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
 		       i, page_count(page_alloc->page));
 
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-				page_alloc->page_size, frag_info->dma_dir);
+				page_alloc->page_size, priv->dma_dir);
 		while (page_alloc->page_offset + frag_info->frag_stride <
 		       page_alloc->page_size) {
 			put_page(page_alloc->page);
@@ -570,7 +570,7 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 		struct mlx4_en_rx_alloc *frame = &ring->page_cache.buf[i];
 
 		dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
-			       priv->frag_info[0].dma_dir);
+			       priv->dma_dir);
 		put_page(frame->page);
 	}
 	ring->page_cache.index = 0;
@@ -1202,7 +1202,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		 * expense of more costly truesize accounting
 		 */
 		priv->frag_info[0].frag_stride = PAGE_SIZE;
-		priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
+		priv->dma_dir = PCI_DMA_BIDIRECTIONAL;
 		priv->frag_info[0].rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
 	} else {
@@ -1217,11 +1217,11 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 			priv->frag_info[i].frag_stride =
 				ALIGN(priv->frag_info[i].frag_size,
 				      SMP_CACHE_BYTES);
-			priv->frag_info[i].dma_dir = PCI_DMA_FROMDEVICE;
 			priv->frag_info[i].rx_headroom = 0;
 			buf_size += priv->frag_info[i].frag_size;
 			i++;
 		}
+		priv->dma_dir = PCI_DMA_FROMDEVICE;
 	}
 
 	priv->num_frags = i;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 3ed42199d3f1275f77560e92a430c0dde181e95a..98bc67a7249b14f8857fe1fd6baa40ae3ec5a880 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -360,7 +360,7 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 
 	if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
 		dma_unmap_page(priv->ddev, tx_info->map0_dma,
-			       PAGE_SIZE, priv->frag_info[0].dma_dir);
+			       PAGE_SIZE, priv->dma_dir);
 		put_page(tx_info->page);
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index cec59bc264c9ac197048fd7c98bcd5cf25de0efd..c914915075b06fa60758cad44119507b1c55a57d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -474,7 +474,6 @@ struct mlx4_en_frag_info {
 	u16 frag_size;
 	u16 frag_prefix_size;
 	u32 frag_stride;
-	enum dma_data_direction dma_dir;
 	u16 order;
 	u16 rx_headroom;
 };
@@ -584,8 +583,9 @@ struct mlx4_en_priv {
 	u32 rx_ring_num;
 	u32 rx_skb_size;
 	struct mlx4_en_frag_info frag_info[MLX4_EN_MAX_RX_FRAGS];
-	u16 num_frags;
-	u16 log_rx_info;
+	u8 num_frags;
+	u8 log_rx_info;
+	u8 dma_dir;
 
 	struct mlx4_en_tx_ring **tx_ring[MLX4_EN_NUM_TX_TYPES];
 	struct mlx4_en_rx_ring *rx_ring[MAX_RX_RINGS];
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 03/14] mlx4: remove order field from mlx4_en_frag_info
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 01/14] mlx4: use __skb_fill_page_desc() Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 02/14] mlx4: dma_dir is a mlx4_en_priv attribute Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 04/14] mlx4: get rid of frag_prefix_size Eric Dumazet
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

This is really a port attribute, no need to duplicate it per
RX queue and per frag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 6 +++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6183128b2d3d0519b46d14152b15c95ebbf62db7..b78d6762e03fc9f0c9e8bfa710efc2e7c86ad306 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -59,7 +59,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 	struct page *page;
 	dma_addr_t dma;
 
-	for (order = frag_info->order; ;) {
+	for (order = priv->rx_page_order; ;) {
 		gfp_t gfp = _gfp;
 
 		if (order)
@@ -1195,7 +1195,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->tx_ring_num[TX_XDP]) {
-		priv->frag_info[0].order = 0;
+		priv->rx_page_order = 0;
 		priv->frag_info[0].frag_size = eff_mtu;
 		priv->frag_info[0].frag_prefix_size = 0;
 		/* This will gain efficient xdp frame recycling at the
@@ -1209,7 +1209,6 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		int buf_size = 0;
 
 		while (buf_size < eff_mtu) {
-			priv->frag_info[i].order = MLX4_EN_ALLOC_PREFER_ORDER;
 			priv->frag_info[i].frag_size =
 				(eff_mtu > buf_size + frag_sizes[i]) ?
 					frag_sizes[i] : eff_mtu - buf_size;
@@ -1221,6 +1220,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 			buf_size += priv->frag_info[i].frag_size;
 			i++;
 		}
+		priv->rx_page_order = MLX4_EN_ALLOC_PREFER_ORDER;
 		priv->dma_dir = PCI_DMA_FROMDEVICE;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index c914915075b06fa60758cad44119507b1c55a57d..670eed48cf3c3f6f76e86e077c997d6f8e655d69 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -474,7 +474,6 @@ struct mlx4_en_frag_info {
 	u16 frag_size;
 	u16 frag_prefix_size;
 	u32 frag_stride;
-	u16 order;
 	u16 rx_headroom;
 };
 
@@ -586,6 +585,7 @@ struct mlx4_en_priv {
 	u8 num_frags;
 	u8 log_rx_info;
 	u8 dma_dir;
+	u8 rx_page_order;
 
 	struct mlx4_en_tx_ring **tx_ring[MLX4_EN_NUM_TX_TYPES];
 	struct mlx4_en_rx_ring *rx_ring[MAX_RX_RINGS];
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 04/14] mlx4: get rid of frag_prefix_size
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (2 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 03/14] mlx4: remove order field from mlx4_en_frag_info Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 05/14] mlx4: rx_headroom is a per port attribute Eric Dumazet
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

Using per frag storage for frag_prefix_size is really silly.

mlx4_en_complete_rx_desc() has all needed info already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 27 ++++++++++++---------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  3 +--
 2 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index b78d6762e03fc9f0c9e8bfa710efc2e7c86ad306..118ea83cff089f614a11f99f6623358f200638a8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -588,15 +588,14 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 				    int length)
 {
 	struct skb_frag_struct *skb_frags_rx = skb_shinfo(skb)->frags;
-	struct mlx4_en_frag_info *frag_info;
-	int nr;
+	struct mlx4_en_frag_info *frag_info = priv->frag_info;
+	int nr, frag_size;
 	dma_addr_t dma;
 
 	/* Collect used fragments while replacing them in the HW descriptors */
-	for (nr = 0; nr < priv->num_frags; nr++) {
-		frag_info = &priv->frag_info[nr];
-		if (length <= frag_info->frag_prefix_size)
-			break;
+	for (nr = 0;;) {
+		frag_size = min_t(int, length, frag_info->frag_size);
+
 		if (unlikely(!frags[nr].page))
 			goto fail;
 
@@ -606,15 +605,16 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 
 		__skb_fill_page_desc(skb, nr, frags[nr].page,
 				     frags[nr].page_offset,
-				     frag_info->frag_size);
+				     frag_size);
 
 		skb->truesize += frag_info->frag_stride;
 		frags[nr].page = NULL;
+		nr++;
+		length -= frag_size;
+		if (!length)
+			break;
+		frag_info++;
 	}
-	/* Adjust size of last fragment to match actual length */
-	if (nr > 0)
-		skb_frag_size_set(&skb_frags_rx[nr - 1],
-			length - priv->frag_info[nr - 1].frag_prefix_size);
 	return nr;
 
 fail:
@@ -1197,7 +1197,6 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	if (priv->tx_ring_num[TX_XDP]) {
 		priv->rx_page_order = 0;
 		priv->frag_info[0].frag_size = eff_mtu;
-		priv->frag_info[0].frag_prefix_size = 0;
 		/* This will gain efficient xdp frame recycling at the
 		 * expense of more costly truesize accounting
 		 */
@@ -1212,7 +1211,6 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 			priv->frag_info[i].frag_size =
 				(eff_mtu > buf_size + frag_sizes[i]) ?
 					frag_sizes[i] : eff_mtu - buf_size;
-			priv->frag_info[i].frag_prefix_size = buf_size;
 			priv->frag_info[i].frag_stride =
 				ALIGN(priv->frag_info[i].frag_size,
 				      SMP_CACHE_BYTES);
@@ -1232,10 +1230,9 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	       eff_mtu, priv->num_frags);
 	for (i = 0; i < priv->num_frags; i++) {
 		en_err(priv,
-		       "  frag:%d - size:%d prefix:%d stride:%d\n",
+		       "  frag:%d - size:%d stride:%d\n",
 		       i,
 		       priv->frag_info[i].frag_size,
-		       priv->frag_info[i].frag_prefix_size,
 		       priv->frag_info[i].frag_stride);
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 670eed48cf3c3f6f76e86e077c997d6f8e655d69..edd0f89fc1a1fdb711b8a8aefb4872fec23a5611 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -472,9 +472,8 @@ struct mlx4_en_mc_list {
 
 struct mlx4_en_frag_info {
 	u16 frag_size;
-	u16 frag_prefix_size;
-	u32 frag_stride;
 	u16 rx_headroom;
+	u32 frag_stride;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 05/14] mlx4: rx_headroom is a per port attribute
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (3 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 04/14] mlx4: get rid of frag_prefix_size Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 06/14] mlx4: reduce rx ring page_cache size Eric Dumazet
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

No need to duplicate it per RX queue / frags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 6 +++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 118ea83cff089f614a11f99f6623358f200638a8..bb33032a280f00ee62cc39d4261e72543ed0434e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -115,7 +115,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 
 	for (i = 0; i < priv->num_frags; i++) {
 		frags[i] = ring_alloc[i];
-		frags[i].page_offset += priv->frag_info[i].rx_headroom;
+		frags[i].page_offset += priv->rx_headroom;
 		rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
 						    frags[i].page_offset);
 		ring_alloc[i] = page_alloc[i];
@@ -1202,7 +1202,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		 */
 		priv->frag_info[0].frag_stride = PAGE_SIZE;
 		priv->dma_dir = PCI_DMA_BIDIRECTIONAL;
-		priv->frag_info[0].rx_headroom = XDP_PACKET_HEADROOM;
+		priv->rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
 	} else {
 		int buf_size = 0;
@@ -1214,12 +1214,12 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 			priv->frag_info[i].frag_stride =
 				ALIGN(priv->frag_info[i].frag_size,
 				      SMP_CACHE_BYTES);
-			priv->frag_info[i].rx_headroom = 0;
 			buf_size += priv->frag_info[i].frag_size;
 			i++;
 		}
 		priv->rx_page_order = MLX4_EN_ALLOC_PREFER_ORDER;
 		priv->dma_dir = PCI_DMA_FROMDEVICE;
+		priv->rx_headroom = 0;
 	}
 
 	priv->num_frags = i;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index edd0f89fc1a1fdb711b8a8aefb4872fec23a5611..535bf2478d5fe7433b83687e04dedccf127838c7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -472,7 +472,6 @@ struct mlx4_en_mc_list {
 
 struct mlx4_en_frag_info {
 	u16 frag_size;
-	u16 rx_headroom;
 	u32 frag_stride;
 };
 
@@ -585,6 +584,7 @@ struct mlx4_en_priv {
 	u8 log_rx_info;
 	u8 dma_dir;
 	u8 rx_page_order;
+	u16 rx_headroom;
 
 	struct mlx4_en_tx_ring **tx_ring[MLX4_EN_NUM_TX_TYPES];
 	struct mlx4_en_rx_ring *rx_ring[MAX_RX_RINGS];
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 06/14] mlx4: reduce rx ring page_cache size
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (4 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 05/14] mlx4: rx_headroom is a per port attribute Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 07/14] mlx4: removal of frag_sizes[] Eric Dumazet
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

We only need to store the page and dma address.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 17 ++++++++++-------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  2 --
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  6 +++++-
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index bb33032a280f00ee62cc39d4261e72543ed0434e..453313d404e3698b0d41a8220005b3834c9d68a1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -250,7 +250,10 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 					(index << priv->log_rx_info);
 
 	if (ring->page_cache.index > 0) {
-		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
+		ring->page_cache.index--;
+		frags[0].page = ring->page_cache.buf[ring->page_cache.index].page;
+		frags[0].dma  = ring->page_cache.buf[ring->page_cache.index].dma;
+		frags[0].page_offset = XDP_PACKET_HEADROOM;
 		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
 						    frags[0].page_offset);
 		return 0;
@@ -537,7 +540,9 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
 	if (cache->index >= MLX4_EN_CACHE_SIZE)
 		return false;
 
-	cache->buf[cache->index++] = *frame;
+	cache->buf[cache->index].page = frame->page;
+	cache->buf[cache->index].dma = frame->dma;
+	cache->index++;
 	return true;
 }
 
@@ -567,11 +572,9 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 	int i;
 
 	for (i = 0; i < ring->page_cache.index; i++) {
-		struct mlx4_en_rx_alloc *frame = &ring->page_cache.buf[i];
-
-		dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
-			       priv->dma_dir);
-		put_page(frame->page);
+		dma_unmap_page(priv->ddev, ring->page_cache.buf[i].dma,
+			       PAGE_SIZE, priv->dma_dir);
+		put_page(ring->page_cache.buf[i].page);
 	}
 	ring->page_cache.index = 0;
 	mlx4_en_free_rx_buf(priv, ring);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 98bc67a7249b14f8857fe1fd6baa40ae3ec5a880..e0c5ffb3e3a6607456e1f191b0b8c8becfc71219 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -354,8 +354,6 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc frame = {
 		.page = tx_info->page,
 		.dma = tx_info->map0_dma,
-		.page_offset = XDP_PACKET_HEADROOM,
-		.page_size = PAGE_SIZE,
 	};
 
 	if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 535bf2478d5fe7433b83687e04dedccf127838c7..d92e5228d4248f28151cba117a9485c71ef5b593 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -267,9 +267,13 @@ struct mlx4_en_rx_alloc {
 };
 
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
+
 struct mlx4_en_page_cache {
 	u32 index;
-	struct mlx4_en_rx_alloc buf[MLX4_EN_CACHE_SIZE];
+	struct {
+		struct page	*page;
+		dma_addr_t	dma;
+	} buf[MLX4_EN_CACHE_SIZE];
 };
 
 struct mlx4_en_priv;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 07/14] mlx4: removal of frag_sizes[]
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (5 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 06/14] mlx4: reduce rx ring page_cache size Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Eric Dumazet
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

We will soon use order-0 pages, and frag truesize will more precisely
match real sizes.

In the new model, we prefer to use <= 2048 bytes fragments, so that
we can use page-recycle technique on PAGE_SIZE=4096 arches.

We will still pack as much frames as possible on arches with big
pages, like PowerPC.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 24 ++++++++++--------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  8 --------
 2 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 453313d404e3698b0d41a8220005b3834c9d68a1..0c61c1200f2aec4c74d7403d9ec3ed06821c1bac 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1181,13 +1181,6 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget)
 	return done;
 }
 
-static const int frag_sizes[] = {
-	FRAG_SZ0,
-	FRAG_SZ1,
-	FRAG_SZ2,
-	FRAG_SZ3
-};
-
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1211,13 +1204,16 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		int buf_size = 0;
 
 		while (buf_size < eff_mtu) {
-			priv->frag_info[i].frag_size =
-				(eff_mtu > buf_size + frag_sizes[i]) ?
-					frag_sizes[i] : eff_mtu - buf_size;
-			priv->frag_info[i].frag_stride =
-				ALIGN(priv->frag_info[i].frag_size,
-				      SMP_CACHE_BYTES);
-			buf_size += priv->frag_info[i].frag_size;
+			int frag_size = eff_mtu - buf_size;
+
+			if (i < MLX4_EN_MAX_RX_FRAGS - 1)
+				frag_size = min(frag_size, 2048);
+
+			priv->frag_info[i].frag_size = frag_size;
+
+			priv->frag_info[i].frag_stride = ALIGN(frag_size,
+							       SMP_CACHE_BYTES);
+			buf_size += frag_size;
 			i++;
 		}
 		priv->rx_page_order = MLX4_EN_ALLOC_PREFER_ORDER;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d92e5228d4248f28151cba117a9485c71ef5b593..92c6cf6de9452d5d1b2092a17a8dad5348db8900 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -104,14 +104,6 @@
 
 #define MLX4_EN_ALLOC_PREFER_ORDER	PAGE_ALLOC_COSTLY_ORDER
 
-/* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
- * and 4K allocations) */
-enum {
-	FRAG_SZ0 = 1536 - NET_IP_ALIGN,
-	FRAG_SZ1 = 4096,
-	FRAG_SZ2 = 4096,
-	FRAG_SZ3 = MLX4_EN_ALLOC_SIZE
-};
 #define MLX4_EN_MAX_RX_FRAGS	4
 
 /* Maximum ring sizes */
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (6 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 07/14] mlx4: removal of frag_sizes[] Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 20:51   ` Alexander Duyck
  2017-02-22 16:22   ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 09/14] mlx4: add page recycling in receive path Eric Dumazet
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

Use of order-3 pages is problematic in some cases.

This patch might add three kinds of regression :

1) a CPU performance regression, but we will add later page
recycling and performance should be back.

2) TCP receiver could grow its receive window slightly slower,
   because skb->len/skb->truesize ratio will decrease.
   This is mostly ok, we prefer being conservative to not risk OOM,
   and eventually tune TCP better in the future.
   This is consistent with other drivers using 2048 per ethernet frame.

3) Because we allocate one page per RX slot, we consume more
   memory for the ring buffers. XDP already had this constraint anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 72 +++++++++++++---------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 --
 2 files changed, 33 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 0c61c1200f2aec4c74d7403d9ec3ed06821c1bac..069ea09185fb0669d5c1f9b1b88f517b2d69c5ed 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -53,38 +53,26 @@
 static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 			    struct mlx4_en_rx_alloc *page_alloc,
 			    const struct mlx4_en_frag_info *frag_info,
-			    gfp_t _gfp)
+			    gfp_t gfp)
 {
-	int order;
 	struct page *page;
 	dma_addr_t dma;
 
-	for (order = priv->rx_page_order; ;) {
-		gfp_t gfp = _gfp;
-
-		if (order)
-			gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NOMEMALLOC;
-		page = alloc_pages(gfp, order);
-		if (likely(page))
-			break;
-		if (--order < 0 ||
-		    ((PAGE_SIZE << order) < frag_info->frag_size))
-			return -ENOMEM;
-	}
-	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-			   priv->dma_dir);
+	page = alloc_page(gfp);
+	if (unlikely(!page))
+		return -ENOMEM;
+	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
 	if (unlikely(dma_mapping_error(priv->ddev, dma))) {
 		put_page(page);
 		return -ENOMEM;
 	}
-	page_alloc->page_size = PAGE_SIZE << order;
 	page_alloc->page = page;
 	page_alloc->dma = dma;
 	page_alloc->page_offset = 0;
 	/* Not doing get_page() for each frag is a big win
 	 * on asymetric workloads. Note we can not use atomic_set().
 	 */
-	page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1);
+	page_ref_add(page, PAGE_SIZE / frag_info->frag_stride - 1);
 	return 0;
 }
 
@@ -105,7 +93,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 		page_alloc[i].page_offset += frag_info->frag_stride;
 
 		if (page_alloc[i].page_offset + frag_info->frag_stride <=
-		    ring_alloc[i].page_size)
+		    PAGE_SIZE)
 			continue;
 
 		if (unlikely(mlx4_alloc_pages(priv, &page_alloc[i],
@@ -127,11 +115,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 	while (i--) {
 		if (page_alloc[i].page != ring_alloc[i].page) {
 			dma_unmap_page(priv->ddev, page_alloc[i].dma,
-				page_alloc[i].page_size,
-				priv->dma_dir);
+				       PAGE_SIZE, priv->dma_dir);
 			page = page_alloc[i].page;
 			/* Revert changes done by mlx4_alloc_pages */
-			page_ref_sub(page, page_alloc[i].page_size /
+			page_ref_sub(page, PAGE_SIZE /
 					   priv->frag_info[i].frag_stride - 1);
 			put_page(page);
 		}
@@ -147,8 +134,8 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
 	u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
 
 
-	if (next_frag_end > frags[i].page_size)
-		dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
+	if (next_frag_end > PAGE_SIZE)
+		dma_unmap_page(priv->ddev, frags[i].dma, PAGE_SIZE,
 			       priv->dma_dir);
 
 	if (frags[i].page)
@@ -168,9 +155,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
 				     frag_info, GFP_KERNEL | __GFP_COLD))
 			goto out;
 
-		en_dbg(DRV, priv, "  frag %d allocator: - size:%d frags:%d\n",
-		       i, ring->page_alloc[i].page_size,
-		       page_ref_count(ring->page_alloc[i].page));
+		en_dbg(DRV, priv, "  frag %d allocator: - frags:%d\n",
+		       i, page_ref_count(ring->page_alloc[i].page));
 	}
 	return 0;
 
@@ -180,11 +166,10 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
 
 		page_alloc = &ring->page_alloc[i];
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-			       page_alloc->page_size,
-			       priv->dma_dir);
+			       PAGE_SIZE, priv->dma_dir);
 		page = page_alloc->page;
 		/* Revert changes done by mlx4_alloc_pages */
-		page_ref_sub(page, page_alloc->page_size /
+		page_ref_sub(page, PAGE_SIZE /
 				   priv->frag_info[i].frag_stride - 1);
 		put_page(page);
 		page_alloc->page = NULL;
@@ -206,9 +191,9 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
 		       i, page_count(page_alloc->page));
 
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-				page_alloc->page_size, priv->dma_dir);
+			       PAGE_SIZE, priv->dma_dir);
 		while (page_alloc->page_offset + frag_info->frag_stride <
-		       page_alloc->page_size) {
+		       PAGE_SIZE) {
 			put_page(page_alloc->page);
 			page_alloc->page_offset += frag_info->frag_stride;
 		}
@@ -1191,7 +1176,6 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->tx_ring_num[TX_XDP]) {
-		priv->rx_page_order = 0;
 		priv->frag_info[0].frag_size = eff_mtu;
 		/* This will gain efficient xdp frame recycling at the
 		 * expense of more costly truesize accounting
@@ -1201,22 +1185,32 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		priv->rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
 	} else {
-		int buf_size = 0;
+		int frag_size_max = 2048, buf_size = 0;
+
+		/* should not happen, right ? */
+		if (eff_mtu > PAGE_SIZE + (MLX4_EN_MAX_RX_FRAGS - 1) * 2048)
+			frag_size_max = PAGE_SIZE;
 
 		while (buf_size < eff_mtu) {
-			int frag_size = eff_mtu - buf_size;
+			int frag_stride, frag_size = eff_mtu - buf_size;
+			int pad, nb;
 
 			if (i < MLX4_EN_MAX_RX_FRAGS - 1)
-				frag_size = min(frag_size, 2048);
+				frag_size = min(frag_size, frag_size_max);
 
 			priv->frag_info[i].frag_size = frag_size;
+			frag_stride = ALIGN(frag_size, SMP_CACHE_BYTES);
+			/* We can only pack 2 1536-bytes frames in on 4K page
+			 * Therefore, each frame would consume more bytes (truesize)
+			 */
+			nb = PAGE_SIZE / frag_stride;
+			pad = (PAGE_SIZE - nb * frag_stride) / nb;
+			pad &= ~(SMP_CACHE_BYTES - 1);
+			priv->frag_info[i].frag_stride = frag_stride + pad;
 
-			priv->frag_info[i].frag_stride = ALIGN(frag_size,
-							       SMP_CACHE_BYTES);
 			buf_size += frag_size;
 			i++;
 		}
-		priv->rx_page_order = MLX4_EN_ALLOC_PREFER_ORDER;
 		priv->dma_dir = PCI_DMA_FROMDEVICE;
 		priv->rx_headroom = 0;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 92c6cf6de9452d5d1b2092a17a8dad5348db8900..4f7d1503277d2e030854deabeb22b96fa9315c5d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -102,8 +102,6 @@
 /* Use the maximum between 16384 and a single page */
 #define MLX4_EN_ALLOC_SIZE	PAGE_ALIGN(16384)
 
-#define MLX4_EN_ALLOC_PREFER_ORDER	PAGE_ALLOC_COSTLY_ORDER
-
 #define MLX4_EN_MAX_RX_FRAGS	4
 
 /* Maximum ring sizes */
@@ -255,7 +253,6 @@ struct mlx4_en_rx_alloc {
 	struct page	*page;
 	dma_addr_t	dma;
 	u32		page_offset;
-	u32		page_size;
 };
 
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
@@ -579,7 +576,6 @@ struct mlx4_en_priv {
 	u8 num_frags;
 	u8 log_rx_info;
 	u8 dma_dir;
-	u8 rx_page_order;
 	u16 rx_headroom;
 
 	struct mlx4_en_tx_ring **tx_ring[MLX4_EN_NUM_TX_TYPES];
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 09/14] mlx4: add page recycling in receive path
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (7 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 10/14] mlx4: add rx_alloc_pages counter in ethtool -S Eric Dumazet
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

Same technique than some Intel drivers, for arches where PAGE_SIZE = 4096

In most cases, pages are reused because they were consumed
before we could loop around the RX ring.

This brings back performance, and is even better,
a single TCP flow reaches 30Gbit on my hosts.

v2: added full memset() in mlx4_en_free_frag(), as Tariq found it was needed
if we switch to large MTU, as priv->log_rx_info can dynamically be changed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 258 +++++++++------------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |   1 -
 2 files changed, 82 insertions(+), 177 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 069ea09185fb0669d5c1f9b1b88f517b2d69c5ed..5edd0cf2999cbde37b3404aafd700584696af940 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -50,10 +50,9 @@
 
 #include "mlx4_en.h"
 
-static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
-			    struct mlx4_en_rx_alloc *page_alloc,
-			    const struct mlx4_en_frag_info *frag_info,
-			    gfp_t gfp)
+static int mlx4_alloc_page(struct mlx4_en_priv *priv,
+			   struct mlx4_en_rx_alloc *frag,
+			   gfp_t gfp)
 {
 	struct page *page;
 	dma_addr_t dma;
@@ -63,145 +62,46 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 		return -ENOMEM;
 	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
 	if (unlikely(dma_mapping_error(priv->ddev, dma))) {
-		put_page(page);
+		__free_page(page);
 		return -ENOMEM;
 	}
-	page_alloc->page = page;
-	page_alloc->dma = dma;
-	page_alloc->page_offset = 0;
-	/* Not doing get_page() for each frag is a big win
-	 * on asymetric workloads. Note we can not use atomic_set().
-	 */
-	page_ref_add(page, PAGE_SIZE / frag_info->frag_stride - 1);
+	frag->page = page;
+	frag->dma = dma;
+	frag->page_offset = priv->rx_headroom;
 	return 0;
 }
 
 static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 			       struct mlx4_en_rx_desc *rx_desc,
 			       struct mlx4_en_rx_alloc *frags,
-			       struct mlx4_en_rx_alloc *ring_alloc,
 			       gfp_t gfp)
 {
-	struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
-	const struct mlx4_en_frag_info *frag_info;
-	struct page *page;
 	int i;
 
-	for (i = 0; i < priv->num_frags; i++) {
-		frag_info = &priv->frag_info[i];
-		page_alloc[i] = ring_alloc[i];
-		page_alloc[i].page_offset += frag_info->frag_stride;
-
-		if (page_alloc[i].page_offset + frag_info->frag_stride <=
-		    PAGE_SIZE)
-			continue;
-
-		if (unlikely(mlx4_alloc_pages(priv, &page_alloc[i],
-					      frag_info, gfp)))
-			goto out;
-	}
-
-	for (i = 0; i < priv->num_frags; i++) {
-		frags[i] = ring_alloc[i];
-		frags[i].page_offset += priv->rx_headroom;
-		rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
-						    frags[i].page_offset);
-		ring_alloc[i] = page_alloc[i];
-	}
-
-	return 0;
-
-out:
-	while (i--) {
-		if (page_alloc[i].page != ring_alloc[i].page) {
-			dma_unmap_page(priv->ddev, page_alloc[i].dma,
-				       PAGE_SIZE, priv->dma_dir);
-			page = page_alloc[i].page;
-			/* Revert changes done by mlx4_alloc_pages */
-			page_ref_sub(page, PAGE_SIZE /
-					   priv->frag_info[i].frag_stride - 1);
-			put_page(page);
-		}
-	}
-	return -ENOMEM;
-}
-
-static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
-			      struct mlx4_en_rx_alloc *frags,
-			      int i)
-{
-	const struct mlx4_en_frag_info *frag_info = &priv->frag_info[i];
-	u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
-
-
-	if (next_frag_end > PAGE_SIZE)
-		dma_unmap_page(priv->ddev, frags[i].dma, PAGE_SIZE,
-			       priv->dma_dir);
-
-	if (frags[i].page)
-		put_page(frags[i].page);
-}
-
-static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
-				  struct mlx4_en_rx_ring *ring)
-{
-	int i;
-	struct mlx4_en_rx_alloc *page_alloc;
-
-	for (i = 0; i < priv->num_frags; i++) {
-		const struct mlx4_en_frag_info *frag_info = &priv->frag_info[i];
-
-		if (mlx4_alloc_pages(priv, &ring->page_alloc[i],
-				     frag_info, GFP_KERNEL | __GFP_COLD))
-			goto out;
-
-		en_dbg(DRV, priv, "  frag %d allocator: - frags:%d\n",
-		       i, page_ref_count(ring->page_alloc[i].page));
+	for (i = 0; i < priv->num_frags; i++, frags++) {
+		if (!frags->page && mlx4_alloc_page(priv, frags, gfp))
+			return -ENOMEM;
+		rx_desc->data[i].addr = cpu_to_be64(frags->dma +
+						    frags->page_offset);
 	}
 	return 0;
-
-out:
-	while (i--) {
-		struct page *page;
-
-		page_alloc = &ring->page_alloc[i];
-		dma_unmap_page(priv->ddev, page_alloc->dma,
-			       PAGE_SIZE, priv->dma_dir);
-		page = page_alloc->page;
-		/* Revert changes done by mlx4_alloc_pages */
-		page_ref_sub(page, PAGE_SIZE /
-				   priv->frag_info[i].frag_stride - 1);
-		put_page(page);
-		page_alloc->page = NULL;
-	}
-	return -ENOMEM;
 }
 
-static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
-				      struct mlx4_en_rx_ring *ring)
+static void mlx4_en_free_frag(const struct mlx4_en_priv *priv,
+			      struct mlx4_en_rx_alloc *frag)
 {
-	struct mlx4_en_rx_alloc *page_alloc;
-	int i;
-
-	for (i = 0; i < priv->num_frags; i++) {
-		const struct mlx4_en_frag_info *frag_info = &priv->frag_info[i];
-
-		page_alloc = &ring->page_alloc[i];
-		en_dbg(DRV, priv, "Freeing allocator:%d count:%d\n",
-		       i, page_count(page_alloc->page));
-
-		dma_unmap_page(priv->ddev, page_alloc->dma,
+	if (frag->page) {
+		dma_unmap_page(priv->ddev, frag->dma,
 			       PAGE_SIZE, priv->dma_dir);
-		while (page_alloc->page_offset + frag_info->frag_stride <
-		       PAGE_SIZE) {
-			put_page(page_alloc->page);
-			page_alloc->page_offset += frag_info->frag_stride;
-		}
-		page_alloc->page = NULL;
+		__free_page(frag->page);
 	}
+	/* We need to clear all fields, otherwise a change of priv->log_rx_info
+	 * could lead to see garbage later in frag->page.
+	 */
+	memset(frag, 0, sizeof(*frag));
 }
 
-static void mlx4_en_init_rx_desc(struct mlx4_en_priv *priv,
+static void mlx4_en_init_rx_desc(const struct mlx4_en_priv *priv,
 				 struct mlx4_en_rx_ring *ring, int index)
 {
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + ring->stride * index;
@@ -235,19 +135,22 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 					(index << priv->log_rx_info);
 
 	if (ring->page_cache.index > 0) {
-		ring->page_cache.index--;
-		frags[0].page = ring->page_cache.buf[ring->page_cache.index].page;
-		frags[0].dma  = ring->page_cache.buf[ring->page_cache.index].dma;
-		frags[0].page_offset = XDP_PACKET_HEADROOM;
-		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
-						    frags[0].page_offset);
+		/* XDP uses a single page per frame */
+		if (!frags->page) {
+			ring->page_cache.index--;
+			frags->page = ring->page_cache.buf[ring->page_cache.index].page;
+			frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
+		}
+		frags->page_offset = XDP_PACKET_HEADROOM;
+		rx_desc->data[0].addr = cpu_to_be64(frags->dma +
+						    XDP_PACKET_HEADROOM);
 		return 0;
 	}
 
-	return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc, gfp);
+	return mlx4_en_alloc_frags(priv, rx_desc, frags, gfp);
 }
 
-static inline bool mlx4_en_is_ring_empty(struct mlx4_en_rx_ring *ring)
+static bool mlx4_en_is_ring_empty(const struct mlx4_en_rx_ring *ring)
 {
 	return ring->prod == ring->cons;
 }
@@ -257,7 +160,8 @@ static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
 	*ring->wqres.db.db = cpu_to_be32(ring->prod & 0xffff);
 }
 
-static void mlx4_en_free_rx_desc(struct mlx4_en_priv *priv,
+/* slow path */
+static void mlx4_en_free_rx_desc(const struct mlx4_en_priv *priv,
 				 struct mlx4_en_rx_ring *ring,
 				 int index)
 {
@@ -267,7 +171,7 @@ static void mlx4_en_free_rx_desc(struct mlx4_en_priv *priv,
 	frags = ring->rx_info + (index << priv->log_rx_info);
 	for (nr = 0; nr < priv->num_frags; nr++) {
 		en_dbg(DRV, priv, "Freeing fragment:%d\n", nr);
-		mlx4_en_free_frag(priv, frags, nr);
+		mlx4_en_free_frag(priv, frags + nr);
 	}
 }
 
@@ -323,12 +227,12 @@ static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv,
 	       ring->cons, ring->prod);
 
 	/* Unmap and free Rx buffers */
-	while (!mlx4_en_is_ring_empty(ring)) {
-		index = ring->cons & ring->size_mask;
+	for (index = 0; index < ring->size; index++) {
 		en_dbg(DRV, priv, "Processing descriptor:%d\n", index);
 		mlx4_en_free_rx_desc(priv, ring, index);
-		++ring->cons;
 	}
+	ring->cons = 0;
+	ring->prod = 0;
 }
 
 void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
@@ -380,9 +284,9 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 
 	tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
 					sizeof(struct mlx4_en_rx_alloc));
-	ring->rx_info = vmalloc_node(tmp, node);
+	ring->rx_info = vzalloc_node(tmp, node);
 	if (!ring->rx_info) {
-		ring->rx_info = vmalloc(tmp);
+		ring->rx_info = vzalloc(tmp);
 		if (!ring->rx_info) {
 			err = -ENOMEM;
 			goto err_ring;
@@ -452,16 +356,6 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 		/* Initialize all descriptors */
 		for (i = 0; i < ring->size; i++)
 			mlx4_en_init_rx_desc(priv, ring, i);
-
-		/* Initialize page allocators */
-		err = mlx4_en_init_allocator(priv, ring);
-		if (err) {
-			en_err(priv, "Failed initializing ring allocator\n");
-			if (ring->stride <= TXBB_SIZE)
-				ring->buf -= TXBB_SIZE;
-			ring_ind--;
-			goto err_allocator;
-		}
 	}
 	err = mlx4_en_fill_rx_buffers(priv);
 	if (err)
@@ -481,11 +375,9 @@ int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv)
 		mlx4_en_free_rx_buf(priv, priv->rx_ring[ring_ind]);
 
 	ring_ind = priv->rx_ring_num - 1;
-err_allocator:
 	while (ring_ind >= 0) {
 		if (priv->rx_ring[ring_ind]->stride <= TXBB_SIZE)
 			priv->rx_ring[ring_ind]->buf -= TXBB_SIZE;
-		mlx4_en_destroy_allocator(priv, priv->rx_ring[ring_ind]);
 		ring_ind--;
 	}
 	return err;
@@ -565,50 +457,68 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 	mlx4_en_free_rx_buf(priv, ring);
 	if (ring->stride <= TXBB_SIZE)
 		ring->buf -= TXBB_SIZE;
-	mlx4_en_destroy_allocator(priv, ring);
 }
 
 
 static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
-				    struct mlx4_en_rx_desc *rx_desc,
 				    struct mlx4_en_rx_alloc *frags,
 				    struct sk_buff *skb,
 				    int length)
 {
-	struct skb_frag_struct *skb_frags_rx = skb_shinfo(skb)->frags;
-	struct mlx4_en_frag_info *frag_info = priv->frag_info;
+	const struct mlx4_en_frag_info *frag_info = priv->frag_info;
+	unsigned int truesize = 0;
 	int nr, frag_size;
+	struct page *page;
 	dma_addr_t dma;
+	bool release;
 
 	/* Collect used fragments while replacing them in the HW descriptors */
-	for (nr = 0;;) {
+	for (nr = 0;; frags++) {
 		frag_size = min_t(int, length, frag_info->frag_size);
 
-		if (unlikely(!frags[nr].page))
+		page = frags->page;
+		if (unlikely(!page))
 			goto fail;
 
-		dma = be64_to_cpu(rx_desc->data[nr].addr);
-		dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size,
-					DMA_FROM_DEVICE);
+		dma = frags->dma;
+		dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
+					      frag_size, priv->dma_dir);
 
-		__skb_fill_page_desc(skb, nr, frags[nr].page,
-				     frags[nr].page_offset,
+		__skb_fill_page_desc(skb, nr, page, frags->page_offset,
 				     frag_size);
 
-		skb->truesize += frag_info->frag_stride;
-		frags[nr].page = NULL;
+		truesize += frag_info->frag_stride;
+		if (frag_info->frag_stride == PAGE_SIZE / 2) {
+			frags->page_offset ^= PAGE_SIZE / 2;
+			release = page_count(page) != 1 ||
+				  page_is_pfmemalloc(page) ||
+				  page_to_nid(page) != numa_mem_id();
+		} else {
+			u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);
+
+			frags->page_offset += sz_align;
+			release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
+		}
+		if (release) {
+			dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
+			frags->page = NULL;
+		} else {
+			page_ref_inc(page);
+		}
+
 		nr++;
 		length -= frag_size;
 		if (!length)
 			break;
 		frag_info++;
 	}
+	skb->truesize += truesize;
 	return nr;
 
 fail:
 	while (nr > 0) {
 		nr--;
-		__skb_frag_unref(&skb_frags_rx[nr]);
+		__skb_frag_unref(skb_shinfo(skb)->frags + nr);
 	}
 	return 0;
 }
@@ -639,7 +549,8 @@ static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 	if (length <= SMALL_PACKET_SIZE) {
 		/* We are copying all relevant data to the skb - temporarily
 		 * sync buffers for the copy */
-		dma = be64_to_cpu(rx_desc->data[0].addr);
+
+		dma = frags[0].dma + frags[0].page_offset;
 		dma_sync_single_for_cpu(priv->ddev, dma, length,
 					DMA_FROM_DEVICE);
 		skb_copy_to_linear_data(skb, va, length);
@@ -648,7 +559,7 @@ static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 		unsigned int pull_len;
 
 		/* Move relevant fragments to skb */
-		used_frags = mlx4_en_complete_rx_desc(priv, rx_desc, frags,
+		used_frags = mlx4_en_complete_rx_desc(priv, frags,
 							skb, length);
 		if (unlikely(!used_frags)) {
 			kfree_skb(skb);
@@ -916,8 +827,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			case XDP_TX:
 				if (likely(!mlx4_en_xmit_frame(ring, frags, dev,
 							length, cq->ring,
-							&doorbell_pending)))
-					goto consumed;
+							&doorbell_pending))) {
+					frags[0].page = NULL;
+					goto next;
+				}
 				trace_xdp_exception(dev, xdp_prog, act);
 				goto xdp_drop_no_cnt; /* Drop on xmit failure */
 			default:
@@ -927,8 +840,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			case XDP_DROP:
 				ring->xdp_drop++;
 xdp_drop_no_cnt:
-				if (likely(mlx4_en_rx_recycle(ring, frags)))
-					goto consumed;
 				goto next;
 			}
 		}
@@ -974,9 +885,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			if (!gro_skb)
 				goto next;
 
-			nr = mlx4_en_complete_rx_desc(priv,
-				rx_desc, frags, gro_skb,
-				length);
+			nr = mlx4_en_complete_rx_desc(priv, frags, gro_skb,
+						      length);
 			if (!nr)
 				goto next;
 
@@ -1084,10 +994,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 
 		napi_gro_receive(&cq->napi, skb);
 next:
-		for (nr = 0; nr < priv->num_frags; nr++)
-			mlx4_en_free_frag(priv, frags, nr);
-
-consumed:
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 4f7d1503277d2e030854deabeb22b96fa9315c5d..3c93b9614e0674e904d33b83e041f348d4f2ffe9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -327,7 +327,6 @@ struct mlx4_en_rx_desc {
 
 struct mlx4_en_rx_ring {
 	struct mlx4_hwq_resources wqres;
-	struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
 	u32 size ;	/* number of Rx descs*/
 	u32 actual_size;
 	u32 size_mask;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 10/14] mlx4: add rx_alloc_pages counter in ethtool -S
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (8 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 09/14] mlx4: add page recycling in receive path Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 11/14] mlx4: do not access rx_desc from mlx4_en_process_rx_cq() Eric Dumazet
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

This new counter tracks number of pages that we allocated for one port.

lpaa24:~# ethtool -S eth0 | egrep 'rx_alloc_pages|rx_packets'
     rx_packets: 306755183
     rx_alloc_pages: 932897

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_port.c    |  2 ++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 11 +++++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  1 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h |  2 +-
 5 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index c4d714fcc7dae759998a49a1f90f9ab1ee9bdda3..ffbcb27c05e55f43630a812249bab21609886dd9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -117,7 +117,7 @@ static const char main_strings[][ETH_GSTRING_LEN] = {
 	/* port statistics */
 	"tso_packets",
 	"xmit_more",
-	"queue_stopped", "wake_queue", "tx_timeout", "rx_alloc_failed",
+	"queue_stopped", "wake_queue", "tx_timeout", "rx_alloc_pages",
 	"rx_csum_good", "rx_csum_none", "rx_csum_complete", "tx_chksum_offload",
 
 	/* pf statistics */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index 9166d90e732858610b1407fe85cbf6cbe27f5e0b..e0eb695318e64ebcaf58d6edb5f9a57be6f9ddf6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -213,6 +213,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 	priv->port_stats.rx_chksum_good = 0;
 	priv->port_stats.rx_chksum_none = 0;
 	priv->port_stats.rx_chksum_complete = 0;
+	priv->port_stats.rx_alloc_pages = 0;
 	priv->xdp_stats.rx_xdp_drop    = 0;
 	priv->xdp_stats.rx_xdp_tx      = 0;
 	priv->xdp_stats.rx_xdp_tx_full = 0;
@@ -223,6 +224,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset)
 		priv->port_stats.rx_chksum_good += READ_ONCE(ring->csum_ok);
 		priv->port_stats.rx_chksum_none += READ_ONCE(ring->csum_none);
 		priv->port_stats.rx_chksum_complete += READ_ONCE(ring->csum_complete);
+		priv->port_stats.rx_alloc_pages += READ_ONCE(ring->rx_alloc_pages);
 		priv->xdp_stats.rx_xdp_drop	+= READ_ONCE(ring->xdp_drop);
 		priv->xdp_stats.rx_xdp_tx	+= READ_ONCE(ring->xdp_tx);
 		priv->xdp_stats.rx_xdp_tx_full	+= READ_ONCE(ring->xdp_tx_full);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5edd0cf2999cbde37b3404aafd700584696af940..d3a425fa46b3924411e42a865b3a57aa063f1635 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -72,6 +72,7 @@ static int mlx4_alloc_page(struct mlx4_en_priv *priv,
 }
 
 static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
+			       struct mlx4_en_rx_ring *ring,
 			       struct mlx4_en_rx_desc *rx_desc,
 			       struct mlx4_en_rx_alloc *frags,
 			       gfp_t gfp)
@@ -79,8 +80,11 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 	int i;
 
 	for (i = 0; i < priv->num_frags; i++, frags++) {
-		if (!frags->page && mlx4_alloc_page(priv, frags, gfp))
-			return -ENOMEM;
+		if (!frags->page) {
+			if (mlx4_alloc_page(priv, frags, gfp))
+				return -ENOMEM;
+			ring->rx_alloc_pages++;
+		}
 		rx_desc->data[i].addr = cpu_to_be64(frags->dma +
 						    frags->page_offset);
 	}
@@ -133,7 +137,6 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
 	struct mlx4_en_rx_alloc *frags = ring->rx_info +
 					(index << priv->log_rx_info);
-
 	if (ring->page_cache.index > 0) {
 		/* XDP uses a single page per frame */
 		if (!frags->page) {
@@ -147,7 +150,7 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 		return 0;
 	}
 
-	return mlx4_en_alloc_frags(priv, rx_desc, frags, gfp);
+	return mlx4_en_alloc_frags(priv, ring, rx_desc, frags, gfp);
 }
 
 static bool mlx4_en_is_ring_empty(const struct mlx4_en_rx_ring *ring)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 3c93b9614e0674e904d33b83e041f348d4f2ffe9..52f157cea7765ad6907c59aead357a158848b38d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -346,6 +346,7 @@ struct mlx4_en_rx_ring {
 	unsigned long csum_ok;
 	unsigned long csum_none;
 	unsigned long csum_complete;
+	unsigned long rx_alloc_pages;
 	unsigned long xdp_drop;
 	unsigned long xdp_tx;
 	unsigned long xdp_tx_full;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h
index 48641cb0367f251a07537b82d0a16bf50d8479ef..926f3c3f3665c5d28fe5d35c41afaa0e5917c007 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_stats.h
@@ -37,7 +37,7 @@ struct mlx4_en_port_stats {
 	unsigned long queue_stopped;
 	unsigned long wake_queue;
 	unsigned long tx_timeout;
-	unsigned long rx_alloc_failed;
+	unsigned long rx_alloc_pages;
 	unsigned long rx_chksum_good;
 	unsigned long rx_chksum_none;
 	unsigned long rx_chksum_complete;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 11/14] mlx4: do not access rx_desc from mlx4_en_process_rx_cq()
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (9 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 10/14] mlx4: add rx_alloc_pages counter in ethtool -S Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 12/14] mlx4: factorize page_address() calls Eric Dumazet
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

Instead of fetching dma address from rx_desc->data[0].addr,
prefer using frags[0].dma + frags[0].page_offset to avoid
a potential cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index d3a425fa46b3924411e42a865b3a57aa063f1635..b62fa265890edd0eff50256bdc50fc907dbe8aa2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -528,7 +528,6 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 
 
 static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
-				      struct mlx4_en_rx_desc *rx_desc,
 				      struct mlx4_en_rx_alloc *frags,
 				      unsigned int length)
 {
@@ -703,7 +702,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_cqe *cqe;
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
-	struct mlx4_en_rx_desc *rx_desc;
 	struct bpf_prog *xdp_prog;
 	int doorbell_pending;
 	struct sk_buff *skb;
@@ -738,7 +736,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		    cq->mcq.cons_index & cq->size)) {
 
 		frags = ring->rx_info + (index << priv->log_rx_info);
-		rx_desc = ring->buf + (index << ring->log_stride);
 
 		/*
 		 * make sure we read the CQE after we read the ownership bit
@@ -767,7 +764,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			/* Get pointer to first fragment since we haven't
 			 * skb yet and cast it to ethhdr struct
 			 */
-			dma = be64_to_cpu(rx_desc->data[0].addr);
+			dma = frags[0].dma + frags[0].page_offset;
 			dma_sync_single_for_cpu(priv->ddev, dma, sizeof(*ethh),
 						DMA_FROM_DEVICE);
 			ethh = (struct ethhdr *)(page_address(frags[0].page) +
@@ -806,7 +803,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			void *orig_data;
 			u32 act;
 
-			dma = be64_to_cpu(rx_desc->data[0].addr);
+			dma = frags[0].dma + frags[0].page_offset;
 			dma_sync_single_for_cpu(priv->ddev, dma,
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
@@ -946,7 +943,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		}
 
 		/* GRO not possible, complete processing here */
-		skb = mlx4_en_rx_skb(priv, rx_desc, frags, length);
+		skb = mlx4_en_rx_skb(priv, frags, length);
 		if (unlikely(!skb)) {
 			ring->dropped++;
 			goto next;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 12/14] mlx4: factorize page_address() calls
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (10 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 11/14] mlx4: do not access rx_desc from mlx4_en_process_rx_cq() Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 13/14] mlx4: make validate_loopback() more generic Eric Dumazet
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

We need to compute the frame virtual address at different points.
Do it once.

Following patch will use the new va address for validate_loopback()

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index b62fa265890edd0eff50256bdc50fc907dbe8aa2..b5aa3f986508f0a5a89a40e18f8db99767068fef 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -734,9 +734,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	/* Process all completed CQEs */
 	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
 		    cq->mcq.cons_index & cq->size)) {
+		void *va;
 
 		frags = ring->rx_info + (index << priv->log_rx_info);
-
+		va = page_address(frags[0].page) + frags[0].page_offset;
 		/*
 		 * make sure we read the CQE after we read the ownership bit
 		 */
@@ -759,7 +760,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 * and not performing the selftest or flb disabled
 		 */
 		if (priv->flags & MLX4_EN_FLAG_RX_FILTER_NEEDED) {
-			struct ethhdr *ethh;
+			const struct ethhdr *ethh = va;
 			dma_addr_t dma;
 			/* Get pointer to first fragment since we haven't
 			 * skb yet and cast it to ethhdr struct
@@ -767,8 +768,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			dma = frags[0].dma + frags[0].page_offset;
 			dma_sync_single_for_cpu(priv->ddev, dma, sizeof(*ethh),
 						DMA_FROM_DEVICE);
-			ethh = (struct ethhdr *)(page_address(frags[0].page) +
-						 frags[0].page_offset);
 
 			if (is_multicast_ether_addr(ethh->h_dest)) {
 				struct mlx4_mac_entry *entry;
@@ -808,8 +807,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp.data_hard_start = page_address(frags[0].page);
-			xdp.data = xdp.data_hard_start + frags[0].page_offset;
+			xdp.data_hard_start = va - frags[0].page_offset;
+			xdp.data = va;
 			xdp.data_end = xdp.data + length;
 			orig_data = xdp.data;
 
@@ -819,6 +818,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 				length = xdp.data_end - xdp.data;
 				frags[0].page_offset = xdp.data -
 					xdp.data_hard_start;
+				va = xdp.data;
 			}
 
 			switch (act) {
@@ -891,7 +891,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 				goto next;
 
 			if (ip_summed == CHECKSUM_COMPLETE) {
-				void *va = skb_frag_address(skb_shinfo(gro_skb)->frags);
 				if (check_csum(cqe, gro_skb, va,
 					       dev->features)) {
 					ip_summed = CHECKSUM_NONE;
@@ -955,7 +954,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		}
 
 		if (ip_summed == CHECKSUM_COMPLETE) {
-			if (check_csum(cqe, skb, skb->data, dev->features)) {
+			if (check_csum(cqe, skb, va, dev->features)) {
 				ip_summed = CHECKSUM_NONE;
 				ring->csum_complete--;
 				ring->csum_none++;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 13/14] mlx4: make validate_loopback() more generic
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (11 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 12/14] mlx4: factorize page_address() calls Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-13 19:58 ` [PATCH v3 net-next 14/14] mlx4: remove duplicate code in mlx4_en_process_rx_cq() Eric Dumazet
  2017-02-17 16:00 ` [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling David Miller
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

Testing a boolean in fast path is not worth duplicating
the code allocating packets, when GRO is on or off.

If this proves to be a problem, we might later use a jump label.

Next patch will remove this duplicated code and ease code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c       | 23 ++++++++++-------------
 drivers/net/ethernet/mellanox/mlx4/en_selftest.c |  6 ------
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index b5aa3f986508f0a5a89a40e18f8db99767068fef..33b28278c8270197c5a8f35521fd2af5e1e6b1be 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -584,20 +584,17 @@ static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 	return skb;
 }
 
-static void validate_loopback(struct mlx4_en_priv *priv, struct sk_buff *skb)
+static void validate_loopback(struct mlx4_en_priv *priv, void *va)
 {
+	const unsigned char *data = va + ETH_HLEN;
 	int i;
-	int offset = ETH_HLEN;
 
-	for (i = 0; i < MLX4_LOOPBACK_TEST_PAYLOAD; i++, offset++) {
-		if (*(skb->data + offset) != (unsigned char) (i & 0xff))
-			goto out_loopback;
+	for (i = 0; i < MLX4_LOOPBACK_TEST_PAYLOAD; i++) {
+		if (data[i] != (unsigned char)i)
+			return;
 	}
 	/* Loopback found */
 	priv->loopback_ok = 1;
-
-out_loopback:
-	dev_kfree_skb_any(skb);
 }
 
 static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
@@ -785,6 +782,11 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			}
 		}
 
+		if (unlikely(priv->validate_loopback)) {
+			validate_loopback(priv, va);
+			goto next;
+		}
+
 		/*
 		 * Packet is OK - process it.
 		 */
@@ -948,11 +950,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			goto next;
 		}
 
-		if (unlikely(priv->validate_loopback)) {
-			validate_loopback(priv, skb);
-			goto next;
-		}
-
 		if (ip_summed == CHECKSUM_COMPLETE) {
 			if (check_csum(cqe, skb, va, dev->features)) {
 				ip_summed = CHECKSUM_NONE;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_selftest.c b/drivers/net/ethernet/mellanox/mlx4/en_selftest.c
index 95290e1fc9fe7600b2e3bcca334f3fad7d733c09..17112faafbccc5f7a75ee82a287be7952859ae9e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_selftest.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_selftest.c
@@ -81,14 +81,11 @@ static int mlx4_en_test_loopback(struct mlx4_en_priv *priv)
 {
 	u32 loopback_ok = 0;
 	int i;
-	bool gro_enabled;
 
         priv->loopback_ok = 0;
 	priv->validate_loopback = 1;
-	gro_enabled = priv->dev->features & NETIF_F_GRO;
 
 	mlx4_en_update_loopback_state(priv->dev, priv->dev->features);
-	priv->dev->features &= ~NETIF_F_GRO;
 
 	/* xmit */
 	if (mlx4_en_test_loopback_xmit(priv)) {
@@ -111,9 +108,6 @@ static int mlx4_en_test_loopback(struct mlx4_en_priv *priv)
 
 	priv->validate_loopback = 0;
 
-	if (gro_enabled)
-		priv->dev->features |= NETIF_F_GRO;
-
 	mlx4_en_update_loopback_state(priv->dev, priv->dev->features);
 	return !loopback_ok;
 }
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v3 net-next 14/14] mlx4: remove duplicate code in mlx4_en_process_rx_cq()
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (12 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 13/14] mlx4: make validate_loopback() more generic Eric Dumazet
@ 2017-02-13 19:58 ` Eric Dumazet
  2017-02-17 16:00 ` [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling David Miller
  14 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 19:58 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Tariq Toukan, Martin KaFai Lau, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov, Eric Dumazet, Eric Dumazet

We should keep one way to build skbs, regardless of GRO being on or off.

Note that I made sure to defer as much as possible the point we need to
pull data from the frame, so that future prefetch() we might add
are more effective.

These skb attributes derive from the CQE or ring :
 ip_summed, csum
 hash
 vlan offload
 hwtstamps
 queue_mapping

As a bonus, this patch removes mlx4 dependency on eth_get_headlen()
which is very often broken enough to give us headaches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 209 ++++++-----------------------
 1 file changed, 41 insertions(+), 168 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 33b28278c8270197c5a8f35521fd2af5e1e6b1be..aa074e57ce06fb2842fa1faabd156c3cd2fe10f5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -526,64 +526,6 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 	return 0;
 }
 
-
-static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
-				      struct mlx4_en_rx_alloc *frags,
-				      unsigned int length)
-{
-	struct sk_buff *skb;
-	void *va;
-	int used_frags;
-	dma_addr_t dma;
-
-	skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN);
-	if (unlikely(!skb)) {
-		en_dbg(RX_ERR, priv, "Failed allocating skb\n");
-		return NULL;
-	}
-	skb_reserve(skb, NET_IP_ALIGN);
-	skb->len = length;
-
-	/* Get pointer to first fragment so we could copy the headers into the
-	 * (linear part of the) skb */
-	va = page_address(frags[0].page) + frags[0].page_offset;
-
-	if (length <= SMALL_PACKET_SIZE) {
-		/* We are copying all relevant data to the skb - temporarily
-		 * sync buffers for the copy */
-
-		dma = frags[0].dma + frags[0].page_offset;
-		dma_sync_single_for_cpu(priv->ddev, dma, length,
-					DMA_FROM_DEVICE);
-		skb_copy_to_linear_data(skb, va, length);
-		skb->tail += length;
-	} else {
-		unsigned int pull_len;
-
-		/* Move relevant fragments to skb */
-		used_frags = mlx4_en_complete_rx_desc(priv, frags,
-							skb, length);
-		if (unlikely(!used_frags)) {
-			kfree_skb(skb);
-			return NULL;
-		}
-		skb_shinfo(skb)->nr_frags = used_frags;
-
-		pull_len = eth_get_headlen(va, SMALL_PACKET_SIZE);
-		/* Copy headers into the skb linear buffer */
-		memcpy(skb->data, va, pull_len);
-		skb->tail += pull_len;
-
-		/* Skip headers in first fragment */
-		skb_shinfo(skb)->frags[0].page_offset += pull_len;
-
-		/* Adjust size of first fragment */
-		skb_frag_size_sub(&skb_shinfo(skb)->frags[0], pull_len);
-		skb->data_len = length - pull_len;
-	}
-	return skb;
-}
-
 static void validate_loopback(struct mlx4_en_priv *priv, void *va)
 {
 	const unsigned char *data = va + ETH_HLEN;
@@ -792,8 +734,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 */
 		length = be32_to_cpu(cqe->byte_cnt);
 		length -= ring->fcs_del;
-		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
-			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
 		/* A bpf program gets first chance to drop the packet. It may
 		 * read bytes but not past the end of the frag.
@@ -849,122 +789,51 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		ring->bytes += length;
 		ring->packets++;
 
+		skb = napi_get_frags(&cq->napi);
+		if (!skb)
+			goto next;
+
+		if (unlikely(ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL)) {
+			timestamp = mlx4_en_get_cqe_ts(cqe);
+			mlx4_en_fill_hwtstamps(mdev, skb_hwtstamps(skb),
+					       timestamp);
+		}
+		skb_record_rx_queue(skb, cq->ring);
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
 				if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
 				    cqe->checksum == cpu_to_be16(0xffff)) {
 					ip_summed = CHECKSUM_UNNECESSARY;
+					l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
+						(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
+					if (l2_tunnel)
+						skb->csum_level = 1;
 					ring->csum_ok++;
 				} else {
-					ip_summed = CHECKSUM_NONE;
-					ring->csum_none++;
+					goto csum_none;
 				}
 			} else {
 				if (priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
 				    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
 							       MLX4_CQE_STATUS_IPV6))) {
-					ip_summed = CHECKSUM_COMPLETE;
-					ring->csum_complete++;
+					if (check_csum(cqe, skb, va, dev->features)) {
+						goto csum_none;
+					} else {
+						ip_summed = CHECKSUM_COMPLETE;
+						ring->csum_complete++;
+					}
 				} else {
-					ip_summed = CHECKSUM_NONE;
-					ring->csum_none++;
+					goto csum_none;
 				}
 			}
 		} else {
+csum_none:
 			ip_summed = CHECKSUM_NONE;
 			ring->csum_none++;
 		}
-
-		/* This packet is eligible for GRO if it is:
-		 * - DIX Ethernet (type interpretation)
-		 * - TCP/IP (v4)
-		 * - without IP options
-		 * - not an IP fragment
-		 */
-		if (dev->features & NETIF_F_GRO) {
-			struct sk_buff *gro_skb = napi_get_frags(&cq->napi);
-			if (!gro_skb)
-				goto next;
-
-			nr = mlx4_en_complete_rx_desc(priv, frags, gro_skb,
-						      length);
-			if (!nr)
-				goto next;
-
-			if (ip_summed == CHECKSUM_COMPLETE) {
-				if (check_csum(cqe, gro_skb, va,
-					       dev->features)) {
-					ip_summed = CHECKSUM_NONE;
-					ring->csum_none++;
-					ring->csum_complete--;
-				}
-			}
-
-			skb_shinfo(gro_skb)->nr_frags = nr;
-			gro_skb->len = length;
-			gro_skb->data_len = length;
-			gro_skb->ip_summed = ip_summed;
-
-			if (l2_tunnel && ip_summed == CHECKSUM_UNNECESSARY)
-				gro_skb->csum_level = 1;
-
-			if ((cqe->vlan_my_qpn &
-			    cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK)) &&
-			    (dev->features & NETIF_F_HW_VLAN_CTAG_RX)) {
-				u16 vid = be16_to_cpu(cqe->sl_vid);
-
-				__vlan_hwaccel_put_tag(gro_skb, htons(ETH_P_8021Q), vid);
-			} else if ((be32_to_cpu(cqe->vlan_my_qpn) &
-				  MLX4_CQE_SVLAN_PRESENT_MASK) &&
-				 (dev->features & NETIF_F_HW_VLAN_STAG_RX)) {
-				__vlan_hwaccel_put_tag(gro_skb,
-						       htons(ETH_P_8021AD),
-						       be16_to_cpu(cqe->sl_vid));
-			}
-
-			if (dev->features & NETIF_F_RXHASH)
-				skb_set_hash(gro_skb,
-					     be32_to_cpu(cqe->immed_rss_invalid),
-					     (ip_summed == CHECKSUM_UNNECESSARY) ?
-						PKT_HASH_TYPE_L4 :
-						PKT_HASH_TYPE_L3);
-
-			skb_record_rx_queue(gro_skb, cq->ring);
-
-			if (ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL) {
-				timestamp = mlx4_en_get_cqe_ts(cqe);
-				mlx4_en_fill_hwtstamps(mdev,
-						       skb_hwtstamps(gro_skb),
-						       timestamp);
-			}
-
-			napi_gro_frags(&cq->napi);
-			goto next;
-		}
-
-		/* GRO not possible, complete processing here */
-		skb = mlx4_en_rx_skb(priv, frags, length);
-		if (unlikely(!skb)) {
-			ring->dropped++;
-			goto next;
-		}
-
-		if (ip_summed == CHECKSUM_COMPLETE) {
-			if (check_csum(cqe, skb, va, dev->features)) {
-				ip_summed = CHECKSUM_NONE;
-				ring->csum_complete--;
-				ring->csum_none++;
-			}
-		}
-
 		skb->ip_summed = ip_summed;
-		skb->protocol = eth_type_trans(skb, dev);
-		skb_record_rx_queue(skb, cq->ring);
-
-		if (l2_tunnel && ip_summed == CHECKSUM_UNNECESSARY)
-			skb->csum_level = 1;
-
 		if (dev->features & NETIF_F_RXHASH)
 			skb_set_hash(skb,
 				     be32_to_cpu(cqe->immed_rss_invalid),
@@ -972,32 +841,36 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 					PKT_HASH_TYPE_L4 :
 					PKT_HASH_TYPE_L3);
 
-		if ((be32_to_cpu(cqe->vlan_my_qpn) &
-		    MLX4_CQE_CVLAN_PRESENT_MASK) &&
+
+		if ((cqe->vlan_my_qpn &
+		     cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK)) &&
 		    (dev->features & NETIF_F_HW_VLAN_CTAG_RX))
-			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), be16_to_cpu(cqe->sl_vid));
-		else if ((be32_to_cpu(cqe->vlan_my_qpn) &
-			  MLX4_CQE_SVLAN_PRESENT_MASK) &&
+			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+					       be16_to_cpu(cqe->sl_vid));
+		else if ((cqe->vlan_my_qpn &
+			  cpu_to_be32(MLX4_CQE_SVLAN_PRESENT_MASK)) &&
 			 (dev->features & NETIF_F_HW_VLAN_STAG_RX))
 			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD),
 					       be16_to_cpu(cqe->sl_vid));
 
-		if (ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL) {
-			timestamp = mlx4_en_get_cqe_ts(cqe);
-			mlx4_en_fill_hwtstamps(mdev, skb_hwtstamps(skb),
-					       timestamp);
+		nr = mlx4_en_complete_rx_desc(priv, frags, skb, length);
+		if (likely(nr)) {
+			skb_shinfo(skb)->nr_frags = nr;
+			skb->len = length;
+			skb->data_len = length;
+			napi_gro_frags(&cq->napi);
+		} else {
+			skb->vlan_tci = 0;
+			skb_clear_hash(skb);
 		}
-
-		napi_gro_receive(&cq->napi, skb);
 next:
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
 		if (++polled == budget)
-			goto out;
+			break;
 	}
 
-out:
 	rcu_read_unlock();
 
 	if (polled) {
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 19:58 ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Eric Dumazet
@ 2017-02-13 20:51   ` Alexander Duyck
  2017-02-13 21:09     ` Eric Dumazet
  2017-02-22 16:22   ` Eric Dumazet
  1 sibling, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-13 20:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet <edumazet@google.com> wrote:
> Use of order-3 pages is problematic in some cases.
>
> This patch might add three kinds of regression :
>
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
>
> 2) TCP receiver could grow its receive window slightly slower,
>    because skb->len/skb->truesize ratio will decrease.
>    This is mostly ok, we prefer being conservative to not risk OOM,
>    and eventually tune TCP better in the future.
>    This is consistent with other drivers using 2048 per ethernet frame.
>
> 3) Because we allocate one page per RX slot, we consume more
>    memory for the ring buffers. XDP already had this constraint anyway.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 72 +++++++++++++---------------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 --
>  2 files changed, 33 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 0c61c1200f2aec4c74d7403d9ec3ed06821c1bac..069ea09185fb0669d5c1f9b1b88f517b2d69c5ed 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -53,38 +53,26 @@
>  static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
>                             struct mlx4_en_rx_alloc *page_alloc,
>                             const struct mlx4_en_frag_info *frag_info,
> -                           gfp_t _gfp)
> +                           gfp_t gfp)
>  {
> -       int order;
>         struct page *page;
>         dma_addr_t dma;
>
> -       for (order = priv->rx_page_order; ;) {
> -               gfp_t gfp = _gfp;
> -
> -               if (order)
> -                       gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NOMEMALLOC;
> -               page = alloc_pages(gfp, order);
> -               if (likely(page))
> -                       break;
> -               if (--order < 0 ||
> -                   ((PAGE_SIZE << order) < frag_info->frag_size))
> -                       return -ENOMEM;
> -       }
> -       dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
> -                          priv->dma_dir);
> +       page = alloc_page(gfp);
> +       if (unlikely(!page))
> +               return -ENOMEM;
> +       dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE, priv->dma_dir);
>         if (unlikely(dma_mapping_error(priv->ddev, dma))) {
>                 put_page(page);
>                 return -ENOMEM;
>         }
> -       page_alloc->page_size = PAGE_SIZE << order;
>         page_alloc->page = page;
>         page_alloc->dma = dma;
>         page_alloc->page_offset = 0;
>         /* Not doing get_page() for each frag is a big win
>          * on asymetric workloads. Note we can not use atomic_set().
>          */
> -       page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1);
> +       page_ref_add(page, PAGE_SIZE / frag_info->frag_stride - 1);
>         return 0;
>  }
>
> @@ -105,7 +93,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>                 page_alloc[i].page_offset += frag_info->frag_stride;
>
>                 if (page_alloc[i].page_offset + frag_info->frag_stride <=
> -                   ring_alloc[i].page_size)
> +                   PAGE_SIZE)
>                         continue;
>
>                 if (unlikely(mlx4_alloc_pages(priv, &page_alloc[i],
> @@ -127,11 +115,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>         while (i--) {
>                 if (page_alloc[i].page != ring_alloc[i].page) {
>                         dma_unmap_page(priv->ddev, page_alloc[i].dma,
> -                               page_alloc[i].page_size,
> -                               priv->dma_dir);
> +                                      PAGE_SIZE, priv->dma_dir);
>                         page = page_alloc[i].page;
>                         /* Revert changes done by mlx4_alloc_pages */
> -                       page_ref_sub(page, page_alloc[i].page_size /
> +                       page_ref_sub(page, PAGE_SIZE /
>                                            priv->frag_info[i].frag_stride - 1);
>                         put_page(page);
>                 }
> @@ -147,8 +134,8 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
>         u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
>
>
> -       if (next_frag_end > frags[i].page_size)
> -               dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
> +       if (next_frag_end > PAGE_SIZE)
> +               dma_unmap_page(priv->ddev, frags[i].dma, PAGE_SIZE,
>                                priv->dma_dir);
>
>         if (frags[i].page)
> @@ -168,9 +155,8 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
>                                      frag_info, GFP_KERNEL | __GFP_COLD))
>                         goto out;
>
> -               en_dbg(DRV, priv, "  frag %d allocator: - size:%d frags:%d\n",
> -                      i, ring->page_alloc[i].page_size,
> -                      page_ref_count(ring->page_alloc[i].page));
> +               en_dbg(DRV, priv, "  frag %d allocator: - frags:%d\n",
> +                      i, page_ref_count(ring->page_alloc[i].page));
>         }
>         return 0;
>
> @@ -180,11 +166,10 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
>
>                 page_alloc = &ring->page_alloc[i];
>                 dma_unmap_page(priv->ddev, page_alloc->dma,
> -                              page_alloc->page_size,
> -                              priv->dma_dir);
> +                              PAGE_SIZE, priv->dma_dir);
>                 page = page_alloc->page;
>                 /* Revert changes done by mlx4_alloc_pages */
> -               page_ref_sub(page, page_alloc->page_size /
> +               page_ref_sub(page, PAGE_SIZE /
>                                    priv->frag_info[i].frag_stride - 1);
>                 put_page(page);
>                 page_alloc->page = NULL;

You can probably simplify this by using __page_frag_cache_drain since
that way you can just doe the sub and test for the whole set instead
of doing N - 1 and 1.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 20:51   ` Alexander Duyck
@ 2017-02-13 21:09     ` Eric Dumazet
  2017-02-13 23:16       ` Alexander Duyck
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 21:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 12:51 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet <edumazet@google.com> wrote:

>> +                              PAGE_SIZE, priv->dma_dir);
>>                 page = page_alloc->page;
>>                 /* Revert changes done by mlx4_alloc_pages */
>> -               page_ref_sub(page, page_alloc->page_size /
>> +               page_ref_sub(page, PAGE_SIZE /
>>                                    priv->frag_info[i].frag_stride - 1);
>>                 put_page(page);
>>                 page_alloc->page = NULL;
>
> You can probably simplify this by using __page_frag_cache_drain since
> that way you can just doe the sub and test for the whole set instead
> of doing N - 1 and 1.

Well, we remove this anyway in following patch ;)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 21:09     ` Eric Dumazet
@ 2017-02-13 23:16       ` Alexander Duyck
  2017-02-13 23:22         ` Eric Dumazet
  2017-02-14 12:12         ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 78+ messages in thread
From: Alexander Duyck @ 2017-02-13 23:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 1:09 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 12:51 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Mon, Feb 13, 2017 at 11:58 AM, Eric Dumazet <edumazet@google.com> wrote:
>
>>> +                              PAGE_SIZE, priv->dma_dir);
>>>                 page = page_alloc->page;
>>>                 /* Revert changes done by mlx4_alloc_pages */
>>> -               page_ref_sub(page, page_alloc->page_size /
>>> +               page_ref_sub(page, PAGE_SIZE /
>>>                                    priv->frag_info[i].frag_stride - 1);
>>>                 put_page(page);
>>>                 page_alloc->page = NULL;
>>
>> You can probably simplify this by using __page_frag_cache_drain since
>> that way you can just doe the sub and test for the whole set instead
>> of doing N - 1 and 1.
>
> Well, we remove this anyway in following patch ;)

Any plans to add the bulk page count updates back at some point?  I
just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
update code to better handle incrementing page count").  I should have
patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
the reasons why I implemented __page_frag_cache_drain in the first
place.  As I'm sure Jesper will probably point out the atomic op for
get_page/page_ref_inc can be pretty expensive if I recall correctly.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 23:16       ` Alexander Duyck
@ 2017-02-13 23:22         ` Eric Dumazet
  2017-02-13 23:26           ` Alexander Duyck
  2017-02-14 12:12         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 23:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 3:16 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> Any plans to add the bulk page count updates back at some point?  I
> just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
> update code to better handle incrementing page count").  I should have
> patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
> the reasons why I implemented __page_frag_cache_drain in the first
> place.  As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

As a matter of fact I did this last week, but could not show any
improvement on one TCP flow,
so decided to not include this in this patch series.

Maybe the issue was more a sender issue, I might revisit this later.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 23:22         ` Eric Dumazet
@ 2017-02-13 23:26           ` Alexander Duyck
  2017-02-13 23:29             ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-13 23:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 3:22 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 3:16 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
>> Any plans to add the bulk page count updates back at some point?  I
>> just got around to adding it for igb in commit bd4171a5d4c2 ("igb:
>> update code to better handle incrementing page count").  I should have
>> patches for ixgbe, i40e, and i40evf here pretty soon.  That was one of
>> the reasons why I implemented __page_frag_cache_drain in the first
>> place.  As I'm sure Jesper will probably point out the atomic op for
>> get_page/page_ref_inc can be pretty expensive if I recall correctly.
>
> As a matter of fact I did this last week, but could not show any
> improvement on one TCP flow,
> so decided to not include this in this patch series.
>
> Maybe the issue was more a sender issue, I might revisit this later.

Odds are for a single TCP flow you won't notice.  This tends to be
more of a small packets type performance issue.  If you hammer on he
Rx using pktgen you would be more likely to see it.

Anyway patch looks fine from a functional perspective so I am okay
with it.  The page count bulking is something that can be addressed
later.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 23:26           ` Alexander Duyck
@ 2017-02-13 23:29             ` Eric Dumazet
  2017-02-13 23:47               ` Alexander Duyck
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-13 23:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 3:26 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

>
> Odds are for a single TCP flow you won't notice.  This tends to be
> more of a small packets type performance issue.  If you hammer on he
> Rx using pktgen you would be more likely to see it.
>
> Anyway patch looks fine from a functional perspective so I am okay
> with it.  The page count bulking is something that can be addressed
> later.

Well, if we hammer enough, pages are not recycled because we cycle
through RX ring buffer too fast.

This might be different for PowerPC though, because of its huge PAGE_SIZE.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 23:29             ` Eric Dumazet
@ 2017-02-13 23:47               ` Alexander Duyck
  2017-02-14  0:22                 ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-13 23:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 3:29 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 3:26 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
>>
>> Odds are for a single TCP flow you won't notice.  This tends to be
>> more of a small packets type performance issue.  If you hammer on he
>> Rx using pktgen you would be more likely to see it.
>>
>> Anyway patch looks fine from a functional perspective so I am okay
>> with it.  The page count bulking is something that can be addressed
>> later.
>
> Well, if we hammer enough, pages are not recycled because we cycle
> through RX ring buffer too fast.
>
> This might be different for PowerPC though, because of its huge PAGE_SIZE.

Actually it depends on the use case.  In the case of pktgen packets
they are usually dropped pretty early in the receive path.  Think
something more along the lines of a TCP syn flood versus something
that would be loading up a socket.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 23:47               ` Alexander Duyck
@ 2017-02-14  0:22                 ` Eric Dumazet
  2017-02-14  0:34                   ` Alexander Duyck
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14  0:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

> Actually it depends on the use case.  In the case of pktgen packets
> they are usually dropped pretty early in the receive path.  Think
> something more along the lines of a TCP syn flood versus something
> that would be loading up a socket.

So I gave another spin an it, reducing the MTU on the sender to 500
instead of 1500 to triple the load (in term of packets per second)
since the sender seems to hit some kind of limit around 30Gbit.

Number of packets we process on one RX queue , and one TCP flow.

Current patch series : 6.3 Mpps

Which is not too bad ;)

This number does not change putting your __page_frag_cache_drain() trick.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14  0:22                 ` Eric Dumazet
@ 2017-02-14  0:34                   ` Alexander Duyck
  2017-02-14  0:46                     ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-14  0:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Jesper Dangaard Brouer,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet

On Mon, Feb 13, 2017 at 4:22 PM, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>
>> Actually it depends on the use case.  In the case of pktgen packets
>> they are usually dropped pretty early in the receive path.  Think
>> something more along the lines of a TCP syn flood versus something
>> that would be loading up a socket.
>
> So I gave another spin an it, reducing the MTU on the sender to 500
> instead of 1500 to triple the load (in term of packets per second)
> since the sender seems to hit some kind of limit around 30Gbit.
>
> Number of packets we process on one RX queue , and one TCP flow.
>
> Current patch series : 6.3 Mpps
>
> Which is not too bad ;)
>
> This number does not change putting your __page_frag_cache_drain() trick.

Well the __page_frag_cache_drain by itself isn't going to add much of
anything.  You really aren't going to be using it except for in the
slow path.  I was talking about the bulk page count update by itself.
All __page_frag_cache_drain gets you is it cleans up the
page_frag_sub(n-1); put_page(1); code calls.

The approach I took for the Intel drivers isn't too different than
what we did for the skbuff page frag cache.  Basically I update the
page count once every 65535 frames setting it to 64K, and holding no
more than 65535 references in the pagecnt_bias.  Also in my code I
don't update the count until after the first recycle call since the
highest likelihood of us discarding the frame is after the first
allocation anyway so we wait until the first recycle to start
repopulating the count.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14  0:34                   ` Alexander Duyck
@ 2017-02-14  0:46                     ` Eric Dumazet
  2017-02-14  0:47                       ` Eric Dumazet
  2017-02-14  0:57                       ` Eric Dumazet
  0 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14  0:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Mon, 2017-02-13 at 16:34 -0800, Alexander Duyck wrote:
> On Mon, Feb 13, 2017 at 4:22 PM, Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Feb 13, 2017 at 3:47 PM, Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >
> >> Actually it depends on the use case.  In the case of pktgen packets
> >> they are usually dropped pretty early in the receive path.  Think
> >> something more along the lines of a TCP syn flood versus something
> >> that would be loading up a socket.
> >
> > So I gave another spin an it, reducing the MTU on the sender to 500
> > instead of 1500 to triple the load (in term of packets per second)
> > since the sender seems to hit some kind of limit around 30Gbit.
> >
> > Number of packets we process on one RX queue , and one TCP flow.
> >
> > Current patch series : 6.3 Mpps
> >
> > Which is not too bad ;)
> >
> > This number does not change putting your __page_frag_cache_drain() trick.
> 
> Well the __page_frag_cache_drain by itself isn't going to add much of
> anything.  You really aren't going to be using it except for in the
> slow path.  I was talking about the bulk page count update by itself.
> All __page_frag_cache_drain gets you is it cleans up the
> page_frag_sub(n-1); put_page(1); code calls.
> 
> The approach I took for the Intel drivers isn't too different than
> what we did for the skbuff page frag cache.  Basically I update the
> page count once every 65535 frames setting it to 64K, and holding no
> more than 65535 references in the pagecnt_bias.  Also in my code I
> don't update the count until after the first recycle call since the
> highest likelihood of us discarding the frame is after the first
> allocation anyway so we wait until the first recycle to start
> repopulating the count.

Alex, be assured that I implemented the full thing, of course.

( The pagecnt_bias field, .. refilled every 16K rounds )

And I repeat : I got _no_ convincing results to justify this patch being
sent in this series.

This series is not about adding 1% or 2% performance changes, but adding
robustness to RX allocations.


 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14  0:46                     ` Eric Dumazet
@ 2017-02-14  0:47                       ` Eric Dumazet
  2017-02-14  0:57                       ` Eric Dumazet
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14  0:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Mon, 2017-02-13 at 16:46 -0800, Eric Dumazet wrote:

> Alex, be assured that I implemented the full thing, of course.
> 
> ( The pagecnt_bias field, .. refilled every 16K rounds )

Correction, USHRT_MAX is  ~64K, not 16K

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14  0:46                     ` Eric Dumazet
  2017-02-14  0:47                       ` Eric Dumazet
@ 2017-02-14  0:57                       ` Eric Dumazet
  2017-02-14  1:32                         ` Alexander Duyck
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14  0:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn


> Alex, be assured that I implemented the full thing, of course.

Patch was :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..0ae1b544668d26c24044dbdefdd9b12253596ff9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -68,6 +68,7 @@ static int mlx4_alloc_page(struct mlx4_en_priv *priv,
        frag->page = page;
        frag->dma = dma;
        frag->page_offset = priv->rx_headroom;
+       frag->pagecnt_bias = 1;
        return 0;
 }
 
@@ -97,7 +98,7 @@ static void mlx4_en_free_frag(const struct mlx4_en_priv *priv,
        if (frag->page) {
                dma_unmap_page(priv->ddev, frag->dma,
                               PAGE_SIZE, priv->dma_dir);
-               __free_page(frag->page);
+               __page_frag_cache_drain(frag->page, frag->pagecnt_bias);
        }
        /* We need to clear all fields, otherwise a change of priv->log_rx_info
         * could lead to see garbage later in frag->page.
@@ -470,6 +471,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 {
        const struct mlx4_en_frag_info *frag_info = priv->frag_info;
        unsigned int truesize = 0;
+       unsigned int pagecnt_bias;
        int nr, frag_size;
        struct page *page;
        dma_addr_t dma;
@@ -491,9 +493,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
                                     frag_size);
 
                truesize += frag_info->frag_stride;
+               pagecnt_bias = frags->pagecnt_bias--;
                if (frag_info->frag_stride == PAGE_SIZE / 2) {
                        frags->page_offset ^= PAGE_SIZE / 2;
-                       release = page_count(page) != 1 ||
+                       release = page_count(page) != pagecnt_bias ||
                                  page_is_pfmemalloc(page) ||
                                  page_to_nid(page) != numa_mem_id();
                } else {
@@ -504,9 +507,13 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
                }
                if (release) {
                        dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
+                       __page_frag_cache_drain(page, --pagecnt_bias);
                        frags->page = NULL;
                } else {
-                       page_ref_inc(page);
+                       if (pagecnt_bias == 1) {
+                               page_ref_add(page, USHRT_MAX);
+                               frags->pagecnt_bias = USHRT_MAX;
+                       }
                }
 
                nr++;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 52f157cea7765ad6907c59aead357a158848b38d..0de5933fd56bda3db15b4aa33d2366eff98f4154 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -252,7 +252,12 @@ struct mlx4_en_tx_desc {
 struct mlx4_en_rx_alloc {
        struct page     *page;
        dma_addr_t      dma;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
        u32             page_offset;
+#else
+       u16             page_offset;
+#endif
+       u16 pagecnt_bias;
 };
 
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14  0:57                       ` Eric Dumazet
@ 2017-02-14  1:32                         ` Alexander Duyck
  0 siblings, 0 replies; 78+ messages in thread
From: Alexander Duyck @ 2017-02-14  1:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn

On Mon, Feb 13, 2017 at 4:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Alex, be assured that I implemented the full thing, of course.
>
> Patch was :
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..0ae1b544668d26c24044dbdefdd9b12253596ff9 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -68,6 +68,7 @@ static int mlx4_alloc_page(struct mlx4_en_priv *priv,
>         frag->page = page;
>         frag->dma = dma;
>         frag->page_offset = priv->rx_headroom;
> +       frag->pagecnt_bias = 1;
>         return 0;
>  }
>
> @@ -97,7 +98,7 @@ static void mlx4_en_free_frag(const struct mlx4_en_priv *priv,
>         if (frag->page) {
>                 dma_unmap_page(priv->ddev, frag->dma,
>                                PAGE_SIZE, priv->dma_dir);
> -               __free_page(frag->page);
> +               __page_frag_cache_drain(frag->page, frag->pagecnt_bias);
>         }
>         /* We need to clear all fields, otherwise a change of priv->log_rx_info
>          * could lead to see garbage later in frag->page.
> @@ -470,6 +471,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>  {
>         const struct mlx4_en_frag_info *frag_info = priv->frag_info;
>         unsigned int truesize = 0;
> +       unsigned int pagecnt_bias;
>         int nr, frag_size;
>         struct page *page;
>         dma_addr_t dma;
> @@ -491,9 +493,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>                                      frag_size);
>
>                 truesize += frag_info->frag_stride;
> +               pagecnt_bias = frags->pagecnt_bias--;
>                 if (frag_info->frag_stride == PAGE_SIZE / 2) {
>                         frags->page_offset ^= PAGE_SIZE / 2;
> -                       release = page_count(page) != 1 ||
> +                       release = page_count(page) != pagecnt_bias ||
>                                   page_is_pfmemalloc(page) ||
>                                   page_to_nid(page) != numa_mem_id();
>                 } else {
> @@ -504,9 +507,13 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>                 }
>                 if (release) {
>                         dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
> +                       __page_frag_cache_drain(page, --pagecnt_bias);
>                         frags->page = NULL;
>                 } else {
> -                       page_ref_inc(page);
> +                       if (pagecnt_bias == 1) {
> +                               page_ref_add(page, USHRT_MAX);
> +                               frags->pagecnt_bias = USHRT_MAX;
> +                       }
>                 }
>
>                 nr++;

You might want to examine the code while running perf.  What you
should see is the page_ref_inc here go from eating a significant
amount of time prior to the patch to something negligable after the
patch.  If the page_ref_inc isn't adding much pressure then maybe that
is why it didn't provide any significant gain on mlx4. I suppose it's
a possibility that the mlx4 code is different enough that maybe their
code is just running in a different environment, for example there
might not be any MMIO pressure to put any serious pressure on the
atomic op so it is processed more quickly.

Also back when I was hammering on this it was back when I was mostly
focused on routing and doing micro-benchmarks.  Odds are it is
probably one of those things that won't show up unless you are really
looking for it so no need to worry about addressing it now.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 23:16       ` Alexander Duyck
  2017-02-13 23:22         ` Eric Dumazet
@ 2017-02-14 12:12         ` Jesper Dangaard Brouer
  2017-02-14 13:45           ` Eric Dumazet
  1 sibling, 1 reply; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 12:12 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, brouer,
	linux-mm

On Mon, 13 Feb 2017 15:16:35 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

[...]
> ... As I'm sure Jesper will probably point out the atomic op for
> get_page/page_ref_inc can be pretty expensive if I recall correctly.

It is important to understand that there are two cases for the cost of
an atomic op, which depend on the cache-coherency state of the
cacheline.

Measured on Skylake CPU i7-6700K CPU @ 4.00GHz

(1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
(2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns

Notice the huge difference. And in case 2, it is enough that the remote
CPU reads the cacheline and brings it into "Shared" (MESI) state, and
the local CPU then does the atomic op.

One key ideas behind the page_pool, is that remote CPUs read/detect
refcnt==1 (Shared-state), and store the page in a small per-CPU array.
When array is full, it gets bulk returned to the shared-ptr-ring pool.
When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
during it's bulk refill, to latency-hide the MESI transitions needed.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 12:12         ` Jesper Dangaard Brouer
@ 2017-02-14 13:45           ` Eric Dumazet
  2017-02-14 14:12             ` Eric Dumazet
  2017-02-14 14:56             ` Tariq Toukan
  0 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 13:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:

> It is important to understand that there are two cases for the cost of
> an atomic op, which depend on the cache-coherency state of the
> cacheline.
>
> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>
> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>

Okay, it seems you guys really want a patch that I said was not giving
good results

Let me publish the numbers I get , adding or not the last (and not
official) patch.

If I _force_ the user space process to run on the other node,
then the results are not the ones Alex or you are expecting.

I have with this patch about 2.7 Mpps of this silly single TCP flow,
and 3.5 Mpps without it.

lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
0.00      0.00      4.50

Profile of the cpu on NUMA node 1 ( netserver consuming data ) :

    54.73%  [kernel]      [k] copy_user_enhanced_fast_string
    31.07%  [kernel]      [k] skb_release_data
     4.24%  [kernel]      [k] skb_copy_datagram_iter
     1.35%  [kernel]      [k] copy_page_to_iter
     0.98%  [kernel]      [k] _raw_spin_lock
     0.90%  [kernel]      [k] skb_release_head_state
     0.60%  [kernel]      [k] tcp_transmit_skb
     0.51%  [kernel]      [k] mlx4_en_xmit
     0.33%  [kernel]      [k] ___cache_free
     0.28%  [kernel]      [k] tcp_rcv_established

Profile of cpu handling mlx4 softirqs (NUMA node 0)


    48.00%  [kernel]          [k] mlx4_en_process_rx_cq
    12.92%  [kernel]          [k] napi_gro_frags
     7.28%  [kernel]          [k] inet_gro_receive
     7.17%  [kernel]          [k] tcp_gro_receive
     5.10%  [kernel]          [k] dev_gro_receive
     4.87%  [kernel]          [k] skb_gro_receive
     2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
     2.04%  [kernel]          [k] __build_skb
     1.02%  [kernel]          [k] napi_reuse_skb.isra.95
     1.01%  [kernel]          [k] tcp4_gro_receive
     0.65%  [kernel]          [k] kmem_cache_alloc
     0.45%  [kernel]          [k] _raw_spin_lock

Without the latest  patch (the exact patch series v3 I submitted),
thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.

lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
0.00      0.00      4.50

Profiles of the two cpus :

    74.85%  [kernel]      [k] copy_user_enhanced_fast_string
     6.42%  [kernel]      [k] skb_release_data
     5.65%  [kernel]      [k] skb_copy_datagram_iter
     1.83%  [kernel]      [k] copy_page_to_iter
     1.59%  [kernel]      [k] _raw_spin_lock
     1.48%  [kernel]      [k] skb_release_head_state
     0.72%  [kernel]      [k] tcp_transmit_skb
     0.68%  [kernel]      [k] mlx4_en_xmit
     0.43%  [kernel]      [k] page_frag_free
     0.38%  [kernel]      [k] ___cache_free
     0.37%  [kernel]      [k] tcp_established_options
     0.37%  [kernel]      [k] __ip_local_out


   37.98%  [kernel]          [k] mlx4_en_process_rx_cq
    26.47%  [kernel]          [k] napi_gro_frags
     7.02%  [kernel]          [k] inet_gro_receive
     5.89%  [kernel]          [k] tcp_gro_receive
     5.17%  [kernel]          [k] dev_gro_receive
     4.80%  [kernel]          [k] skb_gro_receive
     2.61%  [kernel]          [k] __build_skb
     2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
     1.59%  [kernel]          [k] napi_reuse_skb.isra.95
     0.95%  [kernel]          [k] tcp4_gro_receive
     0.51%  [kernel]          [k] kmem_cache_alloc
     0.42%  [kernel]          [k] __inet_lookup_established
     0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu


So probably this will need further analysis, outside of the scope of
this patch series.

Could we now please Ack this v3 and merge it ?

Thanks.



> Notice the huge difference. And in case 2, it is enough that the remote
> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
> the local CPU then does the atomic op.
>
> One key ideas behind the page_pool, is that remote CPUs read/detect
> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
> When array is full, it gets bulk returned to the shared-ptr-ring pool.
> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
> during it's bulk refill, to latency-hide the MESI transitions needed.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 13:45           ` Eric Dumazet
@ 2017-02-14 14:12             ` Eric Dumazet
  2017-02-14 14:56             ` Tariq Toukan
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 14:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 5:45 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> Could we now please Ack this v3 and merge it ?
>

BTW I found the limitation on sender side.

After doing :

lpaa23:~# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000

rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

lpaa23:~# ethtool -C eth0 tx-usecs 8 tx-frames 8

-> 36 Gbit on a single TCP flow, this is the highest number I ever got.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 13:45           ` Eric Dumazet
  2017-02-14 14:12             ` Eric Dumazet
@ 2017-02-14 14:56             ` Tariq Toukan
  2017-02-14 15:51                 ` Eric Dumazet
                                 ` (2 more replies)
  1 sibling, 3 replies; 78+ messages in thread
From: Tariq Toukan @ 2017-02-14 14:56 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm



On 14/02/2017 3:45 PM, Eric Dumazet wrote:
> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>
>> It is important to understand that there are two cases for the cost of
>> an atomic op, which depend on the cache-coherency state of the
>> cacheline.
>>
>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>
>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>
> Okay, it seems you guys really want a patch that I said was not giving
> good results
>
> Let me publish the numbers I get , adding or not the last (and not
> official) patch.
>
> If I _force_ the user space process to run on the other node,
> then the results are not the ones Alex or you are expecting.
>
> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> and 3.5 Mpps without it.
>
> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> 0.00      0.00      4.50
>
> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>
>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>      31.07%  [kernel]      [k] skb_release_data
>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>       1.35%  [kernel]      [k] copy_page_to_iter
>       0.98%  [kernel]      [k] _raw_spin_lock
>       0.90%  [kernel]      [k] skb_release_head_state
>       0.60%  [kernel]      [k] tcp_transmit_skb
>       0.51%  [kernel]      [k] mlx4_en_xmit
>       0.33%  [kernel]      [k] ___cache_free
>       0.28%  [kernel]      [k] tcp_rcv_established
>
> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>
>
>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>      12.92%  [kernel]          [k] napi_gro_frags
>       7.28%  [kernel]          [k] inet_gro_receive
>       7.17%  [kernel]          [k] tcp_gro_receive
>       5.10%  [kernel]          [k] dev_gro_receive
>       4.87%  [kernel]          [k] skb_gro_receive
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       2.04%  [kernel]          [k] __build_skb
>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>       1.01%  [kernel]          [k] tcp4_gro_receive
>       0.65%  [kernel]          [k] kmem_cache_alloc
>       0.45%  [kernel]          [k] _raw_spin_lock
>
> Without the latest  patch (the exact patch series v3 I submitted),
> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only reads.
>
> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> 0.00      0.00      4.50
>
> Profiles of the two cpus :
>
>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>       6.42%  [kernel]      [k] skb_release_data
>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>       1.83%  [kernel]      [k] copy_page_to_iter
>       1.59%  [kernel]      [k] _raw_spin_lock
>       1.48%  [kernel]      [k] skb_release_head_state
>       0.72%  [kernel]      [k] tcp_transmit_skb
>       0.68%  [kernel]      [k] mlx4_en_xmit
>       0.43%  [kernel]      [k] page_frag_free
>       0.38%  [kernel]      [k] ___cache_free
>       0.37%  [kernel]      [k] tcp_established_options
>       0.37%  [kernel]      [k] __ip_local_out
>
>
>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>      26.47%  [kernel]          [k] napi_gro_frags
>       7.02%  [kernel]          [k] inet_gro_receive
>       5.89%  [kernel]          [k] tcp_gro_receive
>       5.17%  [kernel]          [k] dev_gro_receive
>       4.80%  [kernel]          [k] skb_gro_receive
>       2.61%  [kernel]          [k] __build_skb
>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>       0.95%  [kernel]          [k] tcp4_gro_receive
>       0.51%  [kernel]          [k] kmem_cache_alloc
>       0.42%  [kernel]          [k] __inet_lookup_established
>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>
>
> So probably this will need further analysis, outside of the scope of
> this patch series.
>
> Could we now please Ack this v3 and merge it ?
>
> Thanks.
Thanks Eric.

As the previous series caused hangs, we must run functional regression 
tests over this series as well.
Run has already started, and results will be available tomorrow morning.

In general, I really like this series. The re-factorization looks more 
elegant and more correct, functionally.

However, performance wise: we fear that the numbers will be drastically 
lower with this transition to order-0 pages,
because of the (becoming critical) page allocator and dma operations 
bottlenecks, especially on systems with costly
dma operations, such as ARM, iommu=on, etc...

We already have this exact issue in mlx5, where we moved to order-0 
allocations with a fixed size cache, but that was not enough.
Customers of mlx5 have already complained about the performance 
degradation, and currently this is hurting our business.
We get a clear nack from our performance regression team regarding doing 
the same in mlx4.
So, the question is, can we live with this degradation until those 
bottleneck challenges are addressed?
Following our perf experts feedback, I cannot just simply Ack. We need 
to have a clear plan to close the perf gap or reduce the impact.

Internally, I already implemented "dynamic page-cache" and "page-reuse" 
mechanisms in the driver,
and together they totally bridge the performance gap.
That's why I would like to hear from Jesper what is the status of his 
page_pool API, it is promising and could totally solve these issues.

Regards,
Tariq

>
>
>
>> Notice the huge difference. And in case 2, it is enough that the remote
>> CPU reads the cacheline and brings it into "Shared" (MESI) state, and
>> the local CPU then does the atomic op.
>>
>> One key ideas behind the page_pool, is that remote CPUs read/detect
>> refcnt==1 (Shared-state), and store the page in a small per-CPU array.
>> When array is full, it gets bulk returned to the shared-ptr-ring pool.
>> When "local" CPU need new pages, from the shared-ptr-ring it prefetchw
>> during it's bulk refill, to latency-hide the MESI transitions needed.
>>
>> --
>> Best regards,
>>    Jesper Dangaard Brouer
>>    MSc.CS, Principal Kernel Engineer at Red Hat
>>    LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 14:56             ` Tariq Toukan
@ 2017-02-14 15:51                 ` Eric Dumazet
  2017-02-14 17:04               ` David Miller
  2017-02-14 17:29               ` Alexander Duyck
  2 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 15:51 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:

> As the previous series caused hangs, we must run functional regression 
> tests over this series as well.
> Run has already started, and results will be available tomorrow morning.
> 
> In general, I really like this series. The re-factorization looks more 
> elegant and more correct, functionally.
> 
> However, performance wise: we fear that the numbers will be drastically 
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations 
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...
> 

So, again, performance after this patch series his higher,
once you have sensible RX queues parameters, for the expected workload.

Only in pathological cases, you might have some regression.

The old schem was _maybe_ better _when_ memory is not fragmented.

When you run hosts for months, memory _is_ fragmented.

You never see that on benchmarks, unless you force memory being
fragmented.



> We already have this exact issue in mlx5, where we moved to order-0 
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance 
> degradation, and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing 
> the same in mlx4.
> So, the question is, can we live with this degradation until those 
> bottleneck challenges are addressed?

Again, there is no degradation.

We have been using order-0 pages for years at Google.

Only when we made the mistake to rebase from the upstream driver and
order-3 pages we got horrible regressions, causing production outages.

I was silly to believe that mm layer got better.

> Following our perf experts feedback, I cannot just simply Ack. We need 
> to have a clear plan to close the perf gap or reduce the impact.

Your perf experts need to talk to me, or any experts at Google and
Facebook, really.

Anything _relying_ on order-3 pages being available to impress
friends/customers is a lie.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-14 15:51                 ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 15:51 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:

> As the previous series caused hangs, we must run functional regression 
> tests over this series as well.
> Run has already started, and results will be available tomorrow morning.
> 
> In general, I really like this series. The re-factorization looks more 
> elegant and more correct, functionally.
> 
> However, performance wise: we fear that the numbers will be drastically 
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations 
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...
> 

So, again, performance after this patch series his higher,
once you have sensible RX queues parameters, for the expected workload.

Only in pathological cases, you might have some regression.

The old schem was _maybe_ better _when_ memory is not fragmented.

When you run hosts for months, memory _is_ fragmented.

You never see that on benchmarks, unless you force memory being
fragmented.



> We already have this exact issue in mlx5, where we moved to order-0 
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance 
> degradation, and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing 
> the same in mlx4.
> So, the question is, can we live with this degradation until those 
> bottleneck challenges are addressed?

Again, there is no degradation.

We have been using order-0 pages for years at Google.

Only when we made the mistake to rebase from the upstream driver and
order-3 pages we got horrible regressions, causing production outages.

I was silly to believe that mm layer got better.

> Following our perf experts feedback, I cannot just simply Ack. We need 
> to have a clear plan to close the perf gap or reduce the impact.

Your perf experts need to talk to me, or any experts at Google and
Facebook, really.

Anything _relying_ on order-3 pages being available to impress
friends/customers is a lie.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 15:51                 ` Eric Dumazet
@ 2017-02-14 16:03                   ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>

BTW, you do understand that on PowerPC right now, an Ethernet frame
holds 65536*8 =  half a MByte , right ?

So any PowerPC host using mlx4 NIC can easily be bringed down, by
using a few TCP flows and sending out of order packets.

That is in itself quite scary.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-14 16:03                   ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>

BTW, you do understand that on PowerPC right now, an Ethernet frame
holds 65536*8 =  half a MByte , right ?

So any PowerPC host using mlx4 NIC can easily be bringed down, by
using a few TCP flows and sending out of order packets.

That is in itself quite scary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 14:56             ` Tariq Toukan
  2017-02-14 15:51                 ` Eric Dumazet
@ 2017-02-14 17:04               ` David Miller
  2017-02-14 17:17                 ` David Laight
  2017-02-14 19:38                 ` Jesper Dangaard Brouer
  2017-02-14 17:29               ` Alexander Duyck
  2 siblings, 2 replies; 78+ messages in thread
From: David Miller @ 2017-02-14 17:04 UTC (permalink / raw)
  To: ttoukan.linux
  Cc: edumazet, brouer, alexander.duyck, netdev, tariqt, kafai, saeedm,
	willemb, bblanco, ast, eric.dumazet, linux-mm

From: Tariq Toukan <ttoukan.linux@gmail.com>
Date: Tue, 14 Feb 2017 16:56:49 +0200

> Internally, I already implemented "dynamic page-cache" and
> "page-reuse" mechanisms in the driver, and together they totally
> bridge the performance gap.

I worry about a dynamically growing page cache inside of drivers
because it is invisible to the rest of the kernel.

It responds only to local needs.

The price of the real page allocator comes partly because it can
respond to global needs.

If a driver consumes some unreasonable percentage of system memory, it
is keeping that memory from being used from other parts of the system
even if it would be better for networking to be slightly slower with
less cache because that other thing that needs memory is more
important.

I think this is one of the primary reasons that the MM guys severely
chastise us when we build special purpose local caches into networking
facilities.

And the more I think about it the more I think they are right.

One path I see around all of this is full integration.  Meaning that
we can free pages into the page allocator which are still DMA mapped.
And future allocations from that device are prioritized to take still
DMA mapped objects.

Yes, we still need to make the page allocator faster, but this kind of
work helps everyone not just 100GB ethernet NICs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:04               ` David Miller
@ 2017-02-14 17:17                 ` David Laight
  2017-02-14 17:22                   ` David Miller
  2017-02-14 19:38                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 78+ messages in thread
From: David Laight @ 2017-02-14 17:17 UTC (permalink / raw)
  To: 'David Miller', ttoukan.linux
  Cc: edumazet, brouer, alexander.duyck, netdev, tariqt, kafai, saeedm,
	willemb, bblanco, ast, eric.dumazet, linux-mm

From: David Miller
> Sent: 14 February 2017 17:04
...
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.
...

For systems with 'expensive' iommu has anyone tried separating the
allocation of iommu resource (eg page table slots) from their
assignment to physical pages?

Provided the page sizes all match, setting up a receive buffer might
be as simple as writing the physical address into the iommu slot
that matches the ring entry.

Or am I thinking about hardware that is much simpler than real life?

	David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:17                 ` David Laight
@ 2017-02-14 17:22                   ` David Miller
  0 siblings, 0 replies; 78+ messages in thread
From: David Miller @ 2017-02-14 17:22 UTC (permalink / raw)
  To: David.Laight
  Cc: ttoukan.linux, edumazet, brouer, alexander.duyck, netdev, tariqt,
	kafai, saeedm, willemb, bblanco, ast, eric.dumazet, linux-mm

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 14 Feb 2017 17:17:22 +0000

> From: David Miller
>> Sent: 14 February 2017 17:04
> ...
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> ...
> 
> For systems with 'expensive' iommu has anyone tried separating the
> allocation of iommu resource (eg page table slots) from their
> assignment to physical pages?
> 
> Provided the page sizes all match, setting up a receive buffer might
> be as simple as writing the physical address into the iommu slot
> that matches the ring entry.
> 
> Or am I thinking about hardware that is much simpler than real life?

You still will eat an expensive MMIO or hypervisor call to setup the
mapping.

IOMMU is expensive because of two operations, the slot allocation
(which takes locks) and the modification of the IOMMU PTE to setup
or teardown the mapping.

This is why attempts to preallocate slots (which people have looked
into) never really takes off.  You really have to eliminate the
entire operation to get worthwhile gains.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 15:51                 ` Eric Dumazet
  (?)
  (?)
@ 2017-02-14 17:29                 ` Tom Herbert
  2017-02-15 16:42                   ` Tariq Toukan
  -1 siblings, 1 reply; 78+ messages in thread
From: Tom Herbert @ 2017-02-14 17:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tariq Toukan, Eric Dumazet, Jesper Dangaard Brouer,
	Alexander Duyck, David S . Miller, netdev, Tariq Toukan,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, linux-mm

On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>
>> As the previous series caused hangs, we must run functional regression
>> tests over this series as well.
>> Run has already started, and results will be available tomorrow morning.
>>
>> In general, I really like this series. The re-factorization looks more
>> elegant and more correct, functionally.
>>
>> However, performance wise: we fear that the numbers will be drastically
>> lower with this transition to order-0 pages,
>> because of the (becoming critical) page allocator and dma operations
>> bottlenecks, especially on systems with costly
>> dma operations, such as ARM, iommu=on, etc...
>>
>
> So, again, performance after this patch series his higher,
> once you have sensible RX queues parameters, for the expected workload.
>
> Only in pathological cases, you might have some regression.
>
> The old schem was _maybe_ better _when_ memory is not fragmented.
>
> When you run hosts for months, memory _is_ fragmented.
>
> You never see that on benchmarks, unless you force memory being
> fragmented.
>
>
>
>> We already have this exact issue in mlx5, where we moved to order-0
>> allocations with a fixed size cache, but that was not enough.
>> Customers of mlx5 have already complained about the performance
>> degradation, and currently this is hurting our business.
>> We get a clear nack from our performance regression team regarding doing
>> the same in mlx4.
>> So, the question is, can we live with this degradation until those
>> bottleneck challenges are addressed?
>
> Again, there is no degradation.
>
> We have been using order-0 pages for years at Google.
>
> Only when we made the mistake to rebase from the upstream driver and
> order-3 pages we got horrible regressions, causing production outages.
>
> I was silly to believe that mm layer got better.
>
>> Following our perf experts feedback, I cannot just simply Ack. We need
>> to have a clear plan to close the perf gap or reduce the impact.
>
> Your perf experts need to talk to me, or any experts at Google and
> Facebook, really.
>

I agree with this 100%! To be blunt, power users like this are testing
your drivers far beyond what Mellanox is doing and understand how
performance gains in benchmarks translate to possible gains in real
production way more than your perf experts can. Listen to Eric!

Tom


> Anything _relying_ on order-3 pages being available to impress
> friends/customers is a lie.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 14:56             ` Tariq Toukan
  2017-02-14 15:51                 ` Eric Dumazet
  2017-02-14 17:04               ` David Miller
@ 2017-02-14 17:29               ` Alexander Duyck
  2017-02-14 18:46                 ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-14 17:29 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
> On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>>
>> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>
>>> It is important to understand that there are two cases for the cost of
>>> an atomic op, which depend on the cache-coherency state of the
>>> cacheline.
>>>
>>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>>
>>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>>
>> Okay, it seems you guys really want a patch that I said was not giving
>> good results
>>
>> Let me publish the numbers I get , adding or not the last (and not
>> official) patch.
>>
>> If I _force_ the user space process to run on the other node,
>> then the results are not the ones Alex or you are expecting.
>>
>> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> and 3.5 Mpps without it.
>>
>> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> 0.00      0.00      4.50
>>
>> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>>
>>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>>      31.07%  [kernel]      [k] skb_release_data
>>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>>       1.35%  [kernel]      [k] copy_page_to_iter
>>       0.98%  [kernel]      [k] _raw_spin_lock
>>       0.90%  [kernel]      [k] skb_release_head_state
>>       0.60%  [kernel]      [k] tcp_transmit_skb
>>       0.51%  [kernel]      [k] mlx4_en_xmit
>>       0.33%  [kernel]      [k] ___cache_free
>>       0.28%  [kernel]      [k] tcp_rcv_established
>>
>> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>>
>>
>>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>>      12.92%  [kernel]          [k] napi_gro_frags
>>       7.28%  [kernel]          [k] inet_gro_receive
>>       7.17%  [kernel]          [k] tcp_gro_receive
>>       5.10%  [kernel]          [k] dev_gro_receive
>>       4.87%  [kernel]          [k] skb_gro_receive
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       2.04%  [kernel]          [k] __build_skb
>>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>>       1.01%  [kernel]          [k] tcp4_gro_receive
>>       0.65%  [kernel]          [k] kmem_cache_alloc
>>       0.45%  [kernel]          [k] _raw_spin_lock
>>
>> Without the latest  patch (the exact patch series v3 I submitted),
>> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> reads.
>>
>> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> 0.00      0.00      4.50
>>
>> Profiles of the two cpus :
>>
>>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>>       6.42%  [kernel]      [k] skb_release_data
>>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>>       1.83%  [kernel]      [k] copy_page_to_iter
>>       1.59%  [kernel]      [k] _raw_spin_lock
>>       1.48%  [kernel]      [k] skb_release_head_state
>>       0.72%  [kernel]      [k] tcp_transmit_skb
>>       0.68%  [kernel]      [k] mlx4_en_xmit
>>       0.43%  [kernel]      [k] page_frag_free
>>       0.38%  [kernel]      [k] ___cache_free
>>       0.37%  [kernel]      [k] tcp_established_options
>>       0.37%  [kernel]      [k] __ip_local_out
>>
>>
>>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>>      26.47%  [kernel]          [k] napi_gro_frags
>>       7.02%  [kernel]          [k] inet_gro_receive
>>       5.89%  [kernel]          [k] tcp_gro_receive
>>       5.17%  [kernel]          [k] dev_gro_receive
>>       4.80%  [kernel]          [k] skb_gro_receive
>>       2.61%  [kernel]          [k] __build_skb
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>>       0.95%  [kernel]          [k] tcp4_gro_receive
>>       0.51%  [kernel]          [k] kmem_cache_alloc
>>       0.42%  [kernel]          [k] __inet_lookup_established
>>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>>
>>
>> So probably this will need further analysis, outside of the scope of
>> this patch series.
>>
>> Could we now please Ack this v3 and merge it ?
>>
>> Thanks.
>
> Thanks Eric.
>
> As the previous series caused hangs, we must run functional regression tests
> over this series as well.
> Run has already started, and results will be available tomorrow morning.
>
> In general, I really like this series. The re-factorization looks more
> elegant and more correct, functionally.
>
> However, performance wise: we fear that the numbers will be drastically
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...

So to give you an idea I originally came up with the approach used in
the Intel drivers when I was dealing with a PowerPC system where the
IOMMU was a requirement.  With this setup correctly the map/unmap
calls should be almost non-existent.  Basically the only time we
should have to allocate or free a page is if something is sitting on
the pages for an excessively long time or if the interrupt is bouncing
between memory nodes which forces us to free the memory since it is
sitting on the wrong node.

> We already have this exact issue in mlx5, where we moved to order-0
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance degradation,
> and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing the
> same in mlx4.

What form of recycling were you doing?  If you were doing the offset
based setup then that obviously only gets you so much.  The advantage
to using the page count is that you get almost a mobius strip effect
for your buffers and descriptor rings where you just keep flipping the
page offset back and forth via an XOR and the only cost is
dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
do the full allocation, mapping, and unmapping.

> So, the question is, can we live with this degradation until those
> bottleneck challenges are addressed?
> Following our perf experts feedback, I cannot just simply Ack. We need to
> have a clear plan to close the perf gap or reduce the impact.

I think we need to define what is the degradation you are expecting to
see.  Using the page count based approach should actually be better in
some cases than the order 3 page and just walking frags approach since
the theoretical reuse is infinite instead of fixed.

> Internally, I already implemented "dynamic page-cache" and "page-reuse"
> mechanisms in the driver,
> and together they totally bridge the performance gap.
> That's why I would like to hear from Jesper what is the status of his
> page_pool API, it is promising and could totally solve these issues.
>
> Regards,
> Tariq

The page pool may provide gains but we have to actually see it before
we can guarantee it.  If worse comes to worse I might just resort to
standardizing the logic used for the Intel driver page count based
approach.  Then if nothing else we would have a standard path for all
the drivers to use if we start going down this route.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:29               ` Alexander Duyck
@ 2017-02-14 18:46                 ` Jesper Dangaard Brouer
  2017-02-14 19:02                   ` Eric Dumazet
  2017-02-14 19:06                     ` Alexander Duyck
  0 siblings, 2 replies; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 18:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	brouer

On Tue, 14 Feb 2017 09:29:54 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> >
> >
> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >>
> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:
> >>  
> >>> It is important to understand that there are two cases for the cost of
> >>> an atomic op, which depend on the cache-coherency state of the
> >>> cacheline.
> >>>
> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >>>
> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >>>  
> >> Okay, it seems you guys really want a patch that I said was not giving
> >> good results
> >>
> >> Let me publish the numbers I get , adding or not the last (and not
> >> official) patch.
> >>
> >> If I _force_ the user space process to run on the other node,
> >> then the results are not the ones Alex or you are expecting.
> >>
> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> and 3.5 Mpps without it.
> >>
> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> 0.00      0.00      4.50
> >>
> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >>
> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >>      31.07%  [kernel]      [k] skb_release_data
> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >>       0.90%  [kernel]      [k] skb_release_head_state
> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >>       0.33%  [kernel]      [k] ___cache_free
> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >>
> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >>
> >>
> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      12.92%  [kernel]          [k] napi_gro_frags
> >>       7.28%  [kernel]          [k] inet_gro_receive
> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >>       5.10%  [kernel]          [k] dev_gro_receive
> >>       4.87%  [kernel]          [k] skb_gro_receive
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       2.04%  [kernel]          [k] __build_skb
> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >>
> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> reads.
> >>
> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> 0.00      0.00      4.50
> >>
> >> Profiles of the two cpus :
> >>
> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >>       6.42%  [kernel]      [k] skb_release_data
> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >>       1.48%  [kernel]      [k] skb_release_head_state
> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >>       0.43%  [kernel]      [k] page_frag_free
> >>       0.38%  [kernel]      [k] ___cache_free
> >>       0.37%  [kernel]      [k] tcp_established_options
> >>       0.37%  [kernel]      [k] __ip_local_out
> >>
> >>
> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      26.47%  [kernel]          [k] napi_gro_frags
> >>       7.02%  [kernel]          [k] inet_gro_receive
> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >>       5.17%  [kernel]          [k] dev_gro_receive
> >>       4.80%  [kernel]          [k] skb_gro_receive
> >>       2.61%  [kernel]          [k] __build_skb
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >>
> >>
> >> So probably this will need further analysis, outside of the scope of
> >> this patch series.
> >>
> >> Could we now please Ack this v3 and merge it ?
> >>
> >> Thanks.  
> >
> > Thanks Eric.
> >
> > As the previous series caused hangs, we must run functional regression tests
> > over this series as well.
> > Run has already started, and results will be available tomorrow morning.
> >
> > In general, I really like this series. The re-factorization looks more
> > elegant and more correct, functionally.
> >
> > However, performance wise: we fear that the numbers will be drastically
> > lower with this transition to order-0 pages,
> > because of the (becoming critical) page allocator and dma operations
> > bottlenecks, especially on systems with costly
> > dma operations, such as ARM, iommu=on, etc...  
> 
> So to give you an idea I originally came up with the approach used in
> the Intel drivers when I was dealing with a PowerPC system where the
> IOMMU was a requirement.  With this setup correctly the map/unmap
> calls should be almost non-existent.  Basically the only time we
> should have to allocate or free a page is if something is sitting on
> the pages for an excessively long time or if the interrupt is bouncing
> between memory nodes which forces us to free the memory since it is
> sitting on the wrong node.
>
> > We already have this exact issue in mlx5, where we moved to order-0
> > allocations with a fixed size cache, but that was not enough.
> > Customers of mlx5 have already complained about the performance degradation,
> > and currently this is hurting our business.
> > We get a clear nack from our performance regression team regarding doing the
> > same in mlx4.  
> 
> What form of recycling were you doing?  If you were doing the offset
> based setup then that obviously only gets you so much.  The advantage
> to using the page count is that you get almost a mobius strip effect
> for your buffers and descriptor rings where you just keep flipping the
> page offset back and forth via an XOR and the only cost is
> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> do the full allocation, mapping, and unmapping.
> 
> > So, the question is, can we live with this degradation until those
> > bottleneck challenges are addressed?
> > Following our perf experts feedback, I cannot just simply Ack. We need to
> > have a clear plan to close the perf gap or reduce the impact.  
> 
> I think we need to define what is the degradation you are expecting to
> see.  Using the page count based approach should actually be better in
> some cases than the order 3 page and just walking frags approach since
> the theoretical reuse is infinite instead of fixed.
> 
> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> > mechanisms in the driver,
> > and together they totally bridge the performance gap.
> > That's why I would like to hear from Jesper what is the status of his
> > page_pool API, it is promising and could totally solve these issues.
> >
> > Regards,
> > Tariq  
> 
> The page pool may provide gains but we have to actually see it before
> we can guarantee it.  If worse comes to worse I might just resort to
> standardizing the logic used for the Intel driver page count based
> approach.  Then if nothing else we would have a standard path for all
> the drivers to use if we start going down this route.

With this Intel driver page count based recycle approach, the recycle
size is tied to the size of the RX ring.  As Eric and Tariq discovered.
And for other performance reasons (memory footprint of walking RX ring
data-structures), don't want to increase the RX ring sizes.  Thus, it
create two opposite performance needs.  That is why I think a more
explicit approach with a pool is more attractive.

How is this approach doing to work for XDP?
(XDP doesn't "share" the page, and in-general we don't want the extra
atomic.)

We absolutely need recycling with XDP, when transmitting out another
device, and the other devices DMA-TX completion need some way of
returning this page.
What is basically needed is a standardized callback to allow the remote
driver to return the page to the originating driver.  As we don't have
a NDP for XDP-forward/transmit yet, we could pass this callback as a
parameter along with the packet-page to send?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 18:46                 ` Jesper Dangaard Brouer
@ 2017-02-14 19:02                   ` Eric Dumazet
  2017-02-14 20:02                       ` Jesper Dangaard Brouer
  2017-02-14 19:06                     ` Alexander Duyck
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 19:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)
>
> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.
> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?
>
>


mlx4 already has a cache for XDP.
I believe I did not change this part, it still should work.

commit d576acf0a22890cf3f8f7a9b035f1558077f6770
Author: Brenden Blanco <bblanco@plumgrid.com>
Date:   Tue Jul 19 12:16:52 2016 -0700

    net/mlx4_en: add page recycle to prepare rx ring for tx support

I have not checked if recent Tom work added core infra for this cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 18:46                 ` Jesper Dangaard Brouer
@ 2017-02-14 19:06                     ` Alexander Duyck
  2017-02-14 19:06                     ` Alexander Duyck
  1 sibling, 0 replies; 78+ messages in thread
From: Alexander Duyck @ 2017-02-14 19:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	John Fastabend

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 14 Feb 2017 09:29:54 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>> >
>> >
>> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>> >>
>> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> >> <brouer@redhat.com> wrote:
>> >>
>> >>> It is important to understand that there are two cases for the cost of
>> >>> an atomic op, which depend on the cache-coherency state of the
>> >>> cacheline.
>> >>>
>> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>> >>>
>> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>> >>>
>> >> Okay, it seems you guys really want a patch that I said was not giving
>> >> good results
>> >>
>> >> Let me publish the numbers I get , adding or not the last (and not
>> >> official) patch.
>> >>
>> >> If I _force_ the user space process to run on the other node,
>> >> then the results are not the ones Alex or you are expecting.
>> >>
>> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> >> and 3.5 Mpps without it.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> >> 0.00      0.00      4.50
>> >>
>> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>> >>
>> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>      31.07%  [kernel]      [k] skb_release_data
>> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.35%  [kernel]      [k] copy_page_to_iter
>> >>       0.98%  [kernel]      [k] _raw_spin_lock
>> >>       0.90%  [kernel]      [k] skb_release_head_state
>> >>       0.60%  [kernel]      [k] tcp_transmit_skb
>> >>       0.51%  [kernel]      [k] mlx4_en_xmit
>> >>       0.33%  [kernel]      [k] ___cache_free
>> >>       0.28%  [kernel]      [k] tcp_rcv_established
>> >>
>> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>> >>
>> >>
>> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      12.92%  [kernel]          [k] napi_gro_frags
>> >>       7.28%  [kernel]          [k] inet_gro_receive
>> >>       7.17%  [kernel]          [k] tcp_gro_receive
>> >>       5.10%  [kernel]          [k] dev_gro_receive
>> >>       4.87%  [kernel]          [k] skb_gro_receive
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       2.04%  [kernel]          [k] __build_skb
>> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       1.01%  [kernel]          [k] tcp4_gro_receive
>> >>       0.65%  [kernel]          [k] kmem_cache_alloc
>> >>       0.45%  [kernel]          [k] _raw_spin_lock
>> >>
>> >> Without the latest  patch (the exact patch series v3 I submitted),
>> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> >> reads.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> >> 0.00      0.00      4.50
>> >>
>> >> Profiles of the two cpus :
>> >>
>> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>       6.42%  [kernel]      [k] skb_release_data
>> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.83%  [kernel]      [k] copy_page_to_iter
>> >>       1.59%  [kernel]      [k] _raw_spin_lock
>> >>       1.48%  [kernel]      [k] skb_release_head_state
>> >>       0.72%  [kernel]      [k] tcp_transmit_skb
>> >>       0.68%  [kernel]      [k] mlx4_en_xmit
>> >>       0.43%  [kernel]      [k] page_frag_free
>> >>       0.38%  [kernel]      [k] ___cache_free
>> >>       0.37%  [kernel]      [k] tcp_established_options
>> >>       0.37%  [kernel]      [k] __ip_local_out
>> >>
>> >>
>> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      26.47%  [kernel]          [k] napi_gro_frags
>> >>       7.02%  [kernel]          [k] inet_gro_receive
>> >>       5.89%  [kernel]          [k] tcp_gro_receive
>> >>       5.17%  [kernel]          [k] dev_gro_receive
>> >>       4.80%  [kernel]          [k] skb_gro_receive
>> >>       2.61%  [kernel]          [k] __build_skb
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       0.95%  [kernel]          [k] tcp4_gro_receive
>> >>       0.51%  [kernel]          [k] kmem_cache_alloc
>> >>       0.42%  [kernel]          [k] __inet_lookup_established
>> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>> >>
>> >>
>> >> So probably this will need further analysis, outside of the scope of
>> >> this patch series.
>> >>
>> >> Could we now please Ack this v3 and merge it ?
>> >>
>> >> Thanks.
>> >
>> > Thanks Eric.
>> >
>> > As the previous series caused hangs, we must run functional regression tests
>> > over this series as well.
>> > Run has already started, and results will be available tomorrow morning.
>> >
>> > In general, I really like this series. The re-factorization looks more
>> > elegant and more correct, functionally.
>> >
>> > However, performance wise: we fear that the numbers will be drastically
>> > lower with this transition to order-0 pages,
>> > because of the (becoming critical) page allocator and dma operations
>> > bottlenecks, especially on systems with costly
>> > dma operations, such as ARM, iommu=on, etc...
>>
>> So to give you an idea I originally came up with the approach used in
>> the Intel drivers when I was dealing with a PowerPC system where the
>> IOMMU was a requirement.  With this setup correctly the map/unmap
>> calls should be almost non-existent.  Basically the only time we
>> should have to allocate or free a page is if something is sitting on
>> the pages for an excessively long time or if the interrupt is bouncing
>> between memory nodes which forces us to free the memory since it is
>> sitting on the wrong node.
>>
>> > We already have this exact issue in mlx5, where we moved to order-0
>> > allocations with a fixed size cache, but that was not enough.
>> > Customers of mlx5 have already complained about the performance degradation,
>> > and currently this is hurting our business.
>> > We get a clear nack from our performance regression team regarding doing the
>> > same in mlx4.
>>
>> What form of recycling were you doing?  If you were doing the offset
>> based setup then that obviously only gets you so much.  The advantage
>> to using the page count is that you get almost a mobius strip effect
>> for your buffers and descriptor rings where you just keep flipping the
>> page offset back and forth via an XOR and the only cost is
>> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
>> do the full allocation, mapping, and unmapping.
>>
>> > So, the question is, can we live with this degradation until those
>> > bottleneck challenges are addressed?
>> > Following our perf experts feedback, I cannot just simply Ack. We need to
>> > have a clear plan to close the perf gap or reduce the impact.
>>
>> I think we need to define what is the degradation you are expecting to
>> see.  Using the page count based approach should actually be better in
>> some cases than the order 3 page and just walking frags approach since
>> the theoretical reuse is infinite instead of fixed.
>>
>> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
>> > mechanisms in the driver,
>> > and together they totally bridge the performance gap.
>> > That's why I would like to hear from Jesper what is the status of his
>> > page_pool API, it is promising and could totally solve these issues.
>> >
>> > Regards,
>> > Tariq
>>
>> The page pool may provide gains but we have to actually see it before
>> we can guarantee it.  If worse comes to worse I might just resort to
>> standardizing the logic used for the Intel driver page count based
>> approach.  Then if nothing else we would have a standard path for all
>> the drivers to use if we start going down this route.
>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)

The extra atomic is moot since it is we can get rid of that by doing
bulk page count updates.

Why can't XDP share a page?  You have brought up the user space aspect
of things repeatedly but the AF_PACKET patches John did more or less
demonstrated that wasn't the case.  What you end up with is you have
to either be providing pure user space pages or pure kernel pages.
You can't have pages being swapped between the two without introducing
security issues.  So even with your pool approach it doesn't matter
which way things are run.  Your pool is either device to user space,
or device to kernel space and it can't be both without creating
sharing concerns.

> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.

I fully agree here.  However the easiest way of handling this is via
the page count in my opinion.

> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?

No.  This assumes way too much.  Most packets aren't going to be
device to device routing.  We have seen this type of thing and
rejected it multiple times.  Don't think driver to driver.  This is
driver to network stack, socket, device, virtual machine, storage,
etc.  The fact is there are many spots where a frame might get
terminated.  This is why the bulk alloc/free of skbs never went
anywhere.  We never really addressed the fact that there are many more
spots where a frame is terminated.

This is why I said before what we need to do is have a page destructor
to handle this sort of thing.  The idea is you want to have this work
everywhere.  Having just drivers do this would make it a nice toy but
completely useless since not too many people are doing
routing/bridging between interfaces.  Using the page destructor it is
easy to create a pool of "free" pages that you can then pull your DMA
pages out of or that you can release back into the page allocator.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-14 19:06                     ` Alexander Duyck
  0 siblings, 0 replies; 78+ messages in thread
From: Alexander Duyck @ 2017-02-14 19:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	John Fastabend

On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 14 Feb 2017 09:29:54 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>> >
>> >
>> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>> >>
>> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> >> <brouer@redhat.com> wrote:
>> >>
>> >>> It is important to understand that there are two cases for the cost of
>> >>> an atomic op, which depend on the cache-coherency state of the
>> >>> cacheline.
>> >>>
>> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>> >>>
>> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>> >>>
>> >> Okay, it seems you guys really want a patch that I said was not giving
>> >> good results
>> >>
>> >> Let me publish the numbers I get , adding or not the last (and not
>> >> official) patch.
>> >>
>> >> If I _force_ the user space process to run on the other node,
>> >> then the results are not the ones Alex or you are expecting.
>> >>
>> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> >> and 3.5 Mpps without it.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> >> 0.00      0.00      4.50
>> >>
>> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>> >>
>> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>      31.07%  [kernel]      [k] skb_release_data
>> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.35%  [kernel]      [k] copy_page_to_iter
>> >>       0.98%  [kernel]      [k] _raw_spin_lock
>> >>       0.90%  [kernel]      [k] skb_release_head_state
>> >>       0.60%  [kernel]      [k] tcp_transmit_skb
>> >>       0.51%  [kernel]      [k] mlx4_en_xmit
>> >>       0.33%  [kernel]      [k] ___cache_free
>> >>       0.28%  [kernel]      [k] tcp_rcv_established
>> >>
>> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>> >>
>> >>
>> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      12.92%  [kernel]          [k] napi_gro_frags
>> >>       7.28%  [kernel]          [k] inet_gro_receive
>> >>       7.17%  [kernel]          [k] tcp_gro_receive
>> >>       5.10%  [kernel]          [k] dev_gro_receive
>> >>       4.87%  [kernel]          [k] skb_gro_receive
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       2.04%  [kernel]          [k] __build_skb
>> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       1.01%  [kernel]          [k] tcp4_gro_receive
>> >>       0.65%  [kernel]          [k] kmem_cache_alloc
>> >>       0.45%  [kernel]          [k] _raw_spin_lock
>> >>
>> >> Without the latest  patch (the exact patch series v3 I submitted),
>> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> >> reads.
>> >>
>> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> >> 0.00      0.00      4.50
>> >>
>> >> Profiles of the two cpus :
>> >>
>> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>> >>       6.42%  [kernel]      [k] skb_release_data
>> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>> >>       1.83%  [kernel]      [k] copy_page_to_iter
>> >>       1.59%  [kernel]      [k] _raw_spin_lock
>> >>       1.48%  [kernel]      [k] skb_release_head_state
>> >>       0.72%  [kernel]      [k] tcp_transmit_skb
>> >>       0.68%  [kernel]      [k] mlx4_en_xmit
>> >>       0.43%  [kernel]      [k] page_frag_free
>> >>       0.38%  [kernel]      [k] ___cache_free
>> >>       0.37%  [kernel]      [k] tcp_established_options
>> >>       0.37%  [kernel]      [k] __ip_local_out
>> >>
>> >>
>> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>> >>      26.47%  [kernel]          [k] napi_gro_frags
>> >>       7.02%  [kernel]          [k] inet_gro_receive
>> >>       5.89%  [kernel]          [k] tcp_gro_receive
>> >>       5.17%  [kernel]          [k] dev_gro_receive
>> >>       4.80%  [kernel]          [k] skb_gro_receive
>> >>       2.61%  [kernel]          [k] __build_skb
>> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>> >>       0.95%  [kernel]          [k] tcp4_gro_receive
>> >>       0.51%  [kernel]          [k] kmem_cache_alloc
>> >>       0.42%  [kernel]          [k] __inet_lookup_established
>> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>> >>
>> >>
>> >> So probably this will need further analysis, outside of the scope of
>> >> this patch series.
>> >>
>> >> Could we now please Ack this v3 and merge it ?
>> >>
>> >> Thanks.
>> >
>> > Thanks Eric.
>> >
>> > As the previous series caused hangs, we must run functional regression tests
>> > over this series as well.
>> > Run has already started, and results will be available tomorrow morning.
>> >
>> > In general, I really like this series. The re-factorization looks more
>> > elegant and more correct, functionally.
>> >
>> > However, performance wise: we fear that the numbers will be drastically
>> > lower with this transition to order-0 pages,
>> > because of the (becoming critical) page allocator and dma operations
>> > bottlenecks, especially on systems with costly
>> > dma operations, such as ARM, iommu=on, etc...
>>
>> So to give you an idea I originally came up with the approach used in
>> the Intel drivers when I was dealing with a PowerPC system where the
>> IOMMU was a requirement.  With this setup correctly the map/unmap
>> calls should be almost non-existent.  Basically the only time we
>> should have to allocate or free a page is if something is sitting on
>> the pages for an excessively long time or if the interrupt is bouncing
>> between memory nodes which forces us to free the memory since it is
>> sitting on the wrong node.
>>
>> > We already have this exact issue in mlx5, where we moved to order-0
>> > allocations with a fixed size cache, but that was not enough.
>> > Customers of mlx5 have already complained about the performance degradation,
>> > and currently this is hurting our business.
>> > We get a clear nack from our performance regression team regarding doing the
>> > same in mlx4.
>>
>> What form of recycling were you doing?  If you were doing the offset
>> based setup then that obviously only gets you so much.  The advantage
>> to using the page count is that you get almost a mobius strip effect
>> for your buffers and descriptor rings where you just keep flipping the
>> page offset back and forth via an XOR and the only cost is
>> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
>> do the full allocation, mapping, and unmapping.
>>
>> > So, the question is, can we live with this degradation until those
>> > bottleneck challenges are addressed?
>> > Following our perf experts feedback, I cannot just simply Ack. We need to
>> > have a clear plan to close the perf gap or reduce the impact.
>>
>> I think we need to define what is the degradation you are expecting to
>> see.  Using the page count based approach should actually be better in
>> some cases than the order 3 page and just walking frags approach since
>> the theoretical reuse is infinite instead of fixed.
>>
>> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
>> > mechanisms in the driver,
>> > and together they totally bridge the performance gap.
>> > That's why I would like to hear from Jesper what is the status of his
>> > page_pool API, it is promising and could totally solve these issues.
>> >
>> > Regards,
>> > Tariq
>>
>> The page pool may provide gains but we have to actually see it before
>> we can guarantee it.  If worse comes to worse I might just resort to
>> standardizing the logic used for the Intel driver page count based
>> approach.  Then if nothing else we would have a standard path for all
>> the drivers to use if we start going down this route.
>
> With this Intel driver page count based recycle approach, the recycle
> size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> And for other performance reasons (memory footprint of walking RX ring
> data-structures), don't want to increase the RX ring sizes.  Thus, it
> create two opposite performance needs.  That is why I think a more
> explicit approach with a pool is more attractive.
>
> How is this approach doing to work for XDP?
> (XDP doesn't "share" the page, and in-general we don't want the extra
> atomic.)

The extra atomic is moot since it is we can get rid of that by doing
bulk page count updates.

Why can't XDP share a page?  You have brought up the user space aspect
of things repeatedly but the AF_PACKET patches John did more or less
demonstrated that wasn't the case.  What you end up with is you have
to either be providing pure user space pages or pure kernel pages.
You can't have pages being swapped between the two without introducing
security issues.  So even with your pool approach it doesn't matter
which way things are run.  Your pool is either device to user space,
or device to kernel space and it can't be both without creating
sharing concerns.

> We absolutely need recycling with XDP, when transmitting out another
> device, and the other devices DMA-TX completion need some way of
> returning this page.

I fully agree here.  However the easiest way of handling this is via
the page count in my opinion.

> What is basically needed is a standardized callback to allow the remote
> driver to return the page to the originating driver.  As we don't have
> a NDP for XDP-forward/transmit yet, we could pass this callback as a
> parameter along with the packet-page to send?

No.  This assumes way too much.  Most packets aren't going to be
device to device routing.  We have seen this type of thing and
rejected it multiple times.  Don't think driver to driver.  This is
driver to network stack, socket, device, virtual machine, storage,
etc.  The fact is there are many spots where a frame might get
terminated.  This is why the bulk alloc/free of skbs never went
anywhere.  We never really addressed the fact that there are many more
spots where a frame is terminated.

This is why I said before what we need to do is have a page destructor
to handle this sort of thing.  The idea is you want to have this work
everywhere.  Having just drivers do this would make it a nice toy but
completely useless since not too many people are doing
routing/bridging between interfaces.  Using the page destructor it is
easy to create a pool of "free" pages that you can then pull your DMA
pages out of or that you can release back into the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:04               ` David Miller
  2017-02-14 17:17                 ` David Laight
@ 2017-02-14 19:38                 ` Jesper Dangaard Brouer
  2017-02-14 19:59                   ` David Miller
  1 sibling, 1 reply; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 19:38 UTC (permalink / raw)
  To: David Miller
  Cc: ttoukan.linux, edumazet, alexander.duyck, netdev, tariqt, kafai,
	saeedm, willemb, bblanco, ast, eric.dumazet, linux-mm, brouer

On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Tariq Toukan <ttoukan.linux@gmail.com>
> Date: Tue, 14 Feb 2017 16:56:49 +0200
> 
> > Internally, I already implemented "dynamic page-cache" and
> > "page-reuse" mechanisms in the driver, and together they totally
> > bridge the performance gap.  

It sounds like you basically implemented a page_pool scheme...

> I worry about a dynamically growing page cache inside of drivers
> because it is invisible to the rest of the kernel.

Exactly, that is why I wanted a separate standardized thing, I call the
page_pool, which is part of the MM-tree and interacts with the page
allocator.  E.g. it must implement/support a way the page allocator can
reclaim pages from it (admit I didn't implement this in RFC patches).


> It responds only to local needs.

Generally true, but a side effect of recycling these pages, result in
less fragmentation of the page allocator/buddy system.


> The price of the real page allocator comes partly because it can
> respond to global needs.
> 
> If a driver consumes some unreasonable percentage of system memory, it
> is keeping that memory from being used from other parts of the system
> even if it would be better for networking to be slightly slower with
> less cache because that other thing that needs memory is more
> important.

(That is why I want to have OOM protection at device level, with the
recycle feedback from page pool we have this knowledge, and further I
wanted to allow blocking a specific RX queue[1])
[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html#userspace-delivery-and-oom


> I think this is one of the primary reasons that the MM guys severely
> chastise us when we build special purpose local caches into networking
> facilities.
> 
> And the more I think about it the more I think they are right.

+1
 
> One path I see around all of this is full integration.  Meaning that
> we can free pages into the page allocator which are still DMA mapped.
> And future allocations from that device are prioritized to take still
> DMA mapped objects.

I like this idea.  Are you saying that this should be done per DMA
engine or per device?

If this is per device, it is almost the page_pool idea.  

 
> Yes, we still need to make the page allocator faster, but this kind of
> work helps everyone not just 100GB ethernet NICs.

True.  And Mel already have some generic improvements to the page
allocator queued for the next merge.  And I have the responsibility to
get the bulking API into shape.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 19:06                     ` Alexander Duyck
  (?)
@ 2017-02-14 19:50                     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 19:50 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tariq Toukan, Eric Dumazet, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	John Fastabend, brouer

On Tue, 14 Feb 2017 11:06:25 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Tue, 14 Feb 2017 09:29:54 -0800
> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >  
> >> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@gmail.com> wrote:  
> >> >
> >> >
> >> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >> >>
> >> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> >> <brouer@redhat.com> wrote:
> >> >>  
> >> >>> It is important to understand that there are two cases for the cost of
> >> >>> an atomic op, which depend on the cache-coherency state of the
> >> >>> cacheline.
> >> >>>
> >> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >> >>>
> >> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >> >>>  
> >> >> Okay, it seems you guys really want a patch that I said was not giving
> >> >> good results
> >> >>
> >> >> Let me publish the numbers I get , adding or not the last (and not
> >> >> official) patch.
> >> >>
> >> >> If I _force_ the user space process to run on the other node,
> >> >> then the results are not the ones Alex or you are expecting.
> >> >>
> >> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> >> and 3.5 Mpps without it.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >> >>
> >> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>      31.07%  [kernel]      [k] skb_release_data
> >> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >> >>       0.90%  [kernel]      [k] skb_release_head_state
> >> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.33%  [kernel]      [k] ___cache_free
> >> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >> >>
> >> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >> >>
> >> >>
> >> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      12.92%  [kernel]          [k] napi_gro_frags
> >> >>       7.28%  [kernel]          [k] inet_gro_receive
> >> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >> >>       5.10%  [kernel]          [k] dev_gro_receive
> >> >>       4.87%  [kernel]          [k] skb_gro_receive
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       2.04%  [kernel]          [k] __build_skb
> >> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >> >>
> >> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> >> reads.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profiles of the two cpus :
> >> >>
> >> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>       6.42%  [kernel]      [k] skb_release_data
> >> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >> >>       1.48%  [kernel]      [k] skb_release_head_state
> >> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.43%  [kernel]      [k] page_frag_free
> >> >>       0.38%  [kernel]      [k] ___cache_free
> >> >>       0.37%  [kernel]      [k] tcp_established_options
> >> >>       0.37%  [kernel]      [k] __ip_local_out
> >> >>
> >> >>
> >> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      26.47%  [kernel]          [k] napi_gro_frags
> >> >>       7.02%  [kernel]          [k] inet_gro_receive
> >> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >> >>       5.17%  [kernel]          [k] dev_gro_receive
> >> >>       4.80%  [kernel]          [k] skb_gro_receive
> >> >>       2.61%  [kernel]          [k] __build_skb
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >> >>
> >> >>
> >> >> So probably this will need further analysis, outside of the scope of
> >> >> this patch series.
> >> >>
> >> >> Could we now please Ack this v3 and merge it ?
> >> >>
> >> >> Thanks.  
> >> >
> >> > Thanks Eric.
> >> >
> >> > As the previous series caused hangs, we must run functional regression tests
> >> > over this series as well.
> >> > Run has already started, and results will be available tomorrow morning.
> >> >
> >> > In general, I really like this series. The re-factorization looks more
> >> > elegant and more correct, functionally.
> >> >
> >> > However, performance wise: we fear that the numbers will be drastically
> >> > lower with this transition to order-0 pages,
> >> > because of the (becoming critical) page allocator and dma operations
> >> > bottlenecks, especially on systems with costly
> >> > dma operations, such as ARM, iommu=on, etc...  
> >>
> >> So to give you an idea I originally came up with the approach used in
> >> the Intel drivers when I was dealing with a PowerPC system where the
> >> IOMMU was a requirement.  With this setup correctly the map/unmap
> >> calls should be almost non-existent.  Basically the only time we
> >> should have to allocate or free a page is if something is sitting on
> >> the pages for an excessively long time or if the interrupt is bouncing
> >> between memory nodes which forces us to free the memory since it is
> >> sitting on the wrong node.
> >>  
> >> > We already have this exact issue in mlx5, where we moved to order-0
> >> > allocations with a fixed size cache, but that was not enough.
> >> > Customers of mlx5 have already complained about the performance degradation,
> >> > and currently this is hurting our business.
> >> > We get a clear nack from our performance regression team regarding doing the
> >> > same in mlx4.  
> >>
> >> What form of recycling were you doing?  If you were doing the offset
> >> based setup then that obviously only gets you so much.  The advantage
> >> to using the page count is that you get almost a mobius strip effect
> >> for your buffers and descriptor rings where you just keep flipping the
> >> page offset back and forth via an XOR and the only cost is
> >> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> >> do the full allocation, mapping, and unmapping.
> >>  
> >> > So, the question is, can we live with this degradation until those
> >> > bottleneck challenges are addressed?
> >> > Following our perf experts feedback, I cannot just simply Ack. We need to
> >> > have a clear plan to close the perf gap or reduce the impact.  
> >>
> >> I think we need to define what is the degradation you are expecting to
> >> see.  Using the page count based approach should actually be better in
> >> some cases than the order 3 page and just walking frags approach since
> >> the theoretical reuse is infinite instead of fixed.
> >>  
> >> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> >> > mechanisms in the driver,
> >> > and together they totally bridge the performance gap.
> >> > That's why I would like to hear from Jesper what is the status of his
> >> > page_pool API, it is promising and could totally solve these issues.
> >> >
> >> > Regards,
> >> > Tariq  
> >>
> >> The page pool may provide gains but we have to actually see it before
> >> we can guarantee it.  If worse comes to worse I might just resort to
> >> standardizing the logic used for the Intel driver page count based
> >> approach.  Then if nothing else we would have a standard path for all
> >> the drivers to use if we start going down this route.  
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)  
> 
> The extra atomic is moot since it is we can get rid of that by doing
> bulk page count updates.
> 
> Why can't XDP share a page?  You have brought up the user space aspect
> of things repeatedly but the AF_PACKET patches John did more or less
> demonstrated that wasn't the case.  What you end up with is you have
> to either be providing pure user space pages or pure kernel pages.
> You can't have pages being swapped between the two without introducing
> security issues.  So even with your pool approach it doesn't matter
> which way things are run.  Your pool is either device to user space,
> or device to kernel space and it can't be both without creating
> sharing concerns.

(I think we are talking past each other here.)

 
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.  
> 
> I fully agree here.  However the easiest way of handling this is via
> the page count in my opinion.
> 
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?  
> 
> No.  This assumes way too much.  Most packets aren't going to be
> device to device routing.  We have seen this type of thing and
> rejected it multiple times.  Don't think driver to driver.  This is
> driver to network stack, socket, device, virtual machine, storage,
> etc.  The fact is there are many spots where a frame might get
> terminated. [...]

I fully agree, that XDP need to think further than driver to driver.

 
> This is why I said before what we need to do is have a page destructor
> to handle this sort of thing.  The idea is you want to have this work
> everywhere.  Having just drivers do this would make it a nice toy but
> completely useless since not too many people are doing
> routing/bridging between interfaces.  Using the page destructor it is
> easy to create a pool of "free" pages that you can then pull your DMA
> pages out of or that you can release back into the page allocator.

How is this page destructor different from my page_pool RFC
implementation?  (It basically functions as a page destructor...)

Help me understand what I'm missing?

Pointers to page_pool discussions
* page_pool RFC patchset:
 - http://lkml.kernel.org/r/20161220132444.18788.50875.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132812.18788.20431.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132817.18788.64726.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132822.18788.19768.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132827.18788.8658.stgit@firesoul

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 19:38                 ` Jesper Dangaard Brouer
@ 2017-02-14 19:59                   ` David Miller
  0 siblings, 0 replies; 78+ messages in thread
From: David Miller @ 2017-02-14 19:59 UTC (permalink / raw)
  To: brouer
  Cc: ttoukan.linux, edumazet, alexander.duyck, netdev, tariqt, kafai,
	saeedm, willemb, bblanco, ast, eric.dumazet, linux-mm

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 14 Feb 2017 20:38:22 +0100

> On Tue, 14 Feb 2017 12:04:26 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
> 
>> One path I see around all of this is full integration.  Meaning that
>> we can free pages into the page allocator which are still DMA mapped.
>> And future allocations from that device are prioritized to take still
>> DMA mapped objects.
> 
> I like this idea.  Are you saying that this should be done per DMA
> engine or per device?
> 
> If this is per device, it is almost the page_pool idea.  

Per-device is simplest, at least at first.

Maybe later down the road we can try to pool by "mapping entity" be
that a parent IOMMU or something else like a hypervisor managed
page table.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 19:02                   ` Eric Dumazet
@ 2017-02-14 20:02                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 20:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	brouer

On Tue, 14 Feb 2017 11:02:01 -0800
Eric Dumazet <edumazet@google.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >  
> 
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)
> >
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?
> >
> >  
> 
> 
> mlx4 already has a cache for XDP.
> I believe I did not change this part, it still should work.
> 
> commit d576acf0a22890cf3f8f7a9b035f1558077f6770
> Author: Brenden Blanco <bblanco@plumgrid.com>
> Date:   Tue Jul 19 12:16:52 2016 -0700
> 
>     net/mlx4_en: add page recycle to prepare rx ring for tx support

This obviously does not work for the case I'm talking about
(transmitting out another device with XDP).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-14 20:02                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-14 20:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm,
	brouer

On Tue, 14 Feb 2017 11:02:01 -0800
Eric Dumazet <edumazet@google.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >  
> 
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)
> >
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?
> >
> >  
> 
> 
> mlx4 already has a cache for XDP.
> I believe I did not change this part, it still should work.
> 
> commit d576acf0a22890cf3f8f7a9b035f1558077f6770
> Author: Brenden Blanco <bblanco@plumgrid.com>
> Date:   Tue Jul 19 12:16:52 2016 -0700
> 
>     net/mlx4_en: add page recycle to prepare rx ring for tx support

This obviously does not work for the case I'm talking about
(transmitting out another device with XDP).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 20:02                       ` Jesper Dangaard Brouer
@ 2017-02-14 21:56                         ` Eric Dumazet
  -1 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 21:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

>
> This obviously does not work for the case I'm talking about
> (transmitting out another device with XDP).
>

XDP_TX does not handle this yet.

When XDP_TX was added, it was very clear that the transmit _had_ to be
done on the same port.

Since all this discussion happened in this thread ( mlx4: use order-0
pages for RX )
I was kind of assuming all the comments were relevant to current code or patch,
not future devs ;)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-14 21:56                         ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-14 21:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Tariq Toukan, David S . Miller, netdev,
	Tariq Toukan, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, Eric Dumazet, linux-mm

>
> This obviously does not work for the case I'm talking about
> (transmitting out another device with XDP).
>

XDP_TX does not handle this yet.

When XDP_TX was added, it was very clear that the transmit _had_ to be
done on the same port.

Since all this discussion happened in this thread ( mlx4: use order-0
pages for RX )
I was kind of assuming all the comments were relevant to current code or patch,
not future devs ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-14 17:29                 ` Tom Herbert
@ 2017-02-15 16:42                   ` Tariq Toukan
  2017-02-15 16:57                     ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Tariq Toukan @ 2017-02-15 16:42 UTC (permalink / raw)
  To: Tom Herbert, Eric Dumazet
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Alexander Duyck,
	David S . Miller, netdev, Tariq Toukan, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm



On 14/02/2017 7:29 PM, Tom Herbert wrote:
> On Tue, Feb 14, 2017 at 7:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Tue, 2017-02-14 at 16:56 +0200, Tariq Toukan wrote:
>>
>>> As the previous series caused hangs, we must run functional regression
>>> tests over this series as well.
>>> Run has already started, and results will be available tomorrow morning.
>>>
>>> In general, I really like this series. The re-factorization looks more
>>> elegant and more correct, functionally.
>>>
>>> However, performance wise: we fear that the numbers will be drastically
>>> lower with this transition to order-0 pages,
>>> because of the (becoming critical) page allocator and dma operations
>>> bottlenecks, especially on systems with costly
>>> dma operations, such as ARM, iommu=on, etc...
>>>
>> So, again, performance after this patch series his higher,
>> once you have sensible RX queues parameters, for the expected workload.
>>
>> Only in pathological cases, you might have some regression.
>>
>> The old schem was _maybe_ better _when_ memory is not fragmented.
>>
>> When you run hosts for months, memory _is_ fragmented.
>>
>> You never see that on benchmarks, unless you force memory being
>> fragmented.
>>
>>
>>
>>> We already have this exact issue in mlx5, where we moved to order-0
>>> allocations with a fixed size cache, but that was not enough.
>>> Customers of mlx5 have already complained about the performance
>>> degradation, and currently this is hurting our business.
>>> We get a clear nack from our performance regression team regarding doing
>>> the same in mlx4.
>>> So, the question is, can we live with this degradation until those
>>> bottleneck challenges are addressed?
>> Again, there is no degradation.
>>
>> We have been using order-0 pages for years at Google.
>>
>> Only when we made the mistake to rebase from the upstream driver and
>> order-3 pages we got horrible regressions, causing production outages.
>>
>> I was silly to believe that mm layer got better.
>>
>>> Following our perf experts feedback, I cannot just simply Ack. We need
>>> to have a clear plan to close the perf gap or reduce the impact.
>> Your perf experts need to talk to me, or any experts at Google and
>> Facebook, really.
>>
> I agree with this 100%! To be blunt, power users like this are testing
> your drivers far beyond what Mellanox is doing and understand how
> performance gains in benchmarks translate to possible gains in real
> production way more than your perf experts can. Listen to Eric!
>
> Tom
>
>
>> Anything _relying_ on order-3 pages being available to impress
>> friends/customers is a lie.

Isn't it the same principle in page_frag_alloc() ?
It is called form __netdev_alloc_skb()/__napi_alloc_skb().

Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is 
a frag of a huge page,
and it is not going to be freed before the other non-linear frags.
Cannot this cause the same threats (memory pinning and so...)?

Currently, mlx4 doesn't use this generic API, while most other drivers do.

Similar claims are true for TX:
https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-15 16:42                   ` Tariq Toukan
@ 2017-02-15 16:57                     ` Eric Dumazet
  2017-02-16 13:08                       ` Tariq Toukan
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-15 16:57 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Tom Herbert, Eric Dumazet, Jesper Dangaard Brouer,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>

>
> Isn't it the same principle in page_frag_alloc() ?
> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>
> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?

This is not ok.

This is a very well known problem, we already mentioned that here in the past,
but at least core networking stack uses  order-0 pages on PowerPC.

mlx4 driver suffers from this problem 100% more than other drivers ;)

One problem at a time Tariq. Right now, only mlx4 has this big problem
compared to other NIC.

Then, if we _still_ hit major issues, we might also need to force
napi_get_frags()
to allocate skb->head using kmalloc() instead of a page frag.

That is a very simple fix.

Remember that we have skb->truesize that is an approximation, it will
never be completely accurate,
but we need to make it better.

mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
is obviously wrong when host is under memory pressure
(2 frags per page -> truesize should be 2048)


> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
> frag of a huge page,
> and it is not going to be freed before the other non-linear frags.
> Cannot this cause the same threats (memory pinning and so...)?
>
> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>
> Similar claims are true for TX:
> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81

We do not have such problem on TX. GFP_KERNEL allocations do not have
the same issues.

Tasks are usually not malicious in our DC, and most serious
applications use memcg or such memory control.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-15 16:57                     ` Eric Dumazet
@ 2017-02-16 13:08                       ` Tariq Toukan
  2017-02-16 15:47                         ` Eric Dumazet
  2017-02-16 17:05                         ` Tom Herbert
  0 siblings, 2 replies; 78+ messages in thread
From: Tariq Toukan @ 2017-02-16 13:08 UTC (permalink / raw)
  To: Eric Dumazet, Tariq Toukan, Jesper Dangaard Brouer
  Cc: Tom Herbert, Eric Dumazet, Alexander Duyck, David S . Miller,
	netdev, Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, linux-mm


On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>> Isn't it the same principle in page_frag_alloc() ?
>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>
>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> This is not ok.
>
> This is a very well known problem, we already mentioned that here in the past,
> but at least core networking stack uses  order-0 pages on PowerPC.
You're right, we should have done this as well in mlx4 on PPC.
> mlx4 driver suffers from this problem 100% more than other drivers ;)
>
> One problem at a time Tariq. Right now, only mlx4 has this big problem
> compared to other NIC.
We _do_ agree that the series improves the driver's quality, stability,
and performance in a fragmented system.

But due to the late rc we're in, and the fact that we know what benchmarks
our customers are going to run, we cannot Ack the series and get it
as is inside kernel 4.11.

We are interested to get your series merged along another perf improvement
we are preparing for next rc1. This way we will earn the desired stability
without breaking existing benchmarks.
I think this is the right thing to do at this point of time.


The idea behind the perf improvement, suggested by Jesper, is to split
the napi_poll call mlx4_en_process_rx_cq() loop into two.
The first loop extracts completed CQEs and starts prefetching on data
and RX descriptors. The second loop process the real packets.

>
> Then, if we _still_ hit major issues, we might also need to force
> napi_get_frags()
> to allocate skb->head using kmalloc() instead of a page frag.
>
> That is a very simple fix.
>
> Remember that we have skb->truesize that is an approximation, it will
> never be completely accurate,
> but we need to make it better.
>
> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
> is obviously wrong when host is under memory pressure
> (2 frags per page -> truesize should be 2048)
>
>
>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is a
>> frag of a huge page,
>> and it is not going to be freed before the other non-linear frags.
>> Cannot this cause the same threats (memory pinning and so...)?
>>
>> Currently, mlx4 doesn't use this generic API, while most other drivers do.
>>
>> Similar claims are true for TX:
>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
> We do not have such problem on TX. GFP_KERNEL allocations do not have
> the same issues.
>
> Tasks are usually not malicious in our DC, and most serious
> applications use memcg or such memory control.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 13:08                       ` Tariq Toukan
@ 2017-02-16 15:47                         ` Eric Dumazet
  2017-02-16 17:05                         ` Tom Herbert
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-16 15:47 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Tom Herbert,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Thu, 2017-02-16 at 15:08 +0200, Tariq Toukan wrote:
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
> > On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
> >> Isn't it the same principle in page_frag_alloc() ?
> >> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
> >>
> >> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
> > This is not ok.
> >
> > This is a very well known problem, we already mentioned that here in the past,
> > but at least core networking stack uses  order-0 pages on PowerPC.
> You're right, we should have done this as well in mlx4 on PPC.
> > mlx4 driver suffers from this problem 100% more than other drivers ;)
> >
> > One problem at a time Tariq. Right now, only mlx4 has this big problem
> > compared to other NIC.
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
> 
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
> 
> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
> 
> 
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.

Make sure to resubmit my patches before anything new.

We need to backport them to stable versions, without XDP, without
anything fancy.

And submit what is needed for 4.11, since current mlx4 driver in
net-next is broken, in case you missed it.

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 13:08                       ` Tariq Toukan
  2017-02-16 15:47                         ` Eric Dumazet
@ 2017-02-16 17:05                         ` Tom Herbert
  2017-02-16 17:11                             ` Eric Dumazet
  2017-02-16 19:03                             ` David Miller
  1 sibling, 2 replies; 78+ messages in thread
From: Tom Herbert @ 2017-02-16 17:05 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Eric Dumazet, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>
> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>
>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>> Isn't it the same principle in page_frag_alloc() ?
>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>
>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>
>> This is not ok.
>>
>> This is a very well known problem, we already mentioned that here in the
>> past,
>> but at least core networking stack uses  order-0 pages on PowerPC.
>
> You're right, we should have done this as well in mlx4 on PPC.
>>
>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>
>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>> compared to other NIC.
>
> We _do_ agree that the series improves the driver's quality, stability,
> and performance in a fragmented system.
>
> But due to the late rc we're in, and the fact that we know what benchmarks
> our customers are going to run, we cannot Ack the series and get it
> as is inside kernel 4.11.
>
You're admitting that Eric's patches improve driver quality,
stability, and performance but you're not allowing this in the kernel
because "we know what benchmarks our customers are going to run".
Sorry, but that is a weak explanation.

> We are interested to get your series merged along another perf improvement
> we are preparing for next rc1. This way we will earn the desired stability
> without breaking existing benchmarks.
> I think this is the right thing to do at this point of time.
>
>
> The idea behind the perf improvement, suggested by Jesper, is to split
> the napi_poll call mlx4_en_process_rx_cq() loop into two.
> The first loop extracts completed CQEs and starts prefetching on data
> and RX descriptors. The second loop process the real packets.
>
>
>>
>> Then, if we _still_ hit major issues, we might also need to force
>> napi_get_frags()
>> to allocate skb->head using kmalloc() instead of a page frag.
>>
>> That is a very simple fix.
>>
>> Remember that we have skb->truesize that is an approximation, it will
>> never be completely accurate,
>> but we need to make it better.
>>
>> mlx4 driver pretends to have a frag truesize of 1536 bytes, but this
>> is obviously wrong when host is under memory pressure
>> (2 frags per page -> truesize should be 2048)
>>
>>
>>> By using netdev/napi_alloc_skb, you'll get that the SKB's linear data is
>>> a
>>> frag of a huge page,
>>> and it is not going to be freed before the other non-linear frags.
>>> Cannot this cause the same threats (memory pinning and so...)?
>>>
>>> Currently, mlx4 doesn't use this generic API, while most other drivers
>>> do.
>>>
>>> Similar claims are true for TX:
>>>
>>> https://github.com/torvalds/linux/commit/5640f7685831e088fe6c2e1f863a6805962f8e81
>>
>> We do not have such problem on TX. GFP_KERNEL allocations do not have
>> the same issues.
>>
>> Tasks are usually not malicious in our DC, and most serious
>> applications use memcg or such memory control.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 17:05                         ` Tom Herbert
@ 2017-02-16 17:11                             ` Eric Dumazet
  2017-02-16 19:03                             ` David Miller
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-16 17:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".

Note that I do not particularly care if these patches go in 4.11 or 4.12 really.

I already backported them into our 4.3 based kernel.

I guess that we could at least propose the trivial patch for stable releases,
since PowerPC arches really need it.

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -102,7 +102,8 @@
 /* Use the maximum between 16384 and a single page */
 #define MLX4_EN_ALLOC_SIZE     PAGE_ALIGN(16384)

-#define MLX4_EN_ALLOC_PREFER_ORDER     PAGE_ALLOC_COSTLY_ORDER
+#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
         \
+                                        PAGE_ALLOC_COSTLY_ORDER)

 /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
  * and 4K allocations) */

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-16 17:11                             ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-16 17:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Tariq Toukan, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".

Note that I do not particularly care if these patches go in 4.11 or 4.12 really.

I already backported them into our 4.3 based kernel.

I guess that we could at least propose the trivial patch for stable releases,
since PowerPC arches really need it.

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -102,7 +102,8 @@
 /* Use the maximum between 16384 and a single page */
 #define MLX4_EN_ALLOC_SIZE     PAGE_ALIGN(16384)

-#define MLX4_EN_ALLOC_PREFER_ORDER     PAGE_ALLOC_COSTLY_ORDER
+#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
         \
+                                        PAGE_ALLOC_COSTLY_ORDER)

 /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
  * and 4K allocations) */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 17:05                         ` Tom Herbert
@ 2017-02-16 19:03                             ` David Miller
  2017-02-16 19:03                             ` David Miller
  1 sibling, 0 replies; 78+ messages in thread
From: David Miller @ 2017-02-16 19:03 UTC (permalink / raw)
  To: tom
  Cc: tariqt, edumazet, brouer, eric.dumazet, alexander.duyck, netdev,
	kafai, saeedm, willemb, bblanco, ast, linux-mm

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 16 Feb 2017 09:05:26 -0800

> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>
>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>
>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>
>>>> Isn't it the same principle in page_frag_alloc() ?
>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>
>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>
>>> This is not ok.
>>>
>>> This is a very well known problem, we already mentioned that here in the
>>> past,
>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>
>> You're right, we should have done this as well in mlx4 on PPC.
>>>
>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>
>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>> compared to other NIC.
>>
>> We _do_ agree that the series improves the driver's quality, stability,
>> and performance in a fragmented system.
>>
>> But due to the late rc we're in, and the fact that we know what benchmarks
>> our customers are going to run, we cannot Ack the series and get it
>> as is inside kernel 4.11.
>>
> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".
> Sorry, but that is a weak explanation.

I have to agree with Tom and Eric.

If your customers have gotten into the habit of using metrics which
actually do not represent real life performance, that is a completely
inappropriate reason to not include Eric's changes as-is.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
@ 2017-02-16 19:03                             ` David Miller
  0 siblings, 0 replies; 78+ messages in thread
From: David Miller @ 2017-02-16 19:03 UTC (permalink / raw)
  To: tom
  Cc: tariqt, edumazet, brouer, eric.dumazet, alexander.duyck, netdev,
	kafai, saeedm, willemb, bblanco, ast, linux-mm

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 16 Feb 2017 09:05:26 -0800

> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>
>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>
>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>
>>>> Isn't it the same principle in page_frag_alloc() ?
>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>
>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>
>>> This is not ok.
>>>
>>> This is a very well known problem, we already mentioned that here in the
>>> past,
>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>
>> You're right, we should have done this as well in mlx4 on PPC.
>>>
>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>
>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>> compared to other NIC.
>>
>> We _do_ agree that the series improves the driver's quality, stability,
>> and performance in a fragmented system.
>>
>> But due to the late rc we're in, and the fact that we know what benchmarks
>> our customers are going to run, we cannot Ack the series and get it
>> as is inside kernel 4.11.
>>
> You're admitting that Eric's patches improve driver quality,
> stability, and performance but you're not allowing this in the kernel
> because "we know what benchmarks our customers are going to run".
> Sorry, but that is a weak explanation.

I have to agree with Tom and Eric.

If your customers have gotten into the habit of using metrics which
actually do not represent real life performance, that is a completely
inappropriate reason to not include Eric's changes as-is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 17:11                             ` Eric Dumazet
  (?)
@ 2017-02-16 20:49                             ` Saeed Mahameed
  -1 siblings, 0 replies; 78+ messages in thread
From: Saeed Mahameed @ 2017-02-16 20:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Tariq Toukan, Jesper Dangaard Brouer, Eric Dumazet,
	Alexander Duyck, David S . Miller, netdev, Martin KaFai Lau,
	Saeed Mahameed, Willem de Bruijn, Brenden Blanco,
	Alexei Starovoitov, linux-mm

On Thu, Feb 16, 2017 at 7:11 PM, Eric Dumazet <edumazet@google.com> wrote:
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>
> Note that I do not particularly care if these patches go in 4.11 or 4.12 really.
>
> I already backported them into our 4.3 based kernel.
>
> I guess that we could at least propose the trivial patch for stable releases,
> since PowerPC arches really need it.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index cec59bc264c9ac197048fd7c98bcd5cf25de0efd..0f6d2f3b7d54f51de359d4ccde21f4585e6b7852
> 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -102,7 +102,8 @@
>  /* Use the maximum between 16384 and a single page */
>  #define MLX4_EN_ALLOC_SIZE     PAGE_ALIGN(16384)
>
> -#define MLX4_EN_ALLOC_PREFER_ORDER     PAGE_ALLOC_COSTLY_ORDER
> +#define MLX4_EN_ALLOC_PREFER_ORDER min_t(int, get_order(32768),
>          \
> +                                        PAGE_ALLOC_COSTLY_ORDER)
>
>  /* Receive fragment sizes; we use at most 3 fragments (for 9600 byte MTU
>   * and 4K allocations) */

+1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-16 19:03                             ` David Miller
  (?)
@ 2017-02-16 21:06                             ` Saeed Mahameed
  -1 siblings, 0 replies; 78+ messages in thread
From: Saeed Mahameed @ 2017-02-16 21:06 UTC (permalink / raw)
  To: David Miller
  Cc: Tom Herbert, Tariq Toukan, Eric Dumazet, Jesper Dangaard Brouer,
	Eric Dumazet, Alexander Duyck, Linux Netdev List,
	Martin KaFai Lau, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, linux-mm

On Thu, Feb 16, 2017 at 9:03 PM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Thu, 16 Feb 2017 09:05:26 -0800
>
>> On Thu, Feb 16, 2017 at 5:08 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>
>>> On 15/02/2017 6:57 PM, Eric Dumazet wrote:
>>>>
>>>> On Wed, Feb 15, 2017 at 8:42 AM, Tariq Toukan <tariqt@mellanox.com> wrote:
>>>>>
>>>>> Isn't it the same principle in page_frag_alloc() ?
>>>>> It is called form __netdev_alloc_skb()/__napi_alloc_skb().
>>>>>
>>>>> Why is it ok to have order-3 pages (PAGE_FRAG_CACHE_MAX_ORDER) there?
>>>>
>>>> This is not ok.
>>>>
>>>> This is a very well known problem, we already mentioned that here in the
>>>> past,
>>>> but at least core networking stack uses  order-0 pages on PowerPC.
>>>
>>> You're right, we should have done this as well in mlx4 on PPC.
>>>>
>>>> mlx4 driver suffers from this problem 100% more than other drivers ;)
>>>>
>>>> One problem at a time Tariq. Right now, only mlx4 has this big problem
>>>> compared to other NIC.
>>>
>>> We _do_ agree that the series improves the driver's quality, stability,
>>> and performance in a fragmented system.
>>>
>>> But due to the late rc we're in, and the fact that we know what benchmarks
>>> our customers are going to run, we cannot Ack the series and get it
>>> as is inside kernel 4.11.
>>>
>> You're admitting that Eric's patches improve driver quality,
>> stability, and performance but you're not allowing this in the kernel
>> because "we know what benchmarks our customers are going to run".
>> Sorry, but that is a weak explanation.
>
> I have to agree with Tom and Eric.
>
> If your customers have gotten into the habit of using metrics which
> actually do not represent real life performance, that is a completely
> inappropriate reason to not include Eric's changes as-is.

Guys, It is not about customers, benchmark and great performance numbers.
We agree with you that those patches are good and needed for the long term,
but as we are already at rc8 and although this change is correct, i
think it is a little bit too late
to have such huge change in the core RX engine of the driver. ( the
code is already like this for
more kernel releases than one could count, it will hurt no one to keep
it like this for two weeks more).

All we ask is to have a little bit more time - one or two weeks- to
test them and evaluate the impact.
As Eric stated we don't care if they make it to 4.11 or 4.12, the idea
is to have them in ASAP.
so why not wait to 4.11 (just two more weeks) and Tariq already said
that he will accept it as is.
and by then we will be smarter and have a clear plan of the gaps and
how to close them.

For PowerPC page order issue, Eric already have a simpler suggestion
that i support, and can easily be
sent to net and -stable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling
  2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
                   ` (13 preceding siblings ...)
  2017-02-13 19:58 ` [PATCH v3 net-next 14/14] mlx4: remove duplicate code in mlx4_en_process_rx_cq() Eric Dumazet
@ 2017-02-17 16:00 ` David Miller
  14 siblings, 0 replies; 78+ messages in thread
From: David Miller @ 2017-02-17 16:00 UTC (permalink / raw)
  To: edumazet
  Cc: netdev, tariqt, kafai, saeedm, willemb, brouer, bblanco, ast,
	eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 13 Feb 2017 11:58:44 -0800

> As mentioned half a year ago, we better switch mlx4 driver to order-0
> allocations and page recycling.
> 
> This reduces vulnerability surface thanks to better skb->truesize
> tracking and provides better performance in most cases.
> 
> v2 provides an ethtool -S new counter (rx_alloc_pages) and
> code factorization, plus Tariq fix.
> 
> v3 includes various fixes based on Tariq tests and feedback
> from Saeed and Tariq.
> 
> Worth noting this patch series deletes ~250 lines of code ;)

I think this should get queued up for net-next soon, but I marked the
series as "deferred" in patchwork while the discussion continues.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-13 19:58 ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Eric Dumazet
  2017-02-13 20:51   ` Alexander Duyck
@ 2017-02-22 16:22   ` Eric Dumazet
  2017-02-22 17:23     ` Alexander Duyck
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-22 16:22 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: David S . Miller, netdev, Tariq Toukan, Saeed Mahameed,
	Willem de Bruijn, Jesper Dangaard Brouer, Brenden Blanco,
	Alexei Starovoitov

On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> Use of order-3 pages is problematic in some cases.
> 
> This patch might add three kinds of regression :
> 
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
> 
> 2) TCP receiver could grow its receive window slightly slower,
>    because skb->len/skb->truesize ratio will decrease.
>    This is mostly ok, we prefer being conservative to not risk OOM,
>    and eventually tune TCP better in the future.
>    This is consistent with other drivers using 2048 per ethernet frame.
> 
> 3) Because we allocate one page per RX slot, we consume more
>    memory for the ring buffers. XDP already had this constraint anyway.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

Note that we also could use a different strategy.

Assume RX rings of 4096 entries/slots.

With this patch, mlx4 gets the strategy used by Alexander in Intel
drivers : 

Each RX slot has an allocated page, and uses half of it, flipping to the
other half every time the slot is used.

So a ring buffer of 4096 slots allocates 4096 pages.

When we receive a packet train for the same flow, GRO builds an skb with
~45 page frags, all from different pages.

The put_page() done from skb_release_data() touches ~45 different struct
page cache lines, and show a high cost. (compared to the order-3 used
today by mlx4, this adds extra cache line misses and stalls for the
consumer)

If we instead try to use the two halves of one page on consecutive RX
slots, we might instead cook skb with the same number of MSS (45), but
half the number of cache lines for put_page(), so we should speed up the
consumer.

This means the number of active pages would be minimal, especially on
PowerPC. Pages that have been used by X=2 received frags would be put in
a quarantine (size to be determined).
On PowerPC, X would be PAGE_SIZE/frag_size

This strategy would consume less memory on PowerPC : 
65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
4096.

The quarantine would be sized to increase chances of reusing an old
page, without consuming too much memory.

Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))

x86 would still use 4096 pages, but PowerPC would use 98+128 pages
instead of 4096) (14 MBytes instead of 256 MBytes)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-22 16:22   ` Eric Dumazet
@ 2017-02-22 17:23     ` Alexander Duyck
  2017-02-22 17:58       ` David Laight
  2017-02-22 18:21       ` Eric Dumazet
  0 siblings, 2 replies; 78+ messages in thread
From: Alexander Duyck @ 2017-02-22 17:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> Use of order-3 pages is problematic in some cases.
>>
>> This patch might add three kinds of regression :
>>
>> 1) a CPU performance regression, but we will add later page
>> recycling and performance should be back.
>>
>> 2) TCP receiver could grow its receive window slightly slower,
>>    because skb->len/skb->truesize ratio will decrease.
>>    This is mostly ok, we prefer being conservative to not risk OOM,
>>    and eventually tune TCP better in the future.
>>    This is consistent with other drivers using 2048 per ethernet frame.
>>
>> 3) Because we allocate one page per RX slot, we consume more
>>    memory for the ring buffers. XDP already had this constraint anyway.
>>
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> ---
>
> Note that we also could use a different strategy.
>
> Assume RX rings of 4096 entries/slots.
>
> With this patch, mlx4 gets the strategy used by Alexander in Intel
> drivers :
>
> Each RX slot has an allocated page, and uses half of it, flipping to the
> other half every time the slot is used.
>
> So a ring buffer of 4096 slots allocates 4096 pages.
>
> When we receive a packet train for the same flow, GRO builds an skb with
> ~45 page frags, all from different pages.
>
> The put_page() done from skb_release_data() touches ~45 different struct
> page cache lines, and show a high cost. (compared to the order-3 used
> today by mlx4, this adds extra cache line misses and stalls for the
> consumer)
>
> If we instead try to use the two halves of one page on consecutive RX
> slots, we might instead cook skb with the same number of MSS (45), but
> half the number of cache lines for put_page(), so we should speed up the
> consumer.

So there is a problem that is being overlooked here.  That is the cost
of the DMA map/unmap calls.  The problem is many PowerPC systems have
an IOMMU that you have to work around, and that IOMMU comes at a heavy
cost for every map/unmap call.  So unless you are saying you wan to
setup a hybrid between the mlx5 and this approach where we have a page
cache that these all fall back into you will take a heavy cost for
having to map and unmap pages.

The whole reason why I implemented the Intel page reuse approach the
way I did is to try and mitigate the IOMMU issue, it wasn't so much to
resolve allocator/freeing expense.  Basically the allocator scales,
the IOMMU does not.  So any solution would require making certain that
we can leave the pages pinned in the DMA to avoid having to take the
global locks involved in accessing the IOMMU.

> This means the number of active pages would be minimal, especially on
> PowerPC. Pages that have been used by X=2 received frags would be put in
> a quarantine (size to be determined).
> On PowerPC, X would be PAGE_SIZE/frag_size
>
>
> This strategy would consume less memory on PowerPC :
> 65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
> 4096.
>
> The quarantine would be sized to increase chances of reusing an old
> page, without consuming too much memory.
>
> Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))
>
> x86 would still use 4096 pages, but PowerPC would use 98+128 pages
> instead of 4096) (14 MBytes instead of 256 MBytes)

So any solution will need to work with an IOMMU enabled on the
platform.  I assume you have some x86 test systems you could run with
an IOMMU enabled.  My advice would be to try running in that
environment and see where the overhead lies.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-22 17:23     ` Alexander Duyck
@ 2017-02-22 17:58       ` David Laight
  2017-02-22 18:21       ` Eric Dumazet
  1 sibling, 0 replies; 78+ messages in thread
From: David Laight @ 2017-02-22 17:58 UTC (permalink / raw)
  To: 'Alexander Duyck', Eric Dumazet
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

From: Alexander Duyck
> Sent: 22 February 2017 17:24
...
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
..

I can't help feeling that you need to look at how to get the iommu
code to reuse pages, rather the ethernet driver.
Maybe something like:

1) The driver requests a mapped receive buffer from the iommu.
This might give it memory that is already mapped but not in use.

2) When the receive completes the driver tells the iommu the mapping
is no longer needed. The iommu is not (yet) changed.

3) When the skb is freed the iommu is told that the buffer can be freed.

4) At (1), if the driver is using too much iommu resource then the mapping
for completed receives can be removed to free up iommu space.

Probably not as simple as it looks :-)

	David

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-22 17:23     ` Alexander Duyck
  2017-02-22 17:58       ` David Laight
@ 2017-02-22 18:21       ` Eric Dumazet
  2017-02-23  1:08         ` Alexander Duyck
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-02-22 18:21 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> >> Use of order-3 pages is problematic in some cases.
> >>
> >> This patch might add three kinds of regression :
> >>
> >> 1) a CPU performance regression, but we will add later page
> >> recycling and performance should be back.
> >>
> >> 2) TCP receiver could grow its receive window slightly slower,
> >>    because skb->len/skb->truesize ratio will decrease.
> >>    This is mostly ok, we prefer being conservative to not risk OOM,
> >>    and eventually tune TCP better in the future.
> >>    This is consistent with other drivers using 2048 per ethernet frame.
> >>
> >> 3) Because we allocate one page per RX slot, we consume more
> >>    memory for the ring buffers. XDP already had this constraint anyway.
> >>
> >> Signed-off-by: Eric Dumazet <edumazet@google.com>
> >> ---
> >
> > Note that we also could use a different strategy.
> >
> > Assume RX rings of 4096 entries/slots.
> >
> > With this patch, mlx4 gets the strategy used by Alexander in Intel
> > drivers :
> >
> > Each RX slot has an allocated page, and uses half of it, flipping to the
> > other half every time the slot is used.
> >
> > So a ring buffer of 4096 slots allocates 4096 pages.
> >
> > When we receive a packet train for the same flow, GRO builds an skb with
> > ~45 page frags, all from different pages.
> >
> > The put_page() done from skb_release_data() touches ~45 different struct
> > page cache lines, and show a high cost. (compared to the order-3 used
> > today by mlx4, this adds extra cache line misses and stalls for the
> > consumer)
> >
> > If we instead try to use the two halves of one page on consecutive RX
> > slots, we might instead cook skb with the same number of MSS (45), but
> > half the number of cache lines for put_page(), so we should speed up the
> > consumer.
> 
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
> 
> The whole reason why I implemented the Intel page reuse approach the
> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
> resolve allocator/freeing expense.  Basically the allocator scales,
> the IOMMU does not.  So any solution would require making certain that
> we can leave the pages pinned in the DMA to avoid having to take the
> global locks involved in accessing the IOMMU.


I do not see any difference for the fact that we keep pages mapped the
same way.

mlx4_en_complete_rx_desc() will still use the :

dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
                              frag_size, priv->dma_dir);

for every single MSS we receive.

This wont change.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-22 18:21       ` Eric Dumazet
@ 2017-02-23  1:08         ` Alexander Duyck
  2017-02-23  2:06           ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-23  1:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Wed, Feb 22, 2017 at 10:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
>> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> >> Use of order-3 pages is problematic in some cases.
>> >>
>> >> This patch might add three kinds of regression :
>> >>
>> >> 1) a CPU performance regression, but we will add later page
>> >> recycling and performance should be back.
>> >>
>> >> 2) TCP receiver could grow its receive window slightly slower,
>> >>    because skb->len/skb->truesize ratio will decrease.
>> >>    This is mostly ok, we prefer being conservative to not risk OOM,
>> >>    and eventually tune TCP better in the future.
>> >>    This is consistent with other drivers using 2048 per ethernet frame.
>> >>
>> >> 3) Because we allocate one page per RX slot, we consume more
>> >>    memory for the ring buffers. XDP already had this constraint anyway.
>> >>
>> >> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >> ---
>> >
>> > Note that we also could use a different strategy.
>> >
>> > Assume RX rings of 4096 entries/slots.
>> >
>> > With this patch, mlx4 gets the strategy used by Alexander in Intel
>> > drivers :
>> >
>> > Each RX slot has an allocated page, and uses half of it, flipping to the
>> > other half every time the slot is used.
>> >
>> > So a ring buffer of 4096 slots allocates 4096 pages.
>> >
>> > When we receive a packet train for the same flow, GRO builds an skb with
>> > ~45 page frags, all from different pages.
>> >
>> > The put_page() done from skb_release_data() touches ~45 different struct
>> > page cache lines, and show a high cost. (compared to the order-3 used
>> > today by mlx4, this adds extra cache line misses and stalls for the
>> > consumer)
>> >
>> > If we instead try to use the two halves of one page on consecutive RX
>> > slots, we might instead cook skb with the same number of MSS (45), but
>> > half the number of cache lines for put_page(), so we should speed up the
>> > consumer.
>>
>> So there is a problem that is being overlooked here.  That is the cost
>> of the DMA map/unmap calls.  The problem is many PowerPC systems have
>> an IOMMU that you have to work around, and that IOMMU comes at a heavy
>> cost for every map/unmap call.  So unless you are saying you wan to
>> setup a hybrid between the mlx5 and this approach where we have a page
>> cache that these all fall back into you will take a heavy cost for
>> having to map and unmap pages.
>>
>> The whole reason why I implemented the Intel page reuse approach the
>> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
>> resolve allocator/freeing expense.  Basically the allocator scales,
>> the IOMMU does not.  So any solution would require making certain that
>> we can leave the pages pinned in the DMA to avoid having to take the
>> global locks involved in accessing the IOMMU.
>
>
> I do not see any difference for the fact that we keep pages mapped the
> same way.
>
> mlx4_en_complete_rx_desc() will still use the :
>
> dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
>                               frag_size, priv->dma_dir);
>
> for every single MSS we receive.
>
> This wont change.

Right but you were talking about using both halves one after the
other.  If that occurs you have nothing left that you can reuse.  That
was what I was getting at.  If you use up both halves you end up
having to unmap the page.

The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.

- Alex

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-23  1:08         ` Alexander Duyck
@ 2017-02-23  2:06           ` Eric Dumazet
  2017-02-23  2:18             ` Alexander Duyck
                               ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Eric Dumazet @ 2017-02-23  2:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:

> 
> Right but you were talking about using both halves one after the
> other.  If that occurs you have nothing left that you can reuse.  That
> was what I was getting at.  If you use up both halves you end up
> having to unmap the page.
> 

You must have misunderstood me.

Once we use both halves of a page, we _keep_ the page, we do not unmap
it.

We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'

When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.

If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.

If not, _then_ we unmap and release the page.

Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.

To recap on x86 :

2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.

And 2048 pages would be in the reserve.

> The whole idea behind using only half the page per descriptor is to
> allow us to loop through the ring before we end up reusing it again.
> That buys us enough time that usually the stack has consumed the frame
> before we need it again.

The same will happen really.

Best maybe is for me to send the patch ;)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-23  2:06           ` Eric Dumazet
@ 2017-02-23  2:18             ` Alexander Duyck
  2017-02-23 14:02               ` Tariq Toukan
  2017-02-24  9:42             ` Jesper Dangaard Brouer
  2017-03-12 14:57             ` Eric Dumazet
  2 siblings, 1 reply; 78+ messages in thread
From: Alexander Duyck @ 2017-02-23  2:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
>
>>
>> Right but you were talking about using both halves one after the
>> other.  If that occurs you have nothing left that you can reuse.  That
>> was what I was getting at.  If you use up both halves you end up
>> having to unmap the page.
>>
>
> You must have misunderstood me.
>
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
>
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
>
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
>
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
>
> If not, _then_ we unmap and release the page.

Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach".  Makes sense.

> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
>
> To recap on x86 :
>
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
>
> And 2048 pages would be in the reserve.

The buffer info layout for something like that would probably be
pretty interesting.  Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.

If you get a chance to do something like that I would love to know the
result.  Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.

>> The whole idea behind using only half the page per descriptor is to
>> allow us to loop through the ring before we end up reusing it again.
>> That buys us enough time that usually the stack has consumed the frame
>> before we need it again.
>
>
> The same will happen really.
>
> Best maybe is for me to send the patch ;)

I think I have the idea now.  However patches are always welcome..  :-)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-23  2:18             ` Alexander Duyck
@ 2017-02-23 14:02               ` Tariq Toukan
  0 siblings, 0 replies; 78+ messages in thread
From: Tariq Toukan @ 2017-02-23 14:02 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On 23/02/2017 4:18 AM, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
>>
>>>
>>> Right but you were talking about using both halves one after the
>>> other.  If that occurs you have nothing left that you can reuse.  That
>>> was what I was getting at.  If you use up both halves you end up
>>> having to unmap the page.
>>>
>>
>> You must have misunderstood me.
>>
>> Once we use both halves of a page, we _keep_ the page, we do not unmap
>> it.
>>
>> We save the page pointer in a ring buffer of pages.
>> Call it the 'quarantine'
>>
>> When we _need_ to replenish the RX desc, we take a look at the oldest
>> entry in the quarantine ring.
>>
>> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
>> this saved page.
>>
>> If not, _then_ we unmap and release the page.
>
> Okay, that was what I was referring to when I mentioned a "hybrid
> between the mlx5 and the Intel approach".  Makes sense.
>

Indeed, in mlx5 Striding RQ (mpwqe) we do something similar.
Our NIC (ConnectX-4 LX and newer) knows to write multiple _consecutive_ 
packets into the same RX buffer (page).
AFAIU, this is what Eric suggests to do in SW in mlx4.

Here are the main characteristics of our page-cache in mlx5:
1) FIFO (for higher chances of an available page).
2) If the page-cache head is busy, it is not freed. This has its pros 
and cons. We might reconsider.
3) Pages in cache have no page-to-WQE assignment (WQE is for Work Queue 
Element, == RX descriptor). They are shared for all WQEs of an RQ and 
might be used by different WQEs in different rounds.
4) Cache size is smaller than suggested, we would happily increase it to 
reflect a whole ring.

Still, performance tests over mlx5 show that on high load we quickly end 
up allocating pages as the stack does not release its ref count on time.
Increasing the cache size helps of course.
As there's no _fixed_ fair size that guarantees the availability of 
pages every ring cycle, reflecting a ring size can help, and would give 
the opportunity for users to tune their performance by setting their 
ring size according to how powerful their CPUs are, and what traffic 
type/load they're running.

>> Note that we would have received 4096 frames before looking at the page
>> count, so there is high chance both halves were consumed.
>>
>> To recap on x86 :
>>
>> 2048 active pages would be visible by the device, because 4096 RX desc
>> would contain dma addresses pointing to the 4096 halves.
>>
>> And 2048 pages would be in the reserve.
>
> The buffer info layout for something like that would probably be
> pretty interesting.  Basically you would be doubling up the ring so
> that you handle 2 Rx descriptors per a single buffer info since you
> would automatically know that it would be an even/odd setup in terms
> of the buffer offsets.
>
> If you get a chance to do something like that I would love to know the
> result.  Otherwise if I get a chance I can try messing with i40e or
> ixgbe some time and see what kind of impact it has.
>
>>> The whole idea behind using only half the page per descriptor is to
>>> allow us to loop through the ring before we end up reusing it again.
>>> That buys us enough time that usually the stack has consumed the frame
>>> before we need it again.
>>
>>
>> The same will happen really.
>>
>> Best maybe is for me to send the patch ;)
>
> I think I have the idea now.  However patches are always welcome..  :-)
>

Same here :-)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-23  2:06           ` Eric Dumazet
  2017-02-23  2:18             ` Alexander Duyck
@ 2017-02-24  9:42             ` Jesper Dangaard Brouer
  2017-03-12 14:57             ` Eric Dumazet
  2 siblings, 0 replies; 78+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-24  9:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Eric Dumazet, Alexander Duyck, David S . Miller,
	netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Brenden Blanco, Alexei Starovoitov, brouer

On Wed, 22 Feb 2017 18:06:58 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> >   
> 
> You must have misunderstood me.

FYI, I also misunderstood you (Eric) to start with ;-)
 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.

I do like it, and it should work.  I like it because it solves my
concern, regarding being able to adjust the amount of
outstanding-frames independently of the RX ring size.


Do notice: driver developers have to use Alex'es new DMA API in-order
to get writable-pages, else this will violate the DMA API.  And XDP
requires writable pages.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-02-23  2:06           ` Eric Dumazet
  2017-02-23  2:18             ` Alexander Duyck
  2017-02-24  9:42             ` Jesper Dangaard Brouer
@ 2017-03-12 14:57             ` Eric Dumazet
  2017-03-12 15:29               ` Eric Dumazet
  2 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-03-12 14:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Wed, 2017-02-22 at 18:06 -0800, Eric Dumazet wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
> 
> > 
> > Right but you were talking about using both halves one after the
> > other.  If that occurs you have nothing left that you can reuse.  That
> > was what I was getting at.  If you use up both halves you end up
> > having to unmap the page.
> > 
> 
> You must have misunderstood me.
> 
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
> 
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
> 
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
> 
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
> 
> If not, _then_ we unmap and release the page.
> 
> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
> 
> To recap on x86 :
> 
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
> 
> And 2048 pages would be in the reserve.
> 
> 
> > The whole idea behind using only half the page per descriptor is to
> > allow us to loop through the ring before we end up reusing it again.
> > That buys us enough time that usually the stack has consumed the frame
> > before we need it again.
> 
> 
> The same will happen really.
> 
> Best maybe is for me to send the patch ;)

Excellent results so far, performance on PowerPC is back, and x86 gets a
gain as well.

Problem is XDP TX :

I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
RX is owned by current cpu.

Since TX completion is using a different NAPI, I really do not believe
we can avoid an atomic operation, like a spinlock, to protect the list
of pages ( ring->page_cache )

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-03-12 14:57             ` Eric Dumazet
@ 2017-03-12 15:29               ` Eric Dumazet
  2017-03-12 15:49                 ` Saeed Mahameed
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-03-12 15:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Alexander Duyck, David S . Miller, netdev,
	Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:

> Problem is XDP TX :
> 
> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
> RX is owned by current cpu.
> 
> Since TX completion is using a different NAPI, I really do not believe
> we can avoid an atomic operation, like a spinlock, to protect the list
> of pages ( ring->page_cache )

A quick fix for net-next would be :

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..e0b2ea8cefd6beef093c41bade199e3ec4f0291c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -137,13 +137,17 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
 	struct mlx4_en_rx_alloc *frags = ring->rx_info +
 					(index << priv->log_rx_info);
+
 	if (ring->page_cache.index > 0) {
+		spin_lock(&ring->page_cache.lock);
+
 		/* XDP uses a single page per frame */
 		if (!frags->page) {
 			ring->page_cache.index--;
 			frags->page = ring->page_cache.buf[ring->page_cache.index].page;
 			frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
 		}
+		spin_unlock(&ring->page_cache.lock);
 		frags->page_offset = XDP_PACKET_HEADROOM;
 		rx_desc->data[0].addr = cpu_to_be64(frags->dma +
 						    XDP_PACKET_HEADROOM);
@@ -277,6 +281,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 		}
 	}
 
+	spin_lock_init(&ring->page_cache.lock);
 	ring->prod = 0;
 	ring->cons = 0;
 	ring->size = size;
@@ -419,10 +424,13 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
 
 	if (cache->index >= MLX4_EN_CACHE_SIZE)
 		return false;
-
-	cache->buf[cache->index].page = frame->page;
-	cache->buf[cache->index].dma = frame->dma;
-	cache->index++;
+	spin_lock(&cache->lock);
+	if (cache->index < MLX4_EN_CACHE_SIZE) {
+		cache->buf[cache->index].page = frame->page;
+		cache->buf[cache->index].dma = frame->dma;
+		cache->index++;
+	}
+	spin_unlock(&cache->lock);
 	return true;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 39f401aa30474e61c0b0029463b23a829ec35fa3..090a08020d13d8e11cc163ac9fc6ac6affccc463 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -258,7 +258,8 @@ struct mlx4_en_rx_alloc {
 #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
 
 struct mlx4_en_page_cache {
-	u32 index;
+	u32			index;
+	spinlock_t		lock;
 	struct {
 		struct page	*page;
 		dma_addr_t	dma;

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-03-12 15:29               ` Eric Dumazet
@ 2017-03-12 15:49                 ` Saeed Mahameed
  2017-03-12 16:49                   ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Saeed Mahameed @ 2017-03-12 15:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Eric Dumazet, Alexander Duyck, David S . Miller,
	netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
>
>> Problem is XDP TX :
>>
>> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
>> RX is owned by current cpu.
>>
>> Since TX completion is using a different NAPI, I really do not believe
>> we can avoid an atomic operation, like a spinlock, to protect the list
>> of pages ( ring->page_cache )
>
> A quick fix for net-next would be :
>

Hi Eric, Good catch.

I don't think we need to complicate with an expensive spinlock,
 we can simply fix this by not enabling interrupts on XDP TX CQ (not
arm this CQ at all).
and handle XDP TX CQ completion from the RX NAPI context, in a serial
(Atomic) manner before handling RX completions themselves.
This way locking is not required since all page cache handling is done
from the same context (RX NAPI).

This is how we do this in mlx5, and this is the best approach
(performance wise) since we dealy XDP TX CQ completions handling
until we really need the space they hold (On new RX packets).

> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index aa074e57ce06fb2842fa1faabd156c3cd2fe10f5..e0b2ea8cefd6beef093c41bade199e3ec4f0291c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -137,13 +137,17 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
>         struct mlx4_en_rx_desc *rx_desc = ring->buf + (index * ring->stride);
>         struct mlx4_en_rx_alloc *frags = ring->rx_info +
>                                         (index << priv->log_rx_info);
> +
>         if (ring->page_cache.index > 0) {
> +               spin_lock(&ring->page_cache.lock);
> +
>                 /* XDP uses a single page per frame */
>                 if (!frags->page) {
>                         ring->page_cache.index--;
>                         frags->page = ring->page_cache.buf[ring->page_cache.index].page;
>                         frags->dma  = ring->page_cache.buf[ring->page_cache.index].dma;
>                 }
> +               spin_unlock(&ring->page_cache.lock);
>                 frags->page_offset = XDP_PACKET_HEADROOM;
>                 rx_desc->data[0].addr = cpu_to_be64(frags->dma +
>                                                     XDP_PACKET_HEADROOM);
> @@ -277,6 +281,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
>                 }
>         }
>
> +       spin_lock_init(&ring->page_cache.lock);
>         ring->prod = 0;
>         ring->cons = 0;
>         ring->size = size;
> @@ -419,10 +424,13 @@ bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>
>         if (cache->index >= MLX4_EN_CACHE_SIZE)
>                 return false;
> -
> -       cache->buf[cache->index].page = frame->page;
> -       cache->buf[cache->index].dma = frame->dma;
> -       cache->index++;
> +       spin_lock(&cache->lock);
> +       if (cache->index < MLX4_EN_CACHE_SIZE) {
> +               cache->buf[cache->index].page = frame->page;
> +               cache->buf[cache->index].dma = frame->dma;
> +               cache->index++;
> +       }
> +       spin_unlock(&cache->lock);
>         return true;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 39f401aa30474e61c0b0029463b23a829ec35fa3..090a08020d13d8e11cc163ac9fc6ac6affccc463 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -258,7 +258,8 @@ struct mlx4_en_rx_alloc {
>  #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
>
>  struct mlx4_en_page_cache {
> -       u32 index;
> +       u32                     index;
> +       spinlock_t              lock;
>         struct {
>                 struct page     *page;
>                 dma_addr_t      dma;
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-03-12 15:49                 ` Saeed Mahameed
@ 2017-03-12 16:49                   ` Eric Dumazet
  2017-03-13  9:20                     ` Saeed Mahameed
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2017-03-12 16:49 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexander Duyck, Eric Dumazet, Alexander Duyck, David S . Miller,
	netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Sun, 2017-03-12 at 17:49 +0200, Saeed Mahameed wrote:
> On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
> >
> >> Problem is XDP TX :
> >>
> >> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
> >> RX is owned by current cpu.
> >>
> >> Since TX completion is using a different NAPI, I really do not believe
> >> we can avoid an atomic operation, like a spinlock, to protect the list
> >> of pages ( ring->page_cache )
> >
> > A quick fix for net-next would be :
> >
> 
> Hi Eric, Good catch.
> 
> I don't think we need to complicate with an expensive spinlock,
>  we can simply fix this by not enabling interrupts on XDP TX CQ (not
> arm this CQ at all).
> and handle XDP TX CQ completion from the RX NAPI context, in a serial
> (Atomic) manner before handling RX completions themselves.
> This way locking is not required since all page cache handling is done
> from the same context (RX NAPI).
> 
> This is how we do this in mlx5, and this is the best approach
> (performance wise) since we dealy XDP TX CQ completions handling
> until we really need the space they hold (On new RX packets).

SGTM, can you provide the patch for mlx4 ?

Thanks !

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
  2017-03-12 16:49                   ` Eric Dumazet
@ 2017-03-13  9:20                     ` Saeed Mahameed
  0 siblings, 0 replies; 78+ messages in thread
From: Saeed Mahameed @ 2017-03-13  9:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Eric Dumazet, Alexander Duyck, David S . Miller,
	netdev, Tariq Toukan, Saeed Mahameed, Willem de Bruijn,
	Jesper Dangaard Brouer, Brenden Blanco, Alexei Starovoitov

On Sun, Mar 12, 2017 at 6:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sun, 2017-03-12 at 17:49 +0200, Saeed Mahameed wrote:
>> On Sun, Mar 12, 2017 at 5:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Sun, 2017-03-12 at 07:57 -0700, Eric Dumazet wrote:
>> >
>> >> Problem is XDP TX :
>> >>
>> >> I do not see any guarantee mlx4_en_recycle_tx_desc() runs while the NAPI
>> >> RX is owned by current cpu.
>> >>
>> >> Since TX completion is using a different NAPI, I really do not believe
>> >> we can avoid an atomic operation, like a spinlock, to protect the list
>> >> of pages ( ring->page_cache )
>> >
>> > A quick fix for net-next would be :
>> >
>>
>> Hi Eric, Good catch.
>>
>> I don't think we need to complicate with an expensive spinlock,
>>  we can simply fix this by not enabling interrupts on XDP TX CQ (not
>> arm this CQ at all).
>> and handle XDP TX CQ completion from the RX NAPI context, in a serial
>> (Atomic) manner before handling RX completions themselves.
>> This way locking is not required since all page cache handling is done
>> from the same context (RX NAPI).
>>
>> This is how we do this in mlx5, and this is the best approach
>> (performance wise) since we dealy XDP TX CQ completions handling
>> until we really need the space they hold (On new RX packets).
>
> SGTM, can you provide the patch for mlx4 ?
>

of course, We will send it soon.

> Thanks !
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2017-03-13  9:20 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-13 19:58 [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 01/14] mlx4: use __skb_fill_page_desc() Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 02/14] mlx4: dma_dir is a mlx4_en_priv attribute Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 03/14] mlx4: remove order field from mlx4_en_frag_info Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 04/14] mlx4: get rid of frag_prefix_size Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 05/14] mlx4: rx_headroom is a per port attribute Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 06/14] mlx4: reduce rx ring page_cache size Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 07/14] mlx4: removal of frag_sizes[] Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX Eric Dumazet
2017-02-13 20:51   ` Alexander Duyck
2017-02-13 21:09     ` Eric Dumazet
2017-02-13 23:16       ` Alexander Duyck
2017-02-13 23:22         ` Eric Dumazet
2017-02-13 23:26           ` Alexander Duyck
2017-02-13 23:29             ` Eric Dumazet
2017-02-13 23:47               ` Alexander Duyck
2017-02-14  0:22                 ` Eric Dumazet
2017-02-14  0:34                   ` Alexander Duyck
2017-02-14  0:46                     ` Eric Dumazet
2017-02-14  0:47                       ` Eric Dumazet
2017-02-14  0:57                       ` Eric Dumazet
2017-02-14  1:32                         ` Alexander Duyck
2017-02-14 12:12         ` Jesper Dangaard Brouer
2017-02-14 13:45           ` Eric Dumazet
2017-02-14 14:12             ` Eric Dumazet
2017-02-14 14:56             ` Tariq Toukan
2017-02-14 15:51               ` Eric Dumazet
2017-02-14 15:51                 ` Eric Dumazet
2017-02-14 16:03                 ` Eric Dumazet
2017-02-14 16:03                   ` Eric Dumazet
2017-02-14 17:29                 ` Tom Herbert
2017-02-15 16:42                   ` Tariq Toukan
2017-02-15 16:57                     ` Eric Dumazet
2017-02-16 13:08                       ` Tariq Toukan
2017-02-16 15:47                         ` Eric Dumazet
2017-02-16 17:05                         ` Tom Herbert
2017-02-16 17:11                           ` Eric Dumazet
2017-02-16 17:11                             ` Eric Dumazet
2017-02-16 20:49                             ` Saeed Mahameed
2017-02-16 19:03                           ` David Miller
2017-02-16 19:03                             ` David Miller
2017-02-16 21:06                             ` Saeed Mahameed
2017-02-14 17:04               ` David Miller
2017-02-14 17:17                 ` David Laight
2017-02-14 17:22                   ` David Miller
2017-02-14 19:38                 ` Jesper Dangaard Brouer
2017-02-14 19:59                   ` David Miller
2017-02-14 17:29               ` Alexander Duyck
2017-02-14 18:46                 ` Jesper Dangaard Brouer
2017-02-14 19:02                   ` Eric Dumazet
2017-02-14 20:02                     ` Jesper Dangaard Brouer
2017-02-14 20:02                       ` Jesper Dangaard Brouer
2017-02-14 21:56                       ` Eric Dumazet
2017-02-14 21:56                         ` Eric Dumazet
2017-02-14 19:06                   ` Alexander Duyck
2017-02-14 19:06                     ` Alexander Duyck
2017-02-14 19:50                     ` Jesper Dangaard Brouer
2017-02-22 16:22   ` Eric Dumazet
2017-02-22 17:23     ` Alexander Duyck
2017-02-22 17:58       ` David Laight
2017-02-22 18:21       ` Eric Dumazet
2017-02-23  1:08         ` Alexander Duyck
2017-02-23  2:06           ` Eric Dumazet
2017-02-23  2:18             ` Alexander Duyck
2017-02-23 14:02               ` Tariq Toukan
2017-02-24  9:42             ` Jesper Dangaard Brouer
2017-03-12 14:57             ` Eric Dumazet
2017-03-12 15:29               ` Eric Dumazet
2017-03-12 15:49                 ` Saeed Mahameed
2017-03-12 16:49                   ` Eric Dumazet
2017-03-13  9:20                     ` Saeed Mahameed
2017-02-13 19:58 ` [PATCH v3 net-next 09/14] mlx4: add page recycling in receive path Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 10/14] mlx4: add rx_alloc_pages counter in ethtool -S Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 11/14] mlx4: do not access rx_desc from mlx4_en_process_rx_cq() Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 12/14] mlx4: factorize page_address() calls Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 13/14] mlx4: make validate_loopback() more generic Eric Dumazet
2017-02-13 19:58 ` [PATCH v3 net-next 14/14] mlx4: remove duplicate code in mlx4_en_process_rx_cq() Eric Dumazet
2017-02-17 16:00 ` [PATCH v3 net-next 00/14] mlx4: order-0 allocations and page recycling David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.