All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-05  8:58 ` Aaron Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-05  8:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen

page_frag_free() calls __free_pages_ok() to free the page back to
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.

Paweł Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:

  mlx5e_poll_tx_cq
  |
   --16.34%--napi_consume_skb
             |
             |--12.65%--__free_pages_ok
             |          |
             |           --11.86%--free_one_page
             |                     |
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |
             |                      --0.65%--_raw_spin_lock
             |
             |--1.55%--page_frag_free
             |
              --1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
  -> skb_free_frag()
    -> skb_free_frag()
      -> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.

With this change, Paweł hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+	if (unlikely(put_page_testzero(page))) {
+		unsigned int order = compound_order(page);
+
+		if (order == 0)
+			free_unref_page(page);
+		else
+			__free_pages_ok(page, order);
+	}
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-05  8:58 ` Aaron Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-05  8:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen

page_frag_free() calls __free_pages_ok() to free the page back to
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.

PaweA? Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:

  mlx5e_poll_tx_cq
  |
   --16.34%--napi_consume_skb
             |
             |--12.65%--__free_pages_ok
             |          |
             |           --11.86%--free_one_page
             |                     |
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |
             |                      --0.65%--_raw_spin_lock
             |
             |--1.55%--page_frag_free
             |
              --1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
  -> skb_free_frag()
    -> skb_free_frag()
      -> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.

With this change, PaweA? hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+	if (unlikely(put_page_testzero(page))) {
+		unsigned int order = compound_order(page);
+
+		if (order == 0)
+			free_unref_page(page);
+		else
+			__free_pages_ok(page, order);
+	}
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] mm/page_alloc: use a single function to free page
  2018-11-05  8:58 ` Aaron Lu
  (?)
@ 2018-11-05  8:58 ` Aaron Lu
  2018-11-05 16:39   ` Dave Hansen
  2018-11-06  5:30   ` [PATCH v2 " Aaron Lu
  -1 siblings, 2 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-05  8:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen

We have multiple places of freeing a page, most of them doing similar
things and a common function can be used to reduce code duplicate.

It also avoids bug fixed in one function and left in another.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 mm/page_alloc.c | 37 +++++++++++++++++--------------------
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a9a6af41a2..2b330296e92a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(get_zeroed_page);
 
-void __free_pages(struct page *page, unsigned int order)
+/*
+ * Free a page by reducing its ref count by @nr.
+ * If its refcount reaches 0, then according to its order:
+ * order0: send to PCP;
+ * high order: directly send to Buddy.
+ */
+static inline void free_the_page(struct page *page, unsigned int order, int nr)
 {
-	if (put_page_testzero(page)) {
+	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+	if (page_ref_sub_and_test(page, nr)) {
 		if (order == 0)
 			free_unref_page(page);
 		else
@@ -4435,6 +4443,11 @@ void __free_pages(struct page *page, unsigned int order)
 	}
 }
 
+void __free_pages(struct page *page, unsigned int order)
+{
+	free_the_page(page, order, 1);
+}
+
 EXPORT_SYMBOL(__free_pages);
 
 void free_pages(unsigned long addr, unsigned int order)
@@ -4481,16 +4494,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
-	if (page_ref_sub_and_test(page, count)) {
-		unsigned int order = compound_order(page);
-
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	free_the_page(page, compound_order(page), count);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -4555,14 +4559,7 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page))) {
-		unsigned int order = compound_order(page);
-
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	free_the_page(page, compound_order(page), 1);
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
@ 2018-11-05  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2018-11-05  9:26 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Dave Hansen

On 11/5/18 9:58 AM, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>   mlx5e_poll_tx_cq
>   |
>    --16.34%--napi_consume_skb
>              |
>              |--12.65%--__free_pages_ok
>              |          |
>              |           --11.86%--free_one_page
>              |                     |
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |
>              |                      --0.65%--_raw_spin_lock
>              |
>              |--1.55%--page_frag_free
>              |
>               --1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>   -> skb_free_frag()
>     -> skb_free_frag()
>       -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Yeah looks like an obvious thing to do.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ae31839874b8..91a9a6af41a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>  {
>  	struct page *page = virt_to_head_page(addr);
>  
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>  }
>  EXPORT_SYMBOL(page_frag_free);
>  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-05  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2018-11-05  9:26 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Dave Hansen

On 11/5/18 9:58 AM, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> PaweA? Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>   mlx5e_poll_tx_cq
>   |
>    --16.34%--napi_consume_skb
>              |
>              |--12.65%--__free_pages_ok
>              |          |
>              |           --11.86%--free_one_page
>              |                     |
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |
>              |                      --0.65%--_raw_spin_lock
>              |
>              |--1.55%--page_frag_free
>              |
>               --1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>   -> skb_free_frag()
>     -> skb_free_frag()
>       -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, PaweA? hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Yeah looks like an obvious thing to do.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ae31839874b8..91a9a6af41a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>  {
>  	struct page *page = virt_to_head_page(addr);
>  
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>  }
>  EXPORT_SYMBOL(page_frag_free);
>  
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
@ 2018-11-05  9:26   ` Mel Gorman
  -1 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2018-11-05  9:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	iso-8859-1?B?UGF3ZcWC?= Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Saeed Mahameed, Michal Hocko, Vlastimil Babka, Dave Hansen

On Mon, Nov 05, 2018 at 04:58:19PM +0800, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Well spotted,

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-05  9:26   ` Mel Gorman
  0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2018-11-05  9:26 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	iso-8859-1?B?UGF3ZcWC?= Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Saeed Mahameed, Michal Hocko, Vlastimil Babka, Dave Hansen

On Mon, Nov 05, 2018 at 04:58:19PM +0800, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Well spotted,

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
                   ` (3 preceding siblings ...)
  (?)
@ 2018-11-05  9:55 ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 34+ messages in thread
From: Jesper Dangaard Brouer @ 2018-11-05  9:55 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	Paweł Staszewski, Eric Dumazet, Tariq Toukan,
	Ilias Apalodimas, Yoel Caspersen, Mel Gorman, Saeed Mahameed,
	Michal Hocko, Vlastimil Babka, Dave Hansen, brouer

On Mon,  5 Nov 2018 16:58:19 +0800
Aaron Lu <aaron.lu@intel.com> wrote:

> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>   mlx5e_poll_tx_cq
>   |
>    --16.34%--napi_consume_skb
>              |
>              |--12.65%--__free_pages_ok
>              |          |
>              |           --11.86%--free_one_page
>              |                     |
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |
>              |                      --0.65%--_raw_spin_lock
>              |
>              |--1.55%--page_frag_free
>              |
>               --1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>   -> skb_free_frag()
>     -> skb_free_frag()

Nitpick: you added skb_free_frag() two times, else correct.
(All this stuff gets inlined by the compiler, which makes it hard to
spot with perf report).

>       -> page_frag_free()  
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---

It is REALLY great that Aaron spotted this! (based on my analysis).
This have likely been causing scalability issues on real-life network
traffic, but have been hiding behind the driver level recycle tricks
for micro-benchmarking.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ae31839874b8..91a9a6af41a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>  {
>  	struct page *page = virt_to_head_page(addr);
>  
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>  }
>  EXPORT_SYMBOL(page_frag_free);
>  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
@ 2018-11-05 10:46   ` Ilias Apalodimas
  -1 siblings, 0 replies; 34+ messages in thread
From: Ilias Apalodimas @ 2018-11-05 10:46 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
	Tariq Toukan, Yoel Caspersen, Mel Gorman, Saeed Mahameed,
	Michal Hocko, Vlastimil Babka, Dave Hansen

Hi Aaron,
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>   mlx5e_poll_tx_cq
>   |
>    --16.34%--napi_consume_skb
>              |
>              |--12.65%--__free_pages_ok
>              |          |
>              |           --11.86%--free_one_page
>              |                     |
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |
>              |                      --0.65%--_raw_spin_lock
>              |
>              |--1.55%--page_frag_free
>              |
>               --1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>   -> skb_free_frag()
>     -> skb_free_frag()
>       -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone.
I did the same tests on a 'low' speed 1Gbit interface on an cortex-a53.
I used socionext's netsec driver and switched buffer allocation from the 
current scheme to using page_pool API (which by default allocates order0 
pages).

Running 'perf top' pre and post patch got me the same results.
__free_pages_ok() disappeared from perf top and i got an ~11% 
performance boost testing with 64byte packets.

Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-05 10:46   ` Ilias Apalodimas
  0 siblings, 0 replies; 34+ messages in thread
From: Ilias Apalodimas @ 2018-11-05 10:46 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
	Tariq Toukan, Yoel Caspersen, Mel Gorman, Saeed Mahameed,
	Michal Hocko, Vlastimil Babka, Dave Hansen

Hi Aaron,
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> PaweA? Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>   mlx5e_poll_tx_cq
>   |
>    --16.34%--napi_consume_skb
>              |
>              |--12.65%--__free_pages_ok
>              |          |
>              |           --11.86%--free_one_page
>              |                     |
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |
>              |                      --0.65%--_raw_spin_lock
>              |
>              |--1.55%--page_frag_free
>              |
>               --1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 
> The free path as pointed out by Jesper is:
> skb_free_head()
>   -> skb_free_frag()
>     -> skb_free_frag()
>       -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, PaweA? hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone.
I did the same tests on a 'low' speed 1Gbit interface on an cortex-a53.
I used socionext's netsec driver and switched buffer allocation from the 
current scheme to using page_pool API (which by default allocates order0 
pages).

Running 'perf top' pre and post patch got me the same results.
__free_pages_ok() disappeared from perf top and i got an ~11% 
performance boost testing with 64byte packets.

Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
                   ` (5 preceding siblings ...)
  (?)
@ 2018-11-05 15:44 ` Alexander Duyck
  2018-11-10 23:54     ` Paweł Staszewski
  -1 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-05 15:44 UTC (permalink / raw)
  To: aaron.lu
  Cc: linux-mm, LKML, Netdev, Andrew Morton, Paweł Staszewski,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen

On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
>
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
>
>   mlx5e_poll_tx_cq
>   |
>    --16.34%--napi_consume_skb
>              |
>              |--12.65%--__free_pages_ok
>              |          |
>              |           --11.86%--free_one_page
>              |                     |
>              |                     |--10.10%--queued_spin_lock_slowpath
>              |                     |
>              |                      --0.65%--_raw_spin_lock
>              |
>              |--1.55%--page_frag_free
>              |
>               --1.44%--skb_release_data
>
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
>
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
>
> The free path as pointed out by Jesper is:
> skb_free_head()
>   -> skb_free_frag()
>     -> skb_free_frag()
>       -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
>
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
>
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone.
>
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  mm/page_alloc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ae31839874b8..91a9a6af41a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>  {
>         struct page *page = virt_to_head_page(addr);
>
> -       if (unlikely(put_page_testzero(page)))
> -               __free_pages_ok(page, compound_order(page));
> +       if (unlikely(put_page_testzero(page))) {
> +               unsigned int order = compound_order(page);
> +
> +               if (order == 0)
> +                       free_unref_page(page);
> +               else
> +                       __free_pages_ok(page, order);
> +       }
>  }
>  EXPORT_SYMBOL(page_frag_free);
>

One thing I would suggest for Pawel to try would be to reduce the Tx
qdisc size on his transmitting interfaces, Reduce the Tx ring size,
and possibly increase the Tx interrupt rate. Ideally we shouldn't have
too many packets in-flight and I suspect that is the issue that Pawel
is seeing that is leading to the page pool allocator freeing up the
memory. I know we like to try to batch things but the issue is
processing too many Tx buffers in one batch leads to us eating up too
much memory and causing evictions from the cache. Ideally the Rx and
Tx rings and queues should be sized as small as possible while still
allowing us to process up to our NAPI budget. Usually I run things
with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
don't have more buffers stored there than we can place in the Tx ring.
Then we can avoid the extra thrash of having to pull/push memory into
and out of the freelists. Essentially the issue here ends up being
another form of buffer bloat.

With that said this change should be mostly harmless and does address
the fact that we can have both regular order 0 pages and page frags
used for skb->head.

Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
                   ` (6 preceding siblings ...)
  (?)
@ 2018-11-05 16:37 ` Dave Hansen
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2018-11-05 16:37 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen

On 11/5/18 12:58 AM, Aaron Lu wrote:
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>  }

This little hunk seems repeated in __free_pages() and
__page_frag_cache_drain().  Do we need a common helper?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] mm/page_alloc: use a single function to free page
  2018-11-05  8:58 ` [PATCH 2/2] mm/page_alloc: use a single function to free page Aaron Lu
@ 2018-11-05 16:39   ` Dave Hansen
  2018-11-06  5:30   ` [PATCH v2 " Aaron Lu
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2018-11-05 16:39 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen

On 11/5/18 12:58 AM, Aaron Lu wrote:
> We have multiple places of freeing a page, most of them doing similar
> things and a common function can be used to reduce code duplicate.
> 
> It also avoids bug fixed in one function and left in another.

Haha, should have read the next patch. :)

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91a9a6af41a2..2b330296e92a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(get_zeroed_page);
>  
> -void __free_pages(struct page *page, unsigned int order)
> +/*
> + * Free a page by reducing its ref count by @nr.
> + * If its refcount reaches 0, then according to its order:
> + * order0: send to PCP;
> + * high order: directly send to Buddy.
> + */

FWIW, I'm not a fan of comments on the function like this.  Please just
comment the *code* that's doing what you describe.  It's easier to read
and less likely to diverge from the code.

The rest of the patch looks great, though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05  8:58 ` Aaron Lu
@ 2018-11-06  5:28   ` Aaron Lu
  -1 siblings, 0 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-06  5:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen, Alexander Duyck

page_frag_free() calls __free_pages_ok() to free the page back to
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.

Paweł Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:

  mlx5e_poll_tx_cq
  |
   --16.34%--napi_consume_skb
             |
             |--12.65%--__free_pages_ok
             |          |
             |           --11.86%--free_one_page
             |                     |
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |
             |                      --0.65%--_raw_spin_lock
             |
             |--1.55%--page_frag_free
             |
              --1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
  -> skb_free_frag()
    -> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.

With this change, Paweł hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone. Ilias' test
on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11%
performance boost testing with 64byte packets and __free_pages_ok()
disappeared from perf top.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: only changelog changes:
    - remove the duplicated skb_free_frag() as pointed by Jesper;
    - add Ilias' test result;
    - add people's ack/test tag.

 mm/page_alloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+	if (unlikely(put_page_testzero(page))) {
+		unsigned int order = compound_order(page);
+
+		if (order == 0)
+			free_unref_page(page);
+		else
+			__free_pages_ok(page, order);
+	}
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-06  5:28   ` Aaron Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-06  5:28 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen, Alexander Duyck

page_frag_free() calls __free_pages_ok() to free the page back to
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.

PaweA? Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:

  mlx5e_poll_tx_cq
  |
   --16.34%--napi_consume_skb
             |
             |--12.65%--__free_pages_ok
             |          |
             |           --11.86%--free_one_page
             |                     |
             |                     |--10.10%--queued_spin_lock_slowpath
             |                     |
             |                      --0.65%--_raw_spin_lock
             |
             |--1.55%--page_frag_free
             |
              --1.44%--skb_release_data

Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]

I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]

The free path as pointed out by Jesper is:
skb_free_head()
  -> skb_free_frag()
    -> page_frag_free()
And the pages being freed on this path are order-0 pages.

Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.

With this change, PaweA? hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone. Ilias' test
on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11%
performance boost testing with 64byte packets and __free_pages_ok()
disappeared from perf top.

[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: only changelog changes:
    - remove the duplicated skb_free_frag() as pointed by Jesper;
    - add Ilias' test result;
    - add people's ack/test tag.

 mm/page_alloc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page)))
-		__free_pages_ok(page, compound_order(page));
+	if (unlikely(put_page_testzero(page))) {
+		unsigned int order = compound_order(page);
+
+		if (order == 0)
+			free_unref_page(page);
+		else
+			__free_pages_ok(page, order);
+	}
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 2/2] mm/page_alloc: use a single function to free page
  2018-11-05  8:58 ` [PATCH 2/2] mm/page_alloc: use a single function to free page Aaron Lu
  2018-11-05 16:39   ` Dave Hansen
@ 2018-11-06  5:30   ` Aaron Lu
  2018-11-06  8:16     ` Vlastimil Babka
  2018-11-06 11:31     ` [PATCH v3 " Aaron Lu
  1 sibling, 2 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-06  5:30 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen, Alexander Duyck

We have multiple places of freeing a page, most of them doing similar
things and a common function can be used to reduce code duplicate.

It also avoids bug fixed in one function but left in another.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v2: move comments close to code as suggested by Dave.

 mm/page_alloc.c | 36 ++++++++++++++++--------------------
 1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a9a6af41a2..4faf6b7bf225 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(get_zeroed_page);
 
-void __free_pages(struct page *page, unsigned int order)
+static inline void free_the_page(struct page *page, unsigned int order, int nr)
 {
-	if (put_page_testzero(page)) {
+	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+	/*
+	 * Free a page by reducing its ref count by @nr.
+	 * If its refcount reaches 0, then according to its order:
+	 * order0: send to PCP;
+	 * high order: directly send to Buddy.
+	 */
+	if (page_ref_sub_and_test(page, nr)) {
 		if (order == 0)
 			free_unref_page(page);
 		else
@@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
 	}
 }
 
+void __free_pages(struct page *page, unsigned int order)
+{
+	free_the_page(page, order, 1);
+}
 EXPORT_SYMBOL(__free_pages);
 
 void free_pages(unsigned long addr, unsigned int order)
@@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
-	if (page_ref_sub_and_test(page, count)) {
-		unsigned int order = compound_order(page);
-
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	free_the_page(page, compound_order(page), count);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page))) {
-		unsigned int order = compound_order(page);
-
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	free_the_page(page, compound_order(page), 1);
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
  2018-11-06  5:30   ` [PATCH v2 " Aaron Lu
@ 2018-11-06  8:16     ` Vlastimil Babka
  2018-11-06  8:47       ` Aaron Lu
  2018-11-06 11:31     ` [PATCH v3 " Aaron Lu
  1 sibling, 1 reply; 34+ messages in thread
From: Vlastimil Babka @ 2018-11-06  8:16 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Dave Hansen,
	Alexander Duyck

On 11/6/18 6:30 AM, Aaron Lu wrote:
> We have multiple places of freeing a page, most of them doing similar
> things and a common function can be used to reduce code duplicate.
> 
> It also avoids bug fixed in one function but left in another.
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

I assume there's no arch that would run page_ref_sub_and_test(1) slower
than put_page_testzero(), for the critical __free_pages() case?

> ---
> v2: move comments close to code as suggested by Dave.
> 
>  mm/page_alloc.c | 36 ++++++++++++++++--------------------
>  1 file changed, 16 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91a9a6af41a2..4faf6b7bf225 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(get_zeroed_page);
>  
> -void __free_pages(struct page *page, unsigned int order)
> +static inline void free_the_page(struct page *page, unsigned int order, int nr)
>  {
> -	if (put_page_testzero(page)) {
> +	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> +
> +	/*
> +	 * Free a page by reducing its ref count by @nr.
> +	 * If its refcount reaches 0, then according to its order:
> +	 * order0: send to PCP;
> +	 * high order: directly send to Buddy.
> +	 */
> +	if (page_ref_sub_and_test(page, nr)) {
>  		if (order == 0)
>  			free_unref_page(page);
>  		else
> @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
>  	}
>  }
>  
> +void __free_pages(struct page *page, unsigned int order)
> +{
> +	free_the_page(page, order, 1);
> +}
>  EXPORT_SYMBOL(__free_pages);
>  
>  void free_pages(unsigned long addr, unsigned int order)
> @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>  
>  void __page_frag_cache_drain(struct page *page, unsigned int count)
>  {
> -	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> -
> -	if (page_ref_sub_and_test(page, count)) {
> -		unsigned int order = compound_order(page);
> -
> -		if (order == 0)
> -			free_unref_page(page);
> -		else
> -			__free_pages_ok(page, order);
> -	}
> +	free_the_page(page, compound_order(page), count);
>  }
>  EXPORT_SYMBOL(__page_frag_cache_drain);
>  
> @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
>  {
>  	struct page *page = virt_to_head_page(addr);
>  
> -	if (unlikely(put_page_testzero(page))) {
> -		unsigned int order = compound_order(page);
> -
> -		if (order == 0)
> -			free_unref_page(page);
> -		else
> -			__free_pages_ok(page, order);
> -	}
> +	free_the_page(page, compound_order(page), 1);
>  }
>  EXPORT_SYMBOL(page_frag_free);
>  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
  2018-11-06  8:16     ` Vlastimil Babka
@ 2018-11-06  8:47       ` Aaron Lu
  2018-11-06  9:32         ` Vlastimil Babka
  0 siblings, 1 reply; 34+ messages in thread
From: Aaron Lu @ 2018-11-06  8:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
	Tariq Toukan, Ilias Apalodimas, Yoel Caspersen, Mel Gorman,
	Saeed Mahameed, Michal Hocko, Dave Hansen, Alexander Duyck

On Tue, Nov 06, 2018 at 09:16:20AM +0100, Vlastimil Babka wrote:
> On 11/6/18 6:30 AM, Aaron Lu wrote:
> > We have multiple places of freeing a page, most of them doing similar
> > things and a common function can be used to reduce code duplicate.
> > 
> > It also avoids bug fixed in one function but left in another.
> > 
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.

> I assume there's no arch that would run page_ref_sub_and_test(1) slower
> than put_page_testzero(), for the critical __free_pages() case?

Good question.

I followed the non-arch specific calls and found that:
page_ref_sub_and_test() ends up calling atomic_sub_return(i, v) while
put_page_testzero() ends up calling atomic_sub_return(1, v). So they
should be same for archs that do not have their own implementations.

Back to your question: I don't know either.
If this is deemed unsafe, we can probably keep the ref modify part in
their original functions and only take the free part into a common
function.

Regards,
Aaron

> > ---
> > v2: move comments close to code as suggested by Dave.
> > 
> >  mm/page_alloc.c | 36 ++++++++++++++++--------------------
> >  1 file changed, 16 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 91a9a6af41a2..4faf6b7bf225 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
> >  }
> >  EXPORT_SYMBOL(get_zeroed_page);
> >  
> > -void __free_pages(struct page *page, unsigned int order)
> > +static inline void free_the_page(struct page *page, unsigned int order, int nr)
> >  {
> > -	if (put_page_testzero(page)) {
> > +	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > +
> > +	/*
> > +	 * Free a page by reducing its ref count by @nr.
> > +	 * If its refcount reaches 0, then according to its order:
> > +	 * order0: send to PCP;
> > +	 * high order: directly send to Buddy.
> > +	 */
> > +	if (page_ref_sub_and_test(page, nr)) {
> >  		if (order == 0)
> >  			free_unref_page(page);
> >  		else
> > @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
> >  	}
> >  }
> >  
> > +void __free_pages(struct page *page, unsigned int order)
> > +{
> > +	free_the_page(page, order, 1);
> > +}
> >  EXPORT_SYMBOL(__free_pages);
> >  
> >  void free_pages(unsigned long addr, unsigned int order)
> > @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >  
> >  void __page_frag_cache_drain(struct page *page, unsigned int count)
> >  {
> > -	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> > -
> > -	if (page_ref_sub_and_test(page, count)) {
> > -		unsigned int order = compound_order(page);
> > -
> > -		if (order == 0)
> > -			free_unref_page(page);
> > -		else
> > -			__free_pages_ok(page, order);
> > -	}
> > +	free_the_page(page, compound_order(page), count);
> >  }
> >  EXPORT_SYMBOL(__page_frag_cache_drain);
> >  
> > @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
> >  {
> >  	struct page *page = virt_to_head_page(addr);
> >  
> > -	if (unlikely(put_page_testzero(page))) {
> > -		unsigned int order = compound_order(page);
> > -
> > -		if (order == 0)
> > -			free_unref_page(page);
> > -		else
> > -			__free_pages_ok(page, order);
> > -	}
> > +	free_the_page(page, compound_order(page), 1);
> >  }
> >  EXPORT_SYMBOL(page_frag_free);
> >  
> > 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
  2018-11-06  8:47       ` Aaron Lu
@ 2018-11-06  9:32         ` Vlastimil Babka
  2018-11-06 11:20           ` Aaron Lu
  0 siblings, 1 reply; 34+ messages in thread
From: Vlastimil Babka @ 2018-11-06  9:32 UTC (permalink / raw)
  To: Aaron Lu
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
	Tariq Toukan, Ilias Apalodimas, Yoel Caspersen, Mel Gorman,
	Saeed Mahameed, Michal Hocko, Dave Hansen, Alexander Duyck

On 11/6/18 9:47 AM, Aaron Lu wrote:
> On Tue, Nov 06, 2018 at 09:16:20AM +0100, Vlastimil Babka wrote:
>> On 11/6/18 6:30 AM, Aaron Lu wrote:
>>> We have multiple places of freeing a page, most of them doing similar
>>> things and a common function can be used to reduce code duplicate.
>>>
>>> It also avoids bug fixed in one function but left in another.
>>>
>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Thanks.
> 
>> I assume there's no arch that would run page_ref_sub_and_test(1) slower
>> than put_page_testzero(), for the critical __free_pages() case?
> 
> Good question.
> 
> I followed the non-arch specific calls and found that:
> page_ref_sub_and_test() ends up calling atomic_sub_return(i, v) while
> put_page_testzero() ends up calling atomic_sub_return(1, v). So they
> should be same for archs that do not have their own implementations.

x86 seems to distinguish between DECL and SUBL, see
arch/x86/include/asm/atomic.h although I could not figure out where does
e.g. arch_atomic_dec_and_test become atomic_dec_and_test to override the
generic implementation.
I don't know if the CPU e.g. executes DECL faster, but objectively it
has one parameter less. Maybe it doesn't matter?

> Back to your question: I don't know either.
> If this is deemed unsafe, we can probably keep the ref modify part in
> their original functions and only take the free part into a common
> function.

I guess you could also employ  if (__builtin_constant_p(nr)) in
free_the_page(), but the result will be ugly I guess, and maybe not
worth it :)

> Regards,
> Aaron
> 
>>> ---
>>> v2: move comments close to code as suggested by Dave.
>>>
>>>  mm/page_alloc.c | 36 ++++++++++++++++--------------------
>>>  1 file changed, 16 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 91a9a6af41a2..4faf6b7bf225 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
>>>  }
>>>  EXPORT_SYMBOL(get_zeroed_page);
>>>  
>>> -void __free_pages(struct page *page, unsigned int order)
>>> +static inline void free_the_page(struct page *page, unsigned int order, int nr)
>>>  {
>>> -	if (put_page_testzero(page)) {
>>> +	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>> +
>>> +	/*
>>> +	 * Free a page by reducing its ref count by @nr.
>>> +	 * If its refcount reaches 0, then according to its order:
>>> +	 * order0: send to PCP;
>>> +	 * high order: directly send to Buddy.
>>> +	 */
>>> +	if (page_ref_sub_and_test(page, nr)) {
>>>  		if (order == 0)
>>>  			free_unref_page(page);
>>>  		else
>>> @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
>>>  	}
>>>  }
>>>  
>>> +void __free_pages(struct page *page, unsigned int order)
>>> +{
>>> +	free_the_page(page, order, 1);
>>> +}
>>>  EXPORT_SYMBOL(__free_pages);
>>>  
>>>  void free_pages(unsigned long addr, unsigned int order)
>>> @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
>>>  
>>>  void __page_frag_cache_drain(struct page *page, unsigned int count)
>>>  {
>>> -	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
>>> -
>>> -	if (page_ref_sub_and_test(page, count)) {
>>> -		unsigned int order = compound_order(page);
>>> -
>>> -		if (order == 0)
>>> -			free_unref_page(page);
>>> -		else
>>> -			__free_pages_ok(page, order);
>>> -	}
>>> +	free_the_page(page, compound_order(page), count);
>>>  }
>>>  EXPORT_SYMBOL(__page_frag_cache_drain);
>>>  
>>> @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
>>>  {
>>>  	struct page *page = virt_to_head_page(addr);
>>>  
>>> -	if (unlikely(put_page_testzero(page))) {
>>> -		unsigned int order = compound_order(page);
>>> -
>>> -		if (order == 0)
>>> -			free_unref_page(page);
>>> -		else
>>> -			__free_pages_ok(page, order);
>>> -	}
>>> +	free_the_page(page, compound_order(page), 1);
>>>  }
>>>  EXPORT_SYMBOL(page_frag_free);
>>>  
>>>
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/2] mm/page_alloc: use a single function to free page
  2018-11-06  9:32         ` Vlastimil Babka
@ 2018-11-06 11:20           ` Aaron Lu
  0 siblings, 0 replies; 34+ messages in thread
From: Aaron Lu @ 2018-11-06 11:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, netdev, Andrew Morton,
	Paweł Staszewski, Jesper Dangaard Brouer, Eric Dumazet,
	Tariq Toukan, Ilias Apalodimas, Yoel Caspersen, Mel Gorman,
	Saeed Mahameed, Michal Hocko, Dave Hansen, Alexander Duyck

On Tue, Nov 06, 2018 at 10:32:00AM +0100, Vlastimil Babka wrote:
> On 11/6/18 9:47 AM, Aaron Lu wrote:
> > On Tue, Nov 06, 2018 at 09:16:20AM +0100, Vlastimil Babka wrote:
> >> On 11/6/18 6:30 AM, Aaron Lu wrote:
> >>> We have multiple places of freeing a page, most of them doing similar
> >>> things and a common function can be used to reduce code duplicate.
> >>>
> >>> It also avoids bug fixed in one function but left in another.
> >>>
> >>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >>
> >> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Thanks.
> > 
> >> I assume there's no arch that would run page_ref_sub_and_test(1) slower
> >> than put_page_testzero(), for the critical __free_pages() case?
> > 
> > Good question.
> > 
> > I followed the non-arch specific calls and found that:
> > page_ref_sub_and_test() ends up calling atomic_sub_return(i, v) while
> > put_page_testzero() ends up calling atomic_sub_return(1, v). So they
> > should be same for archs that do not have their own implementations.
> 
> x86 seems to distinguish between DECL and SUBL, see

Ah right.

> arch/x86/include/asm/atomic.h although I could not figure out where does
> e.g. arch_atomic_dec_and_test become atomic_dec_and_test to override the
> generic implementation.

I didn't check that either but I think it will :-)

> I don't know if the CPU e.g. executes DECL faster, but objectively it
> has one parameter less. Maybe it doesn't matter?

No immediate idea.

> > Back to your question: I don't know either.
> > If this is deemed unsafe, we can probably keep the ref modify part in
> > their original functions and only take the free part into a common
> > function.
> 
> I guess you could also employ  if (__builtin_constant_p(nr)) in
> free_the_page(), but the result will be ugly I guess, and maybe not
> worth it :)

Right I can't make it clean.
I think I'll just move the free part a common function and leave the ref
decreasing part as is to be safe.

Regards,
Aaron
 
> >>> ---
> >>> v2: move comments close to code as suggested by Dave.
> >>>
> >>>  mm/page_alloc.c | 36 ++++++++++++++++--------------------
> >>>  1 file changed, 16 insertions(+), 20 deletions(-)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 91a9a6af41a2..4faf6b7bf225 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4425,9 +4425,17 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
> >>>  }
> >>>  EXPORT_SYMBOL(get_zeroed_page);
> >>>  
> >>> -void __free_pages(struct page *page, unsigned int order)
> >>> +static inline void free_the_page(struct page *page, unsigned int order, int nr)
> >>>  {
> >>> -	if (put_page_testzero(page)) {
> >>> +	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> >>> +
> >>> +	/*
> >>> +	 * Free a page by reducing its ref count by @nr.
> >>> +	 * If its refcount reaches 0, then according to its order:
> >>> +	 * order0: send to PCP;
> >>> +	 * high order: directly send to Buddy.
> >>> +	 */
> >>> +	if (page_ref_sub_and_test(page, nr)) {
> >>>  		if (order == 0)
> >>>  			free_unref_page(page);
> >>>  		else
> >>> @@ -4435,6 +4443,10 @@ void __free_pages(struct page *page, unsigned int order)
> >>>  	}
> >>>  }
> >>>  
> >>> +void __free_pages(struct page *page, unsigned int order)
> >>> +{
> >>> +	free_the_page(page, order, 1);
> >>> +}
> >>>  EXPORT_SYMBOL(__free_pages);
> >>>  
> >>>  void free_pages(unsigned long addr, unsigned int order)
> >>> @@ -4481,16 +4493,7 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
> >>>  
> >>>  void __page_frag_cache_drain(struct page *page, unsigned int count)
> >>>  {
> >>> -	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> >>> -
> >>> -	if (page_ref_sub_and_test(page, count)) {
> >>> -		unsigned int order = compound_order(page);
> >>> -
> >>> -		if (order == 0)
> >>> -			free_unref_page(page);
> >>> -		else
> >>> -			__free_pages_ok(page, order);
> >>> -	}
> >>> +	free_the_page(page, compound_order(page), count);
> >>>  }
> >>>  EXPORT_SYMBOL(__page_frag_cache_drain);
> >>>  
> >>> @@ -4555,14 +4558,7 @@ void page_frag_free(void *addr)
> >>>  {
> >>>  	struct page *page = virt_to_head_page(addr);
> >>>  
> >>> -	if (unlikely(put_page_testzero(page))) {
> >>> -		unsigned int order = compound_order(page);
> >>> -
> >>> -		if (order == 0)
> >>> -			free_unref_page(page);
> >>> -		else
> >>> -			__free_pages_ok(page, order);
> >>> -	}
> >>> +	free_the_page(page, compound_order(page), 1);
> >>>  }
> >>>  EXPORT_SYMBOL(page_frag_free);
> >>>  
> >>>
> >>
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 2/2] mm/page_alloc: use a single function to free page
  2018-11-06  5:30   ` [PATCH v2 " Aaron Lu
  2018-11-06  8:16     ` Vlastimil Babka
@ 2018-11-06 11:31     ` Aaron Lu
  2018-11-06 12:06       ` Vlastimil Babka
  1 sibling, 1 reply; 34+ messages in thread
From: Aaron Lu @ 2018-11-06 11:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen, Alexander Duyck

We have multiple places of freeing a page, most of them doing similar
things and a common function can be used to reduce code duplicate.

It also avoids bug fixed in one function but left in another.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
v3: Vlastimil mentioned the possible performance loss by using
    page_ref_sub_and_test(page, 1) for put_page_testzero(page), since
    we aren't sure so be safe by keeping page ref decreasing code as
    is, only move freeing page part to a common function.

 mm/page_alloc.c | 37 ++++++++++++++-----------------------
 1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91a9a6af41a2..431a03aa96f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,16 +4425,19 @@ unsigned long get_zeroed_page(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(get_zeroed_page);
 
-void __free_pages(struct page *page, unsigned int order)
+static inline void free_the_page(struct page *page, unsigned int order)
 {
-	if (put_page_testzero(page)) {
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	if (order == 0)
+		free_unref_page(page);
+	else
+		__free_pages_ok(page, order);
 }
 
+void __free_pages(struct page *page, unsigned int order)
+{
+	if (put_page_testzero(page))
+		free_the_page(page, order);
+}
 EXPORT_SYMBOL(__free_pages);
 
 void free_pages(unsigned long addr, unsigned int order)
@@ -4483,14 +4486,8 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
 
-	if (page_ref_sub_and_test(page, count)) {
-		unsigned int order = compound_order(page);
-
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	if (page_ref_sub_and_test(page, count))
+		free_the_page(page, compound_order(page));
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
@@ -4555,14 +4552,8 @@ void page_frag_free(void *addr)
 {
 	struct page *page = virt_to_head_page(addr);
 
-	if (unlikely(put_page_testzero(page))) {
-		unsigned int order = compound_order(page);
-
-		if (order == 0)
-			free_unref_page(page);
-		else
-			__free_pages_ok(page, order);
-	}
+	if (unlikely(put_page_testzero(page)))
+		free_the_page(page, compound_order(page));
 }
 EXPORT_SYMBOL(page_frag_free);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 2/2] mm/page_alloc: use a single function to free page
  2018-11-06 11:31     ` [PATCH v3 " Aaron Lu
@ 2018-11-06 12:06       ` Vlastimil Babka
  0 siblings, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2018-11-06 12:06 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Dave Hansen,
	Alexander Duyck

On 11/6/18 12:31 PM, Aaron Lu wrote:
> We have multiple places of freeing a page, most of them doing similar
> things and a common function can be used to reduce code duplicate.
> 
> It also avoids bug fixed in one function but left in another.
> 
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> ---
> v3: Vlastimil mentioned the possible performance loss by using
>     page_ref_sub_and_test(page, 1) for put_page_testzero(page), since
>     we aren't sure so be safe by keeping page ref decreasing code as
>     is, only move freeing page part to a common function.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-06  5:28   ` Aaron Lu
  (?)
@ 2018-11-07  9:59   ` Tariq Toukan
  -1 siblings, 0 replies; 34+ messages in thread
From: Tariq Toukan @ 2018-11-07  9:59 UTC (permalink / raw)
  To: Aaron Lu, linux-mm, linux-kernel, netdev
  Cc: Andrew Morton, Paweł Staszewski, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, Ilias Apalodimas, Yoel Caspersen,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	Dave Hansen, Alexander Duyck



On 06/11/2018 7:28 AM, Aaron Lu wrote:
> page_frag_free() calls __free_pages_ok() to free the page back to
> Buddy. This is OK for high order page, but for order-0 pages, it
> misses the optimization opportunity of using Per-Cpu-Pages and can
> cause zone lock contention when called frequently.
> 
> Paweł Staszewski recently shared his result of 'how Linux kernel
> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> found the lock contention comes from page allocator:
> 
>    mlx5e_poll_tx_cq
>    |
>     --16.34%--napi_consume_skb
>               |
>               |--12.65%--__free_pages_ok
>               |          |
>               |           --11.86%--free_one_page
>               |                     |
>               |                     |--10.10%--queued_spin_lock_slowpath
>               |                     |
>               |                      --0.65%--_raw_spin_lock
>               |
>               |--1.55%--page_frag_free
>               |
>                --1.44%--skb_release_data
> 
> Jesper explained how it happened: mlx5 driver RX-page recycle
> mechanism is not effective in this workload and pages have to go
> through the page allocator. The lock contention happens during
> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> up at these speeds.[2]
> 
> I thought that __free_pages_ok() are mostly freeing high order
> pages and thought this is an lock contention for high order pages
> but Jesper explained in detail that __free_pages_ok() here are
> actually freeing order-0 pages because mlx5 is using order-0 pages
> to satisfy its page pool allocation request.[3]
> 

Thanks for your patch!
Acked-by: Tariq Toukan <tariqt@mellanox.com>

> The free path as pointed out by Jesper is:
> skb_free_head()
>    -> skb_free_frag()
>      -> page_frag_free()
> And the pages being freed on this path are order-0 pages.
> 
> Fix this by doing similar things as in __page_frag_cache_drain() -
> send the being freed page to PCP if it's an order-0 page, or
> directly to Buddy if it is a high order page.
> 
> With this change, Paweł hasn't noticed lock contention yet in
> his workload and Jesper has noticed a 7% performance improvement
> using a micro benchmark and lock contention is gone. Ilias' test
> on a 'low' speed 1Gbit interface on an cortex-a53 shows ~11%
> performance boost testing with 64byte packets and __free_pages_ok()
> disappeared from perf top.
> 
> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Tested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
> v2: only changelog changes:
>      - remove the duplicated skb_free_frag() as pointed by Jesper;
>      - add Ilias' test result;
>      - add people's ack/test tag.
> 
>   mm/page_alloc.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ae31839874b8..91a9a6af41a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>   {
>   	struct page *page = virt_to_head_page(addr);
>   
> -	if (unlikely(put_page_testzero(page)))
> -		__free_pages_ok(page, compound_order(page));
> +	if (unlikely(put_page_testzero(page))) {
> +		unsigned int order = compound_order(page);
> +
> +		if (order == 0)
> +			free_unref_page(page);
> +		else
> +			__free_pages_ok(page, order);
> +	}
>   }
>   EXPORT_SYMBOL(page_frag_free);
>   
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-05 15:44 ` Alexander Duyck
@ 2018-11-10 23:54     ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-10 23:54 UTC (permalink / raw)
  To: Alexander Duyck, aaron.lu
  Cc: linux-mm, LKML, Netdev, Andrew Morton, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, ilias.apalodimas, yoel, Mel Gorman,
	Saeed Mahameed, Michal Hocko, Vlastimil Babka, dave.hansen



W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>> page_frag_free() calls __free_pages_ok() to free the page back to
>> Buddy. This is OK for high order page, but for order-0 pages, it
>> misses the optimization opportunity of using Per-Cpu-Pages and can
>> cause zone lock contention when called frequently.
>>
>> Paweł Staszewski recently shared his result of 'how Linux kernel
>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
>> found the lock contention comes from page allocator:
>>
>>    mlx5e_poll_tx_cq
>>    |
>>     --16.34%--napi_consume_skb
>>               |
>>               |--12.65%--__free_pages_ok
>>               |          |
>>               |           --11.86%--free_one_page
>>               |                     |
>>               |                     |--10.10%--queued_spin_lock_slowpath
>>               |                     |
>>               |                      --0.65%--_raw_spin_lock
>>               |
>>               |--1.55%--page_frag_free
>>               |
>>                --1.44%--skb_release_data
>>
>> Jesper explained how it happened: mlx5 driver RX-page recycle
>> mechanism is not effective in this workload and pages have to go
>> through the page allocator. The lock contention happens during
>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
>> up at these speeds.[2]
>>
>> I thought that __free_pages_ok() are mostly freeing high order
>> pages and thought this is an lock contention for high order pages
>> but Jesper explained in detail that __free_pages_ok() here are
>> actually freeing order-0 pages because mlx5 is using order-0 pages
>> to satisfy its page pool allocation request.[3]
>>
>> The free path as pointed out by Jesper is:
>> skb_free_head()
>>    -> skb_free_frag()
>>      -> skb_free_frag()
>>        -> page_frag_free()
>> And the pages being freed on this path are order-0 pages.
>>
>> Fix this by doing similar things as in __page_frag_cache_drain() -
>> send the being freed page to PCP if it's an order-0 page, or
>> directly to Buddy if it is a high order page.
>>
>> With this change, Paweł hasn't noticed lock contention yet in
>> his workload and Jesper has noticed a 7% performance improvement
>> using a micro benchmark and lock contention is gone.
>>
>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
>> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>> ---
>>   mm/page_alloc.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ae31839874b8..91a9a6af41a2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>>   {
>>          struct page *page = virt_to_head_page(addr);
>>
>> -       if (unlikely(put_page_testzero(page)))
>> -               __free_pages_ok(page, compound_order(page));
>> +       if (unlikely(put_page_testzero(page))) {
>> +               unsigned int order = compound_order(page);
>> +
>> +               if (order == 0)
>> +                       free_unref_page(page);
>> +               else
>> +                       __free_pages_ok(page, order);
>> +       }
>>   }
>>   EXPORT_SYMBOL(page_frag_free);
>>
> One thing I would suggest for Pawel to try would be to reduce the Tx
> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
> too many packets in-flight and I suspect that is the issue that Pawel
> is seeing that is leading to the page pool allocator freeing up the
> memory. I know we like to try to batch things but the issue is
> processing too many Tx buffers in one batch leads to us eating up too
> much memory and causing evictions from the cache. Ideally the Rx and
> Tx rings and queues should be sized as small as possible while still
> allowing us to process up to our NAPI budget. Usually I run things
> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
> don't have more buffers stored there than we can place in the Tx ring.
> Then we can avoid the extra thrash of having to pull/push memory into
> and out of the freelists. Essentially the issue here ends up being
> another form of buffer bloat.
Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer 
<4096 producing more interface rx drops - and no_rx_buffer on network 
controller that is receiving more packets
So i need to stick with 3000-4000 on RX - and yes i was trying to lower 
the TX buff on connectx4 - but that changed nothing before Aaron patch

After Aaron patch - decreasing TX buffer influencing total bandwidth 
that can be handled by the router/server
Dono why before this patch there was no difference there no matter what 
i set there there was always page_alloc/slowpath on top in perf


Currently testing RX4096/TX256 - this helps with bandwidth like +10% 
more bandwidth with less interrupts...


>
> With that said this change should be mostly harmless and does address
> the fact that we can have both regular order 0 pages and page frags
> used for skb->head.
>
> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-10 23:54     ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-10 23:54 UTC (permalink / raw)
  To: Alexander Duyck, aaron.lu
  Cc: linux-mm, LKML, Netdev, Andrew Morton, Jesper Dangaard Brouer,
	Eric Dumazet, Tariq Toukan, ilias.apalodimas, yoel, Mel Gorman,
	Saeed Mahameed, Michal Hocko, Vlastimil Babka, dave.hansen



W dniu 05.11.2018 oA 16:44, Alexander Duyck pisze:
> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>> page_frag_free() calls __free_pages_ok() to free the page back to
>> Buddy. This is OK for high order page, but for order-0 pages, it
>> misses the optimization opportunity of using Per-Cpu-Pages and can
>> cause zone lock contention when called frequently.
>>
>> PaweA? Staszewski recently shared his result of 'how Linux kernel
>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
>> found the lock contention comes from page allocator:
>>
>>    mlx5e_poll_tx_cq
>>    |
>>     --16.34%--napi_consume_skb
>>               |
>>               |--12.65%--__free_pages_ok
>>               |          |
>>               |           --11.86%--free_one_page
>>               |                     |
>>               |                     |--10.10%--queued_spin_lock_slowpath
>>               |                     |
>>               |                      --0.65%--_raw_spin_lock
>>               |
>>               |--1.55%--page_frag_free
>>               |
>>                --1.44%--skb_release_data
>>
>> Jesper explained how it happened: mlx5 driver RX-page recycle
>> mechanism is not effective in this workload and pages have to go
>> through the page allocator. The lock contention happens during
>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
>> up at these speeds.[2]
>>
>> I thought that __free_pages_ok() are mostly freeing high order
>> pages and thought this is an lock contention for high order pages
>> but Jesper explained in detail that __free_pages_ok() here are
>> actually freeing order-0 pages because mlx5 is using order-0 pages
>> to satisfy its page pool allocation request.[3]
>>
>> The free path as pointed out by Jesper is:
>> skb_free_head()
>>    -> skb_free_frag()
>>      -> skb_free_frag()
>>        -> page_frag_free()
>> And the pages being freed on this path are order-0 pages.
>>
>> Fix this by doing similar things as in __page_frag_cache_drain() -
>> send the being freed page to PCP if it's an order-0 page, or
>> directly to Buddy if it is a high order page.
>>
>> With this change, PaweA? hasn't noticed lock contention yet in
>> his workload and Jesper has noticed a 7% performance improvement
>> using a micro benchmark and lock contention is gone.
>>
>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
>> Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>> ---
>>   mm/page_alloc.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ae31839874b8..91a9a6af41a2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>>   {
>>          struct page *page = virt_to_head_page(addr);
>>
>> -       if (unlikely(put_page_testzero(page)))
>> -               __free_pages_ok(page, compound_order(page));
>> +       if (unlikely(put_page_testzero(page))) {
>> +               unsigned int order = compound_order(page);
>> +
>> +               if (order == 0)
>> +                       free_unref_page(page);
>> +               else
>> +                       __free_pages_ok(page, order);
>> +       }
>>   }
>>   EXPORT_SYMBOL(page_frag_free);
>>
> One thing I would suggest for Pawel to try would be to reduce the Tx
> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
> too many packets in-flight and I suspect that is the issue that Pawel
> is seeing that is leading to the page pool allocator freeing up the
> memory. I know we like to try to batch things but the issue is
> processing too many Tx buffers in one batch leads to us eating up too
> much memory and causing evictions from the cache. Ideally the Rx and
> Tx rings and queues should be sized as small as possible while still
> allowing us to process up to our NAPI budget. Usually I run things
> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
> don't have more buffers stored there than we can place in the Tx ring.
> Then we can avoid the extra thrash of having to pull/push memory into
> and out of the freelists. Essentially the issue here ends up being
> another form of buffer bloat.
Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer 
<4096 producing more interface rx drops - and no_rx_buffer on network 
controller that is receiving more packets
So i need to stick with 3000-4000 on RX - and yes i was trying to lower 
the TX buff on connectx4 - but that changed nothing before Aaron patch

After Aaron patch - decreasing TX buffer influencing total bandwidth 
that can be handled by the router/server
Dono why before this patch there was no difference there no matter what 
i set there there was always page_alloc/slowpath on top in perf


Currently testing RX4096/TX256 - this helps with bandwidth like +10% 
more bandwidth with less interrupts...


>
> With that said this change should be mostly harmless and does address
> the fact that we can have both regular order 0 pages and page frags
> used for skb->head.
>
> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-10 23:54     ` Paweł Staszewski
  (?)
@ 2018-11-11 23:05     ` Alexander Duyck
  2018-11-12  0:39         ` Paweł Staszewski
  -1 siblings, 1 reply; 34+ messages in thread
From: Alexander Duyck @ 2018-11-11 23:05 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen

On Sat, Nov 10, 2018 at 3:54 PM Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>
>
>
> W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
> > On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
> >> page_frag_free() calls __free_pages_ok() to free the page back to
> >> Buddy. This is OK for high order page, but for order-0 pages, it
> >> misses the optimization opportunity of using Per-Cpu-Pages and can
> >> cause zone lock contention when called frequently.
> >>
> >> Paweł Staszewski recently shared his result of 'how Linux kernel
> >> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> >> found the lock contention comes from page allocator:
> >>
> >>    mlx5e_poll_tx_cq
> >>    |
> >>     --16.34%--napi_consume_skb
> >>               |
> >>               |--12.65%--__free_pages_ok
> >>               |          |
> >>               |           --11.86%--free_one_page
> >>               |                     |
> >>               |                     |--10.10%--queued_spin_lock_slowpath
> >>               |                     |
> >>               |                      --0.65%--_raw_spin_lock
> >>               |
> >>               |--1.55%--page_frag_free
> >>               |
> >>                --1.44%--skb_release_data
> >>
> >> Jesper explained how it happened: mlx5 driver RX-page recycle
> >> mechanism is not effective in this workload and pages have to go
> >> through the page allocator. The lock contention happens during
> >> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> >> up at these speeds.[2]
> >>
> >> I thought that __free_pages_ok() are mostly freeing high order
> >> pages and thought this is an lock contention for high order pages
> >> but Jesper explained in detail that __free_pages_ok() here are
> >> actually freeing order-0 pages because mlx5 is using order-0 pages
> >> to satisfy its page pool allocation request.[3]
> >>
> >> The free path as pointed out by Jesper is:
> >> skb_free_head()
> >>    -> skb_free_frag()
> >>      -> skb_free_frag()
> >>        -> page_frag_free()
> >> And the pages being freed on this path are order-0 pages.
> >>
> >> Fix this by doing similar things as in __page_frag_cache_drain() -
> >> send the being freed page to PCP if it's an order-0 page, or
> >> directly to Buddy if it is a high order page.
> >>
> >> With this change, Paweł hasn't noticed lock contention yet in
> >> his workload and Jesper has noticed a 7% performance improvement
> >> using a micro benchmark and lock contention is gone.
> >>
> >> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> >> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> >> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> >> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> >> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >> ---
> >>   mm/page_alloc.c | 10 ++++++++--
> >>   1 file changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index ae31839874b8..91a9a6af41a2 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
> >>   {
> >>          struct page *page = virt_to_head_page(addr);
> >>
> >> -       if (unlikely(put_page_testzero(page)))
> >> -               __free_pages_ok(page, compound_order(page));
> >> +       if (unlikely(put_page_testzero(page))) {
> >> +               unsigned int order = compound_order(page);
> >> +
> >> +               if (order == 0)
> >> +                       free_unref_page(page);
> >> +               else
> >> +                       __free_pages_ok(page, order);
> >> +       }
> >>   }
> >>   EXPORT_SYMBOL(page_frag_free);
> >>
> > One thing I would suggest for Pawel to try would be to reduce the Tx
> > qdisc size on his transmitting interfaces, Reduce the Tx ring size,
> > and possibly increase the Tx interrupt rate. Ideally we shouldn't have
> > too many packets in-flight and I suspect that is the issue that Pawel
> > is seeing that is leading to the page pool allocator freeing up the
> > memory. I know we like to try to batch things but the issue is
> > processing too many Tx buffers in one batch leads to us eating up too
> > much memory and causing evictions from the cache. Ideally the Rx and
> > Tx rings and queues should be sized as small as possible while still
> > allowing us to process up to our NAPI budget. Usually I run things
> > with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
> > don't have more buffers stored there than we can place in the Tx ring.
> > Then we can avoid the extra thrash of having to pull/push memory into
> > and out of the freelists. Essentially the issue here ends up being
> > another form of buffer bloat.
> Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
> <4096 producing more interface rx drops - and no_rx_buffer on network
> controller that is receiving more packets
> So i need to stick with 3000-4000 on RX - and yes i was trying to lower
> the TX buff on connectx4 - but that changed nothing before Aaron patch
>
> After Aaron patch - decreasing TX buffer influencing total bandwidth
> that can be handled by the router/server
> Dono why before this patch there was no difference there no matter what
> i set there there was always page_alloc/slowpath on top in perf
>
>
> Currently testing RX4096/TX256 - this helps with bandwidth like +10%
> more bandwidth with less interrupts...

The problem is if you are going for less interrupts you are setting
yourself up for buffer bloat. Basically you are going to use much more
cache and much more memory then you actually need and if things are
properly configured NAPI should take care of the interrupts anyway
since under maximum load you shouldn't stop polling normally.

One issue I have seen is people delay interrupts for as long as
possible which isn't really a good thing since most network
controllers will use NAPI which will disable the interrupts and leave
them disabled whenever the system is under heavy stress so you should
be able to get the maximum performance by configuring an adapter with
small ring sizes and for high interrupt rates.

It is easiest to think of it this way. Your total packet rate is equal
to your interrupt rate times the number of buffers you will store in
the ring. So if you have some fixed rate "X" for packets and an
interrupt rate of "i" then your optimal ring size should be "X/i". So
if you lower the interrupt rate you end up hurting the throughput
unless you increase the buffer size. However at a certain point the
buffer size starts becoming an issue. For example with UDP flows I
often see massive packet drops if you tune the interrupt rate too low
and then put the system under heavy stress.

- Alex

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-11 23:05     ` Alexander Duyck
@ 2018-11-12  0:39         ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-12  0:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen


W dniu 12.11.2018 o 00:05, Alexander Duyck pisze:
> On Sat, Nov 10, 2018 at 3:54 PM Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>>
>>
>> W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
>>> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>>>> page_frag_free() calls __free_pages_ok() to free the page back to
>>>> Buddy. This is OK for high order page, but for order-0 pages, it
>>>> misses the optimization opportunity of using Per-Cpu-Pages and can
>>>> cause zone lock contention when called frequently.
>>>>
>>>> Paweł Staszewski recently shared his result of 'how Linux kernel
>>>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
>>>> found the lock contention comes from page allocator:
>>>>
>>>>     mlx5e_poll_tx_cq
>>>>     |
>>>>      --16.34%--napi_consume_skb
>>>>                |
>>>>                |--12.65%--__free_pages_ok
>>>>                |          |
>>>>                |           --11.86%--free_one_page
>>>>                |                     |
>>>>                |                     |--10.10%--queued_spin_lock_slowpath
>>>>                |                     |
>>>>                |                      --0.65%--_raw_spin_lock
>>>>                |
>>>>                |--1.55%--page_frag_free
>>>>                |
>>>>                 --1.44%--skb_release_data
>>>>
>>>> Jesper explained how it happened: mlx5 driver RX-page recycle
>>>> mechanism is not effective in this workload and pages have to go
>>>> through the page allocator. The lock contention happens during
>>>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
>>>> up at these speeds.[2]
>>>>
>>>> I thought that __free_pages_ok() are mostly freeing high order
>>>> pages and thought this is an lock contention for high order pages
>>>> but Jesper explained in detail that __free_pages_ok() here are
>>>> actually freeing order-0 pages because mlx5 is using order-0 pages
>>>> to satisfy its page pool allocation request.[3]
>>>>
>>>> The free path as pointed out by Jesper is:
>>>> skb_free_head()
>>>>     -> skb_free_frag()
>>>>       -> skb_free_frag()
>>>>         -> page_frag_free()
>>>> And the pages being freed on this path are order-0 pages.
>>>>
>>>> Fix this by doing similar things as in __page_frag_cache_drain() -
>>>> send the being freed page to PCP if it's an order-0 page, or
>>>> directly to Buddy if it is a high order page.
>>>>
>>>> With this change, Paweł hasn't noticed lock contention yet in
>>>> his workload and Jesper has noticed a 7% performance improvement
>>>> using a micro benchmark and lock contention is gone.
>>>>
>>>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
>>>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
>>>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
>>>> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
>>>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>>> ---
>>>>    mm/page_alloc.c | 10 ++++++++--
>>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index ae31839874b8..91a9a6af41a2 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>>>>    {
>>>>           struct page *page = virt_to_head_page(addr);
>>>>
>>>> -       if (unlikely(put_page_testzero(page)))
>>>> -               __free_pages_ok(page, compound_order(page));
>>>> +       if (unlikely(put_page_testzero(page))) {
>>>> +               unsigned int order = compound_order(page);
>>>> +
>>>> +               if (order == 0)
>>>> +                       free_unref_page(page);
>>>> +               else
>>>> +                       __free_pages_ok(page, order);
>>>> +       }
>>>>    }
>>>>    EXPORT_SYMBOL(page_frag_free);
>>>>
>>> One thing I would suggest for Pawel to try would be to reduce the Tx
>>> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
>>> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
>>> too many packets in-flight and I suspect that is the issue that Pawel
>>> is seeing that is leading to the page pool allocator freeing up the
>>> memory. I know we like to try to batch things but the issue is
>>> processing too many Tx buffers in one batch leads to us eating up too
>>> much memory and causing evictions from the cache. Ideally the Rx and
>>> Tx rings and queues should be sized as small as possible while still
>>> allowing us to process up to our NAPI budget. Usually I run things
>>> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
>>> don't have more buffers stored there than we can place in the Tx ring.
>>> Then we can avoid the extra thrash of having to pull/push memory into
>>> and out of the freelists. Essentially the issue here ends up being
>>> another form of buffer bloat.
>> Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
>> <4096 producing more interface rx drops - and no_rx_buffer on network
>> controller that is receiving more packets
>> So i need to stick with 3000-4000 on RX - and yes i was trying to lower
>> the TX buff on connectx4 - but that changed nothing before Aaron patch
>>
>> After Aaron patch - decreasing TX buffer influencing total bandwidth
>> that can be handled by the router/server
>> Dono why before this patch there was no difference there no matter what
>> i set there there was always page_alloc/slowpath on top in perf
>>
>>
>> Currently testing RX4096/TX256 - this helps with bandwidth like +10%
>> more bandwidth with less interrupts...
> The problem is if you are going for less interrupts you are setting
> yourself up for buffer bloat. Basically you are going to use much more
> cache and much more memory then you actually need and if things are
> properly configured NAPI should take care of the interrupts anyway
> since under maximum load you shouldn't stop polling normally.

Im trying to balance here - there is problem cause server is forwarding 
all kingd of protocols packets/different size etc

The problem is im trying to go in high interrupt rate - but

Setting coalescence to adaptative for rx killing cpu's at 22Gbit/s RX 
and 22Gbit with rly high interrupt rate

So adding a little more latency i can turn off adaptative rx and setup 
rx-usecs from range 16-64 - and this gives me more or less interrupts - 
but the problem is - always same bandwidth as maximum


>
> One issue I have seen is people delay interrupts for as long as
> possible which isn't really a good thing since most network
> controllers will use NAPI which will disable the interrupts and leave
> them disabled whenever the system is under heavy stress so you should
> be able to get the maximum performance by configuring an adapter with
> small ring sizes and for high interrupt rates.

Sure this is bad to setup rx-usec for high values - cause at some point 
this will add high latency for packet traversing both sides - and start 
to hurt buffers

But my problem is a little different now i have no problems with RX side 
- cause i can setup anything like:

coalescence from 16 to 64

rx ring from 3000 to max 8192

And it does not change my max bw - only produces less or more interrupts.

So I start to change params for TX side - and for now i know that the 
best for me is

coalescence adaptative on

TX buffer 128

This helps with max BW that for now is close to 70Gbit/s RX and 70Gbit 
TX but after this change i have increasing DROPS on TX side for vlan 
interfaces.

And only 50% cpu (max was 50% for 70Gbit/s)


> It is easiest to think of it this way. Your total packet rate is equal
> to your interrupt rate times the number of buffers you will store in
> the ring. So if you have some fixed rate "X" for packets and an
> interrupt rate of "i" then your optimal ring size should be "X/i". So
> if you lower the interrupt rate you end up hurting the throughput
> unless you increase the buffer size. However at a certain point the
> buffer size starts becoming an issue. For example with UDP flows I
> often see massive packet drops if you tune the interrupt rate too low
> and then put the system under heavy stress.

Yes - in normal life traffic - most of ddos'es are like this many pps 
with small frames.



> - Alex
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-12  0:39         ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-12  0:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen


W dniu 12.11.2018 oA 00:05, Alexander Duyck pisze:
> On Sat, Nov 10, 2018 at 3:54 PM PaweA? Staszewski <pstaszewski@itcare.pl> wrote:
>>
>>
>> W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
>>> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>>>> page_frag_free() calls __free_pages_ok() to free the page back to
>>>> Buddy. This is OK for high order page, but for order-0 pages, it
>>>> misses the optimization opportunity of using Per-Cpu-Pages and can
>>>> cause zone lock contention when called frequently.
>>>>
>>>> PaweA? Staszewski recently shared his result of 'how Linux kernel
>>>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
>>>> found the lock contention comes from page allocator:
>>>>
>>>>     mlx5e_poll_tx_cq
>>>>     |
>>>>      --16.34%--napi_consume_skb
>>>>                |
>>>>                |--12.65%--__free_pages_ok
>>>>                |          |
>>>>                |           --11.86%--free_one_page
>>>>                |                     |
>>>>                |                     |--10.10%--queued_spin_lock_slowpath
>>>>                |                     |
>>>>                |                      --0.65%--_raw_spin_lock
>>>>                |
>>>>                |--1.55%--page_frag_free
>>>>                |
>>>>                 --1.44%--skb_release_data
>>>>
>>>> Jesper explained how it happened: mlx5 driver RX-page recycle
>>>> mechanism is not effective in this workload and pages have to go
>>>> through the page allocator. The lock contention happens during
>>>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
>>>> up at these speeds.[2]
>>>>
>>>> I thought that __free_pages_ok() are mostly freeing high order
>>>> pages and thought this is an lock contention for high order pages
>>>> but Jesper explained in detail that __free_pages_ok() here are
>>>> actually freeing order-0 pages because mlx5 is using order-0 pages
>>>> to satisfy its page pool allocation request.[3]
>>>>
>>>> The free path as pointed out by Jesper is:
>>>> skb_free_head()
>>>>     -> skb_free_frag()
>>>>       -> skb_free_frag()
>>>>         -> page_frag_free()
>>>> And the pages being freed on this path are order-0 pages.
>>>>
>>>> Fix this by doing similar things as in __page_frag_cache_drain() -
>>>> send the being freed page to PCP if it's an order-0 page, or
>>>> directly to Buddy if it is a high order page.
>>>>
>>>> With this change, PaweA? hasn't noticed lock contention yet in
>>>> his workload and Jesper has noticed a 7% performance improvement
>>>> using a micro benchmark and lock contention is gone.
>>>>
>>>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
>>>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
>>>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
>>>> Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
>>>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>>> ---
>>>>    mm/page_alloc.c | 10 ++++++++--
>>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index ae31839874b8..91a9a6af41a2 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>>>>    {
>>>>           struct page *page = virt_to_head_page(addr);
>>>>
>>>> -       if (unlikely(put_page_testzero(page)))
>>>> -               __free_pages_ok(page, compound_order(page));
>>>> +       if (unlikely(put_page_testzero(page))) {
>>>> +               unsigned int order = compound_order(page);
>>>> +
>>>> +               if (order == 0)
>>>> +                       free_unref_page(page);
>>>> +               else
>>>> +                       __free_pages_ok(page, order);
>>>> +       }
>>>>    }
>>>>    EXPORT_SYMBOL(page_frag_free);
>>>>
>>> One thing I would suggest for Pawel to try would be to reduce the Tx
>>> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
>>> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
>>> too many packets in-flight and I suspect that is the issue that Pawel
>>> is seeing that is leading to the page pool allocator freeing up the
>>> memory. I know we like to try to batch things but the issue is
>>> processing too many Tx buffers in one batch leads to us eating up too
>>> much memory and causing evictions from the cache. Ideally the Rx and
>>> Tx rings and queues should be sized as small as possible while still
>>> allowing us to process up to our NAPI budget. Usually I run things
>>> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
>>> don't have more buffers stored there than we can place in the Tx ring.
>>> Then we can avoid the extra thrash of having to pull/push memory into
>>> and out of the freelists. Essentially the issue here ends up being
>>> another form of buffer bloat.
>> Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
>> <4096 producing more interface rx drops - and no_rx_buffer on network
>> controller that is receiving more packets
>> So i need to stick with 3000-4000 on RX - and yes i was trying to lower
>> the TX buff on connectx4 - but that changed nothing before Aaron patch
>>
>> After Aaron patch - decreasing TX buffer influencing total bandwidth
>> that can be handled by the router/server
>> Dono why before this patch there was no difference there no matter what
>> i set there there was always page_alloc/slowpath on top in perf
>>
>>
>> Currently testing RX4096/TX256 - this helps with bandwidth like +10%
>> more bandwidth with less interrupts...
> The problem is if you are going for less interrupts you are setting
> yourself up for buffer bloat. Basically you are going to use much more
> cache and much more memory then you actually need and if things are
> properly configured NAPI should take care of the interrupts anyway
> since under maximum load you shouldn't stop polling normally.

Im trying to balance here - there is problem cause server is forwarding 
all kingd of protocols packets/different size etc

The problem is im trying to go in high interrupt rate - but

Setting coalescence to adaptative for rx killing cpu's at 22Gbit/s RX 
and 22Gbit with rly high interrupt rate

So adding a little more latency i can turn off adaptative rx and setup 
rx-usecs from range 16-64 - and this gives me more or less interrupts - 
but the problem is - always same bandwidth as maximum


>
> One issue I have seen is people delay interrupts for as long as
> possible which isn't really a good thing since most network
> controllers will use NAPI which will disable the interrupts and leave
> them disabled whenever the system is under heavy stress so you should
> be able to get the maximum performance by configuring an adapter with
> small ring sizes and for high interrupt rates.

Sure this is bad to setup rx-usec for high values - cause at some point 
this will add high latency for packet traversing both sides - and start 
to hurt buffers

But my problem is a little different now i have no problems with RX side 
- cause i can setup anything like:

coalescence from 16 to 64

rx ring from 3000 to max 8192

And it does not change my max bw - only produces less or more interrupts.

So I start to change params for TX side - and for now i know that the 
best for me is

coalescence adaptative on

TX buffer 128

This helps with max BW that for now is close to 70Gbit/s RX and 70Gbit 
TX but after this change i have increasing DROPS on TX side for vlan 
interfaces.

And only 50% cpu (max was 50% for 70Gbit/s)


> It is easiest to think of it this way. Your total packet rate is equal
> to your interrupt rate times the number of buffers you will store in
> the ring. So if you have some fixed rate "X" for packets and an
> interrupt rate of "i" then your optimal ring size should be "X/i". So
> if you lower the interrupt rate you end up hurting the throughput
> unless you increase the buffer size. However at a certain point the
> buffer size starts becoming an issue. For example with UDP flows I
> often see massive packet drops if you tune the interrupt rate too low
> and then put the system under heavy stress.

Yes - in normal life traffic - most of ddos'es are like this many pps 
with small frames.



> - Alex
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-12  0:39         ` Paweł Staszewski
  (?)
@ 2018-11-12 15:30         ` Alexander Duyck
  2018-11-12 15:44           ` Eric Dumazet
  2018-11-12 17:01             ` Paweł Staszewski
  -1 siblings, 2 replies; 34+ messages in thread
From: Alexander Duyck @ 2018-11-12 15:30 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen

On Sun, Nov 11, 2018 at 4:39 PM Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>
>
> W dniu 12.11.2018 o 00:05, Alexander Duyck pisze:
> > On Sat, Nov 10, 2018 at 3:54 PM Paweł Staszewski <pstaszewski@itcare.pl> wrote:
> >>
> >>
> >> W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
> >>> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
> >>>> page_frag_free() calls __free_pages_ok() to free the page back to
> >>>> Buddy. This is OK for high order page, but for order-0 pages, it
> >>>> misses the optimization opportunity of using Per-Cpu-Pages and can
> >>>> cause zone lock contention when called frequently.
> >>>>
> >>>> Paweł Staszewski recently shared his result of 'how Linux kernel
> >>>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
> >>>> found the lock contention comes from page allocator:
> >>>>
> >>>>     mlx5e_poll_tx_cq
> >>>>     |
> >>>>      --16.34%--napi_consume_skb
> >>>>                |
> >>>>                |--12.65%--__free_pages_ok
> >>>>                |          |
> >>>>                |           --11.86%--free_one_page
> >>>>                |                     |
> >>>>                |                     |--10.10%--queued_spin_lock_slowpath
> >>>>                |                     |
> >>>>                |                      --0.65%--_raw_spin_lock
> >>>>                |
> >>>>                |--1.55%--page_frag_free
> >>>>                |
> >>>>                 --1.44%--skb_release_data
> >>>>
> >>>> Jesper explained how it happened: mlx5 driver RX-page recycle
> >>>> mechanism is not effective in this workload and pages have to go
> >>>> through the page allocator. The lock contention happens during
> >>>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
> >>>> up at these speeds.[2]
> >>>>
> >>>> I thought that __free_pages_ok() are mostly freeing high order
> >>>> pages and thought this is an lock contention for high order pages
> >>>> but Jesper explained in detail that __free_pages_ok() here are
> >>>> actually freeing order-0 pages because mlx5 is using order-0 pages
> >>>> to satisfy its page pool allocation request.[3]
> >>>>
> >>>> The free path as pointed out by Jesper is:
> >>>> skb_free_head()
> >>>>     -> skb_free_frag()
> >>>>       -> skb_free_frag()
> >>>>         -> page_frag_free()
> >>>> And the pages being freed on this path are order-0 pages.
> >>>>
> >>>> Fix this by doing similar things as in __page_frag_cache_drain() -
> >>>> send the being freed page to PCP if it's an order-0 page, or
> >>>> directly to Buddy if it is a high order page.
> >>>>
> >>>> With this change, Paweł hasn't noticed lock contention yet in
> >>>> his workload and Jesper has noticed a 7% performance improvement
> >>>> using a micro benchmark and lock contention is gone.
> >>>>
> >>>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
> >>>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
> >>>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
> >>>> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
> >>>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >>>> ---
> >>>>    mm/page_alloc.c | 10 ++++++++--
> >>>>    1 file changed, 8 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>> index ae31839874b8..91a9a6af41a2 100644
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
> >>>>    {
> >>>>           struct page *page = virt_to_head_page(addr);
> >>>>
> >>>> -       if (unlikely(put_page_testzero(page)))
> >>>> -               __free_pages_ok(page, compound_order(page));
> >>>> +       if (unlikely(put_page_testzero(page))) {
> >>>> +               unsigned int order = compound_order(page);
> >>>> +
> >>>> +               if (order == 0)
> >>>> +                       free_unref_page(page);
> >>>> +               else
> >>>> +                       __free_pages_ok(page, order);
> >>>> +       }
> >>>>    }
> >>>>    EXPORT_SYMBOL(page_frag_free);
> >>>>
> >>> One thing I would suggest for Pawel to try would be to reduce the Tx
> >>> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
> >>> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
> >>> too many packets in-flight and I suspect that is the issue that Pawel
> >>> is seeing that is leading to the page pool allocator freeing up the
> >>> memory. I know we like to try to batch things but the issue is
> >>> processing too many Tx buffers in one batch leads to us eating up too
> >>> much memory and causing evictions from the cache. Ideally the Rx and
> >>> Tx rings and queues should be sized as small as possible while still
> >>> allowing us to process up to our NAPI budget. Usually I run things
> >>> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
> >>> don't have more buffers stored there than we can place in the Tx ring.
> >>> Then we can avoid the extra thrash of having to pull/push memory into
> >>> and out of the freelists. Essentially the issue here ends up being
> >>> another form of buffer bloat.
> >> Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
> >> <4096 producing more interface rx drops - and no_rx_buffer on network
> >> controller that is receiving more packets
> >> So i need to stick with 3000-4000 on RX - and yes i was trying to lower
> >> the TX buff on connectx4 - but that changed nothing before Aaron patch
> >>
> >> After Aaron patch - decreasing TX buffer influencing total bandwidth
> >> that can be handled by the router/server
> >> Dono why before this patch there was no difference there no matter what
> >> i set there there was always page_alloc/slowpath on top in perf
> >>
> >>
> >> Currently testing RX4096/TX256 - this helps with bandwidth like +10%
> >> more bandwidth with less interrupts...
> > The problem is if you are going for less interrupts you are setting
> > yourself up for buffer bloat. Basically you are going to use much more
> > cache and much more memory then you actually need and if things are
> > properly configured NAPI should take care of the interrupts anyway
> > since under maximum load you shouldn't stop polling normally.
>
> Im trying to balance here - there is problem cause server is forwarding
> all kingd of protocols packets/different size etc
>
> The problem is im trying to go in high interrupt rate - but
>
> Setting coalescence to adaptative for rx killing cpu's at 22Gbit/s RX
> and 22Gbit with rly high interrupt rate

I wouldn't recommend adaptive just because the behavior would be hard
to predict.

> So adding a little more latency i can turn off adaptative rx and setup
> rx-usecs from range 16-64 - and this gives me more or less interrupts -
> but the problem is - always same bandwidth as maximum

What about the tx-usecs, is that a functional thing for the adapter
you are using?

The Rx side logic should be pretty easy to figure out. Essentially you
want to keep the Rx ring size as small as possible while at the same
time avoiding storming the system with interrupts. I know for 10Gb/s I
have used a value of 25us in the past. What you want to watch for is
if you are dropping packets on the Rx side or not. Ideally you want
enough buffers that you can capture any burst while you wait for the
interrupt routine to catch up.

> >
> > One issue I have seen is people delay interrupts for as long as
> > possible which isn't really a good thing since most network
> > controllers will use NAPI which will disable the interrupts and leave
> > them disabled whenever the system is under heavy stress so you should
> > be able to get the maximum performance by configuring an adapter with
> > small ring sizes and for high interrupt rates.
>
> Sure this is bad to setup rx-usec for high values - cause at some point
> this will add high latency for packet traversing both sides - and start
> to hurt buffers
>
> But my problem is a little different now i have no problems with RX side
> - cause i can setup anything like:
>
> coalescence from 16 to 64
>
> rx ring from 3000 to max 8192
>
> And it does not change my max bw - only produces less or more interrupts.

Right so the issue itself isn't Rx, you aren't throttled there. We are
probably looking at an issue of PCIe bandwidth or Tx slowing things
down. The fact that you are still filing interrupts is a bit
surprising though. Are the Tx and Rx interrupts linked for the device
you are using or are they firing them seperately? Normally Rx traffic
won't generate many interrupts under a stress test as NAPI will leave
the interrupts disabled unless it can keep up. Anyway, my suggestion
would be to look at tuning things for as small a ring size as
possible.

> So I start to change params for TX side - and for now i know that the
> best for me is
>
> coalescence adaptative on
>
> TX buffer 128
>
> This helps with max BW that for now is close to 70Gbit/s RX and 70Gbit
> TX but after this change i have increasing DROPS on TX side for vlan
> interfaces.

So this sounds like you are likely bottlenecked due to either PCIe
bandwidth or latency. When you start putting back-pressure on the Tx
like you have described it starts pushing packets onto the Qdisc
layer. One thing that happens when packets are on the qdisc layer is
that they can start to perform a bulk dequeue. The side effect of this
is that you write multiple packets to the descriptor ring and then
update the hardware doorbell only once for the entire group of packets
instead of once per packet.

> And only 50% cpu (max was 50% for 70Gbit/s)
>
>
> > It is easiest to think of it this way. Your total packet rate is equal
> > to your interrupt rate times the number of buffers you will store in
> > the ring. So if you have some fixed rate "X" for packets and an
> > interrupt rate of "i" then your optimal ring size should be "X/i". So
> > if you lower the interrupt rate you end up hurting the throughput
> > unless you increase the buffer size. However at a certain point the
> > buffer size starts becoming an issue. For example with UDP flows I
> > often see massive packet drops if you tune the interrupt rate too low
> > and then put the system under heavy stress.
>
> Yes - in normal life traffic - most of ddos'es are like this many pps
> with small frames.

It sounds to me like XDP would probably be your best bet. With that
you could probably get away with smaller ring sizes, higher interrupt
rates, and get the advantage of it batching the Tx without having to
drop packets.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-12 15:30         ` Alexander Duyck
@ 2018-11-12 15:44           ` Eric Dumazet
  2018-11-12 17:06               ` Paweł Staszewski
  2018-11-12 17:01             ` Paweł Staszewski
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Dumazet @ 2018-11-12 15:44 UTC (permalink / raw)
  To: Alexander Duyck, Paweł Staszewski
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen



On 11/12/2018 07:30 AM, Alexander Duyck wrote:

> It sounds to me like XDP would probably be your best bet. With that
> you could probably get away with smaller ring sizes, higher interrupt
> rates, and get the advantage of it batching the Tx without having to
> drop packets.

Add to this that with XDP (or anything lowering per packet processing costs)
you can reduce number of cpus/queues, get better latencies, and bigger TX batches.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-12 15:30         ` Alexander Duyck
@ 2018-11-12 17:01             ` Paweł Staszewski
  2018-11-12 17:01             ` Paweł Staszewski
  1 sibling, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-12 17:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen


W dniu 12.11.2018 o 16:30, Alexander Duyck pisze:
> On Sun, Nov 11, 2018 at 4:39 PM Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>>
>> W dniu 12.11.2018 o 00:05, Alexander Duyck pisze:
>>> On Sat, Nov 10, 2018 at 3:54 PM Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>>>>
>>>> W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
>>>>> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>>>>>> page_frag_free() calls __free_pages_ok() to free the page back to
>>>>>> Buddy. This is OK for high order page, but for order-0 pages, it
>>>>>> misses the optimization opportunity of using Per-Cpu-Pages and can
>>>>>> cause zone lock contention when called frequently.
>>>>>>
>>>>>> Paweł Staszewski recently shared his result of 'how Linux kernel
>>>>>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
>>>>>> found the lock contention comes from page allocator:
>>>>>>
>>>>>>      mlx5e_poll_tx_cq
>>>>>>      |
>>>>>>       --16.34%--napi_consume_skb
>>>>>>                 |
>>>>>>                 |--12.65%--__free_pages_ok
>>>>>>                 |          |
>>>>>>                 |           --11.86%--free_one_page
>>>>>>                 |                     |
>>>>>>                 |                     |--10.10%--queued_spin_lock_slowpath
>>>>>>                 |                     |
>>>>>>                 |                      --0.65%--_raw_spin_lock
>>>>>>                 |
>>>>>>                 |--1.55%--page_frag_free
>>>>>>                 |
>>>>>>                  --1.44%--skb_release_data
>>>>>>
>>>>>> Jesper explained how it happened: mlx5 driver RX-page recycle
>>>>>> mechanism is not effective in this workload and pages have to go
>>>>>> through the page allocator. The lock contention happens during
>>>>>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
>>>>>> up at these speeds.[2]
>>>>>>
>>>>>> I thought that __free_pages_ok() are mostly freeing high order
>>>>>> pages and thought this is an lock contention for high order pages
>>>>>> but Jesper explained in detail that __free_pages_ok() here are
>>>>>> actually freeing order-0 pages because mlx5 is using order-0 pages
>>>>>> to satisfy its page pool allocation request.[3]
>>>>>>
>>>>>> The free path as pointed out by Jesper is:
>>>>>> skb_free_head()
>>>>>>      -> skb_free_frag()
>>>>>>        -> skb_free_frag()
>>>>>>          -> page_frag_free()
>>>>>> And the pages being freed on this path are order-0 pages.
>>>>>>
>>>>>> Fix this by doing similar things as in __page_frag_cache_drain() -
>>>>>> send the being freed page to PCP if it's an order-0 page, or
>>>>>> directly to Buddy if it is a high order page.
>>>>>>
>>>>>> With this change, Paweł hasn't noticed lock contention yet in
>>>>>> his workload and Jesper has noticed a 7% performance improvement
>>>>>> using a micro benchmark and lock contention is gone.
>>>>>>
>>>>>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
>>>>>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
>>>>>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
>>>>>> Reported-by: Paweł Staszewski <pstaszewski@itcare.pl>
>>>>>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>>>>> ---
>>>>>>     mm/page_alloc.c | 10 ++++++++--
>>>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index ae31839874b8..91a9a6af41a2 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>>>>>>     {
>>>>>>            struct page *page = virt_to_head_page(addr);
>>>>>>
>>>>>> -       if (unlikely(put_page_testzero(page)))
>>>>>> -               __free_pages_ok(page, compound_order(page));
>>>>>> +       if (unlikely(put_page_testzero(page))) {
>>>>>> +               unsigned int order = compound_order(page);
>>>>>> +
>>>>>> +               if (order == 0)
>>>>>> +                       free_unref_page(page);
>>>>>> +               else
>>>>>> +                       __free_pages_ok(page, order);
>>>>>> +       }
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(page_frag_free);
>>>>>>
>>>>> One thing I would suggest for Pawel to try would be to reduce the Tx
>>>>> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
>>>>> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
>>>>> too many packets in-flight and I suspect that is the issue that Pawel
>>>>> is seeing that is leading to the page pool allocator freeing up the
>>>>> memory. I know we like to try to batch things but the issue is
>>>>> processing too many Tx buffers in one batch leads to us eating up too
>>>>> much memory and causing evictions from the cache. Ideally the Rx and
>>>>> Tx rings and queues should be sized as small as possible while still
>>>>> allowing us to process up to our NAPI budget. Usually I run things
>>>>> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
>>>>> don't have more buffers stored there than we can place in the Tx ring.
>>>>> Then we can avoid the extra thrash of having to pull/push memory into
>>>>> and out of the freelists. Essentially the issue here ends up being
>>>>> another form of buffer bloat.
>>>> Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
>>>> <4096 producing more interface rx drops - and no_rx_buffer on network
>>>> controller that is receiving more packets
>>>> So i need to stick with 3000-4000 on RX - and yes i was trying to lower
>>>> the TX buff on connectx4 - but that changed nothing before Aaron patch
>>>>
>>>> After Aaron patch - decreasing TX buffer influencing total bandwidth
>>>> that can be handled by the router/server
>>>> Dono why before this patch there was no difference there no matter what
>>>> i set there there was always page_alloc/slowpath on top in perf
>>>>
>>>>
>>>> Currently testing RX4096/TX256 - this helps with bandwidth like +10%
>>>> more bandwidth with less interrupts...
>>> The problem is if you are going for less interrupts you are setting
>>> yourself up for buffer bloat. Basically you are going to use much more
>>> cache and much more memory then you actually need and if things are
>>> properly configured NAPI should take care of the interrupts anyway
>>> since under maximum load you shouldn't stop polling normally.
>> Im trying to balance here - there is problem cause server is forwarding
>> all kingd of protocols packets/different size etc
>>
>> The problem is im trying to go in high interrupt rate - but
>>
>> Setting coalescence to adaptative for rx killing cpu's at 22Gbit/s RX
>> and 22Gbit with rly high interrupt rate
> I wouldn't recommend adaptive just because the behavior would be hard
> to predict.
>
>> So adding a little more latency i can turn off adaptative rx and setup
>> rx-usecs from range 16-64 - and this gives me more or less interrupts -
>> but the problem is - always same bandwidth as maximum
> What about the tx-usecs, is that a functional thing for the adapter
> you are using?

Yes tx-usecs is not used now cause of adaptative mode on tx side:

ethtool -c enp175s0
Coalesce parameters for enp175s0:
Adaptive RX: off  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32551

rx-usecs: 64
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 64
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

>
> The Rx side logic should be pretty easy to figure out. Essentially you
> want to keep the Rx ring size as small as possible while at the same
> time avoiding storming the system with interrupts. I know for 10Gb/s I
> have used a value of 25us in the past. What you want to watch for is
> if you are dropping packets on the Rx side or not. Ideally you want
> enough buffers that you can capture any burst while you wait for the
> interrupt routine to catch up.
>
>>> One issue I have seen is people delay interrupts for as long as
>>> possible which isn't really a good thing since most network
>>> controllers will use NAPI which will disable the interrupts and leave
>>> them disabled whenever the system is under heavy stress so you should
>>> be able to get the maximum performance by configuring an adapter with
>>> small ring sizes and for high interrupt rates.
>> Sure this is bad to setup rx-usec for high values - cause at some point
>> this will add high latency for packet traversing both sides - and start
>> to hurt buffers
>>
>> But my problem is a little different now i have no problems with RX side
>> - cause i can setup anything like:
>>
>> coalescence from 16 to 64
>>
>> rx ring from 3000 to max 8192
>>
>> And it does not change my max bw - only produces less or more interrupts.
> Right so the issue itself isn't Rx, you aren't throttled there. We are
> probably looking at an issue of PCIe bandwidth or Tx slowing things
> down. The fact that you are still filing interrupts is a bit
> surprising though. Are the Tx and Rx interrupts linked for the device
> you are using or are they firing them seperately? Normally Rx traffic
> won't generate many interrupts under a stress test as NAPI will leave
> the interrupts disabled unless it can keep up. Anyway, my suggestion
> would be to look at tuning things for as small a ring size as
> possible.

PCIe bw was eliminated - previously there was one 2 port 100G card 
installed in one pciex16 (max bw for pcie x16 gen3 is 32GB/s 16/16GB 
bidirectional)

Currently there are two separate nic's installed in two separate x16 
slots - so can't be problem with pcie bandwidth

But i think I reach memory bandwidth limit now for 70Gbit/70Gbit :)

But wondering if there is any counter that can help me to diagnose 
problems with memory bandwidth ?

stream app tests gives me results like:

./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
  The *best* time for each kernel (excluding the first iteration)
  will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 56
Number of Threads counted = 56
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 4081 microseconds.
    (= 4081 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           29907.2     0.005382     0.005350     0.005405
Scale:          28787.3     0.005611     0.005558     0.005650
Add:            34153.3     0.007037     0.007027     0.007055
Triad:          34944.0     0.006880     0.006868     0.006887
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays

But this is for node 0+1

When limiting test to one node and cores used by network controllers:

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
  The *best* time for each kernel (excluding the first iteration)
  will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 28
Number of Threads counted = 28
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 6107 microseconds.
    (= 6107 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           20156.4     0.007946     0.007938     0.007958
Scale:          19436.1     0.008237     0.008232     0.008243
Add:            20184.7     0.011896     0.011890     0.011904
Triad:          20687.9     0.011607     0.011601     0.011613
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Close to the limit but still some place - there can be some doubled 
operations like for RX/TX side and network controllers can use more 
bandwidth or just can't do this more optimally - cause of 
bulking/buffers etc.


So currently there are only four from six channels used - i will upgrade 
also memory and populate all six channels left/right side for two memory 
controllers that cpu have.


>> So I start to change params for TX side - and for now i know that the
>> best for me is
>>
>> coalescence adaptative on
>>
>> TX buffer 128
>>
>> This helps with max BW that for now is close to 70Gbit/s RX and 70Gbit
>> TX but after this change i have increasing DROPS on TX side for vlan
>> interfaces.
> So this sounds like you are likely bottlenecked due to either PCIe
> bandwidth or latency. When you start putting back-pressure on the Tx
> like you have described it starts pushing packets onto the Qdisc
> layer. One thing that happens when packets are on the qdisc layer is
> that they can start to perform a bulk dequeue. The side effect of this
> is that you write multiple packets to the descriptor ring and then
> update the hardware doorbell only once for the entire group of packets
> instead of once per packet.

yes the problem is i just can't find any place where counters will shows 
me why nic's start to drop packets

it does not reflect in cpu load or any other counter besides rx_phy 
drops and tx_vlan drop packets


>> And only 50% cpu (max was 50% for 70Gbit/s)
>>
>>
>>> It is easiest to think of it this way. Your total packet rate is equal
>>> to your interrupt rate times the number of buffers you will store in
>>> the ring. So if you have some fixed rate "X" for packets and an
>>> interrupt rate of "i" then your optimal ring size should be "X/i". So
>>> if you lower the interrupt rate you end up hurting the throughput
>>> unless you increase the buffer size. However at a certain point the
>>> buffer size starts becoming an issue. For example with UDP flows I
>>> often see massive packet drops if you tune the interrupt rate too low
>>> and then put the system under heavy stress.
>> Yes - in normal life traffic - most of ddos'es are like this many pps
>> with small frames.
> It sounds to me like XDP would probably be your best bet. With that
> you could probably get away with smaller ring sizes, higher interrupt
> rates, and get the advantage of it batching the Tx without having to
> drop packets.

Yes im testing in lab xdp_fwd currently - but have some problems with 
random drops that occuring randomly where server forwards only 1/10  
packet and after some time it starts to work normally.

Currently trying to eliminate nic's offloading that can cause this - so 
turning off one by one and running tests.




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-12 17:01             ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-12 17:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Eric Dumazet, Tariq Toukan,
	ilias.apalodimas, yoel, Mel Gorman, Saeed Mahameed, Michal Hocko,
	Vlastimil Babka, dave.hansen


W dniu 12.11.2018 oA 16:30, Alexander Duyck pisze:
> On Sun, Nov 11, 2018 at 4:39 PM PaweA? Staszewski <pstaszewski@itcare.pl> wrote:
>>
>> W dniu 12.11.2018 o 00:05, Alexander Duyck pisze:
>>> On Sat, Nov 10, 2018 at 3:54 PM PaweA? Staszewski <pstaszewski@itcare.pl> wrote:
>>>>
>>>> W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
>>>>> On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@intel.com> wrote:
>>>>>> page_frag_free() calls __free_pages_ok() to free the page back to
>>>>>> Buddy. This is OK for high order page, but for order-0 pages, it
>>>>>> misses the optimization opportunity of using Per-Cpu-Pages and can
>>>>>> cause zone lock contention when called frequently.
>>>>>>
>>>>>> PaweA? Staszewski recently shared his result of 'how Linux kernel
>>>>>> handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
>>>>>> found the lock contention comes from page allocator:
>>>>>>
>>>>>>      mlx5e_poll_tx_cq
>>>>>>      |
>>>>>>       --16.34%--napi_consume_skb
>>>>>>                 |
>>>>>>                 |--12.65%--__free_pages_ok
>>>>>>                 |          |
>>>>>>                 |           --11.86%--free_one_page
>>>>>>                 |                     |
>>>>>>                 |                     |--10.10%--queued_spin_lock_slowpath
>>>>>>                 |                     |
>>>>>>                 |                      --0.65%--_raw_spin_lock
>>>>>>                 |
>>>>>>                 |--1.55%--page_frag_free
>>>>>>                 |
>>>>>>                  --1.44%--skb_release_data
>>>>>>
>>>>>> Jesper explained how it happened: mlx5 driver RX-page recycle
>>>>>> mechanism is not effective in this workload and pages have to go
>>>>>> through the page allocator. The lock contention happens during
>>>>>> mlx5 DMA TX completion cycle. And the page allocator cannot keep
>>>>>> up at these speeds.[2]
>>>>>>
>>>>>> I thought that __free_pages_ok() are mostly freeing high order
>>>>>> pages and thought this is an lock contention for high order pages
>>>>>> but Jesper explained in detail that __free_pages_ok() here are
>>>>>> actually freeing order-0 pages because mlx5 is using order-0 pages
>>>>>> to satisfy its page pool allocation request.[3]
>>>>>>
>>>>>> The free path as pointed out by Jesper is:
>>>>>> skb_free_head()
>>>>>>      -> skb_free_frag()
>>>>>>        -> skb_free_frag()
>>>>>>          -> page_frag_free()
>>>>>> And the pages being freed on this path are order-0 pages.
>>>>>>
>>>>>> Fix this by doing similar things as in __page_frag_cache_drain() -
>>>>>> send the being freed page to PCP if it's an order-0 page, or
>>>>>> directly to Buddy if it is a high order page.
>>>>>>
>>>>>> With this change, PaweA? hasn't noticed lock contention yet in
>>>>>> his workload and Jesper has noticed a 7% performance improvement
>>>>>> using a micro benchmark and lock contention is gone.
>>>>>>
>>>>>> [1]: https://www.spinics.net/lists/netdev/msg531362.html
>>>>>> [2]: https://www.spinics.net/lists/netdev/msg531421.html
>>>>>> [3]: https://www.spinics.net/lists/netdev/msg531556.html
>>>>>> Reported-by: PaweA? Staszewski <pstaszewski@itcare.pl>
>>>>>> Analysed-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>>>>> ---
>>>>>>     mm/page_alloc.c | 10 ++++++++--
>>>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index ae31839874b8..91a9a6af41a2 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
>>>>>>     {
>>>>>>            struct page *page = virt_to_head_page(addr);
>>>>>>
>>>>>> -       if (unlikely(put_page_testzero(page)))
>>>>>> -               __free_pages_ok(page, compound_order(page));
>>>>>> +       if (unlikely(put_page_testzero(page))) {
>>>>>> +               unsigned int order = compound_order(page);
>>>>>> +
>>>>>> +               if (order == 0)
>>>>>> +                       free_unref_page(page);
>>>>>> +               else
>>>>>> +                       __free_pages_ok(page, order);
>>>>>> +       }
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(page_frag_free);
>>>>>>
>>>>> One thing I would suggest for Pawel to try would be to reduce the Tx
>>>>> qdisc size on his transmitting interfaces, Reduce the Tx ring size,
>>>>> and possibly increase the Tx interrupt rate. Ideally we shouldn't have
>>>>> too many packets in-flight and I suspect that is the issue that Pawel
>>>>> is seeing that is leading to the page pool allocator freeing up the
>>>>> memory. I know we like to try to batch things but the issue is
>>>>> processing too many Tx buffers in one batch leads to us eating up too
>>>>> much memory and causing evictions from the cache. Ideally the Rx and
>>>>> Tx rings and queues should be sized as small as possible while still
>>>>> allowing us to process up to our NAPI budget. Usually I run things
>>>>> with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
>>>>> don't have more buffers stored there than we can place in the Tx ring.
>>>>> Then we can avoid the extra thrash of having to pull/push memory into
>>>>> and out of the freelists. Essentially the issue here ends up being
>>>>> another form of buffer bloat.
>>>> Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
>>>> <4096 producing more interface rx drops - and no_rx_buffer on network
>>>> controller that is receiving more packets
>>>> So i need to stick with 3000-4000 on RX - and yes i was trying to lower
>>>> the TX buff on connectx4 - but that changed nothing before Aaron patch
>>>>
>>>> After Aaron patch - decreasing TX buffer influencing total bandwidth
>>>> that can be handled by the router/server
>>>> Dono why before this patch there was no difference there no matter what
>>>> i set there there was always page_alloc/slowpath on top in perf
>>>>
>>>>
>>>> Currently testing RX4096/TX256 - this helps with bandwidth like +10%
>>>> more bandwidth with less interrupts...
>>> The problem is if you are going for less interrupts you are setting
>>> yourself up for buffer bloat. Basically you are going to use much more
>>> cache and much more memory then you actually need and if things are
>>> properly configured NAPI should take care of the interrupts anyway
>>> since under maximum load you shouldn't stop polling normally.
>> Im trying to balance here - there is problem cause server is forwarding
>> all kingd of protocols packets/different size etc
>>
>> The problem is im trying to go in high interrupt rate - but
>>
>> Setting coalescence to adaptative for rx killing cpu's at 22Gbit/s RX
>> and 22Gbit with rly high interrupt rate
> I wouldn't recommend adaptive just because the behavior would be hard
> to predict.
>
>> So adding a little more latency i can turn off adaptative rx and setup
>> rx-usecs from range 16-64 - and this gives me more or less interrupts -
>> but the problem is - always same bandwidth as maximum
> What about the tx-usecs, is that a functional thing for the adapter
> you are using?

Yes tx-usecs is not used now cause of adaptative mode on tx side:

ethtool -c enp175s0
Coalesce parameters for enp175s0:
Adaptive RX: offA  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32551

rx-usecs: 64
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 64
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

>
> The Rx side logic should be pretty easy to figure out. Essentially you
> want to keep the Rx ring size as small as possible while at the same
> time avoiding storming the system with interrupts. I know for 10Gb/s I
> have used a value of 25us in the past. What you want to watch for is
> if you are dropping packets on the Rx side or not. Ideally you want
> enough buffers that you can capture any burst while you wait for the
> interrupt routine to catch up.
>
>>> One issue I have seen is people delay interrupts for as long as
>>> possible which isn't really a good thing since most network
>>> controllers will use NAPI which will disable the interrupts and leave
>>> them disabled whenever the system is under heavy stress so you should
>>> be able to get the maximum performance by configuring an adapter with
>>> small ring sizes and for high interrupt rates.
>> Sure this is bad to setup rx-usec for high values - cause at some point
>> this will add high latency for packet traversing both sides - and start
>> to hurt buffers
>>
>> But my problem is a little different now i have no problems with RX side
>> - cause i can setup anything like:
>>
>> coalescence from 16 to 64
>>
>> rx ring from 3000 to max 8192
>>
>> And it does not change my max bw - only produces less or more interrupts.
> Right so the issue itself isn't Rx, you aren't throttled there. We are
> probably looking at an issue of PCIe bandwidth or Tx slowing things
> down. The fact that you are still filing interrupts is a bit
> surprising though. Are the Tx and Rx interrupts linked for the device
> you are using or are they firing them seperately? Normally Rx traffic
> won't generate many interrupts under a stress test as NAPI will leave
> the interrupts disabled unless it can keep up. Anyway, my suggestion
> would be to look at tuning things for as small a ring size as
> possible.

PCIe bw was eliminated - previously there was one 2 port 100G card 
installed in one pciex16 (max bw for pcie x16 gen3 is 32GB/s 16/16GB 
bidirectional)

Currently there are two separate nic's installed in two separate x16 
slots - so can't be problem with pcie bandwidth

But i think I reach memory bandwidth limit now for 70Gbit/70Gbit :)

But wondering if there is any counter that can help me to diagnose 
problems with memory bandwidth ?

stream app tests gives me results like:

./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 A The *best* time for each kernel (excluding the first iteration)
 A will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 56
Number of Threads counted = 56
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 4081 microseconds.
 A A  (= 4081 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
FunctionA A A  Best Rate MB/sA  Avg timeA A A A  Min timeA A A A  Max time
Copy:A A A A A A A A A A  29907.2A A A A  0.005382A A A A  0.005350A A A A  0.005405
Scale:A A A A A A A A A  28787.3A A A A  0.005611A A A A  0.005558A A A A  0.005650
Add:A A A A A A A A A A A  34153.3A A A A  0.007037A A A A  0.007027A A A A  0.007055
Triad:A A A A A A A A A  34944.0A A A A  0.006880A A A A  0.006868A A A A  0.006887
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays

But this is for node 0+1

When limiting test to one node and cores used by network controllers:

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 A The *best* time for each kernel (excluding the first iteration)
 A will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 28
Number of Threads counted = 28
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 6107 microseconds.
 A A  (= 6107 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
FunctionA A A  Best Rate MB/sA  Avg timeA A A A  Min timeA A A A  Max time
Copy:A A A A A A A A A A  20156.4A A A A  0.007946A A A A  0.007938A A A A  0.007958
Scale:A A A A A A A A A  19436.1A A A A  0.008237A A A A  0.008232A A A A  0.008243
Add:A A A A A A A A A A A  20184.7A A A A  0.011896A A A A  0.011890A A A A  0.011904
Triad:A A A A A A A A A  20687.9A A A A  0.011607A A A A  0.011601A A A A  0.011613
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Close to the limit but still some place - there can be some doubled 
operations like for RX/TX side and network controllers can use more 
bandwidth or just can't do this more optimally - cause of 
bulking/buffers etc.


So currently there are only four from six channels used - i will upgrade 
also memory and populate all six channels left/right side for two memory 
controllers that cpu have.


>> So I start to change params for TX side - and for now i know that the
>> best for me is
>>
>> coalescence adaptative on
>>
>> TX buffer 128
>>
>> This helps with max BW that for now is close to 70Gbit/s RX and 70Gbit
>> TX but after this change i have increasing DROPS on TX side for vlan
>> interfaces.
> So this sounds like you are likely bottlenecked due to either PCIe
> bandwidth or latency. When you start putting back-pressure on the Tx
> like you have described it starts pushing packets onto the Qdisc
> layer. One thing that happens when packets are on the qdisc layer is
> that they can start to perform a bulk dequeue. The side effect of this
> is that you write multiple packets to the descriptor ring and then
> update the hardware doorbell only once for the entire group of packets
> instead of once per packet.

yes the problem is i just can't find any place where counters will shows 
me why nic's start to drop packets

it does not reflect in cpu load or any other counter besides rx_phy 
drops and tx_vlan drop packets


>> And only 50% cpu (max was 50% for 70Gbit/s)
>>
>>
>>> It is easiest to think of it this way. Your total packet rate is equal
>>> to your interrupt rate times the number of buffers you will store in
>>> the ring. So if you have some fixed rate "X" for packets and an
>>> interrupt rate of "i" then your optimal ring size should be "X/i". So
>>> if you lower the interrupt rate you end up hurting the throughput
>>> unless you increase the buffer size. However at a certain point the
>>> buffer size starts becoming an issue. For example with UDP flows I
>>> often see massive packet drops if you tune the interrupt rate too low
>>> and then put the system under heavy stress.
>> Yes - in normal life traffic - most of ddos'es are like this many pps
>> with small frames.
> It sounds to me like XDP would probably be your best bet. With that
> you could probably get away with smaller ring sizes, higher interrupt
> rates, and get the advantage of it batching the Tx without having to
> drop packets.

Yes im testing in lab xdp_fwd currently - but have some problems with 
random drops that occuring randomly where server forwards only 1/10A  
packet and after some time it starts to work normally.

Currently trying to eliminate nic's offloading that can cause this - so 
turning off one by one and running tests.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
  2018-11-12 15:44           ` Eric Dumazet
@ 2018-11-12 17:06               ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-12 17:06 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Tariq Toukan, ilias.apalodimas, yoel,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	dave.hansen


W dniu 12.11.2018 o 16:44, Eric Dumazet pisze:
>
> On 11/12/2018 07:30 AM, Alexander Duyck wrote:
>
>> It sounds to me like XDP would probably be your best bet. With that
>> you could probably get away with smaller ring sizes, higher interrupt
>> rates, and get the advantage of it batching the Tx without having to
>> drop packets.
> Add to this that with XDP (or anything lowering per packet processing costs)
> you can reduce number of cpus/queues, get better latencies, and bigger TX batches.

Yes for sure - the best for my use case will be to implement XDP :)

But for real life not test lab use programs like xdp_fwd need to be 
extended for minimal information needed from IP router - like counters 
and some aditional debug for traffic like sniffing / sampling for ddos 
detection.

And that is rly minimum needed - for routing IP traffic with XDP



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free()
@ 2018-11-12 17:06               ` Paweł Staszewski
  0 siblings, 0 replies; 34+ messages in thread
From: Paweł Staszewski @ 2018-11-12 17:06 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: aaron.lu, linux-mm, LKML, Netdev, Andrew Morton,
	Jesper Dangaard Brouer, Tariq Toukan, ilias.apalodimas, yoel,
	Mel Gorman, Saeed Mahameed, Michal Hocko, Vlastimil Babka,
	dave.hansen


W dniu 12.11.2018 oA 16:44, Eric Dumazet pisze:
>
> On 11/12/2018 07:30 AM, Alexander Duyck wrote:
>
>> It sounds to me like XDP would probably be your best bet. With that
>> you could probably get away with smaller ring sizes, higher interrupt
>> rates, and get the advantage of it batching the Tx without having to
>> drop packets.
> Add to this that with XDP (or anything lowering per packet processing costs)
> you can reduce number of cpus/queues, get better latencies, and bigger TX batches.

Yes for sure - the best for my use case will be to implement XDP :)

But for real life not test lab use programs like xdp_fwd need to be 
extended for minimal information needed from IP router - like counters 
and some aditional debug for traffic like sniffing / sampling for ddos 
detection.

And that is rly minimum needed - for routing IP traffic with XDP

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-11-12 17:06 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-05  8:58 [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free() Aaron Lu
2018-11-05  8:58 ` Aaron Lu
2018-11-05  8:58 ` [PATCH 2/2] mm/page_alloc: use a single function to free page Aaron Lu
2018-11-05 16:39   ` Dave Hansen
2018-11-06  5:30   ` [PATCH v2 " Aaron Lu
2018-11-06  8:16     ` Vlastimil Babka
2018-11-06  8:47       ` Aaron Lu
2018-11-06  9:32         ` Vlastimil Babka
2018-11-06 11:20           ` Aaron Lu
2018-11-06 11:31     ` [PATCH v3 " Aaron Lu
2018-11-06 12:06       ` Vlastimil Babka
2018-11-05  9:26 ` [PATCH 1/2] mm/page_alloc: free order-0 pages through PCP in page_frag_free() Vlastimil Babka
2018-11-05  9:26   ` Vlastimil Babka
2018-11-05  9:26 ` Mel Gorman
2018-11-05  9:26   ` Mel Gorman
2018-11-05  9:55 ` Jesper Dangaard Brouer
2018-11-05 10:46 ` Ilias Apalodimas
2018-11-05 10:46   ` Ilias Apalodimas
2018-11-05 15:44 ` Alexander Duyck
2018-11-10 23:54   ` Paweł Staszewski
2018-11-10 23:54     ` Paweł Staszewski
2018-11-11 23:05     ` Alexander Duyck
2018-11-12  0:39       ` Paweł Staszewski
2018-11-12  0:39         ` Paweł Staszewski
2018-11-12 15:30         ` Alexander Duyck
2018-11-12 15:44           ` Eric Dumazet
2018-11-12 17:06             ` Paweł Staszewski
2018-11-12 17:06               ` Paweł Staszewski
2018-11-12 17:01           ` Paweł Staszewski
2018-11-12 17:01             ` Paweł Staszewski
2018-11-05 16:37 ` Dave Hansen
2018-11-06  5:28 ` [PATCH v2 " Aaron Lu
2018-11-06  5:28   ` Aaron Lu
2018-11-07  9:59   ` Tariq Toukan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.