netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
@ 2019-12-17 23:17 Jesper Dangaard Brouer
  2019-12-18  7:44 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-17 23:17 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, lirongqing, linyunsheng,
	Ilias Apalodimas, Saeed Mahameed, mhocko, peterz, linux-kernel

The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid.  It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid.  The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime.  To speed up transision to new nid the alloc
cache is now flushed on nid changes.  This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
chunks per allocation and allocation fall-through to the real
page-allocator with the new nid derived from numa_mem_id(). We accept
that transitioning the alloc cache doesn't happen immediately.

Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   64 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 49 insertions(+), 15 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..37316ea66937 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -96,19 +96,22 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
+static void __page_pool_return_page(struct page_pool *pool, struct page *page);
+
 /* fast path */
 static struct page *__page_pool_get_cached(struct page_pool *pool)
 {
 	struct ptr_ring *r = &pool->ring;
+	struct page *first_page, *page;
 	bool refill = false;
-	struct page *page;
+	int i, curr_nid;
 
 	/* Test for safe-context, caller should provide this guarantee */
 	if (likely(in_serving_softirq())) {
 		if (likely(pool->alloc.count)) {
 			/* Fast-path */
-			page = pool->alloc.cache[--pool->alloc.count];
-			return page;
+			first_page = pool->alloc.cache[--pool->alloc.count];
+			return first_page;
 		}
 		refill = true;
 	}
@@ -117,17 +120,42 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
 	if (__ptr_ring_empty(r))
 		return NULL;
 
-	/* Slow-path: Get page from locked ring queue,
-	 * refill alloc array if requested.
+	/* Softirq guarantee CPU and thus NUMA node is stable. This,
+	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
 	 */
+	curr_nid = numa_mem_id();
+
+	/* Slower-path: Get pages from locked ring queue */
 	spin_lock(&r->consumer_lock);
-	page = __ptr_ring_consume(r);
-	if (refill)
-		pool->alloc.count = __ptr_ring_consume_batched(r,
-							pool->alloc.cache,
-							PP_ALLOC_CACHE_REFILL);
+	first_page = __ptr_ring_consume(r);
+
+	/* Fallback to page-allocator if NUMA node doesn't match */
+	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
+		__page_pool_return_page(pool, first_page);
+		first_page = NULL;
+	}
+
+	if (unlikely(!refill))
+		goto out;
+
+	/* Refill alloc array, but only if NUMA node match */
+	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
+		page = __ptr_ring_consume(r);
+		if (unlikely(!page))
+			break;
+
+		if (likely(page_to_nid(page) == curr_nid)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+		} else {
+			/* Release page to page-allocator, assume
+			 * refcnt == 1 invariant of cached pages
+			 */
+			__page_pool_return_page(pool, page);
+		}
+	}
+out:
 	spin_unlock(&r->consumer_lock);
-	return page;
+	return first_page;
 }
 
 static void page_pool_dma_sync_for_device(struct page_pool *pool,
@@ -311,13 +339,10 @@ static bool __page_pool_recycle_direct(struct page *page,
 
 /* page is NOT reusable when:
  * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
- * 2) belongs to a different NUMA node than pool->p.nid.
- *
- * To update pool->p.nid users must call page_pool_update_nid.
  */
 static bool pool_page_reusable(struct page_pool *pool, struct page *page)
 {
-	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
+	return !page_is_pfmemalloc(page);
 }
 
 void __page_pool_put_page(struct page_pool *pool, struct page *page,
@@ -484,7 +509,16 @@ EXPORT_SYMBOL(page_pool_destroy);
 /* Caller must provide appropriate safe context, e.g. NAPI. */
 void page_pool_update_nid(struct page_pool *pool, int new_nid)
 {
+	struct page *page;
+
+	WARN_ON(!in_serving_softirq());
 	trace_page_pool_update_nid(pool, new_nid);
 	pool->p.nid = new_nid;
+
+	/* Flush pool alloc cache, as refill will check NUMA node */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-17 23:17 [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition Jesper Dangaard Brouer
@ 2019-12-18  7:44 ` Jesper Dangaard Brouer
  2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
  2019-12-19 14:20   ` [net-next v5 " Jesper Dangaard Brouer
  0 siblings, 2 replies; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-18  7:44 UTC (permalink / raw)
  To: netdev
  Cc: lirongqing, linyunsheng, Ilias Apalodimas, Saeed Mahameed,
	mhocko, peterz, linux-kernel, brouer

On Wed, 18 Dec 2019 00:17:36 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
> not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
> 
> The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
> recycle non-reusable pages"), were to have RX-pages that belongs to the
> same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
> illustrated by the performance measurements.
> 
> This patch moves the NAPI checks out of fast-path, and at the same time
> solves the NUMA_NO_NODE issue.
> 
> First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
> will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
> as the the preferred nid.  It is only in rare situations, where
> e.g. NUMA zone runs dry, that page gets doesn't get allocated from
> preferred nid.  The page_pool API allows drivers to control the nid
> themselves via controlling pool->p.nid.
> 
> This patch moves the NAPI check to when alloc cache is refilled, via
> dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
> pages from remote NUMA into the ptr_ring, as the dequeue/consume step
> will check the NUMA node. All current drivers using page_pool will
> alloc/refill RX-ring from same CPU running softirq/NAPI process.
> 
> Drivers that control the nid explicitly, also use page_pool_update_nid
> when changing nid runtime.  To speed up transision to new nid the alloc
> cache is now flushed on nid changes.  This force pages to come from
> ptr_ring, which does the appropate nid check.
> 
> For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> chunks per allocation and allocation fall-through to the real
> page-allocator with the new nid derived from numa_mem_id(). We accept
> that transitioning the alloc cache doesn't happen immediately.
> 
> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> Reported-by: Li RongQing <lirongqing@baidu.com>
> Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/page_pool.c |   64 ++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 49 insertions(+), 15 deletions(-)

I'm going to send a V4, because GCC doesn't generate the optimal ASM
code for the fast-path (details below).

> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..37316ea66937 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -96,19 +96,22 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>  }
>  EXPORT_SYMBOL(page_pool_create);
>  
> +static void __page_pool_return_page(struct page_pool *pool, struct page *page);
> +
>  /* fast path */
>  static struct page *__page_pool_get_cached(struct page_pool *pool)
>  {
>  	struct ptr_ring *r = &pool->ring;
> +	struct page *first_page, *page;
>  	bool refill = false;
> -	struct page *page;
> +	int i, curr_nid;
>  
>  	/* Test for safe-context, caller should provide this guarantee */
>  	if (likely(in_serving_softirq())) {
>  		if (likely(pool->alloc.count)) {
>  			/* Fast-path */
> -			page = pool->alloc.cache[--pool->alloc.count];
> -			return page;
> +			first_page = pool->alloc.cache[--pool->alloc.count];
> +			return first_page;
>  		}
>  		refill = true;
>  	}

The compiler (gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)) doesn't
generate optimal ASM code for the likely fast-path of pool->alloc.cache
containing an element.  It does "isolate" it and return (retq) early in
this likely case, BUT it chooses to use %r15 (a call-preserved
register, aka callee-saved register).  Later other call-preserved
registers are used, which leads to pushing all the registers (%rbx,
%rbp, and %r12-r15) on the stack.

000000000000af0 <page_pool_alloc_pages>:
{
 af0:   e8 00 00 00 00          callq  af5 <page_pool_alloc_pages+0x5>
                        af1: R_X86_64_PLT32     __fentry__-0x4
 af5:   41 57                   push   %r15
 af7:   41 56                   push   %r14
 af9:   41 89 f6                mov    %esi,%r14d
 afc:   41 55                   push   %r13
 afe:   41 54                   push   %r12
 b00:   55                      push   %rbp
 b01:   48 89 fd                mov    %rdi,%rbp
 b04:   53                      push   %rbx
 b05:   48 83 ec 08             sub    $0x8,%rsp
 b09:   65 8b 05 00 00 00 00    mov    %gs:0x0(%rip),%eax        # b10 <page_pool_alloc_pages+
0x20>
                        b0c: R_X86_64_PC32      __preempt_count-0x4
        if (likely(in_serving_softirq())) {
 b10:   f6 c4 01                test   $0x1,%ah
 b13:   74 60                   je     b75 <page_pool_alloc_pages+0x85>
                if (likely(pool->alloc.count)) {
 b15:   8b 87 c0 00 00 00       mov    0xc0(%rdi),%eax
 b1b:   85 c0                   test   %eax,%eax
 b1d:   0f 84 94 01 00 00       je     cb7 <page_pool_alloc_pages+0x1c7>
                        first_page = pool->alloc.cache[--pool->alloc.count];
 b23:   83 e8 01                sub    $0x1,%eax
 b26:   89 87 c0 00 00 00       mov    %eax,0xc0(%rdi)
 b2c:   4c 8b bc c7 c8 00 00    mov    0xc8(%rdi,%rax,8),%r15
 b33:   00 
        if (page)
 b34:   4d 85 ff                test   %r15,%r15
 b37:   74 23                   je     b5c <page_pool_alloc_pages+0x6c>
}
 b39:   48 83 c4 08             add    $0x8,%rsp
 b3d:   4c 89 f8                mov    %r15,%rax
 b40:   5b                      pop    %rbx
 b41:   5d                      pop    %rbp
 b42:   41 5c                   pop    %r12
 b44:   41 5d                   pop    %r13
 b46:   41 5e                   pop    %r14
 b48:   41 5f                   pop    %r15
 b4a:   c3                      retq   
 [...]


> @@ -117,17 +120,42 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
>  	if (__ptr_ring_empty(r))
>  		return NULL;
>  
> -	/* Slow-path: Get page from locked ring queue,
> -	 * refill alloc array if requested.
> +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
>  	 */
> +	curr_nid = numa_mem_id();
> +
> +	/* Slower-path: Get pages from locked ring queue */
>  	spin_lock(&r->consumer_lock);
> -	page = __ptr_ring_consume(r);
> -	if (refill)
> -		pool->alloc.count = __ptr_ring_consume_batched(r,
> -							pool->alloc.cache,
> -							PP_ALLOC_CACHE_REFILL);
> +	first_page = __ptr_ring_consume(r);
> +
> +	/* Fallback to page-allocator if NUMA node doesn't match */
> +	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
> +		__page_pool_return_page(pool, first_page);
> +		first_page = NULL;
> +	}
> +
> +	if (unlikely(!refill))
> +		goto out;
> +
> +	/* Refill alloc array, but only if NUMA node match */
> +	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
> +		page = __ptr_ring_consume(r);
> +		if (unlikely(!page))
> +			break;
> +
> +		if (likely(page_to_nid(page) == curr_nid)) {
> +			pool->alloc.cache[pool->alloc.count++] = page;
> +		} else {
> +			/* Release page to page-allocator, assume
> +			 * refcnt == 1 invariant of cached pages
> +			 */
> +			__page_pool_return_page(pool, page);
> +		}
> +	}
> +out:
>  	spin_unlock(&r->consumer_lock);
> -	return page;
> +	return first_page;
>  }

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-18  7:44 ` Jesper Dangaard Brouer
@ 2019-12-18  8:01   ` Jesper Dangaard Brouer
  2019-12-18 14:27     ` 答复: " Li,Rongqing
                       ` (2 more replies)
  2019-12-19 14:20   ` [net-next v5 " Jesper Dangaard Brouer
  1 sibling, 3 replies; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-18  8:01 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, lirongqing, linyunsheng,
	Ilias Apalodimas, Saeed Mahameed, mhocko, peterz, linux-kernel

The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid.  It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid.  The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime.  To speed up transision to new nid the alloc
cache is now flushed on nid changes.  This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
chunks per allocation and allocation fall-through to the real
page-allocator with the new nid derived from numa_mem_id(). We accept
that transitioning the alloc cache doesn't happen immediately.

Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   82 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..bd4f8b2c46b6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -96,10 +96,61 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
+static void __page_pool_return_page(struct page_pool *pool, struct page *page);
+
+noinline
+static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
+						 bool refill)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *first_page, *page;
+	int i, curr_nid;
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Softirq guarantee CPU and thus NUMA node is stable. This,
+	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
+	 */
+	curr_nid = numa_mem_id();
+
+	/* Slower-path: Get pages from locked ring queue */
+	spin_lock(&r->consumer_lock);
+	first_page = __ptr_ring_consume(r);
+
+	/* Fallback to page-allocator if NUMA node doesn't match */
+	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
+		__page_pool_return_page(pool, first_page);
+		first_page = NULL;
+	}
+
+	if (unlikely(!refill))
+		goto out;
+
+	/* Refill alloc array, but only if NUMA node match */
+	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
+		page = __ptr_ring_consume(r);
+		if (unlikely(!page))
+			break;
+
+		if (likely(page_to_nid(page) == curr_nid)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+		} else {
+			/* Release page to page-allocator, assume
+			 * refcnt == 1 invariant of cached pages
+			 */
+			__page_pool_return_page(pool, page);
+		}
+	}
+out:
+	spin_unlock(&r->consumer_lock);
+	return first_page;
+}
+
 /* fast path */
 static struct page *__page_pool_get_cached(struct page_pool *pool)
 {
-	struct ptr_ring *r = &pool->ring;
 	bool refill = false;
 	struct page *page;
 
@@ -113,20 +164,7 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
 		refill = true;
 	}
 
-	/* Quicker fallback, avoid locks when ring is empty */
-	if (__ptr_ring_empty(r))
-		return NULL;
-
-	/* Slow-path: Get page from locked ring queue,
-	 * refill alloc array if requested.
-	 */
-	spin_lock(&r->consumer_lock);
-	page = __ptr_ring_consume(r);
-	if (refill)
-		pool->alloc.count = __ptr_ring_consume_batched(r,
-							pool->alloc.cache,
-							PP_ALLOC_CACHE_REFILL);
-	spin_unlock(&r->consumer_lock);
+	page = page_pool_refill_alloc_cache(pool, refill);
 	return page;
 }
 
@@ -311,13 +349,10 @@ static bool __page_pool_recycle_direct(struct page *page,
 
 /* page is NOT reusable when:
  * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
- * 2) belongs to a different NUMA node than pool->p.nid.
- *
- * To update pool->p.nid users must call page_pool_update_nid.
  */
 static bool pool_page_reusable(struct page_pool *pool, struct page *page)
 {
-	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
+	return !page_is_pfmemalloc(page);
 }
 
 void __page_pool_put_page(struct page_pool *pool, struct page *page,
@@ -484,7 +519,16 @@ EXPORT_SYMBOL(page_pool_destroy);
 /* Caller must provide appropriate safe context, e.g. NAPI. */
 void page_pool_update_nid(struct page_pool *pool, int new_nid)
 {
+	struct page *page;
+
+	WARN_ON(!in_serving_softirq());
 	trace_page_pool_update_nid(pool, new_nid);
 	pool->p.nid = new_nid;
+
+	/* Flush pool alloc cache, as refill will check NUMA node */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* 答复: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
@ 2019-12-18 14:27     ` Li,Rongqing
  2019-12-19 12:00       ` Jesper Dangaard Brouer
  2019-12-19  1:52     ` Yunsheng Lin
  2019-12-19 12:09     ` Michal Hocko
  2 siblings, 1 reply; 24+ messages in thread
From: Li,Rongqing @ 2019-12-18 14:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: linyunsheng, Ilias Apalodimas, Saeed Mahameed, mhocko, peterz,
	linux-kernel


seem it can not handle the in-flight pages if remove the check node id from pool_page_reusable?


-Li

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
  2019-12-18 14:27     ` 答复: " Li,Rongqing
@ 2019-12-19  1:52     ` Yunsheng Lin
  2019-12-19 12:15       ` Jesper Dangaard Brouer
  2019-12-19 12:09     ` Michal Hocko
  2 siblings, 1 reply; 24+ messages in thread
From: Yunsheng Lin @ 2019-12-19  1:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: lirongqing, Ilias Apalodimas, Saeed Mahameed, mhocko, peterz,
	linux-kernel

On 2019/12/18 16:01, Jesper Dangaard Brouer wrote:
> The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
> not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
> 
> The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
> recycle non-reusable pages"), were to have RX-pages that belongs to the
> same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
> illustrated by the performance measurements.
> 
> This patch moves the NAPI checks out of fast-path, and at the same time
> solves the NUMA_NO_NODE issue.
> 
> First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
> will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
> as the the preferred nid.  It is only in rare situations, where
> e.g. NUMA zone runs dry, that page gets doesn't get allocated from
> preferred nid.  The page_pool API allows drivers to control the nid
> themselves via controlling pool->p.nid.
> 
> This patch moves the NAPI check to when alloc cache is refilled, via
> dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
> pages from remote NUMA into the ptr_ring, as the dequeue/consume step
> will check the NUMA node. All current drivers using page_pool will
> alloc/refill RX-ring from same CPU running softirq/NAPI process.
> 
> Drivers that control the nid explicitly, also use page_pool_update_nid
> when changing nid runtime.  To speed up transision to new nid the alloc
> cache is now flushed on nid changes.  This force pages to come from
> ptr_ring, which does the appropate nid check.
> 
> For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> chunks per allocation and allocation fall-through to the real
> page-allocator with the new nid derived from numa_mem_id(). We accept
> that transitioning the alloc cache doesn't happen immediately.
> 
> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> Reported-by: Li RongQing <lirongqing@baidu.com>
> Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/page_pool.c |   82 ++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 63 insertions(+), 19 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..bd4f8b2c46b6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -96,10 +96,61 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>  }
>  EXPORT_SYMBOL(page_pool_create);
>  
> +static void __page_pool_return_page(struct page_pool *pool, struct page *page);

It is possible to avoid forword-declare it by move the __page_pool_return_page()?
Maybe it is ok since this patch is targetting net-next?

> +
> +noinline
> +static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
> +						 bool refill)
> +{
> +	struct ptr_ring *r = &pool->ring;
> +	struct page *first_page, *page;
> +	int i, curr_nid;
> +
> +	/* Quicker fallback, avoid locks when ring is empty */
> +	if (__ptr_ring_empty(r))
> +		return NULL;
> +
> +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> +	 */
> +	curr_nid = numa_mem_id();
> +
> +	/* Slower-path: Get pages from locked ring queue */
> +	spin_lock(&r->consumer_lock);
> +	first_page = __ptr_ring_consume(r);
> +
> +	/* Fallback to page-allocator if NUMA node doesn't match */
> +	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
> +		__page_pool_return_page(pool, first_page);
> +		first_page = NULL;
> +	}
> +
> +	if (unlikely(!refill))
> +		goto out;
> +
> +	/* Refill alloc array, but only if NUMA node match */
> +	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
> +		page = __ptr_ring_consume(r);
> +		if (unlikely(!page))
> +			break;
> +
> +		if (likely(page_to_nid(page) == curr_nid)) {
> +			pool->alloc.cache[pool->alloc.count++] = page;
> +		} else {
> +			/* Release page to page-allocator, assume
> +			 * refcnt == 1 invariant of cached pages
> +			 */
> +			__page_pool_return_page(pool, page);
> +		}
> +	}

The above code seems to not clear all the pages in the ptr_ring that
is not in the local node in some case?

I am not so familiar with asm, but does below code make sense and
generate better asm code?

	struct page *page = NULL;

	while (pool->alloc.count < PP_ALLOC_CACHE_REFILL || !refill) {
		page = __ptr_ring_consume(r);

		if (unlikely(!page || !refill))
			break;

		if (likely(page_to_nid(page) == curr_nid)) {
			pool->alloc.cache[pool->alloc.count++] = page;
		} else {
			/* Release page to page-allocator, assume
			 * refcnt == 1 invariant of cached pages
			 */
			__page_pool_return_page(pool, page);
		}
	}

out:
	if (likely(refill && pool->alloc.count > 0))
		page = pool->alloc.cache[--pool->alloc.count];

	spin_unlock(&r->consumer_lock);
	
	return page;


"The above code does not compile or test yet".

the above will clear all the pages in the ptr_ring that is not in the local
node and treat the refill and !refill case consistently.

But for the refill case, the pool->alloc.count may be PP_ALLOC_CACHE_REFILL - 1
after page_pool_refill_alloc_cache() returns.


> +out:
> +	spin_unlock(&r->consumer_lock);
> +	return first_page;
> +}
> +
>  /* fast path */
>  static struct page *__page_pool_get_cached(struct page_pool *pool)
>  {
> -	struct ptr_ring *r = &pool->ring;
>  	bool refill = false;
>  	struct page *page;
>  
> @@ -113,20 +164,7 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
>  		refill = true;
>  	}
>  
> -	/* Quicker fallback, avoid locks when ring is empty */
> -	if (__ptr_ring_empty(r))
> -		return NULL;
> -
> -	/* Slow-path: Get page from locked ring queue,
> -	 * refill alloc array if requested.
> -	 */
> -	spin_lock(&r->consumer_lock);
> -	page = __ptr_ring_consume(r);
> -	if (refill)
> -		pool->alloc.count = __ptr_ring_consume_batched(r,
> -							pool->alloc.cache,
> -							PP_ALLOC_CACHE_REFILL);
> -	spin_unlock(&r->consumer_lock);
> +	page = page_pool_refill_alloc_cache(pool, refill);
>  	return page;
>  }
>  
> @@ -311,13 +349,10 @@ static bool __page_pool_recycle_direct(struct page *page,
>  
>  /* page is NOT reusable when:
>   * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> - * 2) belongs to a different NUMA node than pool->p.nid.
> - *
> - * To update pool->p.nid users must call page_pool_update_nid.
>   */
>  static bool pool_page_reusable(struct page_pool *pool, struct page *page)
>  {
> -	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
> +	return !page_is_pfmemalloc(page);
>  }
>  
>  void __page_pool_put_page(struct page_pool *pool, struct page *page,
> @@ -484,7 +519,16 @@ EXPORT_SYMBOL(page_pool_destroy);
>  /* Caller must provide appropriate safe context, e.g. NAPI. */
>  void page_pool_update_nid(struct page_pool *pool, int new_nid)
>  {
> +	struct page *page;
> +
> +	WARN_ON(!in_serving_softirq());
>  	trace_page_pool_update_nid(pool, new_nid);
>  	pool->p.nid = new_nid;
> +
> +	/* Flush pool alloc cache, as refill will check NUMA node */
> +	while (pool->alloc.count) {
> +		page = pool->alloc.cache[--pool->alloc.count];
> +		__page_pool_return_page(pool, page);
> +	}
>  }
>  EXPORT_SYMBOL(page_pool_update_nid);
> 
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-18 14:27     ` 答复: " Li,Rongqing
@ 2019-12-19 12:00       ` Jesper Dangaard Brouer
  2019-12-19 12:47         ` 答复: " Li,Rongqing
  0 siblings, 1 reply; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-19 12:00 UTC (permalink / raw)
  To: Li,Rongqing
  Cc: netdev, linyunsheng, Ilias Apalodimas, Saeed Mahameed, mhocko,
	peterz, linux-kernel, brouer

On Wed, 18 Dec 2019 14:27:13 +0000
"Li,Rongqing" <lirongqing@baidu.com> wrote:

> seem it can not handle the in-flight pages if remove the check node
> id from pool_page_reusable?

I don't follow.
I do think this patch handle in-flight pages, as they have to travel
back-through the ptr_ring.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
  2019-12-18 14:27     ` 答复: " Li,Rongqing
  2019-12-19  1:52     ` Yunsheng Lin
@ 2019-12-19 12:09     ` Michal Hocko
  2019-12-19 13:35       ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2019-12-19 12:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Ilias Apalodimas,
	Saeed Mahameed, peterz, linux-kernel

On Wed 18-12-19 09:01:35, Jesper Dangaard Brouer wrote:
[...]
> For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> chunks per allocation and allocation fall-through to the real
> page-allocator with the new nid derived from numa_mem_id(). We accept
> that transitioning the alloc cache doesn't happen immediately.

Could you explain what is the expected semantic of NUMA_NO_NODE in this
case? Does it imply always the preferred locality? See my other email[1] to
this matter.

[1] http://lkml.kernel.org/r/20191219115338.GC26945@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19  1:52     ` Yunsheng Lin
@ 2019-12-19 12:15       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-19 12:15 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: netdev, lirongqing, Ilias Apalodimas, Saeed Mahameed, mhocko,
	peterz, linux-kernel, brouer

On Thu, 19 Dec 2019 09:52:14 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> On 2019/12/18 16:01, Jesper Dangaard Brouer wrote:
> > The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
> > not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
> > 
> > The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
> > recycle non-reusable pages"), were to have RX-pages that belongs to the
> > same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
> > illustrated by the performance measurements.
> > 
> > This patch moves the NAPI checks out of fast-path, and at the same time
> > solves the NUMA_NO_NODE issue.
> > 
> > First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
> > will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
> > as the the preferred nid.  It is only in rare situations, where
> > e.g. NUMA zone runs dry, that page gets doesn't get allocated from
> > preferred nid.  The page_pool API allows drivers to control the nid
> > themselves via controlling pool->p.nid.
> > 
> > This patch moves the NAPI check to when alloc cache is refilled, via
> > dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
> > pages from remote NUMA into the ptr_ring, as the dequeue/consume step
> > will check the NUMA node. All current drivers using page_pool will
> > alloc/refill RX-ring from same CPU running softirq/NAPI process.
> > 
> > Drivers that control the nid explicitly, also use page_pool_update_nid
> > when changing nid runtime.  To speed up transision to new nid the alloc
> > cache is now flushed on nid changes.  This force pages to come from
> > ptr_ring, which does the appropate nid check.
> > 
> > For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> > node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> > chunks per allocation and allocation fall-through to the real
> > page-allocator with the new nid derived from numa_mem_id(). We accept
> > that transitioning the alloc cache doesn't happen immediately.
> > 
> > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> > Reported-by: Li RongQing <lirongqing@baidu.com>
> > Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  net/core/page_pool.c |   82 ++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 63 insertions(+), 19 deletions(-)
> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..bd4f8b2c46b6 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -96,10 +96,61 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
> >  }
> >  EXPORT_SYMBOL(page_pool_create);
> >  
> > +static void __page_pool_return_page(struct page_pool *pool, struct page *page);  
> 
> It is possible to avoid forword-declare it by move the __page_pool_return_page()?
> Maybe it is ok since this patch is targetting net-next?
> 
> > +
> > +noinline
> > +static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
> > +						 bool refill)
> > +{
> > +	struct ptr_ring *r = &pool->ring;
> > +	struct page *first_page, *page;
> > +	int i, curr_nid;
> > +
> > +	/* Quicker fallback, avoid locks when ring is empty */
> > +	if (__ptr_ring_empty(r))
> > +		return NULL;
> > +
> > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > +	 */
> > +	curr_nid = numa_mem_id();
> > +
> > +	/* Slower-path: Get pages from locked ring queue */
> > +	spin_lock(&r->consumer_lock);
> > +	first_page = __ptr_ring_consume(r);
> > +
> > +	/* Fallback to page-allocator if NUMA node doesn't match */
> > +	if (first_page && unlikely(!(page_to_nid(first_page) == curr_nid))) {
> > +		__page_pool_return_page(pool, first_page);
> > +		first_page = NULL;
> > +	}
> > +
> > +	if (unlikely(!refill))
> > +		goto out;
> > +
> > +	/* Refill alloc array, but only if NUMA node match */
> > +	for (i = 0; i < PP_ALLOC_CACHE_REFILL; i++) {
> > +		page = __ptr_ring_consume(r);
> > +		if (unlikely(!page))
> > +			break;
> > +
> > +		if (likely(page_to_nid(page) == curr_nid)) {
> > +			pool->alloc.cache[pool->alloc.count++] = page;
> > +		} else {
> > +			/* Release page to page-allocator, assume
> > +			 * refcnt == 1 invariant of cached pages
> > +			 */
> > +			__page_pool_return_page(pool, page);
> > +		}
> > +	}  
> 
> The above code seems to not clear all the pages in the ptr_ring that
> is not in the local node in some case?
> 
> I am not so familiar with asm, but does below code make sense and
> generate better asm code?

I'm not too concerned with ASM-level optimization for this function
call, as it only happens once every 64 packets.


> 	struct page *page = NULL;
> 
> 	while (pool->alloc.count < PP_ALLOC_CACHE_REFILL || !refill) {
> 		page = __ptr_ring_consume(r);
> 
> 		if (unlikely(!page || !refill))
> 			break;
> 
> 		if (likely(page_to_nid(page) == curr_nid)) {
> 			pool->alloc.cache[pool->alloc.count++] = page;
> 		} else {
> 			/* Release page to page-allocator, assume
> 			 * refcnt == 1 invariant of cached pages
> 			 */
> 			__page_pool_return_page(pool, page);
> 		}
> 	}
> 
> out:
> 	if (likely(refill && pool->alloc.count > 0))
> 		page = pool->alloc.cache[--pool->alloc.count];
> 
> 	spin_unlock(&r->consumer_lock);
> 	
> 	return page;
>
> 
> "The above code does not compile or test yet".
> 
> the above will clear all the pages in the ptr_ring that is not in the
> local node and treat the refill and !refill case consistently.

I don't want to empty the entire ptr_ring in one go.  That is
problematic, because we are running in Softirq with bh + preemption
disabled.  Returning 1024 pages will undoubtedly trigger some page
buddy coalescing work.  That is why I choose to max return 65 pages (I
felt this detail was important enought to mention it in the description
above).

I do acknowledge that the code can be improved.  What I don't like with
my own code, is that I handle the 'first_page' as a special case.  You
code did solve that case, so I'll try to improve my code and send V5.


> 
> But for the refill case, the pool->alloc.count may be PP_ALLOC_CACHE_REFILL - 1
> after page_pool_refill_alloc_cache() returns.
> 
> 
> > +out:
> > +	spin_unlock(&r->consumer_lock);
> > +	return first_page;
> > +}
> > +
> >  /* fast path */
> >  static struct page *__page_pool_get_cached(struct page_pool *pool)
> >  {
> > -	struct ptr_ring *r = &pool->ring;
> >  	bool refill = false;
> >  	struct page *page;
> >  
> > @@ -113,20 +164,7 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
> >  		refill = true;
> >  	}
> >  
> > -	/* Quicker fallback, avoid locks when ring is empty */
> > -	if (__ptr_ring_empty(r))
> > -		return NULL;
> > -
> > -	/* Slow-path: Get page from locked ring queue,
> > -	 * refill alloc array if requested.
> > -	 */
> > -	spin_lock(&r->consumer_lock);
> > -	page = __ptr_ring_consume(r);
> > -	if (refill)
> > -		pool->alloc.count = __ptr_ring_consume_batched(r,
> > -							pool->alloc.cache,
> > -							PP_ALLOC_CACHE_REFILL);
> > -	spin_unlock(&r->consumer_lock);
> > +	page = page_pool_refill_alloc_cache(pool, refill);
> >  	return page;
> >  }
> >  
> > @@ -311,13 +349,10 @@ static bool __page_pool_recycle_direct(struct page *page,
> >  
> >  /* page is NOT reusable when:
> >   * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> > - * 2) belongs to a different NUMA node than pool->p.nid.
> > - *
> > - * To update pool->p.nid users must call page_pool_update_nid.
> >   */
> >  static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> >  {
> > -	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
> > +	return !page_is_pfmemalloc(page);
> >  }
> >  
> >  void __page_pool_put_page(struct page_pool *pool, struct page *page,
> > @@ -484,7 +519,16 @@ EXPORT_SYMBOL(page_pool_destroy);
> >  /* Caller must provide appropriate safe context, e.g. NAPI. */
> >  void page_pool_update_nid(struct page_pool *pool, int new_nid)
> >  {
> > +	struct page *page;
> > +
> > +	WARN_ON(!in_serving_softirq());
> >  	trace_page_pool_update_nid(pool, new_nid);
> >  	pool->p.nid = new_nid;
> > +
> > +	/* Flush pool alloc cache, as refill will check NUMA node */
> > +	while (pool->alloc.count) {
> > +		page = pool->alloc.cache[--pool->alloc.count];
> > +		__page_pool_return_page(pool, page);
> > +	}
> >  }
> >  EXPORT_SYMBOL(page_pool_update_nid);
> > 


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* 答复: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19 12:00       ` Jesper Dangaard Brouer
@ 2019-12-19 12:47         ` Li,Rongqing
  0 siblings, 0 replies; 24+ messages in thread
From: Li,Rongqing @ 2019-12-19 12:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linyunsheng, Ilias Apalodimas, Saeed Mahameed, mhocko,
	peterz, linux-kernel



On Wed, 18 Dec 2019 14:27:13 +0000
"Li,Rongqing" <lirongqing@baidu.com> wrote:

>> seem it can not handle the in-flight pages if remove the check node
>> id from pool_page_reusable?

>I don't follow.
>I do think this patch handle in-flight pages, as they have to travel
>back-through the ptr_ring.


in-flight pages maybe loop between pool-> alloc.cache and in-flight, and did not travel to ptr_ring


-Li






    

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19 12:09     ` Michal Hocko
@ 2019-12-19 13:35       ` Jesper Dangaard Brouer
  2019-12-19 14:52         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-19 13:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: netdev, lirongqing, linyunsheng, Ilias Apalodimas,
	Saeed Mahameed, peterz, linux-kernel, brouer

On Thu, 19 Dec 2019 13:09:25 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 18-12-19 09:01:35, Jesper Dangaard Brouer wrote:
> [...]
> > For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> > node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> > chunks per allocation and allocation fall-through to the real
> > page-allocator with the new nid derived from numa_mem_id(). We accept
> > that transitioning the alloc cache doesn't happen immediately.  

Oh, I just realized that the drivers usually refill several RX
packet-pages at once, this means that this is called N times, meaning
during a NUMA change this will result in N * 65 pages returned.


> Could you explain what is the expected semantic of NUMA_NO_NODE in this
> case? Does it imply always the preferred locality? See my other email[1] to
> this matter.

I do think we want NUMA_NO_NODE to mean preferred locality. My code
allow the page to come from a remote NUMA node, but once it is returned
via the ptr_ring, we return pages not belonging to the local NUMA node
(determined by the CPU processing RX packets from the drivers RX-ring).


> [1] http://lkml.kernel.org/r/20191219115338.GC26945@dhcp22.suse.cz

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-18  7:44 ` Jesper Dangaard Brouer
  2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
@ 2019-12-19 14:20   ` Jesper Dangaard Brouer
  2019-12-20 10:23     ` Ilias Apalodimas
  1 sibling, 1 reply; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-19 14:20 UTC (permalink / raw)
  To: netdev
  Cc: Jesper Dangaard Brouer, lirongqing, linyunsheng,
	Ilias Apalodimas, Saeed Mahameed, mhocko, peterz, linux-kernel

The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid.  It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid.  The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime.  To speed up transision to new nid the alloc
cache is now flushed on nid changes.  This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, we accept that transitioning the alloc cache doesn't happen
immediately. The preferred nid change runtime via consulting
numa_mem_id() based on the CPU processing RX-packets.

Notice, to avoid stressing the page buddy allocator and avoid doing too
much work under softirq with preempt disabled, the NUMA check at
ptr_ring dequeue will break the refill cycle, when detecting a NUMA
mismatch. This will cause a slower transition, but its done on purpose.

Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/page_pool.c |   80 ++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 61 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..748f0d36f4be 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -96,10 +96,60 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
+static void __page_pool_return_page(struct page_pool *pool, struct page *page);
+
+noinline
+static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
+						 bool refill)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *page;
+	int pref_nid; /* preferred NUMA node */
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Softirq guarantee CPU and thus NUMA node is stable. This,
+	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
+	 */
+	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
+
+	/* Slower-path: Get pages from locked ring queue */
+	spin_lock(&r->consumer_lock);
+
+	/* Refill alloc array, but only if NUMA match */
+	do {
+		page = __ptr_ring_consume(r);
+		if (unlikely(!page))
+			break;
+
+		if (likely(page_to_nid(page) == pref_nid)) {
+			pool->alloc.cache[pool->alloc.count++] = page;
+		} else {
+			/* NUMA mismatch;
+			 * (1) release 1 page to page-allocator and
+			 * (2) break out to fallthough to alloc_pages_node.
+			 * This limit stress on page buddy alloactor.
+			 */
+			__page_pool_return_page(pool, page);
+			page = NULL;
+			break;
+		}
+	} while (pool->alloc.count < PP_ALLOC_CACHE_REFILL &&
+		 refill);
+
+	/* Return last page */
+	if (likely(pool->alloc.count > 0))
+		page = pool->alloc.cache[--pool->alloc.count];
+
+	spin_unlock(&r->consumer_lock);
+	return page;
+}
+
 /* fast path */
 static struct page *__page_pool_get_cached(struct page_pool *pool)
 {
-	struct ptr_ring *r = &pool->ring;
 	bool refill = false;
 	struct page *page;
 
@@ -113,20 +163,7 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
 		refill = true;
 	}
 
-	/* Quicker fallback, avoid locks when ring is empty */
-	if (__ptr_ring_empty(r))
-		return NULL;
-
-	/* Slow-path: Get page from locked ring queue,
-	 * refill alloc array if requested.
-	 */
-	spin_lock(&r->consumer_lock);
-	page = __ptr_ring_consume(r);
-	if (refill)
-		pool->alloc.count = __ptr_ring_consume_batched(r,
-							pool->alloc.cache,
-							PP_ALLOC_CACHE_REFILL);
-	spin_unlock(&r->consumer_lock);
+	page = page_pool_refill_alloc_cache(pool, refill);
 	return page;
 }
 
@@ -311,13 +348,10 @@ static bool __page_pool_recycle_direct(struct page *page,
 
 /* page is NOT reusable when:
  * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
- * 2) belongs to a different NUMA node than pool->p.nid.
- *
- * To update pool->p.nid users must call page_pool_update_nid.
  */
 static bool pool_page_reusable(struct page_pool *pool, struct page *page)
 {
-	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
+	return !page_is_pfmemalloc(page);
 }
 
 void __page_pool_put_page(struct page_pool *pool, struct page *page,
@@ -484,7 +518,15 @@ EXPORT_SYMBOL(page_pool_destroy);
 /* Caller must provide appropriate safe context, e.g. NAPI. */
 void page_pool_update_nid(struct page_pool *pool, int new_nid)
 {
+	struct page *page;
+
 	trace_page_pool_update_nid(pool, new_nid);
 	pool->p.nid = new_nid;
+
+	/* Flush pool alloc cache, as refill will check NUMA node */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19 13:35       ` Jesper Dangaard Brouer
@ 2019-12-19 14:52         ` Michal Hocko
  2019-12-19 15:28           ` Ilias Apalodimas
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2019-12-19 14:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Ilias Apalodimas,
	Saeed Mahameed, peterz, linux-kernel

On Thu 19-12-19 14:35:35, Jesper Dangaard Brouer wrote:
> On Thu, 19 Dec 2019 13:09:25 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Wed 18-12-19 09:01:35, Jesper Dangaard Brouer wrote:
> > [...]
> > > For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> > > node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> > > chunks per allocation and allocation fall-through to the real
> > > page-allocator with the new nid derived from numa_mem_id(). We accept
> > > that transitioning the alloc cache doesn't happen immediately.  
> 
> Oh, I just realized that the drivers usually refill several RX
> packet-pages at once, this means that this is called N times, meaning
> during a NUMA change this will result in N * 65 pages returned.
> 
> 
> > Could you explain what is the expected semantic of NUMA_NO_NODE in this
> > case? Does it imply always the preferred locality? See my other email[1] to
> > this matter.
> 
> I do think we want NUMA_NO_NODE to mean preferred locality.

I obviously have no saying here because I am not really familiar with
the users of this API but I would note that if there is such an implicit
assumption then you make it impossible to use the numa agnostic page
pool allocator (aka fast reallocation). This might be not important here
but future extension would be harder (you can still hack it around aka
NUMA_REALLY_NO_NODE). My experience tells me that people are quite
creative and usually require (or worse assume) semantics that you
thought were not useful.

That being said, if the NUMA_NO_NODE really should have a special
locality meaning then document it explicitly at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v4 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19 14:52         ` Michal Hocko
@ 2019-12-19 15:28           ` Ilias Apalodimas
  0 siblings, 0 replies; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-19 15:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jesper Dangaard Brouer, netdev, lirongqing, linyunsheng,
	Saeed Mahameed, peterz, linux-kernel

On Thu, Dec 19, 2019 at 03:52:06PM +0100, Michal Hocko wrote:
> On Thu 19-12-19 14:35:35, Jesper Dangaard Brouer wrote:
> > On Thu, 19 Dec 2019 13:09:25 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Wed 18-12-19 09:01:35, Jesper Dangaard Brouer wrote:
> > > [...]
> > > > For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
> > > > node, then ptr_ring will be emptied in 65 (PP_ALLOC_CACHE_REFILL+1)
> > > > chunks per allocation and allocation fall-through to the real
> > > > page-allocator with the new nid derived from numa_mem_id(). We accept
> > > > that transitioning the alloc cache doesn't happen immediately.  
> > 
> > Oh, I just realized that the drivers usually refill several RX
> > packet-pages at once, this means that this is called N times, meaning
> > during a NUMA change this will result in N * 65 pages returned.
> > 
> > 
> > > Could you explain what is the expected semantic of NUMA_NO_NODE in this
> > > case? Does it imply always the preferred locality? See my other email[1] to
> > > this matter.
> > 
> > I do think we want NUMA_NO_NODE to mean preferred locality.
> 

Why? wouldn't it be clearer if it meant "this is not NUMA AWARE"?
The way i see it iyou have drivers that sit on specific SoCs, 
like the ti one, or the netsec one can declare 'NUMA_NO_NODE' since they 
know beforehand what hardware they'll be sitting on.
On PCI/USB pluggable interfaces mlx5 example should be followed.

> I obviously have no saying here because I am not really familiar with
> the users of this API but I would note that if there is such an implicit
> assumption then you make it impossible to use the numa agnostic page
> pool allocator (aka fast reallocation). This might be not important here
> but future extension would be harder (you can still hack it around aka
> NUMA_REALLY_NO_NODE). My experience tells me that people are quite
> creative and usually require (or worse assume) semantics that you
> thought were not useful.
> 
> That being said, if the NUMA_NO_NODE really should have a special
> locality meaning then document it explicitly at least.

Agree, if we treat it like this we have to document it somehow

> -- 
> Michal Hocko
> SUSE Labs

Thanks
/Ilias

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19 14:20   ` [net-next v5 " Jesper Dangaard Brouer
@ 2019-12-20 10:23     ` Ilias Apalodimas
  2019-12-20 10:41       ` Jesper Dangaard Brouer
  2019-12-20 21:27       ` Saeed Mahameed
  0 siblings, 2 replies; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-20 10:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel

Hi Jesper, 

I like the overall approach since this moves the check out of  the hotpath. 
@Saeed, since i got no hardware to test this on, would it be possible to check
that it still works fine for mlx5?

[...]
> +	struct ptr_ring *r = &pool->ring;
> +	struct page *page;
> +	int pref_nid; /* preferred NUMA node */
> +
> +	/* Quicker fallback, avoid locks when ring is empty */
> +	if (__ptr_ring_empty(r))
> +		return NULL;
> +
> +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> +	 */
> +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;

One of the use cases for this is that during the allocation we are not
guaranteed to pick up the correct NUMA node. 
This will get automatically fixed once the driver starts recycling packets. 

I don't feel strongly about this, since i don't usually like hiding value
changes from the user but, would it make sense to move this into 
__page_pool_alloc_pages_slow() and change the pool->p.nid?

Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
regardless, why not store the actual node in our page pool information?
You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
what's configured. 

Thanks
/Ilias

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-20 10:23     ` Ilias Apalodimas
@ 2019-12-20 10:41       ` Jesper Dangaard Brouer
  2019-12-20 10:49         ` Ilias Apalodimas
  2019-12-20 21:27       ` Saeed Mahameed
  1 sibling, 1 reply; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-20 10:41 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel, brouer

On Fri, 20 Dec 2019 12:23:14 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> Hi Jesper, 
> 
> I like the overall approach since this moves the check out of  the hotpath. 
> @Saeed, since i got no hardware to test this on, would it be possible to check
> that it still works fine for mlx5?
> 
> [...]
> > +	struct ptr_ring *r = &pool->ring;
> > +	struct page *page;
> > +	int pref_nid; /* preferred NUMA node */
> > +
> > +	/* Quicker fallback, avoid locks when ring is empty */
> > +	if (__ptr_ring_empty(r))
> > +		return NULL;
> > +
> > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > +	 */
> > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;  
> 
> One of the use cases for this is that during the allocation we are not
> guaranteed to pick up the correct NUMA node. 
> This will get automatically fixed once the driver starts recycling packets. 
> 
> I don't feel strongly about this, since i don't usually like hiding value
> changes from the user but, would it make sense to move this into 
> __page_pool_alloc_pages_slow() and change the pool->p.nid?
> 
> Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> regardless, why not store the actual node in our page pool information?
> You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> what's configured. 

This single code line helps support that drivers can control the nid
themselves.  This is a feature that is only used my mlx5 AFAIK.

I do think that is useful to allow the driver to "control" the nid, as
pinning/preferring the pages to come from the NUMA node that matches
the PCI-e controller hardware is installed in does have benefits.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-20 10:41       ` Jesper Dangaard Brouer
@ 2019-12-20 10:49         ` Ilias Apalodimas
  2019-12-20 15:22           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-20 10:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel

On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 20 Dec 2019 12:23:14 +0200
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > Hi Jesper, 
> > 
> > I like the overall approach since this moves the check out of  the hotpath. 
> > @Saeed, since i got no hardware to test this on, would it be possible to check
> > that it still works fine for mlx5?
> > 
> > [...]
> > > +	struct ptr_ring *r = &pool->ring;
> > > +	struct page *page;
> > > +	int pref_nid; /* preferred NUMA node */
> > > +
> > > +	/* Quicker fallback, avoid locks when ring is empty */
> > > +	if (__ptr_ring_empty(r))
> > > +		return NULL;
> > > +
> > > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > +	 */
> > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;  
> > 
> > One of the use cases for this is that during the allocation we are not
> > guaranteed to pick up the correct NUMA node. 
> > This will get automatically fixed once the driver starts recycling packets. 
> > 
> > I don't feel strongly about this, since i don't usually like hiding value
> > changes from the user but, would it make sense to move this into 
> > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > 
> > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > regardless, why not store the actual node in our page pool information?
> > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > what's configured. 
> 
> This single code line helps support that drivers can control the nid
> themselves.  This is a feature that is only used my mlx5 AFAIK.
> 
> I do think that is useful to allow the driver to "control" the nid, as
> pinning/preferring the pages to come from the NUMA node that matches
> the PCI-e controller hardware is installed in does have benefits.

Sure you can keep the if statement as-is, it won't break anything. 
Would we want to store the actual numa id in pool->p.nid if the user selects 
'NUMA_NO_NODE'?


Thanks
/Ilias
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-20 10:49         ` Ilias Apalodimas
@ 2019-12-20 15:22           ` Jesper Dangaard Brouer
  2019-12-20 16:06             ` Ilias Apalodimas
  0 siblings, 1 reply; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-20 15:22 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel, brouer

On Fri, 20 Dec 2019 12:49:37 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 20 Dec 2019 12:23:14 +0200
> > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> >   
> > > Hi Jesper, 
> > > 
> > > I like the overall approach since this moves the check out of  the hotpath. 
> > > @Saeed, since i got no hardware to test this on, would it be possible to check
> > > that it still works fine for mlx5?
> > > 
> > > [...]  
> > > > +	struct ptr_ring *r = &pool->ring;
> > > > +	struct page *page;
> > > > +	int pref_nid; /* preferred NUMA node */
> > > > +
> > > > +	/* Quicker fallback, avoid locks when ring is empty */
> > > > +	if (__ptr_ring_empty(r))
> > > > +		return NULL;
> > > > +
> > > > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > > +	 */
> > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;    
> > > 
> > > One of the use cases for this is that during the allocation we are not
> > > guaranteed to pick up the correct NUMA node. 
> > > This will get automatically fixed once the driver starts recycling packets. 
> > > 
> > > I don't feel strongly about this, since i don't usually like hiding value
> > > changes from the user but, would it make sense to move this into 
> > > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > > 
> > > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > > regardless, why not store the actual node in our page pool information?
> > > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > > what's configured.   
> > 
> > This single code line helps support that drivers can control the nid
> > themselves.  This is a feature that is only used my mlx5 AFAIK.
> > 
> > I do think that is useful to allow the driver to "control" the nid, as
> > pinning/preferring the pages to come from the NUMA node that matches
> > the PCI-e controller hardware is installed in does have benefits.  
> 
> Sure you can keep the if statement as-is, it won't break anything. 
> Would we want to store the actual numa id in pool->p.nid if the user
> selects 'NUMA_NO_NODE'?
 
No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
dynamic.  If someone moves an RX IRQ to another CPU on another NUMA
node, then this 'NUMA_NO_NODE' setting makes pages transitioned
automatically.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-20 15:22           ` Jesper Dangaard Brouer
@ 2019-12-20 16:06             ` Ilias Apalodimas
  2019-12-23  7:57               ` Ilias Apalodimas
  0 siblings, 1 reply; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-20 16:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel

On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 20 Dec 2019 12:49:37 +0200
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > >   
> > > > Hi Jesper, 
> > > > 
> > > > I like the overall approach since this moves the check out of  the hotpath. 
> > > > @Saeed, since i got no hardware to test this on, would it be possible to check
> > > > that it still works fine for mlx5?
> > > > 
> > > > [...]  
> > > > > +	struct ptr_ring *r = &pool->ring;
> > > > > +	struct page *page;
> > > > > +	int pref_nid; /* preferred NUMA node */
> > > > > +
> > > > > +	/* Quicker fallback, avoid locks when ring is empty */
> > > > > +	if (__ptr_ring_empty(r))
> > > > > +		return NULL;
> > > > > +
> > > > > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > > > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > > > +	 */
> > > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;    
> > > > 
> > > > One of the use cases for this is that during the allocation we are not
> > > > guaranteed to pick up the correct NUMA node. 
> > > > This will get automatically fixed once the driver starts recycling packets. 
> > > > 
> > > > I don't feel strongly about this, since i don't usually like hiding value
> > > > changes from the user but, would it make sense to move this into 
> > > > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > > > 
> > > > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > > > regardless, why not store the actual node in our page pool information?
> > > > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > > > what's configured.   
> > > 
> > > This single code line helps support that drivers can control the nid
> > > themselves.  This is a feature that is only used my mlx5 AFAIK.
> > > 
> > > I do think that is useful to allow the driver to "control" the nid, as
> > > pinning/preferring the pages to come from the NUMA node that matches
> > > the PCI-e controller hardware is installed in does have benefits.  
> > 
> > Sure you can keep the if statement as-is, it won't break anything. 
> > Would we want to store the actual numa id in pool->p.nid if the user
> > selects 'NUMA_NO_NODE'?
>  
> No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> dynamic.  If someone moves an RX IRQ to another CPU on another NUMA
> node, then this 'NUMA_NO_NODE' setting makes pages transitioned
> automatically.
Ok this assumed that drivers were going to use page_pool_nid_changed(), but with
the current code we don't have to force them to do that. Let's keep this as-is.

I'll be running a few more tests  and wait in case Saeed gets a chance to test
it and send my reviewed-by

Cheers
/Ilias
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-20 10:23     ` Ilias Apalodimas
  2019-12-20 10:41       ` Jesper Dangaard Brouer
@ 2019-12-20 21:27       ` Saeed Mahameed
  1 sibling, 0 replies; 24+ messages in thread
From: Saeed Mahameed @ 2019-12-20 21:27 UTC (permalink / raw)
  To: ilias.apalodimas, brouer
  Cc: peterz, linyunsheng, netdev, Li Rongqing, mhocko, linux-kernel

On Fri, 2019-12-20 at 12:23 +0200, Ilias Apalodimas wrote:
> Hi Jesper, 
> 
> I like the overall approach since this moves the check out of  the
> hotpath. 
> @Saeed, since i got no hardware to test this on, would it be possible
> to check
> that it still works fine for mlx5?

The idea seems reasonable,
I will need a day or two to test and review this.

The only thing we need to be careful about is how heavy the flush
operation on numa changes, holding a spin lock and releasing all pages
at once ..
prior to this patch, page releasing was done per packet, so there
should be an improvement here of bulk page flush, but on the other hand
we will be holding a spin lock.. i am not worried about spin lock
contention though, just about the potential cpu spikes.

Thanks,
Saeed.

> 
> [...]
> > +	struct ptr_ring *r = &pool->ring;
> > +	struct page *page;
> > +	int pref_nid; /* preferred NUMA node */
> > +
> > +	/* Quicker fallback, avoid locks when ring is empty */
> > +	if (__ptr_ring_empty(r))
> > +		return NULL;
> > +
> > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > +	 */
> > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() :
> > pool->p.nid;
> 
> One of the use cases for this is that during the allocation we are
> not
> guaranteed to pick up the correct NUMA node. 
> This will get automatically fixed once the driver starts recycling
> packets. 
> 
> I don't feel strongly about this, since i don't usually like hiding
> value
> changes from the user but, would it make sense to move this into 
> __page_pool_alloc_pages_slow() and change the pool->p.nid?
> 
> Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> regardless, why not store the actual node in our page pool
> information?
> You can then skip this and check pool->p.nid == numa_mem_id(),
> regardless of
> what's configured. 
> 
> Thanks
> /Ilias

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-20 16:06             ` Ilias Apalodimas
@ 2019-12-23  7:57               ` Ilias Apalodimas
  2019-12-23 16:52                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-23  7:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel

Hi Jesper,

Looking at the overall path again, i still need we need to reconsider 
pool->p.nid semantics.

As i said i like the patch and the whole functionality and code seems fine,
but here's the current situation.
If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
page_pool_update_nid() the whole behavior feels a liitle odd.
page_pool_update_nid() first check will always be true since .nid =
NUMA_NO_NODE). Then we'll update this to a real nid. So we end up overwriting
what the user initially coded in. 
This is close to what i proposed in the previous mails on this thread. Always
store a real nid even if the user explicitly requests NUMA_NO_NODE.

So  semantics is still a problem. I'll stick to what we initially suggested.
1. We either *always* store a real nid
or 
2. If NUMA_NO_NODE is present ignore every other check and recycle the memory
blindly. 

Regards
/Ilias

On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 20 Dec 2019 12:49:37 +0200
> > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > 
> > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > >   
> > > > > Hi Jesper, 
> > > > > 
> > > > > I like the overall approach since this moves the check out of  the hotpath. 
> > > > > @Saeed, since i got no hardware to test this on, would it be possible to check
> > > > > that it still works fine for mlx5?
> > > > > 
> > > > > [...]  
> > > > > > +	struct ptr_ring *r = &pool->ring;
> > > > > > +	struct page *page;
> > > > > > +	int pref_nid; /* preferred NUMA node */
> > > > > > +
> > > > > > +	/* Quicker fallback, avoid locks when ring is empty */
> > > > > > +	if (__ptr_ring_empty(r))
> > > > > > +		return NULL;
> > > > > > +
> > > > > > +	/* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > > > > +	 * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > > > > +	 */
> > > > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;    
> > > > > 
> > > > > One of the use cases for this is that during the allocation we are not
> > > > > guaranteed to pick up the correct NUMA node. 
> > > > > This will get automatically fixed once the driver starts recycling packets. 
> > > > > 
> > > > > I don't feel strongly about this, since i don't usually like hiding value
> > > > > changes from the user but, would it make sense to move this into 
> > > > > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > > > > 
> > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > > > > regardless, why not store the actual node in our page pool information?
> > > > > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > > > > what's configured.   
> > > > 
> > > > This single code line helps support that drivers can control the nid
> > > > themselves.  This is a feature that is only used my mlx5 AFAIK.
> > > > 
> > > > I do think that is useful to allow the driver to "control" the nid, as
> > > > pinning/preferring the pages to come from the NUMA node that matches
> > > > the PCI-e controller hardware is installed in does have benefits.  
> > > 
> > > Sure you can keep the if statement as-is, it won't break anything. 
> > > Would we want to store the actual numa id in pool->p.nid if the user
> > > selects 'NUMA_NO_NODE'?
> >  
> > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > dynamic.  If someone moves an RX IRQ to another CPU on another NUMA
> > node, then this 'NUMA_NO_NODE' setting makes pages transitioned
> > automatically.
> Ok this assumed that drivers were going to use page_pool_nid_changed(), but with
> the current code we don't have to force them to do that. Let's keep this as-is.
> 
> I'll be running a few more tests  and wait in case Saeed gets a chance to test
> it and send my reviewed-by
> 
> Cheers
> /Ilias
> > 
> > -- 
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-23  7:57               ` Ilias Apalodimas
@ 2019-12-23 16:52                 ` Jesper Dangaard Brouer
  2019-12-23 22:10                   ` Saeed Mahameed
  2019-12-24  7:41                   ` Ilias Apalodimas
  0 siblings, 2 replies; 24+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-23 16:52 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel, brouer

On Mon, 23 Dec 2019 09:57:00 +0200
Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:

> Hi Jesper,
> 
> Looking at the overall path again, i still need we need to reconsider 
> pool->p.nid semantics.
> 
> As i said i like the patch and the whole functionality and code seems fine,
> but here's the current situation.

> If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> page_pool_update_nid() the whole behavior feels a liitle odd.

As soon as driver uses page_pool_update_nid() than means they want to
control the NUMA placement explicitly.  As soon as that happens, it is
the drivers responsibility and choice, and page_pool API must respect
that (and not automatically change that behind drivers back).


> page_pool_update_nid() first check will always be true since .nid =
> NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> overwriting what the user initially coded in.
>
> This is close to what i proposed in the previous mails on this
> thread. Always store a real nid even if the user explicitly requests
> NUMA_NO_NODE.
> 
> So  semantics is still a problem. I'll stick to what we initially
> suggested.
>  1. We either *always* store a real nid
> or 
>  2. If NUMA_NO_NODE is present ignore every other check and recycle
>  the memory blindly. 
> 

Hmm... I actually disagree with both 1 and 2.

My semantics proposal:
If driver configures page_pool with NUMA_NO_NODE, then page_pool tried
to help get the best default performance. (Which according to
performance measurements is to have RX-pages belong to the NUMA node
RX-processing runs on).

The reason I want this behavior is that during driver init/boot, it can
easily happen that a driver allocates RX-pages from wrong NUMA node.
This will cause a performance slowdown, that normally doesn't happen,
because without a cache (like page_pool) RX-pages would fairly quickly
transition over to the RX NUMA node (instead we keep recycling these,
in your case #2, where you suggest recycle blindly in case of
NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
driver developers.

--Jesper


> On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > wrote:  
> > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > >   
> > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > Brouer wrote:  
> > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > > >     
> > > > > > Hi Jesper, 
> > > > > > 
> > > > > > I like the overall approach since this moves the check out
> > > > > > of  the hotpath. @Saeed, since i got no hardware to test
> > > > > > this on, would it be possible to check that it still works
> > > > > > fine for mlx5?
> > > > > > 
> > > > > > [...]    
> > > > > > > +	struct ptr_ring *r = &pool->ring;
> > > > > > > +	struct page *page;
> > > > > > > +	int pref_nid; /* preferred NUMA node */
> > > > > > > +
> > > > > > > +	/* Quicker fallback, avoid locks when ring is
> > > > > > > empty */
> > > > > > > +	if (__ptr_ring_empty(r))
> > > > > > > +		return NULL;
> > > > > > > +
> > > > > > > +	/* Softirq guarantee CPU and thus NUMA node is
> > > > > > > stable. This,
> > > > > > > +	 * assumes CPU refilling driver RX-ring will
> > > > > > > also run RX-NAPI.
> > > > > > > +	 */
> > > > > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > numa_mem_id() : pool->p.nid;      
> > > > > > 
> > > > > > One of the use cases for this is that during the allocation
> > > > > > we are not guaranteed to pick up the correct NUMA node. 
> > > > > > This will get automatically fixed once the driver starts
> > > > > > recycling packets. 
> > > > > > 
> > > > > > I don't feel strongly about this, since i don't usually
> > > > > > like hiding value changes from the user but, would it make
> > > > > > sense to move this into __page_pool_alloc_pages_slow() and
> > > > > > change the pool->p.nid?
> > > > > > 
> > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > numa_mem_id() regardless, why not store the actual node in
> > > > > > our page pool information? You can then skip this and check
> > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > configured.     
> > > > > 
> > > > > This single code line helps support that drivers can control
> > > > > the nid themselves.  This is a feature that is only used my
> > > > > mlx5 AFAIK.
> > > > > 
> > > > > I do think that is useful to allow the driver to "control"
> > > > > the nid, as pinning/preferring the pages to come from the
> > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > installed in does have benefits.    
> > > > 
> > > > Sure you can keep the if statement as-is, it won't break
> > > > anything. Would we want to store the actual numa id in
> > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?  
> > >  
> > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > > dynamic.  If someone moves an RX IRQ to another CPU on another
> > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > transitioned automatically.  
> > Ok this assumed that drivers were going to use
> > page_pool_nid_changed(), but with the current code we don't have to
> > force them to do that. Let's keep this as-is.
> > 
> > I'll be running a few more tests  and wait in case Saeed gets a
> > chance to test it and send my reviewed-by


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-23 16:52                 ` Jesper Dangaard Brouer
@ 2019-12-23 22:10                   ` Saeed Mahameed
  2019-12-24  9:34                     ` Ilias Apalodimas
  2019-12-24  7:41                   ` Ilias Apalodimas
  1 sibling, 1 reply; 24+ messages in thread
From: Saeed Mahameed @ 2019-12-23 22:10 UTC (permalink / raw)
  To: ilias.apalodimas, brouer
  Cc: peterz, linyunsheng, netdev, Li Rongqing, mhocko, linux-kernel

On Mon, 2019-12-23 at 17:52 +0100, Jesper Dangaard Brouer wrote:
> On Mon, 23 Dec 2019 09:57:00 +0200
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > Hi Jesper,
> > 
> > Looking at the overall path again, i still need we need to
> > reconsider 
> > pool->p.nid semantics.
> > 
> > As i said i like the patch and the whole functionality and code
> > seems fine,
> > but here's the current situation.
> > If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> > page_pool_update_nid() the whole behavior feels a liitle odd.
> 
> As soon as driver uses page_pool_update_nid() than means they want to
> control the NUMA placement explicitly.  As soon as that happens, it
> is
> the drivers responsibility and choice, and page_pool API must respect
> that (and not automatically change that behind drivers back).
> 
> 
> > page_pool_update_nid() first check will always be true since .nid =
> > NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> > overwriting what the user initially coded in.
> > 
> > This is close to what i proposed in the previous mails on this
> > thread. Always store a real nid even if the user explicitly
> > requests
> > NUMA_NO_NODE.
> > 
> > So  semantics is still a problem. I'll stick to what we initially
> > suggested.
> >  1. We either *always* store a real nid
> > or 
> >  2. If NUMA_NO_NODE is present ignore every other check and recycle
> >  the memory blindly. 
> > 
> 
> Hmm... I actually disagree with both 1 and 2.
> 
> My semantics proposal:
> If driver configures page_pool with NUMA_NO_NODE, then page_pool
> tried
> to help get the best default performance. (Which according to
> performance measurements is to have RX-pages belong to the NUMA node
> RX-processing runs on).
> 

which is the msix affinity.. the pool has no knowledge of that on
initialization.

> The reason I want this behavior is that during driver init/boot, it
> can
> easily happen that a driver allocates RX-pages from wrong NUMA node.
> This will cause a performance slowdown, that normally doesn't happen,
> because without a cache (like page_pool) RX-pages would fairly
> quickly
> transition over to the RX NUMA node (instead we keep recycling these,
> in your case #2, where you suggest recycle blindly in case of
> NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
> driver developers.
> 

So, Ilias's #1 suggestion make sense, to always store a valid nid
value. 
the question is which value to store on initialization if the user
provided NUMA_NO_NODE ? I don't think the pool is capable of choosing
the right value, so let's just use numa node 0. 

If the developer cares,  he would have picked the right affinity on
initialization, or he can just call pool_update_nid() when the affinity
is determined and every thing will be fine after that point.

My 2cent is that you just can't provide the perfect performance when
the user uses NUMA_NO_NODE, so just pick any default concrete node id
and avoid dealing with NUMA_NO_NODE in the pool fast or even slow
path..

> --Jesper
> 
> 
> > On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > > wrote:  
> > > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > >   
> > > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > > Brouer wrote:  
> > > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > > > >     
> > > > > > > Hi Jesper, 
> > > > > > > 
> > > > > > > I like the overall approach since this moves the check
> > > > > > > out
> > > > > > > of  the hotpath. @Saeed, since i got no hardware to test
> > > > > > > this on, would it be possible to check that it still
> > > > > > > works
> > > > > > > fine for mlx5?
> > > > > > > 
> > > > > > > [...]    
> > > > > > > > +	struct ptr_ring *r = &pool->ring;
> > > > > > > > +	struct page *page;
> > > > > > > > +	int pref_nid; /* preferred NUMA node */
> > > > > > > > +
> > > > > > > > +	/* Quicker fallback, avoid locks when ring is
> > > > > > > > empty */
> > > > > > > > +	if (__ptr_ring_empty(r))
> > > > > > > > +		return NULL;
> > > > > > > > +
> > > > > > > > +	/* Softirq guarantee CPU and thus NUMA node is
> > > > > > > > stable. This,
> > > > > > > > +	 * assumes CPU refilling driver RX-ring will
> > > > > > > > also run RX-NAPI.
> > > > > > > > +	 */
> > > > > > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > > numa_mem_id() : pool->p.nid;      
> > > > > > > 
> > > > > > > One of the use cases for this is that during the
> > > > > > > allocation
> > > > > > > we are not guaranteed to pick up the correct NUMA node. 
> > > > > > > This will get automatically fixed once the driver starts
> > > > > > > recycling packets. 
> > > > > > > 
> > > > > > > I don't feel strongly about this, since i don't usually
> > > > > > > like hiding value changes from the user but, would it
> > > > > > > make
> > > > > > > sense to move this into __page_pool_alloc_pages_slow()
> > > > > > > and
> > > > > > > change the pool->p.nid?
> > > > > > > 
> > > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > > numa_mem_id() regardless, why not store the actual node
> > > > > > > in
> > > > > > > our page pool information? You can then skip this and
> > > > > > > check
> > > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > > configured.     
> > > > > > 
> > > > > > This single code line helps support that drivers can
> > > > > > control
> > > > > > the nid themselves.  This is a feature that is only used my
> > > > > > mlx5 AFAIK.
> > > > > > 
> > > > > > I do think that is useful to allow the driver to "control"
> > > > > > the nid, as pinning/preferring the pages to come from the
> > > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > > installed in does have benefits.    
> > > > > 
> > > > > Sure you can keep the if statement as-is, it won't break
> > > > > anything. Would we want to store the actual numa id in
> > > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?  
> > > >  
> > > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes
> > > > it
> > > > dynamic.  If someone moves an RX IRQ to another CPU on another
> > > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > > transitioned automatically.  
> > > Ok this assumed that drivers were going to use
> > > page_pool_nid_changed(), but with the current code we don't have
> > > to
> > > force them to do that. Let's keep this as-is.
> > > 
> > > I'll be running a few more tests  and wait in case Saeed gets a
> > > chance to test it and send my reviewed-by
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-23 16:52                 ` Jesper Dangaard Brouer
  2019-12-23 22:10                   ` Saeed Mahameed
@ 2019-12-24  7:41                   ` Ilias Apalodimas
  1 sibling, 0 replies; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-24  7:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, lirongqing, linyunsheng, Saeed Mahameed, mhocko, peterz,
	linux-kernel

Hi Jesper,

On Mon, Dec 23, 2019 at 05:52:57PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 23 Dec 2019 09:57:00 +0200
> Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> 
> > Hi Jesper,
> > 
> > Looking at the overall path again, i still need we need to reconsider 
> > pool->p.nid semantics.
> > 
> > As i said i like the patch and the whole functionality and code seems fine,
> > but here's the current situation.
> 
> > If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> > page_pool_update_nid() the whole behavior feels a liitle odd.
> 
> As soon as driver uses page_pool_update_nid() than means they want to
> control the NUMA placement explicitly.  As soon as that happens, it is
> the drivers responsibility and choice, and page_pool API must respect
> that (and not automatically change that behind drivers back).
> 
> 
> > page_pool_update_nid() first check will always be true since .nid =
> > NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> > overwriting what the user initially coded in.
> >
> > This is close to what i proposed in the previous mails on this
> > thread. Always store a real nid even if the user explicitly requests
> > NUMA_NO_NODE.
> > 
> > So  semantics is still a problem. I'll stick to what we initially
> > suggested.
> >  1. We either *always* store a real nid
> > or 
> >  2. If NUMA_NO_NODE is present ignore every other check and recycle
> >  the memory blindly. 
> > 
> 
> Hmm... I actually disagree with both 1 and 2.
> 
> My semantics proposal:
> If driver configures page_pool with NUMA_NO_NODE, then page_pool tried
> to help get the best default performance. (Which according to
> performance measurements is to have RX-pages belong to the NUMA node
> RX-processing runs on).
> 
> The reason I want this behavior is that during driver init/boot, it can
> easily happen that a driver allocates RX-pages from wrong NUMA node.
> This will cause a performance slowdown, that normally doesn't happen,
> because without a cache (like page_pool) RX-pages would fairly quickly
> transition over to the RX NUMA node (instead we keep recycling these,
> in your case #2, where you suggest recycle blindly in case of
> NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
> driver developers.

Yea #2 has different semantics than the one you propose. So if he chooses
NUMA_NO_NODE, i'd expect the machines(s) the driver sits on, are not NUMA-aware.
Think specific SoC's, i'd never expect PCI cards to use that.
As i said i don't feel strongly about this anyway, it's just another case i had
under consideration but i like what you propose more. I'll try to add 
documentation on page_pool API and describe the semantics you have in mind.


Thanks
/Ilias

> 
> --Jesper
> 
> 
> > On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > > wrote:  
> > > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > >   
> > > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > > Brouer wrote:  
> > > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > > > >     
> > > > > > > Hi Jesper, 
> > > > > > > 
> > > > > > > I like the overall approach since this moves the check out
> > > > > > > of  the hotpath. @Saeed, since i got no hardware to test
> > > > > > > this on, would it be possible to check that it still works
> > > > > > > fine for mlx5?
> > > > > > > 
> > > > > > > [...]    
> > > > > > > > +	struct ptr_ring *r = &pool->ring;
> > > > > > > > +	struct page *page;
> > > > > > > > +	int pref_nid; /* preferred NUMA node */
> > > > > > > > +
> > > > > > > > +	/* Quicker fallback, avoid locks when ring is
> > > > > > > > empty */
> > > > > > > > +	if (__ptr_ring_empty(r))
> > > > > > > > +		return NULL;
> > > > > > > > +
> > > > > > > > +	/* Softirq guarantee CPU and thus NUMA node is
> > > > > > > > stable. This,
> > > > > > > > +	 * assumes CPU refilling driver RX-ring will
> > > > > > > > also run RX-NAPI.
> > > > > > > > +	 */
> > > > > > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > > numa_mem_id() : pool->p.nid;      
> > > > > > > 
> > > > > > > One of the use cases for this is that during the allocation
> > > > > > > we are not guaranteed to pick up the correct NUMA node. 
> > > > > > > This will get automatically fixed once the driver starts
> > > > > > > recycling packets. 
> > > > > > > 
> > > > > > > I don't feel strongly about this, since i don't usually
> > > > > > > like hiding value changes from the user but, would it make
> > > > > > > sense to move this into __page_pool_alloc_pages_slow() and
> > > > > > > change the pool->p.nid?
> > > > > > > 
> > > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > > numa_mem_id() regardless, why not store the actual node in
> > > > > > > our page pool information? You can then skip this and check
> > > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > > configured.     
> > > > > > 
> > > > > > This single code line helps support that drivers can control
> > > > > > the nid themselves.  This is a feature that is only used my
> > > > > > mlx5 AFAIK.
> > > > > > 
> > > > > > I do think that is useful to allow the driver to "control"
> > > > > > the nid, as pinning/preferring the pages to come from the
> > > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > > installed in does have benefits.    
> > > > > 
> > > > > Sure you can keep the if statement as-is, it won't break
> > > > > anything. Would we want to store the actual numa id in
> > > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?  
> > > >  
> > > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > > > dynamic.  If someone moves an RX IRQ to another CPU on another
> > > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > > transitioned automatically.  
> > > Ok this assumed that drivers were going to use
> > > page_pool_nid_changed(), but with the current code we don't have to
> > > force them to do that. Let's keep this as-is.
> > > 
> > > I'll be running a few more tests  and wait in case Saeed gets a
> > > chance to test it and send my reviewed-by
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-23 22:10                   ` Saeed Mahameed
@ 2019-12-24  9:34                     ` Ilias Apalodimas
  0 siblings, 0 replies; 24+ messages in thread
From: Ilias Apalodimas @ 2019-12-24  9:34 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: brouer, peterz, linyunsheng, netdev, Li Rongqing, mhocko, linux-kernel

Hi Saeed, 
> which is the msix affinity.. the pool has no knowledge of that on
> initialization.
> 
> > The reason I want this behavior is that during driver init/boot, it
> > can
> > easily happen that a driver allocates RX-pages from wrong NUMA node.
> > This will cause a performance slowdown, that normally doesn't happen,
> > because without a cache (like page_pool) RX-pages would fairly
> > quickly
> > transition over to the RX NUMA node (instead we keep recycling these,
> > in your case #2, where you suggest recycle blindly in case of
> > NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
> > driver developers.
> > 
> 
> So, Ilias's #1 suggestion make sense, to always store a valid nid
> value. 
> the question is which value to store on initialization if the user
> provided NUMA_NO_NODE ? I don't think the pool is capable of choosing
> the right value, so let's just use numa node 0. 

Again i don't mind using the current solution. We could use 0 or the whatever
numa is choosen from alloc_pages_node()

> 
> If the developer cares,  he would have picked the right affinity on
> initialization, or he can just call pool_update_nid() when the affinity
> is determined and every thing will be fine after that point.
> 
> My 2cent is that you just can't provide the perfect performance when
> the user uses NUMA_NO_NODE, so just pick any default concrete node id
> and avoid dealing with NUMA_NO_NODE in the pool fast or even slow
> path..

I don't have strong preference on any of those. I just prefer the homogeneous
approach of always storing a normal usable memory id. Either way rest of the
code seems fine, so i'll approve this once you manage to test it on your setup. 

I did test it on my netsec card using NUMA_NO_NODE. On that machine though it
doesn't make any difference since page_to_nid(page) and numa_mem_id() always
return 0 on that. So the allocation is already 'correct', the only thing that
changes once i call page_pool_update_nid() is pool->p.nid

Thanks
/Ilias
> 
> > --Jesper
> > 
> > 
> > > On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > > > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > > > wrote:  
> > > > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > > >   
> > > > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > > > Brouer wrote:  
> > > > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > > > Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote:
> > > > > > >     
> > > > > > > > Hi Jesper, 
> > > > > > > > 
> > > > > > > > I like the overall approach since this moves the check
> > > > > > > > out
> > > > > > > > of  the hotpath. @Saeed, since i got no hardware to test
> > > > > > > > this on, would it be possible to check that it still
> > > > > > > > works
> > > > > > > > fine for mlx5?
> > > > > > > > 
> > > > > > > > [...]    
> > > > > > > > > +	struct ptr_ring *r = &pool->ring;
> > > > > > > > > +	struct page *page;
> > > > > > > > > +	int pref_nid; /* preferred NUMA node */
> > > > > > > > > +
> > > > > > > > > +	/* Quicker fallback, avoid locks when ring is
> > > > > > > > > empty */
> > > > > > > > > +	if (__ptr_ring_empty(r))
> > > > > > > > > +		return NULL;
> > > > > > > > > +
> > > > > > > > > +	/* Softirq guarantee CPU and thus NUMA node is
> > > > > > > > > stable. This,
> > > > > > > > > +	 * assumes CPU refilling driver RX-ring will
> > > > > > > > > also run RX-NAPI.
> > > > > > > > > +	 */
> > > > > > > > > +	pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > > > numa_mem_id() : pool->p.nid;      
> > > > > > > > 
> > > > > > > > One of the use cases for this is that during the
> > > > > > > > allocation
> > > > > > > > we are not guaranteed to pick up the correct NUMA node. 
> > > > > > > > This will get automatically fixed once the driver starts
> > > > > > > > recycling packets. 
> > > > > > > > 
> > > > > > > > I don't feel strongly about this, since i don't usually
> > > > > > > > like hiding value changes from the user but, would it
> > > > > > > > make
> > > > > > > > sense to move this into __page_pool_alloc_pages_slow()
> > > > > > > > and
> > > > > > > > change the pool->p.nid?
> > > > > > > > 
> > > > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > > > numa_mem_id() regardless, why not store the actual node
> > > > > > > > in
> > > > > > > > our page pool information? You can then skip this and
> > > > > > > > check
> > > > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > > > configured.     
> > > > > > > 
> > > > > > > This single code line helps support that drivers can
> > > > > > > control
> > > > > > > the nid themselves.  This is a feature that is only used my
> > > > > > > mlx5 AFAIK.
> > > > > > > 
> > > > > > > I do think that is useful to allow the driver to "control"
> > > > > > > the nid, as pinning/preferring the pages to come from the
> > > > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > > > installed in does have benefits.    
> > > > > > 
> > > > > > Sure you can keep the if statement as-is, it won't break
> > > > > > anything. Would we want to store the actual numa id in
> > > > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?  
> > > > >  
> > > > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes
> > > > > it
> > > > > dynamic.  If someone moves an RX IRQ to another CPU on another
> > > > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > > > transitioned automatically.  
> > > > Ok this assumed that drivers were going to use
> > > > page_pool_nid_changed(), but with the current code we don't have
> > > > to
> > > > force them to do that. Let's keep this as-is.
> > > > 
> > > > I'll be running a few more tests  and wait in case Saeed gets a
> > > > chance to test it and send my reviewed-by
> > 
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-12-24  9:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 23:17 [net-next v3 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition Jesper Dangaard Brouer
2019-12-18  7:44 ` Jesper Dangaard Brouer
2019-12-18  8:01   ` [net-next v4 " Jesper Dangaard Brouer
2019-12-18 14:27     ` 答复: " Li,Rongqing
2019-12-19 12:00       ` Jesper Dangaard Brouer
2019-12-19 12:47         ` 答复: " Li,Rongqing
2019-12-19  1:52     ` Yunsheng Lin
2019-12-19 12:15       ` Jesper Dangaard Brouer
2019-12-19 12:09     ` Michal Hocko
2019-12-19 13:35       ` Jesper Dangaard Brouer
2019-12-19 14:52         ` Michal Hocko
2019-12-19 15:28           ` Ilias Apalodimas
2019-12-19 14:20   ` [net-next v5 " Jesper Dangaard Brouer
2019-12-20 10:23     ` Ilias Apalodimas
2019-12-20 10:41       ` Jesper Dangaard Brouer
2019-12-20 10:49         ` Ilias Apalodimas
2019-12-20 15:22           ` Jesper Dangaard Brouer
2019-12-20 16:06             ` Ilias Apalodimas
2019-12-23  7:57               ` Ilias Apalodimas
2019-12-23 16:52                 ` Jesper Dangaard Brouer
2019-12-23 22:10                   ` Saeed Mahameed
2019-12-24  9:34                     ` Ilias Apalodimas
2019-12-24  7:41                   ` Ilias Apalodimas
2019-12-20 21:27       ` Saeed Mahameed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).