linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
       [not found]     ` <831ed886842c894f7b2ffe83fe34705180a86b3b.camel@mellanox.com>
@ 2019-12-12  1:34       ` Yunsheng Lin
  2019-12-12 10:18         ` Jesper Dangaard Brouer
  2019-12-16 12:15         ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Yunsheng Lin @ 2019-12-12  1:34 UTC (permalink / raw)
  To: Saeed Mahameed, brouer
  Cc: ilias.apalodimas, jonathan.lemon, Li Rongqing, netdev, mhocko,
	peterz, Greg Kroah-Hartman, bhelgaas, linux-kernel

+CC Michal, Peter, Greg and Bjorn
Because there has been disscusion about where and how the NUMA_NO_NODE
should be handled before.

On 2019/12/12 5:24, Saeed Mahameed wrote:
> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:
>> On Sat, 7 Dec 2019 03:52:41 +0000
>> Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>>> I don't think it is correct to check that the page nid is same as
>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>> all
>>> pages to recycle, because you can't assume where pages are
>>> allocated
>>> from and where they are being handled.
>>
>> I agree, using numa_mem_id() is not valid, because it takes the numa
>> node id from the executing CPU and the call to __page_pool_put_page()
>> can happen on a remote CPU (e.g. cpumap redirect, and in future
>> SKBs).
>>
>>
>>> I suggest the following:
>>>
>>> return !page_pfmemalloc() && 
>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
>>> );
>>
>> Above code doesn't generate optimal ASM code, I suggest:
>>
>>  static bool pool_page_reusable(struct page_pool *pool, struct page
>> *page)
>>  {
>> 	return !page_is_pfmemalloc(page) &&
>> 		pool->p.nid != NUMA_NO_NODE &&
>> 		page_to_nid(page) == pool->p.nid;
>>  }
>>
> 
> this is not equivalent to the above. Here in case pool->p.nid is
> NUMA_NO_NODE, pool_page_reusable() will always be false.
> 
> We can avoid the extra check in data path.
> How about avoiding NUMA_NO_NODE in page_pool altogether, and force
> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
> pool init, as already done in alloc_pages_node(). 

That means we will not support page reuse migragtion for NUMA_NO_NODE,
which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
because alloc_pages_node() will allocate the page based on the node
of the current running cpu.

Also, There seems to be a wild guessing of the node id here, which has
been disscussed before and has not reached a agreement yet.

> 
> which will imply recycling without adding any extra condition to the
> data path.
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..00c99282a306 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
>  
>         memcpy(&pool->p, params, sizeof(pool->p));
>  
> +	/* overwrite to allow recycling.. */
> +       if (pool->p.nid == NUMA_NO_NODE) 
> +               pool->p.nid = numa_mem_id(); 
> +
> 
> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
> pool->p.nid.. 
> 
> 
>> I have compiled different variants and looked at the A
>> SM code generated
>> by GCC.  This seems to give the best result.
>>
>>
>>> 1) never recycle emergency pages, regardless of pool nid.
>>> 2) always recycle if pool is NUMA_NO_NODE.
>>
>> Yes, this defines the semantics, that a page_pool configured with
>> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
>>
>>
>>> the above change should not add any overhead, a modest branch
>>> predictor will handle this with no effort.
>>
>> It still annoys me that we keep adding instructions to this code
>> hot-path (I counted 34 bytes and 11 instructions in my proposed
>> function).
>>
>> I think that it might be possible to move these NUMA checks to
>> alloc-side (instead of return/recycles side as today), and perhaps
>> only
>> on slow-path when dequeuing from ptr_ring (as recycles that call
>> __page_pool_recycle_direct() will be pinned during NAPI).  But lets
>> focus on a smaller fix for the immediate issue...
>>
> 
> I know. It annoys me too, but we need recycling to work in production :
> where rings/napi can migrate and numa nodes can be NUMA_NO_NODE :-(.
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-12  1:34       ` [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition Yunsheng Lin
@ 2019-12-12 10:18         ` Jesper Dangaard Brouer
  2019-12-13  3:40           ` Yunsheng Lin
  2019-12-17 19:27           ` Saeed Mahameed
  2019-12-16 12:15         ` Michal Hocko
  1 sibling, 2 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-12 10:18 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Saeed Mahameed, ilias.apalodimas, jonathan.lemon, Li Rongqing,
	netdev, mhocko, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel, brouer, Björn Töpel

On Thu, 12 Dec 2019 09:34:14 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> +CC Michal, Peter, Greg and Bjorn
> Because there has been disscusion about where and how the NUMA_NO_NODE
> should be handled before.
> 
> On 2019/12/12 5:24, Saeed Mahameed wrote:
> > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:  
> >> On Sat, 7 Dec 2019 03:52:41 +0000
> >> Saeed Mahameed <saeedm@mellanox.com> wrote:
> >>  
> >>> I don't think it is correct to check that the page nid is same as
> >>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> >>> all  pages to recycle, because you can't assume where pages are
> >>> allocated from and where they are being handled.  
> >>
> >> I agree, using numa_mem_id() is not valid, because it takes the numa
> >> node id from the executing CPU and the call to __page_pool_put_page()
> >> can happen on a remote CPU (e.g. cpumap redirect, and in future
> >> SKBs).
> >>
> >>  
> >>> I suggest the following:
> >>>
> >>> return !page_pfmemalloc() && 
> >>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );  
> >>
> >> Above code doesn't generate optimal ASM code, I suggest:
> >>
> >>  static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> >>  {
> >> 	return !page_is_pfmemalloc(page) &&
> >> 		pool->p.nid != NUMA_NO_NODE &&
> >> 		page_to_nid(page) == pool->p.nid;
> >>  }
> >>  
> > 
> > this is not equivalent to the above. Here in case pool->p.nid is
> > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > 
> > We can avoid the extra check in data path.
> > How about avoiding NUMA_NO_NODE in page_pool altogether, and force
> > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
> > pool init, as already done in alloc_pages_node().   
> 
> That means we will not support page reuse mitigation for NUMA_NO_NODE,
> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
> because alloc_pages_node() will allocate the page based on the node
> of the current running cpu.

True, as I wrote (below) my code defines semantics as: that a page_pool
configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle
regardless of NUMA node page belong to.  It seems that you want another
semantics.

I'm open to other semantics. My main concern is performance.  The
page_pool fast-path for driver recycling use-case of XDP_DROP, have
extreme performance requirements, as it needs to compete with driver
local recycle tricks (else we cannot use page_pool to simplify drivers).
The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
cycle/instruction counts.

 
> Also, There seems to be a wild guessing of the node id here, which has
> been disscussed before and has not reached a agreement yet.
> 
> > 
> > which will imply recycling without adding any extra condition to the
> > data path.

I love code that moves thing out of our fast-path.

> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..00c99282a306 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
> >  
> >         memcpy(&pool->p, params, sizeof(pool->p));
> >  
> > +	/* overwrite to allow recycling.. */
> > +       if (pool->p.nid == NUMA_NO_NODE) 
> > +               pool->p.nid = numa_mem_id(); 
> > +

The problem is that page_pool_init() is can be initiated from a random
CPU, first at driver setup/bringup, and later at other queue changes
that can be started via ethtool or XDP attach. (numa_mem_id() picks
from running CPU).

As Yunsheng mentioned elsewhere, there is also a dev_to_node() function.
Isn't that what we want in a place like this?


One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns
NUMA_NO_NODE (-1).  (And device_initialize() also set it to -1).  Thus,
in that case we set pool->p.nid = 0, as page_to_nid() will also return
zero in that case (as far as I follow the code).


> > After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
> > pool->p.nid.. 
> > 
> >   
> >> I have compiled different variants and looked at the ASM code
> >> generated by GCC.  This seems to give the best result.
> >>
> >>  
> >>> 1) never recycle emergency pages, regardless of pool nid.
> >>> 2) always recycle if pool is NUMA_NO_NODE.  
> >>
> >> Yes, this defines the semantics, that a page_pool configured with
> >> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
> >>
> >>  
> >>> the above change should not add any overhead, a modest branch
> >>> predictor will handle this with no effort.  
> >>
> >> It still annoys me that we keep adding instructions to this code
> >> hot-path (I counted 34 bytes and 11 instructions in my proposed
> >> function).
> >>
> >> I think that it might be possible to move these NUMA checks to
> >> alloc-side (instead of return/recycles side as today), and perhaps
> >> only on slow-path when dequeuing from ptr_ring (as recycles that
> >> call __page_pool_recycle_direct() will be pinned during NAPI).
> >> But lets focus on a smaller fix for the immediate issue...
> >>  
> > 
> > I know. It annoys me too, but we need recycling to work in
> > production : where rings/napi can migrate and numa nodes can be
> > NUMA_NO_NODE :-(.
> > 
> >   

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-12 10:18         ` Jesper Dangaard Brouer
@ 2019-12-13  3:40           ` Yunsheng Lin
  2019-12-13  6:27             ` 答复: " Li,Rongqing
  2019-12-17 19:35             ` Saeed Mahameed
  2019-12-17 19:27           ` Saeed Mahameed
  1 sibling, 2 replies; 22+ messages in thread
From: Yunsheng Lin @ 2019-12-13  3:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, ilias.apalodimas, jonathan.lemon, Li Rongqing,
	netdev, mhocko, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel, Björn Töpel

On 2019/12/12 18:18, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
>> +CC Michal, Peter, Greg and Bjorn
>> Because there has been disscusion about where and how the NUMA_NO_NODE
>> should be handled before.
>>
>> On 2019/12/12 5:24, Saeed Mahameed wrote:
>>> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:  
>>>> On Sat, 7 Dec 2019 03:52:41 +0000
>>>> Saeed Mahameed <saeedm@mellanox.com> wrote:
>>>>  
>>>>> I don't think it is correct to check that the page nid is same as
>>>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>>>> all  pages to recycle, because you can't assume where pages are
>>>>> allocated from and where they are being handled.  
>>>>
>>>> I agree, using numa_mem_id() is not valid, because it takes the numa
>>>> node id from the executing CPU and the call to __page_pool_put_page()
>>>> can happen on a remote CPU (e.g. cpumap redirect, and in future
>>>> SKBs).
>>>>
>>>>  
>>>>> I suggest the following:
>>>>>
>>>>> return !page_pfmemalloc() && 
>>>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );  
>>>>
>>>> Above code doesn't generate optimal ASM code, I suggest:
>>>>
>>>>  static bool pool_page_reusable(struct page_pool *pool, struct page *page)
>>>>  {
>>>> 	return !page_is_pfmemalloc(page) &&
>>>> 		pool->p.nid != NUMA_NO_NODE &&
>>>> 		page_to_nid(page) == pool->p.nid;
>>>>  }
>>>>  
>>>
>>> this is not equivalent to the above. Here in case pool->p.nid is
>>> NUMA_NO_NODE, pool_page_reusable() will always be false.
>>>
>>> We can avoid the extra check in data path.
>>> How about avoiding NUMA_NO_NODE in page_pool altogether, and force
>>> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
>>> pool init, as already done in alloc_pages_node().   
>>
>> That means we will not support page reuse mitigation for NUMA_NO_NODE,
>> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
>> because alloc_pages_node() will allocate the page based on the node
>> of the current running cpu.
> 
> True, as I wrote (below) my code defines semantics as: that a page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle
> regardless of NUMA node page belong to.  It seems that you want another
> semantics.

For driver that does not have page pool support yet, the semantics seems
to be: always allocate and recycle local page(excpet pfmemalloc one). Page
reuse migration moves when the rx interrupt affinity moves(the NAPI polling
context moves accordingly) regardless of the dev_to_node().

It would be good to maintain the above semantics. And rx data page seems
to be close to the cpu that doing the rx cleaning, which means the cpu
can process the buffer quicker?

> 
> I'm open to other semantics. My main concern is performance.  The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
> cycle/instruction counts.

Yes, the performance is a concern too.
But if the rx page is closer to the cpu, maybe the time taken to process
the buffer can be reduced?

It is good to allocate the rx page close to both cpu and device, but if
both goal can not be reached, maybe we choose to allocate page that close
to cpu?

> 
>  
>> Also, There seems to be a wild guessing of the node id here, which has
>> been disscussed before and has not reached a agreement yet.
>>
>>>
>>> which will imply recycling without adding any extra condition to the
>>> data path.
> 
> I love code that moves thing out of our fast-path.
> 
>>>
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index a6aefe989043..00c99282a306 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
>>>  
>>>         memcpy(&pool->p, params, sizeof(pool->p));
>>>  
>>> +	/* overwrite to allow recycling.. */
>>> +       if (pool->p.nid == NUMA_NO_NODE) 
>>> +               pool->p.nid = numa_mem_id(); 
>>> +
> 
> The problem is that page_pool_init() is can be initiated from a random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).

Yes, changing ring num or ring depth releases and allocates rx data page,
so using NUMA_NO_NODE to allocate page and alway recycle those page may
has different performance noticable to user.

> 
> As Yunsheng mentioned elsewhere, there is also a dev_to_node() function.
> Isn't that what we want in a place like this?
> 
> 
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns
> NUMA_NO_NODE (-1).  (And device_initialize() also set it to -1).  Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also return
> zero in that case (as far as I follow the code).
> 
> 
>>> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
>>> pool->p.nid.. 
>>>
>>>   
>>>> I have compiled different variants and looked at the ASM code
>>>> generated by GCC.  This seems to give the best result.
>>>>
>>>>  
>>>>> 1) never recycle emergency pages, regardless of pool nid.
>>>>> 2) always recycle if pool is NUMA_NO_NODE.  
>>>>
>>>> Yes, this defines the semantics, that a page_pool configured with
>>>> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
>>>>
>>>>  
>>>>> the above change should not add any overhead, a modest branch
>>>>> predictor will handle this with no effort.  
>>>>
>>>> It still annoys me that we keep adding instructions to this code
>>>> hot-path (I counted 34 bytes and 11 instructions in my proposed
>>>> function).
>>>>
>>>> I think that it might be possible to move these NUMA checks to
>>>> alloc-side (instead of return/recycles side as today), and perhaps
>>>> only on slow-path when dequeuing from ptr_ring (as recycles that
>>>> call __page_pool_recycle_direct() will be pinned during NAPI).
>>>> But lets focus on a smaller fix for the immediate issue...
>>>>  
>>>
>>> I know. It annoys me too, but we need recycling to work in
>>> production : where rings/napi can migrate and numa nodes can be
>>> NUMA_NO_NODE :-(.
>>>
>>>   
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-13  3:40           ` Yunsheng Lin
@ 2019-12-13  6:27             ` Li,Rongqing
  2019-12-13  6:53               ` Yunsheng Lin
  2019-12-17 19:35             ` Saeed Mahameed
  1 sibling, 1 reply; 22+ messages in thread
From: Li,Rongqing @ 2019-12-13  6:27 UTC (permalink / raw)
  To: Yunsheng Lin, Jesper Dangaard Brouer
  Cc: Saeed Mahameed, ilias.apalodimas, jonathan.lemon, netdev, mhocko,
	peterz, Greg Kroah-Hartman, bhelgaas, linux-kernel,
	Björn Töpel

> 
> It is good to allocate the rx page close to both cpu and device, but if
> both goal can not be reached, maybe we choose to allocate page that close
> to cpu?
> 
I think it is true

If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with
alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as
that page_to_nid(page) is checked with numa_mem_id()  

since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow 
will be called in NAPI polling often if recycle failed, after some cycle, the page will be from
local memory node.

-Li


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-13  6:27             ` 答复: " Li,Rongqing
@ 2019-12-13  6:53               ` Yunsheng Lin
  2019-12-13  8:48                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 22+ messages in thread
From: Yunsheng Lin @ 2019-12-13  6:53 UTC (permalink / raw)
  To: Li,Rongqing, Jesper Dangaard Brouer
  Cc: Saeed Mahameed, ilias.apalodimas, jonathan.lemon, netdev, mhocko,
	peterz, Greg Kroah-Hartman, bhelgaas, linux-kernel,
	Björn Töpel

On 2019/12/13 14:27, Li,Rongqing wrote:
>>
>> It is good to allocate the rx page close to both cpu and device, but if
>> both goal can not be reached, maybe we choose to allocate page that close
>> to cpu?
>>
> I think it is true
> 
> If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with
> alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as
> that page_to_nid(page) is checked with numa_mem_id()  
> 
> since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow 
> will be called in NAPI polling often if recycle failed, after some cycle, the page will be from
> local memory node.

Yes if allocation and recycling are in the same NAPI polling context.

As pointed out by Saeed and Ilias, the allocation and recycling seems to
may not be happening in the same NAPI polling context, see:

"In the current code base if they are only called under NAPI this might be true.
On the page_pool skb recycling patches though (yes we'll eventually send those
:)) this is called from kfree_skb()."

So there may need some additionl attention.


> 
> -Li
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-13  6:53               ` Yunsheng Lin
@ 2019-12-13  8:48                 ` Jesper Dangaard Brouer
  2019-12-16  1:51                   ` Yunsheng Lin
  0 siblings, 1 reply; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2019-12-13  8:48 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Li,Rongqing, Saeed Mahameed, ilias.apalodimas, jonathan.lemon,
	netdev, mhocko, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel, Björn Töpel, brouer

On Fri, 13 Dec 2019 14:53:37 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> On 2019/12/13 14:27, Li,Rongqing wrote:
> >>
> >> It is good to allocate the rx page close to both cpu and device,
> >> but if both goal can not be reached, maybe we choose to allocate
> >> page that close to cpu?
> >>  
> > I think it is true
> > 
> > If it is true, , we can remove pool->p.nid, and replace
> > alloc_pages_node with alloc_pages in __page_pool_alloc_pages_slow,
> > and change pool_page_reusable as that page_to_nid(page) is checked
> > with numa_mem_id()  

No, as I explained before, you cannot use numa_mem_id() in pool_page_reusable,
because recycle call can happen from/on a remote CPU (and numa_mem_id()
uses the local CPU to determine the NUMA id).

Today we already have XDP cpumap redirect.  And we are working on
extending this to SKBs, which will increase the likelihood even more.

I don't think we want to not-recycle if a remote NUMA node CPU happened
to touch the packet?

(Based on the optimizations done for Facebook (the reason we added this))
What seems to matter is the NUMA node of CPU that runs RX NAPI-polling,
this is the first CPU that touch the packet memory and struct-page
memory.  For these drivers, it is also the same "RX-CPU" that does the
allocation of new pages (to refill the RX-ring), and these "new" pages
can come from the page_pool.

With this in mind, it does seem strange that we are doing the check on
the "free"/recycles code path, that can run on remote CPUs...


> > since alloc_pages hint to use the current node page, and
> > __page_pool_alloc_pages_slow will be called in NAPI polling often
> > if recycle failed, after some cycle, the page will be from local
> > memory node.  

You are basically saying that the NUMA check should be moved to
allocation time, as it is running the RX-CPU (NAPI).  And eventually
after some time the pages will come from correct NUMA node.

I think we can do that, and only affect the semi-fast-path.
We just need to handle that pages in the ptr_ring that are recycled can
be from the wrong NUMA node.  In __page_pool_get_cached() when
consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we
can evict pages from wrong NUMA node.

For the pool->alloc.cache we either accept, that it will eventually
after some time be emptied (it is only in a 100% XDP_DROP workload that
it will continue to reuse same pages).   Or we simply clear the
pool->alloc.cache when calling page_pool_update_nid().


> Yes if allocation and recycling are in the same NAPI polling context.

Which is a false assumption.

> As pointed out by Saeed and Ilias, the allocation and recycling seems
> to may not be happening in the same NAPI polling context, see:
> 
> "In the current code base if they are only called under NAPI this
> might be true. On the page_pool skb recycling patches though (yes
> we'll eventually send those :)) this is called from kfree_skb()."
> 
> So there may need some additionl attention.

Yes, as explained above.



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-13  8:48                 ` Jesper Dangaard Brouer
@ 2019-12-16  1:51                   ` Yunsheng Lin
  2019-12-16  4:02                     ` 答复: " Li,Rongqing
  0 siblings, 1 reply; 22+ messages in thread
From: Yunsheng Lin @ 2019-12-16  1:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Li,Rongqing, Saeed Mahameed, ilias.apalodimas, jonathan.lemon,
	netdev, mhocko, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel, Björn Töpel

On 2019/12/13 16:48, Jesper Dangaard Brouer wrote:> You are basically saying that the NUMA check should be moved to
> allocation time, as it is running the RX-CPU (NAPI).  And eventually
> after some time the pages will come from correct NUMA node.
> 
> I think we can do that, and only affect the semi-fast-path.
> We just need to handle that pages in the ptr_ring that are recycled can
> be from the wrong NUMA node.  In __page_pool_get_cached() when
> consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we
> can evict pages from wrong NUMA node.

Yes, that's workable.

> 
> For the pool->alloc.cache we either accept, that it will eventually
> after some time be emptied (it is only in a 100% XDP_DROP workload that
> it will continue to reuse same pages).   Or we simply clear the
> pool->alloc.cache when calling page_pool_update_nid().

Simply clearing the pool->alloc.cache when calling page_pool_update_nid()
seems better.

> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16  1:51                   ` Yunsheng Lin
@ 2019-12-16  4:02                     ` Li,Rongqing
  2019-12-16 10:13                       ` Ilias Apalodimas
  0 siblings, 1 reply; 22+ messages in thread
From: Li,Rongqing @ 2019-12-16  4:02 UTC (permalink / raw)
  To: Yunsheng Lin, Jesper Dangaard Brouer
  Cc: Saeed Mahameed, ilias.apalodimas, jonathan.lemon, netdev, mhocko,
	peterz, Greg Kroah-Hartman, bhelgaas, linux-kernel,
	Björn Töpel



> -----邮件原件-----
> 发件人: Yunsheng Lin [mailto:linyunsheng@huawei.com]
> 发送时间: 2019年12月16日 9:51
> 收件人: Jesper Dangaard Brouer <brouer@redhat.com>
> 抄送: Li,Rongqing <lirongqing@baidu.com>; Saeed Mahameed
> <saeedm@mellanox.com>; ilias.apalodimas@linaro.org;
> jonathan.lemon@gmail.com; netdev@vger.kernel.org; mhocko@kernel.org;
> peterz@infradead.org; Greg Kroah-Hartman <gregkh@linuxfoundation.org>;
> bhelgaas@google.com; linux-kernel@vger.kernel.org; Björn Töpel
> <bjorn.topel@intel.com>
> 主题: Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE
> condition
> 
> On 2019/12/13 16:48, Jesper Dangaard Brouer wrote:> You are basically saying
> that the NUMA check should be moved to
> > allocation time, as it is running the RX-CPU (NAPI).  And eventually
> > after some time the pages will come from correct NUMA node.
> >
> > I think we can do that, and only affect the semi-fast-path.
> > We just need to handle that pages in the ptr_ring that are recycled
> > can be from the wrong NUMA node.  In __page_pool_get_cached() when
> > consuming pages from the ptr_ring (__ptr_ring_consume_batched), then
> > we can evict pages from wrong NUMA node.
> 
> Yes, that's workable.
> 
> >
> > For the pool->alloc.cache we either accept, that it will eventually
> > after some time be emptied (it is only in a 100% XDP_DROP workload that
> > it will continue to reuse same pages).   Or we simply clear the
> > pool->alloc.cache when calling page_pool_update_nid().
> 
> Simply clearing the pool->alloc.cache when calling page_pool_update_nid()
> seems better.
> 

How about the below codes, the driver can configure p.nid to any, which will be adjusted in NAPI polling, irq migration will not be problem, but it will add a check into hot path.

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..4374a6239d17 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -108,6 +108,10 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
                if (likely(pool->alloc.count)) {
                        /* Fast-path */
                        page = pool->alloc.cache[--pool->alloc.count];
+
+                       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
+                               WRITE_ONCE(pool->p.nid, numa_mem_id());
+
                        return page;
                }
                refill = true;
@@ -155,6 +159,10 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
        if (pool->p.order)
                gfp |= __GFP_COMP;
 
+
+       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
+               WRITE_ONCE(pool->p.nid, numa_mem_id());
+
        /* FUTURE development:
         *
         * Current slow-path essentially falls back to single page
Thanks

-Li
> >


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16  4:02                     ` 答复: " Li,Rongqing
@ 2019-12-16 10:13                       ` Ilias Apalodimas
  2019-12-16 10:16                         ` Ilias Apalodimas
  2019-12-17 19:38                         ` Saeed Mahameed
  0 siblings, 2 replies; 22+ messages in thread
From: Ilias Apalodimas @ 2019-12-16 10:13 UTC (permalink / raw)
  To: Li,Rongqing
  Cc: Yunsheng Lin, Jesper Dangaard Brouer, Saeed Mahameed,
	jonathan.lemon, netdev, mhocko, peterz, Greg Kroah-Hartman,
	bhelgaas, linux-kernel, Björn Töpel

On Mon, Dec 16, 2019 at 04:02:04AM +0000, Li,Rongqing wrote:
> 
> 
> > -----邮件原件-----
> > 发件人: Yunsheng Lin [mailto:linyunsheng@huawei.com]
> > 发送时间: 2019年12月16日 9:51
> > 收件人: Jesper Dangaard Brouer <brouer@redhat.com>
> > 抄送: Li,Rongqing <lirongqing@baidu.com>; Saeed Mahameed
> > <saeedm@mellanox.com>; ilias.apalodimas@linaro.org;
> > jonathan.lemon@gmail.com; netdev@vger.kernel.org; mhocko@kernel.org;
> > peterz@infradead.org; Greg Kroah-Hartman <gregkh@linuxfoundation.org>;
> > bhelgaas@google.com; linux-kernel@vger.kernel.org; Björn Töpel
> > <bjorn.topel@intel.com>
> > 主题: Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE
> > condition
> > 
> > On 2019/12/13 16:48, Jesper Dangaard Brouer wrote:> You are basically saying
> > that the NUMA check should be moved to
> > > allocation time, as it is running the RX-CPU (NAPI).  And eventually
> > > after some time the pages will come from correct NUMA node.
> > >
> > > I think we can do that, and only affect the semi-fast-path.
> > > We just need to handle that pages in the ptr_ring that are recycled
> > > can be from the wrong NUMA node.  In __page_pool_get_cached() when
> > > consuming pages from the ptr_ring (__ptr_ring_consume_batched), then
> > > we can evict pages from wrong NUMA node.
> > 
> > Yes, that's workable.
> > 
> > >
> > > For the pool->alloc.cache we either accept, that it will eventually
> > > after some time be emptied (it is only in a 100% XDP_DROP workload that
> > > it will continue to reuse same pages).   Or we simply clear the
> > > pool->alloc.cache when calling page_pool_update_nid().
> > 
> > Simply clearing the pool->alloc.cache when calling page_pool_update_nid()
> > seems better.
> > 
> 
> How about the below codes, the driver can configure p.nid to any, which will be adjusted in NAPI polling, irq migration will not be problem, but it will add a check into hot path.

We'll have to check the impact on some high speed (i.e 100gbit) interface
between doing anything like that. Saeed's current patch runs once per NAPI. This
runs once per packet. The load might be measurable. 
The READ_ONCE is needed in case all producers/consumers run on the same CPU
right?


Thanks
/Ilias
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..4374a6239d17 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -108,6 +108,10 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
>                 if (likely(pool->alloc.count)) {
>                         /* Fast-path */
>                         page = pool->alloc.cache[--pool->alloc.count];
> +
> +                       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
> +                               WRITE_ONCE(pool->p.nid, numa_mem_id());
> +
>                         return page;
>                 }
>                 refill = true;
> @@ -155,6 +159,10 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>         if (pool->p.order)
>                 gfp |= __GFP_COMP;
>  
> +
> +       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
> +               WRITE_ONCE(pool->p.nid, numa_mem_id());
> +
>         /* FUTURE development:
>          *
>          * Current slow-path essentially falls back to single page
> Thanks
> 
> -Li
> > >
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 10:13                       ` Ilias Apalodimas
@ 2019-12-16 10:16                         ` Ilias Apalodimas
  2019-12-16 10:57                           ` 答复: " Li,Rongqing
  2019-12-17 19:38                         ` Saeed Mahameed
  1 sibling, 1 reply; 22+ messages in thread
From: Ilias Apalodimas @ 2019-12-16 10:16 UTC (permalink / raw)
  To: Li,Rongqing
  Cc: Yunsheng Lin, Jesper Dangaard Brouer, Saeed Mahameed,
	jonathan.lemon, netdev, mhocko, peterz, Greg Kroah-Hartman,
	bhelgaas, linux-kernel, Björn Töpel

> > > 
> > > Simply clearing the pool->alloc.cache when calling page_pool_update_nid()
> > > seems better.
> > > 
> > 
> > How about the below codes, the driver can configure p.nid to any, which will be adjusted in NAPI polling, irq migration will not be problem, but it will add a check into hot path.
> 
> We'll have to check the impact on some high speed (i.e 100gbit) interface
> between doing anything like that. Saeed's current patch runs once per NAPI. This
> runs once per packet. The load might be measurable. 
> The READ_ONCE is needed in case all producers/consumers run on the same CPU

I meant different cpus!

> right?
> 
> 
> Thanks
> /Ilias
> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..4374a6239d17 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -108,6 +108,10 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
> >                 if (likely(pool->alloc.count)) {
> >                         /* Fast-path */
> >                         page = pool->alloc.cache[--pool->alloc.count];
> > +
> > +                       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
> > +                               WRITE_ONCE(pool->p.nid, numa_mem_id());
> > +
> >                         return page;
> >                 }
> >                 refill = true;
> > @@ -155,6 +159,10 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> >         if (pool->p.order)
> >                 gfp |= __GFP_COMP;
> >  
> > +
> > +       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
> > +               WRITE_ONCE(pool->p.nid, numa_mem_id());
> > +
> >         /* FUTURE development:
> >          *
> >          * Current slow-path essentially falls back to single page
> > Thanks
> > 
> > -Li
> > > >
> > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* 答复: 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 10:16                         ` Ilias Apalodimas
@ 2019-12-16 10:57                           ` Li,Rongqing
  0 siblings, 0 replies; 22+ messages in thread
From: Li,Rongqing @ 2019-12-16 10:57 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Yunsheng Lin, Jesper Dangaard Brouer, Saeed Mahameed,
	jonathan.lemon, netdev, mhocko, peterz, Greg Kroah-Hartman,
	bhelgaas, linux-kernel, Björn Töpel



> -----邮件原件-----
> 发件人: Ilias Apalodimas [mailto:ilias.apalodimas@linaro.org]
> 发送时间: 2019年12月16日 18:17
> 收件人: Li,Rongqing <lirongqing@baidu.com>
> 抄送: Yunsheng Lin <linyunsheng@huawei.com>; Jesper Dangaard Brouer
> <brouer@redhat.com>; Saeed Mahameed <saeedm@mellanox.com>;
> jonathan.lemon@gmail.com; netdev@vger.kernel.org; mhocko@kernel.org;
> peterz@infradead.org; Greg Kroah-Hartman <gregkh@linuxfoundation.org>;
> bhelgaas@google.com; linux-kernel@vger.kernel.org; Björn Töpel
> <bjorn.topel@intel.com>
> 主题: Re: 答复: [PATCH][v2] page_pool: handle page recycle for
> NUMA_NO_NODE condition
> 
> > > >
> > > > Simply clearing the pool->alloc.cache when calling
> > > > page_pool_update_nid() seems better.
> > > >
> > >
> > > How about the below codes, the driver can configure p.nid to any, which will
> be adjusted in NAPI polling, irq migration will not be problem, but it will add a
> check into hot path.
> >
> > We'll have to check the impact on some high speed (i.e 100gbit)
> > interface between doing anything like that. Saeed's current patch runs
> > once per NAPI. This runs once per packet. The load might be measurable.
> > The READ_ONCE is needed in case all producers/consumers run on the
> > same CPU
> 
> I meant different cpus!
> 

If no READ_ONCE, pool->p.nid will be always written and become dirty although it is unshared by multiple cpus

See Eric' patch:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=503978aca46124cd714703e180b9c8292ba50ba7

-Li 
> > right?
> >
> >
> > Thanks
> > /Ilias
> > >
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c index
> > > a6aefe989043..4374a6239d17 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -108,6 +108,10 @@ static struct page
> *__page_pool_get_cached(struct page_pool *pool)
> > >                 if (likely(pool->alloc.count)) {
> > >                         /* Fast-path */
> > >                         page =
> > > pool->alloc.cache[--pool->alloc.count];
> > > +
> > > +                       if (unlikely(READ_ONCE(pool->p.nid) !=
> numa_mem_id()))
> > > +                               WRITE_ONCE(pool->p.nid,
> > > + numa_mem_id());
> > > +
> > >                         return page;
> > >                 }
> > >                 refill = true;
> > > @@ -155,6 +159,10 @@ static struct page
> *__page_pool_alloc_pages_slow(struct page_pool *pool,
> > >         if (pool->p.order)
> > >                 gfp |= __GFP_COMP;
> > >
> > > +
> > > +       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
> > > +               WRITE_ONCE(pool->p.nid, numa_mem_id());
> > > +
> > >         /* FUTURE development:
> > >          *
> > >          * Current slow-path essentially falls back to single page
> > > Thanks
> > >
> > > -Li
> > > > >
> > >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-12  1:34       ` [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition Yunsheng Lin
  2019-12-12 10:18         ` Jesper Dangaard Brouer
@ 2019-12-16 12:15         ` Michal Hocko
  2019-12-16 12:34           ` Ilias Apalodimas
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-12-16 12:15 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Saeed Mahameed, brouer, ilias.apalodimas, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> +CC Michal, Peter, Greg and Bjorn
> Because there has been disscusion about where and how the NUMA_NO_NODE
> should be handled before.

I do not have a full context. What is the question here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 12:15         ` Michal Hocko
@ 2019-12-16 12:34           ` Ilias Apalodimas
  2019-12-16 13:08             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Ilias Apalodimas @ 2019-12-16 12:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yunsheng Lin, Saeed Mahameed, brouer, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

Hi Michal, 
On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the NUMA_NO_NODE
> > should be handled before.
> 
> I do not have a full context. What is the question here?

When we allocate pages for the page_pool API, during the init, the driver writer
decides which NUMA node to use. The API can,  in some cases recycle the memory,
instead of freeing it and re-allocating it. If the NUMA node has changed (irq
affinity for example), we forbid recycling and free the memory, since recycling
and using memory on far NUMA nodes is more expensive (more expensive than
recycling, at least on the architectures we tried anyway).
Since this would be expensive to do it per packet, the burden falls on the 
driver writer for that. Drivers *have* to call page_pool_update_nid() or 
page_pool_nid_changed() if they want to check for that which runs once
per NAPI cycle.

The current code in the API though does not account for NUMA_NO_NODE. That's
what this is trying to fix.
If the page_pool params are initialized with that, we *never* recycle
the memory. This is happening because the API is allocating memory with 
'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
'page_to_nid(page) == pool->p.nid' will never trigger.

The initial proposal was to check:
pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));

After that the thread span out of control :)
My question is do we *really* have to check for 
page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
wouldn't pool->p.nid == NUMA_NO_NODE be enough?

Thanks
/Ilias
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 12:34           ` Ilias Apalodimas
@ 2019-12-16 13:08             ` Michal Hocko
  2019-12-16 13:21               ` Ilias Apalodimas
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-12-16 13:08 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Yunsheng Lin, Saeed Mahameed, brouer, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
> Hi Michal, 
> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> > > +CC Michal, Peter, Greg and Bjorn
> > > Because there has been disscusion about where and how the NUMA_NO_NODE
> > > should be handled before.
> > 
> > I do not have a full context. What is the question here?
> 
> When we allocate pages for the page_pool API, during the init, the driver writer
> decides which NUMA node to use. The API can,  in some cases recycle the memory,
> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
> affinity for example), we forbid recycling and free the memory, since recycling
> and using memory on far NUMA nodes is more expensive (more expensive than
> recycling, at least on the architectures we tried anyway).
> Since this would be expensive to do it per packet, the burden falls on the 
> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
> page_pool_nid_changed() if they want to check for that which runs once
> per NAPI cycle.

Thanks for the clarification.

> The current code in the API though does not account for NUMA_NO_NODE. That's
> what this is trying to fix.
> If the page_pool params are initialized with that, we *never* recycle
> the memory. This is happening because the API is allocating memory with 
> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
> 'page_to_nid(page) == pool->p.nid' will never trigger.

OK. There is no explicit mention of the expected behavior for
NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
requirement and the MM code simply starts the allocate from a local node
in that case. But the memory might come from any node so there is no
"local node" guarantee.

So the main question is what is the expected semantic? Do people expect
that NUMA_NO_NODE implies locality? Why don't you simply always reuse
when there was no explicit numa requirement?

> The initial proposal was to check:
> pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));

> After that the thread span out of control :)
> My question is do we *really* have to check for 
> page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
> wouldn't pool->p.nid == NUMA_NO_NODE be enough?

If the architecture is !NUMA then numa_mem_id and page_to_nid should
always equal and be both zero.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 13:08             ` Michal Hocko
@ 2019-12-16 13:21               ` Ilias Apalodimas
  2019-12-17  2:11                 ` Yunsheng Lin
  0 siblings, 1 reply; 22+ messages in thread
From: Ilias Apalodimas @ 2019-12-16 13:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yunsheng Lin, Saeed Mahameed, brouer, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
> > Hi Michal, 
> > On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> > > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> > > > +CC Michal, Peter, Greg and Bjorn
> > > > Because there has been disscusion about where and how the NUMA_NO_NODE
> > > > should be handled before.
> > > 
> > > I do not have a full context. What is the question here?
> > 
> > When we allocate pages for the page_pool API, during the init, the driver writer
> > decides which NUMA node to use. The API can,  in some cases recycle the memory,
> > instead of freeing it and re-allocating it. If the NUMA node has changed (irq
> > affinity for example), we forbid recycling and free the memory, since recycling
> > and using memory on far NUMA nodes is more expensive (more expensive than
> > recycling, at least on the architectures we tried anyway).
> > Since this would be expensive to do it per packet, the burden falls on the 
> > driver writer for that. Drivers *have* to call page_pool_update_nid() or 
> > page_pool_nid_changed() if they want to check for that which runs once
> > per NAPI cycle.
> 
> Thanks for the clarification.
> 
> > The current code in the API though does not account for NUMA_NO_NODE. That's
> > what this is trying to fix.
> > If the page_pool params are initialized with that, we *never* recycle
> > the memory. This is happening because the API is allocating memory with 
> > 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
> > 'page_to_nid(page) == pool->p.nid' will never trigger.
> 
> OK. There is no explicit mention of the expected behavior for
> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
> requirement and the MM code simply starts the allocate from a local node
> in that case. But the memory might come from any node so there is no
> "local node" guarantee.
> 
> So the main question is what is the expected semantic? Do people expect
> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
> when there was no explicit numa requirement?
> 

Well they shouldn't. Hence my next proposal. I think we are pretty much saying
the same thing here. 
If the driver defines NUMA_NO_NODE, just blindly recycle memory.

> > The initial proposal was to check:
> > pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));
> 
> > After that the thread span out of control :)
> > My question is do we *really* have to check for 
> > page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
> > wouldn't pool->p.nid == NUMA_NO_NODE be enough?
> 
> If the architecture is !NUMA then numa_mem_id and page_to_nid should
> always equal and be both zero.
> 

Ditto

> -- 
> Michal Hocko
> SUSE Labs


Thanks
/Ilias

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 13:21               ` Ilias Apalodimas
@ 2019-12-17  2:11                 ` Yunsheng Lin
  2019-12-17  9:11                   ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Yunsheng Lin @ 2019-12-17  2:11 UTC (permalink / raw)
  To: Ilias Apalodimas, Michal Hocko
  Cc: Saeed Mahameed, brouer, jonathan.lemon, Li Rongqing, netdev,
	peterz, Greg Kroah-Hartman, bhelgaas, linux-kernel

On 2019/12/16 21:21, Ilias Apalodimas wrote:
> On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
>> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
>>> Hi Michal, 
>>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
>>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
>>>>> +CC Michal, Peter, Greg and Bjorn
>>>>> Because there has been disscusion about where and how the NUMA_NO_NODE
>>>>> should be handled before.
>>>>
>>>> I do not have a full context. What is the question here?
>>>
>>> When we allocate pages for the page_pool API, during the init, the driver writer
>>> decides which NUMA node to use. The API can,  in some cases recycle the memory,
>>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
>>> affinity for example), we forbid recycling and free the memory, since recycling
>>> and using memory on far NUMA nodes is more expensive (more expensive than
>>> recycling, at least on the architectures we tried anyway).
>>> Since this would be expensive to do it per packet, the burden falls on the 
>>> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
>>> page_pool_nid_changed() if they want to check for that which runs once
>>> per NAPI cycle.
>>
>> Thanks for the clarification.
>>
>>> The current code in the API though does not account for NUMA_NO_NODE. That's
>>> what this is trying to fix.
>>> If the page_pool params are initialized with that, we *never* recycle
>>> the memory. This is happening because the API is allocating memory with 
>>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
>>> 'page_to_nid(page) == pool->p.nid' will never trigger.
>>
>> OK. There is no explicit mention of the expected behavior for
>> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
>> requirement and the MM code simply starts the allocate from a local node
>> in that case. But the memory might come from any node so there is no
>> "local node" guarantee.
>>
>> So the main question is what is the expected semantic? Do people expect
>> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
>> when there was no explicit numa requirement?

For driver that has not supported page pool yet, NUMA_NO_NODE seems to
imply locality, see [1].

And for those drivers, locality is decided by rx interrupt affinity, not
dev_to_node(). So when rx interrupt affinity changes, the old page from old
node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
new pages will be allocated to replace the old pages and the new pages will
be recycled because allocation and recycling happens in the same context,
which means numa_mem_id() returns the same node of new page allocated, see
[2].

As why not allocate and recycle page based on dev_to_node(), Jesper had
explained it better than me, see [3]:

"(Based on the optimizations done for Facebook (the reason we added this))
What seems to matter is the NUMA node of CPU that runs RX NAPI-polling,
this is the first CPU that touch the packet memory and struct-page
memory.  For these drivers, it is also the same "RX-CPU" that does the
allocation of new pages (to refill the RX-ring), and these "new" pages
can come from the page_pool."

So maybe the tricky part for page pool is to find the same context to
allocate and recycle pages, so that NUMA_NO_NODE can be used to allocate
pages and numa_mem_id() can be used to decide recycling. Or update the
node dynamically as proposed by Rongqing, see [4].


[1] https://elixir.bootlin.com/linux/v5.5-rc1/ident/dev_alloc_pages
[2] https://elixir.bootlin.com/linux/v5.5-rc1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L1856
[3] https://lkml.org/lkml/2019/12/13/132
[4] https://lkml.org/lkml/2019/12/15/353
>>
> 
> Well they shouldn't. Hence my next proposal. I think we are pretty much saying
> the same thing here. 
> If the driver defines NUMA_NO_NODE, just blindly recycle memory.

Not really. see above.

> 
>>> The initial proposal was to check:
>>> pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));
>>
>>> After that the thread span out of control :)
>>> My question is do we *really* have to check for 
>>> page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
>>> wouldn't pool->p.nid == NUMA_NO_NODE be enough?
>>
>> If the architecture is !NUMA then numa_mem_id and page_to_nid should
>> always equal and be both zero.

The tricky one is that dev_to_node() always return -1 when CONFIG_NUMA is
not defined, and some driver may use that to allocte page, and the node of
page may be checked with dev_to_node() to decide whether to recycle.

>>
> 
> Ditto
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
> 
> 
> Thanks
> /Ilias
> 
> .
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-17  2:11                 ` Yunsheng Lin
@ 2019-12-17  9:11                   ` Michal Hocko
  2019-12-19  2:09                     ` Yunsheng Lin
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-12-17  9:11 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Ilias Apalodimas, Saeed Mahameed, brouer, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

On Tue 17-12-19 10:11:12, Yunsheng Lin wrote:
> On 2019/12/16 21:21, Ilias Apalodimas wrote:
> > On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
> >> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
> >>> Hi Michal, 
> >>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> >>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> >>>>> +CC Michal, Peter, Greg and Bjorn
> >>>>> Because there has been disscusion about where and how the NUMA_NO_NODE
> >>>>> should be handled before.
> >>>>
> >>>> I do not have a full context. What is the question here?
> >>>
> >>> When we allocate pages for the page_pool API, during the init, the driver writer
> >>> decides which NUMA node to use. The API can,  in some cases recycle the memory,
> >>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
> >>> affinity for example), we forbid recycling and free the memory, since recycling
> >>> and using memory on far NUMA nodes is more expensive (more expensive than
> >>> recycling, at least on the architectures we tried anyway).
> >>> Since this would be expensive to do it per packet, the burden falls on the 
> >>> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
> >>> page_pool_nid_changed() if they want to check for that which runs once
> >>> per NAPI cycle.
> >>
> >> Thanks for the clarification.
> >>
> >>> The current code in the API though does not account for NUMA_NO_NODE. That's
> >>> what this is trying to fix.
> >>> If the page_pool params are initialized with that, we *never* recycle
> >>> the memory. This is happening because the API is allocating memory with 
> >>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
> >>> 'page_to_nid(page) == pool->p.nid' will never trigger.
> >>
> >> OK. There is no explicit mention of the expected behavior for
> >> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
> >> requirement and the MM code simply starts the allocate from a local node
> >> in that case. But the memory might come from any node so there is no
> >> "local node" guarantee.
> >>
> >> So the main question is what is the expected semantic? Do people expect
> >> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
> >> when there was no explicit numa requirement?
> 
> For driver that has not supported page pool yet, NUMA_NO_NODE seems to
> imply locality, see [1].

Which is kinda awkward, no? Is there any reason for __dev_alloc_pages to
not use numa_mem_id explicitly when the local node affinity is required?
There is not real guarantee that NUMA_NO_NODE is going to imply local
node and we do not want to grow any subtle dependency on that behavior.

> And for those drivers, locality is decided by rx interrupt affinity, not
> dev_to_node(). So when rx interrupt affinity changes, the old page from old
> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
> new pages will be allocated to replace the old pages and the new pages will
> be recycled because allocation and recycling happens in the same context,
> which means numa_mem_id() returns the same node of new page allocated, see
> [2].

Well, but my understanding is that the generic page pool implementation
has a clear means to change the affinity (namely page_pool_update_nid()).
So my primary question is, why does NUMA_NO_NODE should be use as a
bypass for that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-12 10:18         ` Jesper Dangaard Brouer
  2019-12-13  3:40           ` Yunsheng Lin
@ 2019-12-17 19:27           ` Saeed Mahameed
  1 sibling, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2019-12-17 19:27 UTC (permalink / raw)
  To: linyunsheng, brouer
  Cc: mhocko, gregkh, peterz, Li Rongqing, ilias.apalodimas,
	jonathan.lemon, netdev, linux-kernel, bhelgaas, bjorn.topel

On Thu, 2019-12-12 at 11:18 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the
> > NUMA_NO_NODE
> > should be handled before.
> > 
> > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > wrote:  
> > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > >  
> > > > > I don't think it is correct to check that the page nid is
> > > > > same as
> > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > > allow
> > > > > all  pages to recycle, because you can't assume where pages
> > > > > are
> > > > > allocated from and where they are being handled.  
> > > > 
> > > > I agree, using numa_mem_id() is not valid, because it takes the
> > > > numa
> > > > node id from the executing CPU and the call to
> > > > __page_pool_put_page()
> > > > can happen on a remote CPU (e.g. cpumap redirect, and in future
> > > > SKBs).
> > > > 
> > > >  
> > > > > I suggest the following:
> > > > > 
> > > > > return !page_pfmemalloc() && 
> > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > NUMA_NO_NODE );  
> > > > 
> > > > Above code doesn't generate optimal ASM code, I suggest:
> > > > 
> > > >  static bool pool_page_reusable(struct page_pool *pool, struct
> > > > page *page)
> > > >  {
> > > > 	return !page_is_pfmemalloc(page) &&
> > > > 		pool->p.nid != NUMA_NO_NODE &&
> > > > 		page_to_nid(page) == pool->p.nid;
> > > >  }
> > > >  
> > > 
> > > this is not equivalent to the above. Here in case pool->p.nid is
> > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > > 
> > > We can avoid the extra check in data path.
> > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > force
> > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > page
> > > pool init, as already done in alloc_pages_node().   
> > 
> > That means we will not support page reuse mitigation for
> > NUMA_NO_NODE,
> > which is not same semantic that alloc_pages_node() handle
> > NUMA_NO_NODE,
> > because alloc_pages_node() will allocate the page based on the node
> > of the current running cpu.
> 
> True, as I wrote (below) my code defines semantics as: that a
> page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow
> recycle

Your code will NOT allow recycling when NUMA_NO_NODE is configured.
so i am not sure what semantics you are referring to :)

> regardless of NUMA node page belong to.  It seems that you want
> another
> semantics.
> 

I think that the semantics we want is to allow recycling when
NUMA_NO_NODE is selected, to solve the reported issue ? no ?

> I'm open to other semantics. My main concern is performance.  The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify
> drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
> cycle/instruction counts.
> 

I agree.

>  
> > Also, There seems to be a wild guessing of the node id here, which
> > has
> > been disscussed before and has not reached a agreement yet.
> > 
> > > which will imply recycling without adding any extra condition to
> > > the
> > > data path.
> 
> I love code that moves thing out of our fast-path.
> 
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index a6aefe989043..00c99282a306 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > *pool,
> > >  
> > >         memcpy(&pool->p, params, sizeof(pool->p));
> > >  
> > > +	/* overwrite to allow recycling.. */
> > > +       if (pool->p.nid == NUMA_NO_NODE) 
> > > +               pool->p.nid = numa_mem_id(); 
> > > +
> 
> The problem is that page_pool_init() is can be initiated from a
> random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).
> 

well if the user selected NUMA_NO_NODE, then it is his choice ..
Also the user always have the ability to change pool->p.nid, using the
API i introduced. so i don't see any issue with this.

> As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> function.
> Isn't that what we want in a place like this?
> 

We should keep this outside page_pool, if the user want dev_to_node, he
can set this as a  parameter to the page pool from outside. when
NUMA_NO_NODE is selected this means that user doesn't really care.

> 
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> returns
> NUMA_NO_NODE (-1).  (And device_initialize() also set it to
> -1).  Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also
> return
> zero in that case (as far as I follow the code).
> 

yes this is the idea, since alloc_page will also select cpu 0 if you
keep NUMA_NO_NODE, then recycle will happen inherently by design
without any change to data path.

> > > After a quick look, i don't see any reason why to keep
> > > NUMA_NO_NODE in
> > > pool->p.nid.. 
> > > 
> > >   
> > > > I have compiled different variants and looked at the ASM code
> > > > generated by GCC.  This seems to give the best result.
> > > > 
> > > >  
> > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > 2) always recycle if pool is NUMA_NO_NODE.  
> > > > 
> > > > Yes, this defines the semantics, that a page_pool configured
> > > > with
> > > > NUMA_NO_NODE means skip NUMA checks.  I think that sounds
> > > > okay...
> > > > 
> > > >  
> > > > > the above change should not add any overhead, a modest branch
> > > > > predictor will handle this with no effort.  
> > > > 
> > > > It still annoys me that we keep adding instructions to this
> > > > code
> > > > hot-path (I counted 34 bytes and 11 instructions in my proposed
> > > > function).
> > > > 
> > > > I think that it might be possible to move these NUMA checks to
> > > > alloc-side (instead of return/recycles side as today), and
> > > > perhaps
> > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > that
> > > > call __page_pool_recycle_direct() will be pinned during NAPI).
> > > > But lets focus on a smaller fix for the immediate issue...
> > > >  
> > > 
> > > I know. It annoys me too, but we need recycling to work in
> > > production : where rings/napi can migrate and numa nodes can be
> > > NUMA_NO_NODE :-(.
> > > 
> > >   

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-13  3:40           ` Yunsheng Lin
  2019-12-13  6:27             ` 答复: " Li,Rongqing
@ 2019-12-17 19:35             ` Saeed Mahameed
  1 sibling, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2019-12-17 19:35 UTC (permalink / raw)
  To: linyunsheng, brouer
  Cc: mhocko, gregkh, peterz, Li Rongqing, ilias.apalodimas,
	jonathan.lemon, netdev, linux-kernel, bhelgaas, bjorn.topel

On Fri, 2019-12-13 at 11:40 +0800, Yunsheng Lin wrote:
> On 2019/12/12 18:18, Jesper Dangaard Brouer wrote:
> > On Thu, 12 Dec 2019 09:34:14 +0800
> > Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > 
> > > +CC Michal, Peter, Greg and Bjorn
> > > Because there has been disscusion about where and how the
> > > NUMA_NO_NODE
> > > should be handled before.
> > > 
> > > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > > wrote:  
> > > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > > >  
> > > > > > I don't think it is correct to check that the page nid is
> > > > > > same as
> > > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we
> > > > > > should allow
> > > > > > all  pages to recycle, because you can't assume where pages
> > > > > > are
> > > > > > allocated from and where they are being handled.  
> > > > > 
> > > > > I agree, using numa_mem_id() is not valid, because it takes
> > > > > the numa
> > > > > node id from the executing CPU and the call to
> > > > > __page_pool_put_page()
> > > > > can happen on a remote CPU (e.g. cpumap redirect, and in
> > > > > future
> > > > > SKBs).
> > > > > 
> > > > >  
> > > > > > I suggest the following:
> > > > > > 
> > > > > > return !page_pfmemalloc() && 
> > > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > > NUMA_NO_NODE );  
> > > > > 
> > > > > Above code doesn't generate optimal ASM code, I suggest:
> > > > > 
> > > > >  static bool pool_page_reusable(struct page_pool *pool,
> > > > > struct page *page)
> > > > >  {
> > > > > 	return !page_is_pfmemalloc(page) &&
> > > > > 		pool->p.nid != NUMA_NO_NODE &&
> > > > > 		page_to_nid(page) == pool->p.nid;
> > > > >  }
> > > > >  
> > > > 
> > > > this is not equivalent to the above. Here in case pool->p.nid
> > > > is
> > > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > > > 
> > > > We can avoid the extra check in data path.
> > > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > > force
> > > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > > page
> > > > pool init, as already done in alloc_pages_node().   
> > > 
> > > That means we will not support page reuse mitigation for
> > > NUMA_NO_NODE,
> > > which is not same semantic that alloc_pages_node() handle
> > > NUMA_NO_NODE,
> > > because alloc_pages_node() will allocate the page based on the
> > > node
> > > of the current running cpu.
> > 
> > True, as I wrote (below) my code defines semantics as: that a
> > page_pool
> > configured with NUMA_NO_NODE means skip NUMA checks, and allow
> > recycle
> > regardless of NUMA node page belong to.  It seems that you want
> > another
> > semantics.
> 
> For driver that does not have page pool support yet, the semantics
> seems
> to be: always allocate and recycle local page(excpet pfmemalloc one).
> Page
> reuse migration moves when the rx interrupt affinity moves(the NAPI
> polling
> context moves accordingly) regardless of the dev_to_node().
> 
> It would be good to maintain the above semantics. And rx data page
> seems
> to be close to the cpu that doing the rx cleaning, which means the
> cpu
> can process the buffer quicker?
> 
> > I'm open to other semantics. My main concern is performance.  The
> > page_pool fast-path for driver recycling use-case of XDP_DROP, have
> > extreme performance requirements, as it needs to compete with
> > driver
> > local recycle tricks (else we cannot use page_pool to simplify
> > drivers).
> > The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> > in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single
> > q),
> > and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level
> > every
> > cycle/instruction counts.
> 
> Yes, the performance is a concern too.
> But if the rx page is closer to the cpu, maybe the time taken to
> process
> the buffer can be reduced?
> 
> It is good to allocate the rx page close to both cpu and device, but
> if
> both goal can not be reached, maybe we choose to allocate page that
> close
> to cpu?
> 

this should be up to user, page pool shouldn't care much.
at init, user picks whatever node he wants
on data path/napi, initially recycling will only occur only on the
initial pool->p.nid. dynamically user can change it via
page_pool_nid_chaned();

using dev_to_node() or numa_mem_id() are totally different scenarios
for different use cases, 

Normally you should use  numa_mem_id()(close to cpu) for RX data
buffers. and dev_to_node() (close to device) for HW rings descriptors.

if we make the pool biased to one use case then we can't support the
other. thus keep this outside of the pool.
user must use the nid on initialization and change it on the fly if
necessary.

if user chooses NUMA_NO_NODE, then the pool might decide to go with
numa_mem_id() which is my latest proposed solution for this, which at
least will solve the regression.


> >  
> > > Also, There seems to be a wild guessing of the node id here,
> > > which has
> > > been disscussed before and has not reached a agreement yet.
> > > 
> > > > which will imply recycling without adding any extra condition
> > > > to the
> > > > data path.
> > 
> > I love code that moves thing out of our fast-path.
> > 
> > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > > index a6aefe989043..00c99282a306 100644
> > > > --- a/net/core/page_pool.c
> > > > +++ b/net/core/page_pool.c
> > > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > > *pool,
> > > >  
> > > >         memcpy(&pool->p, params, sizeof(pool->p));
> > > >  
> > > > +	/* overwrite to allow recycling.. */
> > > > +       if (pool->p.nid == NUMA_NO_NODE) 
> > > > +               pool->p.nid = numa_mem_id(); 
> > > > +
> > 
> > The problem is that page_pool_init() is can be initiated from a
> > random
> > CPU, first at driver setup/bringup, and later at other queue
> > changes
> > that can be started via ethtool or XDP attach. (numa_mem_id() picks
> > from running CPU).
> 
> Yes, changing ring num or ring depth releases and allocates rx data
> page,
> so using NUMA_NO_NODE to allocate page and alway recycle those page
> may
> has different performance noticable to user.
> 
> > As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> > function.
> > Isn't that what we want in a place like this?
> > 
> > 
> > One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> > returns
> > NUMA_NO_NODE (-1).  (And device_initialize() also set it to
> > -1).  Thus,
> > in that case we set pool->p.nid = 0, as page_to_nid() will also
> > return
> > zero in that case (as far as I follow the code).
> > 
> > 
> > > > After a quick look, i don't see any reason why to keep
> > > > NUMA_NO_NODE in
> > > > pool->p.nid.. 
> > > > 
> > > >   
> > > > > I have compiled different variants and looked at the ASM code
> > > > > generated by GCC.  This seems to give the best result.
> > > > > 
> > > > >  
> > > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > > 2) always recycle if pool is NUMA_NO_NODE.  
> > > > > 
> > > > > Yes, this defines the semantics, that a page_pool configured
> > > > > with
> > > > > NUMA_NO_NODE means skip NUMA checks.  I think that sounds
> > > > > okay...
> > > > > 
> > > > >  
> > > > > > the above change should not add any overhead, a modest
> > > > > > branch
> > > > > > predictor will handle this with no effort.  
> > > > > 
> > > > > It still annoys me that we keep adding instructions to this
> > > > > code
> > > > > hot-path (I counted 34 bytes and 11 instructions in my
> > > > > proposed
> > > > > function).
> > > > > 
> > > > > I think that it might be possible to move these NUMA checks
> > > > > to
> > > > > alloc-side (instead of return/recycles side as today), and
> > > > > perhaps
> > > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > > that
> > > > > call __page_pool_recycle_direct() will be pinned during
> > > > > NAPI).
> > > > > But lets focus on a smaller fix for the immediate issue...
> > > > >  
> > > > 
> > > > I know. It annoys me too, but we need recycling to work in
> > > > production : where rings/napi can migrate and numa nodes can be
> > > > NUMA_NO_NODE :-(.
> > > > 
> > > >   

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 答复: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-16 10:13                       ` Ilias Apalodimas
  2019-12-16 10:16                         ` Ilias Apalodimas
@ 2019-12-17 19:38                         ` Saeed Mahameed
  1 sibling, 0 replies; 22+ messages in thread
From: Saeed Mahameed @ 2019-12-17 19:38 UTC (permalink / raw)
  To: ilias.apalodimas, Li Rongqing
  Cc: mhocko, gregkh, peterz, linyunsheng, jonathan.lemon, netdev,
	linux-kernel, brouer, bhelgaas, bjorn.topel

On Mon, 2019-12-16 at 12:13 +0200, Ilias Apalodimas wrote:
> On Mon, Dec 16, 2019 at 04:02:04AM +0000, Li,Rongqing wrote:
> > 
> > > -----邮件原件-----
> > > 发件人: Yunsheng Lin [mailto:linyunsheng@huawei.com]
> > > 发送时间: 2019年12月16日 9:51
> > > 收件人: Jesper Dangaard Brouer <brouer@redhat.com>
> > > 抄送: Li,Rongqing <lirongqing@baidu.com>; Saeed Mahameed
> > > <saeedm@mellanox.com>; ilias.apalodimas@linaro.org;
> > > jonathan.lemon@gmail.com; netdev@vger.kernel.org; 
> > > mhocko@kernel.org;
> > > peterz@infradead.org; Greg Kroah-Hartman <
> > > gregkh@linuxfoundation.org>;
> > > bhelgaas@google.com; linux-kernel@vger.kernel.org; Björn Töpel
> > > <bjorn.topel@intel.com>
> > > 主题: Re: [PATCH][v2] page_pool: handle page recycle for
> > > NUMA_NO_NODE
> > > condition
> > > 
> > > On 2019/12/13 16:48, Jesper Dangaard Brouer wrote:> You are
> > > basically saying
> > > that the NUMA check should be moved to
> > > > allocation time, as it is running the RX-CPU (NAPI).  And
> > > > eventually
> > > > after some time the pages will come from correct NUMA node.
> > > > 
> > > > I think we can do that, and only affect the semi-fast-path.
> > > > We just need to handle that pages in the ptr_ring that are
> > > > recycled
> > > > can be from the wrong NUMA node.  In __page_pool_get_cached()
> > > > when
> > > > consuming pages from the ptr_ring (__ptr_ring_consume_batched),
> > > > then
> > > > we can evict pages from wrong NUMA node.
> > > 
> > > Yes, that's workable.
> > > 
> > > > For the pool->alloc.cache we either accept, that it will
> > > > eventually
> > > > after some time be emptied (it is only in a 100% XDP_DROP
> > > > workload that
> > > > it will continue to reuse same pages).   Or we simply clear the
> > > > pool->alloc.cache when calling page_pool_update_nid().
> > > 
> > > Simply clearing the pool->alloc.cache when calling
> > > page_pool_update_nid()
> > > seems better.
> > > 
> > 
> > How about the below codes, the driver can configure p.nid to any,
> > which will be adjusted in NAPI polling, irq migration will not be
> > problem, but it will add a check into hot path.
> 
> We'll have to check the impact on some high speed (i.e 100gbit)
> interface
> between doing anything like that. Saeed's current patch runs once per
> NAPI. This
> runs once per packet. The load might be measurable. 
> The READ_ONCE is needed in case all producers/consumers run on the
> same CPU
> right?
> 

I agree with Illias, and as i explained this will make the pool biased
to cpu close only, and we want to avoid this,

Li, can you please check if this fixes your issue:

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..00c99282a306 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
 
        memcpy(&pool->p, params, sizeof(pool->p));
 
+       /* overwrite to allow recycling.. */
+       if (pool->p.nid == NUMA_NO_NODE) 
+               pool->p.nid = numa_mem_id(); 
+

if user wants dev_to_node() then use can use dev_to_node() on pool
initialization rather than NUMA_NO_NODE.


> Thanks
> /Ilias
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..4374a6239d17 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -108,6 +108,10 @@ static struct page
> > *__page_pool_get_cached(struct page_pool *pool)
> >                 if (likely(pool->alloc.count)) {
> >                         /* Fast-path */
> >                         page = pool->alloc.cache[--pool-
> > >alloc.count];
> > +
> > +                       if (unlikely(READ_ONCE(pool->p.nid) !=
> > numa_mem_id()))
> > +                               WRITE_ONCE(pool->p.nid,
> > numa_mem_id());
> > +
> >                         return page;
> >                 }
> >                 refill = true;
> > @@ -155,6 +159,10 @@ static struct page
> > *__page_pool_alloc_pages_slow(struct page_pool *pool,
> >         if (pool->p.order)
> >                 gfp |= __GFP_COMP;
> >  
> > +
> > +       if (unlikely(READ_ONCE(pool->p.nid) != numa_mem_id()))
> > +               WRITE_ONCE(pool->p.nid, numa_mem_id());
> > +
> >         /* FUTURE development:
> >          *
> >          * Current slow-path essentially falls back to single page
> > Thanks
> > 
> > -Li

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-17  9:11                   ` Michal Hocko
@ 2019-12-19  2:09                     ` Yunsheng Lin
  2019-12-19 11:53                       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Yunsheng Lin @ 2019-12-19  2:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ilias Apalodimas, Saeed Mahameed, brouer, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

On 2019/12/17 17:11, Michal Hocko wrote:
> On Tue 17-12-19 10:11:12, Yunsheng Lin wrote:
>> On 2019/12/16 21:21, Ilias Apalodimas wrote:
>>> On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
>>>> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
>>>>> Hi Michal, 
>>>>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
>>>>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
>>>>>>> +CC Michal, Peter, Greg and Bjorn
>>>>>>> Because there has been disscusion about where and how the NUMA_NO_NODE
>>>>>>> should be handled before.
>>>>>>
>>>>>> I do not have a full context. What is the question here?
>>>>>
>>>>> When we allocate pages for the page_pool API, during the init, the driver writer
>>>>> decides which NUMA node to use. The API can,  in some cases recycle the memory,
>>>>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
>>>>> affinity for example), we forbid recycling and free the memory, since recycling
>>>>> and using memory on far NUMA nodes is more expensive (more expensive than
>>>>> recycling, at least on the architectures we tried anyway).
>>>>> Since this would be expensive to do it per packet, the burden falls on the 
>>>>> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
>>>>> page_pool_nid_changed() if they want to check for that which runs once
>>>>> per NAPI cycle.
>>>>
>>>> Thanks for the clarification.
>>>>
>>>>> The current code in the API though does not account for NUMA_NO_NODE. That's
>>>>> what this is trying to fix.
>>>>> If the page_pool params are initialized with that, we *never* recycle
>>>>> the memory. This is happening because the API is allocating memory with 
>>>>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
>>>>> 'page_to_nid(page) == pool->p.nid' will never trigger.
>>>>
>>>> OK. There is no explicit mention of the expected behavior for
>>>> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
>>>> requirement and the MM code simply starts the allocate from a local node
>>>> in that case. But the memory might come from any node so there is no
>>>> "local node" guarantee.
>>>>
>>>> So the main question is what is the expected semantic? Do people expect
>>>> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
>>>> when there was no explicit numa requirement?
>>
>> For driver that has not supported page pool yet, NUMA_NO_NODE seems to
>> imply locality, see [1].
> 
> Which is kinda awkward, no? Is there any reason for __dev_alloc_pages to
> not use numa_mem_id explicitly when the local node affinity is required?

Not sure why.

And it seems other places that uses NUMA_NO_NODE in the net code too.

grep -rn "NUMA_NO_NODE" net
net/openvswitch/flow_table.c:85:                                      node_online(0) ? 0 : NUMA_NO_NODE);
net/core/skbuff.c:436:          skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
net/core/skbuff.c:507:          skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
net/core/skbuff.c:1514:                                 skb_alloc_rx_flag(skb), NUMA_NO_NODE);
net/core/skbuff.c:1553: struct sk_buff *n = __alloc_skb(size, gfp_mask, flags, NUMA_NO_NODE);
net/core/skbuff.c:1629:                        gfp_mask, NUMA_NO_NODE, NULL);
net/core/skbuff.c:1747:                                 NUMA_NO_NODE);
net/core/skbuff.c:3811:                                    NUMA_NO_NODE);
net/core/skbuff.c:5720:                        gfp_mask, NUMA_NO_NODE, NULL);
net/core/skbuff.c:5844:                        gfp_mask, NUMA_NO_NODE, NULL);
net/core/dev.c:2378:                            NUMA_NO_NODE);
net/core/dev.c:2617:                                         numa_node_id : NUMA_NO_NODE);
net/core/dev.c:9117:    netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
net/core/pktgen.c:3632: pkt_dev->node = NUMA_NO_NODE;
net/sunrpc/svc.c:296:   return NUMA_NO_NODE;
net/qrtr/qrtr.c:97:static unsigned int qrtr_local_nid = NUMA_NO_NODE;


> There is not real guarantee that NUMA_NO_NODE is going to imply local
> node and we do not want to grow any subtle dependency on that behavior.

Strictly speaking, using numa_mem_id() also does not have real guarantee
that it will allocate local memory when local memory is used up, right?
Because alloc_pages_node() is basically turning the node to numa_mem_id()
when it is NUMA_NO_NODE.

Unless we do not allow passing NUMA_NO_NODE to alloc_pages_node(), otherwise
I can not see difference between NUMA_NO_NODE and numa_mem_id().

> 
>> And for those drivers, locality is decided by rx interrupt affinity, not
>> dev_to_node(). So when rx interrupt affinity changes, the old page from old
>> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
>> new pages will be allocated to replace the old pages and the new pages will
>> be recycled because allocation and recycling happens in the same context,
>> which means numa_mem_id() returns the same node of new page allocated, see
>> [2].
> 
> Well, but my understanding is that the generic page pool implementation
> has a clear means to change the affinity (namely page_pool_update_nid()).
> So my primary question is, why does NUMA_NO_NODE should be use as a
> bypass for that?

In that case, page_pool_update_nid() need to be called explicitly, which
may not be the reality, because for drivers using page pool now, mlx5 seems
to be the only one to call page_pool_update_nid(), which may lead to
copy & paste problem when not careful enough.


> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
  2019-12-19  2:09                     ` Yunsheng Lin
@ 2019-12-19 11:53                       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2019-12-19 11:53 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Ilias Apalodimas, Saeed Mahameed, brouer, jonathan.lemon,
	Li Rongqing, netdev, peterz, Greg Kroah-Hartman, bhelgaas,
	linux-kernel

On Thu 19-12-19 10:09:33, Yunsheng Lin wrote:
[...]
> > There is not real guarantee that NUMA_NO_NODE is going to imply local
> > node and we do not want to grow any subtle dependency on that behavior.
> 
> Strictly speaking, using numa_mem_id() also does not have real guarantee
> that it will allocate local memory when local memory is used up, right?
> Because alloc_pages_node() is basically turning the node to numa_mem_id()
> when it is NUMA_NO_NODE.

yes, both allocations are allowed to fallback to other nodes unless
there is an explicit nodemask specified.

> Unless we do not allow passing NUMA_NO_NODE to alloc_pages_node(), otherwise
> I can not see difference between NUMA_NO_NODE and numa_mem_id().

The difference is in the presented intention. NUMA_NO_NODE means no node
preference. We turn it into an implicit local node preference because
this is the best assumption we can in general. If you provide numa_mem_id
then you explicitly ask for the local node preference because you know
that this is the best for your specific code. See the difference?

The NUMA_NO_NODE -> local node is a heuristic which might change
(albeit unlikely).
 
> >> And for those drivers, locality is decided by rx interrupt affinity, not
> >> dev_to_node(). So when rx interrupt affinity changes, the old page from old
> >> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
> >> new pages will be allocated to replace the old pages and the new pages will
> >> be recycled because allocation and recycling happens in the same context,
> >> which means numa_mem_id() returns the same node of new page allocated, see
> >> [2].
> > 
> > Well, but my understanding is that the generic page pool implementation
> > has a clear means to change the affinity (namely page_pool_update_nid()).
> > So my primary question is, why does NUMA_NO_NODE should be use as a
> > bypass for that?
> 
> In that case, page_pool_update_nid() need to be called explicitly, which
> may not be the reality, because for drivers using page pool now, mlx5 seems
> to be the only one to call page_pool_update_nid(), which may lead to
> copy & paste problem when not careful enough.

The API is quite new AFAIU and I think it would be better to use it in
the intended way. Relying on implicit and undocumented behavior is just
going to bend that API in the future and it will impose an additional
burden to any future changes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-12-19 11:53 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1575624767-3343-1-git-send-email-lirongqing@baidu.com>
     [not found] ` <9fecbff3518d311ec7c3aee9ae0315a73682a4af.camel@mellanox.com>
     [not found]   ` <20191211194933.15b53c11@carbon>
     [not found]     ` <831ed886842c894f7b2ffe83fe34705180a86b3b.camel@mellanox.com>
2019-12-12  1:34       ` [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition Yunsheng Lin
2019-12-12 10:18         ` Jesper Dangaard Brouer
2019-12-13  3:40           ` Yunsheng Lin
2019-12-13  6:27             ` 答复: " Li,Rongqing
2019-12-13  6:53               ` Yunsheng Lin
2019-12-13  8:48                 ` Jesper Dangaard Brouer
2019-12-16  1:51                   ` Yunsheng Lin
2019-12-16  4:02                     ` 答复: " Li,Rongqing
2019-12-16 10:13                       ` Ilias Apalodimas
2019-12-16 10:16                         ` Ilias Apalodimas
2019-12-16 10:57                           ` 答复: " Li,Rongqing
2019-12-17 19:38                         ` Saeed Mahameed
2019-12-17 19:35             ` Saeed Mahameed
2019-12-17 19:27           ` Saeed Mahameed
2019-12-16 12:15         ` Michal Hocko
2019-12-16 12:34           ` Ilias Apalodimas
2019-12-16 13:08             ` Michal Hocko
2019-12-16 13:21               ` Ilias Apalodimas
2019-12-17  2:11                 ` Yunsheng Lin
2019-12-17  9:11                   ` Michal Hocko
2019-12-19  2:09                     ` Yunsheng Lin
2019-12-19 11:53                       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).