Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition

From: Saeed Mahameed <saeedm@mellanox.com>
To: "linyunsheng@huawei.com" <linyunsheng@huawei.com>,
	"brouer@redhat.com" <brouer@redhat.com>
Cc: "mhocko@kernel.org" <mhocko@kernel.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	Li Rongqing <lirongqing@baidu.com>,
	"ilias.apalodimas@linaro.org" <ilias.apalodimas@linaro.org>,
	"jonathan.lemon@gmail.com" <jonathan.lemon@gmail.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"bjorn.topel@intel.com" <bjorn.topel@intel.com>
Subject: Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
Date: Tue, 17 Dec 2019 19:27:19 +0000	[thread overview]
Message-ID: <7bd5765b3f447ff7e42ae37b9c380418cd965849.camel@mellanox.com> (raw)
In-Reply-To: <20191212111831.2a9f05d3@carbon>

On Thu, 2019-12-12 at 11:18 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the
> > NUMA_NO_NODE
> > should be handled before.
> > 
> > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > wrote:  
> > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > >  
> > > > > I don't think it is correct to check that the page nid is
> > > > > same as
> > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > > allow
> > > > > all  pages to recycle, because you can't assume where pages
> > > > > are
> > > > > allocated from and where they are being handled.  
> > > > 
> > > > I agree, using numa_mem_id() is not valid, because it takes the
> > > > numa
> > > > node id from the executing CPU and the call to
> > > > __page_pool_put_page()
> > > > can happen on a remote CPU (e.g. cpumap redirect, and in future
> > > > SKBs).
> > > > 
> > > >  
> > > > > I suggest the following:
> > > > > 
> > > > > return !page_pfmemalloc() && 
> > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > NUMA_NO_NODE );  
> > > > 
> > > > Above code doesn't generate optimal ASM code, I suggest:
> > > > 
> > > >  static bool pool_page_reusable(struct page_pool *pool, struct
> > > > page *page)
> > > >  {
> > > > 	return !page_is_pfmemalloc(page) &&
> > > > 		pool->p.nid != NUMA_NO_NODE &&
> > > > 		page_to_nid(page) == pool->p.nid;
> > > >  }
> > > >  
> > > 
> > > this is not equivalent to the above. Here in case pool->p.nid is
> > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > > 
> > > We can avoid the extra check in data path.
> > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > force
> > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > page
> > > pool init, as already done in alloc_pages_node().   
> > 
> > That means we will not support page reuse mitigation for
> > NUMA_NO_NODE,
> > which is not same semantic that alloc_pages_node() handle
> > NUMA_NO_NODE,
> > because alloc_pages_node() will allocate the page based on the node
> > of the current running cpu.
> 
> True, as I wrote (below) my code defines semantics as: that a
> page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow
> recycle

Your code will NOT allow recycling when NUMA_NO_NODE is configured.
so i am not sure what semantics you are referring to :)

> regardless of NUMA node page belong to.  It seems that you want
> another
> semantics.
> 

I think that the semantics we want is to allow recycling when
NUMA_NO_NODE is selected, to solve the reported issue ? no ?

> I'm open to other semantics. My main concern is performance.  The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify
> drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
> cycle/instruction counts.
> 

I agree.

>  
> > Also, There seems to be a wild guessing of the node id here, which
> > has
> > been disscussed before and has not reached a agreement yet.
> > 
> > > which will imply recycling without adding any extra condition to
> > > the
> > > data path.
> 
> I love code that moves thing out of our fast-path.
> 
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index a6aefe989043..00c99282a306 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > *pool,
> > >  
> > >         memcpy(&pool->p, params, sizeof(pool->p));
> > >  
> > > +	/* overwrite to allow recycling.. */
> > > +       if (pool->p.nid == NUMA_NO_NODE) 
> > > +               pool->p.nid = numa_mem_id(); 
> > > +
> 
> The problem is that page_pool_init() is can be initiated from a
> random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).
> 

well if the user selected NUMA_NO_NODE, then it is his choice ..
Also the user always have the ability to change pool->p.nid, using the
API i introduced. so i don't see any issue with this.

> As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> function.
> Isn't that what we want in a place like this?
> 

We should keep this outside page_pool, if the user want dev_to_node, he
can set this as a  parameter to the page pool from outside. when
NUMA_NO_NODE is selected this means that user doesn't really care.

> 
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> returns
> NUMA_NO_NODE (-1).  (And device_initialize() also set it to
> -1).  Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also
> return
> zero in that case (as far as I follow the code).
> 

yes this is the idea, since alloc_page will also select cpu 0 if you
keep NUMA_NO_NODE, then recycle will happen inherently by design
without any change to data path.

> > > After a quick look, i don't see any reason why to keep
> > > NUMA_NO_NODE in
> > > pool->p.nid.. 
> > > 
> > >   
> > > > I have compiled different variants and looked at the ASM code
> > > > generated by GCC.  This seems to give the best result.
> > > > 
> > > >  
> > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > 2) always recycle if pool is NUMA_NO_NODE.  
> > > > 
> > > > Yes, this defines the semantics, that a page_pool configured
> > > > with
> > > > NUMA_NO_NODE means skip NUMA checks.  I think that sounds
> > > > okay...
> > > > 
> > > >  
> > > > > the above change should not add any overhead, a modest branch
> > > > > predictor will handle this with no effort.  
> > > > 
> > > > It still annoys me that we keep adding instructions to this
> > > > code
> > > > hot-path (I counted 34 bytes and 11 instructions in my proposed
> > > > function).
> > > > 
> > > > I think that it might be possible to move these NUMA checks to
> > > > alloc-side (instead of return/recycles side as today), and
> > > > perhaps
> > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > that
> > > > call __page_pool_recycle_direct() will be pinned during NAPI).
> > > > But lets focus on a smaller fix for the immediate issue...
> > > >  
> > > 
> > > I know. It annoys me too, but we need recycling to work in
> > > production : where rings/napi can migrate and numa nodes can be
> > > NUMA_NO_NODE :-(.
> > > 
> > >