All of lore.kernel.org
 help / color / mirror / Atom feed
From: Saeed Mahameed <saeedm@mellanox.com>
To: "linyunsheng@huawei.com" <linyunsheng@huawei.com>,
	"brouer@redhat.com" <brouer@redhat.com>
Cc: "mhocko@kernel.org" <mhocko@kernel.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	Li Rongqing <lirongqing@baidu.com>,
	"ilias.apalodimas@linaro.org" <ilias.apalodimas@linaro.org>,
	"jonathan.lemon@gmail.com" <jonathan.lemon@gmail.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"bjorn.topel@intel.com" <bjorn.topel@intel.com>
Subject: Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition
Date: Tue, 17 Dec 2019 19:27:19 +0000	[thread overview]
Message-ID: <7bd5765b3f447ff7e42ae37b9c380418cd965849.camel@mellanox.com> (raw)
In-Reply-To: <20191212111831.2a9f05d3@carbon>

On Thu, 2019-12-12 at 11:18 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the
> > NUMA_NO_NODE
> > should be handled before.
> > 
> > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > wrote:  
> > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > >  
> > > > > I don't think it is correct to check that the page nid is
> > > > > same as
> > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > > allow
> > > > > all  pages to recycle, because you can't assume where pages
> > > > > are
> > > > > allocated from and where they are being handled.  
> > > > 
> > > > I agree, using numa_mem_id() is not valid, because it takes the
> > > > numa
> > > > node id from the executing CPU and the call to
> > > > __page_pool_put_page()
> > > > can happen on a remote CPU (e.g. cpumap redirect, and in future
> > > > SKBs).
> > > > 
> > > >  
> > > > > I suggest the following:
> > > > > 
> > > > > return !page_pfmemalloc() && 
> > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > NUMA_NO_NODE );  
> > > > 
> > > > Above code doesn't generate optimal ASM code, I suggest:
> > > > 
> > > >  static bool pool_page_reusable(struct page_pool *pool, struct
> > > > page *page)
> > > >  {
> > > > 	return !page_is_pfmemalloc(page) &&
> > > > 		pool->p.nid != NUMA_NO_NODE &&
> > > > 		page_to_nid(page) == pool->p.nid;
> > > >  }
> > > >  
> > > 
> > > this is not equivalent to the above. Here in case pool->p.nid is
> > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > > 
> > > We can avoid the extra check in data path.
> > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > force
> > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > page
> > > pool init, as already done in alloc_pages_node().   
> > 
> > That means we will not support page reuse mitigation for
> > NUMA_NO_NODE,
> > which is not same semantic that alloc_pages_node() handle
> > NUMA_NO_NODE,
> > because alloc_pages_node() will allocate the page based on the node
> > of the current running cpu.
> 
> True, as I wrote (below) my code defines semantics as: that a
> page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow
> recycle

Your code will NOT allow recycling when NUMA_NO_NODE is configured.
so i am not sure what semantics you are referring to :)

> regardless of NUMA node page belong to.  It seems that you want
> another
> semantics.
> 

I think that the semantics we want is to allow recycling when
NUMA_NO_NODE is selected, to solve the reported issue ? no ?

> I'm open to other semantics. My main concern is performance.  The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify
> drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
> cycle/instruction counts.
> 

I agree.

>  
> > Also, There seems to be a wild guessing of the node id here, which
> > has
> > been disscussed before and has not reached a agreement yet.
> > 
> > > which will imply recycling without adding any extra condition to
> > > the
> > > data path.
> 
> I love code that moves thing out of our fast-path.
> 
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index a6aefe989043..00c99282a306 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > *pool,
> > >  
> > >         memcpy(&pool->p, params, sizeof(pool->p));
> > >  
> > > +	/* overwrite to allow recycling.. */
> > > +       if (pool->p.nid == NUMA_NO_NODE) 
> > > +               pool->p.nid = numa_mem_id(); 
> > > +
> 
> The problem is that page_pool_init() is can be initiated from a
> random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).
> 

well if the user selected NUMA_NO_NODE, then it is his choice ..
Also the user always have the ability to change pool->p.nid, using the
API i introduced. so i don't see any issue with this.

> As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> function.
> Isn't that what we want in a place like this?
> 

We should keep this outside page_pool, if the user want dev_to_node, he
can set this as a  parameter to the page pool from outside. when
NUMA_NO_NODE is selected this means that user doesn't really care.

> 
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> returns
> NUMA_NO_NODE (-1).  (And device_initialize() also set it to
> -1).  Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also
> return
> zero in that case (as far as I follow the code).
> 

yes this is the idea, since alloc_page will also select cpu 0 if you
keep NUMA_NO_NODE, then recycle will happen inherently by design
without any change to data path.

> > > After a quick look, i don't see any reason why to keep
> > > NUMA_NO_NODE in
> > > pool->p.nid.. 
> > > 
> > >   
> > > > I have compiled different variants and looked at the ASM code
> > > > generated by GCC.  This seems to give the best result.
> > > > 
> > > >  
> > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > 2) always recycle if pool is NUMA_NO_NODE.  
> > > > 
> > > > Yes, this defines the semantics, that a page_pool configured
> > > > with
> > > > NUMA_NO_NODE means skip NUMA checks.  I think that sounds
> > > > okay...
> > > > 
> > > >  
> > > > > the above change should not add any overhead, a modest branch
> > > > > predictor will handle this with no effort.  
> > > > 
> > > > It still annoys me that we keep adding instructions to this
> > > > code
> > > > hot-path (I counted 34 bytes and 11 instructions in my proposed
> > > > function).
> > > > 
> > > > I think that it might be possible to move these NUMA checks to
> > > > alloc-side (instead of return/recycles side as today), and
> > > > perhaps
> > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > that
> > > > call __page_pool_recycle_direct() will be pinned during NAPI).
> > > > But lets focus on a smaller fix for the immediate issue...
> > > >  
> > > 
> > > I know. It annoys me too, but we need recycling to work in
> > > production : where rings/napi can migrate and numa nodes can be
> > > NUMA_NO_NODE :-(.
> > > 
> > >   

  parent reply	other threads:[~2019-12-17 19:27 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-06  9:32 [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition Li RongQing
2019-12-07  3:52 ` Saeed Mahameed
2019-12-09  1:31   ` Yunsheng Lin
2019-12-09  3:47     ` 答复: " Li,Rongqing
2019-12-09  9:30       ` Ilias Apalodimas
2019-12-09 10:37         ` 答复: " Li,Rongqing
2019-12-09 12:14   ` Jesper Dangaard Brouer
2019-12-09 23:34     ` Saeed Mahameed
2019-12-10  1:31       ` Yunsheng Lin
2019-12-10  9:39         ` 答复: " Li,Rongqing
2019-12-10 14:52           ` Ilias Apalodimas
2019-12-10 19:56           ` Saeed Mahameed
2019-12-10 19:45         ` Saeed Mahameed
2019-12-11  3:01           ` Yunsheng Lin
2019-12-11  3:06             ` Yunsheng Lin
2019-12-11 20:57             ` Saeed Mahameed
2019-12-12  1:04               ` Yunsheng Lin
2019-12-10 15:02       ` Ilias Apalodimas
2019-12-10 20:02         ` Saeed Mahameed
2019-12-10 20:10           ` Ilias Apalodimas
2019-12-11 18:49   ` Jesper Dangaard Brouer
2019-12-11 21:24     ` Saeed Mahameed
2019-12-12  1:34       ` Yunsheng Lin
2019-12-12 10:18         ` Jesper Dangaard Brouer
2019-12-13  3:40           ` Yunsheng Lin
2019-12-13  6:27             ` 答复: " Li,Rongqing
2019-12-13  6:53               ` Yunsheng Lin
2019-12-13  8:48                 ` Jesper Dangaard Brouer
2019-12-16  1:51                   ` Yunsheng Lin
2019-12-16  4:02                     ` 答复: " Li,Rongqing
2019-12-16 10:13                       ` Ilias Apalodimas
2019-12-16 10:16                         ` Ilias Apalodimas
2019-12-16 10:57                           ` 答复: " Li,Rongqing
2019-12-17 19:38                         ` Saeed Mahameed
2019-12-17 19:35             ` Saeed Mahameed
2019-12-17 19:27           ` Saeed Mahameed [this message]
2019-12-16 12:15         ` Michal Hocko
2019-12-16 12:34           ` Ilias Apalodimas
2019-12-16 13:08             ` Michal Hocko
2019-12-16 13:21               ` Ilias Apalodimas
2019-12-17  2:11                 ` Yunsheng Lin
2019-12-17  9:11                   ` Michal Hocko
2019-12-19  2:09                     ` Yunsheng Lin
2019-12-19 11:53                       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7bd5765b3f447ff7e42ae37b9c380418cd965849.camel@mellanox.com \
    --to=saeedm@mellanox.com \
    --cc=bhelgaas@google.com \
    --cc=bjorn.topel@intel.com \
    --cc=brouer@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jonathan.lemon@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linyunsheng@huawei.com \
    --cc=lirongqing@baidu.com \
    --cc=mhocko@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.