linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Feng Tang <feng.tang@intel.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@surriel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>,
	Wei Xu <weixugc@google.com>, osalvador <osalvador@suse.de>,
	Shakeel Butt <shakeelb@google.com>,
	zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Subject: Re: [PATCH -V11 2/3] NUMA balancing: optimize page placement for memory tiering system
Date: Thu, 17 Feb 2022 11:26:04 -0500	[thread overview]
Message-ID: <Yg53HIx8/Knw6TZ7@cmpxchg.org> (raw)
In-Reply-To: <87ee4cliia.fsf@yhuang6-desk2.ccr.corp.intel.com>

Hi Huang,

Sorry, I didn't see this reply until you sent out the new version
already :( Apologies.

On Wed, Feb 09, 2022 at 01:24:29PM +0800, Huang, Ying wrote:
> > On Fri, Jan 28, 2022 at 04:27:50PM +0800, Huang Ying wrote:
> >> @@ -615,6 +622,10 @@ faults may be controlled by the `numa_balancing_scan_period_min_ms,
> >>  numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
> >>  numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls.
> >>  
> >> +Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among
> >> +different types of memory (represented as different NUMA nodes) to
> >> +place the hot pages in the fast memory.  This is implemented based on
> >> +unmapping and page fault too.
> >
> > NORMAL | TIERING appears to be a non-sensical combination.
> >
> > Would it be better to have a tristate (disabled, normal, tiering)
> > rather than a mask?
> 
> NORMAL is for balancing cross-socket memory accessing among DRAM nodes.
> TIERING is for optimizing page placement between DRAM and PMEM in one
> socket.  We think it's possible to do both.
> 
> For example, with [3/3] of the patchset,
> 
> - TIERING: because DRAM pages aren't made PROT_NONE, it's disabled to
>   balance among DRAM nodes.
> 
> - NORMAL | TIERING: both cross-socket balancing among DRAM nodes and
>   page placement optimizing between DRAM and PMEM are enabled.

Ok, I get it. So NORMAL would enable PROT_NONE sampling on all nodes,
and TIERING would additionally raise the watermarks on DRAM nodes.

Thanks!

> >> @@ -2034,16 +2035,30 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> >>  {
> >>  	int page_lru;
> >>  	int nr_pages = thp_nr_pages(page);
> >> +	int order = compound_order(page);
> >>  
> >> -	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
> >> +	VM_BUG_ON_PAGE(order && !PageTransHuge(page), page);
> >>  
> >>  	/* Do not migrate THP mapped by multiple processes */
> >>  	if (PageTransHuge(page) && total_mapcount(page) > 1)
> >>  		return 0;
> >>  
> >>  	/* Avoid migrating to a node that is nearly full */
> >> -	if (!migrate_balanced_pgdat(pgdat, nr_pages))
> >> +	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
> >> +		int z;
> >> +
> >> +		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> >> +		    !numa_demotion_enabled)
> >> +			return 0;
> >> +		if (next_demotion_node(pgdat->node_id) == NUMA_NO_NODE)
> >> +			return 0;
> >
> > The encoded behavior doesn't seem very user-friendly: Unless the user
> > enables numa demotion in a separate flag, enabling numa balancing in
> > tiered mode will silently do nothing.
> 
> In theory, TIERING still does something even with numa_demotion_enabled
> == false.  Where it works more like the original NUMA balancing.  If
> there's some free space in DRAM node (for example, some programs exit),
> some PMEM pages will be promoted to DRAM.  But as in the change log,
> this isn't good enough for page placement optimizing.

Right, so it's a behavior that likely isn't going to be useful.

> > Would it make more sense to have a central flag for the operation of
> > tiered memory systems that will enable both promotion and demotion?
> 
> IMHO, it may be possible for people to enable demotion alone.  For
> example, if some people want to use a user space page placement
> optimizing solution based on PMU counters, they may disable TIERING, but
> still use demotion as a way to avoid swapping in some situation.  Do you
> think this makes sense?

Yes, it does.

> > Alternatively, it could also ignore the state of demotion and promote
> > anyway if asked to, resulting in regular reclaim to make room. It
> > might not be the most popular combination, but would be in line with
> > the zone_reclaim_mode policy of preferring reclaim over remote
> > accesses.  It would make the knobs behave more as expected and it's
> > less convoluted than having flags select other user-visible flags.
> 
> Sorry, I don't get your idea here.  Do you suggest to add another knob
> like zone_relcaim_mode?  Then we can define some bit to control demotion
> and promotion there?  If so, I still don't know how to fit this into the
> existing NUMA balancing framework.

No, I'm just suggesting to remove the !numa_demotion_disabled check
from the promotion path on unbalanced nodes. Keep the switches
independent from each other.

Like you said, demotion without promotion can be a valid config with a
userspace promoter.

And I'm saying promotion without demotion can be a valid config in a
zone_reclaim_mode type of setup.

We also seem to agree degraded promotion when demotion enabled likely
isn't very useful to anybody. So maybe it should be removed?

It just comes down to user expectations. There is no masterswitch that
says "do the right thing on tiered systems", so absent of that I think
it would be best to keep the semantics of each of the two knobs simple
and predictable, without tricky interdependencies - like quietly
degrading promotion behavior when demotion is disabled.

Does that make sense?

Thanks!
Johannes


  reply	other threads:[~2022-02-17 16:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-28  8:27 [PATCH -V11 0/3] NUMA balancing: optimize memory placement for memory tiering system Huang Ying
2022-01-28  8:27 ` [PATCH -V11 1/3] NUMA Balancing: add page promotion counter Huang Ying
2022-02-07 16:18   ` Johannes Weiner
2022-01-28  8:27 ` [PATCH -V11 2/3] NUMA balancing: optimize page placement for memory tiering system Huang Ying
2022-02-07 17:47   ` Johannes Weiner
2022-02-09  5:24     ` Huang, Ying
2022-02-17 16:26       ` Johannes Weiner [this message]
2022-02-18  2:15         ` Huang, Ying
2022-01-28  8:27 ` [PATCH -V11 3/3] memory tiering: skip to scan fast memory Huang Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yg53HIx8/Knw6TZ7@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    --cc=zhongjiang-ali@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).