Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: David Rientjes <rientjes@google.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	kbusch@kernel.org, yang.shi@linux.alibaba.com,
	 ying.huang@intel.com, dan.j.williams@intel.com
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Wed, 1 Jul 2020 12:25:08 -0700 (PDT)
Message-ID: <alpine.DEB.2.23.453.2007011203500.1908531@chino.kir.corp.google.com> (raw)
In-Reply-To: <c06b4453-c533-a9ba-939a-8877fb301ad6@intel.com>

On Wed, 1 Jul 2020, Dave Hansen wrote:

> > Could this cause us to break a user's mbind() or allow a user to 
> > circumvent their cpuset.mems?
> 
> In its current form, yes.
> 
> My current rationale for this is that while it's not as deferential as
> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>  The auto-migration only kicks in when the data is about to go away.  So
> while the user's data might be slower than they like, it is *WAY* faster
> than they deserve because it should be off on the disk.
> 

It's outside the scope of this patchset, but eventually there will be a 
promotion path that I think requires a strict 1:1 relationship between 
DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
cpuset.mems become ineffective for nodes facing memory pressure.

For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
N1.  On promotion, I think we need to rely on something stronger than 
autonuma to decide which DRAM node to promote to: specifically any user 
policy put into effect (memory tiering or autonuma shouldn't be allowed to 
subvert these user policies).

As others have mentioned, we lose the allocation or process context at the 
time of demotion or promotion and any workaround for that requires some 
hacks, such as mapping the page to cpuset (what is the right solution for 
shared pages?) or adding NUMA locality handling to memcg.

I think a 1:1 relationship between DRAM and PMEM nodes is required if we 
consider the eventual promotion of this memory so that user memory can't 
eventually reappear on a DRAM node that is not allowed by mbind(), 
set_mempolicy(), or cpuset.mems.  I think it also makes this patchset much 
simpler.

> > Because we don't have a mapping of the page back to its allocation 
> > context (or the process context in which it was allocated), it seems like 
> > both are possible.
> > 
> > So let's assume that migration nodes cannot be other DRAM nodes.  
> > Otherwise, memory pressure could be intentionally or unintentionally 
> > induced to migrate these pages to another node.  Do we have such a 
> > restriction on migration nodes?
> 
> There's nothing explicit.  On a normal, balanced system where there's a
> 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
> implicit since the migration path is one deep and goes from DRAM->PMEM.
> 
> If there were some oddball system where there was a memory only DRAM
> node, it might very well end up being a migration target.
> 

Shouldn't DRAM->DRAM demotion be banned?  It's all DRAM and within the 
control of mempolicies and cpusets today, so I had assumed this is outside 
the scope of memory tiering support.  I had assumed that memory tiering 
support was all about separate tiers :)

> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> >> +{
> >> +	/*
> >> +	 * 'mask' targets allocation only to the desired node in the
> >> +	 * migration path, and fails fast if the allocation can not be
> >> +	 * immediately satisfied.  Reclaim is already active and heroic
> >> +	 * allocation efforts are unwanted.
> >> +	 */
> >> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> >> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> >> +			__GFP_MOVABLE;
> > 
> > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> > actually want to kick kswapd on the pmem node?
> 
> In my mental model, cold data flows from:
> 
> 	DRAM -> PMEM -> swap
> 
> Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
> for kinda cold data, kswapd can be working on doing the PMEM->swap part
> on really cold data.
> 

Makes sense.


  reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages " Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes
2020-07-01  8:54         ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron
2020-07-01  1:40     ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen
2020-07-01 19:25       ` David Rientjes [this message]
2020-07-02  5:02         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
2020-06-30  7:23   ` Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.23.453.2007011203500.1908531@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git