Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: David Rientjes <rientjes@google.com>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	kbusch@kernel.org, ying.huang@intel.com,
	 dan.j.williams@intel.com
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Tue, 30 Jun 2020 22:41:47 -0700 (PDT)
Message-ID: <alpine.DEB.2.22.394.2006302208460.1685201@chino.kir.corp.google.com> (raw)
In-Reply-To: <039a5704-4468-f662-d660-668071842ca3@linux.alibaba.com>

On Tue, 30 Jun 2020, Yang Shi wrote:

> > > From: Dave Hansen <dave.hansen@linux.intel.com>
> > > 
> > > If a memory node has a preferred migration path to demote cold pages,
> > > attempt to move those inactive pages to that migration node before
> > > reclaiming. This will better utilize available memory, provide a faster
> > > tier than swapping or discarding, and allow such pages to be reused
> > > immediately without IO to retrieve the data.
> > > 
> > > When handling anonymous pages, this will be considered before swap if
> > > enabled. Should the demotion fail for any reason, the page reclaim
> > > will proceed as if the demotion feature was not enabled.
> > > 
> > Thanks for sharing these patches and kick-starting the conversation, Dave.
> > 
> > Could this cause us to break a user's mbind() or allow a user to
> > circumvent their cpuset.mems?
> > 
> > Because we don't have a mapping of the page back to its allocation
> > context (or the process context in which it was allocated), it seems like
> > both are possible.
> 
> Yes, this could break the memory placement policy enforced by mbind and
> cpuset. I discussed this with Michal on mailing list and tried to find a way
> to solve it, but unfortunately it seems not easy as what you mentioned above.
> The memory policy and cpuset is stored in task_struct rather than mm_struct.
> It is not easy to trace back to task_struct from page (owner field of
> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
> preferred way).
> 

Yeah, and Ying made a similar response to this message.

We can do this if we consider pmem not to be a separate memory tier from 
the system perspective, however, but rather the socket perspective.  In 
other words, a node can only demote to a series of exclusive pmem ranges 
and promote to the same series of ranges in reverse order.  So DRAM node 0 
can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
one DRAM node.

This naturally takes care of mbind() and cpuset.mems if we consider pmem 
just to be slower volatile memory and we don't need to deal with the 
latency concerns of cross socket migration.  A user page will never be 
demoted to a pmem range across the socket and will never be promoted to a 
different DRAM node that it doesn't have access to.

That can work with the NUMA abstraction for pmem, but it could also 
theoretically be a new memory zone instead.  If all memory living on pmem 
is migratable (the natural way that memory hotplug is done, so we can 
offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
would determine whether we can allocate directly from this memory based on 
system config or a new gfp flag that could be set for users of a mempolicy 
that allows allocations directly from pmem.  If abstracted as a NUMA node 
instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
make much sense.

Kswapd would need to be enlightened for proper pgdat and pmem balancing 
but in theory it should be simpler because it only has its own node to 
manage.  Existing per-zone watermarks might be easy to use to fine tune 
the policy from userspace: the scale factor determines how much memory we 
try to keep free on DRAM for migration from pmem, for example.  We also 
wouldn't have to deal with node hotplug or updating of demotion/promotion 
node chains.

Maybe the strongest advantage of the node abstraction is the ability to 
use autonuma and migrate_pages()/move_pages() API for moving pages 
explicitly?  Mempolicies could be used for migration to "top-tier" memory, 
i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.


  reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages " Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes [this message]
2020-07-01  8:54         ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron
2020-07-01  1:40     ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen
2020-07-01 19:25       ` David Rientjes
2020-07-02  5:02         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
2020-06-30  7:23   ` Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.22.394.2006302208460.1685201@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git