Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

From: Yang Shi <yang.shi@linux.alibaba.com>
To: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kbusch@kernel.org, ying.huang@intel.com,
	dan.j.williams@intel.com
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Wed, 1 Jul 2020 10:21:24 -0700	[thread overview]
Message-ID: <33028a57-24fd-e618-7d89-5f35a35a6314@linux.alibaba.com> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2006302208460.1685201@chino.kir.corp.google.com>

On 6/30/20 10:41 PM, David Rientjes wrote:
> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>>
>>>> If a memory node has a preferred migration path to demote cold pages,
>>>> attempt to move those inactive pages to that migration node before
>>>> reclaiming. This will better utilize available memory, provide a faster
>>>> tier than swapping or discarding, and allow such pages to be reused
>>>> immediately without IO to retrieve the data.
>>>>
>>>> When handling anonymous pages, this will be considered before swap if
>>>> enabled. Should the demotion fail for any reason, the page reclaim
>>>> will proceed as if the demotion feature was not enabled.
>>>>
>>> Thanks for sharing these patches and kick-starting the conversation, Dave.
>>>
>>> Could this cause us to break a user's mbind() or allow a user to
>>> circumvent their cpuset.mems?
>>>
>>> Because we don't have a mapping of the page back to its allocation
>>> context (or the process context in which it was allocated), it seems like
>>> both are possible.
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from
> the system perspective, however, but rather the socket perspective.  In
> other words, a node can only demote to a series of exclusive pmem ranges
> and promote to the same series of ranges in reverse order.  So DRAM node 0
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem
> just to be slower volatile memory and we don't need to deal with the
> latency concerns of cross socket migration.  A user page will never be
> demoted to a pmem range across the socket and will never be promoted to a
> different DRAM node that it doesn't have access to.

But I don't see too much benefit to limit the migration target to the 
so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on 
a different socket) pmem node since even the cross socket access should 
be much faster then refault or swap from disk.

>
> That can work with the NUMA abstraction for pmem, but it could also
> theoretically be a new memory zone instead.  If all memory living on pmem
> is migratable (the natural way that memory hotplug is done, so we can
> offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering
> would determine whether we can allocate directly from this memory based on
> system config or a new gfp flag that could be set for users of a mempolicy
> that allows allocations directly from pmem.  If abstracted as a NUMA node
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't
> make much sense.
>
> Kswapd would need to be enlightened for proper pgdat and pmem balancing
> but in theory it should be simpler because it only has its own node to
> manage.  Existing per-zone watermarks might be easy to use to fine tune
> the policy from userspace: the scale factor determines how much memory we
> try to keep free on DRAM for migration from pmem, for example.  We also
> wouldn't have to deal with node hotplug or updating of demotion/promotion
> node chains.
>
> Maybe the strongest advantage of the node abstraction is the ability to
> use autonuma and migrate_pages()/move_pages() API for moving pages
> explicitly?  Mempolicies could be used for migration to "top-tier" memory,
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I think using pmem as a node is more natural than zone and less 
intrusive since we can just reuse all the numa APIs. If we treat pmem as 
a new zone I think the implementation may be more intrusive and 
complicated (i.e. need a new gfp flag) and user can't control the memory 
placement.

Actually there had been such proposal before, please see 
https://www.spinics.net/lists/linux-mm/msg151788.html