Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

From: "Huang\, Ying" <ying.huang@intel.com>
To: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	 <linux-kernel@vger.kernel.org>,  <linux-mm@kvack.org>,
	 <kbusch@kernel.org>,  <yang.shi@linux.alibaba.com>,
	 <dan.j.williams@intel.com>
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Thu, 02 Jul 2020 13:02:03 +0800	[thread overview]
Message-ID: <87mu4ijyr8.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007011203500.1908531@chino.kir.corp.google.com> (David Rientjes's message of "Wed, 1 Jul 2020 12:25:08 -0700")

David Rientjes <rientjes@google.com> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> > Could this cause us to break a user's mbind() or allow a user to 
>> > circumvent their cpuset.mems?
>> 
>> In its current form, yes.
>> 
>> My current rationale for this is that while it's not as deferential as
>> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>>  The auto-migration only kicks in when the data is about to go away.  So
>> while the user's data might be slower than they like, it is *WAY* faster
>> than they deserve because it should be off on the disk.
>> 
>
> It's outside the scope of this patchset, but eventually there will be a 
> promotion path that I think requires a strict 1:1 relationship between 
> DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
> cpuset.mems become ineffective for nodes facing memory pressure.

I have posted an patchset for AutoNUMA based promotion support,

https://lore.kernel.org/lkml/20200218082634.1596727-1-ying.huang@intel.com/

Where, the page is promoted upon NUMA hint page fault.  So all memory
policy (mbind(), set_mempolicy(), and cpuset.mems) are available.  We
can refuse promoting the page to the DRAM nodes that are not allowed by
any memory policy.  So, 1:1 relationship isn't necessary for promotion.

> For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
> perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
> then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
> N1.  On promotion, I think we need to rely on something stronger than 
> autonuma to decide which DRAM node to promote to: specifically any user 
> policy put into effect (memory tiering or autonuma shouldn't be allowed to 
> subvert these user policies).
>
> As others have mentioned, we lose the allocation or process context at the 
> time of demotion or promotion

As above, we have process context at time of promotion.

> and any workaround for that requires some 
> hacks, such as mapping the page to cpuset (what is the right solution for 
> shared pages?) or adding NUMA locality handling to memcg.

It sounds natural to me to add NUMA nodes restriction to memcg.

Best Regards,
Huang, Ying