Re: RFC: Memory Tiering Kernel Interfaces

From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: Wei Xu <weixugc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Yang Shi <shy828301@gmail.com>, Linux MM <linux-mm@kvack.org>,
	Greg Thelen <gthelen@google.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alistair Popple <apopple@nvidia.com>,
	Michal Hocko <mhocko@kernel.org>,
	Brice Goglin <brice.goglin@gmail.com>,
	Feng Tang <feng.tang@intel.com>,
	Jonathan.Cameron@huawei.com
Subject: Re: RFC: Memory Tiering Kernel Interfaces
Date: Tue, 3 May 2022 10:07:08 +0800	[thread overview]
Message-ID: <87f8d4d0-6d06-7254-b2a6-3ccf6a555733@linux.alibaba.com> (raw)
In-Reply-To: <20220501175813.tvytoosygtqlh3nn@offworld>

On 5/2/2022 1:58 AM, Davidlohr Bueso wrote:
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
> 
> On Fri, 29 Apr 2022, Wei Xu wrote:
> 
>> The current kernel has the basic memory tiering support: Inactive
>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> tier NUMA node to make room for new allocations on the higher tier
>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> migrated (promoted) to a higher tier NUMA node to improve the
>> performance.
> 
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if 
> hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
> 
>> A tiering relationship between NUMA nodes in the form of demotion path
>> is created during the kernel initialization and updated when a NUMA
>> node is hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>> tier-by-tier by establishing the per-node demotion targets based on
>> the distances between nodes.
>>
>> The current memory tiering interface needs to be improved to address
>> several important use cases:
>>
>> * The current tiering initialization code always initializes
>>  each memory-only NUMA node into a lower tier.  But a memory-only
>>  NUMA node may have a high performance memory device (e.g. a DRAM
>>  device attached via CXL.mem or a DRAM-backed memory-only node on
>>  a virtual machine) and should be put into the top tier.
> 
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
> 
>> Tiering Hierarchy Initialization
>> ================================
>>
>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>
>> A device driver can remove its memory nodes from the top tier, e.g.
>> a dax driver can remove PMEM nodes from the top tier.
>>
>> The kernel builds the memory tiering hierarchy and per-node demotion
>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> best distance nodes in the next lower tier are assigned to
>> node_demotion[N].preferred and all the nodes in the next lower tier
>> are assigned to node_demotion[N].allowed.
>>
>> node_demotion[N].preferred can be empty if no preferred demotion node
>> is available for node N.
> 
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I've tried to use round robin[1] to select a target demotion node if 
there are multiple demotion nodes, however I did not see any obvious 
performance gain with mysql testing. Maybe use other test suits?

https://lore.kernel.org/all/c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@linux.alibaba.com/