From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C5F9EC433F5
	for <linux-mm@archiver.kernel.org>; Mon,  2 May 2022 01:04:08 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1E4466B0072; Sun,  1 May 2022 21:04:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 191F16B0073; Sun,  1 May 2022 21:04:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 059CB6B0074; Sun,  1 May 2022 21:04:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27])
	by kanga.kvack.org (Postfix) with ESMTP id EA97B6B0072
	for <linux-mm@kvack.org>; Sun,  1 May 2022 21:04:07 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id C607E2733D
	for <linux-mm@kvack.org>; Mon,  2 May 2022 01:04:07 +0000 (UTC)
X-FDA: 79419006534.03.8399B4F
Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182])
	by imf25.hostedemail.com (Postfix) with ESMTP id 2A8D3A007D
	for <linux-mm@kvack.org>; Mon,  2 May 2022 01:03:56 +0000 (UTC)
Received: by mail-pl1-f182.google.com with SMTP id h12so11366998plf.12
        for <linux-mm@kvack.org>; Sun, 01 May 2022 18:04:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=FuWLGbnADKrl4A1syPhPXKuQOu6m8jQIYMoHCl1Ol1o=;
        b=Rb7AelR2nkOQeVv31XJPbDhPfFWtYn1uIfa+g1dbFP2YVGuObIDgW5JKE0WU2RQ26u
         /hxKFjcu5c+K2sdCeq0OkLG+0fngV3qtSCiwHx8b53HyIIyZOZQFvbAKKw08W5oMKCAl
         MdiQZvb2K6MyaE150H2qO7cWj5nsg61VYevhiD0QH1xeEMXKBtlUvUACurBLM4lcftQR
         4pIVP8hYEdzt5LXjzZjloRGH3dCMfMOLDdtgODgr9/XEwEFg2vSZj1HETGCTAmove6G+
         DfWNQrhqtcKJAHGZkgsCeGV0uULi7kgP6+r9MZSLS0Du6REDbCgtct6N8n93kdorQlmP
         +hRg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=FuWLGbnADKrl4A1syPhPXKuQOu6m8jQIYMoHCl1Ol1o=;
        b=7Dzy8PzSTnyFoDB6AEWC3kxNFZe0VbiKDkkqcG+c6d1uVspSGffJZ/m/BDL5C3nqps
         5KcHetQsKzyvdHo7+JWG83a7oN4nTMBfBzZ9PkW3cFa9Qo5YqOS9Ft1ADuGjH9PR3qFW
         wKzDmdYlfJ1JEd62hkRhhiIm469hj6egpsDBpVyb9Z1fcg1TyBSbiTMGXzdbllIPS9fk
         xVLAu5uX2K6wGusQoRE6lDb1keTq/ns8Di1Q3oPLVmPDgsSiWGDTPF1ljA6cLuu2lvk3
         Jmxnp5Kn9FkseYU/JfBRb3IVzz2nRUS7qGdrZbm43qyKBXITAbw05rWKNiFNYrQaQrQM
         pWjQ==
X-Gm-Message-State: AOAM533YAhDw9r3KjZ6XnF8zLy1BsjRW03hTIzxIux1kM9XiY/Ji+cmY
	nU3LGE7YZBbEdzQ09/yPxiQ2Vw==
X-Google-Smtp-Source: ABdhPJyD2UoMHfENDkkp1YeWcSmn9t2AcMxPzhx0iLqv54pJl+i+LVY1hPfF8VZLXwmv4cmWJFBMrA==
X-Received: by 2002:a17:90a:1946:b0:1d2:d49b:1db with SMTP id 6-20020a17090a194600b001d2d49b01dbmr15705389pjh.30.1651453446106;
        Sun, 01 May 2022 18:04:06 -0700 (PDT)
Received: from [2620:15c:29:204:e310:ef81:d548:9992] ([2620:15c:29:204:e310:ef81:d548:9992])
        by smtp.gmail.com with ESMTPSA id y1-20020a1709027c8100b0015e8d4eb29bsm3423635pll.229.2022.05.01.18.04.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 01 May 2022 18:04:05 -0700 (PDT)
Date: Sun, 1 May 2022 18:04:05 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
To: Davidlohr Bueso <dave@stgolabs.net>, Yuanchu Xie <yuanchu@google.com>
cc: Wei Xu <weixugc@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
    Dave Hansen <dave.hansen@linux.intel.com>, 
    Huang Ying <ying.huang@intel.com>, Dan Williams <dan.j.williams@intel.com>, 
    Yang Shi <shy828301@gmail.com>, Linux MM <linux-mm@kvack.org>, 
    Greg Thelen <gthelen@google.com>, 
    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, 
    Jagdish Gediya <jvgediya@linux.ibm.com>, 
    Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, 
    Alistair Popple <apopple@nvidia.com>, Michal Hocko <mhocko@kernel.org>, 
    Baolin Wang <baolin.wang@linux.alibaba.com>, 
    Brice Goglin <brice.goglin@gmail.com>, Feng Tang <feng.tang@intel.com>, 
    Jonathan.Cameron@huawei.com
Subject: Re: RFC: Memory Tiering Kernel Interfaces
In-Reply-To: <20220501175813.tvytoosygtqlh3nn@offworld>
Message-ID: <69d7a550-737-9324-b092-97d72487e7dc@google.com>
References: <CAAPL-u9sVx94ACSuCVN8V0tKp+AMxiY89cro0japtyB=xNfNBw@mail.gmail.com> <20220501175813.tvytoosygtqlh3nn@offworld>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 2A8D3A007D
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Rb7AelR2;
	spf=pass (imf25.hostedemail.com: domain of rientjes@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=rientjes@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspam-User: 
X-Stat-Signature: uxkodf47ub7do9h7g9yshjhy6at7j4d1
X-HE-Tag: 1651453436-383615
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sun, 1 May 2022, Davidlohr Bueso wrote:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
> 
> On Fri, 29 Apr 2022, Wei Xu wrote:
> 
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> 
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
> 

Hi Davidlohr,

I tend to agree with this and we've been discussing potential hardware 
assistance for page heatmaps as well, but not as an extension of sampling 
techniques that rely on the page table Accessed bit.

Have you thought about what hardware could give us here that would allow 
us to identify the set of hottest (or coldest) pages over a range so that 
we don't need to iterate through it?

Adding Yuanchu Xie <yuanchu@google.com> who has been looking into this 
recently.

> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> > 
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> > 
> > * The current tiering initialization code always initializes
> >  each memory-only NUMA node into a lower tier.  But a memory-only
> >  NUMA node may have a high performance memory device (e.g. a DRAM
> >  device attached via CXL.mem or a DRAM-backed memory-only node on
> >  a virtual machine) and should be put into the top tier.
> 
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
> 
> > Tiering Hierarchy Initialization
> > ================================
> > 
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > 
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
> > 
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
> > 
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
> 
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.
> 
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
> 
> I think this makes sense.
> 
> Thanks,
> Davidlohr
> 
>