linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wei Xu <weixugc@google.com>
To: Wei Xu <weixugc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Yang Shi <shy828301@gmail.com>, Linux MM <linux-mm@kvack.org>,
	Greg Thelen <gthelen@google.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alistair Popple <apopple@nvidia.com>,
	Michal Hocko <mhocko@kernel.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Feng Tang <feng.tang@intel.com>,
	Jonathan.Cameron@huawei.com
Subject: Re: RFC: Memory Tiering Kernel Interfaces
Date: Mon, 2 May 2022 23:06:25 -0700	[thread overview]
Message-ID: <CAAPL-u8i_wc15iJzU9s9V1YuuS+FQL2zdw3o7MqNnSFao3u4KA@mail.gmail.com> (raw)
In-Reply-To: <20220501175813.tvytoosygtqlh3nn@offworld>

On Sun, May 1, 2022 at 11:09 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
> >The current kernel has the basic memory tiering support: Inactive
> >pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >tier NUMA node to make room for new allocations on the higher tier
> >NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >migrated (promoted) to a higher tier NUMA node to improve the
> >performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.

I agree with your comments on both NUMA hinting faults and
hardware-assisted "heatmaps".


> >A tiering relationship between NUMA nodes in the form of demotion path
> >is created during the kernel initialization and updated when a NUMA
> >node is hot-added or hot-removed.  The current implementation puts all
> >nodes with CPU into the top tier, and then builds the tiering hierarchy
> >tier-by-tier by establishing the per-node demotion targets based on
> >the distances between nodes.
> >
> >The current memory tiering interface needs to be improved to address
> >several important use cases:
> >
> >* The current tiering initialization code always initializes
> >  each memory-only NUMA node into a lower tier.  But a memory-only
> >  NUMA node may have a high performance memory device (e.g. a DRAM
> >  device attached via CXL.mem or a DRAM-backed memory-only node on
> >  a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
> >Tiering Hierarchy Initialization
> >================================
> >
> >By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> >A device driver can remove its memory nodes from the top tier, e.g.
> >a dax driver can remove PMEM nodes from the top tier.
> >
> >The kernel builds the memory tiering hierarchy and per-node demotion
> >order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> >best distance nodes in the next lower tier are assigned to
> >node_demotion[N].preferred and all the nodes in the next lower tier
> >are assigned to node_demotion[N].allowed.
> >
> >node_demotion[N].preferred can be empty if no preferred demotion node
> >is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I'd prefer that the demotion node selection follows the way how the
kernel selects the node/zone for normal allocations.  If we want to
group several demotion nodes with equal cost together (e.g. to better
utilize the bandwidth from these nodes), we'd better to improve such
an optimization in __alloc_pages_nodemask() to benefit normal
allocations as well.

> >Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> >memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> >node.
>
> I think this makes sense.
>
> Thanks,
> Davidlohr

  parent reply	other threads:[~2022-05-03  6:06 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
2022-04-30  3:59 ` Yang Shi
2022-04-30  6:37   ` Wei Xu
2022-05-06  0:01     ` Alistair Popple
2022-05-10  4:32       ` Wei Xu
2022-05-10  5:37         ` Alistair Popple
2022-05-10 11:38           ` Aneesh Kumar K.V
2022-05-11  5:30             ` Wei Xu
2022-05-11  7:34               ` Alistair Popple
2022-05-11  7:49               ` ying.huang
2022-05-11 17:07                 ` Wei Xu
2022-05-12  1:42                   ` ying.huang
2022-05-12  2:39                     ` Wei Xu
2022-05-12  3:13                       ` ying.huang
2022-05-12  3:37                         ` Wei Xu
2022-05-12  6:24                         ` Wei Xu
2022-05-06 18:56     ` Yang Shi
     [not found]       ` <1642ab64-7957-e1e6-71c5-ceab9c23bf41@huawei.com>
2022-05-10  3:24         ` Yang Shi
     [not found]           ` <c272e43d-47c5-d7d4-cb17-95dc6f28f5cd@huawei.com>
2022-05-10 12:10             ` Aneesh Kumar K V
2022-05-11  5:42               ` Wei Xu
2022-05-11  7:12                 ` Alistair Popple
2022-05-11  9:05                   ` Hesham Almatary
2022-05-12  3:02                     ` ying.huang
2022-05-12  4:40                   ` Aneesh Kumar K V
2022-05-12  4:49                     ` Wei Xu
2022-05-10  4:22         ` Wei Xu
2022-05-10 11:44           ` Aneesh Kumar K.V
2022-05-01 18:35   ` Dan Williams
2022-05-03  6:36     ` Wei Xu
2022-05-06 19:05     ` Yang Shi
2022-05-07  7:56     ` ying.huang
2022-05-01 17:58 ` Davidlohr Bueso
2022-05-02  1:04   ` David Rientjes
2022-05-02  7:23   ` Aneesh Kumar K.V
2022-05-03  2:07   ` Baolin Wang
2022-05-03  6:06   ` Wei Xu [this message]
2022-05-03 17:14   ` Alistair Popple
2022-05-03 17:47     ` Dave Hansen
2022-05-03 22:35       ` Alistair Popple
2022-05-03 23:54         ` Dave Hansen
2022-05-04  1:31           ` Wei Xu
2022-05-04 17:02             ` Dave Hansen
2022-05-05  6:35               ` Wei Xu
2022-05-05 14:24                 ` Dave Hansen
2022-05-10  4:43                   ` Wei Xu
2022-05-02  6:25 ` Aneesh Kumar K.V
2022-05-03  7:02   ` Wei Xu
2022-05-02 15:20 ` Dave Hansen
2022-05-03  7:19   ` Wei Xu
2022-05-03 19:12 ` Tim Chen
2022-05-05  7:02   ` Wei Xu
2022-05-05  8:57 ` ying.huang
2022-05-05 23:57 ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAPL-u8i_wc15iJzU9s9V1YuuS+FQL2zdw3o7MqNnSFao3u4KA@mail.gmail.com \
    --to=weixugc@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).