linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Wei Xu <weixugc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Yang Shi <shy828301@gmail.com>, Linux MM <linux-mm@kvack.org>,
	Greg Thelen <gthelen@google.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alistair Popple <apopple@nvidia.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Michal Hocko <mhocko@kernel.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Feng Tang <feng.tang@intel.com>,
	Jonathan.Cameron@huawei.com
Subject: Re: RFC: Memory Tiering Kernel Interfaces
Date: Mon, 02 May 2022 11:55:43 +0530	[thread overview]
Message-ID: <87ilqov4bc.fsf@linux.ibm.com> (raw)
In-Reply-To: <CAAPL-u9sVx94ACSuCVN8V0tKp+AMxiY89cro0japtyB=xNfNBw@mail.gmail.com>

Wei Xu <weixugc@google.com> writes:

....

>
> Tiering Hierarchy Initialization
> ================================
>
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.

Should we look at the tier in which to place the memory an option that
device drivers like dax driver can select? Or dax driver just selects
the desire to mark a specific memory only numa node as demotion target
and won't explicity specify the tier in which it should be placed. I
would like to go for the later and choose the tier details based on the
current memory tiers and the NUMA distance value (even HMAT at some
point in the future). The challenge with NUMA distance though is which
distance value we will pick. For example, in your example1. 

 node   0    1    2    3
    0  10   20   30   40
    1  20   10   40   30
    2  30   40   10   40
    3  40   30   40   10

When Node3 is registered, how do we decide to create a Tier2 or add it
to Tier1? . We could say devices that wish to be placed in the same tier
will have same distance as the existing tier device ie, for the above
case,

node_distance[2][2] == node_distance[2][3] ? Can we expect the firmware
to have distance value like that? 

>
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.
>
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
>
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
>
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
>
>
> Memory Allocation for Demotion
> ==============================
>
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
>
>
> Examples
> ========
>
> * Example 1:
>   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
>   Node 0 has node 2 as the preferred demotion target and can also
>   fallback demotion to node 3.
>
>   Node 1 has node 3 as the preferred demotion target and can also
>   fallback demotion to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3

How can I make Node3 the demotion target for Node2 in this case? Can
we have one file for each tier? ie, we start with
/sys/devices/system/node/memory_tier0. Removing a node with memory from
the above file/list results in the creation of new tiers.

/sys/devices/system/node/memory_tier0
0-1
/sys/devices/system/node/memory_tier1
2-3

echo 2 > /sys/devices/system/node/memory_tier1
/sys/devices/system/node/memory_tier1
2
/sys/devices/system/node/memory_tier2
3

>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2-3]
>   1: [3], [2-3]
>   2: [],  []
>   3: [],  []
>
> * Example 2:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and closer to node 0.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has no preferred demotion target, but can still demote
>   to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [],  [2]
>   2: [],  []
>
>
> * Example 3:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has node 2 as the preferred and only demotion target.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [2], [2]
>   2: [],  []
>
>
> * Example 4:
>   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
>   All nodes are top-tier.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-2
>
> N_TOPTIER_MEMORY: 0-2
>
> node_demotion[]:
>   0: [],  []
>   1: [],  []
>   2: [],  []
>
>
> * Example 5:
>   Node 0 is a DRAM node with CPU.
>   Node 1 is a HBM node.
>   Node 2 is a PMEM node.
>
>   With userspace override, node 1 is the top tier and has node 0 as
>   the preferred and only demotion target.
>
>   Node 0 is in the second tier, tier 1, and has node 2 as the
>   preferred and only demotion target.
>
>   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>
> node distances:
> node   0    1    2
>    0  10   21   30
>    1  21   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
>
> N_TOPTIER_MEMORY: 1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [0], [0]
>   2: [],  []
>
> -- Wei

  parent reply	other threads:[~2022-05-02  6:26 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
2022-04-30  3:59 ` Yang Shi
2022-04-30  6:37   ` Wei Xu
2022-05-06  0:01     ` Alistair Popple
2022-05-10  4:32       ` Wei Xu
2022-05-10  5:37         ` Alistair Popple
2022-05-10 11:38           ` Aneesh Kumar K.V
2022-05-11  5:30             ` Wei Xu
2022-05-11  7:34               ` Alistair Popple
2022-05-11  7:49               ` ying.huang
2022-05-11 17:07                 ` Wei Xu
2022-05-12  1:42                   ` ying.huang
2022-05-12  2:39                     ` Wei Xu
2022-05-12  3:13                       ` ying.huang
2022-05-12  3:37                         ` Wei Xu
2022-05-12  6:24                         ` Wei Xu
2022-05-06 18:56     ` Yang Shi
     [not found]       ` <1642ab64-7957-e1e6-71c5-ceab9c23bf41@huawei.com>
2022-05-10  3:24         ` Yang Shi
     [not found]           ` <c272e43d-47c5-d7d4-cb17-95dc6f28f5cd@huawei.com>
2022-05-10 12:10             ` Aneesh Kumar K V
2022-05-11  5:42               ` Wei Xu
2022-05-11  7:12                 ` Alistair Popple
2022-05-11  9:05                   ` Hesham Almatary
2022-05-12  3:02                     ` ying.huang
2022-05-12  4:40                   ` Aneesh Kumar K V
2022-05-12  4:49                     ` Wei Xu
2022-05-10  4:22         ` Wei Xu
2022-05-10 11:44           ` Aneesh Kumar K.V
2022-05-01 18:35   ` Dan Williams
2022-05-03  6:36     ` Wei Xu
2022-05-06 19:05     ` Yang Shi
2022-05-07  7:56     ` ying.huang
2022-05-01 17:58 ` Davidlohr Bueso
2022-05-02  1:04   ` David Rientjes
2022-05-02  7:23   ` Aneesh Kumar K.V
2022-05-03  2:07   ` Baolin Wang
2022-05-03  6:06   ` Wei Xu
2022-05-03 17:14   ` Alistair Popple
2022-05-03 17:47     ` Dave Hansen
2022-05-03 22:35       ` Alistair Popple
2022-05-03 23:54         ` Dave Hansen
2022-05-04  1:31           ` Wei Xu
2022-05-04 17:02             ` Dave Hansen
2022-05-05  6:35               ` Wei Xu
2022-05-05 14:24                 ` Dave Hansen
2022-05-10  4:43                   ` Wei Xu
2022-05-02  6:25 ` Aneesh Kumar K.V [this message]
2022-05-03  7:02   ` Wei Xu
2022-05-02 15:20 ` Dave Hansen
2022-05-03  7:19   ` Wei Xu
2022-05-03 19:12 ` Tim Chen
2022-05-05  7:02   ` Wei Xu
2022-05-05  8:57 ` ying.huang
2022-05-05 23:57 ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ilqov4bc.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).