All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Memory Tiering Kernel Interfaces (v2)
@ 2022-05-12  6:22 Wei Xu
  2022-05-12  7:03 ` ying.huang
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-12  6:22 UTC (permalink / raw)
  To: Huang Ying, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Jonathan Cameron, Davidlohr Bueso,
	Dan Williams, David Rientjes, Linux MM, Brice Goglin,
	Hesham Almatary

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed.  The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by establishing
the per-node demotion targets based on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tier hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tier
  hierarchy unstable and make it difficult to support tier-based
  memory accounting.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier as defined by the demotion path, not any other
  node from any lower tier.  This strict, hard-coded demotion order
  does not work in all use cases (e.g. some use cases may want to
  allow cross-socket demotion to another node in the same demotion
  tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to
  override the system-wide, per-node demotion order from the
  userspace.  This demotion order is also inconsistent with the page
  allocation fallback order when all the nodes in a higher tier are
  out of space: The page allocation can fall back to any node from
  any lower tier, whereas the demotion order doesn't allow that.

* There are no interfaces for the userspace to learn about the memory
  tier hierarchy in order to optimize its memory allocations.

I'd like to propose revised memory tier kernel interfaces based on
the discussions in the threads:

- https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
- https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
- https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/


High-level Design Ideas
=======================

* Define memory tiers explicitly, not implicitly.

* Memory tiers are defined based on hardware capabilities of memory
  nodes, not their relative node distances between each other.

* The tier assignment of each node is independent from each other.
  Moving a node from one tier to another tier doesn't affect the tier
  assignment of any other node.

* The node-tier association is stable. A node can be reassigned to a
  different tier only under the specific conditions that don't block
  future tier-based memory cgroup accounting.

* A node can demote its pages to any nodes of any lower tiers. The
  demotion target node selection follows the allocation fallback order
  of the source node, which is built based on node distances.  The
  demotion targets are also restricted to only the nodes from the tiers
  lower than the source node.  We no longer need to maintain a separate
  per-node demotion order (node_demotion[]).


Sysfs Interfaces
================

* /sys/devices/system/memtier/memtierN/nodelist

  where N = 0, 1, 2 (the kernel supports only 3 tiers for now).

  Format: node_list

  Read-only.  When read, list the memory nodes in the specified tier.

  Tier 0 is the highest tier, while tier 2 is the lowest tier.

  The absolute value of a tier id number has no specific meaning.
  What matters is the relative order of the tier id numbers.

  When a memory tier has no nodes, the kernel can hide its memtier
  sysfs files.

* /sys/devices/system/node/nodeN/memtier

  where N = 0, 1, ...

  Format: int or empty

  When read, list the memory tier that the node belongs to.  Its value
  is empty for a CPU-only NUMA node.

  When written, the kernel moves the node into the specified memory
  tier if the move is allowed.  The tier assignment of all other nodes
  are not affected.

  Initially, we can make this interface read-only.


Kernel Representation
=====================

* All memory tiering code is guarded by CONFIG_TIERED_MEMORY.

* #define MAX_MEMORY_TIERS 3

  Support 3 memory tiers for now.

* #define MEMORY_DEFAULT_TIER 1

  The default tier that a memory node is assigned to.

* nodemask_t memory_tiers[MAX_MEMORY_TIERS]

  Store memory nodes by tiers.

* int node_tier_map[MAX_NUMNODES]

  Map a node to its tier.

  For each CPU-only node c, node_tier_map[c] = -1.


Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier
(MEMORY_DEFAULT_TIER).

A device driver can move up or down its memory nodes from the default
tier.  For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.


Memory Tier Reassignment
========================

After a memory node is hot-removed, it can be hot-added back to a
different memory tier.  This is useful for supporting dynamically
provisioned CXL.mem NUMA nodes, which may connect to different
memory devices across hot-plug events.  Such tier changes should
be compatible with tier-based memory accounting.

The userspace may also reassign an existing online memory node to a
different tier.  However, this should only be allowed when no pages
are allocated from the memory node or when there are no non-root
memory cgroups (e.g. during the system boot).  This restriction is
important for keeping memory tier hierarchy stable enough for
tier-based memory cgroup accounting.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.


Memory Allocation for Demotion
==============================

To allocate a new page as the demotion target for a page, the kernel
calls the allocation function (__alloc_pages_nodemask) with the
source page node as the preferred node and the union of all lower
tier nodes as the allowed nodemask.  The actual target node selection
then follows the allocation fallback order that the kernel has
already defined.

The pseudo code looks like:

    targets = NODE_MASK_NONE;
    src_nid = page_to_nid(page);
    src_tier = node_tier_map[src_nid];
    for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
            nodes_or(targets, targets, memory_tiers[i]);
    new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);

The memopolicy of cpuset, vma and owner task of the source page can
be set to refine the demotion target nodemask, e.g. to prevent
demotion or select a particular allowed node as the demotion target.


Memory Allocation for Promotion
===============================

The page allocation for promotion is similar to demotion, except that (1)
the target nodemask uses the promotion tiers, (2) the preferred node can
be the accessing CPU node, not the source page node.


Examples
========

* Example 1:

Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

                  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |        \   /       |
       | 30    40 X 40      | 30
       |        /   \       |
  Node 2 (PMEM)  ----  Node 3 (PMEM)
                  40

node distances:
node   0    1    2    3
   0  10   20   30   40
   1  20   10   40   30
   2  30   40   10   40
   3  40   30   40   10

$ cat /sys/devices/system/memtier/memtier*/nodelist
<empty>
0-1
2-3

$ cat /sys/devices/system/node/node*/memtier
1
1
2
2

Demotion fallback order:
node 0: 2, 3
node 1: 3, 2
node 2: empty
node 3: empty

To prevent cross-socket demotion and memory access, the user can set
mempolicy, e.g. cpuset.mems=0,2.


* Example 2:

Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and closer to node 0.

                  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |            /
       | 30       / 40
       |        /
  Node 2 (PMEM)

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   40
   2  30   40   10

$ cat /sys/devices/system/memtier/memtier*/nodelist
<empty>
0-1
2

$ cat /sys/devices/system/node/node*/memtier
1
1
2

Demotion fallback order:
node 0: 2
node 1: 2
node 2: empty


* Example 3:

Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

All nodes are in the same tier.

                  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
         \                 /
          \ 30            / 30
           \             /
             Node 2 (PMEM)

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   30
   2  30   30   10

$ cat /sys/devices/system/memtier/memtier*/nodelist
<empty>
0-2
<empty>

$ cat /sys/devices/system/node/node*/memtier
1
1
1

Demotion fallback order:
node 0: empty
node 1: empty
node 2: empty


* Example 4:

Node 0 is a DRAM node with CPU.
Node 1 is a PMEM node.
Node 2 is a GPU node.

                  50
  Node 0 (DRAM)  ----  Node 2 (GPU)
         \                 /
          \ 30            / 60
           \             /
             Node 1 (PMEM)

node distances:
node   0    1    2
   0  10   30   50
   1  30   10   60
   2  50   60   10

$ cat /sys/devices/system/memtier/memtier*/nodelist
2
0
1

$ cat /sys/devices/system/node/node*/memtier
1
2
0

Demotion fallback order:
node 0: 1
node 1: empty
node 2: 0, 1


* Example 5:

Node 0 is a DRAM node with CPU.
Node 1 is a GPU node.
Node 2 is a PMEM node.
Node 3 is a large, slow DRAM node without CPU.


     Node 2 (PMEM)  ----
   /      |              \
  /       | 30            \ 120
 |        |         100    \
 |   Node 0 (DRAM)  ----  Node 1 (GPU)
  \         \                 /
    \        \ 40            / 110
  80  \       \             /
        ---  Node 3 (Slow DRAM)

node distances:
node    0    1    2    3
   0   10  100   30   40
   1  100   10  120  110
   2   30  120   10   80
   3   40  110   80   10

$ cat /sys/devices/system/memtier/memtier*/nodelist
1
0,3
2

$ cat /sys/devices/system/node/node*/memtier
1
0
2
1

Demotion fallback order:
node 0: 2
node 1: 0, 3, 2
node 2: empty
node 3: 2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  6:22 RFC: Memory Tiering Kernel Interfaces (v2) Wei Xu
@ 2022-05-12  7:03 ` ying.huang
  2022-05-12  7:12   ` Aneesh Kumar K V
  2022-05-12 15:00 ` Jonathan Cameron
  2022-05-13  3:25 ` ying.huang
  2 siblings, 1 reply; 47+ messages in thread
From: ying.huang @ 2022-05-12  7:03 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> Sysfs Interfaces
> ================
> 
> * /sys/devices/system/memtier/memtierN/nodelist
> 
>   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> 
>   Format: node_list
> 
>   Read-only.  When read, list the memory nodes in the specified tier.
> 
>   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> 
>   The absolute value of a tier id number has no specific meaning.
>   What matters is the relative order of the tier id numbers.
> 
>   When a memory tier has no nodes, the kernel can hide its memtier
>   sysfs files.
> 
> * /sys/devices/system/node/nodeN/memtier
> 
>   where N = 0, 1, ...
> 
>   Format: int or empty
> 
>   When read, list the memory tier that the node belongs to.  Its value
>   is empty for a CPU-only NUMA node.
> 
>   When written, the kernel moves the node into the specified memory
>   tier if the move is allowed.  The tier assignment of all other nodes
>   are not affected.
> 
>   Initially, we can make this interface read-only.

It seems that "/sys/devices/system/node/nodeN/memtier" has all
information we needed.  Do we really need
"/sys/devices/system/memtier/memtierN/nodelist"?

That can be gotten via a simple shell command line,

$ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  7:03 ` ying.huang
@ 2022-05-12  7:12   ` Aneesh Kumar K V
  2022-05-12  7:18     ` ying.huang
  2022-05-12  7:22     ` Wei Xu
  0 siblings, 2 replies; 47+ messages in thread
From: Aneesh Kumar K V @ 2022-05-12  7:12 UTC (permalink / raw)
  To: ying.huang, Wei Xu, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
>> Sysfs Interfaces
>> ================
>>
>> * /sys/devices/system/memtier/memtierN/nodelist
>>
>>    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
>>
>>    Format: node_list
>>
>>    Read-only.  When read, list the memory nodes in the specified tier.
>>
>>    Tier 0 is the highest tier, while tier 2 is the lowest tier.
>>
>>    The absolute value of a tier id number has no specific meaning.
>>    What matters is the relative order of the tier id numbers.
>>
>>    When a memory tier has no nodes, the kernel can hide its memtier
>>    sysfs files.
>>
>> * /sys/devices/system/node/nodeN/memtier
>>
>>    where N = 0, 1, ...
>>
>>    Format: int or empty
>>
>>    When read, list the memory tier that the node belongs to.  Its value
>>    is empty for a CPU-only NUMA node.
>>
>>    When written, the kernel moves the node into the specified memory
>>    tier if the move is allowed.  The tier assignment of all other nodes
>>    are not affected.
>>
>>    Initially, we can make this interface read-only.
> 
> It seems that "/sys/devices/system/node/nodeN/memtier" has all
> information we needed.  Do we really need
> "/sys/devices/system/memtier/memtierN/nodelist"?
> 
> That can be gotten via a simple shell command line,
> 
> $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> 

It will be really useful to fetch the memory tier node list in an easy 
fashion rather than reading multiple sysfs directories. If we don't have 
other attributes for memorytier, we could keep
"/sys/devices/system/memtier/memtierN" a NUMA node list there by 
avoiding /sys/devices/system/memtier/memtierN/nodelist

-aneesh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  7:12   ` Aneesh Kumar K V
@ 2022-05-12  7:18     ` ying.huang
  2022-05-12  7:22     ` Wei Xu
  1 sibling, 0 replies; 47+ messages in thread
From: ying.huang @ 2022-05-12  7:18 UTC (permalink / raw)
  To: Aneesh Kumar K V, Wei Xu, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, 2022-05-12 at 12:42 +0530, Aneesh Kumar K V wrote:
> On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > Sysfs Interfaces
> > > ================
> > > 
> > > * /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > >    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > 
> > >    Format: node_list
> > > 
> > >    Read-only.  When read, list the memory nodes in the specified tier.
> > > 
> > >    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > 
> > >    The absolute value of a tier id number has no specific meaning.
> > >    What matters is the relative order of the tier id numbers.
> > > 
> > >    When a memory tier has no nodes, the kernel can hide its memtier
> > >    sysfs files.
> > > 
> > > * /sys/devices/system/node/nodeN/memtier
> > > 
> > >    where N = 0, 1, ...
> > > 
> > >    Format: int or empty
> > > 
> > >    When read, list the memory tier that the node belongs to.  Its value
> > >    is empty for a CPU-only NUMA node.
> > > 
> > >    When written, the kernel moves the node into the specified memory
> > >    tier if the move is allowed.  The tier assignment of all other nodes
> > >    are not affected.
> > > 
> > >    Initially, we can make this interface read-only.
> > 
> > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > information we needed.  Do we really need
> > "/sys/devices/system/memtier/memtierN/nodelist"?
> > 
> > That can be gotten via a simple shell command line,
> > 
> > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> > 
> 
> It will be really useful to fetch the memory tier node list in an easy 
> fashion rather than reading multiple sysfs directories. If we don't have 
> other attributes for memorytier, we could keep
> "/sys/devices/system/memtier/memtierN" a NUMA node list there by 
> avoiding /sys/devices/system/memtier/memtierN/nodelist

This will make the interface not extensible.  Even a single file
"/sys/devices/system/node/memtiers" is better.  As an readonly file, it
should be OK to put multiple values in it.

I remember that one rule for sysfs is that it is accessed more via
libsysfs.  Does that make life easier?

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  7:12   ` Aneesh Kumar K V
  2022-05-12  7:18     ` ying.huang
@ 2022-05-12  7:22     ` Wei Xu
  2022-05-12  7:36       ` Aneesh Kumar K.V
  1 sibling, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-12  7:22 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: ying.huang, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> >> Sysfs Interfaces
> >> ================
> >>
> >> * /sys/devices/system/memtier/memtierN/nodelist
> >>
> >>    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> >>
> >>    Format: node_list
> >>
> >>    Read-only.  When read, list the memory nodes in the specified tier.
> >>
> >>    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> >>
> >>    The absolute value of a tier id number has no specific meaning.
> >>    What matters is the relative order of the tier id numbers.
> >>
> >>    When a memory tier has no nodes, the kernel can hide its memtier
> >>    sysfs files.
> >>
> >> * /sys/devices/system/node/nodeN/memtier
> >>
> >>    where N = 0, 1, ...
> >>
> >>    Format: int or empty
> >>
> >>    When read, list the memory tier that the node belongs to.  Its value
> >>    is empty for a CPU-only NUMA node.
> >>
> >>    When written, the kernel moves the node into the specified memory
> >>    tier if the move is allowed.  The tier assignment of all other nodes
> >>    are not affected.
> >>
> >>    Initially, we can make this interface read-only.
> >
> > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > information we needed.  Do we really need
> > "/sys/devices/system/memtier/memtierN/nodelist"?
> >
> > That can be gotten via a simple shell command line,
> >
> > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> >
>
> It will be really useful to fetch the memory tier node list in an easy
> fashion rather than reading multiple sysfs directories. If we don't have
> other attributes for memorytier, we could keep
> "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> avoiding /sys/devices/system/memtier/memtierN/nodelist
>
> -aneesh

It is harder to implement memtierN as just a file and doesn't follow
the existing sysfs pattern, either.  Besides, it is extensible to have
memtierN as a directory.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  7:22     ` Wei Xu
@ 2022-05-12  7:36       ` Aneesh Kumar K.V
  2022-05-12  8:15         ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-12  7:36 UTC (permalink / raw)
  To: Wei Xu
  Cc: ying.huang, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

Wei Xu <weixugc@google.com> writes:

> On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
>> > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
>> >> Sysfs Interfaces
>> >> ================
>> >>
>> >> * /sys/devices/system/memtier/memtierN/nodelist
>> >>
>> >>    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
>> >>
>> >>    Format: node_list
>> >>
>> >>    Read-only.  When read, list the memory nodes in the specified tier.
>> >>
>> >>    Tier 0 is the highest tier, while tier 2 is the lowest tier.
>> >>
>> >>    The absolute value of a tier id number has no specific meaning.
>> >>    What matters is the relative order of the tier id numbers.
>> >>
>> >>    When a memory tier has no nodes, the kernel can hide its memtier
>> >>    sysfs files.
>> >>
>> >> * /sys/devices/system/node/nodeN/memtier
>> >>
>> >>    where N = 0, 1, ...
>> >>
>> >>    Format: int or empty
>> >>
>> >>    When read, list the memory tier that the node belongs to.  Its value
>> >>    is empty for a CPU-only NUMA node.
>> >>
>> >>    When written, the kernel moves the node into the specified memory
>> >>    tier if the move is allowed.  The tier assignment of all other nodes
>> >>    are not affected.
>> >>
>> >>    Initially, we can make this interface read-only.
>> >
>> > It seems that "/sys/devices/system/node/nodeN/memtier" has all
>> > information we needed.  Do we really need
>> > "/sys/devices/system/memtier/memtierN/nodelist"?
>> >
>> > That can be gotten via a simple shell command line,
>> >
>> > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
>> >
>>
>> It will be really useful to fetch the memory tier node list in an easy
>> fashion rather than reading multiple sysfs directories. If we don't have
>> other attributes for memorytier, we could keep
>> "/sys/devices/system/memtier/memtierN" a NUMA node list there by
>> avoiding /sys/devices/system/memtier/memtierN/nodelist
>>
>> -aneesh
>
> It is harder to implement memtierN as just a file and doesn't follow
> the existing sysfs pattern, either.  Besides, it is extensible to have
> memtierN as a directory. 

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6248326f944d..251f38ec3816 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
 	NULL
 };
 
+#define MAX_TIER 3
+nodemask_t memory_tier[MAX_TIER];
+
+#define _TIER_ATTR_RO(name, tier_index)					\
+	{ __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
+
+struct memory_tier_attr {
+	struct device_attribute attr;
+	int tier_index;
+	int (*write)(nodemask_t nodes);
+};
+
+static ssize_t show_tier(struct device *dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
+
+	return sysfs_emit(buf, "%*pbl\n",
+			  nodemask_pr_args(&memory_tier[mt->tier_index]));
+}
+
 static const struct attribute_group memory_root_attr_group = {
 	.attrs = node_state_attrs,
 };
 
+
+#define TOP_TIER 0
+static struct memory_tier_attr memory_tiers[] = {
+	[0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
+};
+
+static struct attribute *memory_tier_attrs[] = {
+	&memory_tiers[0].attr.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+	.attrs = memory_tier_attrs,
+};
+
 static const struct attribute_group *cpu_root_attr_groups[] = {
 	&memory_root_attr_group,
+	&memory_tier_attr_group,
 	NULL,
 };
 

As long as we have the ability to see the nodelist, I am good with the
proposal.

-aneesh

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  7:36       ` Aneesh Kumar K.V
@ 2022-05-12  8:15         ` Wei Xu
  2022-05-12  8:37           ` ying.huang
  2022-05-12 21:12           ` Tim Chen
  0 siblings, 2 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-12  8:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: ying.huang, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 12, 2022 at 12:36 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> >> > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> >> >> Sysfs Interfaces
> >> >> ================
> >> >>
> >> >> * /sys/devices/system/memtier/memtierN/nodelist
> >> >>
> >> >>    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> >> >>
> >> >>    Format: node_list
> >> >>
> >> >>    Read-only.  When read, list the memory nodes in the specified tier.
> >> >>
> >> >>    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> >> >>
> >> >>    The absolute value of a tier id number has no specific meaning.
> >> >>    What matters is the relative order of the tier id numbers.
> >> >>
> >> >>    When a memory tier has no nodes, the kernel can hide its memtier
> >> >>    sysfs files.
> >> >>
> >> >> * /sys/devices/system/node/nodeN/memtier
> >> >>
> >> >>    where N = 0, 1, ...
> >> >>
> >> >>    Format: int or empty
> >> >>
> >> >>    When read, list the memory tier that the node belongs to.  Its value
> >> >>    is empty for a CPU-only NUMA node.
> >> >>
> >> >>    When written, the kernel moves the node into the specified memory
> >> >>    tier if the move is allowed.  The tier assignment of all other nodes
> >> >>    are not affected.
> >> >>
> >> >>    Initially, we can make this interface read-only.
> >> >
> >> > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> >> > information we needed.  Do we really need
> >> > "/sys/devices/system/memtier/memtierN/nodelist"?
> >> >
> >> > That can be gotten via a simple shell command line,
> >> >
> >> > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> >> >
> >>
> >> It will be really useful to fetch the memory tier node list in an easy
> >> fashion rather than reading multiple sysfs directories. If we don't have
> >> other attributes for memorytier, we could keep
> >> "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> >> avoiding /sys/devices/system/memtier/memtierN/nodelist
> >>
> >> -aneesh
> >
> > It is harder to implement memtierN as just a file and doesn't follow
> > the existing sysfs pattern, either.  Besides, it is extensible to have
> > memtierN as a directory.
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 6248326f944d..251f38ec3816 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
>         NULL
>  };
>
> +#define MAX_TIER 3
> +nodemask_t memory_tier[MAX_TIER];
> +
> +#define _TIER_ATTR_RO(name, tier_index)                                        \
> +       { __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
> +
> +struct memory_tier_attr {
> +       struct device_attribute attr;
> +       int tier_index;
> +       int (*write)(nodemask_t nodes);
> +};
> +
> +static ssize_t show_tier(struct device *dev,
> +                        struct device_attribute *attr, char *buf)
> +{
> +       struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
> +
> +       return sysfs_emit(buf, "%*pbl\n",
> +                         nodemask_pr_args(&memory_tier[mt->tier_index]));
> +}
> +
>  static const struct attribute_group memory_root_attr_group = {
>         .attrs = node_state_attrs,
>  };
>
> +
> +#define TOP_TIER 0
> +static struct memory_tier_attr memory_tiers[] = {
> +       [0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
> +};
> +
> +static struct attribute *memory_tier_attrs[] = {
> +       &memory_tiers[0].attr.attr,
> +       NULL
> +};
> +
> +static const struct attribute_group memory_tier_attr_group = {
> +       .attrs = memory_tier_attrs,
> +};
> +
>  static const struct attribute_group *cpu_root_attr_groups[] = {
>         &memory_root_attr_group,
> +       &memory_tier_attr_group,
>         NULL,
>  };
>
>
> As long as we have the ability to see the nodelist, I am good with the
> proposal.
>
> -aneesh

I am OK with moving back the memory tier nodelist into node/.  When
there are more memory tier attributes needed, we can then create the
memory tier subtree and replace the tier nodelist in node/ with
symlinks.

So the revised sysfs interfaces are:

* /sys/devices/system/node/memory_tierN (read-only)

  where N = 0, 1, 2

  Format: node_list

* /sys/devices/system/node/nodeN/memory_tier (read/write)

  where N = 0, 1, ...

  Format: int or empty

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  8:15         ` Wei Xu
@ 2022-05-12  8:37           ` ying.huang
  2022-05-13  2:52             ` ying.huang
  2022-05-12 21:12           ` Tim Chen
  1 sibling, 1 reply; 47+ messages in thread
From: ying.huang @ 2022-05-12  8:37 UTC (permalink / raw)
  To: Wei Xu, Aneesh Kumar K.V
  Cc: Andrew Morton, Greg Thelen, Yang Shi, Linux Kernel Mailing List,
	Jagdish Gediya, Michal Hocko, Tim C Chen, Dave Hansen,
	Alistair Popple, Baolin Wang, Feng Tang, Jonathan Cameron,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> On Thu, May 12, 2022 at 12:36 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > Wei Xu <weixugc@google.com> writes:
> > 
> > > On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > 
> > > > On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > > > > Sysfs Interfaces
> > > > > > ================
> > > > > > 
> > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > 
> > > > > >    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > 
> > > > > >    Format: node_list
> > > > > > 
> > > > > >    Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > 
> > > > > >    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > 
> > > > > >    The absolute value of a tier id number has no specific meaning.
> > > > > >    What matters is the relative order of the tier id numbers.
> > > > > > 
> > > > > >    When a memory tier has no nodes, the kernel can hide its memtier
> > > > > >    sysfs files.
> > > > > > 
> > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > 
> > > > > >    where N = 0, 1, ...
> > > > > > 
> > > > > >    Format: int or empty
> > > > > > 
> > > > > >    When read, list the memory tier that the node belongs to.  Its value
> > > > > >    is empty for a CPU-only NUMA node.
> > > > > > 
> > > > > >    When written, the kernel moves the node into the specified memory
> > > > > >    tier if the move is allowed.  The tier assignment of all other nodes
> > > > > >    are not affected.
> > > > > > 
> > > > > >    Initially, we can make this interface read-only.
> > > > > 
> > > > > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > > > > information we needed.  Do we really need
> > > > > "/sys/devices/system/memtier/memtierN/nodelist"?
> > > > > 
> > > > > That can be gotten via a simple shell command line,
> > > > > 
> > > > > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> > > > > 
> > > > 
> > > > It will be really useful to fetch the memory tier node list in an easy
> > > > fashion rather than reading multiple sysfs directories. If we don't have
> > > > other attributes for memorytier, we could keep
> > > > "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> > > > avoiding /sys/devices/system/memtier/memtierN/nodelist
> > > > 
> > > > -aneesh
> > > 
> > > It is harder to implement memtierN as just a file and doesn't follow
> > > the existing sysfs pattern, either.  Besides, it is extensible to have
> > > memtierN as a directory.
> > 
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 6248326f944d..251f38ec3816 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
> >         NULL
> >  };
> > 
> > +#define MAX_TIER 3
> > +nodemask_t memory_tier[MAX_TIER];
> > +
> > +#define _TIER_ATTR_RO(name, tier_index)                                        \
> > +       { __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
> > +
> > +struct memory_tier_attr {
> > +       struct device_attribute attr;
> > +       int tier_index;
> > +       int (*write)(nodemask_t nodes);
> > +};
> > +
> > +static ssize_t show_tier(struct device *dev,
> > +                        struct device_attribute *attr, char *buf)
> > +{
> > +       struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
> > +
> > +       return sysfs_emit(buf, "%*pbl\n",
> > +                         nodemask_pr_args(&memory_tier[mt->tier_index]));
> > +}
> > +
> >  static const struct attribute_group memory_root_attr_group = {
> >         .attrs = node_state_attrs,
> >  };
> > 
> > +
> > +#define TOP_TIER 0
> > +static struct memory_tier_attr memory_tiers[] = {
> > +       [0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
> > +};
> > +
> > +static struct attribute *memory_tier_attrs[] = {
> > +       &memory_tiers[0].attr.attr,
> > +       NULL
> > +};
> > +
> > +static const struct attribute_group memory_tier_attr_group = {
> > +       .attrs = memory_tier_attrs,
> > +};
> > +
> >  static const struct attribute_group *cpu_root_attr_groups[] = {
> >         &memory_root_attr_group,
> > +       &memory_tier_attr_group,
> >         NULL,
> >  };
> > 
> > 
> > As long as we have the ability to see the nodelist, I am good with the
> > proposal.
> > 
> > -aneesh
> 
> I am OK with moving back the memory tier nodelist into node/.  When
> there are more memory tier attributes needed, we can then create the
> memory tier subtree and replace the tier nodelist in node/ with
> symlinks.

What attributes do you imagine that we may put in memory_tierX/ sysfs
directory?  If we have good candidates in mind, we may just do that. 
What I can imagine now is "demote", like "memory_reclaim" in nodeX/ or
node/ directory you proposed before.  Is it necessary to show something
like "meminfo", "vmstat" there?

Best Regards,
Huang, Ying

> 
> So the revised sysfs interfaces are:
> 
> * /sys/devices/system/node/memory_tierN (read-only)
> 
>   where N = 0, 1, 2
> 
>   Format: node_list
> 
> * /sys/devices/system/node/nodeN/memory_tier (read/write)
> 
>   where N = 0, 1, ...
> 
>   Format: int or empty



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  6:22 RFC: Memory Tiering Kernel Interfaces (v2) Wei Xu
  2022-05-12  7:03 ` ying.huang
@ 2022-05-12 15:00 ` Jonathan Cameron
  2022-05-18  7:09   ` Wei Xu
  2022-05-13  3:25 ` ying.huang
  2 siblings, 1 reply; 47+ messages in thread
From: Jonathan Cameron @ 2022-05-12 15:00 UTC (permalink / raw)
  To: Wei Xu
  Cc: Huang Ying, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, 11 May 2022 23:22:11 -0700
Wei Xu <weixugc@google.com> wrote:
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
> 
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created during
> the kernel initialization and updated when a NUMA node is hot-added or
> hot-removed.  The current implementation puts all nodes with CPU into
> the top tier, and builds the tier hierarchy tier-by-tier by establishing
> the per-node demotion targets based on the distances between nodes.
> 
> This current memory tier kernel interface needs to be improved for
> several important use cases:
> 
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.
> 
> * The current tier hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
> 
> * Also because the current tier hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tier hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tier
>   hierarchy unstable and make it difficult to support tier-based
>   memory accounting.
> 
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier as defined by the demotion path, not any other
>   node from any lower tier.  This strict, hard-coded demotion order
>   does not work in all use cases (e.g. some use cases may want to
>   allow cross-socket demotion to another node in the same demotion
>   tier as a fallback when the preferred demotion node is out of
>   space), and has resulted in the feature request for an interface to
>   override the system-wide, per-node demotion order from the
>   userspace.  This demotion order is also inconsistent with the page
>   allocation fallback order when all the nodes in a higher tier are
>   out of space: The page allocation can fall back to any node from
>   any lower tier, whereas the demotion order doesn't allow that.
> 
> * There are no interfaces for the userspace to learn about the memory
>   tier hierarchy in order to optimize its memory allocations.
> 
> I'd like to propose revised memory tier kernel interfaces based on
> the discussions in the threads:
> 
> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> 
> 
> High-level Design Ideas
> =======================
> 
> * Define memory tiers explicitly, not implicitly.
> 
> * Memory tiers are defined based on hardware capabilities of memory
>   nodes, not their relative node distances between each other.
> 
> * The tier assignment of each node is independent from each other.
>   Moving a node from one tier to another tier doesn't affect the tier
>   assignment of any other node.
> 
> * The node-tier association is stable. A node can be reassigned to a
>   different tier only under the specific conditions that don't block
>   future tier-based memory cgroup accounting.
> 
> * A node can demote its pages to any nodes of any lower tiers. The
>   demotion target node selection follows the allocation fallback order
>   of the source node, which is built based on node distances.  The
>   demotion targets are also restricted to only the nodes from the tiers
>   lower than the source node.  We no longer need to maintain a separate
>   per-node demotion order (node_demotion[]).
> 

Hi Wei,

This proposal looks good to me, though we'll be having fun
white boarding topologies from our roadmaps for the next few days :)

A few comments inline. It also seems likely to me that there is little
benefit in starting with 3 tiers as the maximum.  Seems unlikely the
code will be substantially simpler for 3 than it would be for 4 or 5.
I've drawn out one simple case that needs 4 to do sensible things.

> 
> Sysfs Interfaces
> ================
> 
> * /sys/devices/system/memtier/memtierN/nodelist
> 
>   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> 
>   Format: node_list
> 
>   Read-only.  When read, list the memory nodes in the specified tier.
> 
>   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> 
>   The absolute value of a tier id number has no specific meaning.
>   What matters is the relative order of the tier id numbers.
> 
>   When a memory tier has no nodes, the kernel can hide its memtier
>   sysfs files.
> 
> * /sys/devices/system/node/nodeN/memtier
> 
>   where N = 0, 1, ...
> 
>   Format: int or empty
> 
>   When read, list the memory tier that the node belongs to.  Its value
>   is empty for a CPU-only NUMA node.
> 
>   When written, the kernel moves the node into the specified memory
>   tier if the move is allowed.  The tier assignment of all other nodes
>   are not affected.
> 
>   Initially, we can make this interface read-only.
> 
> 
> Kernel Representation
> =====================
> 
> * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> 
> * #define MAX_MEMORY_TIERS 3
> 
>   Support 3 memory tiers for now.
> 
> * #define MEMORY_DEFAULT_TIER 1
> 
>   The default tier that a memory node is assigned to.
> 
> * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> 
>   Store memory nodes by tiers.
> 
> * int node_tier_map[MAX_NUMNODES]
> 
>   Map a node to its tier.
> 
>   For each CPU-only node c, node_tier_map[c] = -1.
> 
> 
> Memory Tier Initialization
> ==========================
> 
> By default, all memory nodes are assigned to the default tier
> (MEMORY_DEFAULT_TIER).

This is tighter than it needs to be.  In many cases we can easily
establish if there is any possibility of CPU being hotplugged into
a memory node.  If it's CXL attached no way CPUs are going to be
turning up their later :)  If CPU HP into a given node can't happen
we can be more flexible and I think that often results in better decisions.
See example below, though obviously I could just use the userspace
interface to fix that up anyway or have a CXL driver move it around
if that's relevant.  In some other cases I'm fairly sure we know in
advance where CPUs can be added but I'd need to check all the
relevant specs to be sure there aren't any corner cases.  I 'think'
for ARM for example we know where all possible CPUs can be hotplugged
(constraint coming from the interrupt controller + the fact that only
virtual CPU HP is defined).

> 
> A device driver can move up or down its memory nodes from the default
> tier.  For example, PMEM can move down its memory nodes below the
> default tier, whereas GPU can move up its memory nodes above the
> default tier.
> 
> The kernel initialization code makes the decision on which exact tier
> a memory node should be assigned to based on the requests from the
> device drivers as well as the memory device hardware information
> provided by the firmware.
> 
> 
> Memory Tier Reassignment
> ========================
> 
> After a memory node is hot-removed, it can be hot-added back to a
> different memory tier.  This is useful for supporting dynamically
> provisioned CXL.mem NUMA nodes, which may connect to different
> memory devices across hot-plug events.  Such tier changes should
> be compatible with tier-based memory accounting.
> 
> The userspace may also reassign an existing online memory node to a
> different tier.  However, this should only be allowed when no pages
> are allocated from the memory node or when there are no non-root
> memory cgroups (e.g. during the system boot).  This restriction is
> important for keeping memory tier hierarchy stable enough for
> tier-based memory cgroup accounting.
> 
> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> 
> 
> Memory Allocation for Demotion
> ==============================
> 
> To allocate a new page as the demotion target for a page, the kernel
> calls the allocation function (__alloc_pages_nodemask) with the
> source page node as the preferred node and the union of all lower
> tier nodes as the allowed nodemask.  The actual target node selection
> then follows the allocation fallback order that the kernel has
> already defined.
> 
> The pseudo code looks like:
> 
>     targets = NODE_MASK_NONE;
>     src_nid = page_to_nid(page);
>     src_tier = node_tier_map[src_nid];
>     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
>             nodes_or(targets, targets, memory_tiers[i]);
>     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> 
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion target nodemask, e.g. to prevent
> demotion or select a particular allowed node as the demotion target.
> 
> 
> Memory Allocation for Promotion
> ===============================
> 
> The page allocation for promotion is similar to demotion, except that (1)
> the target nodemask uses the promotion tiers, (2) the preferred node can
> be the accessing CPU node, not the source page node.
> 
> 
> Examples
> ========
> 

...

> * Example 3:
> 
> Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

Node2 is drawn as pmem.

> 
> All nodes are in the same tier.
> 
>                   20
>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>          \                 /
>           \ 30            / 30
>            \             /
>              Node 2 (PMEM)
> 
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
> 
> $ cat /sys/devices/system/memtier/memtier*/nodelist
> <empty>
> 0-2
> <empty>
> 
> $ cat /sys/devices/system/node/node*/memtier
> 1
> 1
> 1
> 
> Demotion fallback order:
> node 0: empty
> node 1: empty
> node 2: empty
> 
> 
> * Example 4:
> 
> Node 0 is a DRAM node with CPU.
> Node 1 is a PMEM node.
> Node 2 is a GPU node.
> 
>                   50
>   Node 0 (DRAM)  ----  Node 2 (GPU)
>          \                 /
>           \ 30            / 60
>            \             /
>              Node 1 (PMEM)
> 
> node distances:
> node   0    1    2
>    0  10   30   50
>    1  30   10   60
>    2  50   60   10
> 
> $ cat /sys/devices/system/memtier/memtier*/nodelist
> 2
> 0
> 1
> 
> $ cat /sys/devices/system/node/node*/memtier
> 1
> 2
> 0
> 
> Demotion fallback order:
> node 0: 1
> node 1: empty
> node 2: 0, 1
> 
> 
> * Example 5:
> 
> Node 0 is a DRAM node with CPU.
> Node 1 is a GPU node.
> Node 2 is a PMEM node.
> Node 3 is a large, slow DRAM node without CPU.
> 
> 
>      Node 2 (PMEM)  ----
>    /      |              \
>   /       | 30            \ 120
>  |        |         100    \
>  |   Node 0 (DRAM)  ----  Node 1 (GPU)
>   \         \                 /
>     \        \ 40            / 110
>   80  \       \             /
>         ---  Node 3 (Slow DRAM)

This is close but not quite what was intended for Hesham's
example... (note we just checked that Hesham's original node0-1
timing didn't make any sense.).


> 
> node distances:
> node    0    1    2    3
>    0   10  100   30   40
>    1  100   10  120  110
>    2   30  120   10   80
>    3   40  110   80   10
> 
> $ cat /sys/devices/system/memtier/memtier*/nodelist
> 1
> 0,3
> 2
> 
> $ cat /sys/devices/system/node/node*/memtier
> 1
> 0
> 2
> 1
> 
> Demotion fallback order:
> node 0: 2
> node 1: 0, 3, 2
> node 2: empty
> node 3: 2

This is close but not quite the same as the example
Hesham gave (note the node timing 1 to 0 on in the table
with that example didn't make sense).  I added another
level of switching to make the numbers more obviously
different and show how critical it might be.

* Example 6:
 
Node 0 is a DRAM node with CPU.
Node 1 is a GPU node.
Node 2 is a PMEM node.
Node 3 is an extremely large, DRAM node without CPU.
  (Key point here being that it probably never makes sense
   to demote to anywhere else from this memory).


I've redone the timings wrt to example 5.
Basis for this is 0 and 2 are directly connected
via controllers in an SoC. 1 and 3 are connected
via a a common switch one switch down switch
(each hop via this is 100)
All drams cost 10 once you've reached correct node
and pmem costs 30 from SoC.
Numbers get too large as a result but meh, I'm making
a point not providing real numbers :)

         PMEM Node 2
            |(30)
        CPU + DRAM Node0
            |(100)
         Switch 1
            |(100)
          Switch 2
    (100)  |      |(100) 
Node 1 GPU     Node3 Large memory.


With one level of s

     Node 2 (PMEM)  ----
    /      |              \
   /       | 30            \ 330
  |        |         310    \
  |   Node 0 (DRAM)  ----  Node 1 (GPU)
   \         \                 /
     \        \ 310           / 210
   330 \       \             /
         ---  Node 3 (Extremely large DRAM)

To my mind, we should potentially also take into account
the fact that Node3 can be known to never contain CPUs
(in at least some architectures we know where the CPUs
 might be added later, they can't just magically turn up
 anywhere in the topology).
 
node distances:
node    0    1    2    3
    0   10   310  30   310
    1   310  10   330  210
    2   30   330  10   330
    3   310  210  330   10

So, my ideal would treat node 3 different from other dram nodes
as we know it can't have CPUs. Trying to come up with an
always correct order for nodes 3 and 2 is tricky as to a certain
extent depends on capacity. If node 2 was  big enough to take
any demotion from node 0 and still have lots of room then demoting
there form node 3 would make sense and visa versa.

 
 $ cat /sys/devices/system/memtier/memtier*/nodelist
 1
 0
 2
 3


 $ cat /sys/devices/system/node/node*/memtier
  1
  0
  2
  3
 
 Demotion fallback order:
 node 0: 2, 3
 node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
 node 2: 3
 node 3: empty 

or as Hesham just pointed out this can be done with 3 tiers
because we can put the GPU and CPU in the same tier because
their is little reason to demote from one to the other. 

We are also a bit worried about ABI backwards compatibility because
of potential need to make more space in tiers lower in number than
CPU attached DDR. I rather liked the negative proposal with
default as 0 that Huang, Ying made.

Jonathan





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  8:15         ` Wei Xu
  2022-05-12  8:37           ` ying.huang
@ 2022-05-12 21:12           ` Tim Chen
  2022-05-12 21:31             ` Wei Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Tim Chen @ 2022-05-12 21:12 UTC (permalink / raw)
  To: Wei Xu, Aneesh Kumar K.V
  Cc: ying.huang, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> 
> I am OK with moving back the memory tier nodelist into node/.  When
> there are more memory tier attributes needed, we can then create the
> memory tier subtree and replace the tier nodelist in node/ with
> symlinks.
> 
> So the revised sysfs interfaces are:
> 
> * /sys/devices/system/node/memory_tierN (read-only)
> 
>   where N = 0, 1, 2
> 
>   Format: node_list
> 
> * /sys/devices/system/node/nodeN/memory_tier (read/write)
> 
>   where N = 0, 1, ...
> 
>   Format: int or empty

This looks good to me.  Just wonder if having just 1 tier
lower than DRAM is sufficient. We could have wide performance
range for such secondary memories and is one tier sufficient for them?

Tim


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12 21:12           ` Tim Chen
@ 2022-05-12 21:31             ` Wei Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-12 21:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aneesh Kumar K.V, ying.huang, Andrew Morton, Greg Thelen,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Jonathan Cameron, Davidlohr Bueso,
	Dan Williams, David Rientjes, Linux MM, Brice Goglin,
	Hesham Almatary

On Thu, May 12, 2022 at 2:13 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> >
> > I am OK with moving back the memory tier nodelist into node/.  When
> > there are more memory tier attributes needed, we can then create the
> > memory tier subtree and replace the tier nodelist in node/ with
> > symlinks.
> >
> > So the revised sysfs interfaces are:
> >
> > * /sys/devices/system/node/memory_tierN (read-only)
> >
> >   where N = 0, 1, 2
> >
> >   Format: node_list
> >
> > * /sys/devices/system/node/nodeN/memory_tier (read/write)
> >
> >   where N = 0, 1, ...
> >
> >   Format: int or empty
>
> This looks good to me.  Just wonder if having just 1 tier
> lower than DRAM is sufficient. We could have wide performance
> range for such secondary memories and is one tier sufficient for them?
>
> Tim

The tier design can be extended to more than 3 tiers (e.g. via
CONFIG_MAX_MEMORY_TIERS).  MAX_MEMORY_TIERS is set to 3 for now
because without enough memory device performance information provided
by the firmware, it is difficult for the kernel to properly initialize
the memory tier hierarchy beyond 3 tiers (GPU, DRAM, PMEM).  We will
have to resort to the userspace override to set up such many-tier
systems.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  8:37           ` ying.huang
@ 2022-05-13  2:52             ` ying.huang
  2022-05-13  7:00               ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: ying.huang @ 2022-05-13  2:52 UTC (permalink / raw)
  To: Wei Xu, Aneesh Kumar K.V
  Cc: Andrew Morton, Greg Thelen, Yang Shi, Linux Kernel Mailing List,
	Jagdish Gediya, Michal Hocko, Tim C Chen, Dave Hansen,
	Alistair Popple, Baolin Wang, Feng Tang, Jonathan Cameron,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Thu, 2022-05-12 at 16:37 +0800, ying.huang@intel.com wrote:
> On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> > On Thu, May 12, 2022 at 12:36 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > > 
> > > Wei Xu <weixugc@google.com> writes:
> > > 
> > > > On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > 
> > > > > On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > > > > > Sysfs Interfaces
> > > > > > > ================
> > > > > > > 
> > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > 
> > > > > > >    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > 
> > > > > > >    Format: node_list
> > > > > > > 
> > > > > > >    Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > 
> > > > > > >    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > 
> > > > > > >    The absolute value of a tier id number has no specific meaning.
> > > > > > >    What matters is the relative order of the tier id numbers.
> > > > > > > 
> > > > > > >    When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > >    sysfs files.
> > > > > > > 
> > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > 
> > > > > > >    where N = 0, 1, ...
> > > > > > > 
> > > > > > >    Format: int or empty
> > > > > > > 
> > > > > > >    When read, list the memory tier that the node belongs to.  Its value
> > > > > > >    is empty for a CPU-only NUMA node.
> > > > > > > 
> > > > > > >    When written, the kernel moves the node into the specified memory
> > > > > > >    tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > >    are not affected.
> > > > > > > 
> > > > > > >    Initially, we can make this interface read-only.
> > > > > > 
> > > > > > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > > > > > information we needed.  Do we really need
> > > > > > "/sys/devices/system/memtier/memtierN/nodelist"?
> > > > > > 
> > > > > > That can be gotten via a simple shell command line,
> > > > > > 
> > > > > > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> > > > > > 
> > > > > 
> > > > > It will be really useful to fetch the memory tier node list in an easy
> > > > > fashion rather than reading multiple sysfs directories. If we don't have
> > > > > other attributes for memorytier, we could keep
> > > > > "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> > > > > avoiding /sys/devices/system/memtier/memtierN/nodelist
> > > > > 
> > > > > -aneesh
> > > > 
> > > > It is harder to implement memtierN as just a file and doesn't follow
> > > > the existing sysfs pattern, either.  Besides, it is extensible to have
> > > > memtierN as a directory.
> > > 
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index 6248326f944d..251f38ec3816 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
> > >         NULL
> > >  };
> > > 
> > > +#define MAX_TIER 3
> > > +nodemask_t memory_tier[MAX_TIER];
> > > +
> > > +#define _TIER_ATTR_RO(name, tier_index)                                        \
> > > +       { __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
> > > +
> > > +struct memory_tier_attr {
> > > +       struct device_attribute attr;
> > > +       int tier_index;
> > > +       int (*write)(nodemask_t nodes);
> > > +};
> > > +
> > > +static ssize_t show_tier(struct device *dev,
> > > +                        struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
> > > +
> > > +       return sysfs_emit(buf, "%*pbl\n",
> > > +                         nodemask_pr_args(&memory_tier[mt->tier_index]));
> > > +}
> > > +
> > >  static const struct attribute_group memory_root_attr_group = {
> > >         .attrs = node_state_attrs,
> > >  };
> > > 
> > > +
> > > +#define TOP_TIER 0
> > > +static struct memory_tier_attr memory_tiers[] = {
> > > +       [0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
> > > +};
> > > +
> > > +static struct attribute *memory_tier_attrs[] = {
> > > +       &memory_tiers[0].attr.attr,
> > > +       NULL
> > > +};
> > > +
> > > +static const struct attribute_group memory_tier_attr_group = {
> > > +       .attrs = memory_tier_attrs,
> > > +};
> > > +
> > >  static const struct attribute_group *cpu_root_attr_groups[] = {
> > >         &memory_root_attr_group,
> > > +       &memory_tier_attr_group,
> > >         NULL,
> > >  };
> > > 
> > > 
> > > As long as we have the ability to see the nodelist, I am good with the
> > > proposal.
> > > 
> > > -aneesh
> > 
> > I am OK with moving back the memory tier nodelist into node/.  When
> > there are more memory tier attributes needed, we can then create the
> > memory tier subtree and replace the tier nodelist in node/ with
> > symlinks.
> 
> What attributes do you imagine that we may put in memory_tierX/ sysfs
> directory?  If we have good candidates in mind, we may just do that. 
> What I can imagine now is "demote", like "memory_reclaim" in nodeX/ or
> node/ directory you proposed before.  Is it necessary to show something
> like "meminfo", "vmstat" there?

My words may be confusing, so let me say it in another way.

Just for brainstorm, if we have

  /sys/devices/system/memtier/memtierN/

What can we put in it in addition to "nodelist" or links to the nodes? 
For example,

  /sys/devices/system/memtier/memtierN/demote

When write a page number to it, the specified number of pages will be
demoted from memtierN to memtierN+1, like the
/sys/devices/system/node/memory_reclaim interface you proposed before. 
Or, is it necessary to add

  /sys/devices/system/memtier/memtierN/meminfo
  /sys/devices/system/memtier/memtierN/vmstat

I don't mean to propose these.  Just want to know whether there's
requirement for these kind of stuff?  And what else may be required.

Best Regards,
Huang, Ying

> > 
> > So the revised sysfs interfaces are:
> > 
> > * /sys/devices/system/node/memory_tierN (read-only)
> > 
> >   where N = 0, 1, 2
> > 
> >   Format: node_list
> > 
> > * /sys/devices/system/node/nodeN/memory_tier (read/write)
> > 
> >   where N = 0, 1, ...
> > 
> >   Format: int or empty
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12  6:22 RFC: Memory Tiering Kernel Interfaces (v2) Wei Xu
  2022-05-12  7:03 ` ying.huang
  2022-05-12 15:00 ` Jonathan Cameron
@ 2022-05-13  3:25 ` ying.huang
  2022-05-13  6:36   ` Wei Xu
  2 siblings, 1 reply; 47+ messages in thread
From: ying.huang @ 2022-05-13  3:25 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> 
> Memory Allocation for Demotion
> ==============================
> 
> To allocate a new page as the demotion target for a page, the kernel
> calls the allocation function (__alloc_pages_nodemask) with the
> source page node as the preferred node and the union of all lower
> tier nodes as the allowed nodemask.  The actual target node selection
> then follows the allocation fallback order that the kernel has
> already defined.
> 
> The pseudo code looks like:
> 
>     targets = NODE_MASK_NONE;
>     src_nid = page_to_nid(page);
>     src_tier = node_tier_map[src_nid];
>     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
>             nodes_or(targets, targets, memory_tiers[i]);
>     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> 
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion target nodemask, e.g. to prevent
> demotion or select a particular allowed node as the demotion target.

Consider a system with 3 tiers, if we want to demote some pages from
tier 0, the desired behavior is,

- Allocate pages from tier 1
- If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so
demote some pages from tier 1 to tier 2
- If there's still no enough free pages in tier 1, allocate pages from
tier 2.

In this way, tier 0 will have the hottest pages, while tier 1 will have
the coldest pages.

With your proposed method, the demoting from tier 0 behavior is,

- Allocate pages from tier 1
- If there's no enough free pages in tier 1, allocate pages in tier 2

The kswapd of tier 1 will not be waken up until there's no enough free
pages in tier 2.  In quite long time, there's no much hot/cold
differentiation between tier 1 and tier 2.

This isn't hard to be fixed, just call __alloc_pages_nodemask() for each
tier one by one considering page allocation fallback order.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-13  3:25 ` ying.huang
@ 2022-05-13  6:36   ` Wei Xu
  2022-05-13  7:04     ` ying.huang
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-13  6:36 UTC (permalink / raw)
  To: ying.huang
  Cc: Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> >
> > Memory Allocation for Demotion
> > ==============================
> >
> > To allocate a new page as the demotion target for a page, the kernel
> > calls the allocation function (__alloc_pages_nodemask) with the
> > source page node as the preferred node and the union of all lower
> > tier nodes as the allowed nodemask.  The actual target node selection
> > then follows the allocation fallback order that the kernel has
> > already defined.
> >
> > The pseudo code looks like:
> >
> >     targets = NODE_MASK_NONE;
> >     src_nid = page_to_nid(page);
> >     src_tier = node_tier_map[src_nid];
> >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> >             nodes_or(targets, targets, memory_tiers[i]);
> >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion target nodemask, e.g. to prevent
> > demotion or select a particular allowed node as the demotion target.
>
> Consider a system with 3 tiers, if we want to demote some pages from
> tier 0, the desired behavior is,
>
> - Allocate pages from tier 1
> - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so
> demote some pages from tier 1 to tier 2
> - If there's still no enough free pages in tier 1, allocate pages from
> tier 2.
>
> In this way, tier 0 will have the hottest pages, while tier 1 will have
> the coldest pages.

When we are already in the allocation path for the demotion of a page
from tier 0, I think we'd better not block this allocation to wait for
kswapd to demote pages from tier 1 to tier 2. Instead, we should
directly allocate from tier 2.  Meanwhile, this demotion can wakeup
kswapd to demote from tier 1 to tier 2 in the background.

> With your proposed method, the demoting from tier 0 behavior is,
>
> - Allocate pages from tier 1
> - If there's no enough free pages in tier 1, allocate pages in tier 2
>
> The kswapd of tier 1 will not be waken up until there's no enough free
> pages in tier 2.  In quite long time, there's no much hot/cold
> differentiation between tier 1 and tier 2.

This is true with the current allocation code. But I think we can make
some changes for demotion allocations. For example, we can add a
GFP_DEMOTE flag and update the allocation function to wake up kswapd
when this flag is set and we need to fall back to another node.

> This isn't hard to be fixed, just call __alloc_pages_nodemask() for each
> tier one by one considering page allocation fallback order.

That would have worked, except that there is an example earlier, in
which it is actually preferred for some nodes to demote to their tier
+ 2, not tier +1.

More specifically, the example is:

                 20
   Node 0 (DRAM) -- Node 1 (DRAM)
    |   |           |    |
    |   | 30    120 |    |
    |   v           v    | 100
100 |  Node 2 (PMEM)     |
    |    |               |
    |    | 100           |
     \   v               v
      -> Node 3 (Large Mem)

Node distances:
node   0    1    2    3
   0  10   20   30  100
   1  20   10  120  100
   2  30  120   10  100
   3 100  100  100   10

3 memory tiers are defined:
tier 0: 0-1
tier 1: 2
tier 2: 3

The demotion fallback order is:
node 0: 2, 3
node 1: 3, 2
node 2: 3
node 3: empty

Note that even though node 3 is in tier 2 and node 2 is in tier 1,
node 1 (tier 0) still prefers node 3 as its first demotion target, not
node 2.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-13  2:52             ` ying.huang
@ 2022-05-13  7:00               ` Wei Xu
  2022-05-16  1:57                 ` ying.huang
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-13  7:00 UTC (permalink / raw)
  To: ying.huang
  Cc: Aneesh Kumar K.V, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 12, 2022 at 7:53 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Thu, 2022-05-12 at 16:37 +0800, ying.huang@intel.com wrote:
> > On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> > > On Thu, May 12, 2022 at 12:36 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > >
> > > > Wei Xu <weixugc@google.com> writes:
> > > >
> > > > > On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > >
> > > > > > On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > > > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > > > > > > Sysfs Interfaces
> > > > > > > > ================
> > > > > > > >
> > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > >
> > > > > > > >    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > >
> > > > > > > >    Format: node_list
> > > > > > > >
> > > > > > > >    Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > >
> > > > > > > >    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > >
> > > > > > > >    The absolute value of a tier id number has no specific meaning.
> > > > > > > >    What matters is the relative order of the tier id numbers.
> > > > > > > >
> > > > > > > >    When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > >    sysfs files.
> > > > > > > >
> > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > >
> > > > > > > >    where N = 0, 1, ...
> > > > > > > >
> > > > > > > >    Format: int or empty
> > > > > > > >
> > > > > > > >    When read, list the memory tier that the node belongs to.  Its value
> > > > > > > >    is empty for a CPU-only NUMA node.
> > > > > > > >
> > > > > > > >    When written, the kernel moves the node into the specified memory
> > > > > > > >    tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > >    are not affected.
> > > > > > > >
> > > > > > > >    Initially, we can make this interface read-only.
> > > > > > >
> > > > > > > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > > > > > > information we needed.  Do we really need
> > > > > > > "/sys/devices/system/memtier/memtierN/nodelist"?
> > > > > > >
> > > > > > > That can be gotten via a simple shell command line,
> > > > > > >
> > > > > > > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> > > > > > >
> > > > > >
> > > > > > It will be really useful to fetch the memory tier node list in an easy
> > > > > > fashion rather than reading multiple sysfs directories. If we don't have
> > > > > > other attributes for memorytier, we could keep
> > > > > > "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> > > > > > avoiding /sys/devices/system/memtier/memtierN/nodelist
> > > > > >
> > > > > > -aneesh
> > > > >
> > > > > It is harder to implement memtierN as just a file and doesn't follow
> > > > > the existing sysfs pattern, either.  Besides, it is extensible to have
> > > > > memtierN as a directory.
> > > >
> > > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > > index 6248326f944d..251f38ec3816 100644
> > > > --- a/drivers/base/node.c
> > > > +++ b/drivers/base/node.c
> > > > @@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
> > > >         NULL
> > > >  };
> > > >
> > > > +#define MAX_TIER 3
> > > > +nodemask_t memory_tier[MAX_TIER];
> > > > +
> > > > +#define _TIER_ATTR_RO(name, tier_index)                                        \
> > > > +       { __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
> > > > +
> > > > +struct memory_tier_attr {
> > > > +       struct device_attribute attr;
> > > > +       int tier_index;
> > > > +       int (*write)(nodemask_t nodes);
> > > > +};
> > > > +
> > > > +static ssize_t show_tier(struct device *dev,
> > > > +                        struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
> > > > +
> > > > +       return sysfs_emit(buf, "%*pbl\n",
> > > > +                         nodemask_pr_args(&memory_tier[mt->tier_index]));
> > > > +}
> > > > +
> > > >  static const struct attribute_group memory_root_attr_group = {
> > > >         .attrs = node_state_attrs,
> > > >  };
> > > >
> > > > +
> > > > +#define TOP_TIER 0
> > > > +static struct memory_tier_attr memory_tiers[] = {
> > > > +       [0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
> > > > +};
> > > > +
> > > > +static struct attribute *memory_tier_attrs[] = {
> > > > +       &memory_tiers[0].attr.attr,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static const struct attribute_group memory_tier_attr_group = {
> > > > +       .attrs = memory_tier_attrs,
> > > > +};
> > > > +
> > > >  static const struct attribute_group *cpu_root_attr_groups[] = {
> > > >         &memory_root_attr_group,
> > > > +       &memory_tier_attr_group,
> > > >         NULL,
> > > >  };
> > > >
> > > >
> > > > As long as we have the ability to see the nodelist, I am good with the
> > > > proposal.
> > > >
> > > > -aneesh
> > >
> > > I am OK with moving back the memory tier nodelist into node/.  When
> > > there are more memory tier attributes needed, we can then create the
> > > memory tier subtree and replace the tier nodelist in node/ with
> > > symlinks.
> >
> > What attributes do you imagine that we may put in memory_tierX/ sysfs
> > directory?  If we have good candidates in mind, we may just do that.
> > What I can imagine now is "demote", like "memory_reclaim" in nodeX/ or
> > node/ directory you proposed before.  Is it necessary to show something
> > like "meminfo", "vmstat" there?
>
> My words may be confusing, so let me say it in another way.

I can understand. :)

> Just for brainstorm, if we have
>
>   /sys/devices/system/memtier/memtierN/
>
> What can we put in it in addition to "nodelist" or links to the nodes?
> For example,
>
>   /sys/devices/system/memtier/memtierN/demote
>
> When write a page number to it, the specified number of pages will be
> demoted from memtierN to memtierN+1, like the
> /sys/devices/system/node/memory_reclaim interface you proposed before.

"demote" might be fine to add there.  Just to clarify, we (Google)
currently don't yet have the need for an interface to do system-wide
demotion from one tier to another.  What we need is memory.demote
(similar to memory.reclaim) for memory cgroup based demotions.

Other things that might be added include tier-specific properties
(e.g. expected latency and bandwidth when available) and tier-specific
stats.

Under /sys/devices/system/memtier/, we may add global properties about
memory tiers, e.g. max number of tiers, min/max tier ids (which might
be useful if we hide unpopulated memory tiers).

> Or, is it necessary to add
>
>   /sys/devices/system/memtier/memtierN/meminfo
>   /sys/devices/system/memtier/memtierN/vmstat

The userspace can aggregate such data from node/nodeN/{meminfo,
vmstat} based on the memory tier nodelist. But I am not against adding
these files to memtierN/ for user convenience.

> I don't mean to propose these.  Just want to know whether there's
> requirement for these kind of stuff?  And what else may be required.

This sounds good.  I think a memtier directory may eventually become a
necessity, though I don't feel too strongly about adding it right now.

> Best Regards,
> Huang, Ying
>
> > >
> > > So the revised sysfs interfaces are:
> > >
> > > * /sys/devices/system/node/memory_tierN (read-only)
> > >
> > >   where N = 0, 1, 2
> > >
> > >   Format: node_list
> > >
> > > * /sys/devices/system/node/nodeN/memory_tier (read/write)
> > >
> > >   where N = 0, 1, ...
> > >
> > >   Format: int or empty
> >
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-13  6:36   ` Wei Xu
@ 2022-05-13  7:04     ` ying.huang
  2022-05-13  7:21       ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: ying.huang @ 2022-05-13  7:04 UTC (permalink / raw)
  To: Wei Xu
  Cc: Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Thu, 2022-05-12 at 23:36 -0700, Wei Xu wrote:
> On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > 
> > > Memory Allocation for Demotion
> > > ==============================
> > > 
> > > To allocate a new page as the demotion target for a page, the kernel
> > > calls the allocation function (__alloc_pages_nodemask) with the
> > > source page node as the preferred node and the union of all lower
> > > tier nodes as the allowed nodemask.  The actual target node selection
> > > then follows the allocation fallback order that the kernel has
> > > already defined.
> > > 
> > > The pseudo code looks like:
> > > 
> > >     targets = NODE_MASK_NONE;
> > >     src_nid = page_to_nid(page);
> > >     src_tier = node_tier_map[src_nid];
> > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > >             nodes_or(targets, targets, memory_tiers[i]);
> > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > 
> > > The memopolicy of cpuset, vma and owner task of the source page can
> > > be set to refine the demotion target nodemask, e.g. to prevent
> > > demotion or select a particular allowed node as the demotion target.
> > 
> > Consider a system with 3 tiers, if we want to demote some pages from
> > tier 0, the desired behavior is,
> > 
> > - Allocate pages from tier 1
> > - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so
> > demote some pages from tier 1 to tier 2
> > - If there's still no enough free pages in tier 1, allocate pages from
> > tier 2.
> > 
> > In this way, tier 0 will have the hottest pages, while tier 1 will have
> > the coldest pages.
> 
> When we are already in the allocation path for the demotion of a page
> from tier 0, I think we'd better not block this allocation to wait for
> kswapd to demote pages from tier 1 to tier 2. Instead, we should
> directly allocate from tier 2.  Meanwhile, this demotion can wakeup
> kswapd to demote from tier 1 to tier 2 in the background.

Yes.  That's what I want too.  My original words may be misleading.

> > With your proposed method, the demoting from tier 0 behavior is,
> > 
> > - Allocate pages from tier 1
> > - If there's no enough free pages in tier 1, allocate pages in tier 2
> > 
> > The kswapd of tier 1 will not be waken up until there's no enough free
> > pages in tier 2.  In quite long time, there's no much hot/cold
> > differentiation between tier 1 and tier 2.
> 
> This is true with the current allocation code. But I think we can make
> some changes for demotion allocations. For example, we can add a
> GFP_DEMOTE flag and update the allocation function to wake up kswapd
> when this flag is set and we need to fall back to another node.
> 
> > This isn't hard to be fixed, just call __alloc_pages_nodemask() for each
> > tier one by one considering page allocation fallback order.
> 
> That would have worked, except that there is an example earlier, in
> which it is actually preferred for some nodes to demote to their tier
> + 2, not tier +1.
> 
> More specifically, the example is:
> 
>                  20
>    Node 0 (DRAM) -- Node 1 (DRAM)
>     |   |           |    |
>     |   | 30    120 |    |
>     |   v           v    | 100
> 100 |  Node 2 (PMEM)     |
>     |    |               |
>     |    | 100           |
>      \   v               v
>       -> Node 3 (Large Mem)
> 
> Node distances:
> node   0    1    2    3
>    0  10   20   30  100
>    1  20   10  120  100
>    2  30  120   10  100
>    3 100  100  100   10
> 
> 3 memory tiers are defined:
> tier 0: 0-1
> tier 1: 2
> tier 2: 3
> 
> The demotion fallback order is:
> node 0: 2, 3
> node 1: 3, 2
> node 2: 3
> node 3: empty
> 
> Note that even though node 3 is in tier 2 and node 2 is in tier 1,
> node 1 (tier 0) still prefers node 3 as its first demotion target, not
> node 2.

Yes.  I understand that we need to support this use case.  We can use
the tier order in allocation fallback list instead of from small to
large.  That is, for node 1, the tier order for demotion is tier 2, tier
1.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-13  7:04     ` ying.huang
@ 2022-05-13  7:21       ` Wei Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-13  7:21 UTC (permalink / raw)
  To: ying.huang
  Cc: Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Fri, May 13, 2022 at 12:04 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Thu, 2022-05-12 at 23:36 -0700, Wei Xu wrote:
> > On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > >
> > > > Memory Allocation for Demotion
> > > > ==============================
> > > >
> > > > To allocate a new page as the demotion target for a page, the kernel
> > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > source page node as the preferred node and the union of all lower
> > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > then follows the allocation fallback order that the kernel has
> > > > already defined.
> > > >
> > > > The pseudo code looks like:
> > > >
> > > >     targets = NODE_MASK_NONE;
> > > >     src_nid = page_to_nid(page);
> > > >     src_tier = node_tier_map[src_nid];
> > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > >
> > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > demotion or select a particular allowed node as the demotion target.
> > >
> > > Consider a system with 3 tiers, if we want to demote some pages from
> > > tier 0, the desired behavior is,
> > >
> > > - Allocate pages from tier 1
> > > - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so
> > > demote some pages from tier 1 to tier 2
> > > - If there's still no enough free pages in tier 1, allocate pages from
> > > tier 2.
> > >
> > > In this way, tier 0 will have the hottest pages, while tier 1 will have
> > > the coldest pages.
> >
> > When we are already in the allocation path for the demotion of a page
> > from tier 0, I think we'd better not block this allocation to wait for
> > kswapd to demote pages from tier 1 to tier 2. Instead, we should
> > directly allocate from tier 2.  Meanwhile, this demotion can wakeup
> > kswapd to demote from tier 1 to tier 2 in the background.
>
> Yes.  That's what I want too.  My original words may be misleading.
>
> > > With your proposed method, the demoting from tier 0 behavior is,
> > >
> > > - Allocate pages from tier 1
> > > - If there's no enough free pages in tier 1, allocate pages in tier 2
> > >
> > > The kswapd of tier 1 will not be waken up until there's no enough free
> > > pages in tier 2.  In quite long time, there's no much hot/cold
> > > differentiation between tier 1 and tier 2.
> >
> > This is true with the current allocation code. But I think we can make
> > some changes for demotion allocations. For example, we can add a
> > GFP_DEMOTE flag and update the allocation function to wake up kswapd
> > when this flag is set and we need to fall back to another node.
> >
> > > This isn't hard to be fixed, just call __alloc_pages_nodemask() for each
> > > tier one by one considering page allocation fallback order.
> >
> > That would have worked, except that there is an example earlier, in
> > which it is actually preferred for some nodes to demote to their tier
> > + 2, not tier +1.
> >
> > More specifically, the example is:
> >
> >                  20
> >    Node 0 (DRAM) -- Node 1 (DRAM)
> >     |   |           |    |
> >     |   | 30    120 |    |
> >     |   v           v    | 100
> > 100 |  Node 2 (PMEM)     |
> >     |    |               |
> >     |    | 100           |
> >      \   v               v
> >       -> Node 3 (Large Mem)
> >
> > Node distances:
> > node   0    1    2    3
> >    0  10   20   30  100
> >    1  20   10  120  100
> >    2  30  120   10  100
> >    3 100  100  100   10
> >
> > 3 memory tiers are defined:
> > tier 0: 0-1
> > tier 1: 2
> > tier 2: 3
> >
> > The demotion fallback order is:
> > node 0: 2, 3
> > node 1: 3, 2
> > node 2: 3
> > node 3: empty
> >
> > Note that even though node 3 is in tier 2 and node 2 is in tier 1,
> > node 1 (tier 0) still prefers node 3 as its first demotion target, not
> > node 2.
>
> Yes.  I understand that we need to support this use case.  We can use
> the tier order in allocation fallback list instead of from small to
> large.  That is, for node 1, the tier order for demotion is tier 2, tier
> 1.

That could work, too, though I feel it might be simpler and more
efficient (no repeated calls to __alloc_pages for the same allocation)
to modify __alloc_pages() itself.

Anyway, we can discuss more on this when it comes to the
implementation of this demotion allocation function.  I believe this
should not affect the general memory tiering interfaces proposed here.

> Best Regards,
> Huang, Ying
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-13  7:00               ` Wei Xu
@ 2022-05-16  1:57                 ` ying.huang
  0 siblings, 0 replies; 47+ messages in thread
From: ying.huang @ 2022-05-16  1:57 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K.V, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Jonathan Cameron, Davidlohr Bueso, Dan Williams, David Rientjes,
	Linux MM, Brice Goglin, Hesham Almatary

On Fri, 2022-05-13 at 00:00 -0700, Wei Xu wrote:
> On Thu, May 12, 2022 at 7:53 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Thu, 2022-05-12 at 16:37 +0800, ying.huang@intel.com wrote:
> > > On Thu, 2022-05-12 at 01:15 -0700, Wei Xu wrote:
> > > > On Thu, May 12, 2022 at 12:36 AM Aneesh Kumar K.V
> > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > 
> > > > > Wei Xu <weixugc@google.com> writes:
> > > > > 
> > > > > > On Thu, May 12, 2022 at 12:12 AM Aneesh Kumar K V
> > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > 
> > > > > > > On 5/12/22 12:33 PM, ying.huang@intel.com wrote:
> > > > > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote:
> > > > > > > > > Sysfs Interfaces
> > > > > > > > > ================
> > > > > > > > > 
> > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > 
> > > > > > > > >    where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > 
> > > > > > > > >    Format: node_list
> > > > > > > > > 
> > > > > > > > >    Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > 
> > > > > > > > >    Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > 
> > > > > > > > >    The absolute value of a tier id number has no specific meaning.
> > > > > > > > >    What matters is the relative order of the tier id numbers.
> > > > > > > > > 
> > > > > > > > >    When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > >    sysfs files.
> > > > > > > > > 
> > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > 
> > > > > > > > >    where N = 0, 1, ...
> > > > > > > > > 
> > > > > > > > >    Format: int or empty
> > > > > > > > > 
> > > > > > > > >    When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > >    is empty for a CPU-only NUMA node.
> > > > > > > > > 
> > > > > > > > >    When written, the kernel moves the node into the specified memory
> > > > > > > > >    tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > >    are not affected.
> > > > > > > > > 
> > > > > > > > >    Initially, we can make this interface read-only.
> > > > > > > > 
> > > > > > > > It seems that "/sys/devices/system/node/nodeN/memtier" has all
> > > > > > > > information we needed.  Do we really need
> > > > > > > > "/sys/devices/system/memtier/memtierN/nodelist"?
> > > > > > > > 
> > > > > > > > That can be gotten via a simple shell command line,
> > > > > > > > 
> > > > > > > > $ grep . /sys/devices/system/node/nodeN/memtier | sort -n -k 2 -t ':'
> > > > > > > > 
> > > > > > > 
> > > > > > > It will be really useful to fetch the memory tier node list in an easy
> > > > > > > fashion rather than reading multiple sysfs directories. If we don't have
> > > > > > > other attributes for memorytier, we could keep
> > > > > > > "/sys/devices/system/memtier/memtierN" a NUMA node list there by
> > > > > > > avoiding /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > 
> > > > > > > -aneesh
> > > > > > 
> > > > > > It is harder to implement memtierN as just a file and doesn't follow
> > > > > > the existing sysfs pattern, either.  Besides, it is extensible to have
> > > > > > memtierN as a directory.
> > > > > 
> > > > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > > > index 6248326f944d..251f38ec3816 100644
> > > > > --- a/drivers/base/node.c
> > > > > +++ b/drivers/base/node.c
> > > > > @@ -1097,12 +1097,49 @@ static struct attribute *node_state_attrs[] = {
> > > > >         NULL
> > > > >  };
> > > > > 
> > > > > +#define MAX_TIER 3
> > > > > +nodemask_t memory_tier[MAX_TIER];
> > > > > +
> > > > > +#define _TIER_ATTR_RO(name, tier_index)                                        \
> > > > > +       { __ATTR(name, 0444, show_tier, NULL), tier_index, NULL }
> > > > > +
> > > > > +struct memory_tier_attr {
> > > > > +       struct device_attribute attr;
> > > > > +       int tier_index;
> > > > > +       int (*write)(nodemask_t nodes);
> > > > > +};
> > > > > +
> > > > > +static ssize_t show_tier(struct device *dev,
> > > > > +                        struct device_attribute *attr, char *buf)
> > > > > +{
> > > > > +       struct memory_tier_attr *mt = container_of(attr, struct memory_tier_attr, attr);
> > > > > +
> > > > > +       return sysfs_emit(buf, "%*pbl\n",
> > > > > +                         nodemask_pr_args(&memory_tier[mt->tier_index]));
> > > > > +}
> > > > > +
> > > > >  static const struct attribute_group memory_root_attr_group = {
> > > > >         .attrs = node_state_attrs,
> > > > >  };
> > > > > 
> > > > > +
> > > > > +#define TOP_TIER 0
> > > > > +static struct memory_tier_attr memory_tiers[] = {
> > > > > +       [0] = _TIER_ATTR_RO(memory_top_tier, TOP_TIER),
> > > > > +};
> > > > > +
> > > > > +static struct attribute *memory_tier_attrs[] = {
> > > > > +       &memory_tiers[0].attr.attr,
> > > > > +       NULL
> > > > > +};
> > > > > +
> > > > > +static const struct attribute_group memory_tier_attr_group = {
> > > > > +       .attrs = memory_tier_attrs,
> > > > > +};
> > > > > +
> > > > >  static const struct attribute_group *cpu_root_attr_groups[] = {
> > > > >         &memory_root_attr_group,
> > > > > +       &memory_tier_attr_group,
> > > > >         NULL,
> > > > >  };
> > > > > 
> > > > > 
> > > > > As long as we have the ability to see the nodelist, I am good with the
> > > > > proposal.
> > > > > 
> > > > > -aneesh
> > > > 
> > > > I am OK with moving back the memory tier nodelist into node/.  When
> > > > there are more memory tier attributes needed, we can then create the
> > > > memory tier subtree and replace the tier nodelist in node/ with
> > > > symlinks.
> > > 
> > > What attributes do you imagine that we may put in memory_tierX/ sysfs
> > > directory?  If we have good candidates in mind, we may just do that.
> > > What I can imagine now is "demote", like "memory_reclaim" in nodeX/ or
> > > node/ directory you proposed before.  Is it necessary to show something
> > > like "meminfo", "vmstat" there?
> > 
> > My words may be confusing, so let me say it in another way.
> 
> I can understand. :)
> 
> > Just for brainstorm, if we have
> > 
> >   /sys/devices/system/memtier/memtierN/
> > 
> > What can we put in it in addition to "nodelist" or links to the nodes?
> > For example,
> > 
> >   /sys/devices/system/memtier/memtierN/demote
> > 
> > When write a page number to it, the specified number of pages will be
> > demoted from memtierN to memtierN+1, like the
> > /sys/devices/system/node/memory_reclaim interface you proposed before.
> 
> "demote" might be fine to add there.  Just to clarify, we (Google)
> currently don't yet have the need for an interface to do system-wide
> demotion from one tier to another.  What we need is memory.demote
> (similar to memory.reclaim) for memory cgroup based demotions.
> 
> Other things that might be added include tier-specific properties
> (e.g. expected latency and bandwidth when available) and tier-specific
> stats.
> 
> Under /sys/devices/system/memtier/, we may add global properties about
> memory tiers, e.g. max number of tiers, min/max tier ids (which might
> be useful if we hide unpopulated memory tiers).
> 
> > Or, is it necessary to add
> > 
> >   /sys/devices/system/memtier/memtierN/meminfo
> >   /sys/devices/system/memtier/memtierN/vmstat
> 
> The userspace can aggregate such data from node/nodeN/{meminfo,
> vmstat} based on the memory tier nodelist. But I am not against adding
> these files to memtierN/ for user convenience.
> 
> > I don't mean to propose these.  Just want to know whether there's
> > requirement for these kind of stuff?  And what else may be required.
> 
> This sounds good.  I think a memtier directory may eventually become a
> necessity, though I don't feel too strongly about adding it right now.

If a memtier directory may eventually become a necessity and we really
want convenient nodelist somewhere, I'm OK to add the memtier directory
now.

Best Regards,
Huang, Ying

> > Best Regards,
> > Huang, Ying
> > 
> > > > 
> > > > So the revised sysfs interfaces are:
> > > > 
> > > > * /sys/devices/system/node/memory_tierN (read-only)
> > > > 
> > > >   where N = 0, 1, 2
> > > > 
> > > >   Format: node_list
> > > > 
> > > > * /sys/devices/system/node/nodeN/memory_tier (read/write)
> > > > 
> > > >   where N = 0, 1, ...
> > > > 
> > > >   Format: int or empty
> > > 
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-12 15:00 ` Jonathan Cameron
@ 2022-05-18  7:09   ` Wei Xu
  2022-05-18 12:00     ` Jonathan Cameron
  2022-05-20  3:06     ` Ying Huang
  0 siblings, 2 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-18  7:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Huang Ying, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 11 May 2022 23:22:11 -0700
> Wei Xu <weixugc@google.com> wrote:
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created during
> > the kernel initialization and updated when a NUMA node is hot-added or
> > hot-removed.  The current implementation puts all nodes with CPU into
> > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > the per-node demotion targets based on the distances between nodes.
> >
> > This current memory tier kernel interface needs to be improved for
> > several important use cases:
> >
> > * The current tier initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into a higher tier.
> >
> > * The current tier hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >   with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tier hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tier hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tier
> >   hierarchy unstable and make it difficult to support tier-based
> >   memory accounting.
> >
> > * A higher tier node can only be demoted to selected nodes on the
> >   next lower tier as defined by the demotion path, not any other
> >   node from any lower tier.  This strict, hard-coded demotion order
> >   does not work in all use cases (e.g. some use cases may want to
> >   allow cross-socket demotion to another node in the same demotion
> >   tier as a fallback when the preferred demotion node is out of
> >   space), and has resulted in the feature request for an interface to
> >   override the system-wide, per-node demotion order from the
> >   userspace.  This demotion order is also inconsistent with the page
> >   allocation fallback order when all the nodes in a higher tier are
> >   out of space: The page allocation can fall back to any node from
> >   any lower tier, whereas the demotion order doesn't allow that.
> >
> > * There are no interfaces for the userspace to learn about the memory
> >   tier hierarchy in order to optimize its memory allocations.
> >
> > I'd like to propose revised memory tier kernel interfaces based on
> > the discussions in the threads:
> >
> > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> >
> >
> > High-level Design Ideas
> > =======================
> >
> > * Define memory tiers explicitly, not implicitly.
> >
> > * Memory tiers are defined based on hardware capabilities of memory
> >   nodes, not their relative node distances between each other.
> >
> > * The tier assignment of each node is independent from each other.
> >   Moving a node from one tier to another tier doesn't affect the tier
> >   assignment of any other node.
> >
> > * The node-tier association is stable. A node can be reassigned to a
> >   different tier only under the specific conditions that don't block
> >   future tier-based memory cgroup accounting.
> >
> > * A node can demote its pages to any nodes of any lower tiers. The
> >   demotion target node selection follows the allocation fallback order
> >   of the source node, which is built based on node distances.  The
> >   demotion targets are also restricted to only the nodes from the tiers
> >   lower than the source node.  We no longer need to maintain a separate
> >   per-node demotion order (node_demotion[]).
> >
>
> Hi Wei,
>
> This proposal looks good to me, though we'll be having fun
> white boarding topologies from our roadmaps for the next few days :)

That's good to hear.

> A few comments inline. It also seems likely to me that there is little
> benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> code will be substantially simpler for 3 than it would be for 4 or 5.
> I've drawn out one simple case that needs 4 to do sensible things.

We can make the number of tiers a config option. 3 tiers are just what
the kernel can reasonably initialize when there isn't enough hardware
performance information from the firmware.

> >
> > Sysfs Interfaces
> > ================
> >
> > * /sys/devices/system/memtier/memtierN/nodelist
> >
> >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> >
> >   Format: node_list
> >
> >   Read-only.  When read, list the memory nodes in the specified tier.
> >
> >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> >
> >   The absolute value of a tier id number has no specific meaning.
> >   What matters is the relative order of the tier id numbers.
> >
> >   When a memory tier has no nodes, the kernel can hide its memtier
> >   sysfs files.
> >
> > * /sys/devices/system/node/nodeN/memtier
> >
> >   where N = 0, 1, ...
> >
> >   Format: int or empty
> >
> >   When read, list the memory tier that the node belongs to.  Its value
> >   is empty for a CPU-only NUMA node.
> >
> >   When written, the kernel moves the node into the specified memory
> >   tier if the move is allowed.  The tier assignment of all other nodes
> >   are not affected.
> >
> >   Initially, we can make this interface read-only.
> >
> >
> > Kernel Representation
> > =====================
> >
> > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> >
> > * #define MAX_MEMORY_TIERS 3
> >
> >   Support 3 memory tiers for now.
> >
> > * #define MEMORY_DEFAULT_TIER 1
> >
> >   The default tier that a memory node is assigned to.
> >
> > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> >
> >   Store memory nodes by tiers.
> >
> > * int node_tier_map[MAX_NUMNODES]
> >
> >   Map a node to its tier.
> >
> >   For each CPU-only node c, node_tier_map[c] = -1.
> >
> >
> > Memory Tier Initialization
> > ==========================
> >
> > By default, all memory nodes are assigned to the default tier
> > (MEMORY_DEFAULT_TIER).
>
> This is tighter than it needs to be.  In many cases we can easily
> establish if there is any possibility of CPU being hotplugged into
> a memory node.  If it's CXL attached no way CPUs are going to be
> turning up their later :)  If CPU HP into a given node can't happen
> we can be more flexible and I think that often results in better decisions.
> See example below, though obviously I could just use the userspace
> interface to fix that up anyway or have a CXL driver move it around
> if that's relevant.  In some other cases I'm fairly sure we know in
> advance where CPUs can be added but I'd need to check all the
> relevant specs to be sure there aren't any corner cases.  I 'think'
> for ARM for example we know where all possible CPUs can be hotplugged
> (constraint coming from the interrupt controller + the fact that only
> virtual CPU HP is defined).

We may not always want to put a CXL-attached memory device into a
slower tier because even though CXL does add some additional latency,
both the memory device and CXL can still be very capable in
performance and may not be much slower (if any) than the on-board DRAM
(e.g. DRAM from a remote CPU socket).

Also, the default tier here is just the initial tier assignment of
each node, which behaves as if there were no tiering.  A tiering
kernel init function can certainly reassign the tier for each node if
it knows enough about the hardware performance for these nodes from
the firmware.

> >
> > A device driver can move up or down its memory nodes from the default
> > tier.  For example, PMEM can move down its memory nodes below the
> > default tier, whereas GPU can move up its memory nodes above the
> > default tier.
> >
> > The kernel initialization code makes the decision on which exact tier
> > a memory node should be assigned to based on the requests from the
> > device drivers as well as the memory device hardware information
> > provided by the firmware.
> >
> >
> > Memory Tier Reassignment
> > ========================
> >
> > After a memory node is hot-removed, it can be hot-added back to a
> > different memory tier.  This is useful for supporting dynamically
> > provisioned CXL.mem NUMA nodes, which may connect to different
> > memory devices across hot-plug events.  Such tier changes should
> > be compatible with tier-based memory accounting.
> >
> > The userspace may also reassign an existing online memory node to a
> > different tier.  However, this should only be allowed when no pages
> > are allocated from the memory node or when there are no non-root
> > memory cgroups (e.g. during the system boot).  This restriction is
> > important for keeping memory tier hierarchy stable enough for
> > tier-based memory cgroup accounting.
> >
> > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> >
> >
> > Memory Allocation for Demotion
> > ==============================
> >
> > To allocate a new page as the demotion target for a page, the kernel
> > calls the allocation function (__alloc_pages_nodemask) with the
> > source page node as the preferred node and the union of all lower
> > tier nodes as the allowed nodemask.  The actual target node selection
> > then follows the allocation fallback order that the kernel has
> > already defined.
> >
> > The pseudo code looks like:
> >
> >     targets = NODE_MASK_NONE;
> >     src_nid = page_to_nid(page);
> >     src_tier = node_tier_map[src_nid];
> >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> >             nodes_or(targets, targets, memory_tiers[i]);
> >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion target nodemask, e.g. to prevent
> > demotion or select a particular allowed node as the demotion target.
> >
> >
> > Memory Allocation for Promotion
> > ===============================
> >
> > The page allocation for promotion is similar to demotion, except that (1)
> > the target nodemask uses the promotion tiers, (2) the preferred node can
> > be the accessing CPU node, not the source page node.
> >
> >
> > Examples
> > ========
> >
>
> ...
>
> > * Example 3:
> >
> > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
> Node2 is drawn as pmem.

Typo. Good catch.

> >
> > All nodes are in the same tier.
> >
> >                   20
> >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> >          \                 /
> >           \ 30            / 30
> >            \             /
> >              Node 2 (PMEM)
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > <empty>
> > 0-2
> > <empty>
> >
> > $ cat /sys/devices/system/node/node*/memtier
> > 1
> > 1
> > 1
> >
> > Demotion fallback order:
> > node 0: empty
> > node 1: empty
> > node 2: empty
> >
> >
> > * Example 4:
> >
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a PMEM node.
> > Node 2 is a GPU node.
> >
> >                   50
> >   Node 0 (DRAM)  ----  Node 2 (GPU)
> >          \                 /
> >           \ 30            / 60
> >            \             /
> >              Node 1 (PMEM)
> >
> > node distances:
> > node   0    1    2
> >    0  10   30   50
> >    1  30   10   60
> >    2  50   60   10
> >
> > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > 2
> > 0
> > 1
> >
> > $ cat /sys/devices/system/node/node*/memtier
> > 1
> > 2
> > 0
> >
> > Demotion fallback order:
> > node 0: 1
> > node 1: empty
> > node 2: 0, 1
> >
> >
> > * Example 5:
> >
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a GPU node.
> > Node 2 is a PMEM node.
> > Node 3 is a large, slow DRAM node without CPU.
> >
> >
> >      Node 2 (PMEM)  ----
> >    /      |              \
> >   /       | 30            \ 120
> >  |        |         100    \
> >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> >   \         \                 /
> >     \        \ 40            / 110
> >   80  \       \             /
> >         ---  Node 3 (Slow DRAM)
>
> This is close but not quite what was intended for Hesham's
> example... (note we just checked that Hesham's original node0-1
> timing didn't make any sense.).
>

This was inspired by Hesham's example. But I should have also included
the version that illustrates the need to skip a tier when demoting
from certain nodes.

> >
> > node distances:
> > node    0    1    2    3
> >    0   10  100   30   40
> >    1  100   10  120  110
> >    2   30  120   10   80
> >    3   40  110   80   10
> >
> > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > 1
> > 0,3
> > 2
> >
> > $ cat /sys/devices/system/node/node*/memtier
> > 1
> > 0
> > 2
> > 1
> >
> > Demotion fallback order:
> > node 0: 2
> > node 1: 0, 3, 2
> > node 2: empty
> > node 3: 2
>
> This is close but not quite the same as the example
> Hesham gave (note the node timing 1 to 0 on in the table
> with that example didn't make sense).  I added another
> level of switching to make the numbers more obviously
> different and show how critical it might be.
>
> * Example 6:
>
> Node 0 is a DRAM node with CPU.
> Node 1 is a GPU node.
> Node 2 is a PMEM node.
> Node 3 is an extremely large, DRAM node without CPU.
>   (Key point here being that it probably never makes sense
>    to demote to anywhere else from this memory).
>
>
> I've redone the timings wrt to example 5.
> Basis for this is 0 and 2 are directly connected
> via controllers in an SoC. 1 and 3 are connected
> via a a common switch one switch down switch
> (each hop via this is 100)
> All drams cost 10 once you've reached correct node
> and pmem costs 30 from SoC.
> Numbers get too large as a result but meh, I'm making
> a point not providing real numbers :)
>
>          PMEM Node 2
>             |(30)
>         CPU + DRAM Node0
>             |(100)
>          Switch 1
>             |(100)
>           Switch 2
>     (100)  |      |(100)
> Node 1 GPU     Node3 Large memory.
>
>
> With one level of s
>
>      Node 2 (PMEM)  ----
>     /      |              \
>    /       | 30            \ 330
>   |        |         310    \
>   |   Node 0 (DRAM)  ----  Node 1 (GPU)
>    \         \                 /
>      \        \ 310           / 210
>    330 \       \             /
>          ---  Node 3 (Extremely large DRAM)
>
> To my mind, we should potentially also take into account
> the fact that Node3 can be known to never contain CPUs
> (in at least some architectures we know where the CPUs
>  might be added later, they can't just magically turn up
>  anywhere in the topology).
>
> node distances:
> node    0    1    2    3
>     0   10   310  30   310
>     1   310  10   330  210
>     2   30   330  10   330
>     3   310  210  330   10
>
> So, my ideal would treat node 3 different from other dram nodes
> as we know it can't have CPUs. Trying to come up with an
> always correct order for nodes 3 and 2 is tricky as to a certain
> extent depends on capacity. If node 2 was  big enough to take
> any demotion from node 0 and still have lots of room then demoting
> there form node 3 would make sense and visa versa.
>
>
>  $ cat /sys/devices/system/memtier/memtier*/nodelist
>  1
>  0
>  2
>  3
>
>
>  $ cat /sys/devices/system/node/node*/memtier
>   1
>   0
>   2
>   3
>
>  Demotion fallback order:
>  node 0: 2, 3
>  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
>  node 2: 3
>  node 3: empty
>
> or as Hesham just pointed out this can be done with 3 tiers
> because we can put the GPU and CPU in the same tier because
> their is little reason to demote from one to the other.

Thank you for the example.  It makes sense to me to have node 3 on its
own tier.  We can have either 3 tiers or 4 tiers in total (assuming
that the max number of tiers is a config option).

> We are also a bit worried about ABI backwards compatibility because
> of potential need to make more space in tiers lower in number than
> CPU attached DDR. I rather liked the negative proposal with
> default as 0 that Huang, Ying made.

It is hard to have negative values as the device IDs.

The current proposal equals the tier device ID to the tier hierarchy
level, which makes the interface simpler, but less flexible.  How
about the following proposal (which decouples the tier device ID from
the tier level)?

/sys/devices/system/memtier/memtierN/nodelist
/sys/devices/system/memtier/memtierN/rank

Each memory tier N has two sysfs files:
- nodelist: the nodes that are in this tier
- rank: an opaque value that helps decide the level at which this tier
is in the tier hierarchy (smaller value means faster tier)

The tier hierarchy is determined by "rank", not by the device id
number N from "memtierN".

The absolute value of "rank" of a memtier doesn't necessarily carry
any meaning. Its value relative to other memtiers decides the level of
this memtier in the tier hierarchy.

The CPU-attached DRAM nodes are always in memtier0 (the device ID),
but memtier0 may not always be the top-tier, e.g. its level can be 3
in a 5-tier system.

For the above example (example 6), we can have:

$ ls /sys/devices/system/memtier
memtier0
memtier1
memtier2
memtier128

$ cat /sys/devices/system/memtier/memtier*/rank
50
60
70
10

The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2

$ cat /sys/devices/system/memtier/memtier*/nodelist
0
2
3
1

$ ls -l /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
/sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
/sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
/sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2

To override the memory tier of a node, we can use a new, write-only,
per-node interface file:

/sys/devices/system/node/nodeN/set_memtier

e.g.

$ echo "memtier128" > sys/devices/system/node/node1/set_memtier

Any comments?

> Jonathan
>
>
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-18  7:09   ` Wei Xu
@ 2022-05-18 12:00     ` Jonathan Cameron
  2022-05-24  7:36       ` Wei Xu
  2022-05-20  3:06     ` Ying Huang
  1 sibling, 1 reply; 47+ messages in thread
From: Jonathan Cameron @ 2022-05-18 12:00 UTC (permalink / raw)
  To: Wei Xu, Dave Hansen, Alistair Popple
  Cc: Huang Ying, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Wed, 18 May 2022 00:09:48 -0700
Wei Xu <weixugc@google.com> wrote:

> On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> >
> > On Wed, 11 May 2022 23:22:11 -0700
> > Wei Xu <weixugc@google.com> wrote:  
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > >
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created during
> > > the kernel initialization and updated when a NUMA node is hot-added or
> > > hot-removed.  The current implementation puts all nodes with CPU into
> > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > the per-node demotion targets based on the distances between nodes.
> > >
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases:
> > >
> > > * The current tier initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into a higher tier.
> > >
> > > * The current tier hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > >   with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tier hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tier hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tier
> > >   hierarchy unstable and make it difficult to support tier-based
> > >   memory accounting.
> > >
> > > * A higher tier node can only be demoted to selected nodes on the
> > >   next lower tier as defined by the demotion path, not any other
> > >   node from any lower tier.  This strict, hard-coded demotion order
> > >   does not work in all use cases (e.g. some use cases may want to
> > >   allow cross-socket demotion to another node in the same demotion
> > >   tier as a fallback when the preferred demotion node is out of
> > >   space), and has resulted in the feature request for an interface to
> > >   override the system-wide, per-node demotion order from the
> > >   userspace.  This demotion order is also inconsistent with the page
> > >   allocation fallback order when all the nodes in a higher tier are
> > >   out of space: The page allocation can fall back to any node from
> > >   any lower tier, whereas the demotion order doesn't allow that.
> > >
> > > * There are no interfaces for the userspace to learn about the memory
> > >   tier hierarchy in order to optimize its memory allocations.
> > >
> > > I'd like to propose revised memory tier kernel interfaces based on
> > > the discussions in the threads:
> > >
> > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > >
> > >
> > > High-level Design Ideas
> > > =======================
> > >
> > > * Define memory tiers explicitly, not implicitly.
> > >
> > > * Memory tiers are defined based on hardware capabilities of memory
> > >   nodes, not their relative node distances between each other.
> > >
> > > * The tier assignment of each node is independent from each other.
> > >   Moving a node from one tier to another tier doesn't affect the tier
> > >   assignment of any other node.
> > >
> > > * The node-tier association is stable. A node can be reassigned to a
> > >   different tier only under the specific conditions that don't block
> > >   future tier-based memory cgroup accounting.
> > >
> > > * A node can demote its pages to any nodes of any lower tiers. The
> > >   demotion target node selection follows the allocation fallback order
> > >   of the source node, which is built based on node distances.  The
> > >   demotion targets are also restricted to only the nodes from the tiers
> > >   lower than the source node.  We no longer need to maintain a separate
> > >   per-node demotion order (node_demotion[]).
> > >  
> >
> > Hi Wei,
> >
> > This proposal looks good to me, though we'll be having fun
> > white boarding topologies from our roadmaps for the next few days :)  
> 
> That's good to hear.
> 
> > A few comments inline. It also seems likely to me that there is little
> > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > code will be substantially simpler for 3 than it would be for 4 or 5.
> > I've drawn out one simple case that needs 4 to do sensible things.  
> 
> We can make the number of tiers a config option. 3 tiers are just what
> the kernel can reasonably initialize when there isn't enough hardware
> performance information from the firmware. 
Now I think your rank solution below solves the following (but I wrote
it before reading that part properly :) ...

One issue with a config option is not breaking ABI if some distro
changes that option or we change a default value in future.
It may take some care.

Imagine that today we think 3 tiers is fine and default to tier 1 for DDR.
Someone writes a script to say their special device attached memory must
be in node 1 as well on assumption it is the same tier as DDR (policy
decision).
Later we decide to move the default DDR to tier 2 because we have
lots of hardware platforms where it makes sense to have multiple
faster tiers. 

Their policy script now puts some memory in a node that doesn't have
the same relationship to the default node.

If we define a 'default_node' or similar sysfs file in memtier
as read only report of what the kernel is defaulting to we can
at least argue they should have read it (no way of actually making
them do so though :(


> 
> > >
> > > Sysfs Interfaces
> > > ================
> > >
> > > * /sys/devices/system/memtier/memtierN/nodelist
> > >
> > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > >
> > >   Format: node_list
> > >
> > >   Read-only.  When read, list the memory nodes in the specified tier.
> > >
> > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > >
> > >   The absolute value of a tier id number has no specific meaning.
> > >   What matters is the relative order of the tier id numbers.
> > >
> > >   When a memory tier has no nodes, the kernel can hide its memtier
> > >   sysfs files.
> > >
> > > * /sys/devices/system/node/nodeN/memtier
> > >
> > >   where N = 0, 1, ...
> > >
> > >   Format: int or empty
> > >
> > >   When read, list the memory tier that the node belongs to.  Its value
> > >   is empty for a CPU-only NUMA node.
> > >
> > >   When written, the kernel moves the node into the specified memory
> > >   tier if the move is allowed.  The tier assignment of all other nodes
> > >   are not affected.
> > >
> > >   Initially, we can make this interface read-only.
> > >
> > >
> > > Kernel Representation
> > > =====================
> > >
> > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > >
> > > * #define MAX_MEMORY_TIERS 3
> > >
> > >   Support 3 memory tiers for now.
> > >
> > > * #define MEMORY_DEFAULT_TIER 1
> > >
> > >   The default tier that a memory node is assigned to.
> > >
> > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > >
> > >   Store memory nodes by tiers.
> > >
> > > * int node_tier_map[MAX_NUMNODES]
> > >
> > >   Map a node to its tier.
> > >
> > >   For each CPU-only node c, node_tier_map[c] = -1.
> > >
> > >
> > > Memory Tier Initialization
> > > ==========================
> > >
> > > By default, all memory nodes are assigned to the default tier
> > > (MEMORY_DEFAULT_TIER).  
> >
> > This is tighter than it needs to be.  In many cases we can easily
> > establish if there is any possibility of CPU being hotplugged into
> > a memory node.  If it's CXL attached no way CPUs are going to be
> > turning up their later :)  If CPU HP into a given node can't happen
> > we can be more flexible and I think that often results in better decisions.
> > See example below, though obviously I could just use the userspace
> > interface to fix that up anyway or have a CXL driver move it around
> > if that's relevant.  In some other cases I'm fairly sure we know in
> > advance where CPUs can be added but I'd need to check all the
> > relevant specs to be sure there aren't any corner cases.  I 'think'
> > for ARM for example we know where all possible CPUs can be hotplugged
> > (constraint coming from the interrupt controller + the fact that only
> > virtual CPU HP is defined).  
> 
> We may not always want to put a CXL-attached memory device into a
> slower tier because even though CXL does add some additional latency,
> both the memory device and CXL can still be very capable in
> performance and may not be much slower (if any) than the on-board DRAM
> (e.g. DRAM from a remote CPU socket).

Absolutely - though it should also report it's performance via
CDAT etc so the information available should be rich.

> 
> Also, the default tier here is just the initial tier assignment of
> each node, which behaves as if there were no tiering.  A tiering
> kernel init function can certainly reassign the tier for each node if
> it knows enough about the hardware performance for these nodes from
> the firmware.

Understood. In someways I'd be happier if we didn't provide an inkernel
interface to set the tier assignments at all and made it a userspace
policy decision.  That way we'd pretty much oblige distros to put
in place sensible scripts on day one. Probably too late for that though :(


> > >
> > > node distances:
> > > node    0    1    2    3
> > >    0   10  100   30   40
> > >    1  100   10  120  110
> > >    2   30  120   10   80
> > >    3   40  110   80   10
> > >
> > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > 1
> > > 0,3
> > > 2
> > >
> > > $ cat /sys/devices/system/node/node*/memtier
> > > 1
> > > 0
> > > 2
> > > 1
> > >
> > > Demotion fallback order:
> > > node 0: 2
> > > node 1: 0, 3, 2
> > > node 2: empty
> > > node 3: 2  
> >
> > This is close but not quite the same as the example
> > Hesham gave (note the node timing 1 to 0 on in the table
> > with that example didn't make sense).  I added another
> > level of switching to make the numbers more obviously
> > different and show how critical it might be.
> >
> > * Example 6:
> >
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a GPU node.
> > Node 2 is a PMEM node.
> > Node 3 is an extremely large, DRAM node without CPU.
> >   (Key point here being that it probably never makes sense
> >    to demote to anywhere else from this memory).
> >
> >
> > I've redone the timings wrt to example 5.
> > Basis for this is 0 and 2 are directly connected
> > via controllers in an SoC. 1 and 3 are connected
> > via a a common switch one switch down switch
> > (each hop via this is 100)
> > All drams cost 10 once you've reached correct node
> > and pmem costs 30 from SoC.
> > Numbers get too large as a result but meh, I'm making
> > a point not providing real numbers :)
> >
> >          PMEM Node 2
> >             |(30)
> >         CPU + DRAM Node0
> >             |(100)
> >          Switch 1
> >             |(100)
> >           Switch 2
> >     (100)  |      |(100)
> > Node 1 GPU     Node3 Large memory.
> >
> >
> > With one level of s
> >
> >      Node 2 (PMEM)  ----
> >     /      |              \
> >    /       | 30            \ 330
> >   |        |         310    \
> >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> >    \         \                 /
> >      \        \ 310           / 210
> >    330 \       \             /
> >          ---  Node 3 (Extremely large DRAM)
> >
> > To my mind, we should potentially also take into account
> > the fact that Node3 can be known to never contain CPUs
> > (in at least some architectures we know where the CPUs
> >  might be added later, they can't just magically turn up
> >  anywhere in the topology).
> >
> > node distances:
> > node    0    1    2    3
> >     0   10   310  30   310
> >     1   310  10   330  210
> >     2   30   330  10   330
> >     3   310  210  330   10
> >
> > So, my ideal would treat node 3 different from other dram nodes
> > as we know it can't have CPUs. Trying to come up with an
> > always correct order for nodes 3 and 2 is tricky as to a certain
> > extent depends on capacity. If node 2 was  big enough to take
> > any demotion from node 0 and still have lots of room then demoting
> > there form node 3 would make sense and visa versa.
> >
> >
> >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> >  1
> >  0
> >  2
> >  3
> >
> >
> >  $ cat /sys/devices/system/node/node*/memtier
> >   1
> >   0
> >   2
> >   3
> >
> >  Demotion fallback order:
> >  node 0: 2, 3
> >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> >  node 2: 3
> >  node 3: empty
> >
> > or as Hesham just pointed out this can be done with 3 tiers
> > because we can put the GPU and CPU in the same tier because
> > their is little reason to demote from one to the other.  
> 
> Thank you for the example.  It makes sense to me to have node 3 on its
> own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> that the max number of tiers is a config option).
> 
> > We are also a bit worried about ABI backwards compatibility because
> > of potential need to make more space in tiers lower in number than
> > CPU attached DDR. I rather liked the negative proposal with
> > default as 0 that Huang, Ying made.  
> 
> It is hard to have negative values as the device IDs.

Doh.  Obvious, but I missed that issue ;)

> 
> The current proposal equals the tier device ID to the tier hierarchy
> level, which makes the interface simpler, but less flexible.  How
> about the following proposal (which decouples the tier device ID from
> the tier level)?
> 
> /sys/devices/system/memtier/memtierN/nodelist
> /sys/devices/system/memtier/memtierN/rank
> 
> Each memory tier N has two sysfs files:
> - nodelist: the nodes that are in this tier
> - rank: an opaque value that helps decide the level at which this tier
> is in the tier hierarchy (smaller value means faster tier)

This we could do with negatives for faster than normal RAM. 0 is a nice
default value.  I'm assuming rank is userspace writeable?

> 
> The tier hierarchy is determined by "rank", not by the device id
> number N from "memtierN".
> 
> The absolute value of "rank" of a memtier doesn't necessarily carry
> any meaning. Its value relative to other memtiers decides the level of
> this memtier in the tier hierarchy.
> 
> The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> but memtier0 may not always be the top-tier, e.g. its level can be 3
> in a 5-tier system.
> 
> For the above example (example 6), we can have:
> 
> $ ls /sys/devices/system/memtier
> memtier0
> memtier1
> memtier2
> memtier128
> 
> $ cat /sys/devices/system/memtier/memtier*/rank
> 50
> 60
> 70
> 10
> 
> The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> 
> $ cat /sys/devices/system/memtier/memtier*/nodelist
> 0
> 2
> 3
> 1
> 
> $ ls -l /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> 
> To override the memory tier of a node, we can use a new, write-only,

Why write-only?
Why not just a number?

> per-node interface file:
> 
> /sys/devices/system/node/nodeN/set_memtier
> 
> e.g.
> 
> $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> 
> Any comments?

Nice :)

Initially I thought this was over complicated when compared to just leaving space, but
after a chat with Hesham just now you have us both convinced that this is an elegant solution.

Few corners probably need fleshing out:
*  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
   value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
   we should).
*  Multiple tiers with same rank.  My assumption is from demotion path point of view you
   fuse them (treat them as if they were a single tier), but keep them expressed
   separately in the sysfs interface so that the rank can be changed independently.
*  Some guidance on what values make sense for given rank default that might be set by
   a driver. If we have multiple GPU vendors, and someone mixes them in a system we
   probably don't want the default values they use to result in demotion between them.
   This might well be a guidance DOC or appropriate set of #define

Sounds like a good direction to explore to me.
Fairly low cost to implement and very flexible.

Thanks,

Jonathan


> 
> > Jonathan
> >
> >
> >
> >
> >  


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-18  7:09   ` Wei Xu
  2022-05-18 12:00     ` Jonathan Cameron
@ 2022-05-20  3:06     ` Ying Huang
  2022-05-24  7:04       ` Wei Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Ying Huang @ 2022-05-20  3:06 UTC (permalink / raw)
  To: Wei Xu, Jonathan Cameron
  Cc: Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
> > 
> > On Wed, 11 May 2022 23:22:11 -0700
> > Wei Xu <weixugc@google.com> wrote:
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > > 
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created during
> > > the kernel initialization and updated when a NUMA node is hot-added or
> > > hot-removed.  The current implementation puts all nodes with CPU into
> > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > the per-node demotion targets based on the distances between nodes.
> > > 
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases:
> > > 
> > > * The current tier initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into a higher tier.
> > > 
> > > * The current tier hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > >   with CPUs are better to be placed into the next lower tier.
> > > 
> > > * Also because the current tier hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tier hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tier
> > >   hierarchy unstable and make it difficult to support tier-based
> > >   memory accounting.
> > > 
> > > * A higher tier node can only be demoted to selected nodes on the
> > >   next lower tier as defined by the demotion path, not any other
> > >   node from any lower tier.  This strict, hard-coded demotion order
> > >   does not work in all use cases (e.g. some use cases may want to
> > >   allow cross-socket demotion to another node in the same demotion
> > >   tier as a fallback when the preferred demotion node is out of
> > >   space), and has resulted in the feature request for an interface to
> > >   override the system-wide, per-node demotion order from the
> > >   userspace.  This demotion order is also inconsistent with the page
> > >   allocation fallback order when all the nodes in a higher tier are
> > >   out of space: The page allocation can fall back to any node from
> > >   any lower tier, whereas the demotion order doesn't allow that.
> > > 
> > > * There are no interfaces for the userspace to learn about the memory
> > >   tier hierarchy in order to optimize its memory allocations.
> > > 
> > > I'd like to propose revised memory tier kernel interfaces based on
> > > the discussions in the threads:
> > > 
> > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > 
> > > 
> > > High-level Design Ideas
> > > =======================
> > > 
> > > * Define memory tiers explicitly, not implicitly.
> > > 
> > > * Memory tiers are defined based on hardware capabilities of memory
> > >   nodes, not their relative node distances between each other.
> > > 
> > > * The tier assignment of each node is independent from each other.
> > >   Moving a node from one tier to another tier doesn't affect the tier
> > >   assignment of any other node.
> > > 
> > > * The node-tier association is stable. A node can be reassigned to a
> > >   different tier only under the specific conditions that don't block
> > >   future tier-based memory cgroup accounting.
> > > 
> > > * A node can demote its pages to any nodes of any lower tiers. The
> > >   demotion target node selection follows the allocation fallback order
> > >   of the source node, which is built based on node distances.  The
> > >   demotion targets are also restricted to only the nodes from the tiers
> > >   lower than the source node.  We no longer need to maintain a separate
> > >   per-node demotion order (node_demotion[]).
> > > 
> > 
> > Hi Wei,
> > 
> > This proposal looks good to me, though we'll be having fun
> > white boarding topologies from our roadmaps for the next few days :)
> 
> That's good to hear.
> 
> > A few comments inline. It also seems likely to me that there is little
> > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > code will be substantially simpler for 3 than it would be for 4 or 5.
> > I've drawn out one simple case that needs 4 to do sensible things.
> 
> We can make the number of tiers a config option. 3 tiers are just what
> the kernel can reasonably initialize when there isn't enough hardware
> performance information from the firmware.
> 
> > > 
> > > Sysfs Interfaces
> > > ================
> > > 
> > > * /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > 
> > >   Format: node_list
> > > 
> > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > 
> > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > 
> > >   The absolute value of a tier id number has no specific meaning.
> > >   What matters is the relative order of the tier id numbers.
> > > 
> > >   When a memory tier has no nodes, the kernel can hide its memtier
> > >   sysfs files.
> > > 
> > > * /sys/devices/system/node/nodeN/memtier
> > > 
> > >   where N = 0, 1, ...
> > > 
> > >   Format: int or empty
> > > 
> > >   When read, list the memory tier that the node belongs to.  Its value
> > >   is empty for a CPU-only NUMA node.
> > > 
> > >   When written, the kernel moves the node into the specified memory
> > >   tier if the move is allowed.  The tier assignment of all other nodes
> > >   are not affected.
> > > 
> > >   Initially, we can make this interface read-only.
> > > 
> > > 
> > > Kernel Representation
> > > =====================
> > > 
> > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > 
> > > * #define MAX_MEMORY_TIERS 3
> > > 
> > >   Support 3 memory tiers for now.
> > > 
> > > * #define MEMORY_DEFAULT_TIER 1
> > > 
> > >   The default tier that a memory node is assigned to.
> > > 
> > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > 
> > >   Store memory nodes by tiers.
> > > 
> > > * int node_tier_map[MAX_NUMNODES]
> > > 
> > >   Map a node to its tier.
> > > 
> > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > 
> > > 
> > > Memory Tier Initialization
> > > ==========================
> > > 
> > > By default, all memory nodes are assigned to the default tier
> > > (MEMORY_DEFAULT_TIER).
> > 
> > This is tighter than it needs to be.  In many cases we can easily
> > establish if there is any possibility of CPU being hotplugged into
> > a memory node.  If it's CXL attached no way CPUs are going to be
> > turning up their later :)  If CPU HP into a given node can't happen
> > we can be more flexible and I think that often results in better decisions.
> > See example below, though obviously I could just use the userspace
> > interface to fix that up anyway or have a CXL driver move it around
> > if that's relevant.  In some other cases I'm fairly sure we know in
> > advance where CPUs can be added but I'd need to check all the
> > relevant specs to be sure there aren't any corner cases.  I 'think'
> > for ARM for example we know where all possible CPUs can be hotplugged
> > (constraint coming from the interrupt controller + the fact that only
> > virtual CPU HP is defined).
> 
> We may not always want to put a CXL-attached memory device into a
> slower tier because even though CXL does add some additional latency,
> both the memory device and CXL can still be very capable in
> performance and may not be much slower (if any) than the on-board DRAM
> (e.g. DRAM from a remote CPU socket).
> 
> Also, the default tier here is just the initial tier assignment of
> each node, which behaves as if there were no tiering.  A tiering
> kernel init function can certainly reassign the tier for each node if
> it knows enough about the hardware performance for these nodes from
> the firmware.
> 
> > > 
> > > A device driver can move up or down its memory nodes from the default
> > > tier.  For example, PMEM can move down its memory nodes below the
> > > default tier, whereas GPU can move up its memory nodes above the
> > > default tier.
> > > 
> > > The kernel initialization code makes the decision on which exact tier
> > > a memory node should be assigned to based on the requests from the
> > > device drivers as well as the memory device hardware information
> > > provided by the firmware.
> > > 
> > > 
> > > Memory Tier Reassignment
> > > ========================
> > > 
> > > After a memory node is hot-removed, it can be hot-added back to a
> > > different memory tier.  This is useful for supporting dynamically
> > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > memory devices across hot-plug events.  Such tier changes should
> > > be compatible with tier-based memory accounting.
> > > 
> > > The userspace may also reassign an existing online memory node to a
> > > different tier.  However, this should only be allowed when no pages
> > > are allocated from the memory node or when there are no non-root
> > > memory cgroups (e.g. during the system boot).  This restriction is
> > > important for keeping memory tier hierarchy stable enough for
> > > tier-based memory cgroup accounting.
> > > 
> > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > 
> > > 
> > > Memory Allocation for Demotion
> > > ==============================
> > > 
> > > To allocate a new page as the demotion target for a page, the kernel
> > > calls the allocation function (__alloc_pages_nodemask) with the
> > > source page node as the preferred node and the union of all lower
> > > tier nodes as the allowed nodemask.  The actual target node selection
> > > then follows the allocation fallback order that the kernel has
> > > already defined.
> > > 
> > > The pseudo code looks like:
> > > 
> > >     targets = NODE_MASK_NONE;
> > >     src_nid = page_to_nid(page);
> > >     src_tier = node_tier_map[src_nid];
> > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > >             nodes_or(targets, targets, memory_tiers[i]);
> > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > 
> > > The memopolicy of cpuset, vma and owner task of the source page can
> > > be set to refine the demotion target nodemask, e.g. to prevent
> > > demotion or select a particular allowed node as the demotion target.
> > > 
> > > 
> > > Memory Allocation for Promotion
> > > ===============================
> > > 
> > > The page allocation for promotion is similar to demotion, except that (1)
> > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > be the accessing CPU node, not the source page node.
> > > 
> > > 
> > > Examples
> > > ========
> > > 
> > 
> > ...
> > 
> > > * Example 3:
> > > 
> > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > 
> > Node2 is drawn as pmem.
> 
> Typo. Good catch.
> 
> > > 
> > > All nodes are in the same tier.
> > > 
> > >                   20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >          \                 /
> > >           \ 30            / 30
> > >            \             /
> > >              Node 2 (PMEM)
> > > 
> > > node distances:
> > > node   0    1    2
> > >    0  10   20   30
> > >    1  20   10   30
> > >    2  30   30   10
> > > 
> > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > <empty>
> > > 0-2
> > > <empty>
> > > 
> > > $ cat /sys/devices/system/node/node*/memtier
> > > 1
> > > 1
> > > 1
> > > 
> > > Demotion fallback order:
> > > node 0: empty
> > > node 1: empty
> > > node 2: empty
> > > 
> > > 
> > > * Example 4:
> > > 
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a PMEM node.
> > > Node 2 is a GPU node.
> > > 
> > >                   50
> > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > >          \                 /
> > >           \ 30            / 60
> > >            \             /
> > >              Node 1 (PMEM)
> > > 
> > > node distances:
> > > node   0    1    2
> > >    0  10   30   50
> > >    1  30   10   60
> > >    2  50   60   10
> > > 
> > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > 2
> > > 0
> > > 1
> > > 
> > > $ cat /sys/devices/system/node/node*/memtier
> > > 1
> > > 2
> > > 0
> > > 
> > > Demotion fallback order:
> > > node 0: 1
> > > node 1: empty
> > > node 2: 0, 1
> > > 
> > > 
> > > * Example 5:
> > > 
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a GPU node.
> > > Node 2 is a PMEM node.
> > > Node 3 is a large, slow DRAM node without CPU.
> > > 
> > > 
> > >      Node 2 (PMEM)  ----
> > >    /      |              \
> > >   /       | 30            \ 120
> > >  |        |         100    \
> > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > >   \         \                 /
> > >     \        \ 40            / 110
> > >   80  \       \             /
> > >         ---  Node 3 (Slow DRAM)
> > 
> > This is close but not quite what was intended for Hesham's
> > example... (note we just checked that Hesham's original node0-1
> > timing didn't make any sense.).
> > 
> 
> This was inspired by Hesham's example. But I should have also included
> the version that illustrates the need to skip a tier when demoting
> from certain nodes.
> 
> > > 
> > > node distances:
> > > node    0    1    2    3
> > >    0   10  100   30   40
> > >    1  100   10  120  110
> > >    2   30  120   10   80
> > >    3   40  110   80   10
> > > 
> > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > 1
> > > 0,3
> > > 2
> > > 
> > > $ cat /sys/devices/system/node/node*/memtier
> > > 1
> > > 0
> > > 2
> > > 1
> > > 
> > > Demotion fallback order:
> > > node 0: 2
> > > node 1: 0, 3, 2
> > > node 2: empty
> > > node 3: 2
> > 
> > This is close but not quite the same as the example
> > Hesham gave (note the node timing 1 to 0 on in the table
> > with that example didn't make sense).  I added another
> > level of switching to make the numbers more obviously
> > different and show how critical it might be.
> > 
> > * Example 6:
> > 
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a GPU node.
> > Node 2 is a PMEM node.
> > Node 3 is an extremely large, DRAM node without CPU.
> >   (Key point here being that it probably never makes sense
> >    to demote to anywhere else from this memory).
> > 
> > 
> > I've redone the timings wrt to example 5.
> > Basis for this is 0 and 2 are directly connected
> > via controllers in an SoC. 1 and 3 are connected
> > via a a common switch one switch down switch
> > (each hop via this is 100)
> > All drams cost 10 once you've reached correct node
> > and pmem costs 30 from SoC.
> > Numbers get too large as a result but meh, I'm making
> > a point not providing real numbers :)
> > 
> >          PMEM Node 2
> >             |(30)
> >         CPU + DRAM Node0
> >             |(100)
> >          Switch 1
> >             |(100)
> >           Switch 2
> >     (100)  |      |(100)
> > Node 1 GPU     Node3 Large memory.
> > 
> > 
> > With one level of s
> > 
> >      Node 2 (PMEM)  ----
> >     /      |              \
> >    /       | 30            \ 330
> >   |        |         310    \
> >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> >    \         \                 /
> >      \        \ 310           / 210
> >    330 \       \             /
> >          ---  Node 3 (Extremely large DRAM)
> > 
> > To my mind, we should potentially also take into account
> > the fact that Node3 can be known to never contain CPUs
> > (in at least some architectures we know where the CPUs
> >  might be added later, they can't just magically turn up
> >  anywhere in the topology).
> > 
> > node distances:
> > node    0    1    2    3
> >     0   10   310  30   310
> >     1   310  10   330  210
> >     2   30   330  10   330
> >     3   310  210  330   10
> > 
> > So, my ideal would treat node 3 different from other dram nodes
> > as we know it can't have CPUs. Trying to come up with an
> > always correct order for nodes 3 and 2 is tricky as to a certain
> > extent depends on capacity. If node 2 was  big enough to take
> > any demotion from node 0 and still have lots of room then demoting
> > there form node 3 would make sense and visa versa.
> > 
> > 
> >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> >  1
> >  0
> >  2
> >  3
> > 
> > 
> >  $ cat /sys/devices/system/node/node*/memtier
> >   1
> >   0
> >   2
> >   3
> > 
> >  Demotion fallback order:
> >  node 0: 2, 3
> >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> >  node 2: 3
> >  node 3: empty
> > 
> > or as Hesham just pointed out this can be done with 3 tiers
> > because we can put the GPU and CPU in the same tier because
> > their is little reason to demote from one to the other.
> 
> Thank you for the example.  It makes sense to me to have node 3 on its
> own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> that the max number of tiers is a config option).
> 
> > We are also a bit worried about ABI backwards compatibility because
> > of potential need to make more space in tiers lower in number than
> > CPU attached DDR. I rather liked the negative proposal with
> > default as 0 that Huang, Ying made.
> 
> It is hard to have negative values as the device IDs.
> 
> The current proposal equals the tier device ID to the tier hierarchy
> level, which makes the interface simpler, but less flexible.  How
> about the following proposal (which decouples the tier device ID from
> the tier level)?
> 
> /sys/devices/system/memtier/memtierN/nodelist
> /sys/devices/system/memtier/memtierN/rank
> 
> Each memory tier N has two sysfs files:
> - nodelist: the nodes that are in this tier
> - rank: an opaque value that helps decide the level at which this tier
> is in the tier hierarchy (smaller value means faster tier)
> 
> The tier hierarchy is determined by "rank", not by the device id
> number N from "memtierN".
> 
> The absolute value of "rank" of a memtier doesn't necessarily carry
> any meaning. Its value relative to other memtiers decides the level of
> this memtier in the tier hierarchy.
> 
> The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> but memtier0 may not always be the top-tier, e.g. its level can be 3
> in a 5-tier system.
> 
> For the above example (example 6), we can have:
> 
> $ ls /sys/devices/system/memtier
> memtier0
> memtier1
> memtier2
> memtier128
> 
> $ cat /sys/devices/system/memtier/memtier*/rank
> 50
> 60
> 70
> 10

I understand that the device ID cannot be negtive.  So we have to use
rank.  Can we make it possible to allow "rank" to be negtive?

Another choice is to do some trick on device ID.  For example, the CPU-
attached DRAM node are always memtier100 (the device ID).  Then we can
have memtier99, memtier100, memtier101, memteri102, ....  That's not
perfect too.

> The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> 
> $ cat /sys/devices/system/memtier/memtier*/nodelist
> 0
> 2
> 3
> 1
> 
> $ ls -l /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> 
> To override the memory tier of a node, we can use a new, write-only,
> per-node interface file:
> 
> /sys/devices/system/node/nodeN/set_memtier
> 
> e.g.
> 
> $ echo "memtier128" > sys/devices/system/node/node1/set_memtier

I prefer the original proposal to make nodeX/memtier a normal file to
hold memtier devicde ID instead of a link.

Best Regards,
Huang, Ying

> Any comments?
> 
> > Jonathan
> > 
> > 
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-20  3:06     ` Ying Huang
@ 2022-05-24  7:04       ` Wei Xu
  2022-05-24  8:24         ` Ying Huang
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-24  7:04 UTC (permalink / raw)
  To: Ying Huang
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > On Wed, 11 May 2022 23:22:11 -0700
> > > Wei Xu <weixugc@google.com> wrote:
> > > > The current kernel has the basic memory tiering support: Inactive
> > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > tier NUMA node to make room for new allocations on the higher tier
> > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > performance.
> > > >
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created during
> > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > the per-node demotion targets based on the distances between nodes.
> > > >
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases:
> > > >
> > > > * The current tier initialization code always initializes
> > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > >   a virtual machine) and should be put into a higher tier.
> > > >
> > > > * The current tier hierarchy always puts CPU nodes into the top
> > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > >   with CPUs are better to be placed into the next lower tier.
> > > >
> > > > * Also because the current tier hierarchy always puts CPU nodes
> > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > >   versa), the memory tier hierarchy gets changed, even though no
> > > >   memory node is added or removed.  This can make the tier
> > > >   hierarchy unstable and make it difficult to support tier-based
> > > >   memory accounting.
> > > >
> > > > * A higher tier node can only be demoted to selected nodes on the
> > > >   next lower tier as defined by the demotion path, not any other
> > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > >   does not work in all use cases (e.g. some use cases may want to
> > > >   allow cross-socket demotion to another node in the same demotion
> > > >   tier as a fallback when the preferred demotion node is out of
> > > >   space), and has resulted in the feature request for an interface to
> > > >   override the system-wide, per-node demotion order from the
> > > >   userspace.  This demotion order is also inconsistent with the page
> > > >   allocation fallback order when all the nodes in a higher tier are
> > > >   out of space: The page allocation can fall back to any node from
> > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > >
> > > > * There are no interfaces for the userspace to learn about the memory
> > > >   tier hierarchy in order to optimize its memory allocations.
> > > >
> > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > the discussions in the threads:
> > > >
> > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > >
> > > >
> > > > High-level Design Ideas
> > > > =======================
> > > >
> > > > * Define memory tiers explicitly, not implicitly.
> > > >
> > > > * Memory tiers are defined based on hardware capabilities of memory
> > > >   nodes, not their relative node distances between each other.
> > > >
> > > > * The tier assignment of each node is independent from each other.
> > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > >   assignment of any other node.
> > > >
> > > > * The node-tier association is stable. A node can be reassigned to a
> > > >   different tier only under the specific conditions that don't block
> > > >   future tier-based memory cgroup accounting.
> > > >
> > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > >   demotion target node selection follows the allocation fallback order
> > > >   of the source node, which is built based on node distances.  The
> > > >   demotion targets are also restricted to only the nodes from the tiers
> > > >   lower than the source node.  We no longer need to maintain a separate
> > > >   per-node demotion order (node_demotion[]).
> > > >
> > >
> > > Hi Wei,
> > >
> > > This proposal looks good to me, though we'll be having fun
> > > white boarding topologies from our roadmaps for the next few days :)
> >
> > That's good to hear.
> >
> > > A few comments inline. It also seems likely to me that there is little
> > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > I've drawn out one simple case that needs 4 to do sensible things.
> >
> > We can make the number of tiers a config option. 3 tiers are just what
> > the kernel can reasonably initialize when there isn't enough hardware
> > performance information from the firmware.
> >
> > > >
> > > > Sysfs Interfaces
> > > > ================
> > > >
> > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > >
> > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > >
> > > >   Format: node_list
> > > >
> > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > >
> > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > >
> > > >   The absolute value of a tier id number has no specific meaning.
> > > >   What matters is the relative order of the tier id numbers.
> > > >
> > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > >   sysfs files.
> > > >
> > > > * /sys/devices/system/node/nodeN/memtier
> > > >
> > > >   where N = 0, 1, ...
> > > >
> > > >   Format: int or empty
> > > >
> > > >   When read, list the memory tier that the node belongs to.  Its value
> > > >   is empty for a CPU-only NUMA node.
> > > >
> > > >   When written, the kernel moves the node into the specified memory
> > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > >   are not affected.
> > > >
> > > >   Initially, we can make this interface read-only.
> > > >
> > > >
> > > > Kernel Representation
> > > > =====================
> > > >
> > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > >
> > > > * #define MAX_MEMORY_TIERS 3
> > > >
> > > >   Support 3 memory tiers for now.
> > > >
> > > > * #define MEMORY_DEFAULT_TIER 1
> > > >
> > > >   The default tier that a memory node is assigned to.
> > > >
> > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > >
> > > >   Store memory nodes by tiers.
> > > >
> > > > * int node_tier_map[MAX_NUMNODES]
> > > >
> > > >   Map a node to its tier.
> > > >
> > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > >
> > > >
> > > > Memory Tier Initialization
> > > > ==========================
> > > >
> > > > By default, all memory nodes are assigned to the default tier
> > > > (MEMORY_DEFAULT_TIER).
> > >
> > > This is tighter than it needs to be.  In many cases we can easily
> > > establish if there is any possibility of CPU being hotplugged into
> > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > turning up their later :)  If CPU HP into a given node can't happen
> > > we can be more flexible and I think that often results in better decisions.
> > > See example below, though obviously I could just use the userspace
> > > interface to fix that up anyway or have a CXL driver move it around
> > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > advance where CPUs can be added but I'd need to check all the
> > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > for ARM for example we know where all possible CPUs can be hotplugged
> > > (constraint coming from the interrupt controller + the fact that only
> > > virtual CPU HP is defined).
> >
> > We may not always want to put a CXL-attached memory device into a
> > slower tier because even though CXL does add some additional latency,
> > both the memory device and CXL can still be very capable in
> > performance and may not be much slower (if any) than the on-board DRAM
> > (e.g. DRAM from a remote CPU socket).
> >
> > Also, the default tier here is just the initial tier assignment of
> > each node, which behaves as if there were no tiering.  A tiering
> > kernel init function can certainly reassign the tier for each node if
> > it knows enough about the hardware performance for these nodes from
> > the firmware.
> >
> > > >
> > > > A device driver can move up or down its memory nodes from the default
> > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > default tier, whereas GPU can move up its memory nodes above the
> > > > default tier.
> > > >
> > > > The kernel initialization code makes the decision on which exact tier
> > > > a memory node should be assigned to based on the requests from the
> > > > device drivers as well as the memory device hardware information
> > > > provided by the firmware.
> > > >
> > > >
> > > > Memory Tier Reassignment
> > > > ========================
> > > >
> > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > different memory tier.  This is useful for supporting dynamically
> > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > memory devices across hot-plug events.  Such tier changes should
> > > > be compatible with tier-based memory accounting.
> > > >
> > > > The userspace may also reassign an existing online memory node to a
> > > > different tier.  However, this should only be allowed when no pages
> > > > are allocated from the memory node or when there are no non-root
> > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > important for keeping memory tier hierarchy stable enough for
> > > > tier-based memory cgroup accounting.
> > > >
> > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > >
> > > >
> > > > Memory Allocation for Demotion
> > > > ==============================
> > > >
> > > > To allocate a new page as the demotion target for a page, the kernel
> > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > source page node as the preferred node and the union of all lower
> > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > then follows the allocation fallback order that the kernel has
> > > > already defined.
> > > >
> > > > The pseudo code looks like:
> > > >
> > > >     targets = NODE_MASK_NONE;
> > > >     src_nid = page_to_nid(page);
> > > >     src_tier = node_tier_map[src_nid];
> > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > >
> > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > demotion or select a particular allowed node as the demotion target.
> > > >
> > > >
> > > > Memory Allocation for Promotion
> > > > ===============================
> > > >
> > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > be the accessing CPU node, not the source page node.
> > > >
> > > >
> > > > Examples
> > > > ========
> > > >
> > >
> > > ...
> > >
> > > > * Example 3:
> > > >
> > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > >
> > > Node2 is drawn as pmem.
> >
> > Typo. Good catch.
> >
> > > >
> > > > All nodes are in the same tier.
> > > >
> > > >                   20
> > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > >          \                 /
> > > >           \ 30            / 30
> > > >            \             /
> > > >              Node 2 (PMEM)
> > > >
> > > > node distances:
> > > > node   0    1    2
> > > >    0  10   20   30
> > > >    1  20   10   30
> > > >    2  30   30   10
> > > >
> > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > <empty>
> > > > 0-2
> > > > <empty>
> > > >
> > > > $ cat /sys/devices/system/node/node*/memtier
> > > > 1
> > > > 1
> > > > 1
> > > >
> > > > Demotion fallback order:
> > > > node 0: empty
> > > > node 1: empty
> > > > node 2: empty
> > > >
> > > >
> > > > * Example 4:
> > > >
> > > > Node 0 is a DRAM node with CPU.
> > > > Node 1 is a PMEM node.
> > > > Node 2 is a GPU node.
> > > >
> > > >                   50
> > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > >          \                 /
> > > >           \ 30            / 60
> > > >            \             /
> > > >              Node 1 (PMEM)
> > > >
> > > > node distances:
> > > > node   0    1    2
> > > >    0  10   30   50
> > > >    1  30   10   60
> > > >    2  50   60   10
> > > >
> > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > 2
> > > > 0
> > > > 1
> > > >
> > > > $ cat /sys/devices/system/node/node*/memtier
> > > > 1
> > > > 2
> > > > 0
> > > >
> > > > Demotion fallback order:
> > > > node 0: 1
> > > > node 1: empty
> > > > node 2: 0, 1
> > > >
> > > >
> > > > * Example 5:
> > > >
> > > > Node 0 is a DRAM node with CPU.
> > > > Node 1 is a GPU node.
> > > > Node 2 is a PMEM node.
> > > > Node 3 is a large, slow DRAM node without CPU.
> > > >
> > > >
> > > >      Node 2 (PMEM)  ----
> > > >    /      |              \
> > > >   /       | 30            \ 120
> > > >  |        |         100    \
> > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > >   \         \                 /
> > > >     \        \ 40            / 110
> > > >   80  \       \             /
> > > >         ---  Node 3 (Slow DRAM)
> > >
> > > This is close but not quite what was intended for Hesham's
> > > example... (note we just checked that Hesham's original node0-1
> > > timing didn't make any sense.).
> > >
> >
> > This was inspired by Hesham's example. But I should have also included
> > the version that illustrates the need to skip a tier when demoting
> > from certain nodes.
> >
> > > >
> > > > node distances:
> > > > node    0    1    2    3
> > > >    0   10  100   30   40
> > > >    1  100   10  120  110
> > > >    2   30  120   10   80
> > > >    3   40  110   80   10
> > > >
> > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > 1
> > > > 0,3
> > > > 2
> > > >
> > > > $ cat /sys/devices/system/node/node*/memtier
> > > > 1
> > > > 0
> > > > 2
> > > > 1
> > > >
> > > > Demotion fallback order:
> > > > node 0: 2
> > > > node 1: 0, 3, 2
> > > > node 2: empty
> > > > node 3: 2
> > >
> > > This is close but not quite the same as the example
> > > Hesham gave (note the node timing 1 to 0 on in the table
> > > with that example didn't make sense).  I added another
> > > level of switching to make the numbers more obviously
> > > different and show how critical it might be.
> > >
> > > * Example 6:
> > >
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a GPU node.
> > > Node 2 is a PMEM node.
> > > Node 3 is an extremely large, DRAM node without CPU.
> > >   (Key point here being that it probably never makes sense
> > >    to demote to anywhere else from this memory).
> > >
> > >
> > > I've redone the timings wrt to example 5.
> > > Basis for this is 0 and 2 are directly connected
> > > via controllers in an SoC. 1 and 3 are connected
> > > via a a common switch one switch down switch
> > > (each hop via this is 100)
> > > All drams cost 10 once you've reached correct node
> > > and pmem costs 30 from SoC.
> > > Numbers get too large as a result but meh, I'm making
> > > a point not providing real numbers :)
> > >
> > >          PMEM Node 2
> > >             |(30)
> > >         CPU + DRAM Node0
> > >             |(100)
> > >          Switch 1
> > >             |(100)
> > >           Switch 2
> > >     (100)  |      |(100)
> > > Node 1 GPU     Node3 Large memory.
> > >
> > >
> > > With one level of s
> > >
> > >      Node 2 (PMEM)  ----
> > >     /      |              \
> > >    /       | 30            \ 330
> > >   |        |         310    \
> > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > >    \         \                 /
> > >      \        \ 310           / 210
> > >    330 \       \             /
> > >          ---  Node 3 (Extremely large DRAM)
> > >
> > > To my mind, we should potentially also take into account
> > > the fact that Node3 can be known to never contain CPUs
> > > (in at least some architectures we know where the CPUs
> > >  might be added later, they can't just magically turn up
> > >  anywhere in the topology).
> > >
> > > node distances:
> > > node    0    1    2    3
> > >     0   10   310  30   310
> > >     1   310  10   330  210
> > >     2   30   330  10   330
> > >     3   310  210  330   10
> > >
> > > So, my ideal would treat node 3 different from other dram nodes
> > > as we know it can't have CPUs. Trying to come up with an
> > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > extent depends on capacity. If node 2 was  big enough to take
> > > any demotion from node 0 and still have lots of room then demoting
> > > there form node 3 would make sense and visa versa.
> > >
> > >
> > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > >  1
> > >  0
> > >  2
> > >  3
> > >
> > >
> > >  $ cat /sys/devices/system/node/node*/memtier
> > >   1
> > >   0
> > >   2
> > >   3
> > >
> > >  Demotion fallback order:
> > >  node 0: 2, 3
> > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > >  node 2: 3
> > >  node 3: empty
> > >
> > > or as Hesham just pointed out this can be done with 3 tiers
> > > because we can put the GPU and CPU in the same tier because
> > > their is little reason to demote from one to the other.
> >
> > Thank you for the example.  It makes sense to me to have node 3 on its
> > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > that the max number of tiers is a config option).
> >
> > > We are also a bit worried about ABI backwards compatibility because
> > > of potential need to make more space in tiers lower in number than
> > > CPU attached DDR. I rather liked the negative proposal with
> > > default as 0 that Huang, Ying made.
> >
> > It is hard to have negative values as the device IDs.
> >
> > The current proposal equals the tier device ID to the tier hierarchy
> > level, which makes the interface simpler, but less flexible.  How
> > about the following proposal (which decouples the tier device ID from
> > the tier level)?
> >
> > /sys/devices/system/memtier/memtierN/nodelist
> > /sys/devices/system/memtier/memtierN/rank
> >
> > Each memory tier N has two sysfs files:
> > - nodelist: the nodes that are in this tier
> > - rank: an opaque value that helps decide the level at which this tier
> > is in the tier hierarchy (smaller value means faster tier)
> >
> > The tier hierarchy is determined by "rank", not by the device id
> > number N from "memtierN".
> >
> > The absolute value of "rank" of a memtier doesn't necessarily carry
> > any meaning. Its value relative to other memtiers decides the level of
> > this memtier in the tier hierarchy.
> >
> > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > in a 5-tier system.
> >
> > For the above example (example 6), we can have:
> >
> > $ ls /sys/devices/system/memtier
> > memtier0
> > memtier1
> > memtier2
> > memtier128
> >
> > $ cat /sys/devices/system/memtier/memtier*/rank
> > 50
> > 60
> > 70
> > 10
>
> I understand that the device ID cannot be negtive.  So we have to use
> rank.  Can we make it possible to allow "rank" to be negtive?

It is possible to allow "rank" to be negative, though I think all
positive values should work equally well.

> Another choice is to do some trick on device ID.  For example, the CPU-
> attached DRAM node are always memtier100 (the device ID).  Then we can
> have memtier99, memtier100, memtier101, memteri102, ....  That's not
> perfect too.

If we go with the device ID tricks, one approach is to use sub-device IDs:

- There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
tier2 (e.g. PMEM).

- Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
tier1.1, tier2.0, tier2.1.

The earlier 4-tier example can be represented as:

memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1

We can also omit .0 so that the tiers are:

memtier0 -> memtier1 -> memtier2 -> memtier2.1

This should be flexible enough to support multiple tiers while keeping
the tier IDs relatively stable.

It is not as flexible as the rank approach. For example, to insert a
new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
existing nodes to these 3 tiers.  Using "rank", we can insert a new
tier and only move desired nodes into the new tier.

What do you think?

> > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> >
> > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > 0
> > 2
> > 3
> > 1
> >
> > $ ls -l /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> >
> > To override the memory tier of a node, we can use a new, write-only,
> > per-node interface file:
> >
> > /sys/devices/system/node/nodeN/set_memtier
> >
> > e.g.
> >
> > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
>
> I prefer the original proposal to make nodeX/memtier a normal file to
> hold memtier devicde ID instead of a link.

OK. We don't have to use a symlink.

> Best Regards,
> Huang, Ying
>
> > Any comments?
> >
> > > Jonathan
> > >
> > >
> > >
> > >
> > >
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-18 12:00     ` Jonathan Cameron
@ 2022-05-24  7:36       ` Wei Xu
  2022-05-24 13:26         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-24  7:36 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Dave Hansen, Alistair Popple, Huang Ying, Andrew Morton,
	Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Baolin Wang, Feng Tang, Davidlohr Bueso,
	Dan Williams, David Rientjes, Linux MM, Brice Goglin,
	Hesham Almatary

On Wed, May 18, 2022 at 5:00 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 18 May 2022 00:09:48 -0700
> Wei Xu <weixugc@google.com> wrote:
>
> > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> > >
> > > On Wed, 11 May 2022 23:22:11 -0700
> > > Wei Xu <weixugc@google.com> wrote:
> > > > The current kernel has the basic memory tiering support: Inactive
> > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > tier NUMA node to make room for new allocations on the higher tier
> > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > performance.
> > > >
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created during
> > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > the per-node demotion targets based on the distances between nodes.
> > > >
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases:
> > > >
> > > > * The current tier initialization code always initializes
> > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > >   a virtual machine) and should be put into a higher tier.
> > > >
> > > > * The current tier hierarchy always puts CPU nodes into the top
> > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > >   with CPUs are better to be placed into the next lower tier.
> > > >
> > > > * Also because the current tier hierarchy always puts CPU nodes
> > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > >   versa), the memory tier hierarchy gets changed, even though no
> > > >   memory node is added or removed.  This can make the tier
> > > >   hierarchy unstable and make it difficult to support tier-based
> > > >   memory accounting.
> > > >
> > > > * A higher tier node can only be demoted to selected nodes on the
> > > >   next lower tier as defined by the demotion path, not any other
> > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > >   does not work in all use cases (e.g. some use cases may want to
> > > >   allow cross-socket demotion to another node in the same demotion
> > > >   tier as a fallback when the preferred demotion node is out of
> > > >   space), and has resulted in the feature request for an interface to
> > > >   override the system-wide, per-node demotion order from the
> > > >   userspace.  This demotion order is also inconsistent with the page
> > > >   allocation fallback order when all the nodes in a higher tier are
> > > >   out of space: The page allocation can fall back to any node from
> > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > >
> > > > * There are no interfaces for the userspace to learn about the memory
> > > >   tier hierarchy in order to optimize its memory allocations.
> > > >
> > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > the discussions in the threads:
> > > >
> > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > >
> > > >
> > > > High-level Design Ideas
> > > > =======================
> > > >
> > > > * Define memory tiers explicitly, not implicitly.
> > > >
> > > > * Memory tiers are defined based on hardware capabilities of memory
> > > >   nodes, not their relative node distances between each other.
> > > >
> > > > * The tier assignment of each node is independent from each other.
> > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > >   assignment of any other node.
> > > >
> > > > * The node-tier association is stable. A node can be reassigned to a
> > > >   different tier only under the specific conditions that don't block
> > > >   future tier-based memory cgroup accounting.
> > > >
> > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > >   demotion target node selection follows the allocation fallback order
> > > >   of the source node, which is built based on node distances.  The
> > > >   demotion targets are also restricted to only the nodes from the tiers
> > > >   lower than the source node.  We no longer need to maintain a separate
> > > >   per-node demotion order (node_demotion[]).
> > > >
> > >
> > > Hi Wei,
> > >
> > > This proposal looks good to me, though we'll be having fun
> > > white boarding topologies from our roadmaps for the next few days :)
> >
> > That's good to hear.
> >
> > > A few comments inline. It also seems likely to me that there is little
> > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > I've drawn out one simple case that needs 4 to do sensible things.
> >
> > We can make the number of tiers a config option. 3 tiers are just what
> > the kernel can reasonably initialize when there isn't enough hardware
> > performance information from the firmware.
> Now I think your rank solution below solves the following (but I wrote
> it before reading that part properly :) ...
>
> One issue with a config option is not breaking ABI if some distro
> changes that option or we change a default value in future.
> It may take some care.
>
> Imagine that today we think 3 tiers is fine and default to tier 1 for DDR.
> Someone writes a script to say their special device attached memory must
> be in node 1 as well on assumption it is the same tier as DDR (policy
> decision).
> Later we decide to move the default DDR to tier 2 because we have
> lots of hardware platforms where it makes sense to have multiple
> faster tiers.
>
> Their policy script now puts some memory in a node that doesn't have
> the same relationship to the default node.
>
> If we define a 'default_node' or similar sysfs file in memtier
> as read only report of what the kernel is defaulting to we can
> at least argue they should have read it (no way of actually making
> them do so though :(
>
>
> >
> > > >
> > > > Sysfs Interfaces
> > > > ================
> > > >
> > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > >
> > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > >
> > > >   Format: node_list
> > > >
> > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > >
> > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > >
> > > >   The absolute value of a tier id number has no specific meaning.
> > > >   What matters is the relative order of the tier id numbers.
> > > >
> > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > >   sysfs files.
> > > >
> > > > * /sys/devices/system/node/nodeN/memtier
> > > >
> > > >   where N = 0, 1, ...
> > > >
> > > >   Format: int or empty
> > > >
> > > >   When read, list the memory tier that the node belongs to.  Its value
> > > >   is empty for a CPU-only NUMA node.
> > > >
> > > >   When written, the kernel moves the node into the specified memory
> > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > >   are not affected.
> > > >
> > > >   Initially, we can make this interface read-only.
> > > >
> > > >
> > > > Kernel Representation
> > > > =====================
> > > >
> > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > >
> > > > * #define MAX_MEMORY_TIERS 3
> > > >
> > > >   Support 3 memory tiers for now.
> > > >
> > > > * #define MEMORY_DEFAULT_TIER 1
> > > >
> > > >   The default tier that a memory node is assigned to.
> > > >
> > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > >
> > > >   Store memory nodes by tiers.
> > > >
> > > > * int node_tier_map[MAX_NUMNODES]
> > > >
> > > >   Map a node to its tier.
> > > >
> > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > >
> > > >
> > > > Memory Tier Initialization
> > > > ==========================
> > > >
> > > > By default, all memory nodes are assigned to the default tier
> > > > (MEMORY_DEFAULT_TIER).
> > >
> > > This is tighter than it needs to be.  In many cases we can easily
> > > establish if there is any possibility of CPU being hotplugged into
> > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > turning up their later :)  If CPU HP into a given node can't happen
> > > we can be more flexible and I think that often results in better decisions.
> > > See example below, though obviously I could just use the userspace
> > > interface to fix that up anyway or have a CXL driver move it around
> > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > advance where CPUs can be added but I'd need to check all the
> > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > for ARM for example we know where all possible CPUs can be hotplugged
> > > (constraint coming from the interrupt controller + the fact that only
> > > virtual CPU HP is defined).
> >
> > We may not always want to put a CXL-attached memory device into a
> > slower tier because even though CXL does add some additional latency,
> > both the memory device and CXL can still be very capable in
> > performance and may not be much slower (if any) than the on-board DRAM
> > (e.g. DRAM from a remote CPU socket).
>
> Absolutely - though it should also report it's performance via
> CDAT etc so the information available should be rich.
>
> >
> > Also, the default tier here is just the initial tier assignment of
> > each node, which behaves as if there were no tiering.  A tiering
> > kernel init function can certainly reassign the tier for each node if
> > it knows enough about the hardware performance for these nodes from
> > the firmware.
>
> Understood. In someways I'd be happier if we didn't provide an inkernel
> interface to set the tier assignments at all and made it a userspace
> policy decision.  That way we'd pretty much oblige distros to put
> in place sensible scripts on day one. Probably too late for that though :(
>
>
> > > >
> > > > node distances:
> > > > node    0    1    2    3
> > > >    0   10  100   30   40
> > > >    1  100   10  120  110
> > > >    2   30  120   10   80
> > > >    3   40  110   80   10
> > > >
> > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > 1
> > > > 0,3
> > > > 2
> > > >
> > > > $ cat /sys/devices/system/node/node*/memtier
> > > > 1
> > > > 0
> > > > 2
> > > > 1
> > > >
> > > > Demotion fallback order:
> > > > node 0: 2
> > > > node 1: 0, 3, 2
> > > > node 2: empty
> > > > node 3: 2
> > >
> > > This is close but not quite the same as the example
> > > Hesham gave (note the node timing 1 to 0 on in the table
> > > with that example didn't make sense).  I added another
> > > level of switching to make the numbers more obviously
> > > different and show how critical it might be.
> > >
> > > * Example 6:
> > >
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a GPU node.
> > > Node 2 is a PMEM node.
> > > Node 3 is an extremely large, DRAM node without CPU.
> > >   (Key point here being that it probably never makes sense
> > >    to demote to anywhere else from this memory).
> > >
> > >
> > > I've redone the timings wrt to example 5.
> > > Basis for this is 0 and 2 are directly connected
> > > via controllers in an SoC. 1 and 3 are connected
> > > via a a common switch one switch down switch
> > > (each hop via this is 100)
> > > All drams cost 10 once you've reached correct node
> > > and pmem costs 30 from SoC.
> > > Numbers get too large as a result but meh, I'm making
> > > a point not providing real numbers :)
> > >
> > >          PMEM Node 2
> > >             |(30)
> > >         CPU + DRAM Node0
> > >             |(100)
> > >          Switch 1
> > >             |(100)
> > >           Switch 2
> > >     (100)  |      |(100)
> > > Node 1 GPU     Node3 Large memory.
> > >
> > >
> > > With one level of s
> > >
> > >      Node 2 (PMEM)  ----
> > >     /      |              \
> > >    /       | 30            \ 330
> > >   |        |         310    \
> > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > >    \         \                 /
> > >      \        \ 310           / 210
> > >    330 \       \             /
> > >          ---  Node 3 (Extremely large DRAM)
> > >
> > > To my mind, we should potentially also take into account
> > > the fact that Node3 can be known to never contain CPUs
> > > (in at least some architectures we know where the CPUs
> > >  might be added later, they can't just magically turn up
> > >  anywhere in the topology).
> > >
> > > node distances:
> > > node    0    1    2    3
> > >     0   10   310  30   310
> > >     1   310  10   330  210
> > >     2   30   330  10   330
> > >     3   310  210  330   10
> > >
> > > So, my ideal would treat node 3 different from other dram nodes
> > > as we know it can't have CPUs. Trying to come up with an
> > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > extent depends on capacity. If node 2 was  big enough to take
> > > any demotion from node 0 and still have lots of room then demoting
> > > there form node 3 would make sense and visa versa.
> > >
> > >
> > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > >  1
> > >  0
> > >  2
> > >  3
> > >
> > >
> > >  $ cat /sys/devices/system/node/node*/memtier
> > >   1
> > >   0
> > >   2
> > >   3
> > >
> > >  Demotion fallback order:
> > >  node 0: 2, 3
> > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > >  node 2: 3
> > >  node 3: empty
> > >
> > > or as Hesham just pointed out this can be done with 3 tiers
> > > because we can put the GPU and CPU in the same tier because
> > > their is little reason to demote from one to the other.
> >
> > Thank you for the example.  It makes sense to me to have node 3 on its
> > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > that the max number of tiers is a config option).
> >
> > > We are also a bit worried about ABI backwards compatibility because
> > > of potential need to make more space in tiers lower in number than
> > > CPU attached DDR. I rather liked the negative proposal with
> > > default as 0 that Huang, Ying made.
> >
> > It is hard to have negative values as the device IDs.
>
> Doh.  Obvious, but I missed that issue ;)
>
> >
> > The current proposal equals the tier device ID to the tier hierarchy
> > level, which makes the interface simpler, but less flexible.  How
> > about the following proposal (which decouples the tier device ID from
> > the tier level)?
> >
> > /sys/devices/system/memtier/memtierN/nodelist
> > /sys/devices/system/memtier/memtierN/rank
> >
> > Each memory tier N has two sysfs files:
> > - nodelist: the nodes that are in this tier
> > - rank: an opaque value that helps decide the level at which this tier
> > is in the tier hierarchy (smaller value means faster tier)
>
> This we could do with negatives for faster than normal RAM. 0 is a nice
> default value.  I'm assuming rank is userspace writeable?

Maybe.  It is simpler if rank doesn't change, though.

> >
> > The tier hierarchy is determined by "rank", not by the device id
> > number N from "memtierN".
> >
> > The absolute value of "rank" of a memtier doesn't necessarily carry
> > any meaning. Its value relative to other memtiers decides the level of
> > this memtier in the tier hierarchy.
> >
> > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > in a 5-tier system.
> >
> > For the above example (example 6), we can have:
> >
> > $ ls /sys/devices/system/memtier
> > memtier0
> > memtier1
> > memtier2
> > memtier128
> >
> > $ cat /sys/devices/system/memtier/memtier*/rank
> > 50
> > 60
> > 70
> > 10
> >
> > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> >
> > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > 0
> > 2
> > 3
> > 1
> >
> > $ ls -l /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> >
> > To override the memory tier of a node, we can use a new, write-only,
>
> Why write-only?
> Why not just a number?

Sure.  We can merge set_memtier and memtier into a single file (no
symlink) for read/write.

> > per-node interface file:
> >
> > /sys/devices/system/node/nodeN/set_memtier
> >
> > e.g.
> >
> > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> >
> > Any comments?
>
> Nice :)
>
> Initially I thought this was over complicated when compared to just leaving space, but
> after a chat with Hesham just now you have us both convinced that this is an elegant solution.
>
> Few corners probably need fleshing out:
> *  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
>    value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
>    we should).
> *  Multiple tiers with same rank.  My assumption is from demotion path point of view you
>    fuse them (treat them as if they were a single tier), but keep them expressed
>    separately in the sysfs interface so that the rank can be changed independently.
> *  Some guidance on what values make sense for given rank default that might be set by
>    a driver. If we have multiple GPU vendors, and someone mixes them in a system we
>    probably don't want the default values they use to result in demotion between them.
>    This might well be a guidance DOC or appropriate set of #define

All of these are good ideas, though I am afraid that these can make
tier management too complex for what it's worth.

How about an alternative tier numbering scheme that uses major.minor
device IDs?  For simplicity, we can just start with 3 major tiers.
New tiers can be inserted in-between using minor tier IDs.

> Sounds like a good direction to explore to me.
> Fairly low cost to implement and very flexible.
>
> Thanks,
>
> Jonathan
>
>
> >
> > > Jonathan
> > >
> > >
> > >
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-24  7:04       ` Wei Xu
@ 2022-05-24  8:24         ` Ying Huang
  2022-05-25  5:32           ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Ying Huang @ 2022-05-24  8:24 UTC (permalink / raw)
  To: Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > <Jonathan.Cameron@huawei.com> wrote:
> > > > 
> > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > Wei Xu <weixugc@google.com> wrote:
> > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > performance.
> > > > > 
> > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > the per-node demotion targets based on the distances between nodes.
> > > > > 
> > > > > This current memory tier kernel interface needs to be improved for
> > > > > several important use cases:
> > > > > 
> > > > > * The current tier initialization code always initializes
> > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > >   a virtual machine) and should be put into a higher tier.
> > > > > 
> > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > 
> > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > >   memory node is added or removed.  This can make the tier
> > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > >   memory accounting.
> > > > > 
> > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > >   next lower tier as defined by the demotion path, not any other
> > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > >   allow cross-socket demotion to another node in the same demotion
> > > > >   tier as a fallback when the preferred demotion node is out of
> > > > >   space), and has resulted in the feature request for an interface to
> > > > >   override the system-wide, per-node demotion order from the
> > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > >   out of space: The page allocation can fall back to any node from
> > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > 
> > > > > * There are no interfaces for the userspace to learn about the memory
> > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > 
> > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > the discussions in the threads:
> > > > > 
> > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > 
> > > > > 
> > > > > High-level Design Ideas
> > > > > =======================
> > > > > 
> > > > > * Define memory tiers explicitly, not implicitly.
> > > > > 
> > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > >   nodes, not their relative node distances between each other.
> > > > > 
> > > > > * The tier assignment of each node is independent from each other.
> > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > >   assignment of any other node.
> > > > > 
> > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > >   different tier only under the specific conditions that don't block
> > > > >   future tier-based memory cgroup accounting.
> > > > > 
> > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > >   demotion target node selection follows the allocation fallback order
> > > > >   of the source node, which is built based on node distances.  The
> > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > >   per-node demotion order (node_demotion[]).
> > > > > 
> > > > 
> > > > Hi Wei,
> > > > 
> > > > This proposal looks good to me, though we'll be having fun
> > > > white boarding topologies from our roadmaps for the next few days :)
> > > 
> > > That's good to hear.
> > > 
> > > > A few comments inline. It also seems likely to me that there is little
> > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > 
> > > We can make the number of tiers a config option. 3 tiers are just what
> > > the kernel can reasonably initialize when there isn't enough hardware
> > > performance information from the firmware.
> > > 
> > > > > 
> > > > > Sysfs Interfaces
> > > > > ================
> > > > > 
> > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > 
> > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > 
> > > > >   Format: node_list
> > > > > 
> > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > 
> > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > 
> > > > >   The absolute value of a tier id number has no specific meaning.
> > > > >   What matters is the relative order of the tier id numbers.
> > > > > 
> > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > >   sysfs files.
> > > > > 
> > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > 
> > > > >   where N = 0, 1, ...
> > > > > 
> > > > >   Format: int or empty
> > > > > 
> > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > >   is empty for a CPU-only NUMA node.
> > > > > 
> > > > >   When written, the kernel moves the node into the specified memory
> > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > >   are not affected.
> > > > > 
> > > > >   Initially, we can make this interface read-only.
> > > > > 
> > > > > 
> > > > > Kernel Representation
> > > > > =====================
> > > > > 
> > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > 
> > > > > * #define MAX_MEMORY_TIERS 3
> > > > > 
> > > > >   Support 3 memory tiers for now.
> > > > > 
> > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > 
> > > > >   The default tier that a memory node is assigned to.
> > > > > 
> > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > 
> > > > >   Store memory nodes by tiers.
> > > > > 
> > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > 
> > > > >   Map a node to its tier.
> > > > > 
> > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > 
> > > > > 
> > > > > Memory Tier Initialization
> > > > > ==========================
> > > > > 
> > > > > By default, all memory nodes are assigned to the default tier
> > > > > (MEMORY_DEFAULT_TIER).
> > > > 
> > > > This is tighter than it needs to be.  In many cases we can easily
> > > > establish if there is any possibility of CPU being hotplugged into
> > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > we can be more flexible and I think that often results in better decisions.
> > > > See example below, though obviously I could just use the userspace
> > > > interface to fix that up anyway or have a CXL driver move it around
> > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > advance where CPUs can be added but I'd need to check all the
> > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > (constraint coming from the interrupt controller + the fact that only
> > > > virtual CPU HP is defined).
> > > 
> > > We may not always want to put a CXL-attached memory device into a
> > > slower tier because even though CXL does add some additional latency,
> > > both the memory device and CXL can still be very capable in
> > > performance and may not be much slower (if any) than the on-board DRAM
> > > (e.g. DRAM from a remote CPU socket).
> > > 
> > > Also, the default tier here is just the initial tier assignment of
> > > each node, which behaves as if there were no tiering.  A tiering
> > > kernel init function can certainly reassign the tier for each node if
> > > it knows enough about the hardware performance for these nodes from
> > > the firmware.
> > > 
> > > > > 
> > > > > A device driver can move up or down its memory nodes from the default
> > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > default tier.
> > > > > 
> > > > > The kernel initialization code makes the decision on which exact tier
> > > > > a memory node should be assigned to based on the requests from the
> > > > > device drivers as well as the memory device hardware information
> > > > > provided by the firmware.
> > > > > 
> > > > > 
> > > > > Memory Tier Reassignment
> > > > > ========================
> > > > > 
> > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > different memory tier.  This is useful for supporting dynamically
> > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > be compatible with tier-based memory accounting.
> > > > > 
> > > > > The userspace may also reassign an existing online memory node to a
> > > > > different tier.  However, this should only be allowed when no pages
> > > > > are allocated from the memory node or when there are no non-root
> > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > important for keeping memory tier hierarchy stable enough for
> > > > > tier-based memory cgroup accounting.
> > > > > 
> > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > 
> > > > > 
> > > > > Memory Allocation for Demotion
> > > > > ==============================
> > > > > 
> > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > source page node as the preferred node and the union of all lower
> > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > then follows the allocation fallback order that the kernel has
> > > > > already defined.
> > > > > 
> > > > > The pseudo code looks like:
> > > > > 
> > > > >     targets = NODE_MASK_NONE;
> > > > >     src_nid = page_to_nid(page);
> > > > >     src_tier = node_tier_map[src_nid];
> > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > 
> > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > demotion or select a particular allowed node as the demotion target.
> > > > > 
> > > > > 
> > > > > Memory Allocation for Promotion
> > > > > ===============================
> > > > > 
> > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > be the accessing CPU node, not the source page node.
> > > > > 
> > > > > 
> > > > > Examples
> > > > > ========
> > > > > 
> > > > 
> > > > ...
> > > > 
> > > > > * Example 3:
> > > > > 
> > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > 
> > > > Node2 is drawn as pmem.
> > > 
> > > Typo. Good catch.
> > > 
> > > > > 
> > > > > All nodes are in the same tier.
> > > > > 
> > > > >                   20
> > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > >          \                 /
> > > > >           \ 30            / 30
> > > > >            \             /
> > > > >              Node 2 (PMEM)
> > > > > 
> > > > > node distances:
> > > > > node   0    1    2
> > > > >    0  10   20   30
> > > > >    1  20   10   30
> > > > >    2  30   30   10
> > > > > 
> > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > <empty>
> > > > > 0-2
> > > > > <empty>
> > > > > 
> > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > 1
> > > > > 1
> > > > > 1
> > > > > 
> > > > > Demotion fallback order:
> > > > > node 0: empty
> > > > > node 1: empty
> > > > > node 2: empty
> > > > > 
> > > > > 
> > > > > * Example 4:
> > > > > 
> > > > > Node 0 is a DRAM node with CPU.
> > > > > Node 1 is a PMEM node.
> > > > > Node 2 is a GPU node.
> > > > > 
> > > > >                   50
> > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > >          \                 /
> > > > >           \ 30            / 60
> > > > >            \             /
> > > > >              Node 1 (PMEM)
> > > > > 
> > > > > node distances:
> > > > > node   0    1    2
> > > > >    0  10   30   50
> > > > >    1  30   10   60
> > > > >    2  50   60   10
> > > > > 
> > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > 2
> > > > > 0
> > > > > 1
> > > > > 
> > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > 1
> > > > > 2
> > > > > 0
> > > > > 
> > > > > Demotion fallback order:
> > > > > node 0: 1
> > > > > node 1: empty
> > > > > node 2: 0, 1
> > > > > 
> > > > > 
> > > > > * Example 5:
> > > > > 
> > > > > Node 0 is a DRAM node with CPU.
> > > > > Node 1 is a GPU node.
> > > > > Node 2 is a PMEM node.
> > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > 
> > > > > 
> > > > >      Node 2 (PMEM)  ----
> > > > >    /      |              \
> > > > >   /       | 30            \ 120
> > > > >  |        |         100    \
> > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > >   \         \                 /
> > > > >     \        \ 40            / 110
> > > > >   80  \       \             /
> > > > >         ---  Node 3 (Slow DRAM)
> > > > 
> > > > This is close but not quite what was intended for Hesham's
> > > > example... (note we just checked that Hesham's original node0-1
> > > > timing didn't make any sense.).
> > > > 
> > > 
> > > This was inspired by Hesham's example. But I should have also included
> > > the version that illustrates the need to skip a tier when demoting
> > > from certain nodes.
> > > 
> > > > > 
> > > > > node distances:
> > > > > node    0    1    2    3
> > > > >    0   10  100   30   40
> > > > >    1  100   10  120  110
> > > > >    2   30  120   10   80
> > > > >    3   40  110   80   10
> > > > > 
> > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > 1
> > > > > 0,3
> > > > > 2
> > > > > 
> > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > 1
> > > > > 0
> > > > > 2
> > > > > 1
> > > > > 
> > > > > Demotion fallback order:
> > > > > node 0: 2
> > > > > node 1: 0, 3, 2
> > > > > node 2: empty
> > > > > node 3: 2
> > > > 
> > > > This is close but not quite the same as the example
> > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > with that example didn't make sense).  I added another
> > > > level of switching to make the numbers more obviously
> > > > different and show how critical it might be.
> > > > 
> > > > * Example 6:
> > > > 
> > > > Node 0 is a DRAM node with CPU.
> > > > Node 1 is a GPU node.
> > > > Node 2 is a PMEM node.
> > > > Node 3 is an extremely large, DRAM node without CPU.
> > > >   (Key point here being that it probably never makes sense
> > > >    to demote to anywhere else from this memory).
> > > > 
> > > > 
> > > > I've redone the timings wrt to example 5.
> > > > Basis for this is 0 and 2 are directly connected
> > > > via controllers in an SoC. 1 and 3 are connected
> > > > via a a common switch one switch down switch
> > > > (each hop via this is 100)
> > > > All drams cost 10 once you've reached correct node
> > > > and pmem costs 30 from SoC.
> > > > Numbers get too large as a result but meh, I'm making
> > > > a point not providing real numbers :)
> > > > 
> > > >          PMEM Node 2
> > > >             |(30)
> > > >         CPU + DRAM Node0
> > > >             |(100)
> > > >          Switch 1
> > > >             |(100)
> > > >           Switch 2
> > > >     (100)  |      |(100)
> > > > Node 1 GPU     Node3 Large memory.
> > > > 
> > > > 
> > > > With one level of s
> > > > 
> > > >      Node 2 (PMEM)  ----
> > > >     /      |              \
> > > >    /       | 30            \ 330
> > > >   |        |         310    \
> > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > >    \         \                 /
> > > >      \        \ 310           / 210
> > > >    330 \       \             /
> > > >          ---  Node 3 (Extremely large DRAM)
> > > > 
> > > > To my mind, we should potentially also take into account
> > > > the fact that Node3 can be known to never contain CPUs
> > > > (in at least some architectures we know where the CPUs
> > > >  might be added later, they can't just magically turn up
> > > >  anywhere in the topology).
> > > > 
> > > > node distances:
> > > > node    0    1    2    3
> > > >     0   10   310  30   310
> > > >     1   310  10   330  210
> > > >     2   30   330  10   330
> > > >     3   310  210  330   10
> > > > 
> > > > So, my ideal would treat node 3 different from other dram nodes
> > > > as we know it can't have CPUs. Trying to come up with an
> > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > extent depends on capacity. If node 2 was  big enough to take
> > > > any demotion from node 0 and still have lots of room then demoting
> > > > there form node 3 would make sense and visa versa.
> > > > 
> > > > 
> > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > >  1
> > > >  0
> > > >  2
> > > >  3
> > > > 
> > > > 
> > > >  $ cat /sys/devices/system/node/node*/memtier
> > > >   1
> > > >   0
> > > >   2
> > > >   3
> > > > 
> > > >  Demotion fallback order:
> > > >  node 0: 2, 3
> > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > >  node 2: 3
> > > >  node 3: empty
> > > > 
> > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > because we can put the GPU and CPU in the same tier because
> > > > their is little reason to demote from one to the other.
> > > 
> > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > that the max number of tiers is a config option).
> > > 
> > > > We are also a bit worried about ABI backwards compatibility because
> > > > of potential need to make more space in tiers lower in number than
> > > > CPU attached DDR. I rather liked the negative proposal with
> > > > default as 0 that Huang, Ying made.
> > > 
> > > It is hard to have negative values as the device IDs.
> > > 
> > > The current proposal equals the tier device ID to the tier hierarchy
> > > level, which makes the interface simpler, but less flexible.  How
> > > about the following proposal (which decouples the tier device ID from
> > > the tier level)?
> > > 
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > /sys/devices/system/memtier/memtierN/rank
> > > 
> > > Each memory tier N has two sysfs files:
> > > - nodelist: the nodes that are in this tier
> > > - rank: an opaque value that helps decide the level at which this tier
> > > is in the tier hierarchy (smaller value means faster tier)
> > > 
> > > The tier hierarchy is determined by "rank", not by the device id
> > > number N from "memtierN".
> > > 
> > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > any meaning. Its value relative to other memtiers decides the level of
> > > this memtier in the tier hierarchy.
> > > 
> > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > in a 5-tier system.
> > > 
> > > For the above example (example 6), we can have:
> > > 
> > > $ ls /sys/devices/system/memtier
> > > memtier0
> > > memtier1
> > > memtier2
> > > memtier128
> > > 
> > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > 50
> > > 60
> > > 70
> > > 10
> > 
> > I understand that the device ID cannot be negtive.  So we have to use
> > rank.  Can we make it possible to allow "rank" to be negtive?
> 
> It is possible to allow "rank" to be negative, though I think all
> positive values should work equally well.
> 
> > Another choice is to do some trick on device ID.  For example, the CPU-
> > attached DRAM node are always memtier100 (the device ID).  Then we can
> > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > perfect too.
> 
> If we go with the device ID tricks, one approach is to use sub-device IDs:
> 
> - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> tier2 (e.g. PMEM).
> 
> - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> tier1.1, tier2.0, tier2.1.
> 
> The earlier 4-tier example can be represented as:
> 
> memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> 
> We can also omit .0 so that the tiers are:
> 
> memtier0 -> memtier1 -> memtier2 -> memtier2.1
> 
> This should be flexible enough to support multiple tiers while keeping
> the tier IDs relatively stable.
> 
> It is not as flexible as the rank approach. For example, to insert a
> new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> existing nodes to these 3 tiers.  Using "rank", we can insert a new
> tier and only move desired nodes into the new tier.
> 
> What do you think?

The rank approach looks better for.  And if we stick with the device ID
rule as follows,

...
255	GPU
0	DRAM
1	PMEM
2
...

255 is -1 for "s8".

The device ID should do most tricks at least now.  The rank can provide
more flexibility in the future.  We can even go without rank in the
first version, and introduce it when it's necessary.

Best Regards,
Huang, Ying

> > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > 
> > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > 0
> > > 2
> > > 3
> > > 1
> > > 
> > > $ ls -l /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > 
> > > To override the memory tier of a node, we can use a new, write-only,
> > > per-node interface file:
> > > 
> > > /sys/devices/system/node/nodeN/set_memtier
> > > 
> > > e.g.
> > > 
> > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > 
> > I prefer the original proposal to make nodeX/memtier a normal file to
> > hold memtier devicde ID instead of a link.
> 
> OK. We don't have to use a symlink.
> 
> > Best Regards,
> > Huang, Ying
> > 
> > > Any comments?
> > > 
> > > > Jonathan
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-24  7:36       ` Wei Xu
@ 2022-05-24 13:26         ` Aneesh Kumar K.V
  2022-05-25  5:27           ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-24 13:26 UTC (permalink / raw)
  To: Wei Xu, Jonathan Cameron
  Cc: Dave Hansen, Alistair Popple, Huang Ying, Andrew Morton,
	Greg Thelen, Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

Wei Xu <weixugc@google.com> writes:

> On Wed, May 18, 2022 at 5:00 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
>>
>> On Wed, 18 May 2022 00:09:48 -0700
>> Wei Xu <weixugc@google.com> wrote:

...

> Nice :)
>>
>> Initially I thought this was over complicated when compared to just leaving space, but
>> after a chat with Hesham just now you have us both convinced that this is an elegant solution.
>>
>> Few corners probably need fleshing out:
>> *  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
>>    value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
>>    we should).
>> *  Multiple tiers with same rank.  My assumption is from demotion path point of view you
>>    fuse them (treat them as if they were a single tier), but keep them expressed
>>    separately in the sysfs interface so that the rank can be changed independently.
>> *  Some guidance on what values make sense for given rank default that might be set by
>>    a driver. If we have multiple GPU vendors, and someone mixes them in a system we
>>    probably don't want the default values they use to result in demotion between them.
>>    This might well be a guidance DOC or appropriate set of #define
>
> All of these are good ideas, though I am afraid that these can make
> tier management too complex for what it's worth.
>
> How about an alternative tier numbering scheme that uses major.minor
> device IDs?  For simplicity, we can just start with 3 major tiers.
> New tiers can be inserted in-between using minor tier IDs.


What drives the creation of a new memory tier here?  Jonathan was
suggesting we could do something similar to writing to set_memtier for
creating a new memory tier.

$ echo "memtier128" > sys/devices/system/node/node1/set_memtier

But I am wondering whether we should implement that now. If we keep
"rank" concept and detach tier index (memtier0 is the memory tier with
index 0) separate from rank, I assume we have enough flexibility for a
future extension that will allow us to create a memory tier from userspace
and assigning it a rank value that helps the device to be placed before or
after DRAM in demotion order. 

ie, For now we will only have memtier0, memtier1, memtier2. We won't add
dynamic creation of memory tiers and the above memory tiers will have
rank value 0, 1, 2 according with demotion order 0 -> 1 -> 2.

-aneesh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-24 13:26         ` Aneesh Kumar K.V
@ 2022-05-25  5:27           ` Wei Xu
  2022-05-25  7:47             ` Alistair Popple
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-25  5:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jonathan Cameron, Dave Hansen, Alistair Popple, Huang Ying,
	Andrew Morton, Greg Thelen, Yang Shi, Linux Kernel Mailing List,
	Jagdish Gediya, Michal Hocko, Tim C Chen, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Tue, May 24, 2022 at 6:27 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Wed, May 18, 2022 at 5:00 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> >>
> >> On Wed, 18 May 2022 00:09:48 -0700
> >> Wei Xu <weixugc@google.com> wrote:
>
> ...
>
> > Nice :)
> >>
> >> Initially I thought this was over complicated when compared to just leaving space, but
> >> after a chat with Hesham just now you have us both convinced that this is an elegant solution.
> >>
> >> Few corners probably need fleshing out:
> >> *  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
> >>    value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
> >>    we should).
> >> *  Multiple tiers with same rank.  My assumption is from demotion path point of view you
> >>    fuse them (treat them as if they were a single tier), but keep them expressed
> >>    separately in the sysfs interface so that the rank can be changed independently.
> >> *  Some guidance on what values make sense for given rank default that might be set by
> >>    a driver. If we have multiple GPU vendors, and someone mixes them in a system we
> >>    probably don't want the default values they use to result in demotion between them.
> >>    This might well be a guidance DOC or appropriate set of #define
> >
> > All of these are good ideas, though I am afraid that these can make
> > tier management too complex for what it's worth.
> >
> > How about an alternative tier numbering scheme that uses major.minor
> > device IDs?  For simplicity, we can just start with 3 major tiers.
> > New tiers can be inserted in-between using minor tier IDs.
>
>
> What drives the creation of a new memory tier here?  Jonathan was
> suggesting we could do something similar to writing to set_memtier for
> creating a new memory tier.
>
> $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
>
> But I am wondering whether we should implement that now. If we keep
> "rank" concept and detach tier index (memtier0 is the memory tier with
> index 0) separate from rank, I assume we have enough flexibility for a
> future extension that will allow us to create a memory tier from userspace
> and assigning it a rank value that helps the device to be placed before or
> after DRAM in demotion order.
>
> ie, For now we will only have memtier0, memtier1, memtier2. We won't add
> dynamic creation of memory tiers and the above memory tiers will have
> rank value 0, 1, 2 according with demotion order 0 -> 1 -> 2.

Great. So the consensus is to go with the "rank" approach.  The above
sounds good to me as a starting point.

> -aneesh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-24  8:24         ` Ying Huang
@ 2022-05-25  5:32           ` Wei Xu
  2022-05-25  9:03             ` Ying Huang
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-25  5:32 UTC (permalink / raw)
  To: Ying Huang
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
>
> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > >
> > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > performance.
> > > > > >
> > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > >
> > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > several important use cases:
> > > > > >
> > > > > > * The current tier initialization code always initializes
> > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > >
> > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > >
> > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > >   memory node is added or removed.  This can make the tier
> > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > >   memory accounting.
> > > > > >
> > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > >   space), and has resulted in the feature request for an interface to
> > > > > >   override the system-wide, per-node demotion order from the
> > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > >   out of space: The page allocation can fall back to any node from
> > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > >
> > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > >
> > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > the discussions in the threads:
> > > > > >
> > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > >
> > > > > >
> > > > > > High-level Design Ideas
> > > > > > =======================
> > > > > >
> > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > >
> > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > >   nodes, not their relative node distances between each other.
> > > > > >
> > > > > > * The tier assignment of each node is independent from each other.
> > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > >   assignment of any other node.
> > > > > >
> > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > >   different tier only under the specific conditions that don't block
> > > > > >   future tier-based memory cgroup accounting.
> > > > > >
> > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > >   demotion target node selection follows the allocation fallback order
> > > > > >   of the source node, which is built based on node distances.  The
> > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > >   per-node demotion order (node_demotion[]).
> > > > > >
> > > > >
> > > > > Hi Wei,
> > > > >
> > > > > This proposal looks good to me, though we'll be having fun
> > > > > white boarding topologies from our roadmaps for the next few days :)
> > > >
> > > > That's good to hear.
> > > >
> > > > > A few comments inline. It also seems likely to me that there is little
> > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > >
> > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > performance information from the firmware.
> > > >
> > > > > >
> > > > > > Sysfs Interfaces
> > > > > > ================
> > > > > >
> > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > >
> > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > >
> > > > > >   Format: node_list
> > > > > >
> > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > >
> > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > >
> > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > >   What matters is the relative order of the tier id numbers.
> > > > > >
> > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > >   sysfs files.
> > > > > >
> > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > >
> > > > > >   where N = 0, 1, ...
> > > > > >
> > > > > >   Format: int or empty
> > > > > >
> > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > >   is empty for a CPU-only NUMA node.
> > > > > >
> > > > > >   When written, the kernel moves the node into the specified memory
> > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > >   are not affected.
> > > > > >
> > > > > >   Initially, we can make this interface read-only.
> > > > > >
> > > > > >
> > > > > > Kernel Representation
> > > > > > =====================
> > > > > >
> > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > >
> > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > >
> > > > > >   Support 3 memory tiers for now.
> > > > > >
> > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > >
> > > > > >   The default tier that a memory node is assigned to.
> > > > > >
> > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > >
> > > > > >   Store memory nodes by tiers.
> > > > > >
> > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > >
> > > > > >   Map a node to its tier.
> > > > > >
> > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > >
> > > > > >
> > > > > > Memory Tier Initialization
> > > > > > ==========================
> > > > > >
> > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > (MEMORY_DEFAULT_TIER).
> > > > >
> > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > we can be more flexible and I think that often results in better decisions.
> > > > > See example below, though obviously I could just use the userspace
> > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > advance where CPUs can be added but I'd need to check all the
> > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > virtual CPU HP is defined).
> > > >
> > > > We may not always want to put a CXL-attached memory device into a
> > > > slower tier because even though CXL does add some additional latency,
> > > > both the memory device and CXL can still be very capable in
> > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > (e.g. DRAM from a remote CPU socket).
> > > >
> > > > Also, the default tier here is just the initial tier assignment of
> > > > each node, which behaves as if there were no tiering.  A tiering
> > > > kernel init function can certainly reassign the tier for each node if
> > > > it knows enough about the hardware performance for these nodes from
> > > > the firmware.
> > > >
> > > > > >
> > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > default tier.
> > > > > >
> > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > a memory node should be assigned to based on the requests from the
> > > > > > device drivers as well as the memory device hardware information
> > > > > > provided by the firmware.
> > > > > >
> > > > > >
> > > > > > Memory Tier Reassignment
> > > > > > ========================
> > > > > >
> > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > be compatible with tier-based memory accounting.
> > > > > >
> > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > are allocated from the memory node or when there are no non-root
> > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > tier-based memory cgroup accounting.
> > > > > >
> > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > >
> > > > > >
> > > > > > Memory Allocation for Demotion
> > > > > > ==============================
> > > > > >
> > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > source page node as the preferred node and the union of all lower
> > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > then follows the allocation fallback order that the kernel has
> > > > > > already defined.
> > > > > >
> > > > > > The pseudo code looks like:
> > > > > >
> > > > > >     targets = NODE_MASK_NONE;
> > > > > >     src_nid = page_to_nid(page);
> > > > > >     src_tier = node_tier_map[src_nid];
> > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > >
> > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > >
> > > > > >
> > > > > > Memory Allocation for Promotion
> > > > > > ===============================
> > > > > >
> > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > be the accessing CPU node, not the source page node.
> > > > > >
> > > > > >
> > > > > > Examples
> > > > > > ========
> > > > > >
> > > > >
> > > > > ...
> > > > >
> > > > > > * Example 3:
> > > > > >
> > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > >
> > > > > Node2 is drawn as pmem.
> > > >
> > > > Typo. Good catch.
> > > >
> > > > > >
> > > > > > All nodes are in the same tier.
> > > > > >
> > > > > >                   20
> > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > >          \                 /
> > > > > >           \ 30            / 30
> > > > > >            \             /
> > > > > >              Node 2 (PMEM)
> > > > > >
> > > > > > node distances:
> > > > > > node   0    1    2
> > > > > >    0  10   20   30
> > > > > >    1  20   10   30
> > > > > >    2  30   30   10
> > > > > >
> > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > <empty>
> > > > > > 0-2
> > > > > > <empty>
> > > > > >
> > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > 1
> > > > > > 1
> > > > > > 1
> > > > > >
> > > > > > Demotion fallback order:
> > > > > > node 0: empty
> > > > > > node 1: empty
> > > > > > node 2: empty
> > > > > >
> > > > > >
> > > > > > * Example 4:
> > > > > >
> > > > > > Node 0 is a DRAM node with CPU.
> > > > > > Node 1 is a PMEM node.
> > > > > > Node 2 is a GPU node.
> > > > > >
> > > > > >                   50
> > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > >          \                 /
> > > > > >           \ 30            / 60
> > > > > >            \             /
> > > > > >              Node 1 (PMEM)
> > > > > >
> > > > > > node distances:
> > > > > > node   0    1    2
> > > > > >    0  10   30   50
> > > > > >    1  30   10   60
> > > > > >    2  50   60   10
> > > > > >
> > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > 2
> > > > > > 0
> > > > > > 1
> > > > > >
> > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > 1
> > > > > > 2
> > > > > > 0
> > > > > >
> > > > > > Demotion fallback order:
> > > > > > node 0: 1
> > > > > > node 1: empty
> > > > > > node 2: 0, 1
> > > > > >
> > > > > >
> > > > > > * Example 5:
> > > > > >
> > > > > > Node 0 is a DRAM node with CPU.
> > > > > > Node 1 is a GPU node.
> > > > > > Node 2 is a PMEM node.
> > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > >
> > > > > >
> > > > > >      Node 2 (PMEM)  ----
> > > > > >    /      |              \
> > > > > >   /       | 30            \ 120
> > > > > >  |        |         100    \
> > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > >   \         \                 /
> > > > > >     \        \ 40            / 110
> > > > > >   80  \       \             /
> > > > > >         ---  Node 3 (Slow DRAM)
> > > > >
> > > > > This is close but not quite what was intended for Hesham's
> > > > > example... (note we just checked that Hesham's original node0-1
> > > > > timing didn't make any sense.).
> > > > >
> > > >
> > > > This was inspired by Hesham's example. But I should have also included
> > > > the version that illustrates the need to skip a tier when demoting
> > > > from certain nodes.
> > > >
> > > > > >
> > > > > > node distances:
> > > > > > node    0    1    2    3
> > > > > >    0   10  100   30   40
> > > > > >    1  100   10  120  110
> > > > > >    2   30  120   10   80
> > > > > >    3   40  110   80   10
> > > > > >
> > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > 1
> > > > > > 0,3
> > > > > > 2
> > > > > >
> > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > 1
> > > > > > 0
> > > > > > 2
> > > > > > 1
> > > > > >
> > > > > > Demotion fallback order:
> > > > > > node 0: 2
> > > > > > node 1: 0, 3, 2
> > > > > > node 2: empty
> > > > > > node 3: 2
> > > > >
> > > > > This is close but not quite the same as the example
> > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > with that example didn't make sense).  I added another
> > > > > level of switching to make the numbers more obviously
> > > > > different and show how critical it might be.
> > > > >
> > > > > * Example 6:
> > > > >
> > > > > Node 0 is a DRAM node with CPU.
> > > > > Node 1 is a GPU node.
> > > > > Node 2 is a PMEM node.
> > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > >   (Key point here being that it probably never makes sense
> > > > >    to demote to anywhere else from this memory).
> > > > >
> > > > >
> > > > > I've redone the timings wrt to example 5.
> > > > > Basis for this is 0 and 2 are directly connected
> > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > via a a common switch one switch down switch
> > > > > (each hop via this is 100)
> > > > > All drams cost 10 once you've reached correct node
> > > > > and pmem costs 30 from SoC.
> > > > > Numbers get too large as a result but meh, I'm making
> > > > > a point not providing real numbers :)
> > > > >
> > > > >          PMEM Node 2
> > > > >             |(30)
> > > > >         CPU + DRAM Node0
> > > > >             |(100)
> > > > >          Switch 1
> > > > >             |(100)
> > > > >           Switch 2
> > > > >     (100)  |      |(100)
> > > > > Node 1 GPU     Node3 Large memory.
> > > > >
> > > > >
> > > > > With one level of s
> > > > >
> > > > >      Node 2 (PMEM)  ----
> > > > >     /      |              \
> > > > >    /       | 30            \ 330
> > > > >   |        |         310    \
> > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > >    \         \                 /
> > > > >      \        \ 310           / 210
> > > > >    330 \       \             /
> > > > >          ---  Node 3 (Extremely large DRAM)
> > > > >
> > > > > To my mind, we should potentially also take into account
> > > > > the fact that Node3 can be known to never contain CPUs
> > > > > (in at least some architectures we know where the CPUs
> > > > >  might be added later, they can't just magically turn up
> > > > >  anywhere in the topology).
> > > > >
> > > > > node distances:
> > > > > node    0    1    2    3
> > > > >     0   10   310  30   310
> > > > >     1   310  10   330  210
> > > > >     2   30   330  10   330
> > > > >     3   310  210  330   10
> > > > >
> > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > there form node 3 would make sense and visa versa.
> > > > >
> > > > >
> > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > >  1
> > > > >  0
> > > > >  2
> > > > >  3
> > > > >
> > > > >
> > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > >   1
> > > > >   0
> > > > >   2
> > > > >   3
> > > > >
> > > > >  Demotion fallback order:
> > > > >  node 0: 2, 3
> > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > >  node 2: 3
> > > > >  node 3: empty
> > > > >
> > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > because we can put the GPU and CPU in the same tier because
> > > > > their is little reason to demote from one to the other.
> > > >
> > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > that the max number of tiers is a config option).
> > > >
> > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > of potential need to make more space in tiers lower in number than
> > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > default as 0 that Huang, Ying made.
> > > >
> > > > It is hard to have negative values as the device IDs.
> > > >
> > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > level, which makes the interface simpler, but less flexible.  How
> > > > about the following proposal (which decouples the tier device ID from
> > > > the tier level)?
> > > >
> > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > /sys/devices/system/memtier/memtierN/rank
> > > >
> > > > Each memory tier N has two sysfs files:
> > > > - nodelist: the nodes that are in this tier
> > > > - rank: an opaque value that helps decide the level at which this tier
> > > > is in the tier hierarchy (smaller value means faster tier)
> > > >
> > > > The tier hierarchy is determined by "rank", not by the device id
> > > > number N from "memtierN".
> > > >
> > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > any meaning. Its value relative to other memtiers decides the level of
> > > > this memtier in the tier hierarchy.
> > > >
> > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > in a 5-tier system.
> > > >
> > > > For the above example (example 6), we can have:
> > > >
> > > > $ ls /sys/devices/system/memtier
> > > > memtier0
> > > > memtier1
> > > > memtier2
> > > > memtier128
> > > >
> > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > 50
> > > > 60
> > > > 70
> > > > 10
> > >
> > > I understand that the device ID cannot be negtive.  So we have to use
> > > rank.  Can we make it possible to allow "rank" to be negtive?
> >
> > It is possible to allow "rank" to be negative, though I think all
> > positive values should work equally well.
> >
> > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > perfect too.
> >
> > If we go with the device ID tricks, one approach is to use sub-device IDs:
> >
> > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > tier2 (e.g. PMEM).
> >
> > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > tier1.1, tier2.0, tier2.1.
> >
> > The earlier 4-tier example can be represented as:
> >
> > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> >
> > We can also omit .0 so that the tiers are:
> >
> > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> >
> > This should be flexible enough to support multiple tiers while keeping
> > the tier IDs relatively stable.
> >
> > It is not as flexible as the rank approach. For example, to insert a
> > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > tier and only move desired nodes into the new tier.
> >
> > What do you think?
>
> The rank approach looks better for.  And if we stick with the device ID
> rule as follows,
>
> ...
> 255     GPU
> 0       DRAM
> 1       PMEM
> 2
> ...
>
> 255 is -1 for "s8".
>
> The device ID should do most tricks at least now.  The rank can provide
> more flexibility in the future.  We can even go without rank in the
> first version, and introduce it when it's necessary.

Given that the "rank" approach is generally favored, let's go with
that to avoid compatibility issues that may come from the switch of
device ID tricks to ranks.

> Best Regards,
> Huang, Ying
>
> > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > >
> > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > 0
> > > > 2
> > > > 3
> > > > 1
> > > >
> > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > >
> > > > To override the memory tier of a node, we can use a new, write-only,
> > > > per-node interface file:
> > > >
> > > > /sys/devices/system/node/nodeN/set_memtier
> > > >
> > > > e.g.
> > > >
> > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > >
> > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > hold memtier devicde ID instead of a link.
> >
> > OK. We don't have to use a symlink.
> >
> > > Best Regards,
> > > Huang, Ying
> > >
> > > > Any comments?
> > > >
> > > > > Jonathan
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
>
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25  5:27           ` Wei Xu
@ 2022-05-25  7:47             ` Alistair Popple
  2022-05-25 11:48               ` Jonathan Cameron
  0 siblings, 1 reply; 47+ messages in thread
From: Alistair Popple @ 2022-05-25  7:47 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K.V, Jonathan Cameron, Dave Hansen, Huang Ying,
	Andrew Morton, Greg Thelen, Yang Shi, Linux Kernel Mailing List,
	Jagdish Gediya, Michal Hocko, Tim C Chen, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary


Wei Xu <weixugc@google.com> writes:

> On Tue, May 24, 2022 at 6:27 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> > On Wed, May 18, 2022 at 5:00 AM Jonathan Cameron
>> > <Jonathan.Cameron@huawei.com> wrote:
>> >>
>> >> On Wed, 18 May 2022 00:09:48 -0700
>> >> Wei Xu <weixugc@google.com> wrote:
>>
>> ...
>>
>> > Nice :)
>> >>
>> >> Initially I thought this was over complicated when compared to just leaving space, but
>> >> after a chat with Hesham just now you have us both convinced that this is an elegant solution.
>> >>
>> >> Few corners probably need fleshing out:
>> >> *  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
>> >>    value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
>> >>    we should).
>> >> *  Multiple tiers with same rank.  My assumption is from demotion path point of view you
>> >>    fuse them (treat them as if they were a single tier), but keep them expressed
>> >>    separately in the sysfs interface so that the rank can be changed independently.
>> >> *  Some guidance on what values make sense for given rank default that might be set by
>> >>    a driver. If we have multiple GPU vendors, and someone mixes them in a system we
>> >>    probably don't want the default values they use to result in demotion between them.
>> >>    This might well be a guidance DOC or appropriate set of #define
>> >
>> > All of these are good ideas, though I am afraid that these can make
>> > tier management too complex for what it's worth.
>> >
>> > How about an alternative tier numbering scheme that uses major.minor
>> > device IDs?  For simplicity, we can just start with 3 major tiers.
>> > New tiers can be inserted in-between using minor tier IDs.
>>
>>
>> What drives the creation of a new memory tier here?  Jonathan was
>> suggesting we could do something similar to writing to set_memtier for
>> creating a new memory tier.
>>
>> $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
>>
>> But I am wondering whether we should implement that now. If we keep
>> "rank" concept and detach tier index (memtier0 is the memory tier with
>> index 0) separate from rank, I assume we have enough flexibility for a
>> future extension that will allow us to create a memory tier from userspace
>> and assigning it a rank value that helps the device to be placed before or
>> after DRAM in demotion order.
>>
>> ie, For now we will only have memtier0, memtier1, memtier2. We won't add
>> dynamic creation of memory tiers and the above memory tiers will have
>> rank value 0, 1, 2 according with demotion order 0 -> 1 -> 2.
>
> Great. So the consensus is to go with the "rank" approach.  The above
> sounds good to me as a starting point.

The rank approach seems good to me too.

 - Alistair

>> -aneesh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25  5:32           ` Wei Xu
@ 2022-05-25  9:03             ` Ying Huang
  2022-05-25 10:01               ` Aneesh Kumar K V
  2022-05-25 15:36               ` Wei Xu
  0 siblings, 2 replies; 47+ messages in thread
From: Ying Huang @ 2022-05-25  9:03 UTC (permalink / raw)
  To: Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > 
> > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > 
> > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > performance.
> > > > > > > 
> > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > 
> > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > several important use cases:
> > > > > > > 
> > > > > > > * The current tier initialization code always initializes
> > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > 
> > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > 
> > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > >   memory accounting.
> > > > > > > 
> > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > 
> > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > 
> > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > the discussions in the threads:
> > > > > > > 
> > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > 
> > > > > > > 
> > > > > > > High-level Design Ideas
> > > > > > > =======================
> > > > > > > 
> > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > 
> > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > 
> > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > >   assignment of any other node.
> > > > > > > 
> > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > >   different tier only under the specific conditions that don't block
> > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > 
> > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > 
> > > > > > 
> > > > > > Hi Wei,
> > > > > > 
> > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > 
> > > > > That's good to hear.
> > > > > 
> > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > 
> > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > performance information from the firmware.
> > > > > 
> > > > > > > 
> > > > > > > Sysfs Interfaces
> > > > > > > ================
> > > > > > > 
> > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > 
> > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > 
> > > > > > >   Format: node_list
> > > > > > > 
> > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > 
> > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > 
> > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > 
> > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > >   sysfs files.
> > > > > > > 
> > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > 
> > > > > > >   where N = 0, 1, ...
> > > > > > > 
> > > > > > >   Format: int or empty
> > > > > > > 
> > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > 
> > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > >   are not affected.
> > > > > > > 
> > > > > > >   Initially, we can make this interface read-only.
> > > > > > > 
> > > > > > > 
> > > > > > > Kernel Representation
> > > > > > > =====================
> > > > > > > 
> > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > 
> > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > 
> > > > > > >   Support 3 memory tiers for now.
> > > > > > > 
> > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > 
> > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > 
> > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > 
> > > > > > >   Store memory nodes by tiers.
> > > > > > > 
> > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > 
> > > > > > >   Map a node to its tier.
> > > > > > > 
> > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > 
> > > > > > > 
> > > > > > > Memory Tier Initialization
> > > > > > > ==========================
> > > > > > > 
> > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > 
> > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > See example below, though obviously I could just use the userspace
> > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > virtual CPU HP is defined).
> > > > > 
> > > > > We may not always want to put a CXL-attached memory device into a
> > > > > slower tier because even though CXL does add some additional latency,
> > > > > both the memory device and CXL can still be very capable in
> > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > (e.g. DRAM from a remote CPU socket).
> > > > > 
> > > > > Also, the default tier here is just the initial tier assignment of
> > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > kernel init function can certainly reassign the tier for each node if
> > > > > it knows enough about the hardware performance for these nodes from
> > > > > the firmware.
> > > > > 
> > > > > > > 
> > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > default tier.
> > > > > > > 
> > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > device drivers as well as the memory device hardware information
> > > > > > > provided by the firmware.
> > > > > > > 
> > > > > > > 
> > > > > > > Memory Tier Reassignment
> > > > > > > ========================
> > > > > > > 
> > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > be compatible with tier-based memory accounting.
> > > > > > > 
> > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > tier-based memory cgroup accounting.
> > > > > > > 
> > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > 
> > > > > > > 
> > > > > > > Memory Allocation for Demotion
> > > > > > > ==============================
> > > > > > > 
> > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > already defined.
> > > > > > > 
> > > > > > > The pseudo code looks like:
> > > > > > > 
> > > > > > >     targets = NODE_MASK_NONE;
> > > > > > >     src_nid = page_to_nid(page);
> > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > 
> > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > 
> > > > > > > 
> > > > > > > Memory Allocation for Promotion
> > > > > > > ===============================
> > > > > > > 
> > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > 
> > > > > > > 
> > > > > > > Examples
> > > > > > > ========
> > > > > > > 
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > > * Example 3:
> > > > > > > 
> > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > 
> > > > > > Node2 is drawn as pmem.
> > > > > 
> > > > > Typo. Good catch.
> > > > > 
> > > > > > > 
> > > > > > > All nodes are in the same tier.
> > > > > > > 
> > > > > > >                   20
> > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > >          \                 /
> > > > > > >           \ 30            / 30
> > > > > > >            \             /
> > > > > > >              Node 2 (PMEM)
> > > > > > > 
> > > > > > > node distances:
> > > > > > > node   0    1    2
> > > > > > >    0  10   20   30
> > > > > > >    1  20   10   30
> > > > > > >    2  30   30   10
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > <empty>
> > > > > > > 0-2
> > > > > > > <empty>
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > 1
> > > > > > > 1
> > > > > > > 1
> > > > > > > 
> > > > > > > Demotion fallback order:
> > > > > > > node 0: empty
> > > > > > > node 1: empty
> > > > > > > node 2: empty
> > > > > > > 
> > > > > > > 
> > > > > > > * Example 4:
> > > > > > > 
> > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > Node 1 is a PMEM node.
> > > > > > > Node 2 is a GPU node.
> > > > > > > 
> > > > > > >                   50
> > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > >          \                 /
> > > > > > >           \ 30            / 60
> > > > > > >            \             /
> > > > > > >              Node 1 (PMEM)
> > > > > > > 
> > > > > > > node distances:
> > > > > > > node   0    1    2
> > > > > > >    0  10   30   50
> > > > > > >    1  30   10   60
> > > > > > >    2  50   60   10
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > 2
> > > > > > > 0
> > > > > > > 1
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > 1
> > > > > > > 2
> > > > > > > 0
> > > > > > > 
> > > > > > > Demotion fallback order:
> > > > > > > node 0: 1
> > > > > > > node 1: empty
> > > > > > > node 2: 0, 1
> > > > > > > 
> > > > > > > 
> > > > > > > * Example 5:
> > > > > > > 
> > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > Node 1 is a GPU node.
> > > > > > > Node 2 is a PMEM node.
> > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > 
> > > > > > > 
> > > > > > >      Node 2 (PMEM)  ----
> > > > > > >    /      |              \
> > > > > > >   /       | 30            \ 120
> > > > > > >  |        |         100    \
> > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > >   \         \                 /
> > > > > > >     \        \ 40            / 110
> > > > > > >   80  \       \             /
> > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > 
> > > > > > This is close but not quite what was intended for Hesham's
> > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > timing didn't make any sense.).
> > > > > > 
> > > > > 
> > > > > This was inspired by Hesham's example. But I should have also included
> > > > > the version that illustrates the need to skip a tier when demoting
> > > > > from certain nodes.
> > > > > 
> > > > > > > 
> > > > > > > node distances:
> > > > > > > node    0    1    2    3
> > > > > > >    0   10  100   30   40
> > > > > > >    1  100   10  120  110
> > > > > > >    2   30  120   10   80
> > > > > > >    3   40  110   80   10
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > 1
> > > > > > > 0,3
> > > > > > > 2
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > 1
> > > > > > > 0
> > > > > > > 2
> > > > > > > 1
> > > > > > > 
> > > > > > > Demotion fallback order:
> > > > > > > node 0: 2
> > > > > > > node 1: 0, 3, 2
> > > > > > > node 2: empty
> > > > > > > node 3: 2
> > > > > > 
> > > > > > This is close but not quite the same as the example
> > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > with that example didn't make sense).  I added another
> > > > > > level of switching to make the numbers more obviously
> > > > > > different and show how critical it might be.
> > > > > > 
> > > > > > * Example 6:
> > > > > > 
> > > > > > Node 0 is a DRAM node with CPU.
> > > > > > Node 1 is a GPU node.
> > > > > > Node 2 is a PMEM node.
> > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > >   (Key point here being that it probably never makes sense
> > > > > >    to demote to anywhere else from this memory).
> > > > > > 
> > > > > > 
> > > > > > I've redone the timings wrt to example 5.
> > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > via a a common switch one switch down switch
> > > > > > (each hop via this is 100)
> > > > > > All drams cost 10 once you've reached correct node
> > > > > > and pmem costs 30 from SoC.
> > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > a point not providing real numbers :)
> > > > > > 
> > > > > >          PMEM Node 2
> > > > > >             |(30)
> > > > > >         CPU + DRAM Node0
> > > > > >             |(100)
> > > > > >          Switch 1
> > > > > >             |(100)
> > > > > >           Switch 2
> > > > > >     (100)  |      |(100)
> > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > 
> > > > > > 
> > > > > > With one level of s
> > > > > > 
> > > > > >      Node 2 (PMEM)  ----
> > > > > >     /      |              \
> > > > > >    /       | 30            \ 330
> > > > > >   |        |         310    \
> > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > >    \         \                 /
> > > > > >      \        \ 310           / 210
> > > > > >    330 \       \             /
> > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > 
> > > > > > To my mind, we should potentially also take into account
> > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > (in at least some architectures we know where the CPUs
> > > > > >  might be added later, they can't just magically turn up
> > > > > >  anywhere in the topology).
> > > > > > 
> > > > > > node distances:
> > > > > > node    0    1    2    3
> > > > > >     0   10   310  30   310
> > > > > >     1   310  10   330  210
> > > > > >     2   30   330  10   330
> > > > > >     3   310  210  330   10
> > > > > > 
> > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > there form node 3 would make sense and visa versa.
> > > > > > 
> > > > > > 
> > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > >  1
> > > > > >  0
> > > > > >  2
> > > > > >  3
> > > > > > 
> > > > > > 
> > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > >   1
> > > > > >   0
> > > > > >   2
> > > > > >   3
> > > > > > 
> > > > > >  Demotion fallback order:
> > > > > >  node 0: 2, 3
> > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > >  node 2: 3
> > > > > >  node 3: empty
> > > > > > 
> > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > their is little reason to demote from one to the other.
> > > > > 
> > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > that the max number of tiers is a config option).
> > > > > 
> > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > of potential need to make more space in tiers lower in number than
> > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > default as 0 that Huang, Ying made.
> > > > > 
> > > > > It is hard to have negative values as the device IDs.
> > > > > 
> > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > about the following proposal (which decouples the tier device ID from
> > > > > the tier level)?
> > > > > 
> > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > 
> > > > > Each memory tier N has two sysfs files:
> > > > > - nodelist: the nodes that are in this tier
> > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > 
> > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > number N from "memtierN".
> > > > > 
> > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > this memtier in the tier hierarchy.
> > > > > 
> > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > in a 5-tier system.
> > > > > 
> > > > > For the above example (example 6), we can have:
> > > > > 
> > > > > $ ls /sys/devices/system/memtier
> > > > > memtier0
> > > > > memtier1
> > > > > memtier2
> > > > > memtier128
> > > > > 
> > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > 50
> > > > > 60
> > > > > 70
> > > > > 10
> > > > 
> > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > 
> > > It is possible to allow "rank" to be negative, though I think all
> > > positive values should work equally well.
> > > 
> > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > perfect too.
> > > 
> > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > 
> > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > tier2 (e.g. PMEM).
> > > 
> > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > tier1.1, tier2.0, tier2.1.
> > > 
> > > The earlier 4-tier example can be represented as:
> > > 
> > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > 
> > > We can also omit .0 so that the tiers are:
> > > 
> > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > 
> > > This should be flexible enough to support multiple tiers while keeping
> > > the tier IDs relatively stable.
> > > 
> > > It is not as flexible as the rank approach. For example, to insert a
> > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > tier and only move desired nodes into the new tier.
> > > 
> > > What do you think?
> > 
> > The rank approach looks better for.  And if we stick with the device ID
> > rule as follows,
> > 
> > ...
> > 255     GPU
> > 0       DRAM
> > 1       PMEM
> > 2
> > ...
> > 
> > 255 is -1 for "s8".
> > 
> > The device ID should do most tricks at least now.  The rank can provide
> > more flexibility in the future.  We can even go without rank in the
> > first version, and introduce it when it's necessary.
> 
> Given that the "rank" approach is generally favored, let's go with
> that to avoid compatibility issues that may come from the switch of
> device ID tricks to ranks.

OK.  Just to confirm.  Does this mean that we will have fixed device ID,
for example,

GPU			memtier255
DRAM (with CPU)		memtier0
PMEM			memtier1

When we add a new memtier, it can be memtier254, or memter2?  The rank
value will determine the real demotion order.

I think you may need to send v3 to make sure everyone is at the same
page.

Best Regards,
Huang, Ying

> > Best Regards,
> > Huang, Ying
> > 
> > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > 
> > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > 0
> > > > > 2
> > > > > 3
> > > > > 1
> > > > > 
> > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > 
> > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > per-node interface file:
> > > > > 
> > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > 
> > > > > e.g.
> > > > > 
> > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > 
> > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > hold memtier devicde ID instead of a link.
> > > 
> > > OK. We don't have to use a symlink.
> > > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > > Any comments?
> > > > > 
> > > > > > Jonathan
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25  9:03             ` Ying Huang
@ 2022-05-25 10:01               ` Aneesh Kumar K V
  2022-05-25 11:36                 ` Mika Penttilä
  2022-05-25 17:27                 ` Wei Xu
  2022-05-25 15:36               ` Wei Xu
  1 sibling, 2 replies; 47+ messages in thread
From: Aneesh Kumar K V @ 2022-05-25 10:01 UTC (permalink / raw)
  To: Ying Huang, Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On 5/25/22 2:33 PM, Ying Huang wrote:
> On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
>> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
>>>
>>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
>>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
>>>>>

...

> 
> OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> for example,
> 
> GPU			memtier255
> DRAM (with CPU)		memtier0
> PMEM			memtier1
> 
> When we add a new memtier, it can be memtier254, or memter2?  The rank
> value will determine the real demotion order.
> 
> I think you may need to send v3 to make sure everyone is at the same
> page.
> 

What we have implemented which we will send as RFC shortly is below.

cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
/sys/devices/system
kvaneesh@ubuntu-guest:/sys/devices/system$ ls
clockevents  clocksource  container  cpu  edac  memory  memtier  mpic 
node  power
kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
/sys/devices/system/memtier
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
default_rank  max_rank  memtier1  power  uevent
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
1
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
3
kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
nodelist  power  rank  subsystem  uevent
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
0-3
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
1
kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd 
../../node/node1/
kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
1
kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
0
root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
root@ubuntu-guest:/sys/devices/system/memtier# ls
default_rank  max_rank  memtier0  memtier1  power  uevent
root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
1
root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
0
root@ubuntu-guest:/sys/devices/system/memtier/memtier0# echo 4 > rank
bash: rank: Permission denied
root@ubuntu-guest:/sys/devices/system/memtier/memtier0#

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 10:01               ` Aneesh Kumar K V
@ 2022-05-25 11:36                 ` Mika Penttilä
  2022-05-25 15:33                   ` Wei Xu
  2022-05-25 17:27                 ` Wei Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Mika Penttilä @ 2022-05-25 11:36 UTC (permalink / raw)
  To: Aneesh Kumar K V, Ying Huang, Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary



On 25.5.2022 13.01, Aneesh Kumar K V wrote:
> On 5/25/22 2:33 PM, Ying Huang wrote:
>> On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
>>> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
>>>>
>>>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
>>>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> 
>>>>> wrote:
>>>>>>
> 
> ...
> 
>>
>> OK.  Just to confirm.  Does this mean that we will have fixed device ID,
>> for example,
>>
>> GPU            memtier255
>> DRAM (with CPU)        memtier0
>> PMEM            memtier1
>>
>> When we add a new memtier, it can be memtier254, or memter2?  The rank
>> value will determine the real demotion order.
>>
>> I think you may need to send v3 to make sure everyone is at the same
>> page.
>>
> 
> What we have implemented which we will send as RFC shortly is below.
> 
> cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
> kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
> /sys/devices/system
> kvaneesh@ubuntu-guest:/sys/devices/system$ ls
> clockevents  clocksource  container  cpu  edac  memory  memtier  mpic 
> node  power
> kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
> /sys/devices/system/memtier
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
> default_rank  max_rank  memtier1  power  uevent
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
> 1
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
> 3
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
> nodelist  power  rank  subsystem  uevent
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
> 0-3
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
> 1
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd 
> ../../node/node1/
> kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
> 1
> kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
> root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
> root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
> 0
> root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
> root@ubuntu-guest:/sys/devices/system/memtier# ls
> default_rank  max_rank  memtier0  memtier1  power  uevent
> root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
> 1
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
> 0
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# echo 4 > rank
> bash: rank: Permission denied
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0#
> 

Just to confirm, unlike today's demotion code, the demotion target 
allocation is planned to honor mempolicies?


--Mika



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25  7:47             ` Alistair Popple
@ 2022-05-25 11:48               ` Jonathan Cameron
  2022-05-25 15:32                 ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Jonathan Cameron @ 2022-05-25 11:48 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Wei Xu, Aneesh Kumar K.V, Dave Hansen, Huang Ying, Andrew Morton,
	Greg Thelen, Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Wed, 25 May 2022 17:47:33 +1000
Alistair Popple <apopple@nvidia.com> wrote:

> Wei Xu <weixugc@google.com> writes:
> 
> > On Tue, May 24, 2022 at 6:27 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:  
> >>
> >> Wei Xu <weixugc@google.com> writes:
> >>  
> >> > On Wed, May 18, 2022 at 5:00 AM Jonathan Cameron
> >> > <Jonathan.Cameron@huawei.com> wrote:  
> >> >>
> >> >> On Wed, 18 May 2022 00:09:48 -0700
> >> >> Wei Xu <weixugc@google.com> wrote:  
> >>
> >> ...
> >>  
> >> > Nice :)  
> >> >>
> >> >> Initially I thought this was over complicated when compared to just leaving space, but
> >> >> after a chat with Hesham just now you have us both convinced that this is an elegant solution.
> >> >>
> >> >> Few corners probably need fleshing out:
> >> >> *  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
> >> >>    value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
> >> >>    we should).
> >> >> *  Multiple tiers with same rank.  My assumption is from demotion path point of view you
> >> >>    fuse them (treat them as if they were a single tier), but keep them expressed
> >> >>    separately in the sysfs interface so that the rank can be changed independently.
> >> >> *  Some guidance on what values make sense for given rank default that might be set by
> >> >>    a driver. If we have multiple GPU vendors, and someone mixes them in a system we
> >> >>    probably don't want the default values they use to result in demotion between them.
> >> >>    This might well be a guidance DOC or appropriate set of #define  
> >> >
> >> > All of these are good ideas, though I am afraid that these can make
> >> > tier management too complex for what it's worth.
> >> >
> >> > How about an alternative tier numbering scheme that uses major.minor
> >> > device IDs?  For simplicity, we can just start with 3 major tiers.
> >> > New tiers can be inserted in-between using minor tier IDs.  
> >>
> >>
> >> What drives the creation of a new memory tier here?  Jonathan was
> >> suggesting we could do something similar to writing to set_memtier for
> >> creating a new memory tier.
> >>
> >> $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> >>
> >> But I am wondering whether we should implement that now. If we keep
> >> "rank" concept and detach tier index (memtier0 is the memory tier with
> >> index 0) separate from rank, I assume we have enough flexibility for a
> >> future extension that will allow us to create a memory tier from userspace
> >> and assigning it a rank value that helps the device to be placed before or
> >> after DRAM in demotion order.
> >>
> >> ie, For now we will only have memtier0, memtier1, memtier2. We won't add
> >> dynamic creation of memory tiers and the above memory tiers will have
> >> rank value 0, 1, 2 according with demotion order 0 -> 1 -> 2.  
> >
> > Great. So the consensus is to go with the "rank" approach.  The above
> > sounds good to me as a starting point.  
> 
> The rank approach seems good to me too.

Rank is good, but I do slightly worry about accidentally defining ABI
that people care about with the particular numbers used for the initial ranks.

Maybe just x100 on all of them to allow things in between with no change to
this initial set of 3?  So 0, 100, 200

Jonathan

> 
>  - Alistair
> 
> >> -aneesh  



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 11:48               ` Jonathan Cameron
@ 2022-05-25 15:32                 ` Wei Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-25 15:32 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Alistair Popple, Aneesh Kumar K.V, Dave Hansen, Huang Ying,
	Andrew Morton, Greg Thelen, Yang Shi, Linux Kernel Mailing List,
	Jagdish Gediya, Michal Hocko, Tim C Chen, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Wed, May 25, 2022 at 4:48 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 25 May 2022 17:47:33 +1000
> Alistair Popple <apopple@nvidia.com> wrote:
>
> > Wei Xu <weixugc@google.com> writes:
> >
> > > On Tue, May 24, 2022 at 6:27 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > >>
> > >> Wei Xu <weixugc@google.com> writes:
> > >>
> > >> > On Wed, May 18, 2022 at 5:00 AM Jonathan Cameron
> > >> > <Jonathan.Cameron@huawei.com> wrote:
> > >> >>
> > >> >> On Wed, 18 May 2022 00:09:48 -0700
> > >> >> Wei Xu <weixugc@google.com> wrote:
> > >>
> > >> ...
> > >>
> > >> > Nice :)
> > >> >>
> > >> >> Initially I thought this was over complicated when compared to just leaving space, but
> > >> >> after a chat with Hesham just now you have us both convinced that this is an elegant solution.
> > >> >>
> > >> >> Few corners probably need fleshing out:
> > >> >> *  Use of an allocator for new tiers. Flat number at startup, or new one on write of unique
> > >> >>    value to set_memtier perhaps?  Also whether to allow drivers to allocate (I think
> > >> >>    we should).
> > >> >> *  Multiple tiers with same rank.  My assumption is from demotion path point of view you
> > >> >>    fuse them (treat them as if they were a single tier), but keep them expressed
> > >> >>    separately in the sysfs interface so that the rank can be changed independently.
> > >> >> *  Some guidance on what values make sense for given rank default that might be set by
> > >> >>    a driver. If we have multiple GPU vendors, and someone mixes them in a system we
> > >> >>    probably don't want the default values they use to result in demotion between them.
> > >> >>    This might well be a guidance DOC or appropriate set of #define
> > >> >
> > >> > All of these are good ideas, though I am afraid that these can make
> > >> > tier management too complex for what it's worth.
> > >> >
> > >> > How about an alternative tier numbering scheme that uses major.minor
> > >> > device IDs?  For simplicity, we can just start with 3 major tiers.
> > >> > New tiers can be inserted in-between using minor tier IDs.
> > >>
> > >>
> > >> What drives the creation of a new memory tier here?  Jonathan was
> > >> suggesting we could do something similar to writing to set_memtier for
> > >> creating a new memory tier.
> > >>
> > >> $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > >>
> > >> But I am wondering whether we should implement that now. If we keep
> > >> "rank" concept and detach tier index (memtier0 is the memory tier with
> > >> index 0) separate from rank, I assume we have enough flexibility for a
> > >> future extension that will allow us to create a memory tier from userspace
> > >> and assigning it a rank value that helps the device to be placed before or
> > >> after DRAM in demotion order.
> > >>
> > >> ie, For now we will only have memtier0, memtier1, memtier2. We won't add
> > >> dynamic creation of memory tiers and the above memory tiers will have
> > >> rank value 0, 1, 2 according with demotion order 0 -> 1 -> 2.
> > >
> > > Great. So the consensus is to go with the "rank" approach.  The above
> > > sounds good to me as a starting point.
> >
> > The rank approach seems good to me too.
>
> Rank is good, but I do slightly worry about accidentally defining ABI
> that people care about with the particular numbers used for the initial ranks.
>
> Maybe just x100 on all of them to allow things in between with no change to
> this initial set of 3?  So 0, 100, 200

I strongly support this, which is also my original intention for rank
values. I'd suggest to even remove 0 to avoid it becoming a special
value that userspace depends on.

> Jonathan
>
> >
> >  - Alistair
> >
> > >> -aneesh
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 11:36                 ` Mika Penttilä
@ 2022-05-25 15:33                   ` Wei Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-25 15:33 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Aneesh Kumar K V, Ying Huang, Jonathan Cameron, Andrew Morton,
	Greg Thelen, Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, May 25, 2022 at 4:37 AM Mika Penttilä <mpenttil@redhat.com> wrote:
>
>
>
> On 25.5.2022 13.01, Aneesh Kumar K V wrote:
> > On 5/25/22 2:33 PM, Ying Huang wrote:
> >> On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> >>> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> >>>>
> >>>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> >>>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com>
> >>>>> wrote:
> >>>>>>
> >
> > ...
> >
> >>
> >> OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> >> for example,
> >>
> >> GPU            memtier255
> >> DRAM (with CPU)        memtier0
> >> PMEM            memtier1
> >>
> >> When we add a new memtier, it can be memtier254, or memter2?  The rank
> >> value will determine the real demotion order.
> >>
> >> I think you may need to send v3 to make sure everyone is at the same
> >> page.
> >>
> >
> > What we have implemented which we will send as RFC shortly is below.
> >
> > cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
> > kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
> > /sys/devices/system
> > kvaneesh@ubuntu-guest:/sys/devices/system$ ls
> > clockevents  clocksource  container  cpu  edac  memory  memtier  mpic
> > node  power
> > kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
> > /sys/devices/system/memtier
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
> > default_rank  max_rank  memtier1  power  uevent
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
> > 1
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
> > 3
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
> > nodelist  power  rank  subsystem  uevent
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
> > 0-3
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
> > 1
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd
> > ../../node/node1/
> > kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
> > 1
> > kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
> > root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
> > root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
> > 0
> > root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
> > root@ubuntu-guest:/sys/devices/system/memtier# ls
> > default_rank  max_rank  memtier0  memtier1  power  uevent
> > root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
> > 1
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
> > 0
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# echo 4 > rank
> > bash: rank: Permission denied
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0#
> >
>
> Just to confirm, unlike today's demotion code, the demotion target
> allocation is planned to honor mempolicies?

Yes, though there will be some limitations in the beginning,
specifically for per-thread mempolicy.

>
> --Mika
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25  9:03             ` Ying Huang
  2022-05-25 10:01               ` Aneesh Kumar K V
@ 2022-05-25 15:36               ` Wei Xu
  2022-05-26  1:09                 ` Ying Huang
  1 sibling, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-25 15:36 UTC (permalink / raw)
  To: Ying Huang
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
>
> On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > >
> > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > >
> > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > performance.
> > > > > > > >
> > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > >
> > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > several important use cases:
> > > > > > > >
> > > > > > > > * The current tier initialization code always initializes
> > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > >
> > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > >
> > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > >   memory accounting.
> > > > > > > >
> > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > >
> > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > >
> > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > the discussions in the threads:
> > > > > > > >
> > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > >
> > > > > > > >
> > > > > > > > High-level Design Ideas
> > > > > > > > =======================
> > > > > > > >
> > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > >
> > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > >
> > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > >   assignment of any other node.
> > > > > > > >
> > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > >
> > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > >
> > > > > > >
> > > > > > > Hi Wei,
> > > > > > >
> > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > >
> > > > > > That's good to hear.
> > > > > >
> > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > >
> > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > performance information from the firmware.
> > > > > >
> > > > > > > >
> > > > > > > > Sysfs Interfaces
> > > > > > > > ================
> > > > > > > >
> > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > >
> > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > >
> > > > > > > >   Format: node_list
> > > > > > > >
> > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > >
> > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > >
> > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > >
> > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > >   sysfs files.
> > > > > > > >
> > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > >
> > > > > > > >   where N = 0, 1, ...
> > > > > > > >
> > > > > > > >   Format: int or empty
> > > > > > > >
> > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > >
> > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > >   are not affected.
> > > > > > > >
> > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > >
> > > > > > > >
> > > > > > > > Kernel Representation
> > > > > > > > =====================
> > > > > > > >
> > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > >
> > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > >
> > > > > > > >   Support 3 memory tiers for now.
> > > > > > > >
> > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > >
> > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > >
> > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > >
> > > > > > > >   Store memory nodes by tiers.
> > > > > > > >
> > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > >
> > > > > > > >   Map a node to its tier.
> > > > > > > >
> > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > >
> > > > > > > >
> > > > > > > > Memory Tier Initialization
> > > > > > > > ==========================
> > > > > > > >
> > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > >
> > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > virtual CPU HP is defined).
> > > > > >
> > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > both the memory device and CXL can still be very capable in
> > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > >
> > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > the firmware.
> > > > > >
> > > > > > > >
> > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > default tier.
> > > > > > > >
> > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > provided by the firmware.
> > > > > > > >
> > > > > > > >
> > > > > > > > Memory Tier Reassignment
> > > > > > > > ========================
> > > > > > > >
> > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > >
> > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > tier-based memory cgroup accounting.
> > > > > > > >
> > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > >
> > > > > > > >
> > > > > > > > Memory Allocation for Demotion
> > > > > > > > ==============================
> > > > > > > >
> > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > already defined.
> > > > > > > >
> > > > > > > > The pseudo code looks like:
> > > > > > > >
> > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > >
> > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > >
> > > > > > > >
> > > > > > > > Memory Allocation for Promotion
> > > > > > > > ===============================
> > > > > > > >
> > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > >
> > > > > > > >
> > > > > > > > Examples
> > > > > > > > ========
> > > > > > > >
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > * Example 3:
> > > > > > > >
> > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > >
> > > > > > > Node2 is drawn as pmem.
> > > > > >
> > > > > > Typo. Good catch.
> > > > > >
> > > > > > > >
> > > > > > > > All nodes are in the same tier.
> > > > > > > >
> > > > > > > >                   20
> > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > >          \                 /
> > > > > > > >           \ 30            / 30
> > > > > > > >            \             /
> > > > > > > >              Node 2 (PMEM)
> > > > > > > >
> > > > > > > > node distances:
> > > > > > > > node   0    1    2
> > > > > > > >    0  10   20   30
> > > > > > > >    1  20   10   30
> > > > > > > >    2  30   30   10
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > <empty>
> > > > > > > > 0-2
> > > > > > > > <empty>
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > 1
> > > > > > > > 1
> > > > > > > > 1
> > > > > > > >
> > > > > > > > Demotion fallback order:
> > > > > > > > node 0: empty
> > > > > > > > node 1: empty
> > > > > > > > node 2: empty
> > > > > > > >
> > > > > > > >
> > > > > > > > * Example 4:
> > > > > > > >
> > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > Node 1 is a PMEM node.
> > > > > > > > Node 2 is a GPU node.
> > > > > > > >
> > > > > > > >                   50
> > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > >          \                 /
> > > > > > > >           \ 30            / 60
> > > > > > > >            \             /
> > > > > > > >              Node 1 (PMEM)
> > > > > > > >
> > > > > > > > node distances:
> > > > > > > > node   0    1    2
> > > > > > > >    0  10   30   50
> > > > > > > >    1  30   10   60
> > > > > > > >    2  50   60   10
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > 2
> > > > > > > > 0
> > > > > > > > 1
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > 1
> > > > > > > > 2
> > > > > > > > 0
> > > > > > > >
> > > > > > > > Demotion fallback order:
> > > > > > > > node 0: 1
> > > > > > > > node 1: empty
> > > > > > > > node 2: 0, 1
> > > > > > > >
> > > > > > > >
> > > > > > > > * Example 5:
> > > > > > > >
> > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > Node 1 is a GPU node.
> > > > > > > > Node 2 is a PMEM node.
> > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > >
> > > > > > > >
> > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > >    /      |              \
> > > > > > > >   /       | 30            \ 120
> > > > > > > >  |        |         100    \
> > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > >   \         \                 /
> > > > > > > >     \        \ 40            / 110
> > > > > > > >   80  \       \             /
> > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > >
> > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > timing didn't make any sense.).
> > > > > > >
> > > > > >
> > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > from certain nodes.
> > > > > >
> > > > > > > >
> > > > > > > > node distances:
> > > > > > > > node    0    1    2    3
> > > > > > > >    0   10  100   30   40
> > > > > > > >    1  100   10  120  110
> > > > > > > >    2   30  120   10   80
> > > > > > > >    3   40  110   80   10
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > 1
> > > > > > > > 0,3
> > > > > > > > 2
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > 1
> > > > > > > > 0
> > > > > > > > 2
> > > > > > > > 1
> > > > > > > >
> > > > > > > > Demotion fallback order:
> > > > > > > > node 0: 2
> > > > > > > > node 1: 0, 3, 2
> > > > > > > > node 2: empty
> > > > > > > > node 3: 2
> > > > > > >
> > > > > > > This is close but not quite the same as the example
> > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > with that example didn't make sense).  I added another
> > > > > > > level of switching to make the numbers more obviously
> > > > > > > different and show how critical it might be.
> > > > > > >
> > > > > > > * Example 6:
> > > > > > >
> > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > Node 1 is a GPU node.
> > > > > > > Node 2 is a PMEM node.
> > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > >   (Key point here being that it probably never makes sense
> > > > > > >    to demote to anywhere else from this memory).
> > > > > > >
> > > > > > >
> > > > > > > I've redone the timings wrt to example 5.
> > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > via a a common switch one switch down switch
> > > > > > > (each hop via this is 100)
> > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > and pmem costs 30 from SoC.
> > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > a point not providing real numbers :)
> > > > > > >
> > > > > > >          PMEM Node 2
> > > > > > >             |(30)
> > > > > > >         CPU + DRAM Node0
> > > > > > >             |(100)
> > > > > > >          Switch 1
> > > > > > >             |(100)
> > > > > > >           Switch 2
> > > > > > >     (100)  |      |(100)
> > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > >
> > > > > > >
> > > > > > > With one level of s
> > > > > > >
> > > > > > >      Node 2 (PMEM)  ----
> > > > > > >     /      |              \
> > > > > > >    /       | 30            \ 330
> > > > > > >   |        |         310    \
> > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > >    \         \                 /
> > > > > > >      \        \ 310           / 210
> > > > > > >    330 \       \             /
> > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > >
> > > > > > > To my mind, we should potentially also take into account
> > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > (in at least some architectures we know where the CPUs
> > > > > > >  might be added later, they can't just magically turn up
> > > > > > >  anywhere in the topology).
> > > > > > >
> > > > > > > node distances:
> > > > > > > node    0    1    2    3
> > > > > > >     0   10   310  30   310
> > > > > > >     1   310  10   330  210
> > > > > > >     2   30   330  10   330
> > > > > > >     3   310  210  330   10
> > > > > > >
> > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > there form node 3 would make sense and visa versa.
> > > > > > >
> > > > > > >
> > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > >  1
> > > > > > >  0
> > > > > > >  2
> > > > > > >  3
> > > > > > >
> > > > > > >
> > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > >   1
> > > > > > >   0
> > > > > > >   2
> > > > > > >   3
> > > > > > >
> > > > > > >  Demotion fallback order:
> > > > > > >  node 0: 2, 3
> > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > >  node 2: 3
> > > > > > >  node 3: empty
> > > > > > >
> > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > their is little reason to demote from one to the other.
> > > > > >
> > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > that the max number of tiers is a config option).
> > > > > >
> > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > default as 0 that Huang, Ying made.
> > > > > >
> > > > > > It is hard to have negative values as the device IDs.
> > > > > >
> > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > the tier level)?
> > > > > >
> > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > >
> > > > > > Each memory tier N has two sysfs files:
> > > > > > - nodelist: the nodes that are in this tier
> > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > >
> > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > number N from "memtierN".
> > > > > >
> > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > this memtier in the tier hierarchy.
> > > > > >
> > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > in a 5-tier system.
> > > > > >
> > > > > > For the above example (example 6), we can have:
> > > > > >
> > > > > > $ ls /sys/devices/system/memtier
> > > > > > memtier0
> > > > > > memtier1
> > > > > > memtier2
> > > > > > memtier128
> > > > > >
> > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > 50
> > > > > > 60
> > > > > > 70
> > > > > > 10
> > > > >
> > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > >
> > > > It is possible to allow "rank" to be negative, though I think all
> > > > positive values should work equally well.
> > > >
> > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > perfect too.
> > > >
> > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > >
> > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > tier2 (e.g. PMEM).
> > > >
> > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > tier1.1, tier2.0, tier2.1.
> > > >
> > > > The earlier 4-tier example can be represented as:
> > > >
> > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > >
> > > > We can also omit .0 so that the tiers are:
> > > >
> > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > >
> > > > This should be flexible enough to support multiple tiers while keeping
> > > > the tier IDs relatively stable.
> > > >
> > > > It is not as flexible as the rank approach. For example, to insert a
> > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > tier and only move desired nodes into the new tier.
> > > >
> > > > What do you think?
> > >
> > > The rank approach looks better for.  And if we stick with the device ID
> > > rule as follows,
> > >
> > > ...
> > > 255     GPU
> > > 0       DRAM
> > > 1       PMEM
> > > 2
> > > ...
> > >
> > > 255 is -1 for "s8".
> > >
> > > The device ID should do most tricks at least now.  The rank can provide
> > > more flexibility in the future.  We can even go without rank in the
> > > first version, and introduce it when it's necessary.
> >
> > Given that the "rank" approach is generally favored, let's go with
> > that to avoid compatibility issues that may come from the switch of
> > device ID tricks to ranks.
>
> OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> for example,
>
> GPU                     memtier255
> DRAM (with CPU)         memtier0
> PMEM                    memtier1
>
> When we add a new memtier, it can be memtier254, or memter2?  The rank
> value will determine the real demotion order.

With the rank approach, the device ID numbering should be flexible and
not mandated by the proposal.

> I think you may need to send v3 to make sure everyone is at the same
> page.

Will do it shortly.

> Best Regards,
> Huang, Ying
>
> > > Best Regards,
> > > Huang, Ying
> > >
> > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > >
> > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > 0
> > > > > > 2
> > > > > > 3
> > > > > > 1
> > > > > >
> > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > >
> > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > per-node interface file:
> > > > > >
> > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > >
> > > > > > e.g.
> > > > > >
> > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > >
> > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > hold memtier devicde ID instead of a link.
> > > >
> > > > OK. We don't have to use a symlink.
> > > >
> > > > > Best Regards,
> > > > > Huang, Ying
> > > > >
> > > > > > Any comments?
> > > > > >
> > > > > > > Jonathan
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 10:01               ` Aneesh Kumar K V
  2022-05-25 11:36                 ` Mika Penttilä
@ 2022-05-25 17:27                 ` Wei Xu
  2022-05-26  9:32                   ` Jonathan Cameron
  2022-05-27  9:26                   ` Aneesh Kumar K V
  1 sibling, 2 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-25 17:27 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Ying Huang, Jonathan Cameron, Andrew Morton, Greg Thelen,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, May 25, 2022 at 3:01 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/25/22 2:33 PM, Ying Huang wrote:
> > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> >> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> >>>
> >>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> >>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> >>>>>
>
> ...
>
> >
> > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > for example,
> >
> > GPU                   memtier255
> > DRAM (with CPU)               memtier0
> > PMEM                  memtier1
> >
> > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > value will determine the real demotion order.
> >
> > I think you may need to send v3 to make sure everyone is at the same
> > page.
> >
>
> What we have implemented which we will send as RFC shortly is below.
>
> cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
> kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
> /sys/devices/system
> kvaneesh@ubuntu-guest:/sys/devices/system$ ls
> clockevents  clocksource  container  cpu  edac  memory  memtier  mpic
> node  power
> kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
> /sys/devices/system/memtier
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
> default_rank  max_rank  memtier1  power  uevent
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
> 1
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
> 3

For flexibility, we don't want max_rank to be interpreted as the
number of memory tiers.  Also, we want to leave spaces in rank values
to allow new memtiers to be inserted when needed.  So I'd suggest to
make max_rank a much larger value (e.g. 255).

> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
> nodelist  power  rank  subsystem  uevent
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
> 0-3
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
> 1
> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd
> ../../node/node1/
> kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
> 1
> kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
> root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
> root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
> 0
> root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
> root@ubuntu-guest:/sys/devices/system/memtier# ls
> default_rank  max_rank  memtier0  memtier1  power  uevent
> root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
> 1
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
> 0

It looks like the example here demonstrates the dynamic creation of
memtier0.  If so, how is the rank of memtier0 determined?  If we want
to support creating new memtiers at runtime, I think an explicit
interface that specifies both device ID and rank is preferred to avoid
implicit dependencies between device IDs and ranks.

> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# echo 4 > rank
> bash: rank: Permission denied
> root@ubuntu-guest:/sys/devices/system/memtier/memtier0#

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 15:36               ` Wei Xu
@ 2022-05-26  1:09                 ` Ying Huang
  2022-05-26  3:53                   ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Ying Huang @ 2022-05-26  1:09 UTC (permalink / raw)
  To: Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:
> On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > 
> > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > > > 
> > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > performance.
> > > > > > > > > 
> > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > 
> > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > several important use cases:
> > > > > > > > > 
> > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > 
> > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > 
> > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > >   memory accounting.
> > > > > > > > > 
> > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > 
> > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > 
> > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > the discussions in the threads:
> > > > > > > > > 
> > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > High-level Design Ideas
> > > > > > > > > =======================
> > > > > > > > > 
> > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > 
> > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > 
> > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > >   assignment of any other node.
> > > > > > > > > 
> > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > 
> > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Hi Wei,
> > > > > > > > 
> > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > > > 
> > > > > > > That's good to hear.
> > > > > > > 
> > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > > > 
> > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > performance information from the firmware.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > Sysfs Interfaces
> > > > > > > > > ================
> > > > > > > > > 
> > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > 
> > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > 
> > > > > > > > >   Format: node_list
> > > > > > > > > 
> > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > 
> > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > 
> > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > 
> > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > >   sysfs files.
> > > > > > > > > 
> > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > 
> > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > 
> > > > > > > > >   Format: int or empty
> > > > > > > > > 
> > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > 
> > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > >   are not affected.
> > > > > > > > > 
> > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Kernel Representation
> > > > > > > > > =====================
> > > > > > > > > 
> > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > 
> > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > 
> > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > 
> > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > 
> > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > 
> > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > 
> > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > 
> > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > 
> > > > > > > > >   Map a node to its tier.
> > > > > > > > > 
> > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Memory Tier Initialization
> > > > > > > > > ==========================
> > > > > > > > > 
> > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > > > 
> > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > virtual CPU HP is defined).
> > > > > > > 
> > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > 
> > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > the firmware.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > default tier.
> > > > > > > > > 
> > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > provided by the firmware.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Memory Tier Reassignment
> > > > > > > > > ========================
> > > > > > > > > 
> > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > 
> > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > 
> > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > ==============================
> > > > > > > > > 
> > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > already defined.
> > > > > > > > > 
> > > > > > > > > The pseudo code looks like:
> > > > > > > > > 
> > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > 
> > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > ===============================
> > > > > > > > > 
> > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Examples
> > > > > > > > > ========
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > > * Example 3:
> > > > > > > > > 
> > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > > > 
> > > > > > > > Node2 is drawn as pmem.
> > > > > > > 
> > > > > > > Typo. Good catch.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > All nodes are in the same tier.
> > > > > > > > > 
> > > > > > > > >                   20
> > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > >          \                 /
> > > > > > > > >           \ 30            / 30
> > > > > > > > >            \             /
> > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > 
> > > > > > > > > node distances:
> > > > > > > > > node   0    1    2
> > > > > > > > >    0  10   20   30
> > > > > > > > >    1  20   10   30
> > > > > > > > >    2  30   30   10
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > <empty>
> > > > > > > > > 0-2
> > > > > > > > > <empty>
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > 1
> > > > > > > > > 1
> > > > > > > > > 1
> > > > > > > > > 
> > > > > > > > > Demotion fallback order:
> > > > > > > > > node 0: empty
> > > > > > > > > node 1: empty
> > > > > > > > > node 2: empty
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > * Example 4:
> > > > > > > > > 
> > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > 
> > > > > > > > >                   50
> > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > >          \                 /
> > > > > > > > >           \ 30            / 60
> > > > > > > > >            \             /
> > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > 
> > > > > > > > > node distances:
> > > > > > > > > node   0    1    2
> > > > > > > > >    0  10   30   50
> > > > > > > > >    1  30   10   60
> > > > > > > > >    2  50   60   10
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > 2
> > > > > > > > > 0
> > > > > > > > > 1
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > 1
> > > > > > > > > 2
> > > > > > > > > 0
> > > > > > > > > 
> > > > > > > > > Demotion fallback order:
> > > > > > > > > node 0: 1
> > > > > > > > > node 1: empty
> > > > > > > > > node 2: 0, 1
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > * Example 5:
> > > > > > > > > 
> > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > >    /      |              \
> > > > > > > > >   /       | 30            \ 120
> > > > > > > > >  |        |         100    \
> > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > >   \         \                 /
> > > > > > > > >     \        \ 40            / 110
> > > > > > > > >   80  \       \             /
> > > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > > > 
> > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > timing didn't make any sense.).
> > > > > > > > 
> > > > > > > 
> > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > from certain nodes.
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > node distances:
> > > > > > > > > node    0    1    2    3
> > > > > > > > >    0   10  100   30   40
> > > > > > > > >    1  100   10  120  110
> > > > > > > > >    2   30  120   10   80
> > > > > > > > >    3   40  110   80   10
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > 1
> > > > > > > > > 0,3
> > > > > > > > > 2
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > 1
> > > > > > > > > 0
> > > > > > > > > 2
> > > > > > > > > 1
> > > > > > > > > 
> > > > > > > > > Demotion fallback order:
> > > > > > > > > node 0: 2
> > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > node 2: empty
> > > > > > > > > node 3: 2
> > > > > > > > 
> > > > > > > > This is close but not quite the same as the example
> > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > different and show how critical it might be.
> > > > > > > > 
> > > > > > > > * Example 6:
> > > > > > > > 
> > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > Node 1 is a GPU node.
> > > > > > > > Node 2 is a PMEM node.
> > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > via a a common switch one switch down switch
> > > > > > > > (each hop via this is 100)
> > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > a point not providing real numbers :)
> > > > > > > > 
> > > > > > > >          PMEM Node 2
> > > > > > > >             |(30)
> > > > > > > >         CPU + DRAM Node0
> > > > > > > >             |(100)
> > > > > > > >          Switch 1
> > > > > > > >             |(100)
> > > > > > > >           Switch 2
> > > > > > > >     (100)  |      |(100)
> > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > With one level of s
> > > > > > > > 
> > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > >     /      |              \
> > > > > > > >    /       | 30            \ 330
> > > > > > > >   |        |         310    \
> > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > >    \         \                 /
> > > > > > > >      \        \ 310           / 210
> > > > > > > >    330 \       \             /
> > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > 
> > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > >  anywhere in the topology).
> > > > > > > > 
> > > > > > > > node distances:
> > > > > > > > node    0    1    2    3
> > > > > > > >     0   10   310  30   310
> > > > > > > >     1   310  10   330  210
> > > > > > > >     2   30   330  10   330
> > > > > > > >     3   310  210  330   10
> > > > > > > > 
> > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > 
> > > > > > > > 
> > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > >  1
> > > > > > > >  0
> > > > > > > >  2
> > > > > > > >  3
> > > > > > > > 
> > > > > > > > 
> > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > >   1
> > > > > > > >   0
> > > > > > > >   2
> > > > > > > >   3
> > > > > > > > 
> > > > > > > >  Demotion fallback order:
> > > > > > > >  node 0: 2, 3
> > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > >  node 2: 3
> > > > > > > >  node 3: empty
> > > > > > > > 
> > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > their is little reason to demote from one to the other.
> > > > > > > 
> > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > that the max number of tiers is a config option).
> > > > > > > 
> > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > default as 0 that Huang, Ying made.
> > > > > > > 
> > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > 
> > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > the tier level)?
> > > > > > > 
> > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > 
> > > > > > > Each memory tier N has two sysfs files:
> > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > 
> > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > number N from "memtierN".
> > > > > > > 
> > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > this memtier in the tier hierarchy.
> > > > > > > 
> > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > in a 5-tier system.
> > > > > > > 
> > > > > > > For the above example (example 6), we can have:
> > > > > > > 
> > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > memtier0
> > > > > > > memtier1
> > > > > > > memtier2
> > > > > > > memtier128
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > 50
> > > > > > > 60
> > > > > > > 70
> > > > > > > 10
> > > > > > 
> > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > > > 
> > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > positive values should work equally well.
> > > > > 
> > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > perfect too.
> > > > > 
> > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > 
> > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > tier2 (e.g. PMEM).
> > > > > 
> > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > tier1.1, tier2.0, tier2.1.
> > > > > 
> > > > > The earlier 4-tier example can be represented as:
> > > > > 
> > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > 
> > > > > We can also omit .0 so that the tiers are:
> > > > > 
> > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > 
> > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > the tier IDs relatively stable.
> > > > > 
> > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > tier and only move desired nodes into the new tier.
> > > > > 
> > > > > What do you think?
> > > > 
> > > > The rank approach looks better for.  And if we stick with the device ID
> > > > rule as follows,
> > > > 
> > > > ...
> > > > 255     GPU
> > > > 0       DRAM
> > > > 1       PMEM
> > > > 2
> > > > ...
> > > > 
> > > > 255 is -1 for "s8".
> > > > 
> > > > The device ID should do most tricks at least now.  The rank can provide
> > > > more flexibility in the future.  We can even go without rank in the
> > > > first version, and introduce it when it's necessary.
> > > 
> > > Given that the "rank" approach is generally favored, let's go with
> > > that to avoid compatibility issues that may come from the switch of
> > > device ID tricks to ranks.
> > 
> > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > for example,
> > 
> > GPU                     memtier255
> > DRAM (with CPU)         memtier0
> > PMEM                    memtier1
> > 
> > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > value will determine the real demotion order.
> 
> With the rank approach, the device ID numbering should be flexible and
> not mandated by the proposal.

If so, the rank number will be fixed?  For example,

GPU			100
DRAM (with CPU)		200
PMEM			300

When we add a new memtier, its rank can be 50, 150, 250, or 400?

If so, this makes me think why we don't just make this kind of rank the
device ID?  Or I missed something?

Or, both device IDs and rank values are not fixed?  Why do we need that
kind of flexibility?  Sorry, I may not undersand all requirements.

Best Regards,
Huang, Ying

> > I think you may need to send v3 to make sure everyone is at the same
> > page.
> 
> Will do it shortly.

Good!  Thanks!

Best Regards,
Huang, Ying

> > Best Regards,
> > Huang, Ying
> > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > 
> > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > 0
> > > > > > > 2
> > > > > > > 3
> > > > > > > 1
> > > > > > > 
> > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > 
> > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > per-node interface file:
> > > > > > > 
> > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > 
> > > > > > > e.g.
> > > > > > > 
> > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > > > 
> > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > hold memtier devicde ID instead of a link.
> > > > > 
> > > > > OK. We don't have to use a symlink.
> > > > > 
> > > > > > Best Regards,
> > > > > > Huang, Ying
> > > > > > 
> > > > > > > Any comments?
> > > > > > > 
> > > > > > > > Jonathan
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26  1:09                 ` Ying Huang
@ 2022-05-26  3:53                   ` Wei Xu
  2022-05-26  6:54                     ` Ying Huang
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-26  3:53 UTC (permalink / raw)
  To: Ying Huang
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:
> > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
> > >
> > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > performance.
> > > > > > > > > >
> > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > >
> > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > several important use cases:
> > > > > > > > > >
> > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > >
> > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > >
> > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > >   memory accounting.
> > > > > > > > > >
> > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > >
> > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > >
> > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > the discussions in the threads:
> > > > > > > > > >
> > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > High-level Design Ideas
> > > > > > > > > > =======================
> > > > > > > > > >
> > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > >
> > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > >
> > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > >   assignment of any other node.
> > > > > > > > > >
> > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > >
> > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Wei,
> > > > > > > > >
> > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > > > >
> > > > > > > > That's good to hear.
> > > > > > > >
> > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > > > >
> > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > performance information from the firmware.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > ================
> > > > > > > > > >
> > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > >
> > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > >
> > > > > > > > > >   Format: node_list
> > > > > > > > > >
> > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > >
> > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > >
> > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > >
> > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > >   sysfs files.
> > > > > > > > > >
> > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > >
> > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > >
> > > > > > > > > >   Format: int or empty
> > > > > > > > > >
> > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > >
> > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > >   are not affected.
> > > > > > > > > >
> > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Kernel Representation
> > > > > > > > > > =====================
> > > > > > > > > >
> > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > >
> > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > >
> > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > >
> > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > >
> > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > >
> > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > >
> > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > >
> > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > >
> > > > > > > > > >   Map a node to its tier.
> > > > > > > > > >
> > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > ==========================
> > > > > > > > > >
> > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > > > >
> > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > virtual CPU HP is defined).
> > > > > > > >
> > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > >
> > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > the firmware.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > default tier.
> > > > > > > > > >
> > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > provided by the firmware.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > ========================
> > > > > > > > > >
> > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > >
> > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > >
> > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > ==============================
> > > > > > > > > >
> > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > already defined.
> > > > > > > > > >
> > > > > > > > > > The pseudo code looks like:
> > > > > > > > > >
> > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > >
> > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > ===============================
> > > > > > > > > >
> > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Examples
> > > > > > > > > > ========
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > > * Example 3:
> > > > > > > > > >
> > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > > > >
> > > > > > > > > Node2 is drawn as pmem.
> > > > > > > >
> > > > > > > > Typo. Good catch.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > >
> > > > > > > > > >                   20
> > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > >          \                 /
> > > > > > > > > >           \ 30            / 30
> > > > > > > > > >            \             /
> > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > >
> > > > > > > > > > node distances:
> > > > > > > > > > node   0    1    2
> > > > > > > > > >    0  10   20   30
> > > > > > > > > >    1  20   10   30
> > > > > > > > > >    2  30   30   10
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > <empty>
> > > > > > > > > > 0-2
> > > > > > > > > > <empty>
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > 1
> > > > > > > > > > 1
> > > > > > > > > > 1
> > > > > > > > > >
> > > > > > > > > > Demotion fallback order:
> > > > > > > > > > node 0: empty
> > > > > > > > > > node 1: empty
> > > > > > > > > > node 2: empty
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > * Example 4:
> > > > > > > > > >
> > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > >
> > > > > > > > > >                   50
> > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > >          \                 /
> > > > > > > > > >           \ 30            / 60
> > > > > > > > > >            \             /
> > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > >
> > > > > > > > > > node distances:
> > > > > > > > > > node   0    1    2
> > > > > > > > > >    0  10   30   50
> > > > > > > > > >    1  30   10   60
> > > > > > > > > >    2  50   60   10
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > 2
> > > > > > > > > > 0
> > > > > > > > > > 1
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > 1
> > > > > > > > > > 2
> > > > > > > > > > 0
> > > > > > > > > >
> > > > > > > > > > Demotion fallback order:
> > > > > > > > > > node 0: 1
> > > > > > > > > > node 1: empty
> > > > > > > > > > node 2: 0, 1
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > * Example 5:
> > > > > > > > > >
> > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > >    /      |              \
> > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > >  |        |         100    \
> > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > >   \         \                 /
> > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > >   80  \       \             /
> > > > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > > > >
> > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > timing didn't make any sense.).
> > > > > > > > >
> > > > > > > >
> > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > from certain nodes.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > node distances:
> > > > > > > > > > node    0    1    2    3
> > > > > > > > > >    0   10  100   30   40
> > > > > > > > > >    1  100   10  120  110
> > > > > > > > > >    2   30  120   10   80
> > > > > > > > > >    3   40  110   80   10
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > 1
> > > > > > > > > > 0,3
> > > > > > > > > > 2
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > 1
> > > > > > > > > > 0
> > > > > > > > > > 2
> > > > > > > > > > 1
> > > > > > > > > >
> > > > > > > > > > Demotion fallback order:
> > > > > > > > > > node 0: 2
> > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > node 2: empty
> > > > > > > > > > node 3: 2
> > > > > > > > >
> > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > different and show how critical it might be.
> > > > > > > > >
> > > > > > > > > * Example 6:
> > > > > > > > >
> > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > (each hop via this is 100)
> > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > a point not providing real numbers :)
> > > > > > > > >
> > > > > > > > >          PMEM Node 2
> > > > > > > > >             |(30)
> > > > > > > > >         CPU + DRAM Node0
> > > > > > > > >             |(100)
> > > > > > > > >          Switch 1
> > > > > > > > >             |(100)
> > > > > > > > >           Switch 2
> > > > > > > > >     (100)  |      |(100)
> > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > With one level of s
> > > > > > > > >
> > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > >     /      |              \
> > > > > > > > >    /       | 30            \ 330
> > > > > > > > >   |        |         310    \
> > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > >    \         \                 /
> > > > > > > > >      \        \ 310           / 210
> > > > > > > > >    330 \       \             /
> > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > >
> > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > >  anywhere in the topology).
> > > > > > > > >
> > > > > > > > > node distances:
> > > > > > > > > node    0    1    2    3
> > > > > > > > >     0   10   310  30   310
> > > > > > > > >     1   310  10   330  210
> > > > > > > > >     2   30   330  10   330
> > > > > > > > >     3   310  210  330   10
> > > > > > > > >
> > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > >  1
> > > > > > > > >  0
> > > > > > > > >  2
> > > > > > > > >  3
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > >   1
> > > > > > > > >   0
> > > > > > > > >   2
> > > > > > > > >   3
> > > > > > > > >
> > > > > > > > >  Demotion fallback order:
> > > > > > > > >  node 0: 2, 3
> > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > >  node 2: 3
> > > > > > > > >  node 3: empty
> > > > > > > > >
> > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > their is little reason to demote from one to the other.
> > > > > > > >
> > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > that the max number of tiers is a config option).
> > > > > > > >
> > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > default as 0 that Huang, Ying made.
> > > > > > > >
> > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > >
> > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > the tier level)?
> > > > > > > >
> > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > >
> > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > >
> > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > number N from "memtierN".
> > > > > > > >
> > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > this memtier in the tier hierarchy.
> > > > > > > >
> > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > in a 5-tier system.
> > > > > > > >
> > > > > > > > For the above example (example 6), we can have:
> > > > > > > >
> > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > memtier0
> > > > > > > > memtier1
> > > > > > > > memtier2
> > > > > > > > memtier128
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > 50
> > > > > > > > 60
> > > > > > > > 70
> > > > > > > > 10
> > > > > > >
> > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > > > >
> > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > positive values should work equally well.
> > > > > >
> > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > perfect too.
> > > > > >
> > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > >
> > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > tier2 (e.g. PMEM).
> > > > > >
> > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > tier1.1, tier2.0, tier2.1.
> > > > > >
> > > > > > The earlier 4-tier example can be represented as:
> > > > > >
> > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > >
> > > > > > We can also omit .0 so that the tiers are:
> > > > > >
> > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > >
> > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > the tier IDs relatively stable.
> > > > > >
> > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > tier and only move desired nodes into the new tier.
> > > > > >
> > > > > > What do you think?
> > > > >
> > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > rule as follows,
> > > > >
> > > > > ...
> > > > > 255     GPU
> > > > > 0       DRAM
> > > > > 1       PMEM
> > > > > 2
> > > > > ...
> > > > >
> > > > > 255 is -1 for "s8".
> > > > >
> > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > more flexibility in the future.  We can even go without rank in the
> > > > > first version, and introduce it when it's necessary.
> > > >
> > > > Given that the "rank" approach is generally favored, let's go with
> > > > that to avoid compatibility issues that may come from the switch of
> > > > device ID tricks to ranks.
> > >
> > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > for example,
> > >
> > > GPU                     memtier255
> > > DRAM (with CPU)         memtier0
> > > PMEM                    memtier1
> > >
> > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > value will determine the real demotion order.
> >
> > With the rank approach, the device ID numbering should be flexible and
> > not mandated by the proposal.
>
> If so, the rank number will be fixed?  For example,
>
> GPU                     100
> DRAM (with CPU)         200
> PMEM                    300
>
> When we add a new memtier, its rank can be 50, 150, 250, or 400?
>
> If so, this makes me think why we don't just make this kind of rank the
> device ID?  Or I missed something?
>
> Or, both device IDs and rank values are not fixed?  Why do we need that
> kind of flexibility?  Sorry, I may not undersand all requirements.

Even though the proposal doesn't mandate a particular device ID
numbering, I expect that the device IDs will be relatively stable once
a kernel implementation is chosen. For example, it is likely that DRAM
nodes with CPUs will always be on memtier1, no matter how many tiers
are higher or lower than these nodes.

We don't need to mandate a particular way to assign the rank values,
either.  What matters is the relative order and some reasonable gap
between these values.

The rank approach allows us to keep memtier device IDs relatively
stable even though we may change the tier ordering among them.  Its
flexibility can have many other uses as well.  For example, we can
insert a new memtier into the tier hierarchy for a new set of nodes
without affecting the node assignment of any existing memtier,
provided that there is enough gap in the rank values for the new
memtier.

Using the rank value directly as the device ID has some disadvantages:
- It is kind of unconventional to number devices in this way.
- We cannot assign DRAM nodes with CPUs with a specific memtier device
ID (even though this is not mandated by the "rank" proposal, I expect
the device will likely always be memtier1 in practice).
- It is possible that we may eventually allow the rank value to be
modified as a way to adjust the tier ordering.  We cannot do that
easily for device IDs.
> Best Regards,
> Huang, Ying
>
> > > I think you may need to send v3 to make sure everyone is at the same
> > > page.
> >
> > Will do it shortly.
>
> Good!  Thanks!
>
> Best Regards,
> Huang, Ying
>
> > > Best Regards,
> > > Huang, Ying
> > >
> > > > > Best Regards,
> > > > > Huang, Ying
> > > > >
> > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > >
> > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > 0
> > > > > > > > 2
> > > > > > > > 3
> > > > > > > > 1
> > > > > > > >
> > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > >
> > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > per-node interface file:
> > > > > > > >
> > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > >
> > > > > > > > e.g.
> > > > > > > >
> > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > > > >
> > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > hold memtier devicde ID instead of a link.
> > > > > >
> > > > > > OK. We don't have to use a symlink.
> > > > > >
> > > > > > > Best Regards,
> > > > > > > Huang, Ying
> > > > > > >
> > > > > > > > Any comments?
> > > > > > > >
> > > > > > > > > Jonathan
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26  3:53                   ` Wei Xu
@ 2022-05-26  6:54                     ` Ying Huang
  2022-05-26  7:08                       ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Ying Huang @ 2022-05-26  6:54 UTC (permalink / raw)
  To: Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote:
> On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:
> > > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > 
> > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > > performance.
> > > > > > > > > > > 
> > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > > > 
> > > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > > several important use cases:
> > > > > > > > > > > 
> > > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > > > 
> > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > > > 
> > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > > >   memory accounting.
> > > > > > > > > > > 
> > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > > > 
> > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > > > 
> > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > > the discussions in the threads:
> > > > > > > > > > > 
> > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > High-level Design Ideas
> > > > > > > > > > > =======================
> > > > > > > > > > > 
> > > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > > > 
> > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > > > 
> > > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > > >   assignment of any other node.
> > > > > > > > > > > 
> > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > > > 
> > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Hi Wei,
> > > > > > > > > > 
> > > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > > > > > 
> > > > > > > > > That's good to hear.
> > > > > > > > > 
> > > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > > > > > 
> > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > > performance information from the firmware.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > > ================
> > > > > > > > > > > 
> > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > 
> > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > > > 
> > > > > > > > > > >   Format: node_list
> > > > > > > > > > > 
> > > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > > > 
> > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > > > 
> > > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > > > 
> > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > > >   sysfs files.
> > > > > > > > > > > 
> > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > > > 
> > > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > > > 
> > > > > > > > > > >   Format: int or empty
> > > > > > > > > > > 
> > > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > > > 
> > > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > > >   are not affected.
> > > > > > > > > > > 
> > > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Kernel Representation
> > > > > > > > > > > =====================
> > > > > > > > > > > 
> > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > > > 
> > > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > > > 
> > > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > > > 
> > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > > > 
> > > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > > > 
> > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > > > 
> > > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > > > 
> > > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > > > 
> > > > > > > > > > >   Map a node to its tier.
> > > > > > > > > > > 
> > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > > ==========================
> > > > > > > > > > > 
> > > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > > > > > 
> > > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > > virtual CPU HP is defined).
> > > > > > > > > 
> > > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > > > 
> > > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > > the firmware.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > > default tier.
> > > > > > > > > > > 
> > > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > > provided by the firmware.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > > ========================
> > > > > > > > > > > 
> > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > > > 
> > > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > > > 
> > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > > ==============================
> > > > > > > > > > > 
> > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > > already defined.
> > > > > > > > > > > 
> > > > > > > > > > > The pseudo code looks like:
> > > > > > > > > > > 
> > > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > > > 
> > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > > ===============================
> > > > > > > > > > > 
> > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Examples
> > > > > > > > > > > ========
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > ...
> > > > > > > > > > 
> > > > > > > > > > > * Example 3:
> > > > > > > > > > > 
> > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > > > > > 
> > > > > > > > > > Node2 is drawn as pmem.
> > > > > > > > > 
> > > > > > > > > Typo. Good catch.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > > > 
> > > > > > > > > > >                   20
> > > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > > >          \                 /
> > > > > > > > > > >           \ 30            / 30
> > > > > > > > > > >            \             /
> > > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > > > 
> > > > > > > > > > > node distances:
> > > > > > > > > > > node   0    1    2
> > > > > > > > > > >    0  10   20   30
> > > > > > > > > > >    1  20   10   30
> > > > > > > > > > >    2  30   30   10
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > <empty>
> > > > > > > > > > > 0-2
> > > > > > > > > > > <empty>
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > 1
> > > > > > > > > > > 1
> > > > > > > > > > > 1
> > > > > > > > > > > 
> > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > node 0: empty
> > > > > > > > > > > node 1: empty
> > > > > > > > > > > node 2: empty
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > * Example 4:
> > > > > > > > > > > 
> > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > > > 
> > > > > > > > > > >                   50
> > > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > > >          \                 /
> > > > > > > > > > >           \ 30            / 60
> > > > > > > > > > >            \             /
> > > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > > > 
> > > > > > > > > > > node distances:
> > > > > > > > > > > node   0    1    2
> > > > > > > > > > >    0  10   30   50
> > > > > > > > > > >    1  30   10   60
> > > > > > > > > > >    2  50   60   10
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > 2
> > > > > > > > > > > 0
> > > > > > > > > > > 1
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > 1
> > > > > > > > > > > 2
> > > > > > > > > > > 0
> > > > > > > > > > > 
> > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > node 0: 1
> > > > > > > > > > > node 1: empty
> > > > > > > > > > > node 2: 0, 1
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > * Example 5:
> > > > > > > > > > > 
> > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > >    /      |              \
> > > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > > >  |        |         100    \
> > > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > >   \         \                 /
> > > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > > >   80  \       \             /
> > > > > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > > > > > 
> > > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > > timing didn't make any sense.).
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > > from certain nodes.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > node distances:
> > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > >    0   10  100   30   40
> > > > > > > > > > >    1  100   10  120  110
> > > > > > > > > > >    2   30  120   10   80
> > > > > > > > > > >    3   40  110   80   10
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > 1
> > > > > > > > > > > 0,3
> > > > > > > > > > > 2
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > 1
> > > > > > > > > > > 0
> > > > > > > > > > > 2
> > > > > > > > > > > 1
> > > > > > > > > > > 
> > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > node 0: 2
> > > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > > node 2: empty
> > > > > > > > > > > node 3: 2
> > > > > > > > > > 
> > > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > > different and show how critical it might be.
> > > > > > > > > > 
> > > > > > > > > > * Example 6:
> > > > > > > > > > 
> > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > > (each hop via this is 100)
> > > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > > a point not providing real numbers :)
> > > > > > > > > > 
> > > > > > > > > >          PMEM Node 2
> > > > > > > > > >             |(30)
> > > > > > > > > >         CPU + DRAM Node0
> > > > > > > > > >             |(100)
> > > > > > > > > >          Switch 1
> > > > > > > > > >             |(100)
> > > > > > > > > >           Switch 2
> > > > > > > > > >     (100)  |      |(100)
> > > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > With one level of s
> > > > > > > > > > 
> > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > >     /      |              \
> > > > > > > > > >    /       | 30            \ 330
> > > > > > > > > >   |        |         310    \
> > > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > >    \         \                 /
> > > > > > > > > >      \        \ 310           / 210
> > > > > > > > > >    330 \       \             /
> > > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > > > 
> > > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > > >  anywhere in the topology).
> > > > > > > > > > 
> > > > > > > > > > node distances:
> > > > > > > > > > node    0    1    2    3
> > > > > > > > > >     0   10   310  30   310
> > > > > > > > > >     1   310  10   330  210
> > > > > > > > > >     2   30   330  10   330
> > > > > > > > > >     3   310  210  330   10
> > > > > > > > > > 
> > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > >  1
> > > > > > > > > >  0
> > > > > > > > > >  2
> > > > > > > > > >  3
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > >   1
> > > > > > > > > >   0
> > > > > > > > > >   2
> > > > > > > > > >   3
> > > > > > > > > > 
> > > > > > > > > >  Demotion fallback order:
> > > > > > > > > >  node 0: 2, 3
> > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > > >  node 2: 3
> > > > > > > > > >  node 3: empty
> > > > > > > > > > 
> > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > > their is little reason to demote from one to the other.
> > > > > > > > > 
> > > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > > that the max number of tiers is a config option).
> > > > > > > > > 
> > > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > > default as 0 that Huang, Ying made.
> > > > > > > > > 
> > > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > > > 
> > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > > the tier level)?
> > > > > > > > > 
> > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > > > 
> > > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > > > 
> > > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > > number N from "memtierN".
> > > > > > > > > 
> > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > > this memtier in the tier hierarchy.
> > > > > > > > > 
> > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > > in a 5-tier system.
> > > > > > > > > 
> > > > > > > > > For the above example (example 6), we can have:
> > > > > > > > > 
> > > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > > memtier0
> > > > > > > > > memtier1
> > > > > > > > > memtier2
> > > > > > > > > memtier128
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > > 50
> > > > > > > > > 60
> > > > > > > > > 70
> > > > > > > > > 10
> > > > > > > > 
> > > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > > > > > 
> > > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > > positive values should work equally well.
> > > > > > > 
> > > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > > perfect too.
> > > > > > > 
> > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > > > 
> > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > > tier2 (e.g. PMEM).
> > > > > > > 
> > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > > tier1.1, tier2.0, tier2.1.
> > > > > > > 
> > > > > > > The earlier 4-tier example can be represented as:
> > > > > > > 
> > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > > > 
> > > > > > > We can also omit .0 so that the tiers are:
> > > > > > > 
> > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > > > 
> > > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > > the tier IDs relatively stable.
> > > > > > > 
> > > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > > tier and only move desired nodes into the new tier.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > 
> > > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > > rule as follows,
> > > > > > 
> > > > > > ...
> > > > > > 255     GPU
> > > > > > 0       DRAM
> > > > > > 1       PMEM
> > > > > > 2
> > > > > > ...
> > > > > > 
> > > > > > 255 is -1 for "s8".
> > > > > > 
> > > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > > more flexibility in the future.  We can even go without rank in the
> > > > > > first version, and introduce it when it's necessary.
> > > > > 
> > > > > Given that the "rank" approach is generally favored, let's go with
> > > > > that to avoid compatibility issues that may come from the switch of
> > > > > device ID tricks to ranks.
> > > > 
> > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > for example,
> > > > 
> > > > GPU                     memtier255
> > > > DRAM (with CPU)         memtier0
> > > > PMEM                    memtier1
> > > > 
> > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > value will determine the real demotion order.
> > > 
> > > With the rank approach, the device ID numbering should be flexible and
> > > not mandated by the proposal.
> > 
> > If so, the rank number will be fixed?  For example,
> > 
> > GPU                     100
> > DRAM (with CPU)         200
> > PMEM                    300
> > 
> > When we add a new memtier, its rank can be 50, 150, 250, or 400?
> > 
> > If so, this makes me think why we don't just make this kind of rank the
> > device ID?  Or I missed something?
> > 
> > Or, both device IDs and rank values are not fixed?  Why do we need that
> > kind of flexibility?  Sorry, I may not undersand all requirements.
> 
> Even though the proposal doesn't mandate a particular device ID
> numbering, I expect that the device IDs will be relatively stable once
> a kernel implementation is chosen. For example, it is likely that DRAM
> nodes with CPUs will always be on memtier1, no matter how many tiers
> are higher or lower than these nodes.
> 
> We don't need to mandate a particular way to assign the rank values,
> either.  What matters is the relative order and some reasonable gap
> between these values.
> 
> The rank approach allows us to keep memtier device IDs relatively
> stable even though we may change the tier ordering among them.  Its
> flexibility can have many other uses as well.  For example, we can
> insert a new memtier into the tier hierarchy for a new set of nodes
> without affecting the node assignment of any existing memtier,
> provided that there is enough gap in the rank values for the new
> memtier.
> 
> Using the rank value directly as the device ID has some disadvantages:
> - It is kind of unconventional to number devices in this way.
> - We cannot assign DRAM nodes with CPUs with a specific memtier device
> ID (even though this is not mandated by the "rank" proposal, I expect
> the device will likely always be memtier1 in practice).
> - It is possible that we may eventually allow the rank value to be
> modified as a way to adjust the tier ordering.  We cannot do that
> easily for device IDs.

OK.  I can understand that sometimes it's more natural to change the
order of a set of nodes with same memory types (and data plane path)
together instead of change that one by one for each node.

It appears that the memtierX device becomes kind of memory types (with
data plane path considered for latency/throughput too).  We can assign a
memory type for a node, and change the order between memory types.  If
so, we need to allow multiple memtiers have same rank value.

Best Regards,
Huang, Ying

> > 
> > > > I think you may need to send v3 to make sure everyone is at the same
> > > > page.
> > > 
> > > Will do it shortly.
> > 
> > Good!  Thanks!
> > 
> > Best Regards,
> > Huang, Ying
> > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > > > Best Regards,
> > > > > > Huang, Ying
> > > > > > 
> > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > > > 
> > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > 0
> > > > > > > > > 2
> > > > > > > > > 3
> > > > > > > > > 1
> > > > > > > > > 
> > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > > > 
> > > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > > per-node interface file:
> > > > > > > > > 
> > > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > > > 
> > > > > > > > > e.g.
> > > > > > > > > 
> > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > > > > > 
> > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > > hold memtier devicde ID instead of a link.
> > > > > > > 
> > > > > > > OK. We don't have to use a symlink.
> > > > > > > 
> > > > > > > > Best Regards,
> > > > > > > > Huang, Ying
> > > > > > > > 
> > > > > > > > > Any comments?
> > > > > > > > > 
> > > > > > > > > > Jonathan
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26  6:54                     ` Ying Huang
@ 2022-05-26  7:08                       ` Wei Xu
  2022-05-26  7:39                         ` Ying Huang
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-26  7:08 UTC (permalink / raw)
  To: Ying Huang
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, May 25, 2022 at 11:55 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote:
> > On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:
> > > > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > > > performance.
> > > > > > > > > > > >
> > > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > > > >
> > > > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > > > several important use cases:
> > > > > > > > > > > >
> > > > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > > > >
> > > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > > > >
> > > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > > > >   memory accounting.
> > > > > > > > > > > >
> > > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > > > >
> > > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > > > >
> > > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > > > the discussions in the threads:
> > > > > > > > > > > >
> > > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > High-level Design Ideas
> > > > > > > > > > > > =======================
> > > > > > > > > > > >
> > > > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > > > >
> > > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > > > >
> > > > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > > > >   assignment of any other node.
> > > > > > > > > > > >
> > > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > > > >
> > > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Wei,
> > > > > > > > > > >
> > > > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > > > > > >
> > > > > > > > > > That's good to hear.
> > > > > > > > > >
> > > > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > > > > > >
> > > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > > > performance information from the firmware.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > > > ================
> > > > > > > > > > > >
> > > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > >
> > > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > > > >
> > > > > > > > > > > >   Format: node_list
> > > > > > > > > > > >
> > > > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > > > >
> > > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > > > >
> > > > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > > > >
> > > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > > > >   sysfs files.
> > > > > > > > > > > >
> > > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > > > >
> > > > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > > > >
> > > > > > > > > > > >   Format: int or empty
> > > > > > > > > > > >
> > > > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > > > >
> > > > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > > > >   are not affected.
> > > > > > > > > > > >
> > > > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Kernel Representation
> > > > > > > > > > > > =====================
> > > > > > > > > > > >
> > > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > > > >
> > > > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > > > >
> > > > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > > > >
> > > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > > > >
> > > > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > > > >
> > > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > > > >
> > > > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > > > >
> > > > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > > > >
> > > > > > > > > > > >   Map a node to its tier.
> > > > > > > > > > > >
> > > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > > > ==========================
> > > > > > > > > > > >
> > > > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > > > > > >
> > > > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > > > virtual CPU HP is defined).
> > > > > > > > > >
> > > > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > > > >
> > > > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > > > the firmware.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > > > default tier.
> > > > > > > > > > > >
> > > > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > > > provided by the firmware.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > > > ========================
> > > > > > > > > > > >
> > > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > > > >
> > > > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > > > >
> > > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > > > ==============================
> > > > > > > > > > > >
> > > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > > > already defined.
> > > > > > > > > > > >
> > > > > > > > > > > > The pseudo code looks like:
> > > > > > > > > > > >
> > > > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > > > >
> > > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > > > ===============================
> > > > > > > > > > > >
> > > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Examples
> > > > > > > > > > > > ========
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ...
> > > > > > > > > > >
> > > > > > > > > > > > * Example 3:
> > > > > > > > > > > >
> > > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > > > > > >
> > > > > > > > > > > Node2 is drawn as pmem.
> > > > > > > > > >
> > > > > > > > > > Typo. Good catch.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > > > >
> > > > > > > > > > > >                   20
> > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > > > >          \                 /
> > > > > > > > > > > >           \ 30            / 30
> > > > > > > > > > > >            \             /
> > > > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > > > >
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > >    0  10   20   30
> > > > > > > > > > > >    1  20   10   30
> > > > > > > > > > > >    2  30   30   10
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > <empty>
> > > > > > > > > > > > 0-2
> > > > > > > > > > > > <empty>
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > 1
> > > > > > > > > > > > 1
> > > > > > > > > > > > 1
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > node 0: empty
> > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > node 2: empty
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > * Example 4:
> > > > > > > > > > > >
> > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > > > >
> > > > > > > > > > > >                   50
> > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > > > >          \                 /
> > > > > > > > > > > >           \ 30            / 60
> > > > > > > > > > > >            \             /
> > > > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > > > >
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > >    0  10   30   50
> > > > > > > > > > > >    1  30   10   60
> > > > > > > > > > > >    2  50   60   10
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > 2
> > > > > > > > > > > > 0
> > > > > > > > > > > > 1
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > 1
> > > > > > > > > > > > 2
> > > > > > > > > > > > 0
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > node 0: 1
> > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > node 2: 0, 1
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > * Example 5:
> > > > > > > > > > > >
> > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > >    /      |              \
> > > > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > > > >  |        |         100    \
> > > > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > >   \         \                 /
> > > > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > > > >   80  \       \             /
> > > > > > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > > > > > >
> > > > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > > > timing didn't make any sense.).
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > > > from certain nodes.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > >    0   10  100   30   40
> > > > > > > > > > > >    1  100   10  120  110
> > > > > > > > > > > >    2   30  120   10   80
> > > > > > > > > > > >    3   40  110   80   10
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > 1
> > > > > > > > > > > > 0,3
> > > > > > > > > > > > 2
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > 1
> > > > > > > > > > > > 0
> > > > > > > > > > > > 2
> > > > > > > > > > > > 1
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > node 0: 2
> > > > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > node 3: 2
> > > > > > > > > > >
> > > > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > > > different and show how critical it might be.
> > > > > > > > > > >
> > > > > > > > > > > * Example 6:
> > > > > > > > > > >
> > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > > > (each hop via this is 100)
> > > > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > > > a point not providing real numbers :)
> > > > > > > > > > >
> > > > > > > > > > >          PMEM Node 2
> > > > > > > > > > >             |(30)
> > > > > > > > > > >         CPU + DRAM Node0
> > > > > > > > > > >             |(100)
> > > > > > > > > > >          Switch 1
> > > > > > > > > > >             |(100)
> > > > > > > > > > >           Switch 2
> > > > > > > > > > >     (100)  |      |(100)
> > > > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > With one level of s
> > > > > > > > > > >
> > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > >     /      |              \
> > > > > > > > > > >    /       | 30            \ 330
> > > > > > > > > > >   |        |         310    \
> > > > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > >    \         \                 /
> > > > > > > > > > >      \        \ 310           / 210
> > > > > > > > > > >    330 \       \             /
> > > > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > > > >
> > > > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > > > >  anywhere in the topology).
> > > > > > > > > > >
> > > > > > > > > > > node distances:
> > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > >     0   10   310  30   310
> > > > > > > > > > >     1   310  10   330  210
> > > > > > > > > > >     2   30   330  10   330
> > > > > > > > > > >     3   310  210  330   10
> > > > > > > > > > >
> > > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > >  1
> > > > > > > > > > >  0
> > > > > > > > > > >  2
> > > > > > > > > > >  3
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > >   1
> > > > > > > > > > >   0
> > > > > > > > > > >   2
> > > > > > > > > > >   3
> > > > > > > > > > >
> > > > > > > > > > >  Demotion fallback order:
> > > > > > > > > > >  node 0: 2, 3
> > > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > > > >  node 2: 3
> > > > > > > > > > >  node 3: empty
> > > > > > > > > > >
> > > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > > > their is little reason to demote from one to the other.
> > > > > > > > > >
> > > > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > > > that the max number of tiers is a config option).
> > > > > > > > > >
> > > > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > > > default as 0 that Huang, Ying made.
> > > > > > > > > >
> > > > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > > > >
> > > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > > > the tier level)?
> > > > > > > > > >
> > > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > > > >
> > > > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > > > >
> > > > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > > > number N from "memtierN".
> > > > > > > > > >
> > > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > > > this memtier in the tier hierarchy.
> > > > > > > > > >
> > > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > > > in a 5-tier system.
> > > > > > > > > >
> > > > > > > > > > For the above example (example 6), we can have:
> > > > > > > > > >
> > > > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > > > memtier0
> > > > > > > > > > memtier1
> > > > > > > > > > memtier2
> > > > > > > > > > memtier128
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > > > 50
> > > > > > > > > > 60
> > > > > > > > > > 70
> > > > > > > > > > 10
> > > > > > > > >
> > > > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > > > > > >
> > > > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > > > positive values should work equally well.
> > > > > > > >
> > > > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > > > perfect too.
> > > > > > > >
> > > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > > > >
> > > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > > > tier2 (e.g. PMEM).
> > > > > > > >
> > > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > > > tier1.1, tier2.0, tier2.1.
> > > > > > > >
> > > > > > > > The earlier 4-tier example can be represented as:
> > > > > > > >
> > > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > > > >
> > > > > > > > We can also omit .0 so that the tiers are:
> > > > > > > >
> > > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > > > >
> > > > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > > > the tier IDs relatively stable.
> > > > > > > >
> > > > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > > > tier and only move desired nodes into the new tier.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > >
> > > > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > > > rule as follows,
> > > > > > >
> > > > > > > ...
> > > > > > > 255     GPU
> > > > > > > 0       DRAM
> > > > > > > 1       PMEM
> > > > > > > 2
> > > > > > > ...
> > > > > > >
> > > > > > > 255 is -1 for "s8".
> > > > > > >
> > > > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > > > more flexibility in the future.  We can even go without rank in the
> > > > > > > first version, and introduce it when it's necessary.
> > > > > >
> > > > > > Given that the "rank" approach is generally favored, let's go with
> > > > > > that to avoid compatibility issues that may come from the switch of
> > > > > > device ID tricks to ranks.
> > > > >
> > > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > > for example,
> > > > >
> > > > > GPU                     memtier255
> > > > > DRAM (with CPU)         memtier0
> > > > > PMEM                    memtier1
> > > > >
> > > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > > value will determine the real demotion order.
> > > >
> > > > With the rank approach, the device ID numbering should be flexible and
> > > > not mandated by the proposal.
> > >
> > > If so, the rank number will be fixed?  For example,
> > >
> > > GPU                     100
> > > DRAM (with CPU)         200
> > > PMEM                    300
> > >
> > > When we add a new memtier, its rank can be 50, 150, 250, or 400?
> > >
> > > If so, this makes me think why we don't just make this kind of rank the
> > > device ID?  Or I missed something?
> > >
> > > Or, both device IDs and rank values are not fixed?  Why do we need that
> > > kind of flexibility?  Sorry, I may not undersand all requirements.
> >
> > Even though the proposal doesn't mandate a particular device ID
> > numbering, I expect that the device IDs will be relatively stable once
> > a kernel implementation is chosen. For example, it is likely that DRAM
> > nodes with CPUs will always be on memtier1, no matter how many tiers
> > are higher or lower than these nodes.
> >
> > We don't need to mandate a particular way to assign the rank values,
> > either.  What matters is the relative order and some reasonable gap
> > between these values.
> >
> > The rank approach allows us to keep memtier device IDs relatively
> > stable even though we may change the tier ordering among them.  Its
> > flexibility can have many other uses as well.  For example, we can
> > insert a new memtier into the tier hierarchy for a new set of nodes
> > without affecting the node assignment of any existing memtier,
> > provided that there is enough gap in the rank values for the new
> > memtier.
> >
> > Using the rank value directly as the device ID has some disadvantages:
> > - It is kind of unconventional to number devices in this way.
> > - We cannot assign DRAM nodes with CPUs with a specific memtier device
> > ID (even though this is not mandated by the "rank" proposal, I expect
> > the device will likely always be memtier1 in practice).
> > - It is possible that we may eventually allow the rank value to be
> > modified as a way to adjust the tier ordering.  We cannot do that
> > easily for device IDs.
>
> OK.  I can understand that sometimes it's more natural to change the
> order of a set of nodes with same memory types (and data plane path)
> together instead of change that one by one for each node.
>
> It appears that the memtierX device becomes kind of memory types (with
> data plane path considered for latency/throughput too).  We can assign a
> memory type for a node, and change the order between memory types.  If
> so, we need to allow multiple memtiers have same rank value.

Jonathan mentioned this feature that multiple memtiers share the same
rank as well.  It can be a convenient feature to have.  For
simplicity, it should be fine to leave out this feature initially.

> Best Regards,
> Huang, Ying
>
> > >
> > > > > I think you may need to send v3 to make sure everyone is at the same
> > > > > page.
> > > >
> > > > Will do it shortly.
> > >
> > > Good!  Thanks!
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
> > > > > Best Regards,
> > > > > Huang, Ying
> > > > >
> > > > > > > Best Regards,
> > > > > > > Huang, Ying
> > > > > > >
> > > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > > > >
> > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > 0
> > > > > > > > > > 2
> > > > > > > > > > 3
> > > > > > > > > > 1
> > > > > > > > > >
> > > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > > > >
> > > > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > > > per-node interface file:
> > > > > > > > > >
> > > > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > > > >
> > > > > > > > > > e.g.
> > > > > > > > > >
> > > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > > > > > >
> > > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > > > hold memtier devicde ID instead of a link.
> > > > > > > >
> > > > > > > > OK. We don't have to use a symlink.
> > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Huang, Ying
> > > > > > > > >
> > > > > > > > > > Any comments?
> > > > > > > > > >
> > > > > > > > > > > Jonathan
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26  7:08                       ` Wei Xu
@ 2022-05-26  7:39                         ` Ying Huang
  2022-05-26 20:55                           ` Wei Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Ying Huang @ 2022-05-26  7:39 UTC (permalink / raw)
  To: Wei Xu
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Thu, 2022-05-26 at 00:08 -0700, Wei Xu wrote:
> On Wed, May 25, 2022 at 11:55 PM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote:
> > > On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > 
> > > > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:
> > > > > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > > > > performance.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > > > > several important use cases:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > > > > >   memory accounting.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > > > > the discussions in the threads:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > High-level Design Ideas
> > > > > > > > > > > > > =======================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > > > > >   assignment of any other node.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Wei,
> > > > > > > > > > > > 
> > > > > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > > > > > > > 
> > > > > > > > > > > That's good to hear.
> > > > > > > > > > > 
> > > > > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > > > > > > > 
> > > > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > > > > performance information from the firmware.
> > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > > > > ================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Format: node_list
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > > > > >   sysfs files.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Format: int or empty
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > > > > >   are not affected.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Kernel Representation
> > > > > > > > > > > > > =====================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   Map a node to its tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > > > > ==========================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > > > > > > > 
> > > > > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > > > > virtual CPU HP is defined).
> > > > > > > > > > > 
> > > > > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > > > > > 
> > > > > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > > > > the firmware.
> > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > > > > default tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > > > > provided by the firmware.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > > > > ========================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > > > > ==============================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > > > > already defined.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The pseudo code looks like:
> > > > > > > > > > > > > 
> > > > > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > > > > ===============================
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Examples
> > > > > > > > > > > > > ========
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > ...
> > > > > > > > > > > > 
> > > > > > > > > > > > > * Example 3:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > > > > > > > 
> > > > > > > > > > > > Node2 is drawn as pmem.
> > > > > > > > > > > 
> > > > > > > > > > > Typo. Good catch.
> > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >                   20
> > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > >           \ 30            / 30
> > > > > > > > > > > > >            \             /
> > > > > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > >    0  10   20   30
> > > > > > > > > > > > >    1  20   10   30
> > > > > > > > > > > > >    2  30   30   10
> > > > > > > > > > > > > 
> > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > 0-2
> > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > 
> > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > node 0: empty
> > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * Example 4:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >                   50
> > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > >           \ 30            / 60
> > > > > > > > > > > > >            \             /
> > > > > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > >    0  10   30   50
> > > > > > > > > > > > >    1  30   10   60
> > > > > > > > > > > > >    2  50   60   10
> > > > > > > > > > > > > 
> > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > 2
> > > > > > > > > > > > > 0
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 
> > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 2
> > > > > > > > > > > > > 0
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > node 0: 1
> > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > node 2: 0, 1
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > * Example 5:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > >    /      |              \
> > > > > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > > > > >  |        |         100    \
> > > > > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > >   \         \                 /
> > > > > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > > > > >   80  \       \             /
> > > > > > > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > > > > > > > 
> > > > > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > > > > timing didn't make any sense.).
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > > > > from certain nodes.
> > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > >    0   10  100   30   40
> > > > > > > > > > > > >    1  100   10  120  110
> > > > > > > > > > > > >    2   30  120   10   80
> > > > > > > > > > > > >    3   40  110   80   10
> > > > > > > > > > > > > 
> > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 0,3
> > > > > > > > > > > > > 2
> > > > > > > > > > > > > 
> > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 0
> > > > > > > > > > > > > 2
> > > > > > > > > > > > > 1
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > node 0: 2
> > > > > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > node 3: 2
> > > > > > > > > > > > 
> > > > > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > > > > different and show how critical it might be.
> > > > > > > > > > > > 
> > > > > > > > > > > > * Example 6:
> > > > > > > > > > > > 
> > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > > > > (each hop via this is 100)
> > > > > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > > > > a point not providing real numbers :)
> > > > > > > > > > > > 
> > > > > > > > > > > >          PMEM Node 2
> > > > > > > > > > > >             |(30)
> > > > > > > > > > > >         CPU + DRAM Node0
> > > > > > > > > > > >             |(100)
> > > > > > > > > > > >          Switch 1
> > > > > > > > > > > >             |(100)
> > > > > > > > > > > >           Switch 2
> > > > > > > > > > > >     (100)  |      |(100)
> > > > > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > With one level of s
> > > > > > > > > > > > 
> > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > >     /      |              \
> > > > > > > > > > > >    /       | 30            \ 330
> > > > > > > > > > > >   |        |         310    \
> > > > > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > >    \         \                 /
> > > > > > > > > > > >      \        \ 310           / 210
> > > > > > > > > > > >    330 \       \             /
> > > > > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > > > > > 
> > > > > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > > > > >  anywhere in the topology).
> > > > > > > > > > > > 
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > >     0   10   310  30   310
> > > > > > > > > > > >     1   310  10   330  210
> > > > > > > > > > > >     2   30   330  10   330
> > > > > > > > > > > >     3   310  210  330   10
> > > > > > > > > > > > 
> > > > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > >  1
> > > > > > > > > > > >  0
> > > > > > > > > > > >  2
> > > > > > > > > > > >  3
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > >   1
> > > > > > > > > > > >   0
> > > > > > > > > > > >   2
> > > > > > > > > > > >   3
> > > > > > > > > > > > 
> > > > > > > > > > > >  Demotion fallback order:
> > > > > > > > > > > >  node 0: 2, 3
> > > > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > > > > >  node 2: 3
> > > > > > > > > > > >  node 3: empty
> > > > > > > > > > > > 
> > > > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > > > > their is little reason to demote from one to the other.
> > > > > > > > > > > 
> > > > > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > > > > that the max number of tiers is a config option).
> > > > > > > > > > > 
> > > > > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > > > > default as 0 that Huang, Ying made.
> > > > > > > > > > > 
> > > > > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > > > > > 
> > > > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > > > > the tier level)?
> > > > > > > > > > > 
> > > > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > > > > > 
> > > > > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > > > > > 
> > > > > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > > > > number N from "memtierN".
> > > > > > > > > > > 
> > > > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > > > > this memtier in the tier hierarchy.
> > > > > > > > > > > 
> > > > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > > > > in a 5-tier system.
> > > > > > > > > > > 
> > > > > > > > > > > For the above example (example 6), we can have:
> > > > > > > > > > > 
> > > > > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > > > > memtier0
> > > > > > > > > > > memtier1
> > > > > > > > > > > memtier2
> > > > > > > > > > > memtier128
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > > > > 50
> > > > > > > > > > > 60
> > > > > > > > > > > 70
> > > > > > > > > > > 10
> > > > > > > > > > 
> > > > > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > > > > > > > 
> > > > > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > > > > positive values should work equally well.
> > > > > > > > > 
> > > > > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > > > > perfect too.
> > > > > > > > > 
> > > > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > > > > > 
> > > > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > > > > tier2 (e.g. PMEM).
> > > > > > > > > 
> > > > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > > > > tier1.1, tier2.0, tier2.1.
> > > > > > > > > 
> > > > > > > > > The earlier 4-tier example can be represented as:
> > > > > > > > > 
> > > > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > > > > > 
> > > > > > > > > We can also omit .0 so that the tiers are:
> > > > > > > > > 
> > > > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > > > > > 
> > > > > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > > > > the tier IDs relatively stable.
> > > > > > > > > 
> > > > > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > > > > tier and only move desired nodes into the new tier.
> > > > > > > > > 
> > > > > > > > > What do you think?
> > > > > > > > 
> > > > > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > > > > rule as follows,
> > > > > > > > 
> > > > > > > > ...
> > > > > > > > 255     GPU
> > > > > > > > 0       DRAM
> > > > > > > > 1       PMEM
> > > > > > > > 2
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > 255 is -1 for "s8".
> > > > > > > > 
> > > > > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > > > > more flexibility in the future.  We can even go without rank in the
> > > > > > > > first version, and introduce it when it's necessary.
> > > > > > > 
> > > > > > > Given that the "rank" approach is generally favored, let's go with
> > > > > > > that to avoid compatibility issues that may come from the switch of
> > > > > > > device ID tricks to ranks.
> > > > > > 
> > > > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > > > for example,
> > > > > > 
> > > > > > GPU                     memtier255
> > > > > > DRAM (with CPU)         memtier0
> > > > > > PMEM                    memtier1
> > > > > > 
> > > > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > > > value will determine the real demotion order.
> > > > > 
> > > > > With the rank approach, the device ID numbering should be flexible and
> > > > > not mandated by the proposal.
> > > > 
> > > > If so, the rank number will be fixed?  For example,
> > > > 
> > > > GPU                     100
> > > > DRAM (with CPU)         200
> > > > PMEM                    300
> > > > 
> > > > When we add a new memtier, its rank can be 50, 150, 250, or 400?
> > > > 
> > > > If so, this makes me think why we don't just make this kind of rank the
> > > > device ID?  Or I missed something?
> > > > 
> > > > Or, both device IDs and rank values are not fixed?  Why do we need that
> > > > kind of flexibility?  Sorry, I may not undersand all requirements.
> > > 
> > > Even though the proposal doesn't mandate a particular device ID
> > > numbering, I expect that the device IDs will be relatively stable once
> > > a kernel implementation is chosen. For example, it is likely that DRAM
> > > nodes with CPUs will always be on memtier1, no matter how many tiers
> > > are higher or lower than these nodes.
> > > 
> > > We don't need to mandate a particular way to assign the rank values,
> > > either.  What matters is the relative order and some reasonable gap
> > > between these values.
> > > 
> > > The rank approach allows us to keep memtier device IDs relatively
> > > stable even though we may change the tier ordering among them.  Its
> > > flexibility can have many other uses as well.  For example, we can
> > > insert a new memtier into the tier hierarchy for a new set of nodes
> > > without affecting the node assignment of any existing memtier,
> > > provided that there is enough gap in the rank values for the new
> > > memtier.
> > > 
> > > Using the rank value directly as the device ID has some disadvantages:
> > > - It is kind of unconventional to number devices in this way.
> > > - We cannot assign DRAM nodes with CPUs with a specific memtier device
> > > ID (even though this is not mandated by the "rank" proposal, I expect
> > > the device will likely always be memtier1 in practice).
> > > - It is possible that we may eventually allow the rank value to be
> > > modified as a way to adjust the tier ordering.  We cannot do that
> > > easily for device IDs.
> > 
> > OK.  I can understand that sometimes it's more natural to change the
> > order of a set of nodes with same memory types (and data plane path)
> > together instead of change that one by one for each node.
> > 
> > It appears that the memtierX device becomes kind of memory types (with
> > data plane path considered for latency/throughput too).  We can assign a
> > memory type for a node, and change the order between memory types.  If
> > so, we need to allow multiple memtiers have same rank value.
> 
> Jonathan mentioned this feature that multiple memtiers share the same
> rank as well.  It can be a convenient feature to have.  For
> simplicity, it should be fine to leave out this feature initially.

OK.  What do you think about the concept of memory types?  You have
mentioned that in memtierX directory, we can put latency/throughput,
etc.  IMHO, these only make sense for one type of memory.  And it's
natural for all memory nodes onlined by a driver to be same memory type.
That is, drivers (including firmware drivers) will register memory types
and put nodes into it.  Base on memory types, "rank" (related to for
example latency) determined the real memory tiers.

If you think it's a good idea, we can rename memtierX to memory_typeX. 
But memory type may be not a good name, DRAM in local memory controler
and DRAM in remote CXL may have quite different performance metric.  Or
memory_class to avoid the possible confusion?

Best Regards,
Huang, Ying

> > Best Regards,
> > Huang, Ying
> > 
> > > > 
> > > > > > I think you may need to send v3 to make sure everyone is at the same
> > > > > > page.
> > > > > 
> > > > > Will do it shortly.
> > > > 
> > > > Good!  Thanks!
> > > > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > > > Best Regards,
> > > > > > Huang, Ying
> > > > > > 
> > > > > > > > Best Regards,
> > > > > > > > Huang, Ying
> > > > > > > > 
> > > > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > > > > > 
> > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > 0
> > > > > > > > > > > 2
> > > > > > > > > > > 3
> > > > > > > > > > > 1
> > > > > > > > > > > 
> > > > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > > > > > 
> > > > > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > > > > per-node interface file:
> > > > > > > > > > > 
> > > > > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > > > > > 
> > > > > > > > > > > e.g.
> > > > > > > > > > > 
> > > > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > > > > > > > 
> > > > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > > > > hold memtier devicde ID instead of a link.
> > > > > > > > > 
> > > > > > > > > OK. We don't have to use a symlink.
> > > > > > > > > 
> > > > > > > > > > Best Regards,
> > > > > > > > > > Huang, Ying
> > > > > > > > > > 
> > > > > > > > > > > Any comments?
> > > > > > > > > > > 
> > > > > > > > > > > > Jonathan
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > 
> > 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 17:27                 ` Wei Xu
@ 2022-05-26  9:32                   ` Jonathan Cameron
  2022-05-26 20:30                     ` Wei Xu
  2022-05-27  9:26                   ` Aneesh Kumar K V
  1 sibling, 1 reply; 47+ messages in thread
From: Jonathan Cameron @ 2022-05-26  9:32 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K V, Ying Huang, Andrew Morton, Greg Thelen,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Wed, 25 May 2022 10:27:42 -0700
Wei Xu <weixugc@google.com> wrote:

> On Wed, May 25, 2022 at 3:01 AM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
> >
> > On 5/25/22 2:33 PM, Ying Huang wrote:  
> > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:  
> > >> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:  
> > >>>
> > >>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:  
> > >>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:  
> > >>>>>  
> >
> > ...
> >  
> > >
> > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > for example,
> > >
> > > GPU                   memtier255
> > > DRAM (with CPU)               memtier0
> > > PMEM                  memtier1
> > >
> > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > value will determine the real demotion order.
> > >
> > > I think you may need to send v3 to make sure everyone is at the same
> > > page.
> > >  
> >
> > What we have implemented which we will send as RFC shortly is below.
> >
> > cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
> > kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
> > /sys/devices/system
> > kvaneesh@ubuntu-guest:/sys/devices/system$ ls
> > clockevents  clocksource  container  cpu  edac  memory  memtier  mpic
> > node  power
> > kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
> > /sys/devices/system/memtier
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
> > default_rank  max_rank  memtier1  power  uevent
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
> > 1
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
> > 3  
> 
> For flexibility, we don't want max_rank to be interpreted as the
> number of memory tiers.  Also, we want to leave spaces in rank values
> to allow new memtiers to be inserted when needed.  So I'd suggest to
> make max_rank a much larger value (e.g. 255).
> 
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
> > nodelist  power  rank  subsystem  uevent
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
> > 0-3
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
> > 1
> > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd
> > ../../node/node1/
> > kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
> > 1
> > kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
> > root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
> > root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
> > 0
> > root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
> > root@ubuntu-guest:/sys/devices/system/memtier# ls
> > default_rank  max_rank  memtier0  memtier1  power  uevent
> > root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
> > 1
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
> > 0  
> 
> It looks like the example here demonstrates the dynamic creation of
> memtier0.  If so, how is the rank of memtier0 determined?  If we want
> to support creating new memtiers at runtime, I think an explicit
> interface that specifies both device ID and rank is preferred to avoid
> implicit dependencies between device IDs and ranks.

Why make device ID explicit - it's meaningless I think?
How about a creation interface that is simply writing the rank value
to create a new one?  The only race I can see would be to get
two parallel attempts to create a new tier with the same rank.
That seems unlikely to matter unless we support changing rank later.

Two attempts to create the same device ID tier seems more likely to
cause fiddly races.

Jonathan


> 
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# echo 4 > rank
> > bash: rank: Permission denied
> > root@ubuntu-guest:/sys/devices/system/memtier/memtier0#  


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26  9:32                   ` Jonathan Cameron
@ 2022-05-26 20:30                     ` Wei Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Wei Xu @ 2022-05-26 20:30 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Aneesh Kumar K V, Ying Huang, Andrew Morton, Greg Thelen,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 26, 2022 at 2:32 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Wed, 25 May 2022 10:27:42 -0700
> Wei Xu <weixugc@google.com> wrote:
>
> > On Wed, May 25, 2022 at 3:01 AM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > On 5/25/22 2:33 PM, Ying Huang wrote:
> > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > >> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > >>>
> > > >>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > >>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > >>>>>
> > >
> > > ...
> > >
> > > >
> > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > for example,
> > > >
> > > > GPU                   memtier255
> > > > DRAM (with CPU)               memtier0
> > > > PMEM                  memtier1
> > > >
> > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > value will determine the real demotion order.
> > > >
> > > > I think you may need to send v3 to make sure everyone is at the same
> > > > page.
> > > >
> > >
> > > What we have implemented which we will send as RFC shortly is below.
> > >
> > > cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
> > > kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
> > > /sys/devices/system
> > > kvaneesh@ubuntu-guest:/sys/devices/system$ ls
> > > clockevents  clocksource  container  cpu  edac  memory  memtier  mpic
> > > node  power
> > > kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
> > > /sys/devices/system/memtier
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
> > > default_rank  max_rank  memtier1  power  uevent
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
> > > 1
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
> > > 3
> >
> > For flexibility, we don't want max_rank to be interpreted as the
> > number of memory tiers.  Also, we want to leave spaces in rank values
> > to allow new memtiers to be inserted when needed.  So I'd suggest to
> > make max_rank a much larger value (e.g. 255).
> >
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
> > > nodelist  power  rank  subsystem  uevent
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
> > > 0-3
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
> > > 1
> > > kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd
> > > ../../node/node1/
> > > kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
> > > 1
> > > kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
> > > root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
> > > root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
> > > 0
> > > root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
> > > root@ubuntu-guest:/sys/devices/system/memtier# ls
> > > default_rank  max_rank  memtier0  memtier1  power  uevent
> > > root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
> > > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
> > > 1
> > > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
> > > 0
> >
> > It looks like the example here demonstrates the dynamic creation of
> > memtier0.  If so, how is the rank of memtier0 determined?  If we want
> > to support creating new memtiers at runtime, I think an explicit
> > interface that specifies both device ID and rank is preferred to avoid
> > implicit dependencies between device IDs and ranks.
>
> Why make device ID explicit - it's meaningless I think?
> How about a creation interface that is simply writing the rank value
> to create a new one?  The only race I can see would be to get
> two parallel attempts to create a new tier with the same rank.
> That seems unlikely to matter unless we support changing rank later.
>
> Two attempts to create the same device ID tier seems more likely to
> cause fiddly races.

That's right: Device ID is not needed when creating a new memtier. It
should be enough to provide only a rank value.

> Jonathan
>
>
> >
> > > root@ubuntu-guest:/sys/devices/system/memtier/memtier0# echo 4 > rank
> > > bash: rank: Permission denied
> > > root@ubuntu-guest:/sys/devices/system/memtier/memtier0#
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26  7:39                         ` Ying Huang
@ 2022-05-26 20:55                           ` Wei Xu
  2022-05-27  9:10                             ` Jonathan Cameron
  0 siblings, 1 reply; 47+ messages in thread
From: Wei Xu @ 2022-05-26 20:55 UTC (permalink / raw)
  To: Ying Huang
  Cc: Jonathan Cameron, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Thu, May 26, 2022 at 12:39 AM Ying Huang <ying.huang@intel.com> wrote:
>
> On Thu, 2022-05-26 at 00:08 -0700, Wei Xu wrote:
> > On Wed, May 25, 2022 at 11:55 PM Ying Huang <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote:
> > > > On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:
> > > > > > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
> > > > > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
> > > > > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:
> > > > > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > > > > > <Jonathan.Cameron@huawei.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > > > > > Wei Xu <weixugc@google.com> wrote:
> > > > > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > > > > > performance.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > > > > > several important use cases:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > > > > > >   memory accounting.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > > > > > the discussions in the threads:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > High-level Design Ideas
> > > > > > > > > > > > > > =======================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > > > > > >   assignment of any other node.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Wei,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > > > > > white boarding topologies from our roadmaps for the next few days :)
> > > > > > > > > > > >
> > > > > > > > > > > > That's good to hear.
> > > > > > > > > > > >
> > > > > > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.
> > > > > > > > > > > >
> > > > > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > > > > > performance information from the firmware.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > > > > > ================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Format: node_list
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > > > > > >   sysfs files.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Format: int or empty
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > > > > > >   are not affected.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kernel Representation
> > > > > > > > > > > > > > =====================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Map a node to its tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > > > > > ==========================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > > > > > (MEMORY_DEFAULT_TIER).
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > > > > > virtual CPU HP is defined).
> > > > > > > > > > > >
> > > > > > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > > > > > >
> > > > > > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > > > > > the firmware.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > > > > > default tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > > > > > provided by the firmware.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > > > > > ========================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > > > > > ==============================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > > > > > already defined.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The pseudo code looks like:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > > > > > ===============================
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Examples
> > > > > > > > > > > > > > ========
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > ...
> > > > > > > > > > > > >
> > > > > > > > > > > > > > * Example 3:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Node2 is drawn as pmem.
> > > > > > > > > > > >
> > > > > > > > > > > > Typo. Good catch.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >                   20
> > > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > > >           \ 30            / 30
> > > > > > > > > > > > > >            \             /
> > > > > > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > > >    0  10   20   30
> > > > > > > > > > > > > >    1  20   10   30
> > > > > > > > > > > > > >    2  30   30   10
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > > 0-2
> > > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > node 0: empty
> > > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Example 4:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >                   50
> > > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > > >           \ 30            / 60
> > > > > > > > > > > > > >            \             /
> > > > > > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > > >    0  10   30   50
> > > > > > > > > > > > > >    1  30   10   60
> > > > > > > > > > > > > >    2  50   60   10
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > 0
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > node 0: 1
> > > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > > node 2: 0, 1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Example 5:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > > >    /      |              \
> > > > > > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > > > > > >  |        |         100    \
> > > > > > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > > >   \         \                 /
> > > > > > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > > > > > >   80  \       \             /
> > > > > > > > > > > > > >         ---  Node 3 (Slow DRAM)
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > > > > > timing didn't make any sense.).
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > > > > > from certain nodes.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > > >    0   10  100   30   40
> > > > > > > > > > > > > >    1  100   10  120  110
> > > > > > > > > > > > > >    2   30  120   10   80
> > > > > > > > > > > > > >    3   40  110   80   10
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > 0,3
> > > > > > > > > > > > > > 2
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > node 0: 2
> > > > > > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > > node 3: 2
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > > > > > different and show how critical it might be.
> > > > > > > > > > > > >
> > > > > > > > > > > > > * Example 6:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > > > > > (each hop via this is 100)
> > > > > > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > > > > > a point not providing real numbers :)
> > > > > > > > > > > > >
> > > > > > > > > > > > >          PMEM Node 2
> > > > > > > > > > > > >             |(30)
> > > > > > > > > > > > >         CPU + DRAM Node0
> > > > > > > > > > > > >             |(100)
> > > > > > > > > > > > >          Switch 1
> > > > > > > > > > > > >             |(100)
> > > > > > > > > > > > >           Switch 2
> > > > > > > > > > > > >     (100)  |      |(100)
> > > > > > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > With one level of s
> > > > > > > > > > > > >
> > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > >     /      |              \
> > > > > > > > > > > > >    /       | 30            \ 330
> > > > > > > > > > > > >   |        |         310    \
> > > > > > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > >    \         \                 /
> > > > > > > > > > > > >      \        \ 310           / 210
> > > > > > > > > > > > >    330 \       \             /
> > > > > > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > > > > > >
> > > > > > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > > > > > >  anywhere in the topology).
> > > > > > > > > > > > >
> > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > >     0   10   310  30   310
> > > > > > > > > > > > >     1   310  10   330  210
> > > > > > > > > > > > >     2   30   330  10   330
> > > > > > > > > > > > >     3   310  210  330   10
> > > > > > > > > > > > >
> > > > > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > >  1
> > > > > > > > > > > > >  0
> > > > > > > > > > > > >  2
> > > > > > > > > > > > >  3
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > >   1
> > > > > > > > > > > > >   0
> > > > > > > > > > > > >   2
> > > > > > > > > > > > >   3
> > > > > > > > > > > > >
> > > > > > > > > > > > >  Demotion fallback order:
> > > > > > > > > > > > >  node 0: 2, 3
> > > > > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > > > > > >  node 2: 3
> > > > > > > > > > > > >  node 3: empty
> > > > > > > > > > > > >
> > > > > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > > > > > their is little reason to demote from one to the other.
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > > > > > that the max number of tiers is a config option).
> > > > > > > > > > > >
> > > > > > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > > > > > default as 0 that Huang, Ying made.
> > > > > > > > > > > >
> > > > > > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > > > > > >
> > > > > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > > > > > the tier level)?
> > > > > > > > > > > >
> > > > > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > > > > > >
> > > > > > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > > > > > >
> > > > > > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > > > > > number N from "memtierN".
> > > > > > > > > > > >
> > > > > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > > > > > this memtier in the tier hierarchy.
> > > > > > > > > > > >
> > > > > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > > > > > in a 5-tier system.
> > > > > > > > > > > >
> > > > > > > > > > > > For the above example (example 6), we can have:
> > > > > > > > > > > >
> > > > > > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > > > > > memtier0
> > > > > > > > > > > > memtier1
> > > > > > > > > > > > memtier2
> > > > > > > > > > > > memtier128
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > > > > > 50
> > > > > > > > > > > > 60
> > > > > > > > > > > > 70
> > > > > > > > > > > > 10
> > > > > > > > > > >
> > > > > > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?
> > > > > > > > > >
> > > > > > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > > > > > positive values should work equally well.
> > > > > > > > > >
> > > > > > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > > > > > perfect too.
> > > > > > > > > >
> > > > > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > > > > > >
> > > > > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > > > > > tier2 (e.g. PMEM).
> > > > > > > > > >
> > > > > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > > > > > tier1.1, tier2.0, tier2.1.
> > > > > > > > > >
> > > > > > > > > > The earlier 4-tier example can be represented as:
> > > > > > > > > >
> > > > > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > > > > > >
> > > > > > > > > > We can also omit .0 so that the tiers are:
> > > > > > > > > >
> > > > > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > > > > > >
> > > > > > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > > > > > the tier IDs relatively stable.
> > > > > > > > > >
> > > > > > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > > > > > tier and only move desired nodes into the new tier.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > > > > > rule as follows,
> > > > > > > > >
> > > > > > > > > ...
> > > > > > > > > 255     GPU
> > > > > > > > > 0       DRAM
> > > > > > > > > 1       PMEM
> > > > > > > > > 2
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > 255 is -1 for "s8".
> > > > > > > > >
> > > > > > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > > > > > more flexibility in the future.  We can even go without rank in the
> > > > > > > > > first version, and introduce it when it's necessary.
> > > > > > > >
> > > > > > > > Given that the "rank" approach is generally favored, let's go with
> > > > > > > > that to avoid compatibility issues that may come from the switch of
> > > > > > > > device ID tricks to ranks.
> > > > > > >
> > > > > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > > > > for example,
> > > > > > >
> > > > > > > GPU                     memtier255
> > > > > > > DRAM (with CPU)         memtier0
> > > > > > > PMEM                    memtier1
> > > > > > >
> > > > > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > > > > value will determine the real demotion order.
> > > > > >
> > > > > > With the rank approach, the device ID numbering should be flexible and
> > > > > > not mandated by the proposal.
> > > > >
> > > > > If so, the rank number will be fixed?  For example,
> > > > >
> > > > > GPU                     100
> > > > > DRAM (with CPU)         200
> > > > > PMEM                    300
> > > > >
> > > > > When we add a new memtier, its rank can be 50, 150, 250, or 400?
> > > > >
> > > > > If so, this makes me think why we don't just make this kind of rank the
> > > > > device ID?  Or I missed something?
> > > > >
> > > > > Or, both device IDs and rank values are not fixed?  Why do we need that
> > > > > kind of flexibility?  Sorry, I may not undersand all requirements.
> > > >
> > > > Even though the proposal doesn't mandate a particular device ID
> > > > numbering, I expect that the device IDs will be relatively stable once
> > > > a kernel implementation is chosen. For example, it is likely that DRAM
> > > > nodes with CPUs will always be on memtier1, no matter how many tiers
> > > > are higher or lower than these nodes.
> > > >
> > > > We don't need to mandate a particular way to assign the rank values,
> > > > either.  What matters is the relative order and some reasonable gap
> > > > between these values.
> > > >
> > > > The rank approach allows us to keep memtier device IDs relatively
> > > > stable even though we may change the tier ordering among them.  Its
> > > > flexibility can have many other uses as well.  For example, we can
> > > > insert a new memtier into the tier hierarchy for a new set of nodes
> > > > without affecting the node assignment of any existing memtier,
> > > > provided that there is enough gap in the rank values for the new
> > > > memtier.
> > > >
> > > > Using the rank value directly as the device ID has some disadvantages:
> > > > - It is kind of unconventional to number devices in this way.
> > > > - We cannot assign DRAM nodes with CPUs with a specific memtier device
> > > > ID (even though this is not mandated by the "rank" proposal, I expect
> > > > the device will likely always be memtier1 in practice).
> > > > - It is possible that we may eventually allow the rank value to be
> > > > modified as a way to adjust the tier ordering.  We cannot do that
> > > > easily for device IDs.
> > >
> > > OK.  I can understand that sometimes it's more natural to change the
> > > order of a set of nodes with same memory types (and data plane path)
> > > together instead of change that one by one for each node.
> > >
> > > It appears that the memtierX device becomes kind of memory types (with
> > > data plane path considered for latency/throughput too).  We can assign a
> > > memory type for a node, and change the order between memory types.  If
> > > so, we need to allow multiple memtiers have same rank value.
> >
> > Jonathan mentioned this feature that multiple memtiers share the same
> > rank as well.  It can be a convenient feature to have.  For
> > simplicity, it should be fine to leave out this feature initially.
>
> OK.  What do you think about the concept of memory types?  You have
> mentioned that in memtierX directory, we can put latency/throughput,
> etc.  IMHO, these only make sense for one type of memory.  And it's
> natural for all memory nodes onlined by a driver to be same memory type.

I think this is not always true. For example, a dax kmem driver can
online both pmem and non-pmem dax devices as system memory.

> That is, drivers (including firmware drivers) will register memory types
> and put nodes into it.  Base on memory types, "rank" (related to for
> example latency) determined the real memory tiers.
>
> If you think it's a good idea, we can rename memtierX to memory_typeX.
> But memory type may be not a good name, DRAM in local memory controler
> and DRAM in remote CXL may have quite different performance metric.  Or
> memory_class to avoid the possible confusion?

Memory types (e.g. GPU, DRAM, PMEM, etc) can be useful information to
help initialize the memory tiers of NUMA nodes. But I think memory
type is not a substitute for memory tier.  We still need to define
memory tiers on top of NUMA node groups based on memory types (for
example, some may want to group GPU and DRAM into the same tier,
others may want separate tiers for GPU/DRAM).  It is simpler to keep
the sysfs interface to just memory tiers and implement memory types as
internal device attributes if needed.

To avoid confusion, we can require that the rank value is unique for
each memtier device.  This should make it clear that each memtier
device represents a distinct memory tier.  We can still put
latency/throughput values into each memtierN directory.  Such values
need to be specified as a range to better accommodate possibly varied
performance of the devices within the same memory tier.

> Best Regards,
> Huang, Ying
>
> > > Best Regards,
> > > Huang, Ying
> > >
> > > > >
> > > > > > > I think you may need to send v3 to make sure everyone is at the same
> > > > > > > page.
> > > > > >
> > > > > > Will do it shortly.
> > > > >
> > > > > Good!  Thanks!
> > > > >
> > > > > Best Regards,
> > > > > Huang, Ying
> > > > >
> > > > > > > Best Regards,
> > > > > > > Huang, Ying
> > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Huang, Ying
> > > > > > > > >
> > > > > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > > > > > >
> > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > 0
> > > > > > > > > > > > 2
> > > > > > > > > > > > 3
> > > > > > > > > > > > 1
> > > > > > > > > > > >
> > > > > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > > > > > >
> > > > > > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > > > > > per-node interface file:
> > > > > > > > > > > >
> > > > > > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > > > > > >
> > > > > > > > > > > > e.g.
> > > > > > > > > > > >
> > > > > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier
> > > > > > > > > > >
> > > > > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > > > > > hold memtier devicde ID instead of a link.
> > > > > > > > > >
> > > > > > > > > > OK. We don't have to use a symlink.
> > > > > > > > > >
> > > > > > > > > > > Best Regards,
> > > > > > > > > > > Huang, Ying
> > > > > > > > > > >
> > > > > > > > > > > > Any comments?
> > > > > > > > > > > >
> > > > > > > > > > > > > Jonathan
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> > >
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-26 20:55                           ` Wei Xu
@ 2022-05-27  9:10                             ` Jonathan Cameron
  2022-05-30  6:54                               ` Ying Huang
  0 siblings, 1 reply; 47+ messages in thread
From: Jonathan Cameron @ 2022-05-27  9:10 UTC (permalink / raw)
  To: Wei Xu
  Cc: Ying Huang, Andrew Morton, Greg Thelen, Aneesh Kumar K.V,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On Thu, 26 May 2022 13:55:39 -0700
Wei Xu <weixugc@google.com> wrote:

> On Thu, May 26, 2022 at 12:39 AM Ying Huang <ying.huang@intel.com> wrote:
> >
> > On Thu, 2022-05-26 at 00:08 -0700, Wei Xu wrote:  
> > > On Wed, May 25, 2022 at 11:55 PM Ying Huang <ying.huang@intel.com> wrote:  
> > > >
> > > > On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote:  
> > > > > On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > >
> > > > > > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:  
> > > > > > > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > >
> > > > > > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:  
> > > > > > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > > > >
> > > > > > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:  
> > > > > > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:  
> > > > > > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > > > > > > <Jonathan.Cameron@huawei.com> wrote:  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > > > > > > Wei Xu <weixugc@google.com> wrote:  
> > > > > > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > > > > > > performance.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > > > > > > several important use cases:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > > > > > > >   memory accounting.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > > > > > > the discussions in the threads:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > High-level Design Ideas
> > > > > > > > > > > > > > > =======================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > > > > > > >   assignment of any other node.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Wei,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > > > > > > white boarding topologies from our roadmaps for the next few days :)  
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's good to hear.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > > > > > > performance information from the firmware.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > > > > > > ================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Format: node_list
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > > > > > > >   sysfs files.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Format: int or empty
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > > > > > > >   are not affected.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kernel Representation
> > > > > > > > > > > > > > > =====================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Map a node to its tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > > > > > > ==========================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > > > > > > (MEMORY_DEFAULT_TIER).  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > > > > > > virtual CPU HP is defined).  
> > > > > > > > > > > > >
> > > > > > > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > > > > > > the firmware.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > > > > > > default tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > > > > > > provided by the firmware.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > > > > > > ========================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > > > > > > ==============================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > > > > > > already defined.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The pseudo code looks like:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > > > > > > ===============================
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Examples
> > > > > > > > > > > > > > > ========
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > * Example 3:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Node2 is drawn as pmem.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > Typo. Good catch.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >                   20
> > > > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > > > >           \ 30            / 30
> > > > > > > > > > > > > > >            \             /
> > > > > > > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > > > >    0  10   20   30
> > > > > > > > > > > > > > >    1  20   10   30
> > > > > > > > > > > > > > >    2  30   30   10
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > > > 0-2
> > > > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > > node 0: empty
> > > > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Example 4:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >                   50
> > > > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > > > >           \ 30            / 60
> > > > > > > > > > > > > > >            \             /
> > > > > > > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > > > >    0  10   30   50
> > > > > > > > > > > > > > >    1  30   10   60
> > > > > > > > > > > > > > >    2  50   60   10
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > > node 0: 1
> > > > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > > > node 2: 0, 1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Example 5:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > > > >    /      |              \
> > > > > > > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > > > > > > >  |        |         100    \
> > > > > > > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > > > >   \         \                 /
> > > > > > > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > > > > > > >   80  \       \             /
> > > > > > > > > > > > > > >         ---  Node 3 (Slow DRAM)  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > > > > > > timing didn't make any sense.).
> > > > > > > > > > > > > >  
> > > > > > > > > > > > >
> > > > > > > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > > > > > > from certain nodes.
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > > > >    0   10  100   30   40
> > > > > > > > > > > > > > >    1  100   10  120  110
> > > > > > > > > > > > > > >    2   30  120   10   80
> > > > > > > > > > > > > > >    3   40  110   80   10
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > 0,3
> > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > > node 0: 2
> > > > > > > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > > > node 3: 2  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > > > > > > different and show how critical it might be.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Example 6:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > > > > > > (each hop via this is 100)
> > > > > > > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > > > > > > a point not providing real numbers :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >          PMEM Node 2
> > > > > > > > > > > > > >             |(30)
> > > > > > > > > > > > > >         CPU + DRAM Node0
> > > > > > > > > > > > > >             |(100)
> > > > > > > > > > > > > >          Switch 1
> > > > > > > > > > > > > >             |(100)
> > > > > > > > > > > > > >           Switch 2
> > > > > > > > > > > > > >     (100)  |      |(100)
> > > > > > > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With one level of s
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > > >     /      |              \
> > > > > > > > > > > > > >    /       | 30            \ 330
> > > > > > > > > > > > > >   |        |         310    \
> > > > > > > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > > >    \         \                 /
> > > > > > > > > > > > > >      \        \ 310           / 210
> > > > > > > > > > > > > >    330 \       \             /
> > > > > > > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > > > > > > >  anywhere in the topology).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > > >     0   10   310  30   310
> > > > > > > > > > > > > >     1   310  10   330  210
> > > > > > > > > > > > > >     2   30   330  10   330
> > > > > > > > > > > > > >     3   310  210  330   10
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > >  1
> > > > > > > > > > > > > >  0
> > > > > > > > > > > > > >  2
> > > > > > > > > > > > > >  3
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > >   1
> > > > > > > > > > > > > >   0
> > > > > > > > > > > > > >   2
> > > > > > > > > > > > > >   3
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  Demotion fallback order:
> > > > > > > > > > > > > >  node 0: 2, 3
> > > > > > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > > > > > > >  node 2: 3
> > > > > > > > > > > > > >  node 3: empty
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > > > > > > their is little reason to demote from one to the other.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > > > > > > that the max number of tiers is a config option).
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > > > > > > default as 0 that Huang, Ying made.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > > > > > > the tier level)?
> > > > > > > > > > > > >
> > > > > > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > > > > > > >
> > > > > > > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > > > > > > >
> > > > > > > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > > > > > > number N from "memtierN".
> > > > > > > > > > > > >
> > > > > > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > > > > > > this memtier in the tier hierarchy.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > > > > > > in a 5-tier system.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For the above example (example 6), we can have:
> > > > > > > > > > > > >
> > > > > > > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > > > > > > memtier0
> > > > > > > > > > > > > memtier1
> > > > > > > > > > > > > memtier2
> > > > > > > > > > > > > memtier128
> > > > > > > > > > > > >
> > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > > > > > > 50
> > > > > > > > > > > > > 60
> > > > > > > > > > > > > 70
> > > > > > > > > > > > > 10  
> > > > > > > > > > > >
> > > > > > > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?  
> > > > > > > > > > >
> > > > > > > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > > > > > > positive values should work equally well.
> > > > > > > > > > >  
> > > > > > > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > > > > > > perfect too.  
> > > > > > > > > > >
> > > > > > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > > > > > > >
> > > > > > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > > > > > > tier2 (e.g. PMEM).
> > > > > > > > > > >
> > > > > > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > > > > > > tier1.1, tier2.0, tier2.1.
> > > > > > > > > > >
> > > > > > > > > > > The earlier 4-tier example can be represented as:
> > > > > > > > > > >
> > > > > > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > > > > > > >
> > > > > > > > > > > We can also omit .0 so that the tiers are:
> > > > > > > > > > >
> > > > > > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > > > > > > >
> > > > > > > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > > > > > > the tier IDs relatively stable.
> > > > > > > > > > >
> > > > > > > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > > > > > > tier and only move desired nodes into the new tier.
> > > > > > > > > > >
> > > > > > > > > > > What do you think?  
> > > > > > > > > >
> > > > > > > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > > > > > > rule as follows,
> > > > > > > > > >
> > > > > > > > > > ...
> > > > > > > > > > 255     GPU
> > > > > > > > > > 0       DRAM
> > > > > > > > > > 1       PMEM
> > > > > > > > > > 2
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > 255 is -1 for "s8".
> > > > > > > > > >
> > > > > > > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > > > > > > more flexibility in the future.  We can even go without rank in the
> > > > > > > > > > first version, and introduce it when it's necessary.  
> > > > > > > > >
> > > > > > > > > Given that the "rank" approach is generally favored, let's go with
> > > > > > > > > that to avoid compatibility issues that may come from the switch of
> > > > > > > > > device ID tricks to ranks.  
> > > > > > > >
> > > > > > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > > > > > for example,
> > > > > > > >
> > > > > > > > GPU                     memtier255
> > > > > > > > DRAM (with CPU)         memtier0
> > > > > > > > PMEM                    memtier1
> > > > > > > >
> > > > > > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > > > > > value will determine the real demotion order.  
> > > > > > >
> > > > > > > With the rank approach, the device ID numbering should be flexible and
> > > > > > > not mandated by the proposal.  
> > > > > >
> > > > > > If so, the rank number will be fixed?  For example,
> > > > > >
> > > > > > GPU                     100
> > > > > > DRAM (with CPU)         200
> > > > > > PMEM                    300
> > > > > >
> > > > > > When we add a new memtier, its rank can be 50, 150, 250, or 400?
> > > > > >
> > > > > > If so, this makes me think why we don't just make this kind of rank the
> > > > > > device ID?  Or I missed something?
> > > > > >
> > > > > > Or, both device IDs and rank values are not fixed?  Why do we need that
> > > > > > kind of flexibility?  Sorry, I may not undersand all requirements.  
> > > > >
> > > > > Even though the proposal doesn't mandate a particular device ID
> > > > > numbering, I expect that the device IDs will be relatively stable once
> > > > > a kernel implementation is chosen. For example, it is likely that DRAM
> > > > > nodes with CPUs will always be on memtier1, no matter how many tiers
> > > > > are higher or lower than these nodes.
> > > > >
> > > > > We don't need to mandate a particular way to assign the rank values,
> > > > > either.  What matters is the relative order and some reasonable gap
> > > > > between these values.
> > > > >
> > > > > The rank approach allows us to keep memtier device IDs relatively
> > > > > stable even though we may change the tier ordering among them.  Its
> > > > > flexibility can have many other uses as well.  For example, we can
> > > > > insert a new memtier into the tier hierarchy for a new set of nodes
> > > > > without affecting the node assignment of any existing memtier,
> > > > > provided that there is enough gap in the rank values for the new
> > > > > memtier.
> > > > >
> > > > > Using the rank value directly as the device ID has some disadvantages:
> > > > > - It is kind of unconventional to number devices in this way.
> > > > > - We cannot assign DRAM nodes with CPUs with a specific memtier device
> > > > > ID (even though this is not mandated by the "rank" proposal, I expect
> > > > > the device will likely always be memtier1 in practice).
> > > > > - It is possible that we may eventually allow the rank value to be
> > > > > modified as a way to adjust the tier ordering.  We cannot do that
> > > > > easily for device IDs.  
> > > >
> > > > OK.  I can understand that sometimes it's more natural to change the
> > > > order of a set of nodes with same memory types (and data plane path)
> > > > together instead of change that one by one for each node.
> > > >
> > > > It appears that the memtierX device becomes kind of memory types (with
> > > > data plane path considered for latency/throughput too).  We can assign a
> > > > memory type for a node, and change the order between memory types.  If
> > > > so, we need to allow multiple memtiers have same rank value.  
> > >
> > > Jonathan mentioned this feature that multiple memtiers share the same
> > > rank as well.  It can be a convenient feature to have.  For
> > > simplicity, it should be fine to leave out this feature initially.  
> >
> > OK.  What do you think about the concept of memory types?  You have
> > mentioned that in memtierX directory, we can put latency/throughput,
> > etc.  IMHO, these only make sense for one type of memory.  And it's
> > natural for all memory nodes onlined by a driver to be same memory type.  
> 
> I think this is not always true. For example, a dax kmem driver can
> online both pmem and non-pmem dax devices as system memory.

CXL Type 3 memory driver is also responsible for a memories of different
types with very different characteristics.  Would need to assign memory
into at least a few different tiers - potentially many different ones.

> 
> > That is, drivers (including firmware drivers) will register memory types
> > and put nodes into it.  Base on memory types, "rank" (related to for
> > example latency) determined the real memory tiers.
> >
> > If you think it's a good idea, we can rename memtierX to memory_typeX.
> > But memory type may be not a good name, DRAM in local memory controler
> > and DRAM in remote CXL may have quite different performance metric.  Or
> > memory_class to avoid the possible confusion?  
> 
> Memory types (e.g. GPU, DRAM, PMEM, etc) can be useful information to
> help initialize the memory tiers of NUMA nodes. But I think memory
> type is not a substitute for memory tier.  We still need to define
> memory tiers on top of NUMA node groups based on memory types (for
> example, some may want to group GPU and DRAM into the same tier,
> others may want separate tiers for GPU/DRAM).  It is simpler to keep
> the sysfs interface to just memory tiers and implement memory types as
> internal device attributes if needed.
> 
> To avoid confusion, we can require that the rank value is unique for
> each memtier device.  This should make it clear that each memtier
> device represents a distinct memory tier. 

I don't mind that for a first implementation, but can see advantage
in flexibility of being able to have multiple tiers fuse by
giving them the same rank value if we ever make rank writeable after
creation.  Given no userspace is going to rely on 'failure' to create
ranks with the same value, the flexibility to make this change later
without ABI compatibility problems is there.

> We can still put
> latency/throughput values into each memtierN directory.  Such values
> need to be specified as a range to better accommodate possibly varied
> performance of the devices within the same memory tier.

I'd postpone adding this sort of information to the tiers
until we need it.  Most of the info can be established by userspace anyway
so why complicate this interface? If there are strong usecases for the info
we can add it later.

Thanks,

Jonathan

> 
> > Best Regards,
> > Huang, Ying
> >  
> > > > Best Regards,
> > > > Huang, Ying
> > > >  
> > > > > >  
> > > > > > > > I think you may need to send v3 to make sure everyone is at the same
> > > > > > > > page.  
> > > > > > >
> > > > > > > Will do it shortly.  
> > > > > >
> > > > > > Good!  Thanks!
> > > > > >
> > > > > > Best Regards,
> > > > > > Huang, Ying
> > > > > >  
> > > > > > > > Best Regards,
> > > > > > > > Huang, Ying
> > > > > > > >  
> > > > > > > > > > Best Regards,
> > > > > > > > > > Huang, Ying
> > > > > > > > > >  
> > > > > > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > > > > > > >
> > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > 0
> > > > > > > > > > > > > 2
> > > > > > > > > > > > > 3
> > > > > > > > > > > > > 1
> > > > > > > > > > > > >
> > > > > > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > > > > > > >
> > > > > > > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > > > > > > per-node interface file:
> > > > > > > > > > > > >
> > > > > > > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > > > > > > >
> > > > > > > > > > > > > e.g.
> > > > > > > > > > > > >
> > > > > > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier  
> > > > > > > > > > > >
> > > > > > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > > > > > > hold memtier devicde ID instead of a link.  
> > > > > > > > > > >
> > > > > > > > > > > OK. We don't have to use a symlink.
> > > > > > > > > > >  
> > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > Huang, Ying
> > > > > > > > > > > >  
> > > > > > > > > > > > > Any comments?
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > Jonathan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >  
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >  
> > > > > > > >
> > > > > > > >  
> > > > > >
> > > > > >  
> > > >
> > > >  
> >
> >  


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-25 17:27                 ` Wei Xu
  2022-05-26  9:32                   ` Jonathan Cameron
@ 2022-05-27  9:26                   ` Aneesh Kumar K V
  1 sibling, 0 replies; 47+ messages in thread
From: Aneesh Kumar K V @ 2022-05-27  9:26 UTC (permalink / raw)
  To: Wei Xu
  Cc: Ying Huang, Jonathan Cameron, Andrew Morton, Greg Thelen,
	Yang Shi, Linux Kernel Mailing List, Jagdish Gediya,
	Michal Hocko, Tim C Chen, Dave Hansen, Alistair Popple,
	Baolin Wang, Feng Tang, Davidlohr Bueso, Dan Williams,
	David Rientjes, Linux MM, Brice Goglin, Hesham Almatary

On 5/25/22 10:57 PM, Wei Xu wrote:
> On Wed, May 25, 2022 at 3:01 AM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 5/25/22 2:33 PM, Ying Huang wrote:
>>> On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:
>>>> On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:
>>>>>
>>>>> On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:
>>>>>> On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:
>>>>>>>
>>
>> ...
>>
>>>
>>> OK.  Just to confirm.  Does this mean that we will have fixed device ID,
>>> for example,
>>>
>>> GPU                   memtier255
>>> DRAM (with CPU)               memtier0
>>> PMEM                  memtier1
>>>
>>> When we add a new memtier, it can be memtier254, or memter2?  The rank
>>> value will determine the real demotion order.
>>>
>>> I think you may need to send v3 to make sure everyone is at the same
>>> page.
>>>
>>
>> What we have implemented which we will send as RFC shortly is below.
>>
>> cd /sys/dekvaneesh@ubuntu-guest:~$ cd /sys/devices/system/
>> kvaneesh@ubuntu-guest:/sys/devices/system$ pwd
>> /sys/devices/system
>> kvaneesh@ubuntu-guest:/sys/devices/system$ ls
>> clockevents  clocksource  container  cpu  edac  memory  memtier  mpic
>> node  power
>> kvaneesh@ubuntu-guest:/sys/devices/system$ cd memtier/
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ pwd
>> /sys/devices/system/memtier
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ ls
>> default_rank  max_rank  memtier1  power  uevent
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat default_rank
>> 1
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cat max_rank
>> 3
> 
> For flexibility, we don't want max_rank to be interpreted as the
> number of memory tiers.  Also, we want to leave spaces in rank values
> to allow new memtiers to be inserted when needed.  So I'd suggest to
> make max_rank a much larger value (e.g. 255).
> 
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier$ cd memtier1/
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ ls
>> nodelist  power  rank  subsystem  uevent
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat nodelist
>> 0-3
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cat rank
>> 1
>> kvaneesh@ubuntu-guest:/sys/devices/system/memtier/memtier1$ cd
>> ../../node/node1/
>> kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$ cat memtier
>> 1
>> kvaneesh@ubuntu-guest:/sys/devices/system/node/node1$
>> root@ubuntu-guest:/sys/devices/system/node/node1# echo 0 > memtier
>> root@ubuntu-guest:/sys/devices/system/node/node1# cat memtier
>> 0
>> root@ubuntu-guest:/sys/devices/system/node/node1# cd ../../memtier/
>> root@ubuntu-guest:/sys/devices/system/memtier# ls
>> default_rank  max_rank  memtier0  memtier1  power  uevent
>> root@ubuntu-guest:/sys/devices/system/memtier# cd memtier0/
>> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat nodelist
>> 1
>> root@ubuntu-guest:/sys/devices/system/memtier/memtier0# cat rank
>> 0
> 
> It looks like the example here demonstrates the dynamic creation of
> memtier0.  If so, how is the rank of memtier0 determined?  If we want
> to support creating new memtiers at runtime, I think an explicit
> interface that specifies both device ID and rank is preferred to avoid
> implicit dependencies between device IDs and ranks.
> 

Right now to keep it all simpler there is a 1:1 relation ship between 
memory tier and rank value. ie.

memory tier  rank
memtier0     100
memtier1     200
memtier2     300

Currently we are limiting this to max 3 tiers. Hence the above is very 
easy. Once we really get dynamic tier creation, we should be looking at 
creating a new memory tier with highest possible rank value. Once we 
establish the memory tier, we then modify the rank value to a desired 
value. There will be a kernel interface to add a node to a memory tier 
with specific rank value so drivers can do that if required.

I haven't gone to that implementation because i was hoping we could get 
to that later when we really start requiring dynamic tier support.

I will share the patch series we have been working with. I am yet to get 
the documentation added. But then i will not wait for it to be complete 
so that we can get some early testing/feedback.

-aneesh

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v2)
  2022-05-27  9:10                             ` Jonathan Cameron
@ 2022-05-30  6:54                               ` Ying Huang
  0 siblings, 0 replies; 47+ messages in thread
From: Ying Huang @ 2022-05-30  6:54 UTC (permalink / raw)
  To: Jonathan Cameron, Wei Xu
  Cc: Andrew Morton, Greg Thelen, Aneesh Kumar K.V, Yang Shi,
	Linux Kernel Mailing List, Jagdish Gediya, Michal Hocko,
	Tim C Chen, Dave Hansen, Alistair Popple, Baolin Wang, Feng Tang,
	Davidlohr Bueso, Dan Williams, David Rientjes, Linux MM,
	Brice Goglin, Hesham Almatary

On Fri, 2022-05-27 at 10:10 +0100, Jonathan Cameron wrote:
> On Thu, 26 May 2022 13:55:39 -0700
> Wei Xu <weixugc@google.com> wrote:
> 
> > On Thu, May 26, 2022 at 12:39 AM Ying Huang <ying.huang@intel.com> wrote:
> > > 
> > > On Thu, 2022-05-26 at 00:08 -0700, Wei Xu wrote:  
> > > > On Wed, May 25, 2022 at 11:55 PM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > 
> > > > > On Wed, 2022-05-25 at 20:53 -0700, Wei Xu wrote:  
> > > > > > On Wed, May 25, 2022 at 6:10 PM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > 
> > > > > > > On Wed, 2022-05-25 at 08:36 -0700, Wei Xu wrote:  
> > > > > > > > On Wed, May 25, 2022 at 2:03 AM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > > > 
> > > > > > > > > On Tue, 2022-05-24 at 22:32 -0700, Wei Xu wrote:  
> > > > > > > > > > On Tue, May 24, 2022 at 1:24 AM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > > > > > 
> > > > > > > > > > > On Tue, 2022-05-24 at 00:04 -0700, Wei Xu wrote:  
> > > > > > > > > > > > On Thu, May 19, 2022 at 8:06 PM Ying Huang <ying.huang@intel.com> wrote:  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Wed, 2022-05-18 at 00:09 -0700, Wei Xu wrote:  
> > > > > > > > > > > > > > On Thu, May 12, 2022 at 8:00 AM Jonathan Cameron
> > > > > > > > > > > > > > <Jonathan.Cameron@huawei.com> wrote:  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On Wed, 11 May 2022 23:22:11 -0700
> > > > > > > > > > > > > > > Wei Xu <weixugc@google.com> wrote:  
> > > > > > > > > > > > > > > > The current kernel has the basic memory tiering support: Inactive
> > > > > > > > > > > > > > > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > > > > > > > > > > > > > > tier NUMA node to make room for new allocations on the higher tier
> > > > > > > > > > > > > > > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > > > > > > > > > > > > > > migrated (promoted) to a higher tier NUMA node to improve the
> > > > > > > > > > > > > > > > performance.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > > > > > > > > demotion path relationship between NUMA nodes, which is created during
> > > > > > > > > > > > > > > > the kernel initialization and updated when a NUMA node is hot-added or
> > > > > > > > > > > > > > > > hot-removed.  The current implementation puts all nodes with CPU into
> > > > > > > > > > > > > > > > the top tier, and builds the tier hierarchy tier-by-tier by establishing
> > > > > > > > > > > > > > > > the per-node demotion targets based on the distances between nodes.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > > > > > > > > several important use cases:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * The current tier initialization code always initializes
> > > > > > > > > > > > > > > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > > > > > > > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > > > > > > > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > > > > > > > >   a virtual machine) and should be put into a higher tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > > > > > > > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > > > > > > > > > > > > > > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > > > > > > > > > > > > > > >   with CPUs are better to be placed into the next lower tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * Also because the current tier hierarchy always puts CPU nodes
> > > > > > > > > > > > > > > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > > > > > > > > > > > > > > >   triggers a memory node from CPU-less into a CPU node (or vice
> > > > > > > > > > > > > > > >   versa), the memory tier hierarchy gets changed, even though no
> > > > > > > > > > > > > > > >   memory node is added or removed.  This can make the tier
> > > > > > > > > > > > > > > >   hierarchy unstable and make it difficult to support tier-based
> > > > > > > > > > > > > > > >   memory accounting.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * A higher tier node can only be demoted to selected nodes on the
> > > > > > > > > > > > > > > >   next lower tier as defined by the demotion path, not any other
> > > > > > > > > > > > > > > >   node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > > > > > > > >   does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > > > > > > > >   allow cross-socket demotion to another node in the same demotion
> > > > > > > > > > > > > > > >   tier as a fallback when the preferred demotion node is out of
> > > > > > > > > > > > > > > >   space), and has resulted in the feature request for an interface to
> > > > > > > > > > > > > > > >   override the system-wide, per-node demotion order from the
> > > > > > > > > > > > > > > >   userspace.  This demotion order is also inconsistent with the page
> > > > > > > > > > > > > > > >   allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > > > > > > > >   out of space: The page allocation can fall back to any node from
> > > > > > > > > > > > > > > >   any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * There are no interfaces for the userspace to learn about the memory
> > > > > > > > > > > > > > > >   tier hierarchy in order to optimize its memory allocations.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I'd like to propose revised memory tier kernel interfaces based on
> > > > > > > > > > > > > > > > the discussions in the threads:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > > > > > > > > > > > > > > - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > High-level Design Ideas
> > > > > > > > > > > > > > > > =======================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * Define memory tiers explicitly, not implicitly.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * Memory tiers are defined based on hardware capabilities of memory
> > > > > > > > > > > > > > > >   nodes, not their relative node distances between each other.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * The tier assignment of each node is independent from each other.
> > > > > > > > > > > > > > > >   Moving a node from one tier to another tier doesn't affect the tier
> > > > > > > > > > > > > > > >   assignment of any other node.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * The node-tier association is stable. A node can be reassigned to a
> > > > > > > > > > > > > > > >   different tier only under the specific conditions that don't block
> > > > > > > > > > > > > > > >   future tier-based memory cgroup accounting.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * A node can demote its pages to any nodes of any lower tiers. The
> > > > > > > > > > > > > > > >   demotion target node selection follows the allocation fallback order
> > > > > > > > > > > > > > > >   of the source node, which is built based on node distances.  The
> > > > > > > > > > > > > > > >   demotion targets are also restricted to only the nodes from the tiers
> > > > > > > > > > > > > > > >   lower than the source node.  We no longer need to maintain a separate
> > > > > > > > > > > > > > > >   per-node demotion order (node_demotion[]).
> > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Hi Wei,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This proposal looks good to me, though we'll be having fun
> > > > > > > > > > > > > > > white boarding topologies from our roadmaps for the next few days :)  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > That's good to hear.
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > A few comments inline. It also seems likely to me that there is little
> > > > > > > > > > > > > > > benefit in starting with 3 tiers as the maximum.  Seems unlikely the
> > > > > > > > > > > > > > > code will be substantially simpler for 3 than it would be for 4 or 5.
> > > > > > > > > > > > > > > I've drawn out one simple case that needs 4 to do sensible things.  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > We can make the number of tiers a config option. 3 tiers are just what
> > > > > > > > > > > > > > the kernel can reasonably initialize when there isn't enough hardware
> > > > > > > > > > > > > > performance information from the firmware.
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Sysfs Interfaces
> > > > > > > > > > > > > > > > ================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   where N = 0, 1, 2 (the kernel supports only 3 tiers for now).
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Format: node_list
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Read-only.  When read, list the memory nodes in the specified tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Tier 0 is the highest tier, while tier 2 is the lowest tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   The absolute value of a tier id number has no specific meaning.
> > > > > > > > > > > > > > > >   What matters is the relative order of the tier id numbers.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   When a memory tier has no nodes, the kernel can hide its memtier
> > > > > > > > > > > > > > > >   sysfs files.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * /sys/devices/system/node/nodeN/memtier
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   where N = 0, 1, ...
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Format: int or empty
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   When read, list the memory tier that the node belongs to.  Its value
> > > > > > > > > > > > > > > >   is empty for a CPU-only NUMA node.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   When written, the kernel moves the node into the specified memory
> > > > > > > > > > > > > > > >   tier if the move is allowed.  The tier assignment of all other nodes
> > > > > > > > > > > > > > > >   are not affected.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Initially, we can make this interface read-only.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Kernel Representation
> > > > > > > > > > > > > > > > =====================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * #define MAX_MEMORY_TIERS 3
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Support 3 memory tiers for now.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * #define MEMORY_DEFAULT_TIER 1
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   The default tier that a memory node is assigned to.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * nodemask_t memory_tiers[MAX_MEMORY_TIERS]
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Store memory nodes by tiers.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * int node_tier_map[MAX_NUMNODES]
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   Map a node to its tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >   For each CPU-only node c, node_tier_map[c] = -1.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Memory Tier Initialization
> > > > > > > > > > > > > > > > ==========================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > By default, all memory nodes are assigned to the default tier
> > > > > > > > > > > > > > > > (MEMORY_DEFAULT_TIER).  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This is tighter than it needs to be.  In many cases we can easily
> > > > > > > > > > > > > > > establish if there is any possibility of CPU being hotplugged into
> > > > > > > > > > > > > > > a memory node.  If it's CXL attached no way CPUs are going to be
> > > > > > > > > > > > > > > turning up their later :)  If CPU HP into a given node can't happen
> > > > > > > > > > > > > > > we can be more flexible and I think that often results in better decisions.
> > > > > > > > > > > > > > > See example below, though obviously I could just use the userspace
> > > > > > > > > > > > > > > interface to fix that up anyway or have a CXL driver move it around
> > > > > > > > > > > > > > > if that's relevant.  In some other cases I'm fairly sure we know in
> > > > > > > > > > > > > > > advance where CPUs can be added but I'd need to check all the
> > > > > > > > > > > > > > > relevant specs to be sure there aren't any corner cases.  I 'think'
> > > > > > > > > > > > > > > for ARM for example we know where all possible CPUs can be hotplugged
> > > > > > > > > > > > > > > (constraint coming from the interrupt controller + the fact that only
> > > > > > > > > > > > > > > virtual CPU HP is defined).  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > We may not always want to put a CXL-attached memory device into a
> > > > > > > > > > > > > > slower tier because even though CXL does add some additional latency,
> > > > > > > > > > > > > > both the memory device and CXL can still be very capable in
> > > > > > > > > > > > > > performance and may not be much slower (if any) than the on-board DRAM
> > > > > > > > > > > > > > (e.g. DRAM from a remote CPU socket).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Also, the default tier here is just the initial tier assignment of
> > > > > > > > > > > > > > each node, which behaves as if there were no tiering.  A tiering
> > > > > > > > > > > > > > kernel init function can certainly reassign the tier for each node if
> > > > > > > > > > > > > > it knows enough about the hardware performance for these nodes from
> > > > > > > > > > > > > > the firmware.
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > A device driver can move up or down its memory nodes from the default
> > > > > > > > > > > > > > > > tier.  For example, PMEM can move down its memory nodes below the
> > > > > > > > > > > > > > > > default tier, whereas GPU can move up its memory nodes above the
> > > > > > > > > > > > > > > > default tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The kernel initialization code makes the decision on which exact tier
> > > > > > > > > > > > > > > > a memory node should be assigned to based on the requests from the
> > > > > > > > > > > > > > > > device drivers as well as the memory device hardware information
> > > > > > > > > > > > > > > > provided by the firmware.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Memory Tier Reassignment
> > > > > > > > > > > > > > > > ========================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > After a memory node is hot-removed, it can be hot-added back to a
> > > > > > > > > > > > > > > > different memory tier.  This is useful for supporting dynamically
> > > > > > > > > > > > > > > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > > > > > > > > > > > > > > memory devices across hot-plug events.  Such tier changes should
> > > > > > > > > > > > > > > > be compatible with tier-based memory accounting.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The userspace may also reassign an existing online memory node to a
> > > > > > > > > > > > > > > > different tier.  However, this should only be allowed when no pages
> > > > > > > > > > > > > > > > are allocated from the memory node or when there are no non-root
> > > > > > > > > > > > > > > > memory cgroups (e.g. during the system boot).  This restriction is
> > > > > > > > > > > > > > > > important for keeping memory tier hierarchy stable enough for
> > > > > > > > > > > > > > > > tier-based memory cgroup accounting.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Memory Allocation for Demotion
> > > > > > > > > > > > > > > > ==============================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > To allocate a new page as the demotion target for a page, the kernel
> > > > > > > > > > > > > > > > calls the allocation function (__alloc_pages_nodemask) with the
> > > > > > > > > > > > > > > > source page node as the preferred node and the union of all lower
> > > > > > > > > > > > > > > > tier nodes as the allowed nodemask.  The actual target node selection
> > > > > > > > > > > > > > > > then follows the allocation fallback order that the kernel has
> > > > > > > > > > > > > > > > already defined.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The pseudo code looks like:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >     targets = NODE_MASK_NONE;
> > > > > > > > > > > > > > > >     src_nid = page_to_nid(page);
> > > > > > > > > > > > > > > >     src_tier = node_tier_map[src_nid];
> > > > > > > > > > > > > > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > > > > > > > > > > > > > > >             nodes_or(targets, targets, memory_tiers[i]);
> > > > > > > > > > > > > > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The memopolicy of cpuset, vma and owner task of the source page can
> > > > > > > > > > > > > > > > be set to refine the demotion target nodemask, e.g. to prevent
> > > > > > > > > > > > > > > > demotion or select a particular allowed node as the demotion target.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Memory Allocation for Promotion
> > > > > > > > > > > > > > > > ===============================
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > The page allocation for promotion is similar to demotion, except that (1)
> > > > > > > > > > > > > > > > the target nodemask uses the promotion tiers, (2) the preferred node can
> > > > > > > > > > > > > > > > be the accessing CPU node, not the source page node.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Examples
> > > > > > > > > > > > > > > > ========
> > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * Example 3:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Node2 is drawn as pmem.  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Typo. Good catch.
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > All nodes are in the same tier.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >                   20
> > > > > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > > > > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > > > > >           \ 30            / 30
> > > > > > > > > > > > > > > >            \             /
> > > > > > > > > > > > > > > >              Node 2 (PMEM)
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > > > > >    0  10   20   30
> > > > > > > > > > > > > > > >    1  20   10   30
> > > > > > > > > > > > > > > >    2  30   30   10
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > > > > 0-2
> > > > > > > > > > > > > > > > <empty>
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > > > node 0: empty
> > > > > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * Example 4:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > > > Node 1 is a PMEM node.
> > > > > > > > > > > > > > > > Node 2 is a GPU node.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >                   50
> > > > > > > > > > > > > > > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > > > > > > > > > > > > > > >          \                 /
> > > > > > > > > > > > > > > >           \ 30            / 60
> > > > > > > > > > > > > > > >            \             /
> > > > > > > > > > > > > > > >              Node 1 (PMEM)
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > > node   0    1    2
> > > > > > > > > > > > > > > >    0  10   30   50
> > > > > > > > > > > > > > > >    1  30   10   60
> > > > > > > > > > > > > > > >    2  50   60   10
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > > > node 0: 1
> > > > > > > > > > > > > > > > node 1: empty
> > > > > > > > > > > > > > > > node 2: 0, 1
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > * Example 5:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > > > > Node 3 is a large, slow DRAM node without CPU.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > > > > >    /      |              \
> > > > > > > > > > > > > > > >   /       | 30            \ 120
> > > > > > > > > > > > > > > >  |        |         100    \
> > > > > > > > > > > > > > > >  |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > > > > >   \         \                 /
> > > > > > > > > > > > > > > >     \        \ 40            / 110
> > > > > > > > > > > > > > > >   80  \       \             /
> > > > > > > > > > > > > > > >         ---  Node 3 (Slow DRAM)  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This is close but not quite what was intended for Hesham's
> > > > > > > > > > > > > > > example... (note we just checked that Hesham's original node0-1
> > > > > > > > > > > > > > > timing didn't make any sense.).
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This was inspired by Hesham's example. But I should have also included
> > > > > > > > > > > > > > the version that illustrates the need to skip a tier when demoting
> > > > > > > > > > > > > > from certain nodes.
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > > > > >    0   10  100   30   40
> > > > > > > > > > > > > > > >    1  100   10  120  110
> > > > > > > > > > > > > > > >    2   30  120   10   80
> > > > > > > > > > > > > > > >    3   40  110   80   10
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 0,3
> > > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Demotion fallback order:
> > > > > > > > > > > > > > > > node 0: 2
> > > > > > > > > > > > > > > > node 1: 0, 3, 2
> > > > > > > > > > > > > > > > node 2: empty
> > > > > > > > > > > > > > > > node 3: 2  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This is close but not quite the same as the example
> > > > > > > > > > > > > > > Hesham gave (note the node timing 1 to 0 on in the table
> > > > > > > > > > > > > > > with that example didn't make sense).  I added another
> > > > > > > > > > > > > > > level of switching to make the numbers more obviously
> > > > > > > > > > > > > > > different and show how critical it might be.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > * Example 6:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Node 0 is a DRAM node with CPU.
> > > > > > > > > > > > > > > Node 1 is a GPU node.
> > > > > > > > > > > > > > > Node 2 is a PMEM node.
> > > > > > > > > > > > > > > Node 3 is an extremely large, DRAM node without CPU.
> > > > > > > > > > > > > > >   (Key point here being that it probably never makes sense
> > > > > > > > > > > > > > >    to demote to anywhere else from this memory).
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I've redone the timings wrt to example 5.
> > > > > > > > > > > > > > > Basis for this is 0 and 2 are directly connected
> > > > > > > > > > > > > > > via controllers in an SoC. 1 and 3 are connected
> > > > > > > > > > > > > > > via a a common switch one switch down switch
> > > > > > > > > > > > > > > (each hop via this is 100)
> > > > > > > > > > > > > > > All drams cost 10 once you've reached correct node
> > > > > > > > > > > > > > > and pmem costs 30 from SoC.
> > > > > > > > > > > > > > > Numbers get too large as a result but meh, I'm making
> > > > > > > > > > > > > > > a point not providing real numbers :)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >          PMEM Node 2
> > > > > > > > > > > > > > >             |(30)
> > > > > > > > > > > > > > >         CPU + DRAM Node0
> > > > > > > > > > > > > > >             |(100)
> > > > > > > > > > > > > > >          Switch 1
> > > > > > > > > > > > > > >             |(100)
> > > > > > > > > > > > > > >           Switch 2
> > > > > > > > > > > > > > >     (100)  |      |(100)
> > > > > > > > > > > > > > > Node 1 GPU     Node3 Large memory.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > With one level of s
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >      Node 2 (PMEM)  ----
> > > > > > > > > > > > > > >     /      |              \
> > > > > > > > > > > > > > >    /       | 30            \ 330
> > > > > > > > > > > > > > >   |        |         310    \
> > > > > > > > > > > > > > >   |   Node 0 (DRAM)  ----  Node 1 (GPU)
> > > > > > > > > > > > > > >    \         \                 /
> > > > > > > > > > > > > > >      \        \ 310           / 210
> > > > > > > > > > > > > > >    330 \       \             /
> > > > > > > > > > > > > > >          ---  Node 3 (Extremely large DRAM)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > To my mind, we should potentially also take into account
> > > > > > > > > > > > > > > the fact that Node3 can be known to never contain CPUs
> > > > > > > > > > > > > > > (in at least some architectures we know where the CPUs
> > > > > > > > > > > > > > >  might be added later, they can't just magically turn up
> > > > > > > > > > > > > > >  anywhere in the topology).
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > node    0    1    2    3
> > > > > > > > > > > > > > >     0   10   310  30   310
> > > > > > > > > > > > > > >     1   310  10   330  210
> > > > > > > > > > > > > > >     2   30   330  10   330
> > > > > > > > > > > > > > >     3   310  210  330   10
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > So, my ideal would treat node 3 different from other dram nodes
> > > > > > > > > > > > > > > as we know it can't have CPUs. Trying to come up with an
> > > > > > > > > > > > > > > always correct order for nodes 3 and 2 is tricky as to a certain
> > > > > > > > > > > > > > > extent depends on capacity. If node 2 was  big enough to take
> > > > > > > > > > > > > > > any demotion from node 0 and still have lots of room then demoting
> > > > > > > > > > > > > > > there form node 3 would make sense and visa versa.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > >  1
> > > > > > > > > > > > > > >  0
> > > > > > > > > > > > > > >  2
> > > > > > > > > > > > > > >  3
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  $ cat /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > >   1
> > > > > > > > > > > > > > >   0
> > > > > > > > > > > > > > >   2
> > > > > > > > > > > > > > >   3
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  Demotion fallback order:
> > > > > > > > > > > > > > >  node 0: 2, 3
> > > > > > > > > > > > > > >  node 1: 3, 0, 2 (key being we will almost always have less pressure on node 3)
> > > > > > > > > > > > > > >  node 2: 3
> > > > > > > > > > > > > > >  node 3: empty
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > or as Hesham just pointed out this can be done with 3 tiers
> > > > > > > > > > > > > > > because we can put the GPU and CPU in the same tier because
> > > > > > > > > > > > > > > their is little reason to demote from one to the other.  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thank you for the example.  It makes sense to me to have node 3 on its
> > > > > > > > > > > > > > own tier.  We can have either 3 tiers or 4 tiers in total (assuming
> > > > > > > > > > > > > > that the max number of tiers is a config option).
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > We are also a bit worried about ABI backwards compatibility because
> > > > > > > > > > > > > > > of potential need to make more space in tiers lower in number than
> > > > > > > > > > > > > > > CPU attached DDR. I rather liked the negative proposal with
> > > > > > > > > > > > > > > default as 0 that Huang, Ying made.  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > It is hard to have negative values as the device IDs.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The current proposal equals the tier device ID to the tier hierarchy
> > > > > > > > > > > > > > level, which makes the interface simpler, but less flexible.  How
> > > > > > > > > > > > > > about the following proposal (which decouples the tier device ID from
> > > > > > > > > > > > > > the tier level)?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > > > > > > /sys/devices/system/memtier/memtierN/rank
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Each memory tier N has two sysfs files:
> > > > > > > > > > > > > > - nodelist: the nodes that are in this tier
> > > > > > > > > > > > > > - rank: an opaque value that helps decide the level at which this tier
> > > > > > > > > > > > > > is in the tier hierarchy (smaller value means faster tier)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The tier hierarchy is determined by "rank", not by the device id
> > > > > > > > > > > > > > number N from "memtierN".
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The absolute value of "rank" of a memtier doesn't necessarily carry
> > > > > > > > > > > > > > any meaning. Its value relative to other memtiers decides the level of
> > > > > > > > > > > > > > this memtier in the tier hierarchy.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The CPU-attached DRAM nodes are always in memtier0 (the device ID),
> > > > > > > > > > > > > > but memtier0 may not always be the top-tier, e.g. its level can be 3
> > > > > > > > > > > > > > in a 5-tier system.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > For the above example (example 6), we can have:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > $ ls /sys/devices/system/memtier
> > > > > > > > > > > > > > memtier0
> > > > > > > > > > > > > > memtier1
> > > > > > > > > > > > > > memtier2
> > > > > > > > > > > > > > memtier128
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/rank
> > > > > > > > > > > > > > 50
> > > > > > > > > > > > > > 60
> > > > > > > > > > > > > > 70
> > > > > > > > > > > > > > 10  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I understand that the device ID cannot be negtive.  So we have to use
> > > > > > > > > > > > > rank.  Can we make it possible to allow "rank" to be negtive?  
> > > > > > > > > > > > 
> > > > > > > > > > > > It is possible to allow "rank" to be negative, though I think all
> > > > > > > > > > > > positive values should work equally well.
> > > > > > > > > > > >  
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > Another choice is to do some trick on device ID.  For example, the CPU-
> > > > > > > > > > > > > attached DRAM node are always memtier100 (the device ID).  Then we can
> > > > > > > > > > > > > have memtier99, memtier100, memtier101, memteri102, ....  That's not
> > > > > > > > > > > > > perfect too.  
> > > > > > > > > > > > 
> > > > > > > > > > > > If we go with the device ID tricks, one approach is to use sub-device IDs:
> > > > > > > > > > > > 
> > > > > > > > > > > > - There are 3 major tiers: tier0 (e.g. GPU), tier1 (e.g.DRAM) and
> > > > > > > > > > > > tier2 (e.g. PMEM).
> > > > > > > > > > > > 
> > > > > > > > > > > > - Each major tier can have minor tiers, e.g. tier0.0, tier1.0,
> > > > > > > > > > > > tier1.1, tier2.0, tier2.1.
> > > > > > > > > > > > 
> > > > > > > > > > > > The earlier 4-tier example can be represented as:
> > > > > > > > > > > > 
> > > > > > > > > > > > memtier0.0 -> memtier1.0 -> memtier2.0 -> memtier2.1
> > > > > > > > > > > > 
> > > > > > > > > > > > We can also omit .0 so that the tiers are:
> > > > > > > > > > > > 
> > > > > > > > > > > > memtier0 -> memtier1 -> memtier2 -> memtier2.1
> > > > > > > > > > > > 
> > > > > > > > > > > > This should be flexible enough to support multiple tiers while keeping
> > > > > > > > > > > > the tier IDs relatively stable.
> > > > > > > > > > > > 
> > > > > > > > > > > > It is not as flexible as the rank approach. For example, to insert a
> > > > > > > > > > > > new tier between 2.0 and 2.1, we need to add a tier 2.2 and reassign
> > > > > > > > > > > > existing nodes to these 3 tiers.  Using "rank", we can insert a new
> > > > > > > > > > > > tier and only move desired nodes into the new tier.
> > > > > > > > > > > > 
> > > > > > > > > > > > What do you think?  
> > > > > > > > > > > 
> > > > > > > > > > > The rank approach looks better for.  And if we stick with the device ID
> > > > > > > > > > > rule as follows,
> > > > > > > > > > > 
> > > > > > > > > > > ...
> > > > > > > > > > > 255     GPU
> > > > > > > > > > > 0       DRAM
> > > > > > > > > > > 1       PMEM
> > > > > > > > > > > 2
> > > > > > > > > > > ...
> > > > > > > > > > > 
> > > > > > > > > > > 255 is -1 for "s8".
> > > > > > > > > > > 
> > > > > > > > > > > The device ID should do most tricks at least now.  The rank can provide
> > > > > > > > > > > more flexibility in the future.  We can even go without rank in the
> > > > > > > > > > > first version, and introduce it when it's necessary.  
> > > > > > > > > > 
> > > > > > > > > > Given that the "rank" approach is generally favored, let's go with
> > > > > > > > > > that to avoid compatibility issues that may come from the switch of
> > > > > > > > > > device ID tricks to ranks.  
> > > > > > > > > 
> > > > > > > > > OK.  Just to confirm.  Does this mean that we will have fixed device ID,
> > > > > > > > > for example,
> > > > > > > > > 
> > > > > > > > > GPU                     memtier255
> > > > > > > > > DRAM (with CPU)         memtier0
> > > > > > > > > PMEM                    memtier1
> > > > > > > > > 
> > > > > > > > > When we add a new memtier, it can be memtier254, or memter2?  The rank
> > > > > > > > > value will determine the real demotion order.  
> > > > > > > > 
> > > > > > > > With the rank approach, the device ID numbering should be flexible and
> > > > > > > > not mandated by the proposal.  
> > > > > > > 
> > > > > > > If so, the rank number will be fixed?  For example,
> > > > > > > 
> > > > > > > GPU                     100
> > > > > > > DRAM (with CPU)         200
> > > > > > > PMEM                    300
> > > > > > > 
> > > > > > > When we add a new memtier, its rank can be 50, 150, 250, or 400?
> > > > > > > 
> > > > > > > If so, this makes me think why we don't just make this kind of rank the
> > > > > > > device ID?  Or I missed something?
> > > > > > > 
> > > > > > > Or, both device IDs and rank values are not fixed?  Why do we need that
> > > > > > > kind of flexibility?  Sorry, I may not undersand all requirements.  
> > > > > > 
> > > > > > Even though the proposal doesn't mandate a particular device ID
> > > > > > numbering, I expect that the device IDs will be relatively stable once
> > > > > > a kernel implementation is chosen. For example, it is likely that DRAM
> > > > > > nodes with CPUs will always be on memtier1, no matter how many tiers
> > > > > > are higher or lower than these nodes.
> > > > > > 
> > > > > > We don't need to mandate a particular way to assign the rank values,
> > > > > > either.  What matters is the relative order and some reasonable gap
> > > > > > between these values.
> > > > > > 
> > > > > > The rank approach allows us to keep memtier device IDs relatively
> > > > > > stable even though we may change the tier ordering among them.  Its
> > > > > > flexibility can have many other uses as well.  For example, we can
> > > > > > insert a new memtier into the tier hierarchy for a new set of nodes
> > > > > > without affecting the node assignment of any existing memtier,
> > > > > > provided that there is enough gap in the rank values for the new
> > > > > > memtier.
> > > > > > 
> > > > > > Using the rank value directly as the device ID has some disadvantages:
> > > > > > - It is kind of unconventional to number devices in this way.
> > > > > > - We cannot assign DRAM nodes with CPUs with a specific memtier device
> > > > > > ID (even though this is not mandated by the "rank" proposal, I expect
> > > > > > the device will likely always be memtier1 in practice).
> > > > > > - It is possible that we may eventually allow the rank value to be
> > > > > > modified as a way to adjust the tier ordering.  We cannot do that
> > > > > > easily for device IDs.  
> > > > > 
> > > > > OK.  I can understand that sometimes it's more natural to change the
> > > > > order of a set of nodes with same memory types (and data plane path)
> > > > > together instead of change that one by one for each node.
> > > > > 
> > > > > It appears that the memtierX device becomes kind of memory types (with
> > > > > data plane path considered for latency/throughput too).  We can assign a
> > > > > memory type for a node, and change the order between memory types.  If
> > > > > so, we need to allow multiple memtiers have same rank value.  
> > > > 
> > > > Jonathan mentioned this feature that multiple memtiers share the same
> > > > rank as well.  It can be a convenient feature to have.  For
> > > > simplicity, it should be fine to leave out this feature initially.  
> > > 
> > > OK.  What do you think about the concept of memory types?  You have
> > > mentioned that in memtierX directory, we can put latency/throughput,
> > > etc.  IMHO, these only make sense for one type of memory.  And it's
> > > natural for all memory nodes onlined by a driver to be same memory type.  
> > 
> > I think this is not always true. For example, a dax kmem driver can
> > online both pmem and non-pmem dax devices as system memory.
> 
> CXL Type 3 memory driver is also responsible for a memories of different
> types with very different characteristics.  Would need to assign memory
> into at least a few different tiers - potentially many different ones.

OK.  My original words aren't correct.  So I should have said that "the
memory types should be determined by the drivers to online them".

> > 
> > > That is, drivers (including firmware drivers) will register memory types
> > > and put nodes into it.  Base on memory types, "rank" (related to for
> > > example latency) determined the real memory tiers.
> > > 
> > > If you think it's a good idea, we can rename memtierX to memory_typeX.
> > > But memory type may be not a good name, DRAM in local memory controler
> > > and DRAM in remote CXL may have quite different performance metric.  Or
> > > memory_class to avoid the possible confusion?  
> > 
> > Memory types (e.g. GPU, DRAM, PMEM, etc) can be useful information to
> > help initialize the memory tiers of NUMA nodes. But I think memory
> > type is not a substitute for memory tier.  We still need to define
> > memory tiers on top of NUMA node groups based on memory types (for
> > example, some may want to group GPU and DRAM into the same tier,
> > others may want separate tiers for GPU/DRAM).  It is simpler to keep
> > the sysfs interface to just memory tiers and implement memory types as
> > internal device attributes if needed.
> > 
> > To avoid confusion, we can require that the rank value is unique for
> > each memtier device.  This should make it clear that each memtier
> > device represents a distinct memory tier. 
> 
> I don't mind that for a first implementation, but can see advantage
> in flexibility of being able to have multiple tiers fuse by
> giving them the same rank value if we ever make rank writeable after
> creation.  Given no userspace is going to rely on 'failure' to create
> ranks with the same value, the flexibility to make this change later
> without ABI compatibility problems is there.

IMHO, I don't think it's a good idea to have 2 memory tiers have same
rank value.  That makes the concept of tier confusing.

Best Regards,
Huang, Ying

> > We can still put
> > latency/throughput values into each memtierN directory.  Such values
> > need to be specified as a range to better accommodate possibly varied
> > performance of the devices within the same memory tier.
> 
> I'd postpone adding this sort of information to the tiers
> until we need it.  Most of the info can be established by userspace anyway
> so why complicate this interface? If there are strong usecases for the info
> we can add it later.
>
> Thanks,
> 
> Jonathan
> 
> > 
> > > Best Regards,
> > > Huang, Ying
> > >  
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > > Best Regards,
> > > > > Huang, Ying
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > >  
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > > I think you may need to send v3 to make sure everyone is at the same
> > > > > > > > > page.  
> > > > > > > > 
> > > > > > > > Will do it shortly.  
> > > > > > > 
> > > > > > > Good!  Thanks!
> > > > > > > 
> > > > > > > Best Regards,
> > > > > > > Huang, Ying
> > > > > > >  
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > > Best Regards,
> > > > > > > > > Huang, Ying
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > Best Regards,
> > > > > > > > > > > Huang, Ying
> > > > > > > > > > >  
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > > > The tier order: memtier128 -> memtier0 -> memtier1 -> memtier2
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > $ cat /sys/devices/system/memtier/memtier*/nodelist
> > > > > > > > > > > > > > 0
> > > > > > > > > > > > > > 2
> > > > > > > > > > > > > > 3
> > > > > > > > > > > > > > 1
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > $ ls -l /sys/devices/system/node/node*/memtier
> > > > > > > > > > > > > > /sys/devices/system/node/node0/memtier -> /sys/devices/system/memtier/memtier0
> > > > > > > > > > > > > > /sys/devices/system/node/node1/memtier -> /sys/devices/system/memtier/memtier128
> > > > > > > > > > > > > > /sys/devices/system/node/node2/memtier -> /sys/devices/system/memtier/memtier1
> > > > > > > > > > > > > > /sys/devices/system/node/node3/memtier -> /sys/devices/system/memtier/memtier2
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > To override the memory tier of a node, we can use a new, write-only,
> > > > > > > > > > > > > > per-node interface file:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > /sys/devices/system/node/nodeN/set_memtier
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > e.g.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > $ echo "memtier128" > sys/devices/system/node/node1/set_memtier  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I prefer the original proposal to make nodeX/memtier a normal file to
> > > > > > > > > > > > > hold memtier devicde ID instead of a link.  
> > > > > > > > > > > > 
> > > > > > > > > > > > OK. We don't have to use a symlink.
> > > > > > > > > > > >  
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > Huang, Ying
> > > > > > > > > > > > >  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Any comments?
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Jonathan
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > >  
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >  
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > 
> > > > > > >  
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > >  
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > 
> > >  
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-05-30  6:54 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12  6:22 RFC: Memory Tiering Kernel Interfaces (v2) Wei Xu
2022-05-12  7:03 ` ying.huang
2022-05-12  7:12   ` Aneesh Kumar K V
2022-05-12  7:18     ` ying.huang
2022-05-12  7:22     ` Wei Xu
2022-05-12  7:36       ` Aneesh Kumar K.V
2022-05-12  8:15         ` Wei Xu
2022-05-12  8:37           ` ying.huang
2022-05-13  2:52             ` ying.huang
2022-05-13  7:00               ` Wei Xu
2022-05-16  1:57                 ` ying.huang
2022-05-12 21:12           ` Tim Chen
2022-05-12 21:31             ` Wei Xu
2022-05-12 15:00 ` Jonathan Cameron
2022-05-18  7:09   ` Wei Xu
2022-05-18 12:00     ` Jonathan Cameron
2022-05-24  7:36       ` Wei Xu
2022-05-24 13:26         ` Aneesh Kumar K.V
2022-05-25  5:27           ` Wei Xu
2022-05-25  7:47             ` Alistair Popple
2022-05-25 11:48               ` Jonathan Cameron
2022-05-25 15:32                 ` Wei Xu
2022-05-20  3:06     ` Ying Huang
2022-05-24  7:04       ` Wei Xu
2022-05-24  8:24         ` Ying Huang
2022-05-25  5:32           ` Wei Xu
2022-05-25  9:03             ` Ying Huang
2022-05-25 10:01               ` Aneesh Kumar K V
2022-05-25 11:36                 ` Mika Penttilä
2022-05-25 15:33                   ` Wei Xu
2022-05-25 17:27                 ` Wei Xu
2022-05-26  9:32                   ` Jonathan Cameron
2022-05-26 20:30                     ` Wei Xu
2022-05-27  9:26                   ` Aneesh Kumar K V
2022-05-25 15:36               ` Wei Xu
2022-05-26  1:09                 ` Ying Huang
2022-05-26  3:53                   ` Wei Xu
2022-05-26  6:54                     ` Ying Huang
2022-05-26  7:08                       ` Wei Xu
2022-05-26  7:39                         ` Ying Huang
2022-05-26 20:55                           ` Wei Xu
2022-05-27  9:10                             ` Jonathan Cameron
2022-05-30  6:54                               ` Ying Huang
2022-05-13  3:25 ` ying.huang
2022-05-13  6:36   ` Wei Xu
2022-05-13  7:04     ` ying.huang
2022-05-13  7:21       ` Wei Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.