linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RFC: Memory Tiering Kernel Interfaces
@ 2022-04-30  2:10 Wei Xu
  2022-04-30  3:59 ` Yang Shi
                   ` (6 more replies)
  0 siblings, 7 replies; 57+ messages in thread
From: Wei Xu @ 2022-04-30  2:10 UTC (permalink / raw)
  To: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

A tiering relationship between NUMA nodes in the form of demotion path
is created during the kernel initialization and updated when a NUMA
node is hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and then builds the tiering hierarchy
tier-by-tier by establishing the per-node demotion targets based on
the distances between nodes.

The current memory tiering interface needs to be improved to address
several important use cases:

* The current tiering initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into the top tier.

* The current tiering hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tiering hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tiering hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tiering
  hierarchy much less stable.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier, not any other node from the next lower tier.  This
  strict, hard-coded demotion order does not work in all use cases
  (e.g. some use cases may want to allow cross-socket demotion to
  another node in the same demotion tier as a fallback when the
  preferred demotion node is out of space), and has resulted in the
  feature request for an interface to override the system-wide,
  per-node demotion order from the userspace.

* There are no interfaces for the userspace to learn about the memory
  tiering hierarchy in order to optimize its memory allocations.

I'd like to propose revised memory tiering kernel interfaces based on
the discussions in the threads:

- https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
- https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/


Sysfs Interfaces
================

* /sys/devices/system/node/memory_tiers

  Format: node list (one tier per line, in the tier order)

  When read, list memory nodes by tiers.

  When written (one tier per line), take the user-provided node-tier
  assignment as the new tiering hierarchy and rebuild the per-node
  demotion order.  It is allowed to only override the top tiers, in
  which cases, the kernel will establish the lower tiers automatically.


Kernel Representation
=====================

* nodemask_t node_states[N_TOPTIER_MEMORY]

  Store all top-tier memory nodes.

* nodemask_t memory_tiers[MAX_TIERS]

  Store memory nodes by tiers.

* struct demotion_nodes node_demotion[]

  where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }

  For a node N:

  node_demotion[N].preferred lists all preferred demotion targets;

  node_demotion[N].allowed lists all allowed demotion targets
  (initialized to be all the nodes in the same demotion tier).


Tiering Hierarchy Initialization
================================

By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).

A device driver can remove its memory nodes from the top tier, e.g.
a dax driver can remove PMEM nodes from the top tier.

The kernel builds the memory tiering hierarchy and per-node demotion
order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
best distance nodes in the next lower tier are assigned to
node_demotion[N].preferred and all the nodes in the next lower tier
are assigned to node_demotion[N].allowed.

node_demotion[N].preferred can be empty if no preferred demotion node
is available for node N.

If the userspace overrides the tiers via the memory_tiers sysfs
interface, the kernel then only rebuilds the per-node demotion order
accordingly.

Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
node.


Memory Allocation for Demotion
==============================

When allocating a new demotion target page, both a preferred node
and the allowed nodemask are provided to the allocation function.
The default kernel allocation fallback order is used to allocate the
page from the specified node and nodemask.

The memopolicy of cpuset, vma and owner task of the source page can
be set to refine the demotion nodemask, e.g. to prevent demotion or
select a particular allowed node as the demotion target.


Examples
========

* Example 1:
  Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

  Node 0 has node 2 as the preferred demotion target and can also
  fallback demotion to node 3.

  Node 1 has node 3 as the preferred demotion target and can also
  fallback demotion to node 2.

  Set mempolicy to prevent cross-socket demotion and memory access,
  e.g. cpuset.mems=0,2

node distances:
node   0    1    2    3
   0  10   20   30   40
   1  20   10   40   30
   2  30   40   10   40
   3  40   30   40   10

/sys/devices/system/node/memory_tiers
0-1
2-3

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
  0: [2], [2-3]
  1: [3], [2-3]
  2: [],  []
  3: [],  []

* Example 2:
  Node 0 & 1 are DRAM nodes.
  Node 2 is a PMEM node and closer to node 0.

  Node 0 has node 2 as the preferred and only demotion target.

  Node 1 has no preferred demotion target, but can still demote
  to node 2.

  Set mempolicy to prevent cross-socket demotion and memory access,
  e.g. cpuset.mems=0,2

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   40
   2  30   40   10

/sys/devices/system/node/memory_tiers
0-1
2

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
  0: [2], [2]
  1: [],  [2]
  2: [],  []


* Example 3:
  Node 0 & 1 are DRAM nodes.
  Node 2 is a PMEM node and has the same distance to node 0 & 1.

  Node 0 has node 2 as the preferred and only demotion target.

  Node 1 has node 2 as the preferred and only demotion target.

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   30
   2  30   30   10

/sys/devices/system/node/memory_tiers
0-1
2

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
  0: [2], [2]
  1: [2], [2]
  2: [],  []


* Example 4:
  Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

  All nodes are top-tier.

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   30
   2  30   30   10

/sys/devices/system/node/memory_tiers
0-2

N_TOPTIER_MEMORY: 0-2

node_demotion[]:
  0: [],  []
  1: [],  []
  2: [],  []


* Example 5:
  Node 0 is a DRAM node with CPU.
  Node 1 is a HBM node.
  Node 2 is a PMEM node.

  With userspace override, node 1 is the top tier and has node 0 as
  the preferred and only demotion target.

  Node 0 is in the second tier, tier 1, and has node 2 as the
  preferred and only demotion target.

  Node 2 is in the lowest tier, tier 2, and has no demotion targets.

node distances:
node   0    1    2
   0  10   21   30
   1  21   10   40
   2  30   40   10

/sys/devices/system/node/memory_tiers (userspace override)
1
0
2

N_TOPTIER_MEMORY: 1

node_demotion[]:
  0: [2], [2]
  1: [0], [0]
  2: [],  []

-- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-05-12  6:25 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
2022-04-30  3:59 ` Yang Shi
2022-04-30  6:37   ` Wei Xu
2022-05-06  0:01     ` Alistair Popple
2022-05-10  4:32       ` Wei Xu
2022-05-10  5:37         ` Alistair Popple
2022-05-10 11:38           ` Aneesh Kumar K.V
2022-05-11  5:30             ` Wei Xu
2022-05-11  7:34               ` Alistair Popple
2022-05-11  7:49               ` ying.huang
2022-05-11 17:07                 ` Wei Xu
2022-05-12  1:42                   ` ying.huang
2022-05-12  2:39                     ` Wei Xu
2022-05-12  3:13                       ` ying.huang
2022-05-12  3:37                         ` Wei Xu
2022-05-12  6:24                         ` Wei Xu
2022-05-06 18:56     ` Yang Shi
2022-05-09 14:32       ` Hesham Almatary
2022-05-10  3:24         ` Yang Shi
2022-05-10  9:59           ` Hesham Almatary
2022-05-10 12:10             ` Aneesh Kumar K V
2022-05-11  5:42               ` Wei Xu
2022-05-11  7:12                 ` Alistair Popple
2022-05-11  9:05                   ` Hesham Almatary
2022-05-12  3:02                     ` ying.huang
2022-05-12  4:40                   ` Aneesh Kumar K V
2022-05-12  4:49                     ` Wei Xu
2022-05-10  4:22         ` Wei Xu
2022-05-10 10:01           ` Hesham Almatary
2022-05-10 11:44           ` Aneesh Kumar K.V
2022-05-01 18:35   ` Dan Williams
2022-05-03  6:36     ` Wei Xu
2022-05-06 19:05     ` Yang Shi
2022-05-07  7:56     ` ying.huang
2022-05-01 17:58 ` Davidlohr Bueso
2022-05-02  1:04   ` David Rientjes
2022-05-02  7:23   ` Aneesh Kumar K.V
2022-05-03  2:07   ` Baolin Wang
2022-05-03  6:06   ` Wei Xu
2022-05-03 17:14   ` Alistair Popple
2022-05-03 17:47     ` Dave Hansen
2022-05-03 22:35       ` Alistair Popple
2022-05-03 23:54         ` Dave Hansen
2022-05-04  1:31           ` Wei Xu
2022-05-04 17:02             ` Dave Hansen
2022-05-05  6:35               ` Wei Xu
2022-05-05 14:24                 ` Dave Hansen
2022-05-10  4:43                   ` Wei Xu
2022-05-02  6:25 ` Aneesh Kumar K.V
2022-05-03  7:02   ` Wei Xu
2022-05-02 15:20 ` Dave Hansen
2022-05-03  7:19   ` Wei Xu
2022-05-03 19:12 ` Tim Chen
2022-05-05  7:02   ` Wei Xu
2022-05-05  8:57 ` ying.huang
2022-05-05 23:57 ` Alistair Popple
2022-05-06  0:25   ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).