linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RFC: Memory Tiering Kernel Interfaces
@ 2022-04-30  2:10 Wei Xu
  2022-04-30  3:59 ` Yang Shi
                   ` (6 more replies)
  0 siblings, 7 replies; 57+ messages in thread
From: Wei Xu @ 2022-04-30  2:10 UTC (permalink / raw)
  To: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

A tiering relationship between NUMA nodes in the form of demotion path
is created during the kernel initialization and updated when a NUMA
node is hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and then builds the tiering hierarchy
tier-by-tier by establishing the per-node demotion targets based on
the distances between nodes.

The current memory tiering interface needs to be improved to address
several important use cases:

* The current tiering initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into the top tier.

* The current tiering hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tiering hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tiering hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tiering
  hierarchy much less stable.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier, not any other node from the next lower tier.  This
  strict, hard-coded demotion order does not work in all use cases
  (e.g. some use cases may want to allow cross-socket demotion to
  another node in the same demotion tier as a fallback when the
  preferred demotion node is out of space), and has resulted in the
  feature request for an interface to override the system-wide,
  per-node demotion order from the userspace.

* There are no interfaces for the userspace to learn about the memory
  tiering hierarchy in order to optimize its memory allocations.

I'd like to propose revised memory tiering kernel interfaces based on
the discussions in the threads:

- https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
- https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/


Sysfs Interfaces
================

* /sys/devices/system/node/memory_tiers

  Format: node list (one tier per line, in the tier order)

  When read, list memory nodes by tiers.

  When written (one tier per line), take the user-provided node-tier
  assignment as the new tiering hierarchy and rebuild the per-node
  demotion order.  It is allowed to only override the top tiers, in
  which cases, the kernel will establish the lower tiers automatically.


Kernel Representation
=====================

* nodemask_t node_states[N_TOPTIER_MEMORY]

  Store all top-tier memory nodes.

* nodemask_t memory_tiers[MAX_TIERS]

  Store memory nodes by tiers.

* struct demotion_nodes node_demotion[]

  where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }

  For a node N:

  node_demotion[N].preferred lists all preferred demotion targets;

  node_demotion[N].allowed lists all allowed demotion targets
  (initialized to be all the nodes in the same demotion tier).


Tiering Hierarchy Initialization
================================

By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).

A device driver can remove its memory nodes from the top tier, e.g.
a dax driver can remove PMEM nodes from the top tier.

The kernel builds the memory tiering hierarchy and per-node demotion
order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
best distance nodes in the next lower tier are assigned to
node_demotion[N].preferred and all the nodes in the next lower tier
are assigned to node_demotion[N].allowed.

node_demotion[N].preferred can be empty if no preferred demotion node
is available for node N.

If the userspace overrides the tiers via the memory_tiers sysfs
interface, the kernel then only rebuilds the per-node demotion order
accordingly.

Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
node.


Memory Allocation for Demotion
==============================

When allocating a new demotion target page, both a preferred node
and the allowed nodemask are provided to the allocation function.
The default kernel allocation fallback order is used to allocate the
page from the specified node and nodemask.

The memopolicy of cpuset, vma and owner task of the source page can
be set to refine the demotion nodemask, e.g. to prevent demotion or
select a particular allowed node as the demotion target.


Examples
========

* Example 1:
  Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

  Node 0 has node 2 as the preferred demotion target and can also
  fallback demotion to node 3.

  Node 1 has node 3 as the preferred demotion target and can also
  fallback demotion to node 2.

  Set mempolicy to prevent cross-socket demotion and memory access,
  e.g. cpuset.mems=0,2

node distances:
node   0    1    2    3
   0  10   20   30   40
   1  20   10   40   30
   2  30   40   10   40
   3  40   30   40   10

/sys/devices/system/node/memory_tiers
0-1
2-3

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
  0: [2], [2-3]
  1: [3], [2-3]
  2: [],  []
  3: [],  []

* Example 2:
  Node 0 & 1 are DRAM nodes.
  Node 2 is a PMEM node and closer to node 0.

  Node 0 has node 2 as the preferred and only demotion target.

  Node 1 has no preferred demotion target, but can still demote
  to node 2.

  Set mempolicy to prevent cross-socket demotion and memory access,
  e.g. cpuset.mems=0,2

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   40
   2  30   40   10

/sys/devices/system/node/memory_tiers
0-1
2

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
  0: [2], [2]
  1: [],  [2]
  2: [],  []


* Example 3:
  Node 0 & 1 are DRAM nodes.
  Node 2 is a PMEM node and has the same distance to node 0 & 1.

  Node 0 has node 2 as the preferred and only demotion target.

  Node 1 has node 2 as the preferred and only demotion target.

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   30
   2  30   30   10

/sys/devices/system/node/memory_tiers
0-1
2

N_TOPTIER_MEMORY: 0-1

node_demotion[]:
  0: [2], [2]
  1: [2], [2]
  2: [],  []


* Example 4:
  Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

  All nodes are top-tier.

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   30
   2  30   30   10

/sys/devices/system/node/memory_tiers
0-2

N_TOPTIER_MEMORY: 0-2

node_demotion[]:
  0: [],  []
  1: [],  []
  2: [],  []


* Example 5:
  Node 0 is a DRAM node with CPU.
  Node 1 is a HBM node.
  Node 2 is a PMEM node.

  With userspace override, node 1 is the top tier and has node 0 as
  the preferred and only demotion target.

  Node 0 is in the second tier, tier 1, and has node 2 as the
  preferred and only demotion target.

  Node 2 is in the lowest tier, tier 2, and has no demotion targets.

node distances:
node   0    1    2
   0  10   21   30
   1  21   10   40
   2  30   40   10

/sys/devices/system/node/memory_tiers (userspace override)
1
0
2

N_TOPTIER_MEMORY: 1

node_demotion[]:
  0: [2], [2]
  1: [0], [0]
  2: [],  []

-- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
@ 2022-04-30  3:59 ` Yang Shi
  2022-04-30  6:37   ` Wei Xu
  2022-05-01 18:35   ` Dan Williams
  2022-05-01 17:58 ` Davidlohr Bueso
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 57+ messages in thread
From: Yang Shi @ 2022-04-30  3:59 UTC (permalink / raw)
  To: Wei Xu
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

Hi Wei,

Thanks for the nice writing. Please see the below inline comments.

On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
>
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> A tiering relationship between NUMA nodes in the form of demotion path
> is created during the kernel initialization and updated when a NUMA
> node is hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and then builds the tiering hierarchy
> tier-by-tier by establishing the per-node demotion targets based on
> the distances between nodes.
>
> The current memory tiering interface needs to be improved to address
> several important use cases:
>
> * The current tiering initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into the top tier.
>
> * The current tiering hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tiering hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tiering hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tiering
>   hierarchy much less stable.

I'd prefer the firmware builds up tiers topology then passes it to
kernel so that kernel knows what nodes are in what tiers. No matter
what nodes are hot-removed/hot-added they always stay in their tiers
defined by the firmware. I think this is important information like
numa distances. NUMA distance alone can't satisfy all the usecases
IMHO.

>
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier, not any other node from the next lower tier.  This
>   strict, hard-coded demotion order does not work in all use cases
>   (e.g. some use cases may want to allow cross-socket demotion to
>   another node in the same demotion tier as a fallback when the
>   preferred demotion node is out of space), and has resulted in the
>   feature request for an interface to override the system-wide,
>   per-node demotion order from the userspace.
>
> * There are no interfaces for the userspace to learn about the memory
>   tiering hierarchy in order to optimize its memory allocations.
>
> I'd like to propose revised memory tiering kernel interfaces based on
> the discussions in the threads:
>
> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
>
>
> Sysfs Interfaces
> ================
>
> * /sys/devices/system/node/memory_tiers
>
>   Format: node list (one tier per line, in the tier order)
>
>   When read, list memory nodes by tiers.
>
>   When written (one tier per line), take the user-provided node-tier
>   assignment as the new tiering hierarchy and rebuild the per-node
>   demotion order.  It is allowed to only override the top tiers, in
>   which cases, the kernel will establish the lower tiers automatically.

TBH I still think it is too soon to define proper user visible
interfaces for now, particularly for override.

>
>
> Kernel Representation
> =====================
>
> * nodemask_t node_states[N_TOPTIER_MEMORY]
>
>   Store all top-tier memory nodes.
>
> * nodemask_t memory_tiers[MAX_TIERS]
>
>   Store memory nodes by tiers.

I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
top tier. The kernel could build this with the topology built by
firmware.

>
> * struct demotion_nodes node_demotion[]
>
>   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>
>   For a node N:
>
>   node_demotion[N].preferred lists all preferred demotion targets;
>
>   node_demotion[N].allowed lists all allowed demotion targets
>   (initialized to be all the nodes in the same demotion tier).

It seems unnecessary to define preferred and allowed IMHO. Why not
just use something like the allocation fallback list? The first node
in the list is the preferred one. When allocating memory for demotion,
convert the list to a nodemask, then call __alloc_pages(gfp, order,
first_node, nodemask). So the allocation could fallback to the allowed
nodes automatically.

>
>
> Tiering Hierarchy Initialization
> ================================
>
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.

With the topology built by firmware we should not need this.

>
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.

I'm not sure whether it should be allowed to demote to multiple lower
tiers. But it is totally fine to *NOT* allow it at the moment. Once we
figure out a good way to define demotion targets, it could be extended
to support this easily.

>
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
>
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
>
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
>
>
> Memory Allocation for Demotion
> ==============================
>
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
>
>
> Examples
> ========
>
> * Example 1:
>   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
>   Node 0 has node 2 as the preferred demotion target and can also
>   fallback demotion to node 3.
>
>   Node 1 has node 3 as the preferred demotion target and can also
>   fallback demotion to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2-3]
>   1: [3], [2-3]
>   2: [],  []
>   3: [],  []
>
> * Example 2:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and closer to node 0.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has no preferred demotion target, but can still demote
>   to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [],  [2]
>   2: [],  []
>
>
> * Example 3:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has node 2 as the preferred and only demotion target.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [2], [2]
>   2: [],  []
>
>
> * Example 4:
>   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
>   All nodes are top-tier.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-2
>
> N_TOPTIER_MEMORY: 0-2
>
> node_demotion[]:
>   0: [],  []
>   1: [],  []
>   2: [],  []
>
>
> * Example 5:
>   Node 0 is a DRAM node with CPU.
>   Node 1 is a HBM node.
>   Node 2 is a PMEM node.
>
>   With userspace override, node 1 is the top tier and has node 0 as
>   the preferred and only demotion target.
>
>   Node 0 is in the second tier, tier 1, and has node 2 as the
>   preferred and only demotion target.
>
>   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>
> node distances:
> node   0    1    2
>    0  10   21   30
>    1  21   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
>
> N_TOPTIER_MEMORY: 1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [0], [0]
>   2: [],  []
>
> -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  3:59 ` Yang Shi
@ 2022-04-30  6:37   ` Wei Xu
  2022-05-06  0:01     ` Alistair Popple
  2022-05-06 18:56     ` Yang Shi
  2022-05-01 18:35   ` Dan Williams
  1 sibling, 2 replies; 57+ messages in thread
From: Wei Xu @ 2022-04-30  6:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron, Tim Chen

On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Wei,
>
> Thanks for the nice writing. Please see the below inline comments.

Thanks for the quick response and comments.

> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> >
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into the top tier.
> >
> > * The current tiering hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >   with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tiering hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tiering hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tiering
> >   hierarchy much less stable.
>
> I'd prefer the firmware builds up tiers topology then passes it to
> kernel so that kernel knows what nodes are in what tiers. No matter
> what nodes are hot-removed/hot-added they always stay in their tiers
> defined by the firmware. I think this is important information like
> numa distances. NUMA distance alone can't satisfy all the usecases
> IMHO.

I agree that the firmware needs to play a bigger role in tier
topology, though it is not clear to me yet that we should require the
tier topology be fully defined by the firmware.  If yes, a standard
needs to be established. Alternatively, with additional hardware
information provided by the firmware (e.g. HMAT), the kernel can be in
a much better position to initialize the proper tier topology by
itself.

It is important to keep tier topology stable, especially if we want to
account and limit memory usage based on tiers.  So I agree that the
nodes should not change their tiers no matter what nodes are
hot-added/hot-removed.

Given that the additional tier-related information is not yet
available from the firmware and NUMA distance alone is not sufficient
for all the tiering use cases, and also that we want to keep tier
topology stable after the kernel boots, I suggest that we add a kernel
boot parameter to override the default tier topology (which nodes are
in which tiers). An example is: tier=2:0-1;2-3, which defines two
tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.

> >
> > * A higher tier node can only be demoted to selected nodes on the
> >   next lower tier, not any other node from the next lower tier.  This
> >   strict, hard-coded demotion order does not work in all use cases
> >   (e.g. some use cases may want to allow cross-socket demotion to
> >   another node in the same demotion tier as a fallback when the
> >   preferred demotion node is out of space), and has resulted in the
> >   feature request for an interface to override the system-wide,
> >   per-node demotion order from the userspace.
> >
> > * There are no interfaces for the userspace to learn about the memory
> >   tiering hierarchy in order to optimize its memory allocations.
> >
> > I'd like to propose revised memory tiering kernel interfaces based on
> > the discussions in the threads:
> >
> > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> >
> >
> > Sysfs Interfaces
> > ================
> >
> > * /sys/devices/system/node/memory_tiers
> >
> >   Format: node list (one tier per line, in the tier order)
> >
> >   When read, list memory nodes by tiers.
> >
> >   When written (one tier per line), take the user-provided node-tier
> >   assignment as the new tiering hierarchy and rebuild the per-node
> >   demotion order.  It is allowed to only override the top tiers, in
> >   which cases, the kernel will establish the lower tiers automatically.
>
> TBH I still think it is too soon to define proper user visible
> interfaces for now, particularly for override.

I agree, but there are also needs to make use of tiering even as it
evolves.  This is why only a minimal sysfs interface is proposed.  We
can make it read-only and resort to a kernel boot parameter to
override tiers.

> >
> >
> > Kernel Representation
> > =====================
> >
> > * nodemask_t node_states[N_TOPTIER_MEMORY]
> >
> >   Store all top-tier memory nodes.
> >
> > * nodemask_t memory_tiers[MAX_TIERS]
> >
> >   Store memory nodes by tiers.
>
> I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
> top tier. The kernel could build this with the topology built by
> firmware.

node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.

node_states is already an existing kernel array (defined as nodemask_t
node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
which is why a new array, memory_tiers, is proposed.

Are you proposing that we make node_states a 2-dimensional array?
That can duplicate the information already in node_states, which is
not ideal.

> >
> > * struct demotion_nodes node_demotion[]
> >
> >   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> >
> >   For a node N:
> >
> >   node_demotion[N].preferred lists all preferred demotion targets;
> >
> >   node_demotion[N].allowed lists all allowed demotion targets
> >   (initialized to be all the nodes in the same demotion tier).
>
> It seems unnecessary to define preferred and allowed IMHO. Why not
> just use something like the allocation fallback list? The first node
> in the list is the preferred one. When allocating memory for demotion,
> convert the list to a nodemask, then call __alloc_pages(gfp, order,
> first_node, nodemask). So the allocation could fallback to the allowed
> nodes automatically.

The nodemask "preferred" is an attempt to preserve a current feature
in node_demotion[]: load balancing among multiple equally-close target
nodes via random selection.  We can remove it to keep things simple.

The idea of defining "preferred" and "allowed" is exactly to use
__alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
that the page allocator already computes the allocation fallback list,
it should be unnecessary to maintain an ordered demotion node list for
each node and convert such a list to a nodemask for demotion
allocation.  This is why allowed is stored as a nodemask.

When demoting a page from node N, I think we can just call
__alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
so, we can remove node_demotion[] entirely and add a tier field to
NODE_DATA for node_to_tier().

> >
> >
> > Tiering Hierarchy Initialization
> > ================================
> >
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
>
> With the topology built by firmware we should not need this.

I agree. But before we have such a firmware, the kernel needs to do
its best to initialize memory tiers.

Given that we know PMEM is slower than DRAM, but a dax device might
not be PMEM, a better place to set the tier for PMEM nodes can be the
ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
the ACPI_SRAT_MEM_NON_VOLATILE bit.

> >
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
>
> I'm not sure whether it should be allowed to demote to multiple lower
> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
> figure out a good way to define demotion targets, it could be extended
> to support this easily.

You mean to only support MAX_TIERS=2 for now.  I am fine with that.
There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
not clear yet whether we want to enable transparent memory tiering to
all the 3 tiers on such systems.

> >
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
> >
> > If the userspace overrides the tiers via the memory_tiers sysfs
> > interface, the kernel then only rebuilds the per-node demotion order
> > accordingly.
> >
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
> >
> >
> > Memory Allocation for Demotion
> > ==============================
> >
> > When allocating a new demotion target page, both a preferred node
> > and the allowed nodemask are provided to the allocation function.
> > The default kernel allocation fallback order is used to allocate the
> > page from the specified node and nodemask.
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion nodemask, e.g. to prevent demotion or
> > select a particular allowed node as the demotion target.
> >
> >
> > Examples
> > ========
> >
> > * Example 1:
> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >
> >   Node 0 has node 2 as the preferred demotion target and can also
> >   fallback demotion to node 3.
> >
> >   Node 1 has node 3 as the preferred demotion target and can also
> >   fallback demotion to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
> >
> > node distances:
> > node   0    1    2    3
> >    0  10   20   30   40
> >    1  20   10   40   30
> >    2  30   40   10   40
> >    3  40   30   40   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2-3
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2-3]
> >   1: [3], [2-3]
> >   2: [],  []
> >   3: [],  []
> >
> > * Example 2:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and closer to node 0.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has no preferred demotion target, but can still demote
> >   to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   40
> >    2  30   40   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [],  [2]
> >   2: [],  []
> >
> >
> > * Example 3:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has node 2 as the preferred and only demotion target.
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [2], [2]
> >   2: [],  []
> >
> >
> > * Example 4:
> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >
> >   All nodes are top-tier.
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-2
> >
> > N_TOPTIER_MEMORY: 0-2
> >
> > node_demotion[]:
> >   0: [],  []
> >   1: [],  []
> >   2: [],  []
> >
> >
> > * Example 5:
> >   Node 0 is a DRAM node with CPU.
> >   Node 1 is a HBM node.
> >   Node 2 is a PMEM node.
> >
> >   With userspace override, node 1 is the top tier and has node 0 as
> >   the preferred and only demotion target.
> >
> >   Node 0 is in the second tier, tier 1, and has node 2 as the
> >   preferred and only demotion target.
> >
> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >
> > node distances:
> > node   0    1    2
> >    0  10   21   30
> >    1  21   10   40
> >    2  30   40   10
> >
> > /sys/devices/system/node/memory_tiers (userspace override)
> > 1
> > 0
> > 2
> >
> > N_TOPTIER_MEMORY: 1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [0], [0]
> >   2: [],  []
> >
> > -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
  2022-04-30  3:59 ` Yang Shi
@ 2022-05-01 17:58 ` Davidlohr Bueso
  2022-05-02  1:04   ` David Rientjes
                     ` (4 more replies)
  2022-05-02  6:25 ` Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  6 siblings, 5 replies; 57+ messages in thread
From: Davidlohr Bueso @ 2022-05-01 17:58 UTC (permalink / raw)
  To: Wei Xu
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

Nice summary, thanks. I don't know who of the interested parties will be
at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
at 14:00 and 15:00.

On Fri, 29 Apr 2022, Wei Xu wrote:

>The current kernel has the basic memory tiering support: Inactive
>pages on a higher tier NUMA node can be migrated (demoted) to a lower
>tier NUMA node to make room for new allocations on the higher tier
>NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>migrated (promoted) to a higher tier NUMA node to improve the
>performance.

Regardless of the promotion algorithm, at some point I see the NUMA hinting
fault mechanism being in the way of performance. It would be nice if hardware
began giving us page "heatmaps" instead of having to rely on faulting or
sampling based ways to identify hot memory.

>A tiering relationship between NUMA nodes in the form of demotion path
>is created during the kernel initialization and updated when a NUMA
>node is hot-added or hot-removed.  The current implementation puts all
>nodes with CPU into the top tier, and then builds the tiering hierarchy
>tier-by-tier by establishing the per-node demotion targets based on
>the distances between nodes.
>
>The current memory tiering interface needs to be improved to address
>several important use cases:
>
>* The current tiering initialization code always initializes
>  each memory-only NUMA node into a lower tier.  But a memory-only
>  NUMA node may have a high performance memory device (e.g. a DRAM
>  device attached via CXL.mem or a DRAM-backed memory-only node on
>  a virtual machine) and should be put into the top tier.

At least the CXL memory (volatile or not) will still be slower than
regular DRAM, so I think that we'd not want this to be top-tier. But
in general, yes I agree that defining top tier as whether or not the
node has a CPU a bit limiting, as you've detailed here.

>Tiering Hierarchy Initialization
>================================
>
>By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
>A device driver can remove its memory nodes from the top tier, e.g.
>a dax driver can remove PMEM nodes from the top tier.
>
>The kernel builds the memory tiering hierarchy and per-node demotion
>order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>best distance nodes in the next lower tier are assigned to
>node_demotion[N].preferred and all the nodes in the next lower tier
>are assigned to node_demotion[N].allowed.
>
>node_demotion[N].preferred can be empty if no preferred demotion node
>is available for node N.

Upon cases where there more than one possible demotion node (with equal
cost), I'm wondering if we want to do something better than choosing
randomly, like we do now - perhaps round robin? Of course anything
like this will require actual performance data, something I have seen
very little of.

>Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>node.

I think this makes sense.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  3:59 ` Yang Shi
  2022-04-30  6:37   ` Wei Xu
@ 2022-05-01 18:35   ` Dan Williams
  2022-05-03  6:36     ` Wei Xu
                       ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: Dan Williams @ 2022-05-01 18:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Wei,
>
> Thanks for the nice writing. Please see the below inline comments.
>
> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> >
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into the top tier.
> >
> > * The current tiering hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >   with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tiering hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tiering hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tiering
> >   hierarchy much less stable.
>
> I'd prefer the firmware builds up tiers topology then passes it to
> kernel so that kernel knows what nodes are in what tiers. No matter
> what nodes are hot-removed/hot-added they always stay in their tiers
> defined by the firmware. I think this is important information like
> numa distances. NUMA distance alone can't satisfy all the usecases
> IMHO.

Just want to note here that the platform firmware can only describe
the tiers of static memory present at boot. CXL hotplug breaks this
model and the kernel is left to dynamically determine the device's
performance characteristics and the performance of the topology to
reach that device. Now, the platform firmware does set expectations
for the perfomance class of different memory ranges, but there is no
way to know in advance the performance of devices that will be asked
to be physically or logically added to the memory configuration. That
said, it's probably still too early to define ABI for those
exceptional cases where the kernel needs to make a policy decision
about a device that does not fit into the firmware's performance
expectations, but just note that there are limits to the description
that platform firmware can provide.

I agree that NUMA distance alone is inadequate and the kernel needs to
make better use of data like ACPI HMAT to determine the default
tiering order.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 17:58 ` Davidlohr Bueso
@ 2022-05-02  1:04   ` David Rientjes
  2022-05-02  7:23   ` Aneesh Kumar K.V
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 57+ messages in thread
From: David Rientjes @ 2022-05-02  1:04 UTC (permalink / raw)
  To: Davidlohr Bueso, Yuanchu Xie
  Cc: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Alistair Popple,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

On Sun, 1 May 2022, Davidlohr Bueso wrote:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
> 
> On Fri, 29 Apr 2022, Wei Xu wrote:
> 
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> 
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
> 

Hi Davidlohr,

I tend to agree with this and we've been discussing potential hardware 
assistance for page heatmaps as well, but not as an extension of sampling 
techniques that rely on the page table Accessed bit.

Have you thought about what hardware could give us here that would allow 
us to identify the set of hottest (or coldest) pages over a range so that 
we don't need to iterate through it?

Adding Yuanchu Xie <yuanchu@google.com> who has been looking into this 
recently.

> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> > 
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> > 
> > * The current tiering initialization code always initializes
> >  each memory-only NUMA node into a lower tier.  But a memory-only
> >  NUMA node may have a high performance memory device (e.g. a DRAM
> >  device attached via CXL.mem or a DRAM-backed memory-only node on
> >  a virtual machine) and should be put into the top tier.
> 
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
> 
> > Tiering Hierarchy Initialization
> > ================================
> > 
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > 
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
> > 
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
> > 
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
> 
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.
> 
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
> 
> I think this makes sense.
> 
> Thanks,
> Davidlohr
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
  2022-04-30  3:59 ` Yang Shi
  2022-05-01 17:58 ` Davidlohr Bueso
@ 2022-05-02  6:25 ` Aneesh Kumar K.V
  2022-05-03  7:02   ` Wei Xu
  2022-05-02 15:20 ` Dave Hansen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-02  6:25 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

Wei Xu <weixugc@google.com> writes:

....

>
> Tiering Hierarchy Initialization
> ================================
>
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.

Should we look at the tier in which to place the memory an option that
device drivers like dax driver can select? Or dax driver just selects
the desire to mark a specific memory only numa node as demotion target
and won't explicity specify the tier in which it should be placed. I
would like to go for the later and choose the tier details based on the
current memory tiers and the NUMA distance value (even HMAT at some
point in the future). The challenge with NUMA distance though is which
distance value we will pick. For example, in your example1. 

 node   0    1    2    3
    0  10   20   30   40
    1  20   10   40   30
    2  30   40   10   40
    3  40   30   40   10

When Node3 is registered, how do we decide to create a Tier2 or add it
to Tier1? . We could say devices that wish to be placed in the same tier
will have same distance as the existing tier device ie, for the above
case,

node_distance[2][2] == node_distance[2][3] ? Can we expect the firmware
to have distance value like that? 

>
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.
>
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
>
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
>
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
>
>
> Memory Allocation for Demotion
> ==============================
>
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
>
>
> Examples
> ========
>
> * Example 1:
>   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
>   Node 0 has node 2 as the preferred demotion target and can also
>   fallback demotion to node 3.
>
>   Node 1 has node 3 as the preferred demotion target and can also
>   fallback demotion to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3

How can I make Node3 the demotion target for Node2 in this case? Can
we have one file for each tier? ie, we start with
/sys/devices/system/node/memory_tier0. Removing a node with memory from
the above file/list results in the creation of new tiers.

/sys/devices/system/node/memory_tier0
0-1
/sys/devices/system/node/memory_tier1
2-3

echo 2 > /sys/devices/system/node/memory_tier1
/sys/devices/system/node/memory_tier1
2
/sys/devices/system/node/memory_tier2
3

>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2-3]
>   1: [3], [2-3]
>   2: [],  []
>   3: [],  []
>
> * Example 2:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and closer to node 0.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has no preferred demotion target, but can still demote
>   to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [],  [2]
>   2: [],  []
>
>
> * Example 3:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has node 2 as the preferred and only demotion target.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [2], [2]
>   2: [],  []
>
>
> * Example 4:
>   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
>   All nodes are top-tier.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-2
>
> N_TOPTIER_MEMORY: 0-2
>
> node_demotion[]:
>   0: [],  []
>   1: [],  []
>   2: [],  []
>
>
> * Example 5:
>   Node 0 is a DRAM node with CPU.
>   Node 1 is a HBM node.
>   Node 2 is a PMEM node.
>
>   With userspace override, node 1 is the top tier and has node 0 as
>   the preferred and only demotion target.
>
>   Node 0 is in the second tier, tier 1, and has node 2 as the
>   preferred and only demotion target.
>
>   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>
> node distances:
> node   0    1    2
>    0  10   21   30
>    1  21   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
>
> N_TOPTIER_MEMORY: 1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [0], [0]
>   2: [],  []
>
> -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 17:58 ` Davidlohr Bueso
  2022-05-02  1:04   ` David Rientjes
@ 2022-05-02  7:23   ` Aneesh Kumar K.V
  2022-05-03  2:07   ` Baolin Wang
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 57+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-02  7:23 UTC (permalink / raw)
  To: Davidlohr Bueso, Wei Xu
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Jagdish Gediya, Linux Kernel Mailing List,
	Alistair Popple, Michal Hocko, Baolin Wang, Brice Goglin,
	Feng Tang, Jonathan.Cameron

Davidlohr Bueso <dave@stgolabs.net> writes:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.

Will there be an online option this time? If so, i would like to
participate in this discussion. I have not closely followed LSF/MM
details this year. So not sure how to get the online attend request out.

>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
>>The current kernel has the basic memory tiering support: Inactive
>>pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>tier NUMA node to make room for new allocations on the higher tier
>>NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>migrated (promoted) to a higher tier NUMA node to improve the
>>performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.


Power10 hardware can do this. Right now we are looking at integrating
this to MultiGen LRU. We haven't got it to work. One of the challenges is
how to estimate the relative hotness of the page compared to the rest of the
pages in the system. I am looking at the random sampling of the oldest
generation pages (the page list in the shrink page list) and using the hot
and cold page in that random sample to determine the hotness of a
specific page and whether to reclaim it or not.

-aneesh


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
                   ` (2 preceding siblings ...)
  2022-05-02  6:25 ` Aneesh Kumar K.V
@ 2022-05-02 15:20 ` Dave Hansen
  2022-05-03  7:19   ` Wei Xu
  2022-05-03 19:12 ` Tim Chen
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 57+ messages in thread
From: Dave Hansen @ 2022-05-02 15:20 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Alistair Popple,
	Davidlohr Bueso, Michal Hocko, Baolin Wang, Brice Goglin,
	Feng Tang, Jonathan.Cameron

> The current memory tiering interface needs to be improved to address
> several important use cases:

FWIW, I totally agree.  We knew when that code went in that the default
ordering was feeble.  There were patches to export the demotion order
and allow it to be modified from userspace, but they were jettisoned at
some point.

> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.

Yeah, this would be a welcome improvement if we can get there.

> * /sys/devices/system/node/memory_tiers
> 
>   Format: node list (one tier per line, in the tier order)
> 
>   When read, list memory nodes by tiers.

Nit: this would seems to violate the one-value-per-file sysfs guideline.
 It can be fixed by making tiers actual objects, which would have some
other nice benefits too.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 17:58 ` Davidlohr Bueso
  2022-05-02  1:04   ` David Rientjes
  2022-05-02  7:23   ` Aneesh Kumar K.V
@ 2022-05-03  2:07   ` Baolin Wang
  2022-05-03  6:06   ` Wei Xu
  2022-05-03 17:14   ` Alistair Popple
  4 siblings, 0 replies; 57+ messages in thread
From: Baolin Wang @ 2022-05-03  2:07 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Alistair Popple,
	Michal Hocko, Brice Goglin, Feng Tang, Jonathan.Cameron



On 5/2/2022 1:58 AM, Davidlohr Bueso wrote:
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
> 
> On Fri, 29 Apr 2022, Wei Xu wrote:
> 
>> The current kernel has the basic memory tiering support: Inactive
>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>> tier NUMA node to make room for new allocations on the higher tier
>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>> migrated (promoted) to a higher tier NUMA node to improve the
>> performance.
> 
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if 
> hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.
> 
>> A tiering relationship between NUMA nodes in the form of demotion path
>> is created during the kernel initialization and updated when a NUMA
>> node is hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>> tier-by-tier by establishing the per-node demotion targets based on
>> the distances between nodes.
>>
>> The current memory tiering interface needs to be improved to address
>> several important use cases:
>>
>> * The current tiering initialization code always initializes
>>  each memory-only NUMA node into a lower tier.  But a memory-only
>>  NUMA node may have a high performance memory device (e.g. a DRAM
>>  device attached via CXL.mem or a DRAM-backed memory-only node on
>>  a virtual machine) and should be put into the top tier.
> 
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
> 
>> Tiering Hierarchy Initialization
>> ================================
>>
>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>
>> A device driver can remove its memory nodes from the top tier, e.g.
>> a dax driver can remove PMEM nodes from the top tier.
>>
>> The kernel builds the memory tiering hierarchy and per-node demotion
>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> best distance nodes in the next lower tier are assigned to
>> node_demotion[N].preferred and all the nodes in the next lower tier
>> are assigned to node_demotion[N].allowed.
>>
>> node_demotion[N].preferred can be empty if no preferred demotion node
>> is available for node N.
> 
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I've tried to use round robin[1] to select a target demotion node if 
there are multiple demotion nodes, however I did not see any obvious 
performance gain with mysql testing. Maybe use other test suits?

https://lore.kernel.org/all/c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 17:58 ` Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2022-05-03  2:07   ` Baolin Wang
@ 2022-05-03  6:06   ` Wei Xu
  2022-05-03 17:14   ` Alistair Popple
  4 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-03  6:06 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Alistair Popple,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

On Sun, May 1, 2022 at 11:09 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
> >The current kernel has the basic memory tiering support: Inactive
> >pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >tier NUMA node to make room for new allocations on the higher tier
> >NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >migrated (promoted) to a higher tier NUMA node to improve the
> >performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.

I agree with your comments on both NUMA hinting faults and
hardware-assisted "heatmaps".


> >A tiering relationship between NUMA nodes in the form of demotion path
> >is created during the kernel initialization and updated when a NUMA
> >node is hot-added or hot-removed.  The current implementation puts all
> >nodes with CPU into the top tier, and then builds the tiering hierarchy
> >tier-by-tier by establishing the per-node demotion targets based on
> >the distances between nodes.
> >
> >The current memory tiering interface needs to be improved to address
> >several important use cases:
> >
> >* The current tiering initialization code always initializes
> >  each memory-only NUMA node into a lower tier.  But a memory-only
> >  NUMA node may have a high performance memory device (e.g. a DRAM
> >  device attached via CXL.mem or a DRAM-backed memory-only node on
> >  a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
> >Tiering Hierarchy Initialization
> >================================
> >
> >By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> >A device driver can remove its memory nodes from the top tier, e.g.
> >a dax driver can remove PMEM nodes from the top tier.
> >
> >The kernel builds the memory tiering hierarchy and per-node demotion
> >order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> >best distance nodes in the next lower tier are assigned to
> >node_demotion[N].preferred and all the nodes in the next lower tier
> >are assigned to node_demotion[N].allowed.
> >
> >node_demotion[N].preferred can be empty if no preferred demotion node
> >is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.

I'd prefer that the demotion node selection follows the way how the
kernel selects the node/zone for normal allocations.  If we want to
group several demotion nodes with equal cost together (e.g. to better
utilize the bandwidth from these nodes), we'd better to improve such
an optimization in __alloc_pages_nodemask() to benefit normal
allocations as well.

> >Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> >memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> >node.
>
> I think this makes sense.
>
> Thanks,
> Davidlohr


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 18:35   ` Dan Williams
@ 2022-05-03  6:36     ` Wei Xu
  2022-05-06 19:05     ` Yang Shi
  2022-05-07  7:56     ` ying.huang
  2 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-03  6:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Sun, May 1, 2022 at 11:35 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > Hi Wei,
> >
> > Thanks for the nice writing. Please see the below inline comments.
> >
> > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> > >
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > >
> > > A tiering relationship between NUMA nodes in the form of demotion path
> > > is created during the kernel initialization and updated when a NUMA
> > > node is hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > >
> > > The current memory tiering interface needs to be improved to address
> > > several important use cases:
> > >
> > > * The current tiering initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into the top tier.
> > >
> > > * The current tiering hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > >   with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tiering hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tiering hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tiering
> > >   hierarchy much less stable.
> >
> > I'd prefer the firmware builds up tiers topology then passes it to
> > kernel so that kernel knows what nodes are in what tiers. No matter
> > what nodes are hot-removed/hot-added they always stay in their tiers
> > defined by the firmware. I think this is important information like
> > numa distances. NUMA distance alone can't satisfy all the usecases
> > IMHO.
>
> Just want to note here that the platform firmware can only describe
> the tiers of static memory present at boot. CXL hotplug breaks this
> model and the kernel is left to dynamically determine the device's
> performance characteristics and the performance of the topology to
> reach that device. Now, the platform firmware does set expectations
> for the perfomance class of different memory ranges, but there is no
> way to know in advance the performance of devices that will be asked
> to be physically or logically added to the memory configuration. That
> said, it's probably still too early to define ABI for those
> exceptional cases where the kernel needs to make a policy decision
> about a device that does not fit into the firmware's performance
> expectations, but just note that there are limits to the description
> that platform firmware can provide.
>
> I agree that NUMA distance alone is inadequate and the kernel needs to
> make better use of data like ACPI HMAT to determine the default
> tiering order.

Very useful clarification. It should be fine for the kernel to
dynamically determine the memory tier of each node.  I expect that it
can also be fine even if a node gets attached to a different memory
device and needs to be assigned into a different tier after another
round of hot-remove/hot-add.

What can be problematic is that a hot-added node not only changes its
own iter, but also causes other existing nodes to change their tiers.
This can mess up any tier-based memory accounting.

One approach to address this is to:

- have tiers being well-defined and stable, e.g. HBM is always in
tier-0, direct-attached DRAM and high-performance CXL.mem devices are
always in tier-1, slower CXL.mem devices are always in tier-2, and
PMEM is always in tier-3.  The tier definition is based on the device
performance, something similar to the class rating of storage devices
(e.g. SD cards).

- allow tiers being absent in the system, e.g. a machine may have only
tier-1 and tier-3, but have neither tier-0 nor tier-2.

- allow demotion to not only the immediate next lower tier, but all
lower tiers.  The actual selection of demotion order follows the
allocation fallback order.   This allows tier-1 to directly demote to
tier-3 without requiring the presence of tier-2.

This approach can ensure that the tiers of existing nodes are stable
and permit that the tier of a hot-plugged node is determined
dynamically.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-02  6:25 ` Aneesh Kumar K.V
@ 2022-05-03  7:02   ` Wei Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-03  7:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Jagdish Gediya, Linux Kernel Mailing List,
	Alistair Popple, Davidlohr Bueso, Michal Hocko, Baolin Wang,
	Brice Goglin, Feng Tang, Jonathan Cameron

On Sun, May 1, 2022 at 11:25 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> ....
>
> >
> > Tiering Hierarchy Initialization
> > ================================
> >
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
>
> Should we look at the tier in which to place the memory an option that
> device drivers like dax driver can select? Or dax driver just selects
> the desire to mark a specific memory only numa node as demotion target
> and won't explicity specify the tier in which it should be placed. I
> would like to go for the later and choose the tier details based on the
> current memory tiers and the NUMA distance value (even HMAT at some
> point in the future).

This is what has been proposed here.  The driver doesn't determine
which particular tier the node should be placed in.  It just removes
the node from the top-tier (i.e. making the node a demotion target).
The actual tier of the node is determined based on all the nodes and
their NUMA distance values.

> The challenge with NUMA distance though is which
> distance value we will pick. For example, in your example1.
>
>  node   0    1    2    3
>     0  10   20   30   40
>     1  20   10   40   30
>     2  30   40   10   40
>     3  40   30   40   10
>
> When Node3 is registered, how do we decide to create a Tier2 or add it
> to Tier1? .

This proposal assumes a breadth-first search in tier construction,
which is also how the current implementation works.  In this example,
the top-tier nodes are [0,1].  We then find a best demotion node for
each of [0,1] and get [0->2, 1->3]. Now we have two tiers: [0,1],
[2,3], and the search terminates.

But this algorithm doesn't work if there is no node 1 and we still
want node 2 & 3 in the same tier.  Without the additional hardware
information such as HMAT, we will need a way to override the default
tier definition.

> We could say devices that wish to be placed in the same tier
> will have same distance as the existing tier device ie, for the above
> case,
>
> node_distance[2][2] == node_distance[2][3] ? Can we expect the firmware
> to have distance value like that?

node_distance[2][2] is local, which should be smaller than
node_distance[2][3].  I expect that this should be the case in normal
firmwares.

> >
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
> >
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
> >
> > If the userspace overrides the tiers via the memory_tiers sysfs
> > interface, the kernel then only rebuilds the per-node demotion order
> > accordingly.
> >
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
> >
> >
> > Memory Allocation for Demotion
> > ==============================
> >
> > When allocating a new demotion target page, both a preferred node
> > and the allowed nodemask are provided to the allocation function.
> > The default kernel allocation fallback order is used to allocate the
> > page from the specified node and nodemask.
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion nodemask, e.g. to prevent demotion or
> > select a particular allowed node as the demotion target.
> >
> >
> > Examples
> > ========
> >
> > * Example 1:
> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >
> >   Node 0 has node 2 as the preferred demotion target and can also
> >   fallback demotion to node 3.
> >
> >   Node 1 has node 3 as the preferred demotion target and can also
> >   fallback demotion to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
> >
> > node distances:
> > node   0    1    2    3
> >    0  10   20   30   40
> >    1  20   10   40   30
> >    2  30   40   10   40
> >    3  40   30   40   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2-3
>
> How can I make Node3 the demotion target for Node2 in this case? Can
> we have one file for each tier? ie, we start with
> /sys/devices/system/node/memory_tier0. Removing a node with memory from
> the above file/list results in the creation of new tiers.
>
> /sys/devices/system/node/memory_tier0
> 0-1
> /sys/devices/system/node/memory_tier1
> 2-3
>
> echo 2 > /sys/devices/system/node/memory_tier1
> /sys/devices/system/node/memory_tier1
> 2
> /sys/devices/system/node/memory_tier2
> 3

The proposal does something similar, except using a single file: memory_tiers.

Another idea is to pass the tier override from a kernel boot argument,
though it is challenging to deal with hot-plugged nodes.

> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2-3]
> >   1: [3], [2-3]
> >   2: [],  []
> >   3: [],  []
> >
> > * Example 2:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and closer to node 0.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has no preferred demotion target, but can still demote
> >   to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   40
> >    2  30   40   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [],  [2]
> >   2: [],  []
> >
> >
> > * Example 3:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has node 2 as the preferred and only demotion target.
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [2], [2]
> >   2: [],  []
> >
> >
> > * Example 4:
> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >
> >   All nodes are top-tier.
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-2
> >
> > N_TOPTIER_MEMORY: 0-2
> >
> > node_demotion[]:
> >   0: [],  []
> >   1: [],  []
> >   2: [],  []
> >
> >
> > * Example 5:
> >   Node 0 is a DRAM node with CPU.
> >   Node 1 is a HBM node.
> >   Node 2 is a PMEM node.
> >
> >   With userspace override, node 1 is the top tier and has node 0 as
> >   the preferred and only demotion target.
> >
> >   Node 0 is in the second tier, tier 1, and has node 2 as the
> >   preferred and only demotion target.
> >
> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >
> > node distances:
> > node   0    1    2
> >    0  10   21   30
> >    1  21   10   40
> >    2  30   40   10
> >
> > /sys/devices/system/node/memory_tiers (userspace override)
> > 1
> > 0
> > 2
> >
> > N_TOPTIER_MEMORY: 1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [0], [0]
> >   2: [],  []
> >
> > -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-02 15:20 ` Dave Hansen
@ 2022-05-03  7:19   ` Wei Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-03  7:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Mon, May 2, 2022 at 8:20 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
>
> FWIW, I totally agree.  We knew when that code went in that the default
> ordering was feeble.  There were patches to export the demotion order
> and allow it to be modified from userspace, but they were jettisoned at
> some point.
>
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
>
> Yeah, this would be a welcome improvement if we can get there.
>
> > * /sys/devices/system/node/memory_tiers
> >
> >   Format: node list (one tier per line, in the tier order)
> >
> >   When read, list memory nodes by tiers.
>
> Nit: this would seems to violate the one-value-per-file sysfs guideline.
>  It can be fixed by making tiers actual objects, which would have some
> other nice benefits too.
>

Good point.  One tier per file should work as well.  It can be even
better to have a separate tier sub-tree.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 17:58 ` Davidlohr Bueso
                     ` (3 preceding siblings ...)
  2022-05-03  6:06   ` Wei Xu
@ 2022-05-03 17:14   ` Alistair Popple
  2022-05-03 17:47     ` Dave Hansen
  4 siblings, 1 reply; 57+ messages in thread
From: Alistair Popple @ 2022-05-03 17:14 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

[-- Attachment #1: Type: text/plain, Size: 3943 bytes --]

Davidlohr Bueso <dave@stgolabs.net> writes:

> Nice summary, thanks. I don't know who of the interested parties will be
> at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday
> at 14:00 and 15:00.
>
> On Fri, 29 Apr 2022, Wei Xu wrote:
>
>>The current kernel has the basic memory tiering support: Inactive
>>pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>tier NUMA node to make room for new allocations on the higher tier
>>NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>migrated (promoted) to a higher tier NUMA node to improve the
>>performance.
>
> Regardless of the promotion algorithm, at some point I see the NUMA hinting
> fault mechanism being in the way of performance. It would be nice if hardware
> began giving us page "heatmaps" instead of having to rely on faulting or
> sampling based ways to identify hot memory.

Agreed. The existing NUMA faulting mechanism is already in the way of
performance on something like POWER9+Coherent GPUs. In that case enabling the
NUMA faulting mechanism results in a multiple orders of magnitude decrease in
performance, to the point that the only reasonable configuration for that system
was to disable NUMA balancing for anything using the GPU.

I would certainly be interested in figuring out how HW could provide some sort
of heatmap to identify which pages are hot and which processing unit is using
them. Currently for these systems users have to manually assign memory policy to
get any reasonable performance, both to disable NUMA balancing and make sure
memory is allocated on the right node.

- Alistair

>>A tiering relationship between NUMA nodes in the form of demotion path
>>is created during the kernel initialization and updated when a NUMA
>>node is hot-added or hot-removed.  The current implementation puts all
>>nodes with CPU into the top tier, and then builds the tiering hierarchy
>>tier-by-tier by establishing the per-node demotion targets based on
>>the distances between nodes.
>>
>>The current memory tiering interface needs to be improved to address
>>several important use cases:
>>
>>* The current tiering initialization code always initializes
>>  each memory-only NUMA node into a lower tier.  But a memory-only
>>  NUMA node may have a high performance memory device (e.g. a DRAM
>>  device attached via CXL.mem or a DRAM-backed memory-only node on
>>  a virtual machine) and should be put into the top tier.
>
> At least the CXL memory (volatile or not) will still be slower than
> regular DRAM, so I think that we'd not want this to be top-tier. But
> in general, yes I agree that defining top tier as whether or not the
> node has a CPU a bit limiting, as you've detailed here.
>
>>Tiering Hierarchy Initialization
>>================================
>>
>>By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>
>>A device driver can remove its memory nodes from the top tier, e.g.
>>a dax driver can remove PMEM nodes from the top tier.
>>
>>The kernel builds the memory tiering hierarchy and per-node demotion
>>order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>>best distance nodes in the next lower tier are assigned to
>>node_demotion[N].preferred and all the nodes in the next lower tier
>>are assigned to node_demotion[N].allowed.
>>
>>node_demotion[N].preferred can be empty if no preferred demotion node
>>is available for node N.
>
> Upon cases where there more than one possible demotion node (with equal
> cost), I'm wondering if we want to do something better than choosing
> randomly, like we do now - perhaps round robin? Of course anything
> like this will require actual performance data, something I have seen
> very little of.
>
>>Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>>memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>>node.
>
> I think this makes sense.
>
> Thanks,
> Davidlohr

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-03 17:14   ` Alistair Popple
@ 2022-05-03 17:47     ` Dave Hansen
  2022-05-03 22:35       ` Alistair Popple
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Hansen @ 2022-05-03 17:47 UTC (permalink / raw)
  To: Alistair Popple, Davidlohr Bueso
  Cc: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

On 5/3/22 10:14, Alistair Popple wrote:
> I would certainly be interested in figuring out how HW could provide some sort
> of heatmap to identify which pages are hot and which processing unit is using
> them. Currently for these systems users have to manually assign memory policy to
> get any reasonable performance, both to disable NUMA balancing and make sure
> memory is allocated on the right node.

Autonuma-induced page faults are a total non-starter for lots of
workloads, even ignoring GPUs.  Basically anyone who is latency
sensitive stays far, far away from autonuma.

As for improving on page faults for data collection...

*Can* hardware provide this information?  Definitely.

Have hardware vendors been motivated enough to add hardware to do this?
 Nope, not yet.

Do you know anyone that works for any hardware companies? ;)

Seriously, though.  Folks at Intel _are_ thinking about this problem.
I'm hoping we have hardware some day to help lend a hand.  The more
hardware vendors that do this, the more likely it is that we'll have
good kernel code to consume data from the hardware.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
                   ` (3 preceding siblings ...)
  2022-05-02 15:20 ` Dave Hansen
@ 2022-05-03 19:12 ` Tim Chen
  2022-05-05  7:02   ` Wei Xu
  2022-05-05  8:57 ` ying.huang
  2022-05-05 23:57 ` Alistair Popple
  6 siblings, 1 reply; 57+ messages in thread
From: Tim Chen @ 2022-05-03 19:12 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Alistair Popple,
	Davidlohr Bueso, Michal Hocko, Baolin Wang, Brice Goglin,
	Feng Tang, Jonathan.Cameron

On Fri, 2022-04-29 at 19:10 -0700, Wei Xu wrote:
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
> 
> A tiering relationship between NUMA nodes in the form of demotion path
> is created during the kernel initialization and updated when a NUMA
> node is hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and then builds the tiering hierarchy
> tier-by-tier by establishing the per-node demotion targets based on
> the distances between nodes.

Thanks for making this proposal.  It has many of the elements needed
for the tiering support.

> 
> The current memory tiering interface needs to be improved to address
> several important use cases:
> 
> * The current tiering initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into the top tier.
> 
> * The current tiering hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
> 
> * Also because the current tiering hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tiering hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tiering
>   hierarchy much less stable.
> 
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier, not any other node from the next lower tier.  This
>   strict, hard-coded demotion order does not work in all use cases
>   (e.g. some use cases may want to allow cross-socket demotion to
>   another node in the same demotion tier as a fallback when the
>   preferred demotion node is out of space), and has resulted in the
>   feature request for an interface to override the system-wide,
>   per-node demotion order from the userspace.
> 
> * There are no interfaces for the userspace to learn about the memory
>   tiering hierarchy in order to optimize its memory allocations.
> 
> I'd like to propose revised memory tiering kernel interfaces based on
> the discussions in the threads:
> 
> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> 
> 
> Sysfs Interfaces
> ================
> 
> * /sys/devices/system/node/memory_tiers
> 
>   Format: node list (one tier per line, in the tier order)
> 
>   When read, list memory nodes by tiers.
> 
>   When written (one tier per line), take the user-provided node-tier
>   assignment as the new tiering hierarchy and rebuild the per-node
>   demotion order.  It is allowed to only override the top tiers, in
>   which cases, the kernel will establish the lower tiers automatically.
> 
> 
> Kernel Representation
> =====================
> 
> * nodemask_t node_states[N_TOPTIER_MEMORY]
> 
>   Store all top-tier memory nodes.
> 
> * nodemask_t memory_tiers[MAX_TIERS]
> 
>   Store memory nodes by tiers.
> 
> * struct demotion_nodes node_demotion[]
> 
>   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> 
>   For a node N:
> 
>   node_demotion[N].preferred lists all preferred demotion targets;
> 
>   node_demotion[N].allowed lists all allowed demotion targets
>   (initialized to be all the nodes in the same demotion tier).
> 

I assume that the preferred list is auto-configured/initialized based on
NUMA distances.  Not sure why "allowed" list is only to the same demotion
tier?  For example, I think the default should be tier 0 should 
is allowed to demote to tier 1 and tier 2, not just to tier 1.  So if we
fail to demote to tier 1, we can demote to tier 2.  

Do you also expose the demotion preferred node and allowed
list via /sys/devices/system/node/memory_tiers, as you have done in the examples?

> Examples
> ========
> 
> * Example 2:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and closer to node 0.
> 
>   Node 0 has node 2 as the preferred and only demotion target.
> 
>   Node 1 has no preferred demotion target, but can still demote
>   to node 2.
> 
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2

Do we expect to later allow configuration of the demotion list explicitly?
Something like:

echo "demotion 0 1 1-3" > /sys/devices/system/node/memory_tiers

to set demotion list for node 0, where preferred demote node is 1, 
allowed demote node list is 1-3.

Thanks.

Tim




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-03 17:47     ` Dave Hansen
@ 2022-05-03 22:35       ` Alistair Popple
  2022-05-03 23:54         ` Dave Hansen
  0 siblings, 1 reply; 57+ messages in thread
From: Alistair Popple @ 2022-05-03 22:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Davidlohr Bueso, Wei Xu, Andrew Morton, Dave Hansen, Huang Ying,
	Dan Williams, Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

Dave Hansen <dave.hansen@intel.com> writes:

> On 5/3/22 10:14, Alistair Popple wrote:
>> I would certainly be interested in figuring out how HW could provide some sort
>> of heatmap to identify which pages are hot and which processing unit is using
>> them. Currently for these systems users have to manually assign memory policy to
>> get any reasonable performance, both to disable NUMA balancing and make sure
>> memory is allocated on the right node.
>
> Autonuma-induced page faults are a total non-starter for lots of
> workloads, even ignoring GPUs.  Basically anyone who is latency
> sensitive stays far, far away from autonuma.
>
> As for improving on page faults for data collection...
>
> *Can* hardware provide this information?  Definitely.
>
> Have hardware vendors been motivated enough to add hardware to do this?
>  Nope, not yet.

Not entirely true. The GPUs on POWER9 have performance counters capable of
collecting this kind of information for memory accessed from the GPU. I will
admit though that sadly most people probably don't have a P9 sitting under their
desk :)

For various reasons these counters weren't exposed to the kernel but that's
something I would like to work on fixing.

> Do you know anyone that works for any hardware companies? ;)

Maybe ;)

> Seriously, though.  Folks at Intel _are_ thinking about this problem.
> I'm hoping we have hardware some day to help lend a hand.  The more
> hardware vendors that do this, the more likely it is that we'll have
> good kernel code to consume data from the hardware.

Agreed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-03 22:35       ` Alistair Popple
@ 2022-05-03 23:54         ` Dave Hansen
  2022-05-04  1:31           ` Wei Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Hansen @ 2022-05-03 23:54 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Davidlohr Bueso, Wei Xu, Andrew Morton, Dave Hansen, Huang Ying,
	Dan Williams, Yang Shi, Linux MM, Greg Thelen, Aneesh Kumar K.V,
	Jagdish Gediya, Linux Kernel Mailing List, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

On 5/3/22 15:35, Alistair Popple wrote:
> Not entirely true. The GPUs on POWER9 have performance counters capable of
> collecting this kind of information for memory accessed from the GPU. I will
> admit though that sadly most people probably don't have a P9 sitting under their
> desk :)

Well, x86 CPUs have performance monitoring hardware that can
theoretically collect physical access information too.  But, this
performance monitoring hardware wasn't designed for this specific use
case in mind.  So, in practice, these events (PEBS) weren't very useful
for driving memory tiering.

Are you saying that the GPUs on POWER9 have performance counters that
can drive memory tiering in practice?  I'd be curious if there's working
code to show how they get used.  Maybe the hardware is better than the
x86 PMU or the software consuming it is more clever than what we did.
But, I'd love to see it either way.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-03 23:54         ` Dave Hansen
@ 2022-05-04  1:31           ` Wei Xu
  2022-05-04 17:02             ` Dave Hansen
  0 siblings, 1 reply; 57+ messages in thread
From: Wei Xu @ 2022-05-04  1:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alistair Popple, Davidlohr Bueso, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Yang Shi, Linux MM, Greg Thelen,
	Aneesh Kumar K.V, Jagdish Gediya, Linux Kernel Mailing List,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Tue, May 3, 2022 at 4:54 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/3/22 15:35, Alistair Popple wrote:
> > Not entirely true. The GPUs on POWER9 have performance counters capable of
> > collecting this kind of information for memory accessed from the GPU. I will
> > admit though that sadly most people probably don't have a P9 sitting under their
> > desk :)
>
> Well, x86 CPUs have performance monitoring hardware that can
> theoretically collect physical access information too.  But, this
> performance monitoring hardware wasn't designed for this specific use
> case in mind.  So, in practice, these events (PEBS) weren't very useful
> for driving memory tiering.

The PEBS events without any filtering might not be useful for memory
tiering, but the PEBS events with hardware-based data source filtering
can be useful in driving promotions in memory tiering. Certainly,
because these events are not designed for this specific use case in
mind, there are inefficiencies using them for memory tiering, e.g.
instead of just getting a heat counter for each hot page, we can get
events repeatedly on the hot pages.

> Are you saying that the GPUs on POWER9 have performance counters that
> can drive memory tiering in practice?  I'd be curious if there's working
> code to show how they get used.  Maybe the hardware is better than the
> x86 PMU or the software consuming it is more clever than what we did.
> But, I'd love to see it either way.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-04  1:31           ` Wei Xu
@ 2022-05-04 17:02             ` Dave Hansen
  2022-05-05  6:35               ` Wei Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Hansen @ 2022-05-04 17:02 UTC (permalink / raw)
  To: Wei Xu
  Cc: Alistair Popple, Davidlohr Bueso, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Yang Shi, Linux MM, Greg Thelen,
	Aneesh Kumar K.V, Jagdish Gediya, Linux Kernel Mailing List,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On 5/3/22 18:31, Wei Xu wrote:
>> Well, x86 CPUs have performance monitoring hardware that can
>> theoretically collect physical access information too.  But, this
>> performance monitoring hardware wasn't designed for this specific use
>> case in mind.  So, in practice, these events (PEBS) weren't very useful
>> for driving memory tiering.
> The PEBS events without any filtering might not be useful for memory
> tiering, but the PEBS events with hardware-based data source filtering
> can be useful in driving promotions in memory tiering. Certainly,
> because these events are not designed for this specific use case in
> mind, there are inefficiencies using them for memory tiering, e.g.
> instead of just getting a heat counter for each hot page, we can get
> events repeatedly on the hot pages.

Also, I believe the addresses that come out of the PEBS events are
virtual addresses (Data Linear Addresses according to the SDM).  If the
events are written from a KVM guest, you get guest linear addresses.

That means a lot of page table and EPT walks to map those linear
addresses back to physical.  That adds to the inefficiency.

In the end, you get big PEBS buffers with lots of irrelevant data that
needs significant post-processing to make sense of it.  The folks at
Intel that tried this really struggled to take this mess and turn it
into a successful hot-page tracking.

Maybe someone else will find a better way to do it, but we tried and
gave up.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-04 17:02             ` Dave Hansen
@ 2022-05-05  6:35               ` Wei Xu
  2022-05-05 14:24                 ` Dave Hansen
  0 siblings, 1 reply; 57+ messages in thread
From: Wei Xu @ 2022-05-05  6:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alistair Popple, Davidlohr Bueso, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Yang Shi, Linux MM, Greg Thelen,
	Aneesh Kumar K.V, Jagdish Gediya, Linux Kernel Mailing List,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Wed, May 4, 2022 at 10:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/3/22 18:31, Wei Xu wrote:
> >> Well, x86 CPUs have performance monitoring hardware that can
> >> theoretically collect physical access information too.  But, this
> >> performance monitoring hardware wasn't designed for this specific use
> >> case in mind.  So, in practice, these events (PEBS) weren't very useful
> >> for driving memory tiering.
> > The PEBS events without any filtering might not be useful for memory
> > tiering, but the PEBS events with hardware-based data source filtering
> > can be useful in driving promotions in memory tiering. Certainly,
> > because these events are not designed for this specific use case in
> > mind, there are inefficiencies using them for memory tiering, e.g.
> > instead of just getting a heat counter for each hot page, we can get
> > events repeatedly on the hot pages.
>
> Also, I believe the addresses that come out of the PEBS events are
> virtual addresses (Data Linear Addresses according to the SDM).  If the
> events are written from a KVM guest, you get guest linear addresses.
>
> That means a lot of page table and EPT walks to map those linear
> addresses back to physical.  That adds to the inefficiency.

That's true if the tracking is purely based on physical pages.  For
hot page tracking from PEBS, we can consider tracking in
virtual/linear addresses.  We don't need to maintain the history for
all linear page addresses nor for an indefinite amount of time.  After
all, we just need to identify pages accessed frequently recently and
promote them.

> In the end, you get big PEBS buffers with lots of irrelevant data that
> needs significant post-processing to make sense of it.

I am curious about what are "lots of irrelevant data" if PEBS data is
filtered on data sources (e.g. DRAM vs PMEM) by hardware.  If we need
to have different policies for the pages from the same data source,
then I agree that the software has to do a lot of filtering work.

> The folks at Intel that tried this really struggled to take this mess and turn it into a successful hot-page tracking.
>
> Maybe someone else will find a better way to do it, but we tried and
> gave up.

It might be challenging to use PEBS as the only and universal hot page
tracking hardware mechanism. For example, there are challenges to use
PEBS to sample KVM guest accesses from the host.  On the other hand,
PEBS with hardware-based data source filtering can be a useful
mechanism to improve hot page tracking in conjunction with other
techniques.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-03 19:12 ` Tim Chen
@ 2022-05-05  7:02   ` Wei Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-05  7:02 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Tue, May 3, 2022 at 12:12 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Fri, 2022-04-29 at 19:10 -0700, Wei Xu wrote:
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
>
> Thanks for making this proposal.  It has many of the elements needed
> for the tiering support.
>
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into the top tier.
> >
> > * The current tiering hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >   with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tiering hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tiering hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tiering
> >   hierarchy much less stable.
> >
> > * A higher tier node can only be demoted to selected nodes on the
> >   next lower tier, not any other node from the next lower tier.  This
> >   strict, hard-coded demotion order does not work in all use cases
> >   (e.g. some use cases may want to allow cross-socket demotion to
> >   another node in the same demotion tier as a fallback when the
> >   preferred demotion node is out of space), and has resulted in the
> >   feature request for an interface to override the system-wide,
> >   per-node demotion order from the userspace.
> >
> > * There are no interfaces for the userspace to learn about the memory
> >   tiering hierarchy in order to optimize its memory allocations.
> >
> > I'd like to propose revised memory tiering kernel interfaces based on
> > the discussions in the threads:
> >
> > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> >
> >
> > Sysfs Interfaces
> > ================
> >
> > * /sys/devices/system/node/memory_tiers
> >
> >   Format: node list (one tier per line, in the tier order)
> >
> >   When read, list memory nodes by tiers.
> >
> >   When written (one tier per line), take the user-provided node-tier
> >   assignment as the new tiering hierarchy and rebuild the per-node
> >   demotion order.  It is allowed to only override the top tiers, in
> >   which cases, the kernel will establish the lower tiers automatically.
> >
> >
> > Kernel Representation
> > =====================
> >
> > * nodemask_t node_states[N_TOPTIER_MEMORY]
> >
> >   Store all top-tier memory nodes.
> >
> > * nodemask_t memory_tiers[MAX_TIERS]
> >
> >   Store memory nodes by tiers.
> >
> > * struct demotion_nodes node_demotion[]
> >
> >   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> >
> >   For a node N:
> >
> >   node_demotion[N].preferred lists all preferred demotion targets;
> >
> >   node_demotion[N].allowed lists all allowed demotion targets
> >   (initialized to be all the nodes in the same demotion tier).
> >
>
> I assume that the preferred list is auto-configured/initialized based on
> NUMA distances.  Not sure why "allowed" list is only to the same demotion
> tier?  For example, I think the default should be tier 0 should
> is allowed to demote to tier 1 and tier 2, not just to tier 1.  So if we
> fail to demote to tier 1, we can demote to tier 2.

I agree that we can allow demotion to go to all the lower tiers, not
just the immediate next tier.  I have mentioned the same idea as well
when replying to Dan's comments.

> Do you also expose the demotion preferred node and allowed
> list via /sys/devices/system/node/memory_tiers, as you have done in the examples?

To keep the memory tier sysfs minimal for now, I didn't propose
exposing the demotion preferred/allowed list in
/sys/devices/system/node/memory_tiers.  But now I can see that in the
way that the examples were presented, N_TOPTIER_MEMORY and
node_demotion[] can be thought as part of the memory_tiers output,
which is not the intention.

> > Examples
> > ========
> >
> > * Example 2:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and closer to node 0.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has no preferred demotion target, but can still demote
> >   to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
>
> Do we expect to later allow configuration of the demotion list explicitly?
> Something like:
>
> echo "demotion 0 1 1-3" > /sys/devices/system/node/memory_tiers
>
> to set demotion list for node 0, where preferred demote node is 1,
> allowed demote node list is 1-3.

IMHO, we'd better follow the allocation fallback order for the
demotion node order in each tier and avoid userspace override of
per-node demotion list.

In general, I think we'd better keep the tier assignment of each node
stable.  If adding/changing one node can redefine the tiers of other
nodes, it can make tier-based memory accounting very difficult.
Overriding the per-node demotion list can have such undesirable side
effects (if the per-node demotion list is used to redefine tiers).

> Thanks.
>
> Tim
>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
                   ` (4 preceding siblings ...)
  2022-05-03 19:12 ` Tim Chen
@ 2022-05-05  8:57 ` ying.huang
  2022-05-05 23:57 ` Alistair Popple
  6 siblings, 0 replies; 57+ messages in thread
From: ying.huang @ 2022-05-05  8:57 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Dave Hansen, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron

On Fri, 2022-04-29 at 19:10 -0700, Wei Xu wrote:
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
> 
> A tiering relationship between NUMA nodes in the form of demotion path
> is created during the kernel initialization and updated when a NUMA
> node is hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and then builds the tiering hierarchy
> tier-by-tier by establishing the per-node demotion targets based on
> the distances between nodes.
> 
> The current memory tiering interface needs to be improved to address
> several important use cases:
> 
> * The current tiering initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into the top tier.
> 
> * The current tiering hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
> 
> * Also because the current tiering hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tiering hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tiering
>   hierarchy much less stable.
> 
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier, not any other node from the next lower tier.  This
>   strict, hard-coded demotion order does not work in all use cases
>   (e.g. some use cases may want to allow cross-socket demotion to
>   another node in the same demotion tier as a fallback when the
>   preferred demotion node is out of space), and has resulted in the
>   feature request for an interface to override the system-wide,
>   per-node demotion order from the userspace.
> 
> * There are no interfaces for the userspace to learn about the memory
>   tiering hierarchy in order to optimize its memory allocations.
> 
> I'd like to propose revised memory tiering kernel interfaces based on
> the discussions in the threads:
> 
> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> 
> 
> Sysfs Interfaces
> ================
> 
> * /sys/devices/system/node/memory_tiers
> 
>   Format: node list (one tier per line, in the tier order)
> 
>   When read, list memory nodes by tiers.
> 
>   When written (one tier per line), take the user-provided node-tier
>   assignment as the new tiering hierarchy and rebuild the per-node
>   demotion order.  It is allowed to only override the top tiers, in
>   which cases, the kernel will establish the lower tiers automatically.
> 
> 
> Kernel Representation
> =====================
> 
> * nodemask_t node_states[N_TOPTIER_MEMORY]
> 
>   Store all top-tier memory nodes.
> 
> * nodemask_t memory_tiers[MAX_TIERS]
> 
>   Store memory nodes by tiers.
> 
> * struct demotion_nodes node_demotion[]
> 
>   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> 
>   For a node N:
> 
>   node_demotion[N].preferred lists all preferred demotion targets;
> 
>   node_demotion[N].allowed lists all allowed demotion targets
>   (initialized to be all the nodes in the same demotion tier).
> 
> 
> Tiering Hierarchy Initialization
> ================================
> 
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> 
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.
> 
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.
> 
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
> 
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
> 
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
> 
> 
> Memory Allocation for Demotion
> ==============================
> 
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
> 
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
> 
> 
> Examples
> ========
> 
> * Example 1:
>   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> 
>   Node 0 has node 2 as the preferred demotion target and can also
>   fallback demotion to node 3.
> 
>   Node 1 has node 3 as the preferred demotion target and can also
>   fallback demotion to node 2.
> 
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
> 
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10
> 
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3
> 
> N_TOPTIER_MEMORY: 0-1
> 
> node_demotion[]:
>   0: [2], [2-3]
>   1: [3], [2-3]
>   2: [],  []
>   3: [],  []
> 
> * Example 2:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and closer to node 0.
> 
>   Node 0 has node 2 as the preferred and only demotion target.
> 
>   Node 1 has no preferred demotion target, but can still demote
>   to node 2.
> 
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
> 
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   40
>    2  30   40   10
> 
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
> 
> N_TOPTIER_MEMORY: 0-1
> 
> node_demotion[]:
>   0: [2], [2]
>   1: [],  [2]
>   2: [],  []
> 
> 
> * Example 3:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and has the same distance to node 0 & 1.
> 
>   Node 0 has node 2 as the preferred and only demotion target.
> 
>   Node 1 has node 2 as the preferred and only demotion target.
> 
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
> 
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
> 
> N_TOPTIER_MEMORY: 0-1
> 
> node_demotion[]:
>   0: [2], [2]
>   1: [2], [2]
>   2: [],  []
> 
> 
> * Example 4:
>   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> 
>   All nodes are top-tier.
> 
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
> 
> /sys/devices/system/node/memory_tiers
> 0-2
> 
> N_TOPTIER_MEMORY: 0-2
> 
> node_demotion[]:
>   0: [],  []
>   1: [],  []
>   2: [],  []
> 
> 
> * Example 5:
>   Node 0 is a DRAM node with CPU.
>   Node 1 is a HBM node.
>   Node 2 is a PMEM node.
> 
>   With userspace override, node 1 is the top tier and has node 0 as
>   the preferred and only demotion target.
> 
>   Node 0 is in the second tier, tier 1, and has node 2 as the
>   preferred and only demotion target.
> 
>   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> 
> node distances:
> node   0    1    2
>    0  10   21   30
>    1  21   10   40
>    2  30   40   10
> 
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
> 
> N_TOPTIER_MEMORY: 1
> 
> node_demotion[]:
>   0: [2], [2]
>   1: [0], [0]
>   2: [],  []

Sorry for late reply.

I think the proposed interfaces above and more "tiered" organization is
a good idea in general.  As in your later email, we should use one file
for each tier as suggested by Dave Hansen.  And we can start with 2
tiers for now.  That is, all nodes start with tier0, and the nodes
onlined via kmem dax driver are in tierN (N >= 1) as suggested by Aneesh
Kumar and Jagdish Gediya.  When we have more information and clearer
requirement in the future, we can improve our implementation and extend
our user space interface.

We can even start with just one file: "tier0", because all nodes except
that in tier0 will be in tier1.

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-05  6:35               ` Wei Xu
@ 2022-05-05 14:24                 ` Dave Hansen
  2022-05-10  4:43                   ` Wei Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Hansen @ 2022-05-05 14:24 UTC (permalink / raw)
  To: Wei Xu
  Cc: Alistair Popple, Davidlohr Bueso, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Yang Shi, Linux MM, Greg Thelen,
	Aneesh Kumar K.V, Jagdish Gediya, Linux Kernel Mailing List,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On 5/4/22 23:35, Wei Xu wrote:
> On Wed, May 4, 2022 at 10:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> That means a lot of page table and EPT walks to map those linear
>> addresses back to physical.  That adds to the inefficiency.
> 
> That's true if the tracking is purely based on physical pages.  For
> hot page tracking from PEBS, we can consider tracking in
> virtual/linear addresses.  We don't need to maintain the history for
> all linear page addresses nor for an indefinite amount of time.  After
> all, we just need to identify pages accessed frequently recently and
> promote them.

Except that you don't want to promote on *every* access.  That might
lead to too much churn.

You're also assuming that all accesses to a physical page are via a
single linear address, which ignores shared memory mapped at different
linear addresses.  Our (maybe wrong) assumption has been that shared
memory is important enough to manage that it can't be ignored.

>> In the end, you get big PEBS buffers with lots of irrelevant data that
>> needs significant post-processing to make sense of it.
> 
> I am curious about what are "lots of irrelevant data" if PEBS data is
> filtered on data sources (e.g. DRAM vs PMEM) by hardware.  If we need
> to have different policies for the pages from the same data source,
> then I agree that the software has to do a lot of filtering work.

Perhaps "irrelevant" was a bad term to use.  I meant that you can't just
take the PEBS data and act directly on it.  It has to be post-processed
and you will see things in there like lots of adjacent accesses to a
page.  Those additional accesses can be interesting but at some point
you have all the weight you need to promote the page and the _rest_ are
irrelevant.

>> The folks at Intel that tried this really struggled to take this mess and turn it into a successful hot-page tracking.
>>
>> Maybe someone else will find a better way to do it, but we tried and
>> gave up.
> 
> It might be challenging to use PEBS as the only and universal hot page
> tracking hardware mechanism. For example, there are challenges to use
> PEBS to sample KVM guest accesses from the host.

Yep, agreed.  This aspect of the hardware is very painful at the moment.

> On the other hand, PEBS with hardware-based data source filtering can
> be a useful mechanism to improve hot page tracking in conjunction
> with other techniques.

Rather than "can", I'd say: "might".  Backing up to what I said originally:

> So, in practice, these events (PEBS) weren't very useful
> for driving memory tiering.

By "driving" I really meant solely driving.  Like, can PEBS be used as
the one and only mechanism?  We couldn't make it work.  But, the
hardware _is_ sitting there mostly unused.  It might be great to augment
what is there, and nobody should be discouraged from looking at it again.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
                   ` (5 preceding siblings ...)
  2022-05-05  8:57 ` ying.huang
@ 2022-05-05 23:57 ` Alistair Popple
  2022-05-06  0:25   ` Alistair Popple
  6 siblings, 1 reply; 57+ messages in thread
From: Alistair Popple @ 2022-05-05 23:57 UTC (permalink / raw)
  To: Wei Xu
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

[-- Attachment #1: Type: text/plain, Size: 8235 bytes --]

Wei Xu <weixugc@google.com> writes:

> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
>
> A tiering relationship between NUMA nodes in the form of demotion path
> is created during the kernel initialization and updated when a NUMA
> node is hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and then builds the tiering hierarchy
> tier-by-tier by establishing the per-node demotion targets based on
> the distances between nodes.
>
> The current memory tiering interface needs to be improved to address
> several important use cases:
>
> * The current tiering initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into the top tier.
>
> * The current tiering hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
>
> * Also because the current tiering hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tiering hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tiering
>   hierarchy much less stable.
>
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier, not any other node from the next lower tier.  This
>   strict, hard-coded demotion order does not work in all use cases
>   (e.g. some use cases may want to allow cross-socket demotion to
>   another node in the same demotion tier as a fallback when the
>   preferred demotion node is out of space), and has resulted in the
>   feature request for an interface to override the system-wide,
>   per-node demotion order from the userspace.
>
> * There are no interfaces for the userspace to learn about the memory
>   tiering hierarchy in order to optimize its memory allocations.
>
> I'd like to propose revised memory tiering kernel interfaces based on
> the discussions in the threads:
>
> - <https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/>
> - <https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/>
>
>
> Sysfs Interfaces
> `=============='
>
> * /sys/devices/system/node/memory_tiers
>
>   Format: node list (one tier per line, in the tier order)
>
>   When read, list memory nodes by tiers.
>
>   When written (one tier per line), take the user-provided node-tier
>   assignment as the new tiering hierarchy and rebuild the per-node
>   demotion order.  It is allowed to only override the top tiers, in
>   which cases, the kernel will establish the lower tiers automatically.
>
>
> Kernel Representation
> `==================='
>
> * nodemask_t node_states[N_TOPTIER_MEMORY]
>
>   Store all top-tier memory nodes.
>
> * nodemask_t memory_tiers[MAX_TIERS]
>
>   Store memory nodes by tiers.
>
> * struct demotion_nodes node_demotion[]
>
>   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>
>   For a node N:
>
>   node_demotion[N].preferred lists all preferred demotion targets;
>
>   node_demotion[N].allowed lists all allowed demotion targets
>   (initialized to be all the nodes in the same demotion tier).
>
>
> Tiering Hierarchy Initialization
> `=============================='
>
> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>
> A device driver can remove its memory nodes from the top tier, e.g.
> a dax driver can remove PMEM nodes from the top tier.
>
> The kernel builds the memory tiering hierarchy and per-node demotion
> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> best distance nodes in the next lower tier are assigned to
> node_demotion[N].preferred and all the nodes in the next lower tier
> are assigned to node_demotion[N].allowed.
>
> node_demotion[N].preferred can be empty if no preferred demotion node
> is available for node N.
>
> If the userspace overrides the tiers via the memory_tiers sysfs
> interface, the kernel then only rebuilds the per-node demotion order
> accordingly.
>
> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> node.
>
>
> Memory Allocation for Demotion
> `============================'
>
> When allocating a new demotion target page, both a preferred node
> and the allowed nodemask are provided to the allocation function.
> The default kernel allocation fallback order is used to allocate the
> page from the specified node and nodemask.
>
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion nodemask, e.g. to prevent demotion or
> select a particular allowed node as the demotion target.
>
>
> Examples
> `======'
>
> * Example 1:
>   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>
>   Node 0 has node 2 as the preferred demotion target and can also
>   fallback demotion to node 3.
>
>   Node 1 has node 3 as the preferred demotion target and can also
>   fallback demotion to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2-3
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2-3]
>   1: [3], [2-3]
>   2: [],  []
>   3: [],  []
>
> * Example 2:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and closer to node 0.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has no preferred demotion target, but can still demote
>   to node 2.
>
>   Set mempolicy to prevent cross-socket demotion and memory access,
>   e.g. cpuset.mems=0,2
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [],  [2]
>   2: [],  []
>
>
> * Example 3:
>   Node 0 & 1 are DRAM nodes.
>   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>
>   Node 0 has node 2 as the preferred and only demotion target.
>
>   Node 1 has node 2 as the preferred and only demotion target.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
>
> N_TOPTIER_MEMORY: 0-1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [2], [2]
>   2: [],  []
>
>
> * Example 4:
>   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>
>   All nodes are top-tier.
>
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
>
> /sys/devices/system/node/memory_tiers
> 0-2
>
> N_TOPTIER_MEMORY: 0-2
>
> node_demotion[]:
>   0: [],  []
>   1: [],  []
>   2: [],  []
>
>
> * Example 5:
>   Node 0 is a DRAM node with CPU.
>   Node 1 is a HBM node.
>   Node 2 is a PMEM node.
>
>   With userspace override, node 1 is the top tier and has node 0 as
>   the preferred and only demotion target.
>
>   Node 0 is in the second tier, tier 1, and has node 2 as the
>   preferred and only demotion target.
>
>   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>
> node distances:
> node   0    1    2
>    0  10   21   30
>    1  21   10   40
>    2  30   40   10
>
> /sys/devices/system/node/memory_tiers (userspace override)
> 1
> 0
> 2
>
> N_TOPTIER_MEMORY: 1
>
> node_demotion[]:
>   0: [2], [2]
>   1: [0], [0]
>   2: [],  []
>
> -- Wei

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  6:37   ` Wei Xu
@ 2022-05-06  0:01     ` Alistair Popple
  2022-05-10  4:32       ` Wei Xu
  2022-05-06 18:56     ` Yang Shi
  1 sibling, 1 reply; 57+ messages in thread
From: Alistair Popple @ 2022-05-06  0:01 UTC (permalink / raw)
  To: Wei Xu
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron, Tim Chen

[-- Attachment #1: Type: text/plain, Size: 6760 bytes --]

Wei Xu <weixugc@google.com> writes:

[...]

>> >
>> >
>> > Tiering Hierarchy Initialization
>> > `=============================='
>> >
>> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>> >
>> > A device driver can remove its memory nodes from the top tier, e.g.
>> > a dax driver can remove PMEM nodes from the top tier.
>>
>> With the topology built by firmware we should not need this.

I agree that in an ideal world the hierarchy should be built by firmware based
on something like the HMAT. But I also think being able to override this will be
useful in getting there. Therefore a way of overriding the generated hierarchy
would be good, either via sysfs or kernel boot parameter if we don't want to
commit to a particular user interface now.

However I'm less sure letting device-drivers override this is a good idea. How
for example would a GPU driver make sure it's node is in the top tier? By moving
every node that the driver does not know about out of N_TOPTIER_MEMORY? That
could get messy if say there were two drivers both of which wanted their node to
be in the top tier.

> I agree. But before we have such a firmware, the kernel needs to do
> its best to initialize memory tiers.
>
> Given that we know PMEM is slower than DRAM, but a dax device might
> not be PMEM, a better place to set the tier for PMEM nodes can be the
> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
> the ACPI_SRAT_MEM_NON_VOLATILE bit.
>
>> >
>> > The kernel builds the memory tiering hierarchy and per-node demotion
>> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> > best distance nodes in the next lower tier are assigned to
>> > node_demotion[N].preferred and all the nodes in the next lower tier
>> > are assigned to node_demotion[N].allowed.
>>
>> I'm not sure whether it should be allowed to demote to multiple lower
>> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
>> figure out a good way to define demotion targets, it could be extended
>> to support this easily.
>
> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
> not clear yet whether we want to enable transparent memory tiering to
> all the 3 tiers on such systems.

At some point I think we will need to deal with 3 tiers but I'd be ok with
limiting it to 2 for now if it makes things simpler.

- Alistair

>> >
>> > node_demotion[N].preferred can be empty if no preferred demotion node
>> > is available for node N.
>> >
>> > If the userspace overrides the tiers via the memory_tiers sysfs
>> > interface, the kernel then only rebuilds the per-node demotion order
>> > accordingly.
>> >
>> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>> > node.
>> >
>> >
>> > Memory Allocation for Demotion
>> > `============================'
>> >
>> > When allocating a new demotion target page, both a preferred node
>> > and the allowed nodemask are provided to the allocation function.
>> > The default kernel allocation fallback order is used to allocate the
>> > page from the specified node and nodemask.
>> >
>> > The memopolicy of cpuset, vma and owner task of the source page can
>> > be set to refine the demotion nodemask, e.g. to prevent demotion or
>> > select a particular allowed node as the demotion target.
>> >
>> >
>> > Examples
>> > `======'
>> >
>> > * Example 1:
>> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>> >
>> >   Node 0 has node 2 as the preferred demotion target and can also
>> >   fallback demotion to node 3.
>> >
>> >   Node 1 has node 3 as the preferred demotion target and can also
>> >   fallback demotion to node 2.
>> >
>> >   Set mempolicy to prevent cross-socket demotion and memory access,
>> >   e.g. cpuset.mems=0,2
>> >
>> > node distances:
>> > node   0    1    2    3
>> >    0  10   20   30   40
>> >    1  20   10   40   30
>> >    2  30   40   10   40
>> >    3  40   30   40   10
>> >
>> > /sys/devices/system/node/memory_tiers
>> > 0-1
>> > 2-3
>> >
>> > N_TOPTIER_MEMORY: 0-1
>> >
>> > node_demotion[]:
>> >   0: [2], [2-3]
>> >   1: [3], [2-3]
>> >   2: [],  []
>> >   3: [],  []
>> >
>> > * Example 2:
>> >   Node 0 & 1 are DRAM nodes.
>> >   Node 2 is a PMEM node and closer to node 0.
>> >
>> >   Node 0 has node 2 as the preferred and only demotion target.
>> >
>> >   Node 1 has no preferred demotion target, but can still demote
>> >   to node 2.
>> >
>> >   Set mempolicy to prevent cross-socket demotion and memory access,
>> >   e.g. cpuset.mems=0,2
>> >
>> > node distances:
>> > node   0    1    2
>> >    0  10   20   30
>> >    1  20   10   40
>> >    2  30   40   10
>> >
>> > /sys/devices/system/node/memory_tiers
>> > 0-1
>> > 2
>> >
>> > N_TOPTIER_MEMORY: 0-1
>> >
>> > node_demotion[]:
>> >   0: [2], [2]
>> >   1: [],  [2]
>> >   2: [],  []
>> >
>> >
>> > * Example 3:
>> >   Node 0 & 1 are DRAM nodes.
>> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>> >
>> >   Node 0 has node 2 as the preferred and only demotion target.
>> >
>> >   Node 1 has node 2 as the preferred and only demotion target.
>> >
>> > node distances:
>> > node   0    1    2
>> >    0  10   20   30
>> >    1  20   10   30
>> >    2  30   30   10
>> >
>> > /sys/devices/system/node/memory_tiers
>> > 0-1
>> > 2
>> >
>> > N_TOPTIER_MEMORY: 0-1
>> >
>> > node_demotion[]:
>> >   0: [2], [2]
>> >   1: [2], [2]
>> >   2: [],  []
>> >
>> >
>> > * Example 4:
>> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>> >
>> >   All nodes are top-tier.
>> >
>> > node distances:
>> > node   0    1    2
>> >    0  10   20   30
>> >    1  20   10   30
>> >    2  30   30   10
>> >
>> > /sys/devices/system/node/memory_tiers
>> > 0-2
>> >
>> > N_TOPTIER_MEMORY: 0-2
>> >
>> > node_demotion[]:
>> >   0: [],  []
>> >   1: [],  []
>> >   2: [],  []
>> >
>> >
>> > * Example 5:
>> >   Node 0 is a DRAM node with CPU.
>> >   Node 1 is a HBM node.
>> >   Node 2 is a PMEM node.
>> >
>> >   With userspace override, node 1 is the top tier and has node 0 as
>> >   the preferred and only demotion target.
>> >
>> >   Node 0 is in the second tier, tier 1, and has node 2 as the
>> >   preferred and only demotion target.
>> >
>> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>> >
>> > node distances:
>> > node   0    1    2
>> >    0  10   21   30
>> >    1  21   10   40
>> >    2  30   40   10
>> >
>> > /sys/devices/system/node/memory_tiers (userspace override)
>> > 1
>> > 0
>> > 2
>> >
>> > N_TOPTIER_MEMORY: 1
>> >
>> > node_demotion[]:
>> >   0: [2], [2]
>> >   1: [0], [0]
>> >   2: [],  []
>> >
>> > -- Wei

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-05 23:57 ` Alistair Popple
@ 2022-05-06  0:25   ` Alistair Popple
  0 siblings, 0 replies; 57+ messages in thread
From: Alistair Popple @ 2022-05-06  0:25 UTC (permalink / raw)
  To: Wei Xu
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Yang Shi,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan.Cameron

Please ignore this one, apologies for the noise.

On Friday, 6 May 2022 9:57:54 AM AEST Alistair Popple wrote:
> Wei Xu <weixugc@google.com> writes:
> 
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into the top tier.
> >
> > * The current tiering hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >   with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tiering hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tiering hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tiering
> >   hierarchy much less stable.
> >
> > * A higher tier node can only be demoted to selected nodes on the
> >   next lower tier, not any other node from the next lower tier.  This
> >   strict, hard-coded demotion order does not work in all use cases
> >   (e.g. some use cases may want to allow cross-socket demotion to
> >   another node in the same demotion tier as a fallback when the
> >   preferred demotion node is out of space), and has resulted in the
> >   feature request for an interface to override the system-wide,
> >   per-node demotion order from the userspace.
> >
> > * There are no interfaces for the userspace to learn about the memory
> >   tiering hierarchy in order to optimize its memory allocations.
> >
> > I'd like to propose revised memory tiering kernel interfaces based on
> > the discussions in the threads:
> >
> > - <https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/>
> > - <https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/>
> >
> >
> > Sysfs Interfaces
> > `=============='
> >
> > * /sys/devices/system/node/memory_tiers
> >
> >   Format: node list (one tier per line, in the tier order)
> >
> >   When read, list memory nodes by tiers.
> >
> >   When written (one tier per line), take the user-provided node-tier
> >   assignment as the new tiering hierarchy and rebuild the per-node
> >   demotion order.  It is allowed to only override the top tiers, in
> >   which cases, the kernel will establish the lower tiers automatically.
> >
> >
> > Kernel Representation
> > `==================='
> >
> > * nodemask_t node_states[N_TOPTIER_MEMORY]
> >
> >   Store all top-tier memory nodes.
> >
> > * nodemask_t memory_tiers[MAX_TIERS]
> >
> >   Store memory nodes by tiers.
> >
> > * struct demotion_nodes node_demotion[]
> >
> >   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> >
> >   For a node N:
> >
> >   node_demotion[N].preferred lists all preferred demotion targets;
> >
> >   node_demotion[N].allowed lists all allowed demotion targets
> >   (initialized to be all the nodes in the same demotion tier).
> >
> >
> > Tiering Hierarchy Initialization
> > `=============================='
> >
> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >
> > A device driver can remove its memory nodes from the top tier, e.g.
> > a dax driver can remove PMEM nodes from the top tier.
> >
> > The kernel builds the memory tiering hierarchy and per-node demotion
> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> > best distance nodes in the next lower tier are assigned to
> > node_demotion[N].preferred and all the nodes in the next lower tier
> > are assigned to node_demotion[N].allowed.
> >
> > node_demotion[N].preferred can be empty if no preferred demotion node
> > is available for node N.
> >
> > If the userspace overrides the tiers via the memory_tiers sysfs
> > interface, the kernel then only rebuilds the per-node demotion order
> > accordingly.
> >
> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > node.
> >
> >
> > Memory Allocation for Demotion
> > `============================'
> >
> > When allocating a new demotion target page, both a preferred node
> > and the allowed nodemask are provided to the allocation function.
> > The default kernel allocation fallback order is used to allocate the
> > page from the specified node and nodemask.
> >
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion nodemask, e.g. to prevent demotion or
> > select a particular allowed node as the demotion target.
> >
> >
> > Examples
> > `======'
> >
> > * Example 1:
> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >
> >   Node 0 has node 2 as the preferred demotion target and can also
> >   fallback demotion to node 3.
> >
> >   Node 1 has node 3 as the preferred demotion target and can also
> >   fallback demotion to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
> >
> > node distances:
> > node   0    1    2    3
> >    0  10   20   30   40
> >    1  20   10   40   30
> >    2  30   40   10   40
> >    3  40   30   40   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2-3
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2-3]
> >   1: [3], [2-3]
> >   2: [],  []
> >   3: [],  []
> >
> > * Example 2:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and closer to node 0.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has no preferred demotion target, but can still demote
> >   to node 2.
> >
> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >   e.g. cpuset.mems=0,2
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   40
> >    2  30   40   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [],  [2]
> >   2: [],  []
> >
> >
> > * Example 3:
> >   Node 0 & 1 are DRAM nodes.
> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >
> >   Node 0 has node 2 as the preferred and only demotion target.
> >
> >   Node 1 has node 2 as the preferred and only demotion target.
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-1
> > 2
> >
> > N_TOPTIER_MEMORY: 0-1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [2], [2]
> >   2: [],  []
> >
> >
> > * Example 4:
> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >
> >   All nodes are top-tier.
> >
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> >
> > /sys/devices/system/node/memory_tiers
> > 0-2
> >
> > N_TOPTIER_MEMORY: 0-2
> >
> > node_demotion[]:
> >   0: [],  []
> >   1: [],  []
> >   2: [],  []
> >
> >
> > * Example 5:
> >   Node 0 is a DRAM node with CPU.
> >   Node 1 is a HBM node.
> >   Node 2 is a PMEM node.
> >
> >   With userspace override, node 1 is the top tier and has node 0 as
> >   the preferred and only demotion target.
> >
> >   Node 0 is in the second tier, tier 1, and has node 2 as the
> >   preferred and only demotion target.
> >
> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >
> > node distances:
> > node   0    1    2
> >    0  10   21   30
> >    1  21   10   40
> >    2  30   40   10
> >
> > /sys/devices/system/node/memory_tiers (userspace override)
> > 1
> > 0
> > 2
> >
> > N_TOPTIER_MEMORY: 1
> >
> > node_demotion[]:
> >   0: [2], [2]
> >   1: [0], [0]
> >   2: [],  []
> >
> > -- Wei
> 






^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-04-30  6:37   ` Wei Xu
  2022-05-06  0:01     ` Alistair Popple
@ 2022-05-06 18:56     ` Yang Shi
  2022-05-09 14:32       ` Hesham Almatary
  1 sibling, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-05-06 18:56 UTC (permalink / raw)
  To: Wei Xu
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan.Cameron, Tim Chen

On Fri, Apr 29, 2022 at 11:37 PM Wei Xu <weixugc@google.com> wrote:
>
> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > Hi Wei,
> >
> > Thanks for the nice writing. Please see the below inline comments.
>
> Thanks for the quick response and comments.
>
> > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> > >
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > >
> > > A tiering relationship between NUMA nodes in the form of demotion path
> > > is created during the kernel initialization and updated when a NUMA
> > > node is hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > >
> > > The current memory tiering interface needs to be improved to address
> > > several important use cases:
> > >
> > > * The current tiering initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into the top tier.
> > >
> > > * The current tiering hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > >   with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tiering hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tiering hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tiering
> > >   hierarchy much less stable.
> >
> > I'd prefer the firmware builds up tiers topology then passes it to
> > kernel so that kernel knows what nodes are in what tiers. No matter
> > what nodes are hot-removed/hot-added they always stay in their tiers
> > defined by the firmware. I think this is important information like
> > numa distances. NUMA distance alone can't satisfy all the usecases
> > IMHO.
>
> I agree that the firmware needs to play a bigger role in tier
> topology, though it is not clear to me yet that we should require the
> tier topology be fully defined by the firmware.  If yes, a standard
> needs to be established. Alternatively, with additional hardware
> information provided by the firmware (e.g. HMAT), the kernel can be in
> a much better position to initialize the proper tier topology by
> itself.
>
> It is important to keep tier topology stable, especially if we want to
> account and limit memory usage based on tiers.  So I agree that the
> nodes should not change their tiers no matter what nodes are
> hot-added/hot-removed.
>
> Given that the additional tier-related information is not yet
> available from the firmware and NUMA distance alone is not sufficient
> for all the tiering use cases, and also that we want to keep tier
> topology stable after the kernel boots, I suggest that we add a kernel
> boot parameter to override the default tier topology (which nodes are
> in which tiers). An example is: tier=2:0-1;2-3, which defines two
> tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.
>
> > >
> > > * A higher tier node can only be demoted to selected nodes on the
> > >   next lower tier, not any other node from the next lower tier.  This
> > >   strict, hard-coded demotion order does not work in all use cases
> > >   (e.g. some use cases may want to allow cross-socket demotion to
> > >   another node in the same demotion tier as a fallback when the
> > >   preferred demotion node is out of space), and has resulted in the
> > >   feature request for an interface to override the system-wide,
> > >   per-node demotion order from the userspace.
> > >
> > > * There are no interfaces for the userspace to learn about the memory
> > >   tiering hierarchy in order to optimize its memory allocations.
> > >
> > > I'd like to propose revised memory tiering kernel interfaces based on
> > > the discussions in the threads:
> > >
> > > - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > >
> > >
> > > Sysfs Interfaces
> > > ================
> > >
> > > * /sys/devices/system/node/memory_tiers
> > >
> > >   Format: node list (one tier per line, in the tier order)
> > >
> > >   When read, list memory nodes by tiers.
> > >
> > >   When written (one tier per line), take the user-provided node-tier
> > >   assignment as the new tiering hierarchy and rebuild the per-node
> > >   demotion order.  It is allowed to only override the top tiers, in
> > >   which cases, the kernel will establish the lower tiers automatically.
> >
> > TBH I still think it is too soon to define proper user visible
> > interfaces for now, particularly for override.
>
> I agree, but there are also needs to make use of tiering even as it
> evolves.  This is why only a minimal sysfs interface is proposed.  We
> can make it read-only and resort to a kernel boot parameter to
> override tiers.
>
> > >
> > >
> > > Kernel Representation
> > > =====================
> > >
> > > * nodemask_t node_states[N_TOPTIER_MEMORY]
> > >
> > >   Store all top-tier memory nodes.
> > >
> > > * nodemask_t memory_tiers[MAX_TIERS]
> > >
> > >   Store memory nodes by tiers.
> >
> > I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
> > top tier. The kernel could build this with the topology built by
> > firmware.
>
> node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.
>
> node_states is already an existing kernel array (defined as nodemask_t
> node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
> which is why a new array, memory_tiers, is proposed.
>
> Are you proposing that we make node_states a 2-dimensional array?
> That can duplicate the information already in node_states, which is
> not ideal.

Sorry for the late reply.

Yes, 2-dimensional array. With it we could know what nodes in what tiers.

>
> > >
> > > * struct demotion_nodes node_demotion[]
> > >
> > >   where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> > >
> > >   For a node N:
> > >
> > >   node_demotion[N].preferred lists all preferred demotion targets;
> > >
> > >   node_demotion[N].allowed lists all allowed demotion targets
> > >   (initialized to be all the nodes in the same demotion tier).
> >
> > It seems unnecessary to define preferred and allowed IMHO. Why not
> > just use something like the allocation fallback list? The first node
> > in the list is the preferred one. When allocating memory for demotion,
> > convert the list to a nodemask, then call __alloc_pages(gfp, order,
> > first_node, nodemask). So the allocation could fallback to the allowed
> > nodes automatically.
>
> The nodemask "preferred" is an attempt to preserve a current feature
> in node_demotion[]: load balancing among multiple equally-close target
> nodes via random selection.  We can remove it to keep things simple.
>
> The idea of defining "preferred" and "allowed" is exactly to use
> __alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
> that the page allocator already computes the allocation fallback list,
> it should be unnecessary to maintain an ordered demotion node list for
> each node and convert such a list to a nodemask for demotion
> allocation.  This is why allowed is stored as a nodemask.

Yeah, it doesn't have to be ordered.

>
> When demoting a page from node N, I think we can just call
> __alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
> so, we can remove node_demotion[] entirely and add a tier field to
> NODE_DATA for node_to_tier().
>
> > >
> > >
> > > Tiering Hierarchy Initialization
> > > ================================
> > >
> > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > >
> > > A device driver can remove its memory nodes from the top tier, e.g.
> > > a dax driver can remove PMEM nodes from the top tier.
> >
> > With the topology built by firmware we should not need this.
>
> I agree. But before we have such a firmware, the kernel needs to do
> its best to initialize memory tiers.
>
> Given that we know PMEM is slower than DRAM, but a dax device might
> not be PMEM, a better place to set the tier for PMEM nodes can be the
> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
> the ACPI_SRAT_MEM_NON_VOLATILE bit.

This is why I hope firmware could chime in, for example, we may have a
new field, called "Tier", in HMAT. Then kernel just reads the field
and put the node into proper tier. But of course override by kernel
could be allowed.

>
> > >
> > > The kernel builds the memory tiering hierarchy and per-node demotion
> > > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> > > best distance nodes in the next lower tier are assigned to
> > > node_demotion[N].preferred and all the nodes in the next lower tier
> > > are assigned to node_demotion[N].allowed.
> >
> > I'm not sure whether it should be allowed to demote to multiple lower
> > tiers. But it is totally fine to *NOT* allow it at the moment. Once we
> > figure out a good way to define demotion targets, it could be extended
> > to support this easily.
>
> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
> not clear yet whether we want to enable transparent memory tiering to
> all the 3 tiers on such systems.

Just start from something simple. And we should fully utilize the
nearest lower tier before demoting to lower lower tiers.

>
> > >
> > > node_demotion[N].preferred can be empty if no preferred demotion node
> > > is available for node N.
> > >
> > > If the userspace overrides the tiers via the memory_tiers sysfs
> > > interface, the kernel then only rebuilds the per-node demotion order
> > > accordingly.
> > >
> > > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> > > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> > > node.
> > >
> > >
> > > Memory Allocation for Demotion
> > > ==============================
> > >
> > > When allocating a new demotion target page, both a preferred node
> > > and the allowed nodemask are provided to the allocation function.
> > > The default kernel allocation fallback order is used to allocate the
> > > page from the specified node and nodemask.
> > >
> > > The memopolicy of cpuset, vma and owner task of the source page can
> > > be set to refine the demotion nodemask, e.g. to prevent demotion or
> > > select a particular allowed node as the demotion target.
> > >
> > >
> > > Examples
> > > ========
> > >
> > > * Example 1:
> > >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > >
> > >   Node 0 has node 2 as the preferred demotion target and can also
> > >   fallback demotion to node 3.
> > >
> > >   Node 1 has node 3 as the preferred demotion target and can also
> > >   fallback demotion to node 2.
> > >
> > >   Set mempolicy to prevent cross-socket demotion and memory access,
> > >   e.g. cpuset.mems=0,2
> > >
> > > node distances:
> > > node   0    1    2    3
> > >    0  10   20   30   40
> > >    1  20   10   40   30
> > >    2  30   40   10   40
> > >    3  40   30   40   10
> > >
> > > /sys/devices/system/node/memory_tiers
> > > 0-1
> > > 2-3
> > >
> > > N_TOPTIER_MEMORY: 0-1
> > >
> > > node_demotion[]:
> > >   0: [2], [2-3]
> > >   1: [3], [2-3]
> > >   2: [],  []
> > >   3: [],  []
> > >
> > > * Example 2:
> > >   Node 0 & 1 are DRAM nodes.
> > >   Node 2 is a PMEM node and closer to node 0.
> > >
> > >   Node 0 has node 2 as the preferred and only demotion target.
> > >
> > >   Node 1 has no preferred demotion target, but can still demote
> > >   to node 2.
> > >
> > >   Set mempolicy to prevent cross-socket demotion and memory access,
> > >   e.g. cpuset.mems=0,2
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   20   30
> > >    1  20   10   40
> > >    2  30   40   10
> > >
> > > /sys/devices/system/node/memory_tiers
> > > 0-1
> > > 2
> > >
> > > N_TOPTIER_MEMORY: 0-1
> > >
> > > node_demotion[]:
> > >   0: [2], [2]
> > >   1: [],  [2]
> > >   2: [],  []
> > >
> > >
> > > * Example 3:
> > >   Node 0 & 1 are DRAM nodes.
> > >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
> > >
> > >   Node 0 has node 2 as the preferred and only demotion target.
> > >
> > >   Node 1 has node 2 as the preferred and only demotion target.
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   20   30
> > >    1  20   10   30
> > >    2  30   30   10
> > >
> > > /sys/devices/system/node/memory_tiers
> > > 0-1
> > > 2
> > >
> > > N_TOPTIER_MEMORY: 0-1
> > >
> > > node_demotion[]:
> > >   0: [2], [2]
> > >   1: [2], [2]
> > >   2: [],  []
> > >
> > >
> > > * Example 4:
> > >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > >
> > >   All nodes are top-tier.
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   20   30
> > >    1  20   10   30
> > >    2  30   30   10
> > >
> > > /sys/devices/system/node/memory_tiers
> > > 0-2
> > >
> > > N_TOPTIER_MEMORY: 0-2
> > >
> > > node_demotion[]:
> > >   0: [],  []
> > >   1: [],  []
> > >   2: [],  []
> > >
> > >
> > > * Example 5:
> > >   Node 0 is a DRAM node with CPU.
> > >   Node 1 is a HBM node.
> > >   Node 2 is a PMEM node.
> > >
> > >   With userspace override, node 1 is the top tier and has node 0 as
> > >   the preferred and only demotion target.
> > >
> > >   Node 0 is in the second tier, tier 1, and has node 2 as the
> > >   preferred and only demotion target.
> > >
> > >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   21   30
> > >    1  21   10   40
> > >    2  30   40   10
> > >
> > > /sys/devices/system/node/memory_tiers (userspace override)
> > > 1
> > > 0
> > > 2
> > >
> > > N_TOPTIER_MEMORY: 1
> > >
> > > node_demotion[]:
> > >   0: [2], [2]
> > >   1: [0], [0]
> > >   2: [],  []
> > >
> > > -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 18:35   ` Dan Williams
  2022-05-03  6:36     ` Wei Xu
@ 2022-05-06 19:05     ` Yang Shi
  2022-05-07  7:56     ` ying.huang
  2 siblings, 0 replies; 57+ messages in thread
From: Yang Shi @ 2022-05-06 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Wei Xu, Andrew Morton, Dave Hansen, Huang Ying, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Sun, May 1, 2022 at 11:35 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > Hi Wei,
> >
> > Thanks for the nice writing. Please see the below inline comments.
> >
> > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> > >
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > >
> > > A tiering relationship between NUMA nodes in the form of demotion path
> > > is created during the kernel initialization and updated when a NUMA
> > > node is hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > >
> > > The current memory tiering interface needs to be improved to address
> > > several important use cases:
> > >
> > > * The current tiering initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into the top tier.
> > >
> > > * The current tiering hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > >   with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tiering hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tiering hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tiering
> > >   hierarchy much less stable.
> >
> > I'd prefer the firmware builds up tiers topology then passes it to
> > kernel so that kernel knows what nodes are in what tiers. No matter
> > what nodes are hot-removed/hot-added they always stay in their tiers
> > defined by the firmware. I think this is important information like
> > numa distances. NUMA distance alone can't satisfy all the usecases
> > IMHO.
>
> Just want to note here that the platform firmware can only describe
> the tiers of static memory present at boot. CXL hotplug breaks this
> model and the kernel is left to dynamically determine the device's
> performance characteristics and the performance of the topology to
> reach that device. Now, the platform firmware does set expectations
> for the perfomance class of different memory ranges, but there is no
> way to know in advance the performance of devices that will be asked
> to be physically or logically added to the memory configuration. That
> said, it's probably still too early to define ABI for those
> exceptional cases where the kernel needs to make a policy decision
> about a device that does not fit into the firmware's performance
> expectations, but just note that there are limits to the description
> that platform firmware can provide.

Thanks, Dan. I don't know too much about CXL. Is it possible to make
it static? For example, put it into a default tier (for example, the
lowest tier) as long as CXL is available regardless of whether there
is any device connected or not? Then the kernel driver could probe
some information and move it to the proper tier once the device is hot
plugged. Anyway, just off the top of my head.

>
> I agree that NUMA distance alone is inadequate and the kernel needs to
> make better use of data like ACPI HMAT to determine the default
> tiering order.

Yeah, we are on the same page.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-01 18:35   ` Dan Williams
  2022-05-03  6:36     ` Wei Xu
  2022-05-06 19:05     ` Yang Shi
@ 2022-05-07  7:56     ` ying.huang
  2 siblings, 0 replies; 57+ messages in thread
From: ying.huang @ 2022-05-07  7:56 UTC (permalink / raw)
  To: Dan Williams, Yang Shi
  Cc: Wei Xu, Andrew Morton, Dave Hansen, Linux MM, Greg Thelen,
	Aneesh Kumar K.V, Jagdish Gediya, Linux Kernel Mailing List,
	Alistair Popple, Davidlohr Bueso, Michal Hocko, Baolin Wang,
	Brice Goglin, Feng Tang, Jonathan Cameron

Hi, Dan,

On Sun, 2022-05-01 at 11:35 -0700, Dan Williams wrote:
> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
> > 
> > Hi Wei,
> > 
> > Thanks for the nice writing. Please see the below inline comments.
> > 
> > On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> > > 
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > > tier NUMA node to make room for new allocations on the higher tier
> > > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > > migrated (promoted) to a higher tier NUMA node to improve the
> > > performance.
> > > 
> > > A tiering relationship between NUMA nodes in the form of demotion path
> > > is created during the kernel initialization and updated when a NUMA
> > > node is hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > > 
> > > The current memory tiering interface needs to be improved to address
> > > several important use cases:
> > > 
> > > * The current tiering initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into the top tier.
> > > 
> > > * The current tiering hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> > >   with CPUs are better to be placed into the next lower tier.
> > > 
> > > * Also because the current tiering hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tiering hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tiering
> > >   hierarchy much less stable.
> > 
> > I'd prefer the firmware builds up tiers topology then passes it to
> > kernel so that kernel knows what nodes are in what tiers. No matter
> > what nodes are hot-removed/hot-added they always stay in their tiers
> > defined by the firmware. I think this is important information like
> > numa distances. NUMA distance alone can't satisfy all the usecases
> > IMHO.
> 
> Just want to note here that the platform firmware can only describe
> the tiers of static memory present at boot. CXL hotplug breaks this
> model and the kernel is left to dynamically determine the device's
> performance characteristics and the performance of the topology to
> reach that device. Now, the platform firmware does set expectations
> for the perfomance class of different memory ranges, but there is no
> way to know in advance the performance of devices that will be asked
> to be physically or logically added to the memory configuration. That
> said, it's probably still too early to define ABI for those
> exceptional cases where the kernel needs to make a policy decision
> about a device that does not fit into the firmware's performance
> expectations, but just note that there are limits to the description
> that platform firmware can provide.
> 

Does this mean we will need some kind of in-kernel memory latency
measurement mechanism to determine the tier of the memory device
finally?

Best Regards,
Huang, Ying

> I agree that NUMA distance alone is inadequate and the kernel needs to
> make better use of data like ACPI HMAT to determine the default
> tiering order.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-06 18:56     ` Yang Shi
@ 2022-05-09 14:32       ` Hesham Almatary
  2022-05-10  3:24         ` Yang Shi
  2022-05-10  4:22         ` Wei Xu
  0 siblings, 2 replies; 57+ messages in thread
From: Hesham Almatary @ 2022-05-09 14:32 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen,
	Wei Xu

Hello Yang,

On 5/6/2022 7:56 PM, Yang Shi wrote:
> On Fri, Apr 29, 2022 at 11:37 PM Wei Xu <weixugc@google.com> wrote:
>> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
>>> Hi Wei,
>>>
>>> Thanks for the nice writing. Please see the below inline comments.
>> Thanks for the quick response and comments.
>>
>>> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
>>>> The current kernel has the basic memory tiering support: Inactive
>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>> tier NUMA node to make room for new allocations on the higher tier
>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>> performance.
>>>>
>>>> A tiering relationship between NUMA nodes in the form of demotion path
>>>> is created during the kernel initialization and updated when a NUMA
>>>> node is hot-added or hot-removed.  The current implementation puts all
>>>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>>>> tier-by-tier by establishing the per-node demotion targets based on
>>>> the distances between nodes.
>>>>
>>>> The current memory tiering interface needs to be improved to address
>>>> several important use cases:
>>>>
>>>> * The current tiering initialization code always initializes
>>>>    each memory-only NUMA node into a lower tier.  But a memory-only
>>>>    NUMA node may have a high performance memory device (e.g. a DRAM
>>>>    device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>    a virtual machine) and should be put into the top tier.
>>>>
>>>> * The current tiering hierarchy always puts CPU nodes into the top
>>>>    tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>    memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>    with CPUs are better to be placed into the next lower tier.
>>>>
>>>> * Also because the current tiering hierarchy always puts CPU nodes
>>>>    into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>    triggers a memory node from CPU-less into a CPU node (or vice
>>>>    versa), the memory tiering hierarchy gets changed, even though no
>>>>    memory node is added or removed.  This can make the tiering
>>>>    hierarchy much less stable.
>>> I'd prefer the firmware builds up tiers topology then passes it to
>>> kernel so that kernel knows what nodes are in what tiers. No matter
>>> what nodes are hot-removed/hot-added they always stay in their tiers
>>> defined by the firmware. I think this is important information like
>>> numa distances. NUMA distance alone can't satisfy all the usecases
>>> IMHO.
>> I agree that the firmware needs to play a bigger role in tier
>> topology, though it is not clear to me yet that we should require the
>> tier topology be fully defined by the firmware.  If yes, a standard
>> needs to be established. Alternatively, with additional hardware
>> information provided by the firmware (e.g. HMAT), the kernel can be in
>> a much better position to initialize the proper tier topology by
>> itself.
>>
>> It is important to keep tier topology stable, especially if we want to
>> account and limit memory usage based on tiers.  So I agree that the
>> nodes should not change their tiers no matter what nodes are
>> hot-added/hot-removed.
>>
>> Given that the additional tier-related information is not yet
>> available from the firmware and NUMA distance alone is not sufficient
>> for all the tiering use cases, and also that we want to keep tier
>> topology stable after the kernel boots, I suggest that we add a kernel
>> boot parameter to override the default tier topology (which nodes are
>> in which tiers). An example is: tier=2:0-1;2-3, which defines two
>> tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.
>>
>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>    next lower tier, not any other node from the next lower tier.  This
>>>>    strict, hard-coded demotion order does not work in all use cases
>>>>    (e.g. some use cases may want to allow cross-socket demotion to
>>>>    another node in the same demotion tier as a fallback when the
>>>>    preferred demotion node is out of space), and has resulted in the
>>>>    feature request for an interface to override the system-wide,
>>>>    per-node demotion order from the userspace.
>>>>
>>>> * There are no interfaces for the userspace to learn about the memory
>>>>    tiering hierarchy in order to optimize its memory allocations.
>>>>
>>>> I'd like to propose revised memory tiering kernel interfaces based on
>>>> the discussions in the threads:
>>>>
>>>> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
>>>> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
>>>>
>>>>
>>>> Sysfs Interfaces
>>>> ================
>>>>
>>>> * /sys/devices/system/node/memory_tiers
>>>>
>>>>    Format: node list (one tier per line, in the tier order)
>>>>
>>>>    When read, list memory nodes by tiers.
>>>>
>>>>    When written (one tier per line), take the user-provided node-tier
>>>>    assignment as the new tiering hierarchy and rebuild the per-node
>>>>    demotion order.  It is allowed to only override the top tiers, in
>>>>    which cases, the kernel will establish the lower tiers automatically.
>>> TBH I still think it is too soon to define proper user visible
>>> interfaces for now, particularly for override.
>> I agree, but there are also needs to make use of tiering even as it
>> evolves.  This is why only a minimal sysfs interface is proposed.  We
>> can make it read-only and resort to a kernel boot parameter to
>> override tiers.
>>
>>>>
>>>> Kernel Representation
>>>> =====================
>>>>
>>>> * nodemask_t node_states[N_TOPTIER_MEMORY]
>>>>
>>>>    Store all top-tier memory nodes.
>>>>
>>>> * nodemask_t memory_tiers[MAX_TIERS]
>>>>
>>>>    Store memory nodes by tiers.
>>> I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
>>> top tier. The kernel could build this with the topology built by
>>> firmware.
>> node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.
>>
>> node_states is already an existing kernel array (defined as nodemask_t
>> node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
>> which is why a new array, memory_tiers, is proposed.
>>
>> Are you proposing that we make node_states a 2-dimensional array?
>> That can duplicate the information already in node_states, which is
>> not ideal.
> Sorry for the late reply.
>
> Yes, 2-dimensional array. With it we could know what nodes in what tiers.
>
>>>> * struct demotion_nodes node_demotion[]
>>>>
>>>>    where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>>>>
>>>>    For a node N:
>>>>
>>>>    node_demotion[N].preferred lists all preferred demotion targets;
>>>>
>>>>    node_demotion[N].allowed lists all allowed demotion targets
>>>>    (initialized to be all the nodes in the same demotion tier).
>>> It seems unnecessary to define preferred and allowed IMHO. Why not
>>> just use something like the allocation fallback list? The first node
>>> in the list is the preferred one. When allocating memory for demotion,
>>> convert the list to a nodemask, then call __alloc_pages(gfp, order,
>>> first_node, nodemask). So the allocation could fallback to the allowed
>>> nodes automatically.
>> The nodemask "preferred" is an attempt to preserve a current feature
>> in node_demotion[]: load balancing among multiple equally-close target
>> nodes via random selection.  We can remove it to keep things simple.
>>
>> The idea of defining "preferred" and "allowed" is exactly to use
>> __alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
>> that the page allocator already computes the allocation fallback list,
>> it should be unnecessary to maintain an ordered demotion node list for
>> each node and convert such a list to a nodemask for demotion
>> allocation.  This is why allowed is stored as a nodemask.
> Yeah, it doesn't have to be ordered.
>
>> When demoting a page from node N, I think we can just call
>> __alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
>> so, we can remove node_demotion[] entirely and add a tier field to
>> NODE_DATA for node_to_tier().
>>
>>>>
>>>> Tiering Hierarchy Initialization
>>>> ================================
>>>>
>>>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>>>
>>>> A device driver can remove its memory nodes from the top tier, e.g.
>>>> a dax driver can remove PMEM nodes from the top tier.
>>> With the topology built by firmware we should not need this.
>> I agree. But before we have such a firmware, the kernel needs to do
>> its best to initialize memory tiers.
>>
>> Given that we know PMEM is slower than DRAM, but a dax device might
>> not be PMEM, a better place to set the tier for PMEM nodes can be the
>> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
>> the ACPI_SRAT_MEM_NON_VOLATILE bit.
> This is why I hope firmware could chime in, for example, we may have a
> new field, called "Tier", in HMAT. Then kernel just reads the field
> and put the node into proper tier. But of course override by kernel
> could be allowed.
>
>>>> The kernel builds the memory tiering hierarchy and per-node demotion
>>>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>>>> best distance nodes in the next lower tier are assigned to
>>>> node_demotion[N].preferred and all the nodes in the next lower tier
>>>> are assigned to node_demotion[N].allowed.
>>> I'm not sure whether it should be allowed to demote to multiple lower
>>> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
>>> figure out a good way to define demotion targets, it could be extended
>>> to support this easily.
>> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
>> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
>> not clear yet whether we want to enable transparent memory tiering to
>> all the 3 tiers on such systems.
> Just start from something simple. And we should fully utilize the
> nearest lower tier before demoting to lower lower tiers.
There might still be simple cases/topologies where we might want to "skip"
the very next lower tier. For example, assume we have a 3 tiered memory 
system as follows:

node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory 
in tier 0,
node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
(could be a bigger DDR or something) in tier 2. The distances are as 
follows:

--------------          --------------
|   Node 0   |          |   Node 1   |
|  -------   |          |  -------   |
| |  DDR  |  |          | |  DDR  |  |
|  -------   |          |  -------   |
|            |          |            |
--------------          --------------
        | 20               | 120    |
        v                  v        |
----------------------------       |
| Node 2     PMEM          |       | 100
----------------------------       |
        | 100                       |
        v                           v
--------------------------------------
| Node 3    Large mem                |
--------------------------------------

node distances:
node   0    1    2    3
    0  10   20   20  120
    1  20   10  120  100
    2  20  120   10  100
    3  120 100  100   10

/sys/devices/system/node/memory_tiers
0-1
2
3

N_TOPTIER_MEMORY: 0-1


In this case, we want to be able to "skip" the demotion path from Node 1 
to Node 2,

and make demotion go directely to Node 3 as it is closer, distance wise. 
How can

we accommodate this scenario (or at least not rule it out as future 
work) with the

current RFC?

>>>> node_demotion[N].preferred can be empty if no preferred demotion node
>>>> is available for node N.
>>>>
>>>> If the userspace overrides the tiers via the memory_tiers sysfs
>>>> interface, the kernel then only rebuilds the per-node demotion order
>>>> accordingly.
>>>>
>>>> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>>>> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>>>> node.
>>>>
>>>>
>>>> Memory Allocation for Demotion
>>>> ==============================
>>>>
>>>> When allocating a new demotion target page, both a preferred node
>>>> and the allowed nodemask are provided to the allocation function.
>>>> The default kernel allocation fallback order is used to allocate the
>>>> page from the specified node and nodemask.
>>>>
>>>> The memopolicy of cpuset, vma and owner task of the source page can
>>>> be set to refine the demotion nodemask, e.g. to prevent demotion or
>>>> select a particular allowed node as the demotion target.
>>>>
>>>>
>>>> Examples
>>>> ========
>>>>
>>>> * Example 1:
>>>>    Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>>>>
>>>>    Node 0 has node 2 as the preferred demotion target and can also
>>>>    fallback demotion to node 3.
>>>>
>>>>    Node 1 has node 3 as the preferred demotion target and can also
>>>>    fallback demotion to node 2.
>>>>
>>>>    Set mempolicy to prevent cross-socket demotion and memory access,
>>>>    e.g. cpuset.mems=0,2
>>>>
>>>> node distances:
>>>> node   0    1    2    3
>>>>     0  10   20   30   40
>>>>     1  20   10   40   30
>>>>     2  30   40   10   40
>>>>     3  40   30   40   10
>>>>
>>>> /sys/devices/system/node/memory_tiers
>>>> 0-1
>>>> 2-3
>>>>
>>>> N_TOPTIER_MEMORY: 0-1
>>>>
>>>> node_demotion[]:
>>>>    0: [2], [2-3]
>>>>    1: [3], [2-3]
>>>>    2: [],  []
>>>>    3: [],  []
>>>>
>>>> * Example 2:
>>>>    Node 0 & 1 are DRAM nodes.
>>>>    Node 2 is a PMEM node and closer to node 0.
>>>>
>>>>    Node 0 has node 2 as the preferred and only demotion target.
>>>>
>>>>    Node 1 has no preferred demotion target, but can still demote
>>>>    to node 2.
>>>>
>>>>    Set mempolicy to prevent cross-socket demotion and memory access,
>>>>    e.g. cpuset.mems=0,2
>>>>
>>>> node distances:
>>>> node   0    1    2
>>>>     0  10   20   30
>>>>     1  20   10   40
>>>>     2  30   40   10
>>>>
>>>> /sys/devices/system/node/memory_tiers
>>>> 0-1
>>>> 2
>>>>
>>>> N_TOPTIER_MEMORY: 0-1
>>>>
>>>> node_demotion[]:
>>>>    0: [2], [2]
>>>>    1: [],  [2]
>>>>    2: [],  []
>>>>
>>>>
>>>> * Example 3:
>>>>    Node 0 & 1 are DRAM nodes.
>>>>    Node 2 is a PMEM node and has the same distance to node 0 & 1.
>>>>
>>>>    Node 0 has node 2 as the preferred and only demotion target.
>>>>
>>>>    Node 1 has node 2 as the preferred and only demotion target.
>>>>
>>>> node distances:
>>>> node   0    1    2
>>>>     0  10   20   30
>>>>     1  20   10   30
>>>>     2  30   30   10
>>>>
>>>> /sys/devices/system/node/memory_tiers
>>>> 0-1
>>>> 2
>>>>
>>>> N_TOPTIER_MEMORY: 0-1
>>>>
>>>> node_demotion[]:
>>>>    0: [2], [2]
>>>>    1: [2], [2]
>>>>    2: [],  []
>>>>
>>>>
>>>> * Example 4:
>>>>    Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>>>>
>>>>    All nodes are top-tier.
>>>>
>>>> node distances:
>>>> node   0    1    2
>>>>     0  10   20   30
>>>>     1  20   10   30
>>>>     2  30   30   10
>>>>
>>>> /sys/devices/system/node/memory_tiers
>>>> 0-2
>>>>
>>>> N_TOPTIER_MEMORY: 0-2
>>>>
>>>> node_demotion[]:
>>>>    0: [],  []
>>>>    1: [],  []
>>>>    2: [],  []
>>>>
>>>>
>>>> * Example 5:
>>>>    Node 0 is a DRAM node with CPU.
>>>>    Node 1 is a HBM node.
>>>>    Node 2 is a PMEM node.
>>>>
>>>>    With userspace override, node 1 is the top tier and has node 0 as
>>>>    the preferred and only demotion target.
>>>>
>>>>    Node 0 is in the second tier, tier 1, and has node 2 as the
>>>>    preferred and only demotion target.
>>>>
>>>>    Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>>>>
>>>> node distances:
>>>> node   0    1    2
>>>>     0  10   21   30
>>>>     1  21   10   40
>>>>     2  30   40   10
>>>>
>>>> /sys/devices/system/node/memory_tiers (userspace override)
>>>> 1
>>>> 0
>>>> 2
>>>>
>>>> N_TOPTIER_MEMORY: 1
>>>>
>>>> node_demotion[]:
>>>>    0: [2], [2]
>>>>    1: [0], [0]
>>>>    2: [],  []
>>>>
>>>> -- Wei
-- Hesham


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-09 14:32       ` Hesham Almatary
@ 2022-05-10  3:24         ` Yang Shi
  2022-05-10  9:59           ` Hesham Almatary
  2022-05-10  4:22         ` Wei Xu
  1 sibling, 1 reply; 57+ messages in thread
From: Yang Shi @ 2022-05-10  3:24 UTC (permalink / raw)
  To: Hesham Almatary
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen,
	Wei Xu

On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
<hesham.almatary@huawei.com> wrote:
>
> Hello Yang,
>
> On 5/6/2022 7:56 PM, Yang Shi wrote:
> > On Fri, Apr 29, 2022 at 11:37 PM Wei Xu <weixugc@google.com> wrote:
> >> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
> >>> Hi Wei,
> >>>
> >>> Thanks for the nice writing. Please see the below inline comments.
> >> Thanks for the quick response and comments.
> >>
> >>> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> >>>> The current kernel has the basic memory tiering support: Inactive
> >>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >>>> tier NUMA node to make room for new allocations on the higher tier
> >>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >>>> migrated (promoted) to a higher tier NUMA node to improve the
> >>>> performance.
> >>>>
> >>>> A tiering relationship between NUMA nodes in the form of demotion path
> >>>> is created during the kernel initialization and updated when a NUMA
> >>>> node is hot-added or hot-removed.  The current implementation puts all
> >>>> nodes with CPU into the top tier, and then builds the tiering hierarchy
> >>>> tier-by-tier by establishing the per-node demotion targets based on
> >>>> the distances between nodes.
> >>>>
> >>>> The current memory tiering interface needs to be improved to address
> >>>> several important use cases:
> >>>>
> >>>> * The current tiering initialization code always initializes
> >>>>    each memory-only NUMA node into a lower tier.  But a memory-only
> >>>>    NUMA node may have a high performance memory device (e.g. a DRAM
> >>>>    device attached via CXL.mem or a DRAM-backed memory-only node on
> >>>>    a virtual machine) and should be put into the top tier.
> >>>>
> >>>> * The current tiering hierarchy always puts CPU nodes into the top
> >>>>    tier. But on a system with HBM (e.g. GPU memory) devices, these
> >>>>    memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >>>>    with CPUs are better to be placed into the next lower tier.
> >>>>
> >>>> * Also because the current tiering hierarchy always puts CPU nodes
> >>>>    into the top tier, when a CPU is hot-added (or hot-removed) and
> >>>>    triggers a memory node from CPU-less into a CPU node (or vice
> >>>>    versa), the memory tiering hierarchy gets changed, even though no
> >>>>    memory node is added or removed.  This can make the tiering
> >>>>    hierarchy much less stable.
> >>> I'd prefer the firmware builds up tiers topology then passes it to
> >>> kernel so that kernel knows what nodes are in what tiers. No matter
> >>> what nodes are hot-removed/hot-added they always stay in their tiers
> >>> defined by the firmware. I think this is important information like
> >>> numa distances. NUMA distance alone can't satisfy all the usecases
> >>> IMHO.
> >> I agree that the firmware needs to play a bigger role in tier
> >> topology, though it is not clear to me yet that we should require the
> >> tier topology be fully defined by the firmware.  If yes, a standard
> >> needs to be established. Alternatively, with additional hardware
> >> information provided by the firmware (e.g. HMAT), the kernel can be in
> >> a much better position to initialize the proper tier topology by
> >> itself.
> >>
> >> It is important to keep tier topology stable, especially if we want to
> >> account and limit memory usage based on tiers.  So I agree that the
> >> nodes should not change their tiers no matter what nodes are
> >> hot-added/hot-removed.
> >>
> >> Given that the additional tier-related information is not yet
> >> available from the firmware and NUMA distance alone is not sufficient
> >> for all the tiering use cases, and also that we want to keep tier
> >> topology stable after the kernel boots, I suggest that we add a kernel
> >> boot parameter to override the default tier topology (which nodes are
> >> in which tiers). An example is: tier=2:0-1;2-3, which defines two
> >> tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.
> >>
> >>>> * A higher tier node can only be demoted to selected nodes on the
> >>>>    next lower tier, not any other node from the next lower tier.  This
> >>>>    strict, hard-coded demotion order does not work in all use cases
> >>>>    (e.g. some use cases may want to allow cross-socket demotion to
> >>>>    another node in the same demotion tier as a fallback when the
> >>>>    preferred demotion node is out of space), and has resulted in the
> >>>>    feature request for an interface to override the system-wide,
> >>>>    per-node demotion order from the userspace.
> >>>>
> >>>> * There are no interfaces for the userspace to learn about the memory
> >>>>    tiering hierarchy in order to optimize its memory allocations.
> >>>>
> >>>> I'd like to propose revised memory tiering kernel interfaces based on
> >>>> the discussions in the threads:
> >>>>
> >>>> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> >>>> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> >>>>
> >>>>
> >>>> Sysfs Interfaces
> >>>> ================
> >>>>
> >>>> * /sys/devices/system/node/memory_tiers
> >>>>
> >>>>    Format: node list (one tier per line, in the tier order)
> >>>>
> >>>>    When read, list memory nodes by tiers.
> >>>>
> >>>>    When written (one tier per line), take the user-provided node-tier
> >>>>    assignment as the new tiering hierarchy and rebuild the per-node
> >>>>    demotion order.  It is allowed to only override the top tiers, in
> >>>>    which cases, the kernel will establish the lower tiers automatically.
> >>> TBH I still think it is too soon to define proper user visible
> >>> interfaces for now, particularly for override.
> >> I agree, but there are also needs to make use of tiering even as it
> >> evolves.  This is why only a minimal sysfs interface is proposed.  We
> >> can make it read-only and resort to a kernel boot parameter to
> >> override tiers.
> >>
> >>>>
> >>>> Kernel Representation
> >>>> =====================
> >>>>
> >>>> * nodemask_t node_states[N_TOPTIER_MEMORY]
> >>>>
> >>>>    Store all top-tier memory nodes.
> >>>>
> >>>> * nodemask_t memory_tiers[MAX_TIERS]
> >>>>
> >>>>    Store memory nodes by tiers.
> >>> I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
> >>> top tier. The kernel could build this with the topology built by
> >>> firmware.
> >> node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.
> >>
> >> node_states is already an existing kernel array (defined as nodemask_t
> >> node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
> >> which is why a new array, memory_tiers, is proposed.
> >>
> >> Are you proposing that we make node_states a 2-dimensional array?
> >> That can duplicate the information already in node_states, which is
> >> not ideal.
> > Sorry for the late reply.
> >
> > Yes, 2-dimensional array. With it we could know what nodes in what tiers.
> >
> >>>> * struct demotion_nodes node_demotion[]
> >>>>
> >>>>    where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> >>>>
> >>>>    For a node N:
> >>>>
> >>>>    node_demotion[N].preferred lists all preferred demotion targets;
> >>>>
> >>>>    node_demotion[N].allowed lists all allowed demotion targets
> >>>>    (initialized to be all the nodes in the same demotion tier).
> >>> It seems unnecessary to define preferred and allowed IMHO. Why not
> >>> just use something like the allocation fallback list? The first node
> >>> in the list is the preferred one. When allocating memory for demotion,
> >>> convert the list to a nodemask, then call __alloc_pages(gfp, order,
> >>> first_node, nodemask). So the allocation could fallback to the allowed
> >>> nodes automatically.
> >> The nodemask "preferred" is an attempt to preserve a current feature
> >> in node_demotion[]: load balancing among multiple equally-close target
> >> nodes via random selection.  We can remove it to keep things simple.
> >>
> >> The idea of defining "preferred" and "allowed" is exactly to use
> >> __alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
> >> that the page allocator already computes the allocation fallback list,
> >> it should be unnecessary to maintain an ordered demotion node list for
> >> each node and convert such a list to a nodemask for demotion
> >> allocation.  This is why allowed is stored as a nodemask.
> > Yeah, it doesn't have to be ordered.
> >
> >> When demoting a page from node N, I think we can just call
> >> __alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
> >> so, we can remove node_demotion[] entirely and add a tier field to
> >> NODE_DATA for node_to_tier().
> >>
> >>>>
> >>>> Tiering Hierarchy Initialization
> >>>> ================================
> >>>>
> >>>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >>>>
> >>>> A device driver can remove its memory nodes from the top tier, e.g.
> >>>> a dax driver can remove PMEM nodes from the top tier.
> >>> With the topology built by firmware we should not need this.
> >> I agree. But before we have such a firmware, the kernel needs to do
> >> its best to initialize memory tiers.
> >>
> >> Given that we know PMEM is slower than DRAM, but a dax device might
> >> not be PMEM, a better place to set the tier for PMEM nodes can be the
> >> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
> >> the ACPI_SRAT_MEM_NON_VOLATILE bit.
> > This is why I hope firmware could chime in, for example, we may have a
> > new field, called "Tier", in HMAT. Then kernel just reads the field
> > and put the node into proper tier. But of course override by kernel
> > could be allowed.
> >
> >>>> The kernel builds the memory tiering hierarchy and per-node demotion
> >>>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> >>>> best distance nodes in the next lower tier are assigned to
> >>>> node_demotion[N].preferred and all the nodes in the next lower tier
> >>>> are assigned to node_demotion[N].allowed.
> >>> I'm not sure whether it should be allowed to demote to multiple lower
> >>> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
> >>> figure out a good way to define demotion targets, it could be extended
> >>> to support this easily.
> >> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
> >> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
> >> not clear yet whether we want to enable transparent memory tiering to
> >> all the 3 tiers on such systems.
> > Just start from something simple. And we should fully utilize the
> > nearest lower tier before demoting to lower lower tiers.
> There might still be simple cases/topologies where we might want to "skip"
> the very next lower tier. For example, assume we have a 3 tiered memory
> system as follows:
>
> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
> in tier 0,
> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
> (could be a bigger DDR or something) in tier 2. The distances are as
> follows:
>
> --------------          --------------
> |   Node 0   |          |   Node 1   |
> |  -------   |          |  -------   |
> | |  DDR  |  |          | |  DDR  |  |
> |  -------   |          |  -------   |
> |            |          |            |
> --------------          --------------
>         | 20               | 120    |
>         v                  v        |
> ----------------------------       |
> | Node 2     PMEM          |       | 100
> ----------------------------       |
>         | 100                       |
>         v                           v
> --------------------------------------
> | Node 3    Large mem                |
> --------------------------------------
>
> node distances:
> node   0    1    2    3
>     0  10   20   20  120
>     1  20   10  120  100
>     2  20  120   10  100
>     3  120 100  100   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
> 3
>
> N_TOPTIER_MEMORY: 0-1
>
>
> In this case, we want to be able to "skip" the demotion path from Node 1
> to Node 2,
>
> and make demotion go directely to Node 3 as it is closer, distance wise.
> How can
>
> we accommodate this scenario (or at least not rule it out as future
> work) with the
>
> current RFC?

If I remember correctly NUMA distance is hardcoded in SLIT by the
firmware, it is supposed to reflect the latency. So I suppose it is
the firmware's responsibility to have correct information. And the RFC
assumes higher tier memory has better performance than lower tier
memory (latency, bandwidth, throughput, etc), so it sounds like a
buggy firmware to have lower tier memory with shorter distance than
higher tier memory IMHO.

Anyway I'm not an expert on hardware or firmware, I just wish the
hardware and firmware would make our life easier :-)

>
> >>>> node_demotion[N].preferred can be empty if no preferred demotion node
> >>>> is available for node N.
> >>>>
> >>>> If the userspace overrides the tiers via the memory_tiers sysfs
> >>>> interface, the kernel then only rebuilds the per-node demotion order
> >>>> accordingly.
> >>>>
> >>>> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> >>>> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> >>>> node.
> >>>>
> >>>>
> >>>> Memory Allocation for Demotion
> >>>> ==============================
> >>>>
> >>>> When allocating a new demotion target page, both a preferred node
> >>>> and the allowed nodemask are provided to the allocation function.
> >>>> The default kernel allocation fallback order is used to allocate the
> >>>> page from the specified node and nodemask.
> >>>>
> >>>> The memopolicy of cpuset, vma and owner task of the source page can
> >>>> be set to refine the demotion nodemask, e.g. to prevent demotion or
> >>>> select a particular allowed node as the demotion target.
> >>>>
> >>>>
> >>>> Examples
> >>>> ========
> >>>>
> >>>> * Example 1:
> >>>>    Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >>>>
> >>>>    Node 0 has node 2 as the preferred demotion target and can also
> >>>>    fallback demotion to node 3.
> >>>>
> >>>>    Node 1 has node 3 as the preferred demotion target and can also
> >>>>    fallback demotion to node 2.
> >>>>
> >>>>    Set mempolicy to prevent cross-socket demotion and memory access,
> >>>>    e.g. cpuset.mems=0,2
> >>>>
> >>>> node distances:
> >>>> node   0    1    2    3
> >>>>     0  10   20   30   40
> >>>>     1  20   10   40   30
> >>>>     2  30   40   10   40
> >>>>     3  40   30   40   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-1
> >>>> 2-3
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2-3]
> >>>>    1: [3], [2-3]
> >>>>    2: [],  []
> >>>>    3: [],  []
> >>>>
> >>>> * Example 2:
> >>>>    Node 0 & 1 are DRAM nodes.
> >>>>    Node 2 is a PMEM node and closer to node 0.
> >>>>
> >>>>    Node 0 has node 2 as the preferred and only demotion target.
> >>>>
> >>>>    Node 1 has no preferred demotion target, but can still demote
> >>>>    to node 2.
> >>>>
> >>>>    Set mempolicy to prevent cross-socket demotion and memory access,
> >>>>    e.g. cpuset.mems=0,2
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   20   30
> >>>>     1  20   10   40
> >>>>     2  30   40   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-1
> >>>> 2
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2]
> >>>>    1: [],  [2]
> >>>>    2: [],  []
> >>>>
> >>>>
> >>>> * Example 3:
> >>>>    Node 0 & 1 are DRAM nodes.
> >>>>    Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >>>>
> >>>>    Node 0 has node 2 as the preferred and only demotion target.
> >>>>
> >>>>    Node 1 has node 2 as the preferred and only demotion target.
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   20   30
> >>>>     1  20   10   30
> >>>>     2  30   30   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-1
> >>>> 2
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2]
> >>>>    1: [2], [2]
> >>>>    2: [],  []
> >>>>
> >>>>
> >>>> * Example 4:
> >>>>    Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >>>>
> >>>>    All nodes are top-tier.
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   20   30
> >>>>     1  20   10   30
> >>>>     2  30   30   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-2
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-2
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [],  []
> >>>>    1: [],  []
> >>>>    2: [],  []
> >>>>
> >>>>
> >>>> * Example 5:
> >>>>    Node 0 is a DRAM node with CPU.
> >>>>    Node 1 is a HBM node.
> >>>>    Node 2 is a PMEM node.
> >>>>
> >>>>    With userspace override, node 1 is the top tier and has node 0 as
> >>>>    the preferred and only demotion target.
> >>>>
> >>>>    Node 0 is in the second tier, tier 1, and has node 2 as the
> >>>>    preferred and only demotion target.
> >>>>
> >>>>    Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   21   30
> >>>>     1  21   10   40
> >>>>     2  30   40   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers (userspace override)
> >>>> 1
> >>>> 0
> >>>> 2
> >>>>
> >>>> N_TOPTIER_MEMORY: 1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2]
> >>>>    1: [0], [0]
> >>>>    2: [],  []
> >>>>
> >>>> -- Wei
> -- Hesham


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-09 14:32       ` Hesham Almatary
  2022-05-10  3:24         ` Yang Shi
@ 2022-05-10  4:22         ` Wei Xu
  2022-05-10 10:01           ` Hesham Almatary
  2022-05-10 11:44           ` Aneesh Kumar K.V
  1 sibling, 2 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-10  4:22 UTC (permalink / raw)
  To: Hesham Almatary
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
<hesham.almatary@huawei.com> wrote:
>
> Hello Yang,
>
> On 5/6/2022 7:56 PM, Yang Shi wrote:
> > On Fri, Apr 29, 2022 at 11:37 PM Wei Xu <weixugc@google.com> wrote:
> >> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
> >>> Hi Wei,
> >>>
> >>> Thanks for the nice writing. Please see the below inline comments.
> >> Thanks for the quick response and comments.
> >>
> >>> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> >>>> The current kernel has the basic memory tiering support: Inactive
> >>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> >>>> tier NUMA node to make room for new allocations on the higher tier
> >>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> >>>> migrated (promoted) to a higher tier NUMA node to improve the
> >>>> performance.
> >>>>
> >>>> A tiering relationship between NUMA nodes in the form of demotion path
> >>>> is created during the kernel initialization and updated when a NUMA
> >>>> node is hot-added or hot-removed.  The current implementation puts all
> >>>> nodes with CPU into the top tier, and then builds the tiering hierarchy
> >>>> tier-by-tier by establishing the per-node demotion targets based on
> >>>> the distances between nodes.
> >>>>
> >>>> The current memory tiering interface needs to be improved to address
> >>>> several important use cases:
> >>>>
> >>>> * The current tiering initialization code always initializes
> >>>>    each memory-only NUMA node into a lower tier.  But a memory-only
> >>>>    NUMA node may have a high performance memory device (e.g. a DRAM
> >>>>    device attached via CXL.mem or a DRAM-backed memory-only node on
> >>>>    a virtual machine) and should be put into the top tier.
> >>>>
> >>>> * The current tiering hierarchy always puts CPU nodes into the top
> >>>>    tier. But on a system with HBM (e.g. GPU memory) devices, these
> >>>>    memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >>>>    with CPUs are better to be placed into the next lower tier.
> >>>>
> >>>> * Also because the current tiering hierarchy always puts CPU nodes
> >>>>    into the top tier, when a CPU is hot-added (or hot-removed) and
> >>>>    triggers a memory node from CPU-less into a CPU node (or vice
> >>>>    versa), the memory tiering hierarchy gets changed, even though no
> >>>>    memory node is added or removed.  This can make the tiering
> >>>>    hierarchy much less stable.
> >>> I'd prefer the firmware builds up tiers topology then passes it to
> >>> kernel so that kernel knows what nodes are in what tiers. No matter
> >>> what nodes are hot-removed/hot-added they always stay in their tiers
> >>> defined by the firmware. I think this is important information like
> >>> numa distances. NUMA distance alone can't satisfy all the usecases
> >>> IMHO.
> >> I agree that the firmware needs to play a bigger role in tier
> >> topology, though it is not clear to me yet that we should require the
> >> tier topology be fully defined by the firmware.  If yes, a standard
> >> needs to be established. Alternatively, with additional hardware
> >> information provided by the firmware (e.g. HMAT), the kernel can be in
> >> a much better position to initialize the proper tier topology by
> >> itself.
> >>
> >> It is important to keep tier topology stable, especially if we want to
> >> account and limit memory usage based on tiers.  So I agree that the
> >> nodes should not change their tiers no matter what nodes are
> >> hot-added/hot-removed.
> >>
> >> Given that the additional tier-related information is not yet
> >> available from the firmware and NUMA distance alone is not sufficient
> >> for all the tiering use cases, and also that we want to keep tier
> >> topology stable after the kernel boots, I suggest that we add a kernel
> >> boot parameter to override the default tier topology (which nodes are
> >> in which tiers). An example is: tier=2:0-1;2-3, which defines two
> >> tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.
> >>
> >>>> * A higher tier node can only be demoted to selected nodes on the
> >>>>    next lower tier, not any other node from the next lower tier.  This
> >>>>    strict, hard-coded demotion order does not work in all use cases
> >>>>    (e.g. some use cases may want to allow cross-socket demotion to
> >>>>    another node in the same demotion tier as a fallback when the
> >>>>    preferred demotion node is out of space), and has resulted in the
> >>>>    feature request for an interface to override the system-wide,
> >>>>    per-node demotion order from the userspace.
> >>>>
> >>>> * There are no interfaces for the userspace to learn about the memory
> >>>>    tiering hierarchy in order to optimize its memory allocations.
> >>>>
> >>>> I'd like to propose revised memory tiering kernel interfaces based on
> >>>> the discussions in the threads:
> >>>>
> >>>> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> >>>> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> >>>>
> >>>>
> >>>> Sysfs Interfaces
> >>>> ================
> >>>>
> >>>> * /sys/devices/system/node/memory_tiers
> >>>>
> >>>>    Format: node list (one tier per line, in the tier order)
> >>>>
> >>>>    When read, list memory nodes by tiers.
> >>>>
> >>>>    When written (one tier per line), take the user-provided node-tier
> >>>>    assignment as the new tiering hierarchy and rebuild the per-node
> >>>>    demotion order.  It is allowed to only override the top tiers, in
> >>>>    which cases, the kernel will establish the lower tiers automatically.
> >>> TBH I still think it is too soon to define proper user visible
> >>> interfaces for now, particularly for override.
> >> I agree, but there are also needs to make use of tiering even as it
> >> evolves.  This is why only a minimal sysfs interface is proposed.  We
> >> can make it read-only and resort to a kernel boot parameter to
> >> override tiers.
> >>
> >>>>
> >>>> Kernel Representation
> >>>> =====================
> >>>>
> >>>> * nodemask_t node_states[N_TOPTIER_MEMORY]
> >>>>
> >>>>    Store all top-tier memory nodes.
> >>>>
> >>>> * nodemask_t memory_tiers[MAX_TIERS]
> >>>>
> >>>>    Store memory nodes by tiers.
> >>> I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
> >>> top tier. The kernel could build this with the topology built by
> >>> firmware.
> >> node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.
> >>
> >> node_states is already an existing kernel array (defined as nodemask_t
> >> node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
> >> which is why a new array, memory_tiers, is proposed.
> >>
> >> Are you proposing that we make node_states a 2-dimensional array?
> >> That can duplicate the information already in node_states, which is
> >> not ideal.
> > Sorry for the late reply.
> >
> > Yes, 2-dimensional array. With it we could know what nodes in what tiers.
> >
> >>>> * struct demotion_nodes node_demotion[]
> >>>>
> >>>>    where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
> >>>>
> >>>>    For a node N:
> >>>>
> >>>>    node_demotion[N].preferred lists all preferred demotion targets;
> >>>>
> >>>>    node_demotion[N].allowed lists all allowed demotion targets
> >>>>    (initialized to be all the nodes in the same demotion tier).
> >>> It seems unnecessary to define preferred and allowed IMHO. Why not
> >>> just use something like the allocation fallback list? The first node
> >>> in the list is the preferred one. When allocating memory for demotion,
> >>> convert the list to a nodemask, then call __alloc_pages(gfp, order,
> >>> first_node, nodemask). So the allocation could fallback to the allowed
> >>> nodes automatically.
> >> The nodemask "preferred" is an attempt to preserve a current feature
> >> in node_demotion[]: load balancing among multiple equally-close target
> >> nodes via random selection.  We can remove it to keep things simple.
> >>
> >> The idea of defining "preferred" and "allowed" is exactly to use
> >> __alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
> >> that the page allocator already computes the allocation fallback list,
> >> it should be unnecessary to maintain an ordered demotion node list for
> >> each node and convert such a list to a nodemask for demotion
> >> allocation.  This is why allowed is stored as a nodemask.
> > Yeah, it doesn't have to be ordered.
> >
> >> When demoting a page from node N, I think we can just call
> >> __alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
> >> so, we can remove node_demotion[] entirely and add a tier field to
> >> NODE_DATA for node_to_tier().
> >>
> >>>>
> >>>> Tiering Hierarchy Initialization
> >>>> ================================
> >>>>
> >>>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >>>>
> >>>> A device driver can remove its memory nodes from the top tier, e.g.
> >>>> a dax driver can remove PMEM nodes from the top tier.
> >>> With the topology built by firmware we should not need this.
> >> I agree. But before we have such a firmware, the kernel needs to do
> >> its best to initialize memory tiers.
> >>
> >> Given that we know PMEM is slower than DRAM, but a dax device might
> >> not be PMEM, a better place to set the tier for PMEM nodes can be the
> >> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
> >> the ACPI_SRAT_MEM_NON_VOLATILE bit.
> > This is why I hope firmware could chime in, for example, we may have a
> > new field, called "Tier", in HMAT. Then kernel just reads the field
> > and put the node into proper tier. But of course override by kernel
> > could be allowed.
> >
> >>>> The kernel builds the memory tiering hierarchy and per-node demotion
> >>>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> >>>> best distance nodes in the next lower tier are assigned to
> >>>> node_demotion[N].preferred and all the nodes in the next lower tier
> >>>> are assigned to node_demotion[N].allowed.
> >>> I'm not sure whether it should be allowed to demote to multiple lower
> >>> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
> >>> figure out a good way to define demotion targets, it could be extended
> >>> to support this easily.
> >> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
> >> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
> >> not clear yet whether we want to enable transparent memory tiering to
> >> all the 3 tiers on such systems.
> > Just start from something simple. And we should fully utilize the
> > nearest lower tier before demoting to lower lower tiers.
> There might still be simple cases/topologies where we might want to "skip"
> the very next lower tier. For example, assume we have a 3 tiered memory
> system as follows:
>
> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
> in tier 0,
> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
> (could be a bigger DDR or something) in tier 2. The distances are as
> follows:
>
> --------------          --------------
> |   Node 0   |          |   Node 1   |
> |  -------   |          |  -------   |
> | |  DDR  |  |          | |  DDR  |  |
> |  -------   |          |  -------   |
> |            |          |            |
> --------------          --------------
>         | 20               | 120    |
>         v                  v        |
> ----------------------------       |
> | Node 2     PMEM          |       | 100
> ----------------------------       |
>         | 100                       |
>         v                           v
> --------------------------------------
> | Node 3    Large mem                |
> --------------------------------------
>
> node distances:
> node   0    1    2    3
>     0  10   20   20  120
>     1  20   10  120  100
>     2  20  120   10  100
>     3  120 100  100   10
>
> /sys/devices/system/node/memory_tiers
> 0-1
> 2
> 3
>
> N_TOPTIER_MEMORY: 0-1
>
>
> In this case, we want to be able to "skip" the demotion path from Node 1
> to Node 2,
>
> and make demotion go directely to Node 3 as it is closer, distance wise.
> How can
>
> we accommodate this scenario (or at least not rule it out as future
> work) with the current RFC?

This is an interesting example.  I think one way to support this is to
allow all the lower tier nodes to be the demotion targets of a node in
the higher tier.  We can then use the allocation fallback order to
select the best demotion target.

For this example, we will have the demotion targets of each node as:

node 0: allowed=2-3, order (based on allocation fallback order): 2, 3
node 1: allowed=2-3, order (based on allocation fallback order): 3, 2
node 2: allowed = 3, order (based on allocation fallback order): 3
node 3: allowed = empty

What do you think?

> >>>> node_demotion[N].preferred can be empty if no preferred demotion node
> >>>> is available for node N.
> >>>>
> >>>> If the userspace overrides the tiers via the memory_tiers sysfs
> >>>> interface, the kernel then only rebuilds the per-node demotion order
> >>>> accordingly.
> >>>>
> >>>> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> >>>> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> >>>> node.
> >>>>
> >>>>
> >>>> Memory Allocation for Demotion
> >>>> ==============================
> >>>>
> >>>> When allocating a new demotion target page, both a preferred node
> >>>> and the allowed nodemask are provided to the allocation function.
> >>>> The default kernel allocation fallback order is used to allocate the
> >>>> page from the specified node and nodemask.
> >>>>
> >>>> The memopolicy of cpuset, vma and owner task of the source page can
> >>>> be set to refine the demotion nodemask, e.g. to prevent demotion or
> >>>> select a particular allowed node as the demotion target.
> >>>>
> >>>>
> >>>> Examples
> >>>> ========
> >>>>
> >>>> * Example 1:
> >>>>    Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >>>>
> >>>>    Node 0 has node 2 as the preferred demotion target and can also
> >>>>    fallback demotion to node 3.
> >>>>
> >>>>    Node 1 has node 3 as the preferred demotion target and can also
> >>>>    fallback demotion to node 2.
> >>>>
> >>>>    Set mempolicy to prevent cross-socket demotion and memory access,
> >>>>    e.g. cpuset.mems=0,2
> >>>>
> >>>> node distances:
> >>>> node   0    1    2    3
> >>>>     0  10   20   30   40
> >>>>     1  20   10   40   30
> >>>>     2  30   40   10   40
> >>>>     3  40   30   40   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-1
> >>>> 2-3
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2-3]
> >>>>    1: [3], [2-3]
> >>>>    2: [],  []
> >>>>    3: [],  []
> >>>>
> >>>> * Example 2:
> >>>>    Node 0 & 1 are DRAM nodes.
> >>>>    Node 2 is a PMEM node and closer to node 0.
> >>>>
> >>>>    Node 0 has node 2 as the preferred and only demotion target.
> >>>>
> >>>>    Node 1 has no preferred demotion target, but can still demote
> >>>>    to node 2.
> >>>>
> >>>>    Set mempolicy to prevent cross-socket demotion and memory access,
> >>>>    e.g. cpuset.mems=0,2
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   20   30
> >>>>     1  20   10   40
> >>>>     2  30   40   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-1
> >>>> 2
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2]
> >>>>    1: [],  [2]
> >>>>    2: [],  []
> >>>>
> >>>>
> >>>> * Example 3:
> >>>>    Node 0 & 1 are DRAM nodes.
> >>>>    Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >>>>
> >>>>    Node 0 has node 2 as the preferred and only demotion target.
> >>>>
> >>>>    Node 1 has node 2 as the preferred and only demotion target.
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   20   30
> >>>>     1  20   10   30
> >>>>     2  30   30   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-1
> >>>> 2
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2]
> >>>>    1: [2], [2]
> >>>>    2: [],  []
> >>>>
> >>>>
> >>>> * Example 4:
> >>>>    Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >>>>
> >>>>    All nodes are top-tier.
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   20   30
> >>>>     1  20   10   30
> >>>>     2  30   30   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers
> >>>> 0-2
> >>>>
> >>>> N_TOPTIER_MEMORY: 0-2
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [],  []
> >>>>    1: [],  []
> >>>>    2: [],  []
> >>>>
> >>>>
> >>>> * Example 5:
> >>>>    Node 0 is a DRAM node with CPU.
> >>>>    Node 1 is a HBM node.
> >>>>    Node 2 is a PMEM node.
> >>>>
> >>>>    With userspace override, node 1 is the top tier and has node 0 as
> >>>>    the preferred and only demotion target.
> >>>>
> >>>>    Node 0 is in the second tier, tier 1, and has node 2 as the
> >>>>    preferred and only demotion target.
> >>>>
> >>>>    Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >>>>
> >>>> node distances:
> >>>> node   0    1    2
> >>>>     0  10   21   30
> >>>>     1  21   10   40
> >>>>     2  30   40   10
> >>>>
> >>>> /sys/devices/system/node/memory_tiers (userspace override)
> >>>> 1
> >>>> 0
> >>>> 2
> >>>>
> >>>> N_TOPTIER_MEMORY: 1
> >>>>
> >>>> node_demotion[]:
> >>>>    0: [2], [2]
> >>>>    1: [0], [0]
> >>>>    2: [],  []
> >>>>
> >>>> -- Wei
> -- Hesham
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-06  0:01     ` Alistair Popple
@ 2022-05-10  4:32       ` Wei Xu
  2022-05-10  5:37         ` Alistair Popple
  0 siblings, 1 reply; 57+ messages in thread
From: Wei Xu @ 2022-05-10  4:32 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> [...]
>
> >> >
> >> >
> >> > Tiering Hierarchy Initialization
> >> > `=============================='
> >> >
> >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >> >
> >> > A device driver can remove its memory nodes from the top tier, e.g.
> >> > a dax driver can remove PMEM nodes from the top tier.
> >>
> >> With the topology built by firmware we should not need this.
>
> I agree that in an ideal world the hierarchy should be built by firmware based
> on something like the HMAT. But I also think being able to override this will be
> useful in getting there. Therefore a way of overriding the generated hierarchy
> would be good, either via sysfs or kernel boot parameter if we don't want to
> commit to a particular user interface now.
>
> However I'm less sure letting device-drivers override this is a good idea. How
> for example would a GPU driver make sure it's node is in the top tier? By moving
> every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> could get messy if say there were two drivers both of which wanted their node to
> be in the top tier.

The suggestion is to allow a device driver to opt out its memory
devices from the top-tier, not the other way around.

I agree that the kernel should still be responsible for the final
node-tier assignment by taking into account all factors: the firmware
tables, device driver requests, and user-overrides (kernel argument or
sysfs).

> > I agree. But before we have such a firmware, the kernel needs to do
> > its best to initialize memory tiers.
> >
> > Given that we know PMEM is slower than DRAM, but a dax device might
> > not be PMEM, a better place to set the tier for PMEM nodes can be the
> > ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
> > the ACPI_SRAT_MEM_NON_VOLATILE bit.
> >
> >> >
> >> > The kernel builds the memory tiering hierarchy and per-node demotion
> >> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
> >> > best distance nodes in the next lower tier are assigned to
> >> > node_demotion[N].preferred and all the nodes in the next lower tier
> >> > are assigned to node_demotion[N].allowed.
> >>
> >> I'm not sure whether it should be allowed to demote to multiple lower
> >> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
> >> figure out a good way to define demotion targets, it could be extended
> >> to support this easily.
> >
> > You mean to only support MAX_TIERS=2 for now.  I am fine with that.
> > There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
> > not clear yet whether we want to enable transparent memory tiering to
> > all the 3 tiers on such systems.
>
> At some point I think we will need to deal with 3 tiers but I'd be ok with
> limiting it to 2 for now if it makes things simpler.
>
> - Alistair
>
> >> >
> >> > node_demotion[N].preferred can be empty if no preferred demotion node
> >> > is available for node N.
> >> >
> >> > If the userspace overrides the tiers via the memory_tiers sysfs
> >> > interface, the kernel then only rebuilds the per-node demotion order
> >> > accordingly.
> >> >
> >> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
> >> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
> >> > node.
> >> >
> >> >
> >> > Memory Allocation for Demotion
> >> > `============================'
> >> >
> >> > When allocating a new demotion target page, both a preferred node
> >> > and the allowed nodemask are provided to the allocation function.
> >> > The default kernel allocation fallback order is used to allocate the
> >> > page from the specified node and nodemask.
> >> >
> >> > The memopolicy of cpuset, vma and owner task of the source page can
> >> > be set to refine the demotion nodemask, e.g. to prevent demotion or
> >> > select a particular allowed node as the demotion target.
> >> >
> >> >
> >> > Examples
> >> > `======'
> >> >
> >> > * Example 1:
> >> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> >> >
> >> >   Node 0 has node 2 as the preferred demotion target and can also
> >> >   fallback demotion to node 3.
> >> >
> >> >   Node 1 has node 3 as the preferred demotion target and can also
> >> >   fallback demotion to node 2.
> >> >
> >> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >> >   e.g. cpuset.mems=0,2
> >> >
> >> > node distances:
> >> > node   0    1    2    3
> >> >    0  10   20   30   40
> >> >    1  20   10   40   30
> >> >    2  30   40   10   40
> >> >    3  40   30   40   10
> >> >
> >> > /sys/devices/system/node/memory_tiers
> >> > 0-1
> >> > 2-3
> >> >
> >> > N_TOPTIER_MEMORY: 0-1
> >> >
> >> > node_demotion[]:
> >> >   0: [2], [2-3]
> >> >   1: [3], [2-3]
> >> >   2: [],  []
> >> >   3: [],  []
> >> >
> >> > * Example 2:
> >> >   Node 0 & 1 are DRAM nodes.
> >> >   Node 2 is a PMEM node and closer to node 0.
> >> >
> >> >   Node 0 has node 2 as the preferred and only demotion target.
> >> >
> >> >   Node 1 has no preferred demotion target, but can still demote
> >> >   to node 2.
> >> >
> >> >   Set mempolicy to prevent cross-socket demotion and memory access,
> >> >   e.g. cpuset.mems=0,2
> >> >
> >> > node distances:
> >> > node   0    1    2
> >> >    0  10   20   30
> >> >    1  20   10   40
> >> >    2  30   40   10
> >> >
> >> > /sys/devices/system/node/memory_tiers
> >> > 0-1
> >> > 2
> >> >
> >> > N_TOPTIER_MEMORY: 0-1
> >> >
> >> > node_demotion[]:
> >> >   0: [2], [2]
> >> >   1: [],  [2]
> >> >   2: [],  []
> >> >
> >> >
> >> > * Example 3:
> >> >   Node 0 & 1 are DRAM nodes.
> >> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
> >> >
> >> >   Node 0 has node 2 as the preferred and only demotion target.
> >> >
> >> >   Node 1 has node 2 as the preferred and only demotion target.
> >> >
> >> > node distances:
> >> > node   0    1    2
> >> >    0  10   20   30
> >> >    1  20   10   30
> >> >    2  30   30   10
> >> >
> >> > /sys/devices/system/node/memory_tiers
> >> > 0-1
> >> > 2
> >> >
> >> > N_TOPTIER_MEMORY: 0-1
> >> >
> >> > node_demotion[]:
> >> >   0: [2], [2]
> >> >   1: [2], [2]
> >> >   2: [],  []
> >> >
> >> >
> >> > * Example 4:
> >> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> >> >
> >> >   All nodes are top-tier.
> >> >
> >> > node distances:
> >> > node   0    1    2
> >> >    0  10   20   30
> >> >    1  20   10   30
> >> >    2  30   30   10
> >> >
> >> > /sys/devices/system/node/memory_tiers
> >> > 0-2
> >> >
> >> > N_TOPTIER_MEMORY: 0-2
> >> >
> >> > node_demotion[]:
> >> >   0: [],  []
> >> >   1: [],  []
> >> >   2: [],  []
> >> >
> >> >
> >> > * Example 5:
> >> >   Node 0 is a DRAM node with CPU.
> >> >   Node 1 is a HBM node.
> >> >   Node 2 is a PMEM node.
> >> >
> >> >   With userspace override, node 1 is the top tier and has node 0 as
> >> >   the preferred and only demotion target.
> >> >
> >> >   Node 0 is in the second tier, tier 1, and has node 2 as the
> >> >   preferred and only demotion target.
> >> >
> >> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
> >> >
> >> > node distances:
> >> > node   0    1    2
> >> >    0  10   21   30
> >> >    1  21   10   40
> >> >    2  30   40   10
> >> >
> >> > /sys/devices/system/node/memory_tiers (userspace override)
> >> > 1
> >> > 0
> >> > 2
> >> >
> >> > N_TOPTIER_MEMORY: 1
> >> >
> >> > node_demotion[]:
> >> >   0: [2], [2]
> >> >   1: [0], [0]
> >> >   2: [],  []
> >> >
> >> > -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-05 14:24                 ` Dave Hansen
@ 2022-05-10  4:43                   ` Wei Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-10  4:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alistair Popple, Davidlohr Bueso, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Yang Shi, Linux MM, Greg Thelen,
	Aneesh Kumar K.V, Jagdish Gediya, Linux Kernel Mailing List,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang,
	Jonathan Cameron

On Thu, May 5, 2022 at 7:24 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/4/22 23:35, Wei Xu wrote:
> > On Wed, May 4, 2022 at 10:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> That means a lot of page table and EPT walks to map those linear
> >> addresses back to physical.  That adds to the inefficiency.
> >
> > That's true if the tracking is purely based on physical pages.  For
> > hot page tracking from PEBS, we can consider tracking in
> > virtual/linear addresses.  We don't need to maintain the history for
> > all linear page addresses nor for an indefinite amount of time.  After
> > all, we just need to identify pages accessed frequently recently and
> > promote them.
>
> Except that you don't want to promote on *every* access.  That might
> lead to too much churn.

Certainly.  We should use the PMU events to help build the page
heatmap in software and select the hottest pages to promote
accordingly.

> You're also assuming that all accesses to a physical page are via a
> single linear address, which ignores shared memory mapped at different
> linear addresses.  Our (maybe wrong) assumption has been that shared
> memory is important enough to manage that it can't be ignored.

Shared memory is important.  Special handling will be needed to better
support such pages for linear address based hot page tracking.

> >> In the end, you get big PEBS buffers with lots of irrelevant data that
> >> needs significant post-processing to make sense of it.
> >
> > I am curious about what are "lots of irrelevant data" if PEBS data is
> > filtered on data sources (e.g. DRAM vs PMEM) by hardware.  If we need
> > to have different policies for the pages from the same data source,
> > then I agree that the software has to do a lot of filtering work.
>
> Perhaps "irrelevant" was a bad term to use.  I meant that you can't just
> take the PEBS data and act directly on it.  It has to be post-processed
> and you will see things in there like lots of adjacent accesses to a
> page.  Those additional accesses can be interesting but at some point
> you have all the weight you need to promote the page and the _rest_ are
> irrelevant.

That's right. The software has to do the post-processing work to build
the page heatmap with what the existing hardware can provide.

> >> The folks at Intel that tried this really struggled to take this mess and turn it into a successful hot-page tracking.
> >>
> >> Maybe someone else will find a better way to do it, but we tried and
> >> gave up.
> >
> > It might be challenging to use PEBS as the only and universal hot page
> > tracking hardware mechanism. For example, there are challenges to use
> > PEBS to sample KVM guest accesses from the host.
>
> Yep, agreed.  This aspect of the hardware is very painful at the moment.
>
> > On the other hand, PEBS with hardware-based data source filtering can
> > be a useful mechanism to improve hot page tracking in conjunction
> > with other techniques.
>
> Rather than "can", I'd say: "might".  Backing up to what I said originally:
>
> > So, in practice, these events (PEBS) weren't very useful
> > for driving memory tiering.
>
> By "driving" I really meant solely driving.  Like, can PEBS be used as
> the one and only mechanism?  We couldn't make it work.  But, the
> hardware _is_ sitting there mostly unused.  It might be great to augment
> what is there, and nobody should be discouraged from looking at it again.

I think we are on the same page.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10  4:32       ` Wei Xu
@ 2022-05-10  5:37         ` Alistair Popple
  2022-05-10 11:38           ` Aneesh Kumar K.V
  0 siblings, 1 reply; 57+ messages in thread
From: Alistair Popple @ 2022-05-10  5:37 UTC (permalink / raw)
  To: Wei Xu
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen


Wei Xu <weixugc@google.com> writes:

> On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> [...]
>>
>> >> >
>> >> >
>> >> > Tiering Hierarchy Initialization
>> >> > `=============================='
>> >> >
>> >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>> >> >
>> >> > A device driver can remove its memory nodes from the top tier, e.g.
>> >> > a dax driver can remove PMEM nodes from the top tier.
>> >>
>> >> With the topology built by firmware we should not need this.
>>
>> I agree that in an ideal world the hierarchy should be built by firmware based
>> on something like the HMAT. But I also think being able to override this will be
>> useful in getting there. Therefore a way of overriding the generated hierarchy
>> would be good, either via sysfs or kernel boot parameter if we don't want to
>> commit to a particular user interface now.
>>
>> However I'm less sure letting device-drivers override this is a good idea. How
>> for example would a GPU driver make sure it's node is in the top tier? By moving
>> every node that the driver does not know about out of N_TOPTIER_MEMORY? That
>> could get messy if say there were two drivers both of which wanted their node to
>> be in the top tier.
>
> The suggestion is to allow a device driver to opt out its memory
> devices from the top-tier, not the other way around.

So how would demotion work in the case of accelerators then? In that
case we would want GPU memory to demote to DRAM, but that won't happen
if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
only override available with this proposal would move GPU memory into a
lower tier, which is the opposite of what's needed there.

>
> I agree that the kernel should still be responsible for the final
> node-tier assignment by taking into account all factors: the firmware
> tables, device driver requests, and user-overrides (kernel argument or
> sysfs).
>
>> > I agree. But before we have such a firmware, the kernel needs to do
>> > its best to initialize memory tiers.
>> >
>> > Given that we know PMEM is slower than DRAM, but a dax device might
>> > not be PMEM, a better place to set the tier for PMEM nodes can be the
>> > ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
>> > the ACPI_SRAT_MEM_NON_VOLATILE bit.
>> >
>> >> >
>> >> > The kernel builds the memory tiering hierarchy and per-node demotion
>> >> > order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>> >> > best distance nodes in the next lower tier are assigned to
>> >> > node_demotion[N].preferred and all the nodes in the next lower tier
>> >> > are assigned to node_demotion[N].allowed.
>> >>
>> >> I'm not sure whether it should be allowed to demote to multiple lower
>> >> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
>> >> figure out a good way to define demotion targets, it could be extended
>> >> to support this easily.
>> >
>> > You mean to only support MAX_TIERS=2 for now.  I am fine with that.
>> > There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
>> > not clear yet whether we want to enable transparent memory tiering to
>> > all the 3 tiers on such systems.
>>
>> At some point I think we will need to deal with 3 tiers but I'd be ok with
>> limiting it to 2 for now if it makes things simpler.
>>
>> - Alistair
>>
>> >> >
>> >> > node_demotion[N].preferred can be empty if no preferred demotion node
>> >> > is available for node N.
>> >> >
>> >> > If the userspace overrides the tiers via the memory_tiers sysfs
>> >> > interface, the kernel then only rebuilds the per-node demotion order
>> >> > accordingly.
>> >> >
>> >> > Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>> >> > memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>> >> > node.
>> >> >
>> >> >
>> >> > Memory Allocation for Demotion
>> >> > `============================'
>> >> >
>> >> > When allocating a new demotion target page, both a preferred node
>> >> > and the allowed nodemask are provided to the allocation function.
>> >> > The default kernel allocation fallback order is used to allocate the
>> >> > page from the specified node and nodemask.
>> >> >
>> >> > The memopolicy of cpuset, vma and owner task of the source page can
>> >> > be set to refine the demotion nodemask, e.g. to prevent demotion or
>> >> > select a particular allowed node as the demotion target.
>> >> >
>> >> >
>> >> > Examples
>> >> > `======'
>> >> >
>> >> > * Example 1:
>> >> >   Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>> >> >
>> >> >   Node 0 has node 2 as the preferred demotion target and can also
>> >> >   fallback demotion to node 3.
>> >> >
>> >> >   Node 1 has node 3 as the preferred demotion target and can also
>> >> >   fallback demotion to node 2.
>> >> >
>> >> >   Set mempolicy to prevent cross-socket demotion and memory access,
>> >> >   e.g. cpuset.mems=0,2
>> >> >
>> >> > node distances:
>> >> > node   0    1    2    3
>> >> >    0  10   20   30   40
>> >> >    1  20   10   40   30
>> >> >    2  30   40   10   40
>> >> >    3  40   30   40   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-1
>> >> > 2-3
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2-3]
>> >> >   1: [3], [2-3]
>> >> >   2: [],  []
>> >> >   3: [],  []
>> >> >
>> >> > * Example 2:
>> >> >   Node 0 & 1 are DRAM nodes.
>> >> >   Node 2 is a PMEM node and closer to node 0.
>> >> >
>> >> >   Node 0 has node 2 as the preferred and only demotion target.
>> >> >
>> >> >   Node 1 has no preferred demotion target, but can still demote
>> >> >   to node 2.
>> >> >
>> >> >   Set mempolicy to prevent cross-socket demotion and memory access,
>> >> >   e.g. cpuset.mems=0,2
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   20   30
>> >> >    1  20   10   40
>> >> >    2  30   40   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-1
>> >> > 2
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2]
>> >> >   1: [],  [2]
>> >> >   2: [],  []
>> >> >
>> >> >
>> >> > * Example 3:
>> >> >   Node 0 & 1 are DRAM nodes.
>> >> >   Node 2 is a PMEM node and has the same distance to node 0 & 1.
>> >> >
>> >> >   Node 0 has node 2 as the preferred and only demotion target.
>> >> >
>> >> >   Node 1 has node 2 as the preferred and only demotion target.
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   20   30
>> >> >    1  20   10   30
>> >> >    2  30   30   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-1
>> >> > 2
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2]
>> >> >   1: [2], [2]
>> >> >   2: [],  []
>> >> >
>> >> >
>> >> > * Example 4:
>> >> >   Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>> >> >
>> >> >   All nodes are top-tier.
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   20   30
>> >> >    1  20   10   30
>> >> >    2  30   30   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers
>> >> > 0-2
>> >> >
>> >> > N_TOPTIER_MEMORY: 0-2
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [],  []
>> >> >   1: [],  []
>> >> >   2: [],  []
>> >> >
>> >> >
>> >> > * Example 5:
>> >> >   Node 0 is a DRAM node with CPU.
>> >> >   Node 1 is a HBM node.
>> >> >   Node 2 is a PMEM node.
>> >> >
>> >> >   With userspace override, node 1 is the top tier and has node 0 as
>> >> >   the preferred and only demotion target.
>> >> >
>> >> >   Node 0 is in the second tier, tier 1, and has node 2 as the
>> >> >   preferred and only demotion target.
>> >> >
>> >> >   Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>> >> >
>> >> > node distances:
>> >> > node   0    1    2
>> >> >    0  10   21   30
>> >> >    1  21   10   40
>> >> >    2  30   40   10
>> >> >
>> >> > /sys/devices/system/node/memory_tiers (userspace override)
>> >> > 1
>> >> > 0
>> >> > 2
>> >> >
>> >> > N_TOPTIER_MEMORY: 1
>> >> >
>> >> > node_demotion[]:
>> >> >   0: [2], [2]
>> >> >   1: [0], [0]
>> >> >   2: [],  []
>> >> >
>> >> > -- Wei


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10  3:24         ` Yang Shi
@ 2022-05-10  9:59           ` Hesham Almatary
  2022-05-10 12:10             ` Aneesh Kumar K V
  0 siblings, 1 reply; 57+ messages in thread
From: Hesham Almatary @ 2022-05-10  9:59 UTC (permalink / raw)
  To: Yang Shi
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen,
	Wei Xu

Hello Yang,

On 5/10/2022 4:24 AM, Yang Shi wrote:
> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> <hesham.almatary@huawei.com> wrote:
>> Hello Yang,
>>
>> On 5/6/2022 7:56 PM, Yang Shi wrote:
>>> On Fri, Apr 29, 2022 at 11:37 PM Wei Xu <weixugc@google.com> wrote:
>>>> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
>>>>> Hi Wei,
>>>>>
>>>>> Thanks for the nice writing. Please see the below inline comments.
>>>> Thanks for the quick response and comments.
>>>>
>>>>> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>> performance.
>>>>>>
>>>>>> A tiering relationship between NUMA nodes in the form of demotion path
>>>>>> is created during the kernel initialization and updated when a NUMA
>>>>>> node is hot-added or hot-removed.  The current implementation puts all
>>>>>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>>>>>> tier-by-tier by establishing the per-node demotion targets based on
>>>>>> the distances between nodes.
>>>>>>
>>>>>> The current memory tiering interface needs to be improved to address
>>>>>> several important use cases:
>>>>>>
>>>>>> * The current tiering initialization code always initializes
>>>>>>     each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>     NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>     device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>     a virtual machine) and should be put into the top tier.
>>>>>>
>>>>>> * The current tiering hierarchy always puts CPU nodes into the top
>>>>>>     tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>     memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>     with CPUs are better to be placed into the next lower tier.
>>>>>>
>>>>>> * Also because the current tiering hierarchy always puts CPU nodes
>>>>>>     into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>     triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>     versa), the memory tiering hierarchy gets changed, even though no
>>>>>>     memory node is added or removed.  This can make the tiering
>>>>>>     hierarchy much less stable.
>>>>> I'd prefer the firmware builds up tiers topology then passes it to
>>>>> kernel so that kernel knows what nodes are in what tiers. No matter
>>>>> what nodes are hot-removed/hot-added they always stay in their tiers
>>>>> defined by the firmware. I think this is important information like
>>>>> numa distances. NUMA distance alone can't satisfy all the usecases
>>>>> IMHO.
>>>> I agree that the firmware needs to play a bigger role in tier
>>>> topology, though it is not clear to me yet that we should require the
>>>> tier topology be fully defined by the firmware.  If yes, a standard
>>>> needs to be established. Alternatively, with additional hardware
>>>> information provided by the firmware (e.g. HMAT), the kernel can be in
>>>> a much better position to initialize the proper tier topology by
>>>> itself.
>>>>
>>>> It is important to keep tier topology stable, especially if we want to
>>>> account and limit memory usage based on tiers.  So I agree that the
>>>> nodes should not change their tiers no matter what nodes are
>>>> hot-added/hot-removed.
>>>>
>>>> Given that the additional tier-related information is not yet
>>>> available from the firmware and NUMA distance alone is not sufficient
>>>> for all the tiering use cases, and also that we want to keep tier
>>>> topology stable after the kernel boots, I suggest that we add a kernel
>>>> boot parameter to override the default tier topology (which nodes are
>>>> in which tiers). An example is: tier=2:0-1;2-3, which defines two
>>>> tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.
>>>>
>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>     next lower tier, not any other node from the next lower tier.  This
>>>>>>     strict, hard-coded demotion order does not work in all use cases
>>>>>>     (e.g. some use cases may want to allow cross-socket demotion to
>>>>>>     another node in the same demotion tier as a fallback when the
>>>>>>     preferred demotion node is out of space), and has resulted in the
>>>>>>     feature request for an interface to override the system-wide,
>>>>>>     per-node demotion order from the userspace.
>>>>>>
>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>     tiering hierarchy in order to optimize its memory allocations.
>>>>>>
>>>>>> I'd like to propose revised memory tiering kernel interfaces based on
>>>>>> the discussions in the threads:
>>>>>>
>>>>>> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
>>>>>> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
>>>>>>
>>>>>>
>>>>>> Sysfs Interfaces
>>>>>> ================
>>>>>>
>>>>>> * /sys/devices/system/node/memory_tiers
>>>>>>
>>>>>>     Format: node list (one tier per line, in the tier order)
>>>>>>
>>>>>>     When read, list memory nodes by tiers.
>>>>>>
>>>>>>     When written (one tier per line), take the user-provided node-tier
>>>>>>     assignment as the new tiering hierarchy and rebuild the per-node
>>>>>>     demotion order.  It is allowed to only override the top tiers, in
>>>>>>     which cases, the kernel will establish the lower tiers automatically.
>>>>> TBH I still think it is too soon to define proper user visible
>>>>> interfaces for now, particularly for override.
>>>> I agree, but there are also needs to make use of tiering even as it
>>>> evolves.  This is why only a minimal sysfs interface is proposed.  We
>>>> can make it read-only and resort to a kernel boot parameter to
>>>> override tiers.
>>>>
>>>>>> Kernel Representation
>>>>>> =====================
>>>>>>
>>>>>> * nodemask_t node_states[N_TOPTIER_MEMORY]
>>>>>>
>>>>>>     Store all top-tier memory nodes.
>>>>>>
>>>>>> * nodemask_t memory_tiers[MAX_TIERS]
>>>>>>
>>>>>>     Store memory nodes by tiers.
>>>>> I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
>>>>> top tier. The kernel could build this with the topology built by
>>>>> firmware.
>>>> node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.
>>>>
>>>> node_states is already an existing kernel array (defined as nodemask_t
>>>> node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
>>>> which is why a new array, memory_tiers, is proposed.
>>>>
>>>> Are you proposing that we make node_states a 2-dimensional array?
>>>> That can duplicate the information already in node_states, which is
>>>> not ideal.
>>> Sorry for the late reply.
>>>
>>> Yes, 2-dimensional array. With it we could know what nodes in what tiers.
>>>
>>>>>> * struct demotion_nodes node_demotion[]
>>>>>>
>>>>>>     where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>>>>>>
>>>>>>     For a node N:
>>>>>>
>>>>>>     node_demotion[N].preferred lists all preferred demotion targets;
>>>>>>
>>>>>>     node_demotion[N].allowed lists all allowed demotion targets
>>>>>>     (initialized to be all the nodes in the same demotion tier).
>>>>> It seems unnecessary to define preferred and allowed IMHO. Why not
>>>>> just use something like the allocation fallback list? The first node
>>>>> in the list is the preferred one. When allocating memory for demotion,
>>>>> convert the list to a nodemask, then call __alloc_pages(gfp, order,
>>>>> first_node, nodemask). So the allocation could fallback to the allowed
>>>>> nodes automatically.
>>>> The nodemask "preferred" is an attempt to preserve a current feature
>>>> in node_demotion[]: load balancing among multiple equally-close target
>>>> nodes via random selection.  We can remove it to keep things simple.
>>>>
>>>> The idea of defining "preferred" and "allowed" is exactly to use
>>>> __alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
>>>> that the page allocator already computes the allocation fallback list,
>>>> it should be unnecessary to maintain an ordered demotion node list for
>>>> each node and convert such a list to a nodemask for demotion
>>>> allocation.  This is why allowed is stored as a nodemask.
>>> Yeah, it doesn't have to be ordered.
>>>
>>>> When demoting a page from node N, I think we can just call
>>>> __alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
>>>> so, we can remove node_demotion[] entirely and add a tier field to
>>>> NODE_DATA for node_to_tier().
>>>>
>>>>>> Tiering Hierarchy Initialization
>>>>>> ================================
>>>>>>
>>>>>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>>>>>
>>>>>> A device driver can remove its memory nodes from the top tier, e.g.
>>>>>> a dax driver can remove PMEM nodes from the top tier.
>>>>> With the topology built by firmware we should not need this.
>>>> I agree. But before we have such a firmware, the kernel needs to do
>>>> its best to initialize memory tiers.
>>>>
>>>> Given that we know PMEM is slower than DRAM, but a dax device might
>>>> not be PMEM, a better place to set the tier for PMEM nodes can be the
>>>> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
>>>> the ACPI_SRAT_MEM_NON_VOLATILE bit.
>>> This is why I hope firmware could chime in, for example, we may have a
>>> new field, called "Tier", in HMAT. Then kernel just reads the field
>>> and put the node into proper tier. But of course override by kernel
>>> could be allowed.
>>>
>>>>>> The kernel builds the memory tiering hierarchy and per-node demotion
>>>>>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>>>>>> best distance nodes in the next lower tier are assigned to
>>>>>> node_demotion[N].preferred and all the nodes in the next lower tier
>>>>>> are assigned to node_demotion[N].allowed.
>>>>> I'm not sure whether it should be allowed to demote to multiple lower
>>>>> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
>>>>> figure out a good way to define demotion targets, it could be extended
>>>>> to support this easily.
>>>> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
>>>> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
>>>> not clear yet whether we want to enable transparent memory tiering to
>>>> all the 3 tiers on such systems.
>>> Just start from something simple. And we should fully utilize the
>>> nearest lower tier before demoting to lower lower tiers.
>> There might still be simple cases/topologies where we might want to "skip"
>> the very next lower tier. For example, assume we have a 3 tiered memory
>> system as follows:
>>
>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>> in tier 0,
>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>> (could be a bigger DDR or something) in tier 2. The distances are as
>> follows:
>>
>> --------------          --------------
>> |   Node 0   |          |   Node 1   |
>> |  -------   |          |  -------   |
>> | |  DDR  |  |          | |  DDR  |  |
>> |  -------   |          |  -------   |
>> |            |          |            |
>> --------------          --------------
>>          | 20               | 120    |
>>          v                  v        |
>> ----------------------------       |
>> | Node 2     PMEM          |       | 100
>> ----------------------------       |
>>          | 100                       |
>>          v                           v
>> --------------------------------------
>> | Node 3    Large mem                |
>> --------------------------------------
>>
>> node distances:
>> node   0    1    2    3
>>      0  10   20   20  120
>>      1  20   10  120  100
>>      2  20  120   10  100
>>      3  120 100  100   10
>>
>> /sys/devices/system/node/memory_tiers
>> 0-1
>> 2
>> 3
>>
>> N_TOPTIER_MEMORY: 0-1
>>
>>
>> In this case, we want to be able to "skip" the demotion path from Node 1
>> to Node 2,
>>
>> and make demotion go directely to Node 3 as it is closer, distance wise.
>> How can
>>
>> we accommodate this scenario (or at least not rule it out as future
>> work) with the
>>
>> current RFC?
> If I remember correctly NUMA distance is hardcoded in SLIT by the
> firmware, it is supposed to reflect the latency. So I suppose it is
> the firmware's responsibility to have correct information. And the RFC
> assumes higher tier memory has better performance than lower tier
> memory (latency, bandwidth, throughput, etc), so it sounds like a
> buggy firmware to have lower tier memory with shorter distance than
> higher tier memory IMHO.

You are correct if you're assuming the topology is all hierarchically

symmetric, but unfortuantely, in real hardware (e.g., my example above)

it is not. The distance/latency between two nodes in the same tier

and a third node, is different. The firmware still provides the correct

latency, but putting a node in a tier is up to the kernel/user, and

is relative: e.g., Node 3 could belong to tier 1 from Node 1's

perspective, but to tier 2 from Node 0's.


A more detailed example (building on my previous one) is when having

the GPU connected to a switch:

----------------------------
| Node 2     PMEM          |
----------------------------
       ^
       |
--------------          --------------
|   Node 0   |          |   Node 1   |
|  -------   |          |  -------   |
| |  DDR  |  |          | |  DDR  |  |
|  -------   |          |  -------   |
|    CPU     |          |    GPU     |
--------------          --------------
        |                  |
        v                  v
----------------------------
|         Switch           |
----------------------------
        |
        v
--------------------------------------
| Node 3    Large mem                |
--------------------------------------

Here, demoting from Node 1 to Node 3 directly would be faster as

it only has to go through one hub, compared to demoting from Node 1

to Node 2, where it goes through two hubs. I hope that example

clarifies things a little bit.

> Anyway I'm not an expert on hardware or firmware, I just wish the
> hardware and firmware would make our life easier :-)
>
>>>>>> node_demotion[N].preferred can be empty if no preferred demotion node
>>>>>> is available for node N.
>>>>>>
>>>>>> If the userspace overrides the tiers via the memory_tiers sysfs
>>>>>> interface, the kernel then only rebuilds the per-node demotion order
>>>>>> accordingly.
>>>>>>
>>>>>> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>>>>>> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>>>>>> node.
>>>>>>
>>>>>>
>>>>>> Memory Allocation for Demotion
>>>>>> ==============================
>>>>>>
>>>>>> When allocating a new demotion target page, both a preferred node
>>>>>> and the allowed nodemask are provided to the allocation function.
>>>>>> The default kernel allocation fallback order is used to allocate the
>>>>>> page from the specified node and nodemask.
>>>>>>
>>>>>> The memopolicy of cpuset, vma and owner task of the source page can
>>>>>> be set to refine the demotion nodemask, e.g. to prevent demotion or
>>>>>> select a particular allowed node as the demotion target.
>>>>>>
>>>>>>
>>>>>> Examples
>>>>>> ========
>>>>>>
>>>>>> * Example 1:
>>>>>>     Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>>>>>>
>>>>>>     Node 0 has node 2 as the preferred demotion target and can also
>>>>>>     fallback demotion to node 3.
>>>>>>
>>>>>>     Node 1 has node 3 as the preferred demotion target and can also
>>>>>>     fallback demotion to node 2.
>>>>>>
>>>>>>     Set mempolicy to prevent cross-socket demotion and memory access,
>>>>>>     e.g. cpuset.mems=0,2
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2    3
>>>>>>      0  10   20   30   40
>>>>>>      1  20   10   40   30
>>>>>>      2  30   40   10   40
>>>>>>      3  40   30   40   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2-3
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2-3]
>>>>>>     1: [3], [2-3]
>>>>>>     2: [],  []
>>>>>>     3: [],  []
>>>>>>
>>>>>> * Example 2:
>>>>>>     Node 0 & 1 are DRAM nodes.
>>>>>>     Node 2 is a PMEM node and closer to node 0.
>>>>>>
>>>>>>     Node 0 has node 2 as the preferred and only demotion target.
>>>>>>
>>>>>>     Node 1 has no preferred demotion target, but can still demote
>>>>>>     to node 2.
>>>>>>
>>>>>>     Set mempolicy to prevent cross-socket demotion and memory access,
>>>>>>     e.g. cpuset.mems=0,2
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   20   30
>>>>>>      1  20   10   40
>>>>>>      2  30   40   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2]
>>>>>>     1: [],  [2]
>>>>>>     2: [],  []
>>>>>>
>>>>>>
>>>>>> * Example 3:
>>>>>>     Node 0 & 1 are DRAM nodes.
>>>>>>     Node 2 is a PMEM node and has the same distance to node 0 & 1.
>>>>>>
>>>>>>     Node 0 has node 2 as the preferred and only demotion target.
>>>>>>
>>>>>>     Node 1 has node 2 as the preferred and only demotion target.
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   20   30
>>>>>>      1  20   10   30
>>>>>>      2  30   30   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2]
>>>>>>     1: [2], [2]
>>>>>>     2: [],  []
>>>>>>
>>>>>>
>>>>>> * Example 4:
>>>>>>     Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>>>>>>
>>>>>>     All nodes are top-tier.
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   20   30
>>>>>>      1  20   10   30
>>>>>>      2  30   30   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-2
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [],  []
>>>>>>     1: [],  []
>>>>>>     2: [],  []
>>>>>>
>>>>>>
>>>>>> * Example 5:
>>>>>>     Node 0 is a DRAM node with CPU.
>>>>>>     Node 1 is a HBM node.
>>>>>>     Node 2 is a PMEM node.
>>>>>>
>>>>>>     With userspace override, node 1 is the top tier and has node 0 as
>>>>>>     the preferred and only demotion target.
>>>>>>
>>>>>>     Node 0 is in the second tier, tier 1, and has node 2 as the
>>>>>>     preferred and only demotion target.
>>>>>>
>>>>>>     Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   21   30
>>>>>>      1  21   10   40
>>>>>>      2  30   40   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers (userspace override)
>>>>>> 1
>>>>>> 0
>>>>>> 2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2]
>>>>>>     1: [0], [0]
>>>>>>     2: [],  []
>>>>>>
>>>>>> -- Wei
>> -- Hesham


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10  4:22         ` Wei Xu
@ 2022-05-10 10:01           ` Hesham Almatary
  2022-05-10 11:44           ` Aneesh Kumar K.V
  1 sibling, 0 replies; 57+ messages in thread
From: Hesham Almatary @ 2022-05-10 10:01 UTC (permalink / raw)
  To: Wei Xu
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Aneesh Kumar K.V, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

Hello Wei,

On 5/10/2022 5:22 AM, Wei Xu wrote:
> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> <hesham.almatary@huawei.com> wrote:
>> Hello Yang,
>>
>> On 5/6/2022 7:56 PM, Yang Shi wrote:
>>> On Fri, Apr 29, 2022 at 11:37 PM Wei Xu <weixugc@google.com> wrote:
>>>> On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
>>>>> Hi Wei,
>>>>>
>>>>> Thanks for the nice writing. Please see the below inline comments.
>>>> Thanks for the quick response and comments.
>>>>
>>>>> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
>>>>>> The current kernel has the basic memory tiering support: Inactive
>>>>>> pages on a higher tier NUMA node can be migrated (demoted) to a lower
>>>>>> tier NUMA node to make room for new allocations on the higher tier
>>>>>> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
>>>>>> migrated (promoted) to a higher tier NUMA node to improve the
>>>>>> performance.
>>>>>>
>>>>>> A tiering relationship between NUMA nodes in the form of demotion path
>>>>>> is created during the kernel initialization and updated when a NUMA
>>>>>> node is hot-added or hot-removed.  The current implementation puts all
>>>>>> nodes with CPU into the top tier, and then builds the tiering hierarchy
>>>>>> tier-by-tier by establishing the per-node demotion targets based on
>>>>>> the distances between nodes.
>>>>>>
>>>>>> The current memory tiering interface needs to be improved to address
>>>>>> several important use cases:
>>>>>>
>>>>>> * The current tiering initialization code always initializes
>>>>>>     each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>     NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>     device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>     a virtual machine) and should be put into the top tier.
>>>>>>
>>>>>> * The current tiering hierarchy always puts CPU nodes into the top
>>>>>>     tier. But on a system with HBM (e.g. GPU memory) devices, these
>>>>>>     memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>>>>>>     with CPUs are better to be placed into the next lower tier.
>>>>>>
>>>>>> * Also because the current tiering hierarchy always puts CPU nodes
>>>>>>     into the top tier, when a CPU is hot-added (or hot-removed) and
>>>>>>     triggers a memory node from CPU-less into a CPU node (or vice
>>>>>>     versa), the memory tiering hierarchy gets changed, even though no
>>>>>>     memory node is added or removed.  This can make the tiering
>>>>>>     hierarchy much less stable.
>>>>> I'd prefer the firmware builds up tiers topology then passes it to
>>>>> kernel so that kernel knows what nodes are in what tiers. No matter
>>>>> what nodes are hot-removed/hot-added they always stay in their tiers
>>>>> defined by the firmware. I think this is important information like
>>>>> numa distances. NUMA distance alone can't satisfy all the usecases
>>>>> IMHO.
>>>> I agree that the firmware needs to play a bigger role in tier
>>>> topology, though it is not clear to me yet that we should require the
>>>> tier topology be fully defined by the firmware.  If yes, a standard
>>>> needs to be established. Alternatively, with additional hardware
>>>> information provided by the firmware (e.g. HMAT), the kernel can be in
>>>> a much better position to initialize the proper tier topology by
>>>> itself.
>>>>
>>>> It is important to keep tier topology stable, especially if we want to
>>>> account and limit memory usage based on tiers.  So I agree that the
>>>> nodes should not change their tiers no matter what nodes are
>>>> hot-added/hot-removed.
>>>>
>>>> Given that the additional tier-related information is not yet
>>>> available from the firmware and NUMA distance alone is not sufficient
>>>> for all the tiering use cases, and also that we want to keep tier
>>>> topology stable after the kernel boots, I suggest that we add a kernel
>>>> boot parameter to override the default tier topology (which nodes are
>>>> in which tiers). An example is: tier=2:0-1;2-3, which defines two
>>>> tiers: tier 0 includes node 0 & 1, and tier 1 includes node 2 & 3.
>>>>
>>>>>> * A higher tier node can only be demoted to selected nodes on the
>>>>>>     next lower tier, not any other node from the next lower tier.  This
>>>>>>     strict, hard-coded demotion order does not work in all use cases
>>>>>>     (e.g. some use cases may want to allow cross-socket demotion to
>>>>>>     another node in the same demotion tier as a fallback when the
>>>>>>     preferred demotion node is out of space), and has resulted in the
>>>>>>     feature request for an interface to override the system-wide,
>>>>>>     per-node demotion order from the userspace.
>>>>>>
>>>>>> * There are no interfaces for the userspace to learn about the memory
>>>>>>     tiering hierarchy in order to optimize its memory allocations.
>>>>>>
>>>>>> I'd like to propose revised memory tiering kernel interfaces based on
>>>>>> the discussions in the threads:
>>>>>>
>>>>>> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
>>>>>> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
>>>>>>
>>>>>>
>>>>>> Sysfs Interfaces
>>>>>> ================
>>>>>>
>>>>>> * /sys/devices/system/node/memory_tiers
>>>>>>
>>>>>>     Format: node list (one tier per line, in the tier order)
>>>>>>
>>>>>>     When read, list memory nodes by tiers.
>>>>>>
>>>>>>     When written (one tier per line), take the user-provided node-tier
>>>>>>     assignment as the new tiering hierarchy and rebuild the per-node
>>>>>>     demotion order.  It is allowed to only override the top tiers, in
>>>>>>     which cases, the kernel will establish the lower tiers automatically.
>>>>> TBH I still think it is too soon to define proper user visible
>>>>> interfaces for now, particularly for override.
>>>> I agree, but there are also needs to make use of tiering even as it
>>>> evolves.  This is why only a minimal sysfs interface is proposed.  We
>>>> can make it read-only and resort to a kernel boot parameter to
>>>> override tiers.
>>>>
>>>>>> Kernel Representation
>>>>>> =====================
>>>>>>
>>>>>> * nodemask_t node_states[N_TOPTIER_MEMORY]
>>>>>>
>>>>>>     Store all top-tier memory nodes.
>>>>>>
>>>>>> * nodemask_t memory_tiers[MAX_TIERS]
>>>>>>
>>>>>>     Store memory nodes by tiers.
>>>>> I'd prefer nodemask_t node_states[MAX_TIERS][]. Tier 0 is always the
>>>>> top tier. The kernel could build this with the topology built by
>>>>> firmware.
>>>> node_states[N_TOPTIER_MEMORY] is for convenience and can be removed.
>>>>
>>>> node_states is already an existing kernel array (defined as nodemask_t
>>>> node_states[NR_NODE_STATES]).  We need an array for memory tiers, too,
>>>> which is why a new array, memory_tiers, is proposed.
>>>>
>>>> Are you proposing that we make node_states a 2-dimensional array?
>>>> That can duplicate the information already in node_states, which is
>>>> not ideal.
>>> Sorry for the late reply.
>>>
>>> Yes, 2-dimensional array. With it we could know what nodes in what tiers.
>>>
>>>>>> * struct demotion_nodes node_demotion[]
>>>>>>
>>>>>>     where: struct demotion_nodes { nodemask_t preferred; nodemask_t allowed; }
>>>>>>
>>>>>>     For a node N:
>>>>>>
>>>>>>     node_demotion[N].preferred lists all preferred demotion targets;
>>>>>>
>>>>>>     node_demotion[N].allowed lists all allowed demotion targets
>>>>>>     (initialized to be all the nodes in the same demotion tier).
>>>>> It seems unnecessary to define preferred and allowed IMHO. Why not
>>>>> just use something like the allocation fallback list? The first node
>>>>> in the list is the preferred one. When allocating memory for demotion,
>>>>> convert the list to a nodemask, then call __alloc_pages(gfp, order,
>>>>> first_node, nodemask). So the allocation could fallback to the allowed
>>>>> nodes automatically.
>>>> The nodemask "preferred" is an attempt to preserve a current feature
>>>> in node_demotion[]: load balancing among multiple equally-close target
>>>> nodes via random selection.  We can remove it to keep things simple.
>>>>
>>>> The idea of defining "preferred" and "allowed" is exactly to use
>>>> __alloc_pages(gfp, order, preferred_node, allowed_nodemask).  Given
>>>> that the page allocator already computes the allocation fallback list,
>>>> it should be unnecessary to maintain an ordered demotion node list for
>>>> each node and convert such a list to a nodemask for demotion
>>>> allocation.  This is why allowed is stored as a nodemask.
>>> Yeah, it doesn't have to be ordered.
>>>
>>>> When demoting a page from node N, I think we can just call
>>>> __alloc_pages(gfp, order, N, memory_tiers[node_to_tier(N) + 1]).  If
>>>> so, we can remove node_demotion[] entirely and add a tier field to
>>>> NODE_DATA for node_to_tier().
>>>>
>>>>>> Tiering Hierarchy Initialization
>>>>>> ================================
>>>>>>
>>>>>> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>>>>>
>>>>>> A device driver can remove its memory nodes from the top tier, e.g.
>>>>>> a dax driver can remove PMEM nodes from the top tier.
>>>>> With the topology built by firmware we should not need this.
>>>> I agree. But before we have such a firmware, the kernel needs to do
>>>> its best to initialize memory tiers.
>>>>
>>>> Given that we know PMEM is slower than DRAM, but a dax device might
>>>> not be PMEM, a better place to set the tier for PMEM nodes can be the
>>>> ACPI code, e.g. acpi_numa_memory_affinity_init() where we can examine
>>>> the ACPI_SRAT_MEM_NON_VOLATILE bit.
>>> This is why I hope firmware could chime in, for example, we may have a
>>> new field, called "Tier", in HMAT. Then kernel just reads the field
>>> and put the node into proper tier. But of course override by kernel
>>> could be allowed.
>>>
>>>>>> The kernel builds the memory tiering hierarchy and per-node demotion
>>>>>> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the
>>>>>> best distance nodes in the next lower tier are assigned to
>>>>>> node_demotion[N].preferred and all the nodes in the next lower tier
>>>>>> are assigned to node_demotion[N].allowed.
>>>>> I'm not sure whether it should be allowed to demote to multiple lower
>>>>> tiers. But it is totally fine to *NOT* allow it at the moment. Once we
>>>>> figure out a good way to define demotion targets, it could be extended
>>>>> to support this easily.
>>>> You mean to only support MAX_TIERS=2 for now.  I am fine with that.
>>>> There can be systems with 3 tiers, e.g. GPU -> DRAM -> PMEM, but it is
>>>> not clear yet whether we want to enable transparent memory tiering to
>>>> all the 3 tiers on such systems.
>>> Just start from something simple. And we should fully utilize the
>>> nearest lower tier before demoting to lower lower tiers.
>> There might still be simple cases/topologies where we might want to "skip"
>> the very next lower tier. For example, assume we have a 3 tiered memory
>> system as follows:
>>
>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>> in tier 0,
>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>> (could be a bigger DDR or something) in tier 2. The distances are as
>> follows:
>>
>> --------------          --------------
>> |   Node 0   |          |   Node 1   |
>> |  -------   |          |  -------   |
>> | |  DDR  |  |          | |  DDR  |  |
>> |  -------   |          |  -------   |
>> |            |          |            |
>> --------------          --------------
>>          | 20               | 120    |
>>          v                  v        |
>> ----------------------------       |
>> | Node 2     PMEM          |       | 100
>> ----------------------------       |
>>          | 100                       |
>>          v                           v
>> --------------------------------------
>> | Node 3    Large mem                |
>> --------------------------------------
>>
>> node distances:
>> node   0    1    2    3
>>      0  10   20   20  120
>>      1  20   10  120  100
>>      2  20  120   10  100
>>      3  120 100  100   10
>>
>> /sys/devices/system/node/memory_tiers
>> 0-1
>> 2
>> 3
>>
>> N_TOPTIER_MEMORY: 0-1
>>
>>
>> In this case, we want to be able to "skip" the demotion path from Node 1
>> to Node 2,
>>
>> and make demotion go directely to Node 3 as it is closer, distance wise.
>> How can
>>
>> we accommodate this scenario (or at least not rule it out as future
>> work) with the current RFC?
> This is an interesting example.  I think one way to support this is to
> allow all the lower tier nodes to be the demotion targets of a node in
> the higher tier.  We can then use the allocation fallback order to
> select the best demotion target.
>
> For this example, we will have the demotion targets of each node as:
>
> node 0: allowed=2-3, order (based on allocation fallback order): 2, 3
> node 1: allowed=2-3, order (based on allocation fallback order): 3, 2
> node 2: allowed = 3, order (based on allocation fallback order): 3
> node 3: allowed = empty
>
> What do you think?

I think that makes sense, and it aligns with what we thought of

initially. Good to know we agree on the same approach for that.

>>>>>> node_demotion[N].preferred can be empty if no preferred demotion node
>>>>>> is available for node N.
>>>>>>
>>>>>> If the userspace overrides the tiers via the memory_tiers sysfs
>>>>>> interface, the kernel then only rebuilds the per-node demotion order
>>>>>> accordingly.
>>>>>>
>>>>>> Memory tiering hierarchy is rebuilt upon hot-add or hot-remove of a
>>>>>> memory node, but is NOT rebuilt upon hot-add or hot-remove of a CPU
>>>>>> node.
>>>>>>
>>>>>>
>>>>>> Memory Allocation for Demotion
>>>>>> ==============================
>>>>>>
>>>>>> When allocating a new demotion target page, both a preferred node
>>>>>> and the allowed nodemask are provided to the allocation function.
>>>>>> The default kernel allocation fallback order is used to allocate the
>>>>>> page from the specified node and nodemask.
>>>>>>
>>>>>> The memopolicy of cpuset, vma and owner task of the source page can
>>>>>> be set to refine the demotion nodemask, e.g. to prevent demotion or
>>>>>> select a particular allowed node as the demotion target.
>>>>>>
>>>>>>
>>>>>> Examples
>>>>>> ========
>>>>>>
>>>>>> * Example 1:
>>>>>>     Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>>>>>>
>>>>>>     Node 0 has node 2 as the preferred demotion target and can also
>>>>>>     fallback demotion to node 3.
>>>>>>
>>>>>>     Node 1 has node 3 as the preferred demotion target and can also
>>>>>>     fallback demotion to node 2.
>>>>>>
>>>>>>     Set mempolicy to prevent cross-socket demotion and memory access,
>>>>>>     e.g. cpuset.mems=0,2
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2    3
>>>>>>      0  10   20   30   40
>>>>>>      1  20   10   40   30
>>>>>>      2  30   40   10   40
>>>>>>      3  40   30   40   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2-3
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2-3]
>>>>>>     1: [3], [2-3]
>>>>>>     2: [],  []
>>>>>>     3: [],  []
>>>>>>
>>>>>> * Example 2:
>>>>>>     Node 0 & 1 are DRAM nodes.
>>>>>>     Node 2 is a PMEM node and closer to node 0.
>>>>>>
>>>>>>     Node 0 has node 2 as the preferred and only demotion target.
>>>>>>
>>>>>>     Node 1 has no preferred demotion target, but can still demote
>>>>>>     to node 2.
>>>>>>
>>>>>>     Set mempolicy to prevent cross-socket demotion and memory access,
>>>>>>     e.g. cpuset.mems=0,2
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   20   30
>>>>>>      1  20   10   40
>>>>>>      2  30   40   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2]
>>>>>>     1: [],  [2]
>>>>>>     2: [],  []
>>>>>>
>>>>>>
>>>>>> * Example 3:
>>>>>>     Node 0 & 1 are DRAM nodes.
>>>>>>     Node 2 is a PMEM node and has the same distance to node 0 & 1.
>>>>>>
>>>>>>     Node 0 has node 2 as the preferred and only demotion target.
>>>>>>
>>>>>>     Node 1 has node 2 as the preferred and only demotion target.
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   20   30
>>>>>>      1  20   10   30
>>>>>>      2  30   30   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2]
>>>>>>     1: [2], [2]
>>>>>>     2: [],  []
>>>>>>
>>>>>>
>>>>>> * Example 4:
>>>>>>     Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
>>>>>>
>>>>>>     All nodes are top-tier.
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   20   30
>>>>>>      1  20   10   30
>>>>>>      2  30   30   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-2
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [],  []
>>>>>>     1: [],  []
>>>>>>     2: [],  []
>>>>>>
>>>>>>
>>>>>> * Example 5:
>>>>>>     Node 0 is a DRAM node with CPU.
>>>>>>     Node 1 is a HBM node.
>>>>>>     Node 2 is a PMEM node.
>>>>>>
>>>>>>     With userspace override, node 1 is the top tier and has node 0 as
>>>>>>     the preferred and only demotion target.
>>>>>>
>>>>>>     Node 0 is in the second tier, tier 1, and has node 2 as the
>>>>>>     preferred and only demotion target.
>>>>>>
>>>>>>     Node 2 is in the lowest tier, tier 2, and has no demotion targets.
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2
>>>>>>      0  10   21   30
>>>>>>      1  21   10   40
>>>>>>      2  30   40   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers (userspace override)
>>>>>> 1
>>>>>> 0
>>>>>> 2
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 1
>>>>>>
>>>>>> node_demotion[]:
>>>>>>     0: [2], [2]
>>>>>>     1: [0], [0]
>>>>>>     2: [],  []
>>>>>>
>>>>>> -- Wei
>> -- Hesham
>>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10  5:37         ` Alistair Popple
@ 2022-05-10 11:38           ` Aneesh Kumar K.V
  2022-05-11  5:30             ` Wei Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-10 11:38 UTC (permalink / raw)
  To: Alistair Popple, Wei Xu
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Jagdish Gediya, Linux Kernel Mailing List,
	Davidlohr Bueso, Michal Hocko, Baolin Wang, Brice Goglin,
	Feng Tang, Jonathan Cameron, Tim Chen

Alistair Popple <apopple@nvidia.com> writes:

> Wei Xu <weixugc@google.com> writes:
>
>> On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
>>>
>>> Wei Xu <weixugc@google.com> writes:
>>>
>>> [...]
>>>
>>> >> >
>>> >> >
>>> >> > Tiering Hierarchy Initialization
>>> >> > `=============================='
>>> >> >
>>> >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>>> >> >
>>> >> > A device driver can remove its memory nodes from the top tier, e.g.
>>> >> > a dax driver can remove PMEM nodes from the top tier.
>>> >>
>>> >> With the topology built by firmware we should not need this.
>>>
>>> I agree that in an ideal world the hierarchy should be built by firmware based
>>> on something like the HMAT. But I also think being able to override this will be
>>> useful in getting there. Therefore a way of overriding the generated hierarchy
>>> would be good, either via sysfs or kernel boot parameter if we don't want to
>>> commit to a particular user interface now.
>>>
>>> However I'm less sure letting device-drivers override this is a good idea. How
>>> for example would a GPU driver make sure it's node is in the top tier? By moving
>>> every node that the driver does not know about out of N_TOPTIER_MEMORY? That
>>> could get messy if say there were two drivers both of which wanted their node to
>>> be in the top tier.
>>
>> The suggestion is to allow a device driver to opt out its memory
>> devices from the top-tier, not the other way around.
>
> So how would demotion work in the case of accelerators then? In that
> case we would want GPU memory to demote to DRAM, but that won't happen
> if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> only override available with this proposal would move GPU memory into a
> lower tier, which is the opposite of what's needed there.

How about we do 3 tiers now. dax kmem devices can be registered to
tier 3. By default all numa nodes can be registered at tier 2 and HBM or
GPU can be enabled to register at tier 1. ?

-aneesh


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10  4:22         ` Wei Xu
  2022-05-10 10:01           ` Hesham Almatary
@ 2022-05-10 11:44           ` Aneesh Kumar K.V
  1 sibling, 0 replies; 57+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-10 11:44 UTC (permalink / raw)
  To: Wei Xu, Hesham Almatary
  Cc: Yang Shi, Andrew Morton, Dave Hansen, Huang Ying, Dan Williams,
	Linux MM, Greg Thelen, Jagdish Gediya, Linux Kernel Mailing List,
	Alistair Popple, Davidlohr Bueso, Michal Hocko, Baolin Wang,
	Brice Goglin, Feng Tang, Tim Chen

Wei Xu <weixugc@google.com> writes:

> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> <hesham.almatary@huawei.com> wrote:
>>

....

> > nearest lower tier before demoting to lower lower tiers.
>> There might still be simple cases/topologies where we might want to "skip"
>> the very next lower tier. For example, assume we have a 3 tiered memory
>> system as follows:
>>
>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>> in tier 0,
>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>> (could be a bigger DDR or something) in tier 2. The distances are as
>> follows:
>>
>> --------------          --------------
>> |   Node 0   |          |   Node 1   |
>> |  -------   |          |  -------   |
>> | |  DDR  |  |          | |  DDR  |  |
>> |  -------   |          |  -------   |
>> |            |          |            |
>> --------------          --------------
>>         | 20               | 120    |
>>         v                  v        |
>> ----------------------------       |
>> | Node 2     PMEM          |       | 100
>> ----------------------------       |
>>         | 100                       |
>>         v                           v
>> --------------------------------------
>> | Node 3    Large mem                |
>> --------------------------------------
>>
>> node distances:
>> node   0    1    2    3
>>     0  10   20   20  120
>>     1  20   10  120  100
>>     2  20  120   10  100
>>     3  120 100  100   10
>>
>> /sys/devices/system/node/memory_tiers
>> 0-1
>> 2
>> 3
>>
>> N_TOPTIER_MEMORY: 0-1
>>
>>
>> In this case, we want to be able to "skip" the demotion path from Node 1
>> to Node 2,
>>
>> and make demotion go directely to Node 3 as it is closer, distance wise.
>> How can
>>
>> we accommodate this scenario (or at least not rule it out as future
>> work) with the current RFC?
>
> This is an interesting example.  I think one way to support this is to
> allow all the lower tier nodes to be the demotion targets of a node in
> the higher tier.  We can then use the allocation fallback order to
> select the best demotion target.
>
> For this example, we will have the demotion targets of each node as:
>
> node 0: allowed=2-3, order (based on allocation fallback order): 2, 3
> node 1: allowed=2-3, order (based on allocation fallback order): 3, 2
> node 2: allowed = 3, order (based on allocation fallback order): 3
> node 3: allowed = empty
>
> What do you think?
>

Can we simplify this further with

tier 0 - > empty (no HBM/GPU)
tier 1 ->  Node0, Node1
tier 2 ->  Node2, Node3

Hence

 node 0: allowed=2-3, order (based on allocation fallback order): 2, 3
 node 1: allowed=2-3, order (based on allocation fallback order): 3, 2
 node 2: allowed = empty
 node 3: allowed = empty

-aneesh


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10  9:59           ` Hesham Almatary
@ 2022-05-10 12:10             ` Aneesh Kumar K V
  2022-05-11  5:42               ` Wei Xu
  0 siblings, 1 reply; 57+ messages in thread
From: Aneesh Kumar K V @ 2022-05-10 12:10 UTC (permalink / raw)
  To: Hesham Almatary, Yang Shi
  Cc: Andrew Morton, Dave Hansen, Huang Ying, Dan Williams, Linux MM,
	Greg Thelen, Jagdish Gediya, Linux Kernel Mailing List,
	Alistair Popple, Davidlohr Bueso, Michal Hocko, Baolin Wang,
	Brice Goglin, Feng Tang, Tim Chen, Wei Xu

On 5/10/22 3:29 PM, Hesham Almatary wrote:
> Hello Yang,
> 
> On 5/10/2022 4:24 AM, Yang Shi wrote:
>> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
>> <hesham.almatary@huawei.com> wrote:


...

>>>
>>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>>> in tier 0,
>>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>>> (could be a bigger DDR or something) in tier 2. The distances are as
>>> follows:
>>>
>>> --------------          --------------
>>> |   Node 0   |          |   Node 1   |
>>> |  -------   |          |  -------   |
>>> | |  DDR  |  |          | |  DDR  |  |
>>> |  -------   |          |  -------   |
>>> |            |          |            |
>>> --------------          --------------
>>>          | 20               | 120    |
>>>          v                  v        |
>>> ----------------------------       |
>>> | Node 2     PMEM          |       | 100
>>> ----------------------------       |
>>>          | 100                       |
>>>          v                           v
>>> --------------------------------------
>>> | Node 3    Large mem                |
>>> --------------------------------------
>>>
>>> node distances:
>>> node   0    1    2    3
>>>      0  10   20   20  120
>>>      1  20   10  120  100
>>>      2  20  120   10  100
>>>      3  120 100  100   10
>>>
>>> /sys/devices/system/node/memory_tiers
>>> 0-1
>>> 2
>>> 3
>>>
>>> N_TOPTIER_MEMORY: 0-1
>>>
>>>
>>> In this case, we want to be able to "skip" the demotion path from Node 1
>>> to Node 2,
>>>
>>> and make demotion go directely to Node 3 as it is closer, distance wise.
>>> How can
>>>
>>> we accommodate this scenario (or at least not rule it out as future
>>> work) with the
>>>
>>> current RFC?
>> If I remember correctly NUMA distance is hardcoded in SLIT by the
>> firmware, it is supposed to reflect the latency. So I suppose it is
>> the firmware's responsibility to have correct information. And the RFC
>> assumes higher tier memory has better performance than lower tier
>> memory (latency, bandwidth, throughput, etc), so it sounds like a
>> buggy firmware to have lower tier memory with shorter distance than
>> higher tier memory IMHO.
> 
> You are correct if you're assuming the topology is all hierarchically
> 
> symmetric, but unfortuantely, in real hardware (e.g., my example above)
> 
> it is not. The distance/latency between two nodes in the same tier
> 
> and a third node, is different. The firmware still provides the correct
> 
> latency, but putting a node in a tier is up to the kernel/user, and
> 
> is relative: e.g., Node 3 could belong to tier 1 from Node 1's
> 
> perspective, but to tier 2 from Node 0's.
> 
> 
> A more detailed example (building on my previous one) is when having
> 
> the GPU connected to a switch:
> 
> ----------------------------
> | Node 2     PMEM          |
> ----------------------------
>        ^
>        |
> --------------          --------------
> |   Node 0   |          |   Node 1   |
> |  -------   |          |  -------   |
> | |  DDR  |  |          | |  DDR  |  |
> |  -------   |          |  -------   |
> |    CPU     |          |    GPU     |
> --------------          --------------
>         |                  |
>         v                  v
> ----------------------------
> |         Switch           |
> ----------------------------
>         |
>         v
> --------------------------------------
> | Node 3    Large mem                |
> --------------------------------------
> 
> Here, demoting from Node 1 to Node 3 directly would be faster as
> 
> it only has to go through one hub, compared to demoting from Node 1
> 
> to Node 2, where it goes through two hubs. I hope that example
> 
> clarifies things a little bit.
> 

Alistair mentioned that we want to consider GPU memory to be expensive 
and want to demote from GPU to regular DRAM. In that case for the above 
case we should end up with


tier 0 - > Node3
tier 1 ->  Node0, Node1
tier 2 ->  Node2

Hence

  node 0: allowed=2
  node 1: allowed=2
  node 2: allowed = empty
  node 3: allowed = 0-1 , based on fallback order 1, 0

-aneesh




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10 11:38           ` Aneesh Kumar K.V
@ 2022-05-11  5:30             ` Wei Xu
  2022-05-11  7:34               ` Alistair Popple
  2022-05-11  7:49               ` ying.huang
  0 siblings, 2 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-11  5:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Alistair Popple, Yang Shi, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> Alistair Popple <apopple@nvidia.com> writes:
>
> > Wei Xu <weixugc@google.com> writes:
> >
> >> On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> >>>
> >>> Wei Xu <weixugc@google.com> writes:
> >>>
> >>> [...]
> >>>
> >>> >> >
> >>> >> >
> >>> >> > Tiering Hierarchy Initialization
> >>> >> > `=============================='
> >>> >> >
> >>> >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> >>> >> >
> >>> >> > A device driver can remove its memory nodes from the top tier, e.g.
> >>> >> > a dax driver can remove PMEM nodes from the top tier.
> >>> >>
> >>> >> With the topology built by firmware we should not need this.
> >>>
> >>> I agree that in an ideal world the hierarchy should be built by firmware based
> >>> on something like the HMAT. But I also think being able to override this will be
> >>> useful in getting there. Therefore a way of overriding the generated hierarchy
> >>> would be good, either via sysfs or kernel boot parameter if we don't want to
> >>> commit to a particular user interface now.
> >>>
> >>> However I'm less sure letting device-drivers override this is a good idea. How
> >>> for example would a GPU driver make sure it's node is in the top tier? By moving
> >>> every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> >>> could get messy if say there were two drivers both of which wanted their node to
> >>> be in the top tier.
> >>
> >> The suggestion is to allow a device driver to opt out its memory
> >> devices from the top-tier, not the other way around.
> >
> > So how would demotion work in the case of accelerators then? In that
> > case we would want GPU memory to demote to DRAM, but that won't happen
> > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > only override available with this proposal would move GPU memory into a
> > lower tier, which is the opposite of what's needed there.
>
> How about we do 3 tiers now. dax kmem devices can be registered to
> tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> GPU can be enabled to register at tier 1. ?

This makes sense.  I will send an updated RFC based on the discussions so far.

> -aneesh


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-10 12:10             ` Aneesh Kumar K V
@ 2022-05-11  5:42               ` Wei Xu
  2022-05-11  7:12                 ` Alistair Popple
  0 siblings, 1 reply; 57+ messages in thread
From: Wei Xu @ 2022-05-11  5:42 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Hesham Almatary, Yang Shi, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Alistair Popple, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/10/22 3:29 PM, Hesham Almatary wrote:
> > Hello Yang,
> >
> > On 5/10/2022 4:24 AM, Yang Shi wrote:
> >> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> >> <hesham.almatary@huawei.com> wrote:
>
>
> ...
>
> >>>
> >>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
> >>> in tier 0,
> >>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
> >>> (could be a bigger DDR or something) in tier 2. The distances are as
> >>> follows:
> >>>
> >>> --------------          --------------
> >>> |   Node 0   |          |   Node 1   |
> >>> |  -------   |          |  -------   |
> >>> | |  DDR  |  |          | |  DDR  |  |
> >>> |  -------   |          |  -------   |
> >>> |            |          |            |
> >>> --------------          --------------
> >>>          | 20               | 120    |
> >>>          v                  v        |
> >>> ----------------------------       |
> >>> | Node 2     PMEM          |       | 100
> >>> ----------------------------       |
> >>>          | 100                       |
> >>>          v                           v
> >>> --------------------------------------
> >>> | Node 3    Large mem                |
> >>> --------------------------------------
> >>>
> >>> node distances:
> >>> node   0    1    2    3
> >>>      0  10   20   20  120
> >>>      1  20   10  120  100
> >>>      2  20  120   10  100
> >>>      3  120 100  100   10
> >>>
> >>> /sys/devices/system/node/memory_tiers
> >>> 0-1
> >>> 2
> >>> 3
> >>>
> >>> N_TOPTIER_MEMORY: 0-1
> >>>
> >>>
> >>> In this case, we want to be able to "skip" the demotion path from Node 1
> >>> to Node 2,
> >>>
> >>> and make demotion go directely to Node 3 as it is closer, distance wise.
> >>> How can
> >>>
> >>> we accommodate this scenario (or at least not rule it out as future
> >>> work) with the
> >>>
> >>> current RFC?
> >> If I remember correctly NUMA distance is hardcoded in SLIT by the
> >> firmware, it is supposed to reflect the latency. So I suppose it is
> >> the firmware's responsibility to have correct information. And the RFC
> >> assumes higher tier memory has better performance than lower tier
> >> memory (latency, bandwidth, throughput, etc), so it sounds like a
> >> buggy firmware to have lower tier memory with shorter distance than
> >> higher tier memory IMHO.
> >
> > You are correct if you're assuming the topology is all hierarchically
> >
> > symmetric, but unfortuantely, in real hardware (e.g., my example above)
> >
> > it is not. The distance/latency between two nodes in the same tier
> >
> > and a third node, is different. The firmware still provides the correct
> >
> > latency, but putting a node in a tier is up to the kernel/user, and
> >
> > is relative: e.g., Node 3 could belong to tier 1 from Node 1's
> >
> > perspective, but to tier 2 from Node 0's.
> >
> >
> > A more detailed example (building on my previous one) is when having
> >
> > the GPU connected to a switch:
> >
> > ----------------------------
> > | Node 2     PMEM          |
> > ----------------------------
> >        ^
> >        |
> > --------------          --------------
> > |   Node 0   |          |   Node 1   |
> > |  -------   |          |  -------   |
> > | |  DDR  |  |          | |  DDR  |  |
> > |  -------   |          |  -------   |
> > |    CPU     |          |    GPU     |
> > --------------          --------------
> >         |                  |
> >         v                  v
> > ----------------------------
> > |         Switch           |
> > ----------------------------
> >         |
> >         v
> > --------------------------------------
> > | Node 3    Large mem                |
> > --------------------------------------
> >
> > Here, demoting from Node 1 to Node 3 directly would be faster as
> >
> > it only has to go through one hub, compared to demoting from Node 1
> >
> > to Node 2, where it goes through two hubs. I hope that example
> >
> > clarifies things a little bit.
> >
>
> Alistair mentioned that we want to consider GPU memory to be expensive
> and want to demote from GPU to regular DRAM. In that case for the above
> case we should end up with
>
>
> tier 0 - > Node3
> tier 1 ->  Node0, Node1
> tier 2 ->  Node2
>
> Hence
>
>   node 0: allowed=2
>   node 1: allowed=2
>   node 2: allowed = empty
>   node 3: allowed = 0-1 , based on fallback order 1, 0

If we have 3 tiers as defined above, then we'd better to have:

node 0: allowed = 2
node 1: allowed = 2
node 2: allowed = empty
node 3: allowed = 0-2, based on fallback order: 1,0,2

The firmware should provide the node distance values to reflect that
PMEM is slowest and should have the largest distance away from node 3.

> -aneesh
>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  5:42               ` Wei Xu
@ 2022-05-11  7:12                 ` Alistair Popple
  2022-05-11  9:05                   ` Hesham Almatary
  2022-05-12  4:40                   ` Aneesh Kumar K V
  0 siblings, 2 replies; 57+ messages in thread
From: Alistair Popple @ 2022-05-11  7:12 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K V, Hesham Almatary, Yang Shi, Andrew Morton,
	Dave Hansen, Huang Ying, Dan Williams, Linux MM, Greg Thelen,
	Jagdish Gediya, Linux Kernel Mailing List, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen


Wei Xu <weixugc@google.com> writes:

> On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 5/10/22 3:29 PM, Hesham Almatary wrote:
>> > Hello Yang,
>> >
>> > On 5/10/2022 4:24 AM, Yang Shi wrote:
>> >> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
>> >> <hesham.almatary@huawei.com> wrote:
>>
>>
>> ...
>>
>> >>>
>> >>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>> >>> in tier 0,
>> >>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>> >>> (could be a bigger DDR or something) in tier 2. The distances are as
>> >>> follows:
>> >>>
>> >>> --------------          --------------
>> >>> |   Node 0   |          |   Node 1   |
>> >>> |  -------   |          |  -------   |
>> >>> | |  DDR  |  |          | |  DDR  |  |
>> >>> |  -------   |          |  -------   |
>> >>> |            |          |            |
>> >>> --------------          --------------
>> >>>          | 20               | 120    |
>> >>>          v                  v        |
>> >>> ----------------------------       |
>> >>> | Node 2     PMEM          |       | 100
>> >>> ----------------------------       |
>> >>>          | 100                       |
>> >>>          v                           v
>> >>> --------------------------------------
>> >>> | Node 3    Large mem                |
>> >>> --------------------------------------
>> >>>
>> >>> node distances:
>> >>> node   0    1    2    3
>> >>>      0  10   20   20  120
>> >>>      1  20   10  120  100
>> >>>      2  20  120   10  100
>> >>>      3  120 100  100   10
>> >>>
>> >>> /sys/devices/system/node/memory_tiers
>> >>> 0-1
>> >>> 2
>> >>> 3
>> >>>
>> >>> N_TOPTIER_MEMORY: 0-1
>> >>>
>> >>>
>> >>> In this case, we want to be able to "skip" the demotion path from Node 1
>> >>> to Node 2,
>> >>>
>> >>> and make demotion go directely to Node 3 as it is closer, distance wise.
>> >>> How can
>> >>>
>> >>> we accommodate this scenario (or at least not rule it out as future
>> >>> work) with the
>> >>>
>> >>> current RFC?
>> >> If I remember correctly NUMA distance is hardcoded in SLIT by the
>> >> firmware, it is supposed to reflect the latency. So I suppose it is
>> >> the firmware's responsibility to have correct information. And the RFC
>> >> assumes higher tier memory has better performance than lower tier
>> >> memory (latency, bandwidth, throughput, etc), so it sounds like a
>> >> buggy firmware to have lower tier memory with shorter distance than
>> >> higher tier memory IMHO.
>> >
>> > You are correct if you're assuming the topology is all hierarchically
>> >
>> > symmetric, but unfortuantely, in real hardware (e.g., my example above)
>> >
>> > it is not. The distance/latency between two nodes in the same tier
>> >
>> > and a third node, is different. The firmware still provides the correct
>> >
>> > latency, but putting a node in a tier is up to the kernel/user, and
>> >
>> > is relative: e.g., Node 3 could belong to tier 1 from Node 1's
>> >
>> > perspective, but to tier 2 from Node 0's.
>> >
>> >
>> > A more detailed example (building on my previous one) is when having
>> >
>> > the GPU connected to a switch:
>> >
>> > ----------------------------
>> > | Node 2     PMEM          |
>> > ----------------------------
>> >        ^
>> >        |
>> > --------------          --------------
>> > |   Node 0   |          |   Node 1   |
>> > |  -------   |          |  -------   |
>> > | |  DDR  |  |          | |  DDR  |  |
>> > |  -------   |          |  -------   |
>> > |    CPU     |          |    GPU     |
>> > --------------          --------------
>> >         |                  |
>> >         v                  v
>> > ----------------------------
>> > |         Switch           |
>> > ----------------------------
>> >         |
>> >         v
>> > --------------------------------------
>> > | Node 3    Large mem                |
>> > --------------------------------------
>> >
>> > Here, demoting from Node 1 to Node 3 directly would be faster as
>> >
>> > it only has to go through one hub, compared to demoting from Node 1
>> >
>> > to Node 2, where it goes through two hubs. I hope that example
>> >
>> > clarifies things a little bit.
>> >
>>
>> Alistair mentioned that we want to consider GPU memory to be expensive
>> and want to demote from GPU to regular DRAM. In that case for the above
>> case we should end up with
>>
>>
>> tier 0 - > Node3
>> tier 1 ->  Node0, Node1
>> tier 2 ->  Node2

I'm a little bit confused by the tiering here as I don't think it's
quite what we want. As pointed out GPU memory is expensive and therefore
we don't want anything demoting to it. That implies it should be in the
top tier:

tier 0 -> Node1
tier 1 -> Node0, Node3
tier 2 -> Node2

Hence:

node 0: allowed=2
node 1: allowed=0,3,2
node 2: allowed=empty
node 3: allowed=2

Alternatively Node3 could be put in tier 2 which would prevent demotion
to PMEM via the switch/CPU:

tier 0 -> Node1
tier 1 -> Node0
tier 2 -> Node2, Node3

node 0: allowed=2,3
node 1: allowed=0,3,2
node 2: allowed=empty
node 3: allowed=empty

Both of these would be an improvement over the current situation
upstream, which demotes everything to GPU memory and doesn't support
demoting from the GPU (meaning reclaim on GPU memory pages everything to
disk).

>>
>> Hence
>>
>>   node 0: allowed=2
>>   node 1: allowed=2
>>   node 2: allowed = empty
>>   node 3: allowed = 0-1 , based on fallback order 1, 0
>
> If we have 3 tiers as defined above, then we'd better to have:
>
> node 0: allowed = 2
> node 1: allowed = 2
> node 2: allowed = empty
> node 3: allowed = 0-2, based on fallback order: 1,0,2
>
> The firmware should provide the node distance values to reflect that
> PMEM is slowest and should have the largest distance away from node 3.

Right. In my above example firmware would have to provide reasonable
distance values to ensure optimal fallback order.

>> -aneesh
>>
>>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  5:30             ` Wei Xu
@ 2022-05-11  7:34               ` Alistair Popple
  2022-05-11  7:49               ` ying.huang
  1 sibling, 0 replies; 57+ messages in thread
From: Alistair Popple @ 2022-05-11  7:34 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K.V, Yang Shi, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen


Wei Xu <weixugc@google.com> writes:

> On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>> > Wei Xu <weixugc@google.com> writes:
>> >
>> >> On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
>> >>>
>> >>> Wei Xu <weixugc@google.com> writes:
>> >>>
>> >>> [...]
>> >>>
>> >>> >> >
>> >>> >> >
>> >>> >> > Tiering Hierarchy Initialization
>> >>> >> > `=============================='
>> >>> >> >
>> >>> >> > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
>> >>> >> >
>> >>> >> > A device driver can remove its memory nodes from the top tier, e.g.
>> >>> >> > a dax driver can remove PMEM nodes from the top tier.
>> >>> >>
>> >>> >> With the topology built by firmware we should not need this.
>> >>>
>> >>> I agree that in an ideal world the hierarchy should be built by firmware based
>> >>> on something like the HMAT. But I also think being able to override this will be
>> >>> useful in getting there. Therefore a way of overriding the generated hierarchy
>> >>> would be good, either via sysfs or kernel boot parameter if we don't want to
>> >>> commit to a particular user interface now.
>> >>>
>> >>> However I'm less sure letting device-drivers override this is a good idea. How
>> >>> for example would a GPU driver make sure it's node is in the top tier? By moving
>> >>> every node that the driver does not know about out of N_TOPTIER_MEMORY? That
>> >>> could get messy if say there were two drivers both of which wanted their node to
>> >>> be in the top tier.
>> >>
>> >> The suggestion is to allow a device driver to opt out its memory
>> >> devices from the top-tier, not the other way around.
>> >
>> > So how would demotion work in the case of accelerators then? In that
>> > case we would want GPU memory to demote to DRAM, but that won't happen
>> > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
>> > only override available with this proposal would move GPU memory into a
>> > lower tier, which is the opposite of what's needed there.
>>
>> How about we do 3 tiers now. dax kmem devices can be registered to
>> tier 3. By default all numa nodes can be registered at tier 2 and HBM or
>> GPU can be enabled to register at tier 1. ?
>
> This makes sense.  I will send an updated RFC based on the discussions so far.

Thanks! The sense I got from LSF/MM was that we should initially try
keep things simple by limiting it to two tiers. However I don't think
there was strong opposition to adding a third tier to support
GPU+CPU+PMEM, and it does seem like it might even be simpler to just
have three tiers and assign devices as suggested.

>> -aneesh


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  5:30             ` Wei Xu
  2022-05-11  7:34               ` Alistair Popple
@ 2022-05-11  7:49               ` ying.huang
  2022-05-11 17:07                 ` Wei Xu
  1 sibling, 1 reply; 57+ messages in thread
From: ying.huang @ 2022-05-11  7:49 UTC (permalink / raw)
  To: Wei Xu, Aneesh Kumar K.V
  Cc: Alistair Popple, Yang Shi, Andrew Morton, Dave Hansen,
	Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > Alistair Popple <apopple@nvidia.com> writes:
> > 
> > > Wei Xu <weixugc@google.com> writes:
> > > 
> > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > 
> > > > > Wei Xu <weixugc@google.com> writes:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > `=============================='
> > > > > > > > 
> > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > 
> > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > 
> > > > > > > With the topology built by firmware we should not need this.
> > > > > 
> > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > commit to a particular user interface now.
> > > > > 
> > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > be in the top tier.
> > > > 
> > > > The suggestion is to allow a device driver to opt out its memory
> > > > devices from the top-tier, not the other way around.
> > > 
> > > So how would demotion work in the case of accelerators then? In that
> > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > only override available with this proposal would move GPU memory into a
> > > lower tier, which is the opposite of what's needed there.
> > 
> > How about we do 3 tiers now. dax kmem devices can be registered to
> > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > GPU can be enabled to register at tier 1. ?
> 
> This makes sense.  I will send an updated RFC based on the discussions so far.

Are these tier number fixed?  If so, it appears strange that the
smallest tier number is 0 on some machines, but 1 on some other
machines.

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  7:12                 ` Alistair Popple
@ 2022-05-11  9:05                   ` Hesham Almatary
  2022-05-12  3:02                     ` ying.huang
  2022-05-12  4:40                   ` Aneesh Kumar K V
  1 sibling, 1 reply; 57+ messages in thread
From: Hesham Almatary @ 2022-05-11  9:05 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Wei Xu, Aneesh Kumar K V, Yang Shi, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

On Wed, 11 May 2022 17:12:34 +1000
Alistair Popple <apopple@nvidia.com> wrote:

> 
> Wei Xu <weixugc@google.com> writes:
> 
> > On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> On 5/10/22 3:29 PM, Hesham Almatary wrote:
> >> > Hello Yang,
> >> >
> >> > On 5/10/2022 4:24 AM, Yang Shi wrote:
> >> >> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> >> >> <hesham.almatary@huawei.com> wrote:
> >>
> >>
> >> ...
> >>
> >> >>>
> >> >>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and
> >> >>> DDR memory in tier 0,
> >> >>> node 2 has NVMM memory in tier 1, node 3 has some sort of
> >> >>> bigger memory (could be a bigger DDR or something) in tier 2.
> >> >>> The distances are as follows:
> >> >>>
> >> >>> --------------          --------------
> >> >>> |   Node 0   |          |   Node 1   |
> >> >>> |  -------   |          |  -------   |
> >> >>> | |  DDR  |  |          | |  DDR  |  |
> >> >>> |  -------   |          |  -------   |
> >> >>> |            |          |            |
> >> >>> --------------          --------------
> >> >>>          | 20               | 120    |
> >> >>>          v                  v        |
> >> >>> ----------------------------       |
> >> >>> | Node 2     PMEM          |       | 100
> >> >>> ----------------------------       |
> >> >>>          | 100                       |
> >> >>>          v                           v
> >> >>> --------------------------------------
> >> >>> | Node 3    Large mem                |
> >> >>> --------------------------------------
> >> >>>
> >> >>> node distances:
> >> >>> node   0    1    2    3
> >> >>>      0  10   20   20  120
> >> >>>      1  20   10  120  100
> >> >>>      2  20  120   10  100
> >> >>>      3  120 100  100   10
> >> >>>
> >> >>> /sys/devices/system/node/memory_tiers
> >> >>> 0-1
> >> >>> 2
> >> >>> 3
> >> >>>
> >> >>> N_TOPTIER_MEMORY: 0-1
> >> >>>
> >> >>>
> >> >>> In this case, we want to be able to "skip" the demotion path
> >> >>> from Node 1 to Node 2,
> >> >>>
> >> >>> and make demotion go directely to Node 3 as it is closer,
> >> >>> distance wise. How can
> >> >>>
> >> >>> we accommodate this scenario (or at least not rule it out as
> >> >>> future work) with the
> >> >>>
> >> >>> current RFC?
> >> >> If I remember correctly NUMA distance is hardcoded in SLIT by
> >> >> the firmware, it is supposed to reflect the latency. So I
> >> >> suppose it is the firmware's responsibility to have correct
> >> >> information. And the RFC assumes higher tier memory has better
> >> >> performance than lower tier memory (latency, bandwidth,
> >> >> throughput, etc), so it sounds like a buggy firmware to have
> >> >> lower tier memory with shorter distance than higher tier memory
> >> >> IMHO.
> >> >
> >> > You are correct if you're assuming the topology is all
> >> > hierarchically
> >> >
> >> > symmetric, but unfortuantely, in real hardware (e.g., my example
> >> > above)
> >> >
> >> > it is not. The distance/latency between two nodes in the same
> >> > tier
> >> >
> >> > and a third node, is different. The firmware still provides the
> >> > correct
> >> >
> >> > latency, but putting a node in a tier is up to the kernel/user,
> >> > and
> >> >
> >> > is relative: e.g., Node 3 could belong to tier 1 from Node 1's
> >> >
> >> > perspective, but to tier 2 from Node 0's.
> >> >
> >> >
> >> > A more detailed example (building on my previous one) is when
> >> > having
> >> >
> >> > the GPU connected to a switch:
> >> >
> >> > ----------------------------
> >> > | Node 2     PMEM          |
> >> > ----------------------------
> >> >        ^
> >> >        |
> >> > --------------          --------------
> >> > |   Node 0   |          |   Node 1   |
> >> > |  -------   |          |  -------   |
> >> > | |  DDR  |  |          | |  DDR  |  |
> >> > |  -------   |          |  -------   |
> >> > |    CPU     |          |    GPU     |
> >> > --------------          --------------
> >> >         |                  |
> >> >         v                  v
> >> > ----------------------------
> >> > |         Switch           |
> >> > ----------------------------
> >> >         |
> >> >         v
> >> > --------------------------------------
> >> > | Node 3    Large mem                |
> >> > --------------------------------------
> >> >
> >> > Here, demoting from Node 1 to Node 3 directly would be faster as
> >> >
> >> > it only has to go through one hub, compared to demoting from
> >> > Node 1
> >> >
> >> > to Node 2, where it goes through two hubs. I hope that example
> >> >
> >> > clarifies things a little bit.
> >> >
> >>
> >> Alistair mentioned that we want to consider GPU memory to be
> >> expensive and want to demote from GPU to regular DRAM. In that
> >> case for the above case we should end up with
> >>
> >>
> >> tier 0 - > Node3
> >> tier 1 ->  Node0, Node1
> >> tier 2 ->  Node2
> 
> I'm a little bit confused by the tiering here as I don't think it's
> quite what we want. As pointed out GPU memory is expensive and
> therefore we don't want anything demoting to it. That implies it
> should be in the top tier:
> 
> tier 0 -> Node1
> tier 1 -> Node0, Node3
> tier 2 -> Node2
> 
> Hence:
> 
> node 0: allowed=2
> node 1: allowed=0,3,2
> node 2: allowed=empty
> node 3: allowed=2
> 
> Alternatively Node3 could be put in tier 2 which would prevent
> demotion to PMEM via the switch/CPU:
> 
> tier 0 -> Node1
> tier 1 -> Node0
> tier 2 -> Node2, Node3
> 
> node 0: allowed=2,3
> node 1: allowed=0,3,2
> node 2: allowed=empty
> node 3: allowed=empty
> 
Indeed. The scenario I described here is where the GPU can't/don't
demote to PMEM, but the CPU can. In this case it would work fine if we
put the GPU (Node 1) in tier 0, and rely on the fallback order.

> Both of these would be an improvement over the current situation
> upstream, which demotes everything to GPU memory and doesn't support
> demoting from the GPU (meaning reclaim on GPU memory pages everything
> to disk).
> 
> >>
> >> Hence
> >>
> >>   node 0: allowed=2
> >>   node 1: allowed=2
> >>   node 2: allowed = empty
> >>   node 3: allowed = 0-1 , based on fallback order 1, 0
> >
> > If we have 3 tiers as defined above, then we'd better to have:
> >
> > node 0: allowed = 2
> > node 1: allowed = 2
> > node 2: allowed = empty
> > node 3: allowed = 0-2, based on fallback order: 1,0,2
> >
> > The firmware should provide the node distance values to reflect that
> > PMEM is slowest and should have the largest distance away from node
> > 3.
> 
> Right. In my above example firmware would have to provide reasonable
> distance values to ensure optimal fallback order.
> 
> >> -aneesh
> >>
> >>



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  7:49               ` ying.huang
@ 2022-05-11 17:07                 ` Wei Xu
  2022-05-12  1:42                   ` ying.huang
  0 siblings, 1 reply; 57+ messages in thread
From: Wei Xu @ 2022-05-11 17:07 UTC (permalink / raw)
  To: ying.huang
  Cc: Aneesh Kumar K.V, Alistair Popple, Yang Shi, Andrew Morton,
	Dave Hansen, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > Alistair Popple <apopple@nvidia.com> writes:
> > >
> > > > Wei Xu <weixugc@google.com> writes:
> > > >
> > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > >
> > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > > `=============================='
> > > > > > > > >
> > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > >
> > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > >
> > > > > > > > With the topology built by firmware we should not need this.
> > > > > >
> > > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > > commit to a particular user interface now.
> > > > > >
> > > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > > be in the top tier.
> > > > >
> > > > > The suggestion is to allow a device driver to opt out its memory
> > > > > devices from the top-tier, not the other way around.
> > > >
> > > > So how would demotion work in the case of accelerators then? In that
> > > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > > only override available with this proposal would move GPU memory into a
> > > > lower tier, which is the opposite of what's needed there.
> > >
> > > How about we do 3 tiers now. dax kmem devices can be registered to
> > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > > GPU can be enabled to register at tier 1. ?
> >
> > This makes sense.  I will send an updated RFC based on the discussions so far.
>
> Are these tier number fixed?  If so, it appears strange that the
> smallest tier number is 0 on some machines, but 1 on some other
> machines.

When the kernel is configured to allow 3 tiers, we can always show all
the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on
some machines.

BTW, the userspace should not assume a specific meaning of a
particular tier id because it can change depending on the number of
tiers that the kernel is configured with.  For example, the userspace
should not assume that tier-2 always means PMEM nodes.  In a system
with 4 tiers, PMEM nodes may be in tier-3, not tier-2.

> Best Regards,
> Huang, Ying
>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11 17:07                 ` Wei Xu
@ 2022-05-12  1:42                   ` ying.huang
  2022-05-12  2:39                     ` Wei Xu
  0 siblings, 1 reply; 57+ messages in thread
From: ying.huang @ 2022-05-12  1:42 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K.V, Alistair Popple, Yang Shi, Andrew Morton,
	Dave Hansen, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote:
> On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > 
> > > > Alistair Popple <apopple@nvidia.com> writes:
> > > > 
> > > > > Wei Xu <weixugc@google.com> writes:
> > > > > 
> > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > > > 
> > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > > > `=============================='
> > > > > > > > > > 
> > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > > > 
> > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > > > 
> > > > > > > > > With the topology built by firmware we should not need this.
> > > > > > > 
> > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > > > commit to a particular user interface now.
> > > > > > > 
> > > > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > > > be in the top tier.
> > > > > > 
> > > > > > The suggestion is to allow a device driver to opt out its memory
> > > > > > devices from the top-tier, not the other way around.
> > > > > 
> > > > > So how would demotion work in the case of accelerators then? In that
> > > > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > > > only override available with this proposal would move GPU memory into a
> > > > > lower tier, which is the opposite of what's needed there.
> > > > 
> > > > How about we do 3 tiers now. dax kmem devices can be registered to
> > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > > > GPU can be enabled to register at tier 1. ?
> > > 
> > > This makes sense.  I will send an updated RFC based on the discussions so far.
> > 
> > Are these tier number fixed?  If so, it appears strange that the
> > smallest tier number is 0 on some machines, but 1 on some other
> > machines.
> 
> When the kernel is configured to allow 3 tiers, we can always show all
> the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on
> some machines.

I still think that it's better to have no empty tiers for auto-generated
memory tiers by kernel.  Yes, the tier number will be not absolutely
stable, but that only happens during system bootup in practice, so it's
not a big issue IMHO.

And, I still think it's better to make only N-1 tiers writable for
totally N tiers (or even readable).  Considering "tier0" is written, how
to deal with nodes in "tier0" before but not after writing?  One
possible way is to put them into "tierN".  And during a user customize
the tiers, the union of "N tiers" may be not complete.

> BTW, the userspace should not assume a specific meaning of a
> particular tier id because it can change depending on the number of
> tiers that the kernel is configured with.  For example, the userspace
> should not assume that tier-2 always means PMEM nodes.  In a system
> with 4 tiers, PMEM nodes may be in tier-3, not tier-2.

Yes.  This sounds good.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-12  1:42                   ` ying.huang
@ 2022-05-12  2:39                     ` Wei Xu
  2022-05-12  3:13                       ` ying.huang
  0 siblings, 1 reply; 57+ messages in thread
From: Wei Xu @ 2022-05-12  2:39 UTC (permalink / raw)
  To: ying.huang
  Cc: Aneesh Kumar K.V, Alistair Popple, Yang Shi, Andrew Morton,
	Dave Hansen, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Wed, May 11, 2022 at 6:42 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote:
> > On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> > > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > >
> > > > > Alistair Popple <apopple@nvidia.com> writes:
> > > > >
> > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > >
> > > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > > > > `=============================='
> > > > > > > > > > >
> > > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > > > >
> > > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > > > >
> > > > > > > > > > With the topology built by firmware we should not need this.
> > > > > > > >
> > > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > > > > commit to a particular user interface now.
> > > > > > > >
> > > > > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > > > > be in the top tier.
> > > > > > >
> > > > > > > The suggestion is to allow a device driver to opt out its memory
> > > > > > > devices from the top-tier, not the other way around.
> > > > > >
> > > > > > So how would demotion work in the case of accelerators then? In that
> > > > > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > > > > only override available with this proposal would move GPU memory into a
> > > > > > lower tier, which is the opposite of what's needed there.
> > > > >
> > > > > How about we do 3 tiers now. dax kmem devices can be registered to
> > > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > > > > GPU can be enabled to register at tier 1. ?
> > > >
> > > > This makes sense.  I will send an updated RFC based on the discussions so far.
> > >
> > > Are these tier number fixed?  If so, it appears strange that the
> > > smallest tier number is 0 on some machines, but 1 on some other
> > > machines.
> >
> > When the kernel is configured to allow 3 tiers, we can always show all
> > the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on
> > some machines.
>
> I still think that it's better to have no empty tiers for auto-generated
> memory tiers by kernel.  Yes, the tier number will be not absolutely
> stable, but that only happens during system bootup in practice, so it's
> not a big issue IMHO.

It should not be hard to hide empty tiers (e.g. tier-0) if we prefer.
But even if tier-0 is empty, we should still keep this tier in the
kernel and not move DRAM nodes into this tier.  One reason is that a
HBM node might be hot-added into tier-0 at a later time.

> And, I still think it's better to make only N-1 tiers writable for
> totally N tiers (or even readable).  Considering "tier0" is written, how
> to deal with nodes in "tier0" before but not after writing?  One
> possible way is to put them into "tierN".  And during a user customize
> the tiers, the union of "N tiers" may be not complete.

The sysfs interfaces that I have in mind now are:

* /sys/devices/system/memtier/memtierN/nodelist (N=0, 1, 2)

This is read-only to list the memory nodes for a specific tier.

* /sys/devices/system/node/nodeN/memtier. (N=0, 1, ...,)

This is a read-write interface. When written, the kernel moves the
node into the user-specified tier.  No other nodes are affected.

This interface should be able to avoid the above issue.

> > BTW, the userspace should not assume a specific meaning of a
> > particular tier id because it can change depending on the number of
> > tiers that the kernel is configured with.  For example, the userspace
> > should not assume that tier-2 always means PMEM nodes.  In a system
> > with 4 tiers, PMEM nodes may be in tier-3, not tier-2.
>
> Yes.  This sounds good.
>
> Best Regards,
> Huang, Ying
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  9:05                   ` Hesham Almatary
@ 2022-05-12  3:02                     ` ying.huang
  0 siblings, 0 replies; 57+ messages in thread
From: ying.huang @ 2022-05-12  3:02 UTC (permalink / raw)
  To: Hesham Almatary, Alistair Popple
  Cc: Wei Xu, Aneesh Kumar K V, Yang Shi, Andrew Morton, Dave Hansen,
	Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

On Wed, 2022-05-11 at 10:05 +0100, Hesham Almatary wrote:
> On Wed, 11 May 2022 17:12:34 +1000
> Alistair Popple <apopple@nvidia.com> wrote:
> 
> > 
> > Wei Xu <weixugc@google.com> writes:
> > 
> > > On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > 
> > > > On 5/10/22 3:29 PM, Hesham Almatary wrote:
> > > > > Hello Yang,
> > > > > 
> > > > > On 5/10/2022 4:24 AM, Yang Shi wrote:
> > > > > > On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> > > > > > <hesham.almatary@huawei.com> wrote:
> > > > 
> > > > 
> > > > ...
> > > > 
> > > > > > > 
> > > > > > > node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and
> > > > > > > DDR memory in tier 0,
> > > > > > > node 2 has NVMM memory in tier 1, node 3 has some sort of
> > > > > > > bigger memory (could be a bigger DDR or something) in tier 2.
> > > > > > > The distances are as follows:
> > > > > > > 
> > > > > > > --------------          --------------
> > > > > > > >   Node 0   |          |   Node 1   |
> > > > > > > >  -------   |          |  -------   |
> > > > > > > > >  DDR  |  |          | |  DDR  |  |
> > > > > > > >  -------   |          |  -------   |
> > > > > > > >            |          |            |
> > > > > > > --------------          --------------
> > > > > > >          | 20               | 120    |
> > > > > > >          v                  v        |
> > > > > > > ----------------------------       |
> > > > > > > > Node 2     PMEM          |       | 100
> > > > > > > ----------------------------       |
> > > > > > >          | 100                       |
> > > > > > >          v                           v
> > > > > > > --------------------------------------
> > > > > > > > Node 3    Large mem                |
> > > > > > > --------------------------------------
> > > > > > > 
> > > > > > > node distances:
> > > > > > > node   0    1    2    3
> > > > > > >      0  10   20   20  120
> > > > > > >      1  20   10  120  100
> > > > > > >      2  20  120   10  100
> > > > > > >      3  120 100  100   10
> > > > > > > 
> > > > > > > /sys/devices/system/node/memory_tiers
> > > > > > > 0-1
> > > > > > > 2
> > > > > > > 3
> > > > > > > 
> > > > > > > N_TOPTIER_MEMORY: 0-1
> > > > > > > 
> > > > > > > 
> > > > > > > In this case, we want to be able to "skip" the demotion path
> > > > > > > from Node 1 to Node 2,
> > > > > > > 
> > > > > > > and make demotion go directely to Node 3 as it is closer,
> > > > > > > distance wise. How can
> > > > > > > 
> > > > > > > we accommodate this scenario (or at least not rule it out as
> > > > > > > future work) with the
> > > > > > > 
> > > > > > > current RFC?
> > > > > > If I remember correctly NUMA distance is hardcoded in SLIT by
> > > > > > the firmware, it is supposed to reflect the latency. So I
> > > > > > suppose it is the firmware's responsibility to have correct
> > > > > > information. And the RFC assumes higher tier memory has better
> > > > > > performance than lower tier memory (latency, bandwidth,
> > > > > > throughput, etc), so it sounds like a buggy firmware to have
> > > > > > lower tier memory with shorter distance than higher tier memory
> > > > > > IMHO.
> > > > > 
> > > > > You are correct if you're assuming the topology is all
> > > > > hierarchically
> > > > > 
> > > > > symmetric, but unfortuantely, in real hardware (e.g., my example
> > > > > above)
> > > > > 
> > > > > it is not. The distance/latency between two nodes in the same
> > > > > tier
> > > > > 
> > > > > and a third node, is different. The firmware still provides the
> > > > > correct
> > > > > 
> > > > > latency, but putting a node in a tier is up to the kernel/user,
> > > > > and
> > > > > 
> > > > > is relative: e.g., Node 3 could belong to tier 1 from Node 1's
> > > > > 
> > > > > perspective, but to tier 2 from Node 0's.
> > > > > 
> > > > > 
> > > > > A more detailed example (building on my previous one) is when
> > > > > having
> > > > > 
> > > > > the GPU connected to a switch:
> > > > > 
> > > > > ----------------------------
> > > > > > Node 2     PMEM          |
> > > > > ----------------------------
> > > > >        ^
> > > > >        |
> > > > > --------------          --------------
> > > > > >   Node 0   |          |   Node 1   |
> > > > > >  -------   |          |  -------   |
> > > > > > >  DDR  |  |          | |  DDR  |  |
> > > > > >  -------   |          |  -------   |
> > > > > >    CPU     |          |    GPU     |
> > > > > --------------          --------------
> > > > >         |                  |
> > > > >         v                  v
> > > > > ----------------------------
> > > > > >         Switch           |
> > > > > ----------------------------
> > > > >         |
> > > > >         v
> > > > > --------------------------------------
> > > > > > Node 3    Large mem                |
> > > > > --------------------------------------
> > > > > 
> > > > > Here, demoting from Node 1 to Node 3 directly would be faster as
> > > > > 
> > > > > it only has to go through one hub, compared to demoting from
> > > > > Node 1
> > > > > 
> > > > > to Node 2, where it goes through two hubs. I hope that example
> > > > > 
> > > > > clarifies things a little bit.
> > > > > 
> > > > 
> > > > Alistair mentioned that we want to consider GPU memory to be
> > > > expensive and want to demote from GPU to regular DRAM. In that
> > > > case for the above case we should end up with
> > > > 
> > > > 
> > > > tier 0 - > Node3
> > > > tier 1 ->  Node0, Node1
> > > > tier 2 ->  Node2
> > 
> > I'm a little bit confused by the tiering here as I don't think it's
> > quite what we want. As pointed out GPU memory is expensive and
> > therefore we don't want anything demoting to it. That implies it
> > should be in the top tier:
> > 
> > tier 0 -> Node1
> > tier 1 -> Node0, Node3
> > tier 2 -> Node2
> > 
> > Hence:
> > 
> > node 0: allowed=2
> > node 1: allowed=0,3,2
> > node 2: allowed=empty
> > node 3: allowed=2
> > 
> > Alternatively Node3 could be put in tier 2 which would prevent
> > demotion to PMEM via the switch/CPU:
> > 
> > tier 0 -> Node1
> > tier 1 -> Node0
> > tier 2 -> Node2, Node3
> > 
> > node 0: allowed=2,3
> > node 1: allowed=0,3,2
> > node 2: allowed=empty
> > node 3: allowed=empty
> > 
> Indeed. The scenario I described here is where the GPU can't/don't
> demote to PMEM, but the CPU can. In this case it would work fine if we
> put the GPU (Node 1) in tier 0, and rely on the fallback order.
> 

We can also try to enforce this with NUMA policy and cpusets together
with memory tier.  We have not only one weapon.

Best Regards,
Huang, Ying

> > Both of these would be an improvement over the current situation
> > upstream, which demotes everything to GPU memory and doesn't support
> > demoting from the GPU (meaning reclaim on GPU memory pages everything
> > to disk).
> > 
> > > > 
> > > > Hence
> > > > 
> > > >   node 0: allowed=2
> > > >   node 1: allowed=2
> > > >   node 2: allowed = empty
> > > >   node 3: allowed = 0-1 , based on fallback order 1, 0
> > > 
> > > If we have 3 tiers as defined above, then we'd better to have:
> > > 
> > > node 0: allowed = 2
> > > node 1: allowed = 2
> > > node 2: allowed = empty
> > > node 3: allowed = 0-2, based on fallback order: 1,0,2
> > > 
> > > The firmware should provide the node distance values to reflect that
> > > PMEM is slowest and should have the largest distance away from node
> > > 3.
> > 
> > Right. In my above example firmware would have to provide reasonable
> > distance values to ensure optimal fallback order.
> > 
> > > > -aneesh
> > > > 
> > > > 
> 




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-12  2:39                     ` Wei Xu
@ 2022-05-12  3:13                       ` ying.huang
  2022-05-12  3:37                         ` Wei Xu
  2022-05-12  6:24                         ` Wei Xu
  0 siblings, 2 replies; 57+ messages in thread
From: ying.huang @ 2022-05-12  3:13 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K.V, Alistair Popple, Yang Shi, Andrew Morton,
	Dave Hansen, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Wed, 2022-05-11 at 19:39 -0700, Wei Xu wrote:
> On Wed, May 11, 2022 at 6:42 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote:
> > > On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> > > > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > 
> > > > > > Alistair Popple <apopple@nvidia.com> writes:
> > > > > > 
> > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > 
> > > > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > > > > > 
> > > > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > > > 
> > > > > > > > > [...]
> > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > > > > > `=============================='
> > > > > > > > > > > > 
> > > > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > > > > > 
> > > > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > > > > > 
> > > > > > > > > > > With the topology built by firmware we should not need this.
> > > > > > > > > 
> > > > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > > > > > commit to a particular user interface now.
> > > > > > > > > 
> > > > > > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > > > > > be in the top tier.
> > > > > > > > 
> > > > > > > > The suggestion is to allow a device driver to opt out its memory
> > > > > > > > devices from the top-tier, not the other way around.
> > > > > > > 
> > > > > > > So how would demotion work in the case of accelerators then? In that
> > > > > > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > > > > > only override available with this proposal would move GPU memory into a
> > > > > > > lower tier, which is the opposite of what's needed there.
> > > > > > 
> > > > > > How about we do 3 tiers now. dax kmem devices can be registered to
> > > > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > > > > > GPU can be enabled to register at tier 1. ?
> > > > > 
> > > > > This makes sense.  I will send an updated RFC based on the discussions so far.
> > > > 
> > > > Are these tier number fixed?  If so, it appears strange that the
> > > > smallest tier number is 0 on some machines, but 1 on some other
> > > > machines.
> > > 
> > > When the kernel is configured to allow 3 tiers, we can always show all
> > > the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on
> > > some machines.
> > 
> > I still think that it's better to have no empty tiers for auto-generated
> > memory tiers by kernel.  Yes, the tier number will be not absolutely
> > stable, but that only happens during system bootup in practice, so it's
> > not a big issue IMHO.
> 
> It should not be hard to hide empty tiers (e.g. tier-0) if we prefer.
> But even if tier-0 is empty, we should still keep this tier in the
> kernel and not move DRAM nodes into this tier.  One reason is that a
> HBM node might be hot-added into tier-0 at a later time.
> 

Yes.  The in-kernel representation and the user space interface could be
different.

I have thought something like below.  We always make the main memory
(DRAM here, CPU local) as tier 0.  Then the slower memory will be
positive, tier 1, 2, 3, ..., and the faster memory will be negative,
tier -1, -2, -3, ....  Then, GPU driver can regesiter its memory as tier
-1.  And the tier number could be more stable.  But I'm not sure whether
users will be happy with negtive tier number.

> > And, I still think it's better to make only N-1 tiers writable for
> > totally N tiers (or even readable).  Considering "tier0" is written, how
> > to deal with nodes in "tier0" before but not after writing?  One
> > possible way is to put them into "tierN".  And during a user customize
> > the tiers, the union of "N tiers" may be not complete.
> 
> The sysfs interfaces that I have in mind now are:
> 
> * /sys/devices/system/memtier/memtierN/nodelist (N=0, 1, 2)
> 
> This is read-only to list the memory nodes for a specific tier.
> 
> * /sys/devices/system/node/nodeN/memtier. (N=0, 1, ...,)
> 
> This is a read-write interface. When written, the kernel moves the
> node into the user-specified tier.  No other nodes are affected.
> 
> This interface should be able to avoid the above issue.

Yes.  This works too.

Best Regards,
Huang, Ying

> > > BTW, the userspace should not assume a specific meaning of a
> > > particular tier id because it can change depending on the number of
> > > tiers that the kernel is configured with.  For example, the userspace
> > > should not assume that tier-2 always means PMEM nodes.  In a system
> > > with 4 tiers, PMEM nodes may be in tier-3, not tier-2.
> > 
> > Yes.  This sounds good.
> > 
> > Best Regards,
> > Huang, Ying
> > 




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-12  3:13                       ` ying.huang
@ 2022-05-12  3:37                         ` Wei Xu
  2022-05-12  6:24                         ` Wei Xu
  1 sibling, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-12  3:37 UTC (permalink / raw)
  To: ying.huang
  Cc: Aneesh Kumar K.V, Alistair Popple, Yang Shi, Andrew Morton,
	Dave Hansen, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Wed, May 11, 2022 at 8:14 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-11 at 19:39 -0700, Wei Xu wrote:
> > On Wed, May 11, 2022 at 6:42 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote:
> > > > On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> > > > > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > >
> > > > > > > Alistair Popple <apopple@nvidia.com> writes:
> > > > > > >
> > > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > >
> > > > > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > > > >
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > > > > > > `=============================='
> > > > > > > > > > > > >
> > > > > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > > > > > >
> > > > > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > > > > > >
> > > > > > > > > > > > With the topology built by firmware we should not need this.
> > > > > > > > > >
> > > > > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > > > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > > > > > > commit to a particular user interface now.
> > > > > > > > > >
> > > > > > > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > > > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > > > > > > be in the top tier.
> > > > > > > > >
> > > > > > > > > The suggestion is to allow a device driver to opt out its memory
> > > > > > > > > devices from the top-tier, not the other way around.
> > > > > > > >
> > > > > > > > So how would demotion work in the case of accelerators then? In that
> > > > > > > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > > > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > > > > > > only override available with this proposal would move GPU memory into a
> > > > > > > > lower tier, which is the opposite of what's needed there.
> > > > > > >
> > > > > > > How about we do 3 tiers now. dax kmem devices can be registered to
> > > > > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > > > > > > GPU can be enabled to register at tier 1. ?
> > > > > >
> > > > > > This makes sense.  I will send an updated RFC based on the discussions so far.
> > > > >
> > > > > Are these tier number fixed?  If so, it appears strange that the
> > > > > smallest tier number is 0 on some machines, but 1 on some other
> > > > > machines.
> > > >
> > > > When the kernel is configured to allow 3 tiers, we can always show all
> > > > the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on
> > > > some machines.
> > >
> > > I still think that it's better to have no empty tiers for auto-generated
> > > memory tiers by kernel.  Yes, the tier number will be not absolutely
> > > stable, but that only happens during system bootup in practice, so it's
> > > not a big issue IMHO.
> >
> > It should not be hard to hide empty tiers (e.g. tier-0) if we prefer.
> > But even if tier-0 is empty, we should still keep this tier in the
> > kernel and not move DRAM nodes into this tier.  One reason is that a
> > HBM node might be hot-added into tier-0 at a later time.
> >
>
> Yes.  The in-kernel representation and the user space interface could be
> different.
>
> I have thought something like below.  We always make the main memory
> (DRAM here, CPU local) as tier 0.  Then the slower memory will be
> positive, tier 1, 2, 3, ..., and the faster memory will be negative,
> tier -1, -2, -3, ....  Then, GPU driver can regesiter its memory as tier
> -1.  And the tier number could be more stable.  But I'm not sure whether
> users will be happy with negtive tier number.

Given that we have agreed that the tier id itself should not carry any
specific meaning to the userspace and what matters is the relative
tier order, I think it is better to avoid negative tier numbers.

> > > And, I still think it's better to make only N-1 tiers writable for
> > > totally N tiers (or even readable).  Considering "tier0" is written, how
> > > to deal with nodes in "tier0" before but not after writing?  One
> > > possible way is to put them into "tierN".  And during a user customize
> > > the tiers, the union of "N tiers" may be not complete.
> >
> > The sysfs interfaces that I have in mind now are:
> >
> > * /sys/devices/system/memtier/memtierN/nodelist (N=0, 1, 2)
> >
> > This is read-only to list the memory nodes for a specific tier.
> >
> > * /sys/devices/system/node/nodeN/memtier. (N=0, 1, ...,)
> >
> > This is a read-write interface. When written, the kernel moves the
> > node into the user-specified tier.  No other nodes are affected.
> >
> > This interface should be able to avoid the above issue.
>
> Yes.  This works too.
>
> Best Regards,
> Huang, Ying
>
> > > > BTW, the userspace should not assume a specific meaning of a
> > > > particular tier id because it can change depending on the number of
> > > > tiers that the kernel is configured with.  For example, the userspace
> > > > should not assume that tier-2 always means PMEM nodes.  In a system
> > > > with 4 tiers, PMEM nodes may be in tier-3, not tier-2.
> > >
> > > Yes.  This sounds good.
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-11  7:12                 ` Alistair Popple
  2022-05-11  9:05                   ` Hesham Almatary
@ 2022-05-12  4:40                   ` Aneesh Kumar K V
  2022-05-12  4:49                     ` Wei Xu
  1 sibling, 1 reply; 57+ messages in thread
From: Aneesh Kumar K V @ 2022-05-12  4:40 UTC (permalink / raw)
  To: Alistair Popple, Wei Xu
  Cc: Hesham Almatary, Yang Shi, Andrew Morton, Dave Hansen,
	Huang Ying, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

On 5/11/22 12:42 PM, Alistair Popple wrote:
> 
> Wei Xu <weixugc@google.com> writes:
> 
>> On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
>> <aneesh.kumar@linux.ibm.com> wrote:
>>>
>>> On 5/10/22 3:29 PM, Hesham Almatary wrote:
>>>> Hello Yang,
>>>>
>>>> On 5/10/2022 4:24 AM, Yang Shi wrote:
>>>>> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
>>>>> <hesham.almatary@huawei.com> wrote:
>>>
>>>
>>> ...
>>>
>>>>>>
>>>>>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>>>>>> in tier 0,
>>>>>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>>>>>> (could be a bigger DDR or something) in tier 2. The distances are as
>>>>>> follows:
>>>>>>
>>>>>> --------------          --------------
>>>>>> |   Node 0   |          |   Node 1   |
>>>>>> |  -------   |          |  -------   |
>>>>>> | |  DDR  |  |          | |  DDR  |  |
>>>>>> |  -------   |          |  -------   |
>>>>>> |            |          |            |
>>>>>> --------------          --------------
>>>>>>           | 20               | 120    |
>>>>>>           v                  v        |
>>>>>> ----------------------------       |
>>>>>> | Node 2     PMEM          |       | 100
>>>>>> ----------------------------       |
>>>>>>           | 100                       |
>>>>>>           v                           v
>>>>>> --------------------------------------
>>>>>> | Node 3    Large mem                |
>>>>>> --------------------------------------
>>>>>>
>>>>>> node distances:
>>>>>> node   0    1    2    3
>>>>>>       0  10   20   20  120
>>>>>>       1  20   10  120  100
>>>>>>       2  20  120   10  100
>>>>>>       3  120 100  100   10
>>>>>>
>>>>>> /sys/devices/system/node/memory_tiers
>>>>>> 0-1
>>>>>> 2
>>>>>> 3
>>>>>>
>>>>>> N_TOPTIER_MEMORY: 0-1
>>>>>>
>>>>>>
>>>>>> In this case, we want to be able to "skip" the demotion path from Node 1
>>>>>> to Node 2,
>>>>>>
>>>>>> and make demotion go directely to Node 3 as it is closer, distance wise.
>>>>>> How can
>>>>>>
>>>>>> we accommodate this scenario (or at least not rule it out as future
>>>>>> work) with the
>>>>>>
>>>>>> current RFC?
>>>>> If I remember correctly NUMA distance is hardcoded in SLIT by the
>>>>> firmware, it is supposed to reflect the latency. So I suppose it is
>>>>> the firmware's responsibility to have correct information. And the RFC
>>>>> assumes higher tier memory has better performance than lower tier
>>>>> memory (latency, bandwidth, throughput, etc), so it sounds like a
>>>>> buggy firmware to have lower tier memory with shorter distance than
>>>>> higher tier memory IMHO.
>>>>
>>>> You are correct if you're assuming the topology is all hierarchically
>>>>
>>>> symmetric, but unfortuantely, in real hardware (e.g., my example above)
>>>>
>>>> it is not. The distance/latency between two nodes in the same tier
>>>>
>>>> and a third node, is different. The firmware still provides the correct
>>>>
>>>> latency, but putting a node in a tier is up to the kernel/user, and
>>>>
>>>> is relative: e.g., Node 3 could belong to tier 1 from Node 1's
>>>>
>>>> perspective, but to tier 2 from Node 0's.
>>>>
>>>>
>>>> A more detailed example (building on my previous one) is when having
>>>>
>>>> the GPU connected to a switch:
>>>>
>>>> ----------------------------
>>>> | Node 2     PMEM          |
>>>> ----------------------------
>>>>         ^
>>>>         |
>>>> --------------          --------------
>>>> |   Node 0   |          |   Node 1   |
>>>> |  -------   |          |  -------   |
>>>> | |  DDR  |  |          | |  DDR  |  |
>>>> |  -------   |          |  -------   |
>>>> |    CPU     |          |    GPU     |
>>>> --------------          --------------
>>>>          |                  |
>>>>          v                  v
>>>> ----------------------------
>>>> |         Switch           |
>>>> ----------------------------
>>>>          |
>>>>          v
>>>> --------------------------------------
>>>> | Node 3    Large mem                |
>>>> --------------------------------------
>>>>
>>>> Here, demoting from Node 1 to Node 3 directly would be faster as
>>>>
>>>> it only has to go through one hub, compared to demoting from Node 1
>>>>
>>>> to Node 2, where it goes through two hubs. I hope that example
>>>>
>>>> clarifies things a little bit.
>>>>
>>>
>>> Alistair mentioned that we want to consider GPU memory to be expensive
>>> and want to demote from GPU to regular DRAM. In that case for the above
>>> case we should end up with
>>>
>>>
>>> tier 0 - > Node3
>>> tier 1 ->  Node0, Node1
>>> tier 2 ->  Node2
> 
> I'm a little bit confused by the tiering here as I don't think it's
> quite what we want. As pointed out GPU memory is expensive and therefore
> we don't want anything demoting to it. That implies it should be in the
> top tier:
> 


I didn't look closely at the topology and assumed that Node3 is the GPU 
connected to the switch. Hence all the confusion.


> tier 0 -> Node1
> tier 1 -> Node0, Node3
> tier 2 -> Node2
> 
> Hence:
> 
> node 0: allowed=2
> node 1: allowed=0,3,2
> node 2: allowed=empty
> node 3: allowed=2

looks good to be default and simple.

> 
> Alternatively Node3 could be put in tier 2 which would prevent demotion
> to PMEM via the switch/CPU:
> 
> tier 0 -> Node1
> tier 1 -> Node0
> tier 2 -> Node2, Node3
> 
> node 0: allowed=2,3
> node 1: allowed=0,3,2
> node 2: allowed=empty
> node 3: allowed=empty
> 

and this can be configured via userspace?

> Both of these would be an improvement over the current situation
> upstream, which demotes everything to GPU memory and doesn't support
> demoting from the GPU (meaning reclaim on GPU memory pages everything to
> disk).
> 
>>>
>>> Hence
>>>
>>>    node 0: allowed=2
>>>    node 1: allowed=2
>>>    node 2: allowed = empty
>>>    node 3: allowed = 0-1 , based on fallback order 1, 0
>>
>> If we have 3 tiers as defined above, then we'd better to have:
>>
>> node 0: allowed = 2
>> node 1: allowed = 2
>> node 2: allowed = empty
>> node 3: allowed = 0-2, based on fallback order: 1,0,2
>>
>> The firmware should provide the node distance values to reflect that
>> PMEM is slowest and should have the largest distance away from node 3.
> 
> Right. In my above example firmware would have to provide reasonable
> distance values to ensure optimal fallback order.
> 
>>> -aneesh
>>>
>>>



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-12  4:40                   ` Aneesh Kumar K V
@ 2022-05-12  4:49                     ` Wei Xu
  0 siblings, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-12  4:49 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Alistair Popple, Hesham Almatary, Yang Shi, Andrew Morton,
	Dave Hansen, Huang Ying, Dan Williams, Linux MM, Greg Thelen,
	Jagdish Gediya, Linux Kernel Mailing List, Davidlohr Bueso,
	Michal Hocko, Baolin Wang, Brice Goglin, Feng Tang, Tim Chen

On Wed, May 11, 2022 at 9:40 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/11/22 12:42 PM, Alistair Popple wrote:
> >
> > Wei Xu <weixugc@google.com> writes:
> >
> >> On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
> >> <aneesh.kumar@linux.ibm.com> wrote:
> >>>
> >>> On 5/10/22 3:29 PM, Hesham Almatary wrote:
> >>>> Hello Yang,
> >>>>
> >>>> On 5/10/2022 4:24 AM, Yang Shi wrote:
> >>>>> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> >>>>> <hesham.almatary@huawei.com> wrote:
> >>>
> >>>
> >>> ...
> >>>
> >>>>>>
> >>>>>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
> >>>>>> in tier 0,
> >>>>>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
> >>>>>> (could be a bigger DDR or something) in tier 2. The distances are as
> >>>>>> follows:
> >>>>>>
> >>>>>> --------------          --------------
> >>>>>> |   Node 0   |          |   Node 1   |
> >>>>>> |  -------   |          |  -------   |
> >>>>>> | |  DDR  |  |          | |  DDR  |  |
> >>>>>> |  -------   |          |  -------   |
> >>>>>> |            |          |            |
> >>>>>> --------------          --------------
> >>>>>>           | 20               | 120    |
> >>>>>>           v                  v        |
> >>>>>> ----------------------------       |
> >>>>>> | Node 2     PMEM          |       | 100
> >>>>>> ----------------------------       |
> >>>>>>           | 100                       |
> >>>>>>           v                           v
> >>>>>> --------------------------------------
> >>>>>> | Node 3    Large mem                |
> >>>>>> --------------------------------------
> >>>>>>
> >>>>>> node distances:
> >>>>>> node   0    1    2    3
> >>>>>>       0  10   20   20  120
> >>>>>>       1  20   10  120  100
> >>>>>>       2  20  120   10  100
> >>>>>>       3  120 100  100   10
> >>>>>>
> >>>>>> /sys/devices/system/node/memory_tiers
> >>>>>> 0-1
> >>>>>> 2
> >>>>>> 3
> >>>>>>
> >>>>>> N_TOPTIER_MEMORY: 0-1
> >>>>>>
> >>>>>>
> >>>>>> In this case, we want to be able to "skip" the demotion path from Node 1
> >>>>>> to Node 2,
> >>>>>>
> >>>>>> and make demotion go directely to Node 3 as it is closer, distance wise.
> >>>>>> How can
> >>>>>>
> >>>>>> we accommodate this scenario (or at least not rule it out as future
> >>>>>> work) with the
> >>>>>>
> >>>>>> current RFC?
> >>>>> If I remember correctly NUMA distance is hardcoded in SLIT by the
> >>>>> firmware, it is supposed to reflect the latency. So I suppose it is
> >>>>> the firmware's responsibility to have correct information. And the RFC
> >>>>> assumes higher tier memory has better performance than lower tier
> >>>>> memory (latency, bandwidth, throughput, etc), so it sounds like a
> >>>>> buggy firmware to have lower tier memory with shorter distance than
> >>>>> higher tier memory IMHO.
> >>>>
> >>>> You are correct if you're assuming the topology is all hierarchically
> >>>>
> >>>> symmetric, but unfortuantely, in real hardware (e.g., my example above)
> >>>>
> >>>> it is not. The distance/latency between two nodes in the same tier
> >>>>
> >>>> and a third node, is different. The firmware still provides the correct
> >>>>
> >>>> latency, but putting a node in a tier is up to the kernel/user, and
> >>>>
> >>>> is relative: e.g., Node 3 could belong to tier 1 from Node 1's
> >>>>
> >>>> perspective, but to tier 2 from Node 0's.
> >>>>
> >>>>
> >>>> A more detailed example (building on my previous one) is when having
> >>>>
> >>>> the GPU connected to a switch:
> >>>>
> >>>> ----------------------------
> >>>> | Node 2     PMEM          |
> >>>> ----------------------------
> >>>>         ^
> >>>>         |
> >>>> --------------          --------------
> >>>> |   Node 0   |          |   Node 1   |
> >>>> |  -------   |          |  -------   |
> >>>> | |  DDR  |  |          | |  DDR  |  |
> >>>> |  -------   |          |  -------   |
> >>>> |    CPU     |          |    GPU     |
> >>>> --------------          --------------
> >>>>          |                  |
> >>>>          v                  v
> >>>> ----------------------------
> >>>> |         Switch           |
> >>>> ----------------------------
> >>>>          |
> >>>>          v
> >>>> --------------------------------------
> >>>> | Node 3    Large mem                |
> >>>> --------------------------------------
> >>>>
> >>>> Here, demoting from Node 1 to Node 3 directly would be faster as
> >>>>
> >>>> it only has to go through one hub, compared to demoting from Node 1
> >>>>
> >>>> to Node 2, where it goes through two hubs. I hope that example
> >>>>
> >>>> clarifies things a little bit.
> >>>>
> >>>
> >>> Alistair mentioned that we want to consider GPU memory to be expensive
> >>> and want to demote from GPU to regular DRAM. In that case for the above
> >>> case we should end up with
> >>>
> >>>
> >>> tier 0 - > Node3
> >>> tier 1 ->  Node0, Node1
> >>> tier 2 ->  Node2
> >
> > I'm a little bit confused by the tiering here as I don't think it's
> > quite what we want. As pointed out GPU memory is expensive and therefore
> > we don't want anything demoting to it. That implies it should be in the
> > top tier:
> >
>
>
> I didn't look closely at the topology and assumed that Node3 is the GPU
> connected to the switch. Hence all the confusion.
>
>
> > tier 0 -> Node1
> > tier 1 -> Node0, Node3
> > tier 2 -> Node2
> >
> > Hence:
> >
> > node 0: allowed=2
> > node 1: allowed=0,3,2
> > node 2: allowed=empty
> > node 3: allowed=2
>
> looks good to be default and simple.
>
> >
> > Alternatively Node3 could be put in tier 2 which would prevent demotion
> > to PMEM via the switch/CPU:
> >
> > tier 0 -> Node1
> > tier 1 -> Node0
> > tier 2 -> Node2, Node3
> >
> > node 0: allowed=2,3
> > node 1: allowed=0,3,2
> > node 2: allowed=empty
> > node 3: allowed=empty
> >
>
> and this can be configured via userspace?

The per-node tier customization interface that I just mentioned should
support such reconfigurations.

> > Both of these would be an improvement over the current situation
> > upstream, which demotes everything to GPU memory and doesn't support
> > demoting from the GPU (meaning reclaim on GPU memory pages everything to
> > disk).
> >
> >>>
> >>> Hence
> >>>
> >>>    node 0: allowed=2
> >>>    node 1: allowed=2
> >>>    node 2: allowed = empty
> >>>    node 3: allowed = 0-1 , based on fallback order 1, 0
> >>
> >> If we have 3 tiers as defined above, then we'd better to have:
> >>
> >> node 0: allowed = 2
> >> node 1: allowed = 2
> >> node 2: allowed = empty
> >> node 3: allowed = 0-2, based on fallback order: 1,0,2
> >>
> >> The firmware should provide the node distance values to reflect that
> >> PMEM is slowest and should have the largest distance away from node 3.
> >
> > Right. In my above example firmware would have to provide reasonable
> > distance values to ensure optimal fallback order.
> >
> >>> -aneesh
> >>>
> >>>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces
  2022-05-12  3:13                       ` ying.huang
  2022-05-12  3:37                         ` Wei Xu
@ 2022-05-12  6:24                         ` Wei Xu
  1 sibling, 0 replies; 57+ messages in thread
From: Wei Xu @ 2022-05-12  6:24 UTC (permalink / raw)
  To: ying.huang
  Cc: Aneesh Kumar K.V, Alistair Popple, Yang Shi, Andrew Morton,
	Dave Hansen, Dan Williams, Linux MM, Greg Thelen, Jagdish Gediya,
	Linux Kernel Mailing List, Davidlohr Bueso, Michal Hocko,
	Baolin Wang, Brice Goglin, Feng Tang, Jonathan Cameron, Tim Chen

On Wed, May 11, 2022 at 8:14 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-05-11 at 19:39 -0700, Wei Xu wrote:
> > On Wed, May 11, 2022 at 6:42 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-05-11 at 10:07 -0700, Wei Xu wrote:
> > > > On Wed, May 11, 2022 at 12:49 AM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Tue, 2022-05-10 at 22:30 -0700, Wei Xu wrote:
> > > > > > On Tue, May 10, 2022 at 4:38 AM Aneesh Kumar K.V
> > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > >
> > > > > > > Alistair Popple <apopple@nvidia.com> writes:
> > > > > > >
> > > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > >
> > > > > > > > > On Thu, May 5, 2022 at 5:19 PM Alistair Popple <apopple@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Wei Xu <weixugc@google.com> writes:
> > > > > > > > > >
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Tiering Hierarchy Initialization
> > > > > > > > > > > > > `=============================='
> > > > > > > > > > > > >
> > > > > > > > > > > > > By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY).
> > > > > > > > > > > > >
> > > > > > > > > > > > > A device driver can remove its memory nodes from the top tier, e.g.
> > > > > > > > > > > > > a dax driver can remove PMEM nodes from the top tier.
> > > > > > > > > > > >
> > > > > > > > > > > > With the topology built by firmware we should not need this.
> > > > > > > > > >
> > > > > > > > > > I agree that in an ideal world the hierarchy should be built by firmware based
> > > > > > > > > > on something like the HMAT. But I also think being able to override this will be
> > > > > > > > > > useful in getting there. Therefore a way of overriding the generated hierarchy
> > > > > > > > > > would be good, either via sysfs or kernel boot parameter if we don't want to
> > > > > > > > > > commit to a particular user interface now.
> > > > > > > > > >
> > > > > > > > > > However I'm less sure letting device-drivers override this is a good idea. How
> > > > > > > > > > for example would a GPU driver make sure it's node is in the top tier? By moving
> > > > > > > > > > every node that the driver does not know about out of N_TOPTIER_MEMORY? That
> > > > > > > > > > could get messy if say there were two drivers both of which wanted their node to
> > > > > > > > > > be in the top tier.
> > > > > > > > >
> > > > > > > > > The suggestion is to allow a device driver to opt out its memory
> > > > > > > > > devices from the top-tier, not the other way around.
> > > > > > > >
> > > > > > > > So how would demotion work in the case of accelerators then? In that
> > > > > > > > case we would want GPU memory to demote to DRAM, but that won't happen
> > > > > > > > if both DRAM and GPU memory are in N_TOPTIER_MEMORY and it seems the
> > > > > > > > only override available with this proposal would move GPU memory into a
> > > > > > > > lower tier, which is the opposite of what's needed there.
> > > > > > >
> > > > > > > How about we do 3 tiers now. dax kmem devices can be registered to
> > > > > > > tier 3. By default all numa nodes can be registered at tier 2 and HBM or
> > > > > > > GPU can be enabled to register at tier 1. ?
> > > > > >
> > > > > > This makes sense.  I will send an updated RFC based on the discussions so far.
> > > > >
> > > > > Are these tier number fixed?  If so, it appears strange that the
> > > > > smallest tier number is 0 on some machines, but 1 on some other
> > > > > machines.
> > > >
> > > > When the kernel is configured to allow 3 tiers, we can always show all
> > > > the 3 tiers. It is just that some tiers (e.g. tier 0) may be empty on
> > > > some machines.
> > >
> > > I still think that it's better to have no empty tiers for auto-generated
> > > memory tiers by kernel.  Yes, the tier number will be not absolutely
> > > stable, but that only happens during system bootup in practice, so it's
> > > not a big issue IMHO.
> >
> > It should not be hard to hide empty tiers (e.g. tier-0) if we prefer.
> > But even if tier-0 is empty, we should still keep this tier in the
> > kernel and not move DRAM nodes into this tier.  One reason is that a
> > HBM node might be hot-added into tier-0 at a later time.
> >
>
> Yes.  The in-kernel representation and the user space interface could be
> different.
>
> I have thought something like below.  We always make the main memory
> (DRAM here, CPU local) as tier 0.  Then the slower memory will be
> positive, tier 1, 2, 3, ..., and the faster memory will be negative,
> tier -1, -2, -3, ....  Then, GPU driver can regesiter its memory as tier
> -1.  And the tier number could be more stable.  But I'm not sure whether
> users will be happy with negtive tier number.
>
> > > And, I still think it's better to make only N-1 tiers writable for
> > > totally N tiers (or even readable).  Considering "tier0" is written, how
> > > to deal with nodes in "tier0" before but not after writing?  One
> > > possible way is to put them into "tierN".  And during a user customize
> > > the tiers, the union of "N tiers" may be not complete.
> >
> > The sysfs interfaces that I have in mind now are:
> >
> > * /sys/devices/system/memtier/memtierN/nodelist (N=0, 1, 2)
> >
> > This is read-only to list the memory nodes for a specific tier.
> >
> > * /sys/devices/system/node/nodeN/memtier. (N=0, 1, ...,)
> >
> > This is a read-write interface. When written, the kernel moves the
> > node into the user-specified tier.  No other nodes are affected.
> >
> > This interface should be able to avoid the above issue.
>
> Yes.  This works too.

FYI, I have just sent out an updated RFC with the above sysfs interfaces.

> Best Regards,
> Huang, Ying
>
> > > > BTW, the userspace should not assume a specific meaning of a
> > > > particular tier id because it can change depending on the number of
> > > > tiers that the kernel is configured with.  For example, the userspace
> > > > should not assume that tier-2 always means PMEM nodes.  In a system
> > > > with 4 tiers, PMEM nodes may be in tier-3, not tier-2.
> > >
> > > Yes.  This sounds good.
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
>
>


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-05-12  6:25 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-30  2:10 RFC: Memory Tiering Kernel Interfaces Wei Xu
2022-04-30  3:59 ` Yang Shi
2022-04-30  6:37   ` Wei Xu
2022-05-06  0:01     ` Alistair Popple
2022-05-10  4:32       ` Wei Xu
2022-05-10  5:37         ` Alistair Popple
2022-05-10 11:38           ` Aneesh Kumar K.V
2022-05-11  5:30             ` Wei Xu
2022-05-11  7:34               ` Alistair Popple
2022-05-11  7:49               ` ying.huang
2022-05-11 17:07                 ` Wei Xu
2022-05-12  1:42                   ` ying.huang
2022-05-12  2:39                     ` Wei Xu
2022-05-12  3:13                       ` ying.huang
2022-05-12  3:37                         ` Wei Xu
2022-05-12  6:24                         ` Wei Xu
2022-05-06 18:56     ` Yang Shi
2022-05-09 14:32       ` Hesham Almatary
2022-05-10  3:24         ` Yang Shi
2022-05-10  9:59           ` Hesham Almatary
2022-05-10 12:10             ` Aneesh Kumar K V
2022-05-11  5:42               ` Wei Xu
2022-05-11  7:12                 ` Alistair Popple
2022-05-11  9:05                   ` Hesham Almatary
2022-05-12  3:02                     ` ying.huang
2022-05-12  4:40                   ` Aneesh Kumar K V
2022-05-12  4:49                     ` Wei Xu
2022-05-10  4:22         ` Wei Xu
2022-05-10 10:01           ` Hesham Almatary
2022-05-10 11:44           ` Aneesh Kumar K.V
2022-05-01 18:35   ` Dan Williams
2022-05-03  6:36     ` Wei Xu
2022-05-06 19:05     ` Yang Shi
2022-05-07  7:56     ` ying.huang
2022-05-01 17:58 ` Davidlohr Bueso
2022-05-02  1:04   ` David Rientjes
2022-05-02  7:23   ` Aneesh Kumar K.V
2022-05-03  2:07   ` Baolin Wang
2022-05-03  6:06   ` Wei Xu
2022-05-03 17:14   ` Alistair Popple
2022-05-03 17:47     ` Dave Hansen
2022-05-03 22:35       ` Alistair Popple
2022-05-03 23:54         ` Dave Hansen
2022-05-04  1:31           ` Wei Xu
2022-05-04 17:02             ` Dave Hansen
2022-05-05  6:35               ` Wei Xu
2022-05-05 14:24                 ` Dave Hansen
2022-05-10  4:43                   ` Wei Xu
2022-05-02  6:25 ` Aneesh Kumar K.V
2022-05-03  7:02   ` Wei Xu
2022-05-02 15:20 ` Dave Hansen
2022-05-03  7:19   ` Wei Xu
2022-05-03 19:12 ` Tim Chen
2022-05-05  7:02   ` Wei Xu
2022-05-05  8:57 ` ying.huang
2022-05-05 23:57 ` Alistair Popple
2022-05-06  0:25   ` Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).